SlideShare une entreprise Scribd logo
1  sur  60
Télécharger pour lire hors ligne
Microservices, containers,
and machine learning
2015-07-23 • PDX
Paco Nathan, @pacoid

O’Reilly Learning
Licensed under a Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
Spark Brief
Spark Brief: Components
• generalized patterns

unified engine for many use cases
• lazy evaluation of the lineage graph

reduces wait states, better pipelining
• generational differences in hardware

off-heap use of large memory spaces
• functional programming / ease of use

reduction in cost to maintain large apps
• lower overhead for starting jobs
• less expensive shuffles
Spark Brief: Key Distinctions vs. MapReduce
databricks.com/blog/2014/11/05/spark-officially-
sets-a-new-record-in-large-scale-sorting.html
Spark Brief: SmashingThe Previous Petabyte Sort Record
GraphX Examples
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1
GraphX:
spark.apache.org/docs/latest/graphx-
programming-guide.html
Key Points:
• graph-parallel systems
• emphasis on integrated workflows
• optimizations
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin

graphlab.org/files/osdi2012-gonzalez-low-gu-bickson-
guestrin.pdf
Pregel: Large-scale graph computing at Google

Grzegorz Czajkowski, et al.

googleresearch.blogspot.com/2009/06/large-scale-graph-
computing-at-google.html
GraphX: Graph Analytics in Spark

Ankur Dave, Databricks

spark-summit.org/east-2015/talk/graphx-graph-
analytics-in-spark
Topic modeling with LDA: MLlib meets GraphX

Joseph Bradley, Databricks

databricks.com/blog/2015/03/25/topic-modeling-with-
lda-mllib-meets-graphx.html
GraphX: Further Reading…
GraphX: Compose Node + Edge RDDs into a Graph
val nodeRDD: RDD[(Long, ND)] = sc.parallelize(…)
val edgeRDD: RDD[Edge[ED]] = sc.parallelize(…)
val g: Graph[ND, ED] = Graph(nodeRDD, edgeRDD)
// http://spark.apache.org/docs/latest/graphx-programming-guide.html
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
case class Peep(name: String, age: Int)
val nodeArray = Array(
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),
(5L, Peep("Leslie", 45))
)
val edgeArray = Array(
Edge(2L, 1L, 7), Edge(2L, 4L, 2),
Edge(3L, 2L, 4), Edge(3L, 5L, 3),
Edge(4L, 1L, 1), Edge(5L, 3L, 9)
)
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)
val results = g.triplets.filter(t => t.attr > 7)
for (triplet <- results.collect) {
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")
}
GraphX: Example – simple traversals
GraphX: Example – routing problems
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1
What is the cost to reach node 0 from any other
node in the graph? This is a common use case for
graph algorithms, e.g., Dijkstra
GraphX: code examples…
Let’s check
some code!
Graph Analytics
Graph Analytics: terminology
• many real-world problems are often
represented as graphs
• graphs can generally be converted into
sparse matrices (bridge to linear algebra)
• eigenvectors find the stable points in 

a system defined by matrices – which 

may be more efficient to compute
• beyond simpler graphs, complex data 

may require work with tensors
Suppose we have a graph as shown below:
We call x a vertex (sometimes called a node)
An edge (sometimes called an arc) is any line
connecting two vertices
Graph Analytics: example
v
u
w
x
We can represent this kind of graph as an
adjacency matrix:
• label the rows and columns based 

on the vertices
• entries get a 1 if an edge connects the
corresponding vertices, or 0 otherwise
Graph Analytics: representation
v
u
w
x
u v w x
u 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0
An adjacency matrix always has certain
properties:
• it is symmetric, i.e., A = AT
• it has real eigenvalues
Therefore algebraic graph theory bridges
between linear algebra and graph theory
Graph Analytics: algebraic graph theory
Sparse Matrix Collection… for when you really
need a wide variety of sparse matrix examples,
e.g., to evaluate new ML algorithms
University of Florida
Sparse Matrix Collection

cise.ufl.edu/
research/sparse/
matrices/
Graph Analytics: beauty in sparsity
Algebraic GraphTheory

Norman Biggs

Cambridge (1974)

amazon.com/dp/0521458978
Graph Analysis andVisualization

Richard Brath, David Jonker

Wiley (2015)

shop.oreilly.com/product/9781118845844.do
See also examples in: Just Enough Math
Graph Analytics: resources
Although tensor factorization is considered
problematic, it may provide more general case
solutions, and some work leverages Spark:
TheTensor Renaissance in Data Science

Anima Anandkumar @UC Irvine

radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and Higher Order Markov Chains

David Gleich @Purdue

slideshare.net/dgleich/spacey-random-walks-
and-higher-order-markov-chains
Graph Analytics: tensor solutions emerging
Although tensor
problematic, it may provide more general case
solutions, and some work leverages Spark:
TheTensor Renaissance in Data Science
Anima Anandkumar
radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and Higher Order Markov Chains
David Gleich
slideshare.net/dgleich/spacey-random-walks-
and-higher-order-markov-chains
Graph Analytics:
watch
this space
carefully
Data Preparation
Data Prep: Exsto Project Overview
https://github.com/ceteri/exsto/
• insights about dev communities, via data mining
their email forums
• works with any Apache project email archive
• applies NLP and ML techniques to analyze
message threads
• graph analytics surface themes and interactions
• results provide feedback for communities, e.g.,
leaderboards
Data Prep: Exsto Project Overview – four links
https://github.com/ceteri/exsto/
http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
http://mail-archives.apache.org/mod_mbox/spark-user/
http://goo.gl/2YqJZK
Data Prep: Scraper pipeline
https://github.com/ceteri/exsto/
+
Data Prep: Scraper pipeline
Typical data rates, e.g., for dev@spark.apache.org:
• ~2K msgs/month
• ~18 MB/month parsed in JSON
Six months’ list activity represents a graph of:
• 1882 senders
• 1,762,113 nodes
• 3,232,174 edges
A large graph?! In any case, it satisfies definition of a 

graph-parallel system – lots of data locality to leverage
Data Prep: idealized ML workflow…
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Data Prep: Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…
( we’ll come back to this point! )
Data Prep: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
Data Prep: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
{
"date": "2014-10-01T00:16:08+00:00",
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ
"prev_thread": "",
"sender": "Debasish Das <debasish.da...@gmail.com>",
"subject": "Re: memory vs data_size",
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n
}
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Data Prep: Parser pipeline
TextBlob
tag and
lemmatize
words
TextBlob
segment
sentences
TextBlob
sentiment
analysis
Py
generate
skip-grams
parsed
JSON
message
JSON Treebank,
WordNet
Data Prep: Parser pipeline
{
"graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],
"id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",
"polr": 0.2,
"sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",
"size": 14,
"subj": 0.7,
"tile": [ [1, 2], [2, 3], [3, 4] ... ]
]
}
{
"date": "2014-10-01T00:16:08+00:00",
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p
"prev_thread": "",
"sender": "Debasish Das <debasish.da...@gmail.com>",
"subject": "Re: memory vs data_size",
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor
}
Data Prep: Example data
Example data from the Apache Spark email list 

is available as JSON on S3:
• https://s3-us-west-1.amazonaws.com/paco.dbfs.public/
exsto/original/2015_01.json
• https://s3-us-west-1.amazonaws.com/paco.dbfs.public/
exsto/parsed/2015_01.json
Data Prep: code examples…
Let’s check
some code!
WTF ML Frameworks?
services
email
archives
community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data
Unique
Word IDs
TextRank,
Word2Vec,
etc.
com
in
WTF ML Frameworks? Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…
• Big Compute, not Big Data: Mb’s per month,
organized as millions of elements in a graph
• The required libraries (NLTK, etc.) are nearly

1000x larger than the data!
• This data does not change, does not need 

to be recomputed…
• Also: assigning unique IDs to entities during
NLP parsing … that doesn’t readily fit Spark’s
compute model (immutable data)
WTF ML Frameworks? Microservices meet Parallel Processing
Personal observation:
• There’s an unfortunate tendency within machine
learning frameworks to try to subsume all of the
data handling within the framework…
• In the case of Spark, app performance would
already be upside-down just by distributing
NLTK across all of the executors
• Also, installing NLTK data can be “interesting”
WTF ML Frameworks? Microservices meet Parallel Processing
Personal observation:
• There’s an unfortunate tendency within machine
learning frameworks to try to subsume all of the
data handling within the framework…
• In the case of Spark, app performance would
already be upside-down just by distributing
NLTK across all of the executors
• Also, installing
WTF ML Frameworks?
WTF

ML Frameworks

???
Keep in mind that “One Size Fits All” is an 

anti-pattern, especially for Big Data tools:
• consider provisioning cost vs. frequency 

of use
• serialization overhead in workflows
• be mindful of crafting the “working set” 

for memory resources
WTF ML Frameworks? Microservices meet Parallel Processing
WTF ML Frameworks? Microservices meet Parallel Processing
instead of OSFA… 

(yes, yes, Big Data `blasphemy`, sigh)
containerized
microservices
Flask
Redis
email
archives
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
Mesos / DCOS
Spark
executors
TextRank in Spark
TextRank: original paper
TextRank: Bringing Order intoTexts


Rada Mihalcea, Paul Tarau
Conference on Empirical Methods in Natural
Language Processing (July 2004)
https://goo.gl/AJnA76
http://web.eecs.umich.edu/~mihalcea/papers.html
http://www.cse.unt.edu/~tarau/
TextRank: other implementations
Jeff Kubina (Perl / English):
http://search.cpan.org/~kubina/Text-Categorize-
Textrank-0.51/lib/Text/Categorize/Textrank/En.pm
Paco Nathan (Hadoop / English+Spanish):
https://github.com/ceteri/textrank/
Karin Christiasen (Java / Icelandic):
https://github.com/karchr/icetextsum
TextRank: Spark-based pipeline
Spark
create
word graph
RDD
word
graph
NetworkX
visualize
graph
GraphX
run
TextRank
Spark
extract
phrases
ranked
phrases
parsed
JSON
TextRank: raw text input
TextRank: data results
"Compatibility of systems of linear constraints"
[{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'},
{'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'},
{'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'},
{'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}]
compat
system
linear
constraint
1:
2:
3:
TextRank: dependencies
https://en.wikipedia.org/wiki/PageRank
TextRank: how it works
TextRank: code examples…
Let’s check
some code!
Social Graph
Social Graph: use GraphX to run graph analytics
// run graph analytics
val g: Graph[String, Int] = Graph(nodes, edges)
val r = g.pageRank(0.0001).vertices
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)
// define a reduce operation to compute the highest degree vertex
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
if (a._2 > b._2) a else b
}
// compute the max degrees
val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)
val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)
val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)
// connected components
val scc = g.stronglyConnectedComponents(10).vertices
node.join(scc).foreach(println)
Social Graph: PageRank of top dev@spark email, 4Q2014
(389,(22.690229478710016,Sean Owen <so...@cloudera.com>))
(857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>))
(652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>))
(101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>))
(471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>))
(931,(8.217073486575732,shahab <shahab.mok...@gmail.com>))
(48,(7.653814912512137,ll <duy.huynh....@gmail.com>))
(1011,(7.602002681952157,Ashic Mahtab <as...@live.com>))
(1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>))
(122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>))
(904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>))
(827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>))
(887,(5.835053915864531,Davies Liu <dav...@databricks.com>))
(303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>))
(206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>))
(483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>))
(185,(5.259438927615685,SK <skrishna...@gmail.com>))
(636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>))
// seaaaaaaaaaan!
maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)
maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)
maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
Social Graph: code examples…
Let’s check
some code!
Whither Next?
Misc., Etc., Maybe:
Feature learning withWord2Vec

Matt Krzus

www.yseam.com/blog/WV.html
ranked
phrases
GraphX
run
Con.Comp.
MLlib
run
Word2Vec
aggregated
by topic
MLlib
run
KMeans
topic
vectors
better than
LDA?
features… models… insights…
O’Reilly Studios, O’Reilly Learning:
O’Reilly Studios, O’Reilly Learning:
Embracing Jupyter Notebooks at O'Reilly
https://beta.oreilly.com/ideas/jupyter-at-oreilly
Andrew Odewahn, 2015-05-07
“O'Reilly Media is using our Atlas platform to 

make Jupyter Notebooks a first class authoring
environment for our publishing program.”
Jupyter, Thebe, Docker, Mesos, etc.
Thank you
presenter:
Just Enough Math
O’Reilly (2014)
justenoughmath.com

preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates, 

events, conf summaries, etc.:
liber118.com/pxn/
Intro to Apache Spark

O’Reilly (2015)

shop.oreilly.com/product/
0636920036807.do

Contenu connexe

Tendances

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.darach
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 

Tendances (20)

Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 

Similaire à Microservices, containers, and machine learning

GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with ScalaChetan Khatri
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache SparkQuantUniversity
 

Similaire à Microservices, containers, and machine learning (20)

GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 

Plus de Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 

Plus de Paco Nathan (12)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

Dernier

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Dernier (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Microservices, containers, and machine learning

  • 1. Microservices, containers, and machine learning 2015-07-23 • PDX Paco Nathan, @pacoid
 O’Reilly Learning Licensed under a Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License http://www.oscon.com/open-source-2015/public/schedule/detail/41579
  • 4. • generalized patterns
 unified engine for many use cases • lazy evaluation of the lineage graph
 reduces wait states, better pipelining • generational differences in hardware
 off-heap use of large memory spaces • functional programming / ease of use
 reduction in cost to maintain large apps • lower overhead for starting jobs • less expensive shuffles Spark Brief: Key Distinctions vs. MapReduce
  • 7. GraphX: spark.apache.org/docs/latest/graphx- programming-guide.html Key Points: • graph-parallel systems • emphasis on integrated workflows • optimizations
  • 8. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
 J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
 graphlab.org/files/osdi2012-gonzalez-low-gu-bickson- guestrin.pdf Pregel: Large-scale graph computing at Google
 Grzegorz Czajkowski, et al.
 googleresearch.blogspot.com/2009/06/large-scale-graph- computing-at-google.html GraphX: Graph Analytics in Spark
 Ankur Dave, Databricks
 spark-summit.org/east-2015/talk/graphx-graph- analytics-in-spark Topic modeling with LDA: MLlib meets GraphX
 Joseph Bradley, Databricks
 databricks.com/blog/2015/03/25/topic-modeling-with- lda-mllib-meets-graphx.html GraphX: Further Reading…
  • 9. GraphX: Compose Node + Edge RDDs into a Graph val nodeRDD: RDD[(Long, ND)] = sc.parallelize(…) val edgeRDD: RDD[Edge[ED]] = sc.parallelize(…) val g: Graph[ND, ED] = Graph(nodeRDD, edgeRDD)
  • 10. // http://spark.apache.org/docs/latest/graphx-programming-guide.html import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD case class Peep(name: String, age: Int) val nodeArray = Array( (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)), (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)), (5L, Peep("Leslie", 45)) ) val edgeArray = Array( Edge(2L, 1L, 7), Edge(2L, 4L, 2), Edge(3L, 2L, 4), Edge(3L, 5L, 3), Edge(4L, 1L, 1), Edge(5L, 3L, 9) ) val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray) val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD) val results = g.triplets.filter(t => t.attr > 7) for (triplet <- results.collect) { println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}") } GraphX: Example – simple traversals
  • 11. GraphX: Example – routing problems cost 4 node 0 node 1 node 3 node 2 cost 3 cost 1 cost 2 cost 1 What is the cost to reach node 0 from any other node in the graph? This is a common use case for graph algorithms, e.g., Dijkstra
  • 14. Graph Analytics: terminology • many real-world problems are often represented as graphs • graphs can generally be converted into sparse matrices (bridge to linear algebra) • eigenvectors find the stable points in 
 a system defined by matrices – which 
 may be more efficient to compute • beyond simpler graphs, complex data 
 may require work with tensors
  • 15. Suppose we have a graph as shown below: We call x a vertex (sometimes called a node) An edge (sometimes called an arc) is any line connecting two vertices Graph Analytics: example v u w x
  • 16. We can represent this kind of graph as an adjacency matrix: • label the rows and columns based 
 on the vertices • entries get a 1 if an edge connects the corresponding vertices, or 0 otherwise Graph Analytics: representation v u w x u v w x u 0 1 0 1 v 1 0 1 1 w 0 1 0 1 x 1 1 1 0
  • 17. An adjacency matrix always has certain properties: • it is symmetric, i.e., A = AT • it has real eigenvalues Therefore algebraic graph theory bridges between linear algebra and graph theory Graph Analytics: algebraic graph theory
  • 18. Sparse Matrix Collection… for when you really need a wide variety of sparse matrix examples, e.g., to evaluate new ML algorithms University of Florida Sparse Matrix Collection
 cise.ufl.edu/ research/sparse/ matrices/ Graph Analytics: beauty in sparsity
  • 19. Algebraic GraphTheory
 Norman Biggs
 Cambridge (1974)
 amazon.com/dp/0521458978 Graph Analysis andVisualization
 Richard Brath, David Jonker
 Wiley (2015)
 shop.oreilly.com/product/9781118845844.do See also examples in: Just Enough Math Graph Analytics: resources
  • 20. Although tensor factorization is considered problematic, it may provide more general case solutions, and some work leverages Spark: TheTensor Renaissance in Data Science
 Anima Anandkumar @UC Irvine
 radar.oreilly.com/2015/05/the-tensor- renaissance-in-data-science.html Spacey RandomWalks and Higher Order Markov Chains
 David Gleich @Purdue
 slideshare.net/dgleich/spacey-random-walks- and-higher-order-markov-chains Graph Analytics: tensor solutions emerging
  • 21. Although tensor problematic, it may provide more general case solutions, and some work leverages Spark: TheTensor Renaissance in Data Science Anima Anandkumar radar.oreilly.com/2015/05/the-tensor- renaissance-in-data-science.html Spacey RandomWalks and Higher Order Markov Chains David Gleich slideshare.net/dgleich/spacey-random-walks- and-higher-order-markov-chains Graph Analytics: watch this space carefully
  • 23. Data Prep: Exsto Project Overview https://github.com/ceteri/exsto/ • insights about dev communities, via data mining their email forums • works with any Apache project email archive • applies NLP and ML techniques to analyze message threads • graph analytics surface themes and interactions • results provide feedback for communities, e.g., leaderboards
  • 24. Data Prep: Exsto Project Overview – four links https://github.com/ceteri/exsto/ http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf http://mail-archives.apache.org/mod_mbox/spark-user/ http://goo.gl/2YqJZK
  • 25. Data Prep: Scraper pipeline https://github.com/ceteri/exsto/ +
  • 26. Data Prep: Scraper pipeline Typical data rates, e.g., for dev@spark.apache.org: • ~2K msgs/month • ~18 MB/month parsed in JSON Six months’ list activity represents a graph of: • 1882 senders • 1,762,113 nodes • 3,232,174 edges A large graph?! In any case, it satisfies definition of a 
 graph-parallel system – lots of data locality to leverage
  • 27. Data Prep: idealized ML workflow… evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms
  • 28. Data Prep: Microservices meet Parallel Processing services email archives community leaderboards SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs TextRank, Word2Vec, etc. community insights not so big data… relatively big compute… ( we’ll come back to this point! )
  • 29. Data Prep: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs
  • 30. Data Prep: Scraper pipeline message JSON Py filter quoted content Apache email list archive urllib2 crawl monthly list by date Py segment paragraphs { "date": "2014-10-01T00:16:08+00:00", "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw", "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg", "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ "prev_thread": "", "sender": "Debasish Das <debasish.da...@gmail.com>", "subject": "Re: memory vs data_size", "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n }
  • 32. TextBlob tag and lemmatize words TextBlob segment sentences TextBlob sentiment analysis Py generate skip-grams parsed JSON message JSON Treebank, WordNet Data Prep: Parser pipeline { "graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ], "id": “CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw", "polr": 0.2, "sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7", "size": 14, "subj": 0.7, "tile": [ [1, 2], [2, 3], [3, 4] ... ] ] } { "date": "2014-10-01T00:16:08+00:00", "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw", "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg", "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p "prev_thread": "", "sender": "Debasish Das <debasish.da...@gmail.com>", "subject": "Re: memory vs data_size", "text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....nnFor }
  • 33. Data Prep: Example data Example data from the Apache Spark email list 
 is available as JSON on S3: • https://s3-us-west-1.amazonaws.com/paco.dbfs.public/ exsto/original/2015_01.json • https://s3-us-west-1.amazonaws.com/paco.dbfs.public/ exsto/parsed/2015_01.json
  • 34. Data Prep: code examples… Let’s check some code!
  • 35. WTF ML Frameworks? services email archives community leaderboards SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs TextRank, Word2Vec, etc. com in
  • 36. WTF ML Frameworks? Microservices meet Parallel Processing services email archives community leaderboards SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs TextRank, Word2Vec, etc. community insights not so big data… relatively big compute…
  • 37. • Big Compute, not Big Data: Mb’s per month, organized as millions of elements in a graph • The required libraries (NLTK, etc.) are nearly
 1000x larger than the data! • This data does not change, does not need 
 to be recomputed… • Also: assigning unique IDs to entities during NLP parsing … that doesn’t readily fit Spark’s compute model (immutable data) WTF ML Frameworks? Microservices meet Parallel Processing
  • 38. Personal observation: • There’s an unfortunate tendency within machine learning frameworks to try to subsume all of the data handling within the framework… • In the case of Spark, app performance would already be upside-down just by distributing NLTK across all of the executors • Also, installing NLTK data can be “interesting” WTF ML Frameworks? Microservices meet Parallel Processing
  • 39. Personal observation: • There’s an unfortunate tendency within machine learning frameworks to try to subsume all of the data handling within the framework… • In the case of Spark, app performance would already be upside-down just by distributing NLTK across all of the executors • Also, installing WTF ML Frameworks? WTF
 ML Frameworks
 ???
  • 40. Keep in mind that “One Size Fits All” is an 
 anti-pattern, especially for Big Data tools: • consider provisioning cost vs. frequency 
 of use • serialization overhead in workflows • be mindful of crafting the “working set” 
 for memory resources WTF ML Frameworks? Microservices meet Parallel Processing
  • 41. WTF ML Frameworks? Microservices meet Parallel Processing instead of OSFA… 
 (yes, yes, Big Data `blasphemy`, sigh) containerized microservices Flask Redis email archives SparkSQL Data Prep Features Explore Scraper / Parser NLTK data Unique Word IDs Mesos / DCOS Spark executors
  • 43. TextRank: original paper TextRank: Bringing Order intoTexts 
 Rada Mihalcea, Paul Tarau Conference on Empirical Methods in Natural Language Processing (July 2004) https://goo.gl/AJnA76 http://web.eecs.umich.edu/~mihalcea/papers.html http://www.cse.unt.edu/~tarau/
  • 44. TextRank: other implementations Jeff Kubina (Perl / English): http://search.cpan.org/~kubina/Text-Categorize- Textrank-0.51/lib/Text/Categorize/Textrank/En.pm Paco Nathan (Hadoop / English+Spanish): https://github.com/ceteri/textrank/ Karin Christiasen (Java / Icelandic): https://github.com/karchr/icetextsum
  • 45. TextRank: Spark-based pipeline Spark create word graph RDD word graph NetworkX visualize graph GraphX run TextRank Spark extract phrases ranked phrases parsed JSON
  • 47. TextRank: data results "Compatibility of systems of linear constraints" [{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'}, {'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'}, {'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'}, {'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}] compat system linear constraint 1: 2: 3:
  • 52. Social Graph: use GraphX to run graph analytics // run graph analytics val g: Graph[String, Int] = Graph(nodes, edges) val r = g.pageRank(0.0001).vertices r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println) // define a reduce operation to compute the highest degree vertex def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = { if (a._2 > b._2) a else b } // compute the max degrees val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max) val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max) val maxDegrees: (VertexId, Int) = g.degrees.reduce(max) // connected components val scc = g.stronglyConnectedComponents(10).vertices node.join(scc).foreach(println)
  • 53. Social Graph: PageRank of top dev@spark email, 4Q2014 (389,(22.690229478710016,Sean Owen <so...@cloudera.com>)) (857,(20.832469059298248,Akhil Das <ak...@sigmoidanalytics.com>)) (652,(13.281821379806798,Michael Armbrust <mich...@databricks.com>)) (101,(9.963167550803664,Tobias Pfeiffer <...@preferred.jp>)) (471,(9.614436778460558,Steve Lewis <lordjoe2...@gmail.com>)) (931,(8.217073486575732,shahab <shahab.mok...@gmail.com>)) (48,(7.653814912512137,ll <duy.huynh....@gmail.com>)) (1011,(7.602002681952157,Ashic Mahtab <as...@live.com>)) (1055,(7.572376489758199,Cheng Lian <lian.cs....@gmail.com>)) (122,(6.87247388819558,Gerard Maas <gerard.m...@gmail.com>)) (904,(6.252657820614504,Xiangrui Meng <men...@gmail.com>)) (827,(6.0941062762076115,Jianshi Huang <jianshi.hu...@gmail.com>)) (887,(5.835053915864531,Davies Liu <dav...@databricks.com>)) (303,(5.724235650446037,Ted Yu <yuzhih...@gmail.com>)) (206,(5.430238461114108,Deep Pradhan <pradhandeep1...@gmail.com>)) (483,(5.332452537151523,Akshat Aranya <aara...@gmail.com>)) (185,(5.259438927615685,SK <skrishna...@gmail.com>)) (636,(5.235941228955769,Matei Zaharia <matei.zaha…@gmail.com>)) // seaaaaaaaaaan! maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126) maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170) maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
  • 54. Social Graph: code examples… Let’s check some code!
  • 56. Misc., Etc., Maybe: Feature learning withWord2Vec
 Matt Krzus
 www.yseam.com/blog/WV.html ranked phrases GraphX run Con.Comp. MLlib run Word2Vec aggregated by topic MLlib run KMeans topic vectors better than LDA? features… models… insights…
  • 58. O’Reilly Studios, O’Reilly Learning: Embracing Jupyter Notebooks at O'Reilly https://beta.oreilly.com/ideas/jupyter-at-oreilly Andrew Odewahn, 2015-05-07 “O'Reilly Media is using our Atlas platform to 
 make Jupyter Notebooks a first class authoring environment for our publishing program.” Jupyter, Thebe, Docker, Mesos, etc.
  • 60. presenter: Just Enough Math O’Reilly (2014) justenoughmath.com
 preview: youtu.be/TQ58cWgdCpA monthly newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/ Intro to Apache Spark
 O’Reilly (2015)
 shop.oreilly.com/product/ 0636920036807.do