SlideShare une entreprise Scribd logo
1  sur  86
Télécharger pour lire hors ligne
Runtime Internal
Nan Zhu (McGill University & Faimdata)
–Johnny Appleseed
“Type a quote here.”
WHO AM I
• Nan Zhu, PhD Candidate in School of Computer Science of
McGill University
• Work on computer networks (Software Defined Networks)
and large-scale data processing
• Work with Prof. Wenbo He and Prof. Xue Liu
• PhD is an awesome experience in my life
• Tackle real world problems
• Keep thinking ! Get insights !
WHO AM I
• Nan Zhu, PhD Candidate in School of Computer Science of
McGill University
• Work on computer networks (Software Defined Networks)
and large-scale data processing
• Work with Prof. Wenbo He and Prof. Xue Liu
• PhD is an awesome experience in my life
• Tackle real world problems
• Keep thinking ! Get insights !
When will I graduate ?
WHO AM I
• Do-it-all Engineer in Faimdata (http://www.faimdata.com)
• Faimdata is a new startup located in Montreal
• Build Customer-centric analysis solution based on
Spark for retailers
• My responsibility
• Participate in everything related to data
• Akka, HBase, Hive, Kafka, Spark, etc.
WHO AM I
• My Contribution to Spark
• 0.8.1, 0.9.0, 0.9.1, 1.0.0
• 1000+ code, 30 patches
• Two examples:
• YARN-like architecture in Spark
• Introduce Actor Supervisor mechanism to
DAGScheduler
WHO AM I
• My Contribution to Spark
• 0.8.1, 0.9.0, 0.9.1, 1.0.0
• 1000+ code, 30 patches
• Two examples:
• YARN-like architecture in Spark
• Introduce Actor Supervisor mechanism to
DAGScheduler
I’m CodingCat@GitHub !!!!
WHO AM I
• My Contribution to Spark
• 0.8.1, 0.9.0, 0.9.1, 1.0.0
• 1000+ code, 30 patches
• Two examples:
• YARN-like architecture in Spark
• Introduce Actor Supervisor mechanism to
DAGScheduler
I’m CodingCat@GitHub !!!!
What is Spark?
What is Spark?
• A distributed computing framework
• Organize computation as concurrent tasks
• Schedule tasks to multiple servers
• Handle fault-tolerance, load balancing, etc, in
automatic (and transparently)
Advantages of Spark
• More Descriptive Computing Model
• Faster Processing Speed
• Unified Pipeline
Mode Descriptive Computing Model (1)
• WordCount in Hadoop (Map & Reduce)
Mode Descriptive Computing Model (1)
• WordCount in Hadoop (Map & Reduce)
Mode Descriptive Computing Model (1)
• WordCount in Hadoop (Map & Reduce)
Map function, read
each line of the input
file and transform
each word into
<word, 1> pair
Mode Descriptive Computing Model (1)
• WordCount in Hadoop (Map & Reduce)
Map function, read
each line of the input
file and transform
each word into
<word, 1> pair
Mode Descriptive Computing Model (1)
• WordCount in Hadoop (Map & Reduce)
Map function, read
each line of the input
file and transform
each word into
<word, 1> pair
Mode Descriptive Computing Model (1)
• WordCount in Hadoop (Map & Reduce)
Map function, read
each line of the input
file and transform
each word into
<word, 1> pair
Reduce function,
collect the <word, 1>
pairs generated by
Map function and
merge them by
accumulation
Mode Descriptive Computing Model (1)
• WordCount in Hadoop (Map & Reduce)
Map function, read
each line of the input
file and transform
each word into
<word, 1> pair
Reduce function,
collect the <word, 1>
pairs generated by
Map function and
merge them by
accumulation
Mode Descriptive Computing Model (1)
• WordCount in Hadoop (Map & Reduce)
Map function, read
each line of the input
file and transform
each word into
<word, 1> pair
Reduce function,
collect the <word, 1>
pairs generated by
Map function and
merge them by
accumulation
Mode Descriptive Computing Model (1)
• WordCount in Hadoop (Map & Reduce)
Map function, read
each line of the input
file and transform
each word into
<word, 1> pair
Reduce function,
collect the <word, 1>
pairs generated by
Map function and
merge them by
accumulation
Configurate the
program
Mode Descriptive Computing Model (1)
• WordCount in Hadoop (Map & Reduce)
Map function, read
each line of the input
file and transform
each word into
<word, 1> pair
Reduce function,
collect the <word, 1>
pairs generated by
Map function and
merge them by
accumulation
Configurate the
program
DESCRIPTIVE COMPUTING MODEL (2)
• WordCount in Spark
Scala:
Java:
DESCRIPTIVE COMPUTING MODEL (2)
• Closer look at WordCount in Spark
Scala:
Organize Computation into Multiple Stages in a Processing Pipeline:
transformation to get the intermediate results with expected schema
action to get final output
Computation is expressed with more high-level
APIs, which simplify the logic in original Map &
Reduce and define the computation as a
processing pipeline
DESCRIPTIVE COMPUTING MODEL (2)
• Closer look at WordCount in Spark
Scala:
Organize Computation into Multiple Stages in a Processing Pipeline:
transformation to get the intermediate results with expected schema
action to get final output
Transformation
Computation is expressed with more high-level
APIs, which simplify the logic in original Map &
Reduce and define the computation as a
processing pipeline
DESCRIPTIVE COMPUTING MODEL (2)
• Closer look at WordCount in Spark
Scala:
Organize Computation into Multiple Stages in a Processing Pipeline:
transformation to get the intermediate results with expected schema
action to get final output
Transformation
Action
Computation is expressed with more high-level
APIs, which simplify the logic in original Map &
Reduce and define the computation as a
processing pipeline
MUCH BETTER PERFORMANCE
• PageRank Algorithm Performance Comparison
Matei Zaharia, et al, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing, NSDI 2012
0"
20"
40"
60"
80"
100"
120"
140"
160"
180"
Hadoop"" Basic"Spark" Spark"with"Controlled;
par<<on"
Time%per%Itera+ons%(s)%
Unified pipeline
Diverse APIs, Operational Cost, etc.
Unified pipeline
Unified pipeline
• With a Single Spark Cluster
• Batching Processing: Spark
Core
• Query: Shark & Spark SQL &
BlinkDB
• Streaming: Spark Streaming
• Machine Learning: MLlib
• Graph: GraphX
Understanding Distributed Computing Framework
Understand a distributed computing framework
• DataFlow
• e.g. Hadoop family utilizes HDFS to transfer data within
a job and share data across jobs/applications
HDFS
Daemon
MapTask
HDFS
Daemon
MapTask
HDFS
Daemon
MapTask
Understand a distributed computing framework
• DataFlow
• e.g. Hadoop family utilizes HDFS to transfer data within
a job and share data across jobs/applications
Understand a distributed computing framework
• DataFlow
• e.g. Hadoop family utilizes HDFS to transfer data within
a job and share data across jobs/applications
Understanding a distributed computing engine
• Task Management
• How the computation is executed within multiple
servers
• How the tasks are scheduled
• How the resources are allocated
Spark Data Abstraction Model
Basic Structure of Spark program
• A Spark Program
val sc = new SparkContext(…)
!
val points = sc.textFile("hdfs://...")
	 .map(_.split.map(_.toDouble)).splitAt(1)
	 .map { case (Array(label), features) =>
	 	 LabeledPoint(label, features)
	 }
!
val model = Model.train(points)
Basic Structure of Spark program
• A Spark Program
val sc = new SparkContext(…)
!
val points = sc.textFile("hdfs://...")
	 .map(_.split.map(_.toDouble)).splitAt(1)
	 .map { case (Array(label), features) =>
	 	 LabeledPoint(label, features)
	 }
!
val model = Model.train(points)
Includes the components
driving the running of
computing tasks (will
introduce later)
Basic Structure of Spark program
• A Spark Program
val sc = new SparkContext(…)
!
val points = sc.textFile("hdfs://...")
	 .map(_.split.map(_.toDouble)).splitAt(1)
	 .map { case (Array(label), features) =>
	 	 LabeledPoint(label, features)
	 }
!
val model = Model.train(points)
Includes the components
driving the running of
computing tasks (will
introduce later)
Load data from HDFS,
forming a RDD
(Resilient Distributed
Datasets) object
Basic Structure of Spark program
• A Spark Program
val sc = new SparkContext(…)
!
val points = sc.textFile("hdfs://...")
	 .map(_.split.map(_.toDouble)).splitAt(1)
	 .map { case (Array(label), features) =>
	 	 LabeledPoint(label, features)
	 }
!
val model = Model.train(points)
Includes the components
driving the running of
computing tasks (will
introduce later)
Load data from HDFS,
forming a RDD
(Resilient Distributed
Datasets) object
Transformations to
generate RDDs with
expected element(s)/
format
Basic Structure of Spark program
• A Spark Program
val sc = new SparkContext(…)
!
val points = sc.textFile("hdfs://...")
	 .map(_.split.map(_.toDouble)).splitAt(1)
	 .map { case (Array(label), features) =>
	 	 LabeledPoint(label, features)
	 }
!
val model = Model.train(points)
Includes the components
driving the running of
computing tasks (will
introduce later)
Load data from HDFS,
forming a RDD
(Resilient Distributed
Datasets) object
Transformations to
generate RDDs with
expected element(s)/
format
All Computations are around RDDs
Resilient Distributed Dataset
• RDD is a distributed memory abstraction which is
• data collection
• immutable
• created by either loading from stable storage system (e.g. HDFS) or
through transformations on other RDD(s)
• partitioned and distributed
sc.textFile(…)
filter() map()
From data to computation
• Lineage
From data to computation
• Lineage
From data to computation
• Lineage
Where do I come
from?
(dependency)
From data to computation
• Lineage
Where do I come
from?
(dependency)
How do I come from?
(save the functions
calculating the partitions)
From data to computation
• Lineage
Where do I come
from?
(dependency)
How do I come from?
(save the functions
calculating the partitions)
Computation is
organized as a DAG
(Lineage)
From data to computation
• Lineage
Where do I come
from?
(dependency)
How do I come from?
(save the functions
calculating the partitions)
Computation is
organized as a DAG
(Lineage)
Lost data can be
recovered in parallel
with the help of the
lineage DAG
Cache
• Frequently accessed RDDs can be materialized and
cached in memory
• Cached RDD can also be replicated for fault tolerance
(Spark scheduler takes cached data locality into account)
• Manage the cache space with LRU algorithm
Benefits Brought Cache
• Example (Log Mining)
Benefits Brought Cache
• Example (Log Mining)
Count is an action, for the first
time, it has to calculate from the
start of the DAG Graph (textFile)
Benefits Brought Cache
• Example (Log Mining)
Count is an action, for the first
time, it has to calculate from the
start of the DAG Graph (textFile)
Because the data is cached, the
second count does not trigger a
“start-from-zero” computation,
instead, it is based on
“cachedMsgs” directly
Summary
• Resilient Distributed Datasets (RDD)
• Distributed memory abstraction in Spark
• Keep computation run in memory with best effort
• Keep track of the “lineage” of data
• Organize computation
• Support fault-tolerance
• Cache
RDD brings much better performance by
simplifying the data flow
• Share Data among Applications
A typical data processing pipeline
RDD brings much better performance by
simplifying the data flow
• Share Data among Applications
A typical data processing pipeline
Overhead Overhead
RDD brings much better performance by
simplifying the data flow
• Share Data among Applications
A typical data processing pipeline
Overhead Overhead
RDD brings much better performance by
simplifying the data flow
• Share Data among Applications
A typical data processing pipeline
RDD brings much better performance by
simplifying the data flow
• Share Data among Applications
A typical data processing pipeline
RDD brings much better performance by
simplifying the data flow
• Share data in Iterative Algorithms
• Certain amount of predictive/machine learning
algorithms are iterative
• e.g. K-Means
Step 1: Place randomly initial group centroids into the space.
Step 2: Assign each object to the group that has the closest
centroid.
Step 3: Recalculate the positions of the centroids.
Step 4: If the positions of the centroids didn't change go to
the next step, else go to Step 2.
Step 5: End.
RDD brings much better performance by
simplifying the data flow
• Share data in Iterative Algorithms
• Certain amount of predictive/machine learning
algorithms are iterative
• e.g. K-Means
RDD brings much better performance by
simplifying the data flow
• Share data in Iterative Algorithms
• Certain amount of predictive/machine learning
algorithms are iterative
• e.g. K-Means
Assign group
(Step 2)
Recalculate
Group (Step 3 & 4)
Output (Step 5)
HDFS
Read
Write
Write
RDD brings much better performance by
simplifying the data flow
• Share data in Iterative Algorithms
• Certain amount of predictive/machine learning
algorithms are iterative
• e.g. K-Means
RDD brings much better performance by
simplifying the data flow
• Share data in Iterative Algorithms
• Certain amount of predictive/machine learning
algorithms are iterative
• e.g. K-Means
Write
Assign group
(Step 2)
Recalculate
Group (Step 3 & 4)
Output (Step 5)
HDFS
RDD brings much better performance by
simplifying the data flow
• Share data in Iterative Algorithms
• Certain amount of predictive/machine learning
algorithms are iterative
• e.g. K-Means
Spark Scheduler
Worker
Worker
The Structure of Spark Cluster
Driver Program
Spark Context
Cluster Manager
DAG
Sche
Task
Sche
Cluster
Sche
Worker
Worker
The Structure of Spark Cluster
Driver Program
Spark Context
Cluster Manager
Each SparkContext
creates a Spark
application
DAG
Sche
Task
Sche
Cluster
Sche
Worker
Worker
The Structure of Spark Cluster
Driver Program
Spark Context
Cluster Manager
Each SparkContext
creates a Spark
application
DAG
Sche
Task
Sche
Cluster
Sche
Worker
Worker
The Structure of Spark Cluster
Driver Program
Spark Context
Cluster Manager
Each SparkContext
creates a Spark
application
Submit Application
to Cluster Manager
DAG
Sche
Task
Sche
Cluster
Sche
Worker
Worker
The Structure of Spark Cluster
Driver Program
Spark Context
Cluster Manager
Each SparkContext
creates a Spark
application
Submit Application
to Cluster Manager
The Cluster Manager
can be the master of
standalone mode in
Spark, Mesos and
YARN
DAG
Sche
Task
Sche
Cluster
Sche
Worker
Worker
Executor
Cache
Executor
Cache
The Structure of Spark Cluster
Driver Program
Spark Context
Cluster Manager
Each SparkContext
creates a Spark
application
Submit Application
to Cluster Manager
The Cluster Manager
can be the master of
standalone mode in
Spark, Mesos and
YARN
DAG
Sche
Task
Sche
Cluster
Sche
Worker
Worker
Executor
Cache
Executor
Cache
The Structure of Spark Cluster
Driver Program
Spark Context
Cluster Manager
Each SparkContext
creates a Spark
application
Submit Application
to Cluster Manager
The Cluster Manager
can be the master of
standalone mode in
Spark, Mesos and
YARN
Start Executors for the application in Workers; Executors
registers with ClusterScheduler;
DAG
Sche
Task
Sche
Cluster
Sche
Worker
Worker
Executor
Cache
Executor
Cache
The Structure of Spark Cluster
Driver Program
Spark Context
Cluster Manager
Each SparkContext
creates a Spark
application
Submit Application
to Cluster Manager
The Cluster Manager
can be the master of
standalone mode in
Spark, Mesos and
YARN
Start Executors for the application in Workers; Executors
registers with ClusterScheduler;
DAG
Sche
Task
Sche
Cluster
Sche
Driver program schedules tasks for the application
TaskTask
Task Task
Scheduling Process
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Split DAG
DAGScheduler
RDD objects
are connected
together with a
DAG
Submit each stage as
a TaskSet
TaskScheduler
TaskSetManagers
!
(monitor the
progress of tasks
and handle failed
stages)
Failed Stages
ClusterScheduler
Submit Tasks to
Executors
Scheduling Optimization
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Scheduling Optimization
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Within Stage Optimization, Pipelining
the generation of RDD partitions when
they are in narrow dependency
Scheduling Optimization
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Within Stage Optimization, Pipelining
the generation of RDD partitions when
they are in narrow dependency
Scheduling Optimization
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Within Stage Optimization, Pipelining
the generation of RDD partitions when
they are in narrow dependency
Scheduling Optimization
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Within Stage Optimization, Pipelining
the generation of RDD partitions when
they are in narrow dependency
Scheduling Optimization
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Within Stage Optimization, Pipelining
the generation of RDD partitions when
they are in narrow dependency
Scheduling Optimization
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Within Stage Optimization, Pipelining
the generation of RDD partitions when
they are in narrow dependency
Scheduling Optimization
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Within Stage Optimization, Pipelining
the generation of RDD partitions when
they are in narrow dependency
Partitioning-based join optimization,
avoid whole-shuffle with best-efforts
Scheduling Optimization
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Within Stage Optimization, Pipelining
the generation of RDD partitions when
they are in narrow dependency
Partitioning-based join optimization,
avoid whole-shuffle with best-efforts
Scheduling Optimization
join%
union%
groupBy%
map%
Stage%3%
Stage%1%
Stage%2%
A:% B:%
C:% D:%
E:%
F:%
G:%
Within Stage Optimization, Pipelining
the generation of RDD partitions when
they are in narrow dependency
Partitioning-based join optimization,
avoid whole-shuffle with best-efforts
Cache-aware to avoid duplicate
computation
Summary
• No centralized application scheduler
• Maximize Throughput
• Application specific schedulers (DAGScheduler, TaskScheduler,
ClusterScheduler) are initialized within SparkContext
• Scheduling Abstraction (DAG, TaskSet, Task)
• Support fault-tolerance, pipelining, auto-recovery, etc.
• Scheduling Optimization
• Pipelining, join, caching
We are hiring !
http://www.faimdata.com
nanzhu@faimdata.com
jobs@faimdata.com
Thank you!
Q & A
Credits to my friend, LianCheng@Databricks, his slides inspired
me a lot

Contenu connexe

Tendances

Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 

Tendances (20)

Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 

En vedette

BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingDavid Lauzon
 
BDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part IIBDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part IIDavid Lauzon
 
BDM29: AdamCloud Project - Part I
BDM29: AdamCloud Project - Part IBDM29: AdamCloud Project - Part I
BDM29: AdamCloud Project - Part IDavid Lauzon
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
 
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetup
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetupBDM24 - Cassandra use case at Netflix 20140429 montrealmeetup
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetupDavid Lauzon
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Cloudera, Inc.
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Cloudera, Inc.
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark InternalBhuridech Sudsee
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
QCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AILex Yu
 
The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)Cloudera, Inc.
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Pawel Szulc
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Cloudera, Inc.
 

En vedette (20)

BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 Debriefing
 
BDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part IIBDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part II
 
BDM29: AdamCloud Project - Part I
BDM29: AdamCloud Project - Part IBDM29: AdamCloud Project - Part I
BDM29: AdamCloud Project - Part I
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetup
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetupBDM24 - Cassandra use case at Netflix 20140429 montrealmeetup
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetup
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internal
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
QCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AI
 
Internals
InternalsInternals
Internals
 
The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)The Vortex of Change - Digital Transformation (Presented by Intel)
The Vortex of Change - Digital Transformation (Presented by Intel)
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1

 

Similaire à BDM25 - Spark runtime internal

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkJohn Godoi
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversityAlex Zeltov
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 

Similaire à BDM25 - Spark runtime internal (20)

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark core
Spark coreSpark core
Spark core
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache spark
Apache sparkApache spark
Apache spark
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 

Dernier

Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 

Dernier (20)

Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 

BDM25 - Spark runtime internal

  • 1. Runtime Internal Nan Zhu (McGill University & Faimdata)
  • 3. WHO AM I • Nan Zhu, PhD Candidate in School of Computer Science of McGill University • Work on computer networks (Software Defined Networks) and large-scale data processing • Work with Prof. Wenbo He and Prof. Xue Liu • PhD is an awesome experience in my life • Tackle real world problems • Keep thinking ! Get insights !
  • 4. WHO AM I • Nan Zhu, PhD Candidate in School of Computer Science of McGill University • Work on computer networks (Software Defined Networks) and large-scale data processing • Work with Prof. Wenbo He and Prof. Xue Liu • PhD is an awesome experience in my life • Tackle real world problems • Keep thinking ! Get insights ! When will I graduate ?
  • 5. WHO AM I • Do-it-all Engineer in Faimdata (http://www.faimdata.com) • Faimdata is a new startup located in Montreal • Build Customer-centric analysis solution based on Spark for retailers • My responsibility • Participate in everything related to data • Akka, HBase, Hive, Kafka, Spark, etc.
  • 6. WHO AM I • My Contribution to Spark • 0.8.1, 0.9.0, 0.9.1, 1.0.0 • 1000+ code, 30 patches • Two examples: • YARN-like architecture in Spark • Introduce Actor Supervisor mechanism to DAGScheduler
  • 7. WHO AM I • My Contribution to Spark • 0.8.1, 0.9.0, 0.9.1, 1.0.0 • 1000+ code, 30 patches • Two examples: • YARN-like architecture in Spark • Introduce Actor Supervisor mechanism to DAGScheduler I’m CodingCat@GitHub !!!!
  • 8. WHO AM I • My Contribution to Spark • 0.8.1, 0.9.0, 0.9.1, 1.0.0 • 1000+ code, 30 patches • Two examples: • YARN-like architecture in Spark • Introduce Actor Supervisor mechanism to DAGScheduler I’m CodingCat@GitHub !!!!
  • 10. What is Spark? • A distributed computing framework • Organize computation as concurrent tasks • Schedule tasks to multiple servers • Handle fault-tolerance, load balancing, etc, in automatic (and transparently)
  • 11. Advantages of Spark • More Descriptive Computing Model • Faster Processing Speed • Unified Pipeline
  • 12. Mode Descriptive Computing Model (1) • WordCount in Hadoop (Map & Reduce)
  • 13. Mode Descriptive Computing Model (1) • WordCount in Hadoop (Map & Reduce)
  • 14. Mode Descriptive Computing Model (1) • WordCount in Hadoop (Map & Reduce) Map function, read each line of the input file and transform each word into <word, 1> pair
  • 15. Mode Descriptive Computing Model (1) • WordCount in Hadoop (Map & Reduce) Map function, read each line of the input file and transform each word into <word, 1> pair
  • 16. Mode Descriptive Computing Model (1) • WordCount in Hadoop (Map & Reduce) Map function, read each line of the input file and transform each word into <word, 1> pair
  • 17. Mode Descriptive Computing Model (1) • WordCount in Hadoop (Map & Reduce) Map function, read each line of the input file and transform each word into <word, 1> pair Reduce function, collect the <word, 1> pairs generated by Map function and merge them by accumulation
  • 18. Mode Descriptive Computing Model (1) • WordCount in Hadoop (Map & Reduce) Map function, read each line of the input file and transform each word into <word, 1> pair Reduce function, collect the <word, 1> pairs generated by Map function and merge them by accumulation
  • 19. Mode Descriptive Computing Model (1) • WordCount in Hadoop (Map & Reduce) Map function, read each line of the input file and transform each word into <word, 1> pair Reduce function, collect the <word, 1> pairs generated by Map function and merge them by accumulation
  • 20. Mode Descriptive Computing Model (1) • WordCount in Hadoop (Map & Reduce) Map function, read each line of the input file and transform each word into <word, 1> pair Reduce function, collect the <word, 1> pairs generated by Map function and merge them by accumulation Configurate the program
  • 21. Mode Descriptive Computing Model (1) • WordCount in Hadoop (Map & Reduce) Map function, read each line of the input file and transform each word into <word, 1> pair Reduce function, collect the <word, 1> pairs generated by Map function and merge them by accumulation Configurate the program
  • 22. DESCRIPTIVE COMPUTING MODEL (2) • WordCount in Spark Scala: Java:
  • 23. DESCRIPTIVE COMPUTING MODEL (2) • Closer look at WordCount in Spark Scala: Organize Computation into Multiple Stages in a Processing Pipeline: transformation to get the intermediate results with expected schema action to get final output Computation is expressed with more high-level APIs, which simplify the logic in original Map & Reduce and define the computation as a processing pipeline
  • 24. DESCRIPTIVE COMPUTING MODEL (2) • Closer look at WordCount in Spark Scala: Organize Computation into Multiple Stages in a Processing Pipeline: transformation to get the intermediate results with expected schema action to get final output Transformation Computation is expressed with more high-level APIs, which simplify the logic in original Map & Reduce and define the computation as a processing pipeline
  • 25. DESCRIPTIVE COMPUTING MODEL (2) • Closer look at WordCount in Spark Scala: Organize Computation into Multiple Stages in a Processing Pipeline: transformation to get the intermediate results with expected schema action to get final output Transformation Action Computation is expressed with more high-level APIs, which simplify the logic in original Map & Reduce and define the computation as a processing pipeline
  • 26. MUCH BETTER PERFORMANCE • PageRank Algorithm Performance Comparison Matei Zaharia, et al, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012 0" 20" 40" 60" 80" 100" 120" 140" 160" 180" Hadoop"" Basic"Spark" Spark"with"Controlled; par<<on" Time%per%Itera+ons%(s)%
  • 27. Unified pipeline Diverse APIs, Operational Cost, etc.
  • 29. Unified pipeline • With a Single Spark Cluster • Batching Processing: Spark Core • Query: Shark & Spark SQL & BlinkDB • Streaming: Spark Streaming • Machine Learning: MLlib • Graph: GraphX
  • 31. Understand a distributed computing framework • DataFlow • e.g. Hadoop family utilizes HDFS to transfer data within a job and share data across jobs/applications HDFS Daemon MapTask HDFS Daemon MapTask HDFS Daemon MapTask
  • 32. Understand a distributed computing framework • DataFlow • e.g. Hadoop family utilizes HDFS to transfer data within a job and share data across jobs/applications
  • 33. Understand a distributed computing framework • DataFlow • e.g. Hadoop family utilizes HDFS to transfer data within a job and share data across jobs/applications
  • 34. Understanding a distributed computing engine • Task Management • How the computation is executed within multiple servers • How the tasks are scheduled • How the resources are allocated
  • 36. Basic Structure of Spark program • A Spark Program val sc = new SparkContext(…) ! val points = sc.textFile("hdfs://...") .map(_.split.map(_.toDouble)).splitAt(1) .map { case (Array(label), features) => LabeledPoint(label, features) } ! val model = Model.train(points)
  • 37. Basic Structure of Spark program • A Spark Program val sc = new SparkContext(…) ! val points = sc.textFile("hdfs://...") .map(_.split.map(_.toDouble)).splitAt(1) .map { case (Array(label), features) => LabeledPoint(label, features) } ! val model = Model.train(points) Includes the components driving the running of computing tasks (will introduce later)
  • 38. Basic Structure of Spark program • A Spark Program val sc = new SparkContext(…) ! val points = sc.textFile("hdfs://...") .map(_.split.map(_.toDouble)).splitAt(1) .map { case (Array(label), features) => LabeledPoint(label, features) } ! val model = Model.train(points) Includes the components driving the running of computing tasks (will introduce later) Load data from HDFS, forming a RDD (Resilient Distributed Datasets) object
  • 39. Basic Structure of Spark program • A Spark Program val sc = new SparkContext(…) ! val points = sc.textFile("hdfs://...") .map(_.split.map(_.toDouble)).splitAt(1) .map { case (Array(label), features) => LabeledPoint(label, features) } ! val model = Model.train(points) Includes the components driving the running of computing tasks (will introduce later) Load data from HDFS, forming a RDD (Resilient Distributed Datasets) object Transformations to generate RDDs with expected element(s)/ format
  • 40. Basic Structure of Spark program • A Spark Program val sc = new SparkContext(…) ! val points = sc.textFile("hdfs://...") .map(_.split.map(_.toDouble)).splitAt(1) .map { case (Array(label), features) => LabeledPoint(label, features) } ! val model = Model.train(points) Includes the components driving the running of computing tasks (will introduce later) Load data from HDFS, forming a RDD (Resilient Distributed Datasets) object Transformations to generate RDDs with expected element(s)/ format All Computations are around RDDs
  • 41. Resilient Distributed Dataset • RDD is a distributed memory abstraction which is • data collection • immutable • created by either loading from stable storage system (e.g. HDFS) or through transformations on other RDD(s) • partitioned and distributed sc.textFile(…) filter() map()
  • 42. From data to computation • Lineage
  • 43. From data to computation • Lineage
  • 44. From data to computation • Lineage Where do I come from? (dependency)
  • 45. From data to computation • Lineage Where do I come from? (dependency) How do I come from? (save the functions calculating the partitions)
  • 46. From data to computation • Lineage Where do I come from? (dependency) How do I come from? (save the functions calculating the partitions) Computation is organized as a DAG (Lineage)
  • 47. From data to computation • Lineage Where do I come from? (dependency) How do I come from? (save the functions calculating the partitions) Computation is organized as a DAG (Lineage) Lost data can be recovered in parallel with the help of the lineage DAG
  • 48. Cache • Frequently accessed RDDs can be materialized and cached in memory • Cached RDD can also be replicated for fault tolerance (Spark scheduler takes cached data locality into account) • Manage the cache space with LRU algorithm
  • 49. Benefits Brought Cache • Example (Log Mining)
  • 50. Benefits Brought Cache • Example (Log Mining) Count is an action, for the first time, it has to calculate from the start of the DAG Graph (textFile)
  • 51. Benefits Brought Cache • Example (Log Mining) Count is an action, for the first time, it has to calculate from the start of the DAG Graph (textFile) Because the data is cached, the second count does not trigger a “start-from-zero” computation, instead, it is based on “cachedMsgs” directly
  • 52. Summary • Resilient Distributed Datasets (RDD) • Distributed memory abstraction in Spark • Keep computation run in memory with best effort • Keep track of the “lineage” of data • Organize computation • Support fault-tolerance • Cache
  • 53. RDD brings much better performance by simplifying the data flow • Share Data among Applications A typical data processing pipeline
  • 54. RDD brings much better performance by simplifying the data flow • Share Data among Applications A typical data processing pipeline Overhead Overhead
  • 55. RDD brings much better performance by simplifying the data flow • Share Data among Applications A typical data processing pipeline Overhead Overhead
  • 56. RDD brings much better performance by simplifying the data flow • Share Data among Applications A typical data processing pipeline
  • 57. RDD brings much better performance by simplifying the data flow • Share Data among Applications A typical data processing pipeline
  • 58. RDD brings much better performance by simplifying the data flow • Share data in Iterative Algorithms • Certain amount of predictive/machine learning algorithms are iterative • e.g. K-Means Step 1: Place randomly initial group centroids into the space. Step 2: Assign each object to the group that has the closest centroid. Step 3: Recalculate the positions of the centroids. Step 4: If the positions of the centroids didn't change go to the next step, else go to Step 2. Step 5: End.
  • 59. RDD brings much better performance by simplifying the data flow • Share data in Iterative Algorithms • Certain amount of predictive/machine learning algorithms are iterative • e.g. K-Means
  • 60. RDD brings much better performance by simplifying the data flow • Share data in Iterative Algorithms • Certain amount of predictive/machine learning algorithms are iterative • e.g. K-Means Assign group (Step 2) Recalculate Group (Step 3 & 4) Output (Step 5) HDFS Read Write Write
  • 61. RDD brings much better performance by simplifying the data flow • Share data in Iterative Algorithms • Certain amount of predictive/machine learning algorithms are iterative • e.g. K-Means
  • 62. RDD brings much better performance by simplifying the data flow • Share data in Iterative Algorithms • Certain amount of predictive/machine learning algorithms are iterative • e.g. K-Means Write Assign group (Step 2) Recalculate Group (Step 3 & 4) Output (Step 5) HDFS
  • 63. RDD brings much better performance by simplifying the data flow • Share data in Iterative Algorithms • Certain amount of predictive/machine learning algorithms are iterative • e.g. K-Means
  • 65. Worker Worker The Structure of Spark Cluster Driver Program Spark Context Cluster Manager DAG Sche Task Sche Cluster Sche
  • 66. Worker Worker The Structure of Spark Cluster Driver Program Spark Context Cluster Manager Each SparkContext creates a Spark application DAG Sche Task Sche Cluster Sche
  • 67. Worker Worker The Structure of Spark Cluster Driver Program Spark Context Cluster Manager Each SparkContext creates a Spark application DAG Sche Task Sche Cluster Sche
  • 68. Worker Worker The Structure of Spark Cluster Driver Program Spark Context Cluster Manager Each SparkContext creates a Spark application Submit Application to Cluster Manager DAG Sche Task Sche Cluster Sche
  • 69. Worker Worker The Structure of Spark Cluster Driver Program Spark Context Cluster Manager Each SparkContext creates a Spark application Submit Application to Cluster Manager The Cluster Manager can be the master of standalone mode in Spark, Mesos and YARN DAG Sche Task Sche Cluster Sche
  • 70. Worker Worker Executor Cache Executor Cache The Structure of Spark Cluster Driver Program Spark Context Cluster Manager Each SparkContext creates a Spark application Submit Application to Cluster Manager The Cluster Manager can be the master of standalone mode in Spark, Mesos and YARN DAG Sche Task Sche Cluster Sche
  • 71. Worker Worker Executor Cache Executor Cache The Structure of Spark Cluster Driver Program Spark Context Cluster Manager Each SparkContext creates a Spark application Submit Application to Cluster Manager The Cluster Manager can be the master of standalone mode in Spark, Mesos and YARN Start Executors for the application in Workers; Executors registers with ClusterScheduler; DAG Sche Task Sche Cluster Sche
  • 72. Worker Worker Executor Cache Executor Cache The Structure of Spark Cluster Driver Program Spark Context Cluster Manager Each SparkContext creates a Spark application Submit Application to Cluster Manager The Cluster Manager can be the master of standalone mode in Spark, Mesos and YARN Start Executors for the application in Workers; Executors registers with ClusterScheduler; DAG Sche Task Sche Cluster Sche Driver program schedules tasks for the application TaskTask Task Task
  • 73. Scheduling Process join% union% groupBy% map% Stage%3% Stage%1% Stage%2% A:% B:% C:% D:% E:% F:% G:% Split DAG DAGScheduler RDD objects are connected together with a DAG Submit each stage as a TaskSet TaskScheduler TaskSetManagers ! (monitor the progress of tasks and handle failed stages) Failed Stages ClusterScheduler Submit Tasks to Executors
  • 75. Scheduling Optimization join% union% groupBy% map% Stage%3% Stage%1% Stage%2% A:% B:% C:% D:% E:% F:% G:% Within Stage Optimization, Pipelining the generation of RDD partitions when they are in narrow dependency
  • 76. Scheduling Optimization join% union% groupBy% map% Stage%3% Stage%1% Stage%2% A:% B:% C:% D:% E:% F:% G:% Within Stage Optimization, Pipelining the generation of RDD partitions when they are in narrow dependency
  • 77. Scheduling Optimization join% union% groupBy% map% Stage%3% Stage%1% Stage%2% A:% B:% C:% D:% E:% F:% G:% Within Stage Optimization, Pipelining the generation of RDD partitions when they are in narrow dependency
  • 78. Scheduling Optimization join% union% groupBy% map% Stage%3% Stage%1% Stage%2% A:% B:% C:% D:% E:% F:% G:% Within Stage Optimization, Pipelining the generation of RDD partitions when they are in narrow dependency
  • 79. Scheduling Optimization join% union% groupBy% map% Stage%3% Stage%1% Stage%2% A:% B:% C:% D:% E:% F:% G:% Within Stage Optimization, Pipelining the generation of RDD partitions when they are in narrow dependency
  • 80. Scheduling Optimization join% union% groupBy% map% Stage%3% Stage%1% Stage%2% A:% B:% C:% D:% E:% F:% G:% Within Stage Optimization, Pipelining the generation of RDD partitions when they are in narrow dependency
  • 81. Scheduling Optimization join% union% groupBy% map% Stage%3% Stage%1% Stage%2% A:% B:% C:% D:% E:% F:% G:% Within Stage Optimization, Pipelining the generation of RDD partitions when they are in narrow dependency Partitioning-based join optimization, avoid whole-shuffle with best-efforts
  • 82. Scheduling Optimization join% union% groupBy% map% Stage%3% Stage%1% Stage%2% A:% B:% C:% D:% E:% F:% G:% Within Stage Optimization, Pipelining the generation of RDD partitions when they are in narrow dependency Partitioning-based join optimization, avoid whole-shuffle with best-efforts
  • 83. Scheduling Optimization join% union% groupBy% map% Stage%3% Stage%1% Stage%2% A:% B:% C:% D:% E:% F:% G:% Within Stage Optimization, Pipelining the generation of RDD partitions when they are in narrow dependency Partitioning-based join optimization, avoid whole-shuffle with best-efforts Cache-aware to avoid duplicate computation
  • 84. Summary • No centralized application scheduler • Maximize Throughput • Application specific schedulers (DAGScheduler, TaskScheduler, ClusterScheduler) are initialized within SparkContext • Scheduling Abstraction (DAG, TaskSet, Task) • Support fault-tolerance, pipelining, auto-recovery, etc. • Scheduling Optimization • Pipelining, join, caching
  • 85. We are hiring ! http://www.faimdata.com nanzhu@faimdata.com jobs@faimdata.com
  • 86. Thank you! Q & A Credits to my friend, LianCheng@Databricks, his slides inspired me a lot