SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
S PA R K - N E W K I D O N
T H E B L O C K
A B O U T M E …
• I designed Bamboo (HP’s Big Data Analytics Platform)
• I write software (mostly with Scala but leaning towards Haskell
recently …)
• I like translating seq to parallel algorithms mostly using CUDA /
OpenCL; embedded assembly is an EVIL thing.
• I wrote 2 books
• OpenCL Parallel Programming Development Cookbook
• Developing an Akka Edge
W H AT ’ S C O V E R E D T O D AY ?
• What’s Apache Spark
• What’s a RDD ? How can i understand it ?
• What’s Spark SQL
• What’s Spark Streaming
• References
W H AT ’ S A PA C H E S PA R K
• As a beginner’s guide, you can refer to Tsai Li Ming’s talk.
• API model abstracts
• how to extract data from 3rd party s/w (via JDBC,
Cassandra, HBase)
• how to extract-compute data (via GraphX, MLLib,
SparkSQL)
• how to store data (data connectors to “local”, “hdfs”,
“s3”
R E S I L I E N T D I S T R I B U T E D D ATA S E T S
• Apache Spark works on data broken into chunks
• These chunks are called RDDs
• RDDs are chained into a lineage graph => a graph
that identifies relationships.
• RDDs can be queried, grouped, transformed in a
coarse grained manner to a fine grained manner.
• A RDD has a lifecycle:
• reification
• lazy-compute/lazy re-compute
• destruction
• RDD’s lifecycle is managed by the system unless …
• A program commands the RDD to persist() or unpersist()
which affects the lazy computation.
R E S I L I E N T D I S T R I B U T E D D ATA S E T S
“ A G G R E G AT E ” I N S PA R K
> val data = sc.parallelize( (1 to 4) toList,2)
> data.aggregate(0)
> .. (math.max(_, _),
> .. ( _ + _ ))
> …..
> result = 6
def aggregate(zerovalue: U)
(fbinary: (U, T) => U,
fagg: (U, U) => U): U
H O W “ A G G R E G AT E ” W O R K S I N S PA R K
e1
RDD
fagg
fbinary
e2 e3 e4
zerovalue
res1
fbinary
res2
fagg final result
caveat:
partition-sensitive
algorithm should work
correctly regardless of
partitions
“ C O G R O U P ” I N S PA R K
> val x = sc.parallelize(List(1, 2, 1, 3), 1)
> val y = x.map((_, "y"))
> val z = x.map((_, "z"))
> y.cogroup(z).collect
res72: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,
(Array(y, y),Array(z, z))), (3,(Array(y),Array(z))), (2,
(Array(y),Array(z))))
def cogroup[W1, W2, W3]
(other1: RDD[(K, W1)],
other2: RDD[(K, W2)],
other3: RDD[(K, W3)], numPartitions: Int):
RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2],
Iterable[W3]))]
H O W “ C O G R O U P ” W O R K S I N S PA R K
RDDx
(k1,va) (k2,vb) (k1,vc) (k3,vd) (k1,ve)
(k1,vf) (k2,vg) (k1,vh) RDDy
RDDx.cogroup(RDDy) =?
H O W “ C O G R O U P ” W O R K S I N S PA R K
Arraycombined
Array[(k1,[va,vc,ve,vf,vh]),
(k2,[vb,vg]),
(k3,[vd])]
RDDx.cogroup(RDDy) = *see below*
“ C O G R O U P ” I N S PA R K
• CoGroup works in both RDD and Spark Streams
• the ability to combine multiple RDDs allows higher
abstractions to be constructed
• A Stream in Spark is just a list of (Time,RDD[U])
W H AT ’ S S PA R K S Q L
• Spark SQL is new, largely replaced Shark
• Large scale queries (inline queries) to be embedded
into a Spark program
• Spark SQL supports Apache Hive, JSON, Parquet,
RDD.
• Spark SQL’s optimizer is clever!
• Supports UDFs from Hive or Write your own !
S PA R K S Q L
J S O N
S PA R K S Q L
PA R Q U E TH I V E
data sources
R D D
S PA R K S Q L ( A N E X A M P L E )
// import spark sql
import org.apache.spark.sql.hive.HiveContext
// create a spark sql hivecontext
val sc = new SparkContext(…)
val hiveCtx = new HiveContext(sc)
S PA R K S Q L ( A N E X A M P L E )
// import spark sql
import org.apache.spark.sql.hive.HiveContext
// create a spark sql hivecontext
val sc = new SparkContext(…)
val hiveCtx = new HiveContext(sc)
val input = hiveCtx.jsonFile(inputFile)
input.registerTempTable(“tweets”)
S PA R K S Q L ( A N E X A M P L E )
// import spark sql
import org.apache.spark.sql.hive.HiveContext
// create a spark sql hivecontext
val sc = new SparkContext(…)
val hiveCtx = new HiveContext(sc)
val input = hiveCtx.jsonFile(inputFile)
input.registerTempTable(“tweets”)
val topTweets = hiveCtx.sql(“SELECT text,
retweetCount
FROM tweets ORDER BY retweetCount LIMIT 10”)
S PA R K S Q L ( A N E X A M P L E )
// import spark sql
import org.apache.spark.sql.hive.HiveContext
// create a spark sql hivecontext
val sc = new SparkContext(…)
val hiveCtx = new HiveContext(sc)
val input = hiveCtx.jsonFile(inputFile)
input.registerTempTable(“tweets”)
val topTweets = hiveCtx.sql(“SELECT text,
retweetCount
FROM tweets ORDER BY retweetCount LIMIT 10”)
val topTweetContent = topTweets.map(row ⇒
row.getString(0))
W H AT ’ S S PA R K S T R E A M I N G
• Core component is a DStream
• DStream is an abstract RDD whose basic components
is a (key,value) pairs where key = Time, value = RDD.
• Forward and backward queries are supported
• Fault-Tolerance by check-pointing RDDs.
• What you can do with RDDs, you can do with
DStreams.
S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
// Create a StreamingContext with a 1-second batch
size from a SparkConf
val ssc = new StreamingContext(conf, Seconds(1))

S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
// Create a StreamingContext with a 1-second batch size from a
SparkConf
val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream using data received after connecting to
// port 7777 on the local machine
val lines = ssc.socketTextStream("localhost", 7777)

// Filter our DStream for lines with "error"

val errorLines = lines.filter(_.contains("error"))

// Print out the lines with errors

errorLines.print()
S PA R K S T R E A M I N G ( Q U I C K E X A M P L E )
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
// Create a StreamingContext with a 1-second batch size from a SparkConf
val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream using data received after connecting to
// port 7777 on the local machine
val lines = ssc.socketTextStream("localhost", 7777)

// Filter our DStream for lines with "error"

val errorLines = lines.filter(_.contains("error"))

// Print out the lines with errors

errorLines.print()
// Start our streaming context and wait for it to "finish"
ssc.start()

// Wait for the job to finish
ssc.awaitTermination()
A D S T R E A M L O O K S L I K E …
t1 to t2 t2 to t3 t3 to t4
timestart
DStream
A D S T R E A M C A N H AV E
T R A N S F O R M AT I O N S O N T H E M !
t1 to t2
timestart
DStream(s)
t1 to t2
data-1
data-2
f
transformation
on the fly!
S PA R K S T R E A M T R A N S F O R M AT I O N
t1 to t2t2 to t3
timestart
DStream(s)
t1 to t2t2 to t3
data-1
data-2
f f
data output in
batches
S PA R K S T R E A M T R A N S F O R M AT I O N
t3 to t4
timestart
DStream(s)
t3 to t4
data-1
data-2
f
t1 to t2t2 to t3
t1 to t2t2 to t3
f fff
S TAT E F U L S PA R K S T R E A M
T R A N S F O R M AT I O N
t3 to t4
timestart
DStream(s)
t3 to t4
data-1
data-2
f
t1 to t2t2 to t3
t1 to t2t2 to t3
f fff
H O W D O E S S PA R K S T R E A M I N G
H A N D L E FA U LT S ?
• As before, check-point is the key to fault-tolerance
(especially in stateful-dstream transformations)
• Programs can recover from check-points => no need
to restart all over again.
• You can use “monit” to restart Spark jobs or pass the
Spark flag “- - supervise” to the job config a.k.a driver
fault tolerance
• All incoming data to workers replicated
• In-house RDDs follow the lineage graph to recover
• The above is known as worker fault tolerance.
• Receivers fault tolerance is largely dependent on whether
data sources can re-send lost data
• Streams guarantee exactly-once semantics; caveat:
multiple writes can occur to the HDFS (app specific logic
needs to handle)
H O W D O E S S PA R K S T R E A M I N G
H A N D L E FA U LT S ?
R E F E R E N C E S
• Books:
• “Learning Spark: Lightning Fast Big Data ANlaytics”
• “Advanced Analytics with Spark: Patterns for Learning from Data At Scale”
• “Fast Data Processing with Spark”
• “Machine Learning with Spark”
• Berkeley Data Bootcamp
• Introduction to Big Data with Apache Spark
• Kien Dang’s introduction to Spark and R using Naive Bayes (click here)
• Spark Streaming with Scala and Akka (click here)
T H E E N D
Q U E S T I O N S ?
T W I T T E R : @ R AY M O N D TAY B L
G I T H U B : @ R AY G I T

Contenu connexe

Tendances

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousRussell Spitzer
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_sparkYiguang Hu
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on AndroidTomáš Kypta
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsStephane Manciot
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandraDuyhai Doan
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 

Tendances (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
SFScon 2020 - Peter Hopfgartner - Open Data de luxeSFScon 2020 - Peter Hopfgartner - Open Data de luxe
SFScon 2020 - Peter Hopfgartner - Open Data de luxe
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandra
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 

En vedette

Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax Academy
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 

En vedette (9)

Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
SPARK SQL
SPARK SQLSPARK SQL
SPARK SQL
 
Learning spark ch1-2
Learning spark ch1-2Learning spark ch1-2
Learning spark ch1-2
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and Future
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 

Similaire à Toying with spark

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLphanleson
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 

Similaire à Toying with spark (20)

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark core
Spark coreSpark core
Spark core
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 

Plus de Raymond Tay

Principled io in_scala_2019_distribution
Principled io in_scala_2019_distributionPrincipled io in_scala_2019_distribution
Principled io in_scala_2019_distributionRaymond Tay
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamRaymond Tay
 
Distributed computing for new bloods
Distributed computing for new bloodsDistributed computing for new bloods
Distributed computing for new bloodsRaymond Tay
 
Functional programming with_scala
Functional programming with_scalaFunctional programming with_scala
Functional programming with_scalaRaymond Tay
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Introduction to Erlang
Introduction to ErlangIntroduction to Erlang
Introduction to ErlangRaymond Tay
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 

Plus de Raymond Tay (8)

Principled io in_scala_2019_distribution
Principled io in_scala_2019_distributionPrincipled io in_scala_2019_distribution
Principled io in_scala_2019_distribution
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beam
 
Practical cats
Practical catsPractical cats
Practical cats
 
Distributed computing for new bloods
Distributed computing for new bloodsDistributed computing for new bloods
Distributed computing for new bloods
 
Functional programming with_scala
Functional programming with_scalaFunctional programming with_scala
Functional programming with_scala
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Introduction to Erlang
Introduction to ErlangIntroduction to Erlang
Introduction to Erlang
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 

Dernier

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Dernier (20)

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Toying with spark

  • 1. S PA R K - N E W K I D O N T H E B L O C K
  • 2. A B O U T M E … • I designed Bamboo (HP’s Big Data Analytics Platform) • I write software (mostly with Scala but leaning towards Haskell recently …) • I like translating seq to parallel algorithms mostly using CUDA / OpenCL; embedded assembly is an EVIL thing. • I wrote 2 books • OpenCL Parallel Programming Development Cookbook • Developing an Akka Edge
  • 3. W H AT ’ S C O V E R E D T O D AY ? • What’s Apache Spark • What’s a RDD ? How can i understand it ? • What’s Spark SQL • What’s Spark Streaming • References
  • 4. W H AT ’ S A PA C H E S PA R K • As a beginner’s guide, you can refer to Tsai Li Ming’s talk. • API model abstracts • how to extract data from 3rd party s/w (via JDBC, Cassandra, HBase) • how to extract-compute data (via GraphX, MLLib, SparkSQL) • how to store data (data connectors to “local”, “hdfs”, “s3”
  • 5. R E S I L I E N T D I S T R I B U T E D D ATA S E T S • Apache Spark works on data broken into chunks • These chunks are called RDDs • RDDs are chained into a lineage graph => a graph that identifies relationships. • RDDs can be queried, grouped, transformed in a coarse grained manner to a fine grained manner.
  • 6. • A RDD has a lifecycle: • reification • lazy-compute/lazy re-compute • destruction • RDD’s lifecycle is managed by the system unless … • A program commands the RDD to persist() or unpersist() which affects the lazy computation. R E S I L I E N T D I S T R I B U T E D D ATA S E T S
  • 7. “ A G G R E G AT E ” I N S PA R K > val data = sc.parallelize( (1 to 4) toList,2) > data.aggregate(0) > .. (math.max(_, _), > .. ( _ + _ )) > ….. > result = 6 def aggregate(zerovalue: U) (fbinary: (U, T) => U, fagg: (U, U) => U): U
  • 8. H O W “ A G G R E G AT E ” W O R K S I N S PA R K e1 RDD fagg fbinary e2 e3 e4 zerovalue res1 fbinary res2 fagg final result caveat: partition-sensitive algorithm should work correctly regardless of partitions
  • 9. “ C O G R O U P ” I N S PA R K > val x = sc.parallelize(List(1, 2, 1, 3), 1) > val y = x.map((_, "y")) > val z = x.map((_, "z")) > y.cogroup(z).collect res72: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1, (Array(y, y),Array(z, z))), (3,(Array(y),Array(z))), (2, (Array(y),Array(z)))) def cogroup[W1, W2, W3] (other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
  • 10. H O W “ C O G R O U P ” W O R K S I N S PA R K RDDx (k1,va) (k2,vb) (k1,vc) (k3,vd) (k1,ve) (k1,vf) (k2,vg) (k1,vh) RDDy RDDx.cogroup(RDDy) =?
  • 11. H O W “ C O G R O U P ” W O R K S I N S PA R K Arraycombined Array[(k1,[va,vc,ve,vf,vh]), (k2,[vb,vg]), (k3,[vd])] RDDx.cogroup(RDDy) = *see below*
  • 12. “ C O G R O U P ” I N S PA R K • CoGroup works in both RDD and Spark Streams • the ability to combine multiple RDDs allows higher abstractions to be constructed • A Stream in Spark is just a list of (Time,RDD[U])
  • 13. W H AT ’ S S PA R K S Q L • Spark SQL is new, largely replaced Shark • Large scale queries (inline queries) to be embedded into a Spark program • Spark SQL supports Apache Hive, JSON, Parquet, RDD. • Spark SQL’s optimizer is clever! • Supports UDFs from Hive or Write your own !
  • 14. S PA R K S Q L J S O N S PA R K S Q L PA R Q U E TH I V E data sources R D D
  • 15. S PA R K S Q L ( A N E X A M P L E ) // import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc)
  • 16. S PA R K S Q L ( A N E X A M P L E ) // import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc) val input = hiveCtx.jsonFile(inputFile) input.registerTempTable(“tweets”)
  • 17. S PA R K S Q L ( A N E X A M P L E ) // import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc) val input = hiveCtx.jsonFile(inputFile) input.registerTempTable(“tweets”) val topTweets = hiveCtx.sql(“SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10”)
  • 18. S PA R K S Q L ( A N E X A M P L E ) // import spark sql import org.apache.spark.sql.hive.HiveContext // create a spark sql hivecontext val sc = new SparkContext(…) val hiveCtx = new HiveContext(sc) val input = hiveCtx.jsonFile(inputFile) input.registerTempTable(“tweets”) val topTweets = hiveCtx.sql(“SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10”) val topTweetContent = topTweets.map(row ⇒ row.getString(0))
  • 19. W H AT ’ S S PA R K S T R E A M I N G • Core component is a DStream • DStream is an abstract RDD whose basic components is a (key,value) pairs where key = Time, value = RDD. • Forward and backward queries are supported • Fault-Tolerance by check-pointing RDDs. • What you can do with RDDs, you can do with DStreams.
  • 20. S PA R K S T R E A M I N G ( Q U I C K E X A M P L E ) import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration // Create a StreamingContext with a 1-second batch size from a SparkConf val ssc = new StreamingContext(conf, Seconds(1))

  • 21. S PA R K S T R E A M I N G ( Q U I C K E X A M P L E ) import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration // Create a StreamingContext with a 1-second batch size from a SparkConf val ssc = new StreamingContext(conf, Seconds(1))
 // Create a DStream using data received after connecting to // port 7777 on the local machine val lines = ssc.socketTextStream("localhost", 7777)
 // Filter our DStream for lines with "error"
 val errorLines = lines.filter(_.contains("error"))
 // Print out the lines with errors
 errorLines.print()
  • 22. S PA R K S T R E A M I N G ( Q U I C K E X A M P L E ) import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.Duration // Create a StreamingContext with a 1-second batch size from a SparkConf val ssc = new StreamingContext(conf, Seconds(1))
 // Create a DStream using data received after connecting to // port 7777 on the local machine val lines = ssc.socketTextStream("localhost", 7777)
 // Filter our DStream for lines with "error"
 val errorLines = lines.filter(_.contains("error"))
 // Print out the lines with errors
 errorLines.print() // Start our streaming context and wait for it to "finish" ssc.start()
 // Wait for the job to finish ssc.awaitTermination()
  • 23. A D S T R E A M L O O K S L I K E … t1 to t2 t2 to t3 t3 to t4 timestart DStream
  • 24. A D S T R E A M C A N H AV E T R A N S F O R M AT I O N S O N T H E M ! t1 to t2 timestart DStream(s) t1 to t2 data-1 data-2 f transformation on the fly!
  • 25. S PA R K S T R E A M T R A N S F O R M AT I O N t1 to t2t2 to t3 timestart DStream(s) t1 to t2t2 to t3 data-1 data-2 f f data output in batches
  • 26. S PA R K S T R E A M T R A N S F O R M AT I O N t3 to t4 timestart DStream(s) t3 to t4 data-1 data-2 f t1 to t2t2 to t3 t1 to t2t2 to t3 f fff
  • 27. S TAT E F U L S PA R K S T R E A M T R A N S F O R M AT I O N t3 to t4 timestart DStream(s) t3 to t4 data-1 data-2 f t1 to t2t2 to t3 t1 to t2t2 to t3 f fff
  • 28. H O W D O E S S PA R K S T R E A M I N G H A N D L E FA U LT S ? • As before, check-point is the key to fault-tolerance (especially in stateful-dstream transformations) • Programs can recover from check-points => no need to restart all over again. • You can use “monit” to restart Spark jobs or pass the Spark flag “- - supervise” to the job config a.k.a driver fault tolerance
  • 29. • All incoming data to workers replicated • In-house RDDs follow the lineage graph to recover • The above is known as worker fault tolerance. • Receivers fault tolerance is largely dependent on whether data sources can re-send lost data • Streams guarantee exactly-once semantics; caveat: multiple writes can occur to the HDFS (app specific logic needs to handle) H O W D O E S S PA R K S T R E A M I N G H A N D L E FA U LT S ?
  • 30. R E F E R E N C E S • Books: • “Learning Spark: Lightning Fast Big Data ANlaytics” • “Advanced Analytics with Spark: Patterns for Learning from Data At Scale” • “Fast Data Processing with Spark” • “Machine Learning with Spark” • Berkeley Data Bootcamp • Introduction to Big Data with Apache Spark • Kien Dang’s introduction to Spark and R using Naive Bayes (click here) • Spark Streaming with Scala and Akka (click here)
  • 31. T H E E N D Q U E S T I O N S ? T W I T T E R : @ R AY M O N D TAY B L G I T H U B : @ R AY G I T