SlideShare une entreprise Scribd logo
1  sur  57
© 2015 IBM Corporation
Apache Hadoop Day 2015
Paranth Thiruvengadam – Architect @ IBM
Sachin Aggarwal – Developer @ IBM
© 2015 IBM Corporation
Spark Streaming
 Features of Spark Streaming
 High Level API (joins, windows etc.)
 Fault – Tolerant (exactly once semantics achievable)
 Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX
etc.)
Apache Hadoop Day 2015
© 2015 IBM Corporation
Architecture
Apache Hadoop Day 2015
© 2015 IBM Corporation
High Level Overview
Apache Hadoop Day 2015
© 2015 IBM Corporation
Receiving Data
Driver
RECEIVER
Input
Source
Executor
Executor
Data Blocks
Data Blocks
Data Blocks
Are replicated
To another
Executor
Driver runs
Receiver as
Long running
tasks
Receiver divides
Streams into
Blocks and
keeps in
memory
Apache Hadoop Day 2015
© 2015 IBM Corporation
Processing Data
Driver
RECEIVER
Executor
Executor
Data Blocks
Data Blocks
Every batch
Internal Driver
Launches tasks
To process the
blocks
Data
Store
results
results
© 2015 IBM Corporation
What’s different from other
Streaming applications?
© 2015 IBM Corporation
Traditional Stream Processing
© 2015 IBM Corporation
Load Balancing…
© 2015 IBM Corporation
Node failure / Stragglers…
© 2015 IBM Corporation
Word Count with Kafka
© 2015 IBM Corporation
Fault Tolerance
© 2015 IBM Corporation
Fault Tolerance
 Why Care?
 Different guarantees for Data Loss
 Atleast Once
 Exactly Once
 What all can fail?
 Driver
 Executor
© 2015 IBM Corporation
What happens when executor fails?
© 2015 IBM Corporation
What happens when Driver fails?
© 2015 IBM Corporation
Recovering Driver – Checkpointing
© 2015 IBM Corporation
Driver restart
© 2015 IBM Corporation
Driver restart – ToDO List
 Configure automatic driver restart
 Spark Standalone
 YARN
 Set Checkpoint in HDFS compatible file system
streamingContext.checkpiont(hdfsDirectory)
 Ensure the Code uses checkpoints for recovery
Def setupStreamingContext() : StreamingContext = {
Val context = new StreamingContext(…)
Val lines = KafkaUtils.createStream(…)
…
Context.checkpoint(hdfsDir)
Val context = StreamingContext.getOrCreate(hdfsDir, setupStreamingContext)
Context.start()
© 2015 IBM Corporation
WAL for no data loss
© 2015 IBM Corporation
Recover using WAL
© 2015 IBM Corporation
Configuration – Enabling WAL
 Enable Checkpointing.
 Enable WAL in Spark Configuration
 sparkConf.set(“spark.streaming.receiver.writeAheadLog.en
able”, “true”)
 Receiver should acknowledge the input source after data
written to WAL
 Disable in-memory replication
© 2015 IBM Corporation
Normal Processing
© 2015 IBM Corporation
Restarting Failed Driver
© 2015 IBM Corporation
Fault-Tolerant Semantics
Exactly Once, If Outputs are Idempotent or transactional
Exactly Once, as long as received data is not lost
Aleast Once, with Checkpointing / WAL
Source
Receiving
Transforming
Outputting
Sink
© 2015 IBM Corporation
Fault-Tolerant Semantics
Exactly Once, If Outputs are Idempotent or transactional
Exactly Once, as long as received data is not lost
Exactly Once, with Kafka Direct API
Source
Receiving
Transforming
Outputting
Sink
© 2015 IBM Corporation
How to achieve “exactly once”
guarantee?
© 2015 IBM Corporation
Before Kafka Direct API
© 2015 IBM Corporation
Kafka Direct API
• Simplified Parallelism
• Less Storage Need
• Exactly Once Semantics
Benefits of this approach
© 2015 IBM Corporation
Demo
D E M O
SPARK STREAMING
OVERVIEW OF SPARK STREAMING
DISCRETIZED STREAMS (DSTREAMS)
• Dstream is basic abstraction in Spark Streaming.
• It is represented by a continuous series of RDDs(of the
same type).
• Each RDD in a DStream contains data from a certain
interval
• DStreams can either be created from live data (such as,
data from TCP sockets, Kafka, Flume, etc.) using a
Streaming Context or it can be generated by
transforming existing DStreams using operations such
as `map`, `window` and `reduceByKeyAndWindow`.
DISCRETIZED STREAMS (DSTREAMS)
WORD COUNT
val sparkConf = new SparkConf()
.setMaster("local[2]”)
.setAppName("WordCount")
val sc = new SparkContext(
sparkConf)
val file = sc.textFile(“filePath”)
val words = file
.flatMap(_.split(" "))
Val pairs = words
.map(x => (x, 1))
val wordCounts =pairs
.reduceByKey(_ + _)
wordCounts.saveAsTextFile(args(1))
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("SocketStreaming")
val ssc = new StreamingContext(
conf, Seconds(2))
val lines = ssc
.socketTextStream("localhost", 9998)
val words = lines
.flatMap(_.split(" "))
val pairs = words
.map(word => (word, 1))
val wordCounts = pairs
.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
DEMO
KAFKA STREAM
val lines = ssc
.socketTextStream("localh
ost", 9998)
val words = lines
.flatMap(_.split(" "))
val pairs = words
.map(word => (word, 1))
val wordCounts = pairs
.reduceByKey(_ + _)
val zkQuorum="localhost:2181”;
val group="test";
val topics="test";
val numThreads="1";
val topicMap = topics
.split(",")
.map((_, numThreads.toInt))
.toMap
val lines = KafkaUtils
.createStream(
ssc, zkQuorum, group, topicMap)
.map(_._2)
val words = lines
.flatMap(_.split(" "))
……..
DEMO
OPERATIONS
• Repartition
• Operation on RDD
(Example print partition count
of each RDD)
Val re_lines=lines
.repartition(5)
re_lines
.foreachRDD(x =>fun(x))
def fun (rdd:RDD[String]) ={
print("partition count”
+ rdd.partitions.length)
}
DEMO
STATELESS TRANSFORMATIONS
• map() Apply a function to each element in the DStream and return a DStream of the result.
• ds.map(x => x + 1)
• flatMap() Apply a function to each element in the DStream and return a DStream of the contents
of the iterators returned.
• ds.flatMap(x => x.split(" "))
• filter() Return a DStream consisting of only elements that pass the condition passed to filter.
• ds.filter(x => x != 1)
• repartition() Change the number of partitions of the DStream.
• ds.repartition(10)
• reduceBy Combine values with the same Key() key in each batch.
• ds.reduceByKey( (x,y)=>x+y)
• groupBy Group values with the same Key() key in each batch.
• ds.groupByKey()
DEMO
STATEFUL TRANSFORMATIONS
Stateful transformations require checkpointing to be
enabled in your StreamingContext for fault tolerance
• Windowed transformations: windowed computations
allow you to apply transformations over a sliding window
of data
• UpdateStateByKey transformation: Enables this by
providing access to a state variable for DStreams of
key/value pairs
DEMO
WINDOW OPERATIONS
This shows that any window operation needs to specify two
parameters.
• window length - The duration of the window.
• sliding interval - The interval at which the window
operation is performed.
These two parameters must be multiples of the batch
interval of the source Dstream
DEMO
WINDOWED TRANSFORMATIONS
• window(windowLength, slideInterval)
• Return a new Dstream, computed based on windowed batches of the source Dstream.
• countByWindow(windowLength, slideInterval)
• Return a sliding window count of elements in the stream.
• val totalWordCount= words.countByWindow(Seconds(30), Seconds(10))
• reduceByWindow(func, windowLength, slideInterval)
• Return a new single-element stream, created by aggregating elements in the stream over a sliding
interval using func.
• The function should be associative so that it can be computed correctly in parallel.
• val totalWordCount= pairs.reduceByWindow({(x, y) => x + y},{(x, y) => x – y} Seconds(10)
• reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
• Returns a new DStream of (K, V) pairs where the values for each key are aggregated using the
given reduce function func over batches in a sliding window
• val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30),
Seconds(10))
• countByValueAndWindow(windowLength, slideInterval)
• Returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a
sliding window.
• val EachWordCount= word.countByValueAndWindow(Seconds(30), Seconds(10))
DEMO
UPDATE STATE BY KEY
TRANSFORMATION
• updateStateByKey()
• Enables this by providing access to a state variable for DStreams of
key/value pairs
• User provide a function updateFunc(events, oldState) and initialRDD
• val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1),
("world", 1)))
• val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
• val stateCount= pairs.updateStateByKey[Int](updateFunc)
DEMO
TRANSFORM OPERATION
• The transform operation allows arbitrary RDD-to-RDD
functions to be applied on a DStream.
• It can be used to apply any RDD operation that is not
exposed in the DStream API.
• For example, the functionality of joining every batch in a
data stream with another dataset is not directly exposed
in the DStream API.
• val cleanedDStream = wordCounts.transform(rdd => {
rdd.join(data)
})
DEMO
JOIN OPERATIONS
• Stream-stream joins:
• Streams can be very easily joined with other streams.
• val stream1: DStream[String, String] = ...
• val stream2: DStream[String, String] = ...
• val joinedStream = stream1.join(stream2)
• Windowed join
• val windowedStream1 = stream1.window(Seconds(20))
• val windowedStream2 = stream2.window(Minutes(1))
• val joinedStream = windowedStream1.join(windowedStream2)
• Stream-dataset joins
• val dataset: RDD[String, String] = ...
• val windowedStream = stream.window(Seconds(20))...
• val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
DEMO
USING FOREACHRDD()
• foreachRDD is a powerful primitive that allows data to be sent out to
external systems.
• dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection)
}
}
• Using foreachRDD, Each RDD is converted to a DataFrame, registered
as a temporary table and then queried using SQL.
• words.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val wordsDataFrame = rdd.toDF("word")
wordsDataFrame.registerTempTable("words")
val wordCountsDataFrame =
sqlContext.sql("select word, count(*) as total from words group by word")
wordCountsDataFrame.show()
}
DEMO
DSTREAMS (SPARK CODE)
• DStreams internally is characterized by a few basic properties:
• A list of other DStreams that the DStream depends on
• A time interval at which the DStream generates an RDD
• A function that is used to generate an RDD after each time interval
• Methods that should be implemented by subclasses of Dstream
• Time interval after which the DStream generates a RDD
• def slideDuration: Duration
• List of parent DStreams on which this DStream depends on
• def dependencies: List[DStream[_]]
• Method that generates a RDD for the given time
• def compute(validTime: Time): Option[RDD[T]]
• This class contains the basic operations available on all DStreams, such as
`map`, `filter` and `window`. In addition, PairDStreamFunctions contains
operations available only on DStreams of key-value pairs, such as
`groupByKeyAndWindow` and `join`. These operations are automatically
available on any DStream of pairs (e.g., DStream[(Int, Int)] through implicit
conversions.
© 2015 IBM Corporation

Contenu connexe

Tendances

Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 

Tendances (20)

Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
Apache spark
Apache sparkApache spark
Apache spark
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 

En vedette

Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache SparkSachin Aggarwal
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web ServiceSpark Summit
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit
 
Spark meetup stream processing use cases
Spark meetup   stream processing use casesSpark meetup   stream processing use cases
Spark meetup stream processing use casespunesparkmeetup
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overviewAvi Levi
 
Spark Streaming and Expert Systems
Spark Streaming and Expert SystemsSpark Streaming and Expert Systems
Spark Streaming and Expert SystemsJim Haughwout
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browsergethue
 
Reactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka StreamsReactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka StreamsDean Wampler
 
Graph Data -- RDF and Property Graphs
Graph Data -- RDF and Property GraphsGraph Data -- RDF and Property Graphs
Graph Data -- RDF and Property Graphsandyseaborne
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 

En vedette (20)

Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache Spark
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark meetup stream processing use cases
Spark meetup   stream processing use casesSpark meetup   stream processing use cases
Spark meetup stream processing use cases
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 
Spark Streaming and Expert Systems
Spark Streaming and Expert SystemsSpark Streaming and Expert Systems
Spark Streaming and Expert Systems
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
 
MOLDEAS at City College
MOLDEAS at City CollegeMOLDEAS at City College
MOLDEAS at City College
 
WP4-QoS Management in the Cloud
WP4-QoS Management in the CloudWP4-QoS Management in the Cloud
WP4-QoS Management in the Cloud
 
Reactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka StreamsReactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka Streams
 
Graph Data -- RDF and Property Graphs
Graph Data -- RDF and Property GraphsGraph Data -- RDF and Property Graphs
Graph Data -- RDF and Property Graphs
 
PSL Overview
PSL OverviewPSL Overview
PSL Overview
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 

Similaire à Apache Spark Streaming: Architecture and Fault Tolerance

Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every dayVadym Khondar
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsStephan Ewen
 
Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018Loiane Groner
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesEduardo Castro
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-actionAssaf Gannon
 
What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?Timo Walther
 
Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features WSO2
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streamsRadu Tudoran
 
Ice mini guide
Ice mini guideIce mini guide
Ice mini guideAdy Liu
 
Qubell — Component Model
Qubell — Component ModelQubell — Component Model
Qubell — Component ModelRoman Timushev
 
El camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - IntroductionEl camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - IntroductionPlain Concepts
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
StackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStackStackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStackChiradeep Vittal
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...InfluxData
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
 

Similaire à Apache Spark Streaming: Architecture and Fault Tolerance (20)

Reactive programming every day
Reactive programming every dayReactive programming every day
Reactive programming every day
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-action
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?What's new for Apache Flink's Table & SQL APIs?
What's new for Apache Flink's Table & SQL APIs?
 
Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features Complex Event Processor 3.0.0 - An overview of upcoming features
Complex Event Processor 3.0.0 - An overview of upcoming features
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
 
Ice mini guide
Ice mini guideIce mini guide
Ice mini guide
 
Qubell — Component Model
Qubell — Component ModelQubell — Component Model
Qubell — Component Model
 
El camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - IntroductionEl camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - Introduction
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
StackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStackStackWatch: A prototype CloudWatch service for CloudStack
StackWatch: A prototype CloudWatch service for CloudStack
 
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 

Dernier

Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 

Dernier (20)

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 

Apache Spark Streaming: Architecture and Fault Tolerance

  • 1. © 2015 IBM Corporation Apache Hadoop Day 2015 Paranth Thiruvengadam – Architect @ IBM Sachin Aggarwal – Developer @ IBM
  • 2. © 2015 IBM Corporation Spark Streaming  Features of Spark Streaming  High Level API (joins, windows etc.)  Fault – Tolerant (exactly once semantics achievable)  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.) Apache Hadoop Day 2015
  • 3. © 2015 IBM Corporation Architecture Apache Hadoop Day 2015
  • 4. © 2015 IBM Corporation High Level Overview Apache Hadoop Day 2015
  • 5. © 2015 IBM Corporation Receiving Data Driver RECEIVER Input Source Executor Executor Data Blocks Data Blocks Data Blocks Are replicated To another Executor Driver runs Receiver as Long running tasks Receiver divides Streams into Blocks and keeps in memory Apache Hadoop Day 2015
  • 6. © 2015 IBM Corporation Processing Data Driver RECEIVER Executor Executor Data Blocks Data Blocks Every batch Internal Driver Launches tasks To process the blocks Data Store results results
  • 7. © 2015 IBM Corporation What’s different from other Streaming applications?
  • 8. © 2015 IBM Corporation Traditional Stream Processing
  • 9. © 2015 IBM Corporation Load Balancing…
  • 10. © 2015 IBM Corporation Node failure / Stragglers…
  • 11. © 2015 IBM Corporation Word Count with Kafka
  • 12. © 2015 IBM Corporation Fault Tolerance
  • 13. © 2015 IBM Corporation Fault Tolerance  Why Care?  Different guarantees for Data Loss  Atleast Once  Exactly Once  What all can fail?  Driver  Executor
  • 14. © 2015 IBM Corporation What happens when executor fails?
  • 15. © 2015 IBM Corporation What happens when Driver fails?
  • 16. © 2015 IBM Corporation Recovering Driver – Checkpointing
  • 17. © 2015 IBM Corporation Driver restart
  • 18. © 2015 IBM Corporation Driver restart – ToDO List  Configure automatic driver restart  Spark Standalone  YARN  Set Checkpoint in HDFS compatible file system streamingContext.checkpiont(hdfsDirectory)  Ensure the Code uses checkpoints for recovery Def setupStreamingContext() : StreamingContext = { Val context = new StreamingContext(…) Val lines = KafkaUtils.createStream(…) … Context.checkpoint(hdfsDir) Val context = StreamingContext.getOrCreate(hdfsDir, setupStreamingContext) Context.start()
  • 19. © 2015 IBM Corporation WAL for no data loss
  • 20. © 2015 IBM Corporation Recover using WAL
  • 21. © 2015 IBM Corporation Configuration – Enabling WAL  Enable Checkpointing.  Enable WAL in Spark Configuration  sparkConf.set(“spark.streaming.receiver.writeAheadLog.en able”, “true”)  Receiver should acknowledge the input source after data written to WAL  Disable in-memory replication
  • 22. © 2015 IBM Corporation Normal Processing
  • 23. © 2015 IBM Corporation Restarting Failed Driver
  • 24. © 2015 IBM Corporation Fault-Tolerant Semantics Exactly Once, If Outputs are Idempotent or transactional Exactly Once, as long as received data is not lost Aleast Once, with Checkpointing / WAL Source Receiving Transforming Outputting Sink
  • 25. © 2015 IBM Corporation Fault-Tolerant Semantics Exactly Once, If Outputs are Idempotent or transactional Exactly Once, as long as received data is not lost Exactly Once, with Kafka Direct API Source Receiving Transforming Outputting Sink
  • 26. © 2015 IBM Corporation How to achieve “exactly once” guarantee?
  • 27. © 2015 IBM Corporation Before Kafka Direct API
  • 28. © 2015 IBM Corporation Kafka Direct API • Simplified Parallelism • Less Storage Need • Exactly Once Semantics Benefits of this approach
  • 29. © 2015 IBM Corporation Demo
  • 30. D E M O SPARK STREAMING
  • 31. OVERVIEW OF SPARK STREAMING
  • 32. DISCRETIZED STREAMS (DSTREAMS) • Dstream is basic abstraction in Spark Streaming. • It is represented by a continuous series of RDDs(of the same type). • Each RDD in a DStream contains data from a certain interval • DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume, etc.) using a Streaming Context or it can be generated by transforming existing DStreams using operations such as `map`, `window` and `reduceByKeyAndWindow`.
  • 34. WORD COUNT val sparkConf = new SparkConf() .setMaster("local[2]”) .setAppName("WordCount") val sc = new SparkContext( sparkConf) val file = sc.textFile(“filePath”) val words = file .flatMap(_.split(" ")) Val pairs = words .map(x => (x, 1)) val wordCounts =pairs .reduceByKey(_ + _) wordCounts.saveAsTextFile(args(1)) val conf = new SparkConf() .setMaster("local[2]") .setAppName("SocketStreaming") val ssc = new StreamingContext( conf, Seconds(2)) val lines = ssc .socketTextStream("localhost", 9998) val words = lines .flatMap(_.split(" ")) val pairs = words .map(word => (word, 1)) val wordCounts = pairs .reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 35. DEMO
  • 36. KAFKA STREAM val lines = ssc .socketTextStream("localh ost", 9998) val words = lines .flatMap(_.split(" ")) val pairs = words .map(word => (word, 1)) val wordCounts = pairs .reduceByKey(_ + _) val zkQuorum="localhost:2181”; val group="test"; val topics="test"; val numThreads="1"; val topicMap = topics .split(",") .map((_, numThreads.toInt)) .toMap val lines = KafkaUtils .createStream( ssc, zkQuorum, group, topicMap) .map(_._2) val words = lines .flatMap(_.split(" ")) ……..
  • 37. DEMO
  • 38. OPERATIONS • Repartition • Operation on RDD (Example print partition count of each RDD) Val re_lines=lines .repartition(5) re_lines .foreachRDD(x =>fun(x)) def fun (rdd:RDD[String]) ={ print("partition count” + rdd.partitions.length) }
  • 39. DEMO
  • 40. STATELESS TRANSFORMATIONS • map() Apply a function to each element in the DStream and return a DStream of the result. • ds.map(x => x + 1) • flatMap() Apply a function to each element in the DStream and return a DStream of the contents of the iterators returned. • ds.flatMap(x => x.split(" ")) • filter() Return a DStream consisting of only elements that pass the condition passed to filter. • ds.filter(x => x != 1) • repartition() Change the number of partitions of the DStream. • ds.repartition(10) • reduceBy Combine values with the same Key() key in each batch. • ds.reduceByKey( (x,y)=>x+y) • groupBy Group values with the same Key() key in each batch. • ds.groupByKey()
  • 41. DEMO
  • 42. STATEFUL TRANSFORMATIONS Stateful transformations require checkpointing to be enabled in your StreamingContext for fault tolerance • Windowed transformations: windowed computations allow you to apply transformations over a sliding window of data • UpdateStateByKey transformation: Enables this by providing access to a state variable for DStreams of key/value pairs
  • 43. DEMO
  • 44. WINDOW OPERATIONS This shows that any window operation needs to specify two parameters. • window length - The duration of the window. • sliding interval - The interval at which the window operation is performed. These two parameters must be multiples of the batch interval of the source Dstream
  • 45. DEMO
  • 46. WINDOWED TRANSFORMATIONS • window(windowLength, slideInterval) • Return a new Dstream, computed based on windowed batches of the source Dstream. • countByWindow(windowLength, slideInterval) • Return a sliding window count of elements in the stream. • val totalWordCount= words.countByWindow(Seconds(30), Seconds(10)) • reduceByWindow(func, windowLength, slideInterval) • Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. • The function should be associative so that it can be computed correctly in parallel. • val totalWordCount= pairs.reduceByWindow({(x, y) => x + y},{(x, y) => x – y} Seconds(10) • reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) • Returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window • val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10)) • countByValueAndWindow(windowLength, slideInterval) • Returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. • val EachWordCount= word.countByValueAndWindow(Seconds(30), Seconds(10))
  • 47. DEMO
  • 48. UPDATE STATE BY KEY TRANSFORMATION • updateStateByKey() • Enables this by providing access to a state variable for DStreams of key/value pairs • User provide a function updateFunc(events, oldState) and initialRDD • val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) • val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.foldLeft(0)(_ + _) val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } • val stateCount= pairs.updateStateByKey[Int](updateFunc)
  • 49. DEMO
  • 50. TRANSFORM OPERATION • The transform operation allows arbitrary RDD-to-RDD functions to be applied on a DStream. • It can be used to apply any RDD operation that is not exposed in the DStream API. • For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. • val cleanedDStream = wordCounts.transform(rdd => { rdd.join(data) })
  • 51. DEMO
  • 52. JOIN OPERATIONS • Stream-stream joins: • Streams can be very easily joined with other streams. • val stream1: DStream[String, String] = ... • val stream2: DStream[String, String] = ... • val joinedStream = stream1.join(stream2) • Windowed join • val windowedStream1 = stream1.window(Seconds(20)) • val windowedStream2 = stream2.window(Minutes(1)) • val joinedStream = windowedStream1.join(windowedStream2) • Stream-dataset joins • val dataset: RDD[String, String] = ... • val windowedStream = stream.window(Seconds(20))... • val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
  • 53. DEMO
  • 54. USING FOREACHRDD() • foreachRDD is a powerful primitive that allows data to be sent out to external systems. • dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) ConnectionPool.returnConnection(connection) } } • Using foreachRDD, Each RDD is converted to a DataFrame, registered as a temporary table and then queried using SQL. • words.foreachRDD { rdd => val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) import sqlContext.implicits._ val wordsDataFrame = rdd.toDF("word") wordsDataFrame.registerTempTable("words") val wordCountsDataFrame = sqlContext.sql("select word, count(*) as total from words group by word") wordCountsDataFrame.show() }
  • 55. DEMO
  • 56. DSTREAMS (SPARK CODE) • DStreams internally is characterized by a few basic properties: • A list of other DStreams that the DStream depends on • A time interval at which the DStream generates an RDD • A function that is used to generate an RDD after each time interval • Methods that should be implemented by subclasses of Dstream • Time interval after which the DStream generates a RDD • def slideDuration: Duration • List of parent DStreams on which this DStream depends on • def dependencies: List[DStream[_]] • Method that generates a RDD for the given time • def compute(validTime: Time): Option[RDD[T]] • This class contains the basic operations available on all DStreams, such as `map`, `filter` and `window`. In addition, PairDStreamFunctions contains operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and `join`. These operations are automatically available on any DStream of pairs (e.g., DStream[(Int, Int)] through implicit conversions.
  • 57. © 2015 IBM Corporation

Notes de l'éditeur

  1. Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  2. Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  3. Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  4. Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux. D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
  5. Have to have a sample code before coming to this slide.
  6. reference ids of the blocks for locating their data in the executor memory, (ii) offset information of the block data in the logs
  7. Have to read on Kafka Direct API.