Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
32. DISCRETIZED STREAMS (DSTREAMS)
• Dstream is basic abstraction in Spark Streaming.
• It is represented by a continuous series of RDDs(of the
same type).
• Each RDD in a DStream contains data from a certain
interval
• DStreams can either be created from live data (such as,
data from TCP sockets, Kafka, Flume, etc.) using a
Streaming Context or it can be generated by
transforming existing DStreams using operations such
as `map`, `window` and `reduceByKeyAndWindow`.
34. WORD COUNT
val sparkConf = new SparkConf()
.setMaster("local[2]”)
.setAppName("WordCount")
val sc = new SparkContext(
sparkConf)
val file = sc.textFile(“filePath”)
val words = file
.flatMap(_.split(" "))
Val pairs = words
.map(x => (x, 1))
val wordCounts =pairs
.reduceByKey(_ + _)
wordCounts.saveAsTextFile(args(1))
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("SocketStreaming")
val ssc = new StreamingContext(
conf, Seconds(2))
val lines = ssc
.socketTextStream("localhost", 9998)
val words = lines
.flatMap(_.split(" "))
val pairs = words
.map(word => (word, 1))
val wordCounts = pairs
.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
36. KAFKA STREAM
val lines = ssc
.socketTextStream("localh
ost", 9998)
val words = lines
.flatMap(_.split(" "))
val pairs = words
.map(word => (word, 1))
val wordCounts = pairs
.reduceByKey(_ + _)
val zkQuorum="localhost:2181”;
val group="test";
val topics="test";
val numThreads="1";
val topicMap = topics
.split(",")
.map((_, numThreads.toInt))
.toMap
val lines = KafkaUtils
.createStream(
ssc, zkQuorum, group, topicMap)
.map(_._2)
val words = lines
.flatMap(_.split(" "))
……..
40. STATELESS TRANSFORMATIONS
• map() Apply a function to each element in the DStream and return a DStream of the result.
• ds.map(x => x + 1)
• flatMap() Apply a function to each element in the DStream and return a DStream of the contents
of the iterators returned.
• ds.flatMap(x => x.split(" "))
• filter() Return a DStream consisting of only elements that pass the condition passed to filter.
• ds.filter(x => x != 1)
• repartition() Change the number of partitions of the DStream.
• ds.repartition(10)
• reduceBy Combine values with the same Key() key in each batch.
• ds.reduceByKey( (x,y)=>x+y)
• groupBy Group values with the same Key() key in each batch.
• ds.groupByKey()
42. STATEFUL TRANSFORMATIONS
Stateful transformations require checkpointing to be
enabled in your StreamingContext for fault tolerance
• Windowed transformations: windowed computations
allow you to apply transformations over a sliding window
of data
• UpdateStateByKey transformation: Enables this by
providing access to a state variable for DStreams of
key/value pairs
44. WINDOW OPERATIONS
This shows that any window operation needs to specify two
parameters.
• window length - The duration of the window.
• sliding interval - The interval at which the window
operation is performed.
These two parameters must be multiples of the batch
interval of the source Dstream
46. WINDOWED TRANSFORMATIONS
• window(windowLength, slideInterval)
• Return a new Dstream, computed based on windowed batches of the source Dstream.
• countByWindow(windowLength, slideInterval)
• Return a sliding window count of elements in the stream.
• val totalWordCount= words.countByWindow(Seconds(30), Seconds(10))
• reduceByWindow(func, windowLength, slideInterval)
• Return a new single-element stream, created by aggregating elements in the stream over a sliding
interval using func.
• The function should be associative so that it can be computed correctly in parallel.
• val totalWordCount= pairs.reduceByWindow({(x, y) => x + y},{(x, y) => x – y} Seconds(10)
• reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
• Returns a new DStream of (K, V) pairs where the values for each key are aggregated using the
given reduce function func over batches in a sliding window
• val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30),
Seconds(10))
• countByValueAndWindow(windowLength, slideInterval)
• Returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a
sliding window.
• val EachWordCount= word.countByValueAndWindow(Seconds(30), Seconds(10))
48. UPDATE STATE BY KEY
TRANSFORMATION
• updateStateByKey()
• Enables this by providing access to a state variable for DStreams of
key/value pairs
• User provide a function updateFunc(events, oldState) and initialRDD
• val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1),
("world", 1)))
• val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
• val stateCount= pairs.updateStateByKey[Int](updateFunc)
50. TRANSFORM OPERATION
• The transform operation allows arbitrary RDD-to-RDD
functions to be applied on a DStream.
• It can be used to apply any RDD operation that is not
exposed in the DStream API.
• For example, the functionality of joining every batch in a
data stream with another dataset is not directly exposed
in the DStream API.
• val cleanedDStream = wordCounts.transform(rdd => {
rdd.join(data)
})
54. USING FOREACHRDD()
• foreachRDD is a powerful primitive that allows data to be sent out to
external systems.
• dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection)
}
}
• Using foreachRDD, Each RDD is converted to a DataFrame, registered
as a temporary table and then queried using SQL.
• words.foreachRDD { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val wordsDataFrame = rdd.toDF("word")
wordsDataFrame.registerTempTable("words")
val wordCountsDataFrame =
sqlContext.sql("select word, count(*) as total from words group by word")
wordCountsDataFrame.show()
}
56. DSTREAMS (SPARK CODE)
• DStreams internally is characterized by a few basic properties:
• A list of other DStreams that the DStream depends on
• A time interval at which the DStream generates an RDD
• A function that is used to generate an RDD after each time interval
• Methods that should be implemented by subclasses of Dstream
• Time interval after which the DStream generates a RDD
• def slideDuration: Duration
• List of parent DStreams on which this DStream depends on
• def dependencies: List[DStream[_]]
• Method that generates a RDD for the given time
• def compute(validTime: Time): Option[RDD[T]]
• This class contains the basic operations available on all DStreams, such as
`map`, `filter` and `window`. In addition, PairDStreamFunctions contains
operations available only on DStreams of key-value pairs, such as
`groupByKeyAndWindow` and `join`. These operations are automatically
available on any DStream of pairs (e.g., DStream[(Int, Int)] through implicit
conversions.
Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux.
D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux.
D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux.
D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
Continuous operator processing model. Each node continuously receives records, updates internal state, and emits new records. The latency is low but Fault tolerance is typically achieved through replication, using a synchronization protocol like Flux.
D-Stream processing model. In each time interval, the records that arrive are stored reliably across the cluster to form an immutable, partitioned dataset. This is then processed via deterministic parallel operations to compute other distributed datasets that represent program output or state to pass to the next interval. Each series of datasets forms one D-Stream
Have to have a sample code before coming to this slide.
reference ids of the blocks for locating their data in the executor memory,
(ii) offset information of the block data in the logs