At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
17. @maasg #EUstr2
ConstantInputDStream
17
/**
* An input stream that always returns the same RDD on each time step. Useful for testing.
*/
class ConstantInputDStream[T: ClassTag](_ssc: StreamingContext, rdd: RDD[T])
// Usage
val constantDStream = new ConstantInputDStream(streamingContext, rdd)
18. @maasg #EUstr2
ConstantInputDStream: Generate Data
18
import scala.util.Random
val sensorId: () => Int = () => Random.nextInt(sensorCount)
val data: () => Double = () => Random.nextDouble
val timestamp: () => Long = () => System.currentTimeMillis
// Generates records with Random data
val recordFunction: () => String = { () =>
if (Random.nextDouble < 0.9) {
Seq(sensorId().toString, timestamp(), data()).mkString(",")
} else {
// simulate 10% crap data as well… real world streams are seldom clean
"!!~corrupt~^&##$"
}
}
val sensorDataGenerator = sparkContext.parallelize(1 to n).map(_ => recordFunction)
val sensorData = sensorDataGenerator.map(recordFun => recordFun())
val rawDStream = new ConstantInputDStream(streamingContext, sensorData)
RDD[() => Record]
20. @maasg #EUstr2
ConstantInputDStream + foreachRDD=
Reload External Data Periodically
20
var sensorReference = sparkSession.read.parquet(s"$referenceFile")
sensorRef.cache()
val refreshDStream = new ConstantInputDStream(streamingContext, sparkContext.emptyRDD[Int])
// Refresh data every 5 minutes
val refreshIntervalDStream = refreshDStream.window(Seconds(300), Seconds(300))
refreshIntervalDStream.foreachRDD{ _ =>
sensorRef.unpersist(false)
sensorRef = sparkSession.read.parquet(s"$referenceFile")
sensorRef.cache()
}
21. @maasg #EUstr2
DStream + foreachRDD=
Reload External Data with a Trigger
21
var sensorReference = sparkSession.read.parquet(s"$referenceFile")
sensorRef.cache()
val triggerRefreshDStream: DStream = // create a DStream from a source. e.g. Kafka
val referenceStream = triggerRefreshDStream.transform { rdd =>
if (rdd.take(1) == “refreshNow”) {
sensorRef.unpersist(false)
sensorRef = sparkSession.read.parquet(s"$referenceFile")
sensorRef.cache()
}
sensorRef.rdd
}
incomingStream.join(referenceStream) ...
25. @maasg #EUstr2
Keeping Arbitrary State
25
var baseline: Dataset[Features] = sparkSession.read.parquet(targetFile).as[Features]
…
stream.foreachRDD{ rdd =>
val incomingData = sparkSession.createDataset(rdd)
val incomingFeatures = rawToFeatures(incomingData)
val analyzed = compare(incomingFeatures, baseline)
// store analyzed data
baseline = (baseline union incomingFeatures).filter(isExpired)
}
https://gist.github.com/maasg/9d51a2a42fc831e385cf744b84e80479