Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.
2. Spark Core
SQL
Structured Data
Streaming
Real-time
MLib
Machine Learning
GraphX
Graph Data
A fast and general purpose framework for big data processing, with built-in modules for streaming,
SQL, machine learning and graph processing.
Apache Spark
MapReduce
Hive Pig Mahaut
HDFS
4. How is Spark faster?
RDD - A Resilient Distributed Dataset, the basic abstraction
in Spark. Represents an immutable, partitioned collection
of elements that can be operated on in parallel.
Caching + DAG model is enough to run them efficiently
Combining libraries into one program is much faster
DataFrames - schema-RDD
8. Spark Streaming
A data processing framework
to build streaming
applications.
Why?
1. Scalable
2. Fault-tolerant
3. Simpler
4. Modular
5. Code reuse
9. But Spark vs Storm..?
● Storm is a stream processing framework that also does
micro-batching (Trident).
● Spark is a batch processing framework that also does
micro-batching (Spark Streaming).
Also read:https://www.quora.com/What-are-the-differences-between-Apache-Spark-and-Apache-Flink/answer/Santosh-Sahoo
10. World of Stream Processors
@http://www.slideshare.net/zbigniew.jerzak
And
continues..
11. Stream.scala
1. val conf = new SparkConf().setAppName("demoapp").setMaster("local[1]")
2. val sc = new SparkContext(conf)
3. val ssc = new StreamingContext(sc, Seconds(2))
4. val kafkaConfig = Map("metadata.broker.list"->"localhost:9092")
5. val topics = Set("topic1")
6. val wordstream = KafkaUtils.createDirectStream(ssc, kafkaConfig,
topics )
7. wordstream.print()
8. ssc.start()
9. ssc.awaitTermination()
20. Composite Example
// Load data using SQL
points = ctx.sql(“select latitude, longitude from hive_tweets”)
// Train a machine learning model
model = KMeans.train(points, 10)
// Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
21. Apache Kafka
No nonsense logging platform
● 100K/s throughput vs 20k of RabbitMQ
● Log compaction
● Durable persistence
● Partition tolerance
● Replication
● Best in class integration with Spark
○ http://spark.apache.org/docs/latest/streaming-kafka-integration.html