A lot of players on the market have built successful MapReduce workflows to daily process terabytes of historical data. But who wants to wait for 24h to get updated analytics? This talk will introduce you to the lambda architecture designed to take advantages of both batch and streaming processing methods. So we will leverage fast access to historical data with real-time streaming data using Spark (Core, SQL, Streaming), Twitter, Apache Parquet, etc.
Clear code plus intuitive demo are also included - https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Was presented on JEEConf 2016 in Kyiv on 20/05/2016.
Design by Yarko Filevych: http://www.filevych.com/
3. Apache Hadoop: A Brief History
http://www.slideshare.net/fadicce/hadoop-user-group-uae-meeting
4. A lot of customers implemented
successful Hadoop-based M/R pipelines
which are operating today
5. Examples from Real Life
• Oozie workflow, operates daily and processes up to
150 TB to generate analytics
• bash managed workflow, operates daily and processes
up to 8 TB to generate analytics
6. Examples from Real Life
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
7. Lambda Architecture
A data-processing architecture
designed to handle massive quantities of data
by taking advantage of both
batch and stream processing methods
http://lambda-architecture.net/
10. Layers of Lambda Architecture
Batch layer
• manages the master dataset (an immutable, append-only set of
raw data)
• pre-compute the batch views
Serving layer
• indexes the batch views so that they can be queried in ad-hoc with
low-latency
Speed layer
• deals with recent-data only
http://lambda-architecture.net/
13. Trade-offs
Full recomputation vs. partical recomputation
e.g. using Bloom filters
Additive algorithms vs. approximation algorithms
e.g. HyperLogLog for count-distinct problem
20. Enables scalable, high-throughput, fault-tolerant
stream processing of live data streams
50% users consider it the most important part of Spark
Spark Streaming
http://spark.apache.org/docs/latest/streaming-programming-guide.html
27. Provide hashtags statistics
used in a #jeeconf tweets
All time till today + right now
Sample Application
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
31. Simplified Steps
• Create batch view (.parquet) via Apache Spark
• Cache batch view in Apache Spark
• Start streaming application connected to Twitter
• Focus on real-time #jeeconf tweets*
• Build incremental real-time views
• Query, i.e. merge batch and real-time views on a fly
* Stream from file system (used for testing) can be used as a backup
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
36. Structured Streaming in Spark 2.0
The simplest way to perform streaming analytics
is not having to reason about streaming
Static DataFrame API = Infinite DataFrame API
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
37. Structured Streaming
• Introduces streaming API built on top of Spark SQL
• Unifies streaming, interactive and batch queries
logs = context.read.format("json")
.stream("s3://logs")
logs.groupBy(logs.user_id)
.agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
https://www.youtube.com/watch?v=oXkxXDG0gNk
42. References
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
https://www.manning.com/books/big-data
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly
Media)
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
http://www.rittmanmead.com/2015/08/combining-spark-streaming-and-data-frames-for-near-real-time-log-analysis/
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://milinda.pathirage.org/kappa-architecture.com/
http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
http://thenewstack.io/spark-2-0-will-offer-interactive-querying-live-data/
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-spark-and-hyperloglog.html
https://www.youtube.com/watch?v=ZFBgY0PwUeY
https://www.youtube.com/watch?v=oXkxXDG0gN
http://milinda.pathirage.org/kappa-architecture.com/
https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html
http://www.slideshare.net/Typesafe_Inc/four-things-to-know-about-reliable-spark-streaming-with-typesafe-and-databricks
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
Notes de l'éditeur
micro-batch architecture
series of batch computations on small chunks of data
batch interval is configurable
exactly once semantics
Receiver:
Task that collects data from the input source and represents it as RDDs
Is launched automatically for each input source
Replicates data to another executor for fault tolerance
spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate (number of records per second)
spark.streaming.blockInterval (default 200ms)
Spark 2.0:
Project Tungsten 2.0
Whole stage code generation
Optimized input / output -> Parquet + built-in cache
Spark Streaming
DataFrame API unified with Dataset API