JEEConf 2016 - Lambda Architecture with Apache Spark

Lambda Architecture
with Apache Spark
IMAGE

About Me
https://ua.linkedin.com/in/tarasmatyashovsky

Apache Hadoop: A Brief History
http://www.slideshare.net/fadicce/hadoop-user-group-uae-meeting

A lot of customers implemented
successful Hadoop-based M/R pipelines
which are operating today

Examples from Real Life
• Oozie workflow, operates daily and processes up to
150 TB to generate analytics
• bash managed workflow, operates daily and processes
up to 8 TB to generate analytics

Examples from Real Life
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop

Lambda Architecture
A data-processing architecture
designed to handle massive quantities of data
by taking advantage of both
batch and stream processing methods
http://lambda-architecture.net/

https://www.manning.com/books/big-data

Layers of Lambda Architecture
Batch layer
• manages the master dataset (an immutable, append-only set of
raw data)
• pre-compute the batch views
Serving layer
• indexes the batch views so that they can be queried in ad-hoc with
low-latency
Speed layer
• deals with recent-data only
http://lambda-architecture.net/

https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark

Relevance of Data
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
query =
real time view =
batch view =
function(batch view, real time view)
function(real time view, new data)
function(all data)

Trade-offs
Full recomputation vs. partical recomputation
e.g. using Bloom filters
Additive algorithms vs. approximation algorithms
e.g. HyperLogLog for count-distinct problem

Implementation of Lambda Architecture

Integrated solution for processing
on all lambda architecture layers

Enables scalable, high-throughput, fault-tolerant
stream processing of live data streams
50% users consider it the most important part of Spark
Spark Streaming
http://spark.apache.org/docs/latest/streaming-programming-guide.html

Streaming Architecture

https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

http://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers

http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams

DStream as a Continuous Series of RDDs
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams

Provide hashtags statistics
used in a #jeeconf tweets
All time till today + right now
Sample Application
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv

Batch View
apache –
architecture –
aws –
java –
jeeconf –
lambda –
morningatlohika –
simpleworkflow –
spark –
6
12
3
4
7
6
15
14
5

Real-time View
“Cool presentation by @tmatyashovsky about
#lambda #architecture using #apache #spark
at #jeeconf”
apache –
architecture –
jeeconf–
lambda –
spark –
1
1
1
1
1

Batch View + Real-time View
apache –
architecture –
aws –
java –
jeeconf –
lambda –
morningatlohika –
simpleworkflow –
spark –
7
13
3
4
8
7
15
14
6

Simplified Steps
• Create batch view (.parquet) via Apache Spark
• Cache batch view in Apache Spark
• Start streaming application connected to Twitter
• Focus on real-time #jeeconf tweets*
• Build incremental real-time views
• Query, i.e. merge batch and real-time views on a fly
* Stream from file system (used for testing) can be used as a backup

Demo Time

http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics

Structured Streaming in Spark 2.0
The simplest way to perform streaming analytics
is not having to reason about streaming
Static DataFrame API = Infinite DataFrame API
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark

Structured Streaming
• Introduces streaming API built on top of Spark SQL
• Unifies streaming, interactive and batch queries
logs = context.read.format("json")
.stream("s3://logs")
logs.groupBy(logs.user_id)
.agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
https://www.youtube.com/watch?v=oXkxXDG0gNk

http://milinda.pathirage.org/kappa-architecture.com/

Taras Matyashovsky
taras.matyashovsky@gmail.com
@tmatyashovsky
http://www.filevych.com/
Thank you!

References
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
https://www.manning.com/books/big-data
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly
Media)
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
http://www.rittmanmead.com/2015/08/combining-spark-streaming-and-data-frames-for-near-real-time-log-analysis/
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
http://thenewstack.io/spark-2-0-will-offer-interactive-querying-live-data/
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-spark-and-hyperloglog.html
https://www.youtube.com/watch?v=ZFBgY0PwUeY
https://www.youtube.com/watch?v=oXkxXDG0gN
https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html
http://www.slideshare.net/Typesafe_Inc/four-things-to-know-about-reliable-spark-streaming-with-typesafe-and-databricks
http://spark.apache.org/docs/latest/configuration.html#spark-streaming

JEEConf 2016 - Lambda Architecture with Apache Spark

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (16)

Plus de Taras Matyashovsky

Plus de Taras Matyashovsky (12)

Dernier

Dernier (20)

JEEConf 2016 - Lambda Architecture with Apache Spark

Notes de l'éditeur