SlideShare une entreprise Scribd logo
1  sur  51
© 2015 MapR Technologies 1© 2015 MapR Technologies
Implementing the Lambda Architecture
efficiently with Apache Spark
Polyglot Processing
• Combination of different
processing engines over
DFS/NoSQL stores
• Lambda and Kappa
architectures are two
prominent examples
datadventures.ghost.io/2014/07/06/polyglot-processing/
Fault tolerance
hardware
software
developer
?
© 2014 MapR Technologies 4© 2014 MapR Technologies
Let’s talk about developers…
xkcd.com/327/
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
© 2015 MapR Technologies 8© 2014 MapR Technologies
human fault toleranceLet’s talk about developers…
Human fault tolerance
When things go wrong …
http://allfacebook.com/the-real-reason-facebook-went-down-yesterday-its-complicated_b19366
2010
unfortunate
handling of
error condition
When things go wrong …
2012
cascaded bug
http://money.cnn.com/2012/06/21/technology/twitter-down/index.htm
When things go wrong …
http://www.v3.co.uk/v3-uk/news/2196577/rbs-takes-gbp125m-hit-over-it-outage
2012
upgrade of batch
processing
When things go wrong …
http://www.androidcentral.com/google-explains-reasons-behind-today-s-30-minute-service-outage
2014
bug/bad config
© 2014 MapR Technologies 14© 2014 MapR Technologies
Lambda Architecture to the rescue!
Let’s step back a bit …
• Nathan Marz (Backtype, Twitter, stealth startup)
• Creator of …
– Storm
– Cascalog
– ElephantDB
manning.com/marz/
Lambda Architecture—Requirements
• Fault-tolerant against both hardware failures and human errors
• Support variety of use cases that include low latency querying as
well as updates
• Linear scale-out capabilities
• Extensible, so that the system is manageable and can
accommodate newer features easily
Lambda Architecture—Concept
• Latency—the time it takes to run a query
• Timeliness—how up to date the query results are
( consistency)
• Accuracy—tradeoff between performance and scalability
( approximations)
query = function(all data)
Lambda Architecture
NEW DATA
STREAM QUERY
BATCH VIEWS
√View 1 View 2 View N
REAL-TIME VIEWS
BATCH LAYER
SERVINGLAYER
SPEED LAYER
MERGE
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE
PROCESS
STREAM
INCREMENT
VIEWSREAL-TIME
INCREMENT
View 1 View 2 View N
Lambda Architecture—Compensate Batch
time
not absorbed
now
Lambda Architecture—Immutable Data + Views
openflights.org
Lambda Architecture—Immutable Data + Views
timestamp airport flight actiontimestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
2014-01-01T10:10:00 CDG AF03 landing
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
2014-01-01T10:10:00 CDG AF03 landing
2014-01-01T10:10:00 FCO AZ501 take-off
immutable master dataset
Lambda Architecture—Immutable Data + Views
timestamp airport flight action
2014-01-01T10:00:00 DUB EI123 take-off
2014-01-01T10:05:00 HEL SAS45 take-off
2014-01-01T10:07:00 AMS BA99 take-off
2014-01-01T10:09:00 LHR LH17 landing
2014-01-01T10:10:00 CDG AF03 landing
2014-01-01T10:10:00 FCO AZ501 take-off
immutable master dataset
views
airport planes
AMS 69
CDG 44
DUB 31
FCO 10
HEL 17
LHR 101
airport load: airline planes
AF 59
AZ 23
BA 167
EI 19
LH 201
SAS 28
air-borne
per airline:
air-borne: 2307
© 2014 MapR Technologies 23© 2014 MapR Technologies
Implementing the Lambda Architecture
© 2014 MapR Technologies 24
Implementing the Lambda Architecture efficiently with Apache Spark
How about an integrated approach?
• Twitter Summingbird
• Lambdoop
• Apache Spark
Apache Spark
Apache Spark—a unified platform …
spark.apache.org
Continued innovation bringing new functionality, such as:
• Tachyon (Shared RDDs, off-heap solution)
• BlinkDB (approximate queries)
• SparkR (R wrapper for Spark)
Spark SQL
(SQL/HQL)
Spark Streaming
(stream processing)
MLlib
(machine learning)
Spark (core execution engine—RDDs)
GraphX
(graph processing)
Mesos
file system (local, MapR-FS, HDFS, S3) or data store (HBase, Elasticsearch, etc.)
YARNStandalone
Easy and fast Big Data
• Easy to Develop
– Rich APIs available through
Java, Scala, Python
– Interactive shell
• Fast to Run
– Advanced data storage model
(automated optimization
between memory and disk)
– General execution graphs
2-5× less code up to 10× faster on disk,
100× in memory
https://amplab.cs.berkeley.edu/benchmark/
… across multiple datasources
• Local Files
• Object Stores (Amazon S3)
• HDFS
– text files, sequence files, any other Hadoop InputFormat
• Key-Value datastores (HBase, C*)
• Elasticsearch
Easy: expressive API
map reduce
Easy: expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
spark.apache.org/docs/latest/programming-guide.html#transformations
… and scale as you go (mentally and physically)
YARN
Standalone
Resilient Distributed Datasets (RDD)
• RDDs are the core of the Spark execution engine
• Collections of elements that can be operated on in parallel
• Persistent in memory between operations
www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
RDD Operations
• Lazy evaluation is key to Spark
• Transformations
– Creation of a new dataset from an existing:
map, filter, distinct, union, sample, groupByKey, join, etc.
• Actions
– Return a value after running a computation:
collect, count, first, takeSample, foreach, etc.
RDD persistence
spark.apache.org/docs/latest/scala-programming-guide.html
Spark Streaming
• High-level language operators for streaming data
• Fault-tolerant semantics
• Support for merging streaming data with historical data
spark.apache.org/docs/latest/streaming-programming-guide.html
Spark Streaming
Run a streaming computation as a series of small, deterministic
batch jobs.
• Chop up live stream into batches of X
seconds (DStream)
• Spark treats each batch of data as RDDs
and processes them using RDD ops
• Finally, processed results of the RDD
operations are returned in batches
Spark
Spark
Streaming
batches of X seconds
live data stream
processed results
Spark Streaming
Run a streaming computation as a series of small, deterministic
batch jobs.
• Batch sizes as low as ½ second,
latency of about 1 second
• Potential for combining batch processing
and streaming processing in the same
system
Spark
Spark
Streaming
batches of X seconds
live data stream
processed results
Spark Streaming: Execution
Spark Streaming: transformations
• Stateless transformations
• Stateful transformations
– checkpointing
– windowed transformations
• window duration
• sliding duration
spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
Spark Streaming comparison
• Spark Streaming: 670k records/sec/node
• Storm: 115k records/sec/node
• Commercial systems: 100-500k records/sec/node
0
10
20
30
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
Grep
Spark
Storm
Where to go from here
The book: Learning Spark
shop.oreilly.com/product/0636920028512.do
Apache Spark developer certificate program
oreilly.com/go/sparkcert
http://lambda-architecture.net
http://spark-stack.org
MapR Sandbox with Spark
mapr.com/blog/getting-started-spark-mapr-sandbox
Conclusion
• Let’s scale systems and humans
• How? Lambda Architecture!
• Apache Spark is an efficient way to implement Lambda Architecture
$50M$50M
in Free Training
Q&A
@mhausenblas maprtech
mhausenblas@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Contenu connexe

Tendances

Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 PivotalOpenSourceHub
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Helena Edelson
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoTJim Haughwout
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Stratio
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...Nathan Bijnens
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 

Tendances (20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming Spark Summit - Stratio Streaming
Spark Summit - Stratio Streaming
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 

En vedette

Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionGuido Schmutz
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réelMathieu DESPRIEE
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itnathanmarz
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
 
Architecture data-flow programmable pour le traitement d’image
Architecture data-flow programmable pour le traitement d’imageArchitecture data-flow programmable pour le traitement d’image
Architecture data-flow programmable pour le traitement d’imageCiprian Teodorov
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Altan Khendup
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureMapR Technologies
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 

En vedette (20)

Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in Action
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
[USI] Lambda-Architecture : comment réconcilier BigData et temps-réel
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
Architecture data-flow programmable pour le traitement d’image
Architecture data-flow programmable pour le traitement d’imageArchitecture data-flow programmable pour le traitement d’image
Architecture data-flow programmable pour le traitement d’image
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data Architecture
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 

Similaire à Implementing the Lambda Architecture efficiently with Apache Spark

Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversityAlex Zeltov
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Juan Pedro Moreno
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftLi Gao
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 

Similaire à Implementing the Lambda Architecture efficiently with Apache Spark (20)

Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Implementing the Lambda Architecture efficiently with Apache Spark

  • 1. © 2015 MapR Technologies 1© 2015 MapR Technologies Implementing the Lambda Architecture efficiently with Apache Spark
  • 2. Polyglot Processing • Combination of different processing engines over DFS/NoSQL stores • Lambda and Kappa architectures are two prominent examples datadventures.ghost.io/2014/07/06/polyglot-processing/
  • 4. © 2014 MapR Technologies 4© 2014 MapR Technologies Let’s talk about developers…
  • 8. © 2015 MapR Technologies 8© 2014 MapR Technologies human fault toleranceLet’s talk about developers…
  • 10. When things go wrong … http://allfacebook.com/the-real-reason-facebook-went-down-yesterday-its-complicated_b19366 2010 unfortunate handling of error condition
  • 11. When things go wrong … 2012 cascaded bug http://money.cnn.com/2012/06/21/technology/twitter-down/index.htm
  • 12. When things go wrong … http://www.v3.co.uk/v3-uk/news/2196577/rbs-takes-gbp125m-hit-over-it-outage 2012 upgrade of batch processing
  • 13. When things go wrong … http://www.androidcentral.com/google-explains-reasons-behind-today-s-30-minute-service-outage 2014 bug/bad config
  • 14. © 2014 MapR Technologies 14© 2014 MapR Technologies Lambda Architecture to the rescue!
  • 15. Let’s step back a bit … • Nathan Marz (Backtype, Twitter, stealth startup) • Creator of … – Storm – Cascalog – ElephantDB manning.com/marz/
  • 16. Lambda Architecture—Requirements • Fault-tolerant against both hardware failures and human errors • Support variety of use cases that include low latency querying as well as updates • Linear scale-out capabilities • Extensible, so that the system is manageable and can accommodate newer features easily
  • 17. Lambda Architecture—Concept • Latency—the time it takes to run a query • Timeliness—how up to date the query results are ( consistency) • Accuracy—tradeoff between performance and scalability ( approximations) query = function(all data)
  • 18. Lambda Architecture NEW DATA STREAM QUERY BATCH VIEWS √View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWSBATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWSREAL-TIME INCREMENT View 1 View 2 View N
  • 20. Lambda Architecture—Immutable Data + Views openflights.org
  • 21. Lambda Architecture—Immutable Data + Views timestamp airport flight actiontimestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing 2014-01-01T10:10:00 FCO AZ501 take-off immutable master dataset
  • 22. Lambda Architecture—Immutable Data + Views timestamp airport flight action 2014-01-01T10:00:00 DUB EI123 take-off 2014-01-01T10:05:00 HEL SAS45 take-off 2014-01-01T10:07:00 AMS BA99 take-off 2014-01-01T10:09:00 LHR LH17 landing 2014-01-01T10:10:00 CDG AF03 landing 2014-01-01T10:10:00 FCO AZ501 take-off immutable master dataset views airport planes AMS 69 CDG 44 DUB 31 FCO 10 HEL 17 LHR 101 airport load: airline planes AF 59 AZ 23 BA 167 EI 19 LH 201 SAS 28 air-borne per airline: air-borne: 2307
  • 23. © 2014 MapR Technologies 23© 2014 MapR Technologies Implementing the Lambda Architecture
  • 24. © 2014 MapR Technologies 24
  • 26. How about an integrated approach? • Twitter Summingbird • Lambdoop • Apache Spark
  • 28. Apache Spark—a unified platform … spark.apache.org Continued innovation bringing new functionality, such as: • Tachyon (Shared RDDs, off-heap solution) • BlinkDB (approximate queries) • SparkR (R wrapper for Spark) Spark SQL (SQL/HQL) Spark Streaming (stream processing) MLlib (machine learning) Spark (core execution engine—RDDs) GraphX (graph processing) Mesos file system (local, MapR-FS, HDFS, S3) or data store (HBase, Elasticsearch, etc.) YARNStandalone
  • 29. Easy and fast Big Data • Easy to Develop – Rich APIs available through Java, Scala, Python – Interactive shell • Fast to Run – Advanced data storage model (automated optimization between memory and disk) – General execution graphs 2-5× less code up to 10× faster on disk, 100× in memory https://amplab.cs.berkeley.edu/benchmark/
  • 30. … across multiple datasources • Local Files • Object Stores (Amazon S3) • HDFS – text files, sequence files, any other Hadoop InputFormat • Key-Value datastores (HBase, C*) • Elasticsearch
  • 33. … and scale as you go (mentally and physically) YARN Standalone
  • 34. Resilient Distributed Datasets (RDD) • RDDs are the core of the Spark execution engine • Collections of elements that can be operated on in parallel • Persistent in memory between operations www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 35. RDD Operations • Lazy evaluation is key to Spark • Transformations – Creation of a new dataset from an existing: map, filter, distinct, union, sample, groupByKey, join, etc. • Actions – Return a value after running a computation: collect, count, first, takeSample, foreach, etc.
  • 37. Spark Streaming • High-level language operators for streaming data • Fault-tolerant semantics • Support for merging streaming data with historical data spark.apache.org/docs/latest/streaming-programming-guide.html
  • 38. Spark Streaming Run a streaming computation as a series of small, deterministic batch jobs. • Chop up live stream into batches of X seconds (DStream) • Spark treats each batch of data as RDDs and processes them using RDD ops • Finally, processed results of the RDD operations are returned in batches Spark Spark Streaming batches of X seconds live data stream processed results
  • 39. Spark Streaming Run a streaming computation as a series of small, deterministic batch jobs. • Batch sizes as low as ½ second, latency of about 1 second • Potential for combining batch processing and streaming processing in the same system Spark Spark Streaming batches of X seconds live data stream processed results
  • 41. Spark Streaming: transformations • Stateless transformations • Stateful transformations – checkpointing – windowed transformations • window duration • sliding duration spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
  • 42. Spark Streaming comparison • Spark Streaming: 670k records/sec/node • Storm: 115k records/sec/node • Commercial systems: 100-500k records/sec/node 0 10 20 30 100 1000 Throughputpernode (MB/s) Record Size (bytes) WordCount Spark Storm 0 20 40 60 100 1000 Throughputpernode (MB/s) Record Size (bytes) Grep Spark Storm
  • 43. Where to go from here
  • 44. The book: Learning Spark shop.oreilly.com/product/0636920028512.do
  • 45. Apache Spark developer certificate program oreilly.com/go/sparkcert
  • 48. MapR Sandbox with Spark mapr.com/blog/getting-started-spark-mapr-sandbox
  • 49. Conclusion • Let’s scale systems and humans • How? Lambda Architecture! • Apache Spark is an efficient way to implement Lambda Architecture
  • 51. Q&A @mhausenblas maprtech mhausenblas@mapr.com Engage with us! MapR maprtech mapr-technologies

Notes de l'éditeur

  1. fault-tolerant : can force “exactly once” operations on incoming data streams The same Spark Code works with streaming data sets and the static, DFS data sets. You can very easily write applications that test incoming data streams against historical data