Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The Future of
Real-Time in Spark
Reynold Xin @rxin
Spark Summit, New York, Feb 18,2016
Why Real-Time?
Making decisions faster is valuable.
• Preventingcreditcard fraud
• Monitoringindustrialmachinery
• Human-f...
Streaming Engine
Noun.
Takes an input streamand producesan output stream.
SQL Streaming MLlib
Spark Core
GraphX
Spark Unified Stack
StreamingSQL MLlib
Spark Core
GraphXStreaming
Introduced3 years ago in Spark 0.7
50% usersconsider most important part of ...
Spark Streaming
• First attempt at unifying streaming and batch
• State management built in
• Exactly once semantics
• Fea...
Streaming computations don’t run in isolation.
Use Case: Fraud Detection
STREAM
ANOMALY
Machine learningmodel
continuously updates
to detectnew anomalies
Ad-hocanalyze h...
Continuous Application
noun.
An end-to-end application that acts on real-time data.
Challenges Building Continuous
Applications
Integration with non-streaming systems often an after-thought
• Interactive,ba...
Integration Example
Streaming
engine
Stream
(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What can go ...
Processing
Businesslogic change & new ops
(windows,sessions)
Complex Programming Models
Output
How do we define
outputover...
Structured Streaming
The simplest way to perform streaming analytics
is not having to reason about streaming.
Spark 2.0
Infinite DataFrames
Spark 1.3
Static DataFrames
Single API !
Structured Streaming
High-level streaming API built on SparkSQL engine
• Runsthe same querieson DataFrames
• Eventtime, wi...
output for
data at 1
Result
Query
Time
data up
to PT 1
Input
complete
output
Output
1 2 3
Trigger: every 1 sec
data up
to ...
delta
output
output for
data at 1
Result
Query
Time
data up
to PT 2
data up
to PT 3
data up
to PT 1
Input
output for
data ...
Model Details
Input sources:append-onlytables
Queries: newoperators for windowing, sessions, etc
Triggers:based on time (e...
Example: ETL
Input: files in S3
Query: map (transform each record)
Trigger: “every5 sec”
Output mode: “newrecords”,into S3...
Example: Page View Count
Input: recordsin Kafka
Query: select count(*) group by page, minute(evtime)
Trigger: “every5 sec”...
Logically:
DataFrame operations on static data
(i.e. as easyto understand as batch)
Physically:
Spark automatically runs t...
logs = ctx.read.format("json").open("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.sav...
logs = ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.s...
T = 0 Aggregate
AggregateT = 1
AggregateT = 2
…
Automatic Incremental Execution
Rest of Spark will follow
• Interactive queriesshould just work
• Spark’s data sourceAPI will be updated to support seamle...
What can we do with this that’s hard
with other engines?
Ad-hoc, interactive queries
Dynamic changing queries
Benefits of ...
Use Case: Fraud Detection
STREAM
ANOMALY
Machine LearningModel
continuously updates
to detectnew anomalies
Analyze Histori...
Timeline
Spark 2.0
• API foundation
• Kafka, file systems, and
databases
• Event-time aggregations
Spark 2.1 +
• Continuou...
Thank you.
@rxin
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture
Next
Upcoming SlideShare
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture
Next
Download to read offline and view in fullscreen.

Share

The Future of Real-Time in Spark

Download to read offline

Reynold Xin, an Apache Spark PMC member and Chief Architect for Spark at Databricks, outlines the future of real-time in Apache Spark.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

The Future of Real-Time in Spark

  1. 1. The Future of Real-Time in Spark Reynold Xin @rxin Spark Summit, New York, Feb 18,2016
  2. 2. Why Real-Time? Making decisions faster is valuable. • Preventingcreditcard fraud • Monitoringindustrialmachinery • Human-facingdashboards • …
  3. 3. Streaming Engine Noun. Takes an input streamand producesan output stream.
  4. 4. SQL Streaming MLlib Spark Core GraphX Spark Unified Stack
  5. 5. StreamingSQL MLlib Spark Core GraphXStreaming Introduced3 years ago in Spark 0.7 50% usersconsider most important part of Spark Spark Unified Stack
  6. 6. Spark Streaming • First attempt at unifying streaming and batch • State management built in • Exactly once semantics • Features required for large clusters • Straggler mitigation,dynamic load balancing,fast fault-recovery
  7. 7. Streaming computations don’t run in isolation.
  8. 8. Use Case: Fraud Detection STREAM ANOMALY Machine learningmodel continuously updates to detectnew anomalies Ad-hocanalyze historic data
  9. 9. Continuous Application noun. An end-to-end application that acts on real-time data.
  10. 10. Challenges Building Continuous Applications Integration with non-streaming systems often an after-thought • Interactive,batch,relational databases, machine learning,… Streaming programming models are complex
  11. 11. Integration Example Streaming engine Stream (home.html, 10:08) (product.html, 10:09) (home.html, 10:10) . . . What can go wrong? • Late events • Partial outputs to MySQL • State recovery on failure • Distributed reads/writes • ... MySQL Page Minute Visits home 10:09 21 pricing 10:10 30 ... ... ...
  12. 12. Processing Businesslogic change & new ops (windows,sessions) Complex Programming Models Output How do we define outputover time & correctness? Data Late arrival, varying distribution overtime, …
  13. 13. Structured Streaming
  14. 14. The simplest way to perform streaming analytics is not having to reason about streaming.
  15. 15. Spark 2.0 Infinite DataFrames Spark 1.3 Static DataFrames Single API !
  16. 16. Structured Streaming High-level streaming API built on SparkSQL engine • Runsthe same querieson DataFrames • Eventtime, windowing,sessions,sources& sinks Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serve using JDBC • Change queriesatruntime • Build and apply ML models
  17. 17. output for data at 1 Result Query Time data up to PT 1 Input complete output Output 1 2 3 Trigger: every 1 sec data up to PT 2 output for data at 2 data up to PT 3 output for data at 3 Model
  18. 18. delta output output for data at 1 Result Query Time data up to PT 2 data up to PT 3 data up to PT 1 Input output for data at 2 output for data at 3 Output 1 2 3 Trigger: every 1 sec Model
  19. 19. Model Details Input sources:append-onlytables Queries: newoperators for windowing, sessions, etc Triggers:based on time (e.g. every 1 sec) Output modes: complete, deltas, update-in-place
  20. 20. Example: ETL Input: files in S3 Query: map (transform each record) Trigger: “every5 sec” Output mode: “newrecords”,into S3 sink
  21. 21. Example: Page View Count Input: recordsin Kafka Query: select count(*) group by page, minute(evtime) Trigger: “every5 sec” Output mode: “update-in-place”, into MySQL sink Note: this will automatically update “old” recordson late data!
  22. 22. Logically: DataFrame operations on static data (i.e. as easyto understand as batch) Physically: Spark automatically runs the queryin streaming fashion (i.e. incrementally and continuously) DataFrame Logical Plan Continuous, incremental execution Catalyst optimizer Execution
  23. 23. logs = ctx.read.format("json").open("s3://logs") logs.groupBy(logs.user_id).agg(sum(logs.time)) .write.format("jdbc") .save("jdbc:mysql//...") Example: Batch Aggregation
  24. 24. logs = ctx.read.format("json").stream("s3://logs") logs.groupBy(logs.user_id).agg(sum(logs.time)) .write.format("jdbc") .stream("jdbc:mysql//...") Example: Continuous Aggregation
  25. 25. T = 0 Aggregate AggregateT = 1 AggregateT = 2 … Automatic Incremental Execution
  26. 26. Rest of Spark will follow • Interactive queriesshould just work • Spark’s data sourceAPI will be updated to support seamless streaming integration • Exactly once semantics end-to-end • Different outputmodes (complete,delta, update-in-place) • ML algorithms will be updated too
  27. 27. What can we do with this that’s hard with other engines? Ad-hoc, interactive queries Dynamic changing queries Benefits of Spark: elastic scaling, stragglermitigation, etc
  28. 28. Use Case: Fraud Detection STREAM ANOMALY Machine LearningModel continuously updates to detectnew anomalies Analyze Historic Data
  29. 29. Timeline Spark 2.0 • API foundation • Kafka, file systems, and databases • Event-time aggregations Spark 2.1 + • Continuous SQL • BI app integration • Other streaming sources/ sinks • Machine learning
  30. 30. Thank you. @rxin
  • JasonWayne

    Aug. 30, 2018
  • wcmj1023

    Sep. 17, 2017
  • RavirajAdrangi

    Sep. 27, 2016
  • allenjoe1986

    Aug. 19, 2016
  • mageru

    Mar. 30, 2016
  • vinothchandar

    Mar. 14, 2016
  • FernandoFerri

    Mar. 11, 2016
  • ismailover

    Mar. 9, 2016
  • LocChanel

    Mar. 2, 2016
  • TomaszLelek

    Feb. 29, 2016
  • darkseed

    Feb. 29, 2016
  • bunkertor

    Feb. 29, 2016
  • iolo

    Feb. 29, 2016
  • MarkHolton2

    Feb. 23, 2016
  • binaryb0y

    Feb. 22, 2016
  • powerwu

    Feb. 21, 2016
  • mushtaq.a

    Feb. 21, 2016
  • moncefsalgado

    Feb. 20, 2016
  • mathieu_despriee

    Feb. 19, 2016
  • MarcosColebrookSantamaria

    Feb. 19, 2016

Reynold Xin, an Apache Spark PMC member and Chief Architect for Spark at Databricks, outlines the future of real-time in Apache Spark.

Views

Total views

4,821

On Slideshare

0

From embeds

0

Number of embeds

163

Actions

Downloads

181

Shares

0

Comments

0

Likes

27

×