2. 2
About the presenters
• Principal Solutions Architect at
Cloudera
• Done Hadoop for 6 years
– Worked with > 70 companies in 8
countries
• Previously, lead architect at FINRA
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Marvel fan boy, runner
• Software Engineer at Cloudera,
working on Spark
• Committer on Apache Bigtop, PMC
member on Apache Sentry
(incubating)
• Contributor to Apache Hadoop,
Spark, Hive, Sqoop, Pig and Flume
Ted Malaska Mark Grover
3. 3
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
7. 7
When to stream, and when not to
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in case
of failures
10s of seconds or
more, re-run in case of
failures
Real-time Near real-time Batch
8. 8
When to stream, and when not to
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in case
of failures
10s of seconds or
more, re-run in case of
failures
Real-time Near real-time Batch
9. 9
No free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in case
of failures
10s of seconds or
more, re-run in case of
failures
Real-time Near real-time Batch
“Difficult” architectures, lower latency “Easier” architectures, higher latency
14. 14
But there multiple sources
Ingest
Source System 1
Destination systemSource System 2
Source System 3
Ingest
Ingest
Streaming
engine Ingest
15. 15
But..
• Sources, sinks, ingestion channels may go down
• Sources, sinks producing/consuming at different rates (buffering)
• Regular maintenance windows may need to be scheduled
• You need a resilient message broker (pub/sub)
16. 16
Need for a message broker
Source System 1
Destination
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
18. 18
Destination systems
Source System 1
Destination
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
Most common
“destination” is a
storage system
19. 19
Architecture diagram with a broker
Source System 1
Storage
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
20. 20
Streaming engines
Source System 1
Storage
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Kafka
Connect
Apache
Flume
Message broker
Apache Beam
(incubating)
21. 21
Storage options
Source System 1
Storage
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Kafka
Connect
Apache
Flume
Message broker
Apache Beam
(incubating)
23. 23
Semantic types
• At most once
– Not good for many cases
– Only where performance/SLA is more important than accuracy
• Exactly once
– Expensive to achieve but desirable
• At least once
– Easiest to achieve
25. 25
Semantics of our architecture
Source System 1
Destination
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming
engine
Push
Message broker
At least once
At least once
Ordered
Partitioned
It depends It depends
27. 27
Streaming architecture for ingestion
Source System 1
Storage
systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations
28. 28
Ingestion and/or Transformation
1. Zero Transformation
– No transformation, plain ingest, no schema validation
– Keep the original format - SequenceFiles, Text, etc.
– Allows to store data that may have errors in the schema
2. Format Transformation
– Simply change the format of field, for example
– Structured Format e.g. Avro
– Which does schema validation
3. Enrichment Transformation
– Atomic
– Contextual
29. 29
#3 - Enrichment transformations
Atomic
• Need to work with one event at a
time
• Mask a credit card number
• Add processing time or offset to the
record
Contextual
• Need to refer to external context
• Example - convert zip code to state,
by looking up a cache
32. 32
Where to store the context
1. Locally Broadcast Cached Dim Data
– Local to Process (On Heap, Off Heap)
– Local to Node (Off Process)
2. Partitioned Cache
– Shuffle to move new data to partitioned cache
3. External Fetch Data (e.g. HBase, Memcached)
33. 33
#1a - Locally broadcast cached data
Could be
On heap or Off heap
34. 34
#1b - Off process cached data
Data is cached on the
node, outside of
process. Potentially in
an external system like
Rocks DB
35. 35
#2 - Partitioned cache data
Data is partitioned
based on field(s) and
then cached
64. 64
We started with Lambda
Pipe
Speed Layer
Batch Layer
Persist Results
Speed Results
Batch Results
Serving Layer
65. 65
Why did Streaming Suck
• Increments with Cassandra
• Double increment
• No strong consistency
• Storm without Kafka
• Not only once
• Not at least once
• Batch would have to re-process EVERY record to remove
dups
66. 66
We have come a long way
• We don’t have to use Increments any more and we can
have consistency
• HBase
• We can have state in our streaming platform
• Spark Streaming
• We don’t lose data
• Spark Streaming
• Kafka
• Other options
• Full universe of Deduping
• Again HBase with versions
70. 70
Advanced Streaming
• Ad-hoc will produce Identify Value
• Ad-hoc will become batch
• The value will demand less latency on batch
• Batch will become Streaming
71. 71
Advanced Streaming
• Requirements for Ideal Batch to Streaming frameworks
• Something that can snap both paradigms
• Something that can use the tools of Ad-hoc
• SQL
• MlLib
• R
• Scala
• Java
• Development through a common IDE
• Debugging
• Unit Testing
• Common deployment model
72. 72
Advanced Streaming
• In Spark Streaming
• A DStream is a collection of RDD with respect to micro batch
intervals
• If we can access RDDs in Spark Streaming
• We can convert to Vectors
• KMeans
• Principal component analysis
• We can convert to LabeledPoint
• NaiveBayes
• Random Forest
• Linear Support Vector Machines
• We can convert to a DataFrames
• SQL
• R