Twitter Storm: Ereignisverarbeitung in Echtzeit

2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 
WELCOME Twitter Storm:
Ereignisverarbeitung in
Echtzeit
Guido Schmutz 
Java Forum Stuttgart 2013
4.7.2013
4.7.2013
Twitter Storm: Ereignisverarbeitung in Echtzeit
1

2013 © Trivadis
Trivadis ist führend bei der IT-Beratung, der Systemintegration,
dem Solution-Engineering und der Erbringung von IT-Services
mit Fokussierung auf und Technologien
im D-A-CH-Raum.
Unsere Leistungen erbringen wir aus den strategischen Geschäftsfeldern:
Trivadis Services übernimmt den korrespondierenden Betrieb
Ihrer IT Systeme.
Unser Unternehmen
4.7.2013
B E T R I E B
2

2013 © Trivadis
Mit über 600 IT- und Fachexperten bei Ihnen vor Ort
3
11 Trivadis Niederlassungen mit 
über 600 Mitarbeitenden
200 Service Level Agreements
Mehr als 4'000 Trainingsteilnehmer
Forschungs- und Entwicklungs-
budget: CHF 5.0 / EUR 4 Mio.
Finanziell unabhängig und 
nachhaltig profitabel
Erfahrung aus mehr als 1'900
Projekten pro Jahr bei über 800
Kunden
Stand 12/2012
Hamburg
Düsseldorf
Frankfurt
Freiburg
München
Wien
Basel
ZürichBern
Lausanne
3
Stuttgart
4.7.2013
3

2013 © Trivadis
Guido Schmutz
•  Working for Trivadis for more than 16 years
•  Oracle ACE Director for Fusion Middleware and SOA
•  Co-Author of different books
•  Consultant, Trainer Software Architect for Java, Oracle, SOA,
EDA, BigData und FastData
•  Member of Trivadis Architecture Board
•  Technology Manager @ Trivadis
•  More than 20 years of software development  
experience
•  Contact: guido.schmutz@trivadis.com
•  Blog: http://guidoschmutz.wordpress.com
•  Twitter: gschmutz
4.7.2013
4

2013 © Trivadis
Big Data Definition (Gartner et al)
14.02.2013
Big Data 4 Sales
5
Velocity
Tera-, Peta-, Exa-, Zetta-, Yota- bytes and constantly growing
“Traditional” computing in RDBMS  
is not scalable enough.  
We search for “linear scalability”
“Only … structured information  
is not enough” – “95% of produced data in
unstructured”
Characteristics of Big Data: Its
Volume, Velocity and Variety in
combination
+ Veracity (IBM) - information uncertainty
+ Time to action ? – Big Data + Event Processing = Fast Data

2013 © Trivadis
14.02.2013
Big Data 4 Sales
6

2013 © Trivadis
Velocity
24. April 2013
Big Data und Fast Data
7
§  Velocity requirement examples:
§  Recommendation Engine
§  Predictive Analytics
§  Marketing Campaign Analysis
§  Customer Retention and Churn Analysis
§  Social Graph Analysis
§  Capital Markets Analysis
§  Risk Management
§  Rogue Trading
§  Fraud Detection
§  Retail Banking
§  Network Monitoring
§  Research and Development

2013 © Trivadis
Agenda
1.  What is Twitter Storm
2.  Main Concepts
3.  Trident
4.  BigData and FastData combined – Lambda Architecture
5.  Summary
4.7.2013
8

2013 © Trivadis
What is Twitter Storm ?
A platform for doing analysis on streams of data as they come in, so you
can react to data as it happens.
•  A highly distributed real-time computation system
•  Provides general primitives to do real-time computation
•  To simplify working with queues & workers
•  scalable and fault-tolerant
•  complementary to Hadoop
Originated at Backtype, acquired by Twitter in 2011
Open Sourced late 2011
4.7.2013
9

2013 © Trivadis
Hadoop vs. Storm
4.7.2013
10
Storm
Real-time processing
Topologies run forever
Stateless nodes
Scalable
Guarantees no data loss
Open Source
=> Fast, reactive, real time procesing
Hadoop
Batch processing
Jobs runs to completion
Stateful nodes
Scalable
Guarantees no data loss
Open Source
=> big batch processing
=
!=

2013 © Trivadis
Idea: Simplify dealing with queue & workers
4.7.2013
11
Scaling is painful – queue partitioning & worker deploy
Operational overhead – worker failures & queue backups
No guarantees on data processing

2013 © Trivadis
What does Storm provide?
§  At least once message processing
§  Fault-tolerant
§  Horizontal scalability
§  No intermediate queues
§  Less operational overhead
§  „Just works“
§  Broad set of use cases
§  Stream processing
§  Continous computation
§  Distributed RPC
4.7.2013
12

2013 © Trivadis
Agenda
2.  Main Concepts
3.  Trident
5.  Summary
4.7.2013
13

2013 © Trivadis
Main Concepts – Shown based on Twitter example
Show the tweets related to Java Forum (use hashtag #JFS2013) and show
the count of the hashtags used
4.7.2013
14
not showing #JFS2013

2013 © Trivadis
Main Concepts - Streams
Stream
•  an unbounded sequence of tuples
•  core abstraction in Storm
•  Defined with a schema that names the  
fields in the tuple
•  Value must be serializable
•  Every stream has an id
Topology
•  graph where each node is a spout or bolt
•  edges indicating which bolt subscribes  
to which streams
4.7.2013
15
Source of
stream A
Source of
stream B
Subscribes: A
Emits: C
Subscribes: A
Emits: D
Subscribes: A & B
Emits: -
Subscribes: C & D
Emits: -
T T T T T T T T

2013 © Trivadis
Worker
•  executes a subset of a topology, may run one or more executors for one or
more components
Executor
•  thread spawned by worker
Task
•  performs the actual data processing
Main Concepts – Worker, Executor, Task
4.7.2013
16
Worker Process
Task
Task
Task
Task

2013 © Trivadis
Main Concepts - Spout
A spout is the component which acts as the source of streams in a
topology
Generally spouts read tuples from an external source and emit them into
the topology
Spouts can emit more than one stream
Step 1: Use a Spout to subscribe to the Twitter Filter Streaming API to get
all the tweets containing the hashtag #JFS2013
4.7.2013
17
Twitter
Stream
Tweets tweet
Tweet 
Stream

2013 © Trivadis
Main Concepts - Spout
4.7.2013
18
Twitter
Stream
Tweets tweet
Tweet 
Stream
By calling this method, Storm requests that the
Spout emits tuples to the output collector
Called when a task for this
component is initialized
Used to declare streams and their schmeas. it is possible to declare
serveral streams and specifying the one to use when emiting tuples

2013 © Trivadis
Bolt is doing the processing of the message
•  filtering, functions, aggregations, joins, talking to databases….
•  Can do simple stream transformations
•  Complex transformations often require multiple steps and therefore multiple
bolts
Bolts can emit one or more streams
Step 2: Use a bolt to retrieve the hashtags for each tweet
Main Concepts - Bolt
4.7.2013
19
Twitter
Stream
Tweets tweet
Hashtag 
Splitter
Tweet 
Stream
hashtag

2013 © Trivadis
4.7.2013
20
Twitter
Stream
Tweets tweet
Hashtag 
Splitter
Tweet 
Stream
hashtag
Execute is called for each tuple
arriving on the input stream
Declares the output fields for the component and
is called when bolt is created

2013 © Trivadis
Step 3: Use a bolt to count the occurences of the hashtag in Redis NoSQL
4.7.2013
21
Twitter
Stream
Tweets tweet
Hashtag 
Splitter
Tweet 
Stream
hashtag Hashtag 
Counter

2013 © Trivadis
4.7.2013
22
Twitter
Stream
Tweets tweet
Hashtag 
Splitter
Tweet 
Stream
hashtag Hashtag 
Counter
Prepare is called when bolt is
created
Execute is called for each tuple
arriving on the input stream

2013 © Trivadis
Main Concepts - Topology
Step 4: Setup the topology with the necessary groupings (distribution)
Each Spout or Bolt are running N instances in parallel
4.7.2013
23
Twitter
Stream
Tweets
Hashtag 
Splitter
Tweet 
Stream
Hashtag 
Counter
Hashtag 
Splitter
Hashtag 
Counter
Shuffle Fields
Shuffle grouping is random grouping
Fields grouping is grouped by value, such that equal value results in equal task
All grouping replicates to all tasks
Global grouping makes all tuples go to one task
None grouping makes bolt run in the same thread as bold/spout it subscribes to
Direct grouping producer (task that emits) controls which consumer will receive

2013 © Trivadis
Main Concepts - Topology
4.7.2013
24
Create Stream called „tweet-stream“
Run this bolt by 2 workers
Run this spout
by 1 worker
Subscribe to stream „tweet-stream“
using shuffle grouping
Create stream called „hashtag-splitter“
Subscribe to stream „hashtag-splitter“
using fields grouping on field „hashtag“

2013 © Trivadis
Sample – How does it work ?
4.7.2013
25
Twitter
Stream
Mein Präsentation
„Twitter Storm:
Ereignisverarbeitung
in Echtzeit“ Morgen
um 15:35 am Java
Forum Stuttgart
#JFS2013 #storm
#fastdata CU there!
Hashtag 
Splitter
Tweet 
Stream
Hashtag 
Counter
Hashtag 
Splitter
Hashtag 
Counter

2013 © Trivadis
4.7.2013
26
Twitter
Stream
Mein Präsentation
„Twitter Storm:
um 15:35 am Java
Forum Stuttgart
#JFS2013 #storm
#fastdata CU there!
Hashtag 
Splitter
Tweet 
Stream
Hashtag 
Counter
Hashtag 
Splitter
Hashtag 
Counter
storm
fastdata
jsf2013

2013 © Trivadis
4.7.2013
27
Twitter
Stream
Mein Präsentation
„Twitter Storm:
um 15:35 am Java
Forum Stuttgart
#JFS2013 #storm
#fastdata CU there!
Hashtag 
Splitter
Tweet 
Stream
Hashtag 
Counter
Hashtag 
Splitter
Hashtag 
Counter
storm
fastdata
jfs2013
INCR
jfs2013
INCR
storm
INCR
fastdata storm = 1
fastdata =1
jfs2013= 1

2013 © Trivadis
What is Trident?
•  High-Level abstraction on top of storm
•  Makes it easier to build topologies
•  Core data model is the stream
•  Processed as a series of batches
•  Stream is partitioned among nodes in cluster
•  5 kinds of operations in Trident
•  Operations that apply locally to each partition and cause no network
transfer
•  Repartitioning operations that don‘t change the contents
•  Aggregation operations that do network transfer
•  Operations on grouped streams
•  Merges and Joins
4.7.2013
29

2013 © Trivadis
What is Trident?
Supports
•  Joins
•  Aggregations
•  Grouping
•  Functions
•  Filters
Similar to Hadoop and Pig/Cascading
4.7.2013
30

2013 © Trivadis
Trident Concepts - Topology
4.7.2013
31
Twitter
Stream
Tweets tweet
Hashtag 
Splitter
Tweet 
Stream
hashtag Hashtag 
Normalizer
Persistent 
Aggregate
hashtag
groupBylocal
Bolt Bolt

2013 © Trivadis
Trident Concepts - Function
•  takes in a set of input fields and emits zero or more tuples as output
•  fields of the output tuple are appended to the original input tuple in the
stream
•  If a function emits no tuples, the original input tuple is filtered out
•  Otherwise the input tuple is duplicated for each output tuple
4.7.2013
32

2013 © Trivadis
Lambda Architecture
4.7.2013
34
Immutable
data
Batch
View
Query
Data
Stream
Realtime
View
Incoming
Data
Serving Layer
Speed Layer
Batch Layer
A
B
C D
E
F
G

2013 © Trivadis
Lambda Architecture
4.7.2013
35
Speed Layer
Precompute
Views
query
Source: Marz, N. & Warren, J. (2013) Big Data. Manning.
Batch Layer
Precomputed
information
All data
Incremented
information
Process stream
Incoming
Data
Batch
recompute
Realtime
increment
Serving Layer
batch view
batch view
real time view
real time view
Merge

2013 © Trivadis
Lambda Architecture
24. April 2013
Big Data und Fast Data
36
one possible product/framework mapping
Speed Layer
Precompute
Views
query
Batch Layer
Precomputed
information
All data
Incremented
information
Process stream
Incoming
Data
Batch
recompute
Realtime
increment
Serving Layer
batch view
batch view
real time view
real time view
Merge

2013 © Trivadis
Summary
Express real-time processing naturally
Not a Hadoop/MapReduce competitor
Supports other languages
Hard to debug
Object Serialization
Didn‘t cover
§  Reliability through guaranteed message processing
§  Distributed RPC
§  Storm cluster setup and deploy
4.7.2013
38

2013 © Trivadis
Resources
Website (http://storm-project.net/)
Wiki (https://github.com/nathanmarz/storm/wiki)
Doc (https://github.com/nathanmarz/storm/wiki/Documentation)
Storm-starter project (https://github.com/nathanmarz/storm-starter)
Mailing list (http://groups.google.com/group/storm-user)
4.7.2013
39

2013 © Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 
Thank You!
Trivadis AG
Guido Schmutz 
guido.schmutz@trivadis.com
4.7.2013
40

Twitter Storm: Ereignisverarbeitung in Echtzeit

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à Twitter Storm: Ereignisverarbeitung in Echtzeit

Similaire à Twitter Storm: Ereignisverarbeitung in Echtzeit (20)

Plus de Guido Schmutz

Plus de Guido Schmutz (20)

Dernier

Dernier (20)

Twitter Storm: Ereignisverarbeitung in Echtzeit