SlideShare une entreprise Scribd logo
1  sur  101
Télécharger pour lire hors ligne
© 2019 Ververica
Konstantin Knauf
Head of Product - Ververica Platform
Apache Flink Committer
@snntrable
© 2019 Ververica2 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance
© 2019 Ververica3 = Worst Practice
...choose the right setup!
start with one of you most challenging uses cases
© 2019 Ververica4 = Worst Practice
...choose the right setup!
start with one of you most challenging uses cases
no prior stream processing knowledge
© 2019 Ververica5 = Worst Practice
...choose the right setup!
start with one of you most challenging uses cases
no training
no prior stream processing knowledge
© 2019 Ververica6 = Worst Practice
...choose the right setup!
start with one of you most challenging uses cases
no training
don’t use the community
no prior stream processing knowledge
© 2019 Ververica7 = Worst Practice
● Training
○ @FlinkForward
○ https://www.ververica.com/training
Getting Help!
© 2019 Ververica8 = Worst Practice
● Community
○ user@flink.apache.org
■ ~ 600 threads per month
■ ∅30h until first response
○ www.stackoverflow.com
■ How not to ask questions?
● https://data.stackexchange.com/stackoverflow/query/1115371/most-down-voted-apache-flink-questions
○ (ASF Slack #flink)
● Training
○ @FlinkForward
○ https://www.ververica.com/training
Getting Help!
© 2019 Ververica9 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance
© 2019 Ververica10 = Worst Practice
...don’t think too much about your requirements.
don’t think about consistency & delivery guarantees
© 2019 Ververica11 = Worst Practice
...don’t think too much about your requirements.
don’t think about consistency & delivery guarantees
don’t think about upgrades and application evolution
© 2019 Ververica12 = Worst Practice
...don’t think too much about your requirements.
don’t think about consistency & delivery guarantees
don’t think about the scale of your problem
don’t think about upgrades and application evolution
© 2019 Ververica13 = Worst Practice
...don’t think too much about your requirements.
don’t think about consistency & delivery guarantees
don’t think about the scale of your problem
don’t think too much about your actual business requirements
don’t think about upgrades and application evolution
© 2019 Ververica14 = Worst Practice
Do I care about losing records?
© 2019 Ververica15 = Worst Practice
No checkpointing,
any source
No
Do I care about losing records?
© 2019 Ververica16 = Worst Practice
No checkpointing,
any source
NoYes
Do I care about correct results?
Do I care about losing records?
© 2019 Ververica17 = Worst Practice
No checkpointing,
any source
NoYes
CheckpointingMode.AT_LEAST_ONCE
& replayable sources
No
Do I care about correct results?
Do I care about losing records?
© 2019 Ververica18 = Worst Practice
No checkpointing,
any source
NoYes
CheckpointingMode.AT_LEAST_ONCE
& replayable sources
Yes No
Do I care about correct results?
Do I care about duplicate (yet correct)
records downstream?
Do I care about losing records?
© 2019 Ververica19 = Worst Practice
No checkpointing,
any source
NoYes
CheckpointingMode.EXACTLY_ONCE
& replayable sources
CheckpointingMode.AT_LEAST_ONCE
& replayable sources
Yes No
No
Do I care about correct results?
Do I care about duplicate (yet correct)
records downstream?
Do I care about losing records?
© 2019 Ververica20 = Worst Practice
No checkpointing,
any source
NoYes
CheckpointingMode.EXACTLY_ONCE
& replayable sources
CheckpointingMode.AT_LEAST_ONCE
& replayable sources
CheckpointingMode.EXACTLY_ONCE,
replayable sources & transactional sinks
Yes No
NoYes
Do I care about correct results?
Do I care about duplicate (yet correct)
records downstream?
Do I care about losing records?
© 2019 Ververica21 = Worst Practice
No checkpointing,
any source
NoYes
CheckpointingMode.EXACTLY_ONCE
& replayable sources
CheckpointingMode.AT_LEAST_ONCE
& replayable sources
CheckpointingMode.EXACTLY_ONCE,
replayable sources & transactional sinks
Yes No
NoYes
Do I care about correct results?
Do I care about duplicate (yet correct)
records downstream?
Do I care about losing records?
L
A
T
E
N
C
Y
© 2019 Ververica22 = Worst Practice
Flink job
user code
Local state
backends
Persisted
savepoint
local reads / writes that
manipulate state
Basics
© 2019 Ververica23 = Worst Practice
Flink job
user code
Local state
backends
Persisted
savepoint
local reads / writes that
manipulate state
persist to
DFS on
savepoint
Basics
© 2019 Ververica24 = Worst Practice
Flink job
user code
Local state
backends
Persisted
savepoint
Upgrade
Application
1. Upgrade Flink cluster
2. Fix bugs
3. Pipeline topology changes
4. Job reconfigurations
5. Adapt state schema
6. ...
Basics
© 2019 Ververica25 = Worst Practice
Flink job
user code
Local state
backends
Persisted
savepoint
reload state
into local
state backends
Basics
© 2019 Ververica26 = Worst Practice
Flink job
user code
Local state
backends
Persisted
savepoint
continue to
access state
reload state
into local
state backends
Basics
© 2019 Ververica27 = Worst Practice
Topology Changes
© 2019 Ververica28 = Worst Practice
Sink:
LateDataSink
Topology Changes
© 2019 Ververica29 = Worst Practice
flink run --allowNonRestoredState <jar-file
<arguments>
Sink:
LateDataSink
Topology Changes
© 2019 Ververica30 = Worst Practice
new operator starts ➙ empty state
Topology Changes
© 2019 Ververica31 = Worst Practice
new operator starts ➙ empty state
Still the same?
Topology Changes
© 2019 Ververica32 = Worst Practice
new operator starts ➙ empty state
Still the same?
streamOperator
.uid("AggregatePerSensor")
Topology Changes
© 2019 Ververica33 = Worst Practice
● Avro Types ✓
○ https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution
● Flink POJOs ✓
○ https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/schema_evolution.html#pojo-types
● Kryo✗
● Key data types can not be changed
State Schema Evolution
© 2019 Ververica34 = Worst Practice
● Avro Types ✓
○ https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution
● Flink POJOs ✓
○ https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/state/schema_evolution.html#pojo-types
● Kryo✗
● Key data types can not be changed
State Schema Evolution
State Unlocked
Today, 12:20pm - 1:00pm
Track 2
Seth Wiesman, Tzu-Li (Gordon) Tai
● State Processor API for the rescue
© 2019 Ververica35 = Worst Practice
State Size & Network
Stateful Operator
TaskManager (1 Slot)
Source
keyBy
Sink
© 2019 Ververica36 = Worst Practice
State Size & Network
Stateful Operator
TaskManager (1 Slot)
Source
keyBy
Sink
FilesystemStatebackend
© 2019 Ververica37 = Worst Practice
State Size & Network
Stateful Operator
TaskManager (1 Slot)
Source
keyBy
Sink
FilesystemStatebackend
#key * state_size
© 2019 Ververica38 = Worst Practice
State Size & Network
Stateful Operator
TaskManager (1 Slot)
Source
keyBy
Sink
Ingress MB/s
FilesystemStatebackend
#key * state_size
#records/s * size
© 2019 Ververica39 = Worst Practice
State Size & Network
Stateful Operator
TaskManager (1 Slot)
Source
keyBy
Sink
Ingress MB/s Shuffle MB/s
Shuffle MB/s
FilesystemStatebackend
#key * state_size
#records/s * size
#records/s * size * #tm-1/#tm
#records/s * size * #tm-1/#tm
© 2019 Ververica40 = Worst Practice
State Size & Network
Stateful Operator
TaskManager (1 Slot)
Source
keyBy
Sink
Ingress MB/s Shuffle MB/s
Shuffle MB/s
Egress MB/s
FilesystemStatebackend
#key * state_size
#records/s * size
#records/s * size * #tm-1/#tm
#records/s * size * #tm-1/#tm
#messages/s * size
© 2019 Ververica41 = Worst Practice
State Size & Network
Stateful Operator
TaskManager (1 Slot)
Source
keyBy
Sink
Ingress MB/s Shuffle MB/s
Shuffle MB/s
Egress MB/s
FilesystemStatebackend Checkpoint MB/s
#key * state_size
#records/s * size
#records/s * size * #tm-1/#tm
overall_state_size * 1/checkpoint_interval
#records/s * size * #tm-1/#tm
#messages/s * size
© 2019 Ververica42 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
© 2019 Ververica43 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4
© 2019 Ververica44 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8
© 2019 Ververica45 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8 13
© 2019 Ververica46 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8 13 11
© 2019 Ververica47 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8 13 11 15
© 2019 Ververica48 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8 13 11 15 21
© 2019 Ververica49 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8 13 11 15 21
4 8 13 11 15 21Tumbling Window (10 secs):
© 2019 Ververica50 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8 13 11 21
4
15
8 13 11 15 21
4 8 13 11 15 21
Tumbling Window (10 secs):
Sliding Window (10/5 secs):
© 2019 Ververica51 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
Tumbling Window (10 secs):
Sliding Window (10/5 secs):
“Look Back” (10 secs) :
© 2019 Ververica52 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
Tumbling Window (10 secs):
Sliding Window (10/5 secs):
“Look Back” (10 secs) :
Fire or wait for watermark?
© 2019 Ververica53 = Worst Practice
Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
Tumbling Window (10 secs):
Sliding Window (10/5 secs):
“Look Back” (10 secs) :
Fire or wait for watermark?
Fire or wait till watermark reaches end of window?
© 2019 Ververica54 = Worst Practice
Batch Processing Requirements
I receive two files every ten minutes: transactions.txt & quotes.txt
● They need to be transformed & joined.
● The output must be exactly one file.
● The operation needs to be atomic.
→ Don’t try to re-implement your batch jobs as a stream processing jobs.
© 2019 Ververica55 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Production Maintenance
© 2019 Ververica56 = Worst Practice
Applications
(physical)
Analytics
(declarative)
DataStream API Table API/SQL
Types are Java / Scala classes Logical Schema for Tables
Transformation Functions Declarative Language (SQL, Table DSL)
State implicit in operationsExplicit control over State
Explicit control over Time SLAs define when to trigger
Executes as described Automatic Optimization
Analytics or Application
© 2019 Ververica57 = Worst Practice
Table API / SQL Red Flags
● When upgrading Apache Flink or my application I want to migrate state.
● I can not loose late data.
● I want to change the behaviour of my application during runtime.
© 2019 Ververica58 = Worst Practice
Worst Practices
make use of deeply-nested, complex data types
KeySelectors#getKey can return any serializable type; use this freedom
© 2019 Ververica59 = Worst Practice
● serialization is not to be underestimated
● the simpler your data types the better
● use Flink POJOs or Avro SpecificRecords
● key types matter most
○ part of every KeyedState
○ part of every Timer
● tune locally
Serializer Ops/s
PojoSerializer 813
Kryo 294
Avro (Reflect API) 114
Avro (SpecificRecord API) 632
https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html
© 2019 Ververica60 = Worst Practice
sourceStream.flatMap(new Deserializer())
.keyBy(“cities”)
.timeWindow()
.count()
.filter(new GeographyFilter(“America”))
.addSink(...)
© 2019 Ververica61 = Worst Practice
sourceStream.flatMap(new Deserializer())
.keyBy(“cities”)
.timeWindow()
.count()
.filter(new GeographyFilter(“America”))
.addSink(...)
don’t process data you don’t need
● project early
● filter early
● don’t deserialize unused fields, e.g.
public class Record {
private City city;
private byte[] enclosedRecord;
}
© 2019 Ververica62 = Worst Practice
static variables to share state between Tasks
© 2019 Ververica63 = Worst Practice
static variables to share state between Tasks ● bugs
● deadlocks & lock contention
(interaction with framework code)
● synchronization overhead
© 2019 Ververica64 = Worst Practice
static variables to share state between Tasks
spawning threads in user functions
● bugs
● deadlocks & lock contention
(interaction with framework code)
● synchronization overhead
© 2019 Ververica65 = Worst Practice
static variables to share state between Tasks
spawning threads in user functions
● bugs
● deadlocks & lock contention
(interaction with framework code)
● synchronization overhead
● complicated error prone
(checkpointing)
● use AsyncStreams to reduce wait times
on external IO
● use Timers to schedule tasks
● increase operator parallelism to
increase parallelism
© 2019 Ververica66 = Worst Practice
stream.keyBy(“key”)
.window(GlobalWindows.create())
.trigger(new CustomTrigger())
.evictor(new CustomEvictor())
.reduce/aggregate/fold/apply()
© 2019 Ververica67 = Worst Practice
stream.keyBy(“key”)
.window(GlobalWindows.create())
.trigger(new CustomTrigger())
.evictor(new CustomEvictor())
.reduce/aggregate/fold/apply()
● Avoid custom windowing
● KeyedProcessFunction usually
○ less error-prone
○ simpler
© 2019 Ververica68 = Worst Practice
stream.keyBy(“key”)
.timeWindow(Time.of(30, DAYS), Time.of(5, SECONDS)
.apply(new MyWindowFunction())
stream.keyBy(“key”)
.window(GlobalWindows.create())
.trigger(new CustomTrigger())
.evictor(new CustomEvictor())
.reduce/aggregate/fold/apply()
● Avoid custom windowing
● KeyedProcessFunction usually
○ less error-prone
○ simpler
© 2019 Ververica69 = Worst Practice
stream.keyBy(“key”)
.timeWindow(Time.of(30, DAYS), Time.of(5, SECONDS)
.apply(new MyWindowFunction())
stream.keyBy(“key”)
.window(GlobalWindows.create())
.trigger(new CustomTrigger())
.evictor(new CustomEvictor())
.reduce/aggregate/fold/apply()
● Avoid custom windowing
● KeyedProcessFunction usually
○ less error-prone
○ simpler
● each record is added to > 500k
windows
● without pre-aggregation
© 2019 Ververica70 = Worst Practice
Queryable State for Inter-Task Communication
© 2019 Ververica71 = Worst Practice
Queryable State for Inter-Task Communication
● Non-Thread Safe State Access
○ RocksDBStatebackend ✓
○ FilesystemStatebackend ✗
● Performance?
● Consistency Guarantees?
● Use Queryable State for debugging and
monitoring only
© 2019 Ververica72 = Worst Practice
stream.keyBy(“key”)
.flatmap(..)
.keyBy(“key”)
.process(..)
.keyBy(“key”)
.timeWindow(..)
© 2019 Ververica73 = Worst Practice
stream.keyBy(“key”)
.flatmap(..)
.keyBy(“key”)
.process(..)
.keyBy(“key”)
.timeWindow(..)
● DataStreamUtils#reinterpretAsKeyedStream
● Note: Stream needs to be partitioned exactly as Flink
would partition it.
© 2019 Ververica74 = Worst Practice
stream.keyBy(“key”)
.flatmap(..)
.keyBy(“key”)
.process(..)
.keyBy(“key”)
.timeWindow(..)
● DataStreamUtils#reinterpretAsKeyedStream
● Note: Stream needs to be partitioned exactly as Flink
would partition it.
public void flatMap(Bar bar, Collector<Foo> out) throws Exception {
MyParserFactory factory = MyParserFactory.newInstance();
MyParser parser = factory.newParser();
out.collect(parser.parse(bar));
}
© 2019 Ververica75 = Worst Practice
stream.keyBy(“key”)
.flatmap(..)
.keyBy(“key”)
.process(..)
.keyBy(“key”)
.timeWindow(..)
● DataStreamUtils#reinterpretAsKeyedStream
● Note: Stream needs to be partitioned exactly as Flink
would partition it.
public void flatMap(Bar bar, Collector<Foo> out) throws Exception {
MyParserFactory factory = MyParserFactory.newInstance();
MyParser parser = factory.newParser();
out.collect(parser.parse(bar));
}
● use RichFunction#open
for initialization logic
© 2019 Ververica76 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance
© 2019 Ververica77 = Worst Practice
That’s a pyramid!
UDF Unit Tests
System Tests
© 2019 Ververica78 = Worst Practice
That’s a pyramid!
UDF Unit Tests
AbstractTestHarness
MiniClusterResource
System Tests
© 2019 Ververica79 = Worst Practice
Testing Stateful and Timely Operators and Functions
@Test
public void testingStatefulFlatMapFunction() throws Exception {
//push (timestamped) elements into the operator (and hence user defined function)
testHarness.processElement(2L, 100L);
//trigger event time timers by advancing the event time of the operator with a watermark
testHarness.processWatermark(100L);
//trigger processing time timers by advancing the processing time of the operator directly
testHarness.setProcessingTime(100L);
//retrieve list of emitted records for assertions
assertThat(testHarness.getOutput(), containsInExactlyThisOrder(3L))
//retrieve list of records emitted to a specific side output for assertions (ProcessFunction only)
assertThat(testHarness.getSideOutput(new OutputTag<>("invalidRecords")), hasSize(0))
}
© 2019 Ververica80 = Worst Practice
Examples: https://github.com/knaufk/flink-testing-pyramid
Testing Stateful and Timely Operators and Functions
@Test
public void testingStatefulFlatMapFunction() throws Exception {
//push (timestamped) elements into the operator (and hence user defined function)
testHarness.processElement(2L, 100L);
//trigger event time timers by advancing the event time of the operator with a watermark
testHarness.processWatermark(100L);
//trigger processing time timers by advancing the processing time of the operator directly
testHarness.setProcessingTime(100L);
//retrieve list of emitted records for assertions
assertThat(testHarness.getOutput(), containsInExactlyThisOrder(3L))
//retrieve list of records emitted to a specific side output for assertions (ProcessFunction only)
assertThat(testHarness.getSideOutput(new OutputTag<>("invalidRecords")), hasSize(0))
}
© 2019 Ververica81 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance
© 2019 Ververica82 = Worst Practice
Ignore Spiky Loads!
● seasonal fluctuations
processing time
records/s
© 2019 Ververica83 = Worst Practice
Ignore Spiky Loads!
● seasonal fluctuations
● timers
processing time
records/s
© 2019 Ververica84 = Worst Practice
Ignore Spiky Loads!
● seasonal fluctuations
● checkpoint alignment
● watermark interval
● GC pauses
● ...
● timers
processing time
records/s
© 2019 Ververica85 = Worst Practice
Ignore Spiky Loads!
● seasonal fluctuations
● checkpoint alignment
● watermark interval
● GC pauses
● ...
● timers
processing time
records/s
catch-up scenario
© 2019 Ververica86 = Worst Practice
Monitoring & Metrics
if at all start monitoring now
© 2019 Ververica87 = Worst Practice
Monitoring & Metrics
if at all start monitoring now
[1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html
● don’t miss the chance to learn about
Flink’s runtime during development
● not sure how to start → read [1]
© 2019 Ververica88 = Worst Practice
use the Flink Web Interface as monitoring system
Monitoring & Metrics
if at all start monitoring now
[1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html
● don’t miss the chance to learn about
Flink’s runtime during development
● not sure how to start → read [1]
© 2019 Ververica89 = Worst Practice
use the Flink Web Interface as monitoring system
Monitoring & Metrics
if at all start monitoring now
[1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html
● don’t miss the chance to learn about
Flink’s runtime during development
● not sure how to start → read [1]
● not the right tool for the job →
MetricsReporters (e.g Prometheus,
InfluxDB, Datadog)
● too many metrics can bring
JobManagers down
© 2019 Ververica90 = Worst Practice
use the Flink Web Interface as monitoring system
Monitoring & Metrics
if at all start monitoring now
using latency markers in production
[1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html
● don’t miss the chance to learn about
Flink’s runtime during development
● not sure how to start → read [1]
● not the right tool for the job →
MetricsReporters (e.g Prometheus,
InfluxDB, Datadog)
● too many metrics can bring
JobManagers down
© 2019 Ververica91 = Worst Practice
use the Flink Web Interface as monitoring system
Monitoring & Metrics
if at all start monitoring now
using latency markers in production
[1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html
● don’t miss the chance to learn about
Flink’s runtime during development
● not sure how to start → read [1]
● not the right tool for the job →
MetricsReporters (e.g Prometheus,
InfluxDB, Datadog)
● too many metrics can bring
JobManagers down
● high overhead (in particular with
metrics.latency.granularity:
subtask)
● measure event time lag instead [2]
[2] https://flink.apache.org/2019/06/05/flink-network-stack.html
© 2019 Ververica92 = Worst Practice
Configuration
choosing RocksDBStatebackend by default
© 2019 Ververica93 = Worst Practice
Configuration
choosing RocksDBStatebackend by default ● FilesystemStatebackend is faster &
easier to configure
● State Processor API can be used to
switch later
© 2019 Ververica94 = Worst Practice
Configuration
choosing RocksDBStatebackend by default
NFS/EBS/etc. as state.backend.rocksdb.localdir
● FilesystemStatebackend is faster &
easier to configure
● State Processor API can be used to
switch later
© 2019 Ververica95 = Worst Practice
Configuration
choosing RocksDBStatebackend by default
NFS/EBS/etc. as state.backend.rocksdb.localdir
● FilesystemStatebackend is faster &
easier to configure
● State Processor API can be used to
switch later
● disk IO ultimately determines how fast
you can read/write state
© 2019 Ververica96 = Worst Practice
Configuration
choosing RocksDBStatebackend by default
playing around with Slots and SlotSharingGroups (too
early)
NFS/EBS/etc. as state.backend.rocksdb.localdir
● FilesystemStatebackend is faster &
easier to configure
● State Processor API can be used to
switch later
● disk IO ultimately determines how fast
you can read/write state
© 2019 Ververica97 = Worst Practice
Configuration
choosing RocksDBStatebackend by default
playing around with Slots and SlotSharingGroups (too
early)
NFS/EBS/etc. as state.backend.rocksdb.localdir
● FilesystemStatebackend is faster &
easier to configure
● State Processor API can be used to
switch later
● disk IO ultimately determines how fast
you can read/write state
● usually not worth it
● leads to less chaining and additional
serialization, more network shuffles
and lower resource utilization
© 2019 Ververica98 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance
© 2019 Ververica99 = Worst Practice
with a fast-pace project like Apache Flink, don’t upgrade
© 2019 Ververica
Questions?
© 2019 Ververica
www.ververica.com @snntrablekonstantin@ververica.com

Contenu connexe

Tendances

Building and Evolving a Dependency-Graph Based Microservice Architecture (La...
 Building and Evolving a Dependency-Graph Based Microservice Architecture (La... Building and Evolving a Dependency-Graph Based Microservice Architecture (La...
Building and Evolving a Dependency-Graph Based Microservice Architecture (La...
confluent
 

Tendances (20)

Building and Evolving a Dependency-Graph Based Microservice Architecture (La...
 Building and Evolving a Dependency-Graph Based Microservice Architecture (La... Building and Evolving a Dependency-Graph Based Microservice Architecture (La...
Building and Evolving a Dependency-Graph Based Microservice Architecture (La...
 
Technology choices for Apache Kafka and Change Data Capture
Technology choices for Apache Kafka and Change Data CaptureTechnology choices for Apache Kafka and Change Data Capture
Technology choices for Apache Kafka and Change Data Capture
 
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
 
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
 
Flux is incubating + the road ahead
Flux is incubating + the road aheadFlux is incubating + the road ahead
Flux is incubating + the road ahead
 
Don't Cross the Streams! (or do, we got you)
Don't Cross the Streams! (or do, we got you)Don't Cross the Streams! (or do, we got you)
Don't Cross the Streams! (or do, we got you)
 
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
 
Github Projects Overview and IBM Streams V4.1
Github Projects Overview and IBM Streams V4.1Github Projects Overview and IBM Streams V4.1
Github Projects Overview and IBM Streams V4.1
 
Unified Data Processing with Apache Flink and Apache Pulsar_Seth Wiesman
Unified Data Processing with Apache Flink and Apache Pulsar_Seth WiesmanUnified Data Processing with Apache Flink and Apache Pulsar_Seth Wiesman
Unified Data Processing with Apache Flink and Apache Pulsar_Seth Wiesman
 
Beyond the Brokers | Emma Humber and Andrew Borley, IBM
Beyond the Brokers | Emma Humber and Andrew Borley, IBMBeyond the Brokers | Emma Humber and Andrew Borley, IBM
Beyond the Brokers | Emma Humber and Andrew Borley, IBM
 
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...
 
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
 
Integrating Funambol with CalDAV and LDAP
Integrating Funambol with CalDAV and LDAPIntegrating Funambol with CalDAV and LDAP
Integrating Funambol with CalDAV and LDAP
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
Evolving from Messaging to Event Streaming
Evolving from Messaging to Event StreamingEvolving from Messaging to Event Streaming
Evolving from Messaging to Event Streaming
 
Evolution of netflix conductor
Evolution of netflix conductorEvolution of netflix conductor
Evolution of netflix conductor
 
Removing Barriers Between Dev and Ops
Removing Barriers Between Dev and OpsRemoving Barriers Between Dev and Ops
Removing Barriers Between Dev and Ops
 
12 Factor, or Cloud Native Apps – What EXACTLY Does that Mean for Spring Deve...
12 Factor, or Cloud Native Apps – What EXACTLY Does that Mean for Spring Deve...12 Factor, or Cloud Native Apps – What EXACTLY Does that Mean for Spring Deve...
12 Factor, or Cloud Native Apps – What EXACTLY Does that Mean for Spring Deve...
 
Service Mesh - Observability
Service Mesh - ObservabilityService Mesh - Observability
Service Mesh - Observability
 

Similaire à Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf

You’re Spiky and We Know It With Ravindra Bhanot | Current 2022
You’re Spiky and We Know It With Ravindra Bhanot | Current 2022You’re Spiky and We Know It With Ravindra Bhanot | Current 2022
You’re Spiky and We Know It With Ravindra Bhanot | Current 2022
HostedbyConfluent
 

Similaire à Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf (20)

Apache Flink Worst Practices
Apache Flink Worst PracticesApache Flink Worst Practices
Apache Flink Worst Practices
 
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
 
Automating Disaster Recovery for Faultless Service Delivery
Automating Disaster Recovery for Faultless Service DeliveryAutomating Disaster Recovery for Faultless Service Delivery
Automating Disaster Recovery for Faultless Service Delivery
 
Goodbye Crontab, Hello CA Workload Automation DE - Verizon Style!
Goodbye Crontab, Hello CA Workload Automation DE - Verizon Style!Goodbye Crontab, Hello CA Workload Automation DE - Verizon Style!
Goodbye Crontab, Hello CA Workload Automation DE - Verizon Style!
 
Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10Real World Problem Solving Using Application Performance Management 10
Real World Problem Solving Using Application Performance Management 10
 
Avoiding disaster recovery disasters
Avoiding disaster recovery disastersAvoiding disaster recovery disasters
Avoiding disaster recovery disasters
 
Avoiding disaster recovery disasters
Avoiding disaster recovery disastersAvoiding disaster recovery disasters
Avoiding disaster recovery disasters
 
MySQL Developer Day conference: MySQL Replication and Scalability
MySQL Developer Day conference: MySQL Replication and ScalabilityMySQL Developer Day conference: MySQL Replication and Scalability
MySQL Developer Day conference: MySQL Replication and Scalability
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
 
Case Study: VF Corporation Takes a Practical Approach to Improving its MOJO w...
Case Study: VF Corporation Takes a Practical Approach to Improving its MOJO w...Case Study: VF Corporation Takes a Practical Approach to Improving its MOJO w...
Case Study: VF Corporation Takes a Practical Approach to Improving its MOJO w...
 
Sequencing work for maximum economic benefit using the Scaled Agile Framework...
Sequencing work for maximum economic benefit using the Scaled Agile Framework...Sequencing work for maximum economic benefit using the Scaled Agile Framework...
Sequencing work for maximum economic benefit using the Scaled Agile Framework...
 
Scenarios or Why Some Automation Projects Fail - Webinar Presentation
Scenarios or Why Some Automation Projects Fail - Webinar Presentation Scenarios or Why Some Automation Projects Fail - Webinar Presentation
Scenarios or Why Some Automation Projects Fail - Webinar Presentation
 
2Ring Gadgets for Cisco Finesse & Calabrio Integration
2Ring Gadgets for Cisco Finesse & Calabrio Integration2Ring Gadgets for Cisco Finesse & Calabrio Integration
2Ring Gadgets for Cisco Finesse & Calabrio Integration
 
What Continuous Delivery Means for Version Control
What Continuous Delivery Means for Version ControlWhat Continuous Delivery Means for Version Control
What Continuous Delivery Means for Version Control
 
DR Planning and Testing
DR Planning and TestingDR Planning and Testing
DR Planning and Testing
 
JetAdvice Summit 2014 - Insigths
JetAdvice Summit  2014  - InsigthsJetAdvice Summit  2014  - Insigths
JetAdvice Summit 2014 - Insigths
 
Alert! Event Notification Options for Force.com Apps Webinar
Alert! Event Notification Options for Force.com Apps WebinarAlert! Event Notification Options for Force.com Apps Webinar
Alert! Event Notification Options for Force.com Apps Webinar
 
You’re Spiky and We Know It With Ravindra Bhanot | Current 2022
You’re Spiky and We Know It With Ravindra Bhanot | Current 2022You’re Spiky and We Know It With Ravindra Bhanot | Current 2022
You’re Spiky and We Know It With Ravindra Bhanot | Current 2022
 
Improve Employee Experiences on Cisco RoomOS Devices, Webex, and Microsoft Te...
Improve Employee Experiences on Cisco RoomOS Devices, Webex, and Microsoft Te...Improve Employee Experiences on Cisco RoomOS Devices, Webex, and Microsoft Te...
Improve Employee Experiences on Cisco RoomOS Devices, Webex, and Microsoft Te...
 
End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...
End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...
End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...
 

Plus de Flink Forward

Plus de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf

  • 1. © 2019 Ververica Konstantin Knauf Head of Product - Ververica Platform Apache Flink Committer @snntrable
  • 2. © 2019 Ververica2 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Go Live Maintenance
  • 3. © 2019 Ververica3 = Worst Practice ...choose the right setup! start with one of you most challenging uses cases
  • 4. © 2019 Ververica4 = Worst Practice ...choose the right setup! start with one of you most challenging uses cases no prior stream processing knowledge
  • 5. © 2019 Ververica5 = Worst Practice ...choose the right setup! start with one of you most challenging uses cases no training no prior stream processing knowledge
  • 6. © 2019 Ververica6 = Worst Practice ...choose the right setup! start with one of you most challenging uses cases no training don’t use the community no prior stream processing knowledge
  • 7. © 2019 Ververica7 = Worst Practice ● Training ○ @FlinkForward ○ https://www.ververica.com/training Getting Help!
  • 8. © 2019 Ververica8 = Worst Practice ● Community ○ user@flink.apache.org ■ ~ 600 threads per month ■ ∅30h until first response ○ www.stackoverflow.com ■ How not to ask questions? ● https://data.stackexchange.com/stackoverflow/query/1115371/most-down-voted-apache-flink-questions ○ (ASF Slack #flink) ● Training ○ @FlinkForward ○ https://www.ververica.com/training Getting Help!
  • 9. © 2019 Ververica9 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Go Live Maintenance
  • 10. © 2019 Ververica10 = Worst Practice ...don’t think too much about your requirements. don’t think about consistency & delivery guarantees
  • 11. © 2019 Ververica11 = Worst Practice ...don’t think too much about your requirements. don’t think about consistency & delivery guarantees don’t think about upgrades and application evolution
  • 12. © 2019 Ververica12 = Worst Practice ...don’t think too much about your requirements. don’t think about consistency & delivery guarantees don’t think about the scale of your problem don’t think about upgrades and application evolution
  • 13. © 2019 Ververica13 = Worst Practice ...don’t think too much about your requirements. don’t think about consistency & delivery guarantees don’t think about the scale of your problem don’t think too much about your actual business requirements don’t think about upgrades and application evolution
  • 14. © 2019 Ververica14 = Worst Practice Do I care about losing records?
  • 15. © 2019 Ververica15 = Worst Practice No checkpointing, any source No Do I care about losing records?
  • 16. © 2019 Ververica16 = Worst Practice No checkpointing, any source NoYes Do I care about correct results? Do I care about losing records?
  • 17. © 2019 Ververica17 = Worst Practice No checkpointing, any source NoYes CheckpointingMode.AT_LEAST_ONCE & replayable sources No Do I care about correct results? Do I care about losing records?
  • 18. © 2019 Ververica18 = Worst Practice No checkpointing, any source NoYes CheckpointingMode.AT_LEAST_ONCE & replayable sources Yes No Do I care about correct results? Do I care about duplicate (yet correct) records downstream? Do I care about losing records?
  • 19. © 2019 Ververica19 = Worst Practice No checkpointing, any source NoYes CheckpointingMode.EXACTLY_ONCE & replayable sources CheckpointingMode.AT_LEAST_ONCE & replayable sources Yes No No Do I care about correct results? Do I care about duplicate (yet correct) records downstream? Do I care about losing records?
  • 20. © 2019 Ververica20 = Worst Practice No checkpointing, any source NoYes CheckpointingMode.EXACTLY_ONCE & replayable sources CheckpointingMode.AT_LEAST_ONCE & replayable sources CheckpointingMode.EXACTLY_ONCE, replayable sources & transactional sinks Yes No NoYes Do I care about correct results? Do I care about duplicate (yet correct) records downstream? Do I care about losing records?
  • 21. © 2019 Ververica21 = Worst Practice No checkpointing, any source NoYes CheckpointingMode.EXACTLY_ONCE & replayable sources CheckpointingMode.AT_LEAST_ONCE & replayable sources CheckpointingMode.EXACTLY_ONCE, replayable sources & transactional sinks Yes No NoYes Do I care about correct results? Do I care about duplicate (yet correct) records downstream? Do I care about losing records? L A T E N C Y
  • 22. © 2019 Ververica22 = Worst Practice Flink job user code Local state backends Persisted savepoint local reads / writes that manipulate state Basics
  • 23. © 2019 Ververica23 = Worst Practice Flink job user code Local state backends Persisted savepoint local reads / writes that manipulate state persist to DFS on savepoint Basics
  • 24. © 2019 Ververica24 = Worst Practice Flink job user code Local state backends Persisted savepoint Upgrade Application 1. Upgrade Flink cluster 2. Fix bugs 3. Pipeline topology changes 4. Job reconfigurations 5. Adapt state schema 6. ... Basics
  • 25. © 2019 Ververica25 = Worst Practice Flink job user code Local state backends Persisted savepoint reload state into local state backends Basics
  • 26. © 2019 Ververica26 = Worst Practice Flink job user code Local state backends Persisted savepoint continue to access state reload state into local state backends Basics
  • 27. © 2019 Ververica27 = Worst Practice Topology Changes
  • 28. © 2019 Ververica28 = Worst Practice Sink: LateDataSink Topology Changes
  • 29. © 2019 Ververica29 = Worst Practice flink run --allowNonRestoredState <jar-file <arguments> Sink: LateDataSink Topology Changes
  • 30. © 2019 Ververica30 = Worst Practice new operator starts ➙ empty state Topology Changes
  • 31. © 2019 Ververica31 = Worst Practice new operator starts ➙ empty state Still the same? Topology Changes
  • 32. © 2019 Ververica32 = Worst Practice new operator starts ➙ empty state Still the same? streamOperator .uid("AggregatePerSensor") Topology Changes
  • 33. © 2019 Ververica33 = Worst Practice ● Avro Types ✓ ○ https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution ● Flink POJOs ✓ ○ https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/schema_evolution.html#pojo-types ● Kryo✗ ● Key data types can not be changed State Schema Evolution
  • 34. © 2019 Ververica34 = Worst Practice ● Avro Types ✓ ○ https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution ● Flink POJOs ✓ ○ https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/state/schema_evolution.html#pojo-types ● Kryo✗ ● Key data types can not be changed State Schema Evolution State Unlocked Today, 12:20pm - 1:00pm Track 2 Seth Wiesman, Tzu-Li (Gordon) Tai ● State Processor API for the rescue
  • 35. © 2019 Ververica35 = Worst Practice State Size & Network Stateful Operator TaskManager (1 Slot) Source keyBy Sink
  • 36. © 2019 Ververica36 = Worst Practice State Size & Network Stateful Operator TaskManager (1 Slot) Source keyBy Sink FilesystemStatebackend
  • 37. © 2019 Ververica37 = Worst Practice State Size & Network Stateful Operator TaskManager (1 Slot) Source keyBy Sink FilesystemStatebackend #key * state_size
  • 38. © 2019 Ververica38 = Worst Practice State Size & Network Stateful Operator TaskManager (1 Slot) Source keyBy Sink Ingress MB/s FilesystemStatebackend #key * state_size #records/s * size
  • 39. © 2019 Ververica39 = Worst Practice State Size & Network Stateful Operator TaskManager (1 Slot) Source keyBy Sink Ingress MB/s Shuffle MB/s Shuffle MB/s FilesystemStatebackend #key * state_size #records/s * size #records/s * size * #tm-1/#tm #records/s * size * #tm-1/#tm
  • 40. © 2019 Ververica40 = Worst Practice State Size & Network Stateful Operator TaskManager (1 Slot) Source keyBy Sink Ingress MB/s Shuffle MB/s Shuffle MB/s Egress MB/s FilesystemStatebackend #key * state_size #records/s * size #records/s * size * #tm-1/#tm #records/s * size * #tm-1/#tm #messages/s * size
  • 41. © 2019 Ververica41 = Worst Practice State Size & Network Stateful Operator TaskManager (1 Slot) Source keyBy Sink Ingress MB/s Shuffle MB/s Shuffle MB/s Egress MB/s FilesystemStatebackend Checkpoint MB/s #key * state_size #records/s * size #records/s * size * #tm-1/#tm overall_state_size * 1/checkpoint_interval #records/s * size * #tm-1/#tm #messages/s * size
  • 42. © 2019 Ververica42 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds.
  • 43. © 2019 Ververica43 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4
  • 44. © 2019 Ververica44 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8
  • 45. © 2019 Ververica45 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8 13
  • 46. © 2019 Ververica46 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8 13 11
  • 47. © 2019 Ververica47 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8 13 11 15
  • 48. © 2019 Ververica48 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8 13 11 15 21
  • 49. © 2019 Ververica49 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8 13 11 15 21 4 8 13 11 15 21Tumbling Window (10 secs):
  • 50. © 2019 Ververica50 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8 13 11 21 4 15 8 13 11 15 21 4 8 13 11 15 21 Tumbling Window (10 secs): Sliding Window (10/5 secs):
  • 51. © 2019 Ververica51 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8 13 11 15 21 4 8 13 11 15 21 4 8 13 11 15 21 4 8 13 11 15 21 Tumbling Window (10 secs): Sliding Window (10/5 secs): “Look Back” (10 secs) :
  • 52. © 2019 Ververica52 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8 13 11 15 21 4 8 13 11 15 21 4 8 13 11 15 21 4 8 13 11 15 21 Tumbling Window (10 secs): Sliding Window (10/5 secs): “Look Back” (10 secs) : Fire or wait for watermark?
  • 53. © 2019 Ververica53 = Worst Practice Event Time & Out-Of-Orderness I want to send an alarm when the number of transactions per customer exceeds three in ten seconds. 4 8 13 11 15 21 4 8 13 11 15 21 4 8 13 11 15 21 4 8 13 11 15 21 Tumbling Window (10 secs): Sliding Window (10/5 secs): “Look Back” (10 secs) : Fire or wait for watermark? Fire or wait till watermark reaches end of window?
  • 54. © 2019 Ververica54 = Worst Practice Batch Processing Requirements I receive two files every ten minutes: transactions.txt & quotes.txt ● They need to be transformed & joined. ● The output must be exactly one file. ● The operation needs to be atomic. → Don’t try to re-implement your batch jobs as a stream processing jobs.
  • 55. © 2019 Ververica55 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Production Maintenance
  • 56. © 2019 Ververica56 = Worst Practice Applications (physical) Analytics (declarative) DataStream API Table API/SQL Types are Java / Scala classes Logical Schema for Tables Transformation Functions Declarative Language (SQL, Table DSL) State implicit in operationsExplicit control over State Explicit control over Time SLAs define when to trigger Executes as described Automatic Optimization Analytics or Application
  • 57. © 2019 Ververica57 = Worst Practice Table API / SQL Red Flags ● When upgrading Apache Flink or my application I want to migrate state. ● I can not loose late data. ● I want to change the behaviour of my application during runtime.
  • 58. © 2019 Ververica58 = Worst Practice Worst Practices make use of deeply-nested, complex data types KeySelectors#getKey can return any serializable type; use this freedom
  • 59. © 2019 Ververica59 = Worst Practice ● serialization is not to be underestimated ● the simpler your data types the better ● use Flink POJOs or Avro SpecificRecords ● key types matter most ○ part of every KeyedState ○ part of every Timer ● tune locally Serializer Ops/s PojoSerializer 813 Kryo 294 Avro (Reflect API) 114 Avro (SpecificRecord API) 632 https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html
  • 60. © 2019 Ververica60 = Worst Practice sourceStream.flatMap(new Deserializer()) .keyBy(“cities”) .timeWindow() .count() .filter(new GeographyFilter(“America”)) .addSink(...)
  • 61. © 2019 Ververica61 = Worst Practice sourceStream.flatMap(new Deserializer()) .keyBy(“cities”) .timeWindow() .count() .filter(new GeographyFilter(“America”)) .addSink(...) don’t process data you don’t need ● project early ● filter early ● don’t deserialize unused fields, e.g. public class Record { private City city; private byte[] enclosedRecord; }
  • 62. © 2019 Ververica62 = Worst Practice static variables to share state between Tasks
  • 63. © 2019 Ververica63 = Worst Practice static variables to share state between Tasks ● bugs ● deadlocks & lock contention (interaction with framework code) ● synchronization overhead
  • 64. © 2019 Ververica64 = Worst Practice static variables to share state between Tasks spawning threads in user functions ● bugs ● deadlocks & lock contention (interaction with framework code) ● synchronization overhead
  • 65. © 2019 Ververica65 = Worst Practice static variables to share state between Tasks spawning threads in user functions ● bugs ● deadlocks & lock contention (interaction with framework code) ● synchronization overhead ● complicated error prone (checkpointing) ● use AsyncStreams to reduce wait times on external IO ● use Timers to schedule tasks ● increase operator parallelism to increase parallelism
  • 66. © 2019 Ververica66 = Worst Practice stream.keyBy(“key”) .window(GlobalWindows.create()) .trigger(new CustomTrigger()) .evictor(new CustomEvictor()) .reduce/aggregate/fold/apply()
  • 67. © 2019 Ververica67 = Worst Practice stream.keyBy(“key”) .window(GlobalWindows.create()) .trigger(new CustomTrigger()) .evictor(new CustomEvictor()) .reduce/aggregate/fold/apply() ● Avoid custom windowing ● KeyedProcessFunction usually ○ less error-prone ○ simpler
  • 68. © 2019 Ververica68 = Worst Practice stream.keyBy(“key”) .timeWindow(Time.of(30, DAYS), Time.of(5, SECONDS) .apply(new MyWindowFunction()) stream.keyBy(“key”) .window(GlobalWindows.create()) .trigger(new CustomTrigger()) .evictor(new CustomEvictor()) .reduce/aggregate/fold/apply() ● Avoid custom windowing ● KeyedProcessFunction usually ○ less error-prone ○ simpler
  • 69. © 2019 Ververica69 = Worst Practice stream.keyBy(“key”) .timeWindow(Time.of(30, DAYS), Time.of(5, SECONDS) .apply(new MyWindowFunction()) stream.keyBy(“key”) .window(GlobalWindows.create()) .trigger(new CustomTrigger()) .evictor(new CustomEvictor()) .reduce/aggregate/fold/apply() ● Avoid custom windowing ● KeyedProcessFunction usually ○ less error-prone ○ simpler ● each record is added to > 500k windows ● without pre-aggregation
  • 70. © 2019 Ververica70 = Worst Practice Queryable State for Inter-Task Communication
  • 71. © 2019 Ververica71 = Worst Practice Queryable State for Inter-Task Communication ● Non-Thread Safe State Access ○ RocksDBStatebackend ✓ ○ FilesystemStatebackend ✗ ● Performance? ● Consistency Guarantees? ● Use Queryable State for debugging and monitoring only
  • 72. © 2019 Ververica72 = Worst Practice stream.keyBy(“key”) .flatmap(..) .keyBy(“key”) .process(..) .keyBy(“key”) .timeWindow(..)
  • 73. © 2019 Ververica73 = Worst Practice stream.keyBy(“key”) .flatmap(..) .keyBy(“key”) .process(..) .keyBy(“key”) .timeWindow(..) ● DataStreamUtils#reinterpretAsKeyedStream ● Note: Stream needs to be partitioned exactly as Flink would partition it.
  • 74. © 2019 Ververica74 = Worst Practice stream.keyBy(“key”) .flatmap(..) .keyBy(“key”) .process(..) .keyBy(“key”) .timeWindow(..) ● DataStreamUtils#reinterpretAsKeyedStream ● Note: Stream needs to be partitioned exactly as Flink would partition it. public void flatMap(Bar bar, Collector<Foo> out) throws Exception { MyParserFactory factory = MyParserFactory.newInstance(); MyParser parser = factory.newParser(); out.collect(parser.parse(bar)); }
  • 75. © 2019 Ververica75 = Worst Practice stream.keyBy(“key”) .flatmap(..) .keyBy(“key”) .process(..) .keyBy(“key”) .timeWindow(..) ● DataStreamUtils#reinterpretAsKeyedStream ● Note: Stream needs to be partitioned exactly as Flink would partition it. public void flatMap(Bar bar, Collector<Foo> out) throws Exception { MyParserFactory factory = MyParserFactory.newInstance(); MyParser parser = factory.newParser(); out.collect(parser.parse(bar)); } ● use RichFunction#open for initialization logic
  • 76. © 2019 Ververica76 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Go Live Maintenance
  • 77. © 2019 Ververica77 = Worst Practice That’s a pyramid! UDF Unit Tests System Tests
  • 78. © 2019 Ververica78 = Worst Practice That’s a pyramid! UDF Unit Tests AbstractTestHarness MiniClusterResource System Tests
  • 79. © 2019 Ververica79 = Worst Practice Testing Stateful and Timely Operators and Functions @Test public void testingStatefulFlatMapFunction() throws Exception { //push (timestamped) elements into the operator (and hence user defined function) testHarness.processElement(2L, 100L); //trigger event time timers by advancing the event time of the operator with a watermark testHarness.processWatermark(100L); //trigger processing time timers by advancing the processing time of the operator directly testHarness.setProcessingTime(100L); //retrieve list of emitted records for assertions assertThat(testHarness.getOutput(), containsInExactlyThisOrder(3L)) //retrieve list of records emitted to a specific side output for assertions (ProcessFunction only) assertThat(testHarness.getSideOutput(new OutputTag<>("invalidRecords")), hasSize(0)) }
  • 80. © 2019 Ververica80 = Worst Practice Examples: https://github.com/knaufk/flink-testing-pyramid Testing Stateful and Timely Operators and Functions @Test public void testingStatefulFlatMapFunction() throws Exception { //push (timestamped) elements into the operator (and hence user defined function) testHarness.processElement(2L, 100L); //trigger event time timers by advancing the event time of the operator with a watermark testHarness.processWatermark(100L); //trigger processing time timers by advancing the processing time of the operator directly testHarness.setProcessingTime(100L); //retrieve list of emitted records for assertions assertThat(testHarness.getOutput(), containsInExactlyThisOrder(3L)) //retrieve list of records emitted to a specific side output for assertions (ProcessFunction only) assertThat(testHarness.getSideOutput(new OutputTag<>("invalidRecords")), hasSize(0)) }
  • 81. © 2019 Ververica81 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Go Live Maintenance
  • 82. © 2019 Ververica82 = Worst Practice Ignore Spiky Loads! ● seasonal fluctuations processing time records/s
  • 83. © 2019 Ververica83 = Worst Practice Ignore Spiky Loads! ● seasonal fluctuations ● timers processing time records/s
  • 84. © 2019 Ververica84 = Worst Practice Ignore Spiky Loads! ● seasonal fluctuations ● checkpoint alignment ● watermark interval ● GC pauses ● ... ● timers processing time records/s
  • 85. © 2019 Ververica85 = Worst Practice Ignore Spiky Loads! ● seasonal fluctuations ● checkpoint alignment ● watermark interval ● GC pauses ● ... ● timers processing time records/s catch-up scenario
  • 86. © 2019 Ververica86 = Worst Practice Monitoring & Metrics if at all start monitoring now
  • 87. © 2019 Ververica87 = Worst Practice Monitoring & Metrics if at all start monitoring now [1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html ● don’t miss the chance to learn about Flink’s runtime during development ● not sure how to start → read [1]
  • 88. © 2019 Ververica88 = Worst Practice use the Flink Web Interface as monitoring system Monitoring & Metrics if at all start monitoring now [1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html ● don’t miss the chance to learn about Flink’s runtime during development ● not sure how to start → read [1]
  • 89. © 2019 Ververica89 = Worst Practice use the Flink Web Interface as monitoring system Monitoring & Metrics if at all start monitoring now [1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html ● don’t miss the chance to learn about Flink’s runtime during development ● not sure how to start → read [1] ● not the right tool for the job → MetricsReporters (e.g Prometheus, InfluxDB, Datadog) ● too many metrics can bring JobManagers down
  • 90. © 2019 Ververica90 = Worst Practice use the Flink Web Interface as monitoring system Monitoring & Metrics if at all start monitoring now using latency markers in production [1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html ● don’t miss the chance to learn about Flink’s runtime during development ● not sure how to start → read [1] ● not the right tool for the job → MetricsReporters (e.g Prometheus, InfluxDB, Datadog) ● too many metrics can bring JobManagers down
  • 91. © 2019 Ververica91 = Worst Practice use the Flink Web Interface as monitoring system Monitoring & Metrics if at all start monitoring now using latency markers in production [1] https://flink.apache.org/news/2019/02/25/monitoring-best-practices.html ● don’t miss the chance to learn about Flink’s runtime during development ● not sure how to start → read [1] ● not the right tool for the job → MetricsReporters (e.g Prometheus, InfluxDB, Datadog) ● too many metrics can bring JobManagers down ● high overhead (in particular with metrics.latency.granularity: subtask) ● measure event time lag instead [2] [2] https://flink.apache.org/2019/06/05/flink-network-stack.html
  • 92. © 2019 Ververica92 = Worst Practice Configuration choosing RocksDBStatebackend by default
  • 93. © 2019 Ververica93 = Worst Practice Configuration choosing RocksDBStatebackend by default ● FilesystemStatebackend is faster & easier to configure ● State Processor API can be used to switch later
  • 94. © 2019 Ververica94 = Worst Practice Configuration choosing RocksDBStatebackend by default NFS/EBS/etc. as state.backend.rocksdb.localdir ● FilesystemStatebackend is faster & easier to configure ● State Processor API can be used to switch later
  • 95. © 2019 Ververica95 = Worst Practice Configuration choosing RocksDBStatebackend by default NFS/EBS/etc. as state.backend.rocksdb.localdir ● FilesystemStatebackend is faster & easier to configure ● State Processor API can be used to switch later ● disk IO ultimately determines how fast you can read/write state
  • 96. © 2019 Ververica96 = Worst Practice Configuration choosing RocksDBStatebackend by default playing around with Slots and SlotSharingGroups (too early) NFS/EBS/etc. as state.backend.rocksdb.localdir ● FilesystemStatebackend is faster & easier to configure ● State Processor API can be used to switch later ● disk IO ultimately determines how fast you can read/write state
  • 97. © 2019 Ververica97 = Worst Practice Configuration choosing RocksDBStatebackend by default playing around with Slots and SlotSharingGroups (too early) NFS/EBS/etc. as state.backend.rocksdb.localdir ● FilesystemStatebackend is faster & easier to configure ● State Processor API can be used to switch later ● disk IO ultimately determines how fast you can read/write state ● usually not worth it ● leads to less chaining and additional serialization, more network shuffles and lower resource utilization
  • 98. © 2019 Ververica98 = Worst Practice Don’t use an iterative development process! AnalysisProject Setup Development (Testing) Go Live Maintenance
  • 99. © 2019 Ververica99 = Worst Practice with a fast-pace project like Apache Flink, don’t upgrade
  • 101. © 2019 Ververica www.ververica.com @snntrablekonstantin@ververica.com