Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf

© 2019 Ververica
Konstantin Knauf
Head of Product - Ververica Platform
Apache Flink Committer
@snntrable

© 2019 Ververica2 = Worst Practice
Don’t use an iterative development process!
AnalysisProject Setup Development (Testing) Go Live Maintenance

...choose the right setup!
start with one of you most challenging uses cases

no prior stream processing knowledge

no training

no training
don’t use the community

● Training
○ @FlinkForward
○ https://www.ververica.com/training
Getting Help!

● Community
○ user@flink.apache.org
■ ~ 600 threads per month
■ ∅30h until first response
○ www.stackoverflow.com
■ How not to ask questions?
● https://data.stackexchange.com/stackoverflow/query/1115371/most-down-voted-apache-flink-questions
○ (ASF Slack #flink)
● Training
○ @FlinkForward
○ https://www.ververica.com/training
Getting Help!

...don’t think too much about your requirements.
don’t think about consistency & delivery guarantees

don’t think about upgrades and application evolution

don’t think about the scale of your problem

don’t think about the scale of your problem
don’t think too much about your actual business requirements

Do I care about losing records?

No checkpointing,
any source
No

No checkpointing,
any source
NoYes
Do I care about correct results?

No checkpointing,
any source
NoYes
CheckpointingMode.AT_LEAST_ONCE
& replayable sources
No

No checkpointing,
any source
NoYes
Yes No
Do I care about duplicate (yet correct)
records downstream?

No checkpointing,
any source
NoYes
CheckpointingMode.EXACTLY_ONCE
Yes No
No
records downstream?

No checkpointing,
any source
NoYes
CheckpointingMode.EXACTLY_ONCE,
replayable sources & transactional sinks
Yes No
NoYes
records downstream?

No checkpointing,
any source
NoYes
CheckpointingMode.EXACTLY_ONCE,
replayable sources & transactional sinks
Yes No
NoYes
records downstream?
L
A
T
E
N
C
Y

Flink job
user code
Local state
backends
Persisted
savepoint
local reads / writes that
manipulate state
Basics

Flink job
user code
Local state
backends
Persisted
savepoint
local reads / writes that
manipulate state
persist to
DFS on
savepoint
Basics

Flink job
user code
Local state
backends
Persisted
savepoint
Upgrade
Application
1. Upgrade Flink cluster
2. Fix bugs
3. Pipeline topology changes
4. Job reconﬁgurations
5. Adapt state schema
6. ...
Basics

Flink job
user code
Local state
backends
Persisted
savepoint
reload state
into local
state backends
Basics

Flink job
user code
Local state
backends
Persisted
savepoint
continue to
access state
reload state
into local
state backends
Basics

Topology Changes

Sink:
LateDataSink
Topology Changes

flink run --allowNonRestoredState <jar-file
<arguments>
Sink:
LateDataSink
Topology Changes

new operator starts ➙ empty state
Topology Changes

Still the same?
Topology Changes

Still the same?
streamOperator
.uid("AggregatePerSensor")
Topology Changes

● Avro Types ✓
○ https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution
● Flink POJOs ✓
○ https://ci.apache.org/projects/ﬂink/ﬂink-docs-release-1.9/dev/stream/state/schema_evolution.html#pojo-types
● Kryo✗
● Key data types can not be changed
State Schema Evolution

● Avro Types ✓
○ https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution
● Flink POJOs ✓
○ https://ci.apache.org/projects/ﬂink/ﬂink-docs-release-1.10/dev/stream/state/schema_evolution.html#pojo-types
● Kryo✗
● Key data types can not be changed
State Schema Evolution
State Unlocked
Today, 12:20pm - 1:00pm
Track 2
Seth Wiesman, Tzu-Li (Gordon) Tai
● State Processor API for the rescue

State Size & Network
Stateful Operator
TaskManager (1 Slot)
Source
keyBy
Sink

Stateful Operator
Source
keyBy
Sink
FilesystemStatebackend

Stateful Operator
Source
keyBy
Sink
#key * state_size

Stateful Operator
Source
keyBy
Sink
Ingress MB/s
#key * state_size
#records/s * size

Stateful Operator
Source
keyBy
Sink
Ingress MB/s Shuﬄe MB/s
Shuﬄe MB/s
#key * state_size
#records/s * size
#records/s * size * #tm-1/#tm

Stateful Operator
Source
keyBy
Sink
Shuﬄe MB/s
Egress MB/s
#key * state_size
#records/s * size
#messages/s * size

Stateful Operator
Source
keyBy
Sink
Shuﬄe MB/s
Egress MB/s
FilesystemStatebackend Checkpoint MB/s
#key * state_size
#records/s * size
overall_state_size * 1/checkpoint_interval
#messages/s * size

Event Time & Out-Of-Orderness
I want to send an alarm when the number of transactions per customer
exceeds three in ten seconds.

4

4 8

4 8 13

4 8 13 11

4 8 13 11 15

4 8 13 11 15 21

4 8 13 11 15 21
4 8 13 11 15 21Tumbling Window (10 secs):

4 8 13 11 21
4
15
8 13 11 15 21
4 8 13 11 15 21
Tumbling Window (10 secs):
Sliding Window (10/5 secs):

4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
“Look Back” (10 secs) :

4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
Fire or wait for watermark?

4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
4 8 13 11 15 21
Fire or wait for watermark?
Fire or wait till watermark reaches end of window?

Batch Processing Requirements
I receive two ﬁles every ten minutes: transactions.txt & quotes.txt
● They need to be transformed & joined.
● The output must be exactly one ﬁle.
● The operation needs to be atomic.
→ Don’t try to re-implement your batch jobs as a stream processing jobs.

AnalysisProject Setup Development (Testing) Production Maintenance

Applications
(physical)
Analytics
(declarative)
DataStream API Table API/SQL
Types are Java / Scala classes Logical Schema for Tables
Transformation Functions Declarative Language (SQL, Table DSL)
State implicit in operationsExplicit control over State
Explicit control over Time SLAs deﬁne when to trigger
Executes as described Automatic Optimization
Analytics or Application

Table API / SQL Red Flags
● When upgrading Apache Flink or my application I want to migrate state.
● I can not loose late data.
● I want to change the behaviour of my application during runtime.

Worst Practices
make use of deeply-nested, complex data types
KeySelectors#getKey can return any serializable type; use this freedom

● serialization is not to be underestimated
● the simpler your data types the better
● use Flink POJOs or Avro SpecificRecords
● key types matter most
○ part of every KeyedState
○ part of every Timer
● tune locally
Serializer Ops/s
PojoSerializer 813
Kryo 294
Avro (Reflect API) 114
Avro (SpecificRecord API) 632
https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html

sourceStream.flatMap(new Deserializer())
.keyBy(“cities”)
.timeWindow()
.count()
.filter(new GeographyFilter(“America”))
.addSink(...)

sourceStream.flatMap(new Deserializer())
.keyBy(“cities”)
.timeWindow()
.count()
.filter(new GeographyFilter(“America”))
.addSink(...)
don’t process data you don’t need
● project early
● ﬁlter early
● don’t deserialize unused ﬁelds, e.g.
public class Record {
private City city;
private byte[] enclosedRecord;
}

static variables to share state between Tasks

static variables to share state between Tasks ● bugs
● deadlocks & lock contention
(interaction with framework code)
● synchronization overhead

spawning threads in user functions
● bugs

spawning threads in user functions
● bugs
● complicated error prone
(checkpointing)
● use AsyncStreams to reduce wait times
on external IO
● use Timers to schedule tasks
● increase operator parallelism to
increase parallelism

stream.keyBy(“key”)
.window(GlobalWindows.create())
.trigger(new CustomTrigger())
.evictor(new CustomEvictor())
.reduce/aggregate/fold/apply()

● Avoid custom windowing
● KeyedProcessFunction usually
○ less error-prone
○ simpler

.timeWindow(Time.of(30, DAYS), Time.of(5, SECONDS)
.apply(new MyWindowFunction())
○ simpler

.timeWindow(Time.of(30, DAYS), Time.of(5, SECONDS)
.apply(new MyWindowFunction())
○ simpler
● each record is added to > 500k
windows
● without pre-aggregation

Queryable State for Inter-Task Communication

Queryable State for Inter-Task Communication
● Non-Thread Safe State Access
○ RocksDBStatebackend ✓
○ FilesystemStatebackend ✗
● Performance?
● Consistency Guarantees?
● Use Queryable State for debugging and
monitoring only

.flatmap(..)
.keyBy(“key”)
.process(..)
.keyBy(“key”)
.timeWindow(..)

.flatmap(..)
.keyBy(“key”)
.process(..)
.keyBy(“key”)
.timeWindow(..)
● DataStreamUtils#reinterpretAsKeyedStream
● Note: Stream needs to be partitioned exactly as Flink
would partition it.

.flatmap(..)
.keyBy(“key”)
.process(..)
.keyBy(“key”)
.timeWindow(..)
would partition it.
public void flatMap(Bar bar, Collector<Foo> out) throws Exception {
MyParserFactory factory = MyParserFactory.newInstance();
MyParser parser = factory.newParser();
out.collect(parser.parse(bar));
}

.flatmap(..)
.keyBy(“key”)
.process(..)
.keyBy(“key”)
.timeWindow(..)
would partition it.
public void flatMap(Bar bar, Collector<Foo> out) throws Exception {
MyParserFactory factory = MyParserFactory.newInstance();
MyParser parser = factory.newParser();
out.collect(parser.parse(bar));
}
● use RichFunction#open
for initialization logic

That’s a pyramid!
UDF Unit Tests
System Tests

That’s a pyramid!
UDF Unit Tests
AbstractTestHarness
MiniClusterResource
System Tests

Testing Stateful and Timely Operators and Functions
@Test
public void testingStatefulFlatMapFunction() throws Exception {
//push (timestamped) elements into the operator (and hence user defined function)
testHarness.processElement(2L, 100L);
//trigger event time timers by advancing the event time of the operator with a watermark
testHarness.processWatermark(100L);
//trigger processing time timers by advancing the processing time of the operator directly
testHarness.setProcessingTime(100L);
//retrieve list of emitted records for assertions
assertThat(testHarness.getOutput(), containsInExactlyThisOrder(3L))
//retrieve list of records emitted to a specific side output for assertions (ProcessFunction only)
assertThat(testHarness.getSideOutput(new OutputTag<>("invalidRecords")), hasSize(0))
}

Examples: https://github.com/knaufk/ﬂink-testing-pyramid
Testing Stateful and Timely Operators and Functions
@Test
public void testingStatefulFlatMapFunction() throws Exception {
//push (timestamped) elements into the operator (and hence user defined function)
testHarness.processElement(2L, 100L);
//trigger event time timers by advancing the event time of the operator with a watermark
testHarness.processWatermark(100L);
//trigger processing time timers by advancing the processing time of the operator directly
testHarness.setProcessingTime(100L);
//retrieve list of emitted records for assertions
assertThat(testHarness.getOutput(), containsInExactlyThisOrder(3L))
//retrieve list of records emitted to a specific side output for assertions (ProcessFunction only)
assertThat(testHarness.getSideOutput(new OutputTag<>("invalidRecords")), hasSize(0))
}

Ignore Spiky Loads!
● seasonal ﬂuctuations
processing time
records/s

Ignore Spiky Loads!
● timers
processing time
records/s

Ignore Spiky Loads!
● checkpoint alignment
● watermark interval
● GC pauses
● ...
● timers
processing time
records/s

Ignore Spiky Loads!
● checkpoint alignment
● watermark interval
● GC pauses
● ...
● timers
processing time
records/s
catch-up scenario

Monitoring & Metrics
if at all start monitoring now

[1] https://ﬂink.apache.org/news/2019/02/25/monitoring-best-practices.html
● don’t miss the chance to learn about
Flink’s runtime during development
● not sure how to start → read [1]

use the Flink Web Interface as monitoring system

● not the right tool for the job →
MetricsReporters (e.g Prometheus,
InﬂuxDB, Datadog)
● too many metrics can bring
JobManagers down

using latency markers in production
InﬂuxDB, Datadog)
JobManagers down

using latency markers in production
InfluxDB, Datadog)
JobManagers down
● high overhead (in particular with
metrics.latency.granularity:
subtask)
● measure event time lag instead [2]
[2] https://flink.apache.org/2019/06/05/flink-network-stack.html

Conﬁguration
choosing RocksDBStatebackend by default

Conﬁguration
choosing RocksDBStatebackend by default ● FilesystemStatebackend is faster &
easier to conﬁgure
● State Processor API can be used to
switch later

Conﬁguration
NFS/EBS/etc. as state.backend.rocksdb.localdir
● FilesystemStatebackend is faster &
switch later

Conﬁguration
switch later
● disk IO ultimately determines how fast
you can read/write state

Conﬁguration
playing around with Slots and SlotSharingGroups (too
early)
switch later

Conﬁguration
playing around with Slots and SlotSharingGroups (too
early)
switch later
● usually not worth it
● leads to less chaining and additional
serialization, more network shuﬄes
and lower resource utilization

with a fast-pace project like Apache Flink, don’t upgrade

Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf

Similaire à Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf (20)

Plus de Flink Forward

Plus de Flink Forward (20)

Dernier

Dernier (20)

Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf