During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.
16. Physical Planning (Stream)
What is changelog and Why we need it?
Special things for streaming: Changelog Mechanism
aka Retraction Mechanism
17. Physical Planning (Stream): Changelog Mechanism
SELECT cnt, COUNT(cnt) as freq
FROM(
SELECT word,COUNT(*) as cnt
FROMwords
GROUP BY word)
GROUP BY cnt
word
Hello
World
Hello
Source
cnt freq
1 2
2 1
1 1
Expected Result
18. Physical Planning (Stream): Changelog Mechanism
word
Hello word cnt
Hello 1
word_count
World 1
Hello 2
Hello, 1
World, 1
Hello, 2
SELECT
word,
COUNT(*) as cnt
FROM words
GROUP BY word
World
Hello
Source
cnt freq
1 2
2 1
SELECT
cnt,
COUNT(cnt) as freq
FROM word_count
GROUP BY cnt
1 2
Should be
“1”
Count Frequency
without changelog
Word Count
19. Physical Planning (Stream): Changelog Mechanism
word
Hello word cnt
Hello 1
word_count
World 1
Hello 2
SELECT
word,
COUNT(*) as cnt
FROM logs
GROUP BY word
World
Hello
cnt freq
1 2
2 1
1 1
SELECT
cnt,
COUNT(cnt) as freq
FROM word_count
GROUP BY cnt
with changelog
由查询优化器判断是否需要Retraction,用户无感知。① Changelog makes the streaming query result correct
② Query optimizer determines whether update_before is needed
③ Users are not aware of it
Hello, 1insert
World, 1insert
Hello, 1update_before
Hello, 2update_after
Helloinsert
Worldinsert
Helloinsert
Source Word Count Count Frequency
20. Physical Planning (Stream): Changelog Mechanism
Calc
Source
Aggregate
Aggregate
UpsertSink
[I]:Insert
[U]:Update
[D]: Delete
produce
[INSERT]
produce
[INSERT, UPDATE]
produce
[INSERT, UPDATE]
produce
[INSERT, UPDATE, DELETE]
[I]
[I,U,D]
[I,U]
[I,U]
Step1: determine what changes
will a node produce
words
word,
count(*) as cnt
cnt
cnt, count(*)
Physical Plan
21. Physical Planning (Stream): Changelog Mechanism
Calc
Source
Aggregate
Aggregate
UpsertSink
produce
[UPDATE_BEFORE+UPDATE_AFTER]
require
only UPDATE_AFTER
[I]
[I,U,D]
[I,U]
[I,U]
require
UPDATE_BEFORE + UPDATE_AFTER
require
UPDATE_BEFORE + UPDATE_AFTER
require nothing
produce
[UPDATE_BEFORE+UPDATE_AFTER]
produce
[UPDATE_AFTER][UB]:Update_Before
[UA]:Update_After
Step2: determine how to produce updates
[UB+UA]
[UB+UA]
[UA]
22. Physical Planning (Stream): Changelog Mechanism
Calc
Source
Aggregate
Aggregate
UpsertSink
[I]
[I,U,D]
[I,U]
[I,U]
[UB+UA]
[UB+UA]
[UA]
Simple COUNT implementation,
Generate UPDATE_BEFORE
COUNT with retract() implementation,
Not generate UPDATE_BEFORE
words
word,
count(*) as cnt
cnt
cnt, count(*)
Final Physical Plan
23. Flink SQL Optimizations
• Internal Data Structure (BinaryRow)
• Mini-Batch Processing
• Aggregation Skew Handling
• Plan Rewrite
25. • Deeply integrated with MemorySegment
• No need to deserialize / Compact layout / Random accessible
• Also have BinaryString, BinaryArray, BinaryMap
New Blink Planner: BinaryRow
2019 pointer pointer 5 Flink 7 Forward
Memory Segment
Fixed-length part Variable-length partNull info
Header (Row Kind)
Blink planner is +54.6% than old planner when object reuse is enabled:
https://www.ververica.com/blog/a-journey-to-beating-flinks-sql-performance
26. • Each record would cost:
• One state reading and writing
• One deserialization and serialization
• One output
Mini-Batch Processing
Normal aggregation:
SELECT SUM(num) FROM T GROUP BY color
27. • Use heap memory to hold bundle
• In-memory aggregation before
accessing states and serde operations
• Also ease the downstream loads
Mini-Batch Processing
Mini-Batch aggregation:
table.exec.mini-batch.enabled = true
table.exec.mini-batch.allow-latency = “5000 ms”
table.exec.mini-batch.size = 1000
SELECT SUM(num) FROM T GROUP BY color
29. • It’s impractical to do a global streaming sort
• But it becomes possible if user only cares about the top n elements
• E.g. Calculate the top 3 shops for each category
Plan Rewrite (Top-N)
SELECT *
FROM (
SELECT *, // you can get like shopId or other information from this
ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC) AS rowNum
FROM shop_sales)
WHERE rowNum <= 3
32. • Flink took a big step towards truly unified architecture
• Introduced how Flink SQL works step by step.
• Flink SQL does a lot of optimizations for users automatically
• Future (Flink 1.11+)
• Blink planner will be the default planner and ready for production
• New TableSource and TableSink interfaces (FLIP-95)
• Support to read changelogs (FLIP-105)
• Unified Batch and Streaming Filesystem connector (FLIP-115)
• Hive DDL & DML compatible (FLIP-123)
Summary & Futures