Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Multiple time frame trading analysis -brianshannon.pdf
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
1.
2. Simplify CDC pipeline with Spark
Streaming SQL and Delta Lake
Jun Song@windpiger
songjun.sj@alibaba-inc.com
June 24, 2020
3. About Me
• Staff Engineer from Alibaba Cloud E-MapReduce Product Team
• Spark contributor focused on SparkSQL
• HiveOnDelta contributor(https://github.com/delta-io/connectors)
4. Agenda
• What is CDC
• CDC solution using Spark
Streaming SQL & Delta Lake
• Future Works
7. ▪ load pressure on source database
▪ high latency batch job(hourly/daily/…)
▪ can not handle delete rows
▪ can not handle schema change
DrawbacksUsing Sqoop(Batch Mode)
Change Data Capture
sqoop
--incremental lastmodified
--last-value '2028/01/01 13:00:00’
...
sqoop merge
--new-data newer
--onto older
--merge-key id
…
8. ▪ heavy servers & operational
support(Kudu&HBase)
▪ HBase can not support high-throughput
analytics
▪ complex merge implement by java/scala code
▪ can not handle schema change
DrawbacksUsing binlog(Streaming Mode)
Change Data Capture
binlog
scala/java
10. Spark Streaming SQL
SparkCore
SparkSQL
Structured Streaming
Spark Streaming SQL
https://www.alibabacloud.com/help/doc-detail/124684.htm
SQL is a Standard Declarative Language, which can simplify real-time analytics
▪ DDL
CREATE TABLE、CREATE TABLE AS SELECT、CREATE SCAN、CREATE STREAM
▪ DML
INSERT INTO、MERGE INTO
▪ SELECT
SELECT FROM、WHERE、GROUP BY 、JOIN、UNION ALL
▪ UDF
TUMBLING、HOPPING、DELAY、SparkSQL UDF
▪ Data Source
Delta、Kafka、HBase、JDBC、Druid、Redis、Kudu、Alibaba Cloud(Loghub、Tablestore、DataHub)
Design Doc: https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit
11. Spark Streaming SQL
CREATE TABLE kafka_test
USING kafka
OPTIONS(
kafka.bootstrap.servers='',
subscribe='test')
batch
streaming
CREATE SCAN kafka_test_batch_scan
ON kafka_test
USING batch
CREATE SCAN kafka_test_batch_scan
ON kafka_test
USING stream
OPTIONS(
maxOffsetsPerTrigger='100000'
)
SELECT count(*)
FROM kafka_test_batch_scan
?
CREATE SCAN
CREATE SCAN tbName_alias
ON tbName
USING queryType
OPTIONS (propertyName=propertyValue[,propertyName=propertyValue]*)
12. Spark Streaming SQL
CREATE SCAN kafka_test_batch_scan
ON kafka_test
USING stream
OPTIONS(
maxOffsetsPerTrigger='100000'
)
CREATE STREAM
CREATE STREAM kafka_test_stream_job
OPTIONS(
checkpointLocation='/tmp/spark',
outputMode='Append',
triggerType='ProcessingTime'
triggerIntervalMs='3000')
INSERT INTO target_tbl
SELECT * FROM kafka_test_batch_scan
WHERE units > 1000;
CREATE STREAM queryName
OPTIONS (propertyName=propertyValue[,propertyName=propertyValue]*)
INSERT INTO tbName
queryStatement;
13. Spark Streaming SQL
MERGE INTO
mergeInto
: MERGE INTO target=tableIdentifier tableAlias
USING (source=tableIdentifier (timeTravel)? | '(' subquery = query ')') tableAlias
mergeCondition?
matchedClauses*
notMatchedClause?
MERGE INTO target_table t
USING source_table s
ON s.id = t.id
WHEN MATCHED AND s.opType = 'delete' THEN DELETE
WHEN MATCHED AND s.opTye = 'update' THEN UPDATE SET id = s.id, name = s.name
WHEN NOT MATCHED AND s.opType = 'insert' THEN INSERT (key, value) VALUES (key, value)
14. Spark Streaming SQL
DELAY / TUMBLING / HOPPING
WHERE delay(colName) < 'duration'withWatermark("colName", "duration")
SELECT avg(inv_quantity_on_hand) qoh
FROM kafka_inventory
WHERE delay(inv_data_time) < '2 minutes'
GROUP BY TUMBLING (inv_data_time, interval 1 minute)
17. Delta Lake
Improvement
CREATE EXTERNAL TABLE delta_tbl(a string, b int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'io.delta.hive.DeltaInputFormat'
OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat'
LOCATION 'oss://testbucket/delta/events'
18. Spark Streaming SQL
binlog
▪ no extra operational support for delta
▪ no load pressure on source database
▪ merge implement easily by SQL
▪ realtime low-latency(minute)
CDC solution using Spark Streaming SQL & Delta Lake
19. Spark Streaming SQL
binlog
CREATE SCAN cdctest_incremental_scan ON kafka_cdctest
USING STREAM
OPTIONS(
startingOffsets='earliest',
maxOffsetsPerTrigger='100000',
failOnDataLoss=false
);
CREATE STREAM cdctest_job
OPTIONS(checkpointLocation='/delta/cdctest_checkpoint_oss')
MERGE INTO delta_cdctest_oss as target
USING (
SELECT
// binlog parser
…
FROM cdctest_incremental_scan
)
ON target.id = source.before_id
WHEN MATCHED AND source.recordType='UPDATE' THEN
UPDATE SET …
WHEN MATCHED AND source.recordType='DELETE' THEN
DELETE
WHEN NOT MATCHED AND source.recordType='INSERT' THEN
INSERT …
CREATE TABLE kafka_cdctest USING KAFKA …
CREATE TABLE delta_cdctest_oss USING DELTA …
streaming-sql --master yarn --use-emr-datasource -f cdc_oss.sql
1
2
3
4
20. Delta Table
batch-batch- batch- …batch-
DeltaTable
.merge
DeltaTable
.merge
DeltaTable
.merge
DeltaTable
.merge
Spark Streaming SQL MERGE INTO
CDC solution using Spark Streaming SQL & Delta Lake
21. Long Running Stability Improvement
How to handle small files?
▪ increase batch interval(minutes)
▪ compcation(change data layout, not change data)
▪ adaptive execution mode
22. Long Running Stability Improvement
How to handle small files? - Scheduled Compaction
Delta Table
batch-batch- batch- …
batch-
DeltaTable
.merge
DeltaTable
.merge
DeltaTable
.merge
DeltaTable
.merge
CompactionCompaction
scheduled job hourly/daily/…
OPTIMIZE <tbl> [WHERE where_clause]
23. Long Running Stability Improvement
Problem: Streaming Job Failed when doing a compaction
batch-
Delta Table
Compaction do transaction commit
merge
Transaction conflict check
read
How to handle small files? - Scheduled Compaction
24. Long Running Stability Improvement
Problem: Streaming Job Failed when doing a compaction
binlog in batch
streaming job
status
Improvement
only insert Succeed Fix one bug: https://github.com/delta-io/delta/issues/326
including
delete/update Failed streaming job retry this batch
How to handle small files? - Scheduled Compaction
25. Long Running Stability Improvement
How to handle small files? - Auto Compaction
Delta Table
batch-batch- …batch-
merge
compaction compaction
merge merge
sequential execution, no conflict
select files which file_size
> COMPACT_FILE_SIZE
flies
number >
TRIGGER_FILE_COU
NT
do compaction
Yes
No
continue streaming
Strategy
26. Long Running Stability Improvement
How to handle small files? - Adaptive Execution
batch
binlog
target
delta
target
changed files
Join
rewrite
all changed files
Join
spark.sql.adaptive.enabled -> true
Adaptive Excetion can auto merge small partitions,
to decrease the number of reducers, then decrease
the number of output files.
29. Future
▪ auto schema change detected
▪ long running stable performance(read on merge)
▪ simplify user experience by SYNC grammar
SYNC kafka_binlog_tbl
TO delta_tbl
OPTIONS(
type='debezium.mysql'
)