Building robust CDC pipeline with Apache Hudi and Debezium

•

3 j'aime•2,777 vues

We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.

Données & analyses

BUILDING ROBUST CDC PIPELINE WITH
APACHE HUDI AND DEBEZIUM @SCALE
• PRATYAKSH
• PURUSHOTHAM
• SYED
• SHAIK
Hadoop Meetup Bangalore
(Dec-2019)

What is CDC?
Benefits of CDC
Comparison of CDC Streaming Systems
Comparison of Reconciler Systems
CDC Platform Architecture @ Tathastu
Challenges
Contribution
Roadmap
Questions

CHANGE DATA CAPTURE (CDC): A set of
software design patterns used to determine
(and track) the data that has changed so that
action can be taken using the changed data.

Low latency
Event processing
Real time analytics and Dashboarding
Audit logging
Distribute the load round the clock

Method Log-Based Query-Based
Tools Debezium JDBC Connector
Schema Evolution Yes Yes
Processing Stream Batch
Audit Track Preserved Partially Preserved
Latency Low High
Cost High Low
Delete Track Yes No

Solution Maxwell Apache NiFi Debezium
Bootstrap Yes No Yes
Formats JSON JSON JSON, Avro
Message Queues
Kafka, Kinesis, SQS, Google
Pub/Sub, RabbitMQ, Redis, Custom
Producer
NiFi connections Kafka
Schema Evolution Yes No Yes
Latency Low Medium Low
Supported Databases MySQL MySQL
MySQL, PostgreSQL, Oracle,
SQL Server, MongoDB,
Cassandra
Onboarding Command Driven Config and API Driven Purely API Driven
State
Storage/checkpoints
External Database
Zookeeper, External
Cache
Kafka topics

Solution
Delta.io
(Databricks)
Apache
HUDI
Apache Hive
(LLAP)
Updates / Deletes Yes Yes Yes
Compactions
Manual cleanup
No Compaction
Automatic
Manual
Automatic
Manual
File Format Parquet
Parquet
AVRO
ORC
Engine
Spark
Presto (Recently)
Spark
Presto
Hive
EMR
Athena (with workaround)
Hive
Spark(LLAP)
SQL DML NO NO YES
Write Amplification HIGH LOW LOW
Apache Governance YES (Recently) YES YES
Credits Qubole

Hadoop Upserts Deletes and Incrementals
Consists of a self-contained spark library
Hudi key = Record key + Partition key
Storage types – COPY_ON_WRITE and MERGE_ON_READ
Query Engines – SparkSQL, Hive, Presto
Multiple Cleaning and Compaction policies supported
Key classes – HoodieDeltaStreamer, HiveSyncTool

Schema evolution
Handling datatypes (JDBC)
Handling RDS internal commands
Making libraries compatible with latest versions of Kafka and Spark
Multi-table support in DeltaStreamer
Enhancing Kafka Batch read for Bootstrapping (Source Limit)
Hive Metastore settings
Queriable HUDI dataset – making compatible with Athena

CONTRIBUTION
• HUDI-288
• HUDI-340
• HUDI-259
• HUDI-114
• HUDI-118
• HUDI-245
• DBZ-1521
• DBZ-1492
• 563
• 311
• NIFI-6501
• NIFI-6914
• NIFI-6119

• Build the single click UI for Orchestration
• Data profiler UI for validation and alerts
• Config-store for configs and credential
• ACL for table and databases (via Ranger)
• Managing the subscriber list for notifications
and alerts

• QUBOLE CDC RECONCILER COMPARISION
• HUDI DETAILED ARCHITECTURE DISCUSSION
• ADVANTAGES OF LOG-BASED OVER QUERY-BASED

$spark-submit --name debz_futurepay --queue etl --files jaas.conf,custom_config.json --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 4g --num-executors 50 --class org.apache.hudi.utilities.deltastreamer.CDCStreamer hudi- utilities-bundle-0.5.1-SNAPSHOT.jar --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --storage-type COPY_ON_WRITE --source-ordering-field __ts_ms --target-base-path s3://{BASE_PATH}/hudi/${DATABASE}/${TABLE}/ --target-table cdc_flat_cow --props ${HUDI_CONFIG} --enable-hive-sync --custom-props custom_config.json --continuous -- source-limit 1000000 hive.metastore.disallow.incompatible.col.type.changes=false; parquet.column.index.access='false' HUDI Command Hive Metastore Properties$

#Cleanup policy
hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
hoodie.cleaner.fileversions.retained=1
HUDI Properties (For Athena )

Building robust CDC pipeline with Apache Hudi and Debezium

Recommandé

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Building large scale transactional data lake using apache hudiBill Liu

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Recommandé

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Building large scale transactional data lake using apache hudiBill Liu

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Apache Hudi: The Path ForwardAlluxio, Inc.

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Introduction to Spark InternalsPietro Michiardi

How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

Spark shuffle introductioncolorant

Apache Spark overviewDataArt

Apache Kudu: Technical Deep Dive  Cloudera, Inc.

Understanding Query Plans and Spark UIsDatabricks

Hive + Tez: A Performance Deep DiveDataWorks Summit

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Apache Spark FundamentalsZahra Eskandari

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Apache Tez: Accelerating Hadoop Query ProcessingHortonworks

Building an open data platform with apache icebergAlluxio, Inc.

Modernizing Your Data Warehouse using APSStéphane Fréchette

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld

Contenu connexe

Tendances

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Apache Hudi: The Path ForwardAlluxio, Inc.

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Introduction to Spark InternalsPietro Michiardi

How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

Spark shuffle introductioncolorant

Apache Spark overviewDataArt

Apache Kudu: Technical Deep Dive  Cloudera, Inc.

Understanding Query Plans and Spark UIsDatabricks

Hive + Tez: A Performance Deep DiveDataWorks Summit

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Apache Spark FundamentalsZahra Eskandari

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Apache Tez: Accelerating Hadoop Query ProcessingHortonworks

Building an open data platform with apache icebergAlluxio, Inc.

Tendances (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets

Apache Hudi: The Path Forward

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...

The Parquet Format and Performance Optimization Opportunities

Introduction to Spark Internals

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Spark shuffle introduction

Apache Spark overview

Apache Kudu: Technical Deep Dive  

Understanding Query Plans and Spark UIs

Hive + Tez: A Performance Deep Dive

Apache Spark in Depth: Core Concepts, Architecture & Internals

Apache Spark Core—Deep Dive—Proper Optimization

Apache Spark Fundamentals

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Apache Tez: Accelerating Hadoop Query Processing

Building an open data platform with apache iceberg

Similaire à Building robust CDC pipeline with Apache Hudi and Debezium

Modernizing Your Data Warehouse using APSStéphane Fréchette

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld

SQL on Hadoopnvvrajesh

Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network

Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays

Introducing Azure SQL Data WarehouseJames Serra

Microsoft Data Platform - What's includedJames Serra

sudoers: Benchmarking Hadoop with ALOJANicolas Poggi

Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime

Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit

Hadoop in the Cloud - The what, why and how from the expertsDataWorks Summit/Hadoop Summit

Hortonworks.bdbEmil Andreas Siemes

Hadoop_arunam_pptjerrin joseph

Big Data and NoSQL for Database and BI ProsAndrew Brust

USQL Trivadis Azure Data Lake EventTrivadis

5 Comparing Microsoft Big Data Technologies for AnalyticsJen Stirrup

Storage and-compute-hdfs-map reduceChris Nauroth

CIS13: A Breakthrough in Directory Technology: Meet the Elephant in the Room ...CloudIDSummit

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks

Similaire à Building robust CDC pipeline with Apache Hudi and Debezium (20)

Modernizing Your Data Warehouse using APS

VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...

SQL on Hadoop

Hadoop Frameworks Panel__HadoopSummit2010

Hadoop in the Cloud – The What, Why and How from the Experts

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...

Introducing Azure SQL Data Warehouse

Microsoft Data Platform - What's included

sudoers: Benchmarking Hadoop with ALOJA

Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Big Data Simplified - Is all about Ab'strakSHeN

Hadoop in the Cloud - The what, why and how from the experts

Hortonworks.bdb

Hadoop_arunam_ppt

Big Data and NoSQL for Database and BI Pros

USQL Trivadis Azure Data Lake Event

5 Comparing Microsoft Big Data Technologies for Analytics

Storage and-compute-hdfs-map reduce

CIS13: A Breakthrough in Directory Technology: Meet the Elephant in the Room ...

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...

Dernier

Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann

How we prevented account sharing with MFAAndrei Kaleshka

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

ASML's Taxonomy Adventure by Daniel Cantervoginip

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Vision, Mission, Goals and Objectives ppt..pptxellehsormae

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Multiple time frame trading analysis -brianshannon.pdfchwongval

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Dernier (20)

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines

How we prevented account sharing with MFA

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

ASML's Taxonomy Adventure by Daniel Canter

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Real-Time AI Streaming - AI Max Princeton

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Vision, Mission, Goals and Objectives ppt..pptx

Data Factory in Microsoft Fabric (MsBIP #82)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Identifying Appropriate Test Statistics Involving Population Mean

GA4 Without Cookies [Measure Camp AMS]

Multiple time frame trading analysis -brianshannon.pdf

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree

Student profile product demonstration on grades, ability, well-being and mind...

Building robust CDC pipeline with Apache Hudi and Debezium

1. BUILDING ROBUST CDC PIPELINE WITH APACHE HUDI AND DEBEZIUM @SCALE • PRATYAKSH • PURUSHOTHAM • SYED • SHAIK Hadoop Meetup Bangalore (Dec-2019)

2. What is CDC? Benefits of CDC Comparison of CDC Streaming Systems Comparison of Reconciler Systems CDC Platform Architecture @ Tathastu Challenges Contribution Roadmap Questions

3. CHANGE DATA CAPTURE (CDC): A set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data.

4. Low latency Event processing Real time analytics and Dashboarding Audit logging Distribute the load round the clock

5. Method Log-Based Query-Based Tools Debezium JDBC Connector Schema Evolution Yes Yes Processing Stream Batch Audit Track Preserved Partially Preserved Latency Low High Cost High Low Delete Track Yes No

6. Solution Maxwell Apache NiFi Debezium Bootstrap Yes No Yes Formats JSON JSON JSON, Avro Message Queues Kafka, Kinesis, SQS, Google Pub/Sub, RabbitMQ, Redis, Custom Producer NiFi connections Kafka Schema Evolution Yes No Yes Latency Low Medium Low Supported Databases MySQL MySQL MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, Cassandra Onboarding Command Driven Config and API Driven Purely API Driven State Storage/checkpoints External Database Zookeeper, External Cache Kafka topics

7. Solution Delta.io (Databricks) Apache HUDI Apache Hive (LLAP) Updates / Deletes Yes Yes Yes Compactions Manual cleanup No Compaction Automatic Manual Automatic Manual File Format Parquet Parquet AVRO ORC Engine Spark Presto (Recently) Spark Presto Hive EMR Athena (with workaround) Hive Spark(LLAP) SQL DML NO NO YES Write Amplification HIGH LOW LOW Apache Governance YES (Recently) YES YES Credits Qubole

9. Hadoop Upserts Deletes and Incrementals Consists of a self-contained spark library Hudi key = Record key + Partition key Storage types – COPY_ON_WRITE and MERGE_ON_READ Query Engines – SparkSQL, Hive, Presto Multiple Cleaning and Compaction policies supported Key classes – HoodieDeltaStreamer, HiveSyncTool

10.

11. Schema evolution Handling datatypes (JDBC) Handling RDS internal commands Making libraries compatible with latest versions of Kafka and Spark Multi-table support in DeltaStreamer Enhancing Kafka Batch read for Bootstrapping (Source Limit) Hive Metastore settings Queriable HUDI dataset – making compatible with Athena

12. CONTRIBUTION • HUDI-288 • HUDI-340 • HUDI-259 • HUDI-114 • HUDI-118 • HUDI-245 • DBZ-1521 • DBZ-1492 • 563 • 311 • NIFI-6501 • NIFI-6914 • NIFI-6119

13. • Build the single click UI for Orchestration • Data profiler UI for validation and alerts • Config-store for configs and credential • ACL for table and databases (via Ranger) • Managing the subscriber list for notifications and alerts

14. • QUBOLE CDC RECONCILER COMPARISION • HUDI DETAILED ARCHITECTURE DISCUSSION • ADVANTAGES OF LOG-BASED OVER QUERY-BASED

15. spark-submit --name debz_futurepay --queue etl --files jaas.conf,custom_config.json --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 4g --num-executors 50 --class org.apache.hudi.utilities.deltastreamer.CDCStreamer hudi- utilities-bundle-0.5.1-SNAPSHOT.jar --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --storage-type COPY_ON_WRITE --source-ordering-field __ts_ms --target-base-path s3://{BASE_PATH}/hudi/${DATABASE}/${TABLE}/ --target-table cdc_flat_cow --props ${HUDI_CONFIG} --enable-hive-sync --custom-props custom_config.json --continuous -- source-limit 1000000 hive.metastore.disallow.incompatible.col.type.changes=false; parquet.column.index.access='false' HUDI Command Hive Metastore Properties

16. #Cleanup policy hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS hoodie.cleaner.fileversions.retained=1 HUDI Properties (For Athena )