We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Student profile product demonstration on grades, ability, well-being and mind...
Building robust CDC pipeline with Apache Hudi and Debezium
1. BUILDING ROBUST CDC PIPELINE WITH
APACHE HUDI AND DEBEZIUM @SCALE
• PRATYAKSH
• PURUSHOTHAM
• SYED
• SHAIK
Hadoop Meetup Bangalore
(Dec-2019)
2. What is CDC?
Benefits of CDC
Comparison of CDC Streaming Systems
Comparison of Reconciler Systems
CDC Platform Architecture @ Tathastu
Challenges
Contribution
Roadmap
Questions
3. CHANGE DATA CAPTURE (CDC): A set of
software design patterns used to determine
(and track) the data that has changed so that
action can be taken using the changed data.
9. Hadoop Upserts Deletes and Incrementals
Consists of a self-contained spark library
Hudi key = Record key + Partition key
Storage types – COPY_ON_WRITE and MERGE_ON_READ
Query Engines – SparkSQL, Hive, Presto
Multiple Cleaning and Compaction policies supported
Key classes – HoodieDeltaStreamer, HiveSyncTool
10.
11. Schema evolution
Handling datatypes (JDBC)
Handling RDS internal commands
Making libraries compatible with latest versions of Kafka and Spark
Multi-table support in DeltaStreamer
Enhancing Kafka Batch read for Bootstrapping (Source Limit)
Hive Metastore settings
Queriable HUDI dataset – making compatible with Athena
13. • Build the single click UI for Orchestration
• Data profiler UI for validation and alerts
• Config-store for configs and credential
• ACL for table and databases (via Ranger)
• Managing the subscriber list for notifications
and alerts
14. • QUBOLE CDC RECONCILER COMPARISION
• HUDI DETAILED ARCHITECTURE DISCUSSION
• ADVANTAGES OF LOG-BASED OVER QUERY-BASED