Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Change Data Capture
To Data Lakes
Using
Apache Pulsar/Hudi
Speaker Bio
PMC Chair/Creator of Apache Hudi
Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking)
Principal Eng @ Conflu...
Agenda
1) Background On CDC
2) CDC to Lakes
3) Hudi Overview
4) Onwards
Background
CDC - What, Why
Change Data Capture
Design Pattern for Data Integration
- Not tied to any particular technology
- Deliver low-latency
Syst...
Polling an external API
- Timestamps, status indicators, versions
- Simple, works for small-scale data changes
- E.g: Poll...
Emit Event From App
- Data model to encode deltas
- Scales for high-volume data changes
- E.g: Emitting sensor state chang...
Scan Database Redo Log
- SCN and other watermarks to extract data/metadata changes
- Operationally heavy, very high fideli...
CDC vs Extract-Transform-
Load?
CDC is merely Incremental Extraction
- Not really competing concepts
- ETL needs one-time ...
CDC vs Stream Processing
CDC enables Streaming ETL
- Why bulk T & L anymore?
- Process change streams
- Mutable Sinks
Reli...
Ideal CDC Source
Support reliable incremental consumption
- Offsets are stable, deterministic
- Efficient fetching of new ...
Ideal CDC Sink
Mutable, Transactional
- Reliably apply changes
Quickly absorb changes
- Sinks are often bottlenecks
- Rand...
CDC to Lakes
Putting Pulsar and Hudi to work
What is a Data Lake?
Architectural Pattern for Analytical Data
- Data Lake != Spark, Flink
- Data Lake != Files on S3
- Ra...
Database
Events
Apps/
Services
Queries
DFS/Cloud Storage
Change Stream
Operational
Data Infrastructure
Analytics
Data Infr...
Challenges
Data Lakes are often file dumps
- Reliably change subset of files
- Transactional, Concurrency Control
Getting ...
Apache Hudi
Transactional Writes, MVCC/OCC
- Fully managed file/object storage
- Automatic compaction, clustering, sizing
...
Change
Streams
DFS/Cloud Storage
Tables
Pulsar (Source) + Hudi (Sink)
Pulsar
Pulsar
Source
Connectors
De-Dupe Indexing
Txn...
Applying Event Logs
PR#3096
Applying Database Changes
Coming Soon..
Streaming ETL using Hudi
Hudi Overview
Intro, Components, APIs, Design Choices
Hudi Data Lake
Original pioneer of the transactional
data lake movement
Embeddable, Serverless, Distributed
Database abstr...
Stream Processing is Fast & Efficient
Streaming Stack
+ Intelligent, Incremental
+ Fast, Efficient
- Row oriented
- Not sc...
What If: Streaming Model on Batch Data?
The Incremental Stack
+ Intelligent, Incremental
+ Fast, Efficient
+ Scans, Column...
Hudi : Open Sourcing & Evolution..
2015 : Published core ideas/principles for incremental processing (O’reilly article)
20...
Apache Hudi - Adoption
Committers/
Contributors:
Uber, AWS, Alibaba,
Tencent, Robinhood,
Moveworks,
Confluent,
Snowflake,
...
The Hudi Stack
Complete “data” lake platform
Tightly integrated, Self managing
Write using Spark, Flink
Query using Spark,...
Our Design Goals
Streaming/Incremental
- Upsert/Delete Optimized
- Key based operations
Faster
- Frequent Commits
- Design...
Delta Logs at File Level over Global
Each file group is it’s own self
contained log
- Constant metadata size,
controlled b...
Record Indexes over Just File/Column Stats
Index maps key to a file group
- During upsert/deletes
- Much like streaming st...
MVCC Concurrency Control over Only OCC
Frequent commits => More frequent
clustering/compaction => More contention
Differen...
Record Level Merge API over Only Overwrites
More generalized approach
- Default: overwrite w/ latest writer wins
- Support...
Specialized Database over Generalized Format
Approach it more like a shared-nothing
database
- Daemons aware of each other...
Record level CDC over File/Snapshot Diffing
Per record metadata
- _hoodie_commit_time : Kafka style
compacted change strea...
Onwards
Ideas, Ongoing work, Future Plans
Pulsar Source In Hudi
PR#3096 is up and WIP! (Contributions welcome)
- Supports Avro/KEY_VALUE, partitioned/non-partitione...
Hudi Sink from Pulsar
Push to the lake in real-time
- Today’s model is “pull based”, micro-batch
- Transactional, Concurre...
Pulsar Tiered Storage
Columnar reads off Hudi
- Leverage Hudi’s metadata to track ordering/changes
- Push projections/filt...
Engage With Our Community
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display...
Thanks!
Questions?
Upcoming SlideShare
Loading in …5
×

of

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 1 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 2 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 3 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 4 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 5 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 6 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 7 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 8 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 9 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 10 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 11 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 12 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 13 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 14 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 15 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 16 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 17 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 18 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 19 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 20 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 21 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 22 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 23 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 24 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 25 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 26 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 27 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 28 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 29 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 30 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 31 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 32 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 33 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 34 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 35 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 36 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 37 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 38 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 39 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 40 [Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi Slide 41
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi

Download to read offline

Talk link : https://www.youtube.com/watch?v=MWpnVIgcAXw

Summit Link: https://www.na2021.pulsar-summit.org/all-talks/change-data-capture-to-data-lakes-using-apache-pulsar-and-apache-hudi

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi

  1. 1. Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
  2. 2. Speaker Bio PMC Chair/Creator of Apache Hudi Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking) Principal Eng @ Confluent (ksqlDB, Kafka/Streams) Staff Eng @ Linkedin (Voldemort, DDS) Sr Eng @ Oracle (CDC/Goldengate/XStream)
  3. 3. Agenda 1) Background On CDC 2) CDC to Lakes 3) Hudi Overview 4) Onwards
  4. 4. Background CDC - What, Why
  5. 5. Change Data Capture Design Pattern for Data Integration - Not tied to any particular technology - Deliver low-latency System for tracking, fetching new data - Not concerned with how to use such data - Ideally, incremental update downstream - Minimizing number of bits read-written/change Change is the ONLY Constant - Even in Computer science - Data is immutable = Myth (well, kinda)
  6. 6. Polling an external API - Timestamps, status indicators, versions - Simple, works for small-scale data changes - E.g: Polling github events API
  7. 7. Emit Event From App - Data model to encode deltas - Scales for high-volume data changes - E.g: Emitting sensor state changes to Pulsar
  8. 8. Scan Database Redo Log - SCN and other watermarks to extract data/metadata changes - Operationally heavy, very high fidelity - E.g: Using Debezium to obtain changelogs from MySQL
  9. 9. CDC vs Extract-Transform- Load? CDC is merely Incremental Extraction - Not really competing concepts - ETL needs one-time full bootstrap CDC changes T and L significantly - T on change streams, not just table state - L incrementally, not just bulk reloads Incremental L = Apply
  10. 10. CDC vs Stream Processing CDC enables Streaming ETL - Why bulk T & L anymore? - Process change streams - Mutable Sinks Reliable Stream Processing needs distributed logs - Rewind/Replay CDC logs - Absorb spikes/batch writes to sinks
  11. 11. Ideal CDC Source Support reliable incremental consumption - Offsets are stable, deterministic - Efficient fetching of new changes Support rewinding/replay - Databases redo logs are typically purged frequently - Event Logs offer tiering/large retention Support ordering of changes - Out-of-order apply => incorrect results
  12. 12. Ideal CDC Sink Mutable, Transactional - Reliably apply changes Quickly absorb changes - Sinks are often bottlenecks - Random I/O Bonus: Also act as CDC Source - Keep the stream flowing
  13. 13. CDC to Lakes Putting Pulsar and Hudi to work
  14. 14. What is a Data Lake? Architectural Pattern for Analytical Data - Data Lake != Spark, Flink - Data Lake != Files on S3 - Raw data (OLTP schema) - Derived Data (OLAP/BI, ML schema) Vast Storage - Object storage vs dedicated storage nodes - Open formats (data + metadata) Scalable Compute - Many query engine choices Source: https://martinfowler.com/bliki/images/dataLake/context.png
  15. 15. Database Events Apps/ Services Queries DFS/Cloud Storage Change Stream Operational Data Infrastructure Analytics Data Infrastructure External Sources Tables Why not?
  16. 16. Challenges Data Lakes are often file dumps - Reliably change subset of files - Transactional, Concurrency Control Getting “ALL” data quickly - Apply Updates quicky - Scalable Deletes, to ensure compliance Lakes think in big batches - Difficult to align batch intervals, to join - Large skews for “event_time” streaming joins
  17. 17. Apache Hudi Transactional Writes, MVCC/OCC - Fully managed file/object storage - Automatic compaction, clustering, sizing First class support for Updates, Deletes - Record level Update/Deletes inspired by stream processors CDC Streams From Lake Storage - Storage Layout optimized for incremental fetches - Hudi’s unique contribution in the space
  18. 18. Change Streams DFS/Cloud Storage Tables Pulsar (Source) + Hudi (Sink) Pulsar Pulsar Source Connectors De-Dupe Indexing Txn Hudi’s DeltaStreamer Cluster Optimize Compact
  19. 19. Applying Event Logs PR#3096
  20. 20. Applying Database Changes Coming Soon..
  21. 21. Streaming ETL using Hudi
  22. 22. Hudi Overview Intro, Components, APIs, Design Choices
  23. 23. Hudi Data Lake Original pioneer of the transactional data lake movement Embeddable, Serverless, Distributed Database abstraction layer over DFS - We invented this! Hadoop Upserts, Deletes & Incrementals Provide transactional updates/deletes First class support for record level CDC streams
  24. 24. Stream Processing is Fast & Efficient Streaming Stack + Intelligent, Incremental + Fast, Efficient - Row oriented - Not scan optimized Batch Stack + Scans, Columnar formats + Scalable Compute - Naive, In-efficient
  25. 25. What If: Streaming Model on Batch Data? The Incremental Stack + Intelligent, Incremental + Fast, Efficient + Scans, Columnar formats + Scalable Compute https://www.oreilly.com/content/ubers-case-for- incremental-processing-on-hadoop/; 2016
  26. 26. Hudi : Open Sourcing & Evolution.. 2015 : Published core ideas/principles for incremental processing (O’reilly article) 2016 : Project created at Uber & powers all database/business critical feeds @ Uber 2017 : Project open sourced by Uber & work begun on Merge-On-Read, Cloud support 2018 : Picked up adopters, hardening, async compaction.. 2019 : Incubated into ASF, community growth, added more platform components. 2020 : Top level Apache project, Over 10x growth in community, downloads, adoption 2021 : SQL DMLs, Flink Continuous Queries, More indexing schemes, Metaserver, Caching
  27. 27. Apache Hudi - Adoption Committers/ Contributors: Uber, AWS, Alibaba, Tencent, Robinhood, Moveworks, Confluent, Snowflake, Bytedance, Zendesk, Yotpo and more https://hudi.apache.org/docs/powered_by.html
  28. 28. The Hudi Stack Complete “data” lake platform Tightly integrated, Self managing Write using Spark, Flink Query using Spark, Flink, Hive, Presto, Trino, Impala, AWS Athena/Redshift, Aliyun DLA etc Out-of-box tools/services for painless dataops
  29. 29. Our Design Goals Streaming/Incremental - Upsert/Delete Optimized - Key based operations Faster - Frequent Commits - Design around logs - Minimize overhead
  30. 30. Delta Logs at File Level over Global Each file group is it’s own self contained log - Constant metadata size, controlled by “retention” parameters - Leverage append() when available; lower metadata overhead Merges are local to each file group - UUID keys throw off any range pruning
  31. 31. Record Indexes over Just File/Column Stats Index maps key to a file group - During upsert/deletes - Much like streaming state store Workloads have different shapes - Late arriving updates; Totally random - Trickle down to derived tables Many pluggable options - Bloom Filters + Key ranges - HBase, Join based - Global vs Local
  32. 32. MVCC Concurrency Control over Only OCC Frequent commits => More frequent clustering/compaction => More contention Differentiate writers vs table services - Much like what databases do - Table services don’t contend with writers - Async compaction/clustering Don’t be so “Optimistic” - OCC b/w writers; works, until it does n’t - Retries, split txns, wastes resources - MVCC/Log based between writers/table services
  33. 33. Record Level Merge API over Only Overwrites More generalized approach - Default: overwrite w/ latest writer wins - Support business-specific resolution Log partial updates - Log just changed column; - Drastic reduction in write amplification Log based reconciliation - Delete, Undelete based on business logic - CRDT, Operational Transform like delayed conflict resolution
  34. 34. Specialized Database over Generalized Format Approach it more like a shared-nothing database - Daemons aware of each other - E.g: Compaction, Cleaning in rocksDB E.g: Clustering & Compaction know each other - Reconcile metadata based on time order - Compactions avoid redundant scheduling Self Managing - Sorting, Time-order preservation, File- sizing
  35. 35. Record level CDC over File/Snapshot Diffing Per record metadata - _hoodie_commit_time : Kafka style compacted change streams in commit order - _hoodie_commit_seqno: Consume large commits in chunks, ala Kafka offsets File group design => CDC friendly - Efficient retrieval of old, new values - Efficient retrieval of all values for key Infinite Retention/Lookback coming later in 2021
  36. 36. Onwards Ideas, Ongoing work, Future Plans
  37. 37. Pulsar Source In Hudi PR#3096 is up and WIP! (Contributions welcome) - Supports Avro/KEY_VALUE, partitioned/non-partitioned topics, checkpointing - All Hudi operations, bells-and-whistles - Consolidate with Kafka-Debezium Source work Hudi facing work - Adding transformers, record payload for parsing debezium logs - Hardening, functional/scale testing Pulsar facing work - Better Spark batch query datasource support in Apache Pulsar - Streamnative/pulsar-spark : upgrade to Spark 3 (we know its painful)/Scala 2.12, support for KV records.
  38. 38. Hudi Sink from Pulsar Push to the lake in real-time - Today’s model is “pull based”, micro-batch - Transactional, Concurrency Control Hudi facing work - Harden hudi-java-client - ~1 min commit frequency, while retaining a month of history Pulsar facing work - How to once across tasks? - How to perform indexing etc efficiently without shuffling data around
  39. 39. Pulsar Tiered Storage Columnar reads off Hudi - Leverage Hudi’s metadata to track ordering/changes - Push projections/filters to Hudi - Faster backfills! Work/Challenges - Most open-ended of the lot - Pluggable Tiered storage API in Pulsar (exists?) - Mapping offsets to _hoodie_commit_seqno - Leverage Hudi’s compaction and other machinery
  40. 40. Engage With Our Community User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) dev@hudi.apache.org (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup
  41. 41. Thanks! Questions?

Talk link : https://www.youtube.com/watch?v=MWpnVIgcAXw Summit Link: https://www.na2021.pulsar-summit.org/all-talks/change-data-capture-to-data-lakes-using-apache-pulsar-and-apache-hudi

Views

Total views

106

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

4

Shares

0

Comments

0

Likes

0

×