Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar

•

13 j'aime•3,770 vues

Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.

Données & analyses

Changelog Changelog
Normal Table
(Hive/Spark/Presto)
Dataset

Hudi
DataSource
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Dataset On
Hadoop FS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views

// Command to extract incrementals using sqoop
bin/sqoop import
-Dmapreduce.job.user.classpath.first=true
--connect jdbc:mysql://localhost/users
--username root
--password *******
--table users
--as-avrodatafile
--target-dir
s3:///tmp/sqoop/import-1/users
// Spark Datasource
Import com.uber.hoodie.DataSourceWriteOptions._
// Use Spark datasource to read avro
Dataset<Row> inputDataset
spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’);
// save it as a Hoodie dataset
inputDataset.write.format(“com.uber.hoodie”)
.option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”)
.option(RECORDKEY_FIELD_OPT_KEY(), "userID")
.option(PARTITIONPATH_FIELD_OPT_KEY(),"country")
.option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod")
.option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL())
.mode(SaveMode.Append);
.save(“/path/on/dfs”)

// Deltastreamer command to ingest sqoop incrementals
spark-submit
--class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
/path/to/hoodie-utilities-*-SNAPSHOT.jar`
--props s3://path/to/dfs-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
--source-class com.uber.hoodie.utilities.sources.AvroDFSSource
--source-ordering-field last_mod
--target-base-path s3:///path/on/dfs
--target-table uber.employees
--op UPSERT
// dfs-source-properties
include=base.properties
# Key generator props
hoodie.datasource.write.recordkey.field=_userID
hoodie.datasource.write.partitionpath.field=country
# Schema provider props
hoodie.deltastreamer.filebased.schemaprovider.source.schema.file=s3:///path/to/users.avsc
# DFS Source
hoodie.deltastreamer.source.dfs.root=s3:///tmp/sqoop

// Deltastreamer command to ingest kafka events, dedupe, ingest
spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
/path/to/hoodie-utilities-*-SNAPSHOT.jar`
--props s3://path/to/kafka-source.properties
--schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider
--source-class com.uber.hoodie.utilities.sources.AvroKafkaSource
--source-ordering-field time
--target-base-path s3:///hoodie-deltastreamer/impressions --target-table uber.impressions
--op BULK_INSERT
--filter-dupes
// kafka-source-properties
include=base.properties
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=datestr
# schema provider configs
hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-value/v
ersions/latest
# Kafka Source
hoodie.deltastreamer.source.kafka.topic=impressions
#Kafka props
metadata.broker.list=localhost:9092
auto.offset.reset=smallest
schema.registry.url=http://localhost:8081

New Data
Unaffected Data
Updated Datachanges
Source table ETL table A ETL table B

$// Spark Datasource Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._ // Use Spark datasource to read avro Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie") .option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL()) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM) .load(“s3://tables/raw_trips”); Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF) // save it as a Hoodie dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”) .option(RECORDKEY_FIELD_OPT_KEY(), "id") .option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr") .option(PRECOMBINE_FIELD_OPT_KEY(), "time") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”)$

Recommandé

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Building large scale transactional data lake using apache hudiBill Liu

Apache Hudi: The Path ForwardAlluxio, Inc.

Hoodie - DataEngConf 2017Vinoth Chandar

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Recommandé

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Building large scale transactional data lake using apache hudiBill Liu

Apache Hudi: The Path ForwardAlluxio, Inc.

Hoodie - DataEngConf 2017Vinoth Chandar

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Rds data lake @ Robinhood BalajiVaradarajan13

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Making Apache Spark Better with Delta LakeDatabricks

Delta from a Data Engineer's PerspectiveDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Reshape Data Lake (as of 2020.07)Eric Sun

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak

Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks

Building an open data platform with apache icebergAlluxio, Inc.

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Realtime Analytics on AWSSungmin Kim

The Impala CookbookCloudera, Inc.

Delta lake and the delta architectureAdam Doyle

Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks

AWS Hadoop and PIG and overviewDan Morrill

Getting Started with Spark Structured Streaming - Current 22Dustin Vannoy

Contenu connexe

Tendances

Rds data lake @ Robinhood BalajiVaradarajan13

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Making Apache Spark Better with Delta LakeDatabricks

Delta from a Data Engineer's PerspectiveDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Reshape Data Lake (as of 2020.07)Eric Sun

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak

Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks

Building an open data platform with apache icebergAlluxio, Inc.

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Realtime Analytics on AWSSungmin Kim

The Impala CookbookCloudera, Inc.

Delta lake and the delta architectureAdam Doyle

Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks

Tendances (20)

Rds data lake @ Robinhood

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Iceberg: A modern table format for big data (Strata NY 2018)

The Parquet Format and Performance Optimization Opportunities

Making Apache Spark Better with Delta Lake

Delta from a Data Engineer's Perspective

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Efficient Data Storage for Analytics with Apache Parquet 2.0

Reshape Data Lake (as of 2020.07)

Iceberg + Alluxio for Fast Data Analytics

Apache Beam and Google Cloud Dataflow - IDG - final

Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...

Building an open data platform with apache iceberg

Apache Spark in Depth: Core Concepts, Architecture & Internals

Realtime Analytics on AWS

The Impala Cookbook

Delta lake and the delta architecture

Adaptive Query Execution: Speeding Up Spark SQL at Runtime

Similaire à Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar

AWS Hadoop and PIG and overviewDan Morrill

Getting Started with Spark Structured Streaming - Current 22Dustin Vannoy

Sparkling Waterh2oworld

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...HostedbyConfluent

4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA

Apache Spark WorkshopMichael Spector

On secure application of PHP wrappersPositive Hack Days

Hadoop spark performance comparisonarunkumar sadhasivam

Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin

Scaling in Mind (Case study of Drupal Core)jimyhuang

2014 09 30_sparkling_water_hands_onSri Ambati

You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg

Into The Box 2018 Going live with commandbox and dockerOrtus Solutions, Corp

Going live with BommandBox and docker Into The Box 2018Ortus Solutions, Corp

Spark 101Mohit Garg

Building highly scalable data pipelines with Apache SparkMartin Toshev

Really useful linux commandsMichael J Geiser

Introduction to Apache SparkRahul Jain

Building a Cloud Native Stack with EMR Spark, Alluxio, and S3Alluxio, Inc.

July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox

Similaire à Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar (20)

AWS Hadoop and PIG and overview

Getting Started with Spark Structured Streaming - Current 22

Sparkling Water

Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...

4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...

Apache Spark Workshop

On secure application of PHP wrappers

Hadoop spark performance comparison

Introduction to Apache Spark :: Lagos Scala Meetup session 2

Scaling in Mind (Case study of Drupal Core)

2014 09 30_sparkling_water_hands_on

You know, for search. Querying 24 Billion Documents in 900ms

Into The Box 2018 Going live with commandbox and docker

Going live with BommandBox and docker Into The Box 2018

Spark 101

Building highly scalable data pipelines with Apache Spark

Really useful linux commands

Introduction to Apache Spark

Building a Cloud Native Stack with EMR Spark, Alluxio, and S3

July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Plus de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Dernier

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

Principles and Practices of Data VisualizationKianJazayeri1

Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181

Cyber awareness ppt on the recorded dataTecnoIncentive

What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Networking Case Study prepared by teacher.pptxHimangsuNath

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell

Digital Marketing Plan, how digital marketing worksdeepakthakur548787

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

Learn How Data Science Changes Our WorldEduminds Learning

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole

Dernier (20)

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024

Principles and Practices of Data Visualization

Rithik Kumar Singh codealpha pythohn.pdf

Cyber awareness ppt on the recorded data

What To Do For World Nature Conservation Day by Slidesgo.pptx

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model

Student Profile Sample report on improving academic performance by uniting gr...

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

Networking Case Study prepared by teacher.pptx

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project

Data Factory in Microsoft Fabric (MsBIP #82)

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx

Digital Marketing Plan, how digital marketing works

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

Learn How Data Science Changes Our World

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

Semantic Shed - Squashing and Squeezing.pptx

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and Vinoth Chandar

1. Session hashtag: #SAISEco10

4. ● ● ● ● ● ●

10. Changelog Changelog Normal Table (Hive/Spark/Presto) Dataset

11.

12. Hudi DataSource (Spark) Index Data Files Timeline Metadata Hive Queries Dataset On Hadoop FS Presto Queries Spark DAGs Store & Index Data Read data Storage Type Views

13. REALTIME READ OPTIMIZED Cost Latency

14.

15.

16.

17.

18. // Command to extract incrementals using sqoop bin/sqoop import -Dmapreduce.job.user.classpath.first=true --connect jdbc:mysql://localhost/users --username root --password ******* --table users --as-avrodatafile --target-dir s3:///tmp/sqoop/import-1/users // Spark Datasource Import com.uber.hoodie.DataSourceWriteOptions._ // Use Spark datasource to read avro Dataset<Row> inputDataset spark.read.avro(‘s3://tmp/sqoop/import-1/users/*’); // save it as a Hoodie dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.users”) .option(RECORDKEY_FIELD_OPT_KEY(), "userID") .option(PARTITIONPATH_FIELD_OPT_KEY(),"country") .option(PRECOMBINE_FIELD_OPT_KEY(), "last_mod") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”)

19. // Deltastreamer command to ingest sqoop incrementals spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer /path/to/hoodie-utilities-*-SNAPSHOT.jar` --props s3://path/to/dfs-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --source-class com.uber.hoodie.utilities.sources.AvroDFSSource --source-ordering-field last_mod --target-base-path s3:///path/on/dfs --target-table uber.employees --op UPSERT // dfs-source-properties include=base.properties # Key generator props hoodie.datasource.write.recordkey.field=_userID hoodie.datasource.write.partitionpath.field=country # Schema provider props hoodie.deltastreamer.filebased.schemaprovider.source.schema.file=s3:///path/to/users.avsc # DFS Source hoodie.deltastreamer.source.dfs.root=s3:///tmp/sqoop

20.

21.

22. // Deltastreamer command to ingest kafka events, dedupe, ingest spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer /path/to/hoodie-utilities-*-SNAPSHOT.jar` --props s3://path/to/kafka-source.properties --schemaprovider-class com.uber.hoodie.utilities.schema.SchemaRegistryProvider --source-class com.uber.hoodie.utilities.sources.AvroKafkaSource --source-ordering-field time --target-base-path s3:///hoodie-deltastreamer/impressions --target-table uber.impressions --op BULK_INSERT --filter-dupes // kafka-source-properties include=base.properties # Key fields, for kafka example hoodie.datasource.write.recordkey.field=id hoodie.datasource.write.partitionpath.field=datestr # schema provider configs hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-value/v ersions/latest # Kafka Source hoodie.deltastreamer.source.kafka.topic=impressions #Kafka props metadata.broker.list=localhost:9092 auto.offset.reset=smallest schema.registry.url=http://localhost:8081

23.

24. New Data Unaffected Data Updated Datachanges Source table ETL table A ETL table B

25. // Spark Datasource Import com.uber.hoodie.{DataSourceWriteOptions, DataSourceReadOptions}._ // Use Spark datasource to read avro Dataset<Row> hoodieIncViewDF = spark.read().format("com.uber.hoodie") .option(VIEW_TYPE_OPT_KEY(), VIEW_TYPE_INCREMENTAL_OPT_VAL()) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),commitInstantFor8AM) .load(“s3://tables/raw_trips”); Dataset<Row> stdDF = standardize_fare(hoodieIncViewDF) // save it as a Hoodie dataset inputDataset.write.format(“com.uber.hoodie”) .option(HoodieWriteConfig.TABLE_NAME, “hoodie.std_trips”) .option(RECORDKEY_FIELD_OPT_KEY(), "id") .option(PARTITIONPATH_FIELD_OPT_KEY(),"datestr") .option(PRECOMBINE_FIELD_OPT_KEY(), "time") .option(OPERATION_OPT_KEY(), UPSERT_OPERATION_OPT_VAL()) .mode(SaveMode.Append); .save(“/path/on/dfs”)

26.

27.

28. . Questions?