SlideShare une entreprise Scribd logo
1  sur  83
Télécharger pour lire hors ligne
An Open Source Incremental Processing Framework
Hoodie
DATA
Who Am I
Vinoth Chandar
- Founding engineer/architect of the data team at Uber.
- Previously,
- Lead on Linkedin’s Voldemort key value store.
- Oracle Database replication, Stream Processing.
- HPC & Grid Computing
Agenda
• Data @ Uber
• Motivation
• Concepts
• Deep Dive
• Use-Cases
• Comparisons
• Open Source
Data @ Uber
Quick Recap of how Uber’s data ecosystem has evolved!
Circa 2014
Reliability
- JSON data, breaking pipelines
- Word-of-mouth schema
Scalability
- Kafka7, No Hadoop
- Growing data volumes
- Multi-datacenter data merges
In-efficiencies
- Several hours of data delays
- Bulk data copies stressing OLTP
systems
- Single choice of query engine
Re-Architecture
Schemafication
- Avro as data lingua franca
- Schema enforcement at producers
Horizontally Scalable
- Kafka8
- Hadoop (many PBs & 1000s servers)
- Scalable data pipelines
- Multi-DC Aware data flow
Performant
- 1-3 hrs data latency
- Columnar queries via parquet
- Multiple query engines
Data Users
Analytics
- Dashboards
- Federated Querying
- Interactive Analysis
Data Apps
- Machine Learning
- Fraud Detection
- Incentive Spends
Data Warehousing
- Traditional ETL
- Curated data feeds
- Data Lake => Data Mart
Query Engines
Presto
> 100K queries/day
Spark
100s of Apps
Hive
20K Pipelines & Queries
Hoodie : Motivations
Use-cases & business needs that led to the birth of the project
Query Engines
Presto
> 100K queries/day
Spark
100s of Apps
Hive
20K Pipelines & Queries
Partitioned by trip start date
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 30 min
Day level partitions
Motivating Use-Case: Late Arriving Updates
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Update
d Trips
Jan: 6 hr (500 executors)
Snapshot
DB Ingestion: Status Quo
trips
(Parquet)
Changelog
12-18+ hr
Derived
Tables
Apr: 8 hr (800 executors)
Aug: 10 hr (1000 executors)
How can we fix this?
Query HBase?
- Bad Fit for scans
- Lack of support for nested data
- Significant operational overhead
Specialized Analytical DBs?
- Joins with other datasets in HDFS
- Not all data will fit into memory
- Lambda architecture & data copies
Don’t support Snapshots, Only Logs
- Logs ultimately need to be
compacted anyway
- Merging done inconsistently &
inefficiently by users
Data Modelling Tricks?
- Does not change fundamental
nature of problem
Pivotal Question
What do we need to solve this directly on top of a
petabyte scale Hadoop Data Lake?
Let’s Go A Decade Back
How did RDBMS-es solve this?
• Update existing row with new value (Transactions)
• Consume a log of changes downstream (Redo log)
• Update again downstream
MySQL
(Server A)
MySQL
(Server B)
Update
Update
Pull Redo
Log
TransformationImportant Differences
• Columnar file formats
• Read-heavy analytical workloads
• Petabytes & 1000s of servers
Changes
Pivotal Question
What do we need to solve this directly on top of a
petabyte scale Hadoop Data Lake?
Answer: upserts & incrementals
10 hr (1000)
8 hr (800)
6 hr (500)
snapshot
Challenging Status Quo: upserts & incr pull
12-18+ hr
8 hr
1 hr
Replicated
Trip Rows
New
/updated
trip rows
Changelog
Hoodie : Concepts
Incremental Processing Foundations & why it’s important
Anatomy Of Data Pipelines
Core Operations
• Projections (Easy)
• Filtering (Easy)
• Aggregations (Tricky)
• Window (Tricky)
• Joins (Hard)
Operational Levers (Google DataFlow)
• Latency
• Completeness
• Cost
Typically Pick 2/3
Source SinkData Pipeline
An Artificial Dichotomy
It’s A Spectrum
- Very Common
use-cases tolerating
few mins of latency
- 100x more batch
pipelines than
streaming pipelines
Incremental Processing : What?
Run Mini Batch Pipelines
- Provide high completeness than streaming pipelines
- By supporting things like multi-table joins seamlessly
In Streaming Fashion
- Provide lower latency than typical batch pipeline
- By only consuming new input & ability to update old results
Incremental Processing : Increased Efficiency
Less IO,
On-Demand
Resource
Allocation
Incremental Processing : Leverage Hadoop SQL
- Good support for joins
- Columnar File Formats,
- Cover wide range of use
cases - exploratory,
interactive
Incremental Processing : Simplify Architecture
- Efficient pipelines on
same batch
infrastructure
- Consolidation of
storage & compute
Incremental Processing : Primitives
Incremental Pull (Primitive #2)
- Log stream of changes, avoid costly
scans
- Enable chaining processing in DAG
Upsert (Primitive #1)
- Modify processed results
- Like state stores in stream
processing
Introducing: Hoodie
(Hadoop Upserts anD Incrementals)
Storage Abstraction to
- Apply mutations to dataset
- Pull changelog incrementally
Spark Library
- Scales horizontally like any job
- Stores dataset directly on HDFS
Open Source
- https://github.com/uber/hoodie
- https://eng.uber.com/hoodie
Upsert
(Spark)
Changelog Changelog
Incr Pull
(Hive/Spark/Presto)
Normal Table
(Hive/Spark/Presto)
Hoodie: Overview
Hoodie
WriteClient
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Dataset On HDFS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views
Hoodie: Storage Types & Views
Storage Type
(How is Data stored?)
Views
(How is Data Read?)
Copy On Write
Read Optimized,
LogView
Merge On Read
Read Optimized,
RealTime,
LogView
Hoodie : Deep Dive
Design & Implementation of incremental processing primitives
Storage: Basic Idea
2017/02/17
Index
Index
File1_v2.parquet
2017/02/15
2017/02/16
2017/02/17
File1.avro.log
200 GB
30min batch
File1
10 GB
5min batch
File1_v1.parquet
10 GB
5 min batch ●
●
●
●
●
●
●
●
●
● 1825 Partitions (365 days * 5 yrs)
● 100 GB Partition Size
● 128 MB File Size
● ~800 Files Per Partition
● Skew spread - 0.5 % (single batch)
New Files - 0.005 % (single batch)
● 7300 Files rewritten
~ 8 new Files
● 20 seconds to re-write 1 File (shuffle)
● 100 executors
10 executors
● 24 minutes to write
~2 minutes to write
Input
Changelog
Hoodie Dataset
Index and Storage
Index
- Tag ingested record as update or insert
- Index is immutable (record key to File mapping never changes)
- Pluggable
- Bloom Filter
- HBase
Storage
- HDFS or Compatible Filesystem or Cloud Storage
- Block aligned files
- ROFormat (Apache Parquet) & WOFormat (Apache Avro)
Concurrency
● Multi-row atomicity
● Strong consistency (Same as HDFS guarantees)
● Single Writer - Multiple Consumer pattern
● MVCC for isolation
○ Running queries are run concurrently to ingestion
Data Skew
Why skew is a problem?
- Spark 2GB Remote Shuffle Block limit
- Straggler problem
Hoodie handles data skew automatically
- Index lookup skew
- Data write skew handled by auto sub partitioning based on history
Compaction
Essential for Query performance
- Merge Write Optimized row format with Scan Optimized column
format
Scheduled asynchronously to Ingestion
- Ingestion already groups updates per File Id
- Locks down versions of log files to compact
- Pluggable strategy to prioritize compactions
- Base File to Log file size ratio
- Recent partitions compacted first
Failure recovery
Automatic recovery via Spark RDD
- Resilient Distributed Datasets!!
No Partial writes
- Commit is atomic
- Auto rollback last failed commit
Rollback specific commits
Savepoints/Snapshots
Hoodie Write API
// WriteConfig contains basePath of hoodie dataset (among other configs)
HoodieWriteClient(JavaSparkContext jsc, HoodieWriteConfig clientConfig)
// Start a commit and get a commit time to atomically upsert a batch of records
String startCommit()
// Upsert the RDD<Records> into the hoodie dataset
JavaRDD<WriteStatus> upsert(JavaRDD<HoodieRecord<T>> records, final String
commitTime)
// Choose to commit
boolean commit(String commitTime, JavaRDD<WriteStatus> writeStatuses)
// Rollback
boolean rollback(final String commitTime) throws HoodieRollbackException
Hoodie Record
HoodieRecordPayload
// Get the Avro IndexedRecord for the dataset schema
○ IndexedRecord getInsertValue(Schema schema);
// Combine Existing value with New incoming value and return the combined value
○ IndexedRecord combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema);
Hoodie: Overview
Hoodie
WriteClien
t
(Spark)
Index
Data Files
Timeline
Metadata
Hive
Queries
Hoodie Dataset On
HDFS
Presto
Queries
Spark
DAGs
Store & Index
Data
Read data
Storage
Type
Views
Hoodie Views
REALTIME
READ
OPTIMIZED
Queryexecutiontime
Data Latency
3 Logical views Of Dataset
Read Optimized View
- Raw Parquet Query Performance
- Targets existing Hive tables
Real Time View
- Hybrid of row & columnar data
- Brings near-real time tables
Log View
- Stream of changes to dataset
- Enables Incr. Pull
Hoodie Views
Read Optimized
Table
Real Time Table
Hive
2017/02/15
2017/02/16
2017/02/17
2017/02/16
File1.parquet
Index
Index
File1_v2.parquet
File1.avro.log
File1
File1_v1.parquet
10 GB
5min batch
10 GB
5 min batch
Input
Changelog
Incremental Log table
Read Optimized View
InputFormat picks only Compacted Columnar Files
Optimized for faster query runtime over data latency
- Plug into query plan generation to filter out older versions
- All Optimizations done to read parquet applies (Vectorized etc)
Works out of the box with Presto and Apache Spark
Presto Read Optimized Performance
Real Time View
InputFormat merges Columnar with Row Log at query execution
- Data Latency can approach speed of HDFS appends
Custom RecordReader
- Logs are grouped per FileID
- Single split is usually a single FileID in Hoodie (Block Aligned files)
Works out of the box with Presto and Apache Spark
- Specialized parquet read path optimizations not supported
Incremental Log View
Partitioned by trip start date
2010-2014
New Data
Unaffected Data
Updated Data
Incremental
update
2015/XX/XX
Every 5 min
2016/XX/XX
2017/(01-03)/XX
2017/04/16
New/Update
d Trips Log
View
Incr Pull
Incremental Log View
Pull ONLY changed records in a time range using SQL
- ‘startTs’ > _hoodie_commit_time < ‘endTs’
Avoid full table/partition scan
Do not rely on a custom sequence ID to tail
Hoodie : Use Cases
How is it being used in real production environments?
Use Cases
Near Real-Time ingestion / stream into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Near Real-Time Ingestion
Use Cases
Near Real-Time ingestion / stream into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Incremental Data Pipelines
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Incremental ETL
Use Cases
Near Real-Time ingestion / streaming into HDFS
- Replicate online state in HDFS within few minutes
- Offload analytics to HDFS
Incremental Data Pipelines
- Don't tradeoff correctness to do incremental processing
- Hoodie integration with Scheduler
Unified Analytical Serving Layer
- Eliminate your specialized serving layer , if latency tolerated is > 5 min
- Simplify serving with HDFS for the entire dataset
Unified Analytics Serving
Adoption @ Uber
Powering ~1000 Data ingestion data feeds
- Every 30 mins today, several TBs per hour
- Towards < 10 min in the next few months
Incremental ETL for dimension tables
- Data warehouse at large
Reduced resource usage by 10x
- In production for last 6 months
- Hardened across rolling restarts, data node
reboots
Hoodie : Comparisons
What trade-offs does hoodie offer compared to other systems?
Source: (CERN Blog) Performance comparison of different
file formats and storage engines in the Hadoop ecosystem
Comparison: Analytical Storage - Scans
Source: (CERN Blog) Performance comparison of different
file formats and storage engines in the Hadoop ecosystem
Comparison: Analytical Storage - Write Rate
Comparison
Apache HBase Apache Kudu Hoodie
Write Latency Milliseconds Seconds (streaming) ~5m update, ~1m insert**
Scan Performance Not optimal Optimized via columnar
State of Art Hadoop
Formats
Query Engines Hive* Impala/Spark* Hive, Presto, Spark at scale
Deployment Extra Region servers
Specialized Storage
Servers
Spark Jobs on HDFS
Multi Row Commit/Rollback No No Yes
Incremental Pull No No Yes
Automatic Hotspot Handling No No Yes
Hoodie : Open Source
How to get involved, roadmap..
Community
Shopify evaluating for use
- Incremental DB ingestion onto GCS
- Early interest from multiple companies
Engage with us on Github (uber/hoodie)
- Look for “beginner-task” tagged issues
- Try out tools & utilities
Uber is hiring for “Hoodie”
- “Software Engineer - Data Processing Plaform (Hoodie)”
Future Plans
Merge On Read (Project #1)
- Active developement, Productionizing, Shipping!
Global Index (Project #2)
- Fast, lightweight index to map key to fileID, globally (not just partitions)
Spark Datasource (Issue #7) & Presto Plugins (Issue #81)
- Native support for incremental SQL (e.g: where _hoodie_commit_time > ... )
Beam Runner (Issue #8)
- Build incremental pipelines that also port across batch or streaming modes
Takeaways
Fills a big void in Hadoop land
- Upserts & Faster data
Play well with Hadoop ecosystem & deployments
- Leverage Spark vs re-inventing yet-another storage silo
Designed for Incremental Processing
- Incremental Pull is a ‘Hoodie’ special
Questions?
Office Hours after talk
5:00pm–5:45pm
source
Extra Slides
Hoodie: Storage Types & Views
Hoodie Views
●
●
●
●
○
○
●
●
2017/02/15
2017/02/16
2017/02/17
2017/02/16
File1.parquet
Index
Index
File1_v2.parquet
File1.avro.log
Change Log 200 GB
Realtime View
Read Optimized
View
Hive
File1
10 GB
File1_v1.parquet
Hoodie Write Path
Change log
Index lookup
updates
inserts
File Id1 LogFile
commit
(10:06)
Failed
commit
(10:08)
commit
(10:08)
Version 1
commit
(10:09)
Version 2
2017-03-11
File Id1
Compacted
(10:05)
2017-03-14
File Id2
2017-03-10
2017-03-11
2017-03-12
2017-03-13
2017-03-14
Commit Time: 10:10
Empty
Hoodie Write Path
Spark Application
Read Optimized View
Spark SQL Performance Comparison
Realtime View
Incremental Log View
Hoodie: Storage Types & Views
Incremental Log View
Comparison
Comparison
Petabytes to Exabytes
Greater need for
Incremental
Processing
Exponential Growth is fun ..
Also extremely hard, to keep up with …
- Long waits for queue
- Disks running out of space
Common Pitfalls
- Massive re-computations
- Batch jobs are too big fail

Contenu connexe

Tendances

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 

Tendances (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 

Similaire à Hoodie - DataEngConf 2017

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemSerendio Inc.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From UberChester Chen
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGuang Xu
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemDataWorks Summit
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 

Similaire à Hoodie - DataEngConf 2017 (20)

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 

Plus de Vinoth Chandar

Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to ProductionVinoth Chandar
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesVinoth Chandar
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesVinoth Chandar
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingTriple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingVinoth Chandar
 
Distributeddatabasesforchallengednet
DistributeddatabasesforchallengednetDistributeddatabasesforchallengednet
DistributeddatabasesforchallengednetVinoth Chandar
 

Plus de Vinoth Chandar (8)

Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
 
Triple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based GroupingTriple-Triple RDF Store with Greedy Graph based Grouping
Triple-Triple RDF Store with Greedy Graph based Grouping
 
Distributeddatabasesforchallengednet
DistributeddatabasesforchallengednetDistributeddatabasesforchallengednet
Distributeddatabasesforchallengednet
 
Bluetube
BluetubeBluetube
Bluetube
 

Dernier

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Dernier (20)

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 

Hoodie - DataEngConf 2017

  • 1. An Open Source Incremental Processing Framework Hoodie DATA
  • 2. Who Am I Vinoth Chandar - Founding engineer/architect of the data team at Uber. - Previously, - Lead on Linkedin’s Voldemort key value store. - Oracle Database replication, Stream Processing. - HPC & Grid Computing
  • 3. Agenda • Data @ Uber • Motivation • Concepts • Deep Dive • Use-Cases • Comparisons • Open Source
  • 4. Data @ Uber Quick Recap of how Uber’s data ecosystem has evolved!
  • 5. Circa 2014 Reliability - JSON data, breaking pipelines - Word-of-mouth schema Scalability - Kafka7, No Hadoop - Growing data volumes - Multi-datacenter data merges In-efficiencies - Several hours of data delays - Bulk data copies stressing OLTP systems - Single choice of query engine
  • 6. Re-Architecture Schemafication - Avro as data lingua franca - Schema enforcement at producers Horizontally Scalable - Kafka8 - Hadoop (many PBs & 1000s servers) - Scalable data pipelines - Multi-DC Aware data flow Performant - 1-3 hrs data latency - Columnar queries via parquet - Multiple query engines
  • 7. Data Users Analytics - Dashboards - Federated Querying - Interactive Analysis Data Apps - Machine Learning - Fraud Detection - Incentive Spends Data Warehousing - Traditional ETL - Curated data feeds - Data Lake => Data Mart
  • 8. Query Engines Presto > 100K queries/day Spark 100s of Apps Hive 20K Pipelines & Queries
  • 9. Hoodie : Motivations Use-cases & business needs that led to the birth of the project
  • 10. Query Engines Presto > 100K queries/day Spark 100s of Apps Hive 20K Pipelines & Queries
  • 11. Partitioned by trip start date 2010-2014 New Data Unaffected Data Updated Data Incremental update 2015/XX/XX Every 30 min Day level partitions Motivating Use-Case: Late Arriving Updates 2016/XX/XX 2017/(01-03)/XX 2017/04/16 New/Update d Trips
  • 12. Jan: 6 hr (500 executors) Snapshot DB Ingestion: Status Quo trips (Parquet) Changelog 12-18+ hr Derived Tables Apr: 8 hr (800 executors) Aug: 10 hr (1000 executors)
  • 13. How can we fix this? Query HBase? - Bad Fit for scans - Lack of support for nested data - Significant operational overhead Specialized Analytical DBs? - Joins with other datasets in HDFS - Not all data will fit into memory - Lambda architecture & data copies Don’t support Snapshots, Only Logs - Logs ultimately need to be compacted anyway - Merging done inconsistently & inefficiently by users Data Modelling Tricks? - Does not change fundamental nature of problem
  • 14. Pivotal Question What do we need to solve this directly on top of a petabyte scale Hadoop Data Lake?
  • 15. Let’s Go A Decade Back How did RDBMS-es solve this? • Update existing row with new value (Transactions) • Consume a log of changes downstream (Redo log) • Update again downstream MySQL (Server A) MySQL (Server B) Update Update Pull Redo Log TransformationImportant Differences • Columnar file formats • Read-heavy analytical workloads • Petabytes & 1000s of servers Changes
  • 16. Pivotal Question What do we need to solve this directly on top of a petabyte scale Hadoop Data Lake? Answer: upserts & incrementals
  • 17. 10 hr (1000) 8 hr (800) 6 hr (500) snapshot Challenging Status Quo: upserts & incr pull 12-18+ hr 8 hr 1 hr Replicated Trip Rows New /updated trip rows Changelog
  • 18. Hoodie : Concepts Incremental Processing Foundations & why it’s important
  • 19. Anatomy Of Data Pipelines Core Operations • Projections (Easy) • Filtering (Easy) • Aggregations (Tricky) • Window (Tricky) • Joins (Hard) Operational Levers (Google DataFlow) • Latency • Completeness • Cost Typically Pick 2/3 Source SinkData Pipeline
  • 21. It’s A Spectrum - Very Common use-cases tolerating few mins of latency - 100x more batch pipelines than streaming pipelines
  • 22. Incremental Processing : What? Run Mini Batch Pipelines - Provide high completeness than streaming pipelines - By supporting things like multi-table joins seamlessly In Streaming Fashion - Provide lower latency than typical batch pipeline - By only consuming new input & ability to update old results
  • 23. Incremental Processing : Increased Efficiency Less IO, On-Demand Resource Allocation
  • 24. Incremental Processing : Leverage Hadoop SQL - Good support for joins - Columnar File Formats, - Cover wide range of use cases - exploratory, interactive
  • 25. Incremental Processing : Simplify Architecture - Efficient pipelines on same batch infrastructure - Consolidation of storage & compute
  • 26. Incremental Processing : Primitives Incremental Pull (Primitive #2) - Log stream of changes, avoid costly scans - Enable chaining processing in DAG Upsert (Primitive #1) - Modify processed results - Like state stores in stream processing
  • 27. Introducing: Hoodie (Hadoop Upserts anD Incrementals) Storage Abstraction to - Apply mutations to dataset - Pull changelog incrementally Spark Library - Scales horizontally like any job - Stores dataset directly on HDFS Open Source - https://github.com/uber/hoodie - https://eng.uber.com/hoodie Upsert (Spark) Changelog Changelog Incr Pull (Hive/Spark/Presto) Normal Table (Hive/Spark/Presto)
  • 28. Hoodie: Overview Hoodie WriteClient (Spark) Index Data Files Timeline Metadata Hive Queries Dataset On HDFS Presto Queries Spark DAGs Store & Index Data Read data Storage Type Views
  • 29. Hoodie: Storage Types & Views Storage Type (How is Data stored?) Views (How is Data Read?) Copy On Write Read Optimized, LogView Merge On Read Read Optimized, RealTime, LogView
  • 30. Hoodie : Deep Dive Design & Implementation of incremental processing primitives
  • 31. Storage: Basic Idea 2017/02/17 Index Index File1_v2.parquet 2017/02/15 2017/02/16 2017/02/17 File1.avro.log 200 GB 30min batch File1 10 GB 5min batch File1_v1.parquet 10 GB 5 min batch ● ● ● ● ● ● ● ● ● ● 1825 Partitions (365 days * 5 yrs) ● 100 GB Partition Size ● 128 MB File Size ● ~800 Files Per Partition ● Skew spread - 0.5 % (single batch) New Files - 0.005 % (single batch) ● 7300 Files rewritten ~ 8 new Files ● 20 seconds to re-write 1 File (shuffle) ● 100 executors 10 executors ● 24 minutes to write ~2 minutes to write Input Changelog Hoodie Dataset
  • 32. Index and Storage Index - Tag ingested record as update or insert - Index is immutable (record key to File mapping never changes) - Pluggable - Bloom Filter - HBase Storage - HDFS or Compatible Filesystem or Cloud Storage - Block aligned files - ROFormat (Apache Parquet) & WOFormat (Apache Avro)
  • 33. Concurrency ● Multi-row atomicity ● Strong consistency (Same as HDFS guarantees) ● Single Writer - Multiple Consumer pattern ● MVCC for isolation ○ Running queries are run concurrently to ingestion
  • 34. Data Skew Why skew is a problem? - Spark 2GB Remote Shuffle Block limit - Straggler problem Hoodie handles data skew automatically - Index lookup skew - Data write skew handled by auto sub partitioning based on history
  • 35. Compaction Essential for Query performance - Merge Write Optimized row format with Scan Optimized column format Scheduled asynchronously to Ingestion - Ingestion already groups updates per File Id - Locks down versions of log files to compact - Pluggable strategy to prioritize compactions - Base File to Log file size ratio - Recent partitions compacted first
  • 36. Failure recovery Automatic recovery via Spark RDD - Resilient Distributed Datasets!! No Partial writes - Commit is atomic - Auto rollback last failed commit Rollback specific commits Savepoints/Snapshots
  • 37. Hoodie Write API // WriteConfig contains basePath of hoodie dataset (among other configs) HoodieWriteClient(JavaSparkContext jsc, HoodieWriteConfig clientConfig) // Start a commit and get a commit time to atomically upsert a batch of records String startCommit() // Upsert the RDD<Records> into the hoodie dataset JavaRDD<WriteStatus> upsert(JavaRDD<HoodieRecord<T>> records, final String commitTime) // Choose to commit boolean commit(String commitTime, JavaRDD<WriteStatus> writeStatuses) // Rollback boolean rollback(final String commitTime) throws HoodieRollbackException
  • 38. Hoodie Record HoodieRecordPayload // Get the Avro IndexedRecord for the dataset schema ○ IndexedRecord getInsertValue(Schema schema); // Combine Existing value with New incoming value and return the combined value ○ IndexedRecord combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema);
  • 39. Hoodie: Overview Hoodie WriteClien t (Spark) Index Data Files Timeline Metadata Hive Queries Hoodie Dataset On HDFS Presto Queries Spark DAGs Store & Index Data Read data Storage Type Views
  • 40. Hoodie Views REALTIME READ OPTIMIZED Queryexecutiontime Data Latency 3 Logical views Of Dataset Read Optimized View - Raw Parquet Query Performance - Targets existing Hive tables Real Time View - Hybrid of row & columnar data - Brings near-real time tables Log View - Stream of changes to dataset - Enables Incr. Pull
  • 41. Hoodie Views Read Optimized Table Real Time Table Hive 2017/02/15 2017/02/16 2017/02/17 2017/02/16 File1.parquet Index Index File1_v2.parquet File1.avro.log File1 File1_v1.parquet 10 GB 5min batch 10 GB 5 min batch Input Changelog Incremental Log table
  • 42. Read Optimized View InputFormat picks only Compacted Columnar Files Optimized for faster query runtime over data latency - Plug into query plan generation to filter out older versions - All Optimizations done to read parquet applies (Vectorized etc) Works out of the box with Presto and Apache Spark
  • 43. Presto Read Optimized Performance
  • 44. Real Time View InputFormat merges Columnar with Row Log at query execution - Data Latency can approach speed of HDFS appends Custom RecordReader - Logs are grouped per FileID - Single split is usually a single FileID in Hoodie (Block Aligned files) Works out of the box with Presto and Apache Spark - Specialized parquet read path optimizations not supported
  • 45. Incremental Log View Partitioned by trip start date 2010-2014 New Data Unaffected Data Updated Data Incremental update 2015/XX/XX Every 5 min 2016/XX/XX 2017/(01-03)/XX 2017/04/16 New/Update d Trips Log View Incr Pull
  • 46. Incremental Log View Pull ONLY changed records in a time range using SQL - ‘startTs’ > _hoodie_commit_time < ‘endTs’ Avoid full table/partition scan Do not rely on a custom sequence ID to tail
  • 47. Hoodie : Use Cases How is it being used in real production environments?
  • 48. Use Cases Near Real-Time ingestion / stream into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS
  • 50. Use Cases Near Real-Time ingestion / stream into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS Incremental Data Pipelines - Don't tradeoff correctness to do incremental processing - Hoodie integration with Scheduler
  • 52. Use Cases Near Real-Time ingestion / streaming into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS Incremental Data Pipelines - Don't tradeoff correctness to do incremental processing - Hoodie integration with Scheduler Unified Analytical Serving Layer - Eliminate your specialized serving layer , if latency tolerated is > 5 min - Simplify serving with HDFS for the entire dataset
  • 54. Adoption @ Uber Powering ~1000 Data ingestion data feeds - Every 30 mins today, several TBs per hour - Towards < 10 min in the next few months Incremental ETL for dimension tables - Data warehouse at large Reduced resource usage by 10x - In production for last 6 months - Hardened across rolling restarts, data node reboots
  • 55. Hoodie : Comparisons What trade-offs does hoodie offer compared to other systems?
  • 56. Source: (CERN Blog) Performance comparison of different file formats and storage engines in the Hadoop ecosystem Comparison: Analytical Storage - Scans
  • 57. Source: (CERN Blog) Performance comparison of different file formats and storage engines in the Hadoop ecosystem Comparison: Analytical Storage - Write Rate
  • 58. Comparison Apache HBase Apache Kudu Hoodie Write Latency Milliseconds Seconds (streaming) ~5m update, ~1m insert** Scan Performance Not optimal Optimized via columnar State of Art Hadoop Formats Query Engines Hive* Impala/Spark* Hive, Presto, Spark at scale Deployment Extra Region servers Specialized Storage Servers Spark Jobs on HDFS Multi Row Commit/Rollback No No Yes Incremental Pull No No Yes Automatic Hotspot Handling No No Yes
  • 59. Hoodie : Open Source How to get involved, roadmap..
  • 60. Community Shopify evaluating for use - Incremental DB ingestion onto GCS - Early interest from multiple companies Engage with us on Github (uber/hoodie) - Look for “beginner-task” tagged issues - Try out tools & utilities Uber is hiring for “Hoodie” - “Software Engineer - Data Processing Plaform (Hoodie)”
  • 61. Future Plans Merge On Read (Project #1) - Active developement, Productionizing, Shipping! Global Index (Project #2) - Fast, lightweight index to map key to fileID, globally (not just partitions) Spark Datasource (Issue #7) & Presto Plugins (Issue #81) - Native support for incremental SQL (e.g: where _hoodie_commit_time > ... ) Beam Runner (Issue #8) - Build incremental pipelines that also port across batch or streaming modes
  • 62. Takeaways Fills a big void in Hadoop land - Upserts & Faster data Play well with Hadoop ecosystem & deployments - Leverage Spark vs re-inventing yet-another storage silo Designed for Incremental Processing - Incremental Pull is a ‘Hoodie’ special
  • 63. Questions? Office Hours after talk 5:00pm–5:45pm source
  • 67.
  • 68.
  • 69.
  • 72. Hoodie Write Path Change log Index lookup updates inserts File Id1 LogFile commit (10:06) Failed commit (10:08) commit (10:08) Version 1 commit (10:09) Version 2 2017-03-11 File Id1 Compacted (10:05) 2017-03-14 File Id2 2017-03-10 2017-03-11 2017-03-12 2017-03-13 2017-03-14 Commit Time: 10:10 Empty
  • 73. Hoodie Write Path Spark Application
  • 75. Spark SQL Performance Comparison
  • 82. Petabytes to Exabytes Greater need for Incremental Processing
  • 83. Exponential Growth is fun .. Also extremely hard, to keep up with … - Long waits for queue - Disks running out of space Common Pitfalls - Massive re-computations - Batch jobs are too big fail