Big Data's Journey to ACID

•Télécharger en tant que PPTX, PDF•

0 j'aime•169 vues

Owen O'Malley

A comparison of different tools for change management in big data systems.

Technologie

BIG DATA’S JOURNEY TO ACID
Owen O’Malley
owen@cloudera.com
October 2019
@owen_omalley

© 2019 Cloudera, Inc. All rights reserved. 3
BIG DATA HAS A LOT OF CONCURRENCY
• Your data changes continually.
• Daily, hourly, or faster
• Ad hoc solutions require a lot of work
• Producers and consumers must agree
• Distributed systems have lots of actors
• And no global clock

© 2019 Cloudera, Inc. All rights reserved. 4
USE CASES
• Updating dimension tables
• Changing a user’s address
• Deleting old records
• GDPR user removal
• Update/restate large fact tables
• Fix problems after they are in the warehouse
• Streaming data ingest
• NOT OLTP

© 2019 Cloudera, Inc. All rights reserved. 6
APACHE HADOOP MAP/REDUCE
• Only supporting adding new directories
• Provided isolation via the output committer.
• Task isolation
• Job isolation
• Used HDFS atomic renames
• Used _SUCCESS_ file to mark available directories

© 2019 Cloudera, Inc. All rights reserved. 7
APACHE HBASE
• Provided point lookup and edits
• Read & Write performance – low latency, low throughput
• Row level atomicity
• Tephra provided transactions, but lacks adoption
• Write-Ahead Log (WAL)
• Regular compactions

© 2019 Cloudera, Inc. All rights reserved. 8
TRADITIONAL APACHE HIVE
• Provided Hive Meta-Store (HMS) to track tables
• Provided structure for table layout
• Value partitioning
• Only add or remove partition operations were atomic
• Only add partition was isolated
• Provided simplistic locking

© 2019 Cloudera, Inc. All rights reserved. 9
APACHE HIVE ACID
• Supports streaming writes
• Integrated with SQL data manipulation commands
• Insert, delete, update, merge
• Snapshot isolation
• Read & Write performance: high throughput, high latency
• Lockless compaction
• Writes delta directories
• Assumes HDFS consistent directory listings

© 2019 Cloudera, Inc. All rights reserved. 10
APACHE HUDI
• Designed for streaming data
• Row level updates
• WAL & compaction
• Assumes HDFS
• Provides three reading levels:
• Compacted
• Compacted + deltas
• Deltas

© 2019 Cloudera, Inc. All rights reserved. 11
APACHE ICEBERG
• Designed to support data in object stores (eg. S3)
• Avoids inconsistent & slow directory listing
• Tracks tables and partitions to file level
• Supports column min, max, and count per file
• Snapshot isolation
• Writers automatically retry on conflict
• Manifest files use copy on write
• Supports time travel and rollback

© 2019 Cloudera, Inc. All rights reserved. 12
DATABRICKS DELTA
• Open-source, but closed governance
• Ignoring the proprietary version
• Designed for object stores
• Avoids inconsistent & slow directory listings
• Snapshot isolation
• Add, replace, remove data files

© 2019 Cloudera, Inc. All rights reserved. 14
CONCLUSIONS
• GDPR is huge and leading to redesign of data warehouse
• Support for object stores like S3 is critical
• Streaming ingest and processing is growing quickly
• This area is under active development
Will change over the next 6 months
Hive ACID is adding Presto & Impala support.
Iceberg is adding delta files and Hive support

© 2019 Cloudera, Inc. All rights reserved. 15
OVERVIEW OF HIGH THROUGHPUT SYSTEMS
SQL data
data ops
Open Write Amp
Amp
Object Store
Store
Stream ingest
ingest
Engines
Hive ACID Yes Govern Low Poor Good RW: Hive;
R: Spark, Impala
Hudi No Govern Low Poor Good RW: Spark;
R: Hive, Presto
Iceberg No Govern High Good Poor RW: Spark, Presto;
R: Pig
Delta No Source High Good Poor RW: Spark;
R: Presto

THANK YOU
Owen O’Malley
owen@cloudera.com
@owen_omalley

Contenu connexe

Tendances

HDFS tiered storageDataWorks Summit

HDFS Analysis for Small FilesDataWorks Summit/Hadoop Summit

Securing Spark ApplicationsDataWorks Summit/Hadoop Summit

Apache Hadoop 3.0 Community UpdateDataWorks Summit

Leveraging docker for hadoop build automation and big data stack provisioningEvans Ye

Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit

Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Spark Summit

HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit/Hadoop Summit

Empower Data-Driven Organizations with HPE and HadoopDataWorks Summit/Hadoop Summit

Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit

Ingest and Stream Processing - What will you choose?DataWorks Summit/Hadoop Summit

Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit

Apache Hive on ACIDDataWorks Summit/Hadoop Summit

Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit

Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit

To The Cloud and Back: A Look At Hybrid AnalyticsDataWorks Summit/Hadoop Summit

The state of SQL-on-Hadoop in the CloudDataWorks Summit/Hadoop Summit

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit

Tendances (20)

HDFS tiered storage

HDFS Analysis for Small Files

Securing Spark Applications

Apache Hadoop 3.0 Community Update

Leveraging docker for hadoop build automation and big data stack provisioning

Efficient Data Formats for Analytics with Parquet and Arrow

Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...

HDFS Tiered Storage: Mounting Object Stores in HDFS

Empower Data-Driven Organizations with HPE and Hadoop

Data Wrangling and Oracle Connectors for Hadoop

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...

Ingest and Stream Processing - What will you choose?

Dancing elephants - efficiently working with object stores from Apache Spark ...

Apache Hive on ACID

Running secured Spark job in Kubernetes compute cluster and integrating with ...

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

To The Cloud and Back: A Look At Hybrid Analytics

The state of SQL-on-Hadoop in the Cloud

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...

Managing Hadoop, HBase and Storm Clusters at Yahoo Scale

Similaire à Big Data's Journey to ACID

Introducing Kudu, Big Data Warehousing MeetupCaserta

Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.

Introducing KuduJeremy Beard

Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.

Day1_Data Lake_v2.pdfJyotiMishra985288

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Managing storage on Prem and in CloudHoward Marks

Building a Hadoop Data Warehouse with Impalahuguk

Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann

Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera, Inc.

Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group

Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic

Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldCloudera, Inc.

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.

Cloud Data Warehousing with Cloudera Altus 7.24.18Cloudera, Inc.

Denodo DataFest 2017: Business Needs for a Fast Data StrategyDenodo

Cloud computing UNIT 2.1 presentation inRahulBhole12

Deep Dive into Azure SQLManpreet Singh

Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.

Similaire à Big Data's Journey to ACID (20)

Introducing Kudu, Big Data Warehousing Meetup

Multi-Tenant Operations with Cloudera 5.7 & BT

Introducing Kudu

Hadoop Essentials -- The What, Why and How to Meet Agency Objectives

Day1_Data Lake_v2.pdf

Leveraging the cloud for analytics and machine learning 1.29.19

Managing storage on Prem and in Cloud

Building a Hadoop Data Warehouse with Impala

Cloudera Operational DB (Apache HBase & Apache Phoenix)

Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud

Building a Hadoop Data Warehouse with Impala

Introducing Apache Kudu (Incubating) - Montreal HUG May 2016

Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5

Cloud Data Warehousing with Cloudera Altus 7.24.18

Denodo DataFest 2017: Business Needs for a Fast Data Strategy

Cloud computing UNIT 2.1 presentation in

Deep Dive into Azure SQL

Impala 2.0 - The Best Analytic Database for Hadoop

Plus de Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley

ORC Deep Dive 2020Owen O'Malley

Protect your private data with ORC column encryptionOwen O'Malley

Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley

Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley

Strata NYC 2018 IcebergOwen O'Malley

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley

ORC Column EncryptionOwen O'Malley

File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley

Protecting Enterprise Data in Apache HadoopOwen O'Malley

Data protection2015Owen O'Malley

Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley

Hadoop Security ArchitectureOwen O'Malley

Adding ACID Updates to HiveOwen O'Malley

ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley

ORC FilesOwen O'Malley

ORC File IntroductionOwen O'Malley

Optimizing Hive QueriesOwen O'Malley

Next Generation Hadoop OperationsOwen O'Malley

Next Generation MapReduceOwen O'Malley

Plus de Owen O'Malley (20)

Running An Apache Project: 10 Traps and How to Avoid Them

ORC Deep Dive 2020

Protect your private data with ORC column encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Strata NYC 2018 Iceberg

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

ORC Column Encryption

File Format Benchmarks - Avro, JSON, ORC, & Parquet

Protecting Enterprise Data in Apache Hadoop

Data protection2015

Structor - Automated Building of Virtual Hadoop Clusters

Hadoop Security Architecture

Adding ACID Updates to Hive

ORC File and Vectorization - Hadoop Summit 2013

ORC Files

ORC File Introduction

Optimizing Hive Queries

Next Generation Hadoop Operations

Next Generation MapReduce

Dernier

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Slack Application Development 101 Slidespraypatel2

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

A Call to Action for Generative AI in 2024Results

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

How to convert PDF to text with Nanonetsnaman860154

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Real Time Object Detection Using Open CVKhem

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Dernier (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Slack Application Development 101 Slides

Breaking the Kubernetes Kill Chain: Host Path Mount

A Call to Action for Generative AI in 2024

Advantages of Hiring UIUX Design Service Providers for Your Business

Powerful Google developer tools for immediate impact! (2023-24 C)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Presentation on how to chat with PDF using ChatGPT code interpreter

How to convert PDF to text with Nanonets

08448380779 Call Girls In Friends Colony Women Seeking Men

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Real Time Object Detection Using Open CV

Exploring the Future Potential of AI-Enabled Smartphone Processors

08448380779 Call Girls In Civil Lines Women Seeking Men

Boost PC performance: How more available memory can improve productivity

Big Data's Journey to ACID

1. BIG DATA’S JOURNEY TO ACID Owen O’Malley owen@cloudera.com October 2019 @owen_omalley

2. WHY IS ACID IMPORTANT?

3. © 2019 Cloudera, Inc. All rights reserved. 3 BIG DATA HAS A LOT OF CONCURRENCY • Your data changes continually. • Daily, hourly, or faster • Ad hoc solutions require a lot of work • Producers and consumers must agree • Distributed systems have lots of actors • And no global clock

4. © 2019 Cloudera, Inc. All rights reserved. 4 USE CASES • Updating dimension tables • Changing a user’s address • Deleting old records • GDPR user removal • Update/restate large fact tables • Fix problems after they are in the warehouse • Streaming data ingest • NOT OLTP

5. THE SYSTEMS

6. © 2019 Cloudera, Inc. All rights reserved. 6 APACHE HADOOP MAP/REDUCE • Only supporting adding new directories • Provided isolation via the output committer. • Task isolation • Job isolation • Used HDFS atomic renames • Used _SUCCESS_ file to mark available directories

7. © 2019 Cloudera, Inc. All rights reserved. 7 APACHE HBASE • Provided point lookup and edits • Read & Write performance – low latency, low throughput • Row level atomicity • Tephra provided transactions, but lacks adoption • Write-Ahead Log (WAL) • Regular compactions

8. © 2019 Cloudera, Inc. All rights reserved. 8 TRADITIONAL APACHE HIVE • Provided Hive Meta-Store (HMS) to track tables • Provided structure for table layout • Value partitioning • Only add or remove partition operations were atomic • Only add partition was isolated • Provided simplistic locking

9. © 2019 Cloudera, Inc. All rights reserved. 9 APACHE HIVE ACID • Supports streaming writes • Integrated with SQL data manipulation commands • Insert, delete, update, merge • Snapshot isolation • Read & Write performance: high throughput, high latency • Lockless compaction • Writes delta directories • Assumes HDFS consistent directory listings

10. © 2019 Cloudera, Inc. All rights reserved. 10 APACHE HUDI • Designed for streaming data • Row level updates • WAL & compaction • Assumes HDFS • Provides three reading levels: • Compacted • Compacted + deltas • Deltas

11. © 2019 Cloudera, Inc. All rights reserved. 11 APACHE ICEBERG • Designed to support data in object stores (eg. S3) • Avoids inconsistent & slow directory listing • Tracks tables and partitions to file level • Supports column min, max, and count per file • Snapshot isolation • Writers automatically retry on conflict • Manifest files use copy on write • Supports time travel and rollback

12. © 2019 Cloudera, Inc. All rights reserved. 12 DATABRICKS DELTA • Open-source, but closed governance • Ignoring the proprietary version • Designed for object stores • Avoids inconsistent & slow directory listings • Snapshot isolation • Add, replace, remove data files

13. CONCLUSIONS

14. © 2019 Cloudera, Inc. All rights reserved. 14 CONCLUSIONS • GDPR is huge and leading to redesign of data warehouse • Support for object stores like S3 is critical • Streaming ingest and processing is growing quickly • This area is under active development Will change over the next 6 months Hive ACID is adding Presto & Impala support. Iceberg is adding delta files and Hive support

15. © 2019 Cloudera, Inc. All rights reserved. 15 OVERVIEW OF HIGH THROUGHPUT SYSTEMS SQL data data ops Open Write Amp Amp Object Store Store Stream ingest ingest Engines Hive ACID Yes Govern Low Poor Good RW: Hive; R: Spark, Impala Hudi No Govern Low Poor Good RW: Spark; R: Hive, Presto Iceberg No Govern High Good Poor RW: Spark, Presto; R: Pig Delta No Source High Good Poor RW: Spark; R: Presto

16. THANK YOU Owen O’Malley owen@cloudera.com @owen_omalley

Big Data's Journey to ACID

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Big Data's Journey to ACID

Similaire à Big Data's Journey to ACID (20)

Plus de Owen O'Malley

Plus de Owen O'Malley (20)

Dernier

Dernier (20)

Big Data's Journey to ACID