Building a Virtual Data Lake with Apache Arrow

•

13 likes•8,162 views

Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.

Software

Analytics on modern
data is incredibly hard
Unprecedented complexity

The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention

Your analysts are hungry for data
SQL
But your data is everywhere
And it’s not in the shape they need

Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL

Today you engineer data flows and reshaping
Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL

Today you engineer data flows and reshaping
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
+
+
+
+
+
+
+
+
+

How can we Tackle this Age-old
Problem?
Direct access to data In-memory, GPU,
…
Columnar Distributed

Apache Arrow: Process & Move Data
Fast
• Top-level Apache project as of Feb 2016
• Collaboration among many open source projects around shared needs
• Three components:
• Language-independent columnar data structures
• Implementations available for C++, Java, Python
• Metadata for describing schemas/record batches
• Protocol for moving data between between processes without
serialization overhead

High-Performance Data Interchange
Today With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and
deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-
to-Arrow reader)

Data is Organized in Record Batches
Schema
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Schema & File
Layout
Streaming Format File Format

Each Record Batch is Columnar
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Traditional
Memory Buffer
Arrow
Memory Buffer
Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs:

Example: Spark to
Pandas via Apache
Arrow

Fast Import of Arrow in Pandas & R
Credit: Wes McKinney, Two Sigma

Fast Export of Arrow in Spark
• Legacy export from Spark to Pandas (toPandas) was extremely
slow
• Row-by-row conversion from Spark driver to Python memory
• SPARK-13534 introduced an Arrow based implementation
• Wes McKinney (Two Sigma), Bryan Cutler (IBM), Li Jin (Two Sigma), and
Yin Xusen (IBM)
• Set spark.sql.execution.arrow.enable = True
Clock Time 12.5s 1.89s (6.6x)
Deserialization 88% of the time 1% of the time
Peak memory usage 8x dataset size 2x dataset size

Designing a Virtual Data
Lake Powered by Apache
Arrow

Arrow-based Distributed Execution
Persistent Columnar Cache (Parquet)
In-Memory Columnar Cache (Arrow)
Pandas
R
BI
Data Sources
(NoSQL, RDBMS, Hadoop, S3)
Arrow-based Execution and Integration

Thank You
• Apache Arrow community
• Strata organizers
• Get involved
• Subscribe to the Arrow ASF lists
• Contribute to the Arrow project
• Want to learn more about Dremio?
• tshiran@dremio.com

What's hot

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Apache Spark on K8S Best Practice and Performance in the CloudDatabricks

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Apache Arrow: In Theory, In PracticeDremio Corporation

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz

Parquet Strata/Hadoop World, New York 2013Julien Le Dem

Spark shuffle introductioncolorant

Apache Arrow - An OverviewDremio Corporation

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

Parquet Hadoop Summit 2013Julien Le Dem

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

How to boost your datamanagement with Dremio ?Vincent Terrasi

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Building an open data platform with apache icebergAlluxio, Inc.

Massive Data Processing in Adobe Using Delta LakeDatabricks

What's hot (20)

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Apache Iceberg: An Architectural Look Under the Covers

Apache Spark on K8S Best Practice and Performance in the Cloud

The Parquet Format and Performance Optimization Opportunities

Apache Arrow: In Theory, In Practice

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...

HBase and HDFS: Understanding FileSystem Usage in HBase

Parquet Strata/Hadoop World, New York 2013

Spark shuffle introduction

Apache Arrow - An Overview

A Thorough Comparison of Delta Lake, Iceberg and Hudi

How We Optimize Spark SQL Jobs With parallel and sync IO

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Parquet Hadoop Summit 2013

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

A Deep Dive into Query Execution Engine of Spark SQL

How to boost your datamanagement with Dremio ?

Iceberg: A modern table format for big data (Strata NY 2018)

Building an open data platform with apache iceberg

Massive Data Processing in Adobe Using Delta Lake

Viewers also liked

Apache Calcite: One planner fits allJulian Hyde

Data Science Languages and Industry AnalyticsWes McKinney

The twins that everyone loved too muchJulian Hyde

Options for Data Prep - A Survey of the Current MarketDremio Corporation

Bi on Big Data - Strata 2016 in LondonDremio Corporation

Don’t optimize my queries, optimize my data!Julian Hyde

SQL on everything, in memoryJulian Hyde

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

Apache Calcite overviewJulian Hyde

Viewers also liked (9)

Apache Calcite: One planner fits all

Data Science Languages and Industry Analytics

The twins that everyone loved too much

Options for Data Prep - A Survey of the Current Market

Bi on Big Data - Strata 2016 in London

Don’t optimize my queries, optimize my data!

SQL on everything, in memory

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Apache Calcite overview

Similar to Building a Virtual Data Lake with Apache Arrow

Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta

Data modeling trends for analyticsIke Ellis

Introduction to DremioDremio Corporation

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit

Tech Spark PresentationStephen Borg

Adding structure to your streaming pipelines: moving from Spark streaming to ...DataWorks Summit

Meta scale kognitio hadoop webinarKognitio

Intake at AnacondaConMartin Durant

Spark_Intro_Syed_AcademySyed Hadoop

Real Time Big Data Processing on AWSCaserta

Nisha talagala keynote_inflow_2016Nisha Talagala

Big Data Introduction - Solix empowerDurga Gadiraju

Big data berlinkammeyer

Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks

20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge

Making Apache Spark Better with Delta LakeDatabricks

DoneDeal - AWS Data Analytics Platformmartinbpeters

From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks

How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences

Similar to Building a Virtual Data Lake with Apache Arrow (20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015

Data modeling trends for analytics

Introduction to Dremio

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Tech Spark Presentation

Adding structure to your streaming pipelines: moving from Spark streaming to ...

Meta scale kognitio hadoop webinar

Intake at AnacondaCon

Spark_Intro_Syed_Academy

Real Time Big Data Processing on AWS

Nisha talagala keynote_inflow_2016

Big Data Introduction - Solix empower

Big data berlin

Designing and Building Next Generation Data Pipelines at Scale with Structure...

20160331 sa introduction to big data pipelining berlin meetup 0.3

Making Apache Spark Better with Delta Lake

DoneDeal - AWS Data Analytics Platform

From Pipelines to Refineries: scaling big data applications with Tim Hunter

How to use Big Data and Data Lake concept in business using Hadoop and Spark...

Recently uploaded

PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122

Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

React Server Component in Next.js by Hanief UtamaHanief Utama

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky

What are the key points to focus on before starting to learn ETL Development....kzayra69

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

Recently uploaded (20)

PREDICTING RIVER WATER QUALITY ppt presentation

Ahmed Motair CV April 2024 (Senior SW Developer)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

Cloud Data Center Network Construction - IEEE

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

How to Track Employee Performance A Comprehensive Guide.pdf

Software Project Health Check: Best Practices and Techniques for Your Product...

React Server Component in Next.js by Hanief Utama

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

Intelligent Home Wi-Fi Solutions | ThinkPalm

CRM Contender Series: HubSpot vs. Salesforce

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...

What are the key points to focus on before starting to learn ETL Development....

What is Advanced Excel and what are some best practices for designing and cre...

Buds n Tech IT Solutions: Top-Notch Web Services in Noida

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

Implementing Zero Trust strategy with Azure

Folding Cheat Sheet #4 - fourth in a series

Building a Virtual Data Lake with Apache Arrow

1. Tomer Shiran Co-Founder @tshiran

2. Analytics on modern data is incredibly hard Unprecedented complexity

3. The demands for data are growing rapidly Increasing demands Reporting New products Forecasting Threat detection BI Machine Learning Segmenting Fraud prevention

4. Your analysts are hungry for data SQL But your data is everywhere And it’s not in the shape they need

5. Today you engineer data flows and reshaping Data Staging • Custon ETL • Fragile transforms • Slow moving SQL

6. Today you engineer data flows and reshaping Data Staging Data Warehouse • $$$ • High overhead • Proprietary lock in • Custon ETL • Fragile transforms • Slow moving SQL

7. Today you engineer data flows and reshaping Data Staging Data Warehouse Cubes, BI Extracts & Aggregation Tables • Data sprawl • Governance issues • Slow to update • $$$ • High overhead • Proprietary lock in • Custon ETL • Fragile transforms • Slow moving SQL + + + + + + + + +

8. Lots of Copies…

9. How can we Tackle this Age-old Problem? Direct access to data In-memory, GPU, … Columnar Distributed

10. Apache Arrow: Process & Move Data Fast • Top-level Apache project as of Feb 2016 • Collaboration among many open source projects around shared needs • Three components: • Language-independent columnar data structures • Implementations available for C++, Java, Python • Metadata for describing schemas/record batches • Protocol for moving data between between processes without serialization overhead

11. High-Performance Data Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet- to-Arrow reader)

12. Data is Organized in Record Batches Schema Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Schema & File Layout Streaming Format File Format

13. Each Record Batch is Columnar Intel CPU SELECT * FROM clickstream WHERE session_id = 1331246351 Traditional Memory Buffer Arrow Memory Buffer Arrow leverages the data parallelism (SIMD) in modern Intel CPUs:

14. Example: Spark to Pandas via Apache Arrow

15. Fast Import of Arrow in Pandas & R Credit: Wes McKinney, Two Sigma

16. Fast Export of Arrow in Spark • Legacy export from Spark to Pandas (toPandas) was extremely slow • Row-by-row conversion from Spark driver to Python memory • SPARK-13534 introduced an Arrow based implementation • Wes McKinney (Two Sigma), Bryan Cutler (IBM), Li Jin (Two Sigma), and Yin Xusen (IBM) • Set spark.sql.execution.arrow.enable = True Clock Time 12.5s 1.89s (6.6x) Deserialization 88% of the time 1% of the time Peak memory usage 8x dataset size 2x dataset size

17. Designing a Virtual Data Lake Powered by Apache Arrow

18. Arrow-based Distributed Execution Persistent Columnar Cache (Parquet) In-Memory Columnar Cache (Arrow) Pandas R BI Data Sources (NoSQL, RDBMS, Hadoop, S3) Arrow-based Execution and Integration

19. Demo

20. Thank You • Apache Arrow community • Strata organizers • Get involved • Subscribe to the Arrow ASF lists • Contribute to the Arrow project • Want to learn more about Dremio? • tshiran@dremio.com

Editor's Notes

BI assumes single relational database, but… Data in non-relational technologies Data fragmented across many systems Massive scale and velocity
Data is the business, and… Era of impatient smartphone natives Rise of self-service BI Accelerating time to market Because of the complexity of modern data and increasing demands for data, IT gets crushed in the middle: Slow or non-responsive IT “Shadow Analytics” Data governance risk Illusive data engineers Immature software Competing strategic initiatives
Here’s the problem everyone is trying to solve today. You have consumers of data with their favorite tools. BI products like Tableau, PowerBI, Qlik, as well as data science tools like Python, R, Spark, and SQL. Then you have all your data, in a mix of relational, NoSQL, Hadoop, and cloud like S3. So how are you going to get the data to the people asking for it?
Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too. Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts. But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too. Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts. But what we see with many customers is that the performance here isn’t sufficient for their needs, and so … You build cubes and aggregation tables to get the performance your users are asking for. And to do this you build another set of scripts. In the end you’re left with something like this picture. You may have more layers, the technologies may be different, but you’re probably living with something like this. And nobody likes this – it’s expensive, the data movement is slow, it’s hard to change. But worst of all, you’re left with a dynamic where every time a consumer of the data wants a new piece of data: They open a ticket with IT IT begins an engineering project to build another set of pipelines, over several weeks or months

Building a Virtual Data Lake with Apache Arrow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Building a Virtual Data Lake with Apache Arrow

Similar to Building a Virtual Data Lake with Apache Arrow (20)

Recently uploaded

Recently uploaded (20)

Building a Virtual Data Lake with Apache Arrow

Editor's Notes