ORC File & Vectorization - Improving Hive Data Storage and Query Performance

•Télécharger en tant que PPTX, PDF•

20 j'aime•10,902 vues

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

Technologie Business

Copyright 2013 by Hortonworks and Microsoft
ORC File & Vectorization
Improving Hive Data Storage and Query Performance
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
Jitendra Pandey
jitendra@hortonworks.com
Eric Hanson
ehans@microsoft.com
owen@hortonworks.c
om

File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4

Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double

Comparison
Page 23
RC File Trevni Parquet ORC
Hive Integration Y N N Y
Active Development N N Y Y
Hive Type Model N N N Y
Shred complex columns N Y Y Y
Splits found quickly N Y Y Y
Files per a bucket 1 many 1 or many 1
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N Y Y
Store min, max, sum, count N N N Y
Store internal indexes N N N Y
No overhead for non-null N N N Y ≥ 0.12
Predicate Pushdown N N N Y ≥ 0.12

Why row-at-a-time execution is slow
Page 26
• Hive uses Object Inspectors to work on a row
• Enables level of abstraction
• Costs major performance
• Exacerbated by using lazy serdes
• Inner loop has many method, new(), and if-
then-else calls
• Lots of CPU instructions
• Pipeline stalls Poor instructions/cycle
• Poor cache locality

How the code works (simplified)
Page 27
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long scalar;
void evaluate(VectorizedRowBatch batch) {
long [] inVector =
((LongColumnVector) batch.columns[inputColumn]).vector;
long [] outVector =
((LongColumnVector) batch.columns[outputColumn]).vector;
if (batch.selectedInUse) {
for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];
outVector[i] = inVector[i] + scalar;
}
} else {
for (int i = 0; i < batch.size; i++) {
outVector[i] = inVector[i] + scalar;
}
}
}
}
}
No method calls
Low instruction count
Cache locality to 1024 values
No pipeline stalls
SIMD in Java 8

Preliminary performance results
• NOT a benchmark
• 218 million row fact table of real data, 25 columns
• 18GB raw data
• 6 core, 12 thread workstation, 1 disk, 16GB RAM
• select a, b, count(*) from t
where c >= const group by a, b -- 53 row result
Page 29
warm start times RC non-
vectorized
(default, not
compressed)
ORC non-
vectorized
(default,
compressed)
ORC vectorized
(default,
compressed)
Runtime (sec) 261 58 43
Total CPU (sec) 381 159 42

Thanks to contributors!
Page 30
• Microsoft Big Data:
• Eric Hanson, Remus Rusanu, Sarvesh
Sakalanaga, Tony Murphy, Ashit Gosalia
• Hortonworks:
• Jitendra Pandey, Owen O’Malley, Gopal V
• Others:
• Teddy Choi, Tim Chen
Jitendra/Eric are joint leads

Recommandé

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Building Robust ETL Pipelines with Apache SparkDatabricks

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

Hive + Tez: A Performance Deep DiveDataWorks Summit

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Building large scale transactional data lake using apache hudiBill Liu

Apache Arrow Flight OverviewJacques Nadeau

Recommandé

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Building Robust ETL Pipelines with Apache SparkDatabricks

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

Hive + Tez: A Performance Deep DiveDataWorks Summit

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Building large scale transactional data lake using apache hudiBill Liu

Apache Arrow Flight OverviewJacques Nadeau

The Parquet Format and Performance Optimization OpportunitiesDatabricks

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Apache Spark ArchitectureAlexey Grishchenko

Optimizing Hive QueriesOwen O'Malley

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Deep Dive: Memory Management in Apache SparkDatabricks

Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

HTAP QueriesAtif Shaikh

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

The Apache Spark File Format EcosystemDatabricks

Spark shuffle introductioncolorant

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Hive+Tez: A performance deep divet3rmin4t0r

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Apache Spark FundamentalsZahra Eskandari

Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Hive tuningMichael Zhang

Hive: Loading DataBenjamin Leonhardi

Contenu connexe

Tendances

The Parquet Format and Performance Optimization OpportunitiesDatabricks

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Apache Spark ArchitectureAlexey Grishchenko

Optimizing Hive QueriesOwen O'Malley

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Deep Dive: Memory Management in Apache SparkDatabricks

Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

HTAP QueriesAtif Shaikh

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

The Apache Spark File Format EcosystemDatabricks

Spark shuffle introductioncolorant

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Hive+Tez: A performance deep divet3rmin4t0r

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Apache Spark FundamentalsZahra Eskandari

Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Tendances (20)

The Parquet Format and Performance Optimization Opportunities

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Apache Spark Architecture

Optimizing Hive Queries

Apache Iceberg: An Architectural Look Under the Covers

Deep Dive: Memory Management in Apache Spark

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Iceberg: A modern table format for big data (Strata NY 2018)

Optimizing Delta/Parquet Data Lakes for Apache Spark

HTAP Queries

A Deep Dive into Query Execution Engine of Spark SQL

The Apache Spark File Format Ecosystem

Spark shuffle introduction

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Hive+Tez: A performance deep dive

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Apache Spark Fundamentals

Apache Tez - A New Chapter in Hadoop Data Processing

Top 5 Mistakes When Writing Spark Applications

En vedette

Hive tuningMichael Zhang

Hive: Loading DataBenjamin Leonhardi

File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

ORC 2015: Faster, Better, SmallerDataWorks Summit

ORC FilesOwen O'Malley

Parquet Strata/Hadoop World, New York 2013Julien Le Dem

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon

ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley

Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole

ORC File IntroductionOwen O'Malley

LLAP Nov Meetupt3rmin4t0r

ORC 2015t3rmin4t0r

Indexed HiveNikhilDeshpande

Data organization: hive meetupt3rmin4t0r

Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks

Parquet and AVROairisData

LLAP: long-lived execution in HiveDataWorks Summit

Big data: Loading your data with flume and sqoopChristophe Marchal

Effective Hive QueriesQubole

En vedette (20)

Hive tuning

Hive: Loading Data

File Format Benchmarks - Avro, JSON, ORC, & Parquet

Efficient Data Storage for Analytics with Apache Parquet 2.0

ORC 2015: Faster, Better, Smaller

ORC Files

Parquet Strata/Hadoop World, New York 2013

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...

ORC File and Vectorization - Hadoop Summit 2013

Harnessing the Hadoop Ecosystem Optimizations in Apache Hive

ORC File Introduction

LLAP Nov Meetup

ORC 2015

Indexed Hive

Data organization: hive meetup

Project Tungsten: Bringing Spark Closer to Bare Metal

Parquet and AVRO

LLAP: long-lived execution in Hive

Big data: Loading your data with flume and sqoop

Effective Hive Queries

Similaire à ORC File & Vectorization - Improving Hive Data Storage and Query Performance

Overview of the Hive Stinger InitiativeModern Data Stack France

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData

Master tuningThomas Kejser

Web analytics at scale with Druid at naver.comJungsu Heo

CBStreams - Java Streams for ColdFusion (CFML)Ortus Solutions, Corp

ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...Ortus Solutions, Corp

User Group3009sqlserver.co.il

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS

OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS

Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi

WebObjects OptimizationWO Community

Nodejs - Should Ruby Developers Care?Felix Geisendörfer

NOSQL and Cassandrarantav

Google cloud Dataflow & Apache FlinkIván Fernández Perea

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Orms vs Micro-ORMsDavid Paquette

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit

VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld

Performance optimization - JavaScriptFilip Mares

Node.js: The What, The How and The WhenFITC

Similaire à ORC File & Vectorization - Improving Hive Data Storage and Query Performance (20)

Overview of the Hive Stinger Initiative

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...

Master tuning

Web analytics at scale with Druid at naver.com

CBStreams - Java Streams for ColdFusion (CFML)

ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...

User Group3009

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu

OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu

Fighting Against Chaotically Separated Values with Embulk

WebObjects Optimization

Nodejs - Should Ruby Developers Care?

NOSQL and Cassandra

Google cloud Dataflow & Apache Flink

Using Apache Hive with High Performance

Orms vs Micro-ORMs

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine

VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight

Performance optimization - JavaScript

Node.js: The What, The How and The When

Plus de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Plus de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Dernier

CloudStudio User manual (basic edition):comworks

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

How to write a Business Continuity PlanDatabarracks

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Advanced Computer Architecture – An IntroductionDilum Bandara

Story boards and shot lists for my a level piececharlottematthew16

Dernier (20)

CloudStudio User manual (basic edition):

Dev Dives: Streamline document processing with UiPath Studio Web

Unraveling Multimodality with Large Language Models.pdf

"Debugging python applications inside k8s environment", Andrii Soldatenko

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Nell’iperspazio con Rocket: il Framework Web di Rust!

Streamlining Python Development: A Guide to a Modern Project Setup

What's New in Teams Calling, Meetings and Devices March 2024

SIP trunking in Janus @ Kamailio World 2024

WordPress Websites for Engineers: Elevate Your Brand

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

How to write a Business Continuity Plan

Are Multi-Cloud and Serverless Good or Bad?

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Developer Data Modeling Mistakes: From Postgres to NoSQL

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Advanced Computer Architecture – An Introduction

Story boards and shot lists for my a level piece

ORC File & Vectorization - Improving Hive Data Storage and Query Performance

1. Copyright 2013 by Hortonworks and Microsoft ORC File & Vectorization Improving Hive Data Storage and Query Performance June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley Jitendra Pandey jitendra@hortonworks.com Eric Hanson ehans@microsoft.com owen@hortonworks.c om

2. ORC – Optimized RC File Page 2

3. History Page 3

4. Remaining Challenges Page 4

5. Requirements Page 5

6. File Structure Page 6

7. Stripe Structure Page 7

8. File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4

9. Compression Page 9

10. Integer Column Serialization Page 10

11. String Column Serialization Page 11

12. Hive Compound Types Page 12 0 Struct 4 Struct 3 String 1 Int 2 Map 7 Time 5 String 6 Double

13. Compound Type Serialization Page 13

14. Generic Compression Page 14

15. Column Projection Page 15

16. How Do You Use ORC Page 16

17. Managing Memory Page 17

18. TPC-DS File Sizes Page 18

19. ORC Predicate Pushdown Page 19

20. Additional Details Page 20

21. Current work for Hive 0.12 Page 21

22. Future Work Page 22

23. Comparison Page 23 RC File Trevni Parquet ORC Hive Integration Y N N Y Active Development N N Y Y Hive Type Model N N N Y Shred complex columns N Y Y Y Splits found quickly N Y Y Y Files per a bucket 1 many 1 or many 1 Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N Y Y Store min, max, sum, count N N N Y Store internal indexes N N N Y No overhead for non-null N N N Y ≥ 0.12 Predicate Pushdown N N N Y ≥ 0.12

24. Vectorization Page 24

25. Vectorization Page 25

26. Why row-at-a-time execution is slow Page 26 • Hive uses Object Inspectors to work on a row • Enables level of abstraction • Costs major performance • Exacerbated by using lazy serdes • Inner loop has many method, new(), and if- then-else calls • Lots of CPU instructions • Pipeline stalls Poor instructions/cycle • Poor cache locality

27. How the code works (simplified) Page 27 class LongColumnAddLongScalarExpression { int inputColumn; int outputColumn; long scalar; void evaluate(VectorizedRowBatch batch) { long [] inVector = ((LongColumnVector) batch.columns[inputColumn]).vector; long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector; if (batch.selectedInUse) { for (int j = 0; j < batch.size; j++) { int i = batch.selected[j]; outVector[i] = inVector[i] + scalar; } } else { for (int i = 0; i < batch.size; i++) { outVector[i] = inVector[i] + scalar; } } } } } No method calls Low instruction count Cache locality to 1024 values No pipeline stalls SIMD in Java 8

28. Vectorization project Page 28

29. Preliminary performance results • NOT a benchmark • 218 million row fact table of real data, 25 columns • 18GB raw data • 6 core, 12 thread workstation, 1 disk, 16GB RAM • select a, b, count(*) from t where c >= const group by a, b -- 53 row result Page 29 warm start times RC non- vectorized (default, not compressed) ORC non- vectorized (default, compressed) ORC vectorized (default, compressed) Runtime (sec) 261 58 43 Total CPU (sec) 381 159 42

30. Thanks to contributors! Page 30 • Microsoft Big Data: • Eric Hanson, Remus Rusanu, Sarvesh Sakalanaga, Tony Murphy, Ashit Gosalia • Hortonworks: • Jitendra Pandey, Owen O’Malley, Gopal V • Others: • Teddy Choi, Tim Chen Jitendra/Eric are joint leads