ORC Files

•

49 likes•51,458 views

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.

© Hortonworks Inc. 2012
ORC Files
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
owen@hortonworks.com

© Hortonworks Inc. 2012
Who Am I?
Page 2

© Hortonworks Inc. 2012
Remaining Challenges
Page 4

© Hortonworks Inc. 2012
Requirements
Page 5

© Hortonworks Inc. 2012
File Structure
Page 6

© Hortonworks Inc. 2012
Stripe Structure
Page 7

© Hortonworks Inc. 2012
File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4

© Hortonworks Inc. 2012
Compression
Page 9

© Hortonworks Inc. 2012
Integer Column Serialization
Page 10

© Hortonworks Inc. 2012
String Column Serialization
Page 11

© Hortonworks Inc. 2012
Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double

© Hortonworks Inc. 2012
Compound Type Serialization
Page 13

© Hortonworks Inc. 2012
Generic Compression
Page 14

© Hortonworks Inc. 2012
Column Projection
Page 15

© Hortonworks Inc. 2012
How Do You Use ORC
Page 16

© Hortonworks Inc. 2012
Managing Memory
Page 17

© Hortonworks Inc. 2012
Pavan’s Trick
Page 18

© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 19

© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 20

© Hortonworks Inc. 2012
TPC-DS File Sizes
Page 21

© Hortonworks Inc. 2012
TPC-DS Query Performance
Page 22

© Hortonworks Inc. 2012
Additional Details
Page 23

© Hortonworks Inc. 2012
Current work
Page 24

© Hortonworks Inc. 2012
Vectorization
Page 25

© Hortonworks Inc. 2012
Vectorization Preliminary Results
Page 26

© Hortonworks Inc. 2012
Future Work
Page 27

© Hortonworks Inc. 2012
Comparison
Page 29
RC File Trevni Parquet ORC File
Hive Type Model N N N Y
Separate complex columns N Y Y Y
Splits found quickly N Y Y Y
Default column group size 4MB 64MB* 64MB* 256MB
Files per a bucket 1 > 1 1* 1
Store min, max, sum, count N N N Y
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N N Y
Store row count N Y N Y
Skip compressed blocks N N N Y
Store internal indexes N N N Y

What's hot

ORC improvement in Apache Spark 2.3DataWorks Summit

Hive+Tez: A performance deep divet3rmin4t0r

Internal HiveRecruit Technologies

Hive: Loading DataBenjamin Leonhardi

Inside Parquet FormatYue Chen

Hive 3 - a new horizonThejas Nair

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit

HDFS FederationHortonworks

HBase Storage InternalsDataWorks Summit

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks

6.hivePrashant Gupta

Using Queryable State for Fun and ProfitFlink Forward

HBase Application Performance ImprovementBiju Nair

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

LLAP: long-lived execution in HiveDataWorks Summit

Optimizing Hive QueriesDataWorks Summit

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit

What's hot (20)

ORC improvement in Apache Spark 2.3

Hive+Tez: A performance deep dive

Internal Hive

Hive: Loading Data

Inside Parquet Format

Hive 3 - a new horizon

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Apache Tez - A New Chapter in Hadoop Data Processing

HDFS Federation

HBase Storage Internals

Apache Iceberg - A Table Format for Hige Analytic Datasets

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...

6.hive

Using Queryable State for Fun and Profit

HBase Application Performance Improvement

Iceberg: A modern table format for big data (Strata NY 2018)

LLAP: long-lived execution in Hive

Optimizing Hive Queries

How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...

Similar to ORC Files

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Optimizing Hive QueriesOwen O'Malley

Getting Started with MongoDB Using the Microsoft Stack MongoDB

ORC 2015t3rmin4t0r

Hive on spark is blazing fast or is it finalHortonworks

MOUG17 Keynote: Oracle OpenWorld Major AnnouncementsMonica Li

Data lake – On Premise VS CloudIdan Tohami

SQL in the Hybrid WorldTanel Poder

Enabling R on HadoopDataWorks Summit

Migre sus bases de datos Oracle a la nube EDB

ORC 2015: Faster, Better, SmallerThe Apache Software Foundation

Building Operational Data Lake using Spark and SequoiaDB with Yang PengDatabricks

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)MongoDB

Migration DB2 to EDB - Project ExperienceEDB

LA HUG - Agile Analytics Applications on HDPHortonworks

Things learned from OpenWorld 2013Connor McDonald

Whats new in Oracle Database 12c release 12.1.0.2Connor McDonald

What's New in Apache Hive 3.0?DataWorks Summit

What's New in Apache Hive 3.0 - TokyoDataWorks Summit

Ozone: scaling HDFS to trillions of objectsDataWorks Summit

Similar to ORC Files (20)

Using Apache Hive with High Performance

Optimizing Hive Queries

Getting Started with MongoDB Using the Microsoft Stack

ORC 2015

Hive on spark is blazing fast or is it final

MOUG17 Keynote: Oracle OpenWorld Major Announcements

Data lake – On Premise VS Cloud

SQL in the Hybrid World

Enabling R on Hadoop

Migre sus bases de datos Oracle a la nube

ORC 2015: Faster, Better, Smaller

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migration DB2 to EDB - Project Experience

LA HUG - Agile Analytics Applications on HDP

Things learned from OpenWorld 2013

Whats new in Oracle Database 12c release 12.1.0.2

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0 - Tokyo

Ozone: scaling HDFS to trillions of objects

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley

Big Data's Journey to ACIDOwen O'Malley

Protect your private data with ORC column encryptionOwen O'Malley

Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley

Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley

Strata NYC 2018 IcebergOwen O'Malley

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley

ORC Column EncryptionOwen O'Malley

File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley

Protecting Enterprise Data in Apache HadoopOwen O'Malley

Data protection2015Owen O'Malley

Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley

Hadoop Security ArchitectureOwen O'Malley

Adding ACID Updates to HiveOwen O'Malley

Next Generation Hadoop OperationsOwen O'Malley

Next Generation MapReduceOwen O'Malley

Bay Area HUG Feb 2011 IntroOwen O'Malley

Plugging the Holes: Security and Compatability in HadoopOwen O'Malley

More from Owen O'Malley (18)

Running An Apache Project: 10 Traps and How to Avoid Them

Big Data's Journey to ACID

Protect your private data with ORC column encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Strata NYC 2018 Iceberg

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

ORC Column Encryption

File Format Benchmarks - Avro, JSON, ORC, & Parquet

Protecting Enterprise Data in Apache Hadoop

Data protection2015

Structor - Automated Building of Virtual Hadoop Clusters

Hadoop Security Architecture

Adding ACID Updates to Hive

Next Generation Hadoop Operations

Next Generation MapReduce

Bay Area HUG Feb 2011 Intro

Plugging the Holes: Security and Compatability in Hadoop

ORC Files

1. © Hortonworks Inc. 2012 ORC Files June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com

8. © Hortonworks Inc. 2012 File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4

29. © Hortonworks Inc. 2012 Comparison Page 29 RC File Trevni Parquet ORC File Hive Type Model N N N Y Separate complex columns N Y Y Y Splits found quickly N Y Y Y Default column group size 4MB 64MB* 64MB* 256MB Files per a bucket 1 > 1 1* 1 Store min, max, sum, count N N N Y Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N N Y Store row count N Y N Y Skip compressed blocks N N N Y Store internal indexes N N N Y

ORC Files

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ORC Files

Similar to ORC Files (20)

More from Owen O'Malley

More from Owen O'Malley (18)

ORC Files