ORC 2015: Faster, Better, Smaller

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC 2015: Faster, Better, Smaller
Prasanth Jayachandran
Apache Hive Team, Hortonworks
@prasanth_j

Apache ORC – Optimized Row-Columnar File
Apache TLP – orc.apache.org+
Type Specific Encodings+
Came out of Apache Hive+
Vectorized Readers (Java, C++)+
Projection and Predicate Pushdown+
Columnar Storage+
Block Compression+
Hive ACID transactions+
Single SerDe Format+
Protobuf Metadata Storage+

ORC: Format Specification
How ORC stores data?

ORC File Layout
 File Footer and Postscript
 Stripes
 Indexes (Row group indexes and Bloom Filter
interleaved)
 Min/Max stats, Positions for every 10K rows
 Data
 Multiple streams per column encoded and
compressed independently
 Stripe Footer
 Locations to streams, type of encoding
 Full specification at [1]

ORC Writer
Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>
 One tree writer per flattened column
 Multiple streams per column
 PRESENT
 DATA
 LENGTH
 DICTIONARY_DATA
 SECONDARY
 ROW_INDEX
 BLOOM_FILTER

ORC Data Streams
Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>
 Streams can be suppressed.
 Example: PRESENT stream is suppressed when all values in a stripe are non-null.
IS_PRESENT DATA DICTIONARY LENGTH SECONDARY
Compression
Buffers

ORC: Features Timeline
How ORC improved over time?

Timeline
February 2013
 Stinger Initiative Announcement*
 Roadmap to improve Apache Hive’s
performance by 100x
 Delivered in 100% Apache Open Source
* http://hortonworks.com/blog/100x-faster-hive/
| 2013
| 2014
| 2015
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORC
+ +
Distributed
Execution
Apache Tez
= 100x

Timeline
March 2013
Optimized Row Columnar (ORC)
file format committed to Hive
 Hive version: 0.11
 Native data format in Hive
| 2013
| 2014
| 2015

Timeline
March 2013
| 2013
| 2014
| 2015
Predicate Pushdown
 SARG interface
 Prune stripes and row groups
based on min/max statistics
Improved Run Length Encoding
 Tighter bit packing
 Longer runs
 DELTA, SHORT_REPEATS,
DIRECT, PATCHED_BASE

Run Length Encoding Improvements
RLE (hive 0.11) RLE (hive >= 0.12)
Compression
Ratio
Encoding Time (in
ms)
Decoding Time (in
ms)
Compression
Ratio
Encoding Time (in
ms)
Decoding Time (in
ms)
Twitter Census API ID (24,556,361
records) 2.32 1770 1263 6.97 1558 864
HTTP Archive (bytes.json) 79.4 198 191 200.82 263 125
Github Archive
(root.payload.name.txt.dict-len) 114.05 21 15 260.73 23 15
AOL Querylog Epoch (36,389,577
records) 2.51 553 364 3.7 652 246
Reference: https://issues.apache.org/jira/secure/attachment/12596722/ORC-Compression-Ratio-Comparison.xlsx

Timeline
April 2013
| 2013
| 2014
| 2015
Vectorized ORC readers
 Read and process columns in
batches of size 1024
Null stream suppression
 Suppress PRESENT stream
if no nulls in a stripe
 Enables fast path in vectorization
June 2013

Timeline
October 2013
| 2013
| 2014
| 2015
Statistics Interface
 Writer – Update statistics during load time
 Reader – ANALYZE TABLE .. NOSCAN
Split Elimination
 Stripe level column statistics
 Eliminate stripes that do not satisfy
predicate conditions
November 2013

Timeline
February 2014
| 2013
| 2014
| 2015
Zero copy read path
 HDFS caching APIs to read directly into
memory without extra data copies
Serialization Improvements
 Bit width alignment (trade-off space
for speed)
 Unrolled bit packing and unpacking
 Buffered double reader and writer
June 2014

0
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 24 32 40 48 56 64
MeanTime(ms)
Bit Width
ORC Read Integer Performance (smaller is better)
hive 0.13 unpacking
hive-1.0 unpacking (new)

241.679
171.045
174.163
0
50
100
150
200
250
300
hive <= 0.13 buffered + BE buffered + LE
MeanTime(ms)
Double Read Modes
ORC Read Double Performance
(smaller is better)
~1.4x improvement

Timeline
June 2014
| 2013
| 2014
| 2015
Adaptive compression buffer size
 >1000 columns adjust compression buffer
size based on available memory
 Avoids wide table OOMs
Fast stripe level file merging
 Many small files to few large files
 No Decompression, No Decoding
 ALTER TABLE … CONCATENATE
July 2014

Fast File Merging
1091
651
245
816
0
200
400
600
800
1000
1200
1400
1600
ORC RCFile
TotalTimeinseconds
CONCAT Supporting File Formats
ETL With File Merging – TPC-H 1000 Scale Lineitem
(smaller is better)
Merge Time
Load Time
1336
1467
~3.33x improvement
in merge time

Timeline
July 2014
| 2013
| 2014
| 2015
ORC Padding Improvements
 Pad bytes to avoid remote HDFS reads
 Last stripe is adjusted to fit within HDFS
block boundary (worst case: 5% wastage)
Decouple stripe size vs block size
 Smaller stripes (64MB)
 More stripes per block (4 per block)
 Better parallelism & split elimination

Timeline
September 2014
| 2013
| 2014
| 2015
String Dictionary Improvements
 Row group level checking
 Remember decision across stripes
 Avoids expensive RBTree insertions

String Dictionary Improvements
767
540
0
100
200
300
400
500
600
700
800
900
hive <= 0.13 hive > 0.13
Timeinseconds
Hive Version
String Dictionary Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
Load Time
~1.4x improvement

Timeline
September 2014
| 2013
| 2014
| 2015
Improved ZLIB compression
 Different streams compressed with
different zlib strategies/levels
 Compress integers and doubles
differently
 Data and Dictionary stream
- Looks for smaller byte patterns
 All other streams
- Less LZ77, More Huffman

ZLIB Improvements
178.5
172.2
225.1
0
50
100
150
200
250
ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY
DataSizeinGBs
File Format + Compression Codec
Data Size Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
~4% improvement ~1.3x smaller

ZLIB Improvements
674
433
389
0
100
200
300
400
500
600
700
800
ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY
DataSizeinGBs
File Format + Compression Codec
Load Time Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
~1.6x improvement Only ~10% slower than SNAPPY

Timeline
September 2014
| 2013
| 2014
| 2015
ACID transactions
 Order of millions of rows
 Not designed for OLTP requirements
 Streaming Ingest via Flume or Storm
 Atomically add base and delta directories
 Minor compaction – Merge many delta files
 Major compaction – Re-write base files to
incorporate delta file changes
Broken pattern: Add Partitions for Atomicity-

Timeline
January 2015
| 2013
| 2014
| 2015
hasNull flag in ORC internal index
 Better pruning of row groups
 Improves the performance of
SELECT .. WHERE column IS NULL;

hasNull in Index Improvement
Bytes Read: 208.77 GB vs 539 MB
66.73
7.87
0
10
20
30
40
50
60
70
80
hive < 1.1.0 hive >= 1.1.0
ExecutionTimeinseconds
Hive Version
select * from lineitem where l_shipdate is null
(smaller is better)
Execution Time~8.5x improvement

Timeline
February 2015
| 2013
| 2014
| 2015
Bloom Filter Index
 Much better row group pruning when
compared to min/max
 Bloom filter evaluated after the
fast Min/Max based elimination

Bloom Filter Indexes Improvements
5999989709
540,000
10,000
No Indexes Min-Max Indexes Bloomfilter Indexes
select * from tpch_1000.lineitem where l_orderkey = 1212000001;
(log scale – smaller is better)
Rows Read

Bloom Filter Indexes Improvements
74
4.5 1.34
No Indexes Min-Max Indexes Bloomfilter Indexes
select * from tpch_1000.lineitem where l_orderkey=1212000001;
(smaller is better)
Time Taken (seconds)
~16x improvement
~3.3x improvement

Timeline
April 2015
| 2013
| 2014
| 2015
Split Strategies
 BI – Skip reading file footer
 ETL – Read and cache file footer
 HYBRID – Default. Chooses BI/ETL
based on number of files and
average file size
 Group splits based on columnar
projection size instead of file size

Timeline
April 2015
| 2013
| 2014
| 2015
ORC became Apache Top Level Project
 C++ reader with contributions from
Hortonworks, HP and Microsoft
 Column encryption to encrypt
sensitive columns
http://orc.apache.org/

ORC: In Production

ORC at Facebook
Saved more than 1,400
servers worth of storage.(2)
Compressioni
Compression ratio
increased from 5x to 8x
globally.(2)
Compressioni

ORC at Spotify
16x less HDFS read when
using ORC versus Avro.(3)
IOi
32x less CPU when using
ORC versus Avro.(3)
CPUi

ORC at Yahoo!
6-50x speedup when using
ORC versus Text File.(4)
Speedupi
1.6-30x speedup when
using ORC versus RCFile.(4)
Speedupi

ORC: LLAP and Sub-second
ORC – Pushing for Sub-second

ORC: LLAP
- JIT Performance for short queries+
Row-group level caching+
Asynchronous IO Elevator+
+ Multi-threaded Column Vector processing+

ORC: Vectorization + SIMD
0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm2
0x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm2
0x00007f13d2e6afba: movslq %eax,%r10
0x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3
;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94)
Example:
Query: select ss_ext_tax + 1.0 from store_sales_orc;
JVM Options: HADOOP_OPTS=“ -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly”
Note: Make sure to have hotspot disassembler in $JAVA_HOME/jre/lib
Generated Assembly:
 Allocation free tight inner loops enables JDK’s auto-vectorization
 Vectors can be filtered early in ORC
 String dictionary can be used to binary-search
 Vectorized SIMD Join
 Improves performance for single key joins
AVX - Vector Addition Packed Double
4 doubles loaded to 256 bit registers

ORC: LLAP (+ SIMD + Split Strategies + Row Indexes)
select * from tpch_1000.lineitem where l_orderkey=1212000001;

Questions
?
Interested? Stop by the Hortonworks booth to learn more

Endnotes
(1) https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc-
specORCFormatSpecification
(2) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
(3) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014
(4) http://www.slideshare.net/Hadoop_Summit/w-1205p230-aradhakrishnan-v3

ORC 2015: Faster, Better, Smaller

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to ORC 2015: Faster, Better, Smaller

Similar to ORC 2015: Faster, Better, Smaller (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

ORC 2015: Faster, Better, Smaller