More Related Content
Similar to ORC 2015: Faster, Better, Smaller (20)
More from DataWorks Summit (20)
ORC 2015: Faster, Better, Smaller
- 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC 2015: Faster, Better, Smaller
Prasanth Jayachandran
Apache Hive Team, Hortonworks
@prasanth_j
- 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache ORC – Optimized Row-Columnar File
Apache TLP – orc.apache.org+
Type Specific Encodings+
Came out of Apache Hive+
Vectorized Readers (Java, C++)+
Projection and Predicate Pushdown+
Columnar Storage+
Block Compression+
Hive ACID transactions+
Single SerDe Format+
Protobuf Metadata Storage+
- 3. Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Format Specification
How ORC stores data?
- 4. Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC File Layout
File Footer and Postscript
Stripes
Indexes (Row group indexes and Bloom Filter
interleaved)
Min/Max stats, Positions for every 10K rows
Data
Multiple streams per column encoded and
compressed independently
Stripe Footer
Locations to streams, type of encoding
Full specification at [1]
- 5. Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC Writer
Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>
One tree writer per flattened column
Multiple streams per column
PRESENT
DATA
LENGTH
DICTIONARY_DATA
SECONDARY
ROW_INDEX
BLOOM_FILTER
- 6. Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC Data Streams
Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>
Streams can be suppressed.
Example: PRESENT stream is suppressed when all values in a stripe are non-null.
IS_PRESENT DATA DICTIONARY LENGTH SECONDARY
Compression
Buffers
- 7. Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Features Timeline
How ORC improved over time?
- 8. Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
February 2013
Stinger Initiative Announcement*
Roadmap to improve Apache Hive’s
performance by 100x
Delivered in 100% Apache Open Source
* http://hortonworks.com/blog/100x-faster-hive/
| 2013
| 2014
| 2015
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORC
+ +
Distributed
Execution
Apache Tez
= 100x
- 9. Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
March 2013
Optimized Row Columnar (ORC)
file format committed to Hive
Hive version: 0.11
Native data format in Hive
| 2013
| 2014
| 2015
- 10. Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
March 2013
| 2013
| 2014
| 2015
Predicate Pushdown
SARG interface
Prune stripes and row groups
based on min/max statistics
Improved Run Length Encoding
Tighter bit packing
Longer runs
DELTA, SHORT_REPEATS,
DIRECT, PATCHED_BASE
- 11. Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Run Length Encoding Improvements
RLE (hive 0.11) RLE (hive >= 0.12)
Compression
Ratio
Encoding Time (in
ms)
Decoding Time (in
ms)
Compression
Ratio
Encoding Time (in
ms)
Decoding Time (in
ms)
Twitter Census API ID (24,556,361
records) 2.32 1770 1263 6.97 1558 864
HTTP Archive (bytes.json) 79.4 198 191 200.82 263 125
Github Archive
(root.payload.name.txt.dict-len) 114.05 21 15 260.73 23 15
AOL Querylog Epoch (36,389,577
records) 2.51 553 364 3.7 652 246
Reference: https://issues.apache.org/jira/secure/attachment/12596722/ORC-Compression-Ratio-Comparison.xlsx
- 12. Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
April 2013
| 2013
| 2014
| 2015
Vectorized ORC readers
Read and process columns in
batches of size 1024
Null stream suppression
Suppress PRESENT stream
if no nulls in a stripe
Enables fast path in vectorization
June 2013
- 13. Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
October 2013
| 2013
| 2014
| 2015
Statistics Interface
Writer – Update statistics during load time
Reader – ANALYZE TABLE .. NOSCAN
Split Elimination
Stripe level column statistics
Eliminate stripes that do not satisfy
predicate conditions
November 2013
- 14. Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
February 2014
| 2013
| 2014
| 2015
Zero copy read path
HDFS caching APIs to read directly into
memory without extra data copies
Serialization Improvements
Bit width alignment (trade-off space
for speed)
Unrolled bit packing and unpacking
Buffered double reader and writer
June 2014
- 15. Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Serialization Improvements
0
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 24 32 40 48 56 64
MeanTime(ms)
Bit Width
ORC Read Integer Performance (smaller is better)
hive 0.13 unpacking
hive-1.0 unpacking (new)
- 16. Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Serialization Improvements
241.679
171.045
174.163
0
50
100
150
200
250
300
hive <= 0.13 buffered + BE buffered + LE
MeanTime(ms)
Double Read Modes
ORC Read Double Performance
(smaller is better)
~1.4x improvement
- 17. Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
June 2014
| 2013
| 2014
| 2015
Adaptive compression buffer size
>1000 columns adjust compression buffer
size based on available memory
Avoids wide table OOMs
Fast stripe level file merging
Many small files to few large files
No Decompression, No Decoding
ALTER TABLE … CONCATENATE
July 2014
- 18. Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Fast File Merging
1091
651
245
816
0
200
400
600
800
1000
1200
1400
1600
ORC RCFile
TotalTimeinseconds
CONCAT Supporting File Formats
ETL With File Merging – TPC-H 1000 Scale Lineitem
(smaller is better)
Merge Time
Load Time
1336
1467
~3.33x improvement
in merge time
- 19. Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
July 2014
| 2013
| 2014
| 2015
ORC Padding Improvements
Pad bytes to avoid remote HDFS reads
Last stripe is adjusted to fit within HDFS
block boundary (worst case: 5% wastage)
Decouple stripe size vs block size
Smaller stripes (64MB)
More stripes per block (4 per block)
Better parallelism & split elimination
- 20. Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
September 2014
| 2013
| 2014
| 2015
String Dictionary Improvements
Row group level checking
Remember decision across stripes
Avoids expensive RBTree insertions
- 21. Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
String Dictionary Improvements
767
540
0
100
200
300
400
500
600
700
800
900
hive <= 0.13 hive > 0.13
Timeinseconds
Hive Version
String Dictionary Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
Load Time
~1.4x improvement
- 22. Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
September 2014
| 2013
| 2014
| 2015
Improved ZLIB compression
Different streams compressed with
different zlib strategies/levels
Compress integers and doubles
differently
Data and Dictionary stream
- Looks for smaller byte patterns
All other streams
- Less LZ77, More Huffman
- 23. Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ZLIB Improvements
178.5
172.2
225.1
0
50
100
150
200
250
ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY
DataSizeinGBs
File Format + Compression Codec
Data Size Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
~4% improvement ~1.3x smaller
- 24. Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ZLIB Improvements
674
433
389
0
100
200
300
400
500
600
700
800
ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY
DataSizeinGBs
File Format + Compression Codec
Load Time Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
~1.6x improvement Only ~10% slower than SNAPPY
- 25. Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
September 2014
| 2013
| 2014
| 2015
ACID transactions
Order of millions of rows
Not designed for OLTP requirements
Streaming Ingest via Flume or Storm
Atomically add base and delta directories
Minor compaction – Merge many delta files
Major compaction – Re-write base files to
incorporate delta file changes
Broken pattern: Add Partitions for Atomicity-
- 26. Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
January 2015
| 2013
| 2014
| 2015
hasNull flag in ORC internal index
Better pruning of row groups
Improves the performance of
SELECT .. WHERE column IS NULL;
- 27. Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
hasNull in Index Improvement
Bytes Read: 208.77 GB vs 539 MB
66.73
7.87
0
10
20
30
40
50
60
70
80
hive < 1.1.0 hive >= 1.1.0
ExecutionTimeinseconds
Hive Version
select * from lineitem where l_shipdate is null
(smaller is better)
Execution Time~8.5x improvement
- 28. Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
February 2015
| 2013
| 2014
| 2015
Bloom Filter Index
Much better row group pruning when
compared to min/max
Bloom filter evaluated after the
fast Min/Max based elimination
- 29. Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Bloom Filter Indexes Improvements
5999989709
540,000
10,000
No Indexes Min-Max Indexes Bloomfilter Indexes
select * from tpch_1000.lineitem where l_orderkey = 1212000001;
(log scale – smaller is better)
Rows Read
- 30. Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Bloom Filter Indexes Improvements
74
4.5 1.34
No Indexes Min-Max Indexes Bloomfilter Indexes
select * from tpch_1000.lineitem where l_orderkey=1212000001;
(smaller is better)
Time Taken (seconds)
~16x improvement
~3.3x improvement
- 31. Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
April 2015
| 2013
| 2014
| 2015
Split Strategies
BI – Skip reading file footer
ETL – Read and cache file footer
HYBRID – Default. Chooses BI/ETL
based on number of files and
average file size
Group splits based on columnar
projection size instead of file size
- 32. Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
April 2015
| 2013
| 2014
| 2015
ORC became Apache Top Level Project
C++ reader with contributions from
Hortonworks, HP and Microsoft
Column encryption to encrypt
sensitive columns
http://orc.apache.org/
- 33. Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: In Production
- 34. Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Facebook
Saved more than 1,400
servers worth of storage.(2)
Compressioni
Compression ratio
increased from 5x to 8x
globally.(2)
Compressioni
- 35. Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Spotify
16x less HDFS read when
using ORC versus Avro.(3)
IOi
32x less CPU when using
ORC versus Avro.(3)
CPUi
- 36. Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Yahoo!
6-50x speedup when using
ORC versus Text File.(4)
Speedupi
1.6-30x speedup when
using ORC versus RCFile.(4)
Speedupi
- 37. Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP and Sub-second
ORC – Pushing for Sub-second
- 38. Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP
- JIT Performance for short queries+
Row-group level caching+
Asynchronous IO Elevator+
+ Multi-threaded Column Vector processing+
- 39. Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Vectorization + SIMD
0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm2
0x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm2
0x00007f13d2e6afba: movslq %eax,%r10
0x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3
;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94)
Example:
Query: select ss_ext_tax + 1.0 from store_sales_orc;
JVM Options: HADOOP_OPTS=“ -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly”
Note: Make sure to have hotspot disassembler in $JAVA_HOME/jre/lib
Generated Assembly:
Allocation free tight inner loops enables JDK’s auto-vectorization
Vectors can be filtered early in ORC
String dictionary can be used to binary-search
Vectorized SIMD Join
Improves performance for single key joins
AVX - Vector Addition Packed Double
4 doubles loaded to 256 bit registers
- 40. Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP (+ SIMD + Split Strategies + Row Indexes)
select * from tpch_1000.lineitem where l_orderkey=1212000001;
- 41. Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Questions
?
Interested? Stop by the Hortonworks booth to learn more
- 42. Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Endnotes
(1) https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc-
specORCFormatSpecification
(2) https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
(3) http://www.slideshare.net/AdamKawa/a-perfect-hive-query-for-a-perfect-meeting-hadoop-summit-2014
(4) http://www.slideshare.net/Hadoop_Summit/w-1205p230-aradhakrishnan-v3