SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Parquet performance tuning:
The missing guide
Ryan Blue
Strata + Hadoop World NY 2016
● Big data at Netflix
● Parquet format background
● Optimization basics
● Stats and dictionary filtering
● Format 2 and compression
● Future work
Contents.
Big data at Netflix.
Big data at Netflix.
40+ PB DW Read 3PB Write 300TB600B Events
Strata San Jose results.
Metrics dataset.
Based on Atlas, Netflix’s telemetry platform.
● Performance monitoring backend and UI
● http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html
Example metrics data.
● Partitioned by day, and cluster
● Columns include metric time, name, value, and host
● Measurements for each minute are stored in a Parquet table
Parquet format background.
Parquet data layout.
ROW GROUPS.
● Data needed for a group of rows to be reassembled
● Smallest task or input split size
● Made of COLUMN CHUNKS
COLUMN CHUNKS.
● Contiguous data for a single column
● Made of DATA PAGES and an optional DICTIONARY PAGE
DATA PAGES.
● Encoded and compressed runs of values
Row groups.
... F
A B C D
a1 b1 c1 d1
... ... ... ...
aN bN cN dN
... ... ... ...
HDFS block
Column chunks and pages.
... F
dict
Read less data.
Columnar organization.
● Encoding: make the data smaller
● Column projection: read only the columns you need
Row group filtering.
● Use footer stats to eliminate row groups
● Use dictionary pages to eliminate row groups
Page filtering.
● Use page stats to eliminate pages
Basics.
Setup.
Parquet writes:
● Version 1.8.1 or later – includes fix for incorrect statistics, PARQUET-251
● 1.9.0 due in October
Reads:
● Presto: Used 0.139
● Spark: Used version 1.6.1 reading from Hive
● Pig: Used parquet-pig 1.9.0 for predicate push-down
Pig configuration.
-- enable pushdown/filtering
set parquet.pig.predicate.pushdown.enable true;
-- enables stats and dictionary filtering
set parquet.filter.statistics.enabled true;
set parquet.filter.dictionary.enabled true;
Spark configuration.
// turn on Parquet push-down, stats filtering, and dictionary filtering
sqlContext.setConf("parquet.filter.statistics.enabled", "true")
sqlContext.setConf("parquet.filter.dictionary.enabled", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
// use the non-Hive read path
sqlContext.setConf("spark.sql.hive.convertMetastoreParquet", "true")
// turn off schema merging, which turns off push-down
sqlContext.setConf("spark.sql.parquet.mergeSchema", "false")
sqlContext.setConf("spark.sql.hive.convertMetastoreParquet.mergeSchema",
"false")
Writing the data.
Spark:
sqlContext
.table("raw_metrics")
.write.insertInto("metrics")
Pig:
metricsData = LOAD 'raw_metrics'
USING SomeLoader;
STORE metricsData INTO 'metrics'
USING ParquetStorer;
Writing the data.
Spark:
sqlContext
.table("raw_metrics")
.write.insertInto("metrics")
Pig:
metricsData = LOAD 'raw_metrics'
USING SomeLoader;
STORE metricsData INTO 'metrics'
USING ParquetStorer;
OutOfMemoryError
or
ParquetRuntimeException
Writing too many files.
Data doesn’t match partitioning.
● Tasks write a file per partition
Symptoms:
● OutOfMemoryError
● ParquetRuntimeException: New Memory allocation 1047284 bytes is smaller than the
minimum allocation size of 1048576 bytes.
● Successfully write lots of small files, slow split planning
Task 1 part=1/
part=2/
Task 2 part=3/
part=4/
Task 3 part=.../
Account for partitioning.
Spark.
sqlContext
.table("raw_metrics")
.sort("day", "cluster")
.write.insertInto("metrics")
Pig.
metrics = LOAD 'raw_metrics'
USING SomeLoader;
metricsSorted = ORDER metrics
BY day, cluster;
STORE metricsSorted INTO 'metrics'
USING ParquetStorer;
Filter to select partitions.
Spark.
val partition = sqlContext
.table("metrics")
.filter("day = 20160929")
.filter("cluster = 'emr_adhoc'")
Pig.
metricsData = LOAD 'metrics'
USING ParquetLoader;
partition = FILTER metricsData BY
date == 20160929 AND
cluster == 'emr_adhoc'
Stats filters.
Sample query.
Spark.
val low_cpu_count = partition
.filter("name =
'system.cpu.utilization'")
.filter("value < 0.8")
.count()
Pig.
low_cpu = FILTER partition BY
name == 'system.cpu.utilization' AND
value < 0.8;
low_cpu_count = FOREACH
(GROUP low_cpu ALL) GENERATE
COUNT(name);
My job was 5 minutes faster!
Did it work?
● Success metrics: S3 bytes read, CPU time spent
S3N: Number of bytes read: 1,366,228,942,336
CPU time spent (ms): 280,218,780
● Filter didn’t work. Bytes read shows the entire partition was read.
● What happened?
Inspect the file.
● Stats show what happened:
Row group 0: count: 84756 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 84756 61.52 B 0 "A..." / "z..."
...
Row group 1: count: 84756 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 85579 61.52 B 0 "A..." / "z..."
● Every row group matched the query
Add query columns to the sort.
Spark.
sqlContext
.table("raw_metrics")
.sort("day", "cluster", "name")
.write.insertInto("metrics")
Pig.
metrics = LOAD 'raw_metrics'
USING SomeLoader;
metricsSorted = ORDER metrics
BY day, cluster, name;
STORE metricsSorted INTO 'metrics'
USING ParquetStorer;
Inspect the file, again.
● Stats are fixed:
Row group 0: count: 84756 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 84756 61.52 B 0 "A..." / "F..."
...
Row group 1: count: 85579 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 85579 61.52 B 0 "F..." / "N..."
...
Row group 2: count: 86712 845.42 B records
type encodings count avg size nulls min / max
name BINARY G _ 86712 61.52 B 0 "N..." / "b..."
Dictionary filters.
Dictionary filtering.
Dictionary is a compact list of all the values.
● Search term missing? Skip the row group
● Like a bloom filter without false positives
When dictionary filtering helps:
● When a column is sorted in each file, not globally sorted – one row group matches
● When filtering an unsorted column
dict dict dict
Dictionary filtering overhead.
Read overhead.
● Extra seeks
● Extra page reads
Not a problem in practice.
● Reading both dictionary and row group resulted in < 1% penalty
● Stats filtering prevents unnecessary dictionary reads
dict dict dict
Works out of the box, right?
Nope.
● Only works when columns are completely dictionary-encoded
● Plain-encoded pages can contain any value, dictionary is no help
● All pages in a chunk must use the dictionary
Dictionary fallback rules:
● If dictionary + references > plain encoding, fall back
● If dictionary size is too large, fall back (default threshold: 1 MB)
Fallback to plain encoding.
parquet-tools dump -d
utc_timestamp_ms TV=142990 RL=0 DL=1 DS: 833491 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED V:RLE SZ:72912
page 1: DLE:RLE RLE:BIT_PACKED V:RLE SZ:135022
page 2: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607
page 3: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607
page 4: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:714941
What’s happening:
● Values repeat, but change over time
● Dictionary gets too large, falls back to plain encoding
● Dictionary encoding is a size win!
Avoid encoding fallback.
Increase max dictionary size.
● 2-3 MB usually worked
● parquet.dictionary.page.size
Decrease row group size.
● 24, 32, or 64 MB
● parquet.block.size
● New dictionary for each row group
● Also lowers memory consumption!
Run several tests to find the right configuration (per table).
Row group size.
Other reasons to decrease row group size:
● Reduce memory consumption – but not to avoid write-side OOM
● Increase number of tasks / parallelism
Results!
Results (from Pig).
CPU and wall time dropped.
● Initial: CPU Time: 280,218,780 ms Wall Time: 15m 27s
● Filtered: CPU Time: 120,275,590 ms Wall Time: 9m 51s
● Final: CPU Time: 9,593,700 ms Wall Time: 6m 47s
Bytes read is much better.
● Initial: S3 bytes read: 1,366,228,942,336 (1.24 TB)
● Filtered: S3 bytes read: 49,195,996,736 (45.82 GB)
Filtered vs. final time.
Row group filtering is parallel.
● Split planning is independent of stats (or else is a bottleneck)
● Lots of very small tasks: read footer, read dictionary, stop processing
Combine splits in Pig/MR for better time.
● 1 GB splits tend to work well
Other work.
Format version 2.
What’s included:
● New encodings: delta-integer, prefix-binary
● New page format to enable page-level filtering
New encodings didn’t help with Netflix data.
● Delta-integer didn’t help significantly, even with timestamps (high overhead?)
● Not large enough prefixes in URL and JSON data
Page filtering isn’t implemented (yet).
Brotli compression.
● New compression library, from Google
● Based on LZ77, with compatible license
Faster compression, smaller files, or both.
● brotli-5: 19.7% smaller, 2.7% slower – 1 day of data from Kafka
● brotli-4: 14.8% smaller, 12.5% faster – 1 hour, 4 largest Parquet tables
● brotli-1: 8.1% smaller, 28.3% faster – JSON-heavy dataset
Brotli compression. (continued)
Future work.
Future work.
Short term:
● Release Parquet 1.9.0
● Test Zstd compression
● Convert embedded JSON to Avro – good preliminary results
Long-term:
● New encodings: Zig-zag RLE, patching, and floating point decomposition
● Page-level filtering
Thank you!
Questions?
https://jobs.netflix.com/
rblue@netflix.com

Contenu connexe

Tendances

Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 

Tendances (20)

Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 

En vedette

Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightAmazon Web Services
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-Takeshi Yamamuro
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...Amazon Web Services
 
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 TokyoPrestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 TokyoTreasure Data, Inc.
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAmazon Web Services Japan
 

En vedette (7)

Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 TokyoPrestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
Prestoで実現するインタラクティブクエリ - dbtech showcase 2014 Tokyo
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon Athena
 

Similaire à Parquet performance tuning: the missing guide

10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Ontico
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Guido Oswald
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBKai Sasaki
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 

Similaire à Parquet performance tuning: the missing guide (20)

10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDB
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 

Dernier

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Dernier (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Parquet performance tuning: the missing guide

  • 1. Parquet performance tuning: The missing guide Ryan Blue Strata + Hadoop World NY 2016
  • 2. ● Big data at Netflix ● Parquet format background ● Optimization basics ● Stats and dictionary filtering ● Format 2 and compression ● Future work Contents.
  • 3. Big data at Netflix.
  • 4. Big data at Netflix. 40+ PB DW Read 3PB Write 300TB600B Events
  • 5. Strata San Jose results.
  • 6. Metrics dataset. Based on Atlas, Netflix’s telemetry platform. ● Performance monitoring backend and UI ● http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html Example metrics data. ● Partitioned by day, and cluster ● Columns include metric time, name, value, and host ● Measurements for each minute are stored in a Parquet table
  • 8. Parquet data layout. ROW GROUPS. ● Data needed for a group of rows to be reassembled ● Smallest task or input split size ● Made of COLUMN CHUNKS COLUMN CHUNKS. ● Contiguous data for a single column ● Made of DATA PAGES and an optional DICTIONARY PAGE DATA PAGES. ● Encoded and compressed runs of values
  • 9. Row groups. ... F A B C D a1 b1 c1 d1 ... ... ... ... aN bN cN dN ... ... ... ... HDFS block
  • 10. Column chunks and pages. ... F dict
  • 11. Read less data. Columnar organization. ● Encoding: make the data smaller ● Column projection: read only the columns you need Row group filtering. ● Use footer stats to eliminate row groups ● Use dictionary pages to eliminate row groups Page filtering. ● Use page stats to eliminate pages
  • 13. Setup. Parquet writes: ● Version 1.8.1 or later – includes fix for incorrect statistics, PARQUET-251 ● 1.9.0 due in October Reads: ● Presto: Used 0.139 ● Spark: Used version 1.6.1 reading from Hive ● Pig: Used parquet-pig 1.9.0 for predicate push-down
  • 14. Pig configuration. -- enable pushdown/filtering set parquet.pig.predicate.pushdown.enable true; -- enables stats and dictionary filtering set parquet.filter.statistics.enabled true; set parquet.filter.dictionary.enabled true;
  • 15. Spark configuration. // turn on Parquet push-down, stats filtering, and dictionary filtering sqlContext.setConf("parquet.filter.statistics.enabled", "true") sqlContext.setConf("parquet.filter.dictionary.enabled", "true") sqlContext.setConf("spark.sql.parquet.filterPushdown", "true") // use the non-Hive read path sqlContext.setConf("spark.sql.hive.convertMetastoreParquet", "true") // turn off schema merging, which turns off push-down sqlContext.setConf("spark.sql.parquet.mergeSchema", "false") sqlContext.setConf("spark.sql.hive.convertMetastoreParquet.mergeSchema", "false")
  • 16. Writing the data. Spark: sqlContext .table("raw_metrics") .write.insertInto("metrics") Pig: metricsData = LOAD 'raw_metrics' USING SomeLoader; STORE metricsData INTO 'metrics' USING ParquetStorer;
  • 17. Writing the data. Spark: sqlContext .table("raw_metrics") .write.insertInto("metrics") Pig: metricsData = LOAD 'raw_metrics' USING SomeLoader; STORE metricsData INTO 'metrics' USING ParquetStorer; OutOfMemoryError or ParquetRuntimeException
  • 18. Writing too many files. Data doesn’t match partitioning. ● Tasks write a file per partition Symptoms: ● OutOfMemoryError ● ParquetRuntimeException: New Memory allocation 1047284 bytes is smaller than the minimum allocation size of 1048576 bytes. ● Successfully write lots of small files, slow split planning Task 1 part=1/ part=2/ Task 2 part=3/ part=4/ Task 3 part=.../
  • 19. Account for partitioning. Spark. sqlContext .table("raw_metrics") .sort("day", "cluster") .write.insertInto("metrics") Pig. metrics = LOAD 'raw_metrics' USING SomeLoader; metricsSorted = ORDER metrics BY day, cluster; STORE metricsSorted INTO 'metrics' USING ParquetStorer;
  • 20. Filter to select partitions. Spark. val partition = sqlContext .table("metrics") .filter("day = 20160929") .filter("cluster = 'emr_adhoc'") Pig. metricsData = LOAD 'metrics' USING ParquetLoader; partition = FILTER metricsData BY date == 20160929 AND cluster == 'emr_adhoc'
  • 22. Sample query. Spark. val low_cpu_count = partition .filter("name = 'system.cpu.utilization'") .filter("value < 0.8") .count() Pig. low_cpu = FILTER partition BY name == 'system.cpu.utilization' AND value < 0.8; low_cpu_count = FOREACH (GROUP low_cpu ALL) GENERATE COUNT(name);
  • 23. My job was 5 minutes faster!
  • 24. Did it work? ● Success metrics: S3 bytes read, CPU time spent S3N: Number of bytes read: 1,366,228,942,336 CPU time spent (ms): 280,218,780 ● Filter didn’t work. Bytes read shows the entire partition was read. ● What happened?
  • 25. Inspect the file. ● Stats show what happened: Row group 0: count: 84756 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 84756 61.52 B 0 "A..." / "z..." ... Row group 1: count: 84756 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 85579 61.52 B 0 "A..." / "z..." ● Every row group matched the query
  • 26. Add query columns to the sort. Spark. sqlContext .table("raw_metrics") .sort("day", "cluster", "name") .write.insertInto("metrics") Pig. metrics = LOAD 'raw_metrics' USING SomeLoader; metricsSorted = ORDER metrics BY day, cluster, name; STORE metricsSorted INTO 'metrics' USING ParquetStorer;
  • 27. Inspect the file, again. ● Stats are fixed: Row group 0: count: 84756 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 84756 61.52 B 0 "A..." / "F..." ... Row group 1: count: 85579 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 85579 61.52 B 0 "F..." / "N..." ... Row group 2: count: 86712 845.42 B records type encodings count avg size nulls min / max name BINARY G _ 86712 61.52 B 0 "N..." / "b..."
  • 29. Dictionary filtering. Dictionary is a compact list of all the values. ● Search term missing? Skip the row group ● Like a bloom filter without false positives When dictionary filtering helps: ● When a column is sorted in each file, not globally sorted – one row group matches ● When filtering an unsorted column dict dict dict
  • 30. Dictionary filtering overhead. Read overhead. ● Extra seeks ● Extra page reads Not a problem in practice. ● Reading both dictionary and row group resulted in < 1% penalty ● Stats filtering prevents unnecessary dictionary reads dict dict dict
  • 31. Works out of the box, right? Nope. ● Only works when columns are completely dictionary-encoded ● Plain-encoded pages can contain any value, dictionary is no help ● All pages in a chunk must use the dictionary Dictionary fallback rules: ● If dictionary + references > plain encoding, fall back ● If dictionary size is too large, fall back (default threshold: 1 MB)
  • 32. Fallback to plain encoding. parquet-tools dump -d utc_timestamp_ms TV=142990 RL=0 DL=1 DS: 833491 DE:PLAIN_DICTIONARY ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED V:RLE SZ:72912 page 1: DLE:RLE RLE:BIT_PACKED V:RLE SZ:135022 page 2: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607 page 3: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:1048607 page 4: DLE:RLE RLE:BIT_PACKED V:PLAIN SZ:714941 What’s happening: ● Values repeat, but change over time ● Dictionary gets too large, falls back to plain encoding ● Dictionary encoding is a size win!
  • 33. Avoid encoding fallback. Increase max dictionary size. ● 2-3 MB usually worked ● parquet.dictionary.page.size Decrease row group size. ● 24, 32, or 64 MB ● parquet.block.size ● New dictionary for each row group ● Also lowers memory consumption! Run several tests to find the right configuration (per table).
  • 34. Row group size. Other reasons to decrease row group size: ● Reduce memory consumption – but not to avoid write-side OOM ● Increase number of tasks / parallelism
  • 36. Results (from Pig). CPU and wall time dropped. ● Initial: CPU Time: 280,218,780 ms Wall Time: 15m 27s ● Filtered: CPU Time: 120,275,590 ms Wall Time: 9m 51s ● Final: CPU Time: 9,593,700 ms Wall Time: 6m 47s Bytes read is much better. ● Initial: S3 bytes read: 1,366,228,942,336 (1.24 TB) ● Filtered: S3 bytes read: 49,195,996,736 (45.82 GB)
  • 37. Filtered vs. final time. Row group filtering is parallel. ● Split planning is independent of stats (or else is a bottleneck) ● Lots of very small tasks: read footer, read dictionary, stop processing Combine splits in Pig/MR for better time. ● 1 GB splits tend to work well
  • 39. Format version 2. What’s included: ● New encodings: delta-integer, prefix-binary ● New page format to enable page-level filtering New encodings didn’t help with Netflix data. ● Delta-integer didn’t help significantly, even with timestamps (high overhead?) ● Not large enough prefixes in URL and JSON data Page filtering isn’t implemented (yet).
  • 40. Brotli compression. ● New compression library, from Google ● Based on LZ77, with compatible license Faster compression, smaller files, or both. ● brotli-5: 19.7% smaller, 2.7% slower – 1 day of data from Kafka ● brotli-4: 14.8% smaller, 12.5% faster – 1 hour, 4 largest Parquet tables ● brotli-1: 8.1% smaller, 28.3% faster – JSON-heavy dataset
  • 43. Future work. Short term: ● Release Parquet 1.9.0 ● Test Zstd compression ● Convert embedded JSON to Avro – good preliminary results Long-term: ● New encodings: Zig-zag RLE, patching, and floating point decomposition ● Page-level filtering