SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Parquet
Columnar storage for the people
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter
Nong Li nong@cloudera.com Software engineer, Cloudera Impala
http://parquet.io
Outline
• Context

from various companies

• Results

in production and benchmarks

• Format

deep-dive

http://parquet.io
Twitter Context
•

Twitter’s data
•

230M+ monthly active users generating and consuming 500M+ tweets a day.

•

100TB+ a day of compressed data

•

Scale is huge:
•

•

Instrumentation, User graph, Derived data, ...

Analytics infrastructure:
•

Several 1K+ node Hadoop clusters

•

Log collection pipeline

•

Processing tools

The Parquet Planers
Gustave Caillebotte

http://parquet.io
Twitter’s use case
•

Logs available on HDFS

•

Thrift to store logs

•

example: one schema has 87 columns, up to 7 levels of nesting.

struct LogEvent {
1: optional logbase.LogBase log_base
2: optional i64 event_value
3: optional string context
4: optional string referring_event
...
18: optional EventNamespace event_namespace
19: optional list<Item> items
20: optional map<AssociationType,Association> associations
21: optional MobileDetails mobile_details
22: optional WidgetDetails widget_details
23: optional map<ExternalService,string> external_ids
}

struct LogBase {
1: string transaction_id,
2: string ip_address,
...
15: optional string country,
16: optional string pid,
}

http://parquet.io
Goal

To have a state of the art columnar storage available across the
Hadoop platform
•

Hadoop is very reliable for big long running queries but also IO heavy.

•

Incrementally take advantage of column based storage in existing framework.

•

Not tied to any framework in particular

http://parquet.io
Columnar Storage
•

Limits IO to data actually needed:
•

•

loads only the columns that need to be accessed.

Saves space:
•

Columnar layout compresses better

•
•

@EmrgencyKittens

Type specific encodings.

Enables vectorized execution engines.

http://parquet.io
Collaboration between Twitter and Cloudera:
•

Common file format definition:
•
•

•

Language independent
Formally specified.

Implementation in Java for Map/Reduce:
•

•

https://github.com/Parquet/parquet-mr

C++ and code generation in Cloudera Impala:
•

https://github.com/cloudera/impala
http://parquet.io
Results in Impala

 TPC-H lineitem table @ 1TB scale factor

GB

http://parquet.io
Text
Seq w/ Snappy
RC w/Snappy
Parquet w/Snappy

Impala query times on TPC-DS

Seconds (wall clock)

500
375
250
125
0

Q27

Q34

Q42

Q43

Q46

Q52

Q55

Q59

Q65

Q73

Q79

Q96

http://parquet.io
Criteo: The Context
• Billions
• ~60

columns per log

• Heavy
• BI

of new events per day

analytic workload

analysts using Hive and RCFile

• Frequent

• Perfect

schema modifications

use case for Parquet + Hive !
http://parquet.io
Parquet + Hive: Basic Reqs

•

MapRed compatibility due to Hive.

•

Correctly handle evolving schemas across Parquet files.

•

Read only the columns used by query to minimize data read.

•

Interoperability with other execution engines (eg Pig, Impala, etc.)

http://parquet.io
Performance of Hive 0.11 with Parquet vs orc
Size relative to text:
orc-snappy: 35%
parquet-snappy: 33%

TPC-DS scale factor 100
All jobs calibrated to run ~50 mappers
Nodes:
2 x 6 cores, 96 GB RAM, 14 x 3TB
DISK

total CPU seconds

20000
orc-snappy
parquet-snappy

15000
10000
5000
0
q19 q34 q42 q43 q46 q52 q55 q59 q63 q65 q68

q7

q73 q79

q8

q89 q98
http://parquet.io
Twitter: production results
Data converted: similar to access logs. 30 columns.
Original format: Thrift binary in block compressed files (LZO)
New format: Parquet (LZO)

Scan time

Space
120.0%
100.0%
80.0%
60.0%
40.0%
20.0%
0%

100.00%
75.00%
50.00%
25.00%
0%

Space
Thrift

1

Thrift

Parquet

Parquet

30

columns

•

Space saving: 30% using the same compression algorithm

•

Scan + assembly time compared to original:
• One column: 10%
• All columns: 110%

http://parquet.io
Production savings at Twitter
• Petabytes
• Example
• Job

of storage saved.

jobs taking advantage of projection push down:

1 (Pig): reading 32% less data => 20% task time saving.

• Job

2 (Scalding): reading 14 out of 35 columns. reading 80% less
data => 66% task time saving.

• Terabytes

of scanning saved every day.

http://parquet.io

14
Format
•

Row group
Column a

Row group: A group of rows in columnar format.
•
•
•

Page 0

Max size buffered in memory while writing.

•

Page 0

Page 1
Page 2

One (or more) per split while reading. 

Page 1
Page 2

Page 3

roughly: 50MB < row group < 1 GB

Column chunk: The data for one column in a row group.
•

Page 0

Column c

Page 1

Page 4

•

Column b

Page 2

Page 3

Row group

Column chunks can be read independently for efficient scans.

Page: Unit of access in a column chunk.
•

Should be big enough for compression to be efficient.

•

Minimum size to read to access a single record (when index pages are available).

•

roughly: 8KB < page < 1MB

http://parquet.io

15
Format
•

Layout:

Row groups in columnar
format. A footer contains
column chunks offset and
schema.
• Language independent:
Well defined format. Hadoop
and Cloudera Impala support.

http://parquet.io

16
Nested record shredding/assembly
• Algorithm borrowed from Google Dremel's column IO
• Each cell is encoded as a triplet: repetition level, definition level, value.
• Level values are bound by the depth of the schema: stored in a compact form.
Schema:
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
}

Max rep. Max def.
level
level

DocId

Backward

Forward

0

1

2

Links.Forward

Links

0

Links.Backward

DocId

Record:
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80

Columns

Document

1

2

Column

Document

Value

R

D

DocId

Backward

10

30

Forward

0

Links.Backward

10

0

2

30

1

2

Links.Forward

Links

20

0

Links.Backward

DocId

20

80

0

2

80

http://parquet.io

17
Repetition level
1

2

R

Nested lists

level1

level2: a

0

new record

level2: b

2

new level2 entry

level2: c

2

new level2 entry

level2: d

1

new level1 entry

level2: e

2

new level2 entry

level2: f

2

new level2 entry

level2: g

2

new level2 entry

level1

level2: h

0

new record

level1

Schema:
message nestedLists {
repeated group level1 {
repeated string level2;
}
}

0

level2: i

1

new level1 entry

level2: j

2

new level2 entry

level1

Records:
[[a, b, c], [d, e, f, g]]
[[h], [i, j]]
Nested lists

Columns:
Level: 0,2,2,1,2,2,2,0,1,2
Data: a,b,c,d,e,f,g,h,i,j

more details: https://blog.twitter.com/2013/dremel-made-simple-with-parquet

http://parquet.io
Differences of Parquet and ORC Nesting support
• Parquet:
Document

Repetition/Definition levels capture the structure.
=> one column per Leaf in the schema.
Array<int> is one column.

DocId

Links

Backward

Forward

Nullity/repetition of an inner node is stored in each of its children
=> One column independently of nesting with some redundancy.

• ORC:
An extra column for each Map or List to record their size.
=> one column per Node in the schema.
Array<int> is two columns: array size and content.
=> An extra column per nesting level.
http://parquet.io
Reading assembled records
• Record level API to integrate with existing row
based engines (Hive, Pig, M/R).

a1
a2

b1

a2

• Aware of dictionary encoding: enable
optimizations.

a1

b2

a3

b3

a3
b1
b2
b3

• Assembles projection for any subset of the
columns: only those are loaded from disc.
Document

DocId

Document

Document

Document

Links

Links

Links

20

Backward

10

30

Backward

10

30

Forward

80

Forward

80

http://parquet.io

20
Projection push down
• Automated in Pig and Hive:

Based on the query being executed only the columns for the fields accessed will be fetched.
• Explicit in MapReduce, Scalding and Cascading using globing syntax.

Example: field1;field2/**;field4/{subfield1,subfield2}
Will return:
field1
all the columns under field2
subfield1 and 2 under field4
but not field3

http://parquet.io

21
Reading columns

•

To implement column based execution engine

Row:

0

1

A

0

1

B

1

1

C

2

0

0

3

•

V

1

•

D

0

•

R

0

1

Iteration on triplets: repetition level, definition level, value.

R=1 => same row
D<1 => Null

D

Repetition level = 0 indicates a new record.
Encoded or decoded values: computing aggregations on integers is faster than on
strings.

http://parquet.io
Integration APIs
•

Schema definition and record materialization:
•

•

•

Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro,
ProtocolBuffers do.
Event-based SAX-style record materialization layer. No double conversion.

Integration with existing type systems and processing frameworks:
•

Impala

•

Pig

•

Thrift and Scrooge for M/R, Cascading and Scalding

•

Cascading tuples

•

Avro

•

Hive

•

Spark

http://parquet.io
Encodings
•

Bit packing:

1

3

2

•

•

2

2

0

1

1

1

1

1

01|11|10|00 00|10|10|00

Useful for repetition level, definition levels and dictionary keys

Run Length Encoding:
•

1

1

8

1

Cheap compression

•

1

Used in combination with bit packing

•

•

0

Small integers encoded in the minimum bits required

•

0

Works well for definition level of sparse columns.

Dictionary encoding:
•
•

•

Useful for columns with few ( < 50,000 ) distinct values
When applicable, compresses better and faster than heavyweight algorithms (gzip, lzo, snappy)

Extensible: Defining new encodings is supported by the format

http://parquet.io
Parquet 2.0

•

More encodings: compact storage without heavyweight compression
•

Delta encodings: for integers, strings and sorted dictionaries.

•

Improved encoding for strings and boolean.

•

Statistics: to be used by query planners and predicate pushdown.

•

New page format: to facilitate skipping ahead at a more granular level.

http://parquet.io
Main contributors
Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings
Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala
Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down
Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration
Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration
Tom White (Cloudera): Avro integration
Avi Bryant, Colin Marc (Stripe): Cascading tuples integration
Matt Massie (Berkeley AMP lab): predicate and projection push down
David Chen (Linkedin): Avro integration improvements

http://parquet.io
How to contribute
Questions? Ideas? Want to contribute?
Contribute at: github.com/Parquet
Come talk to us.
Cloudera
Criteo
Twitter
http://parquet.io

27

Contenu connexe

Tendances

Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 

Tendances (20)

Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 

Similaire à Parquet Strata/Hadoop World, New York 2013

Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindEMC
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaServing Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaRedis Labs
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupItamar Haber
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用LINE Corporation
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraPyData
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 

Similaire à Parquet Strata/Hadoop World, New York 2013 (20)

Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaServing Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetup
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 

Plus de Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesJulien Le Dem
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languagesJulien Le Dem
 

Plus de Julien Le Dem (19)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
 

Dernier

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 

Dernier (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 

Parquet Strata/Hadoop World, New York 2013

  • 1. Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala http://parquet.io
  • 2. Outline • Context from various companies • Results in production and benchmarks • Format deep-dive http://parquet.io
  • 3. Twitter Context • Twitter’s data • 230M+ monthly active users generating and consuming 500M+ tweets a day. • 100TB+ a day of compressed data • Scale is huge: • • Instrumentation, User graph, Derived data, ... Analytics infrastructure: • Several 1K+ node Hadoop clusters • Log collection pipeline • Processing tools The Parquet Planers Gustave Caillebotte http://parquet.io
  • 4. Twitter’s use case • Logs available on HDFS • Thrift to store logs • example: one schema has 87 columns, up to 7 levels of nesting. struct LogEvent { 1: optional logbase.LogBase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event ... 18: optional EventNamespace event_namespace 19: optional list<Item> items 20: optional map<AssociationType,Association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<ExternalService,string> external_ids } struct LogBase { 1: string transaction_id, 2: string ip_address, ... 15: optional string country, 16: optional string pid, } http://parquet.io
  • 5. Goal To have a state of the art columnar storage available across the Hadoop platform • Hadoop is very reliable for big long running queries but also IO heavy. • Incrementally take advantage of column based storage in existing framework. • Not tied to any framework in particular http://parquet.io
  • 6. Columnar Storage • Limits IO to data actually needed: • • loads only the columns that need to be accessed. Saves space: • Columnar layout compresses better • • @EmrgencyKittens Type specific encodings. Enables vectorized execution engines. http://parquet.io
  • 7. Collaboration between Twitter and Cloudera: • Common file format definition: • • • Language independent Formally specified. Implementation in Java for Map/Reduce: • • https://github.com/Parquet/parquet-mr C++ and code generation in Cloudera Impala: • https://github.com/cloudera/impala http://parquet.io
  • 8. Results in Impala  TPC-H lineitem table @ 1TB scale factor GB http://parquet.io
  • 9. Text Seq w/ Snappy RC w/Snappy Parquet w/Snappy Impala query times on TPC-DS Seconds (wall clock) 500 375 250 125 0 Q27 Q34 Q42 Q43 Q46 Q52 Q55 Q59 Q65 Q73 Q79 Q96 http://parquet.io
  • 10. Criteo: The Context • Billions • ~60 columns per log • Heavy • BI of new events per day analytic workload analysts using Hive and RCFile • Frequent • Perfect schema modifications use case for Parquet + Hive ! http://parquet.io
  • 11. Parquet + Hive: Basic Reqs • MapRed compatibility due to Hive. • Correctly handle evolving schemas across Parquet files. • Read only the columns used by query to minimize data read. • Interoperability with other execution engines (eg Pig, Impala, etc.) http://parquet.io
  • 12. Performance of Hive 0.11 with Parquet vs orc Size relative to text: orc-snappy: 35% parquet-snappy: 33% TPC-DS scale factor 100 All jobs calibrated to run ~50 mappers Nodes: 2 x 6 cores, 96 GB RAM, 14 x 3TB DISK total CPU seconds 20000 orc-snappy parquet-snappy 15000 10000 5000 0 q19 q34 q42 q43 q46 q52 q55 q59 q63 q65 q68 q7 q73 q79 q8 q89 q98 http://parquet.io
  • 13. Twitter: production results Data converted: similar to access logs. 30 columns. Original format: Thrift binary in block compressed files (LZO) New format: Parquet (LZO) Scan time Space 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0% 100.00% 75.00% 50.00% 25.00% 0% Space Thrift 1 Thrift Parquet Parquet 30 columns • Space saving: 30% using the same compression algorithm • Scan + assembly time compared to original: • One column: 10% • All columns: 110% http://parquet.io
  • 14. Production savings at Twitter • Petabytes • Example • Job of storage saved. jobs taking advantage of projection push down: 1 (Pig): reading 32% less data => 20% task time saving. • Job 2 (Scalding): reading 14 out of 35 columns. reading 80% less data => 66% task time saving. • Terabytes of scanning saved every day. http://parquet.io 14
  • 15. Format • Row group Column a Row group: A group of rows in columnar format. • • • Page 0 Max size buffered in memory while writing. • Page 0 Page 1 Page 2 One (or more) per split while reading.  Page 1 Page 2 Page 3 roughly: 50MB < row group < 1 GB Column chunk: The data for one column in a row group. • Page 0 Column c Page 1 Page 4 • Column b Page 2 Page 3 Row group Column chunks can be read independently for efficient scans. Page: Unit of access in a column chunk. • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 1MB http://parquet.io 15
  • 16. Format • Layout: Row groups in columnar format. A footer contains column chunks offset and schema. • Language independent: Well defined format. Hadoop and Cloudera Impala support. http://parquet.io 16
  • 17. Nested record shredding/assembly • Algorithm borrowed from Google Dremel's column IO • Each cell is encoded as a triplet: repetition level, definition level, value. • Level values are bound by the depth of the schema: stored in a compact form. Schema: message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } } Max rep. Max def. level level DocId Backward Forward 0 1 2 Links.Forward Links 0 Links.Backward DocId Record: DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Columns Document 1 2 Column Document Value R D DocId Backward 10 30 Forward 0 Links.Backward 10 0 2 30 1 2 Links.Forward Links 20 0 Links.Backward DocId 20 80 0 2 80 http://parquet.io 17
  • 18. Repetition level 1 2 R Nested lists level1 level2: a 0 new record level2: b 2 new level2 entry level2: c 2 new level2 entry level2: d 1 new level1 entry level2: e 2 new level2 entry level2: f 2 new level2 entry level2: g 2 new level2 entry level1 level2: h 0 new record level1 Schema: message nestedLists { repeated group level1 { repeated string level2; } } 0 level2: i 1 new level1 entry level2: j 2 new level2 entry level1 Records: [[a, b, c], [d, e, f, g]] [[h], [i, j]] Nested lists Columns: Level: 0,2,2,1,2,2,2,0,1,2 Data: a,b,c,d,e,f,g,h,i,j more details: https://blog.twitter.com/2013/dremel-made-simple-with-parquet http://parquet.io
  • 19. Differences of Parquet and ORC Nesting support • Parquet: Document Repetition/Definition levels capture the structure. => one column per Leaf in the schema. Array<int> is one column. DocId Links Backward Forward Nullity/repetition of an inner node is stored in each of its children => One column independently of nesting with some redundancy. • ORC: An extra column for each Map or List to record their size. => one column per Node in the schema. Array<int> is two columns: array size and content. => An extra column per nesting level. http://parquet.io
  • 20. Reading assembled records • Record level API to integrate with existing row based engines (Hive, Pig, M/R). a1 a2 b1 a2 • Aware of dictionary encoding: enable optimizations. a1 b2 a3 b3 a3 b1 b2 b3 • Assembles projection for any subset of the columns: only those are loaded from disc. Document DocId Document Document Document Links Links Links 20 Backward 10 30 Backward 10 30 Forward 80 Forward 80 http://parquet.io 20
  • 21. Projection push down • Automated in Pig and Hive: Based on the query being executed only the columns for the fields accessed will be fetched. • Explicit in MapReduce, Scalding and Cascading using globing syntax. Example: field1;field2/**;field4/{subfield1,subfield2} Will return: field1 all the columns under field2 subfield1 and 2 under field4 but not field3 http://parquet.io 21
  • 22. Reading columns • To implement column based execution engine Row: 0 1 A 0 1 B 1 1 C 2 0 0 3 • V 1 • D 0 • R 0 1 Iteration on triplets: repetition level, definition level, value. R=1 => same row D<1 => Null D Repetition level = 0 indicates a new record. Encoded or decoded values: computing aggregations on integers is faster than on strings. http://parquet.io
  • 23. Integration APIs • Schema definition and record materialization: • • • Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do. Event-based SAX-style record materialization layer. No double conversion. Integration with existing type systems and processing frameworks: • Impala • Pig • Thrift and Scrooge for M/R, Cascading and Scalding • Cascading tuples • Avro • Hive • Spark http://parquet.io
  • 24. Encodings • Bit packing: 1 3 2 • • 2 2 0 1 1 1 1 1 01|11|10|00 00|10|10|00 Useful for repetition level, definition levels and dictionary keys Run Length Encoding: • 1 1 8 1 Cheap compression • 1 Used in combination with bit packing • • 0 Small integers encoded in the minimum bits required • 0 Works well for definition level of sparse columns. Dictionary encoding: • • • Useful for columns with few ( < 50,000 ) distinct values When applicable, compresses better and faster than heavyweight algorithms (gzip, lzo, snappy) Extensible: Defining new encodings is supported by the format http://parquet.io
  • 25. Parquet 2.0 • More encodings: compact storage without heavyweight compression • Delta encodings: for integers, strings and sorted dictionaries. • Improved encoding for strings and boolean. • Statistics: to be used by query planners and predicate pushdown. • New page format: to facilitate skipping ahead at a more granular level. http://parquet.io
  • 26. Main contributors Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration Tom White (Cloudera): Avro integration Avi Bryant, Colin Marc (Stripe): Cascading tuples integration Matt Massie (Berkeley AMP lab): predicate and projection push down David Chen (Linkedin): Avro integration improvements http://parquet.io
  • 27. How to contribute Questions? Ideas? Want to contribute? Contribute at: github.com/Parquet Come talk to us. Cloudera Criteo Twitter http://parquet.io 27