Parquet overview

•Télécharger en tant que PPT, PDF•

5 j'aime•7,331 vues

Julien Le Dem

Parquet overview given to the Apache Drill meetup

Format
Schema definition: for binary
representation

Layout: currently PAX, supports one file
per column when Hadoop allows block
placement policy.

Not java centric: encodings, compression
codecs, etc are ENUMs, not java class
names. i.e.: formally defined. Impala
reads Parquet files.

Footer: contains column chunks offsets

2

Format

• Row group: A group of rows in columnar format.
• Max size buffered in memory while writing.
• One (or more) per split while reading.
• roughly: 10MB < row group < 1 GB

• Column chunk: The data for one column in a row group.
• Column chunks can be read independently for efficient scans.

• Page: Unit of compression in a column chunk
• Should be big enough for compression to be efficient.
• Minimum size to read to access a single record (when index pages are available).
• roughly: 8KB < page < 100KB

3

Dremel’s shredding/assembly
Schema:
message Document {
required int64 DocId; Columns:
optional group Links { DocId
repeated int64 Backward; Links.Backward
repeated int64 Forward; } Links.Forward
repeated group Name { Name.Language.Code
repeated group Language { Name.Language.Country
required string Code; Name.Url
optional string Country; }
optional string Url; }}

Reference:
http://research.google.com/pubs/pub36632.html
• Each cell is encoded as a triplet: repetition level, definition level, value.
• This allows reconstructing the nested records.
• Level values are bound by the depth of the schema: They are stored in a
compact form.

Example: Max repetition level Max definition level

DocId 0 0
Links.Backward 1 2
Links.Forward 1 2
Name.Language.Code 2 2
Name.Language.Country 2 3
Name.Url 1 2

4

Abstractions

• Column layer:
• Iteration on triplets: repetition level, definition level, value.
• Repetition level = 0 indicates a new record.
•When dictionary encoding and other compact encodings are implemented, can iterate over
encoded or un-encoded values.

• Record layer:
• Iteration on fully assembled records.
•Provides assembled records for any subset of the columns, so that only columns actually
accessed are loaded.

5

Extensibility

• Schema conversion:
• Hadoop does not have a notion of schema.
• However Pig, Hive, Thrift, Avro, ProtoBufs, etc do.

• Record materialization:
• Pluggable record materialization layer.
• No double conversion.
• Sax-style Event base API.

• Encodings:
• Extensible encoding definitions.
• Planned: dictionary encoding, zigzag, rle, ...

6

Recommandé

Parquet Hadoop Summit 2013Julien Le Dem

How to use Parquet as a basis for ETL and analyticsJulien Le Dem

Parquet Strata/Hadoop World, New York 2013Julien Le Dem

Memory Management in Apache SparkDatabricks

Hive: Loading DataBenjamin Leonhardi

Parquet performance tuning: the missing guideRyan Blue

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

Recommandé

Parquet Hadoop Summit 2013Julien Le Dem

How to use Parquet as a basis for ETL and analyticsJulien Le Dem

Parquet Strata/Hadoop World, New York 2013Julien Le Dem

Memory Management in Apache SparkDatabricks

Hive: Loading DataBenjamin Leonhardi

Parquet performance tuning: the missing guideRyan Blue

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

The Parquet Format and Performance Optimization OpportunitiesDatabricks

The Apache Spark File Format EcosystemDatabricks

Apache Arrow - An OverviewDremio Corporation

Deep Dive: Memory Management in Apache SparkDatabricks

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation

Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Inside Parquet FormatYue Chen

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Apache Arrow: In Theory, In PracticeDremio Corporation

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Deep Dive into the New Features of Apache Spark 3.0Databricks

Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov

Spark, Python and Parquet odsc

ApacheCon-Flume-Kafka-2016Jayesh Thakrar

Contenu connexe

Tendances

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

The Parquet Format and Performance Optimization OpportunitiesDatabricks

The Apache Spark File Format EcosystemDatabricks

Apache Arrow - An OverviewDremio Corporation

Deep Dive: Memory Management in Apache SparkDatabricks

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation

Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Inside Parquet FormatYue Chen

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Apache Arrow: In Theory, In PracticeDremio Corporation

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Deep Dive into the New Features of Apache Spark 3.0Databricks

Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov

Tendances (20)

Efficient Data Storage for Analytics with Apache Parquet 2.0

The Parquet Format and Performance Optimization Opportunities

The Apache Spark File Format Ecosystem

Apache Arrow - An Overview

Deep Dive: Memory Management in Apache Spark

Fine Tuning and Enhancing Performance of Apache Spark Jobs

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...

Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

The columnar roadmap: Apache Parquet and Apache Arrow

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Performant Streaming in Production: Preventing Common Pitfalls when Productio...

Inside Parquet Format

File Format Benchmark - Avro, JSON, ORC & Parquet

Apache Arrow: In Theory, In Practice

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...

Hudi architecture, fundamentals and capabilities

Iceberg: A modern table format for big data (Strata NY 2018)

Deep Dive into the New Features of Apache Spark 3.0

Spark performance tuning - Maksud Ibrahimov

En vedette

Spark, Python and Parquet odsc

ApacheCon-Flume-Kafka-2016Jayesh Thakrar

大型电商的数据服务的要点和难点 Chao Zhu

Implementing and running a secure datalake from the trenches DataWorks Summit

Data Aggregation At Scale Using Apache FlumeArvind Prabhakar

Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty

Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise

Parquet and AVROairisData

Paytm labs soyouwanttodatascienceAdam Muise

Flume vs. kafkaOmid Vahdaty

En vedette (10)

Spark, Python and Parquet

ApacheCon-Flume-Kafka-2016

大型电商的数据服务的要点和难点

Implementing and running a secure datalake from the trenches

Data Aggregation At Scale Using Apache Flume

Introduction to streaming and messaging flume,kafka,SQS,kinesis

Moving to a data-centric architecture: Toronto Data Unconference 2015

Parquet and AVRO

Paytm labs soyouwanttodatascience

Flume vs. kafka

Similaire à Parquet overview

HadoopAbhishek Agarwal

(Julien le dem) parquetNAVER D2

Outside The Box With Apache CassnadraEric Evans

An introduction to PincasterFrank Denis

MongoDB Replication fundamentals - Desert Code Camp - October 2014Avinash Ramineni

The Cassandra Distributed DatabaseEric Evans

Spring one2gx2010 spring-nonrelational_dataRoger Xia

MongoDB Replication fundamentals - Desert Code Camp - October 2014clairvoyantllc

SDEC2011 NoSQL concepts and modelsKorea Sdec

Building a distributed Key-Value store with Cassandraaaronmorton

Parquet Twitter Seattle open houseJulien Le Dem

Caching solutions with RedisGeorge Platon

Thoughts on Transaction and Consistency Modelsiammutex

On Rails with Apache CassandraStu Hood

Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.

Hbase jddAndrzej Grzesik

What every developer should know about database scalability, PyCon 2010jbellis

Drop acidMike Feltman

Non Relational DatabasesChris Baglieri

Inexpensive storageManfred Furuholmen

Similaire à Parquet overview (20)

Hadoop

(Julien le dem) parquet

Outside The Box With Apache Cassnadra

An introduction to Pincaster

MongoDB Replication fundamentals - Desert Code Camp - October 2014

The Cassandra Distributed Database

Spring one2gx2010 spring-nonrelational_data

MongoDB Replication fundamentals - Desert Code Camp - October 2014

SDEC2011 NoSQL concepts and models

Building a distributed Key-Value store with Cassandra

Parquet Twitter Seattle open house

Caching solutions with Redis

Thoughts on Transaction and Consistency Models

On Rails with Apache Cassandra

Doug Cutting on the State of the Hadoop Ecosystem

Hbase jdd

What every developer should know about database scalability, PyCon 2010

Drop acid

Non Relational Databases

Inexpensive storage

Plus de Julien Le Dem

Data and AI summit: data pipelines observability with open lineageJulien Le Dem

Data pipelines observability: OpenLineage & MarquezJulien Le Dem

Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem

Data platform architecture principles - ieee infrastructure 2020Julien Le Dem

Data lineage and observability with Marquez - subsurface 2020Julien Le Dem

Strata NY 2018: The deconstructed databaseJulien Le Dem

From flat files to deconstructed databaseJulien Le Dem

Strata NY 2017 Parquet Arrow roadmapJulien Le Dem

The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem

Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem

Mule soft mar 2017 Parquet ArrowJulien Le Dem

Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem

Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem

Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem

Sql on everything with drillJulien Le Dem

If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem

Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem

Poster Hadoop summit 2011: pig embedding in scripting languagesJulien Le Dem

Embedding Pig in scripting languagesJulien Le Dem

Plus de Julien Le Dem (19)

Data and AI summit: data pipelines observability with open lineage

Data pipelines observability: OpenLineage & Marquez

Open core summit: Observability for data pipelines with OpenLineage

Data platform architecture principles - ieee infrastructure 2020

Data lineage and observability with Marquez - subsurface 2020

Strata NY 2018: The deconstructed database

From flat files to deconstructed database

Strata NY 2017 Parquet Arrow roadmap

The columnar roadmap: Apache Parquet and Apache Arrow

Improving Python and Spark Performance and Interoperability with Apache Arrow

Mule soft mar 2017 Parquet Arrow

Data Eng Conf NY Nov 2016 Parquet Arrow

Strata NY 2016: The future of column-oriented data processing with Arrow and ...

Strata London 2016: The future of column oriented data processing with Arrow ...

Sql on everything with drill

If you have your own Columnar format, stop now and use Parquet 😛

Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014

Poster Hadoop summit 2011: pig embedding in scripting languages

Embedding Pig in scripting languages

Parquet overview

1. Parquet overview Julien Le Dem Twitter http://parquet.github.com

2. Format Schema definition: for binary representation Layout: currently PAX, supports one file per column when Hadoop allows block placement policy. Not java centric: encodings, compression codecs, etc are ENUMs, not java class names. i.e.: formally defined. Impala reads Parquet files. Footer: contains column chunks offsets 2

3. Format • Row group: A group of rows in columnar format. • Max size buffered in memory while writing. • One (or more) per split while reading. • roughly: 10MB < row group < 1 GB • Column chunk: The data for one column in a row group. • Column chunks can be read independently for efficient scans. • Page: Unit of compression in a column chunk • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 100KB 3

4. Dremel’s shredding/assembly Schema: message Document { required int64 DocId; Columns: optional group Links { DocId repeated int64 Backward; Links.Backward repeated int64 Forward; } Links.Forward repeated group Name { Name.Language.Code repeated group Language { Name.Language.Country required string Code; Name.Url optional string Country; } optional string Url; }} Reference: http://research.google.com/pubs/pub36632.html • Each cell is encoded as a triplet: repetition level, definition level, value. • This allows reconstructing the nested records. • Level values are bound by the depth of the schema: They are stored in a compact form. Example: Max repetition level Max definition level DocId 0 0 Links.Backward 1 2 Links.Forward 1 2 Name.Language.Code 2 2 Name.Language.Country 2 3 Name.Url 1 2 4

5. Abstractions • Column layer: • Iteration on triplets: repetition level, definition level, value. • Repetition level = 0 indicates a new record. •When dictionary encoding and other compact encodings are implemented, can iterate over encoded or un-encoded values. • Record layer: • Iteration on fully assembled records. •Provides assembled records for any subset of the columns, so that only columns actually accessed are loaded. 5

6. Extensibility • Schema conversion: • Hadoop does not have a notion of schema. • However Pig, Hive, Thrift, Avro, ProtoBufs, etc do. • Record materialization: • Pluggable record materialization layer. • No double conversion. • Sax-style Event base API. • Encodings: • Extensible encoding definitions. • Planned: dictionary encoding, zigzag, rle, ... 6

7. Extensibility • Schema conversion: • Hadoop does not have a notion of schema. • However Pig, Hive, Thrift, Avro, ProtoBufs, etc do. • Record materialization: • Pluggable record materialization layer. • No double conversion. • Sax-style Event base API. • Encodings: • Extensible encoding definitions. • Planned: dictionary encoding, zigzag, rle, ... 6