SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
© 2015 Dremio Corporation
If you have your own Columnar format,
stop now and use Parquet
😛
Julien Le Dem
Principal Architect, Dremio
VP Apache Parquet
© 2015 Dremio Corporation
About Dremio
Jacques Nadeau
Founder & CTO
‱Apache Drill PMC Chair
‱Recognized SQL & NoSQL expert
‱Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer Shiran
Founder & CEO
‱Apache Drill Founder
‱MapR (VP Product); Microsoft; IBM
Research
‱Carnegie Mellon, Technion
Julien Le Dem
Architect
‱Apache Parquet Founder
‱Apache Pig PMC Member
‱Twitter (Lead, Analytics Data Pipeline);
Yahoo! (Architect)
Top Silicon Valley VCs‱ Enabling self-service data discovery, exploration and
analysis on modern data
‱ Founded in June 2015
‱ Building on open source technologies including Drill,
Parquet, Spark
© 2015 Dremio Corporation
Background of Parquet
‱ Twitter’s data
‱ Lots of data: Instrumentation, User graph, Derived data, ...
‱ Complex: deeply nested structures
‱ Analytics infrastructure:
‱ Several 1000s nodes Hadoop clusters
‱ Log collection to HDFS in Thrift
‱ Parquet
‱ Columnar: space and query efficient
‱ Inspired from the Google Dremel Paper
‱ supports complex data
‱ interoperable
Caillebotte: The Parquet Planers
© 2015 Dremio Corporation
Parquet timeline
‱ Fall 2012: Twitter & Cloudera’s Impala team
merge efforts to develop columnar formats.
‱ March 2013: OSS announcement; Criteo
signs on for Hive integration.
‱ July 2013: 1.0 release. 18 contributors from
more than 5 organizations.
‱ August 2013: Drill chose Parquet as its
primary storage format.
‱ May 2014: Apache Incubator. 40+
contributors, 18 with 1000+ LOC. 26
incremental releases.
‱ Apr 2015: Parquet graduates from the
Apache Incubator.
© 2015 Dremio Corporation
What does Parquet do?
Interoperability
Space efïŹciency
Query efïŹciency
@EmrgencyKittens
© 2015 Dremio Corporation
Columnar storage
Logical table
representation
Row layout
Column layout
Nested schema
a b c
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5
a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5
encoded chunk encoded chunk encoded chunk
On Disk:
Encodings: Dictionary, RLE, Delta, PreïŹx
© 2015 Dremio Corporation
Nested representation
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
© 2015 Dremio Corporation
Statistics
Vertical partitioning
(projection push down)
Horizontal partitioning
(predicate push down)
Read only the data
you need!
+ =
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
© 2015 Dremio Corporation
Interoperability
Library level integration Query engine integration
Avro Thrift
Protocol
Buffer
Pig Tuple Hive SerDe
Assembly/striping (model agnostic)
Parquet ïŹle format (language agnostic)
Object model
parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive
Column encodings
Impala
...
...
Encodings (C)
PrestoDrill 

© 2015 Dremio Corporation
Query engines, frameworks and libraries
integrated with Parquet (non exhaustive)
Query engines:
Hive, Impala, HAWQ,
IBM Big SQL, Drill, Tajo,
Pig, Presto, SparkSQL
Frameworks:
Spark, MapReduce, Cascading,
Crunch, Scalding, Kite
Data Models:
Avro, Thrift, ProtocolBuffers,
POJOs
© 2015 Dremio Corporation
Loose coupling
‱ Users don’t want to load
their data into every tool.
‱ Many tools are available
and show up every day.
‱ The cost of trying a new
tool should be minimal
Storage (HDFS/S3/
)
Interactive
queries
(Drill, Impala,
Presto, 
)
automated
dashboard
machine
learning
Query-efïŹcient
format
Parquet
Graph
Processing
(Giraph, 
)
Batch
computation
(Pig, Cascading,
Scalding, Spark,

)
© 2015 Dremio Corporation
Get involved
Twitter:
- @ApacheParquet
Mailing list:
- dev@parquet.apache.org
Github repo:
- https://github.com/apache/parquet-mr
Parquet sync ups:
- Regular meetings on google hangout

Contenu connexe

Tendances

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
 

Tendances (20)

Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 

En vedette

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œá„Œá…„á†šá„‹á…”á†« 도ᄆᅊᄋᅔᆫ ᄅᅊᄋᅔᄋᅄ ᄀᅟᄎᅟᆚ허ᄀᅔ
á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œá„Œá…„á†šá„‹á…”á†« 도ᄆᅊᄋᅔᆫ ᄅᅊᄋᅔᄋᅄ á„€á…źá„Žá…źá†šá„’á…Ąá„€á…”á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œá„Œá…„á†šá„‹á…”á†« 도ᄆᅊᄋᅔᆫ ᄅᅊᄋᅔᄋᅄ ᄀᅟᄎᅟᆚ허ᄀᅔ
á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œá„Œá…„á†šá„‹á…”á†« 도ᄆᅊᄋᅔᆫ ᄅᅊᄋᅔᄋᅄ ᄀᅟᄎᅟᆚ허ᄀᅔ
Young-Ho Cho
 
á„‹á…ąá„‘á…łá†Żá„…á…”á„á…Šá„‹á…”á„‰á…§á†« á„‹á…Ąá„á…”á„á…Šá†šá„Žá…„á„‹á…Ș á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œ
á„‹á…ąá„‘á…łá†Żá„…á…”á„á…Šá„‹á…”á„‰á…§á†« á„‹á…Ąá„á…”á„á…Šá†šá„Žá…„á„‹á…Ș á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œ á„‹á…ąá„‘á…łá†Żá„…á…”á„á…Šá„‹á…”á„‰á…§á†« á„‹á…Ąá„á…”á„á…Šá†šá„Žá…„á„‹á…Ș á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œ
á„‹á…ąá„‘á…łá†Żá„…á…”á„á…Šá„‹á…”á„‰á…§á†« á„‹á…Ąá„á…”á„á…Šá†šá„Žá…„á„‹á…Ș á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œ
Young-Ho Cho
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraph
DataWorks Summit
 

En vedette (20)

Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œá„Œá…„á†šá„‹á…”á†« 도ᄆᅊᄋᅔᆫ ᄅᅊᄋᅔᄋᅄ ᄀᅟᄎᅟᆚ허ᄀᅔ
á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œá„Œá…„á†šá„‹á…”á†« 도ᄆᅊᄋᅔᆫ ᄅᅊᄋᅔᄋᅄ á„€á…źá„Žá…źá†šá„’á…Ąá„€á…”á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œá„Œá…„á†šá„‹á…”á†« 도ᄆᅊᄋᅔᆫ ᄅᅊᄋᅔᄋᅄ ᄀᅟᄎᅟᆚ허ᄀᅔ
á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œá„Œá…„á†šá„‹á…”á†« 도ᄆᅊᄋᅔᆫ ᄅᅊᄋᅔᄋᅄ ᄀᅟᄎᅟᆚ허ᄀᅔ
 
[NEXT 프연 Week2] UNIX ëȘ…ë čì–Ž 간닚하êȌ ì‚ŽíŽŽëłŽêž°
[NEXT 프연 Week2] UNIX ëȘ…ë čì–Ž 간닚하êȌ ì‚ŽíŽŽëłŽêž°[NEXT 프연 Week2] UNIX ëȘ…ë čì–Ž 간닚하êȌ ì‚ŽíŽŽëłŽêž°
[NEXT 프연 Week2] UNIX ëȘ…ë čì–Ž 간닚하êȌ ì‚ŽíŽŽëłŽêž°
 
á„‹á…ąá„‘á…łá†Żá„…á…”á„á…Šá„‹á…”á„‰á…§á†« á„‹á…Ąá„á…”á„á…Šá†šá„Žá…„á„‹á…Ș á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œ
á„‹á…ąá„‘á…łá†Żá„…á…”á„á…Šá„‹á…”á„‰á…§á†« á„‹á…Ąá„á…”á„á…Šá†šá„Žá…„á„‹á…Ș á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œ á„‹á…ąá„‘á…łá†Żá„…á…”á„á…Šá„‹á…”á„‰á…§á†« á„‹á…Ąá„á…”á„á…Šá†šá„Žá…„á„‹á…Ș á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œ
á„‹á…ąá„‘á…łá†Żá„…á…”á„á…Šá„‹á…”á„‰á…§á†« á„‹á…Ąá„á…”á„á…Šá†šá„Žá…„á„‹á…Ș á„€á…ąá†šá„Žá…Šá„Œá…”á„’á…Łá†Œ
 
Domain Driven Design
Domain Driven DesignDomain Driven Design
Domain Driven Design
 
도ᄆᅊᄋᅔᆫ á„Œá…źá„ƒá…© ᄉᅄᆯᄀᅚᄋᅎ á„‡á…©á†«á„Œá…”á†Ż
도ᄆᅊᄋᅔᆫ á„Œá…źá„ƒá…© ᄉᅄᆯᄀᅚᄋᅎ á„‡á…©á†«á„Œá…”á†Żá„ƒá…©á„†á…Šá„‹á…”á†« á„Œá…źá„ƒá…© ᄉᅄᆯᄀᅚᄋᅎ á„‡á…©á†«á„Œá…”á†Ż
도ᄆᅊᄋᅔᆫ á„Œá…źá„ƒá…© ᄉᅄᆯᄀᅚᄋᅎ á„‡á…©á†«á„Œá…”á†Ż
 
What Makes Great Infographics
What Makes Great InfographicsWhat Makes Great Infographics
What Makes Great Infographics
 
Masters of SlideShare
Masters of SlideShareMasters of SlideShare
Masters of SlideShare
 
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to SlideshareSTOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
 
You Suck At PowerPoint!
You Suck At PowerPoint!You Suck At PowerPoint!
You Suck At PowerPoint!
 
10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization10 Ways to Win at SlideShare SEO & Presentation Optimization
10 Ways to Win at SlideShare SEO & Presentation Optimization
 
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content MarketingHow To Get More From SlideShare - Super-Simple Tips For Content Marketing
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraph
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 

Similaire à If you have your own Columnar format, stop now and use Parquet 😛

Advantages of the Cloud_Q2_2017.pptx
Advantages of the Cloud_Q2_2017.pptxAdvantages of the Cloud_Q2_2017.pptx
Advantages of the Cloud_Q2_2017.pptx
SaboneSabone
 
BPM and SOA Are Going Mobile: An Architectural Perspective
BPM and SOA Are Going Mobile: An Architectural PerspectiveBPM and SOA Are Going Mobile: An Architectural Perspective
BPM and SOA Are Going Mobile: An Architectural Perspective
Guido Schmutz
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Pentaho
 

Similaire à If you have your own Columnar format, stop now and use Parquet 😛 (20)

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
The Big Connection: Integrating Cloud with Enterprise Systems
The Big Connection: Integrating Cloud with Enterprise SystemsThe Big Connection: Integrating Cloud with Enterprise Systems
The Big Connection: Integrating Cloud with Enterprise Systems
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept FinalSteve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
 
Getting Started with HTML5 in Tech Com (STC 2012)
Getting Started with HTML5 in Tech Com (STC 2012)Getting Started with HTML5 in Tech Com (STC 2012)
Getting Started with HTML5 in Tech Com (STC 2012)
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
Advantages of the Cloud_Q2_2017.pptx
Advantages of the Cloud_Q2_2017.pptxAdvantages of the Cloud_Q2_2017.pptx
Advantages of the Cloud_Q2_2017.pptx
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Level Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationLevel Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop Acceleration
 
Octo and the DevSecOps Evolution at Oracle by Ian Van Hoven
Octo and the DevSecOps Evolution at Oracle by Ian Van HovenOcto and the DevSecOps Evolution at Oracle by Ian Van Hoven
Octo and the DevSecOps Evolution at Oracle by Ian Van Hoven
 
Modernizing i5 Applications
Modernizing i5 ApplicationsModernizing i5 Applications
Modernizing i5 Applications
 
Cloud Native Applications Containers Microservices Platforms CICD Oh my
Cloud Native Applications Containers Microservices Platforms CICD Oh myCloud Native Applications Containers Microservices Platforms CICD Oh my
Cloud Native Applications Containers Microservices Platforms CICD Oh my
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Senior C++ engineer
Senior C++ engineerSenior C++ engineer
Senior C++ engineer
 
BPM and SOA Are Going Mobile: An Architectural Perspective
BPM and SOA Are Going Mobile: An Architectural PerspectiveBPM and SOA Are Going Mobile: An Architectural Perspective
BPM and SOA Are Going Mobile: An Architectural Perspective
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Mastering Docker and Docker Swarm
Mastering Docker and Docker Swarm Mastering Docker and Docker Swarm
Mastering Docker and Docker Swarm
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
 

Plus de Julien Le Dem

Plus de Julien Le Dem (10)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

If you have your own Columnar format, stop now and use Parquet 😛

  • 1. © 2015 Dremio Corporation If you have your own Columnar format, stop now and use Parquet 😛 Julien Le Dem Principal Architect, Dremio VP Apache Parquet
  • 2. © 2015 Dremio Corporation About Dremio Jacques Nadeau Founder & CTO ‱Apache Drill PMC Chair ‱Recognized SQL & NoSQL expert ‱Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO ‱Apache Drill Founder ‱MapR (VP Product); Microsoft; IBM Research ‱Carnegie Mellon, Technion Julien Le Dem Architect ‱Apache Parquet Founder ‱Apache Pig PMC Member ‱Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs‱ Enabling self-service data discovery, exploration and analysis on modern data ‱ Founded in June 2015 ‱ Building on open source technologies including Drill, Parquet, Spark
  • 3. © 2015 Dremio Corporation Background of Parquet ‱ Twitter’s data ‱ Lots of data: Instrumentation, User graph, Derived data, ... ‱ Complex: deeply nested structures ‱ Analytics infrastructure: ‱ Several 1000s nodes Hadoop clusters ‱ Log collection to HDFS in Thrift ‱ Parquet ‱ Columnar: space and query efficient ‱ Inspired from the Google Dremel Paper ‱ supports complex data ‱ interoperable Caillebotte: The Parquet Planers
  • 4. © 2015 Dremio Corporation Parquet timeline ‱ Fall 2012: Twitter & Cloudera’s Impala team merge efforts to develop columnar formats. ‱ March 2013: OSS announcement; Criteo signs on for Hive integration. ‱ July 2013: 1.0 release. 18 contributors from more than 5 organizations. ‱ August 2013: Drill chose Parquet as its primary storage format. ‱ May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. ‱ Apr 2015: Parquet graduates from the Apache Incubator.
  • 5. © 2015 Dremio Corporation What does Parquet do? Interoperability Space efïŹciency Query efïŹciency @EmrgencyKittens
  • 6. © 2015 Dremio Corporation Columnar storage Logical table representation Row layout Column layout Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk On Disk: Encodings: Dictionary, RLE, Delta, PreïŹx
  • 7. © 2015 Dremio Corporation Nested representation Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 8. © 2015 Dremio Corporation Statistics Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =
  • 9. © 2015 Dremio Corporation Interoperability Library level integration Query engine integration Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping (model agnostic) Parquet ïŹle format (language agnostic) Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encodings Impala ... ... Encodings (C) PrestoDrill 

  • 10. © 2015 Dremio Corporation Query engines, frameworks and libraries integrated with Parquet (non exhaustive) Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto, SparkSQL Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  • 11. © 2015 Dremio Corporation Loose coupling ‱ Users don’t want to load their data into every tool. ‱ Many tools are available and show up every day. ‱ The cost of trying a new tool should be minimal Storage (HDFS/S3/
) Interactive queries (Drill, Impala, Presto, 
) automated dashboard machine learning Query-efïŹcient format Parquet Graph Processing (Giraph, 
) Batch computation (Pig, Cascading, Scalding, Spark, 
)
  • 12. © 2015 Dremio Corporation Get involved Twitter: - @ApacheParquet Mailing list: - dev@parquet.apache.org Github repo: - https://github.com/apache/parquet-mr Parquet sync ups: - Regular meetings on google hangout