SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
How to use Parquet
as a basis for ETL and analytics
Julien Le Dem @J_
VP Apache Parquet
Analytics Data Pipeline tech lead, Data Platform
@ApacheParquet
Outline
2
- Storing data efficiently for analysis
- Context: Instrumentation and data collection
- Constraints of ETL
Storing data efficiently for analysis
Why do we need to worry about efficiency?
Producing a lot of data is easy
5
Producing a lot of derived data is even easier.

Solution: Compress all the things!
Scanning a lot of data is easy
6
1% completed
… but not necessarily fast.

Waiting is not productive. We want faster turnaround.

Compression but not at the cost of reading speed.
Interoperability not that easy
7
We need a storage format interoperable with all the tools we use
and
keep our options open for the next big thing.
Enter Apache Parquet
Parquet design goals
9
Interoperability
Space efficiency
Query efficiency
Efficiency
Columnar storage
11
Logical table
representation
Row layout
Column layout
encoding
Nested schema
a b c
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5
a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5
encoded chunk encoded chunk encoded chunk
Properties of efficient encodings
12
- Minimize CPU pipeline bubbles:

	 highly predictable branching
	 reduce data dependency
!
- Minimize CPU cache misses

	 reduce size of the working set
The right encoding for the right job
13
- Delta encodings:

for sorted datasets or signals where the variation is less important than the absolute
value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching.
!
- Prefix coding (delta encoding for strings)

When dictionary encoding does not work.
!
- Dictionary encoding: 

small (60K) set of values (server IP, experiment id, …)
!
- Run Length Encoding:

repetitive data.
Parquet nested representation
14
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Statistics for filter and query optimization
15
Vertical partitioning
(projection push down)
Horizontal partitioning
(predicate push down)
Read only the data
you need!
+ =
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Interoperability
Interoperable
17
Model agnostic
Language agnostic
Java C++
Avro Thrift
Protocol
Buffer
Pig Tuple Hive SerDe
Assembly/striping
Parquet file format
Object model
parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive
Column encoding
Impala
...
...
Encoding
Query
execution
Frameworks and libraries integrated with Parquet
18
Query engines:
Hive, Impala, HAWQ,
IBM Big SQL, Drill, Tajo,
Pig, Presto, SparkSQL
!
Frameworks:
Spark, MapReduce, Cascading,
Crunch, Scalding, Kite
!
Data Models:
Avro, Thrift, ProtocolBuffers,
POJOs
Context: Instrumentation and data collection
Typical data flow
20
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Happy users
Typical data flow
21
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Typical data flow
22
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc
queries
(Impala,
Hive,
Drill, ...)
automated
dashboard
Batch computation
(Graph, machine
learning, ...)
Streaming
computation
(Storm, Samza,
SparkStreaming..)
Query-efficient
format
Parquet
streaming
analysis
periodic
consolidation
snapshots
Typical data flow
23
Happy
Data Scientist
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc
queries
(Impala,
Hive,
Drill, ...)
automated
dashboard
Batch computation
(Graph, machine
learning, ...)
Streaming
computation
(Storm, Samza,
SparkStreaming..)
Query-efficient
format
Parquet
streaming
analysis
periodic
consolidation
snapshots
Schema management
Schema in Hadoop
25
Hadoop does not define a standard
notion of schema but there are
many available:

- Avro

- Thrift

- Protocol Buffers

- Pig

- Hive

- …

And they are all different
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
What they define
26
Schema:

Structure of a record

Constraints on the type

!
Row oriented binary format:
How records are represented one
at a time
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
What they *do not* define
27
	 Column oriented binary format:
Parquet reuses the schema
definitions and provides a common
column oriented binary format
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Example: address book
28
AddressBook
Address
street
city
state
zip
comment
addresses
Protocol Buffers
29
message AddressBook {!
repeated group addresses = 1 {!
required string street = 2;!
required string city = 3;!
required string state = 4;!
required string zip = 5;!
optional string comment = 6;!
}!
}!
!
- Allows recursive definition

- Types: Group or primitive

- binary format refers to field ids only => Renaming fields does not impact binary format

- Requires installing a native compiler separated from your build
Fields have ids and can be
optional, required or repeated
Lists are repeated fields
Thrift
30
struct AddressBook {!
1: required list<Address> addresses;!
}!
struct Addresses {!
1: required string street;!
2: required string city;!
3: required string state;!
4: required string zip;!
5: optional string comment;!
}!
!
- No recursive definition

- Types: Struct, Map, List, Set, Union or primitive

- binary format refers to field ids only => Renaming fields does not impact binary format

- Requires installing a native compiler separately from the build
Fields have ids and can be
optional or required
explicit collection types
Avro
31
{!
"type": "record", !
"name": "AddressBook",!
"fields" : [{ !
"name": "addresses", !
"type": "array", !
"items": { !
“type”: “record”,!
“fields”: [!
{"name": "street", "type": “string"},!
{"name": "city", "type": “string”}!
{"name": "state", "type": “string"}!
{"name": "zip", "type": “string”}!
{"name": "comment", "type": [“null”, “string”]}!
] !
}!
}]!
}
explicit collection types
- Allows recursive definition

- Types: Records, Arrays, Maps, Unions or primitive

- Binary format requires knowing the write-time schema

➡ more compact but not self descriptive

➡ renaming fields does not impact binary format

- generator in java (well integrated in the build)
null is a type
Optional is a union
Requirements of ETL
Log event collection
33
Initial collection is fundamentally
row oriented:
!
- Sync to disk as early as
possible to minimize event loss
!
- Counting events sent and
received is a good idea Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Columnar storage conversion
34
Columnar storage requires writes
to be buffered in memory for the
entire row group:
!
- Write many records at a time.
!
- Better executed in batch
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Columnar storage conversion
35
Not just columnar storage:
!
- Dynamic partitioning
!
- Sort order
!
- Stats generation
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Hive / Impala:
Write to Parquet
36
OutputFormat ProtoParquetOutputFormat ParquetThriftOutputFormat AvroParquetOutputFormat
define schema
setProtobufClass(job,
AddressBook.class)
setThriftClass(job,
AddressBook.class)
setSchema(job,
AddressBook.SCHEMA$)
Scalding: // define the Parquet source!
case class AddressBookParquetSource(override implicit val dateRange: DateRange)!
extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)!
// load and transform data!
pipe.write(ParquetSource())
Pig: STORE mydata !
! INTO ‘my/data’ !
! USING parquet.pig.ParquetStorer();
MapReduce:
create table parquet_table (x int, y string) stored as parquetfile;!
insert into parquet_table select x, y from some_other_table;
Query engines
Scalding
38
	 loading:
new FixedPathParquetThrift[AddressBook](“my”, “data”) {!
val city = StringColumn("city")!
override val withFilter: Option[FilterPredicate] = !
Some(city === “San Jose”)!
}!
!
operations:
p.map( (r) => r.a + r.b )!
p.groupBy( (r) => r.c )!
p.join !
…
Explicit push
down
Pig
39
loading:
mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();!
!
operations:
A = FOREACH mydata GENERATE a + b;!
B = GROUP mydata BY c;!
C = JOIN A BY a, B BY b;
Projection push
down happens
automatically
SQL engines
40
Load query
Hive
create table as … SELECT city FROM addresses WHERE zip == 95113Impala
Presto
Drill
optional. !
Drill can directly
query parquet files
SELECT city FROM dfs.`/table/addresses` zip == 95113
SparkSQL
val parquetFile =
sqlContext.parquetFile(
"/table/addresses")
val result = sqlContext!
!.sql("SELECT city FROM addresses WHERE zip == 95113”)!
result.map((r) => …)
Projection push
down happens
automatically
Community
Parquet timeline
42
- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats

- March 2013: OSS announcement; Criteo signs on for Hive integration

- July 2013: 1.0 release. 18 contributors from more than 5 organizations.

- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.

- Apr 2015: Parquet graduates from the Apache Incubator
Thank you to our contributors
43
Open Source announcement
1.0 release
Get involved
44
Mailing lists:
- dev@parquet.apache.org
!
Github repo:
- https://github.com/apache/parquet-mr
!
Parquet sync ups:
- Regular meetings on google hangout
@ApacheParquet
Questions
45
SELECT answer(question) FROM audience
@ApacheParquet

Contenu connexe

En vedette

Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop ClustersCloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop ClustersCloudera, Inc.
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet FormatYue Chen
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
What is a good technology stack today?
What is a good technology stack today?What is a good technology stack today?
What is a good technology stack today?Netlight Consulting
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksHakky St
 
Technology Stack Discussion
Technology Stack DiscussionTechnology Stack Discussion
Technology Stack DiscussionZaiyang Li
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview externalmattlieber
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Introduction to Web Technology Stacks
Introduction to Web Technology StacksIntroduction to Web Technology Stacks
Introduction to Web Technology StacksPrakarsh -
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
 
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectKaufman Ng
 
Introduction to Team Foundation Server (TFS) Online
Introduction to Team Foundation Server (TFS) OnlineIntroduction to Team Foundation Server (TFS) Online
Introduction to Team Foundation Server (TFS) OnlineDenis Voituron
 
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuBig Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuEmre Sevinç
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
Building a real-time streaming platform using Kafka Connect + Kafka Streams
Building a real-time streaming platform using Kafka Connect + Kafka StreamsBuilding a real-time streaming platform using Kafka Connect + Kafka Streams
Building a real-time streaming platform using Kafka Connect + Kafka Streamsconfluent
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Spark Summit
 
The Future of Hadoop Security - Hadoop Summit 2014
The Future of Hadoop Security - Hadoop Summit 2014The Future of Hadoop Security - Hadoop Summit 2014
The Future of Hadoop Security - Hadoop Summit 2014Cloudera, Inc.
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solrpittaya
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...Hadoop / Spark Conference Japan
 

En vedette (20)

Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop ClustersCloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
What is a good technology stack today?
What is a good technology stack today?What is a good technology stack today?
What is a good technology stack today?
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
 
Technology Stack Discussion
Technology Stack DiscussionTechnology Stack Discussion
Technology Stack Discussion
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Introduction to Web Technology Stacks
Introduction to Web Technology StacksIntroduction to Web Technology Stacks
Introduction to Web Technology Stacks
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
 
Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
 
Introduction to Team Foundation Server (TFS) Online
Introduction to Team Foundation Server (TFS) OnlineIntroduction to Team Foundation Server (TFS) Online
Introduction to Team Foundation Server (TFS) Online
 
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetuBig Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
Big Data Governance in Hadoop Environments with Cloudera Navigatorfeb2017meetu
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Building a real-time streaming platform using Kafka Connect + Kafka Streams
Building a real-time streaming platform using Kafka Connect + Kafka StreamsBuilding a real-time streaming platform using Kafka Connect + Kafka Streams
Building a real-time streaming platform using Kafka Connect + Kafka Streams
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
 
The Future of Hadoop Security - Hadoop Summit 2014
The Future of Hadoop Security - Hadoop Summit 2014The Future of Hadoop Security - Hadoop Summit 2014
The Future of Hadoop Security - Hadoop Summit 2014
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...
Project Tungsten Bringing Spark Closer to Bare Meta (Hadoop / Spark Conferenc...
 

Plus de Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 

Plus de Julien Le Dem (17)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 

Dernier

COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 

Dernier (20)

COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 

How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

  • 1. How to use Parquet as a basis for ETL and analytics Julien Le Dem @J_ VP Apache Parquet Analytics Data Pipeline tech lead, Data Platform @ApacheParquet
  • 2. Outline 2 - Storing data efficiently for analysis - Context: Instrumentation and data collection - Constraints of ETL
  • 4. Why do we need to worry about efficiency?
  • 5. Producing a lot of data is easy 5 Producing a lot of derived data is even easier. Solution: Compress all the things!
  • 6. Scanning a lot of data is easy 6 1% completed … but not necessarily fast. Waiting is not productive. We want faster turnaround. Compression but not at the cost of reading speed.
  • 7. Interoperability not that easy 7 We need a storage format interoperable with all the tools we use and keep our options open for the next big thing.
  • 9. Parquet design goals 9 Interoperability Space efficiency Query efficiency
  • 11. Columnar storage 11 Logical table representation Row layout Column layout encoding Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk
  • 12. Properties of efficient encodings 12 - Minimize CPU pipeline bubbles: highly predictable branching reduce data dependency ! - Minimize CPU cache misses reduce size of the working set
  • 13. The right encoding for the right job 13 - Delta encodings: for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. ! - Prefix coding (delta encoding for strings) When dictionary encoding does not work. ! - Dictionary encoding: small (60K) set of values (server IP, experiment id, …) ! - Run Length Encoding: repetitive data.
  • 14. Parquet nested representation 14 Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 15. Statistics for filter and query optimization 15 Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =
  • 17. Interoperable 17 Model agnostic Language agnostic Java C++ Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping Parquet file format Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encoding Impala ... ... Encoding Query execution
  • 18. Frameworks and libraries integrated with Parquet 18 Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto, SparkSQL ! Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite ! Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  • 19. Context: Instrumentation and data collection
  • 21. Typical data flow 21 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 22. Typical data flow 22 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efficient format Parquet streaming analysis periodic consolidation snapshots
  • 23. Typical data flow 23 Happy Data Scientist Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efficient format Parquet streaming analysis periodic consolidation snapshots
  • 25. Schema in Hadoop 25 Hadoop does not define a standard notion of schema but there are many available: - Avro - Thrift - Protocol Buffers - Pig - Hive - … And they are all different Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 26. What they define 26 Schema: Structure of a record Constraints on the type ! Row oriented binary format: How records are represented one at a time Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 27. What they *do not* define 27 Column oriented binary format: Parquet reuses the schema definitions and provides a common column oriented binary format Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 29. Protocol Buffers 29 message AddressBook {! repeated group addresses = 1 {! required string street = 2;! required string city = 3;! required string state = 4;! required string zip = 5;! optional string comment = 6;! }! }! ! - Allows recursive definition - Types: Group or primitive - binary format refers to field ids only => Renaming fields does not impact binary format - Requires installing a native compiler separated from your build Fields have ids and can be optional, required or repeated Lists are repeated fields
  • 30. Thrift 30 struct AddressBook {! 1: required list<Address> addresses;! }! struct Addresses {! 1: required string street;! 2: required string city;! 3: required string state;! 4: required string zip;! 5: optional string comment;! }! ! - No recursive definition - Types: Struct, Map, List, Set, Union or primitive - binary format refers to field ids only => Renaming fields does not impact binary format - Requires installing a native compiler separately from the build Fields have ids and can be optional or required explicit collection types
  • 31. Avro 31 {! "type": "record", ! "name": "AddressBook",! "fields" : [{ ! "name": "addresses", ! "type": "array", ! "items": { ! “type”: “record”,! “fields”: [! {"name": "street", "type": “string"},! {"name": "city", "type": “string”}! {"name": "state", "type": “string"}! {"name": "zip", "type": “string”}! {"name": "comment", "type": [“null”, “string”]}! ] ! }! }]! } explicit collection types - Allows recursive definition - Types: Records, Arrays, Maps, Unions or primitive - Binary format requires knowing the write-time schema ➡ more compact but not self descriptive ➡ renaming fields does not impact binary format - generator in java (well integrated in the build) null is a type Optional is a union
  • 33. Log event collection 33 Initial collection is fundamentally row oriented: ! - Sync to disk as early as possible to minimize event loss ! - Counting events sent and received is a good idea Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 34. Columnar storage conversion 34 Columnar storage requires writes to be buffered in memory for the entire row group: ! - Write many records at a time. ! - Better executed in batch Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 35. Columnar storage conversion 35 Not just columnar storage: ! - Dynamic partitioning ! - Sort order ! - Stats generation Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 36. Hive / Impala: Write to Parquet 36 OutputFormat ProtoParquetOutputFormat ParquetThriftOutputFormat AvroParquetOutputFormat define schema setProtobufClass(job, AddressBook.class) setThriftClass(job, AddressBook.class) setSchema(job, AddressBook.SCHEMA$) Scalding: // define the Parquet source! case class AddressBookParquetSource(override implicit val dateRange: DateRange)! extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)! // load and transform data! pipe.write(ParquetSource()) Pig: STORE mydata ! ! INTO ‘my/data’ ! ! USING parquet.pig.ParquetStorer(); MapReduce: create table parquet_table (x int, y string) stored as parquetfile;! insert into parquet_table select x, y from some_other_table;
  • 38. Scalding 38 loading: new FixedPathParquetThrift[AddressBook](“my”, “data”) {! val city = StringColumn("city")! override val withFilter: Option[FilterPredicate] = ! Some(city === “San Jose”)! }! ! operations: p.map( (r) => r.a + r.b )! p.groupBy( (r) => r.c )! p.join ! … Explicit push down
  • 39. Pig 39 loading: mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();! ! operations: A = FOREACH mydata GENERATE a + b;! B = GROUP mydata BY c;! C = JOIN A BY a, B BY b; Projection push down happens automatically
  • 40. SQL engines 40 Load query Hive create table as … SELECT city FROM addresses WHERE zip == 95113Impala Presto Drill optional. ! Drill can directly query parquet files SELECT city FROM dfs.`/table/addresses` zip == 95113 SparkSQL val parquetFile = sqlContext.parquetFile( "/table/addresses") val result = sqlContext! !.sql("SELECT city FROM addresses WHERE zip == 95113”)! result.map((r) => …) Projection push down happens automatically
  • 42. Parquet timeline 42 - Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats - March 2013: OSS announcement; Criteo signs on for Hive integration - July 2013: 1.0 release. 18 contributors from more than 5 organizations. - May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. - Apr 2015: Parquet graduates from the Apache Incubator
  • 43. Thank you to our contributors 43 Open Source announcement 1.0 release
  • 44. Get involved 44 Mailing lists: - dev@parquet.apache.org ! Github repo: - https://github.com/apache/parquet-mr ! Parquet sync ups: - Regular meetings on google hangout @ApacheParquet
  • 45. Questions 45 SELECT answer(question) FROM audience @ApacheParquet