Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 of 45

How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

47

Share

Download to read offline

Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

How to use Parquet as a basis for ETL and analytics (Hadoop Summit San Jose 2015)

  1. 1. How to use Parquet as a basis for ETL and analytics Julien Le Dem @J_ VP Apache Parquet Analytics Data Pipeline tech lead, Data Platform @ApacheParquet
  2. 2. Outline 2 - Storing data efficiently for analysis - Context: Instrumentation and data collection - Constraints of ETL
  3. 3. Storing data efficiently for analysis
  4. 4. Why do we need to worry about efficiency?
  5. 5. Producing a lot of data is easy 5 Producing a lot of derived data is even easier. Solution: Compress all the things!
  6. 6. Scanning a lot of data is easy 6 1% completed … but not necessarily fast. Waiting is not productive. We want faster turnaround. Compression but not at the cost of reading speed.
  7. 7. Interoperability not that easy 7 We need a storage format interoperable with all the tools we use and keep our options open for the next big thing.
  8. 8. Enter Apache Parquet
  9. 9. Parquet design goals 9 Interoperability Space efficiency Query efficiency
  10. 10. Efficiency
  11. 11. Columnar storage 11 Logical table representation Row layout Column layout encoding Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk
  12. 12. Properties of efficient encodings 12 - Minimize CPU pipeline bubbles: highly predictable branching reduce data dependency ! - Minimize CPU cache misses reduce size of the working set
  13. 13. The right encoding for the right job 13 - Delta encodings: for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. ! - Prefix coding (delta encoding for strings) When dictionary encoding does not work. ! - Dictionary encoding: small (60K) set of values (server IP, experiment id, …) ! - Run Length Encoding: repetitive data.
  14. 14. Parquet nested representation 14 Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  15. 15. Statistics for filter and query optimization 15 Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =
  16. 16. Interoperability
  17. 17. Interoperable 17 Model agnostic Language agnostic Java C++ Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping Parquet file format Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encoding Impala ... ... Encoding Query execution
  18. 18. Frameworks and libraries integrated with Parquet 18 Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto, SparkSQL ! Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite ! Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  19. 19. Context: Instrumentation and data collection
  20. 20. Typical data flow 20 Serving Instrumented Services Mutable Serving stores mutation Happy users
  21. 21. Typical data flow 21 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  22. 22. Typical data flow 22 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efficient format Parquet streaming analysis periodic consolidation snapshots
  23. 23. Typical data flow 23 Happy Data Scientist Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efficient format Parquet streaming analysis periodic consolidation snapshots
  24. 24. Schema management
  25. 25. Schema in Hadoop 25 Hadoop does not define a standard notion of schema but there are many available: - Avro - Thrift - Protocol Buffers - Pig - Hive - … And they are all different Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  26. 26. What they define 26 Schema: Structure of a record Constraints on the type ! Row oriented binary format: How records are represented one at a time Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  27. 27. What they *do not* define 27 Column oriented binary format: Parquet reuses the schema definitions and provides a common column oriented binary format Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  28. 28. Example: address book 28 AddressBook Address street city state zip comment addresses
  29. 29. Protocol Buffers 29 message AddressBook {! repeated group addresses = 1 {! required string street = 2;! required string city = 3;! required string state = 4;! required string zip = 5;! optional string comment = 6;! }! }! ! - Allows recursive definition - Types: Group or primitive - binary format refers to field ids only => Renaming fields does not impact binary format - Requires installing a native compiler separated from your build Fields have ids and can be optional, required or repeated Lists are repeated fields
  30. 30. Thrift 30 struct AddressBook {! 1: required list<Address> addresses;! }! struct Addresses {! 1: required string street;! 2: required string city;! 3: required string state;! 4: required string zip;! 5: optional string comment;! }! ! - No recursive definition - Types: Struct, Map, List, Set, Union or primitive - binary format refers to field ids only => Renaming fields does not impact binary format - Requires installing a native compiler separately from the build Fields have ids and can be optional or required explicit collection types
  31. 31. Avro 31 {! "type": "record", ! "name": "AddressBook",! "fields" : [{ ! "name": "addresses", ! "type": "array", ! "items": { ! “type”: “record”,! “fields”: [! {"name": "street", "type": “string"},! {"name": "city", "type": “string”}! {"name": "state", "type": “string"}! {"name": "zip", "type": “string”}! {"name": "comment", "type": [“null”, “string”]}! ] ! }! }]! } explicit collection types - Allows recursive definition - Types: Records, Arrays, Maps, Unions or primitive - Binary format requires knowing the write-time schema ➡ more compact but not self descriptive ➡ renaming fields does not impact binary format - generator in java (well integrated in the build) null is a type Optional is a union
  32. 32. Requirements of ETL
  33. 33. Log event collection 33 Initial collection is fundamentally row oriented: ! - Sync to disk as early as possible to minimize event loss ! - Counting events sent and received is a good idea Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  34. 34. Columnar storage conversion 34 Columnar storage requires writes to be buffered in memory for the entire row group: ! - Write many records at a time. ! - Better executed in batch Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  35. 35. Columnar storage conversion 35 Not just columnar storage: ! - Dynamic partitioning ! - Sort order ! - Stats generation Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  36. 36. Hive / Impala: Write to Parquet 36 OutputFormat ProtoParquetOutputFormat ParquetThriftOutputFormat AvroParquetOutputFormat define schema setProtobufClass(job, AddressBook.class) setThriftClass(job, AddressBook.class) setSchema(job, AddressBook.SCHEMA$) Scalding: // define the Parquet source! case class AddressBookParquetSource(override implicit val dateRange: DateRange)! extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)! // load and transform data! pipe.write(ParquetSource()) Pig: STORE mydata ! ! INTO ‘my/data’ ! ! USING parquet.pig.ParquetStorer(); MapReduce: create table parquet_table (x int, y string) stored as parquetfile;! insert into parquet_table select x, y from some_other_table;
  37. 37. Query engines
  38. 38. Scalding 38 loading: new FixedPathParquetThrift[AddressBook](“my”, “data”) {! val city = StringColumn("city")! override val withFilter: Option[FilterPredicate] = ! Some(city === “San Jose”)! }! ! operations: p.map( (r) => r.a + r.b )! p.groupBy( (r) => r.c )! p.join ! … Explicit push down
  39. 39. Pig 39 loading: mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();! ! operations: A = FOREACH mydata GENERATE a + b;! B = GROUP mydata BY c;! C = JOIN A BY a, B BY b; Projection push down happens automatically
  40. 40. SQL engines 40 Load query Hive create table as … SELECT city FROM addresses WHERE zip == 95113Impala Presto Drill optional. ! Drill can directly query parquet files SELECT city FROM dfs.`/table/addresses` zip == 95113 SparkSQL val parquetFile = sqlContext.parquetFile( "/table/addresses") val result = sqlContext! !.sql("SELECT city FROM addresses WHERE zip == 95113”)! result.map((r) => …) Projection push down happens automatically
  41. 41. Community
  42. 42. Parquet timeline 42 - Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats - March 2013: OSS announcement; Criteo signs on for Hive integration - July 2013: 1.0 release. 18 contributors from more than 5 organizations. - May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. - Apr 2015: Parquet graduates from the Apache Incubator
  43. 43. Thank you to our contributors 43 Open Source announcement 1.0 release
  44. 44. Get involved 44 Mailing lists: - dev@parquet.apache.org ! Github repo: - https://github.com/apache/parquet-mr ! Parquet sync ups: - Regular meetings on google hangout @ApacheParquet
  45. 45. Questions 45 SELECT answer(question) FROM audience @ApacheParquet

×