Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 of 46

How to use Parquet as a basis for ETL and analytics

35

Share

Download to read offline

Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

How to use Parquet as a basis for ETL and analytics

  1. 1. How to use Parquet as a basis for ETL and analytics Julien Le Dem @J_ Analytics Data Pipeline tech lead, Data Platform @ApacheParquet
  2. 2. Outline 2 - Instrumentation and data collection - Storing data efficiently for analysis - Openness and Interoperability
  3. 3. Instrumentation and data collection
  4. 4. Typical data flow 4 Serving Instrumented Services Mutable Serving stores mutation Happy users
  5. 5. Typical data flow 5 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  6. 6. Typical data flow 6 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efficient format Parquet streaming analysis periodic consolidation snapshots
  7. 7. Typical data flow 7 Happy Data Scientist Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efficient format Parquet streaming analysis periodic consolidation snapshots
  8. 8. Storing data for analysis
  9. 9. Producing a lot of data is easy 9 Producing a lot of derived data is even easier. Solution: Compress all the things!
  10. 10. Scanning a lot of data is easy 10 1% completed … but not necessarily fast. Waiting is not productive. We want faster turnaround. Compression but not at the cost of reading speed.
  11. 11. Interoperability not that easy 11 We need a storage format interoperable with all the tools we use and keep our options open for the next big thing.
  12. 12. Enter Apache Parquet
  13. 13. Parquet design goals 13 - Interoperability - Space efficiency - Query efficiency
  14. 14. Efficiency
  15. 15. Columnar storage 15 Logical table representation Row layout Column layout encoding Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk
  16. 16. Parquet nested representation 16 Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  17. 17. Statistics for filter and query optimization 17 Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =
  18. 18. Properties of efficient encodings 18 - Minimize CPU pipeline bubbles: highly predictable branching reduce data dependency ! - Minimize CPU cache misses reduce size of the working set
  19. 19. The right encoding for the right job 19 - Delta encodings: for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. ! - Prefix coding (delta encoding for strings) When dictionary encoding does not work. ! - Dictionary encoding: small (60K) set of values (server IP, experiment id, …) ! - Run Length Encoding: repetitive data.
  20. 20. Interoperability
  21. 21. Interoperable 21 Model agnostic Language agnostic Java C++ Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping Parquet file format Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encoding Impala ... ... Encoding Query execution
  22. 22. Frameworks and libraries integrated with Parquet 22 Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto ! Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite ! Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  23. 23. Schema management
  24. 24. Schema in Hadoop 24 Hadoop does not define a standard notion of schema but there are many available: - Avro - Thrift - Protocol Buffers - Pig - Hive - … And they are all different
  25. 25. What they define 25 Schema: Structure of a record Constraints on the type ! Row oriented binary format: How records are represented one at a time
  26. 26. What they *do not* define 26 Column oriented binary format: Parquet reuses the schema definitions and provides a common column oriented binary format
  27. 27. Example: address book 27 AddressBook Address street city state zip comment addresses
  28. 28. Protocol Buffers 28 message AddressBook {! repeated group addresses = 1 {! required string street = 2;! required string city = 3;! required string state = 4;! required string zip = 5;! optional string comment = 6;! }! }! ! - Allows recursive definition - Types: Group or primitive - binary format refers to field ids only => Renaming fields does not impact binary format - Requires installing a native compiler separated from your build Fields have ids and can be optional, required or repeated Lists are repeated fields
  29. 29. Thrift 29 struct AddressBook {! 1: required list<Address> addresses;! }! struct Addresses {! 1: required string street;! 2: required string city;! 3: required string state;! 4: required string zip;! 5: optional string comment;! }! ! - No recursive definition - Types: Struct, Map, List, Set, Union or primitive - binary format refers to field ids only => Renaming fields does not impact binary format - Requires installing a native compiler separately from the build Fields have ids and can be optional or required explicit collection types
  30. 30. Avro 30 {! "type": "record", ! "name": "AddressBook",! "fields" : [{ ! "name": "addresses", ! "type": "array", ! "items": { ! “type”: “record”,! “fields”: [! {"name": "street", "type": “string"},! {"name": "city", "type": “string”}! {"name": "state", "type": “string"}! {"name": "zip", "type": “string”}! {"name": "comment", "type": [“null”, “string”]}! ] ! }! }]! } explicit collection types - Allows recursive definition - Types: Records, Arrays, Maps, Unions or primitive - Binary format requires knowing the write-time schema ➡ more compact but not self descriptive ➡ renaming fields does not impact binary format - generator in java (well integrated in the build) null is a type Optional is a union
  31. 31. Write to Parquet
  32. 32. Write to Parquet with Map Reduce 32 Protocol Buffers: job.setOutputFormatClass(ProtoParquetOutputFormat.class);! ProtoParquetOutputFormat.setProtobufClass(job, AddressBook.class);! ! Thrift: job.setOutputFormatClass(ParquetThriftOutputFormat.class);! ParquetThriftOutputFormat.setThriftClass(job, AddressBook.class);! ! Avro: job.setOutputFormatClass(AvroParquetOutputFormat.class);! AvroParquetOutputFormat.setSchema(job, AddressBook.SCHEMA$);
  33. 33. Write to Parquet with Scalding 33 // define the Parquet source! case class AddressBookParquetSource(override implicit val dateRange: DateRange)! extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)! // load and transform data! …! pipe.write(ParquetSource())!
  34. 34. Write with Parquet with Pig 34 …! STORE mydata ! ! INTO ‘my/data’ ! ! USING parquet.pig.ParquetStorer();
  35. 35. Query engines
  36. 36. Scalding 36 loading: new FixedPathParquetThrift[AddressBook](“my”, “data”) {! val city = StringColumn("city")! override val withFilter: Option[FilterPredicate] = ! Some(city === “San Jose”)! }! ! operations: p.map( (r) => r.a + r.b )! p.groupBy( (r) => r.c )! p.join ! …
  37. 37. Pig 37 loading: mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();! ! operations: A = FOREACH mydata GENERATE a + b;! B = GROUP mydata BY c;! C = JOIN A BY a, B BY b;
  38. 38. Hive 38 loading: create table parquet_table_name (x INT, y STRING)! ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'! STORED AS ! INPUTFORMAT "parquet.hive.MapredParquetInputFormat"! OUTPUTFORMAT “parquet.hive.MapredParquetInputFormat";! ! operations: SQL!
  39. 39. Impala 39 loading: create table parquet_table (x int, y string) stored as parquetfile;! insert into parquet_table select x, y from some_other_table;! select y from parquet_table where x between 70 and 100;! operations: SQL!
  40. 40. Drill 40 SELECT * FROM dfs.`/my/data`
  41. 41. Spark SQL 41 loading: val address = sqlContext.parquetFile(“/my/data/addresses“)! ! operations: val result = sqlContext! ! .sql("SELECT city FROM addresses WHERE zip == 94707”)! result.map((r) => …)!
  42. 42. Community
  43. 43. Parquet timeline 43 - Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats - March 2013: OSS announcement; Criteo signs on for Hive integration - July 2013: 1.0 release. 18 contributors from more than 5 organizations. - May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. - Parquet 2.0 coming as Apache release
  44. 44. Thank you to our contributors 44 Open Source announcement 1.0 release
  45. 45. Get involved 45 Mailing lists: - dev@parquet.incubator.apache.org ! Parquet sync ups: - Regular meetings on google hangout
  46. 46. Questions 46 Questions.foreach( answer(_) ) @ApacheParquet

×