SlideShare une entreprise Scribd logo
1  sur  43
1 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Improvement in Apache Spark 2.3
Dongjoon Hyun
Principal Software Engineer @ Hortonworks Data Science Team
April 2018
2 © Hortonworks Inc. 2011–2018. All rights reserved
Dongjoon Hyun
• Hortonworks
• Principal Software Engineer @ Data Science Team
• Apache Project
• Apache REEF Project Management Committee(PMC) Member & Committer
• Apache Spark Project Contributor
• GitHub
• https://github.com/dongjoon-hyun
3 © Hortonworks Inc. 2011–2018. All rights reserved
Agenda
• What’s New in Apache Spark 2.3
• Previous ORC issues in Apache Spark
• Current Approach & Demo
• Performance & Limitation
• Future roadmap
4 © Hortonworks Inc. 2011–2018. All rights reserved
• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
What’s New in Apache Spark 2.3
5 © Hortonworks Inc. 2011–2018. All rights reserved
Spark’s file-based data sources
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Popular for shared Hive tables
6 © Hortonworks Inc. 2011–2018. All rights reserved
Motivation
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Popular for shared Hive tables
Fast
Flexible
Hive Table Access
7 © Hortonworks Inc. 2011–2018. All rights reserved
Previous ORC Issues in Spark
8 © Hortonworks Inc. 2011–2018. All rights reserved
Background – The history of Spark and ORC
• Before Apache ORC
• Hive 1.2.1 (2015 JUN)  SPARK-2883 (Hive ORC is used since Spark 1.4)
• After Apache ORC
• v1.0.0 (2016 JAN)
• v1.1.0 (2016 JUN)
• v1.2.0 (2016 AUG)
• v1.3.0 (2017 JAN)
• v1.4.0 (2017 MAY)  SPARK-21422 (Apache ORC is added since Spark 2.3)
• v1.4.1 (2017 OCT)  SPARK-22300
• v1.4.3 (2018 FEB)  SPARK-23340 (Spark 2.4)
9 © Hortonworks Inc. 2011–2018. All rights reserved
Six Issue Categories
• ORC Writer Versions
• Performance
• Structured streaming
• Column names
• Hive tables and schema evolution
• Robustness
10 © Hortonworks Inc. 2011–2018. All rights reserved
Issues with ORC Writer Versions
• ORIGINAL
• HIVE_8732 (2014) ORC string statistics are not merged correctly
• HIVE_4243 (2015) Fix column names in FileSinkOperator
• HIVE_12055(2015) Create row-by-row shims for the write path
• HIVE_13083(2016) Writing HiveDecimal can wrongly suppress
present stream
• ORC_101 (2016) Correct the use of the default charset in bloomfilter
• ORC_135 (2018) PPD for timestamp is wrong when reader/writer
timezones are different
11 © Hortonworks Inc. 2011–2018. All rights reserved
Issues with performance
• Vectorized ORC Reader (SPARK-16060)
• Fast read partition-column only (SPARK-22712)
• Pushing down filters for DateType (SPARK-21787)
12 © Hortonworks Inc. 2011–2018. All rights reserved
• `FileNotFoundException` at writing
empty partitions as ORC
• Create structured steam with ORC files
Write (SPARK-15474) Read (SPARK-22781)
Issues with structured streaming
spark.readStream.orc(path)
13 © Hortonworks Inc. 2011–2018. All rights reserved
Issues with column names
• Unicode column names (SPARK-23072)
• Column names with dot (SPARK-21791)
• Should not create invalid column names (SPARK-21912)
14 © Hortonworks Inc. 2011–2018. All rights reserved
Issues with Hive tables and schema evolution
• Support `ALTER TABLE ADD COLUMNS` (SPARK-21929)
• Introduced at Spark 2.2, but throws AnalysisException for ORC
• Support column positional mismatch (SPARK-22267)
• Return wrong result if ORC file schema is different from Hive MetaStore schema order
• `convertMetastore` ignore storage property (SPARK-22158, Fixed at 2.2.1)
• `convertMetastoreOrc` is introduced in Spark 2.0, but it had several issues.
15 © Hortonworks Inc. 2011–2018. All rights reserved
Issues with robustness
• ORC metadata exceed ProtoBuf message size limit (SPARK-19109)
• NullPointerException on zero-size ORC file (SPARK-19809)
• Support `ignoreCorruptFiles` (SPARK-23049)
• Support `ignoreMissingFiles` (SPARK-23305)
• `FileNotFound` at file names with special chars (SPARK-22146, Fixed in 2.2.1)
16 © Hortonworks Inc. 2011–2018. All rights reserved
Current Approach
17 © Hortonworks Inc. 2011–2018. All rights reserved
Supports two ORC file formats
• Adding a new OrcFileFormat (SPARK-20682)
FileFormat
TextBasedFileFormat
ParquetFileFormat
OrcFileFormat
HiveFileFormat
JsonFileFormat
LibSVMFileFormat
CSVFileFormat
TextFileFormat
o.a.s.sql.execution.datasources
o.a.s.ml.source.libsvmo.a.s.sql.hive.orc
OrcFileFormat
`hive` OrcFileFormat
from Hive 1.2.1
`native` OrcFileFormat
with ORC 1.4.3
18 © Hortonworks Inc. 2011–2018. All rights reserved
In Reality – Four cases for ORC Reader/Writer
`hive` Reader`native` Reader
`hive` Writer
`native` Writer
• New Data
• New Apps
• Best performance
(Vectorized Reader)
• New Data
• Old Apps
• Improved performance
(Non-vectorized Reader)
• Old Data
• New Apps
• Improved performance
(Vectorized Reader)
• Old Data
• Old Apps
• As-Is performance
(Non-vectorized Reader)
1
2
3
4
19 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / native reader
native writer / hive reader hive writer / hive reader
4x 1
2
3
4
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
20 © Hortonworks Inc. 2011–2018. All rights reserved
How to specify `native` OrcFileFormat directly
CREATE TABLE people (name string, age int)
USING org.apache.spark.sql.execution.datasources.orc
df.write
.format("org.apache.spark.sql.execution.datasources.orc")
.save(path)
spark.read
.format("org.apache.spark.sql.execution.datasources.orc")
.load(path)
Read Dataset
Write Dataset
Create ORC Table
21 © Hortonworks Inc. 2011–2018. All rights reserved
Switch ORC implementation (SPARK-20728)
• spark.sql.orc.impl=native (default: `hive`)
CREATE TABLE people (name string, age int)
USING ORC OPTIONS (orc.compress 'ZLIB')
spark.read.orc(path)
df.write.orc(path)
spark.read.format("orc").load (path)
df.write.format("orc").save(path)
Read/Write Dataset
Read/Write Dataset
Create ORC Table
22 © Hortonworks Inc. 2011–2018. All rights reserved
Switch ORC implementation (SPARK-20728) – Cont.
• spark.sql.orc.impl=native (default: `hive`)
spark.readStream.orc(path)
spark.readStream.format("orc").load(path)
df.writeStream
.option("checkpointLocation", path1)
.format("orc")
.option("path", path2)
.start
Read/Write
Structured Stream
23 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Readers with `spark.sql.` configurations
orc.impl
# of cols <= codegen.maxFields
`native`
`hive` ORC Reader
`hive`
true
spark.sql.codegen.maxFields=100 (default)
false
`native` ORC Columnar Batch Reader
all atomic types
true
false
`native` ORC Record Reader
orc.enableVectorizedReader false
true
24 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Readers with `spark.sql.` configurations – Cont.
orc.enableVectorizedReader
Wrapping
ORC ColumnVector 
Spark OrcColumnVector
orc.copyBatchToSpark
true
false
Copying
ORC ColumnVector 
Spark OnHeapColumnVector
true
columnVector.offheap.enabled
true
Copying
ORC ColumnVector 
Spark OffHeapColumnVector
false
`native` ORC Columnar Batch Reader
25 © Hortonworks Inc. 2011–2018. All rights reserved
Support vectorized read on Hive ORC Tables
• spark.sql.hive.convertMetastoreOrc=true (default: false)
• `spark.sql.orc.impl=native` is required, too.
CREATE TABLE people (name string, age int)
STORED AS ORC
CREATE TABLE people (name string, age int)
USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
SPARK-23355
26 © Hortonworks Inc. 2011–2018. All rights reserved
Schema evolution at reading file-based data sources
• Frequently, new files can have wider column types or new columns
• Before SPARK-21929, users drop and recreate ORC table with an updated schema.
• User-defined schema reduces schema inference cost and handles upcasting
• boolean -> byte -> short -> int -> long
• float -> double
spark.read.schema("col1 int").orc(path)
spark.read.schema("col1 long, col2 long").orc(path)
27 © Hortonworks Inc. 2011–2018. All rights reserved
Schema evolution at reading file-based data sources – Cont.
1. Native Vectorized ORC Reader
2. Only safe change via upcasting
3. JSON is the most flexible for changing types
File Format TEXT CSV JSON ORC
`hive`
ORC
`native`1
PARQUET
Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️
Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️
Hide Column ✔️ ✔️ ✔️
Change Type2 ✔️ ✔️3 ✔️
Change Position ✔️ ✔️ ✔️
28 © Hortonworks Inc. 2011–2018. All rights reserved
Demo 1
ORC configuration
29 © Hortonworks Inc. 2011–2018. All rights reserved
Demo 2
PySpark with ORC
30 © Hortonworks Inc. 2011–2018. All rights reserved
Performance
31 © Hortonworks Inc. 2011–2018. All rights reserved
Micro Benchmark
• Target
• Apache Spark 2.3.0
• Apache ORC 1.4.1
• Machine
• MacBook Pro (2015 Mid)
• Intel® Core™ i7-4770JQ CPI @ 2.20GHz
• Mac OS X 10.13.4
• JDK 1.8.0_161
32 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / hive reader
4x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
33 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Vectorized Read
0
500
1000
1500
2000
2500
TINYINT SMALLINT INT BIGINT FLOAT DOULBE
native hive
15M rows in a single-column table
Time
(ms)
10x
5x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
11x
34 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Partitioned table read
0
500
1000
1500
2000
2500
Data column Partition column Both columns
native hive
Time
(ms)
21x7x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
15M rows in a partitioned table
35 © Hortonworks Inc. 2011–2018. All rights reserved
Predicate Pushdown
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Select 10% rows (id < value)
Select 50% rows (id < value)
Select 90% rows (id < value)
Select all rows (id IS NOT NULL)
parquet native Time (ms)
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala
15M rows with 5 data columns and 1 sequential id column
36 © Hortonworks Inc. 2011–2018. All rights reserved
Limitation
Future Roadmap
37 © Hortonworks Inc. 2011–2018. All rights reserved
Limitation
• Spark vectorization supports atomic types only
• Limited simple schema evolution. JSON provides more
• boolean -> byte -> short -> int -> long
• float -> double
• `convertMetastore` ignores `STORED AS` table properties (SPARK-23355)
• Both ORC/Parquet
38 © Hortonworks Inc. 2011–2018. All rights reserved
Future Roadmap – Apache Spark 2.4 (2018 Fall)
• Feature Parity for ORC with Parquet (SPARK-20901)
• Use `native` ORC implementation by default (SPARK-23456)
• Use ORC predicate pushdown by default (SPARK-21783)
• Use `convertMetastoreOrc` by default (SPARK-22279)
• Test ORC as default data source format (SPARK-23553)
• Test and support Bloom Filters (SPARK-12417)
39 © Hortonworks Inc. 2011–2018. All rights reserved
Future Roadmap – On-going work
• Support VectorUDT/MatrixUDT (SPARK-22320)
• Support CHAR/VARCHAR Types
• Vectorized Writer with DataSource V2
• ALTER TABLE … CHANGE column type (SPARK-18727)
40 © Hortonworks Inc. 2011–2018. All rights reserved
Summary
• Apache Spark 2.3 starts to take advantage of Apache ORC
• Native vectorized ORC reader
• boosts Spark ORC performance
• provides better schema evolution ability
• Structured streaming starts to work with ORC (both reader/writer)
• Spark is going to become faster and faster with ORC
41 © Hortonworks Inc. 2011–2018. All rights reserved
Reference
• https://youtu.be/ZVSD9EsQl-8, ORC configuration in Apache Spark 2.3
• https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow
• https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-
spark-22.html
• https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc-
met-apache-spark-81023199, Dataworks Summit 2017 Sydney
• https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data,
Dataworks Summit 2017 San Jose
42 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
43 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

Contenu connexe

Tendances

ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 

Tendances (20)

ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Similaire à ORC improvement in Apache Spark 2.3

ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3Dongjoon Hyun
 
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4Dongjoon Hyun
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkDataWorks Summit
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4DataWorks Summit
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4DataWorks Summit
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesBig Data Spain
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleHortonworks
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyondXiao Li
 

Similaire à ORC improvement in Apache Spark 2.3 (20)

ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
HiveWarehouseConnector
HiveWarehouseConnectorHiveWarehouseConnector
HiveWarehouseConnector
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Dernier (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

ORC improvement in Apache Spark 2.3

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved ORC Improvement in Apache Spark 2.3 Dongjoon Hyun Principal Software Engineer @ Hortonworks Data Science Team April 2018
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Dongjoon Hyun • Hortonworks • Principal Software Engineer @ Data Science Team • Apache Project • Apache REEF Project Management Committee(PMC) Member & Committer • Apache Spark Project Contributor • GitHub • https://github.com/dongjoon-hyun
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Agenda • What’s New in Apache Spark 2.3 • Previous ORC issues in Apache Spark • Current Approach & Demo • Performance & Limitation • Future roadmap
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features What’s New in Apache Spark 2.3
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Spark’s file-based data sources • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Popular for shared Hive tables
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Motivation • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Popular for shared Hive tables Fast Flexible Hive Table Access
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Previous ORC Issues in Spark
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Background – The history of Spark and ORC • Before Apache ORC • Hive 1.2.1 (2015 JUN)  SPARK-2883 (Hive ORC is used since Spark 1.4) • After Apache ORC • v1.0.0 (2016 JAN) • v1.1.0 (2016 JUN) • v1.2.0 (2016 AUG) • v1.3.0 (2017 JAN) • v1.4.0 (2017 MAY)  SPARK-21422 (Apache ORC is added since Spark 2.3) • v1.4.1 (2017 OCT)  SPARK-22300 • v1.4.3 (2018 FEB)  SPARK-23340 (Spark 2.4)
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Six Issue Categories • ORC Writer Versions • Performance • Structured streaming • Column names • Hive tables and schema evolution • Robustness
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Issues with ORC Writer Versions • ORIGINAL • HIVE_8732 (2014) ORC string statistics are not merged correctly • HIVE_4243 (2015) Fix column names in FileSinkOperator • HIVE_12055(2015) Create row-by-row shims for the write path • HIVE_13083(2016) Writing HiveDecimal can wrongly suppress present stream • ORC_101 (2016) Correct the use of the default charset in bloomfilter • ORC_135 (2018) PPD for timestamp is wrong when reader/writer timezones are different
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Issues with performance • Vectorized ORC Reader (SPARK-16060) • Fast read partition-column only (SPARK-22712) • Pushing down filters for DateType (SPARK-21787)
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved • `FileNotFoundException` at writing empty partitions as ORC • Create structured steam with ORC files Write (SPARK-15474) Read (SPARK-22781) Issues with structured streaming spark.readStream.orc(path)
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Issues with column names • Unicode column names (SPARK-23072) • Column names with dot (SPARK-21791) • Should not create invalid column names (SPARK-21912)
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Issues with Hive tables and schema evolution • Support `ALTER TABLE ADD COLUMNS` (SPARK-21929) • Introduced at Spark 2.2, but throws AnalysisException for ORC • Support column positional mismatch (SPARK-22267) • Return wrong result if ORC file schema is different from Hive MetaStore schema order • `convertMetastore` ignore storage property (SPARK-22158, Fixed at 2.2.1) • `convertMetastoreOrc` is introduced in Spark 2.0, but it had several issues.
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Issues with robustness • ORC metadata exceed ProtoBuf message size limit (SPARK-19109) • NullPointerException on zero-size ORC file (SPARK-19809) • Support `ignoreCorruptFiles` (SPARK-23049) • Support `ignoreMissingFiles` (SPARK-23305) • `FileNotFound` at file names with special chars (SPARK-22146, Fixed in 2.2.1)
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Current Approach
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved Supports two ORC file formats • Adding a new OrcFileFormat (SPARK-20682) FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat `hive` OrcFileFormat from Hive 1.2.1 `native` OrcFileFormat with ORC 1.4.3
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved In Reality – Four cases for ORC Reader/Writer `hive` Reader`native` Reader `hive` Writer `native` Writer • New Data • New Apps • Best performance (Vectorized Reader) • New Data • Old Apps • Improved performance (Non-vectorized Reader) • Old Data • New Apps • Improved performance (Vectorized Reader) • Old Data • Old Apps • As-Is performance (Non-vectorized Reader) 1 2 3 4
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / native reader native writer / hive reader hive writer / hive reader 4x 1 2 3 4 https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved How to specify `native` OrcFileFormat directly CREATE TABLE people (name string, age int) USING org.apache.spark.sql.execution.datasources.orc df.write .format("org.apache.spark.sql.execution.datasources.orc") .save(path) spark.read .format("org.apache.spark.sql.execution.datasources.orc") .load(path) Read Dataset Write Dataset Create ORC Table
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) • spark.sql.orc.impl=native (default: `hive`) CREATE TABLE people (name string, age int) USING ORC OPTIONS (orc.compress 'ZLIB') spark.read.orc(path) df.write.orc(path) spark.read.format("orc").load (path) df.write.format("orc").save(path) Read/Write Dataset Read/Write Dataset Create ORC Table
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) – Cont. • spark.sql.orc.impl=native (default: `hive`) spark.readStream.orc(path) spark.readStream.format("orc").load(path) df.writeStream .option("checkpointLocation", path1) .format("orc") .option("path", path2) .start Read/Write Structured Stream
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved ORC Readers with `spark.sql.` configurations orc.impl # of cols <= codegen.maxFields `native` `hive` ORC Reader `hive` true spark.sql.codegen.maxFields=100 (default) false `native` ORC Columnar Batch Reader all atomic types true false `native` ORC Record Reader orc.enableVectorizedReader false true
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved ORC Readers with `spark.sql.` configurations – Cont. orc.enableVectorizedReader Wrapping ORC ColumnVector  Spark OrcColumnVector orc.copyBatchToSpark true false Copying ORC ColumnVector  Spark OnHeapColumnVector true columnVector.offheap.enabled true Copying ORC ColumnVector  Spark OffHeapColumnVector false `native` ORC Columnar Batch Reader
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Support vectorized read on Hive ORC Tables • spark.sql.hive.convertMetastoreOrc=true (default: false) • `spark.sql.orc.impl=native` is required, too. CREATE TABLE people (name string, age int) STORED AS ORC CREATE TABLE people (name string, age int) USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip') SPARK-23355
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources • Frequently, new files can have wider column types or new columns • Before SPARK-21929, users drop and recreate ORC table with an updated schema. • User-defined schema reduces schema inference cost and handles upcasting • boolean -> byte -> short -> int -> long • float -> double spark.read.schema("col1 int").orc(path) spark.read.schema("col1 long, col2 long").orc(path)
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources – Cont. 1. Native Vectorized ORC Reader 2. Only safe change via upcasting 3. JSON is the most flexible for changing types File Format TEXT CSV JSON ORC `hive` ORC `native`1 PARQUET Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️ Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️ Hide Column ✔️ ✔️ ✔️ Change Type2 ✔️ ✔️3 ✔️ Change Position ✔️ ✔️ ✔️
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Demo 1 ORC configuration
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Demo 2 PySpark with ORC
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Performance
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Micro Benchmark • Target • Apache Spark 2.3.0 • Apache ORC 1.4.1 • Machine • MacBook Pro (2015 Mid) • Intel® Core™ i7-4770JQ CPI @ 2.20GHz • Mac OS X 10.13.4 • JDK 1.8.0_161
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / hive reader 4x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Vectorized Read 0 500 1000 1500 2000 2500 TINYINT SMALLINT INT BIGINT FLOAT DOULBE native hive 15M rows in a single-column table Time (ms) 10x 5x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 11x
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Partitioned table read 0 500 1000 1500 2000 2500 Data column Partition column Both columns native hive Time (ms) 21x7x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 15M rows in a partitioned table
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Predicate Pushdown 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Select 10% rows (id < value) Select 50% rows (id < value) Select 90% rows (id < value) Select all rows (id IS NOT NULL) parquet native Time (ms) https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala 15M rows with 5 data columns and 1 sequential id column
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved Limitation Future Roadmap
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Limitation • Spark vectorization supports atomic types only • Limited simple schema evolution. JSON provides more • boolean -> byte -> short -> int -> long • float -> double • `convertMetastore` ignores `STORED AS` table properties (SPARK-23355) • Both ORC/Parquet
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Future Roadmap – Apache Spark 2.4 (2018 Fall) • Feature Parity for ORC with Parquet (SPARK-20901) • Use `native` ORC implementation by default (SPARK-23456) • Use ORC predicate pushdown by default (SPARK-21783) • Use `convertMetastoreOrc` by default (SPARK-22279) • Test ORC as default data source format (SPARK-23553) • Test and support Bloom Filters (SPARK-12417)
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Future Roadmap – On-going work • Support VectorUDT/MatrixUDT (SPARK-22320) • Support CHAR/VARCHAR Types • Vectorized Writer with DataSource V2 • ALTER TABLE … CHANGE column type (SPARK-18727)
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Summary • Apache Spark 2.3 starts to take advantage of Apache ORC • Native vectorized ORC reader • boosts Spark ORC performance • provides better schema evolution ability • Structured streaming starts to work with ORC (both reader/writer) • Spark is going to become faster and faster with ORC
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved Reference • https://youtu.be/ZVSD9EsQl-8, ORC configuration in Apache Spark 2.3 • https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow • https://community.hortonworks.com/articles/148917/orc-improvements-for-apache- spark-22.html • https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc- met-apache-spark-81023199, Dataworks Summit 2017 Sydney • https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data, Dataworks Summit 2017 San Jose
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved Thank you