SlideShare a Scribd company logo
1 of 30
Download to read offline
Adaptive Query Execution:
Speeding Up Spark SQL at Runtime
Maryann Xue Staff Engineer @ Databricks
Ke Jia Software Engineer @ Intel
Agenda
Maryann Xue
What is Adaptive Query Execution (AQE)?
Why AQE?
How AQE Works?
The Major Optimizations in Spark 3.0
Ke Jia
Live Demo
TPC-DS Performance
AQE in Production
Background
▪ Well-studied problem in database literature
▪ Primitive version in Spark 1.6
▪ New AQE prototyped and experimented by Intel Big Data
▪ Databricks and Intel co-engineered new AQE in Spark 3.0
What is Adaptive Query Execution (AQE)?
Dynamic query optimization that happens in the middle of query
execution based on runtime statistics.
Why AQE?
Cost-based optimization (CBO) aims to choose the best plan, but does
NOT work well when:
▪ Stale or missing statistics lead to inaccurate estimates
▪ Statistics collection are too costly (e.g., column histograms)
▪ Predicates contain UDFs
▪ Hints do not work for rapidly evolving data
AQE base all optimization decisions on accurate runtime statistics
▪ Shuffle or broadcast exchanges divide a
query into query stages
▪ Intermediate results are materialized at
the end of a query stage
▪ Query stage boundaries optimal for
runtime optimization:
The inherent break point of operator pipelines
Statistics available, e.g., data size, partition sizes
Query Stages
AGGREGATE (final)
SHUFFLE
AGGREGATE (partial)
SCAN
Query Stage
SORT
SHUFFLE
Pipeline Break
Point
Query Stage
Pipeline Break
Point
SELECT x, avg(y) FROM t GROUP BY x ORDER BY avg(y)
How AQE works
1. Run leaf stages
2. Optimize when any stage completes -- new stats available
3. Run more stages with dependency requirement satisfied
4. Repeat (2) (3) until no more stages to run
Run query
stages
with dep. cleared
Optimize
rest of the query
more
stages?
Done
The AQE Major Features in Spark 3.0
▪ Dynamically coalesce shuffle partitions
▪ Dynamically switch join strategies
▪ Dynamically optimize skew joins
Dynamically coalesce shuffle partitions -- Why? (1)
Shuffle partition number and sizes crucial to query performance
Inefficient I/O
Scheduler overhead
Task setup overhead
Partition too largePartition too small
GC pressure
Disk spilling
Dynamically coalesce shuffle partitions -- Why? (2)
Problem:
▪ One universal partition number throughout the entire query execution
▪ Data size changes at different times of query execution
Solution by AQE:
▪ Set the initial partition number high to accommodate the largest data
size of the entire query execution
▪ Automatically coalesce partitions if needed after each query stage
Dynamically coalesce shuffle partitions -- When?
AGGREGATE (final)
SHUFFLE (50 part.)
AGGREGATE (partial)
SCAN
Stage 1
complete
total: 650MB
avg: 13MB
SORT
SHUFFLE (50 part.)
2. Optimize1. Run leaf stages 3. Run more stages 4. Optimize
COALESCE (10 part.)
SHUFFLE (50 part.)
AGGREGATE (partial)
SCAN
SORT
AGGREGATE (final)
SHUFFLE (50 part.)
COALESCE (10 part.)
SHUFFLE (50 part.)
AGGREGATE (partial)
SCAN
SORT
AGGREGATE (final)
SHUFFLE (50 part.)
Stage 2
complete
total: 300MB
avg: 6MB
COALESCE (10 part.)
SHUFFLE (50 part.)
AGGREGATE (partial)
SCAN
SORT
AGGREGATE (final)
SHUFFLE (50 part.)
COALESCE (5 part.)
1
1 1
2
1
2
SELECT x, avg(y) FROM t GROUP BY x ORDER BY avg(y)
Dynamically coalescing shuffle partitions - How? (1)
Regular shuffle -- no coalescing
▪ Partitioned into statically specified partition number -- in this case, 5
MAP 1
MAP 2
REDUCE 1
REDUCE 2
REDUCE 3
REDUCE 4
REDUCE 5
Dynamically coalesce shuffle partitions -- How? (2)
REDUCE 2’ (COALESCED)
AQE Coalesced shuffle
▪ Combine adjacent small partitions -- in this case, orig. partitions 2, 3,
4
REDUCE 3’
MAP 1
MAP 2
REDUCE 1’
Dynamically switch join strategies - Why?
Spark chooses Broadcast Hash Join if either child of the join can fit well in
memory.
Problem: estimates can go wrong and the opportunity of doing BHJ can be
missed:
▪ Stats not qualified for accurate cardinality or selectivity estimate
▪ Child relation being a complex subtree of operators
▪ Blackbox predicates, e.g., UDFs
Solution by AQE: replan joins with runtime data sizes.
Dynamically switch join strategies -- When & How?
SORT
SHUFFLE
SCAN A
SORT MERGE JOIN
SORT
2. Optimize1. Run leaf stages 3. Run more stages
SHUFFLE
FILTER
SCAN B
Stage 2
complete
est: 25MB
actual: 8MB
SHUFFLE
SCAN A
BROADCAST HASH JOIN
BROADCAST
SHUFFLE
FILTER
SCAN B
SHUFFLE
SCAN A
BROADCAST HASH JOIN
BROADCAST
SHUFFLE
FILTER
SCAN B
1
2 1 2
1
3
2
SELECT * FROM a JOIN b ON a.key = b.key WHERE b.value LIKE ‘%xyz%’
Dynamically optimize skew joins -- Why?
Problem: data skew can lead to significant performance downgrade
▪ Individual long running tasks slow down the entire stage
▪ Especially large partitions lead to more slowdown with disk spilling.
Solution by AQE: handle skew join automatically using runtime statistics
▪ Detect skew from partition sizes
▪ Split skew partitions into smaller subpartitions
Dynamically optimize skew joins -- When?
SORT
SHUFFLE
SCAN A
SORT MERGE JOIN
SORT
2. Optimize1. Run leaf stages
SHUFFLE
SCAN B
Stage 2
complete
1 2Stage 1
complete
med: 55MB
min: 40MB
max: 250MB
SORT
SHUFFLE
SCAN A
SORT MERGE JOIN
SORT
SHUFFLE
SCAN B
1 2
SKEW READER SKEW READER
SELECT * FROM a JOIN b ON a.col = b.col
Dynamically optimize skew joins -- How? (1)
Regular sort merge join -- no skew optimization:
TABLE A - MAP 1
TABLE A - MAP 2
TABLE A - MAP 3
PART. A0
PART. A1
PART. A2
PART. A3
PART. B0
PART. B1
PART. B2
PART. B3
TABLE B - MAP 1
TABLE B - MAP 2
JOIN
Dynamically optimize skew joins -- How? (2)
Skew-optimized sort merge join -- with skew shuffle reader:
A0 - S2
TABLE A - MAP 1
B0
TABLE A - MAP 2
TABLE A - MAP 3
Split A0
PART. A1
PART. A2
PART. A3
PART. B1
PART. B2
PART. B3
TABLE B - MAP 1
TABLE B - MAP 2
A0 - S1
A0 - S0
B0
B0
Duplicate B0
JOIN
About Me
Ke Jia
Big Data Product Engineer at Intel
Contributor of Spark, OAP and Hive
About Me
Ke Jia
Big Data Product Engineer at Intel
Contributor of Spark, OAP and Hive
Demo
Try this notebook in Databricks
TPC-DS Performance (3TB) -- Cluster Setup
Hardware BDW
Slave Node# 5
CPU Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (96cores)
Memory 384 GB
Disk 7× 1 TB SSD
Network 10 Gigabit Ethernet
Master CPU Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (96cores)
Memory 384 GB
Disk 7× 1 TB SSD
Network 10 Gigabit Ethernet
Software
OS Fedora release 29
Kernel 4.20.6-200.fc29.x86_64
Spark* Spark master (commit ID: 0b6aae422ba37a13531e98c8801589f5f3cb28e0)
Hadoop*/HDFS* hadoop-2.7.5
JDK 1.8.0_110 (Oracle* Corporation)
TPC-DS Performance (3TB) -- Results
1.76x
1.5x
1.41x 1.4x
1.38x
1.28x
1.27x
1.22x
1.21x
1.19x
▪ Over 1.5x speedup on 2 queries; over 1.1x speedup on 37 queries
TPC-DS Performance (3TB) -- Partition Coalescing
• Less scheduler overhead and task startup time.
• Less disk IO requests.
• Less data are written to disk because more data are aggregated.
Partitions Number 1000 (Q8 without AQE)
Partitions Number changed to 658 and 717 (Q8 with AQE)
TPC-DS Performance (3TB) -- Join Strategies
• Random IO read -> Sequence IO read
• Remote shuffle read -> local shuffle read.
SortMergeJoin (Q14b without AQE)
Broadcast Hash Join (Q14b with AQE)
AQE in Production
▪ Performance shared by one of largest E-commerce company in China
AQE helped them resolved critical data skew issues and achieved significant performance for
online business queries. AQE engine can get 17.7x, 14.0x, 1.6x and 1.3x respectively on 4 typical
skewed queries.
▪ Performance shared by one of largest internet company in China
AQE can gain 5x and 1.38x performance for two typical queries in their production
environment.
AQE in Production -- Skew Join Optimization
17.7x
14.0x
1.6x
1.3x
key Avg record Skew records comment
sale_order_id 2000 15383717 NULL
ivc_content_id 9231804 4077995632 Not NULL
ivc_type_id 360 3582336345 Not NULL
▪ Select the user’s invoice details based on the sale order id, invoice
content id and the invoice type id.
Durations(s)
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

What's hot

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet FormatYue Chen
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 

What's hot (20)

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 

Similar to Adaptive Query Execution: Speeding Up Spark SQL at Runtime

An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuDatabricks
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance BenchmarkBigstep
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and HadoopDataWorks Summit
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
QCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AILex Yu
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Spark Summit
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Skew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsSkew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsDatabricks
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Jboss World 2011 Infinispan
Jboss World 2011 InfinispanJboss World 2011 Infinispan
Jboss World 2011 Infinispancbo_
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataJignesh Shah
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryoguest40fc7cd
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...IndicThreads
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit
 

Similar to Adaptive Query Execution: Speeding Up Spark SQL at Runtime (20)

An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
QCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AI
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Skew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale JoinsSkew Mitigation For Facebook PetabyteScale Joins
Skew Mitigation For Facebook PetabyteScale Joins
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Jboss World 2011 Infinispan
Jboss World 2011 InfinispanJboss World 2011 Infinispan
Jboss World 2011 Infinispan
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
OOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with ParallelOOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with Parallel
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

Adaptive Query Execution: Speeding Up Spark SQL at Runtime

  • 1.
  • 2. Adaptive Query Execution: Speeding Up Spark SQL at Runtime Maryann Xue Staff Engineer @ Databricks Ke Jia Software Engineer @ Intel
  • 3. Agenda Maryann Xue What is Adaptive Query Execution (AQE)? Why AQE? How AQE Works? The Major Optimizations in Spark 3.0 Ke Jia Live Demo TPC-DS Performance AQE in Production
  • 4. Background ▪ Well-studied problem in database literature ▪ Primitive version in Spark 1.6 ▪ New AQE prototyped and experimented by Intel Big Data ▪ Databricks and Intel co-engineered new AQE in Spark 3.0
  • 5. What is Adaptive Query Execution (AQE)? Dynamic query optimization that happens in the middle of query execution based on runtime statistics.
  • 6. Why AQE? Cost-based optimization (CBO) aims to choose the best plan, but does NOT work well when: ▪ Stale or missing statistics lead to inaccurate estimates ▪ Statistics collection are too costly (e.g., column histograms) ▪ Predicates contain UDFs ▪ Hints do not work for rapidly evolving data AQE base all optimization decisions on accurate runtime statistics
  • 7. ▪ Shuffle or broadcast exchanges divide a query into query stages ▪ Intermediate results are materialized at the end of a query stage ▪ Query stage boundaries optimal for runtime optimization: The inherent break point of operator pipelines Statistics available, e.g., data size, partition sizes Query Stages AGGREGATE (final) SHUFFLE AGGREGATE (partial) SCAN Query Stage SORT SHUFFLE Pipeline Break Point Query Stage Pipeline Break Point SELECT x, avg(y) FROM t GROUP BY x ORDER BY avg(y)
  • 8. How AQE works 1. Run leaf stages 2. Optimize when any stage completes -- new stats available 3. Run more stages with dependency requirement satisfied 4. Repeat (2) (3) until no more stages to run Run query stages with dep. cleared Optimize rest of the query more stages? Done
  • 9. The AQE Major Features in Spark 3.0 ▪ Dynamically coalesce shuffle partitions ▪ Dynamically switch join strategies ▪ Dynamically optimize skew joins
  • 10. Dynamically coalesce shuffle partitions -- Why? (1) Shuffle partition number and sizes crucial to query performance Inefficient I/O Scheduler overhead Task setup overhead Partition too largePartition too small GC pressure Disk spilling
  • 11. Dynamically coalesce shuffle partitions -- Why? (2) Problem: ▪ One universal partition number throughout the entire query execution ▪ Data size changes at different times of query execution Solution by AQE: ▪ Set the initial partition number high to accommodate the largest data size of the entire query execution ▪ Automatically coalesce partitions if needed after each query stage
  • 12. Dynamically coalesce shuffle partitions -- When? AGGREGATE (final) SHUFFLE (50 part.) AGGREGATE (partial) SCAN Stage 1 complete total: 650MB avg: 13MB SORT SHUFFLE (50 part.) 2. Optimize1. Run leaf stages 3. Run more stages 4. Optimize COALESCE (10 part.) SHUFFLE (50 part.) AGGREGATE (partial) SCAN SORT AGGREGATE (final) SHUFFLE (50 part.) COALESCE (10 part.) SHUFFLE (50 part.) AGGREGATE (partial) SCAN SORT AGGREGATE (final) SHUFFLE (50 part.) Stage 2 complete total: 300MB avg: 6MB COALESCE (10 part.) SHUFFLE (50 part.) AGGREGATE (partial) SCAN SORT AGGREGATE (final) SHUFFLE (50 part.) COALESCE (5 part.) 1 1 1 2 1 2 SELECT x, avg(y) FROM t GROUP BY x ORDER BY avg(y)
  • 13. Dynamically coalescing shuffle partitions - How? (1) Regular shuffle -- no coalescing ▪ Partitioned into statically specified partition number -- in this case, 5 MAP 1 MAP 2 REDUCE 1 REDUCE 2 REDUCE 3 REDUCE 4 REDUCE 5
  • 14. Dynamically coalesce shuffle partitions -- How? (2) REDUCE 2’ (COALESCED) AQE Coalesced shuffle ▪ Combine adjacent small partitions -- in this case, orig. partitions 2, 3, 4 REDUCE 3’ MAP 1 MAP 2 REDUCE 1’
  • 15. Dynamically switch join strategies - Why? Spark chooses Broadcast Hash Join if either child of the join can fit well in memory. Problem: estimates can go wrong and the opportunity of doing BHJ can be missed: ▪ Stats not qualified for accurate cardinality or selectivity estimate ▪ Child relation being a complex subtree of operators ▪ Blackbox predicates, e.g., UDFs Solution by AQE: replan joins with runtime data sizes.
  • 16. Dynamically switch join strategies -- When & How? SORT SHUFFLE SCAN A SORT MERGE JOIN SORT 2. Optimize1. Run leaf stages 3. Run more stages SHUFFLE FILTER SCAN B Stage 2 complete est: 25MB actual: 8MB SHUFFLE SCAN A BROADCAST HASH JOIN BROADCAST SHUFFLE FILTER SCAN B SHUFFLE SCAN A BROADCAST HASH JOIN BROADCAST SHUFFLE FILTER SCAN B 1 2 1 2 1 3 2 SELECT * FROM a JOIN b ON a.key = b.key WHERE b.value LIKE ‘%xyz%’
  • 17. Dynamically optimize skew joins -- Why? Problem: data skew can lead to significant performance downgrade ▪ Individual long running tasks slow down the entire stage ▪ Especially large partitions lead to more slowdown with disk spilling. Solution by AQE: handle skew join automatically using runtime statistics ▪ Detect skew from partition sizes ▪ Split skew partitions into smaller subpartitions
  • 18. Dynamically optimize skew joins -- When? SORT SHUFFLE SCAN A SORT MERGE JOIN SORT 2. Optimize1. Run leaf stages SHUFFLE SCAN B Stage 2 complete 1 2Stage 1 complete med: 55MB min: 40MB max: 250MB SORT SHUFFLE SCAN A SORT MERGE JOIN SORT SHUFFLE SCAN B 1 2 SKEW READER SKEW READER SELECT * FROM a JOIN b ON a.col = b.col
  • 19. Dynamically optimize skew joins -- How? (1) Regular sort merge join -- no skew optimization: TABLE A - MAP 1 TABLE A - MAP 2 TABLE A - MAP 3 PART. A0 PART. A1 PART. A2 PART. A3 PART. B0 PART. B1 PART. B2 PART. B3 TABLE B - MAP 1 TABLE B - MAP 2 JOIN
  • 20. Dynamically optimize skew joins -- How? (2) Skew-optimized sort merge join -- with skew shuffle reader: A0 - S2 TABLE A - MAP 1 B0 TABLE A - MAP 2 TABLE A - MAP 3 Split A0 PART. A1 PART. A2 PART. A3 PART. B1 PART. B2 PART. B3 TABLE B - MAP 1 TABLE B - MAP 2 A0 - S1 A0 - S0 B0 B0 Duplicate B0 JOIN
  • 21. About Me Ke Jia Big Data Product Engineer at Intel Contributor of Spark, OAP and Hive
  • 22. About Me Ke Jia Big Data Product Engineer at Intel Contributor of Spark, OAP and Hive
  • 23. Demo Try this notebook in Databricks
  • 24. TPC-DS Performance (3TB) -- Cluster Setup Hardware BDW Slave Node# 5 CPU Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (96cores) Memory 384 GB Disk 7× 1 TB SSD Network 10 Gigabit Ethernet Master CPU Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (96cores) Memory 384 GB Disk 7× 1 TB SSD Network 10 Gigabit Ethernet Software OS Fedora release 29 Kernel 4.20.6-200.fc29.x86_64 Spark* Spark master (commit ID: 0b6aae422ba37a13531e98c8801589f5f3cb28e0) Hadoop*/HDFS* hadoop-2.7.5 JDK 1.8.0_110 (Oracle* Corporation)
  • 25. TPC-DS Performance (3TB) -- Results 1.76x 1.5x 1.41x 1.4x 1.38x 1.28x 1.27x 1.22x 1.21x 1.19x ▪ Over 1.5x speedup on 2 queries; over 1.1x speedup on 37 queries
  • 26. TPC-DS Performance (3TB) -- Partition Coalescing • Less scheduler overhead and task startup time. • Less disk IO requests. • Less data are written to disk because more data are aggregated. Partitions Number 1000 (Q8 without AQE) Partitions Number changed to 658 and 717 (Q8 with AQE)
  • 27. TPC-DS Performance (3TB) -- Join Strategies • Random IO read -> Sequence IO read • Remote shuffle read -> local shuffle read. SortMergeJoin (Q14b without AQE) Broadcast Hash Join (Q14b with AQE)
  • 28. AQE in Production ▪ Performance shared by one of largest E-commerce company in China AQE helped them resolved critical data skew issues and achieved significant performance for online business queries. AQE engine can get 17.7x, 14.0x, 1.6x and 1.3x respectively on 4 typical skewed queries. ▪ Performance shared by one of largest internet company in China AQE can gain 5x and 1.38x performance for two typical queries in their production environment.
  • 29. AQE in Production -- Skew Join Optimization 17.7x 14.0x 1.6x 1.3x key Avg record Skew records comment sale_order_id 2000 15383717 NULL ivc_content_id 9231804 4077995632 Not NULL ivc_type_id 360 3582336345 Not NULL ▪ Select the user’s invoice details based on the sale order id, invoice content id and the invoice type id. Durations(s)
  • 30. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.