Optimizing Delta and Parquet Data Lakes with Spark

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Matthew Powers, Prognos Health
Optimizing Delta / Parquet
Data Lakes
#UnifiedDataAnalytics #SparkAISummit

Agenda
• Why Delta?
• Delta basics and transaction log
• Compacting Delta lake
• Vacuuming old files
• Partitioning Delta lakes
• Deleting rows
• Persisting transformations in columns
3

About
4
MungingData
• Time travel
• Compacting
• Vacuuming
• Update columns

Contact me
• GitHub: MrPowers
• Email: matthewkevinpowers@gmail.com
• Delta Slack channel
• Open source hacking
5

What is Delta lake?
• Parquet + transaction log
• Provides awesome features for free!
6

Delta Lake =!= Databricks Delta
7
https://github.com/delta-io/delta/issues/49

#UnifiedDataAnalytics #SparkAISummit
TL;DR
• 1 GB files
• No nested directories
8

#UnifiedDataAnalytics #SparkAISummit 9
Delta Lake Slack says 1GB files

Databricks Delta autoOptimize
10

Why does compaction speed up
lakes?
• Parquet: files need to be listed before they are
read. Listing is expensive in object stores.
• Delta: Data is read via the transaction log.
• Easier for Spark to read partitioned lakes into
memory partitions.
11

_delta_log/00000000000000000000.json
15

_delta_log/00000000000000000001.json
19

Compacting Delta lakes without breaking
downstream apps
20
https://github.com/delta-io/delta/issues/146

Delta Lake Vacuum
• Files marked for removal older than the retention
period
• Default retention period is 7 days
• Not going to improve performance
22

Optimal number of partitions
(delta)
25

Optimal number of partitions (parquet)
28
https://github.com/MrPowers/spark-daria/blob/master/src/main/scala/com/github/
mrpowers/spark/daria/utils/DirHelpers.scala

Why partition data lakes?
• Data skipping
• Massively improve query performance
• I’ve seen queries run 50-100 times faster on
partitioned lakes
29

Filtering unpartitioned lake
31
== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) &&
StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia),
StringStartsWith(first_name,M)],
ReadSchema: struct

_delta_log/00000000000000000000.json
34

Filtering partitioned lake
35
== Physical Plan ==
*(1) Project [first_name#662, last_name#663, country#664]
+- *(1) Filter (isnotnull(first_name#662) && StartsWith(first_name#662, M))
+- *(1) FileScan parquet [first_name#662,last_name#663,country#664]
Batched: true,
Format: Parquet,
Location: TahoeLogFileIndex[file:/…/tmp/europe_partitioned1],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#664), (country#664 = Russia)],
PushedFilters: [IsNotNull(first_name), StringStartsWith(first_name,M)],
ReadSchema: struct<first_name:string,last_name:string>

Comparing physical plans
36
Unpartitioned
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12))
&& (country#14 = Russia)) && StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[….],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name),
EqualTo(country,Russia), StringStartsWith(first_name,M)],
ReadSchema: struct
Partitioned
+- Filter (isnotnull(first_name#662) && StartsWith(first_name#662, M))
+- FileScan parquet [first_name#662,last_name#663,country#664]
Batched: true,
Format: Parquet,
Location: TahoeLogFileIndex[file:/…/tmp/europe_partitioned1],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#664), (country#664 =
Russia)],
PushedFilters: [IsNotNull(first_name),
StringStartsWith(first_name,M)],
ReadSchema: struct<first_name:string,last_name:string>

Directly grabbing the partitions is
faster for Parquet lakes…
37
Directly grabbing partitions was 83 times faster than relying on partition
filters for a simple query

Real partitioned data lake
• Updates every 3 hours
• Has 5 million files
• 15,000 files are being added every day
• Still great for a lot of queries
38

Creating partitioned lake (2/3)
39

Partitioned lake on disk (2/3)
40

Creating partitioned lake (3/3)
41

Incrementally updating
partitioned lakes
• Small file problem grows quickly
• Compaction is hard
42

We can delete rows in Delta lakes
44

Delta lake downsides… not many
49

Contact me
• GitHub: MrPowers
• Email: matthewkevinpowers@gmail.com
• Delta Slack channel
• Open source hacking
50

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Optimizing Delta and Parquet Data Lakes with Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optimizing Delta and Parquet Data Lakes with Spark

Similar to Optimizing Delta and Parquet Data Lakes with Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Optimizing Delta and Parquet Data Lakes with Spark