Powering Interactive Data Analysis at Pinterest by Amazon Redshift

Powering interactive data analysis by
Amazon Redshift
Jie Li
Data Infra at Pinterest

Pinterest: a place to get inspired and
plan for the future

Data Infra at Pinterest
Production
data pipeline

Kafka

Pinball (*)
Hive

S3

MySQL
HBase
Redis

Cascading

Hadoop

MySQL

Amazon Web Service
* Pinball is our own workflow manager that we plan to open source.

Ad-hoc data
analysis

Analytics Dashboard

We need a low latency data warehouse!
Production
data pipeline

Kafka

Pinball
Hive

S3

MySQL
HBase
Redis

Cascading

Hadoop

High latency!

Ad-hoc data
analysis

MySQL not a viable data warehouse.
MySQL

Amazon Web Service

Analytics Dashboard

Low-latency data warehouse
• SQL on Hadoop
– Shark, Impala, Drill, Tez, Presto, …
– Open source and free 
– Immature? 

• Massive Parallel Processing (MPP)
– Asterdata, Vertica, ParAccel, …
– Built on mature technologies like Postgres 
– Expensive and only available on-premise 

• Amazon Redshift
– ParAccel on AWS 
– Mature but also cost-effective 

Highlights of Redshift
High cost efficiency
on-demand $0.85 per hour
3yr reserved instances $999/TB/year
Free snapshot on S3

Low maintenance overhead
Fully self-managed
Automated maintenance & upgrade
Built-in admin dashboard

Superior performance
6000

5000

25-100x over Hive
• Columnar layout
• Index
• Advanced optimizer
• Efficient execution

second

4000

3000

2000

1000

0

Q1

Q2
Hive

Q3
RedShift

Note: based on our own dataset and queries.

Q4

Cool, but how to integrate Redshift with
Hive/Hadoop

First, get data from Hive into Redshift
Unstructure
d
Unclean

Structured
Clean

Extract & Transform
Hive

Columnar
Compact
Compressed

Load
S3

Hadoop/Hive is perfect for heavy-lifting ETL workloads

Redshift

Building ETL from Hive to Redshift
What worked

What didn’t work

Schematizing Hive tables

Writing column-mapping
scripts to generate ETL
queries

N/A

Cleaning data

Filtering out non-ASCII
characters

Loading all characters

Loading big tables with
sortkey

Sorting externally in
Hadoop/Hive and loading in
chunks

Loading unordered data
directly

Loading time-series tables

Appending to the table in
the order of time (sortkey)

A table per day connected
with view performing poorly

Table retention

Insert into a new table

Delete and vacuum (poor
performance)

But it’s just the beginning.
Make sure you audit the ETL from Day 1

Audit ETL for Data Consistency
Everything was good until one day we noticed
one table was only half of its size 
S3 is only eventual consistent (EC) ! 

Hive

Solutions:

S3

① Audit

Redshift

Audit

② Also reduce number of files on S3 to alleviate EC.
Also, recently there is a new feature to specify a manifest for files on S3.

Now we got the data.
Is it ready for superior performance?

Understand the performance
Leader

① Understand the query execution
plan (via “explain”). Always
update system stats after data
loading by running “analyze”.

System
Stats

Compute

Compute

Compute

② Optimize the data layout by choosing consistent
distkeys across tables, and always choose a
sortkey. Watch out for bad distkey with skew (e.g.
distkey with null values).

What if a query took long
It’s worth doing your own homework
Filing tickets doesn’t work well for perf issues
• Requires a lot of information exchange
• May be caused by minor issues
Case: we optimized a query from 3 hours to 7 seconds
after studying the query plan and fixing the system
stats (the broadcast join regarded the larger table as
the smaller one).

Educate users with best practices 
Best Practice

Details

Select only the columns you need

Redshift is a columnar database and it only scans the
columns you need to speed things up. “SELECT *” is
usually bad.

Use the sortkey (dt or created_at)

Using sortkey can skip unnecessary data. Most of our
tables are using dt or created_at as the sortkey.

Avoid slow data transferring

Transferring large query result from Redshift to the local
client may be slow. Try saving the result as a Redshift
table or using the command line client on EC2.

Apply selective filters before join

Join operation can be significantly faster if we filter out
irrelevant data as much as possible.

Run one query at a time

The performance gets diluted with more queries. So be
patient.

Understand the query plan by
EXPLAIN

EXPLAIN gives you idea why a query may be slow. For
advanced users only.

Hopefully users will follow the best practice 
But Redshift is a shared service

One query may slow down the whole cluster 

Proactive monitoring
System tables
(e.g. stl_query)

It’s easy to write scripts for
Real-time
monitoring
slow queries

• Ping users with best practice
• Send alerts to admin

Analyzing
patterns

• Who need help
• Who was “abusing”

Hint: manually backup these system tables as they will be cleaned up weekly.

Optimizing workload management
• Run heavy ETL during night
– ETL is resource intensive
– No easy way to limit the resource usage (IO/CPU)

• Time out user queries during peak hours
– Long queries (>= 30 mins) likely have mistakes
– Sacrifice a few users for the majority

• Unlike Hadoop, there is no preemption in
Redshift 

Current status of Redshift at Pinterest
•
•
•
•

16 node 256TB cluster with 100TB+ core data
Ingesting 1.5TB data per day with retention
30+ daily users
500+ ad-hoc queries per day
– 75% <= 35 seconds, 90% <= 2 minute

• operational effort <= 5 hours/week

Redshift integrated at Pinterest
Pinball
Hive

Kafka

Cascading

Production
data pipeline

Hadoop

Ad-hoc data
analysis

S3

MySQL
HBase
Redis

Redshift

MySQL

Amazon Web Service

Analytics
Dashboard

Next step
• Next generation of analytics dashboards
– Replace offline MySQL with Redshift
– Replace custom dashboards with Tableau
Pinball
Hive

Kafka

Cascading

Production
data pipeline

Hadoop

Ad-hoc data
analysis

S3
MySQL
HBase
Redis

Redshift

Amazon Web Service

Tableau

Remaining risks
• SLA for low latency queries
– Due to the lack of preemption, it can not
guarantee mission-critical queries to finish fast

• High availability
– Takes hours to restore clusters from snapshots
– May need a standby cluster in future

Questions?
• Quora: http://qr.ae/TwRJf
• Twitter: @jay23jack

Powering Interactive Data Analysis at Pinterest by Amazon Redshift

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Powering Interactive Data Analysis at Pinterest by Amazon Redshift

Similaire à Powering Interactive Data Analysis at Pinterest by Amazon Redshift (20)

Dernier

Dernier (20)

Powering Interactive Data Analysis at Pinterest by Amazon Redshift

Notes de l'éditeur