SlideShare une entreprise Scribd logo
1  sur  25
Powering interactive data analysis by
Amazon Redshift
Jie Li
Data Infra at Pinterest
Pinterest: a place to get inspired and
plan for the future
Data Infra at Pinterest
Production
data pipeline

Kafka

Pinball (*)
Hive

S3

MySQL
HBase
Redis

Cascading

Hadoop

MySQL

Amazon Web Service
* Pinball is our own workflow manager that we plan to open source.

Ad-hoc data
analysis

Analytics Dashboard
We need a low latency data warehouse!
Production
data pipeline

Kafka

Pinball
Hive

S3

MySQL
HBase
Redis

Cascading

Hadoop

High latency!

Ad-hoc data
analysis

MySQL not a viable data warehouse.
MySQL

Amazon Web Service

Analytics Dashboard
Low-latency data warehouse
• SQL on Hadoop
– Shark, Impala, Drill, Tez, Presto, …
– Open source and free 
– Immature? 

• Massive Parallel Processing (MPP)
– Asterdata, Vertica, ParAccel, …
– Built on mature technologies like Postgres 
– Expensive and only available on-premise 

• Amazon Redshift
– ParAccel on AWS 
– Mature but also cost-effective 
Highlights of Redshift
High cost efficiency
on-demand $0.85 per hour
3yr reserved instances $999/TB/year
Free snapshot on S3
Highlights of Redshift
Low maintenance overhead
Fully self-managed
Automated maintenance & upgrade
Built-in admin dashboard
Highlights of Redshift
Superior performance
6000

5000

25-100x over Hive
• Columnar layout
• Index
• Advanced optimizer
• Efficient execution

second

4000

3000

2000

1000

0

Q1

Q2
Hive

Q3
RedShift

Note: based on our own dataset and queries.

Q4
Cool, but how to integrate Redshift with
Hive/Hadoop
First, get data from Hive into Redshift
Unstructure
d
Unclean

Structured
Clean

Extract & Transform
Hive

Columnar
Compact
Compressed

Load
S3

Hadoop/Hive is perfect for heavy-lifting ETL workloads

Redshift
Building ETL from Hive to Redshift
What worked

What didn’t work

Schematizing Hive tables

Writing column-mapping
scripts to generate ETL
queries

N/A

Cleaning data

Filtering out non-ASCII
characters

Loading all characters

Loading big tables with
sortkey

Sorting externally in
Hadoop/Hive and loading in
chunks

Loading unordered data
directly

Loading time-series tables

Appending to the table in
the order of time (sortkey)

A table per day connected
with view performing poorly

Table retention

Insert into a new table

Delete and vacuum (poor
performance)
But it’s just the beginning.
Make sure you audit the ETL from Day 1
Audit ETL for Data Consistency
Everything was good until one day we noticed
one table was only half of its size 
S3 is only eventual consistent (EC) ! 

Hive

Solutions:

S3

① Audit

Redshift

Audit

② Also reduce number of files on S3 to alleviate EC.
Also, recently there is a new feature to specify a manifest for files on S3.
Now we got the data.
Is it ready for superior performance?
Understand the performance
Leader

① Understand the query execution
plan (via “explain”). Always
update system stats after data
loading by running “analyze”.

System
Stats

Compute

Compute

Compute

② Optimize the data layout by choosing consistent
distkeys across tables, and always choose a
sortkey. Watch out for bad distkey with skew (e.g.
distkey with null values).
What if a query took long
It’s worth doing your own homework
Filing tickets doesn’t work well for perf issues
• Requires a lot of information exchange
• May be caused by minor issues
Case: we optimized a query from 3 hours to 7 seconds
after studying the query plan and fixing the system
stats (the broadcast join regarded the larger table as
the smaller one).
Educate users with best practices 
Best Practice

Details

Select only the columns you need

Redshift is a columnar database and it only scans the
columns you need to speed things up. “SELECT *” is
usually bad.

Use the sortkey (dt or created_at)

Using sortkey can skip unnecessary data. Most of our
tables are using dt or created_at as the sortkey.

Avoid slow data transferring

Transferring large query result from Redshift to the local
client may be slow. Try saving the result as a Redshift
table or using the command line client on EC2.

Apply selective filters before join

Join operation can be significantly faster if we filter out
irrelevant data as much as possible.

Run one query at a time

The performance gets diluted with more queries. So be
patient.

Understand the query plan by
EXPLAIN

EXPLAIN gives you idea why a query may be slow. For
advanced users only.
Hopefully users will follow the best practice 
But Redshift is a shared service

One query may slow down the whole cluster 
Proactive monitoring
System tables
(e.g. stl_query)

It’s easy to write scripts for
Real-time
monitoring
slow queries

• Ping users with best practice
• Send alerts to admin

Analyzing
patterns

• Who need help
• Who was “abusing”

Hint: manually backup these system tables as they will be cleaned up weekly.
Optimizing workload management
• Run heavy ETL during night
– ETL is resource intensive
– No easy way to limit the resource usage (IO/CPU)

• Time out user queries during peak hours
– Long queries (>= 30 mins) likely have mistakes
– Sacrifice a few users for the majority

• Unlike Hadoop, there is no preemption in
Redshift 
Current status of Redshift at Pinterest
•
•
•
•

16 node 256TB cluster with 100TB+ core data
Ingesting 1.5TB data per day with retention
30+ daily users
500+ ad-hoc queries per day
– 75% <= 35 seconds, 90% <= 2 minute

• operational effort <= 5 hours/week
Redshift integrated at Pinterest
Pinball
Hive

Kafka

Cascading

Production
data pipeline

Hadoop

Ad-hoc data
analysis

S3

MySQL
HBase
Redis

Redshift

MySQL

Amazon Web Service

Analytics
Dashboard
Next step
• Next generation of analytics dashboards
– Replace offline MySQL with Redshift
– Replace custom dashboards with Tableau
Pinball
Hive

Kafka

Cascading

Production
data pipeline

Hadoop

Ad-hoc data
analysis

S3
MySQL
HBase
Redis

Redshift

Amazon Web Service

Tableau
Remaining risks
• SLA for low latency queries
– Due to the lack of preemption, it can not
guarantee mission-critical queries to finish fast

• High availability
– Takes hours to restore clusters from snapshots
– May need a standby cluster in future
Questions?
• Quora: http://qr.ae/TwRJf
• Twitter: @jay23jack

Contenu connexe

Tendances

Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features Amazon Web Services
 
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAmazon Web Services
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Amazon Web Services
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceAmazon Web Services
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon AthenaJulien SIMON
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Amazon Web Services
 
Building AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauBuilding AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauLynn Langit
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseAmazon Web Services
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationVolodymyr Rovetskiy
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 

Tendances (20)

Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
Building AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauBuilding AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and Tableau
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 

En vedette

Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスEtsuji Nakai
 
Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12todmoore
 
Continuous Delivery in Ruby
Continuous Delivery in RubyContinuous Delivery in Ruby
Continuous Delivery in RubyBrian Guthrie
 
Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...C4Media
 
From One to Many: Evolving VPC Design (ARC401) | AWS re:Invent 2013
From One to Many:  Evolving VPC Design (ARC401) | AWS re:Invent 2013From One to Many:  Evolving VPC Design (ARC401) | AWS re:Invent 2013
From One to Many: Evolving VPC Design (ARC401) | AWS re:Invent 2013Amazon Web Services
 
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...CloudIDSummit
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAmazon Web Services
 
OpenID Connect - An Emperor or Just New Cloths?
OpenID Connect - An Emperor or Just New Cloths?OpenID Connect - An Emperor or Just New Cloths?
OpenID Connect - An Emperor or Just New Cloths?Oliver Pfaff
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014Amazon Web Services
 
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014Amazon Web Services
 
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...Amazon Web Services
 
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Web Services Korea
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsSteven Francia
 
Core MIDI and Friends
Core MIDI and FriendsCore MIDI and Friends
Core MIDI and FriendsChris Adamson
 
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...Amazon Web Services
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services
 

En vedette (20)

Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービス
 
Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12Data Center In Healthcare Presentation 02 12
Data Center In Healthcare Presentation 02 12
 
Continuous Delivery in Ruby
Continuous Delivery in RubyContinuous Delivery in Ruby
Continuous Delivery in Ruby
 
Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...Better Together - Using Spark and Redshift to Combine Your Data with Public D...
Better Together - Using Spark and Redshift to Combine Your Data with Public D...
 
From One to Many: Evolving VPC Design (ARC401) | AWS re:Invent 2013
From One to Many:  Evolving VPC Design (ARC401) | AWS re:Invent 2013From One to Many:  Evolving VPC Design (ARC401) | AWS re:Invent 2013
From One to Many: Evolving VPC Design (ARC401) | AWS re:Invent 2013
 
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...
CIS13: Bootcamp: Ping Identity OAuth and OpenID Connect In Action with PingFe...
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 
OpenID Connect - An Emperor or Just New Cloths?
OpenID Connect - An Emperor or Just New Cloths?OpenID Connect - An Emperor or Just New Cloths?
OpenID Connect - An Emperor or Just New Cloths?
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
 
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014
(WEB302) Best Practices for Running WordPress on AWS | AWS re:Invent 2014
 
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
(GAM301) Real-Time Game Analytics with Amazon Kinesis, Amazon Redshift, and A...
 
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS Applications
 
Core MIDI and Friends
Core MIDI and FriendsCore MIDI and Friends
Core MIDI and Friends
 
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 

Similaire à Powering Interactive Data Analysis at Pinterest by Amazon Redshift

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataSchubert Zhang
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platformmartinbpeters
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftLars Kamp
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftAmazon Web Services
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 

Similaire à Powering Interactive Data Analysis at Pinterest by Amazon Redshift (20)

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of Data
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon Redshift
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 

Dernier

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Dernier (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Powering Interactive Data Analysis at Pinterest by Amazon Redshift

  • 1. Powering interactive data analysis by Amazon Redshift Jie Li Data Infra at Pinterest
  • 2. Pinterest: a place to get inspired and plan for the future
  • 3. Data Infra at Pinterest Production data pipeline Kafka Pinball (*) Hive S3 MySQL HBase Redis Cascading Hadoop MySQL Amazon Web Service * Pinball is our own workflow manager that we plan to open source. Ad-hoc data analysis Analytics Dashboard
  • 4. We need a low latency data warehouse! Production data pipeline Kafka Pinball Hive S3 MySQL HBase Redis Cascading Hadoop High latency! Ad-hoc data analysis MySQL not a viable data warehouse. MySQL Amazon Web Service Analytics Dashboard
  • 5. Low-latency data warehouse • SQL on Hadoop – Shark, Impala, Drill, Tez, Presto, … – Open source and free  – Immature?  • Massive Parallel Processing (MPP) – Asterdata, Vertica, ParAccel, … – Built on mature technologies like Postgres  – Expensive and only available on-premise  • Amazon Redshift – ParAccel on AWS  – Mature but also cost-effective 
  • 6. Highlights of Redshift High cost efficiency on-demand $0.85 per hour 3yr reserved instances $999/TB/year Free snapshot on S3
  • 7. Highlights of Redshift Low maintenance overhead Fully self-managed Automated maintenance & upgrade Built-in admin dashboard
  • 8. Highlights of Redshift Superior performance 6000 5000 25-100x over Hive • Columnar layout • Index • Advanced optimizer • Efficient execution second 4000 3000 2000 1000 0 Q1 Q2 Hive Q3 RedShift Note: based on our own dataset and queries. Q4
  • 9. Cool, but how to integrate Redshift with Hive/Hadoop
  • 10. First, get data from Hive into Redshift Unstructure d Unclean Structured Clean Extract & Transform Hive Columnar Compact Compressed Load S3 Hadoop/Hive is perfect for heavy-lifting ETL workloads Redshift
  • 11. Building ETL from Hive to Redshift What worked What didn’t work Schematizing Hive tables Writing column-mapping scripts to generate ETL queries N/A Cleaning data Filtering out non-ASCII characters Loading all characters Loading big tables with sortkey Sorting externally in Hadoop/Hive and loading in chunks Loading unordered data directly Loading time-series tables Appending to the table in the order of time (sortkey) A table per day connected with view performing poorly Table retention Insert into a new table Delete and vacuum (poor performance)
  • 12. But it’s just the beginning. Make sure you audit the ETL from Day 1
  • 13. Audit ETL for Data Consistency Everything was good until one day we noticed one table was only half of its size  S3 is only eventual consistent (EC) !  Hive Solutions: S3 ① Audit Redshift Audit ② Also reduce number of files on S3 to alleviate EC. Also, recently there is a new feature to specify a manifest for files on S3.
  • 14. Now we got the data. Is it ready for superior performance?
  • 15. Understand the performance Leader ① Understand the query execution plan (via “explain”). Always update system stats after data loading by running “analyze”. System Stats Compute Compute Compute ② Optimize the data layout by choosing consistent distkeys across tables, and always choose a sortkey. Watch out for bad distkey with skew (e.g. distkey with null values).
  • 16. What if a query took long It’s worth doing your own homework Filing tickets doesn’t work well for perf issues • Requires a lot of information exchange • May be caused by minor issues Case: we optimized a query from 3 hours to 7 seconds after studying the query plan and fixing the system stats (the broadcast join regarded the larger table as the smaller one).
  • 17. Educate users with best practices  Best Practice Details Select only the columns you need Redshift is a columnar database and it only scans the columns you need to speed things up. “SELECT *” is usually bad. Use the sortkey (dt or created_at) Using sortkey can skip unnecessary data. Most of our tables are using dt or created_at as the sortkey. Avoid slow data transferring Transferring large query result from Redshift to the local client may be slow. Try saving the result as a Redshift table or using the command line client on EC2. Apply selective filters before join Join operation can be significantly faster if we filter out irrelevant data as much as possible. Run one query at a time The performance gets diluted with more queries. So be patient. Understand the query plan by EXPLAIN EXPLAIN gives you idea why a query may be slow. For advanced users only.
  • 18. Hopefully users will follow the best practice  But Redshift is a shared service One query may slow down the whole cluster 
  • 19. Proactive monitoring System tables (e.g. stl_query) It’s easy to write scripts for Real-time monitoring slow queries • Ping users with best practice • Send alerts to admin Analyzing patterns • Who need help • Who was “abusing” Hint: manually backup these system tables as they will be cleaned up weekly.
  • 20. Optimizing workload management • Run heavy ETL during night – ETL is resource intensive – No easy way to limit the resource usage (IO/CPU) • Time out user queries during peak hours – Long queries (>= 30 mins) likely have mistakes – Sacrifice a few users for the majority • Unlike Hadoop, there is no preemption in Redshift 
  • 21. Current status of Redshift at Pinterest • • • • 16 node 256TB cluster with 100TB+ core data Ingesting 1.5TB data per day with retention 30+ daily users 500+ ad-hoc queries per day – 75% <= 35 seconds, 90% <= 2 minute • operational effort <= 5 hours/week
  • 22. Redshift integrated at Pinterest Pinball Hive Kafka Cascading Production data pipeline Hadoop Ad-hoc data analysis S3 MySQL HBase Redis Redshift MySQL Amazon Web Service Analytics Dashboard
  • 23. Next step • Next generation of analytics dashboards – Replace offline MySQL with Redshift – Replace custom dashboards with Tableau Pinball Hive Kafka Cascading Production data pipeline Hadoop Ad-hoc data analysis S3 MySQL HBase Redis Redshift Amazon Web Service Tableau
  • 24. Remaining risks • SLA for low latency queries – Due to the lack of preemption, it can not guarantee mission-critical queries to finish fast • High availability – Takes hours to restore clusters from snapshots – May need a standby cluster in future

Notes de l'éditeur

  1. Add Hadoop Logo
  2. To verify machine generated queries
  3. Use logo for hadoop/hive. Add Pinball?