SlideShare une entreprise Scribd logo
1  sur  17
Amazon Redshift
Spend time with your data, not your database….
Data Warehouse Challenges
Cost
Complexity
Performance
Rigidity
1990 2000 2010 2020
Enterprise Data Data in Warehouse
Amazon Redshift powers Clickstream Analytics for
Amazon.com
• Web log analysis for Amazon.com
– Petabyte workload
– Largest table: 400 TB
• Understand customer behavior
– Who is browsing but not buying
– Which products/features are winners
– What sequence led to higher customer conversion
• Solution
– Best scale-out solution—query across 1 week
– Hadoop—query across 1 month
Amazon Redshift benefits realized
• Performance
– Scan 2.25 trillion rows of data: 14 minutes
– Load 5 billion rows data: 10 minutes
– Backfill 150 billion rows of data: 9.75 hours
– Pig  Amazon Redshift: 2 days to 1 hr
• 10B row join with 700 M rows
– Oracle  Amazon Redshift: 90 hours to 8 hrs
• Cost
– 1.6 PB cluster
– 100 8xl HDD nodes
– $180/hr
• Complexity
– 20% time of one DBA
• Backup
• Restore
• Resizing
Expanding Amazon Redshift
Functionality
Scalar User-Defined Functions (UDF)
• Scalar UDFs using Python 2.7
– Return single result value for each input value
– Executed in parallel across cluster
– Syntax largely identical to PostgreSQL
– We reserve any function with f_ for customers
• Pandas, NumPy, SciPy pre-installed
– Do matrix operations, build optimization algorithms, and run
statistical analyses
– Build end-to-end modeling workflow
• Import your own libraries
CREATE FUNCTION f_function_name
( [ argument_name arg_type, ... ] )
RETURNS data_type
{ VOLATILE | STABLE | IMMUTABLE }
AS $$
python_program
$$ LANGUAGE plpythonu;
Scalar UDF Security
• Run in restricted container that is fully isolated
– Cannot make system and network calls
– Cannot corrupt your cluster or negatively impact its performance
• Current limitations
– Can’t access file system - functions that write files won’t work
– Don’t yet cache stable and immutable functions
– Slower than built-in functions compiled to machine code
• Haven’t fully optimized some cases, including nested functions
Scalar UDF example - URL parsing
CREATE FUNCTION f_hostname (url varchar)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return urlparse.urlparse(url).hostname
$$ LANGUAGE plpythonu;
SELECT f_hostname(url) FROM table;
SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3') FROM table;
Scalar UDF example – Distance
CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float)
RETURNS float
STABLE
AS $$
import math
r = 3963.1676 # earth's radius, in miles
phi_orig = math.radians(orig_lat)
phi_dest = math.radians(dest_lat)
delta_lat = math.radians(dest_lat - orig_lat)
delta_long = math.radians(dest_long - orig_long)
a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) 
* math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
d = r * c
return d
$$ LANGUAGE plpythonu;
Redshift Github UDF Repository
Script Purpose
f_encryption.sql
Uses pyaes library to encrypt/decrypt strings
using passphrase
f_next_business_day.sql
Uses pandas library to return dates which
are US Federal Holiday aware
f_null_syns.sql
Uses python sets to match strings, similar to
a SQL IN condition
f_parse_url_query_string.sql
Uses urlparse to parse the field-value pairs
from a url query string
f_parse_xml.sql Uses xml.etree.ElementTree to parse XML
f_unixts_to_timestamp.sql
Uses pandas library to convert a unix
timestamp to UTC datetime
github.com/awslabs/amazon-redshift-udfs
Amazon Kinesis Firehose to Amazon Redshift
Load massive volumes of streaming data into Amazon Redshift
• Zero administration: Capture and deliver streaming data into Redshift without writing an
application
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit
streaming data to Firehose
Firehose loads streaming data
continuously into S3 and Redshift
Analyze streaming data using Chartio
• Uses your S3 bucket as an intermediate destination
• S3 bucket has ‘manifests’ folder – holds manifest of files to be copied
• Issues COPY command synchronously
• Single delivery stream loads into a single Redshift cluster, database, and table
• Continuously issues COPY once previous one is finished
• Frequency of COPYs determined by how fast your cluster can load files
• No partial loads. If a single record fails, whole file or batch fails
• Info on skipped files delivered to S3 bucket as manifest in errors folder
• If cannot reach cluster, retries every 5 min for 60 min and then moves on to next batch of
objects
Amazon Kinesis Firehose to Amazon Redshift
Multi-Column Sort
• Compound sort keys
– Filter data by one leading column
• Interleaved sort keys
– Filter data by up to eight columns
– No storage overhead, unlike an index or projection
– Lower maintenance penalty
Compound sort keys illustrated
• Four records fill a
block, sorted by
customer
• Records with a given
customer are all in one
block.
• Records with a given
product are spread
across four blocks.
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved sort keys illustrated
• Records with a given
customer are spread
across two blocks.
• Records with a given
product are also spread
across two blocks.
• Both keys are equal.
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
Interleaved Sort Key Considerations
• Vacuum time can increase by 10-50% for interleaved sort keys vs.
compound keys
• If data increases monotonically, such as dates, interleaved sort order
will skew over time
– You’ll need to run a vacuum operation to re-analyze the distribution and re-sort
the data.
• Query filtering on the leading sort column, runs faster using
compound sort keys vs. interleaved
SAN FRANCISCO
Questions/Comments?
Please contact us at redshift-feedback@amazon.com

Contenu connexe

Tendances

Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
AWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarAWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarArti Bhatia
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 
LLAP: Locality is dead (in the cloud)
LLAP: Locality is dead (in the cloud)  LLAP: Locality is dead (in the cloud)
LLAP: Locality is dead (in the cloud) Future of Data Meetup
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...Ontico
 
phoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetupphoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetupMaryann Xue
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4caizer_x
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 

Tendances (20)

Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
AWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarAWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinar
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
LLAP: Locality is dead (in the cloud)
LLAP: Locality is dead (in the cloud)  LLAP: Locality is dead (in the cloud)
LLAP: Locality is dead (in the cloud)
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
 
phoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetupphoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetup
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 

En vedette

Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesAlexandra Sasha Blumenfeld
 
Using cohort analysis to understand your SaaS business | Growth Hacking Brussels
Using cohort analysis to understand your SaaS business | Growth Hacking BrusselsUsing cohort analysis to understand your SaaS business | Growth Hacking Brussels
Using cohort analysis to understand your SaaS business | Growth Hacking BrusselsUniversem
 
The Vital Metrics Every Sales Team Should Be Measuring
The Vital Metrics Every Sales Team Should Be MeasuringThe Vital Metrics Every Sales Team Should Be Measuring
The Vital Metrics Every Sales Team Should Be MeasuringChartio
 
How To Drive Exponential Growth Using Unconventional Data Sources
How To Drive Exponential Growth Using Unconventional Data SourcesHow To Drive Exponential Growth Using Unconventional Data Sources
How To Drive Exponential Growth Using Unconventional Data SourcesChartio
 
Producing and Analyzing Rich Data with PostgreSQL
Producing and Analyzing Rich Data with PostgreSQLProducing and Analyzing Rich Data with PostgreSQL
Producing and Analyzing Rich Data with PostgreSQLChartio
 
From Data to Insight: Uncovering the 'Aha' Moments That Matter
From Data to Insight: Uncovering the 'Aha' Moments That MatterFrom Data to Insight: Uncovering the 'Aha' Moments That Matter
From Data to Insight: Uncovering the 'Aha' Moments That MatterQualtrics
 
Learn How to Run Python on Redshift
Learn How to Run Python on RedshiftLearn How to Run Python on Redshift
Learn How to Run Python on RedshiftChartio
 
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Using the PostgreSQL Extension Ecosystem for Advanced AnalyticsUsing the PostgreSQL Extension Ecosystem for Advanced Analytics
Using the PostgreSQL Extension Ecosystem for Advanced AnalyticsChartio
 
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?Totango
 

En vedette (9)

Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 Minutes
 
Using cohort analysis to understand your SaaS business | Growth Hacking Brussels
Using cohort analysis to understand your SaaS business | Growth Hacking BrusselsUsing cohort analysis to understand your SaaS business | Growth Hacking Brussels
Using cohort analysis to understand your SaaS business | Growth Hacking Brussels
 
The Vital Metrics Every Sales Team Should Be Measuring
The Vital Metrics Every Sales Team Should Be MeasuringThe Vital Metrics Every Sales Team Should Be Measuring
The Vital Metrics Every Sales Team Should Be Measuring
 
How To Drive Exponential Growth Using Unconventional Data Sources
How To Drive Exponential Growth Using Unconventional Data SourcesHow To Drive Exponential Growth Using Unconventional Data Sources
How To Drive Exponential Growth Using Unconventional Data Sources
 
Producing and Analyzing Rich Data with PostgreSQL
Producing and Analyzing Rich Data with PostgreSQLProducing and Analyzing Rich Data with PostgreSQL
Producing and Analyzing Rich Data with PostgreSQL
 
From Data to Insight: Uncovering the 'Aha' Moments That Matter
From Data to Insight: Uncovering the 'Aha' Moments That MatterFrom Data to Insight: Uncovering the 'Aha' Moments That Matter
From Data to Insight: Uncovering the 'Aha' Moments That Matter
 
Learn How to Run Python on Redshift
Learn How to Run Python on RedshiftLearn How to Run Python on Redshift
Learn How to Run Python on Redshift
 
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Using the PostgreSQL Extension Ecosystem for Advanced AnalyticsUsing the PostgreSQL Extension Ecosystem for Advanced Analytics
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
 
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
 

Similaire à Redshift Chartio Event Presentation

Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon AthenaJulien SIMON
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftAmazon Web Services LATAM
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services
 
Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersSpeck&Tech
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseAmazon Web Services
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From UberChester Chen
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features Amazon Web Services
 
Big data dive amazon emr processing
Big data dive amazon emr processingBig data dive amazon emr processing
Big data dive amazon emr processingOlga Lavrentieva
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 

Similaire à Redshift Chartio Event Presentation (20)

Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customers
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Big data dive amazon emr processing
Big data dive amazon emr processingBig data dive amazon emr processing
Big data dive amazon emr processing
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 

Dernier

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Dernier (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Redshift Chartio Event Presentation

  • 1. Amazon Redshift Spend time with your data, not your database….
  • 2. Data Warehouse Challenges Cost Complexity Performance Rigidity 1990 2000 2010 2020 Enterprise Data Data in Warehouse
  • 3. Amazon Redshift powers Clickstream Analytics for Amazon.com • Web log analysis for Amazon.com – Petabyte workload – Largest table: 400 TB • Understand customer behavior – Who is browsing but not buying – Which products/features are winners – What sequence led to higher customer conversion • Solution – Best scale-out solution—query across 1 week – Hadoop—query across 1 month
  • 4. Amazon Redshift benefits realized • Performance – Scan 2.25 trillion rows of data: 14 minutes – Load 5 billion rows data: 10 minutes – Backfill 150 billion rows of data: 9.75 hours – Pig  Amazon Redshift: 2 days to 1 hr • 10B row join with 700 M rows – Oracle  Amazon Redshift: 90 hours to 8 hrs • Cost – 1.6 PB cluster – 100 8xl HDD nodes – $180/hr • Complexity – 20% time of one DBA • Backup • Restore • Resizing
  • 6. Scalar User-Defined Functions (UDF) • Scalar UDFs using Python 2.7 – Return single result value for each input value – Executed in parallel across cluster – Syntax largely identical to PostgreSQL – We reserve any function with f_ for customers • Pandas, NumPy, SciPy pre-installed – Do matrix operations, build optimization algorithms, and run statistical analyses – Build end-to-end modeling workflow • Import your own libraries CREATE FUNCTION f_function_name ( [ argument_name arg_type, ... ] ) RETURNS data_type { VOLATILE | STABLE | IMMUTABLE } AS $$ python_program $$ LANGUAGE plpythonu;
  • 7. Scalar UDF Security • Run in restricted container that is fully isolated – Cannot make system and network calls – Cannot corrupt your cluster or negatively impact its performance • Current limitations – Can’t access file system - functions that write files won’t work – Don’t yet cache stable and immutable functions – Slower than built-in functions compiled to machine code • Haven’t fully optimized some cases, including nested functions
  • 8. Scalar UDF example - URL parsing CREATE FUNCTION f_hostname (url varchar) RETURNS varchar IMMUTABLE AS $$ import urlparse return urlparse.urlparse(url).hostname $$ LANGUAGE plpythonu; SELECT f_hostname(url) FROM table; SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3') FROM table;
  • 9. Scalar UDF example – Distance CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float) RETURNS float STABLE AS $$ import math r = 3963.1676 # earth's radius, in miles phi_orig = math.radians(orig_lat) phi_dest = math.radians(dest_lat) delta_lat = math.radians(dest_lat - orig_lat) delta_long = math.radians(dest_long - orig_long) a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) * math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) d = r * c return d $$ LANGUAGE plpythonu;
  • 10. Redshift Github UDF Repository Script Purpose f_encryption.sql Uses pyaes library to encrypt/decrypt strings using passphrase f_next_business_day.sql Uses pandas library to return dates which are US Federal Holiday aware f_null_syns.sql Uses python sets to match strings, similar to a SQL IN condition f_parse_url_query_string.sql Uses urlparse to parse the field-value pairs from a url query string f_parse_xml.sql Uses xml.etree.ElementTree to parse XML f_unixts_to_timestamp.sql Uses pandas library to convert a unix timestamp to UTC datetime github.com/awslabs/amazon-redshift-udfs
  • 11. Amazon Kinesis Firehose to Amazon Redshift Load massive volumes of streaming data into Amazon Redshift • Zero administration: Capture and deliver streaming data into Redshift without writing an application • Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery • Seamless elasticity: Seamlessly scales to match data throughput w/o intervention Capture and submit streaming data to Firehose Firehose loads streaming data continuously into S3 and Redshift Analyze streaming data using Chartio
  • 12. • Uses your S3 bucket as an intermediate destination • S3 bucket has ‘manifests’ folder – holds manifest of files to be copied • Issues COPY command synchronously • Single delivery stream loads into a single Redshift cluster, database, and table • Continuously issues COPY once previous one is finished • Frequency of COPYs determined by how fast your cluster can load files • No partial loads. If a single record fails, whole file or batch fails • Info on skipped files delivered to S3 bucket as manifest in errors folder • If cannot reach cluster, retries every 5 min for 60 min and then moves on to next batch of objects Amazon Kinesis Firehose to Amazon Redshift
  • 13. Multi-Column Sort • Compound sort keys – Filter data by one leading column • Interleaved sort keys – Filter data by up to eight columns – No storage overhead, unlike an index or projection – Lower maintenance penalty
  • 14. Compound sort keys illustrated • Four records fill a block, sorted by customer • Records with a given customer are all in one block. • Records with a given product are spread across four blocks. 1 1 1 1 2 3 4 1 4 4 4 2 3 4 4 1 3 3 3 2 3 4 3 1 2 2 2 2 3 4 2 1 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id cust_id prod_id other columns blocks
  • 15. 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id Interleaved sort keys illustrated • Records with a given customer are spread across two blocks. • Records with a given product are also spread across two blocks. • Both keys are equal. 1 1 2 2 2 1 2 3 3 4 4 4 3 4 3 1 3 4 4 2 1 2 3 3 1 2 2 4 3 4 1 1 cust_id prod_id other columns blocks
  • 16. Interleaved Sort Key Considerations • Vacuum time can increase by 10-50% for interleaved sort keys vs. compound keys • If data increases monotonically, such as dates, interleaved sort order will skew over time – You’ll need to run a vacuum operation to re-analyze the distribution and re-sort the data. • Query filtering on the leading sort column, runs faster using compound sort keys vs. interleaved
  • 17. SAN FRANCISCO Questions/Comments? Please contact us at redshift-feedback@amazon.com

Notes de l'éditeur

  1. Can’t add a file. when you run something like python script.py, the script is converted to bytecode and then the interpreter/VM/CPython–really just a C Program–reads in the python bytecode and executes the program accordingly. Not translated to machine code Other implementations, like Pypy, have JIT compilation, i.e. they translate Python to machine codes on the fly.
  2. Urlparse is part of built-in libraries from python.
  3. haversine
  4. The data producer sends data blobs as large as 1,000 KB to a delivery stream.
  5. 1,000,000 blocks (1 TB per column) with an interleaved sort key of both customer ID and page ID, you scan 1,000 blocks when you filter on a specific customer or page, a speedup of 1000x compared to the unsorted case.