Redshift Chartio Event Presentation

Amazon Redshift
Spend time with your data, not your database….

Data Warehouse Challenges
Cost
Complexity
Performance
Rigidity
1990 2000 2010 2020
Enterprise Data Data in Warehouse

Amazon Redshift powers Clickstream Analytics for
Amazon.com
• Web log analysis for Amazon.com
– Petabyte workload
– Largest table: 400 TB
• Understand customer behavior
– Who is browsing but not buying
– Which products/features are winners
– What sequence led to higher customer conversion
• Solution
– Best scale-out solution—query across 1 week
– Hadoop—query across 1 month

Amazon Redshift benefits realized
• Performance
– Scan 2.25 trillion rows of data: 14 minutes
– Load 5 billion rows data: 10 minutes
– Backfill 150 billion rows of data: 9.75 hours
– Pig  Amazon Redshift: 2 days to 1 hr
• 10B row join with 700 M rows
– Oracle  Amazon Redshift: 90 hours to 8 hrs
• Cost
– 1.6 PB cluster
– 100 8xl HDD nodes
– $180/hr
• Complexity
– 20% time of one DBA
• Backup
• Restore
• Resizing

Expanding Amazon Redshift
Functionality

Scalar User-Defined Functions (UDF)
• Scalar UDFs using Python 2.7
– Return single result value for each input value
– Executed in parallel across cluster
– Syntax largely identical to PostgreSQL
– We reserve any function with f_ for customers
• Pandas, NumPy, SciPy pre-installed
– Do matrix operations, build optimization algorithms, and run
statistical analyses
– Build end-to-end modeling workflow
• Import your own libraries
CREATE FUNCTION f_function_name
( [ argument_name arg_type, ... ] )
RETURNS data_type
{ VOLATILE | STABLE | IMMUTABLE }
AS $$
python_program
$$ LANGUAGE plpythonu;

Scalar UDF Security
• Run in restricted container that is fully isolated
– Cannot make system and network calls
– Cannot corrupt your cluster or negatively impact its performance
• Current limitations
– Can’t access file system - functions that write files won’t work
– Don’t yet cache stable and immutable functions
– Slower than built-in functions compiled to machine code
• Haven’t fully optimized some cases, including nested functions

Scalar UDF example - URL parsing
CREATE FUNCTION f_hostname (url varchar)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return urlparse.urlparse(url).hostname
SELECT f_hostname(url) FROM table;
SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3') FROM table;

Scalar UDF example – Distance
CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float)
RETURNS float
STABLE
AS $$
import math
r = 3963.1676 # earth's radius, in miles
phi_orig = math.radians(orig_lat)
phi_dest = math.radians(dest_lat)
delta_lat = math.radians(dest_lat - orig_lat)
delta_long = math.radians(dest_long - orig_long)
a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig)
* math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
d = r * c
return d

Redshift Github UDF Repository
Script Purpose
f_encryption.sql
Uses pyaes library to encrypt/decrypt strings
using passphrase
f_next_business_day.sql
Uses pandas library to return dates which
are US Federal Holiday aware
f_null_syns.sql
Uses python sets to match strings, similar to
a SQL IN condition
f_parse_url_query_string.sql
Uses urlparse to parse the field-value pairs
from a url query string
f_parse_xml.sql Uses xml.etree.ElementTree to parse XML
f_unixts_to_timestamp.sql
Uses pandas library to convert a unix
timestamp to UTC datetime
github.com/awslabs/amazon-redshift-udfs

Amazon Kinesis Firehose to Amazon Redshift
Load massive volumes of streaming data into Amazon Redshift
• Zero administration: Capture and deliver streaming data into Redshift without writing an
application
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit
streaming data to Firehose
Firehose loads streaming data
continuously into S3 and Redshift
Analyze streaming data using Chartio

• Uses your S3 bucket as an intermediate destination
• S3 bucket has ‘manifests’ folder – holds manifest of files to be copied
• Issues COPY command synchronously
• Single delivery stream loads into a single Redshift cluster, database, and table
• Continuously issues COPY once previous one is finished
• Frequency of COPYs determined by how fast your cluster can load files
• No partial loads. If a single record fails, whole file or batch fails
• Info on skipped files delivered to S3 bucket as manifest in errors folder
• If cannot reach cluster, retries every 5 min for 60 min and then moves on to next batch of
objects
Amazon Kinesis Firehose to Amazon Redshift

Multi-Column Sort
• Compound sort keys
– Filter data by one leading column
• Interleaved sort keys
– Filter data by up to eight columns
– No storage overhead, unlike an index or projection
– Lower maintenance penalty

Compound sort keys illustrated
• Four records fill a
block, sorted by
customer
• Records with a given
customer are all in one
block.
product are spread
across four blocks.
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks

1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved sort keys illustrated
customer are spread
across two blocks.
product are also spread
across two blocks.
• Both keys are equal.
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks

Interleaved Sort Key Considerations
• Vacuum time can increase by 10-50% for interleaved sort keys vs.
compound keys
• If data increases monotonically, such as dates, interleaved sort order
will skew over time
– You’ll need to run a vacuum operation to re-analyze the distribution and re-sort
the data.
• Query filtering on the leading sort column, runs faster using
compound sort keys vs. interleaved

SAN FRANCISCO
Questions/Comments?
Please contact us at redshift-feedback@amazon.com

Redshift Chartio Event Presentation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à Redshift Chartio Event Presentation

Similaire à Redshift Chartio Event Presentation (20)

Dernier

Dernier (20)

Redshift Chartio Event Presentation

Notes de l'éditeur