13. 按查詢付款 - 掃描$ 5 / TB
• 支付每個查詢掃描的數據量
• 節省成本的方法
• 壓縮
• 轉換為Columnar格式
• 使用 partitioning
• 免費:DDL查詢,失敗的查詢
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
14. 轉換為ORC和PARQUET
• 您可以使用Hive CTAS轉換數據
• CREATE TABLE new_key_value_store
• STORED AS PARQUET
• As
• SELECT col_1,col2,col3 FROM noncolumartable
• SORT BY new_key,key_value_pair;
• 您也可以使用Spark將文件轉換為PARQUET / ORC
• 20行Pyspark代碼,將1TB的文本數據轉換為130 GB的PARQUET在EMR上運
行
• 快速轉換總成本$ 5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
15. 如何定義你的 partitions
CREATE EXTERNAL TABLE Employee (
Id INT,
Name STRING,
Address STRING
) PARTITIONED BY (year INT)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
LOCATION
‘s3://mybucket/athena/inputdata/’;
CREATE EXTERNAL TABLE Employee (
Id INT,
Name STRING,
Address STRING,
INT Year
) PARTITIONED BY (year INT)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
LOCATION
‘s3://mybucket/athena/inputdata/’;
16. 如何定義你的 partitions
s3://elasticmapreduce/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
CREATE EXTERNAL TABLE impressions ( requestBeginTime string, ......)
PARTITIONED BY (dt string) LOCATION
's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;
PRE dt=2009-04-12-14-10/
MSCK REPAIR TABLE impressions
24. 遷移的關鍵和TCO考慮
• DO NOT LIFT AND SHIFT
• 透過S3,張存儲和計算分開
• 解構工作負載並映射到開源工具
• 短暫的群集和自動縮放
• 選擇實例類型和EC2 Spot實例
25. 分拆運算和存儲,使用S3去作為您的數據層
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
30. 提交作業的選項
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Use Oozie on your
cluster to build
DAGs of jobs
33. Spot for
task nodes
Up to 80%
off EC2
On-Demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
使用Spot和Reserved instance降低成本
以可預見的成本滿足SLA 以較低的成本超出SLA
37. DataXu – 180TB of Log Data per Day
CDN
Real Time
Bidding
Retargeting
Platform
Kinesis Attribution & ML
S3
Reporting
Data Visualization
Data
Pipeline
ETL(Spark SQL)
Ecosystem of tools and services
Amazon Athena
38. Petabytes of data generated
on-premises, brought to AWS,
and stored in S3
Thousands of analytical
queries performed on EMR
and Amazon Redshift.
Stringent security requirements
met by leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Data Management
Data Movement
Data Registration
Version Management
Amazon S3
Web Applications
Analysts; Regulators
FINRA: Migrating from on-prem to AWS