Building a Sustainable Data Platform on AWS

Building a Sustainable
Data Platform on AWS
Takumi Sakamoto
2016.01.27

Takumi Sakamoto
@takus
😍 = ⚽ ✈ 📷

http://bit.ly/1MCOyBX
JAWSDAYS 2015

Mentioned by @jeffbarr
https://twitter.com/jeffbarr/status/649575575787454464
http://www.slideshare.net/smartnews/smart-newss-journey-into-microservices

AWS Case Study
http://aws.amazon.com/solutions/case-studies/smartnews/

What is SmartNews?
• News Discovery App
• Launched in 2012
• 15M+ Downloads in World Wide
https://www.smartnews.com/en/

Our Mission
the world's quality information?
the people who need it?
How?

Machine Learning
URLs Found
Structure Analysis
Semantics Analysis
Importance Estimation
Diversiﬁcation
Internet
100,000+ /day
1000+ /day
Feedback
Deliver
Trending Stories

Data Platform Use Cases
• Product development
• track KPI such as DAU and MAU
• A/B test for new feature, on-boarding, etc...
• ad-hoc analysis
• Provide data to applications
• realtime re-ranking news articles
• CTR prediction of Ads system
• dashboard service for media partners

Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or conﬁgurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day

Sustainable Data Platform
• Provide a reliable and scalable "Lambda Architecture"
• Minimize both operation & running cost
• Be open to uncertain future

Lambda Architecture
http://lambda-architecture.net/

Why Sustainable?
• Do a lot with a few engineers
• no one is a full-time maintainer
• avoid to waste too much time
• Empower brilliant engineers in SmartNews
• everything should be as self-serve as possible
• don't ask for permission, beg for forgiveness

λ Architecture at SmartNews
Input Batch Serving
Speed
Output

Design Principles
• Decoupled "Computation" and "Storage" layers
• multiple consumers can use the same data
• run consumers on Spot Instances
• prevent serious data lost with minimum effort
• Use the right tool for the job
• leverage AWS managed service as possible
• ﬁll in the missing pieces by Presto & PipelineDB

An Example
Amazon EMR
AMI 3.x
Amazon S3
Amazon EMR
Hive
General
Users
Application
Engineer
I wanna
upgrade hive
Ad
Engineer
I wanna combine
news data with
ad data
Amazon EMR
AMI 4.x
Amazon EMR
Spark
We’re satisﬁed
with current
version
Data
Scientist
I wanna test my
algorithm with the
latest spark
Batch Layer
Run multiple EMR clusters for each usages
Kinesis
Stream
Spark
on EMR
AWS
Lambda
Data
Scientist
I wanna consume
streaming data by
Spark
Application
Engineer
I wanna add a
streaming monitor
by Lambda
Speed Layer
Consume the same data for each usages
• AWS managed services
• Replicated data into Multiple AZs
• High availability

Collect Events by Fluentd
• Forwarder (running on each instances)
• store JSON events to S3
• forward events to aggregators
• collect metrics and post them to Datadog
• Aggregator
• input events into Kinesis & PipelineDB
• other reporting tasks (not mentioned today)

Forwards to S3
<source>
@type tail
format json
path /data/log/user_activity.log
pos_file /data/log/pos/user_activity.pos
tag smartnews.user_activity
time_key timestamp
</source>
<match smartnews.user_activity>
@type copy
<store>
@type relabel
@label @s3
</store>
<store>
@type forward
@label @forward
</store>
</match>
@include conf.d/s3.conf
@include conf.d/forward.conf
<label @s3>
<% node[:td_agent][:s3].each do |c| -%>
<match <%= c[:tag] %>>
@id s3.<%= c[:tag] %>
@type s3
...
path fluentd/<%= node[:env] %>/<%= node[:role] %>/<%= c[:tag] %>
time_slice_format dt=%Y-%m-%d/hh=%H
time_key timestamp
include_time_key
time_as_epoch
reduced_redundancy true
format json
utc
buffer_chunk_limit 2048m
</match>
<% end -%>
</label>
td-agent.conf conf.d/s3.conf

Capture DynamoDB Streams
<source>
type dynamodb_streams
stream_arn YOUR_DDB_STREAMS_ARN
pos_file /path/to/table.pos
fetch_interval 1
fetch_size 100
</source>
https://github.com/takus/ﬂuent-plugin-dynamodb-streams
DynamoDB DynamoDB
Streams
Amazon S3
AWS
Lambda
Fluentd

Recommended Practices
• Make configuration simple as possible
• fluentd can cover everything, but shouldn't
• keep stateless
• Use v0.12 or later
• "Filter" : better performance
• "Label": eliminate 'output_tag' configuration

Monitor Fluentd Status
• Monitor traffic volume & retry count by Datadog
• Datadog's fluentd integration
• fluent-plugin-flowcounter
• fluent-plugin-dogstatsd

Archive to Amazon S3
• I have 2 recommended settings
• versioning
• enable to recover from human error
• lifecycle policy
• minify storage cost
Archives to IA or Gracier
xx days after the creation date
Keep previous versions xx days
Save you in the future!!

Various ETL Tasks
• Extract
• dump MySQL records by Embulk
• make files on S3 readable to Hive
• Transform
• transform text files into columnar files (RCFile, ORC)
• generate features for machine learning
• aggregate records (by country, by channel)
• Load
• load aggregated metrics into Amazon Aurora

Hive
• Most popular project on Hadoop ecosystem
• famous for its lovely logo :)
• HiveQL and MapReduce
• convert SQL-like query into MR jobs
• Not adopt Tez engine yet
• Amazon EMR doesn't support now
• limited improvement to our queries

How to process JSON?
A. Transform into columnar table periodically
• required converting job
• better performance
B. Use JSON-SerDe for temporary analysis
• easy way for querying raw json text ﬁles
• required to "drop table" for change schema
• performance is not good

Transform Tables
-- Make S3 files readable by Hive
ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION
(dt='${DATE}', hh='${HOUR}');
-- Transform text files into columnar files (Flatten JSON)
INSERT OVERWRITE TABLE activities
PARTITION (dt='${DATE}', action)
SELECT
user_id, timestamp, os, country,
data,
action
FROM raw_activities
LATERAL VIEW json_tuple(
raw_activities.json,
'userId','timestamp','platform','country','action','data'
) a as user_id, timestamp, os, country, action, data
WHERE dt = '${DATE}'
CLUSTER BY os, country, action, user_id
;

JSON-SerDe
-- Define table with SERDE
CREATE TABLE json_table (
country string,
languages array<string>,
religions map<string,array<int>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
-- Result: 10
SELECT religions['catholic'][0] FROM json_table;

cf. hive-ruby-scripting
-- Define your ruby (JRuby) script
SET rb.script=
require 'json'
def parse (json)
j = JSON.load(json)
j['profile']['attribute1']
end
;
-- Use the script in HQL
SELECT rb_exec('&parse', json) FROM user;
https://github.com/gree/hive-ruby-scripting

Spark
http://www.slideshare.net/smartnews/aws-meetupapache-spark-on-emr

Self-Serve via AWS CLI
# Create EMR clusters that runs Hive & Spark & Ganglia
aws emr create-cluster
--name "My Cluster"
--release-label emr-4.2.0
--applications Name=Hive Name=Spark Name=GANGLIA
--ec2-attributes KeyName=myKey
--instance-type c3.4xlarge
--instance-count 4
--use-default-roles

Minimize expenses
• Use Spot Instances as possible
• typically discount 50-90%
• select instance type with stable price
• C3 families spike often :(
• Dynamic cluster resizing
• x2 capacity during daily batch job
• 1/2 capacity during midnight

Typical Anti-Pattern
5 * * * * app hive -f query_1.hql

Workflow Management
• Define dependencies
• task E is executed after finishing task C and task D
• Scheduling
• task A is kicked after 09:00 AM
• throttle concurrent running of the same task
• Monitoring
• notification in failure
• task C must finish before 01:00 PM (SLA)
cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo

Airflow
• A workflow management systems
• define workflow by Python
• built in shiny UI & CLI
• pluggable architecture
http://nerds.airbnb.com/airflow/

Deﬁne Tasks
dag = DAG('tutorial', default_args=default_args)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t3 = BashOperator(
task_id='templated',
bash_command="""
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
""",
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)
Task
Dependencies
Python code
DAG

Workﬂow as Code
Deploy codes automatically after merging into master

Alerting to Slack
• SLA Violation
• task A should be done till 00:00 PM
• other team's task K has dependency into task A
• Output validation failure
• stop the following tasks if the output is doubtful

Retry from Web UI
Once clear histories, airﬂow scheduler back ﬁll the histories

Retry from CLI
// Clear some histories from 2016-01-01
airflow clear etl_smartnews
--task_regex user_
--downstream
--start_date 2016-01-01
// Backfill uncompleted tasks
airflow backfill etl_smartnews
--start_date 2016-01-01

Pluggable Architecture
• Built-in plugins
• operator: bash, hive, preto, mysql
• transfer: hive_to_mysql
• sensor: wait_hive_partition, wait_s3_ﬁle
• Written our own plugin
• mysql_partition

Examples
user_sensor = S3KeySensor(
task_id='wait_user',
bucket_name='smartnews',
bucket_key='user/dt={{ ds }}/dump.csv',
)
etl = HiveOperator(
task_id="task1",
hql="INSERT OVERWRITE INTO...."
)
etl.set_upstream(user_sensor)
import = HiveToMySqlTransfer(
task_id=name,
mysql_preoperator="DELETE FROM %s WHERE date = '{{ ds }}'" % table,
sql="SELECT country, count(*) FROM %s" % table,
mysql_table=table
)
import.set_upstream(etl)
Wait a S3 ﬁle creation
After the ﬁle is created,
Run ETL Query
After that,
Import into MySQL

Provides batch views
in low-latency and ad-hoc way

Presto
• A distributed SQL query engine
• join multiple data sources (Hive + MySQL)
• support standard ANSI SQL
• designed to handle TBs or PBs scale data
cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014

Presto Architecture
Amazon S3 Kinesis
Stream
Amazon
RDS
Amazon
Aurora
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Coordinator
Client
1. Query with Standard SQL
4. Scan data concurrently
5. Aggregate data without disk I/O
6. Return result to client
2. Generate execution plan
3. Dispatch tasks into multiple workers
Amazon EMR
(Hive Metastore)
Provides Hive table metadata
(S3 access only)
※ https://github.com/qubole/presto-kinesis
※

Why Presto?
• Join multiple data sources
• skip large parts of ETL process
• enable to merge Hive/MySQL/Kinesis/PipelineDB
• Low latency
• ~30s to scan billions records in S3
• Low maintenance cost
• stateless, and easy to integrate with Auto Scaling

Use case: A/B Test
-- Suppose that this table exists
DESC hive.default.user_activities;
user_id bigint
action varchar
abtest array<map<varchar, bigint>>
url varchar
-- Summarize page view per A/B Test identifier
-- for comparing two algorithms v1 & v2
SELECT
dt,
t['behaviorId'],
count(*) as pv
FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t)
WHERE dt like '2016-01-%' AND action = 'viewArticle'
AND t['definitionId'] = 163
GROUP BY dt, t['behaviorId'] ORDER BY dt
;
2015-12-01 | algorithm_v1 | 40000
2015-12-01 | algorithm_v2 | 62000

Use case: Troubleshoot
-- Store access logs to S3, and query to them
-- Summarize access & 95pct response time by SQL
SELECT
from_unixtime(timestamp),
count(*) as access,
approx_percentile(reqtime, 0.95) as pct95_reqtime
FROM hive.default.access_log
WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx'
GROUP BY timestamp ORDER BY timestamp
;
2015-11-04 22:00:00.000 | 6377 | 0.522
2015-11-04 22:00:01.000 | 3580 | 0.422

Scheduled Auto Scaling
$ aws autoscaling describe-scheduled-actions
{
"ScheduledUpdateGroupActions": [
{
"DesiredCapacity": 2,
"AutoScalingGroupName": "presto-worker-prd",
"Recurrence": "59 14 * * *",
"ScheduledActionName": "scalein-2359-jst"
},
{
"DesiredCapacity": 20,
"AutoScalingGroupName": "presto-worker-prd",
"Recurrence": "45 0 * * 1-5",
"ScheduledActionName": "scaleout-0945-jst"
}
]
}

Presto Covers Everything? No!
• Fixed system on Amazon Aurora (or other RDB)
• provides KPI for products & business
• require high availability & low latency
• has no ﬂexibility
• Ad-hoc system on Presto
• provides access to all dataset on data platform
• require high scalability
• has ﬂexibility (join various data sources)

Why Fixed vs Ad-hoc?
• Difficulties on the Ad-hoc only solution
• difficult to prevent heavy queries
• large distinct count exhausts computing resources
• decrease presto maintainability

Chartio
• Dashboard as A Service
• helps businesses analyze and track their critical data
• one of AWS partners (※)
• Combine multiple data sources at one dashboard
• Presto, MySQL, Redshift, BigQuery, Elasticsearch ...
• enable to join BigQuery + MySQL internally
• Easy to use for every one
• everyone can make their own dashboard
• write SQL directly / generate query by drag & drop
※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959

Creating dashboard
1. Building query
(Drag&Drop / SQL)
2. Add step
(ﬁlter、sort、modify)
3. Select visualize way
(table、graph)

Why Chartio?
• Chartio saves a lot of engineering resources
• before
• maintain in-house dashboard written by rails
• everyone got tired to maintain it
• after
• everyone can build their own dashboard easily
• Chartio's UI is cool
• very important factor for dashboard tool

Missing Pieces of Chartio
• No programable API provides
• need to edit dashboard / chart manually
• No rollback feature
• all changes are recorded, but not rollback to the
previous state
• work around : clone => edit => rename

Today’s News is Wrapping
Tomorrow’s Fish and Chips

↑
Yesterday's News
http://www.personalchefapproach.com/tomorrows-ﬁsh-n-chips-wrapper/

How News Behaves?
https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/

Use cases
• Re-rank news articles by user feedback
• track user's positive/negative signal
• consider gender, age, location, interests
• Realtime article monitoring
• detect high bounce rate (may be broken?)
• make realtime reporting dashboard for A/B test

Realtime Re-Ranking
ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合
www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive
Amazon
CloudSearch
Search
API
API
Gateway
Kinesis
Stream
Amazon S3
Amazon EMR
Amazon S3 Amazon EMR
DynamoDB
Realtime
Feedback
Re-rank
Articles
Article
Metadata
User
Interests
User
Behaviors
Offline Procees
by Hive / Spark

Realtime Monitoring
API
Gateway
Stream
Continuous
View
Continuous
View
Continuous
View
Discard raw record soon after
consumed by Continuous View
Incrementally
updated in realtime
PipelineDB Chartio
AWS
Lambda
Slack
Access Continuous View
by PostgreSQL Client
Record
※1
※1
Use cron on 26 Feb. 2016
Migrate it soon after supporting VPC

PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/
http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/

Continuous View
-- Calculate unique users seen per media each day
-- Using only a constant amount of space (HyperLogLog)
CREATE CONTINUOUS VIEW uniques AS
SELECT
day(arrival_timestamp),
substring(url from '.*://([^/]*)') as hostname,
COUNT(DISTINCT user_id::integer)
FROM activity_stream GROUP BY day,hostname;
-- How many impressions have we served in the last five minutes?
CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS
SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency?
CREATE CONTINUOUS VIEW latency AS
SELECT
percentile_cont(array[90, 95, 99])
WITHIN GROUP (ORDER BY latency::integer)
FROM latency_stream;

Sustainable Data Platform
• build a reliable and scalable lambda architecture
• minimize operation & running cost
• be open to uncertain future

My Wishlist to AWS
• Support Reduced Redundancy Storage (RRS) on EMR
• Faster EMR Launch
• Set TTL to DynamoDB records
• Auto-scale Kinesis Stream
• Launch Kinesis Analytics in Tokyo region

Building a Sustainable Data Platform on AWS

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Building a Sustainable Data Platform on AWS

Similaire à Building a Sustainable Data Platform on AWS (20)

Plus de SmartNews, Inc.

Plus de SmartNews, Inc. (19)

Dernier

Dernier (20)

Building a Sustainable Data Platform on AWS