SlideShare une entreprise Scribd logo
1  sur  81
Télécharger pour lire hors ligne
Building a Sustainable
Data Platform on AWS
Takumi Sakamoto
2016.01.27
Takumi Sakamoto
@takus
😍 = ⚽ ✈ 📷
http://bit.ly/1MCOyBX
JAWSDAYS 2015
Mentioned by @jeffbarr
https://twitter.com/jeffbarr/status/649575575787454464
http://www.slideshare.net/smartnews/smart-newss-journey-into-microservices
AWS Case Study
http://aws.amazon.com/solutions/case-studies/smartnews/
Data Platform at
SmartNews
What is SmartNews?
• News Discovery App
• Launched in 2012
• 15M+ Downloads in World Wide
https://www.smartnews.com/en/
Our Mission
the world's quality information?
the people who need it?
How?
Machine Learning
URLs Found
Structure Analysis
Semantics Analysis
Importance Estimation
Diversification
Internet
100,000+ /day
1000+ /day
Feedback
Deliver
Trending Stories
Data Platform Use Cases
• Product development
• track KPI such as DAU and MAU
• A/B test for new feature, on-boarding, etc...
• ad-hoc analysis
• Provide data to applications
• realtime re-ranking news articles
• CTR prediction of Ads system
• dashboard service for media partners
Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
Sustainable
Data Platform?
Sustainable Data Platform
• Provide a reliable and scalable "Lambda Architecture"
• Minimize both operation & running cost
• Be open to uncertain future
Lambda Architecture
http://lambda-architecture.net/
Why Sustainable?
• Do a lot with a few engineers
• no one is a full-time maintainer
• avoid to waste too much time
• Empower brilliant engineers in SmartNews
• everything should be as self-serve as possible
• don't ask for permission, beg for forgiveness
System Design
λ Architecture at SmartNews
Input Batch Serving
Speed
Output
Design Principles
• Decoupled "Computation" and "Storage" layers
• multiple consumers can use the same data
• run consumers on Spot Instances
• prevent serious data lost with minimum effort
• Use the right tool for the job
• leverage AWS managed service as possible
• fill in the missing pieces by Presto & PipelineDB
An Example
Amazon EMR
AMI 3.x
Amazon S3
Amazon EMR
Hive
General
Users
Application
Engineer
I wanna
upgrade hive
Ad
Engineer
I wanna combine
news data with
ad data
Amazon EMR
AMI 4.x
Amazon EMR
Spark
We’re satisfied
with current
version
Data
Scientist
I wanna test my
algorithm with the
latest spark
Batch Layer
Run multiple EMR clusters for each usages
Kinesis
Stream
Spark
on EMR
AWS
Lambda
Data
Scientist
I wanna consume
streaming data by
Spark
Application
Engineer
I wanna add a
streaming monitor
by Lambda
Speed Layer
Consume the same data for each usages
• AWS managed services
• Replicated data into Multiple AZs
• High availability
Input Data
Collect Events by Fluentd
• Forwarder (running on each instances)
• store JSON events to S3
• forward events to aggregators
• collect metrics and post them to Datadog
• Aggregator
• input events into Kinesis & PipelineDB
• other reporting tasks (not mentioned today)
Forwards to S3
<source>
@type tail
format json
path /data/log/user_activity.log
pos_file /data/log/pos/user_activity.pos
tag smartnews.user_activity
time_key timestamp
</source>
<match smartnews.user_activity>
@type copy
<store>
@type relabel
@label @s3
</store>
<store>
@type forward
@label @forward
</store>
</match>
@include conf.d/s3.conf
@include conf.d/forward.conf
<label @s3>
<% node[:td_agent][:s3].each do |c| -%>
<match <%= c[:tag] %>>
@id s3.<%= c[:tag] %>
@type s3
...
path fluentd/<%= node[:env] %>/<%= node[:role] %>/<%= c[:tag] %>
time_slice_format dt=%Y-%m-%d/hh=%H
time_key timestamp
include_time_key
time_as_epoch
reduced_redundancy true
format json
utc
buffer_chunk_limit 2048m
</match>
<% end -%>
</label>
td-agent.conf conf.d/s3.conf
Capture DynamoDB Streams
<source>
type dynamodb_streams
stream_arn YOUR_DDB_STREAMS_ARN
pos_file /path/to/table.pos
fetch_interval 1
fetch_size 100
</source>
https://github.com/takus/fluent-plugin-dynamodb-streams
DynamoDB DynamoDB
Streams
Amazon S3
AWS
Lambda
Fluentd
Recommended Practices
• Make configuration simple as possible
• fluentd can cover everything, but shouldn't
• keep stateless
• Use v0.12 or later
• "Filter" : better performance
• "Label": eliminate 'output_tag' configuration
Monitor Fluentd Status
• Monitor traffic volume & retry count by Datadog
• Datadog's fluentd integration
• fluent-plugin-flowcounter
• fluent-plugin-dogstatsd
Archive to Amazon S3
• I have 2 recommended settings
• versioning
• enable to recover from human error
• lifecycle policy
• minify storage cost
Archives to IA or Gracier
xx days after the creation date
Keep previous versions xx days
Save you in the future!!
Batch Layer
Various ETL Tasks
• Extract
• dump MySQL records by Embulk
• make files on S3 readable to Hive
• Transform
• transform text files into columnar files (RCFile, ORC)
• generate features for machine learning
• aggregate records (by country, by channel)
• Load
• load aggregated metrics into Amazon Aurora
Hive
• Most popular project on Hadoop ecosystem
• famous for its lovely logo :)
• HiveQL and MapReduce
• convert SQL-like query into MR jobs
• Not adopt Tez engine yet
• Amazon EMR doesn't support now
• limited improvement to our queries
How to process JSON?
A. Transform into columnar table periodically
• required converting job
• better performance
B. Use JSON-SerDe for temporary analysis
• easy way for querying raw json text files
• required to "drop table" for change schema
• performance is not good
Transform Tables
-- Make S3 files readable by Hive
ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION
(dt='${DATE}', hh='${HOUR}');
-- Transform text files into columnar files (Flatten JSON)
INSERT OVERWRITE TABLE activities
PARTITION (dt='${DATE}', action)
SELECT
user_id, timestamp, os, country,
data,
action
FROM raw_activities
LATERAL VIEW json_tuple(
raw_activities.json,
'userId','timestamp','platform','country','action','data'
) a as user_id, timestamp, os, country, action, data
WHERE dt = '${DATE}'
CLUSTER BY os, country, action, user_id
;
JSON-SerDe
-- Define table with SERDE
CREATE TABLE json_table (
country string,
languages array<string>,
religions map<string,array<int>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
-- Result: 10
SELECT religions['catholic'][0] FROM json_table;
cf. hive-ruby-scripting
-- Define your ruby (JRuby) script
SET rb.script=
require 'json'
def parse (json)
j = JSON.load(json)
j['profile']['attribute1']
end
;
-- Use the script in HQL
SELECT rb_exec('&parse', json) FROM user;
https://github.com/gree/hive-ruby-scripting
Spark
http://www.slideshare.net/smartnews/aws-meetupapache-spark-on-emr
Self-Serve via AWS CLI
# Create EMR clusters that runs Hive & Spark & Ganglia
aws emr create-cluster 
--name "My Cluster" 
--release-label emr-4.2.0 
--applications Name=Hive Name=Spark Name=GANGLIA 
--ec2-attributes KeyName=myKey 
--instance-type c3.4xlarge 
--instance-count 4 
--use-default-roles
Minimize expenses
• Use Spot Instances as possible
• typically discount 50-90%
• select instance type with stable price
• C3 families spike often :(
• Dynamic cluster resizing
• x2 capacity during daily batch job
• 1/2 capacity during midnight
Handle Data Dependencies
Typical Anti-Pattern
5 * * * * app hive -f query_1.hql
15 * * * * app hive -f query_2.hql
30 * * * * app hive -f query_3.hql
0 * * * * app hive -f query_4.hql
1 * * * * app hive -f query_5.hql
Workflow Management
• Define dependencies
• task E is executed after finishing task C and task D
• Scheduling
• task A is kicked after 09:00 AM
• throttle concurrent running of the same task
• Monitoring
• notification in failure
• task C must finish before 01:00 PM (SLA)
cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo
Airflow
• A workflow management systems
• define workflow by Python
• built in shiny UI & CLI
• pluggable architecture
http://nerds.airbnb.com/airflow/
Define Tasks
dag = DAG('tutorial', default_args=default_args)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t3 = BashOperator(
task_id='templated',
bash_command="""
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
""",
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)
Task
Dependencies
Python code
DAG
Workflow as Code
Deploy codes automatically after merging into master
Visualize Dependencies
What is done or not?
Alerting to Slack
• SLA Violation
• task A should be done till 00:00 PM
• other team's task K has dependency into task A
• Output validation failure
• stop the following tasks if the output is doubtful
Retry from Web UI
Once clear histories, airflow scheduler back fill the histories
Retry from CLI
// Clear some histories from 2016-01-01
airflow clear etl_smartnews 
--task_regex user_ 
--downstream 
--start_date 2016-01-01
// Backfill uncompleted tasks
airflow backfill etl_smartnews 
--start_date 2016-01-01
Check Rendered Query
How Long Each Tasks?
Pluggable Architecture
• Built-in plugins
• operator: bash, hive, preto, mysql
• transfer: hive_to_mysql
• sensor: wait_hive_partition, wait_s3_file
• Written our own plugin
• mysql_partition
Examples
user_sensor = S3KeySensor(
task_id='wait_user',
bucket_name='smartnews',
bucket_key='user/dt={{ ds }}/dump.csv',
)
etl = HiveOperator(
task_id="task1",
hql="INSERT OVERWRITE INTO...."
)
etl.set_upstream(user_sensor)
import = HiveToMySqlTransfer(
task_id=name,
mysql_preoperator="DELETE FROM %s WHERE date = '{{ ds }}'" % table,
sql="SELECT country, count(*) FROM %s" % table,
mysql_table=table
)
import.set_upstream(etl)
Wait a S3 file creation
After the file is created,
Run ETL Query
After that,
Import into MySQL
Serving Layer
Provides batch views
in low-latency and ad-hoc way
Presto
• A distributed SQL query engine
• join multiple data sources (Hive + MySQL)
• support standard ANSI SQL
• designed to handle TBs or PBs scale data
cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014
Presto Architecture
Amazon S3 Kinesis
Stream
Amazon
RDS
Amazon
Aurora
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Coordinator
Client
1. Query with Standard SQL
4. Scan data concurrently
5. Aggregate data without disk I/O
6. Return result to client
2. Generate execution plan
3. Dispatch tasks into multiple workers
Amazon EMR
(Hive Metastore)
Provides Hive table metadata
(S3 access only)
※ https://github.com/qubole/presto-kinesis
※
Why Presto?
• Join multiple data sources
• skip large parts of ETL process
• enable to merge Hive/MySQL/Kinesis/PipelineDB
• Low latency
• ~30s to scan billions records in S3
• Low maintenance cost
• stateless, and easy to integrate with Auto Scaling
Use case: A/B Test
-- Suppose that this table exists
DESC hive.default.user_activities;
user_id bigint
action varchar
abtest array<map<varchar, bigint>>
url varchar
-- Summarize page view per A/B Test identifier
-- for comparing two algorithms v1 & v2
SELECT
  dt,
  t['behaviorId'],
  count(*) as pv
FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t)
WHERE dt like '2016-01-%' AND action = 'viewArticle'
AND t['definitionId'] = 163
GROUP BY dt, t['behaviorId'] ORDER BY dt
;
2015-12-01 | algorithm_v1 | 40000
2015-12-01 | algorithm_v2 | 62000
Use case: Troubleshoot
-- Store access logs to S3, and query to them
-- Summarize access & 95pct response time by SQL
SELECT
from_unixtime(timestamp),
count(*) as access,
approx_percentile(reqtime, 0.95) as pct95_reqtime
FROM hive.default.access_log
WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx'
GROUP BY timestamp ORDER BY timestamp
;
2015-11-04 22:00:00.000 | 6377 | 0.522
2015-11-04 22:00:01.000 | 3580 | 0.422
Scheduled Auto Scaling
$ aws autoscaling describe-scheduled-actions
{
"ScheduledUpdateGroupActions": [
{
"DesiredCapacity": 2,
"AutoScalingGroupName": "presto-worker-prd",
"Recurrence": "59 14 * * *",
"ScheduledActionName": "scalein-2359-jst"
},
{
"DesiredCapacity": 20,
"AutoScalingGroupName": "presto-worker-prd",
"Recurrence": "45 0 * * 1-5",
"ScheduledActionName": "scaleout-0945-jst"
}
]
}
Presto Covers Everything? No!
• Fixed system on Amazon Aurora (or other RDB)
• provides KPI for products & business
• require high availability & low latency
• has no flexibility
• Ad-hoc system on Presto
• provides access to all dataset on data platform
• require high scalability
• has flexibility (join various data sources)
Why Fixed vs Ad-hoc?
• Difficulties on the Ad-hoc only solution
• difficult to prevent heavy queries
• large distinct count exhausts computing resources
• decrease presto maintainability
Output Data
Chartio
• Dashboard as A Service
• helps businesses analyze and track their critical data
• one of AWS partners (※)
• Combine multiple data sources at one dashboard
• Presto, MySQL, Redshift, BigQuery, Elasticsearch ...
• enable to join BigQuery + MySQL internally
• Easy to use for every one
• everyone can make their own dashboard
• write SQL directly / generate query by drag & drop
※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959
Creating dashboard
1. Building query
(Drag&Drop / SQL)
2. Add step
(filter、sort、modify)
3. Select visualize way
(table、graph)
Examples
Why Chartio?
• Chartio saves a lot of engineering resources
• before
• maintain in-house dashboard written by rails
• everyone got tired to maintain it
• after
• everyone can build their own dashboard easily
• Chartio's UI is cool
• very important factor for dashboard tool
Missing Pieces of Chartio
• No programable API provides
• need to edit dashboard / chart manually
• No rollback feature
• all changes are recorded, but not rollback to the
previous state
• work around : clone => edit => rename
Speed Layer
Why Speed is Matter?
Today’s News is Wrapping
Tomorrow’s Fish and Chips
↑
Yesterday's News
http://www.personalchefapproach.com/tomorrows-fish-n-chips-wrapper/
How News Behaves?
https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/
Use cases
• Re-rank news articles by user feedback
• track user's positive/negative signal
• consider gender, age, location, interests
• Realtime article monitoring
• detect high bounce rate (may be broken?)
• make realtime reporting dashboard for A/B test
Realtime Re-Ranking
ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合
www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive
Amazon
CloudSearch
Search
API
API
Gateway
Kinesis
Stream
Amazon S3
Amazon EMR
Amazon S3 Amazon EMR
DynamoDB
Realtime
Feedback
Re-rank
Articles
Article
Metadata
User
Interests
User
Behaviors
Offline Procees
by Hive / Spark
Realtime Monitoring
API
Gateway
Stream
Continuous
View
Continuous
View
Continuous
View
Discard raw record soon after
consumed by Continuous View
Incrementally
updated in realtime
PipelineDB Chartio
AWS
Lambda
Slack
Access Continuous View
by PostgreSQL Client
Record
※1
※1
Use cron on 26 Feb. 2016
Migrate it soon after supporting VPC
PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/
http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
Continuous View
-- Calculate unique users seen per media each day
-- Using only a constant amount of space (HyperLogLog)
CREATE CONTINUOUS VIEW uniques AS
SELECT
day(arrival_timestamp),
substring(url from '.*://([^/]*)') as hostname,
COUNT(DISTINCT user_id::integer)
FROM activity_stream GROUP BY day,hostname;
-- How many impressions have we served in the last five minutes?
CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS
SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency?
CREATE CONTINUOUS VIEW latency AS
SELECT
percentile_cont(array[90, 95, 99])
WITHIN GROUP (ORDER BY latency::integer)
FROM latency_stream;
Summary
Sustainable Data Platform
• build a reliable and scalable lambda architecture
• minimize operation & running cost
• be open to uncertain future
My Wishlist to AWS
• Support Reduced Redundancy Storage (RRS) on EMR
• Faster EMR Launch
• Set TTL to DynamoDB records
• Auto-scale Kinesis Stream
• Launch Kinesis Analytics in Tokyo region
Thank you!!

Contenu connexe

Tendances

コンテナ時代にインフラエンジニアは何をするのか
コンテナ時代にインフラエンジニアは何をするのかコンテナ時代にインフラエンジニアは何をするのか
コンテナ時代にインフラエンジニアは何をするのかgree_tech
 
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...Amazon Web Services Japan
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報Amazon Web Services Japan
 
これから始めるSAPのクラウド化 Azure が選ばれる理由
これから始めるSAPのクラウド化 Azure が選ばれる理由これから始めるSAPのクラウド化 Azure が選ばれる理由
これから始めるSAPのクラウド化 Azure が選ばれる理由Hitoshi Ikemoto
 
20200617 AWS Black Belt Online Seminar Amazon Athena
20200617 AWS Black Belt Online Seminar Amazon Athena20200617 AWS Black Belt Online Seminar Amazon Athena
20200617 AWS Black Belt Online Seminar Amazon AthenaAmazon Web Services Japan
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and PerformanceMineaki Motohashi
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
AWS Black Belt Tech シリーズ 2015 - AWS IoT
AWS Black Belt Tech シリーズ 2015 - AWS IoTAWS Black Belt Tech シリーズ 2015 - AWS IoT
AWS Black Belt Tech シリーズ 2015 - AWS IoTAmazon Web Services Japan
 
VPC Reachability Analyzer 使って人生が変わった話
VPC Reachability Analyzer 使って人生が変わった話VPC Reachability Analyzer 使って人生が変わった話
VPC Reachability Analyzer 使って人生が変わった話Noritaka Sekiyama
 
Five Connectivity and Security Use Cases for Azure VNets
Five Connectivity and Security Use Cases for Azure VNetsFive Connectivity and Security Use Cases for Azure VNets
Five Connectivity and Security Use Cases for Azure VNetsKhash Nakhostin
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したこと20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したことAmazon Web Services Japan
 
10월 웨비나 - AWS에서 Active Directory 구축 및 연동 옵션 살펴보기 (김용우 솔루션즈 아키텍트)
10월 웨비나 - AWS에서 Active Directory 구축 및 연동 옵션 살펴보기 (김용우 솔루션즈 아키텍트)10월 웨비나 - AWS에서 Active Directory 구축 및 연동 옵션 살펴보기 (김용우 솔루션즈 아키텍트)
10월 웨비나 - AWS에서 Active Directory 구축 및 연동 옵션 살펴보기 (김용우 솔루션즈 아키텍트)Amazon Web Services Korea
 
20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本Amazon Web Services Japan
 

Tendances (20)

AWS Cost Management Workshop
AWS Cost Management WorkshopAWS Cost Management Workshop
AWS Cost Management Workshop
 
コンテナ時代にインフラエンジニアは何をするのか
コンテナ時代にインフラエンジニアは何をするのかコンテナ時代にインフラエンジニアは何をするのか
コンテナ時代にインフラエンジニアは何をするのか
 
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...
20180425 AWS Black Belt Online Seminar Amazon Relational Database Service (Am...
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報
20211203 AWS Black Belt Online Seminar AWS re:Invent 2021アップデート速報
 
これから始めるSAPのクラウド化 Azure が選ばれる理由
これから始めるSAPのクラウド化 Azure が選ばれる理由これから始めるSAPのクラウド化 Azure が選ばれる理由
これから始めるSAPのクラウド化 Azure が選ばれる理由
 
20200617 AWS Black Belt Online Seminar Amazon Athena
20200617 AWS Black Belt Online Seminar Amazon Athena20200617 AWS Black Belt Online Seminar Amazon Athena
20200617 AWS Black Belt Online Seminar Amazon Athena
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and Performance
 
Cloud Migration Workshop
Cloud Migration WorkshopCloud Migration Workshop
Cloud Migration Workshop
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
AWS Black Belt Tech シリーズ 2015 - AWS IoT
AWS Black Belt Tech シリーズ 2015 - AWS IoTAWS Black Belt Tech シリーズ 2015 - AWS IoT
AWS Black Belt Tech シリーズ 2015 - AWS IoT
 
AWS Black Belt Online Seminar 2016 AWS IoT
AWS Black Belt Online Seminar 2016 AWS IoTAWS Black Belt Online Seminar 2016 AWS IoT
AWS Black Belt Online Seminar 2016 AWS IoT
 
VPC Reachability Analyzer 使って人生が変わった話
VPC Reachability Analyzer 使って人生が変わった話VPC Reachability Analyzer 使って人生が変わった話
VPC Reachability Analyzer 使って人生が変わった話
 
Five Connectivity and Security Use Cases for Azure VNets
Five Connectivity and Security Use Cases for Azure VNetsFive Connectivity and Security Use Cases for Azure VNets
Five Connectivity and Security Use Cases for Azure VNets
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
2020.02.06 우리는 왜 glue를 버렸나?
2020.02.06 우리는 왜 glue를 버렸나?2020.02.06 우리는 왜 glue를 버렸나?
2020.02.06 우리는 왜 glue를 버렸나?
 
20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したこと20220409 AWS BLEA 開発にあたって検討したこと
20220409 AWS BLEA 開発にあたって検討したこと
 
10월 웨비나 - AWS에서 Active Directory 구축 및 연동 옵션 살펴보기 (김용우 솔루션즈 아키텍트)
10월 웨비나 - AWS에서 Active Directory 구축 및 연동 옵션 살펴보기 (김용우 솔루션즈 아키텍트)10월 웨비나 - AWS에서 Active Directory 구축 및 연동 옵션 살펴보기 (김용우 솔루션즈 아키텍트)
10월 웨비나 - AWS에서 Active Directory 구축 및 연동 옵션 살펴보기 (김용우 솔루션즈 아키텍트)
 
20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本
 

En vedette

[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例Amazon Web Services Japan
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Minero Aoki
 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAmazon Web Services Japan
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
AWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAmazon Web Services Japan
 

En vedette (20)

20170725 black belt_monitoring_on_aws
20170725 black belt_monitoring_on_aws20170725 black belt_monitoring_on_aws
20170725 black belt_monitoring_on_aws
 
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
20170726 black belt_stepfunctions
20170726 black belt_stepfunctions20170726 black belt_stepfunctions
20170726 black belt_stepfunctions
 
Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築
 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon Connect
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
AWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS Shield
 
20170621 aws-black belt-ads-sms
20170621 aws-black belt-ads-sms20170621 aws-black belt-ads-sms
20170621 aws-black belt-ads-sms
 
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
 
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWS
 
AWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 SnowballAWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 Snowball
 
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-Ray
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon Aurora
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLift
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
 
AWS BlackBelt AWS上でのDDoS対策
AWS BlackBelt AWS上でのDDoS対策AWS BlackBelt AWS上でのDDoS対策
AWS BlackBelt AWS上でのDDoS対策
 
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR
 
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
 
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
 

Similaire à Building a Sustainable Data Platform on AWS

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Amazon Web Services
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션Amazon Web Services Korea
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsSteve Jamieson
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB
 
React state management with Redux and MobX
React state management with Redux and MobXReact state management with Redux and MobX
React state management with Redux and MobXDarko Kukovec
 
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoTAmazon Web Services
 
Orchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functionsOrchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functionsChris Shenton
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Amazon Web Services
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel PartnersCraeg Strong
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB
 

Similaire à Building a Sustainable Data Platform on AWS (20)

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.js
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
 
React state management with Redux and MobX
React state management with Redux and MobXReact state management with Redux and MobX
React state management with Redux and MobX
 
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
 
Orchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functionsOrchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functions
 
DW on AWS
DW on AWSDW on AWS
DW on AWS
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
 

Plus de SmartNews, Inc.

SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用SmartNews, Inc.
 
Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤SmartNews, Inc.
 
エンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへエンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへSmartNews, Inc.
 
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SmartNews, Inc.
 
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNews, Inc.
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysSmartNews, Inc.
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側SmartNews, Inc.
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews, Inc.
 
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews, Inc.
 
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews, Inc.
 
SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews, Inc.
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews, Inc.
 
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合SmartNews, Inc.
 
SmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews, Inc.
 
AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」SmartNews, Inc.
 
Smartnews Product Manager Night
Smartnews Product Manager NightSmartnews Product Manager Night
Smartnews Product Manager NightSmartNews, Inc.
 
SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews, Inc.
 
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法SmartNews, Inc.
 

Plus de SmartNews, Inc. (19)

SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用
 
Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤
 
エンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへエンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへ
 
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
 
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdays
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
 
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
 
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
 
SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解
 
NLP in SmartNews
NLP in SmartNewsNLP in SmartNews
NLP in SmartNews
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservices
 
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
 
SmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォーム
 
AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」
 
Smartnews Product Manager Night
Smartnews Product Manager NightSmartnews Product Manager Night
Smartnews Product Manager Night
 
SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015
 
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
 

Dernier

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 

Dernier (20)

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 

Building a Sustainable Data Platform on AWS

  • 1. Building a Sustainable Data Platform on AWS Takumi Sakamoto 2016.01.27
  • 7. What is SmartNews? • News Discovery App • Launched in 2012 • 15M+ Downloads in World Wide https://www.smartnews.com/en/
  • 8. Our Mission the world's quality information? the people who need it? How?
  • 9. Machine Learning URLs Found Structure Analysis Semantics Analysis Importance Estimation Diversification Internet 100,000+ /day 1000+ /day Feedback Deliver Trending Stories
  • 10. Data Platform Use Cases • Product development • track KPI such as DAU and MAU • A/B test for new feature, on-boarding, etc... • ad-hoc analysis • Provide data to applications • realtime re-ranking news articles • CTR prediction of Ads system • dashboard service for media partners
  • 11. Data & Its Numbers • User activities • ~100 GBs per day (compressed) • 60+ record types • User demographics or configurations etc... • 15M+ records • Articles metadata • 100K+ records per day
  • 13. Sustainable Data Platform • Provide a reliable and scalable "Lambda Architecture" • Minimize both operation & running cost • Be open to uncertain future
  • 15. Why Sustainable? • Do a lot with a few engineers • no one is a full-time maintainer • avoid to waste too much time • Empower brilliant engineers in SmartNews • everything should be as self-serve as possible • don't ask for permission, beg for forgiveness
  • 17. λ Architecture at SmartNews Input Batch Serving Speed Output
  • 18. Design Principles • Decoupled "Computation" and "Storage" layers • multiple consumers can use the same data • run consumers on Spot Instances • prevent serious data lost with minimum effort • Use the right tool for the job • leverage AWS managed service as possible • fill in the missing pieces by Presto & PipelineDB
  • 19. An Example Amazon EMR AMI 3.x Amazon S3 Amazon EMR Hive General Users Application Engineer I wanna upgrade hive Ad Engineer I wanna combine news data with ad data Amazon EMR AMI 4.x Amazon EMR Spark We’re satisfied with current version Data Scientist I wanna test my algorithm with the latest spark Batch Layer Run multiple EMR clusters for each usages Kinesis Stream Spark on EMR AWS Lambda Data Scientist I wanna consume streaming data by Spark Application Engineer I wanna add a streaming monitor by Lambda Speed Layer Consume the same data for each usages • AWS managed services • Replicated data into Multiple AZs • High availability
  • 21. Collect Events by Fluentd • Forwarder (running on each instances) • store JSON events to S3 • forward events to aggregators • collect metrics and post them to Datadog • Aggregator • input events into Kinesis & PipelineDB • other reporting tasks (not mentioned today)
  • 22. Forwards to S3 <source> @type tail format json path /data/log/user_activity.log pos_file /data/log/pos/user_activity.pos tag smartnews.user_activity time_key timestamp </source> <match smartnews.user_activity> @type copy <store> @type relabel @label @s3 </store> <store> @type forward @label @forward </store> </match> @include conf.d/s3.conf @include conf.d/forward.conf <label @s3> <% node[:td_agent][:s3].each do |c| -%> <match <%= c[:tag] %>> @id s3.<%= c[:tag] %> @type s3 ... path fluentd/<%= node[:env] %>/<%= node[:role] %>/<%= c[:tag] %> time_slice_format dt=%Y-%m-%d/hh=%H time_key timestamp include_time_key time_as_epoch reduced_redundancy true format json utc buffer_chunk_limit 2048m </match> <% end -%> </label> td-agent.conf conf.d/s3.conf
  • 23. Capture DynamoDB Streams <source> type dynamodb_streams stream_arn YOUR_DDB_STREAMS_ARN pos_file /path/to/table.pos fetch_interval 1 fetch_size 100 </source> https://github.com/takus/fluent-plugin-dynamodb-streams DynamoDB DynamoDB Streams Amazon S3 AWS Lambda Fluentd
  • 24. Recommended Practices • Make configuration simple as possible • fluentd can cover everything, but shouldn't • keep stateless • Use v0.12 or later • "Filter" : better performance • "Label": eliminate 'output_tag' configuration
  • 25. Monitor Fluentd Status • Monitor traffic volume & retry count by Datadog • Datadog's fluentd integration • fluent-plugin-flowcounter • fluent-plugin-dogstatsd
  • 26. Archive to Amazon S3 • I have 2 recommended settings • versioning • enable to recover from human error • lifecycle policy • minify storage cost Archives to IA or Gracier xx days after the creation date Keep previous versions xx days Save you in the future!!
  • 28. Various ETL Tasks • Extract • dump MySQL records by Embulk • make files on S3 readable to Hive • Transform • transform text files into columnar files (RCFile, ORC) • generate features for machine learning • aggregate records (by country, by channel) • Load • load aggregated metrics into Amazon Aurora
  • 29. Hive • Most popular project on Hadoop ecosystem • famous for its lovely logo :) • HiveQL and MapReduce • convert SQL-like query into MR jobs • Not adopt Tez engine yet • Amazon EMR doesn't support now • limited improvement to our queries
  • 30. How to process JSON? A. Transform into columnar table periodically • required converting job • better performance B. Use JSON-SerDe for temporary analysis • easy way for querying raw json text files • required to "drop table" for change schema • performance is not good
  • 31. Transform Tables -- Make S3 files readable by Hive ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION (dt='${DATE}', hh='${HOUR}'); -- Transform text files into columnar files (Flatten JSON) INSERT OVERWRITE TABLE activities PARTITION (dt='${DATE}', action) SELECT user_id, timestamp, os, country, data, action FROM raw_activities LATERAL VIEW json_tuple( raw_activities.json, 'userId','timestamp','platform','country','action','data' ) a as user_id, timestamp, os, country, action, data WHERE dt = '${DATE}' CLUSTER BY os, country, action, user_id ;
  • 32. JSON-SerDe -- Define table with SERDE CREATE TABLE json_table ( country string, languages array<string>, religions map<string,array<int>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; -- Result: 10 SELECT religions['catholic'][0] FROM json_table;
  • 33. cf. hive-ruby-scripting -- Define your ruby (JRuby) script SET rb.script= require 'json' def parse (json) j = JSON.load(json) j['profile']['attribute1'] end ; -- Use the script in HQL SELECT rb_exec('&parse', json) FROM user; https://github.com/gree/hive-ruby-scripting
  • 35. Self-Serve via AWS CLI # Create EMR clusters that runs Hive & Spark & Ganglia aws emr create-cluster --name "My Cluster" --release-label emr-4.2.0 --applications Name=Hive Name=Spark Name=GANGLIA --ec2-attributes KeyName=myKey --instance-type c3.4xlarge --instance-count 4 --use-default-roles
  • 36. Minimize expenses • Use Spot Instances as possible • typically discount 50-90% • select instance type with stable price • C3 families spike often :( • Dynamic cluster resizing • x2 capacity during daily batch job • 1/2 capacity during midnight
  • 38. Typical Anti-Pattern 5 * * * * app hive -f query_1.hql 15 * * * * app hive -f query_2.hql 30 * * * * app hive -f query_3.hql 0 * * * * app hive -f query_4.hql 1 * * * * app hive -f query_5.hql
  • 39. Workflow Management • Define dependencies • task E is executed after finishing task C and task D • Scheduling • task A is kicked after 09:00 AM • throttle concurrent running of the same task • Monitoring • notification in failure • task C must finish before 01:00 PM (SLA) cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo
  • 40. Airflow • A workflow management systems • define workflow by Python • built in shiny UI & CLI • pluggable architecture http://nerds.airbnb.com/airflow/
  • 41. Define Tasks dag = DAG('tutorial', default_args=default_args) t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag) t3 = BashOperator( task_id='templated', bash_command=""" {% for i in range(5) %} echo "{{ ds }}" echo "{{ macros.ds_add(ds, 7)}}" echo "{{ params.my_param }}" {% endfor %} """, params={'my_param': 'Parameter I passed in'}, dag=dag) t2.set_upstream(t1) t3.set_upstream(t1) Task Dependencies Python code DAG
  • 42. Workflow as Code Deploy codes automatically after merging into master
  • 44. What is done or not?
  • 45. Alerting to Slack • SLA Violation • task A should be done till 00:00 PM • other team's task K has dependency into task A • Output validation failure • stop the following tasks if the output is doubtful
  • 46. Retry from Web UI Once clear histories, airflow scheduler back fill the histories
  • 47. Retry from CLI // Clear some histories from 2016-01-01 airflow clear etl_smartnews --task_regex user_ --downstream --start_date 2016-01-01 // Backfill uncompleted tasks airflow backfill etl_smartnews --start_date 2016-01-01
  • 49. How Long Each Tasks?
  • 50. Pluggable Architecture • Built-in plugins • operator: bash, hive, preto, mysql • transfer: hive_to_mysql • sensor: wait_hive_partition, wait_s3_file • Written our own plugin • mysql_partition
  • 51. Examples user_sensor = S3KeySensor( task_id='wait_user', bucket_name='smartnews', bucket_key='user/dt={{ ds }}/dump.csv', ) etl = HiveOperator( task_id="task1", hql="INSERT OVERWRITE INTO...." ) etl.set_upstream(user_sensor) import = HiveToMySqlTransfer( task_id=name, mysql_preoperator="DELETE FROM %s WHERE date = '{{ ds }}'" % table, sql="SELECT country, count(*) FROM %s" % table, mysql_table=table ) import.set_upstream(etl) Wait a S3 file creation After the file is created, Run ETL Query After that, Import into MySQL
  • 53. Provides batch views in low-latency and ad-hoc way
  • 54. Presto • A distributed SQL query engine • join multiple data sources (Hive + MySQL) • support standard ANSI SQL • designed to handle TBs or PBs scale data cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014
  • 55. Presto Architecture Amazon S3 Kinesis Stream Amazon RDS Amazon Aurora Presto Worker Presto Worker Presto Worker Presto Worker Presto Worker Presto Worker Presto Coordinator Client 1. Query with Standard SQL 4. Scan data concurrently 5. Aggregate data without disk I/O 6. Return result to client 2. Generate execution plan 3. Dispatch tasks into multiple workers Amazon EMR (Hive Metastore) Provides Hive table metadata (S3 access only) ※ https://github.com/qubole/presto-kinesis ※
  • 56. Why Presto? • Join multiple data sources • skip large parts of ETL process • enable to merge Hive/MySQL/Kinesis/PipelineDB • Low latency • ~30s to scan billions records in S3 • Low maintenance cost • stateless, and easy to integrate with Auto Scaling
  • 57. Use case: A/B Test -- Suppose that this table exists DESC hive.default.user_activities; user_id bigint action varchar abtest array<map<varchar, bigint>> url varchar -- Summarize page view per A/B Test identifier -- for comparing two algorithms v1 & v2 SELECT   dt,   t['behaviorId'],   count(*) as pv FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t) WHERE dt like '2016-01-%' AND action = 'viewArticle' AND t['definitionId'] = 163 GROUP BY dt, t['behaviorId'] ORDER BY dt ; 2015-12-01 | algorithm_v1 | 40000 2015-12-01 | algorithm_v2 | 62000
  • 58. Use case: Troubleshoot -- Store access logs to S3, and query to them -- Summarize access & 95pct response time by SQL SELECT from_unixtime(timestamp), count(*) as access, approx_percentile(reqtime, 0.95) as pct95_reqtime FROM hive.default.access_log WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx' GROUP BY timestamp ORDER BY timestamp ; 2015-11-04 22:00:00.000 | 6377 | 0.522 2015-11-04 22:00:01.000 | 3580 | 0.422
  • 59. Scheduled Auto Scaling $ aws autoscaling describe-scheduled-actions { "ScheduledUpdateGroupActions": [ { "DesiredCapacity": 2, "AutoScalingGroupName": "presto-worker-prd", "Recurrence": "59 14 * * *", "ScheduledActionName": "scalein-2359-jst" }, { "DesiredCapacity": 20, "AutoScalingGroupName": "presto-worker-prd", "Recurrence": "45 0 * * 1-5", "ScheduledActionName": "scaleout-0945-jst" } ] }
  • 60. Presto Covers Everything? No! • Fixed system on Amazon Aurora (or other RDB) • provides KPI for products & business • require high availability & low latency • has no flexibility • Ad-hoc system on Presto • provides access to all dataset on data platform • require high scalability • has flexibility (join various data sources)
  • 61. Why Fixed vs Ad-hoc? • Difficulties on the Ad-hoc only solution • difficult to prevent heavy queries • large distinct count exhausts computing resources • decrease presto maintainability
  • 63. Chartio • Dashboard as A Service • helps businesses analyze and track their critical data • one of AWS partners (※) • Combine multiple data sources at one dashboard • Presto, MySQL, Redshift, BigQuery, Elasticsearch ... • enable to join BigQuery + MySQL internally • Easy to use for every one • everyone can make their own dashboard • write SQL directly / generate query by drag & drop ※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959
  • 64. Creating dashboard 1. Building query (Drag&Drop / SQL) 2. Add step (filter、sort、modify) 3. Select visualize way (table、graph)
  • 66. Why Chartio? • Chartio saves a lot of engineering resources • before • maintain in-house dashboard written by rails • everyone got tired to maintain it • after • everyone can build their own dashboard easily • Chartio's UI is cool • very important factor for dashboard tool
  • 67. Missing Pieces of Chartio • No programable API provides • need to edit dashboard / chart manually • No rollback feature • all changes are recorded, but not rollback to the previous state • work around : clone => edit => rename
  • 69. Why Speed is Matter?
  • 70. Today’s News is Wrapping Tomorrow’s Fish and Chips
  • 73. Use cases • Re-rank news articles by user feedback • track user's positive/negative signal • consider gender, age, location, interests • Realtime article monitoring • detect high bounce rate (may be broken?) • make realtime reporting dashboard for A/B test
  • 74. Realtime Re-Ranking ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合 www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive Amazon CloudSearch Search API API Gateway Kinesis Stream Amazon S3 Amazon EMR Amazon S3 Amazon EMR DynamoDB Realtime Feedback Re-rank Articles Article Metadata User Interests User Behaviors Offline Procees by Hive / Spark
  • 75. Realtime Monitoring API Gateway Stream Continuous View Continuous View Continuous View Discard raw record soon after consumed by Continuous View Incrementally updated in realtime PipelineDB Chartio AWS Lambda Slack Access Continuous View by PostgreSQL Client Record ※1 ※1 Use cron on 26 Feb. 2016 Migrate it soon after supporting VPC
  • 76. PipelineDB • OSS & enterprise streaming SQL database • PostgreSQL compatible • connect to Chartio 😍 • join stream to normal PostgreSQL table • Support probabilistic data structures • e.g. HyperLogLog https://www.pipelinedb.com/ http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
  • 77. Continuous View -- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day,hostname; -- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream; -- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;
  • 79. Sustainable Data Platform • build a reliable and scalable lambda architecture • minimize operation & running cost • be open to uncertain future
  • 80. My Wishlist to AWS • Support Reduced Redundancy Storage (RRS) on EMR • Faster EMR Launch • Set TTL to DynamoDB records • Auto-scale Kinesis Stream • Launch Kinesis Analytics in Tokyo region