Hadoop Platform at Yahoo: A Year in Review

HADOOP PLATFORM
AT YAHOO
A YEAR IN REVIEW
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms

Agenda
2
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5

0
100
200
300
400
500
600
700
800
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
RawHDFS(inPB)
#Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling
Hadoop for
Production
Use
Research
Workloads
in Search
and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-
tenancy, and
SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase,
Storm, Spark,
Hive)
Increased
User-base
with
partitioned
namespaces
Apache H2.7
(Scalable ML, Latency,
Utilization, Productivity)
Platform Evolution
3

Deployment Models
Private (dedicated)
Clusters
Hosted Multi-tenant
(private cloud)
Clusters
Hosted Compute
Clusters
 Large demanding use
cases
 New technology not
yet platformized
 Data movement and
regulation issues
 When more cost
effective than on-
premise
 Time to market/
results matter
 Data already in
public cloud
 Source of truth for all
of orgs data
 App delivery agility
 Operational efficiency
and cost savings
through economies of
scale
On-Premise Public Cloud
Purpose-built
Big Data
Clusters
 For performance,
tighter integration
with tech stack
 Value added services
such as monitoring,
alerts, tuning and
common tools
4

Platform Today
ZK DBMS MON SSHOP LOG WH TOOLS
Apache / Open Source Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
5

Technology Stack Assembly
ZK DBMS MON SSHOP LOG WH TOOLS
Apache Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
HDFS
(File System)
YARN
(Scheduling, Resource Management)
Common
RHEL6 64-bit, JDK8
Platformized
Tech with
Production
Support
In-
progress,
Unmet
needs or
Apache
Alignment
6

Common Backplane
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Network
Backplane
7

0
10
20
30
Cluster 1 (2,000 servers)
HDFS 12 PB
Compute 23 TB
Avg. Util: 26%
Research Cluster Consolidation
0
20
40
60
80
ComputeTotalandUsed(TB)
HDFS 36 PB
Compute 70 TB
Avg. Util: 59%
HDFS 21 PB
Compute 52 TB
Avg. Util: 40%
0
20
40
60
One Month Sample (2015)
Total Used
8

0
50
100
150
200
250
300
Consolidated Cluster
HDFS 65 PB
Compute 240 TB
Avg. Util: 70%
Consolidated Research Cluster Characteristics
One Month Sample (2016)
40% decrease in TCO
10,500
servers
2,200
servers
Before After
65% increase in compute capacity
50% increase in avg. utilization
Total Used
ComputeTotalandUsed(TB)
9

Common Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
.
.
.
.
.
.
.
.
.
10

New Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
100Gbps
InfiniBand
GPU Servers
Hi-Mem Servers
.
.
.
11

YARN Node Labels
J2J3
J4
Queue 1, 40%
Label x
Queue 2, 40%
Label x, y
J1
Queue 3, 20%
x x x x x x
x x x x x x
y y y y y y
y y y y y y
yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>
yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue
Hadoop Cluster
12

Agenda
Platform Overview1
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
13

CaffeOnSpark – Distributed Deep Learning
CaffeOnSpark
for
DL
MLLib
for
non-DL
Hive or
SparkSQL
Spark
YARN (RM and Scheduling)
HDFS (Datasets)
. . .
14

Few Use Cases – Yahoo Weather
15

Few Use Cases – Flickr Facial Recognition
16

Few Use Cases – Flickr Scene Detection
17

CaffeOnSpark Architecture – Common Cluster
Spark Driver
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Model
O/P on
HDFS
MPI on RDMA / TCP
18

CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)
conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true)
cos.train(dl_train_source) //training DL model
lr_raw_source = DataSource.getSource(conf, false)
ext_df = cos.features(lr_raw_source) // extract features via DL
Feature
Engineering:
DeepLearning
19

CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)
conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true)
cos.train(dl_train_source) //training DL model
lr_raw_source = DataSource.getSource(conf, false)
ext_df = cos.features(lr_raw_source) // extract features via DL
vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label))
)
.withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0))))
lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F")
lr_model = lr.fit(lr_input_df) …
Feature
Engineering:
DeepLearning
20
TrainClassifiers:
Non-deep
Learning

CaffeOnSpark Architecture – Single Command
spark-submit
--num-executors #Exes
--class CaffeOnSpark
my-caffe-on-spark.jar
-devices #GPUs
-model dl_model_file
-output lr_model_file
21

Distributed Deep Learning
Apache
License
Existing
Clusters
Powerful
DL Platform
Fully
Distributed
High-level
API
Incremental
Learning
CaffeOnSpark
github.com/yahoo/caffeonspark
22

Agenda
Platform Overview1
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
23

Hadoop Compute Sources
HDFS
(File System and Storage)
Pig
(Scripting)
Hive
(SQL)
Java MR APIs
YARN
(Resource Management and Scheduling)
Tez
(Execution Engine for
Pig and Hive)
Spark
(Alternate Exec Engine)
MapReduce
(Legacy)
Data Processing
ML
Custom App on
Slider
Oozie
Data
Management
24

Compute Growth
13.3
20.4
23.8
27.2
32.3
34.1
39.1
10
15
20
25
30
35
40
45 Mar-13
Apr-13
May-13
Jun-13
Jul-13
Aug-13
Sep-13
Oct-13
Nov-13
Dec-13
Jan-14
Feb-14
Mar-14
Apr-14
May-14
Jun-14
Jul-14
Aug-14
Sep-14
Oct-14
Nov-14
Dec-14
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
#MR,Tez,SparkJobs(inmillions)
25

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec)
Q1 2016
MapReduce Tez Spark
112 Million Batch Jobs in Q1’16
Jan 78%
Mar 67%
Mar 21% 12%Jan 8% 14%
26

Recent Apache Storm Developments at Yahoo
MT & RA
Scheduler
Dist. Cache
API
8 x
Throughput
Improved
Debuggability
1 github.com/yahoo/streaming-benchmarks
Pacemaker
Server
Streaming
Benchmark 1
28

Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
 Good enough approximate answers
for problem queries
 Streamable
 Approximate with predictable error
 Sub-linear in size
 Mergeable / additive
 Highly parallelizable
 Maven deployable
Characteristics
29

Distinct Count Sketch, High-level View
Big Data
Stream
Transform Data Structure Estimator
Result + / - ε
White
Noise
Basic Sketch Elements
30

Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
31

Agenda
Platform Overview1
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
32

Apache HBase at Yahoo
 Security
 Isolated Deployment
 Multi-tenant
 Region Server Group
 Namespace
 Unsupported Features
HBase
Client
HBase
Client
JobTracker Namenode
TaskTracker
DataNode
Namenode
RegionServer
DataNode
RegionServer
DataNode
RegionServer
DataNode
HBase Master
Zookeeper
Quorum
HBase
Client
MR Client
M/R Task
TaskTracker
DataNode
M/R Task
TaskTracker
DataNode
MR Task
Compute Cluster HBase Cluster
Gateway/Launcher
Rest Proxy
HTTP
Client
33

Security
 Authentication
 Kerberos (users, processes)
 Delegation Token (MapReduce, YARN, etc.)
 Authorization
 HBase ACLs (Read, Write, Create, Admin)
 Grant permissions to User or Unix Group
 ACL for Table, Column Family or Column
34

Region Server Groups
 Dedicated region servers for a set of tables
 Resource Isolation (CPU, Memory, IO, etc)
RegionServer
Group Foo
RegionServer
RegionServer
RegionServer
Region Server 1...5
TableA TableB TableC
TableD TableE TableF
RegionServer
Group Bar
RegionServer
RegionServer
RegionServer
Region Server 6…10
Table1 Table2 Table3
Table4 Table5 Table6
35

Namespaces
 Analogous to “Database”
 Namespace ACL to create tables
 Default group
 Quota
 Tables
 Regions
Namespace
Group Tables Quota ACL
36

Split Meta to Spread Load and Avoid Large Regions
37

Favored Nodes for HDFS Locality
38

Scaling HBase to Handle Millions of Regions on a Cluster
Region Server
Groups
Split
Meta
Split
ZK
Favored
Nodes
Humongous
Tables
40

Transactions on HBase with Omid1
Highly performant and fault tolerant ACID
transactional framework
New Apache Incubator project
incubator.apache.org/projects/omid.html
Handles million of transactions per day for
search and personalization products
1 Omid stands for “Hope” in Persian
41

Agenda
Platform Overview1
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
44

Oozie Data Pipelines
Oozie
Message Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
45

Large Scale Data Pipeline Requirements
Administrative
 One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
 Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
 If dataset has a BCP instance, workflow
should run with either, whichever arrives first
 Start as soon as mandatory data is available,
other feeds are optional
 Data is not guaranteed, start processing
even if partial data is available
SLA Management
 Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
 Pipelines owners should get notified if
an SLA is missed
Multiple Providers
 If data is available from multiple
providers, I want to specify the provider
priority
 Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
46

Large Scale Data Pipeline Requirements
Administrative
 One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
 Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
 If dataset has a BCP instance, workflow
should run with either, whichever arrives first
 Start as soon as mandatory data is available,
other feeds are optional
 Data is not guaranteed, start processing
even if partial data is available
SLA Management
 Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
 Pipelines owners should get notified if
an SLA is missed
Multiple Providers
 If data is available from multiple
providers, I want to specify the provider
priority
 Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
47

BCP And Mandatory / Optional Feeds
Pull data from A or B. Specify dataset as
AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A” wait=“10”/>
<data-in dataset="B"/>
</or>
</input-logic>
Dataset B is optional, Oozie will start
processing as soon as A is available. It
will include dataset from A and whatever
is available from B.
<input-logic>
<and name="optional
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
48

Data Not Guaranteed / Priority Among Dataset Instances
A will have higher precedence over B
and B will have higher precedence
over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="C”/>
</or>
</input-logic>
49
Oozie will start processing if available A
instances are >= 10. Min can also be
combined with wait (as shown for dataset B).
<input-logic>
<data-in dataset="A" min=”10”/>
<data-in dataset=“B” min =“10”
wait=“20”/>
</input-logic>

Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:CURRENT(-5)} </start-instance>
<end-instance> ${coord:latest(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:CURRENT(-5)}</start-instance>
<end-instance>${coord:CURRENT(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
</combine>
</input-logic>
50

Agenda
Platform Overview1
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
51

Automated Onboarding / Collaboration Portal
52

Built for Tenant Transparency
53

Queue Utilization Dashboard
54

Audits, Compliance, and Efficiency
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources
56

Audits, Compliance, and Efficiency (cont’d)
Data Discovery and Access
Public
Non-sensitive
Financial $
Governance
Classification
No addn. reqmt.
LMS Integration
Stock Admin
Integration
Approval Flow
Restricted
57

Hosted UI – Hue as a Service
WSGI
Hue-1.Cluster-1 (Hot)
VIPUsers
HS2
Hue
MySQL DB
(HA)
Hadoop Cluster
HCat
Meta
Oozie
Server
YARN
RM
Web
HDFS
NMs
WSGI
Hue-2.Cluster-1 (hot)
HS2
IdP
SAML
Auth.
Serving pages and static content
Cookies, saved queries,
workflows etc.
FullStackHA
REST / Thrift
(jQuery, Bootstrap, Knockout.js, Love)
58

Going Forward
Increased
Intelligence
Greater
Speed
Higher
Efficiency
Necessary
Scale
59

Increased Intelligence
GBDT FTRL SGD
Deep
Learning
Random
Forests
ML Libraries
Click
Prediction Search RankingKeyword Auctions Ad
Relevance Abuse Detection
Applications
Proven to
Work at Scale
Solve Complex
Problems
YARN (Resource Manager)
Heterogeneous
Scheduling
Long-running
Services
GPUs
Large
Memory Support
Core Grid
Enhancements
…
Parameter ServerGlobally Shared
Parameters
Compute Engines
Distributed
Processing
…
60

Greater Speed
DeData
Management
Ease of
Use
Productivity
Dimensions
Real-time
Pipelines
Unified Metadata &
Lineage
Fine-grained
Access Control
Self-serve Data
Movement
SLA & Cost
Transparency
Intuitive
UIs
Planning &
Collab. Tools
Central Grid
Portal
Improvements
Query times
< 1 sec
4x Speedups in
ETL
SQL on
HBase
Limitless BI
Clients
Analytics, BI &
Reporting
61

Higher Efficiency
Achieve five 9’s availability and 70% average compute utilization across clusters
62

Hadoop Users at Yahoo
Slingstone & Aviate Mail Anti-Spam
Gemini Campaign
Mgmt.
Search Assist
Audience Analytics Flickr YAM+ & Targeting Membership Abuse
… and many more.
63

Yahoo at the Apache Open Source Foundation
10 Committers (6 PMC)
6 Committer (5 PMC)
1 Committer
3 Committers (2 PMCs)
7 Committers (6 PMCs)
1 2
43
5 6
7 8
1 Committer
64

Join Us @ yahoohadoop.tumblr.com
65

THANK YOU
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Icon Courtesy – iconfinder.com (under Creative Commons)

Hadoop Platform at Yahoo: A Year in Review

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Platform at Yahoo: A Year in Review

Similar to Hadoop Platform at Yahoo: A Year in Review (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop Platform at Yahoo: A Year in Review

Editor's Notes