The document provides an overview of the Hadoop platform at Yahoo over the past year. It discusses the evolution of the platform infrastructure and metrics including growth in storage from 12PB to 65PB and compute capacity from 23TB to 240TB. It highlights new technologies added to the platform like CaffeOnSpark for distributed deep learning, Apache Storm for streaming analytics, and data sketches algorithms. It also discusses enhancements to existing technologies like HBase for transactions with Omid and improvements to Oozie for data pipelines. The document aims to provide insights on how the Hadoop platform at Yahoo has scaled to support growing analytics needs through consolidation, new services, and ease of use features.
3. 0
100
200
300
400
500
600
700
800
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
RawHDFS(inPB)
#Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling
Hadoop for
Production
Use
Research
Workloads
in Search
and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-
tenancy, and
SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase,
Storm, Spark,
Hive)
Increased
User-base
with
partitioned
namespaces
Apache H2.7
(Scalable ML, Latency,
Utilization, Productivity)
Platform Evolution
3
4. Deployment Models
Private (dedicated)
Clusters
Hosted Multi-tenant
(private cloud)
Clusters
Hosted Compute
Clusters
Large demanding use
cases
New technology not
yet platformized
Data movement and
regulation issues
When more cost
effective than on-
premise
Time to market/
results matter
Data already in
public cloud
Source of truth for all
of orgs data
App delivery agility
Operational efficiency
and cost savings
through economies of
scale
On-Premise Public Cloud
Purpose-built
Big Data
Clusters
For performance,
tighter integration
with tech stack
Value added services
such as monitoring,
alerts, tuning and
common tools
4
9. 0
50
100
150
200
250
300
Consolidated Cluster
HDFS 65 PB
Compute 240 TB
Avg. Util: 70%
Consolidated Research Cluster Characteristics
One Month Sample (2016)
40% decrease in TCO
10,500
servers
2,200
servers
Before After
65% increase in compute capacity
50% increase in avg. utilization
Total Used
ComputeTotalandUsed(TB)
9
10. Common Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
.
.
.
.
.
.
.
.
.
10
11. New Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
100Gbps
InfiniBand
GPU Servers
Hi-Mem Servers
.
.
.
11
12. YARN Node Labels
J2J3
J4
Queue 1, 40%
Label x
Queue 2, 40%
Label x, y
J1
Queue 3, 20%
x x x x x x
x x x x x x
y y y y y y
y y y y y y
yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>
yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue
Hadoop Cluster
12
18. CaffeOnSpark Architecture – Common Cluster
Spark Driver
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Model
O/P on
HDFS
MPI on RDMA / TCP
18
19. CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)
conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true)
cos.train(dl_train_source) //training DL model
lr_raw_source = DataSource.getSource(conf, false)
ext_df = cos.features(lr_raw_source) // extract features via DL
Feature
Engineering:
DeepLearning
19
20. CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)
conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true)
cos.train(dl_train_source) //training DL model
lr_raw_source = DataSource.getSource(conf, false)
ext_df = cos.features(lr_raw_source) // extract features via DL
vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label))
)
.withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0))))
lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F")
lr_model = lr.fit(lr_input_df) …
Feature
Engineering:
DeepLearning
20
TrainClassifiers:
Non-deep
Learning
24. Hadoop Compute Sources
HDFS
(File System and Storage)
Pig
(Scripting)
Hive
(SQL)
Java MR APIs
YARN
(Resource Management and Scheduling)
Tez
(Execution Engine for
Pig and Hive)
Spark
(Alternate Exec Engine)
MapReduce
(Legacy)
Data Processing
ML
Custom App on
Slider
Oozie
Data
Management
24
34. Security
Authentication
Kerberos (users, processes)
Delegation Token (MapReduce, YARN, etc.)
Authorization
HBase ACLs (Read, Write, Create, Admin)
Grant permissions to User or Unix Group
ACL for Table, Column Family or Column
34
35. Region Server Groups
Dedicated region servers for a set of tables
Resource Isolation (CPU, Memory, IO, etc)
RegionServer
Group Foo
RegionServer
RegionServer
RegionServer
Region Server 1...5
TableA TableB TableC
TableD TableE TableF
RegionServer
Group Bar
RegionServer
RegionServer
RegionServer
Region Server 6…10
Table1 Table2 Table3
Table4 Table5 Table6
35
36. Namespaces
Analogous to “Database”
Namespace ACL to create tables
Default group
Quota
Tables
Regions
Namespace
Group Tables Quota ACL
36
37. Split Meta to Spread Load and Avoid Large Regions
37
40. Scaling HBase to Handle Millions of Regions on a Cluster
Region Server
Groups
Split
Meta
Split
ZK
Favored
Nodes
Humongous
Tables
40
41. Transactions on HBase with Omid1
Highly performant and fault tolerant ACID
transactional framework
New Apache Incubator project
incubator.apache.org/projects/omid.html
Handles million of transactions per day for
search and personalization products
1 Omid stands for “Hope” in Persian
41
45. Oozie Data Pipelines
Oozie
Message Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
45
46. Large Scale Data Pipeline Requirements
Administrative
One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
If dataset has a BCP instance, workflow
should run with either, whichever arrives first
Start as soon as mandatory data is available,
other feeds are optional
Data is not guaranteed, start processing
even if partial data is available
SLA Management
Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
Pipelines owners should get notified if
an SLA is missed
Multiple Providers
If data is available from multiple
providers, I want to specify the provider
priority
Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
46
47. Large Scale Data Pipeline Requirements
Administrative
One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
If dataset has a BCP instance, workflow
should run with either, whichever arrives first
Start as soon as mandatory data is available,
other feeds are optional
Data is not guaranteed, start processing
even if partial data is available
SLA Management
Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
Pipelines owners should get notified if
an SLA is missed
Multiple Providers
If data is available from multiple
providers, I want to specify the provider
priority
Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
47
48. BCP And Mandatory / Optional Feeds
Pull data from A or B. Specify dataset as
AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A” wait=“10”/>
<data-in dataset="B"/>
</or>
</input-logic>
Dataset B is optional, Oozie will start
processing as soon as A is available. It
will include dataset from A and whatever
is available from B.
<input-logic>
<and name="optional
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
48
49. Data Not Guaranteed / Priority Among Dataset Instances
A will have higher precedence over B
and B will have higher precedence
over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
49
Oozie will start processing if available A
instances are >= 10. Min can also be
combined with wait (as shown for dataset B).
<input-logic>
<data-in dataset="A" min=”10”/>
<data-in dataset=“B” min =“10”
wait=“20”/>
</input-logic>
50. Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:CURRENT(-5)} </start-instance>
<end-instance> ${coord:latest(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:CURRENT(-5)}</start-instance>
<end-instance>${coord:CURRENT(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
50
66. THANK YOU
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Icon Courtesy – iconfinder.com (under Creative Commons)
Editor's Notes
JIRA 1976 (Oozie 4.3)
While $coord:latest allows skipping to available ones, the workflow will never trigger unless mentioned number of instances are found. Min can be also combined with wait. If all dependencies are not met and if we have met MIN dependencies and then Oozie keeps on waiting for more instance till wait time elapses or all data dependencies are met.
(30 secs)
T: 2 min 30 secs
xyz
Protocols
REST – Use pyhton-requests and a custom client to streamline RESTful interface calls
Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls
Accessibility
Middleware – Make Hadoop interfaces accessible in request objects
Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses.
If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key.
You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.
Protocols
REST – Use pyhton-requests and a custom client to streamline RESTful interface calls
Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls
Accessibility
Middleware – Make Hadoop interfaces accessible in request objects
Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses.
If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key.
You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.