SlideShare a Scribd company logo
1 of 66
HADOOP PLATFORM
AT YAHOO
A YEAR IN REVIEW
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Agenda
2
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
0
100
200
300
400
500
600
700
800
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
RawHDFS(inPB)
#Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling
Hadoop for
Production
Use
Research
Workloads
in Search
and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-
tenancy, and
SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase,
Storm, Spark,
Hive)
Increased
User-base
with
partitioned
namespaces
Apache H2.7
(Scalable ML, Latency,
Utilization, Productivity)
Platform Evolution
3
Deployment Models
Private (dedicated)
Clusters
Hosted Multi-tenant
(private cloud)
Clusters
Hosted Compute
Clusters
 Large demanding use
cases
 New technology not
yet platformized
 Data movement and
regulation issues
 When more cost
effective than on-
premise
 Time to market/
results matter
 Data already in
public cloud
 Source of truth for all
of orgs data
 App delivery agility
 Operational efficiency
and cost savings
through economies of
scale
On-Premise Public Cloud
Purpose-built
Big Data
Clusters
 For performance,
tighter integration
with tech stack
 Value added services
such as monitoring,
alerts, tuning and
common tools
4
Platform Today
ZK DBMS MON SSHOP LOG WH TOOLS
Apache / Open Source Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
5
Technology Stack Assembly
ZK DBMS MON SSHOP LOG WH TOOLS
Apache Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
Services
Compute
Storage / Msg.
Tools
HDFS
(File System)
YARN
(Scheduling, Resource Management)
Common
RHEL6 64-bit, JDK8
Platformized
Tech with
Production
Support
In-
progress,
Unmet
needs or
Apache
Alignment
6
Common Backplane
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Network
Backplane
7
0
10
20
30
Cluster 1 (2,000 servers)
HDFS 12 PB
Compute 23 TB
Avg. Util: 26%
Research Cluster Consolidation
0
20
40
60
80
ComputeTotalandUsed(TB)
Cluster 3 (5,400 servers)
HDFS 36 PB
Compute 70 TB
Avg. Util: 59%
Cluster 2 (3,100 servers)
HDFS 21 PB
Compute 52 TB
Avg. Util: 40%
0
20
40
60
One Month Sample (2015)
Total Used
8
0
50
100
150
200
250
300
Consolidated Cluster
HDFS 65 PB
Compute 240 TB
Avg. Util: 70%
Consolidated Research Cluster Characteristics
One Month Sample (2016)
40% decrease in TCO
10,500
servers
2,200
servers
Before After
65% increase in compute capacity
50% increase in avg. utilization
Total Used
ComputeTotalandUsed(TB)
9
Common Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
.
.
.
.
.
.
.
.
.
10
New Hadoop Cluster Configuration
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
100Gbps
InfiniBand
GPU Servers
Hi-Mem Servers
.
.
.
11
YARN Node Labels
J2J3
J4
Queue 1, 40%
Label x
Queue 2, 40%
Label x, y
J1
Queue 3, 20%
x x x x x x
x x x x x x
y y y y y y
y y y y y y
yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>
yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue
Hadoop Cluster
12
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
13
CaffeOnSpark – Distributed Deep Learning
CaffeOnSpark
for
DL
MLLib
for
non-DL
Hive or
SparkSQL
Spark
YARN (RM and Scheduling)
HDFS (Datasets)
. . .
14
Few Use Cases – Yahoo Weather
15
Few Use Cases – Flickr Facial Recognition
16
Few Use Cases – Flickr Scene Detection
17
CaffeOnSpark Architecture – Common Cluster
Spark Driver
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Caffe
(enhanced with
multi-GPU/CPU)
Model
Synchronizer
(across nodes)
HDFS
Datasets
Spark
Executor
(for data feeding
and control)
Model
O/P on
HDFS
MPI on RDMA / TCP
18
CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)
conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true)
cos.train(dl_train_source) //training DL model
lr_raw_source = DataSource.getSource(conf, false)
ext_df = cos.features(lr_raw_source) // extract features via DL
Feature
Engineering:
DeepLearning
19
CaffeOnSpark Architecture – Incremental Learning
cos = new CaffeOnSpark(ctx)
conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true)
cos.train(dl_train_source) //training DL model
lr_raw_source = DataSource.getSource(conf, false)
ext_df = cos.features(lr_raw_source) // extract features via DL
vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label))
)
.withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0))))
lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F")
lr_model = lr.fit(lr_input_df) …
Feature
Engineering:
DeepLearning
20
TrainClassifiers:
Non-deep
Learning
CaffeOnSpark Architecture – Single Command
spark-submit
--num-executors #Exes
--class CaffeOnSpark
my-caffe-on-spark.jar
-devices #GPUs
-model dl_model_file
-output lr_model_file
21
Distributed Deep Learning
Apache
License
Existing
Clusters
Powerful
DL Platform
Fully
Distributed
High-level
API
Incremental
Learning
CaffeOnSpark
github.com/yahoo/caffeonspark
22
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
23
Hadoop Compute Sources
HDFS
(File System and Storage)
Pig
(Scripting)
Hive
(SQL)
Java MR APIs
YARN
(Resource Management and Scheduling)
Tez
(Execution Engine for
Pig and Hive)
Spark
(Alternate Exec Engine)
MapReduce
(Legacy)
Data Processing
ML
Custom App on
Slider
Oozie
Data
Management
24
Compute Growth
13.3
20.4
23.8
27.2
32.3
34.1
39.1
10
15
20
25
30
35
40
45 Mar-13
Apr-13
May-13
Jun-13
Jul-13
Aug-13
Sep-13
Oct-13
Nov-13
Dec-13
Jan-14
Feb-14
Mar-14
Apr-14
May-14
Jun-14
Jul-14
Aug-14
Sep-14
Oct-14
Nov-14
Dec-14
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
#MR,Tez,SparkJobs(inmillions)
25
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec)
Q1 2016
MapReduce Tez Spark
112 Million Batch Jobs in Q1’16
Jan 78%
Mar 67%
Mar 21% 12%Jan 8% 14%
26
Multi-tenant Apache Storm
27
Recent Apache Storm Developments at Yahoo
MT & RA
Scheduler
Dist. Cache
API
8 x
Throughput
Improved
Debuggability
1 github.com/yahoo/streaming-benchmarks
Pacemaker
Server
Streaming
Benchmark 1
28
Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
 Good enough approximate answers
for problem queries
 Streamable
 Approximate with predictable error
 Sub-linear in size
 Mergeable / additive
 Highly parallelizable
 Maven deployable
Characteristics
29
Distinct Count Sketch, High-level View
Big Data
Stream
Transform Data Structure Estimator
Result + / - ε
White
Noise
Basic Sketch Elements
30
Data Sketches Algorithms
Data Sketches Algorithms Library
datasketches.github.io
31
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
32
Apache HBase at Yahoo
 Security
 Isolated Deployment
 Multi-tenant
 Region Server Group
 Namespace
 Unsupported Features
HBase
Client
HBase
Client
JobTracker Namenode
TaskTracker
DataNode
Namenode
RegionServer
DataNode
RegionServer
DataNode
RegionServer
DataNode
HBase Master
Zookeeper
Quorum
HBase
Client
MR Client
M/R Task
TaskTracker
DataNode
M/R Task
TaskTracker
DataNode
MR Task
Compute Cluster HBase Cluster
Gateway/Launcher
Rest Proxy
HTTP
Client
33
Security
 Authentication
 Kerberos (users, processes)
 Delegation Token (MapReduce, YARN, etc.)
 Authorization
 HBase ACLs (Read, Write, Create, Admin)
 Grant permissions to User or Unix Group
 ACL for Table, Column Family or Column
34
Region Server Groups
 Dedicated region servers for a set of tables
 Resource Isolation (CPU, Memory, IO, etc)
RegionServer
Group Foo
RegionServer
RegionServer
RegionServer
Region Server 1...5
TableA TableB TableC
TableD TableE TableF
RegionServer
Group Bar
RegionServer
RegionServer
RegionServer
Region Server 6…10
Table1 Table2 Table3
Table4 Table5 Table6
35
Namespaces
 Analogous to “Database”
 Namespace ACL to create tables
 Default group
 Quota
 Tables
 Regions
Namespace
Group Tables Quota ACL
36
Split Meta to Spread Load and Avoid Large Regions
37
Favored Nodes for HDFS Locality
38
Humongous Tables
39
Scaling HBase to Handle Millions of Regions on a Cluster
Region Server
Groups
Split
Meta
Split
ZK
Favored
Nodes
Humongous
Tables
40
Transactions on HBase with Omid1
Highly performant and fault tolerant ACID
transactional framework
New Apache Incubator project
incubator.apache.org/projects/omid.html
Handles million of transactions per day for
search and personalization products
1 Omid stands for “Hope” in Persian
41
Omid Components
42
Omid Data Model
43
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
44
Oozie Data Pipelines
Oozie
Message Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
45
Large Scale Data Pipeline Requirements
Administrative
 One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
 Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
 If dataset has a BCP instance, workflow
should run with either, whichever arrives first
 Start as soon as mandatory data is available,
other feeds are optional
 Data is not guaranteed, start processing
even if partial data is available
SLA Management
 Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
 Pipelines owners should get notified if
an SLA is missed
Multiple Providers
 If data is available from multiple
providers, I want to specify the provider
priority
 Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
46
Large Scale Data Pipeline Requirements
Administrative
 One should be able to start, stop and pause
all related pipelines at a same time
Dependency Management
 Output of a coordinator “n+1” action is
dependent on coordinator “n” action (dataset
dependency)
 If dataset has a BCP instance, workflow
should run with either, whichever arrives first
 Start as soon as mandatory data is available,
other feeds are optional
 Data is not guaranteed, start processing
even if partial data is available
SLA Management
 Monitor pipeline processing to take
immediate action in case of failures or
SLA misses
 Pipelines owners should get notified if
an SLA is missed
Multiple Providers
 If data is available from multiple
providers, I want to specify the provider
priority
 Combine datasets from multiple
providers to fill the gaps in data a single
provider may have
47
BCP And Mandatory / Optional Feeds
Pull data from A or B. Specify dataset as
AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A” wait=“10”/>
<data-in dataset="B"/>
</or>
</input-logic>
Dataset B is optional, Oozie will start
processing as soon as A is available. It
will include dataset from A and whatever
is available from B.
<input-logic>
<and name="optional
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
48
Data Not Guaranteed / Priority Among Dataset Instances
A will have higher precedence over B
and B will have higher precedence
over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
49
Oozie will start processing if available A
instances are >= 10. Min can also be
combined with wait (as shown for dataset B).
<input-logic>
<data-in dataset="A" min=”10”/>
<data-in dataset=“B” min =“10”
wait=“20”/>
</input-logic>
Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:CURRENT(-5)} </start-instance>
<end-instance> ${coord:latest(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:CURRENT(-5)}</start-instance>
<end-instance>${coord:CURRENT(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
50
Agenda
Platform Overview1
Infrastructure and Metrics2
CaffeOnSpark for Distributed DL3
Compute and Sketches4
Oozie6
Ease of Use7
Q&A8
HBase and Omid5
51
Automated Onboarding / Collaboration Portal
52
Built for Tenant Transparency
53
Queue Utilization Dashboard
54
Data Discovery and Access
55
Audits, Compliance, and Efficiency
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources
56
Audits, Compliance, and Efficiency (cont’d)
Data Discovery and Access
Public
Non-sensitive
Financial $
Governance
Classification
No addn. reqmt.
LMS Integration
Stock Admin
Integration
Approval Flow
Restricted
57
Hosted UI – Hue as a Service
WSGI
Hue-1.Cluster-1 (Hot)
VIPUsers
HS2
Hue
MySQL DB
(HA)
Hadoop Cluster
HCat
Meta
Oozie
Server
YARN
RM
Web
HDFS
NMs
WSGI
Hue-2.Cluster-1 (hot)
HS2
IdP
SAML
Auth.
Serving pages and static content
Cookies, saved queries,
workflows etc.
FullStackHA
REST / Thrift
(jQuery, Bootstrap, Knockout.js, Love)
58
Going Forward
Increased
Intelligence
Greater
Speed
Higher
Efficiency
Necessary
Scale
59
Increased Intelligence
GBDT FTRL SGD
Deep
Learning
Random
Forests
ML Libraries
Click
Prediction Search RankingKeyword Auctions Ad
Relevance Abuse Detection
Applications
Proven to
Work at Scale
Solve Complex
Problems
YARN (Resource Manager)
Heterogeneous
Scheduling
Long-running
Services
GPUs
Large
Memory Support
Core Grid
Enhancements
…
Parameter ServerGlobally Shared
Parameters
Compute Engines
Distributed
Processing
…
60
Greater Speed
DeData
Management
Ease of
Use
Productivity
Dimensions
Real-time
Pipelines
Unified Metadata &
Lineage
Fine-grained
Access Control
Self-serve Data
Movement
SLA & Cost
Transparency
Intuitive
UIs
Planning &
Collab. Tools
Central Grid
Portal
Improvements
Query times
< 1 sec
4x Speedups in
ETL
SQL on
HBase
Limitless BI
Clients
Analytics, BI &
Reporting
61
Higher Efficiency
Achieve five 9’s availability and 70% average compute utilization across clusters
62
Hadoop Users at Yahoo
Slingstone & Aviate Mail Anti-Spam
Gemini Campaign
Mgmt.
Search Assist
Audience Analytics Flickr YAM+ & Targeting Membership Abuse
… and many more.
63
Yahoo at the Apache Open Source Foundation
10 Committers (6 PMC)
3 Committers (3 PMC)
3 Committers (2 PMC)
6 Committer (5 PMC)
1 Committer
3 Committers (2 PMCs)
7 Committers (6 PMCs)
1 2
43
5 6
7 8
1 Committer
64
Join Us @ yahoohadoop.tumblr.com
65
THANK YOU
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Icon Courtesy – iconfinder.com (under Creative Commons)

More Related Content

What's hot

Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologiesneeraj rathore
 
Distributing your pandas ETL job using Modin and Ray.pdf
Distributing your pandas ETL job using Modin and Ray.pdfDistributing your pandas ETL job using Modin and Ray.pdf
Distributing your pandas ETL job using Modin and Ray.pdfAndrew Li
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopDataWorks Summit
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop IntegrationJeremy Hanna
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualizationVini Vasundharan
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Importance of data analytics for business
Importance of data analytics for businessImportance of data analytics for business
Importance of data analytics for businessBranliticSocial
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )ANKUSH
 

What's hot (20)

Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Distributing your pandas ETL job using Modin and Ray.pdf
Distributing your pandas ETL job using Modin and Ray.pdfDistributing your pandas ETL job using Modin and Ray.pdf
Distributing your pandas ETL job using Modin and Ray.pdf
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualization
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Importance of data analytics for business
Importance of data analytics for businessImportance of data analytics for business
Importance of data analytics for business
 
BUSINESS INTELLIGENCE
BUSINESS INTELLIGENCEBUSINESS INTELLIGENCE
BUSINESS INTELLIGENCE
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )
 

Viewers also liked

Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expediahuguk
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaSkillspeed
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talkKrishna Gade
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfutureIT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfutureYahoo!デベロッパーネットワーク
 
ユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイントユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイントRecruit Technologies
 
What i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaWhat i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaRakuten Group, Inc.
 
Rakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakuten Group, Inc.
 
Kafka Connect(Japanese)
Kafka Connect(Japanese)Kafka Connect(Japanese)
Kafka Connect(Japanese)Roman Shtykh
 
ビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分けビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分けTetsutaro Watanabe
 
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-Recruit Technologies
 
新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場Recruit Technologies
 
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...Recruit Technologies
 

Viewers also liked (20)

Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfutureIT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
IT業界のリーディングカンパニーとして描く「少し先の未来」〜Yahoo! JAPANの事例を通して〜#a11yfuture
 
ユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイントユーザー企業内製CSIRTにおける対応のポイント
ユーザー企業内製CSIRTにおける対応のポイント
 
What i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawaWhat i learned from translation of the sre ryuji tamagawa
What i learned from translation of the sre ryuji tamagawa
 
Rakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichiRakutenとsreと私 yanagimoto koichi
Rakutenとsreと私 yanagimoto koichi
 
Yahoo! JAPANのデータ基盤とHadoop #dbts2016
Yahoo! JAPANのデータ基盤とHadoop #dbts2016Yahoo! JAPANのデータ基盤とHadoop #dbts2016
Yahoo! JAPANのデータ基盤とHadoop #dbts2016
 
Yahoo! JAPANを支えるビッグデータプラットフォーム技術
Yahoo! JAPANを支えるビッグデータプラットフォーム技術Yahoo! JAPANを支えるビッグデータプラットフォーム技術
Yahoo! JAPANを支えるビッグデータプラットフォーム技術
 
Prestoクエリログの保存/分析機能の構築 #yjdsnight
Prestoクエリログの保存/分析機能の構築 #yjdsnightPrestoクエリログの保存/分析機能の構築 #yjdsnight
Prestoクエリログの保存/分析機能の構築 #yjdsnight
 
Yahoo! JAPANにおけるオンライン機械学習実例 #streamctjp
Yahoo! JAPANにおけるオンライン機械学習実例 #streamctjpYahoo! JAPANにおけるオンライン機械学習実例 #streamctjp
Yahoo! JAPANにおけるオンライン機械学習実例 #streamctjp
 
Kafka Connect(Japanese)
Kafka Connect(Japanese)Kafka Connect(Japanese)
Kafka Connect(Japanese)
 
ビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分けビックデータ処理技術の全体像とリクルートでの使い分け
ビックデータ処理技術の全体像とリクルートでの使い分け
 
Apache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreading
Apache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreadingApache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreading
Apache Big Data Miami 2017 - Hadoop Source Code Reading #23 #hadoopreading
 
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
Struggling with BIGDATA -リクルートおけるデータサイエンス/エンジニアリング-
 
新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場新卒2年目が鍛えられたコードレビュー道場
新卒2年目が鍛えられたコードレビュー道場
 
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
Company Recommendation for New Graduates via Implicit Feedback Multiple Matri...
 

Similar to Hadoop Platform at Yahoo: A Year in Review

Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Big Data Joe™ Rossi
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and dockerBob Ward
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIndrajit Poddar
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 

Similar to Hadoop Platform at Yahoo: A Year in Review (20)

Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Hadoop Platform at Yahoo: A Year in Review

  • 1. HADOOP PLATFORM AT YAHOO A YEAR IN REVIEW SUMEET SINGH (@sumeetksingh) Sr. Director, Cloud and Big Data Platforms
  • 2. Agenda 2 Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5
  • 3. 0 100 200 300 400 500 600 700 800 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 RawHDFS(inPB) #Servers Year Servers Storage Yahoo! Commits to Scaling Hadoop for Production Use Research Workloads in Search and Advertising Production (Modeling) with machine learning & WebMap Revenue Systems with Security, Multi- tenancy, and SLAs Open Sourced with Apache Hortonworks Spinoff for Enterprise hardening Nextgen Hadoop (H 0.23 YARN) New Services (HBase, Storm, Spark, Hive) Increased User-base with partitioned namespaces Apache H2.7 (Scalable ML, Latency, Utilization, Productivity) Platform Evolution 3
  • 4. Deployment Models Private (dedicated) Clusters Hosted Multi-tenant (private cloud) Clusters Hosted Compute Clusters  Large demanding use cases  New technology not yet platformized  Data movement and regulation issues  When more cost effective than on- premise  Time to market/ results matter  Data already in public cloud  Source of truth for all of orgs data  App delivery agility  Operational efficiency and cost savings through economies of scale On-Premise Public Cloud Purpose-built Big Data Clusters  For performance, tighter integration with tech stack  Value added services such as monitoring, alerts, tuning and common tools 4
  • 5. Platform Today ZK DBMS MON SSHOP LOG WH TOOLS Apache / Open Source Projects Yahoo Projects HDFS HBase HCat Kafka CMS DH Pig Hive Oozie Hue GDM Big ML YARN CS MR Tez Spark Storm Services Compute Storage / Msg. Tools 5
  • 6. Technology Stack Assembly ZK DBMS MON SSHOP LOG WH TOOLS Apache Projects Yahoo Projects HDFS HBase HCat Kafka CMS DH Pig Hive Oozie Hue GDM Big ML YARN CS MR Tez Spark Storm Services Compute Storage / Msg. Tools HDFS (File System) YARN (Scheduling, Resource Management) Common RHEL6 64-bit, JDK8 Platformized Tech with Production Support In- progress, Unmet needs or Apache Alignment 6
  • 7. Common Backplane DataNode NodeManager NameNode RM DataNodes RegionServers NameNode HBase Master Nimbus Supervisor Administration, Management and Monitoring ZooKeeper Pools HTTP/HDFS/GDM Load Proxies Applications and Data Data Feeds Data Stores Oozie Server HS2/ HCat Network Backplane 7
  • 8. 0 10 20 30 Cluster 1 (2,000 servers) HDFS 12 PB Compute 23 TB Avg. Util: 26% Research Cluster Consolidation 0 20 40 60 80 ComputeTotalandUsed(TB) Cluster 3 (5,400 servers) HDFS 36 PB Compute 70 TB Avg. Util: 59% Cluster 2 (3,100 servers) HDFS 21 PB Compute 52 TB Avg. Util: 40% 0 20 40 60 One Month Sample (2015) Total Used 8
  • 9. 0 50 100 150 200 250 300 Consolidated Cluster HDFS 65 PB Compute 240 TB Avg. Util: 70% Consolidated Research Cluster Characteristics One Month Sample (2016) 40% decrease in TCO 10,500 servers 2,200 servers Before After 65% increase in compute capacity 50% increase in avg. utilization Total Used ComputeTotalandUsed(TB) 9
  • 10. Common Hadoop Cluster Configuration Rack 1 Network Backplane CPU Servers with JBODs & 10GbE Rack 2 Rack N . . . . . . . . . 10
  • 11. New Hadoop Cluster Configuration Rack 1 Network Backplane CPU Servers with JBODs & 10GbE Rack 2 Rack N 100Gbps InfiniBand GPU Servers Hi-Mem Servers . . . 11
  • 12. YARN Node Labels J2J3 J4 Queue 1, 40% Label x Queue 2, 40% Label x, y J1 Queue 3, 20% x x x x x x x x x x x x y y y y y y y y y y y y yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name> yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue Hadoop Cluster 12
  • 13. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 13
  • 14. CaffeOnSpark – Distributed Deep Learning CaffeOnSpark for DL MLLib for non-DL Hive or SparkSQL Spark YARN (RM and Scheduling) HDFS (Datasets) . . . 14
  • 15. Few Use Cases – Yahoo Weather 15
  • 16. Few Use Cases – Flickr Facial Recognition 16
  • 17. Few Use Cases – Flickr Scene Detection 17
  • 18. CaffeOnSpark Architecture – Common Cluster Spark Driver Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Caffe (enhanced with multi-GPU/CPU) Model Synchronizer (across nodes) HDFS Datasets Spark Executor (for data feeding and control) Model O/P on HDFS MPI on RDMA / TCP 18
  • 19. CaffeOnSpark Architecture – Incremental Learning cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init() dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source) //training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df = cos.features(lr_raw_source) // extract features via DL Feature Engineering: DeepLearning 19
  • 20. CaffeOnSpark Architecture – Incremental Learning cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init() dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source) //training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df = cos.features(lr_raw_source) // extract features via DL vlr_input=ext_df.withColumn(“L",cos.floats2doubleUDF(ext_df(conf.label)) ) .withColumn(“F",cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model = lr.fit(lr_input_df) … Feature Engineering: DeepLearning 20 TrainClassifiers: Non-deep Learning
  • 21. CaffeOnSpark Architecture – Single Command spark-submit --num-executors #Exes --class CaffeOnSpark my-caffe-on-spark.jar -devices #GPUs -model dl_model_file -output lr_model_file 21
  • 22. Distributed Deep Learning Apache License Existing Clusters Powerful DL Platform Fully Distributed High-level API Incremental Learning CaffeOnSpark github.com/yahoo/caffeonspark 22
  • 23. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 23
  • 24. Hadoop Compute Sources HDFS (File System and Storage) Pig (Scripting) Hive (SQL) Java MR APIs YARN (Resource Management and Scheduling) Tez (Execution Engine for Pig and Hive) Spark (Alternate Exec Engine) MapReduce (Legacy) Data Processing ML Custom App on Slider Oozie Data Management 24
  • 26. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec) Q1 2016 MapReduce Tez Spark 112 Million Batch Jobs in Q1’16 Jan 78% Mar 67% Mar 21% 12%Jan 8% 14% 26
  • 28. Recent Apache Storm Developments at Yahoo MT & RA Scheduler Dist. Cache API 8 x Throughput Improved Debuggability 1 github.com/yahoo/streaming-benchmarks Pacemaker Server Streaming Benchmark 1 28
  • 29. Data Sketches Algorithms Data Sketches Algorithms Library datasketches.github.io  Good enough approximate answers for problem queries  Streamable  Approximate with predictable error  Sub-linear in size  Mergeable / additive  Highly parallelizable  Maven deployable Characteristics 29
  • 30. Distinct Count Sketch, High-level View Big Data Stream Transform Data Structure Estimator Result + / - ε White Noise Basic Sketch Elements 30
  • 31. Data Sketches Algorithms Data Sketches Algorithms Library datasketches.github.io 31
  • 32. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 32
  • 33. Apache HBase at Yahoo  Security  Isolated Deployment  Multi-tenant  Region Server Group  Namespace  Unsupported Features HBase Client HBase Client JobTracker Namenode TaskTracker DataNode Namenode RegionServer DataNode RegionServer DataNode RegionServer DataNode HBase Master Zookeeper Quorum HBase Client MR Client M/R Task TaskTracker DataNode M/R Task TaskTracker DataNode MR Task Compute Cluster HBase Cluster Gateway/Launcher Rest Proxy HTTP Client 33
  • 34. Security  Authentication  Kerberos (users, processes)  Delegation Token (MapReduce, YARN, etc.)  Authorization  HBase ACLs (Read, Write, Create, Admin)  Grant permissions to User or Unix Group  ACL for Table, Column Family or Column 34
  • 35. Region Server Groups  Dedicated region servers for a set of tables  Resource Isolation (CPU, Memory, IO, etc) RegionServer Group Foo RegionServer RegionServer RegionServer Region Server 1...5 TableA TableB TableC TableD TableE TableF RegionServer Group Bar RegionServer RegionServer RegionServer Region Server 6…10 Table1 Table2 Table3 Table4 Table5 Table6 35
  • 36. Namespaces  Analogous to “Database”  Namespace ACL to create tables  Default group  Quota  Tables  Regions Namespace Group Tables Quota ACL 36
  • 37. Split Meta to Spread Load and Avoid Large Regions 37
  • 38. Favored Nodes for HDFS Locality 38
  • 40. Scaling HBase to Handle Millions of Regions on a Cluster Region Server Groups Split Meta Split ZK Favored Nodes Humongous Tables 40
  • 41. Transactions on HBase with Omid1 Highly performant and fault tolerant ACID transactional framework New Apache Incubator project incubator.apache.org/projects/omid.html Handles million of transactions per day for search and personalization products 1 Omid stands for “Hope” in Persian 41
  • 44. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 44
  • 45. Oozie Data Pipelines Oozie Message Bus HCatalog 3. Push notification <New Partition> 2. Register Topic 4. Notify New Partition Data Producer HDFS Produce data (distcp, pig, M/R..) /data/click/2014/06/02 1. Query/Poll Partition Start workflow Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’) 45
  • 46. Large Scale Data Pipeline Requirements Administrative  One should be able to start, stop and pause all related pipelines at a same time Dependency Management  Output of a coordinator “n+1” action is dependent on coordinator “n” action (dataset dependency)  If dataset has a BCP instance, workflow should run with either, whichever arrives first  Start as soon as mandatory data is available, other feeds are optional  Data is not guaranteed, start processing even if partial data is available SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combine datasets from multiple providers to fill the gaps in data a single provider may have 46
  • 47. Large Scale Data Pipeline Requirements Administrative  One should be able to start, stop and pause all related pipelines at a same time Dependency Management  Output of a coordinator “n+1” action is dependent on coordinator “n” action (dataset dependency)  If dataset has a BCP instance, workflow should run with either, whichever arrives first  Start as soon as mandatory data is available, other feeds are optional  Data is not guaranteed, start processing even if partial data is available SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combine datasets from multiple providers to fill the gaps in data a single provider may have 47
  • 48. BCP And Mandatory / Optional Feeds Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A” wait=“10”/> <data-in dataset="B"/> </or> </input-logic> Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic> 48
  • 49. Data Not Guaranteed / Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 49 Oozie will start processing if available A instances are >= 10. Min can also be combined with wait (as shown for dataset B). <input-logic> <data-in dataset="A" min=”10”/> <data-in dataset=“B” min =“10” wait=“20”/> </input-logic>
  • 50. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:CURRENT(-5)} </start-instance> <end-instance> ${coord:latest(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:CURRENT(-5)}</start-instance> <end-instance>${coord:CURRENT(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 50
  • 51. Agenda Platform Overview1 Infrastructure and Metrics2 CaffeOnSpark for Distributed DL3 Compute and Sketches4 Oozie6 Ease of Use7 Q&A8 HBase and Omid5 51
  • 52. Automated Onboarding / Collaboration Portal 52
  • 53. Built for Tenant Transparency 53
  • 55. Data Discovery and Access 55
  • 56. Audits, Compliance, and Efficiency Starling FS, Job, Task logs Cluster 1 Cluster 2 Cluster n... CF, Region, Action, Query Stats Cluster 1 Cluster 2 Cluster n... DB, Tbl., Part., Colmn. Access Stats ...MS 1 MS 2 MS n GDM Data Defn., Flow, Feed, Source F 1 F 2 F n Log Warehouse Log Sources 56
  • 57. Audits, Compliance, and Efficiency (cont’d) Data Discovery and Access Public Non-sensitive Financial $ Governance Classification No addn. reqmt. LMS Integration Stock Admin Integration Approval Flow Restricted 57
  • 58. Hosted UI – Hue as a Service WSGI Hue-1.Cluster-1 (Hot) VIPUsers HS2 Hue MySQL DB (HA) Hadoop Cluster HCat Meta Oozie Server YARN RM Web HDFS NMs WSGI Hue-2.Cluster-1 (hot) HS2 IdP SAML Auth. Serving pages and static content Cookies, saved queries, workflows etc. FullStackHA REST / Thrift (jQuery, Bootstrap, Knockout.js, Love) 58
  • 60. Increased Intelligence GBDT FTRL SGD Deep Learning Random Forests ML Libraries Click Prediction Search RankingKeyword Auctions Ad Relevance Abuse Detection Applications Proven to Work at Scale Solve Complex Problems YARN (Resource Manager) Heterogeneous Scheduling Long-running Services GPUs Large Memory Support Core Grid Enhancements … Parameter ServerGlobally Shared Parameters Compute Engines Distributed Processing … 60
  • 61. Greater Speed DeData Management Ease of Use Productivity Dimensions Real-time Pipelines Unified Metadata & Lineage Fine-grained Access Control Self-serve Data Movement SLA & Cost Transparency Intuitive UIs Planning & Collab. Tools Central Grid Portal Improvements Query times < 1 sec 4x Speedups in ETL SQL on HBase Limitless BI Clients Analytics, BI & Reporting 61
  • 62. Higher Efficiency Achieve five 9’s availability and 70% average compute utilization across clusters 62
  • 63. Hadoop Users at Yahoo Slingstone & Aviate Mail Anti-Spam Gemini Campaign Mgmt. Search Assist Audience Analytics Flickr YAM+ & Targeting Membership Abuse … and many more. 63
  • 64. Yahoo at the Apache Open Source Foundation 10 Committers (6 PMC) 3 Committers (3 PMC) 3 Committers (2 PMC) 6 Committer (5 PMC) 1 Committer 3 Committers (2 PMCs) 7 Committers (6 PMCs) 1 2 43 5 6 7 8 1 Committer 64
  • 65. Join Us @ yahoohadoop.tumblr.com 65
  • 66. THANK YOU SUMEET SINGH (@sumeetksingh) Sr. Director, Cloud and Big Data Platforms Icon Courtesy – iconfinder.com (under Creative Commons)

Editor's Notes

  1. JIRA 1976 (Oozie 4.3)
  2. While $coord:latest allows skipping to available ones, the workflow will never trigger unless mentioned number of instances are found. Min can be also combined with wait. If all dependencies are not met and if we have met MIN dependencies and then Oozie keeps on waiting for more instance till wait time elapses or all data dependencies are met.
  3. (30 secs) T: 2 min 30 secs xyz
  4. Protocols REST – Use pyhton-requests and a custom client to streamline RESTful interface calls Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls Accessibility Middleware – Make Hadoop interfaces accessible in request objects Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses. If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key. You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.
  5. Protocols REST – Use pyhton-requests and a custom client to streamline RESTful interface calls Thrift – Custom connection pooling and socket multiplexing to streamline thrift calls Accessibility Middleware – Make Hadoop interfaces accessible in request objects Hue uses CherryPy web server. You can use the following options to change the IP address and port that the web server listens on. The default setting is port 8888 on all configured IP addresses. If you don’t specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key. You can configure Hue to serve over HTTPS. To do so, you must install "pyOpenSSL" within Hue’s context and configure your keys.