Monitoring the Data Pipeline: Lessons Learned in Tracking Performance

LESSONS LEARNED
MONITORING THE DATA PIPELINE

AGENDA
• Who am I?
• What’s a Hulu?
• Beacons & the Data Pipeline
• Monitoring – Take One
• Monitoring – Take Two

TRISTAN REID
METRICS & REPORTING TOOLS TEAM LEAD

Help people find and enjoy
the world’s premium content
when, where and how they want
it.
HULU’S MISSION

PREMIUM CONTENT QUALITY AD EXPERIENCE
• Premium Content
• 485+ Content Partners
• 6 of 6 Broadcast Networks
USER CONTROL
• Ads can’t be skipped
• Less ad load than TV
• 100% video completion
rate guarantee
• On Demand
• Across Devices
• Choice Based Ad Formats
WHY IS HULU EFFECTIVE?

7
• Service Oriented
• Small teams, specialized scopes
• Build tools for other developers
• Right tool for the job

Fire & Forget
HTTP Format
High Availability
Process
Transform
Collect
External View of Beacons

Beacons
80 2013-04-01 00:00:00
/v3/playback/start?
bitrate=650
&cdn=Akamai
&channel=Anime
&client=Explorer
&computerguid=EA8FA1000232B8F6986C3E0BE
55E9333
&contentid=5003673
…
 Which show is the user watching?
 Which pages did they visit?
 How long did they stay?
 Where did they come from?
 Did they become Plus members?

The pipeline
Beacon collection
service
HDFS
Hive
RDBMS
Log Collector / Flume
MapReduce Jobs
Continuous Aggregation /
Selective PublishingReporting
Monitoring
Developers
Business Analysts

Avg. 12,000
events per
second
Peak: ~35K
Data Collection

Data never stops
coming…
and we can’t lose
any data

HDFS
Files bucketed by beacon
type and partitioned by hour
Log Collection
machine #1
Log Collection
…
Load balancer
Devices
Devices
Devices
Log Collection
machine #11
CDN

MapReduce - from beacons to basefacts
video_id 289696
content_partner_id 398
distribution_partner_id 602
distro_platform_id 14
is_on_hulu 0
…
hourid 383149
watched 76426

Hulu MapReduce Metrics Jobs
Definitions of
beacons and
base-facts
Beaconspec
compiler
MapReduce code,
including
metadata lookups
Job Scheduler
BeaconSpec DSL
Scala / Akka
JFlex & CUP Java (Generated)
Documentation
Automated
Validations for
Beacon Generators
In Progress…

UserJobs
 Mention the MVEL coolness
MVEL:
client contains 'Chrome' &&
fullscreen == true &&
(os contains 'Windows' || os contains 'Mac')

Aggregation & Publishing
Hourly Facts
Aggregations
Daily/Weekly/Monthly/Quarterly/Ann
ual
Popular Data
MySQL SQL
Publishing

Data API
Service
Reporting Flow
Reporting
Portal UI
(RP2)
Report
Controller
Scheduler
HiveRunner
Published
DB’s
RP2
DB
Available columns
Date range checks
Submit Report
Execute Report
Check Status
Queue
Run
Generate Query

Some Issues…
BIG DATA PIPELINE?
I’LL BET THAT’S GOING
GREAT FOR YOU
EMAIL
EXPLOSIONS
GATEKEEPINGOverhead
Consumption
C
H
A
N
G
E

Lots of Monitoring Tools Available
Ingest
Jobs
ClusterOpenTSDB & Graphite

WHAT’s GOING ON??!??
HOW IS OUR CLUSTER? WILL WE MEET OUR SLAs?
HOW FAST DID A JOB RUN?
HOW DID RUNTIME COMPARE TO
HISTORICAL?
HOW IS THIS COMPONENT? HOW IS OUR SYSTEM?

The Design…
Access all your tools in one
place...
…but avoid multitasking
Service Oriented
Architecture
Comprehensive Web UI

Does this solve our problems?
32
• Single Point of Access?
• Maintain services separately?
TAKE THAT
DATA
PIPELINE
ISSUES!!

Our Users’ Perspective?
• We detect platform issues
• We quickly troubleshoot errors
• We track relative performance
• We know where we are re: SLAs
…but is detection of a problem
enough?
A PROBLEM
DETECTION
USERS
We need to think of things from
the report users’ perspectives

The User Perspective
User
Group
Report
User
Report
User
Report
UserReport
User
Report
UserReport
UserUser
Group
Report
Report
Report
Report
Report
Report
Run
Report
Run
Report
Run
Report
Run
Report
Run
Data
Pipeline
Resources
ETC!
Schedule

Contextual Troubleshooting Model
• Connect issues to business units
• Better impact assessment
• Tune performance per user needs
We need a graph data structure,
populated with the stuff we care
about
Something like this

Why a Graph?
 …instead of RDBMS
 Indeterminate # of Joins
 Query for graph connectedness is trivial and short
 Query for connectedness w/ SQL relies on knowing the
intermediate resources
 …instead of a tree?
 Data is sometimes recombinant (e.g. a metric in
multiple reports to same user)

Let’s investigate… These failed before getting to a data store
Most of the hive failures were the same
table, but it’s a common table
As we filter, the matched reports show up
on the bottom of the page. The log link
shows us the details

Each service implements a log-fetching interface,
specific to the resources used for a particular report

In Summary…
 Find the Important Questions => Measure the Right Data
 Make troubleshooting easy
 Small distinct services are easy to create, maintain, and
wire together

Questions?
• Muthu…the Platform GrandMaster
• All of Metrics Platform, Tools, Reporting for making this stuff
• Mohamed, Chris, Charlie, Robert, Phong, AJ, Ratheesh, Adi, Matt, Shashank, Joanne,
Siddhartha, Tamir, Jun, James, Dr. Kevin, Hang
• All of the Hulu DEV team for general awesomeness
• Prasan…thanks for the impetus to do this. I’ll look u up
• Kevin…thanks for Hulu. I’ll send u a snap
Thanks to…

Monitoring the Data Pipeline: Lessons Learned in Tracking Performance

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (18)

Similaire à Monitoring the Data Pipeline: Lessons Learned in Tracking Performance

Similaire à Monitoring the Data Pipeline: Lessons Learned in Tracking Performance (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Monitoring the Data Pipeline: Lessons Learned in Tracking Performance

Notes de l'éditeur