This document summarizes lessons learned about monitoring a data pipeline at Hulu. It discusses how the initial monitoring approach had some issues from the perspectives of users and detecting problems. A new approach is proposed using a graph data structure to provide contextual troubleshooting that connects any issues to their impacts on business units and user needs. This approach aims to make troubleshooting easier by querying the relationships between different components and resources. Small independent services would also be easier to create and maintain within this approach.
5. Help people find and enjoy
the world’s premium content
when, where and how they want
it.
HULU’S MISSION
6. PREMIUM CONTENT QUALITY AD EXPERIENCE
• Premium Content
• 485+ Content Partners
• 6 of 6 Broadcast Networks
USER CONTROL
• Ads can’t be skipped
• Less ad load than TV
• 100% video completion
rate guarantee
• On Demand
• Across Devices
• Choice Based Ad Formats
WHY IS HULU EFFECTIVE?
7. 7
• Service Oriented
• Small teams, specialized scopes
• Build tools for other developers
• Right tool for the job
16. Hulu MapReduce Metrics Jobs
Definitions of
beacons and
base-facts
Beaconspec
compiler
MapReduce code,
including
metadata lookups
Job Scheduler
BeaconSpec DSL
Scala / Akka
JFlex & CUP Java (Generated)
Documentation
Automated
Validations for
Beacon Generators
In Progress…
19. Data API
Service
Reporting Flow
Reporting
Portal UI
(RP2)
Report
Controller
Scheduler
HiveRunner
Published
DB’s
RP2
DB
Available columns
Date range checks
Submit Report
Execute Report
Check Status
Queue
Run
Generate Query
24. Some Issues…
BIG DATA PIPELINE?
I’LL BET THAT’S GOING
GREAT FOR YOU
EMAIL
EXPLOSIONS
GATEKEEPINGOverhead
Consumption
C
H
A
N
G
E
25. Lots of Monitoring Tools Available
Ingest
Jobs
ClusterOpenTSDB & Graphite
26. WHAT’s GOING ON??!??
HOW IS OUR CLUSTER? WILL WE MEET OUR SLAs?
HOW FAST DID A JOB RUN?
HOW DID RUNTIME COMPARE TO
HISTORICAL?
HOW IS THIS COMPONENT? HOW IS OUR SYSTEM?
27. The Design…
Access all your tools in one
place...
…but avoid multitasking
Service Oriented
Architecture
Comprehensive Web UI
28.
29.
30.
31.
32. Does this solve our problems?
32
• Single Point of Access?
• Maintain services separately?
TAKE THAT
DATA
PIPELINE
ISSUES!!
33. Our Users’ Perspective?
• We detect platform issues
• We quickly troubleshoot errors
• We track relative performance
• We know where we are re: SLAs
…but is detection of a problem
enough?
A PROBLEM
DETECTION
USERS
We need to think of things from
the report users’ perspectives
35. Contextual Troubleshooting Model
• Connect issues to business units
• Better impact assessment
• Tune performance per user needs
We need a graph data structure,
populated with the stuff we care
about
Something like this
36. Why a Graph?
…instead of RDBMS
Indeterminate # of Joins
Query for graph connectedness is trivial and short
Query for connectedness w/ SQL relies on knowing the
intermediate resources
…instead of a tree?
Data is sometimes recombinant (e.g. a metric in
multiple reports to same user)
37.
38. Let’s investigate… These failed before getting to a data store
Most of the hive failures were the same
table, but it’s a common table
As we filter, the matched reports show up
on the bottom of the page. The log link
shows us the details
39. Each service implements a log-fetching interface,
specific to the resources used for a particular report
41. In Summary…
Find the Important Questions => Measure the Right Data
Make troubleshooting easy
Small distinct services are easy to create, maintain, and
wire together
42. Questions?
• Muthu…the Platform GrandMaster
• All of Metrics Platform, Tools, Reporting for making this stuff
• Mohamed, Chris, Charlie, Robert, Phong, AJ, Ratheesh, Adi, Matt, Shashank, Joanne,
Siddhartha, Tamir, Jun, James, Dr. Kevin, Hang
• All of the Hulu DEV team for general awesomeness
• Prasan…thanks for the impetus to do this. I’ll look u up
• Kevin…thanks for Hulu. I’ll send u a snap
Thanks to…
Notes de l'éditeur
Each of the log collection machine are running Nginx. The nginx access logs are then processed by flume, and bucketed by beacon types, partitioned by hour, and stored on hdfs.
majority of our MapReduce jobs:
Select a set of dimensions that we are concerned about
Clean up any incomplete/malformed beacons
Perform some lookups against metadata tables (for example mapping a video id to a show name)
Group by the selected dimensions and aggregate on some attribute (for example, the number of minutes watched)
We have about 100 different MR jobs that run every hour – if we handwrote each MR job that would be painful
The BeaconSpec tool parses a beacon specification file and provides an object model of beacons and base fact. The tool also supports useful tasks, like generating base fact scrubber code, harpy data definitions, and validation tests. The MetStat dashboard uses BeaconSpec to automate the creation of processing jobs.
Three basic components of any modern compiler:
- Lexer
- Parser
- Code generator
Jflex and CUP are modeled on Flex and Bison, which are in turn modeled on lex and yacc
“There will always be problems…make it easy to troubleshoot”