Enabling real-time exploration and analytics at scale to drive operational intelligence at Hulu by Indrasis Mondal, Director, Data Engineering and Data Products, Hulu
Data is one of most powerful assets for companies today and a key driver for innovation, product development and business efficiency. Operational intelligence allows modern organization to use that data asset in real-time to enable immediate insights to their business operations and allow rapid decision making for strategic advantage. In this presentation we will walk through the operational intelligence capabilities Hulu has built to process tens of millions of events per minute to enable fast exploration of data and real-time decision making .
2. 2
❏ Operational Intelligence at Hulu
❏ Buy vs Build
❏ High Level Architecture
❏ Use Cases
❏ Conclusion/Challenges
Agenda
3. 3
Operational Intelligence
● Operational intelligence (OI) is a category of
real-time dynamic, business analytics that
delivers visibility and insight into data, streaming
events and business operations.
Business Intelligence
● Business intelligence solutions help
organizations improve their business
performance over time.
What is Operational Intelligence
Source: ,
https://www.linkedin.com/pulse/operational-reporting-vs-business-intelligence-whats-sean-williams/
4. 4
The operational intelligence tool at Hulu is known as Glyph. It empowers Hulu to easily draw insights from
real-time and historical event driven data
● Capabilities
○ Real time data exploration , analytics and visualization
○ Interactive query
○ Real-time dashboard
○ Dynamic Real-time funnels
○ API Service
● Primary Usages
○ Operational Intelligence and Reporting
○ User interaction
○ Product usage
○ Video quality
○ App and device performance, etc.
Operational Intelligence Capability at Hulu
5. 5
● Guiding Principles
○ Need for a data visualization tool.
○ This tool is capable of answering event-driven questions
○ Questions about user interaction, app health or quality of service
○ This tool serves as serving layer in a lambda pattern
● Key Assumptions
○ Primary stakeholders are Product, Technology, Engineering Operation, and Analysts
○ Ad Hoc questions related to events are time-bound
○ Aggregation at any other level are not optimized - risk of slow response
○ The data is available in real-time
○ The data is aggregated, not sampled
○ Result delay is not equal to query delay, it is data availability question!
○ CAP: Availability and partition tolerance guaranteed
○ CAP: Eventual consistency achieved through batch layer
Glyph Introduction
7. 7
• ~8TB of device and app data produced each day, with ~4PB available in hadoop
• ~150K Events / Second flowing through the pipeline
• ~1.5TB of Druid data generated each day
• ~5 second delta from data emitted by client to data is queryable
• ~150k Glyph queries per day, resulting in ~450k Druid queries
• 250ms Average response time
• 1100ms P95 response time across all queries
• ~50% of Hulu employees use Glyph with ~10% using it on any given day
• Largest single data source in Druid produces ~0.5TB per day
Glyph by Numbers
9. 9
● Open source timeseries column-oriented datastore: http://druid.io/
● Supports streaming data, which is immediately queryable
● Built for lambda-style architectures
● Highly distributed
● Sub-second response times
● Built in time-based tiering for data storage
● All-in-one system split across several roles
○ Historical: Data node. Loads segments determined by the coordinator and makes them available
for querying. Executes queries on the portions of data owned by the given node.
○ Coordinator: Data availability node. Manages the Historical nodes and performs segment balancing
and handoffs.
○ Middle Manager: Indexing node. Spawns many workers (peons) on each host which ingest
streaming data.
○ Overlord: “Middle manager” manager. Distributes indexing jobs across the many available middle
managers.
○ Broker: Query nodes. Farm user queries out over many historicals + peons then aggregates node
results.
Druid at a glance
10. 10
● Cluster: ~80 Druid nodes with ~160 TB of total storage
● Map events to data sources
○ High-volume and/or complexity events get their own dedicated data sources
■ Can individually scale and modify the retention to fit the requirements
■ Higher cost as require dedicated indexing capacity
○ Low-volume events get merged into shared data sources
■ Reduces indexing resources used in favor of potentially worse storage size
● Simplify column types for ingestion
○ Dimension: Some column people want to filter/split over
○ Metric: Some column people want to aggregate. Each metric generates every aggregation type,
allowing users to execute any query
○ High-cardinality: Some column which would normally be a dimension, but due to the cardinality
of the value has limited value in full fidelity. These columns are only able to have a count distinct
query run over them, as they have been aggregated away into a sketch representation
Druid - How we use it
11. 11
● Core problems:
○ Data definitions and requirements constantly change
○ Many data sources, hard to maintain consistency across them all
○ Need flexibility in order to change things like segment size, granularity and schemas as data ages
○ Hard to tell if data is getting dropped, or if there is just no data due to ingestion setup
● What we tried:
○ Ingest data via blacklist
■ PRO: Easiest setup imaginable
■ CON: Easiest database failure mode imaginable
○ Ingest data via whitelist shipped along with ingestion services
■ PRO: Easy to add a new event just had to modify 2 config files
■ CON: Was hard to implement a config schema that didn’t involve complex logic across
multiple projects
■ CON: Each project ended up having config differences due to forgotten / delayed
deployments
● Current: Ingest data via whitelist served via micro-service
○ PRO: Guaranteed consistency across ingestion services
○ PRO: Allowed development of configuration as a service, rather than as an afterthought
○ CON: Effectively put a single point of failure, if config service went down ingestion would fail
Ingestion Configs
13. 13
● Problems:
○ Druid syntax is decidedly *not* SQL
○ Use case requirements were for very simple querying; Don’t bring a firetruck to a water gun fight
○ Wanted query descriptions to be as simple as possible
■ Not building a generic SQL engine, so able to define a simple data model to describe the queries
■ Simple query description allowed us to easily fit it into a rich UI, as well as work with internally
○ Given use cases, common to see the same query issued many times, but we have had some
difficulties with Druid query caching at scale
○ Need to abstract consumers away from our druid implementation choices
○ Sometimes people want joins
● Solutions:
○ Build an API to do simplified data model -> query translation
○ Build in query-aware caching logic
○ Abstract away our implementations during query translations
○ Build in API-side query time lookups to project properties on top of existing data
Glyph API
14. Over the first half of 2018 6-10% of Hulu’s employees
view Glyph dashboards each month. What are these
dashboards used for?
1. Real time monitoring of special events
2. Real time monitoring of feature launch
3. Real time monitoring of device health
Use Cases
Glyph is instrumented to allow for Glyph data usage
queries in Glyph.
15. Real-time Monitoring of Special Events
Why monitor:
● Keep on top of quality of service issues
● Evaluate effect on sign up
● Measure the concurrent users watching the event
● Measure the percent of users watching the event
● Determine if there are platform-specific issues
How do we monitor:
● Set up dashboard with relevant metrics
● Circulate widely before the event
When to monitor:
● During the event, refer to the dashboard when people have questions related to usage and quality of
service
16. Real Time monitoring of Feature Launch
Feature Launch
Why monitor:
● Determine how fast new features are adopted by users
● Determine common usage patterns related to new features
● Determine platform-specific performance
What to monitor
● Fields or events related to the new features
How do we monitor:
● Set up dashboard with relevant metrics
○ Include data from beta testing, if possible
● Share widely among product managers and client teams
When to monitor:
● Set a fixed endpoint of right before the feature launch, and monitor the dashboard as data starts to
roll in
17. Real Time monitoring of Device Health
Why monitor:
● Continuously understanding app behavior will allow us to detect issues in new app versions issues.
● Compare user behavior and app performance of different app versions
What to monitor:
● Adoption rates of new app versions
● App performance by app version
● Performance comparison week over week and day over day
How do we monitor:
● Set up dashboards for each app client
When to monitor:
● Always, but especially when new app versions are released