People, Platform, Projects: these slides overview how Netflix works with Big Data. I share how our teams are organized, the roles we typically have on the teams, an overview of our Big Data Platform, and two example projects.
31. 〉 This dashboard’s dimensional grain >1B rows
PROJECTS
Partner Ecosystem Dashboard
〉 Current solution is
〉 ETL in Pig
〉 Prepare Druid indexed dataset with Hadoop job
〉 Load to 100 historical nodes
〉 Queries are typically <2 seconds
〉 Solution v1 was Tableau + Redshift SSD
〉 Some views rendered in <10 seconds
〉 But multiple views with filters took >1 minute
〉 Future solution is...
33. 〉 Easy: playback events, units of time, other discrete events
〉 Hard: uniques, especially over windows or groups
PROJECTS
Efficient compute - Counting
〉 Solution: estimates, i.e. HyperLogLog(++)
〉 Where we’ll implement:
〉 Distribution arrays in staging tables
〉 In Druid data sources
〉 Merge functions in query layer, so UDFs for Pig/Hive/Presto/JS
34. 〉 Easy: average, but not accurate
〉 Hard: percentiles
PROJECTS
Efficient compute - Measuring time
〉 Solution: estimates, most likely T-digest
〉 Where we’ll implement:
〉 Digests in staging tables
〉 In Druid data sources
〉 Merge functions in query layer, so UDFs for Pig/Hive/Presto/JS