Fast real-time approximations using Spark streaming

•

0 j'aime•996 vues

By Kevin Schmidt (Head of Data Science at Mind Candy) Luis Vicente (Senior Data Engineer at Mind Candy) For mobile games, constant tweaks are the difference between success and failure. Data and analytics have to be available in real-time, but calculating, for example, uniqueness or newness of a data point requires a list of seen data points - both memory intensive and tricky when using real-time stream processing like Spark Streaming. Probabilistic data structures allow approximation of these properties with a fixed memory representation, and are very well suited for this kind of stream processing. Getting from the theory of approximation to a useful metric at a low error rate even for many millions of users is another story. In our talk we will look at practical ways of achieving this: which approximation we used for selection of useful metrics, why we picked a specific probabilistic data structure, how we stored it in Cassandra as a time series and how we implemented it in Spark Streaming.

Technologie

Fast > Perfect
Practical real-time approximations
using Spark Streaming
Kevin Schmidt
@kevinschmidtbiz
LuisVicente
@lvicentesanchez

A Bit of Context: Free To Play
Sum ArbitraryValues
Count Uniques It’s Complicated

A Bit of Context: Requirements
• Constant storage space usage independent of number of
users
• Handle delayed or duplicate data
• Error rate under 3%

Counting Users: Basics
How To Count IDs Uniquely

Counting Users: HyperLogLog
addIdentiﬁer(value: String)
merge(other: HyperLogLog): HyperLogLog
zero(): HyperLogLog
countUniques(): Long
HyperLogLog
Error Rate = 1.6%
Fixed Size = 4KB
14Bit Size:
12Bit Size:
Error Rate = 0.9%
Fixed Size = 16KB

Counting Users: Some Scala
https://github.com/lvicentesanchez/fast-gt-perfect

Counting Users: Result
• Constant storage size usage for one day of data using
14bit HyperLogLogs: 288 * 16KB = 4608KB
• HyperLogLogs count users only once even if data is
duplicated or repeated
• Time bucketing ensures delayed data is counted
correctly
• Difference of <1% between HyperLogLogs and real
count

Counting Revenue: Basics
How To Sum ArbitraryValues

Counting Revenue: BloomFilter
BloomFilter
Capacity = 10k
Error Rate = 1%
Size = 11.7KB
Conﬁgurable Size:
addIdentiﬁer(value: String)
merge(other: BloomFilter): BloomFilter
zero(): BloomFilter
contains(): Boolean

Counting Revenue: Some Scala
https://github.com/lvicentesanchez/fast-gt-perfect

Counting Revenue: Result
• Constant storage size usage for one day of data using a
10k BloomFilter: 288 * 11.7KB = 3370KB
• BloomFilter eliminates sales already counted
• Time bucketing ensures delayed data is counted
correctly and keeps BloomFilters small
• Difference of <1% between approximated and real
revenue

Trending: StreamSummary
StreamSummary
Conﬁgurable Size:
addIdentiﬁer(value: String)
merge(other: SS): SS
topK(k: Int): Seq[(String, Long)]
Capacity = 400
Max Size = 21.9KB
Metwally,Agrawal & Abbadi: Efﬁcient Computation of Frequent
and Top-k Elements in Data Streams (2005)

Trending: Result
• Constant storage size usage for one day of data using a
Top400 StreamSummary: 288 * 21.9KB = 6307KB
• StreamSummary will not eliminate duplicates
• Time bucketing ensures delayed data is counted
correctly
• Difference of <2% between StreamSummary trending
items and the real trending items

Questions?
Kevin Schmidt
@kevinschmidtbiz
LuisVicente
@lvicentesanchez
https://github.com/lvicentesanchez/fast-gt-perfect

Recommandé

Building Hadoop Data Applications with Kitehuguk

Amazon Elastic Map Reduce - Ian Meyershuguk

Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks

Apache Spark e AWS GlueLaercio Serra

Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI

Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...Amazon Web Services

MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB

Processing and AnalyticsAmazon Web Services

Recommandé

Building Hadoop Data Applications with Kitehuguk

Amazon Elastic Map Reduce - Ian Meyershuguk

Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks

Apache Spark e AWS GlueLaercio Serra

Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI

Build Your Web Analytics with node.js, Amazon DynamoDB and Amazon EMR (BDT203...Amazon Web Services

MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB

Processing and AnalyticsAmazon Web Services

Data Collection and StorageAmazon Web Services

Netflix running Presto in the AWS CloudZhenxiao Luo

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill

Querying Data Pipeline with AWS AthenaYaroslav Tkachenko

Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit

Streaming data for real time analysisAmazon Web Services

Lambda Architecture Using SQLSATOSHI TAGOMORI

Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov

GumGum: Multi-Region Cassandra in AWSDataStax Academy

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services

(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services

Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu

(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014Amazon Web Services

Data Pipeline with KafkaPeerapat Asoktummarungsri

AWS Community Nordics Virtual MeetupAnahit Pogosova

(WRK302) Event-Driven ProgrammingAmazon Web Services

BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012Amazon Web Services

(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...Amazon Web Services

Stream Processing Live Traffic Data with Kafka StreamsTim Ysewyn

Realtime Analytics on AWSSungmin Kim

Contenu connexe

Tendances

Data Collection and StorageAmazon Web Services

Netflix running Presto in the AWS CloudZhenxiao Luo

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill

Querying Data Pipeline with AWS AthenaYaroslav Tkachenko

Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit

Streaming data for real time analysisAmazon Web Services

Lambda Architecture Using SQLSATOSHI TAGOMORI

Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov

GumGum: Multi-Region Cassandra in AWSDataStax Academy

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services

(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services

Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu

(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014Amazon Web Services

Data Pipeline with KafkaPeerapat Asoktummarungsri

AWS Community Nordics Virtual MeetupAnahit Pogosova

(WRK302) Event-Driven ProgrammingAmazon Web Services

BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012Amazon Web Services

(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...Amazon Web Services

Tendances (20)

Data Collection and Storage

Netflix running Presto in the AWS Cloud

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...

Querying Data Pipeline with AWS Athena

Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez

Streaming data for real time analysis

Lambda Architecture Using SQL

Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...

GumGum: Multi-Region Cassandra in AWS

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

(BDT303) Running Spark and Presto on the Netflix Big Data Platform

Developing high frequency indicators using real time tick data on apache supe...

(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014

Data Pipeline with Kafka

AWS Community Nordics Virtual Meetup

(WRK302) Event-Driven Programming

BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012

(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...

Similaire à Fast real-time approximations using Spark streaming

Stream Processing Live Traffic Data with Kafka StreamsTim Ysewyn

Realtime Analytics on AWSSungmin Kim

The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit

Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit

AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services

What's new in MongoDB 3.6?MongoDB

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services

BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisAmazon Web Services

Avoiding Common Pitfalls: Spark Structured Streaming with KafkaHostedbyConfluent

Top Java Performance Problems and Metrics To Check in Your PipelineAndreas Grabner

Launching Your First Big Data Project on AWSAmazon Web Services

Services Over Servers - Innovate VA 2016SingleStonecx

Amazon Kinesis Data Streams Vs Msk (1).pptxRenjithPillai26

Real-Time Event ProcessingAmazon Web Services

Web Browser Controls in Adlib: The Hidden Diamond in the Adlib Treasure ChestAxiell ALM

Stream Processing Live Traffic Data with Kafka StreamsTom Van den Bulck

Writing high performance code in NetCore 3.0Javier Cantón Ferrero

DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0Plain Concepts

Similaire à Fast real-time approximations using Spark streaming (20)

Stream Processing Live Traffic Data with Kafka Streams

Realtime Analytics on AWS

The Patterns of Distributed Logging and Containers

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...

AquaQ Analytics Kx Event - Data Direct Networks Presentation

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...

What's new in MongoDB 3.6?

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things

BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis

Avoiding Common Pitfalls: Spark Structured Streaming with Kafka

Top Java Performance Problems and Metrics To Check in Your Pipeline

Launching Your First Big Data Project on AWS

Services Over Servers - Innovate VA 2016

Amazon Kinesis Data Streams Vs Msk (1).pptx

Real-Time Event Processing

Web Browser Controls in Adlib: The Hidden Diamond in the Adlib Treasure Chest

Stream Processing Live Traffic Data with Kafka Streams

Writing high performance code in NetCore 3.0

DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0

Plus de huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk

ether.camp - Hackathon & ether.camp introhuguk

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk

Extracting maximum value from data while protecting consumer privacy. Jason ...huguk

Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk

Streaming Dataflow with Apache Flink huguk

Lambda architecture on Spark, Kafka for real-time large scale MLhuguk

Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk

Jonathon Southam: Venture Capital, Funding & Pitchinghuguk

Signal Media: Real-Time Media & News Monitoringhuguk

Dean Bryen: Scaling The Platform For Your Startuphuguk

Peter Karney: Intro to the Digital catapulthuguk

Cytora: Real-Time Political Risk Analysishuguk

Cubitic: Predictive Analyticshuguk

Bird.i: Earth Observation Data Made Socialhuguk

Aiseedo: Real Time Machine Intelligencehuguk

Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk

TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk

Hadoop - Looking to the Future By Arun Murthyhuguk

Plus de huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta

ether.camp - Hackathon & ether.camp intro

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...

Extracting maximum value from data while protecting consumer privacy. Jason ...

Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson

Streaming Dataflow with Apache Flink

Lambda architecture on Spark, Kafka for real-time large scale ML

Today’s reality Hadoop with Spark- How to select the best Data Science approa...

Jonathon Southam: Venture Capital, Funding & Pitching

Signal Media: Real-Time Media & News Monitoring

Dean Bryen: Scaling The Platform For Your Startup

Peter Karney: Intro to the Digital catapult

Cytora: Real-Time Political Risk Analysis

Cubitic: Predictive Analytics

Bird.i: Earth Observation Data Made Social

Aiseedo: Real Time Machine Intelligence

Secrets of Spark's success - Deenar Toraskar, Think Reactive

TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...

Hadoop - Looking to the Future By Arun Murthy

Dernier

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"ML in Production",Oleksandr BaganFwdays

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Advanced Computer Architecture – An IntroductionDilum Bandara

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Dernier (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"ML in Production",Oleksandr Bagan

SIP trunking in Janus @ Kamailio World 2024

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

SAP Build Work Zone - Overview L2-L3.pptx

Dev Dives: Streamline document processing with UiPath Studio Web

Gen AI in Business - Global Trends Report 2024.pdf

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

How AI, OpenAI, and ChatGPT impact business and software.

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Ensuring Technical Readiness For Copilot in Microsoft 365

Connect Wave/ connectwave Pitch Deck Presentation

Generative AI for Technical Writer or Information Developers

DevEX - reference for building teams, processes, and platforms

What is DBT - The Ultimate Data Build Tool.pdf

Developer Data Modeling Mistakes: From Postgres to NoSQL

Advanced Computer Architecture – An Introduction

Streamlining Python Development: A Guide to a Modern Project Setup

Moving Beyond Passwords: FIDO Paris Seminar.pdf

Scanning the Internet for External Cloud Exposures via SSL Certs

Fast real-time approximations using Spark streaming

1. Fast > Perfect Practical real-time approximations using Spark Streaming Kevin Schmidt @kevinschmidtbiz LuisVicente @lvicentesanchez

2. A Bit of Context: Mind Candy

3. A Bit of Context: Free To Play Sum ArbitraryValues Count Uniques It’s Complicated

4. A Bit of Context: Setup

5. A Bit of Context: Requirements • Constant storage space usage independent of number of users • Handle delayed or duplicate data • Error rate under 3%

6. Counting Users: Basics How To Count IDs Uniquely

7. Counting Users: HyperLogLog addIdentiﬁer(value: String) merge(other: HyperLogLog): HyperLogLog zero(): HyperLogLog countUniques(): Long HyperLogLog Error Rate = 1.6% Fixed Size = 4KB 14Bit Size: 12Bit Size: Error Rate = 0.9% Fixed Size = 16KB

8. Counting Users: DStream

9. Counting Users: RDD

10. Counting Users:Transform

11. Counting Users: Storing

12. Counting Users: Some Scala https://github.com/lvicentesanchez/fast-gt-perfect

13. Counting Users:Adding Up

14. Counting Users: Performance

15. Counting Users: Result • Constant storage size usage for one day of data using 14bit HyperLogLogs: 288 * 16KB = 4608KB • HyperLogLogs count users only once even if data is duplicated or repeated • Time bucketing ensures delayed data is counted correctly • Difference of <1% between HyperLogLogs and real count

16. Counting Revenue: Basics How To Sum ArbitraryValues

17. Counting Revenue: BloomFilter BloomFilter Capacity = 10k Error Rate = 1% Size = 11.7KB Conﬁgurable Size: addIdentiﬁer(value: String) merge(other: BloomFilter): BloomFilter zero(): BloomFilter contains(): Boolean

18. Counting Revenue:Transform

19. Counting Revenue:Transform

20. Counting Revenue: Storing

21. Counting Revenue: Some Scala https://github.com/lvicentesanchez/fast-gt-perfect

22. Counting Revenue:Adding Up

23. Counting Revenue: Result • Constant storage size usage for one day of data using a 10k BloomFilter: 288 * 11.7KB = 3370KB • BloomFilter eliminates sales already counted • Time bucketing ensures delayed data is counted correctly and keeps BloomFilters small • Difference of <1% between approximated and real revenue

24. Trending: Basics How To Find the Top K

25. Trending: StreamSummary StreamSummary Configurable Size: addIdentifier(value: String) merge(other: SS): SS topK(k: Int): Seq[(String, Long)] Capacity = 400 Max Size = 21.9KB Metwally,Agrawal & Abbadi: Efficient Computation of Frequent and Top-k Elements in Data Streams (2005)

26. Trending:Transform

27. Trending: Storing

28. Trending:Adding Up

29. Trending: Result • Constant storage size usage for one day of data using a Top400 StreamSummary: 288 * 21.9KB = 6307KB • StreamSummary will not eliminate duplicates • Time bucketing ensures delayed data is counted correctly • Difference of <2% between StreamSummary trending items and the real trending items

30. Questions? Kevin Schmidt @kevinschmidtbiz LuisVicente @lvicentesanchez https://github.com/lvicentesanchez/fast-gt-perfect