SlideShare une entreprise Scribd logo
1  sur  18
Counting Big Data
by Streaming Algorithms
2013/10/26 @ Rakuten Technology Conference 2013
Rakuten Institute of Technology, Rakuten, Inc.,
Yusaku Kaneta
http://www.rakuten.co.jp/
Who am I?
• Yusaku Kaneta (@yusakukaneta)
– Joined Rakuten in April 2012.
– Rakuten Institute of Technology (RIT)

• Interests:
– String processing (esp., Pattern matching)
– Hardware design using FPGA
– Bitwise tricks & techniques
• Love TAOCP 7.1.3 & Hacker's Delight
2
Problem: Count Big Data
• Counting:
– Fundamental operation in data analysis.

• Big data is difficult to just count
– Because it needs huge amount of memory.
– E.g., 400GB+ is needed for
one-year access logs.

3
Batch Processing
• Batch processing can solve this.
– E.g.,

• Two issues:
– High latency

– Requirement for a cluster of machines
Batch

Batch

Batch

= High cost

Batch

Batch

Batch

4
Our Goals
1. Reduce memory
– Cost reduction.

2. Reduce latency
– Quick business decisions.

3. Achieve high-accuracy
– Correct business decisions.
5
Our Approach
• Streaming algorithms
– Can fulfill all our goals!
– Become common in Web companies.
• See the paper on Google’s PowerDrill & the code of
Twitter’s Algebird for examples of how to use.

• Keys:
– Limited memory
– Low latency
– Theoretical guarantee for accuracy
6
Streaming Algorithm Library
• RIT internally provides a C library
for streaming algorithms, libsketch.
• Three advantages:
Memory
efficient

• Bindings for

High
speed

High
accuracy

&
7
Why C?
• Our target: Python & Ruby users!
for data analysis

for stream processing

– But most of existing libraries are written in Scala
(algebird), Java (stream-lib), ...

This is a reason
why our library is written in C!
Easy to incorporate C libraris in Python & Ruby.
8
Application
Count Query in Rakuten
• Example: We want to know...
1. How many unique users that checked
an item in one day (month, or year)?
2. How many products sold in one day
(month, or year)?

• Streaming algorithms for the queries
1. HyperLogLog algorithm
2. Count-Min Sketch algorithm
10
Count Query in Rakuten
• Example: We want to know...
1. How many unique users that checked
an item in one day (month, or year)?
2. How many products sold in one day
(month, or year)?

• Streaming algorithms for the queries
1. HyperLogLog algorithm
2. Count-Min Sketch algorithm
11
Problem: Unique Item Count
• Naïve approach:
– Uses dict in Python: ”dict[key] += 1”
– This can require a large amount of memory.

• Streaming algorithm: HyperLogLog
– Counts unique items approximately.
– This needs a fixed amount of memory.
• Google recently proposed an improved version of
HyperLogLog, called HyperLogLog++.

12
HyperLogLog
• Basic ideas:

–Hash function
–Harmonic mean
–Stochastic averaging

13
HyperLogLog
• Algorithm
Keys 1. Set i to upper bits 2. Set A[i] to max(j, A[i])

…

upper bits

lower bits

…

Item1 hash(Item1): 0 0 0 1 0 0 0 0 0 1 1 0
Item2
i = (0001)2= 1 j = (# leading 0s)+1= 6
A[1]
Item3
4
0 1
···
···
Item1 array A 2 6
3. Estimate # unique items from E=1/Σ(2-A[i]).
(In practice, we use heuristics for corrections.)
14
Demo
• Naïve vs. HyperLogLog

15
Performance
• Task: Count unique items in an item set.
Memory
efficient

High
speed

1%

4x -1%

Memory
1193MB

5MB

Speed-up
419sec

108sec

High
accuracy

Accuracy
100%

99%

This data set is small,
but we are using HyperLogLog for bigger data.
16
Conclusion
• Streaming algorithms in Rakuten
–We are using them for data analysis.
–We have an internal C library with bindings.
• HyperLogLog, Count-Min Sketch, and so on.

–Future: Plan to implement other algorithms.

17
Reference
• HyperLogLog & HyperLogLog++
– [Flajolet et al., AOFA 2007], [Heule et al., EDBT 2013]

• Count-Min Sketch
– [Cormode, Muthukrishnan, J. Algorithms, 2005]

• An excellent slide by Alex Smola
– http://alex.smola.org/teaching/berkeley2012/slides/3_Streams.pdf

• AK TECH BLOG by Aggregate Knowledge
– http://blog.aggregateknowledge.com/

• Stream-lib by Clearspring
– https://github.com/clearspring/stream-lib

18

Contenu connexe

Tendances

Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Databricks
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
Databricks
 

Tendances (20)

Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at Scale
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at Scale
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlow
 
Graph Gurus Episode 11: Accumulators for Complex Graph Analytics
Graph Gurus Episode 11: Accumulators for Complex Graph AnalyticsGraph Gurus Episode 11: Accumulators for Complex Graph Analytics
Graph Gurus Episode 11: Accumulators for Complex Graph Analytics
 

En vedette

Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
Rakuten Group, Inc.
 

En vedette (9)

[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
[RakutenTechConf2013][C-4_3] Our Goals and Activities at Rakuten Institute o...
 
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era
[RakutenTechConf2013] [LT] Giving Life to your IDEAS to Survive in Evolving Era
 
Latent Class Transliteration based on Source Language Origin
Latent Class Transliteration based on Source Language OriginLatent Class Transliteration based on Source Language Origin
Latent Class Transliteration based on Source Language Origin
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 
[RakutenTechConf2013] [C4-1] Text detection in product images
[RakutenTechConf2013] [C4-1] Text detection in product images[RakutenTechConf2013] [C4-1] Text detection in product images
[RakutenTechConf2013] [C4-1] Text detection in product images
 
Unsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionUnsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product Description
 
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
[RakutenTechConf2013] [C-4_2] Building Structured Data from Product Descriptions
 
The Egison Programming Language
The Egison Programming LanguageThe Egison Programming Language
The Egison Programming Language
 
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
Purchase prediction by statistical analysis (統計技術を用いた商品購買予測)
 

Similaire à [RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms

Hardware Provisioning for MongoDB
Hardware Provisioning for MongoDBHardware Provisioning for MongoDB
Hardware Provisioning for MongoDB
MongoDB
 

Similaire à [RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms (20)

Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Months
 
Big Data
Big DataBig Data
Big Data
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
 
In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)
In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)
In-Memory Oracle BI Applications (UKOUG Analytics Event, July 2013)
 
data structure
data structuredata structure
data structure
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
DeNA West & BigQuery
DeNA West & BigQueryDeNA West & BigQuery
DeNA West & BigQuery
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Faceted search with Oracle InMemory option
Faceted search with Oracle InMemory optionFaceted search with Oracle InMemory option
Faceted search with Oracle InMemory option
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDB
 
How GPU Computing literally saved me at work!
How GPU Computing literally saved me at work!How GPU Computing literally saved me at work!
How GPU Computing literally saved me at work!
 
Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19
Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19
Tiago Fonseca & Rui Velho - Syone & Leroy Merlin - OSL19
 
Python in the real world : from everyday applications to advanced robotics
Python in the real world : from everyday applications to advanced roboticsPython in the real world : from everyday applications to advanced robotics
Python in the real world : from everyday applications to advanced robotics
 
Hadoop bangalore-meetup-dec-2011-yoda
Hadoop bangalore-meetup-dec-2011-yodaHadoop bangalore-meetup-dec-2011-yoda
Hadoop bangalore-meetup-dec-2011-yoda
 
Hardware Provisioning for MongoDB
Hardware Provisioning for MongoDBHardware Provisioning for MongoDB
Hardware Provisioning for MongoDB
 

Plus de Rakuten Group, Inc.

Plus de Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms

  • 1. Counting Big Data by Streaming Algorithms 2013/10/26 @ Rakuten Technology Conference 2013 Rakuten Institute of Technology, Rakuten, Inc., Yusaku Kaneta http://www.rakuten.co.jp/
  • 2. Who am I? • Yusaku Kaneta (@yusakukaneta) – Joined Rakuten in April 2012. – Rakuten Institute of Technology (RIT) • Interests: – String processing (esp., Pattern matching) – Hardware design using FPGA – Bitwise tricks & techniques • Love TAOCP 7.1.3 & Hacker's Delight 2
  • 3. Problem: Count Big Data • Counting: – Fundamental operation in data analysis. • Big data is difficult to just count – Because it needs huge amount of memory. – E.g., 400GB+ is needed for one-year access logs. 3
  • 4. Batch Processing • Batch processing can solve this. – E.g., • Two issues: – High latency – Requirement for a cluster of machines Batch Batch Batch = High cost Batch Batch Batch 4
  • 5. Our Goals 1. Reduce memory – Cost reduction. 2. Reduce latency – Quick business decisions. 3. Achieve high-accuracy – Correct business decisions. 5
  • 6. Our Approach • Streaming algorithms – Can fulfill all our goals! – Become common in Web companies. • See the paper on Google’s PowerDrill & the code of Twitter’s Algebird for examples of how to use. • Keys: – Limited memory – Low latency – Theoretical guarantee for accuracy 6
  • 7. Streaming Algorithm Library • RIT internally provides a C library for streaming algorithms, libsketch. • Three advantages: Memory efficient • Bindings for High speed High accuracy & 7
  • 8. Why C? • Our target: Python & Ruby users! for data analysis for stream processing – But most of existing libraries are written in Scala (algebird), Java (stream-lib), ... This is a reason why our library is written in C! Easy to incorporate C libraris in Python & Ruby. 8
  • 10. Count Query in Rakuten • Example: We want to know... 1. How many unique users that checked an item in one day (month, or year)? 2. How many products sold in one day (month, or year)? • Streaming algorithms for the queries 1. HyperLogLog algorithm 2. Count-Min Sketch algorithm 10
  • 11. Count Query in Rakuten • Example: We want to know... 1. How many unique users that checked an item in one day (month, or year)? 2. How many products sold in one day (month, or year)? • Streaming algorithms for the queries 1. HyperLogLog algorithm 2. Count-Min Sketch algorithm 11
  • 12. Problem: Unique Item Count • Naïve approach: – Uses dict in Python: ”dict[key] += 1” – This can require a large amount of memory. • Streaming algorithm: HyperLogLog – Counts unique items approximately. – This needs a fixed amount of memory. • Google recently proposed an improved version of HyperLogLog, called HyperLogLog++. 12
  • 13. HyperLogLog • Basic ideas: –Hash function –Harmonic mean –Stochastic averaging 13
  • 14. HyperLogLog • Algorithm Keys 1. Set i to upper bits 2. Set A[i] to max(j, A[i]) … upper bits lower bits … Item1 hash(Item1): 0 0 0 1 0 0 0 0 0 1 1 0 Item2 i = (0001)2= 1 j = (# leading 0s)+1= 6 A[1] Item3 4 0 1 ··· ··· Item1 array A 2 6 3. Estimate # unique items from E=1/Σ(2-A[i]). (In practice, we use heuristics for corrections.) 14
  • 15. Demo • Naïve vs. HyperLogLog 15
  • 16. Performance • Task: Count unique items in an item set. Memory efficient High speed 1% 4x -1% Memory 1193MB 5MB Speed-up 419sec 108sec High accuracy Accuracy 100% 99% This data set is small, but we are using HyperLogLog for bigger data. 16
  • 17. Conclusion • Streaming algorithms in Rakuten –We are using them for data analysis. –We have an internal C library with bindings. • HyperLogLog, Count-Min Sketch, and so on. –Future: Plan to implement other algorithms. 17
  • 18. Reference • HyperLogLog & HyperLogLog++ – [Flajolet et al., AOFA 2007], [Heule et al., EDBT 2013] • Count-Min Sketch – [Cormode, Muthukrishnan, J. Algorithms, 2005] • An excellent slide by Alex Smola – http://alex.smola.org/teaching/berkeley2012/slides/3_Streams.pdf • AK TECH BLOG by Aggregate Knowledge – http://blog.aggregateknowledge.com/ • Stream-lib by Clearspring – https://github.com/clearspring/stream-lib 18