SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
HyperLogLog
in Hive How to count
sheep efficiently?
Phillip Capper: Whitecliffs Sheep
@bzamecnik
Agenda
● the problem – count distinct elements
● exact counting
● fast approximate counting – using HLL in Hive
● comparing performance and accuracy
● appendix – a bit of theory of probabilistic counting
○ how it works?
The problem: count distinct elements
● eg. the number of unique visitors
● each visitor can make a lot of clicks
● typically grouped in various ways
● "set cardinality estimation" problem
Small data solutions
● sort the data O(N*log(N)) and skip duplicates O(N)
○ O(N) space
● put data into a hash or tree set and iterate
○ hash set: O(N^2) worst case build, O(N) iteration
○ tree set: O(N*log(N)) build, O(N) iteration
○ both O(N) space
● but: we have big data
Example:
~100M unique values in 5B rows each day
32 bytes per value -> 3 GB unique, 150 GB total
Problems with counting big data
● data is partitioned
○ across many machines
○ in time
● we can't sum cardinality of each partition
○ since the subsets are generally not disjoint
○ we would overestimate
count(part1) + count(part1) >= count(part1 ∪ part2)
● we need to merge estimators and then estimate
cardinality
count(estimator(part1) ∪ estimator(part2))
SELECT COUNT(DISTINCT user_id)
FROM events;
single reducer!
Exact counting in Hive
Exact counting in Hive – subquery
SELECT COUNT(*) FROM (
SELECT 1 FROM events
GROUP BY user_id
) unique_guids;
Or more concisely:
SELECT COUNT(*) FROM (
SELECT DISTINCT user_id
FROM events
) unique_guids;
many reducers
two phases
cannot combine
more aggregations
Exact counting in Hive
● hive.optimize.distinct.rewrite
○ allows to rewrite COUNT(DISTINCT) to subquery
○ since Hive 1.2.0
Probabilistic counting
● fast results, but approximate
● practical example of using HLL in Hive
● more theory in the appendix
● klout/brickhouse
○ single option
○ no JAR, some tests
○ based on HLL++ from stream-lib (quite fast)
● jdmaturen/hive-hll
○ no options (they are in API, but not implemented!)
○ no JAR, no tests
○ compatible with java-hll, pg-hll, js-hll
● t3rmin4t0r/hive-hll-udf
○ no options, no JAR, no tests
Implementations of HLL as Hive UDFs
● User-Defined Functions
● function registered from a class (loaded from JAR)
● JAR needs to be on HDFS (otherwise it fails)
● you can choose the UDF name at will
● work both in HiveServer2/Beeline and Hive CLI
ADD JAR hdfs:///path/to/the/library.jar;
CREATE TEMPORARY FUNCTION foo_func
AS 'com.example.foo.FooUDF';
● Usage:
SELECT foo_func(...) FROM ...;
UDFs in Hive
● to_hll(value)
○ aggregate values to HLL
○ UDAF (aggregation function)
○ + hash each value
○ optionally can be configured (eg. for precision)
● union_hlls(hll)
○ union multiple HLLs
○ UDAF
● hll_approx_count(hll)
○ estimate cardinality from a HLL
○ UDF
HLL can be stored as binary or string type.
General UDFs API for HLL
● Estimate of total unique visitors:
SELECT hll_approx_count(to_hll(user_id))
FROM events;
● Estimate of total events + unique visitors at once:
SELECT
count(*) AS total_events
hll_approx_count(to_hll(user_id))
AS unique_visitors
FROM events;
Example usage
Example usage
● Compute each daily estimator once:
CREATE TABLE daily_user_hll AS
SELECT date, to_hll(user_id) AS users_hll
FROM events
GROUP BY date;
● Then quickly aggregate and estimate:
SELECT hll_approx_count(union_hlls(users_hll))
AS user_count
FROM daily_user_hll
WHERE date BETWEEN '2015-01-01' AND '2015-01-31';
https://github.com/klout/brickhouse - Hive UDF
https://github.com/addthis/stream-lib - HLL++
$ git clone https://github.com/klout/brickhouse
disable maven-javadoc-plugin in pom.xml (since it fails)
$ mvn package
$ wget http://central.maven.
org/maven2/com/clearspring/analytics/stream/2.3.0/stream-
2.3.0.jar
$ scp target/brickhouse-0.7.1-SNAPSHOT.jar 
stream-2.3.0.jar cluster-host:
cluster-host$ hdfs dfs -copyFromLocal *.jar 
/user/me/hive-libs
Brickhouse – installation
Brickhouse – usage
ADD JAR /user/zamecnik/lib/brickhouse-0.7.1-15f5e8e.jar;
ADD JAR /user/zamecnik/lib/stream-2.3.0.jar;
CREATE TEMPORARY FUNCTION to_hll AS 'brickhouse.udf.hll.
HyperLogLogUDAF';
CREATE TEMPORARY FUNCTION union_hlls AS 'brickhouse.udf.
hll.UnionHyperLogLogUDAF';
CREATE TEMPORARY FUNCTION hll_approx_count AS 'brickhouse.
udf.hll.EstimateCardinalityUDF';
to_hll(value, [bit_precision])
● bit_precision: 4 to 16 (default 6)
Hive-hll usage
ADD JAR /user/zamecnik/lib/hive-hll-0.1-2807db.jar;
CREATE TEMPORARY FUNCTION hll_hash as 'com.kresilas.hll.
HashUDF';
CREATE TEMPORARY FUNCTION to_hll AS 'com.kresilas.hll.
AddAggUDAF';
CREATE TEMPORARY FUNCTION union_hlls AS 'com.kresilas.hll.
UnionAggUDAF';
CREATE TEMPORARY FUNCTION hll_approx_count AS 'com.
kresilas.hll.CardinalityUDF';
We have to explicitly hash the value:
SELECT
hll_approx_count(to_hll(hll_hash(user_id)))
FROM events;
Options for creating HLL:
to_hll(x, [log2m, regwidth, expthresh, sparseon])
hardcoded to:
[log2m=11, regwidth=5, expthresh=-1, sparseon=true]
Hive-hll usage
Nice things
● HLLs are additive
○ can be computed once
○ various partitions can be merged and estimated for
cardinality later
● we can count multiple unique columns at once
○ no need to subquery
○ we can do wild grouping (by country, browser, …)
● HLLs take only little space
Rolling window
-- keep reasonable number of task for month of data
SET mapreduce.input.fileinputformat.split.maxsize=5368709120;
-- keep low number of output files (HLLs are quite small)
SET hive.merge.mapredfiles=true;
-- maximum precision
SET hivevar:hll_precision=16;
-- HLL for each day
CREATE TABLE guids_parquet_hll AS
SELECT
'${year}' AS year,
'${month}' AS month,
day,
to_hll(guid, ${hll_precision}) AS guid_hll
FROM parquet.dump_${year}_${month}
GROUP BY day;
-- for each day estimate number of guids 7-days back
CREATE TABLE zamecnik.guids_parquet_rolling_30_day_count
AS
SELECT
`date`,
hll_approx_count(guids_union) AS guid_count
FROM (
SELECT
concat(`year`, '-', `month`, '-', `day`) as `date`,
union_hlls(guid_hll) OVER w AS guids_union
FROM guids_parquet_hll
WINDOW w AS (
ORDER BY `year`, `month`, `day` ROWS 6 PRECEDING
)
) rolling_guids;
Rolling window
● when JARs are not on HDFS the query fails (why?)
● computing on many days of raw clickstream fails in
Beeline (works in Hive CLI), parquet is ok
● HIVE-9073 WINDOW + custom UDAF → NPE
○ fixed in Hive 1.2.0
● DISTRO-631
Pitfalls
Approximation error
● Typically < 1-2 %
● Can be controlled by the parameters
● Example: 1 year of guids
Appendix – more interesting things
● trade-off: some approximation error for far better
performance and memory consumption
● sketch - streaming & probabilistic algorithm
● KMV - k minimal values
● linear counter
● loglog counter
Probabilistic counting
LogLog counter
● run length of initial zeros
● multiple estimators (registers)
● stochastic averaging
○ single hash function
○ multiple buckets
● hash → (register index, run length)
Linear counter
m = 20 # size of the register
register = bitarray(m) # register, m bits
def add(value):
h = mmh3.hash(value) % m # select bit index
register[h] = 1 # = max(1, register[h])
def cardinality():
u_n = register.count(0) # number of zeros
v_n = u_n / m # relative number of zeros
n_hat = -m * math.log(v_n) # estimate of the set cardinality
return n_hat
● structure like loglog counter
● harmonic mean to combine registers
● correction for small and large cardinalities
● values needs to be hashed well – murmur3
HyperLogLog (HLL)
HLL union
● just take max of each register value
● no loss – same result as HLL of union of streams
● parallelizable
● union preserves error bound, intersection/diff do not
Further reading
● very nice explanation of HLL
● Probabilistic Data Structures For Web Analytics And
Data Mining
● Sketch of the Day: HyperLogLog — Cornerstone of a
Big Data Infrastructure
● HyperLogLog in Pure SQL
● Use Subqueries to Count Distinct 50X Faster
● It is possible to combine HLL of different sizes
Papers
● HyperLogLog in Practice: Algorithmic Engineering of
a State of The Art Cardinality Estimation Algorithm
● https://github.com/addthis/stream-lib#cardinality
Other problems & structures
● set membership – bloom filter
● top-k elements – count-min-sketch, stream-summary

Contenu connexe

Tendances

Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data AnalyticsFelipe
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Apache pulsar - storage architecture
Apache pulsar - storage architectureApache pulsar - storage architecture
Apache pulsar - storage architectureMatteo Merli
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack PresentationAmr Alaa Yassen
 

Tendances (20)

Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data Analytics
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache pulsar - storage architecture
Apache pulsar - storage architectureApache pulsar - storage architecture
Apache pulsar - storage architecture
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 

En vedette

ReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоPechaKucha Ukraine
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structuresshrinivasvasala
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
 
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Qrator Labs
 
Hyper loglog
Hyper loglogHyper loglog
Hyper loglognybon
 
Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Roman Elizarov
 

En vedette (9)

ReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений СафроненкоReqLabs PechaKucha Евгений Сафроненко
ReqLabs PechaKucha Евгений Сафроненко
 
Big Data aggregation techniques
Big Data aggregation techniquesBig Data aggregation techniques
Big Data aggregation techniques
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
 
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]Анализ количества посетителей на сайте [Считаем уникальные элементы]
Анализ количества посетителей на сайте [Считаем уникальные элементы]
 
Hyper loglog
Hyper loglogHyper loglog
Hyper loglog
 
Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017Deep dive into Coroutines on JVM @ KotlinConf 2017
Deep dive into Coroutines on JVM @ KotlinConf 2017
 

Similaire à HyperLogLog in Hive - How to count sheep efficiently?

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDSystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDMike Dusenberry
 
A taste of GlobalISel
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISelIgalia
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2Dvir Volk
 
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DBMarch 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DBJosiah Carlson
 
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdf
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdfProgramming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdf
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdfssuser6254411
 
Meetup C++ A brief overview of c++17
Meetup C++  A brief overview of c++17Meetup C++  A brief overview of c++17
Meetup C++ A brief overview of c++17Daniel Eriksson
 
OQGraph @ SCaLE 11x 2013
OQGraph @ SCaLE 11x 2013OQGraph @ SCaLE 11x 2013
OQGraph @ SCaLE 11x 2013Antony T Curtis
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Kai Chan
 
10-IDL.pptx
10-IDL.pptx10-IDL.pptx
10-IDL.pptxDhayaM1
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBArangoDB Database
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Alexey Rybak
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017Corey Huinker
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 

Similaire à HyperLogLog in Hive - How to count sheep efficiently? (20)

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDSystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWD
 
A taste of GlobalISel
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISel
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2
 
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DBMarch 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
March 29, 2016 Dr. Josiah Carlson talks about using Redis as a Time Series DB
 
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdf
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdfProgramming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdf
Programming For Big Data [ Submission DvcScheduleV2.cpp and StaticA.pdf
 
Meetup C++ A brief overview of c++17
Meetup C++  A brief overview of c++17Meetup C++  A brief overview of c++17
Meetup C++ A brief overview of c++17
 
OQGraph @ SCaLE 11x 2013
OQGraph @ SCaLE 11x 2013OQGraph @ SCaLE 11x 2013
OQGraph @ SCaLE 11x 2013
 
Java 8
Java 8Java 8
Java 8
 
Towards hasktorch 1.0
Towards hasktorch 1.0Towards hasktorch 1.0
Towards hasktorch 1.0
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
 
10-IDL.pptx
10-IDL.pptx10-IDL.pptx
10-IDL.pptx
 
Go. why it goes v2
Go. why it goes v2Go. why it goes v2
Go. why it goes v2
 
Hibernate 1x2
Hibernate 1x2Hibernate 1x2
Hibernate 1x2
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDB
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 

Dernier

Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESNarmatha D
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 

Dernier (20)

Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIES
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 

HyperLogLog in Hive - How to count sheep efficiently?

  • 1. HyperLogLog in Hive How to count sheep efficiently? Phillip Capper: Whitecliffs Sheep @bzamecnik
  • 2. Agenda ● the problem – count distinct elements ● exact counting ● fast approximate counting – using HLL in Hive ● comparing performance and accuracy ● appendix – a bit of theory of probabilistic counting ○ how it works?
  • 3. The problem: count distinct elements ● eg. the number of unique visitors ● each visitor can make a lot of clicks ● typically grouped in various ways ● "set cardinality estimation" problem
  • 4. Small data solutions ● sort the data O(N*log(N)) and skip duplicates O(N) ○ O(N) space ● put data into a hash or tree set and iterate ○ hash set: O(N^2) worst case build, O(N) iteration ○ tree set: O(N*log(N)) build, O(N) iteration ○ both O(N) space ● but: we have big data Example: ~100M unique values in 5B rows each day 32 bytes per value -> 3 GB unique, 150 GB total
  • 5. Problems with counting big data ● data is partitioned ○ across many machines ○ in time ● we can't sum cardinality of each partition ○ since the subsets are generally not disjoint ○ we would overestimate count(part1) + count(part1) >= count(part1 ∪ part2) ● we need to merge estimators and then estimate cardinality count(estimator(part1) ∪ estimator(part2))
  • 6. SELECT COUNT(DISTINCT user_id) FROM events; single reducer! Exact counting in Hive
  • 7. Exact counting in Hive – subquery SELECT COUNT(*) FROM ( SELECT 1 FROM events GROUP BY user_id ) unique_guids; Or more concisely: SELECT COUNT(*) FROM ( SELECT DISTINCT user_id FROM events ) unique_guids; many reducers two phases cannot combine more aggregations
  • 8. Exact counting in Hive ● hive.optimize.distinct.rewrite ○ allows to rewrite COUNT(DISTINCT) to subquery ○ since Hive 1.2.0
  • 9.
  • 10. Probabilistic counting ● fast results, but approximate ● practical example of using HLL in Hive ● more theory in the appendix
  • 11. ● klout/brickhouse ○ single option ○ no JAR, some tests ○ based on HLL++ from stream-lib (quite fast) ● jdmaturen/hive-hll ○ no options (they are in API, but not implemented!) ○ no JAR, no tests ○ compatible with java-hll, pg-hll, js-hll ● t3rmin4t0r/hive-hll-udf ○ no options, no JAR, no tests Implementations of HLL as Hive UDFs
  • 12. ● User-Defined Functions ● function registered from a class (loaded from JAR) ● JAR needs to be on HDFS (otherwise it fails) ● you can choose the UDF name at will ● work both in HiveServer2/Beeline and Hive CLI ADD JAR hdfs:///path/to/the/library.jar; CREATE TEMPORARY FUNCTION foo_func AS 'com.example.foo.FooUDF'; ● Usage: SELECT foo_func(...) FROM ...; UDFs in Hive
  • 13. ● to_hll(value) ○ aggregate values to HLL ○ UDAF (aggregation function) ○ + hash each value ○ optionally can be configured (eg. for precision) ● union_hlls(hll) ○ union multiple HLLs ○ UDAF ● hll_approx_count(hll) ○ estimate cardinality from a HLL ○ UDF HLL can be stored as binary or string type. General UDFs API for HLL
  • 14. ● Estimate of total unique visitors: SELECT hll_approx_count(to_hll(user_id)) FROM events; ● Estimate of total events + unique visitors at once: SELECT count(*) AS total_events hll_approx_count(to_hll(user_id)) AS unique_visitors FROM events; Example usage
  • 15. Example usage ● Compute each daily estimator once: CREATE TABLE daily_user_hll AS SELECT date, to_hll(user_id) AS users_hll FROM events GROUP BY date; ● Then quickly aggregate and estimate: SELECT hll_approx_count(union_hlls(users_hll)) AS user_count FROM daily_user_hll WHERE date BETWEEN '2015-01-01' AND '2015-01-31';
  • 16. https://github.com/klout/brickhouse - Hive UDF https://github.com/addthis/stream-lib - HLL++ $ git clone https://github.com/klout/brickhouse disable maven-javadoc-plugin in pom.xml (since it fails) $ mvn package $ wget http://central.maven. org/maven2/com/clearspring/analytics/stream/2.3.0/stream- 2.3.0.jar $ scp target/brickhouse-0.7.1-SNAPSHOT.jar stream-2.3.0.jar cluster-host: cluster-host$ hdfs dfs -copyFromLocal *.jar /user/me/hive-libs Brickhouse – installation
  • 17. Brickhouse – usage ADD JAR /user/zamecnik/lib/brickhouse-0.7.1-15f5e8e.jar; ADD JAR /user/zamecnik/lib/stream-2.3.0.jar; CREATE TEMPORARY FUNCTION to_hll AS 'brickhouse.udf.hll. HyperLogLogUDAF'; CREATE TEMPORARY FUNCTION union_hlls AS 'brickhouse.udf. hll.UnionHyperLogLogUDAF'; CREATE TEMPORARY FUNCTION hll_approx_count AS 'brickhouse. udf.hll.EstimateCardinalityUDF'; to_hll(value, [bit_precision]) ● bit_precision: 4 to 16 (default 6)
  • 18. Hive-hll usage ADD JAR /user/zamecnik/lib/hive-hll-0.1-2807db.jar; CREATE TEMPORARY FUNCTION hll_hash as 'com.kresilas.hll. HashUDF'; CREATE TEMPORARY FUNCTION to_hll AS 'com.kresilas.hll. AddAggUDAF'; CREATE TEMPORARY FUNCTION union_hlls AS 'com.kresilas.hll. UnionAggUDAF'; CREATE TEMPORARY FUNCTION hll_approx_count AS 'com. kresilas.hll.CardinalityUDF';
  • 19. We have to explicitly hash the value: SELECT hll_approx_count(to_hll(hll_hash(user_id))) FROM events; Options for creating HLL: to_hll(x, [log2m, regwidth, expthresh, sparseon]) hardcoded to: [log2m=11, regwidth=5, expthresh=-1, sparseon=true] Hive-hll usage
  • 20. Nice things ● HLLs are additive ○ can be computed once ○ various partitions can be merged and estimated for cardinality later ● we can count multiple unique columns at once ○ no need to subquery ○ we can do wild grouping (by country, browser, …) ● HLLs take only little space
  • 21. Rolling window -- keep reasonable number of task for month of data SET mapreduce.input.fileinputformat.split.maxsize=5368709120; -- keep low number of output files (HLLs are quite small) SET hive.merge.mapredfiles=true; -- maximum precision SET hivevar:hll_precision=16; -- HLL for each day CREATE TABLE guids_parquet_hll AS SELECT '${year}' AS year, '${month}' AS month, day, to_hll(guid, ${hll_precision}) AS guid_hll FROM parquet.dump_${year}_${month} GROUP BY day;
  • 22. -- for each day estimate number of guids 7-days back CREATE TABLE zamecnik.guids_parquet_rolling_30_day_count AS SELECT `date`, hll_approx_count(guids_union) AS guid_count FROM ( SELECT concat(`year`, '-', `month`, '-', `day`) as `date`, union_hlls(guid_hll) OVER w AS guids_union FROM guids_parquet_hll WINDOW w AS ( ORDER BY `year`, `month`, `day` ROWS 6 PRECEDING ) ) rolling_guids; Rolling window
  • 23. ● when JARs are not on HDFS the query fails (why?) ● computing on many days of raw clickstream fails in Beeline (works in Hive CLI), parquet is ok ● HIVE-9073 WINDOW + custom UDAF → NPE ○ fixed in Hive 1.2.0 ● DISTRO-631 Pitfalls
  • 24. Approximation error ● Typically < 1-2 % ● Can be controlled by the parameters ● Example: 1 year of guids
  • 25. Appendix – more interesting things
  • 26. ● trade-off: some approximation error for far better performance and memory consumption ● sketch - streaming & probabilistic algorithm ● KMV - k minimal values ● linear counter ● loglog counter Probabilistic counting
  • 27. LogLog counter ● run length of initial zeros ● multiple estimators (registers) ● stochastic averaging ○ single hash function ○ multiple buckets ● hash → (register index, run length)
  • 28. Linear counter m = 20 # size of the register register = bitarray(m) # register, m bits def add(value): h = mmh3.hash(value) % m # select bit index register[h] = 1 # = max(1, register[h]) def cardinality(): u_n = register.count(0) # number of zeros v_n = u_n / m # relative number of zeros n_hat = -m * math.log(v_n) # estimate of the set cardinality return n_hat
  • 29. ● structure like loglog counter ● harmonic mean to combine registers ● correction for small and large cardinalities ● values needs to be hashed well – murmur3 HyperLogLog (HLL)
  • 30. HLL union ● just take max of each register value ● no loss – same result as HLL of union of streams ● parallelizable ● union preserves error bound, intersection/diff do not
  • 31. Further reading ● very nice explanation of HLL ● Probabilistic Data Structures For Web Analytics And Data Mining ● Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure ● HyperLogLog in Pure SQL ● Use Subqueries to Count Distinct 50X Faster ● It is possible to combine HLL of different sizes
  • 32. Papers ● HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm ● https://github.com/addthis/stream-lib#cardinality
  • 33. Other problems & structures ● set membership – bloom filter ● top-k elements – count-min-sketch, stream-summary