Probabilistic algorithms for fun and pseudorandom profit

PROBABILISTIC ALGORITHMS
for fun and pseudorandom profit
Tyler Treat / 12.5.2015

ABOUT THE SPEAKER
➤ Backend engineer at Workiva
➤ Messaging platform tech lead
➤ Distributed systems
➤ bravenewgeek.com
@tyler_treat 
tyler.treat@workiva.com

Time
Data
Batch 
(days, hours)
Meh, data 
(I can store this on 
my laptop)
Streaming 
(minutes, seconds)
Oi, data! 
(We’re gonna need 
a bigger boat…)
Real-Time™ 
(I need it now, dammit!)
Big data™ 
(IoT, sensors)
/dev/null

Time
Data
Batch 
(days, hours)
Meh, data 
(I can store this on 
my laptop)
Streaming 
(minutes, seconds)
Oi, data! 
(We’re gonna need 
a bigger boat…)
Real-Time™ 
(I need it now, dammit!)
Big data™ 
(IoT, sensors)
Not Interesting
Kinda
Interesting
Pretty
Interesting
/dev/null

http://bravenewgeek.com/stream-processing-and-probabilistic-methods/

THIS TALK IS NOT
➤ About Samza, Storm, Spark Streaming et al.
➤ Strictly about stream-processing techniques
➤ Mathy
➤ Statistics-y

THIS TALK IS
➤ About basic probability theory
➤ About practical design trade-oﬀs
➤ About algorithms & data structures
➤ About dealing with large or unbounded datasets
➤ A marriage of CS & engineering

OUTLINE
➤ Terminology & context
➤ Why probabilistic algorithms?
➤ Bloom ﬁlters & variants
➤ Count-min sketch
➤ HyperLogLog

Randomized Algorithms
Las Vegas Algorithms Monte Carlo Algorithms
Random Input
Correct result 
Gamble on speed
Deterministic speed 
Gamble on result

DEFINING SOME TERMINOLOGY
➤ Online - processing elements as they arrive
➤ Oﬄine - entire dataset is known ahead of time
➤ Real-time - hard constraint on response time
➤ A priori knowledge - something known beforehand

BATCH VS STREAMING
➤ Batch
➤ Oﬄine
➤ Heuristics/multiple passes
➤ Data structures less important
documents search index

BATCH VS STREAMING
➤ Streaming
➤ Online, one pass
➤ Usually real-time (but not necessarily)
➤ Potentially unbounded
transactions
caches
fraud
analytics

3 DATA INTEGRATION QUESTIONS
➤ How do you get the data?
➤ How do you disseminate the data?
➤ How do you process the data?

3 DATA INTEGRATION QUESTIONS
➤ How do you get the data (quickly)?
➤ How do you disseminate the data (quickly)?
➤ How do you process the data (quickly)?

Denormalization is critical to
performance at scale.

How to count the number of
distinct document views
across Wikipedia?

10b531cb-914c-4b3e-
ac1d-11678dd72f7a
3,042,568
16-byte GUID 8-byte integer

10b531cb-914c-4b3e-
ac1d-11678dd72f7a
5d5d5a78-f98f-4eee-
bc83-762b3c78f1ea
3558d299-45ef-4fc9-
b9ec-902e4943c7f8
6febb745-c987-4c51-
afd2-90a55f357d7b
6f3f199e-4cc3-4c68-
9d2a-00c31eb199f3
3,042,568
1,250,763
982,531
24,703,289
7,401,050

Wikipedia has ~38 million pages.

38,000,000 pages x
(16-byte guid + 8-byte integer)
≈ 1GB

➤ Not unreasonable for modern
hardware
➤ Held in memory for lifetime of
process so will move to old
GC generations—expensive to
collect!
➤ Now we want to track views
per unique IP address
➤ >4 billion IPv4 addresses
➤ Naive solutions quickly
become intractable

DISTRIBUTED SYSTEMS TRADE-OFFS
Consistency
Availability
Partition Tolerance

DATA PROCESSING TRADE-OFFS
Time
Accuracy
Space

HAVE YOUR CAKE AND EAT IT TOO?
Stream
Processing
Batch
Processing
App
The “Lambda Architecture”

Probabilistic algorithms trade
accuracy for space and performance.

“Sketching” data structures make this
trade by storing a summary of the dataset
when storing it entirely is prohibitively
expensive.

Bloom Filters
B. H. Bloom. 
Space/Time Trade-offs in Hash Coding with Allowable Errors. 1970.

Answers a simple question:
is this element a member of a set?
S ⊆ 𝕌 
x ∈ S

SET MEMBERSHIP
➤ Is this URL malicious?
➤ Is this IP address blacklisted?
➤ Is this word contained in the document?
➤ Is this record in the database?
➤ Has this transaction been processed?

Hash Table
entry for each member

Bit Array
bit for each element in universe
0101101000101110010100100110101…

BLOOM FILTERS
➤ Bloom ﬁlters store set memberships
➤ Answers “not in set” or “probably in set”

Bloom Filter Secondary Store
Do you have key 1?
no
no
Do you have key 2?
Here’s key 2
yes
Necessary access
Here’s key 2
Do you have key 3?
no
yes
Unnecessary access
no
yes
no

BLOOM FILTERS
➤ 2 operations: add, lookup
➤ Allocate bit array of length m
➤ k hash functions
➤ Conﬁgure m and k for desired false-positive rate

BLOOM FILTERS
➤ Add element:
➤ Hash with k functions to
get k indices
➤ Set bits at each index
➤ Lookup:
➤ Hash with k functions to
get k indices
➤ Check bit at each index
➤ If any bit is unset, element
not in set

BLOOM FILTERS
➤ Benefits:
➤ More space-efficient than hash table or bit array
➤ Can determine trade-off between accuracy and space
➤ Drawbacks:
➤ Some elements potentially more sensitive to false positives
than others (solvable by partitioning)
➤ Can’t remove elements
➤ Requires a priori knowledge of the dataset
➤ Over-provisioned filter wastes space

Bloom filters are great for efficient offline
processing, but what about streaming?

BLOOM FILTERS WITH A TWIST
➤ Rotating Bloom ﬁlters
➤ e.g. remember everything in the last hour 
➤ Scalable Bloom Filters
➤ Dynamically allocating chained ﬁlters 
➤ Stable Bloom Filters
➤ Continuously evict stale data

Scalable Bloom Filters
P. S. Almeida, C. Baquero, N. Preguiça, D. Hutchison. 
Scalable Bloom Filters. 2007.

l
P0 P0 P0 P0
P0 = error prob. of 1 filter
l = # filters
P = compound error prob.
P = 1 - (1 - P 0 )
i=0
l-1

P0 = 0.1
P = 1 - (1 - P0)
i=0
l-1

SCALABLE BLOOM FILTERS
➤ Questions:
➤ When to add a new ﬁlter?
➤ How to place a tight upper bound on P?

➤ When to add a new ﬁlter?
➤ Fill ratio p = # set bits / # bits
➤ Add new ﬁlter when target p is reached
➤ Optimal target p = 0.5 (math follows from paper)

➤ How to place a tight upper bound on P?
➤ Apply tightening ratio r to P0, where 0 < r < 1
➤ Start with 1 ﬁlter, error probability P0
➤ When full, add new ﬁlter, error probability P1=P0r
➤ Results in geometric series:
➤ Series converges on target error probability P

P0 = 0.1 
r = 0.5
P
P = 1 - (1 - P0r i )
i=0
l-1

➤ Add elements to last filter
➤ Check each filter on lookups
➤ Tightening ratio r controls m and k for new filters

➤ Beneﬁts:
➤ Can grow dynamically to accommodate dataset
➤ Provides tight upper bound on false-positive rate
➤ Can control growth rate
➤ Drawbacks:
➤ Size still proportional to dataset
➤ Additional computation on adds (negligible amortized)

Stable Bloom Filters
F. Deng, D. Rafiei. 
Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters. 2006.

DUPLICATE DETECTION
➤ Query processing
➤ URL crawling
➤ Monitoring distinct IP addresses
➤ Advertiser click streams
➤ Graph processing

Bloom filters are remarkably useful for
dealing with graph data.

GRAPH PROCESSING
➤ Detecting cycles
➤ Pruning search space
➤ E.g. often used in bioinformatics
➤ Storing chemical structures, properties, and molecular
fingerprints in filters to optimize searches and determine
structural similarities
➤ Rapid classification of DNA sequences as large as the
human genome

GRAPH PROCESSING
➤ Store crawled nodes in memory
➤ Set of nodes may be too large to ﬁt in memory
➤ Store crawled nodes in secondary storage
➤ Too many searches to perform in limited time

Precisely eliminating duplicates in an
unbounded stream isn’t feasible with
limited space and time.

Efficacy/Efficiency Conjecture: 
In many situations, a quick answer with
an allowable error rate is better than a
precise one that is slow.

Staleness Conjecture: 
In many situations, more recent data has
more value than stale data.

STABLE BLOOM FILTERS
➤ Discards old data to make room for new data
➤ Replace bit array with array of d-bit counters
➤ Initialize counters to zero
➤ Maximum counter value Max = 2d
- 1

➤ Add element:
➤ Select P random counters and
decrement by one
➤ Hash with k functions to get k indices
➤ Set counters at each index to Max
➤ Lookup:
➤ Hash with k functions to get k indices
➤ Check counter at each index
➤ If any counter is zero, element not in
set

➤ Classic Bloom filter a special case of SBF w/ d=1, P=0
➤ Tight upper bound on false positives
➤ FP rate asymptotically approaches configurable fixed constant
(stable-point property)
➤ See paper for math and parameter settings
➤ Evicting data introduces false negatives

➤ Beneﬁts:
➤ Fixed memory allocation
➤ Evicts old data to make room for new data
➤ Provides tight upper bound on false positives
➤ Drawbacks:
➤ Introduces false negatives
➤ Additional computation on adds

Count-Min Sketch
G. Cormode, S. Muthukrishnan. 
An Improved Data Stream Summary: The Count-Min Sketch and its Applications. 2003.

Can we count element frequencies using
sub-linear space?
page views
94.136.205.1
132.208.90.15
54.222.151.15
7
4
11

COUNT-MIN SKETCH
➤ Approximates frequencies in sub-linear
space
➤ Matrix with w columns and d rows
➤ Each row has a hash function
➤ Each cell initialized to zero
➤ When element arrives:
➤ Hash for each row
➤ Increment each counter by 1
➤ freq(element) = min counter value

COUNT-MIN SKETCH
➤ Why the minimum?
➤ Possibility for collisions between elements
➤ Counter may be incremented by multiple elements
➤ Taking minimum counter value gives closer approximation

COUNT-MIN SKETCH
➤ Beneﬁts:
➤ Simple!
➤ Sub-linear space
➤ Useful for detecting “heavy hitters”
➤ Easy to track top-k by adding a min-heap
➤ Drawbacks:
➤ Biased estimator: may overestimate, never underestimates
➤ Better suited to Zipﬁan distributions & rare events

HyperLogLog
P. Flajolet, É. Fusy, O. Gandouet, F. Meunier. 
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. 2007.

How do we count distinct things in a stream?

COUNTING PROBLEMS
➤ E.g. how many diﬀerent words are used in Wikipedia?
➤ Counter per element explodes memory
➤ Usually requires memory proportional to cardinality
➤ Can we approximate cardinality with constant space?

HYPERLOGLOG
➤ The name: can estimate cardinality of set w/ cardinality Nmax
using loglog(Nmax) + O(1) bits
➤ Hash element to integer
➤ Count number of leading 0’s in binary form of hash
➤ Track highest number of leading 0’s, n
➤ Cardinality ≈ 2n+1

HYPERLOGLOG
➤ stream = [“foo”, “bar”, “baz”, “qux”]
➤ h(“foo”) = 10100001
➤ h(“bar”) = 01110111
➤ h(“baz”) = 01110100
➤ h(“qux”) = 10100011
➤ n = 1
➤ |stream| ≈ 2n+1
= 22
= 4

It’s actually not magic but just a few
really clever observations.

With 50/50 odds, how long will it take to flip
3 heads in a row? 20? 100?

HYPERLOGLOG
➤ Replace “heads” and “tails” with 0’s and 1’s
➤ Count leading consecutive 0’s in binary form of hash
➤ E.g. imagine a 4-bit hash, 16 possible values:
➤ 0000 4 leading 0’s
➤ 0001 3 leading 0’s
➤ 0011, 0010 2 leading 0’s
➤ 0100, 0111, 0110, 0101 1 leading 0’s
➤ 1111, 1110, 1001 1010, 1101, 1100 1011, 1000 0 leading 0’s
➤ Assume good hash function → 1/16 odds for each permutation

HYPERLOGLOG
➤ Track highest number of leading 0’s, n
➤ n = 0 → 8/16=1/2 odds
➤ n = 1 → 4/16=1/4 odds
➤ n = 2 → 2/16=1/8 odds
➤ n = 3 → 1/16 odds
➤ Cardinality ≈ how many things did we have to look?
➤ E.g. highest count = 1 → 1/4 odds → cardinality 4

HYPERLOGLOG
➤ 1/2 of all binary numbers start with 1
➤ Each additional bit cuts the probability in half:
➤ 1/4 start with 01
➤ etc.
➤ P(run of length n) = 1 / 2n+1
➤ Seeing 001 has 1/8 probability, meaning we had to look at
approximately 8 things til we saw it (cardinality 8)
➤ Cardinality ≈ prob-1
(reciprocal of probability)

HYPERLOGLOG
➤ Use multiple buckets
➤ Use ﬁrst few bits of hash to determine bucket
➤ Use remaining bits to count 0’s
➤ Each bucket tracks its own count
➤ Take harmonic mean of all buckets to get cardinality
➤ min(x1…xn) ≤ H(x1…xn) ≤ n min(x1…xn)
01011010001011100101001001101010
bucket counting space

http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
Number of distinct words in all of Shakespeare's work

HYPERLOGLOG
➤ Beneﬁts:
➤ Constant memory
➤ Super fast (calculating MSB is cheap)
➤ Can give accurate count with <1% error
➤ Drawbacks:
➤ Has a margin of error (albeit small)

Data processing has trade-offs.

Probabilistic algorithms trade accuracy for
speed and space.

Often we only care about answers that are
mostly correct but available now.

Sometimes the “right” answer is impossible
to compute or simply doesn’t exist.

Probabilistic algorithms are
just damn cool.

ALGORITHM IMPLEMENTATIONS
➤ Algebird - https://github.com/twitter/algebird
➤ Bloom filter
➤ HyperLogLog
➤ stream-lib - https://github.com/addthis/stream-lib
➤ Bloom filter
➤ HyperLogLog
➤ Boom Filters - https://github.com/tylertreat/BoomFilters
➤ Bloom filter
➤ Scalable Bloom filter
➤ Stable Bloom filter
➤ HyperLogLog

OTHER COOL PROBABILISTIC ALGORITHMS
➤ Counting Bloom filter (and many other Bloom variations)
➤ Bloomier filter (encode functions instead of sets)
➤ Cuckoo filter (Bloom filter w/ cuckoo hashing)
➤ q-digest (quantile approximation)
➤ t-digest (online accumulation of rank-based statistics)
➤ Locality-sensitive hashing (hash similar items to same buckets)
➤ MinHash (set similarity)
➤ Miller–Rabin (primality testing)
➤ Karger’s algorithm (min cut of connected graph)

@tyler_treat
github.com/tylertreat
bravenewgeek.com
Thanks
We’re hiring!

BIBLIOGRAPHY
Almeida, P., Baquero, C., Preguica, N., Hutchison, D. 2007. Scalable Bloom Filters; http://gsd.di.uminho.pt/
members/cbm/ps/dbloom.pdf
Bloom, B. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors; https://www.cs.upc.edu/
~diaz/p422-bloom.pdf
Cormode, G., & Muthukrishnan, S. 2003. An Improved Data Stream Summary: The Count-Min Sketch and its
Applications; http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
Deng, F., & Rafiei, D. 2006. Approximately Detecting Duplicates for Streaming Data using Stable Bloom
Filters; https://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf
Flajolet, P., Fusy, É, Gandouet, O., Meunier, F. 2007. HyperLogLog: The analysis of a near-optimal cardinality
estimation algorithm; http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J. 2010. Classification of
DNA sequences using Bloom filters. Bioinformatics, 26(13); http://bioinformatics.oxfordjournals.org/content/
26/13/1595.full.pdf
Tarkoma, S., Rothenberg, C., & Lagerspetz, E. 2011. Theory and Practice of Bloom Filters for Distributed
Systems. IEEE Communications Surveys & Tutorials, 14(1); https://gnunet.org/sites/default/files/
TheoryandPracticeBloomFilter2011Tarkoma.pdf
Treat, T. 2015. Stream Processing and Probabilistic Methods: Data at Scale; http://bravenewgeek.com/stream-
processing-and-probabilistic-methods

Probabilistic algorithms for fun and pseudorandom profit

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (20)

Similaire à Probabilistic algorithms for fun and pseudorandom profit

Similaire à Probabilistic algorithms for fun and pseudorandom profit (20)

Plus de Tyler Treat

Plus de Tyler Treat (8)

Dernier

Dernier (20)

Probabilistic algorithms for fun and pseudorandom profit