SlideShare une entreprise Scribd logo
1  sur  99
Télécharger pour lire hors ligne
PROBABILISTIC ALGORITHMS
for fun and pseudorandom profit
Tyler Treat / 12.5.2015
ABOUT THE SPEAKER
➤ Backend engineer at Workiva
➤ Messaging platform tech lead
➤ Distributed systems
➤ bravenewgeek.com
@tyler_treat

tyler.treat@workiva.com
Time
Data
Batch

(days, hours)
Meh, data

(I can store this on

my laptop)
Streaming

(minutes, seconds)
Oi, data!

(We’re gonna need

a bigger boat…)
Real-Time™

(I need it now, dammit!)
Big data™

(IoT, sensors)
/dev/null
Time
Data
Batch

(days, hours)
Meh, data

(I can store this on

my laptop)
Streaming

(minutes, seconds)
Oi, data!

(We’re gonna need

a bigger boat…)
Real-Time™

(I need it now, dammit!)
Big data™

(IoT, sensors)
Not Interesting
Kinda
Interesting
Pretty
Interesting
/dev/null
http://bravenewgeek.com/stream-processing-and-probabilistic-methods/
THIS TALK IS NOT
➤ About Samza, Storm, Spark Streaming et al.
➤ Strictly about stream-processing techniques
➤ Mathy
➤ Statistics-y
THIS TALK IS
➤ About basic probability theory
➤ About practical design trade-offs
➤ About algorithms & data structures
➤ About dealing with large or unbounded datasets
➤ A marriage of CS & engineering
OUTLINE
➤ Terminology & context
➤ Why probabilistic algorithms?
➤ Bloom filters & variants
➤ Count-min sketch
➤ HyperLogLog
Randomized Algorithms
Las Vegas Algorithms Monte Carlo Algorithms
Random Input
Correct result

Gamble on speed
Deterministic speed

Gamble on result
Randomized Algorithms
Las Vegas Algorithms Monte Carlo Algorithms
Random Input
Correct result

Gamble on speed
Deterministic speed

Gamble on result
DEFINING SOME TERMINOLOGY
➤ Online - processing elements as they arrive
➤ Offline - entire dataset is known ahead of time
➤ Real-time - hard constraint on response time
➤ A priori knowledge - something known beforehand
BATCH VS STREAMING
➤ Batch
➤ Offline
➤ Heuristics/multiple passes
➤ Data structures less important
documents search index
BATCH VS STREAMING
➤ Streaming
➤ Online, one pass
➤ Usually real-time (but not necessarily)
➤ Potentially unbounded
transactions
caches
fraud
analytics
3 DATA INTEGRATION QUESTIONS
➤ How do you get the data?
➤ How do you disseminate the data?
➤ How do you process the data?
3 DATA INTEGRATION QUESTIONS
➤ How do you get the data (quickly)?
➤ How do you disseminate the data (quickly)?
➤ How do you process the data (quickly)?
Denormalization is critical to
performance at scale.
How to count the number of
distinct document views
across Wikipedia?
10b531cb-914c-4b3e-
ac1d-11678dd72f7a
3,042,568
16-byte GUID 8-byte integer
10b531cb-914c-4b3e-
ac1d-11678dd72f7a
5d5d5a78-f98f-4eee-
bc83-762b3c78f1ea
3558d299-45ef-4fc9-
b9ec-902e4943c7f8
6febb745-c987-4c51-
afd2-90a55f357d7b
6f3f199e-4cc3-4c68-
9d2a-00c31eb199f3
3,042,568
1,250,763
982,531
24,703,289
7,401,050
Wikipedia has ~38 million pages.
38,000,000 pages x
(16-byte guid + 8-byte integer)
≈ 1GB
➤ Not unreasonable for modern
hardware
➤ Held in memory for lifetime of
process so will move to old
GC generations—expensive to
collect!
➤ Now we want to track views
per unique IP address
➤ >4 billion IPv4 addresses
➤ Naive solutions quickly
become intractable
DISTRIBUTED SYSTEMS TRADE-OFFS
Consistency
Availability
Partition Tolerance
DATA PROCESSING TRADE-OFFS
Time
Accuracy
Space
HAVE YOUR CAKE AND EAT IT TOO?
Stream
Processing
Batch
Processing
App
The “Lambda Architecture”
Probabilistic algorithms trade
accuracy for space and performance.
“Sketching” data structures make this
trade by storing a summary of the dataset
when storing it entirely is prohibitively
expensive.
Bloom Filters
B. H. Bloom.

Space/Time Trade-offs in Hash Coding with Allowable Errors. 1970.
Answers a simple question:
is this element a member of a set?
S ⊆ 𝕌

x ∈ S
SET MEMBERSHIP
➤ Is this URL malicious?
➤ Is this IP address blacklisted?
➤ Is this word contained in the document?
➤ Is this record in the database?
➤ Has this transaction been processed?
Hash Table
entry for each member
Bit Array
bit for each element in universe
0101101000101110010100100110101…
BLOOM FILTERS
➤ Bloom filters store set memberships
➤ Answers “not in set” or “probably in set”
Bloom Filter Secondary Store
Do you have key 1?
no
no
Do you have key 2?
Here’s key 2
yes
Necessary access
Here’s key 2
Do you have key 3?
no
yes
Unnecessary access
no
yes
no
BLOOM FILTERS
➤ 2 operations: add, lookup
➤ Allocate bit array of length m
➤ k hash functions
➤ Configure m and k for desired false-positive rate
BLOOM FILTERS
➤ Add element:
➤ Hash with k functions to
get k indices
➤ Set bits at each index
➤ Lookup:
➤ Hash with k functions to
get k indices
➤ Check bit at each index
➤ If any bit is unset, element
not in set
BLOOM FILTERS
➤ Benefits:
➤ More space-efficient than hash table or bit array
➤ Can determine trade-off between accuracy and space
➤ Drawbacks:
➤ Some elements potentially more sensitive to false positives
than others (solvable by partitioning)
➤ Can’t remove elements
➤ Requires a priori knowledge of the dataset
➤ Over-provisioned filter wastes space
Bloom filters are great for efficient offline
processing, but what about streaming?
BLOOM FILTERS WITH A TWIST
➤ Rotating Bloom filters
➤ e.g. remember everything in the last hour

➤ Scalable Bloom Filters
➤ Dynamically allocating chained filters

➤ Stable Bloom Filters
➤ Continuously evict stale data
Scalable Bloom Filters
P. S. Almeida, C. Baquero, N. Preguiça, D. Hutchison.

Scalable Bloom Filters. 2007.
l
P0 P0 P0 P0
P0 = error prob. of 1 filter
l = # filters
P = compound error prob.
P = 1 - (1 - P 0 )
i=0
l-1
P0 = 0.1
P = 1 - (1 - P0)
i=0
l-1
SCALABLE BLOOM FILTERS
➤ Questions:
➤ When to add a new filter?
➤ How to place a tight upper bound on P?
SCALABLE BLOOM FILTERS
➤ When to add a new filter?
➤ Fill ratio p = # set bits / # bits
➤ Add new filter when target p is reached
➤ Optimal target p = 0.5 (math follows from paper)
SCALABLE BLOOM FILTERS
➤ How to place a tight upper bound on P?
➤ Apply tightening ratio r to P0, where 0 < r < 1
➤ Start with 1 filter, error probability P0
➤ When full, add new filter, error probability P1=P0r
➤ Results in geometric series:
➤ Series converges on target error probability P
P0 = 0.1

r = 0.5
P
P = 1 - (1 - P0r i )
i=0
l-1
SCALABLE BLOOM FILTERS
➤ Add elements to last filter
➤ Check each filter on lookups
➤ Tightening ratio r controls m and k for new filters
SCALABLE BLOOM FILTERS
➤ Benefits:
➤ Can grow dynamically to accommodate dataset
➤ Provides tight upper bound on false-positive rate
➤ Can control growth rate
➤ Drawbacks:
➤ Size still proportional to dataset
➤ Additional computation on adds (negligible amortized)
Stable Bloom Filters
F. Deng, D. Rafiei.

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters. 2006.
DUPLICATE DETECTION
➤ Query processing
➤ URL crawling
➤ Monitoring distinct IP addresses
➤ Advertiser click streams
➤ Graph processing
Bloom filters are remarkably useful for
dealing with graph data.
GRAPH PROCESSING
➤ Detecting cycles
➤ Pruning search space
➤ E.g. often used in bioinformatics
➤ Storing chemical structures, properties, and molecular
fingerprints in filters to optimize searches and determine
structural similarities
➤ Rapid classification of DNA sequences as large as the
human genome
GRAPH PROCESSING
➤ Store crawled nodes in memory
➤ Set of nodes may be too large to fit in memory
➤ Store crawled nodes in secondary storage
➤ Too many searches to perform in limited time
Precisely eliminating duplicates in an
unbounded stream isn’t feasible with
limited space and time.
Efficacy/Efficiency Conjecture:

In many situations, a quick answer with
an allowable error rate is better than a
precise one that is slow.
Staleness Conjecture:

In many situations, more recent data has
more value than stale data.
STABLE BLOOM FILTERS
➤ Discards old data to make room for new data
➤ Replace bit array with array of d-bit counters
➤ Initialize counters to zero
➤ Maximum counter value Max = 2d
- 1
STABLE BLOOM FILTERS
➤ Add element:
➤ Select P random counters and
decrement by one
➤ Hash with k functions to get k indices
➤ Set counters at each index to Max
➤ Lookup:
➤ Hash with k functions to get k indices
➤ Check counter at each index
➤ If any counter is zero, element not in
set
STABLE BLOOM FILTERS
➤ Classic Bloom filter a special case of SBF w/ d=1, P=0
➤ Tight upper bound on false positives
➤ FP rate asymptotically approaches configurable fixed constant
(stable-point property)
➤ See paper for math and parameter settings
➤ Evicting data introduces false negatives
STABLE BLOOM FILTERS
➤ Benefits:
➤ Fixed memory allocation
➤ Evicts old data to make room for new data
➤ Provides tight upper bound on false positives
➤ Drawbacks:
➤ Introduces false negatives
➤ Additional computation on adds
Count-Min Sketch
G. Cormode, S. Muthukrishnan.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications. 2003.
Can we count element frequencies using
sub-linear space?
page views
94.136.205.1
132.208.90.15
54.222.151.15
7
4
11
COUNT-MIN SKETCH
➤ Approximates frequencies in sub-linear
space
➤ Matrix with w columns and d rows
➤ Each row has a hash function
➤ Each cell initialized to zero
➤ When element arrives:
➤ Hash for each row
➤ Increment each counter by 1
➤ freq(element) = min counter value
COUNT-MIN SKETCH
➤ Why the minimum?
➤ Possibility for collisions between elements
➤ Counter may be incremented by multiple elements
➤ Taking minimum counter value gives closer approximation
COUNT-MIN SKETCH
➤ Benefits:
➤ Simple!
➤ Sub-linear space
➤ Useful for detecting “heavy hitters”
➤ Easy to track top-k by adding a min-heap
➤ Drawbacks:
➤ Biased estimator: may overestimate, never underestimates
➤ Better suited to Zipfian distributions & rare events
HyperLogLog
P. Flajolet, É. Fusy, O. Gandouet, F. Meunier.

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. 2007.
How do we count distinct things in a stream?
COUNTING PROBLEMS
➤ E.g. how many different words are used in Wikipedia?
➤ Counter per element explodes memory
➤ Usually requires memory proportional to cardinality
➤ Can we approximate cardinality with constant space?
HYPERLOGLOG
➤ The name: can estimate cardinality of set w/ cardinality Nmax
using loglog(Nmax) + O(1) bits
➤ Hash element to integer
➤ Count number of leading 0’s in binary form of hash
➤ Track highest number of leading 0’s, n
➤ Cardinality ≈ 2n+1
HYPERLOGLOG
➤ stream = [“foo”, “bar”, “baz”, “qux”]
➤ h(“foo”) = 10100001
➤ h(“bar”) = 01110111
➤ h(“baz”) = 01110100
➤ h(“qux”) = 10100011
➤ n = 1
➤ |stream| ≈ 2n+1
= 22
= 4
It’s actually not magic but just a few
really clever observations.
With 50/50 odds, how long will it take to flip
3 heads in a row? 20? 100?
HYPERLOGLOG
➤ Replace “heads” and “tails” with 0’s and 1’s
➤ Count leading consecutive 0’s in binary form of hash
➤ E.g. imagine a 4-bit hash, 16 possible values:
➤ 0000 4 leading 0’s
➤ 0001 3 leading 0’s
➤ 0011, 0010 2 leading 0’s
➤ 0100, 0111, 0110, 0101 1 leading 0’s
➤ 1111, 1110, 1001 1010, 1101, 1100 1011, 1000 0 leading 0’s
➤ Assume good hash function → 1/16 odds for each permutation
HYPERLOGLOG
➤ Track highest number of leading 0’s, n
➤ n = 0 → 8/16=1/2 odds
➤ n = 1 → 4/16=1/4 odds
➤ n = 2 → 2/16=1/8 odds
➤ n = 3 → 1/16 odds
➤ Cardinality ≈ how many things did we have to look?
➤ E.g. highest count = 1 → 1/4 odds → cardinality 4
HYPERLOGLOG
➤ 1/2 of all binary numbers start with 1
➤ Each additional bit cuts the probability in half:
➤ 1/4 start with 01
➤ 1/8 start with 001
➤ 1/16 start with 0001
➤ etc.
➤ P(run of length n) = 1 / 2n+1
➤ Seeing 001 has 1/8 probability, meaning we had to look at
approximately 8 things til we saw it (cardinality 8)
➤ Cardinality ≈ prob-1
(reciprocal of probability)
What about outliers?
HYPERLOGLOG
➤ Use multiple buckets
➤ Use first few bits of hash to determine bucket
➤ Use remaining bits to count 0’s
➤ Each bucket tracks its own count
➤ Take harmonic mean of all buckets to get cardinality
➤ min(x1…xn) ≤ H(x1…xn) ≤ n min(x1…xn)
01011010001011100101001001101010
bucket counting space
http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
Number of distinct words in all of Shakespeare's work
HYPERLOGLOG
➤ Benefits:
➤ Constant memory
➤ Super fast (calculating MSB is cheap)
➤ Can give accurate count with <1% error
➤ Drawbacks:
➤ Has a margin of error (albeit small)
What did we learn?
Data processing has trade-offs.
Probabilistic algorithms trade accuracy for
speed and space.
Often we only care about answers that are
mostly correct but available now.
Sometimes the “right” answer is impossible
to compute or simply doesn’t exist.
But mostly…
Probabilistic algorithms are
just damn cool.
What about the code?
ALGORITHM IMPLEMENTATIONS
➤ Algebird - https://github.com/twitter/algebird
➤ Bloom filter
➤ Count-min sketch
➤ HyperLogLog
➤ stream-lib - https://github.com/addthis/stream-lib
➤ Bloom filter
➤ Count-min sketch
➤ HyperLogLog
➤ Boom Filters - https://github.com/tylertreat/BoomFilters
➤ Bloom filter
➤ Scalable Bloom filter
➤ Stable Bloom filter
➤ Count-min sketch
➤ HyperLogLog
OTHER COOL PROBABILISTIC ALGORITHMS
➤ Counting Bloom filter (and many other Bloom variations)
➤ Bloomier filter (encode functions instead of sets)
➤ Cuckoo filter (Bloom filter w/ cuckoo hashing)
➤ q-digest (quantile approximation)
➤ t-digest (online accumulation of rank-based statistics)
➤ Locality-sensitive hashing (hash similar items to same buckets)
➤ MinHash (set similarity)
➤ Miller–Rabin (primality testing)
➤ Karger’s algorithm (min cut of connected graph)
@tyler_treat
github.com/tylertreat
bravenewgeek.com
Thanks
We’re hiring!
BIBLIOGRAPHY
Almeida, P., Baquero, C., Preguica, N., Hutchison, D. 2007. Scalable Bloom Filters; http://gsd.di.uminho.pt/
members/cbm/ps/dbloom.pdf
Bloom, B. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors; https://www.cs.upc.edu/
~diaz/p422-bloom.pdf
Cormode, G., & Muthukrishnan, S. 2003. An Improved Data Stream Summary: The Count-Min Sketch and its
Applications; http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
Deng, F., & Rafiei, D. 2006. Approximately Detecting Duplicates for Streaming Data using Stable Bloom
Filters; https://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf
Flajolet, P., Fusy, É, Gandouet, O., Meunier, F. 2007. HyperLogLog: The analysis of a near-optimal cardinality
estimation algorithm; http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J. 2010. Classification of
DNA sequences using Bloom filters. Bioinformatics, 26(13); http://bioinformatics.oxfordjournals.org/content/
26/13/1595.full.pdf
Tarkoma, S., Rothenberg, C., & Lagerspetz, E. 2011. Theory and Practice of Bloom Filters for Distributed
Systems. IEEE Communications Surveys & Tutorials, 14(1); https://gnunet.org/sites/default/files/
TheoryandPracticeBloomFilter2011Tarkoma.pdf
Treat, T. 2015. Stream Processing and Probabilistic Methods: Data at Scale; http://bravenewgeek.com/stream-
processing-and-probabilistic-methods

Contenu connexe

Tendances

Stream Processing Frameworks
Stream Processing FrameworksStream Processing Frameworks
Stream Processing FrameworksSirKetchup
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with StormMariusz Gil
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentationGabriel Eisbruch
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormJohn Georgiadis
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 

Tendances (19)

Stream Processing Frameworks
Stream Processing FrameworksStream Processing Frameworks
Stream Processing Frameworks
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentation
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 

En vedette

From Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed SystemsFrom Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed SystemsTyler Treat
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaminghuguk
 
Metrics 2.0 & Graph-Explorer
Metrics 2.0 & Graph-ExplorerMetrics 2.0 & Graph-Explorer
Metrics 2.0 & Graph-ExplorerDieter Plaetinck
 
Simple Solutions for Complex Problems
Simple Solutions for Complex ProblemsSimple Solutions for Complex Problems
Simple Solutions for Complex ProblemsTyler Treat
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
 
Instrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in productionInstrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in productionbcantrill
 
Călin Andrei Burloiu - Connecting Hadoop with Couchbase: Engineering for per...
Călin Andrei Burloiu - Connecting Hadoop with Couchbase: Engineering for per...Călin Andrei Burloiu - Connecting Hadoop with Couchbase: Engineering for per...
Călin Andrei Burloiu - Connecting Hadoop with Couchbase: Engineering for per...huguk
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming AlgorithmsRakuten Group, Inc.
 
Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamCodemotion
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for HadoopDavid Gleich
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myuiMakoto Yui
 
Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBBoxed Ice
 
Word Frequency Dominance and L2 Word Recognition
Word Frequency Dominance and L2 Word RecognitionWord Frequency Dominance and L2 Word Recognition
Word Frequency Dominance and L2 Word RecognitionYu Tamura
 
Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014Dieter Plaetinck
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistJeremy Zawodny
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...DECK36
 

En vedette (20)

From Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed SystemsFrom Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed Systems
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
 
Metrics stack 2.0
Metrics stack 2.0Metrics stack 2.0
Metrics stack 2.0
 
Metrics 2.0 & Graph-Explorer
Metrics 2.0 & Graph-ExplorerMetrics 2.0 & Graph-Explorer
Metrics 2.0 & Graph-Explorer
 
Simple Solutions for Complex Problems
Simple Solutions for Complex ProblemsSimple Solutions for Complex Problems
Simple Solutions for Complex Problems
 
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
 
Instrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in productionInstrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in production
 
Călin Andrei Burloiu - Connecting Hadoop with Couchbase: Engineering for per...
Călin Andrei Burloiu - Connecting Hadoop with Couchbase: Engineering for per...Călin Andrei Burloiu - Connecting Hadoop with Couchbase: Engineering for per...
Călin Andrei Burloiu - Connecting Hadoop with Couchbase: Engineering for per...
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
 
Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time stream
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for Hadoop
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
 
Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDB
 
Word Frequency Dominance and L2 Word Recognition
Word Frequency Dominance and L2 Word RecognitionWord Frequency Dominance and L2 Word Recognition
Word Frequency Dominance and L2 Word Recognition
 
Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Benchmark slideshow
Benchmark slideshowBenchmark slideshow
Benchmark slideshow
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at Craigslist
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 

Similaire à Probabilistic algorithms for fun and pseudorandom profit

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Big Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's PerspectiveBig Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's PerspectiveAerospike, Inc.
 
Traffic Matrices and its measurement
Traffic Matrices and its measurementTraffic Matrices and its measurement
Traffic Matrices and its measurementeetacupc
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...StampedeCon
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analyticsinside-BigData.com
 
Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timeAerospike, Inc.
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesAltinity Ltd
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph✔ Eric David Benari, PMP
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022HostedbyConfluent
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code EuropeDavid Pilato
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...inside-BigData.com
 
Year of the #WiFiCactus
Year of the #WiFiCactusYear of the #WiFiCactus
Year of the #WiFiCactusDefCamp
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 

Similaire à Probabilistic algorithms for fun and pseudorandom profit (20)

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Big Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's PerspectiveBig Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's Perspective
 
Traffic Matrices and its measurement
Traffic Matrices and its measurementTraffic Matrices and its measurement
Traffic Matrices and its measurement
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
 
Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-time
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
 
Year of the #WiFiCactus
Year of the #WiFiCactusYear of the #WiFiCactus
Year of the #WiFiCactus
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 

Plus de Tyler Treat

Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native ObservabilityTyler Treat
 
The Observability Pipeline
The Observability PipelineThe Observability Pipeline
The Observability PipelineTyler Treat
 
Distributed Systems Are a UX Problem
Distributed Systems Are a UX ProblemDistributed Systems Are a UX Problem
Distributed Systems Are a UX ProblemTyler Treat
 
The Future of Ops
The Future of OpsThe Future of Ops
The Future of OpsTyler Treat
 
Building a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16xBuilding a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16xTyler Treat
 
Building a Distributed Message Log from Scratch
Building a Distributed Message Log from ScratchBuilding a Distributed Message Log from Scratch
Building a Distributed Message Log from ScratchTyler Treat
 
So You Wanna Go Fast?
So You Wanna Go Fast?So You Wanna Go Fast?
So You Wanna Go Fast?Tyler Treat
 
The Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going DistributedThe Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going DistributedTyler Treat
 

Plus de Tyler Treat (8)

Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native Observability
 
The Observability Pipeline
The Observability PipelineThe Observability Pipeline
The Observability Pipeline
 
Distributed Systems Are a UX Problem
Distributed Systems Are a UX ProblemDistributed Systems Are a UX Problem
Distributed Systems Are a UX Problem
 
The Future of Ops
The Future of OpsThe Future of Ops
The Future of Ops
 
Building a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16xBuilding a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16x
 
Building a Distributed Message Log from Scratch
Building a Distributed Message Log from ScratchBuilding a Distributed Message Log from Scratch
Building a Distributed Message Log from Scratch
 
So You Wanna Go Fast?
So You Wanna Go Fast?So You Wanna Go Fast?
So You Wanna Go Fast?
 
The Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going DistributedThe Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going Distributed
 

Dernier

What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 

Dernier (20)

What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 

Probabilistic algorithms for fun and pseudorandom profit

  • 1. PROBABILISTIC ALGORITHMS for fun and pseudorandom profit Tyler Treat / 12.5.2015
  • 2. ABOUT THE SPEAKER ➤ Backend engineer at Workiva ➤ Messaging platform tech lead ➤ Distributed systems ➤ bravenewgeek.com @tyler_treat
 tyler.treat@workiva.com
  • 3. Time Data Batch
 (days, hours) Meh, data
 (I can store this on
 my laptop) Streaming
 (minutes, seconds) Oi, data!
 (We’re gonna need
 a bigger boat…) Real-Time™
 (I need it now, dammit!) Big data™
 (IoT, sensors) /dev/null
  • 4. Time Data Batch
 (days, hours) Meh, data
 (I can store this on
 my laptop) Streaming
 (minutes, seconds) Oi, data!
 (We’re gonna need
 a bigger boat…) Real-Time™
 (I need it now, dammit!) Big data™
 (IoT, sensors) Not Interesting Kinda Interesting Pretty Interesting /dev/null
  • 6. THIS TALK IS NOT ➤ About Samza, Storm, Spark Streaming et al. ➤ Strictly about stream-processing techniques ➤ Mathy ➤ Statistics-y
  • 7. THIS TALK IS ➤ About basic probability theory ➤ About practical design trade-offs ➤ About algorithms & data structures ➤ About dealing with large or unbounded datasets ➤ A marriage of CS & engineering
  • 8. OUTLINE ➤ Terminology & context ➤ Why probabilistic algorithms? ➤ Bloom filters & variants ➤ Count-min sketch ➤ HyperLogLog
  • 9. Randomized Algorithms Las Vegas Algorithms Monte Carlo Algorithms Random Input Correct result
 Gamble on speed Deterministic speed
 Gamble on result
  • 10. Randomized Algorithms Las Vegas Algorithms Monte Carlo Algorithms Random Input Correct result
 Gamble on speed Deterministic speed
 Gamble on result
  • 11. DEFINING SOME TERMINOLOGY ➤ Online - processing elements as they arrive ➤ Offline - entire dataset is known ahead of time ➤ Real-time - hard constraint on response time ➤ A priori knowledge - something known beforehand
  • 12. BATCH VS STREAMING ➤ Batch ➤ Offline ➤ Heuristics/multiple passes ➤ Data structures less important documents search index
  • 13. BATCH VS STREAMING ➤ Streaming ➤ Online, one pass ➤ Usually real-time (but not necessarily) ➤ Potentially unbounded transactions caches fraud analytics
  • 14. 3 DATA INTEGRATION QUESTIONS ➤ How do you get the data? ➤ How do you disseminate the data? ➤ How do you process the data?
  • 15. 3 DATA INTEGRATION QUESTIONS ➤ How do you get the data (quickly)? ➤ How do you disseminate the data (quickly)? ➤ How do you process the data (quickly)?
  • 16. Denormalization is critical to performance at scale.
  • 17. How to count the number of distinct document views across Wikipedia?
  • 20. Wikipedia has ~38 million pages.
  • 21. 38,000,000 pages x (16-byte guid + 8-byte integer) ≈ 1GB
  • 22. ➤ Not unreasonable for modern hardware ➤ Held in memory for lifetime of process so will move to old GC generations—expensive to collect! ➤ Now we want to track views per unique IP address ➤ >4 billion IPv4 addresses ➤ Naive solutions quickly become intractable
  • 25. HAVE YOUR CAKE AND EAT IT TOO? Stream Processing Batch Processing App The “Lambda Architecture”
  • 26. Probabilistic algorithms trade accuracy for space and performance.
  • 27. “Sketching” data structures make this trade by storing a summary of the dataset when storing it entirely is prohibitively expensive.
  • 28. Bloom Filters B. H. Bloom.
 Space/Time Trade-offs in Hash Coding with Allowable Errors. 1970.
  • 29. Answers a simple question: is this element a member of a set? S ⊆ 𝕌
 x ∈ S
  • 30. SET MEMBERSHIP ➤ Is this URL malicious? ➤ Is this IP address blacklisted? ➤ Is this word contained in the document? ➤ Is this record in the database? ➤ Has this transaction been processed?
  • 31. Hash Table entry for each member
  • 32. Bit Array bit for each element in universe 0101101000101110010100100110101…
  • 33. BLOOM FILTERS ➤ Bloom filters store set memberships ➤ Answers “not in set” or “probably in set”
  • 34. Bloom Filter Secondary Store Do you have key 1? no no Do you have key 2? Here’s key 2 yes Necessary access Here’s key 2 Do you have key 3? no yes Unnecessary access no yes no
  • 35. BLOOM FILTERS ➤ 2 operations: add, lookup ➤ Allocate bit array of length m ➤ k hash functions ➤ Configure m and k for desired false-positive rate
  • 36. BLOOM FILTERS ➤ Add element: ➤ Hash with k functions to get k indices ➤ Set bits at each index ➤ Lookup: ➤ Hash with k functions to get k indices ➤ Check bit at each index ➤ If any bit is unset, element not in set
  • 37. BLOOM FILTERS ➤ Benefits: ➤ More space-efficient than hash table or bit array ➤ Can determine trade-off between accuracy and space ➤ Drawbacks: ➤ Some elements potentially more sensitive to false positives than others (solvable by partitioning) ➤ Can’t remove elements ➤ Requires a priori knowledge of the dataset ➤ Over-provisioned filter wastes space
  • 38. Bloom filters are great for efficient offline processing, but what about streaming?
  • 39. BLOOM FILTERS WITH A TWIST ➤ Rotating Bloom filters ➤ e.g. remember everything in the last hour
 ➤ Scalable Bloom Filters ➤ Dynamically allocating chained filters
 ➤ Stable Bloom Filters ➤ Continuously evict stale data
  • 40. Scalable Bloom Filters P. S. Almeida, C. Baquero, N. Preguiça, D. Hutchison.
 Scalable Bloom Filters. 2007.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46. l P0 P0 P0 P0 P0 = error prob. of 1 filter l = # filters P = compound error prob. P = 1 - (1 - P 0 ) i=0 l-1
  • 47. P0 = 0.1 P = 1 - (1 - P0) i=0 l-1
  • 48. SCALABLE BLOOM FILTERS ➤ Questions: ➤ When to add a new filter? ➤ How to place a tight upper bound on P?
  • 49. SCALABLE BLOOM FILTERS ➤ When to add a new filter? ➤ Fill ratio p = # set bits / # bits ➤ Add new filter when target p is reached ➤ Optimal target p = 0.5 (math follows from paper)
  • 50. SCALABLE BLOOM FILTERS ➤ How to place a tight upper bound on P? ➤ Apply tightening ratio r to P0, where 0 < r < 1 ➤ Start with 1 filter, error probability P0 ➤ When full, add new filter, error probability P1=P0r ➤ Results in geometric series: ➤ Series converges on target error probability P
  • 51. P0 = 0.1
 r = 0.5 P P = 1 - (1 - P0r i ) i=0 l-1
  • 52. SCALABLE BLOOM FILTERS ➤ Add elements to last filter ➤ Check each filter on lookups ➤ Tightening ratio r controls m and k for new filters
  • 53. SCALABLE BLOOM FILTERS ➤ Benefits: ➤ Can grow dynamically to accommodate dataset ➤ Provides tight upper bound on false-positive rate ➤ Can control growth rate ➤ Drawbacks: ➤ Size still proportional to dataset ➤ Additional computation on adds (negligible amortized)
  • 54. Stable Bloom Filters F. Deng, D. Rafiei.
 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters. 2006.
  • 55. DUPLICATE DETECTION ➤ Query processing ➤ URL crawling ➤ Monitoring distinct IP addresses ➤ Advertiser click streams ➤ Graph processing
  • 56.
  • 57. Bloom filters are remarkably useful for dealing with graph data.
  • 58. GRAPH PROCESSING ➤ Detecting cycles ➤ Pruning search space ➤ E.g. often used in bioinformatics ➤ Storing chemical structures, properties, and molecular fingerprints in filters to optimize searches and determine structural similarities ➤ Rapid classification of DNA sequences as large as the human genome
  • 59. GRAPH PROCESSING ➤ Store crawled nodes in memory ➤ Set of nodes may be too large to fit in memory ➤ Store crawled nodes in secondary storage ➤ Too many searches to perform in limited time
  • 60. Precisely eliminating duplicates in an unbounded stream isn’t feasible with limited space and time.
  • 61. Efficacy/Efficiency Conjecture:
 In many situations, a quick answer with an allowable error rate is better than a precise one that is slow.
  • 62. Staleness Conjecture:
 In many situations, more recent data has more value than stale data.
  • 63. STABLE BLOOM FILTERS ➤ Discards old data to make room for new data ➤ Replace bit array with array of d-bit counters ➤ Initialize counters to zero ➤ Maximum counter value Max = 2d - 1
  • 64. STABLE BLOOM FILTERS ➤ Add element: ➤ Select P random counters and decrement by one ➤ Hash with k functions to get k indices ➤ Set counters at each index to Max ➤ Lookup: ➤ Hash with k functions to get k indices ➤ Check counter at each index ➤ If any counter is zero, element not in set
  • 65. STABLE BLOOM FILTERS ➤ Classic Bloom filter a special case of SBF w/ d=1, P=0 ➤ Tight upper bound on false positives ➤ FP rate asymptotically approaches configurable fixed constant (stable-point property) ➤ See paper for math and parameter settings ➤ Evicting data introduces false negatives
  • 66. STABLE BLOOM FILTERS ➤ Benefits: ➤ Fixed memory allocation ➤ Evicts old data to make room for new data ➤ Provides tight upper bound on false positives ➤ Drawbacks: ➤ Introduces false negatives ➤ Additional computation on adds
  • 67. Count-Min Sketch G. Cormode, S. Muthukrishnan.
 An Improved Data Stream Summary: The Count-Min Sketch and its Applications. 2003.
  • 68. Can we count element frequencies using sub-linear space? page views 94.136.205.1 132.208.90.15 54.222.151.15 7 4 11
  • 69. COUNT-MIN SKETCH ➤ Approximates frequencies in sub-linear space ➤ Matrix with w columns and d rows ➤ Each row has a hash function ➤ Each cell initialized to zero ➤ When element arrives: ➤ Hash for each row ➤ Increment each counter by 1 ➤ freq(element) = min counter value
  • 70. COUNT-MIN SKETCH ➤ Why the minimum? ➤ Possibility for collisions between elements ➤ Counter may be incremented by multiple elements ➤ Taking minimum counter value gives closer approximation
  • 71. COUNT-MIN SKETCH ➤ Benefits: ➤ Simple! ➤ Sub-linear space ➤ Useful for detecting “heavy hitters” ➤ Easy to track top-k by adding a min-heap ➤ Drawbacks: ➤ Biased estimator: may overestimate, never underestimates ➤ Better suited to Zipfian distributions & rare events
  • 72. HyperLogLog P. Flajolet, É. Fusy, O. Gandouet, F. Meunier.
 HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. 2007.
  • 73. How do we count distinct things in a stream?
  • 74. COUNTING PROBLEMS ➤ E.g. how many different words are used in Wikipedia? ➤ Counter per element explodes memory ➤ Usually requires memory proportional to cardinality ➤ Can we approximate cardinality with constant space?
  • 75. HYPERLOGLOG ➤ The name: can estimate cardinality of set w/ cardinality Nmax using loglog(Nmax) + O(1) bits ➤ Hash element to integer ➤ Count number of leading 0’s in binary form of hash ➤ Track highest number of leading 0’s, n ➤ Cardinality ≈ 2n+1
  • 76. HYPERLOGLOG ➤ stream = [“foo”, “bar”, “baz”, “qux”] ➤ h(“foo”) = 10100001 ➤ h(“bar”) = 01110111 ➤ h(“baz”) = 01110100 ➤ h(“qux”) = 10100011 ➤ n = 1 ➤ |stream| ≈ 2n+1 = 22 = 4
  • 77.
  • 78. It’s actually not magic but just a few really clever observations.
  • 79. With 50/50 odds, how long will it take to flip 3 heads in a row? 20? 100?
  • 80. HYPERLOGLOG ➤ Replace “heads” and “tails” with 0’s and 1’s ➤ Count leading consecutive 0’s in binary form of hash ➤ E.g. imagine a 4-bit hash, 16 possible values: ➤ 0000 4 leading 0’s ➤ 0001 3 leading 0’s ➤ 0011, 0010 2 leading 0’s ➤ 0100, 0111, 0110, 0101 1 leading 0’s ➤ 1111, 1110, 1001 1010, 1101, 1100 1011, 1000 0 leading 0’s ➤ Assume good hash function → 1/16 odds for each permutation
  • 81. HYPERLOGLOG ➤ Track highest number of leading 0’s, n ➤ n = 0 → 8/16=1/2 odds ➤ n = 1 → 4/16=1/4 odds ➤ n = 2 → 2/16=1/8 odds ➤ n = 3 → 1/16 odds ➤ Cardinality ≈ how many things did we have to look? ➤ E.g. highest count = 1 → 1/4 odds → cardinality 4
  • 82.
  • 83. HYPERLOGLOG ➤ 1/2 of all binary numbers start with 1 ➤ Each additional bit cuts the probability in half: ➤ 1/4 start with 01 ➤ 1/8 start with 001 ➤ 1/16 start with 0001 ➤ etc. ➤ P(run of length n) = 1 / 2n+1 ➤ Seeing 001 has 1/8 probability, meaning we had to look at approximately 8 things til we saw it (cardinality 8) ➤ Cardinality ≈ prob-1 (reciprocal of probability)
  • 85. HYPERLOGLOG ➤ Use multiple buckets ➤ Use first few bits of hash to determine bucket ➤ Use remaining bits to count 0’s ➤ Each bucket tracks its own count ➤ Take harmonic mean of all buckets to get cardinality ➤ min(x1…xn) ≤ H(x1…xn) ≤ n min(x1…xn) 01011010001011100101001001101010 bucket counting space
  • 87. HYPERLOGLOG ➤ Benefits: ➤ Constant memory ➤ Super fast (calculating MSB is cheap) ➤ Can give accurate count with <1% error ➤ Drawbacks: ➤ Has a margin of error (albeit small)
  • 88. What did we learn?
  • 89. Data processing has trade-offs.
  • 90. Probabilistic algorithms trade accuracy for speed and space.
  • 91. Often we only care about answers that are mostly correct but available now.
  • 92. Sometimes the “right” answer is impossible to compute or simply doesn’t exist.
  • 95. What about the code?
  • 96. ALGORITHM IMPLEMENTATIONS ➤ Algebird - https://github.com/twitter/algebird ➤ Bloom filter ➤ Count-min sketch ➤ HyperLogLog ➤ stream-lib - https://github.com/addthis/stream-lib ➤ Bloom filter ➤ Count-min sketch ➤ HyperLogLog ➤ Boom Filters - https://github.com/tylertreat/BoomFilters ➤ Bloom filter ➤ Scalable Bloom filter ➤ Stable Bloom filter ➤ Count-min sketch ➤ HyperLogLog
  • 97. OTHER COOL PROBABILISTIC ALGORITHMS ➤ Counting Bloom filter (and many other Bloom variations) ➤ Bloomier filter (encode functions instead of sets) ➤ Cuckoo filter (Bloom filter w/ cuckoo hashing) ➤ q-digest (quantile approximation) ➤ t-digest (online accumulation of rank-based statistics) ➤ Locality-sensitive hashing (hash similar items to same buckets) ➤ MinHash (set similarity) ➤ Miller–Rabin (primality testing) ➤ Karger’s algorithm (min cut of connected graph)
  • 99. BIBLIOGRAPHY Almeida, P., Baquero, C., Preguica, N., Hutchison, D. 2007. Scalable Bloom Filters; http://gsd.di.uminho.pt/ members/cbm/ps/dbloom.pdf Bloom, B. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors; https://www.cs.upc.edu/ ~diaz/p422-bloom.pdf Cormode, G., & Muthukrishnan, S. 2003. An Improved Data Stream Summary: The Count-Min Sketch and its Applications; http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf Deng, F., & Rafiei, D. 2006. Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters; https://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf Flajolet, P., Fusy, É, Gandouet, O., Meunier, F. 2007. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm; http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J. 2010. Classification of DNA sequences using Bloom filters. Bioinformatics, 26(13); http://bioinformatics.oxfordjournals.org/content/ 26/13/1595.full.pdf Tarkoma, S., Rothenberg, C., & Lagerspetz, E. 2011. Theory and Practice of Bloom Filters for Distributed Systems. IEEE Communications Surveys & Tutorials, 14(1); https://gnunet.org/sites/default/files/ TheoryandPracticeBloomFilter2011Tarkoma.pdf Treat, T. 2015. Stream Processing and Probabilistic Methods: Data at Scale; http://bravenewgeek.com/stream- processing-and-probabilistic-methods