A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

•

23 j'aime•16,920 vues

François Garillot

An Implementation war story of locality sensitive hashing with Apache Spark, with performance lessons.

Logiciels

A GENTLE INTRODUCTION TO
APACHE SPARK AND
LOCALITY-SENSITIVE
HASHING
1

FRANCOIS GARILLOT
(FORMERLY) TYPESAFE
francois@garillot.net
@huitseeker
2

LOCALITY-SENSITIVE HASHING
▸ A story : Why LSH
▸ How it works & hash families
▸ LSH distribution
▸ Beware : WIP
3

SPARK TENETS
▸ broadcast variables
▸ per-partition commands
▸ shuffle sparsely
4

SEGMENTATION
▸ small sample: 289421 users
▸ larger sample : 5684403 users
46K websites, ultimately users
4 personal laptops, 4 provided laptops
8

K-MEANS COMPLEXITY
Find with the 'elbow method' on within-cluster sum of squares.
Then
9

EM - GAUSSIAN MIXTURE
With dimensions, mixtures,
10

LOCALITY-SENSITIVE HASHING FUNCTIONS
A family H of hashing functions is -sensitive if:
▸ if then
▸ if then
11

DISTANCES ! (THOSE AND MANY OTHER)
▸ Hamming distance : where is a
randomly chosen index
▸ Jaccard :
▸ Cosine distance:
12

EARTH MOVER'S DISTANCE
Find optimal F minimizing:
Then:
14

A WORD ON MODULARITY
LSH for EMD introduced by Charikar in the Simhash paper (2002).
Yet no place to plug your LSH family in implementation (e.g. scikit,
mrsqueeze) !
15

LSH AMPLIFICATION : CONCATENATIONS AND PARALLEL
▸ basic LSH:
▸ AND (series) construction:
▸ OR (parallel) construction :
16

BASIC LSH
val hashCollection = records.map(s => (getId(s), s)).
mapValues(s => getHash(s, hashers))
val subArray = hashCollection.flatMap {
case (recordId, hash) =>
hash.grouped(hashLength / numberBands).zipWithIndex.map{
case (band, bandIndex) => (bandIndex, (band, sentenceId))
}
}
18

$LOOKUP def findCandidates(record: Iterable[String], hashers: Array[Int => Int], mBands: BandType) = { val hash = getHash(record, hashers) val subArrays = partitionArray(hash).zipWithIndex subArrays.flatMap { case (band, bandIndex) => val hashedBucket = mBands.lookup(bandIndex). headOption. flatMap{_.get(band)} hashedBucket }.flatten.toSet } 19$

getHash(record,hashers)
DISTRIBUTE RANDOM SEEDS, NOT PERMUTATION FUNCTIONS
records.mapPartitions { iter =>
val rng = new Scala.util.random()
iter.map(x => hashers.flatMap{h => getHashFunction(rng, h)(x)})
}
20

BASIC LSH
WITH A 2-STABLE GAUSSIAN DISTRIBUTION
With data points, choose and
, to solve the problem
22

WEB LOGS ARE SPARSE
Input : hits per user, over 6 months, 2x50-ish integers/user (4GB)
Output of length 1000 integers per user : 10 (parallel) bands, 100
(concatenated) hashes
64-bit integers : 40 GB
Yet !
23

ENTROPY LSH (PANIGRAPHI 2006)
REPLACE TABLES BY OFFSETS
, , chosen randomly from the surface
of , the sphere of radius centered at
24

ENTROPY LSH
WITH A 2-STABLE GAUSSIAN DISTRIBUTION
With data points, choose and
, to solve the problem with as
few as hash tables
25

BUT ... NETWORK COSTS
▸ Basic LSH : look up buckets,
▸ Entropy LSH : search for offsets
26

LAYERED LSH (BAHMANI ET AL. 2012)
Output of your LSH family is in , with e.g. a cosine norm.
For closer points, the chance of hashes hashing to the same bucket is
high!
27

LAYERED LSH
Have an LSH family for your norm on
Likely that for all offsets
28

LAYERED LSH
Output of hash generation is (GH(p), (H(p), p)) for all p.
In Spark, group, or custom partitioner for (H(p), p) RDD.
Network cost :
29

FUTURE WORK
HAVE A (BIG) WEBLOG ?
▸ Weve
▸ Yandex
31

FUTURE WORK
LOCALITY-SENSITIVE HASHING FORESTS !
32

RELEASE
github.com/huitseeker/spark-lsh
1 SEPT 2015
33

Contenu connexe

Tendances

Rehashingrajshreemuthiah

Concept of hashingRafi Dar

Hashingamoldkul

Hashing PPTSaurabh Kumar

Graph Regularised HashingSean Moran

From Trill to Quill: Pushing the Envelope of Functionality and ScaleBadrish Chandramouli

DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...Menlo Systems GmbH

Introduction to Ultra-succinct representation of ordered trees with applicationsYu Liu

K10692 control theorysaagar264

IOEfficientParalleMatrixMultiplication_presentShubham Joshi

Hashinggrahamwell

HashingDinesh Vujuru

Data Structure and Algorithms HashingManishPrajapati78

Spatial search with geohashesLucidworks (Archived)

Tech talk Probabilistic Data StructureRishabh Dugar

Hashing Technique In Data StructuresSHAKOOR AB

Hashing AlgorithmHayi Nukman

Mpmc unit-string manipulationxyxz

Hashingdebolina13

geogebra TRIPURARI RAI

Tendances (20)

Rehashing

Concept of hashing

Hashing

Hashing PPT

Graph Regularised Hashing

From Trill to Quill: Pushing the Envelope of Functionality and Scale

DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...

Introduction to Ultra-succinct representation of ordered trees with applications

K10692 control theory

IOEfficientParalleMatrixMultiplication_present

Hashing

Data Structure and Algorithms Hashing

Spatial search with geohashes

Tech talk Probabilistic Data Structure

Hashing Technique In Data Structures

Hashing Algorithm

Mpmc unit-string manipulation

Hashing

geogebra

Similaire à A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

Sketching and locality sensitive hashing for alignmentssuser2be88c

Structures de données exotiquesSamir Bessalah

Expressing and Exploiting Multi-Dimensional Locality in DASHMenlo Systems GmbH

Cluster DrmHong ChangBum

Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner

Enterprise Scale Topological Data Analysis Using SparkAlpine Data

Enterprise Scale Topological Data Analysis Using SparkSpark Summit

Hadoop Overview kdd2011Milind Bhandarkar

Faster persistent data structures through hashingJohan Tibell

Svm map reduce_slidesSara Asher

Cassandra talk @JUG Lausanne, 2012.06.14Benoit Perroud

Distributed approximate spectral clustering for large scale datasetsBita Kazemi

Hashing and Hash Tablesadil raja

HACC: Fitting the Universe Inside a Supercomputerinside-BigData.com

Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks

Cascading Map-Side Joins over HBase for Scalable Join ProcessingAlexander Schätzle

snarks <3 hash functionsRebekah Mercer

Basics of Distributed Systems - Distributed StorageNilesh Salpe

Similaire à A Gentle Introduction to Locality Sensitive Hashing with Apache Spark (20)

Sketching and locality sensitive hashing for alignment

Structures de données exotiques

Expressing and Exploiting Multi-Dimensional Locality in DASH

Cluster Drm

Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...

Enterprise Scale Topological Data Analysis Using Spark

Hadoop Overview kdd2011

Faster persistent data structures through hashing

Svm map reduce_slides

Cassandra talk @JUG Lausanne, 2012.06.14

Distributed approximate spectral clustering for large scale datasets

Hashing and Hash Tables

HACC: Fitting the Universe Inside a Supercomputer

Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...

Cascading Map-Side Joins over HBase for Scalable Join Processing

snarks <3 hash functions

Basics of Distributed Systems - Distributed Storage

Plus de François Garillot

Growing Your Types Without Growing Your WorkloadFrançois Garillot

Deep learning on a mixed cluster with deeplearning4j and sparkFrançois Garillot

Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot

Delivering near real time mobility insights at swisscomFrançois Garillot

Spark Streaming : Dealing with StateFrançois Garillot

Ramping up your Devops Fu for Big Data developersFrançois Garillot

Diving In The Deep End Of The Big Data PoolFrançois Garillot

Scala Collections : Java 8 on SteroidsFrançois Garillot

Plus de François Garillot (8)

Growing Your Types Without Growing Your Workload

Deep learning on a mixed cluster with deeplearning4j and spark

Mobility insights at Swisscom - Understanding collective mobility in Switzerland

Delivering near real time mobility insights at swisscom

Spark Streaming : Dealing with State

Ramping up your Devops Fu for Big Data developers

Diving In The Deep End Of The Big Data Pool

Scala Collections : Java 8 on Steroids

Dernier

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp

Patterns for automating API delivery. API conferencessuser9e7c64

Understanding Flamingo - DeepMind's VLM Architecturerahul_net

What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics

eSoftTools IMAP Backup Software and migration toolsosttopstonverter

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp

Introduction to Firebase Workshop Slidesvaideheekore1

Precise and Complete Requirements? An Elusive GoalLionel Briand

Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley

VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics

Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions

Osi security architecture in network.pptxVinzoCenzo

Dernier (20)

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx

Patterns for automating API delivery. API conference

Understanding Flamingo - DeepMind's VLM Architecture

What’s New in VictoriaMetrics: Q1 2024 Updates

eSoftTools IMAP Backup Software and migration tools

A healthy diet for your Java application Devoxx France.pdf

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...

Keeping your build tool updated in a multi repository world

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...

Salesforce Implementation Services PPT By ABSYZ

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx

Introduction to Firebase Workshop Slides

Precise and Complete Requirements? An Elusive Goal

Comparing Linux OS Image Update Models - EOSS 2024.pdf

VictoriaMetrics Anomaly Detection Updates: Q1 2024

Best Angular 17 Classroom & Online training - Naresh IT

Osi security architecture in network.pptx

A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

1. A GENTLE INTRODUCTION TO APACHE SPARK AND LOCALITY-SENSITIVE HASHING 1

2. FRANCOIS GARILLOT (FORMERLY) TYPESAFE francois@garillot.net @huitseeker 2

3. LOCALITY-SENSITIVE HASHING ▸ A story : Why LSH ▸ How it works & hash families ▸ LSH distribution ▸ Beware : WIP 3

4. SPARK TENETS ▸ broadcast variables ▸ per-partition commands ▸ shuffle sparsely 4

5. 5

6. 6

7. 7

8. SEGMENTATION ▸ small sample: 289421 users ▸ larger sample : 5684403 users 46K websites, ultimately users 4 personal laptops, 4 provided laptops 8

9. K-MEANS COMPLEXITY Find with the 'elbow method' on within-cluster sum of squares. Then 9

10. EM - GAUSSIAN MIXTURE With dimensions, mixtures, 10

11. LOCALITY-SENSITIVE HASHING FUNCTIONS A family H of hashing functions is -sensitive if: ▸ if then ▸ if then 11

12. DISTANCES ! (THOSE AND MANY OTHER) ▸ Hamming distance : where is a randomly chosen index ▸ Jaccard : ▸ Cosine distance: 12

13. EARTH MOVER'S DISTANCE 13

14. EARTH MOVER'S DISTANCE Find optimal F minimizing: Then: 14

15. A WORD ON MODULARITY LSH for EMD introduced by Charikar in the Simhash paper (2002). Yet no place to plug your LSH family in implementation (e.g. scikit, mrsqueeze) ! 15

16. LSH AMPLIFICATION : CONCATENATIONS AND PARALLEL ▸ basic LSH: ▸ AND (series) construction: ▸ OR (parallel) construction : 16

17. 17

18. BASIC LSH val hashCollection = records.map(s => (getId(s), s)). mapValues(s => getHash(s, hashers)) val subArray = hashCollection.flatMap { case (recordId, hash) => hash.grouped(hashLength / numberBands).zipWithIndex.map{ case (band, bandIndex) => (bandIndex, (band, sentenceId)) } } 18

19. LOOKUP def findCandidates(record: Iterable[String], hashers: Array[Int => Int], mBands: BandType) = { val hash = getHash(record, hashers) val subArrays = partitionArray(hash).zipWithIndex subArrays.flatMap { case (band, bandIndex) => val hashedBucket = mBands.lookup(bandIndex). headOption. flatMap{_.get(band)} hashedBucket }.flatten.toSet } 19

20. getHash(record,hashers) DISTRIBUTE RANDOM SEEDS, NOT PERMUTATION FUNCTIONS records.mapPartitions { iter => val rng = new Scala.util.random() iter.map(x => hashers.flatMap{h => getHashFunction(rng, h)(x)}) } 20

21. AND YET, OOM 21

22. BASIC LSH WITH A 2-STABLE GAUSSIAN DISTRIBUTION With data points, choose and , to solve the problem 22

23. WEB LOGS ARE SPARSE Input : hits per user, over 6 months, 2x50-ish integers/user (4GB) Output of length 1000 integers per user : 10 (parallel) bands, 100 (concatenated) hashes 64-bit integers : 40 GB Yet ! 23

24. ENTROPY LSH (PANIGRAPHI 2006) REPLACE TABLES BY OFFSETS , , chosen randomly from the surface of , the sphere of radius centered at 24

25. ENTROPY LSH WITH A 2-STABLE GAUSSIAN DISTRIBUTION With data points, choose and , to solve the problem with as few as hash tables 25

26. BUT ... NETWORK COSTS ▸ Basic LSH : look up buckets, ▸ Entropy LSH : search for offsets 26

27. LAYERED LSH (BAHMANI ET AL. 2012) Output of your LSH family is in , with e.g. a cosine norm. For closer points, the chance of hashes hashing to the same bucket is high! 27

28. LAYERED LSH Have an LSH family for your norm on Likely that for all offsets 28

29. LAYERED LSH Output of hash generation is (GH(p), (H(p), p)) for all p. In Spark, group, or custom partitioner for (H(p), p) RDD. Network cost : 29

30. PERFORMANCE 30

31. FUTURE WORK HAVE A (BIG) WEBLOG ? ▸ Weve ▸ Yandex 31

32. FUTURE WORK LOCALITY-SENSITIVE HASHING FORESTS ! 32

33. RELEASE github.com/huitseeker/spark-lsh 1 SEPT 2015 33