15. A WORD ON MODULARITY
LSH for EMD introduced by Charikar in the Simhash paper (2002).
Yet no place to plug your LSH family in implementation (e.g. scikit,
mrsqueeze) !
15
16. LSH AMPLIFICATION : CONCATENATIONS AND PARALLEL
▸ basic LSH:
▸ AND (series) construction:
▸ OR (parallel) construction :
16
18. BASIC LSH
val hashCollection = records.map(s => (getId(s), s)).
mapValues(s => getHash(s, hashers))
val subArray = hashCollection.flatMap {
case (recordId, hash) =>
hash.grouped(hashLength / numberBands).zipWithIndex.map{
case (band, bandIndex) => (bandIndex, (band, sentenceId))
}
}
18
19. LOOKUP
def findCandidates(record: Iterable[String], hashers: Array[Int => Int],
mBands: BandType) = {
val hash = getHash(record, hashers)
val subArrays = partitionArray(hash).zipWithIndex
subArrays.flatMap { case (band, bandIndex) =>
val hashedBucket = mBands.lookup(bandIndex).
headOption.
flatMap{_.get(band)}
hashedBucket
}.flatten.toSet
}
19
20. getHash(record,hashers)
DISTRIBUTE RANDOM SEEDS, NOT PERMUTATION FUNCTIONS
records.mapPartitions { iter =>
val rng = new Scala.util.random()
iter.map(x => hashers.flatMap{h => getHashFunction(rng, h)(x)})
}
20
22. BASIC LSH
WITH A 2-STABLE GAUSSIAN DISTRIBUTION
With data points, choose and
, to solve the problem
22
23. WEB LOGS ARE SPARSE
Input : hits per user, over 6 months, 2x50-ish integers/user (4GB)
Output of length 1000 integers per user : 10 (parallel) bands, 100
(concatenated) hashes
64-bit integers : 40 GB
Yet !
23
24. ENTROPY LSH (PANIGRAPHI 2006)
REPLACE TABLES BY OFFSETS
, , chosen randomly from the surface
of , the sphere of radius centered at
24
25. ENTROPY LSH
WITH A 2-STABLE GAUSSIAN DISTRIBUTION
With data points, choose and
, to solve the problem with as
few as hash tables
25
26. BUT ... NETWORK COSTS
▸ Basic LSH : look up buckets,
▸ Entropy LSH : search for offsets
26
27. LAYERED LSH (BAHMANI ET AL. 2012)
Output of your LSH family is in , with e.g. a cosine norm.
For closer points, the chance of hashes hashing to the same bucket is
high!
27
28. LAYERED LSH
Have an LSH family for your norm on
Likely that for all offsets
28
29. LAYERED LSH
Output of hash generation is (GH(p), (H(p), p)) for all p.
In Spark, group, or custom partitioner for (H(p), p) RDD.
Network cost :
29