In a talk at the Chinese Academic of Sciences Institute for Automation, I discuss some of the MapReduce and community detection methods I've worked on.
Big data matrix factorizations and Overlapping community detection in graphs
1. Big data matrix
factorizations and
Overlapping community
detection in graphs.
David F. Gleich!
Purdue University!
Joint work with Paul
Constantine, Austin Benson,
Jason Lee, Jeremy Templeton,
Yangyang Hou, C. Seshadhri
Joyce Jiyoung Whang, and
Inderjit S. Dhillon, supported by
NSF CAREER 1149756-CCF,
and DOE ASCR award
Code bit.ly/dgleich-codes!
2. 2
A
From tinyimages"
collection
Tall-and-Skinny
matrices
(m ≫ n)
Many rows (like a billion)
A few columns (under 10,000)
regression and!
general linear models!
with many samples!
block iterative methods
panel factorizations
approximate kernel k-means
big-data SVD/PCA!
Used in
David Gleich · Purdue
3. A graphical view of the MapReduce
programming model
David Gleich · Purdue
3
data
Map
data
Map
data
Map
data
Map
key
value
key
value
key
value
key
value
key
value
key
value
()
Shuffle
key
value
value
dataReduce
key
value
value
value
dataReduce
key
value dataReduce
Map tasks read batches of data in
parallel and do some initial filtering
Reduce is often where the
computation happens
Shuffle is a
global comm.
like group-by
or MPIAlltoall
4. PCA of 80,000,000"
images
4/22
A
80,000,000images
1000 pixels
First 16 columns of V as
images
David Gleich · Purdue
Constantine & Gleich, MapReduce 2010.
20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Principal Components
Fractionofvariance
20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Principal Components
Fractionofvariance
0
0
0
0
Fractionofvariance
0
0
0
0
Fractionofvariance
Figure 5: The 16 most impo
nent basis functions (by row
5. Regression with 80,000,000
images
The goal was to approx.
how much red there was
in a picture from the
value of the grayscale
pixels only.
We get a measure of
how much “redness”
each pixel contributes to
the whole.
via
time
and
per-
ates
(for
on),
split
file
d by
test
the
r in
final
size
pers
1000
h is
the
hav-
final
the sum of red-pixel values in each image as a linear combi-
nation of the gray values in each image. Formally, if ri is the
sum of the red components in all pixels of image i, and Gi,j
is the gray value of the jth pixel in image i, then we wanted
to find min
q
i
(ri ≠
q
j
Gi,jsj)2
. There is no particular im-
portance to this regression problem, we use it merely as a
demonstration.
The coe cients sj are dis-
played as an image at the right.
They reveal regions of the im-
age that are not as important
in determining the overall red
component of an image. The
color scale varies from light-
blue (strongly negative) to blue
(0) and red (strongly positive).
The computation took 30 min-
utes using the Dumbo frame-
work and a two-iteration job with 250 intermediate reducers.
We also solved a principal component problem to find a
principal component basis for each image. Let G be matrix
of Gi,j’s from the regression and let ui be the mean of the ith
A
80,000,000images
1000 pixels
David Gleich · Purdue
5
6. Models and algorithms for high performance !
matrix and network computations
David Gleich · Purdue
6
1
error
1
std
0
2
(b) Std, s = 0.39 cm
10
error
0
0
10
std
0
20
(d) Std, s = 1.95 cm
model compared to the prediction standard de-
bble locations at the final time for two values of
= 1.95 cm. (Colors are visible in the electronic
approximately twenty minutes to construct using
s.
ta involved a few pre- and post-processing steps:
m Aria, globally transpose the data, compute the
nd errors. The preprocessing steps took approx-
recise timing information, but we do not report
Tensor eigenvalues"
and a power method
FIGURE 6 – Previous work
from the PI tackled net-
work alignment with ma-
trix methods for edge
overlap:
i
j j0
i0
OverlapOverlap
A L B
This proposal is for match-
ing triangles using tensor
methods:
j
i
k
j0
i0
k0
TriangleTriangle
A L B
t
r
o
s.
g
n.
o
n
s
s-
g
maximize
P
ijk Tijk xi xj xk
subject to kxk2 = 1
where ! ensures the 2-norm
[x(next)
]i = ⇢ · (
X
jk
Tijk xj xk + xi )
SSHOPM method due to "
Kolda and Mayo
Big data methods
SIMAX ‘09, SISC ‘11,MapReduce ‘11, ICASSP ’12
Network alignment
ICDM ‘09, SC ‘11, TKDE ‘13
Fast & Scalable"
Network centrality
SC ‘05, WAW ‘07, SISC ‘10, WWW ’10, …
Data clustering
WSDM ‘12, KDD ‘12, CIKM ’13 …
Ax = b
min kAx bk
Ax = x
Massive matrix "
computations
on multi-threaded
and distributed
architectures
7. PCA of 80,000,000"
images
7/22
A
80,000,000images
1000 pixels
X
MapReduce Post Processing
Zero"
mean"
rows
TSQR
R
SVD
V
First 16
columns
of V as
images
Top 100
singular
values
(principal
components)
David Gleich · Purdue
Constantine & Gleich, MapReduce 2010.
8. Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute colsum( A ) 161 sec.
Time to compute R in qr( A ) 387 sec.
David Gleich · Purdue
8
9. How to store tall-and-skinny
matrices in Hadoop
David Gleich · Purdue
9
A1
A4
A2
A3
A4
A : m x n, m ≫ n
Key is an arbitrary row-id
Value is the 1 x n array "
for a row (or b x n block)
Each submatrix Ai is an "
the input to a map task.
10. 10
0
10
5
10
10
10
15
10
20
10
−15
10
−10
10
−5
10
0
10
5
Numerical stability was a
problem for prior approaches
10
Condition number
norm(QTQ–I)
AR-1
AR-1 + "
iterative refinement
4. Direct TSQR
Benson, Gleich, "
Demmel, BigData’13
Prior work
1. Constantine & Gleich,
MapReduce 2011
2. Benson, Gleich,
Demmel, BigData’13
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
David Gleich · Purdue
3. Benson, Gleich,
Demmel, BigData’13
11. A1
A2
A3
A1
A2
qr
Q2 R2
A3
qr
Q3 R3
A4
qr
Q4A4
R4
emit
A5
A6
A7
A5
A6
qr
Q6 R6
A7
qr
Q7 R7
A8
qr
Q8A8
R8
emit
Mapper 1
Serial TSQR
R4
R8
Mapper 2
Serial TSQR
R4
R8
qr
Q
emit
R
Reducer 1
Serial TSQR
Algorithm
Data Rows of a matrix
Map QR factorization of rows
Reduce QR factorization of rows
Communication avoiding QR (Demmel et al. 2008) "
on MapReduce (Constantine and Gleich, 2011)
11
David Gleich · Purdue
12. More about how to "
compute a regression
A
min kAx bk2
= min
X
i
(
X
j
Aij xj bi )2
b
A1
A2
A3
A1
A2
qr Q2
R2
A3
qr
A4
Mapper 1
Serial TSQR
b2 = Q2
T b1
b1
David Gleich · Purdue
12
13. Too many maps cause too
much data to one reducer!
Each image is 5k.
Each HDFS block has "
12,800 images.
6,250 total blocks.
Each map outputs "
1000-by-1000 matrix
One reducer gets a 6.25M-
by-1000 matrix (50GB)
David Gleich · Purdue
13
14. Too many maps cause too
much data to one reducer!
S(1)
A
A1
A2
A3
A3
R1
map
Mapper 1-1
Serial TSQR
A2
emit
R2
map
Mapper 1-2
Serial TSQR
A3
emit
R3
map
Mapper 1-3
Serial TSQR
A4
emit
R4
map
Mapper 1-4
Serial TSQR
shuffle
S1
A2
reduce
Reducer 1-1
Serial TSQR
S2
R2,2
reduce
Reducer 1-2
Serial TSQR
R2,1
emit
emit
emit
shuffle
A2S3
R2,3
reduce
Reducer 1-3
Serial TSQR
emit
Iteration 1 Iteration 2
identitymap
A2S(2)
Rreduce
Reducer 2-1
Serial TSQR
emit
David Gleich · Purdue
14
15. The rest of the talk"
Full TSQR code in hadoopy
15
David Gleich · Purdue
import random, numpy, hadoopy
class SerialTSQR:
def __init__(self,blocksize,isreducer):
self.bsize=blocksize
self.data = []
if isreducer: self.__call__ = self.reducer
else: self.__call__ = self.mapper
def compress(self):
R = numpy.linalg.qr(
numpy.array(self.data),'r')
# reset data and re-initialize to R
self.data = []
for row in R:
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
def close(self):
self.compress()
for row in self.data:
key = random.randint(0,2000000000)
yield key, row
def mapper(self,key,value):
self.collect(key,value)
def reducer(self,key,values):
for value in values: self.mapper(key,value)
if __name__=='__main__':
mapper = SerialTSQR(blocksize=3,isreducer=False)
reducer = SerialTSQR(blocksize=3,isreducer=True)
hadoopy.run(mapper, reducer)
16. Non-negative matrix
factorization
David Gleich · Purdue
16
(b) NMF (c) Manifold Learning
xy
z
xy
Projection on 1st NNF
2ndNNF
First manifold parameter
Second
Find W, H 0
where A ⇡ WH
NMF !
Separable NMF!
Find H 0, A(:, K)
where A ⇡ A(:, K)H
17. There are good algorithms for
separable NMF that avoid
alternating between W, H.
David Gleich · Purdue
17
Find W, H 0
where A ⇡ WH
NMF ! Separable NMF!
Find H 0, A(:, K)
where A ⇡ A(:, K)H
18. Separable NMF algorithms
1. Find the columns of A.
2. Find the values of W.
David Gleich · Purdue
18
(b) NMF (c) Manifold Learning
xy
z
x
y
NNF
cond
Separable NMF!
Find H 0, A(:, K)
where A ⇡ A(:, K)H
19. Separable NMF algorithms
are really geometry
1. Find the columns of A. "
Equiv. to “Find the extreme
points of a convex set.”
2. These are preserved under
linear transformations
David Gleich · Purdue
19
(b) NMF (c) Manifold Learning
xy
z
x
y
NNF
cond
Separable NMF!
Find H 0, A(:, K)
where A ⇡ A(:, K)H
20. We use our tall-and-skinny QR
to get a orthogonal
transformation to make the
problem easily solvable.
David Gleich · Purdue
20
21. David Gleich · Purdue
21
A U
S VT
SVD
NMF
AK
H
1. Compute QR using
TSQR method
2. Run a separable NMF
method on SVT
3. Find H by solving a
small non-negative
least-squares problem
in each column. These
are tiny.
22. All of the hard analysis is on
the small dimension of the
matrix, which makes this very
useful in practice.
David Gleich · Purdue
22
23. Our methods vs. the
competition
David Gleich · Purdue
23
Figure 1: Relative error in the separable factoriza-
ion as a function of nonnegative rank (r) for the
hree algorithms. The matrix was synthetically gen-
erated to be separable. SPA and GP capture all of
he true extreme columns when r = 20 (where the
esidual is zero). Since we are using the greedy vari-
Figure 2: First 20 extreme columns selected by
XRAY, and GP along with the true column
in the synthetic matrix generation. A mar
present for a given column index if and only
column is a selected extreme column. SPA an
capture all of the true extreme columns. Sin
gure 1: Relative error in the separable factoriza-
n as a function of nonnegative rank (r) for the
ree algorithms. The matrix was synthetically gen-
ated to be separable. SPA and GP capture all of
e true extreme columns when r = 20 (where the
idual is zero). Since we are using the greedy vari-
t of XRAY, it takes r = 21 to capture all of the
Figure 2: First 20 extreme columns selected by SPA,
XRAY, and GP along with the true columns used
in the synthetic matrix generation. A marker is
present for a given column index if and only if that
column is a selected extreme column. SPA and GP
capture all of the true extreme columns. Since we
are using the greedy variant of XRAY, it does se-
200 million rows, 200 columns, separation rank 20.
24. David Gleich · Purdue
24
Nonlinear heat transfer model in
random media
Each run takes 5 hours on 8 processors,
outputs 4M (node) by 9 (time-step) simulation
We did 8192 runs (128 samples of
bubble locations, 64 bubble radii)
4.5 TB of data in Exodus II (NetCDF)
Applyheat
Lookattemperature
https://www.opensciencedatacloud.org/
publicdata/heat-transfer/
26. David Gleich · Purdue
26
A
Each simulation is a column
5B-by-64 matrix
2.2TB
U
S VT
SVD
NMF
AK
H
Run a “standard” NMF "
algorithm on SVT
27. David Gleich · Purdue
27
Figure 9: Coe cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. In
all cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each column
n H has at most two non-zero values. Specifically, the non-extreme columns are conic combinations of the
two extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe cients.
Figure 9: Coe cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. In
all cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each column
in H has at most two non-zero values. Specifically, the non-extreme columns are conic combinations of the
two extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe cients.
Figure 8: First 10 extreme columns selected by SPA,
XRAY, and GP for the heat transfer simulation
Figure 10: Value of H matrix for columns 1 through
34 for the SPA algorithm on the heat transfer sim-
33. We can find communities using
Personalized PageRank (PPR)
[Andersen et al. 2006]
PPR is a Markov chain on nodes
1. with probability 𝛼, ", "
follow a random edge
2. with probability 1-𝛼, ", "
restart at a seed
aka random surfer
aka random walk with restart
unique stationary distribution
David Gleich · Purdue
33
34. Personalized PageRank
community detection
1. Given a seed, approximate the
stationary distribution.
2. Extract the community.
Both are local operations.
David Gleich · Purdue
34
35. Conductance communities
Conductance is one of the most
important community scores [Schaeffer07]
The conductance of a set of vertices is
the ratio of edges leaving to total edges:
Equivalently, it’s the probability that a
random edge leaves the set.
Small conductance ó Good community
(S) =
cut(S)
min vol(S), vol( ¯S)
(edges leaving the set)
(total edges
in the set)
David Gleich · Purdue
cut(S) = 7
vol(S) = 33
vol( ¯S) = 11
(S) = 7/11
35
37. # G is graph as dictionary-of-sets!
alpha=0.99!
tol=1e-4!
!
x = {} # Store x, r as dictionaries!
r = {} # initialize residual!
Q = collections.deque() # initialize queue!
for s in seed: !
r(s) = 1/len(seed)!
Q.append(s)!
while len(Q) > 0:!
v = Q.popleft() # v has r[v] > tol*deg(v)!
if v not in x: x[v] = 0.!
x[v] += (1-alpha)*r[v]!
mass = alpha*r[v]/(2*len(G[v])) !
for u in G[v]: # for neighbors of u!
if u not in r: r[u] = 0.!
if r[u] < len(G[u])*tol and !
r[u] + mass >= len(G[u])*tol:!
Q.append(u) # add u to queue if large!
r[u] = r[u] + mass!
r[v] = mass*len(G[v]) !
David Gleich · Purdue
37
39. Whang-Gleich-Dhillon,
CIKM2013 [upcoming…]
1. Extract part of the graph that might have
overlapping communities.
2. Compute a partitioning of the network into
many pieces (think sqrt(n)) using Graclus.
3. Find the center of these partitions.
4. Use PPR to grow egonets of these centers.
David Gleich · Purdue
39
40. Student Version of MATLAB
(a) AstroPh
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Coverage (percentage)
MaximumConductance
egonet
graclus centers
spread hubs
random
bigclam
(d) Flickr
Flickr social
network
2M vertices"
22M edges
We can cover
95% of network
with communities
of cond. ~0.15.
David Gleich · Purdue
A good partitioning helps"
40
flickr sample - 2M verts, 22M edges
41. F1 F2
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
DBLP
demon
bigclam
graclus centers
spread hubs
random
egonet
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Figure 3: F1 and F2 measures comparing our algorithmic co
indicates better communities.
Run time Our seed
Using datasets from "
Yang and Leskovec
(WDSM 2013) with
known overlapping
community structure
Our method outperform
current state of the art
overlapping community
detection methods. "
Even randomly seeded!
David Gleich · Purdue
And helps to find real-world
overlapping communities too.
41
42. Seed Set Expansion
Carefully select seeds
Greedily expand communities around the seed sets
The algorithm
Filtering Phase
Seeding Phase
Seed Set Expansion Phase
Propagation Phase
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44)
David Gleich · Purdue
42
43. David Gleich · Purdue
43
Filtering Phase
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (9/44)
Filtering Phase
44. David Gleich · Purdue
44
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (16/44)
Seed Set Expansion Phase
Run clustering,
and choose
centers or pick
an independent
set of high
degree nodes
Run
personalized
PageRank
45. David Gleich · Purdue
45
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (28/44)
Propagation Phase
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (30/44)
We can prove that this only
improves the objective
46. Conclusion & Discussion &
PPR community detection is fast "
[Andersen et al. FOCS06]
PPR communities look real "
[Abrahao et al. KDD2012; Zhu et al. ICML2013]
Partitioning for seeding yields "
high coverage & real communities.
“Caveman” communities?!
!
!
!
David Gleich · Purdue
46
Gleich & Seshadhri
KDD2012
Whang, Gleich & Dhillon
CIKM2013
PPR Sample !
bit.ly/18khzO5!
!
Egonet seeding
bit.ly/dgleich-code!
References
Best conductance cut
at intersection of
communities?