Big data matrix factorizations and Overlapping community detection in graphs

Big data matrix
factorizations and
Overlapping community
detection in graphs.
David F. Gleich!
Purdue University!
Joint work with Paul
Constantine, Austin Benson,
Jason Lee, Jeremy Templeton,
Yangyang Hou, C. Seshadhri
Joyce Jiyoung Whang, and
Inderjit S. Dhillon, supported by
NSF CAREER 1149756-CCF,
and DOE ASCR award
Code bit.ly/dgleich-codes!

2
A
From tinyimages"
collection
Tall-and-Skinny
matrices

(m ≫ n)
Many rows (like a billion)
A few columns (under 10,000)
regression and!
general linear models!
with many samples!

block iterative methods
panel factorizations

approximate kernel k-means

big-data SVD/PCA!
Used in
David Gleich · Purdue

A graphical view of the MapReduce
programming model
3
data
Map
data
Map
data
Map
data
Map
key
value
key
value
key
value
key
value
key
value
key
value
()
Shuffle
key
value
value
dataReduce
key
value
value
value
dataReduce
key
value dataReduce
Map tasks read batches of data in
parallel and do some initial ﬁltering
Reduce is often where the
computation happens
Shufﬂe is a
global comm.
like group-by
or MPIAlltoall

PCA of 80,000,000"
images
4/22
A
80,000,000images
1000 pixels
First 16 columns of V as
images
Constantine & Gleich, MapReduce 2010.
20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Principal Components
Fractionofvariance
20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Principal Components
Fractionofvariance
0
0
0
0
Fractionofvariance
0
0
0
0
Fractionofvariance
Figure 5: The 16 most impo
nent basis functions (by row

Regression with 80,000,000
images
The goal was to approx.
how much red there was
in a picture from the
value of the grayscale
pixels only.
We get a measure of
how much “redness”
each pixel contributes to
the whole.
via
time
and
per-
ates
(for
on),
split
file
d by
test
the
r in
final
size
pers
1000
h is
the
hav-
final
the sum of red-pixel values in each image as a linear combi-
nation of the gray values in each image. Formally, if ri is the
sum of the red components in all pixels of image i, and Gi,j
is the gray value of the jth pixel in image i, then we wanted
to find min
q
i
(ri ≠
q
j
Gi,jsj)2
. There is no particular im-
portance to this regression problem, we use it merely as a
demonstration.
The coe cients sj are dis-
played as an image at the right.
They reveal regions of the im-
age that are not as important
in determining the overall red
component of an image. The
color scale varies from light-
blue (strongly negative) to blue
(0) and red (strongly positive).
The computation took 30 min-
utes using the Dumbo frame-
work and a two-iteration job with 250 intermediate reducers.
We also solved a principal component problem to find a
principal component basis for each image. Let G be matrix
of Gi,j’s from the regression and let ui be the mean of the ith
A
80,000,000images
1000 pixels
5

Models and algorithms for high performance !
matrix and network computations
6
1
error
1
std
0
2
(b) Std, s = 0.39 cm
10
error
0
0
10
std
0
20
(d) Std, s = 1.95 cm
model compared to the prediction standard de-
bble locations at the ﬁnal time for two values of
= 1.95 cm. (Colors are visible in the electronic
approximately twenty minutes to construct using
s.
ta involved a few pre- and post-processing steps:
m Aria, globally transpose the data, compute the
nd errors. The preprocessing steps took approx-
recise timing information, but we do not report
Tensor eigenvalues"
and a power method

FIGURE 6 – Previous work
from the PI tackled net-
work alignment with ma-
trix methods for edge
overlap:
i
j j0
i0
OverlapOverlap
A L B
This proposal is for match-
ing triangles using tensor
methods:
j
i
k
j0
i0
k0
TriangleTriangle
A L B
t
r
o
s.
g
n.
o
n
s
s-
g
maximize
P
ijk Tijk xi xj xk
subject to kxk2 = 1
where ! ensures the 2-norm
[x(next)
]i = ⇢ · (
X
jk
Tijk xj xk + xi )
SSHOPM method due to "
Kolda and Mayo
Big data methods
SIMAX ‘09, SISC ‘11,MapReduce ‘11, ICASSP ’12
Network alignment
ICDM ‘09, SC ‘11, TKDE ‘13
Fast & Scalable"
Network centrality
SC ‘05, WAW ‘07, SISC ‘10, WWW ’10, …
Data clustering
WSDM ‘12, KDD ‘12, CIKM ’13 …
Ax = b
min kAx bk
Ax = x
Massive matrix "
computations
on multi-threaded
and distributed
architectures

PCA of 80,000,000"
images
7/22
A
80,000,000images
1000 pixels
X
MapReduce Post Processing
Zero"
mean"
rows
TSQR
R
SVD
V
First 16
columns
of V as
images
Top 100
singular
values
(principal  
components)
Constantine & Gleich, MapReduce 2010.

Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute colsum( A ) 161 sec.
Time to compute R in qr( A ) 387 sec.

8

How to store tall-and-skinny
matrices in Hadoop
9
A1
A4
A2
A3
A4
A : m x n, m ≫ n

Key is an arbitrary row-id
Value is the 1 x n array "
for a row (or b x n block)

Each submatrix Ai is an "
the input to a map task.

10
0
10
5
10
10
10
15
10
20
10
−15
10
−10
10
−5
10
0
10
5
Numerical stability was a
problem for prior approaches
10
Condition number
norm(QTQ–I)
AR-1
AR-1 + "
iterative reﬁnement
4. Direct TSQR
Benson, Gleich, "
Demmel, BigData’13
Prior work
1. Constantine & Gleich,
MapReduce 2011
2. Benson, Gleich,
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
3. Benson, Gleich,

A1
A2
A3
A1
A2
qr
Q2 R2
A3
qr
Q3 R3
A4
qr
Q4A4
R4
emit
A5
A6
A7
A5
A6
qr
Q6 R6
A7
qr
Q7 R7
A8
qr
Q8A8
R8
emit
Mapper 1
Serial TSQR
R4
R8
Mapper 2
Serial TSQR
R4
R8
qr
Q
emit
R
Reducer 1
Serial TSQR
Algorithm
Data Rows of a matrix
Map QR factorization of rows
Reduce QR factorization of rows
Communication avoiding QR (Demmel et al. 2008) "
on MapReduce (Constantine and Gleich, 2011)
11

More about how to "
compute a regression
A
min kAx bk2
= min
X
i
(
X
j
Aij xj bi )2
b
A1
A2
A3
A1
A2
qr Q2
R2
A3
qr
A4
Mapper 1
Serial TSQR
b2 = Q2
T b1
b1
12

Too many maps cause too
much data to one reducer!
Each image is 5k.
Each HDFS block has "
12,800 images.
6,250 total blocks.
Each map outputs "
1000-by-1000 matrix
One reducer gets a 6.25M-
by-1000 matrix (50GB)
13

Too many maps cause too
much data to one reducer!
S(1)
A
A1
A2
A3
A3
R1
map
Mapper 1-1
Serial TSQR
A2
emit
R2
map
Mapper 1-2
Serial TSQR
A3
emit
R3
map
Mapper 1-3
Serial TSQR
A4
emit
R4
map
Mapper 1-4
Serial TSQR
shuffle
S1
A2
reduce
Reducer 1-1
Serial TSQR
S2
R2,2
reduce
Reducer 1-2
Serial TSQR
R2,1
emit
emit
emit
shuffle
A2S3
R2,3
reduce
Reducer 1-3
Serial TSQR
emit
Iteration 1 Iteration 2
identitymap
A2S(2)
Rreduce
Reducer 2-1
Serial TSQR
emit
14

The rest of the talk"
Full TSQR code in hadoopy
15
import random, numpy, hadoopy
class SerialTSQR:
def __init__(self,blocksize,isreducer):
self.bsize=blocksize
self.data = []
if isreducer: self.__call__ = self.reducer
else: self.__call__ = self.mapper
def compress(self):
R = numpy.linalg.qr(
numpy.array(self.data),'r')
# reset data and re-initialize to R
self.data = []
for row in R:
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
def close(self):
self.compress()
for row in self.data:
key = random.randint(0,2000000000)
yield key, row
def mapper(self,key,value):
self.collect(key,value)
def reducer(self,key,values):
for value in values: self.mapper(key,value)
if __name__=='__main__':
mapper = SerialTSQR(blocksize=3,isreducer=False)
reducer = SerialTSQR(blocksize=3,isreducer=True)
hadoopy.run(mapper, reducer)

Non-negative matrix
factorization
16
(b) NMF (c) Manifold Learning
xy
z
xy
Projection on 1st NNF
2ndNNF
First manifold parameter
Second
Find W, H 0
where A ⇡ WH
NMF !
Separable NMF!
Find H 0, A(:, K)
where A ⇡ A(:, K)H

There are good algorithms for
separable NMF that avoid
alternating between W, H.
17
Find W, H 0
where A ⇡ WH
NMF ! Separable NMF!
Find H 0, A(:, K)

Separable NMF algorithms
1.  Find the columns of A.
2.  Find the values of W.
18
xy
z
x
y
NNF
cond
Separable NMF!
Find H 0, A(:, K)

Separable NMF algorithms
are really geometry
1.  Find the columns of A. "
Equiv. to “Find the extreme
points of a convex set.”
2.  These are preserved under
linear transformations
19
xy
z
x
y
NNF
cond
Separable NMF!
Find H 0, A(:, K)

We use our tall-and-skinny QR
to get a orthogonal
transformation to make the
problem easily solvable.
20

21
A U
S VT
SVD
NMF
AK
H
1. Compute QR using
TSQR method
2. Run a separable NMF
method on SVT
3. Find H by solving a
small non-negative
least-squares problem
in each column. These
are tiny.

All of the hard analysis is on
the small dimension of the
matrix, which makes this very
useful in practice.
22

Our methods vs. the
competition
23
Figure 1: Relative error in the separable factoriza-
ion as a function of nonnegative rank (r) for the
hree algorithms. The matrix was synthetically gen-
erated to be separable. SPA and GP capture all of
he true extreme columns when r = 20 (where the
esidual is zero). Since we are using the greedy vari-
Figure 2: First 20 extreme columns selected by
XRAY, and GP along with the true column
in the synthetic matrix generation. A mar
present for a given column index if and only
column is a selected extreme column. SPA an
capture all of the true extreme columns. Sin
gure 1: Relative error in the separable factoriza-
n as a function of nonnegative rank (r) for the
ree algorithms. The matrix was synthetically gen-
ated to be separable. SPA and GP capture all of
e true extreme columns when r = 20 (where the
idual is zero). Since we are using the greedy vari-
t of XRAY, it takes r = 21 to capture all of the
Figure 2: First 20 extreme columns selected by SPA,
XRAY, and GP along with the true columns used
in the synthetic matrix generation. A marker is
present for a given column index if and only if that
column is a selected extreme column. SPA and GP
capture all of the true extreme columns. Since we
are using the greedy variant of XRAY, it does se-
200 million rows, 200 columns, separation rank 20.

24
Nonlinear heat transfer model in
random media
Each run takes 5 hours on 8 processors,
outputs 4M (node) by 9 (time-step) simulation
We did 8192 runs (128 samples of
bubble locations, 64 bubble radii)
4.5 TB of data in Exodus II (NetCDF)
Applyheat
Lookattemperature
https://www.opensciencedatacloud.org/
publicdata/heat-transfer/

0 10 20 30 40 50 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bubble radius
Proportionoftemp.>475K
15 20 25
0
0.5
1
True
ROM
RS
25
0 10 20 30 40 50 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bubble radius
Proportionoftemp.>475K
Insulator regime
Non-insulator regime

26
A
Each simulation is a column
5B-by-64 matrix
2.2TB
U
S VT
SVD
NMF
AK
H
Run a “standard” NMF "
algorithm on SVT

27
Figure 9: Coe cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. In
all cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each column
n H has at most two non-zero values. Speciﬁcally, the non-extreme columns are conic combinations of the
two extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe cients.
Figure 9: Coe cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. In
all cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each column
in H has at most two non-zero values. Speciﬁcally, the non-extreme columns are conic combinations of the
two extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe cients.
Figure 8: First 10 extreme columns selected by SPA,
XRAY, and GP for the heat transfer simulation
Figure 10: Value of H matrix for columns 1 through
34 for the SPA algorithm on the heat transfer sim-

A bunch of papers
Constantine & Gleich, MapReduce 2011
Benson, Gleich & Demmel, BigData 2013
Benson, Gleich, Rawja & Lee, arXiv 2014
Constantine, Gleich, Hou, Templeton, SISC In-
press

Code online: github.com/arbenson
28

Next talk
1.  Personalized PageRank"
based community detection
2.  The best community detection algorithm?
29

A community is a set of
vertices that is denser inside
than out.
30

250 node GEOP network in 2 dimensions
31

250 node GEOP network in 2 dimensions
32

We can ﬁnd communities using
Personalized PageRank (PPR)
[Andersen et al. 2006]

PPR is a Markov chain on nodes
1.  with probability 𝛼, ", "
follow a random edge
2.  with probability 1-𝛼, ", "
restart at a seed
aka random surfer
aka random walk with restart
unique stationary distribution
33

Personalized PageRank
community detection

1.  Given a seed, approximate the
stationary distribution.
2.  Extract the community.

Both are local operations.
34

Conductance communities
Conductance is one of the most
important community scores [Schaeffer07]
The conductance of a set of vertices is
the ratio of edges leaving to total edges:

Equivalently, it’s the probability that a
random edge leaves the set.
Small conductance ó Good community
(S) =
cut(S)
min vol(S), vol( ¯S)
(edges leaving the set)
(total edges
in the set)
cut(S) = 7
vol(S) = 33
vol( ¯S) = 11
(S) = 7/11
35

Andersen-
Chung-Lang"
personalized
PageRank
community
theorem"
[Andersen et al. 2006]!

Informally
Suppose the seeds are in a set
of good conductance, then the
personalized PageRank method
will ﬁnd a set with conductance
that’s nearly as good.
… also, it’s really fast.
36

# G is graph as dictionary-of-sets!
alpha=0.99!
tol=1e-4!
!
x = {} # Store x, r as dictionaries!
r = {} # initialize residual!
Q = collections.deque() # initialize queue!
for s in seed: !
r(s) = 1/len(seed)!
Q.append(s)!
while len(Q) > 0:!
v = Q.popleft() # v has r[v] > tol*deg(v)!
if v not in x: x[v] = 0.!
x[v] += (1-alpha)*r[v]!
mass = alpha*r[v]/(2*len(G[v])) !
for u in G[v]: # for neighbors of u!
if u not in r: r[u] = 0.!
if r[u] < len(G[u])*tol and !
r[u] + mass >= len(G[u])*tol:!
Q.append(u) # add u to queue if large!
r[u] = r[u] + mass!
r[v] = mass*len(G[v]) !
37

Problem 1, which seeds?
38

Whang-Gleich-Dhillon,
CIKM2013 [upcoming…]
1.  Extract part of the graph that might have
overlapping communities.
2.  Compute a partitioning of the network into
many pieces (think sqrt(n)) using Graclus.
3.  Find the center of these partitions.
4.  Use PPR to grow egonets of these centers.
39

Student Version of MATLAB
(a) AstroPh
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Coverage (percentage)
MaximumConductance
egonet
graclus centers
spread hubs
random
bigclam
(d) Flickr
Flickr social
network

2M vertices"
22M edges

We can cover
95% of network
with communities
of cond. ~0.15.

A good partitioning helps"

40
ﬂickr sample - 2M verts, 22M edges

F1 F2
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
DBLP
demon
bigclam
graclus centers
spread hubs
random
egonet
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Figure 3: F1 and F2 measures comparing our algorithmic co
indicates better communities.
Run time Our seed
Using datasets from "
Yang and Leskovec
(WDSM 2013) with
known overlapping
community structure

Our method outperform
current state of the art
overlapping community
detection methods. "
Even randomly seeded!
And helps to ﬁnd real-world
overlapping communities too.
41

Seed Set Expansion
Carefully select seeds
Greedily expand communities around the seed sets
The algorithm
Filtering Phase
Seeding Phase
Seed Set Expansion Phase
Propagation Phase
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44)
42

43
Filtering Phase
Filtering Phase

44
Seed Set Expansion Phase
Run clustering,
and choose
centers or pick
an independent
set of high
degree nodes
Run
personalized
PageRank

45
Propagation Phase
We can prove that this only
improves the objective

Conclusion & Discussion &
PPR community detection is fast "
[Andersen et al. FOCS06]
PPR communities look real "
[Abrahao et al. KDD2012; Zhu et al. ICML2013]
Partitioning for seeding yields "
high coverage & real communities.
“Caveman” communities?!
!
!
!
46
Gleich & Seshadhri
KDD2012

Whang, Gleich & Dhillon
CIKM2013

PPR Sample !
bit.ly/18khzO5!
!
Egonet seeding
bit.ly/dgleich-code!

References
Best conductance cut
at intersection of
communities?

Big data matrix factorizations and Overlapping community detection in graphs

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

En vedette

En vedette (18)

Similaire à Big data matrix factorizations and Overlapping community detection in graphs

Similaire à Big data matrix factorizations and Overlapping community detection in graphs (20)

Dernier

Dernier (20)

Big data matrix factorizations and Overlapping community detection in graphs