SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
Big data matrix
factorizations and
Overlapping community
detection in graphs.
David F. Gleich!
Purdue University!
Joint work with Paul
Constantine, Austin Benson,
Jason Lee, Jeremy Templeton,
Yangyang Hou, C. Seshadhri
Joyce Jiyoung Whang, and
Inderjit S. Dhillon, supported by
NSF CAREER 1149756-CCF,
and DOE ASCR award
Code bit.ly/dgleich-codes!
2
A
From tinyimages"
collection
Tall-and-Skinny
matrices

(m ≫ n) 
Many rows (like a billion)
A few columns (under 10,000)
regression and!
general linear models!
with many samples!

block iterative methods
panel factorizations

approximate kernel k-means 

big-data SVD/PCA!
Used in
David Gleich · Purdue
A graphical view of the MapReduce
programming model
David Gleich · Purdue
3
data
Map
data
Map
data
Map
data
Map
key
value
key
value
key
value
key
value
key
value
key
value
()
Shuffle
key
value
value
dataReduce
key
value
value
value
dataReduce
key
value dataReduce
Map tasks read batches of data in
parallel and do some initial filtering
Reduce is often where the
computation happens
Shuffle is a
global comm.
like group-by
or MPIAlltoall
PCA of 80,000,000"
images
4/22
A
80,000,000images
1000 pixels
First 16 columns of V as
images
David Gleich · Purdue
Constantine & Gleich, MapReduce 2010.
20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Principal Components
Fractionofvariance
20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Principal Components
Fractionofvariance
0
0
0
0
Fractionofvariance
0
0
0
0
Fractionofvariance
Figure 5: The 16 most impo
nent basis functions (by row
Regression with 80,000,000
images
The goal was to approx.
how much red there was
in a picture from the
value of the grayscale
pixels only. 
We get a measure of
how much “redness”
each pixel contributes to
the whole.
via
time
and
per-
ates
(for
on),
split
file
d by
test
the
r in
final
size
pers
1000
h is
the
hav-
final
the sum of red-pixel values in each image as a linear combi-
nation of the gray values in each image. Formally, if ri is the
sum of the red components in all pixels of image i, and Gi,j
is the gray value of the jth pixel in image i, then we wanted
to find min
q
i
(ri ≠
q
j
Gi,jsj)2
. There is no particular im-
portance to this regression problem, we use it merely as a
demonstration.
The coe cients sj are dis-
played as an image at the right.
They reveal regions of the im-
age that are not as important
in determining the overall red
component of an image. The
color scale varies from light-
blue (strongly negative) to blue
(0) and red (strongly positive).
The computation took 30 min-
utes using the Dumbo frame-
work and a two-iteration job with 250 intermediate reducers.
We also solved a principal component problem to find a
principal component basis for each image. Let G be matrix
of Gi,j’s from the regression and let ui be the mean of the ith
A
80,000,000images
1000 pixels
David Gleich · Purdue
5
Models and algorithms for high performance !
matrix and network computations
David Gleich · Purdue
6
1
error
1
std
0
2
(b) Std, s = 0.39 cm
10
error
0
0
10
std
0
20
(d) Std, s = 1.95 cm
model compared to the prediction standard de-
bble locations at the final time for two values of
= 1.95 cm. (Colors are visible in the electronic
approximately twenty minutes to construct using
s.
ta involved a few pre- and post-processing steps:
m Aria, globally transpose the data, compute the
nd errors. The preprocessing steps took approx-
recise timing information, but we do not report
Tensor eigenvalues"
and a power method

FIGURE 6 – Previous work
from the PI tackled net-
work alignment with ma-
trix methods for edge
overlap:
i
j j0
i0
OverlapOverlap
A L B
This proposal is for match-
ing triangles using tensor
methods:
j
i
k
j0
i0
k0
TriangleTriangle
A L B
t
r
o
s.
g
n.
o
n
s
s-
g
maximize
P
ijk Tijk xi xj xk
subject to kxk2 = 1
where ! ensures the 2-norm
[x(next)
]i = ⇢ · (
X
jk
Tijk xj xk + xi )
SSHOPM method due to "
Kolda and Mayo
Big data methods
SIMAX ‘09, SISC ‘11,MapReduce ‘11, ICASSP ’12
Network alignment
ICDM ‘09, SC ‘11, TKDE ‘13
Fast & Scalable"
Network centrality
SC ‘05, WAW ‘07, SISC ‘10, WWW ’10, …
Data clustering
WSDM ‘12, KDD ‘12, CIKM ’13 …
Ax = b
min kAx bk
Ax = x
Massive matrix "
computations
on multi-threaded
and distributed 
architectures
PCA of 80,000,000"
images
7/22
A
80,000,000images
1000 pixels
X
MapReduce Post Processing
Zero"
mean"
rows
TSQR
R
SVD
  V
First 16
columns
of V as
images
Top 100
singular
values
(principal 

components)
David Gleich · Purdue
Constantine & Gleich, MapReduce 2010.
Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute  colsum( A ) 161 sec.
Time to compute R in qr( A ) 387 sec.

David Gleich · Purdue
8
How to store tall-and-skinny
matrices in Hadoop
David Gleich · Purdue
9
A1
A4
A2
A3
A4
A : m x n, m ≫ n

Key is an arbitrary row-id
Value is the 1 x n array "
for a row (or b x n block)

Each submatrix Ai is an "
the input to a map task.
10
0
10
5
10
10
10
15
10
20
10
−15
10
−10
10
−5
10
0
10
5
Numerical stability was a
problem for prior approaches
10
Condition number
norm(QTQ–I)
AR-1
AR-1 + "
iterative refinement
 4. Direct TSQR
Benson, Gleich, "
Demmel, BigData’13
Prior work
1. Constantine & Gleich,
MapReduce 2011
2. Benson, Gleich,
Demmel, BigData’13
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal 
David Gleich · Purdue
3. Benson, Gleich,
Demmel, BigData’13
A1
A2
A3
A1
A2
qr
Q2 R2
A3
qr
Q3 R3
A4
qr
Q4A4
R4
emit
A5
A6
A7
A5
A6
qr
Q6 R6
A7
qr
Q7 R7
A8
qr
Q8A8
R8
emit
Mapper 1
Serial TSQR
R4
R8
Mapper 2
Serial TSQR
R4
R8
qr
Q
emit
R
Reducer 1
Serial TSQR
Algorithm
Data Rows of a matrix
Map QR factorization of rows
Reduce QR factorization of rows
Communication avoiding QR (Demmel et al. 2008) "
on MapReduce (Constantine and Gleich, 2011)
11
David Gleich · Purdue
More about how to "
compute a regression
A
min kAx bk2
= min
X
i
(
X
j
Aij xj bi )2
b
A1
A2
A3
A1
A2
qr Q2
R2
A3
qr
A4
Mapper 1
Serial TSQR
b2 = Q2
T b1
b1
David Gleich · Purdue
12
Too many maps cause too
much data to one reducer!
Each image is 5k.
Each HDFS block has "
12,800 images.
6,250 total blocks.
Each map outputs "
1000-by-1000 matrix
One reducer gets a 6.25M-
by-1000 matrix (50GB)
David Gleich · Purdue
13
Too many maps cause too
much data to one reducer!
S(1)
A
A1
A2
A3
A3
R1
map
Mapper 1-1
Serial TSQR
A2
emit
R2
map
Mapper 1-2
Serial TSQR
A3
emit
R3
map
Mapper 1-3
Serial TSQR
A4
emit
R4
map
Mapper 1-4
Serial TSQR
shuffle
S1
A2
reduce
Reducer 1-1
Serial TSQR
S2
R2,2
reduce
Reducer 1-2
Serial TSQR
R2,1
emit
emit
emit
shuffle
A2S3
R2,3
reduce
Reducer 1-3
Serial TSQR
emit
Iteration 1 Iteration 2
identitymap
A2S(2)
Rreduce
Reducer 2-1
Serial TSQR
emit
David Gleich · Purdue
14
The rest of the talk"
Full TSQR code in hadoopy
15
David Gleich · Purdue
import random, numpy, hadoopy
class SerialTSQR:
def __init__(self,blocksize,isreducer):
self.bsize=blocksize
self.data = []
if isreducer: self.__call__ = self.reducer
else: self.__call__ = self.mapper
def compress(self):
R = numpy.linalg.qr(
numpy.array(self.data),'r')
# reset data and re-initialize to R
self.data = []
for row in R:
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
def close(self):
self.compress()
for row in self.data:
key = random.randint(0,2000000000)
yield key, row
def mapper(self,key,value):
self.collect(key,value)
def reducer(self,key,values):
for value in values: self.mapper(key,value)
if __name__=='__main__':
mapper = SerialTSQR(blocksize=3,isreducer=False)
reducer = SerialTSQR(blocksize=3,isreducer=True)
hadoopy.run(mapper, reducer)
Non-negative matrix
factorization
David Gleich · Purdue
16
(b) NMF (c) Manifold Learning
xy
z
xy
Projection on 1st NNF
2ndNNF
First manifold parameter
Second
Find W, H 0
where A ⇡ WH
NMF !
Separable NMF!
Find H 0, A(:, K)
where A ⇡ A(:, K)H
There are good algorithms for
separable NMF that avoid
alternating between W, H.
David Gleich · Purdue
17
Find W, H 0
where A ⇡ WH
NMF ! Separable NMF!
Find H 0, A(:, K)
where A ⇡ A(:, K)H
Separable NMF algorithms
1.  Find the columns of A. 
2.  Find the values of W.
David Gleich · Purdue
18
(b) NMF (c) Manifold Learning
xy
z
x
y
NNF
cond
Separable NMF!
Find H 0, A(:, K)
where A ⇡ A(:, K)H
Separable NMF algorithms
are really geometry
1.  Find the columns of A. "
Equiv. to “Find the extreme
points of a convex set.”
2.  These are preserved under
linear transformations
David Gleich · Purdue
19
(b) NMF (c) Manifold Learning
xy
z
x
y
NNF
cond
Separable NMF!
Find H 0, A(:, K)
where A ⇡ A(:, K)H
We use our tall-and-skinny QR
to get a orthogonal
transformation to make the
problem easily solvable.
David Gleich · Purdue
20
David Gleich · Purdue
21
A U
S VT
SVD
NMF
AK
H
1. Compute QR using
TSQR method
2. Run a separable NMF
method on SVT 
3. Find H by solving a
small non-negative
least-squares problem
in each column. These
are tiny.
All of the hard analysis is on
the small dimension of the
matrix, which makes this very
useful in practice.
David Gleich · Purdue
22
Our methods vs. the
competition
David Gleich · Purdue
23
Figure 1: Relative error in the separable factoriza-
ion as a function of nonnegative rank (r) for the
hree algorithms. The matrix was synthetically gen-
erated to be separable. SPA and GP capture all of
he true extreme columns when r = 20 (where the
esidual is zero). Since we are using the greedy vari-
Figure 2: First 20 extreme columns selected by
XRAY, and GP along with the true column
in the synthetic matrix generation. A mar
present for a given column index if and only
column is a selected extreme column. SPA an
capture all of the true extreme columns. Sin
gure 1: Relative error in the separable factoriza-
n as a function of nonnegative rank (r) for the
ree algorithms. The matrix was synthetically gen-
ated to be separable. SPA and GP capture all of
e true extreme columns when r = 20 (where the
idual is zero). Since we are using the greedy vari-
t of XRAY, it takes r = 21 to capture all of the
Figure 2: First 20 extreme columns selected by SPA,
XRAY, and GP along with the true columns used
in the synthetic matrix generation. A marker is
present for a given column index if and only if that
column is a selected extreme column. SPA and GP
capture all of the true extreme columns. Since we
are using the greedy variant of XRAY, it does se-
200 million rows, 200 columns, separation rank 20.
David Gleich · Purdue
24
Nonlinear heat transfer model in
random media
Each run takes 5 hours on 8 processors,
outputs 4M (node) by 9 (time-step) simulation
We did 8192 runs (128 samples of
bubble locations, 64 bubble radii)
4.5 TB of data in Exodus II (NetCDF)
Applyheat
Lookattemperature
https://www.opensciencedatacloud.org/
publicdata/heat-transfer/
0 10 20 30 40 50 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bubble radius
Proportionoftemp.>475K
15 20 25
0
0.5
1
True
ROM
RS
David Gleich · Purdue
25
0 10 20 30 40 50 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bubble radius
Proportionoftemp.>475K
Insulator regime
Non-insulator regime
David Gleich · Purdue
26
A
Each simulation is a column
5B-by-64 matrix
2.2TB
U
S VT
SVD
NMF
AK
H
Run a “standard” NMF "
algorithm on SVT
David Gleich · Purdue
27
Figure 9: Coe cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. In
all cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each column
n H has at most two non-zero values. Specifically, the non-extreme columns are conic combinations of the
two extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe cients.
Figure 9: Coe cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. In
all cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each column
in H has at most two non-zero values. Specifically, the non-extreme columns are conic combinations of the
two extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe cients.
Figure 8: First 10 extreme columns selected by SPA,
XRAY, and GP for the heat transfer simulation
Figure 10: Value of H matrix for columns 1 through
34 for the SPA algorithm on the heat transfer sim-
A bunch of papers
Constantine & Gleich, MapReduce 2011
Benson, Gleich & Demmel, BigData 2013
Benson, Gleich, Rawja & Lee, arXiv 2014 
Constantine, Gleich, Hou, Templeton, SISC In-
press

Code online: github.com/arbenson
David Gleich · Purdue
28
Next talk
1.  Personalized PageRank"
based community detection
2.  The best community detection algorithm?
David Gleich · Purdue
29
A community is a set of
vertices that is denser inside
than out.
David Gleich · Purdue
30
250 node GEOP network in 2 dimensions
31
250 node GEOP network in 2 dimensions
32
We can find communities using
Personalized PageRank (PPR)
[Andersen et al. 2006]

PPR is a Markov chain on nodes
1.  with probability 𝛼, ", "
follow a random edge
2.  with probability 1-𝛼, ", "
restart at a seed
aka random surfer
aka random walk with restart
unique stationary distribution
David Gleich · Purdue
33
Personalized PageRank
community detection

1.  Given a seed, approximate the
stationary distribution.
2.  Extract the community.

Both are local operations.
David Gleich · Purdue
34
Conductance communities
Conductance is one of the most
important community scores [Schaeffer07]
The conductance of a set of vertices is
the ratio of edges leaving to total edges:


Equivalently, it’s the probability that a
random edge leaves the set.
Small conductance ó Good community
(S) =
cut(S)
min vol(S), vol( ¯S)
(edges leaving the set)
(total edges
in the set)
David Gleich · Purdue
cut(S) = 7
vol(S) = 33
vol( ¯S) = 11
(S) = 7/11
35
Andersen-
Chung-Lang"
personalized
PageRank
community
theorem"
[Andersen et al. 2006]!

Informally
Suppose the seeds are in a set
of good conductance, then the
personalized PageRank method
will find a set with conductance
that’s nearly as good.
… also, it’s really fast.
David Gleich · Purdue
36
# G is graph as dictionary-of-sets!
alpha=0.99!
tol=1e-4!
!
x = {} # Store x, r as dictionaries!
r = {} # initialize residual!
Q = collections.deque() # initialize queue!
for s in seed: !
r(s) = 1/len(seed)!
Q.append(s)!
while len(Q) > 0:!
v = Q.popleft() # v has r[v] > tol*deg(v)!
if v not in x: x[v] = 0.!
x[v] += (1-alpha)*r[v]!
mass = alpha*r[v]/(2*len(G[v])) !
for u in G[v]: # for neighbors of u!
if u not in r: r[u] = 0.!
if r[u] < len(G[u])*tol and !
r[u] + mass >= len(G[u])*tol:!
Q.append(u) # add u to queue if large!
r[u] = r[u] + mass!
r[v] = mass*len(G[v]) !
David Gleich · Purdue
37
Problem 1, which seeds?
David Gleich · Purdue
38
Whang-Gleich-Dhillon,
CIKM2013 [upcoming…]
1.  Extract part of the graph that might have
overlapping communities.
2.  Compute a partitioning of the network into
many pieces (think sqrt(n)) using Graclus.
3.  Find the center of these partitions.
4.  Use PPR to grow egonets of these centers.
David Gleich · Purdue
39
Student Version of MATLAB
(a) AstroPh
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Coverage (percentage)
MaximumConductance
egonet
graclus centers
spread hubs
random
bigclam
(d) Flickr
Flickr social
network

2M vertices"
22M edges

We can cover
95% of network
with communities
of cond. ~0.15.

David Gleich · Purdue
A good partitioning helps"

40
flickr sample - 2M verts, 22M edges
F1 F2
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
DBLP
demon
bigclam
graclus centers
spread hubs
random
egonet
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Figure 3: F1 and F2 measures comparing our algorithmic co
indicates better communities.
Run time Our seed
Using datasets from "
Yang and Leskovec
(WDSM 2013) with
known overlapping
community structure

Our method outperform
current state of the art
overlapping community
detection methods. "
Even randomly seeded!
David Gleich · Purdue
And helps to find real-world
overlapping communities too.
41
Seed Set Expansion
Carefully select seeds
Greedily expand communities around the seed sets
The algorithm
Filtering Phase
Seeding Phase
Seed Set Expansion Phase
Propagation Phase
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44)
David Gleich · Purdue
42
David Gleich · Purdue
43
Filtering Phase
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (9/44)
Filtering Phase
David Gleich · Purdue
44
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (16/44)
Seed Set Expansion Phase
Run clustering,
and choose
centers or pick
an independent
set of high
degree nodes
Run
personalized
PageRank
David Gleich · Purdue
45
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (28/44)
Propagation Phase
Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (30/44)
We can prove that this only
improves the objective
Conclusion & Discussion & 
PPR community detection is fast "
[Andersen et al. FOCS06]
PPR communities look real "
[Abrahao et al. KDD2012; Zhu et al. ICML2013]
Partitioning for seeding yields "
high coverage & real communities.
“Caveman” communities?!
!
!
!
David Gleich · Purdue
46
Gleich & Seshadhri 
KDD2012

Whang, Gleich & Dhillon
CIKM2013

PPR Sample !
bit.ly/18khzO5!
!
Egonet seeding 
bit.ly/dgleich-code!

References
Best conductance cut
at intersection of
communities?

Contenu connexe

Tendances

Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresDavid Gleich
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detectionDavid Gleich
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksDavid Gleich
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017Fred J. Hickernell
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksDavid Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHolistic Benchmarking of Big Linked Data
 
Tensor Train decomposition in machine learning
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learningAlexander Novikov
 
Reduction of the small gain condition
Reduction of the small gain conditionReduction of the small gain condition
Reduction of the small gain conditionMKosmykov
 
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributionsWooSung Choi
 
Green’s Function Solution of Non-homogenous Singular Sturm-Liouville Problem
Green’s Function Solution of Non-homogenous Singular Sturm-Liouville ProblemGreen’s Function Solution of Non-homogenous Singular Sturm-Liouville Problem
Green’s Function Solution of Non-homogenous Singular Sturm-Liouville ProblemIJSRED
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Neural Art (English Version)
Neural Art (English Version)Neural Art (English Version)
Neural Art (English Version)Mark Chang
 

Tendances (18)

Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Spectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structuresSpectral clustering with motifs and higher-order structures
Spectral clustering with motifs and higher-order structures
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Personalized PageRank based community detection
Personalized PageRank based community detectionPersonalized PageRank based community detection
Personalized PageRank based community detection
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017
 
Correlation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networksCorrelation clustering and community detection in graphs and networks
Correlation clustering and community detection in graphs and networks
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K Characters
 
Tensor Train decomposition in machine learning
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learning
 
Reduction of the small gain condition
Reduction of the small gain conditionReduction of the small gain condition
Reduction of the small gain condition
 
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions
 
Green’s Function Solution of Non-homogenous Singular Sturm-Liouville Problem
Green’s Function Solution of Non-homogenous Singular Sturm-Liouville ProblemGreen’s Function Solution of Non-homogenous Singular Sturm-Liouville Problem
Green’s Function Solution of Non-homogenous Singular Sturm-Liouville Problem
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Neural Art (English Version)
Neural Art (English Version)Neural Art (English Version)
Neural Art (English Version)
 

En vedette

Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social MediaSymeon Papadopoulos
 
Applications of community detection in bibliometric network analysis
Applications of community detection in bibliometric network analysisApplications of community detection in bibliometric network analysis
Applications of community detection in bibliometric network analysisNees Jan van Eck
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceDavid Gleich
 
Community Detection
Community Detection Community Detection
Community Detection Kanika Kanwal
 
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...Paragon_Science_Inc
 
Community Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief OverviewCommunity Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief OverviewSatyaki Sikdar
 
Citation analysis for research evaluation
Citation analysis for research evaluationCitation analysis for research evaluation
Citation analysis for research evaluationWouter Gerritsma
 
153-Russo Multilayer network analysis of innovation intermediaries activities
153-Russo Multilayer network analysis of innovation intermediaries activities153-Russo Multilayer network analysis of innovation intermediaries activities
153-Russo Multilayer network analysis of innovation intermediaries activitiesinnovationoecd
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignmentDavid Gleich
 
A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignmentDavid Gleich
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveDavid Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDavid Gleich
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsDavid Gleich
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesDavid Gleich
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...David Gleich
 
Community detection algorithms
Community detection algorithmsCommunity detection algorithms
Community detection algorithmsAlireza Andalib
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationDavid Gleich
 

En vedette (18)

Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social Media
 
Applications of community detection in bibliometric network analysis
Applications of community detection in bibliometric network analysisApplications of community detection in bibliometric network analysis
Applications of community detection in bibliometric network analysis
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Community Detection
Community Detection Community Detection
Community Detection
 
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
Finding Emerging Topics Using Chaos and Community Detection in Social Media G...
 
Community Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief OverviewCommunity Detection in Social Networks: A Brief Overview
Community Detection in Social Networks: A Brief Overview
 
Citation analysis for research evaluation
Citation analysis for research evaluationCitation analysis for research evaluation
Citation analysis for research evaluation
 
153-Russo Multilayer network analysis of innovation intermediaries activities
153-Russo Multilayer network analysis of innovation intermediaries activities153-Russo Multilayer network analysis of innovation intermediaries activities
153-Russo Multilayer network analysis of innovation intermediaries activities
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignment
 
A multithreaded method for network alignment
A multithreaded method for network alignmentA multithreaded method for network alignment
A multithreaded method for network alignment
 
A history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspectiveA history of PageRank from the numerical computing perspective
A history of PageRank from the numerical computing perspective
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
Direct tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architecturesDirect tall-and-skinny QR factorizations in MapReduce architectures
Direct tall-and-skinny QR factorizations in MapReduce architectures
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulants
 
Tall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architecturesTall-and-skinny QR factorizations in MapReduce architectures
Tall-and-skinny QR factorizations in MapReduce architectures
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...
 
Community detection algorithms
Community detection algorithmsCommunity detection algorithms
Community detection algorithms
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportation
 

Similaire à Big data matrix factorizations and Overlapping community detection in graphs

QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...Austin Benson
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Austin Benson
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Austin Benson
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15Karen Pao
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsJonny Daenen
 
Reeves: Modelling & Estimating Forest Structure Attributes Using LiDAR
Reeves: Modelling & Estimating Forest Structure Attributes Using LiDARReeves: Modelling & Estimating Forest Structure Attributes Using LiDAR
Reeves: Modelling & Estimating Forest Structure Attributes Using LiDARCOGS Presentations
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slidesSara Asher
 
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Florent Renucci
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernelsivaderivader
 
Ee693 questionshomework
Ee693 questionshomeworkEe693 questionshomework
Ee693 questionshomeworkGopi Saiteja
 
Low-rank response surface in numerical aerodynamics
Low-rank response surface in numerical aerodynamicsLow-rank response surface in numerical aerodynamics
Low-rank response surface in numerical aerodynamicsAlexander Litvinenko
 
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Christopher Morris
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfRTEFGDFGJU
 

Similaire à Big data matrix factorizations and Overlapping community detection in graphs (20)

QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Adams_SIAMCSE15
Adams_SIAMCSE15Adams_SIAMCSE15
Adams_SIAMCSE15
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
Reeves: Modelling & Estimating Forest Structure Attributes Using LiDAR
Reeves: Modelling & Estimating Forest Structure Attributes Using LiDARReeves: Modelling & Estimating Forest Structure Attributes Using LiDAR
Reeves: Modelling & Estimating Forest Structure Attributes Using LiDAR
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
Manifold Blurring Mean Shift algorithms for manifold denoising, presentation,...
 
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph KernelsDDGK: Learning Graph Representations for Deep Divergence Graph Kernels
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
 
Ee693 questionshomework
Ee693 questionshomeworkEe693 questionshomework
Ee693 questionshomework
 
Low-rank response surface in numerical aerodynamics
Low-rank response surface in numerical aerodynamicsLow-rank response surface in numerical aerodynamics
Low-rank response surface in numerical aerodynamics
 
xldb-2015
xldb-2015xldb-2015
xldb-2015
 
Lecture12
Lecture12Lecture12
Lecture12
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
 

Dernier

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Dernier (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

Big data matrix factorizations and Overlapping community detection in graphs

  • 1. Big data matrix factorizations and Overlapping community detection in graphs. David F. Gleich! Purdue University! Joint work with Paul Constantine, Austin Benson, Jason Lee, Jeremy Templeton, Yangyang Hou, C. Seshadhri Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by NSF CAREER 1149756-CCF, and DOE ASCR award Code bit.ly/dgleich-codes!
  • 2. 2 A From tinyimages" collection Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000) regression and! general linear models! with many samples! block iterative methods panel factorizations approximate kernel k-means big-data SVD/PCA! Used in David Gleich · Purdue
  • 3. A graphical view of the MapReduce programming model David Gleich · Purdue 3 data Map data Map data Map data Map key value key value key value key value key value key value () Shuffle key value value dataReduce key value value value dataReduce key value dataReduce Map tasks read batches of data in parallel and do some initial filtering Reduce is often where the computation happens Shuffle is a global comm. like group-by or MPIAlltoall
  • 4. PCA of 80,000,000" images 4/22 A 80,000,000images 1000 pixels First 16 columns of V as images David Gleich · Purdue Constantine & Gleich, MapReduce 2010. 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Principal Components Fractionofvariance 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Principal Components Fractionofvariance 0 0 0 0 Fractionofvariance 0 0 0 0 Fractionofvariance Figure 5: The 16 most impo nent basis functions (by row
  • 5. Regression with 80,000,000 images The goal was to approx. how much red there was in a picture from the value of the grayscale pixels only. We get a measure of how much “redness” each pixel contributes to the whole. via time and per- ates (for on), split file d by test the r in final size pers 1000 h is the hav- final the sum of red-pixel values in each image as a linear combi- nation of the gray values in each image. Formally, if ri is the sum of the red components in all pixels of image i, and Gi,j is the gray value of the jth pixel in image i, then we wanted to find min q i (ri ≠ q j Gi,jsj)2 . There is no particular im- portance to this regression problem, we use it merely as a demonstration. The coe cients sj are dis- played as an image at the right. They reveal regions of the im- age that are not as important in determining the overall red component of an image. The color scale varies from light- blue (strongly negative) to blue (0) and red (strongly positive). The computation took 30 min- utes using the Dumbo frame- work and a two-iteration job with 250 intermediate reducers. We also solved a principal component problem to find a principal component basis for each image. Let G be matrix of Gi,j’s from the regression and let ui be the mean of the ith A 80,000,000images 1000 pixels David Gleich · Purdue 5
  • 6. Models and algorithms for high performance ! matrix and network computations David Gleich · Purdue 6 1 error 1 std 0 2 (b) Std, s = 0.39 cm 10 error 0 0 10 std 0 20 (d) Std, s = 1.95 cm model compared to the prediction standard de- bble locations at the final time for two values of = 1.95 cm. (Colors are visible in the electronic approximately twenty minutes to construct using s. ta involved a few pre- and post-processing steps: m Aria, globally transpose the data, compute the nd errors. The preprocessing steps took approx- recise timing information, but we do not report Tensor eigenvalues" and a power method FIGURE 6 – Previous work from the PI tackled net- work alignment with ma- trix methods for edge overlap: i j j0 i0 OverlapOverlap A L B This proposal is for match- ing triangles using tensor methods: j i k j0 i0 k0 TriangleTriangle A L B t r o s. g n. o n s s- g maximize P ijk Tijk xi xj xk subject to kxk2 = 1 where ! ensures the 2-norm [x(next) ]i = ⇢ · ( X jk Tijk xj xk + xi ) SSHOPM method due to " Kolda and Mayo Big data methods SIMAX ‘09, SISC ‘11,MapReduce ‘11, ICASSP ’12 Network alignment ICDM ‘09, SC ‘11, TKDE ‘13 Fast & Scalable" Network centrality SC ‘05, WAW ‘07, SISC ‘10, WWW ’10, … Data clustering WSDM ‘12, KDD ‘12, CIKM ’13 … Ax = b min kAx bk Ax = x Massive matrix " computations on multi-threaded and distributed architectures
  • 7. PCA of 80,000,000" images 7/22 A 80,000,000images 1000 pixels X MapReduce Post Processing Zero" mean" rows TSQR R SVD   V First 16 columns of V as images Top 100 singular values (principal 
 components) David Gleich · Purdue Constantine & Gleich, MapReduce 2010.
  • 8. Input 500,000,000-by-100 matrix Each record 1-by-100 row HDFS Size 423.3 GB Time to compute  colsum( A ) 161 sec. Time to compute R in qr( A ) 387 sec. David Gleich · Purdue 8
  • 9. How to store tall-and-skinny matrices in Hadoop David Gleich · Purdue 9 A1 A4 A2 A3 A4 A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array " for a row (or b x n block) Each submatrix Ai is an " the input to a map task.
  • 10. 10 0 10 5 10 10 10 15 10 20 10 −15 10 −10 10 −5 10 0 10 5 Numerical stability was a problem for prior approaches 10 Condition number norm(QTQ–I) AR-1 AR-1 + " iterative refinement 4. Direct TSQR Benson, Gleich, " Demmel, BigData’13 Prior work 1. Constantine & Gleich, MapReduce 2011 2. Benson, Gleich, Demmel, BigData’13 Previous methods couldn’t ensure that the matrix Q was orthogonal David Gleich · Purdue 3. Benson, Gleich, Demmel, BigData’13
  • 11. A1 A2 A3 A1 A2 qr Q2 R2 A3 qr Q3 R3 A4 qr Q4A4 R4 emit A5 A6 A7 A5 A6 qr Q6 R6 A7 qr Q7 R7 A8 qr Q8A8 R8 emit Mapper 1 Serial TSQR R4 R8 Mapper 2 Serial TSQR R4 R8 qr Q emit R Reducer 1 Serial TSQR Algorithm Data Rows of a matrix Map QR factorization of rows Reduce QR factorization of rows Communication avoiding QR (Demmel et al. 2008) " on MapReduce (Constantine and Gleich, 2011) 11 David Gleich · Purdue
  • 12. More about how to " compute a regression A min kAx bk2 = min X i ( X j Aij xj bi )2 b A1 A2 A3 A1 A2 qr Q2 R2 A3 qr A4 Mapper 1 Serial TSQR b2 = Q2 T b1 b1 David Gleich · Purdue 12
  • 13. Too many maps cause too much data to one reducer! Each image is 5k. Each HDFS block has " 12,800 images. 6,250 total blocks. Each map outputs " 1000-by-1000 matrix One reducer gets a 6.25M- by-1000 matrix (50GB) David Gleich · Purdue 13
  • 14. Too many maps cause too much data to one reducer! S(1) A A1 A2 A3 A3 R1 map Mapper 1-1 Serial TSQR A2 emit R2 map Mapper 1-2 Serial TSQR A3 emit R3 map Mapper 1-3 Serial TSQR A4 emit R4 map Mapper 1-4 Serial TSQR shuffle S1 A2 reduce Reducer 1-1 Serial TSQR S2 R2,2 reduce Reducer 1-2 Serial TSQR R2,1 emit emit emit shuffle A2S3 R2,3 reduce Reducer 1-3 Serial TSQR emit Iteration 1 Iteration 2 identitymap A2S(2) Rreduce Reducer 2-1 Serial TSQR emit David Gleich · Purdue 14
  • 15. The rest of the talk" Full TSQR code in hadoopy 15 David Gleich · Purdue import random, numpy, hadoopy class SerialTSQR: def __init__(self,blocksize,isreducer): self.bsize=blocksize self.data = [] if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def compress(self): R = numpy.linalg.qr( numpy.array(self.data),'r') # reset data and re-initialize to R self.data = [] for row in R: self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() def close(self): self.compress() for row in self.data: key = random.randint(0,2000000000) yield key, row def mapper(self,key,value): self.collect(key,value) def reducer(self,key,values): for value in values: self.mapper(key,value) if __name__=='__main__': mapper = SerialTSQR(blocksize=3,isreducer=False) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer)
  • 16. Non-negative matrix factorization David Gleich · Purdue 16 (b) NMF (c) Manifold Learning xy z xy Projection on 1st NNF 2ndNNF First manifold parameter Second Find W, H 0 where A ⇡ WH NMF ! Separable NMF! Find H 0, A(:, K) where A ⇡ A(:, K)H
  • 17. There are good algorithms for separable NMF that avoid alternating between W, H. David Gleich · Purdue 17 Find W, H 0 where A ⇡ WH NMF ! Separable NMF! Find H 0, A(:, K) where A ⇡ A(:, K)H
  • 18. Separable NMF algorithms 1.  Find the columns of A. 2.  Find the values of W. David Gleich · Purdue 18 (b) NMF (c) Manifold Learning xy z x y NNF cond Separable NMF! Find H 0, A(:, K) where A ⇡ A(:, K)H
  • 19. Separable NMF algorithms are really geometry 1.  Find the columns of A. " Equiv. to “Find the extreme points of a convex set.” 2.  These are preserved under linear transformations David Gleich · Purdue 19 (b) NMF (c) Manifold Learning xy z x y NNF cond Separable NMF! Find H 0, A(:, K) where A ⇡ A(:, K)H
  • 20. We use our tall-and-skinny QR to get a orthogonal transformation to make the problem easily solvable. David Gleich · Purdue 20
  • 21. David Gleich · Purdue 21 A U S VT SVD NMF AK H 1. Compute QR using TSQR method 2. Run a separable NMF method on SVT 3. Find H by solving a small non-negative least-squares problem in each column. These are tiny.
  • 22. All of the hard analysis is on the small dimension of the matrix, which makes this very useful in practice. David Gleich · Purdue 22
  • 23. Our methods vs. the competition David Gleich · Purdue 23 Figure 1: Relative error in the separable factoriza- ion as a function of nonnegative rank (r) for the hree algorithms. The matrix was synthetically gen- erated to be separable. SPA and GP capture all of he true extreme columns when r = 20 (where the esidual is zero). Since we are using the greedy vari- Figure 2: First 20 extreme columns selected by XRAY, and GP along with the true column in the synthetic matrix generation. A mar present for a given column index if and only column is a selected extreme column. SPA an capture all of the true extreme columns. Sin gure 1: Relative error in the separable factoriza- n as a function of nonnegative rank (r) for the ree algorithms. The matrix was synthetically gen- ated to be separable. SPA and GP capture all of e true extreme columns when r = 20 (where the idual is zero). Since we are using the greedy vari- t of XRAY, it takes r = 21 to capture all of the Figure 2: First 20 extreme columns selected by SPA, XRAY, and GP along with the true columns used in the synthetic matrix generation. A marker is present for a given column index if and only if that column is a selected extreme column. SPA and GP capture all of the true extreme columns. Since we are using the greedy variant of XRAY, it does se- 200 million rows, 200 columns, separation rank 20.
  • 24. David Gleich · Purdue 24 Nonlinear heat transfer model in random media Each run takes 5 hours on 8 processors, outputs 4M (node) by 9 (time-step) simulation We did 8192 runs (128 samples of bubble locations, 64 bubble radii) 4.5 TB of data in Exodus II (NetCDF) Applyheat Lookattemperature https://www.opensciencedatacloud.org/ publicdata/heat-transfer/
  • 25. 0 10 20 30 40 50 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bubble radius Proportionoftemp.>475K 15 20 25 0 0.5 1 True ROM RS David Gleich · Purdue 25 0 10 20 30 40 50 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bubble radius Proportionoftemp.>475K Insulator regime Non-insulator regime
  • 26. David Gleich · Purdue 26 A Each simulation is a column 5B-by-64 matrix 2.2TB U S VT SVD NMF AK H Run a “standard” NMF " algorithm on SVT
  • 27. David Gleich · Purdue 27 Figure 9: Coe cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. In all cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each column n H has at most two non-zero values. Specifically, the non-extreme columns are conic combinations of the two extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe cients. Figure 9: Coe cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. In all cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each column in H has at most two non-zero values. Specifically, the non-extreme columns are conic combinations of the two extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe cients. Figure 8: First 10 extreme columns selected by SPA, XRAY, and GP for the heat transfer simulation Figure 10: Value of H matrix for columns 1 through 34 for the SPA algorithm on the heat transfer sim-
  • 28. A bunch of papers Constantine & Gleich, MapReduce 2011 Benson, Gleich & Demmel, BigData 2013 Benson, Gleich, Rawja & Lee, arXiv 2014 Constantine, Gleich, Hou, Templeton, SISC In- press Code online: github.com/arbenson David Gleich · Purdue 28
  • 29. Next talk 1.  Personalized PageRank" based community detection 2.  The best community detection algorithm? David Gleich · Purdue 29
  • 30. A community is a set of vertices that is denser inside than out. David Gleich · Purdue 30
  • 31. 250 node GEOP network in 2 dimensions 31
  • 32. 250 node GEOP network in 2 dimensions 32
  • 33. We can find communities using Personalized PageRank (PPR) [Andersen et al. 2006] PPR is a Markov chain on nodes 1.  with probability 𝛼, ", " follow a random edge 2.  with probability 1-𝛼, ", " restart at a seed aka random surfer aka random walk with restart unique stationary distribution David Gleich · Purdue 33
  • 34. Personalized PageRank community detection 1.  Given a seed, approximate the stationary distribution. 2.  Extract the community. Both are local operations. David Gleich · Purdue 34
  • 35. Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community (S) = cut(S) min vol(S), vol( ¯S) (edges leaving the set) (total edges in the set) David Gleich · Purdue cut(S) = 7 vol(S) = 33 vol( ¯S) = 11 (S) = 7/11 35
  • 36. Andersen- Chung-Lang" personalized PageRank community theorem" [Andersen et al. 2006]! Informally Suppose the seeds are in a set of good conductance, then the personalized PageRank method will find a set with conductance that’s nearly as good. … also, it’s really fast. David Gleich · Purdue 36
  • 37. # G is graph as dictionary-of-sets! alpha=0.99! tol=1e-4! ! x = {} # Store x, r as dictionaries! r = {} # initialize residual! Q = collections.deque() # initialize queue! for s in seed: ! r(s) = 1/len(seed)! Q.append(s)! while len(Q) > 0:! v = Q.popleft() # v has r[v] > tol*deg(v)! if v not in x: x[v] = 0.! x[v] += (1-alpha)*r[v]! mass = alpha*r[v]/(2*len(G[v])) ! for u in G[v]: # for neighbors of u! if u not in r: r[u] = 0.! if r[u] < len(G[u])*tol and ! r[u] + mass >= len(G[u])*tol:! Q.append(u) # add u to queue if large! r[u] = r[u] + mass! r[v] = mass*len(G[v]) ! David Gleich · Purdue 37
  • 38. Problem 1, which seeds? David Gleich · Purdue 38
  • 39. Whang-Gleich-Dhillon, CIKM2013 [upcoming…] 1.  Extract part of the graph that might have overlapping communities. 2.  Compute a partitioning of the network into many pieces (think sqrt(n)) using Graclus. 3.  Find the center of these partitions. 4.  Use PPR to grow egonets of these centers. David Gleich · Purdue 39
  • 40. Student Version of MATLAB (a) AstroPh 0 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Coverage (percentage) MaximumConductance egonet graclus centers spread hubs random bigclam (d) Flickr Flickr social network 2M vertices" 22M edges We can cover 95% of network with communities of cond. ~0.15. David Gleich · Purdue A good partitioning helps" 40 flickr sample - 2M verts, 22M edges
  • 41. F1 F2 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 DBLP demon bigclam graclus centers spread hubs random egonet 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 Figure 3: F1 and F2 measures comparing our algorithmic co indicates better communities. Run time Our seed Using datasets from " Yang and Leskovec (WDSM 2013) with known overlapping community structure Our method outperform current state of the art overlapping community detection methods. " Even randomly seeded! David Gleich · Purdue And helps to find real-world overlapping communities too. 41
  • 42. Seed Set Expansion Carefully select seeds Greedily expand communities around the seed sets The algorithm Filtering Phase Seeding Phase Seed Set Expansion Phase Propagation Phase Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44) David Gleich · Purdue 42
  • 43. David Gleich · Purdue 43 Filtering Phase Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (9/44) Filtering Phase
  • 44. David Gleich · Purdue 44 Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (16/44) Seed Set Expansion Phase Run clustering, and choose centers or pick an independent set of high degree nodes Run personalized PageRank
  • 45. David Gleich · Purdue 45 Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (28/44) Propagation Phase Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (30/44) We can prove that this only improves the objective
  • 46. Conclusion & Discussion & PPR community detection is fast " [Andersen et al. FOCS06] PPR communities look real " [Abrahao et al. KDD2012; Zhu et al. ICML2013] Partitioning for seeding yields " high coverage & real communities. “Caveman” communities?! ! ! ! David Gleich · Purdue 46 Gleich & Seshadhri KDD2012 Whang, Gleich & Dhillon CIKM2013 PPR Sample ! bit.ly/18khzO5! ! Egonet seeding bit.ly/dgleich-code! References Best conductance cut at intersection of communities?