SlideShare a Scribd company logo
1 of 60
Download to read offline
Engineering Data Science Objective
Functions for Social Network
Analysis
David F. Gleich
Purdue University
With Nate Veldt (Purdue -> Cornell),
Tony Wirth (Melbourne)
Paper arXiv:1903.05246 Code github.com/nveldt/LearnResParams
LLNL 1David Gleich · Purdue
Somewhere too close and very recently…
Application expert. “Hi, I see you work on
clustering. I want to cluster my data …
… what algorithm should I use?”
LLNLDavid Gleich · Purdue 2
The dreaded question for people
who study clustering, community
detection, etc.
“What algorithm should I use?”
Why is this such a hard question?
LLNLDavid Gleich · Purdue 4
Journal of Biomedicine and Biotechnology • 2005:2 (2005) 215–225 • DOI: 10.1155/JBB.2005.215
REVIEW ARTICLE
Finding Groups in Gene Expression Data
David J. Hand and Nicholas A. Heard
Department of Mathematics, Faculty of Physical Sciences, Imperial College, London SW7 2AZ, UK
Received 11 June 2004; revised 24 August 2004; accepted 24 August 2004
The vast potential of the genomic insight offered by microarray technologies has led to their widespread use since they were in-
troduced a decade ago. Application areas include gene function discovery, disease diagnosis, and inferring regulatory networks.
Microarray experiments enable large-scale, high-throughput investigations of gene activity and have thus provided the data analyst
with a distinctive, high-dimensional field of study. Many questions in this field relate to finding subgroups of data profiles which are
very similar. A popular type of exploratory tool for finding subgroups is cluster analysis, and many different flavors of algorithms
have been used and indeed tailored for microarray data. Cluster analysis, however, implies a partitioning of the entire data set, and
this does not always match the objective. Sometimes pattern discovery or bump hunting tools are more appropriate. This paper
reviews these various tools for finding interesting subgroups.
INTRODUCTION
Microarray gene expression studies are now routinely
used to measure the transcription levels of an organism’s
genes at a particular instant of time. These mRNA levels
serve as a proxy for either the level of synthesis of pro-
teins encoded by a gene or perhaps its involvement in a
metabolic pathway. Differential expression between a con-
trol organism and an experimental or diseased organism
can thus highlight genes whose function is related to the
experimental challenge.
An often cited example is the classification of cancer
types (Golub et al [1], Alizadeh et al [2], Bittner et al [3],
croarray slide can typically hold tens of thousands of gene
fragments whose responses here act as the predictor vari-
ables (p), whilst the number of patient tissue samples (n)
available in such studies is much less (for the above exam-
ples, 38 in Golub et al, 96 in Alizadeh et al, 38 in Bittner
et al, 41 in Nielsen et al, 63 in Tibshirani et al, and 80 in
Parmigiani et al).
More generally, beyond such “supervised” classifica-
tion problems, there is interest in identifying groups of
genes with related expression level patterns over time or
across repeated samples, say, even within the same classi-
fication label type. Typically one will be looking for coreg-
between neighbouring frequencies; analogously for mi-
croarray data, there is evidence of correlation of expres-
sion of genes residing closely to one another on the chro-
mosome (Turkheimer et al [17]). Thus when we come to
look at cluster analysis for microarray data, we will see
a large emphasis on methods which are computationally
suited to cope with the high-dimensional data.
CLUSTER ANALYSIS
The need to group or partition objects seems funda-
mental to human understanding: once one can identify a
class of objects, one can discuss the properties of the class
members as a whole, without having to worry about indi-
vidual differences. As a consequence, there is a vast litera-
ture on cluster analysis methods, going back at least as far
as the earliest computers. In fact, at one point in the early
1980s new ad hoc clustering algorithms were being devel-
oped so rapidly that it was suggested there should be a
moratorium on the development of new algorithms while
some understanding of the properties of the existing ones
fundamental pro
pairwise similar
such distances i
objects in the da
Cluster anal
based solely on
sist of relatively
Since cluster an
data set, usually
ter. Extensions o
tering, whereby
than one cluster
naturally to such
these ideas (in f
special case of th
rithm) was given
Since the aim
which are simila
how “similarity
clustering this fo
model). In someLLNLDavid Gleich · Purdue 5
Why is this such a hard question?
There are many reasons people want to cluster data
• Help understand it
• Bin items for some downstream process
• …
There are many methods and strategies to cluster data
• Linkage methods from stats
• Partitioning methods
• Objective functions (K-means) and updating algorithms
• …
I can’t psychically intuit what you need from your data!
LLNLDavid Gleich · Purdue 6
I don’t like studying clustering…
LLNLDavid Gleich · Purdue 7
I don’t like studying clustering…
… so let’s do exactly that.
LLNLDavid Gleich · Purdue 8
Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 9
Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 10
Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 11
Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 12
Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 13
Let’s consult an expert!
Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 14
Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 15
Graph clustering seeks“communities”of nodes
LLNLDavid Gleich · Purdue 16
Objective
functions
All seek to
balance
High internal densityLow external connectivity
modularity, densest subgraph, maximum
clique, conductance, sparsest cut, etc.
Two objectives at opposite ends of the spectrum
min
cut(S)
`S`
+
cut(S)
`¯S`
Sparsest cut
David Gleich · Purdue 17LLNL
Sparsest cut
Minimize number of edges removed
to partition graph into cliques
Two objectives at opposite ends of the spectrum
Cluster Deletion
min
cut(S)
`S`
+
cut(S)
`¯S`
David Gleich · Purdue 18LLNL
We show sparsest cut and cluster deletion are two special
cases of the same new clustering framework:
LAMBDACC = λ Correlation Clustering
This framework also leads to
- new connections to other objectives (including modularity!)
- new approximation algorithms (2-approx for cluster deletion)
- several experiments/applications (social network analysis)
- (aside) fast method for LPs w/ metric constraints (for approx. algs)
David Gleich · Purdue 19LLNL
And now you are thinking…
… is this talk really going to propose
another new method?!??!?
I’m going to advocate for flexible
clustering frameworks, which we can
then engineer to “fit” example data
LLNLDavid Gleich · Purdue 21
22
Our framework is
based on correlation
clustering
Edges in a signed
graph indicate
similarity (+)
or dissimilarity (-)
LLNLDavid Gleich · Purdue
i
j
k
Edges can be weighted, but problems
become harder.
w+
ij wjk
w+
ij wjk
23
Our framework is
based on correlation
clustering
Edges in a signed
graph indicate
similarity (+)
or dissimilarity (-)
LLNLDavid Gleich · Purdue
Our framework is
based on correlation
clustering
Edges in a signed
graph indicate
similarity (+)
or dissimilarity (-)
i
j
k
Mistake Mistake
Objective: Minimize the weight of “mistakes”
w+
ij wjk
w+
ij wjk
24LLNLDavid Gleich · Purdue
Given G = (V,E), construct signed
graph G’ = (V,E+,E- ), an instance
of correlation clustering
You can use correlation clustering to cluster unsigned
graphs
LLNLDavid Gleich · Purdue 25
+
++
–
–
–
+
+ –
To model sparsest cut or cluster
deletion, set resolution parameter
λ ∈ (0,1)
LAMBDACC
1
1
1
1
1
Without weights, unweighted
correlation clustering is the same
as cluster editing
Consider a restriction to two clusters
Positive mistakes: (1 – λ) cut(S)
Negative mistakes: λ |E–| – λ [ |S| |S| – cut(S) ]
Total weight of mistakes =
David Gleich · Purdue 26
S S
cut(S)– λ |S| |S| + λ |E–|
LLNL
This is a scaled version of sparsest cut!
minimize cut(S) `S``¯S` + `E `
constantTwo-cluster LAMBDACC can be written
cut(S) `S``S` < 0 ()
cut(S)
`S``S`
<Note
David Gleich · Purdue 27
cut(S)
`S`
+
cut(S)
`S`
= `V`
cut(S)
`S``S`
LLNL
We can write the objective in terms of cuts to get a
relationship with sparsest cut.
The general LAMBDACC objective can be written
THEOREM
Minimizing this objective produces clusters with scaled sparsest
cut at most λ (if they exist). There exists some λ’ such that
minimizing LAMBDACC will return the minimum sparsest cut
partition.
minimize
1
2
kX
i=1
cut(Si)
2
kX
i=1
`Si``Si` + `E `
David Gleich · Purdue 28LLNL
We show this is
equivalent to LAMBDACC
for the right choice of
λ ≫ (1-λ)
1
1
1
1
1
cluster deletion correlation clustering with infinite
penalties on negative edges
David Gleich · Purdue 29
1
1
1
1
1
For large λ,LAMBDACC generalizes cluster deletion
LLNL
1 2 1 4
3
2
4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 6
1 6
4
2
1 6
Degree-weighted LAMBDACC is related to Modularity
Though this does not preserve approximations…
LAMBDACC is a linear function of Modularity
Positive weight: 1 – λdidj
Negative weight: λdidj
David Gleich · Purdue 30LLNL
Degree-
weighted
Standard
Sparsest Cut Cluster
Deletion
Correlation
Clustering
(Cluster Editing)
Normalized Cut Modularity
1
2m
0 1 0 1
m
m + 1
⇢⇤
⇤
=
1/2
Many other objectives are special cases of LAMBDACC
LLNLDavid Gleich · Purdue 17
m = |E|
And now, an answer to one of the
most frequently asked questions in
clustering.
“What method should I use”?
LLNLDavid Gleich · Purdue 32
Changing your method (implicitly) changes the value of
λ that you are using.
Lambda
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
RatiotoLPbound
1
2
3
4
Graclus
Louvain
InfoMap
RMQC
RMC
Dense subgraph regimeSparse cut regime
This figure shows that if you
use one of these algorithms
(Graclus, Louvain, InfoMap,
recursive max-quasi clique,
or recursive max-clique)
then you implicitly
minimize λ-CC for some
choice of λ.
Turns the question
“what method should I use?”into
“what λ should I use?”
LLNL 33David Gleich · Purdue
Changing your method (implicitly) changes the value of
λ that you are using.
Lambda
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
RatiotoLPbound
1
2
3
4
Graclus
Louvain
InfoMap
RMQC
RMC
Dense subgraph regimeSparse cut regime
This figure shows that if you
use one of these algorithms
(Graclus, Louvain, InfoMap,
recursive max-quasi clique,
or recursive max-clique)
then you implicitly
minimize λ-CC for some
choice of λ.
Turns the question
“what method should I use?”into
“what λ should I use?”
LLNL 34David Gleich · Purdue
We wrote an entire SIMODS paper
explaining how we made this figure!
LP bound involves an LP
with 12 billion constraints.
35
How should I set ! for
my new clustering
application?
Can you give me an example
of what you want your
clusters to look like?
I want communities
that look like this!
LambdaCC inspires an approach for learning the“right”
objective function to use for new applications.
David Gleich · Purdue LLNL
The goal is not to reproduce the example clusters.
The goal to find sets with similar properties size and density tradeoffs.
LLNLDavid Gleich · Purdue 36
Let’s go back to the figure we just saw
David Gleich · Purdue 37
Each clustering traces out a
bowl-shaped curve.
The minimum point on each
curve tells us the ! regime
where the clustering optimizes
LambdaCC.
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
Graclus
Louvain
InfoMap
RMQC
RMC
LLNL
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
David Gleich · Purdue 38
So the “example” clustering
will also correspond to some
type of curve.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
LLNL
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
David Gleich · Purdue 39
As will any other clustering.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
LLNL
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
David Gleich · Purdue 40
As will any other clustering.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
LLNL
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
David Gleich · Purdue 41
As will any other clustering.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
LLNL
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
David Gleich · Purdue 42
As will any other clustering.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
LLNL
David Gleich · Purdue 43
Strategy.
Start with a fixed “good”
clustering example.
Find the minimizer for its curve,
to get a ! that is designed to
produce similar clusterings!
Challenge.
We want to do this without
computing the entire curve.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
This is a new optimization
problem where we are
optimizing over !!
LLNL
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
What function is tracing out these curves?
David Gleich · Purdue
Score for a clustering C.
A linear function in λ.
LambdaCC LP bound for fixed λ.
A parametric LP: concave,and
piecewise linear in λ (Adler & Montiero 1992).
PC( ) =
FC( )
G( )
44
PC( ) =
FC( )
G( )
The “parameter fitness function.”
LLNL
We prove two useful properties about P
David Gleich · Purdue 45
Since FC is linear and G is concave and piecewise linear:P satisfies the following two properties:
1. If < < +, then P( )  max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
Translation…
1. Once P goes up, it can’t go
back down
LLNL
David Gleich · Purdue 46
Since FC is linear and G is concave and piecewise linear:P satisfies the following two properties:
1. If < < +, then P( )  max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
Translation…
1. Once P goes up, it can’t go
back down
2. There are no “flat” regions
where we might get stuck
We prove two useful properties about P
LLNL
David Gleich · Purdue
P satisfies the following two properties:
1. If < < +, then P( )  max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
We know the
minimizer can’t
be to the left of
this point
47
This allows us to minimize P without seeing all of it
LLNL
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
David Gleich · Purdue
P satisfies the following two properties:
1. If < < +, then P( )  max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
We know the
minimizer can’t
be to the left of
this point
48
So this is
possible.
This allows us to minimize P without seeing all of it
LLNL
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
David Gleich · Purdue
P satisfies the following two properties:
1. If < < +, then P( )  max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
We know the
minimizer can’t
be to the left of
this point
49
So this is
possible.
But so is this.
This allows us to minimize P without seeing all of it
LLNL
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
David Gleich · Purdue
P satisfies the following two properties:
1. If < < +, then P( )  max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
We know the
minimizer can’t
be to the left of
this point
Evaluate P at a
new point
50
So we’ve ruled out
this possibility!
Now we know the minimizer
can’t be to the right of this one
This allows us to minimize P without seeing all of it
LLNL
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
David Gleich · Purdue
P satisfies the following two properties:
1. If < < +, then P( )  max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
If two input ! have
the same fitness
score, the minimizer
is between them.
51
…so it’s not over here.
This allows us to minimize P without seeing all of it
LLNL
David Gleich · Purdue
We developed a bisection-like approach for minimizing P
by evaluating it at carefully selected points
One branch scenario: The
minimizer isn’t in [m,r]
Two branch scenario: Evaluate a
couple more points to rule out [m,r]
52LLNL
A simple synthetic test case to demonstrate that having
an example helps.
Nate Veldt 53
Modularity, (a special case of LambdaCC with λ = 1/(2m))
wasn’t able to get the community structure right for
the graph G.
Let’s fix that!
1. Generate a new random graph G’ from the same
distribution
2. Using the ground truth of G’, learn a resolution
parameter !’
3. Cluster G using LambdaCC with ! = !’
We’ve captured the community structure for a specific
class of graphs and can detect the right answer!
G
G’
Nate Veldt
We tested this on a regime of synthetic graphs that is
hard for modularity.
54
Smaller µ à ground truth easier to detect.
For each µ, we train on one
graph, and tested on 5 others.
One example when µ = 0.3
Modularity often fails to separate
ground truth clusters.
“mixing parameter”
We can use this to test if a metadata attribute seems to
be reflected in some characteristic graph structure
LLNLDavid Gleich · Purdue 55
S/F Gen Maj. Maj. 2 Res. Yr HS
min Preal 1.30 1.73 2.03 2.12 1.35 1.57 2.11
min Pfake 1.65 1.80 2.12 2.12 2.11 2.09 2.12
<latexit sha1_base64="u3gHB8ZaTnoGowuTl4th1FKapm8=">AAAGKHicdVTNbttGEKaSqE3Vv7g55rKpE7coHJW04do5FDDQwEmAGHUjOUlhCu6SHEpb7S6J3WVtZbGv0AfpudfkGXIrcu2pj9FZUqpF2aVg7nh25vvmj5OUnGkThu87167f6H7w4c2Peh9/8ulnn99a++KFLiqVwnFa8EK9SqgGziQcG2Y4vCoVUJFweJlMf/D3L38DpVkhh2ZWwkjQsWQ5S6lB1ela56s4gTGT1tCk4lQ5m5LWz/ViU5Sq4tDbIINvD8gGeQwS34f01/7i2ELhOWj//88KX08GJI57sWBZ7XgPJUliQc0kpdweudNaZsZiqNyRexsk6m+HxB+723hs9cPmiDyyv9ypL3d2G23k4f8fNadTQNTa5bvGcy+8wLs4oobr4VzpY04KYwpRhx2DzP6ry+mt9bAf1g+5LERzYT2YP0ena9f/ibMirQRIk3Kq9UkUlmZkqTIs5YB1rTSUNJ3SMZygKKkAPbJ1Sx25j5qM5IXCP2lIre0tuyCOorMWyiLW87Y2KYop3mjXa1OafG9kmSwrAzJtGPOKE1MQPygkYwpSw2ekTWumrx+MFS0nDYlh09ecJYqqmY+oONObekJL0JvYk3QzZwbt6ug5GDuscgPPIXPY+OzuXng34Qi7bGEmMFYA0tn68DZn2FJYsUl4Bc7695JFO79hNLK+dj65VgZHwwGVWI1YgYSztBCCYp/jnArGZxnktOLG2VjnC7ldAJ37KXO9+8tkGrOF7Puw/3Azxak0mATl2E0kMOc69xACh5Li92kMqF6M2LE05x5qv3G2+psTHKOdkVvYFpiob/ojwPlRMJiJpOAHmJJtULSzPx4+c1Z6CsGcFc7W9R6AucoYFdmqSzJ3mXN4h0GVaNwklV8QVxOsMgwODn1JFgTDqFU+m5w7q/kFiTduvO1TtPQ1oLycUHcR6i9PV6qejTmwdPKgqf1VN9hojV9Oe/SFh1nushiwsUAm/LZ1pcDD2TgRNm707tJYiGe4TLOrPOYX6IK7IVrdBJeFF7hitvtbP22t7+/Nt8TN4E7wZfB1EAW7wX7wJDgKjoO083vnz86bztvuH9133b+67xvTa525z+2g9XT//heBrBeH</latexit>
(Listen, don’t read!)
For the Caltech network, find the minimum value of lambda for a clustering X induced by a
metadata attribute. Then look at the objective function P(λ,X) = F(λ,X)/G(λ) at the minimizer.
Do this for the real attribute and a randomized attribute (just shuffle the labels); that gives
a null-score where there is no relationship with graph structure.
We can use this to test if a metadata attribute seems to
be reflected in some characteristic graph structure
LLNLDavid Gleich · Purdue 56
S/F Gen Maj. Maj. 2 Res. Yr HS
min Preal 1.30 1.73 2.03 2.12 1.35 1.57 2.11
min Pfake 1.65 1.80 2.12 2.12 2.11 2.09 2.12
<latexit sha1_base64="u3gHB8ZaTnoGowuTl4th1FKapm8=">AAAGKHicdVTNbttGEKaSqE3Vv7g55rKpE7coHJW04do5FDDQwEmAGHUjOUlhCu6SHEpb7S6J3WVtZbGv0AfpudfkGXIrcu2pj9FZUqpF2aVg7nh25vvmj5OUnGkThu87167f6H7w4c2Peh9/8ulnn99a++KFLiqVwnFa8EK9SqgGziQcG2Y4vCoVUJFweJlMf/D3L38DpVkhh2ZWwkjQsWQ5S6lB1ela56s4gTGT1tCk4lQ5m5LWz/ViU5Sq4tDbIINvD8gGeQwS34f01/7i2ELhOWj//88KX08GJI57sWBZ7XgPJUliQc0kpdweudNaZsZiqNyRexsk6m+HxB+723hs9cPmiDyyv9ypL3d2G23k4f8fNadTQNTa5bvGcy+8wLs4oobr4VzpY04KYwpRhx2DzP6ry+mt9bAf1g+5LERzYT2YP0ena9f/ibMirQRIk3Kq9UkUlmZkqTIs5YB1rTSUNJ3SMZygKKkAPbJ1Sx25j5qM5IXCP2lIre0tuyCOorMWyiLW87Y2KYop3mjXa1OafG9kmSwrAzJtGPOKE1MQPygkYwpSw2ekTWumrx+MFS0nDYlh09ecJYqqmY+oONObekJL0JvYk3QzZwbt6ug5GDuscgPPIXPY+OzuXng34Qi7bGEmMFYA0tn68DZn2FJYsUl4Bc7695JFO79hNLK+dj65VgZHwwGVWI1YgYSztBCCYp/jnArGZxnktOLG2VjnC7ldAJ37KXO9+8tkGrOF7Puw/3Azxak0mATl2E0kMOc69xACh5Li92kMqF6M2LE05x5qv3G2+psTHKOdkVvYFpiob/ojwPlRMJiJpOAHmJJtULSzPx4+c1Z6CsGcFc7W9R6AucoYFdmqSzJ3mXN4h0GVaNwklV8QVxOsMgwODn1JFgTDqFU+m5w7q/kFiTduvO1TtPQ1oLycUHcR6i9PV6qejTmwdPKgqf1VN9hojV9Oe/SFh1nushiwsUAm/LZ1pcDD2TgRNm707tJYiGe4TLOrPOYX6IK7IVrdBJeFF7hitvtbP22t7+/Nt8TN4E7wZfB1EAW7wX7wJDgKjoO083vnz86bztvuH9133b+67xvTa525z+2g9XT//heBrBeH</latexit>
(Listen, don’t read!)
For the Caltech network, find the minimum value of lambda for a clustering X induced by a
metadata attribute. Then look at the objective function P(λ,X) = F(λ,X)/G(λ) at the minimizer.
Do this for the real attribute and a randomized attribute (just shuffle the labels); that gives
a null-score where there is no relationship with graph structure.
We can also investigate metadata sets in social
networks. This led to a fun story!
LLNLDavid Gleich · Purdue 57
1 1.2 1.4 1.6 1.8
0
0.2
0.4
0.6
0.8
1
How well you do
at finding those
same sets again.
The objective ratio at a minimum, i.e.
how close you get to the lower bound
We can also investigate metadata sets in social
networks. This led to a fun story!
LLNLDavid Gleich · Purdue 58
1 1.2 1.4 1.6 1.8
0
0.2
0.4
0.6
0.8
1
2006-2008
2009
How well you do
at finding those
same sets again.
The objective ratio at a minimum, i.e.
how close you get to the lower bound
A quick summary of other work from our research team
on data-driven scientific computing
Our team’s overall goal is to design algorithms and methods tuned to
the evolving needs and nature of scientific data analysis.
Low-rank methods for network alignment – Huda Nassar -> Stanford.
• Principled methods that scale to
aligning thousands of networks.
Spectral properties and generation of realistic
networks – Nicole Eikmeier -> Grinnell College
• “Power-laws” in the top sing. vals of adj matrix are most
robust than degree “power-laws”
• Fast sampling for hypergraph models with higher-order structure.
Local analysis of network data – Meng Liu
• Applications in bioinformatics, software https://github.com/kfoynt/LocalGraphClustering
LLNLDavid Gleich · Purdue 59
=
aaa ddd aab
bbb
bdd
Fig. 5. For a Kronecker graph with a 2 ⇥ 2 initi
been “⌦-powered” three times to an 8 ⇥ 8 probability
LLNL
Paper arXiv:1903.05246 (at WWW2019)
Code github.com/nveldt/LearnResParams
(at WWW2018),1806.01678
Software. github: nveldt/LamCC,nveldt/MetricOptimization
60
Don’t ask what algorithm, ask what kind of clusters!
Issues.
• Yeah, this is still slow L
• Needs to be generalized beyond lambda-CC
(ongoing work with Meng Liu at Purdue)
See the paper and code!
David Gleich · Purdue
With Nate Veldt (Purdue),
Tony Wirth (Melbourne).
Cameron Ruggles (Purdue)
James Saunderson (Monash)

More Related Content

What's hot

Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisDavid Gleich
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansDavid Gleich
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structuresDavid Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
Uncertainty Modeling in Deep Learning
Uncertainty Modeling in Deep LearningUncertainty Modeling in Deep Learning
Uncertainty Modeling in Deep LearningSungjoon Choi
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
 
Modeling uncertainty in deep learning
Modeling uncertainty in deep learning Modeling uncertainty in deep learning
Modeling uncertainty in deep learning Sungjoon Choi
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHolistic Benchmarking of Big Linked Data
 
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...dhruvgairola
 
Uncertainty Quantification in AI
Uncertainty Quantification in AIUncertainty Quantification in AI
Uncertainty Quantification in AIFlorian Wilhelm
 
A new generalized lindley distribution
A new generalized lindley distributionA new generalized lindley distribution
A new generalized lindley distributionAlexander Decker
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 

What's hot (20)

Using Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based LearningUsing Local Spectral Methods to Robustify Graph-Based Learning
Using Local Spectral Methods to Robustify Graph-Based Learning
 
Spacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysisSpacey random walks and higher-order data analysis
Spacey random walks and higher-order data analysis
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
 
Non-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-meansNon-exhaustive, Overlapping K-means
Non-exhaustive, Overlapping K-means
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
 
Uncertainty in Deep Learning
Uncertainty in Deep LearningUncertainty in Deep Learning
Uncertainty in Deep Learning
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
Uncertainty Modeling in Deep Learning
Uncertainty Modeling in Deep LearningUncertainty Modeling in Deep Learning
Uncertainty Modeling in Deep Learning
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
Modeling uncertainty in deep learning
Modeling uncertainty in deep learning Modeling uncertainty in deep learning
Modeling uncertainty in deep learning
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K Characters
 
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
 
Uncertainty Quantification in AI
Uncertainty Quantification in AIUncertainty Quantification in AI
Uncertainty Quantification in AI
 
Cs36565569
Cs36565569Cs36565569
Cs36565569
 
Deep learning networks
Deep learning networksDeep learning networks
Deep learning networks
 
A new generalized lindley distribution
A new generalized lindley distributionA new generalized lindley distribution
A new generalized lindley distribution
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 

Similar to Engineering Data Science Objectives for Social Network Analysis

15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Multilevel techniques for the clustering problem
Multilevel techniques for the clustering problemMultilevel techniques for the clustering problem
Multilevel techniques for the clustering problemcsandit
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptxGandhiMathy6
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
Simplicial closure and higher-order link prediction
Simplicial closure and higher-order link predictionSimplicial closure and higher-order link prediction
Simplicial closure and higher-order link predictionAustin Benson
 
Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)Austin Benson
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
 
Clustering Algorithms.pptx
Clustering Algorithms.pptxClustering Algorithms.pptx
Clustering Algorithms.pptxIssra'a Almgoter
 
[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalizationJaeJun Yoo
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 

Similar to Engineering Data Science Objectives for Social Network Analysis (20)

15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Multilevel techniques for the clustering problem
Multilevel techniques for the clustering problemMultilevel techniques for the clustering problem
Multilevel techniques for the clustering problem
 
Az36311316
Az36311316Az36311316
Az36311316
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
Unsupervised Learning.pptx
Unsupervised Learning.pptxUnsupervised Learning.pptx
Unsupervised Learning.pptx
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
Simplicial closure and higher-order link prediction
Simplicial closure and higher-order link predictionSimplicial closure and higher-order link prediction
Simplicial closure and higher-order link prediction
 
Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)Simplicial closure and higher-order link prediction (SIAMNS18)
Simplicial closure and higher-order link prediction (SIAMNS18)
 
Pca part
Pca partPca part
Pca part
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
Clustering Algorithms.pptx
Clustering Algorithms.pptxClustering Algorithms.pptx
Clustering Algorithms.pptx
 
[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 

More from David Gleich

Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksDavid Gleich
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceDavid Gleich
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...David Gleich
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationDavid Gleich
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...David Gleich
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsDavid Gleich
 
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...David Gleich
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for HadoopDavid Gleich
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignmentDavid Gleich
 

More from David Gleich (12)

Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
 
A dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportationA dynamical system for PageRank with time-dependent teleportation
A dynamical system for PageRank with time-dependent teleportation
 
How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...How does Google Google: A journey into the wondrous mathematics behind your f...
How does Google Google: A journey into the wondrous mathematics behind your f...
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
The power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulantsThe power and Arnoldi methods in an algebra of circulants
The power and Arnoldi methods in an algebra of circulants
 
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
What you can do with a tall-and-skinny QR factorization in Hadoop: Principal ...
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for Hadoop
 
Iterative methods for network alignment
Iterative methods for network alignmentIterative methods for network alignment
Iterative methods for network alignment
 

Recently uploaded

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Recently uploaded (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Engineering Data Science Objectives for Social Network Analysis

  • 1. Engineering Data Science Objective Functions for Social Network Analysis David F. Gleich Purdue University With Nate Veldt (Purdue -> Cornell), Tony Wirth (Melbourne) Paper arXiv:1903.05246 Code github.com/nveldt/LearnResParams LLNL 1David Gleich · Purdue
  • 2. Somewhere too close and very recently… Application expert. “Hi, I see you work on clustering. I want to cluster my data … … what algorithm should I use?” LLNLDavid Gleich · Purdue 2
  • 3. The dreaded question for people who study clustering, community detection, etc. “What algorithm should I use?”
  • 4. Why is this such a hard question? LLNLDavid Gleich · Purdue 4
  • 5. Journal of Biomedicine and Biotechnology • 2005:2 (2005) 215–225 • DOI: 10.1155/JBB.2005.215 REVIEW ARTICLE Finding Groups in Gene Expression Data David J. Hand and Nicholas A. Heard Department of Mathematics, Faculty of Physical Sciences, Imperial College, London SW7 2AZ, UK Received 11 June 2004; revised 24 August 2004; accepted 24 August 2004 The vast potential of the genomic insight offered by microarray technologies has led to their widespread use since they were in- troduced a decade ago. Application areas include gene function discovery, disease diagnosis, and inferring regulatory networks. Microarray experiments enable large-scale, high-throughput investigations of gene activity and have thus provided the data analyst with a distinctive, high-dimensional field of study. Many questions in this field relate to finding subgroups of data profiles which are very similar. A popular type of exploratory tool for finding subgroups is cluster analysis, and many different flavors of algorithms have been used and indeed tailored for microarray data. Cluster analysis, however, implies a partitioning of the entire data set, and this does not always match the objective. Sometimes pattern discovery or bump hunting tools are more appropriate. This paper reviews these various tools for finding interesting subgroups. INTRODUCTION Microarray gene expression studies are now routinely used to measure the transcription levels of an organism’s genes at a particular instant of time. These mRNA levels serve as a proxy for either the level of synthesis of pro- teins encoded by a gene or perhaps its involvement in a metabolic pathway. Differential expression between a con- trol organism and an experimental or diseased organism can thus highlight genes whose function is related to the experimental challenge. An often cited example is the classification of cancer types (Golub et al [1], Alizadeh et al [2], Bittner et al [3], croarray slide can typically hold tens of thousands of gene fragments whose responses here act as the predictor vari- ables (p), whilst the number of patient tissue samples (n) available in such studies is much less (for the above exam- ples, 38 in Golub et al, 96 in Alizadeh et al, 38 in Bittner et al, 41 in Nielsen et al, 63 in Tibshirani et al, and 80 in Parmigiani et al). More generally, beyond such “supervised” classifica- tion problems, there is interest in identifying groups of genes with related expression level patterns over time or across repeated samples, say, even within the same classi- fication label type. Typically one will be looking for coreg- between neighbouring frequencies; analogously for mi- croarray data, there is evidence of correlation of expres- sion of genes residing closely to one another on the chro- mosome (Turkheimer et al [17]). Thus when we come to look at cluster analysis for microarray data, we will see a large emphasis on methods which are computationally suited to cope with the high-dimensional data. CLUSTER ANALYSIS The need to group or partition objects seems funda- mental to human understanding: once one can identify a class of objects, one can discuss the properties of the class members as a whole, without having to worry about indi- vidual differences. As a consequence, there is a vast litera- ture on cluster analysis methods, going back at least as far as the earliest computers. In fact, at one point in the early 1980s new ad hoc clustering algorithms were being devel- oped so rapidly that it was suggested there should be a moratorium on the development of new algorithms while some understanding of the properties of the existing ones fundamental pro pairwise similar such distances i objects in the da Cluster anal based solely on sist of relatively Since cluster an data set, usually ter. Extensions o tering, whereby than one cluster naturally to such these ideas (in f special case of th rithm) was given Since the aim which are simila how “similarity clustering this fo model). In someLLNLDavid Gleich · Purdue 5
  • 6. Why is this such a hard question? There are many reasons people want to cluster data • Help understand it • Bin items for some downstream process • … There are many methods and strategies to cluster data • Linkage methods from stats • Partitioning methods • Objective functions (K-means) and updating algorithms • … I can’t psychically intuit what you need from your data! LLNLDavid Gleich · Purdue 6
  • 7. I don’t like studying clustering… LLNLDavid Gleich · Purdue 7
  • 8. I don’t like studying clustering… … so let’s do exactly that. LLNLDavid Gleich · Purdue 8
  • 9. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 9
  • 10. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 10
  • 11. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 11
  • 12. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 12
  • 13. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 13 Let’s consult an expert!
  • 14. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 14
  • 15. Let’s do some warm up. What are the clusters in this graph? LLNLDavid Gleich · Purdue 15
  • 16. Graph clustering seeks“communities”of nodes LLNLDavid Gleich · Purdue 16 Objective functions All seek to balance High internal densityLow external connectivity modularity, densest subgraph, maximum clique, conductance, sparsest cut, etc.
  • 17. Two objectives at opposite ends of the spectrum min cut(S) `S` + cut(S) `¯S` Sparsest cut David Gleich · Purdue 17LLNL
  • 18. Sparsest cut Minimize number of edges removed to partition graph into cliques Two objectives at opposite ends of the spectrum Cluster Deletion min cut(S) `S` + cut(S) `¯S` David Gleich · Purdue 18LLNL
  • 19. We show sparsest cut and cluster deletion are two special cases of the same new clustering framework: LAMBDACC = λ Correlation Clustering This framework also leads to - new connections to other objectives (including modularity!) - new approximation algorithms (2-approx for cluster deletion) - several experiments/applications (social network analysis) - (aside) fast method for LPs w/ metric constraints (for approx. algs) David Gleich · Purdue 19LLNL
  • 20. And now you are thinking… … is this talk really going to propose another new method?!??!?
  • 21. I’m going to advocate for flexible clustering frameworks, which we can then engineer to “fit” example data LLNLDavid Gleich · Purdue 21
  • 22. 22 Our framework is based on correlation clustering Edges in a signed graph indicate similarity (+) or dissimilarity (-) LLNLDavid Gleich · Purdue
  • 23. i j k Edges can be weighted, but problems become harder. w+ ij wjk w+ ij wjk 23 Our framework is based on correlation clustering Edges in a signed graph indicate similarity (+) or dissimilarity (-) LLNLDavid Gleich · Purdue
  • 24. Our framework is based on correlation clustering Edges in a signed graph indicate similarity (+) or dissimilarity (-) i j k Mistake Mistake Objective: Minimize the weight of “mistakes” w+ ij wjk w+ ij wjk 24LLNLDavid Gleich · Purdue
  • 25. Given G = (V,E), construct signed graph G’ = (V,E+,E- ), an instance of correlation clustering You can use correlation clustering to cluster unsigned graphs LLNLDavid Gleich · Purdue 25 + ++ – – – + + – To model sparsest cut or cluster deletion, set resolution parameter λ ∈ (0,1) LAMBDACC 1 1 1 1 1 Without weights, unweighted correlation clustering is the same as cluster editing
  • 26. Consider a restriction to two clusters Positive mistakes: (1 – λ) cut(S) Negative mistakes: λ |E–| – λ [ |S| |S| – cut(S) ] Total weight of mistakes = David Gleich · Purdue 26 S S cut(S)– λ |S| |S| + λ |E–| LLNL
  • 27. This is a scaled version of sparsest cut! minimize cut(S) `S``¯S` + `E ` constantTwo-cluster LAMBDACC can be written cut(S) `S``S` < 0 () cut(S) `S``S` <Note David Gleich · Purdue 27 cut(S) `S` + cut(S) `S` = `V` cut(S) `S``S` LLNL
  • 28. We can write the objective in terms of cuts to get a relationship with sparsest cut. The general LAMBDACC objective can be written THEOREM Minimizing this objective produces clusters with scaled sparsest cut at most λ (if they exist). There exists some λ’ such that minimizing LAMBDACC will return the minimum sparsest cut partition. minimize 1 2 kX i=1 cut(Si) 2 kX i=1 `Si``Si` + `E ` David Gleich · Purdue 28LLNL
  • 29. We show this is equivalent to LAMBDACC for the right choice of λ ≫ (1-λ) 1 1 1 1 1 cluster deletion correlation clustering with infinite penalties on negative edges David Gleich · Purdue 29 1 1 1 1 1 For large λ,LAMBDACC generalizes cluster deletion LLNL
  • 30. 1 2 1 4 3 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 6 1 6 4 2 1 6 Degree-weighted LAMBDACC is related to Modularity Though this does not preserve approximations… LAMBDACC is a linear function of Modularity Positive weight: 1 – λdidj Negative weight: λdidj David Gleich · Purdue 30LLNL
  • 31. Degree- weighted Standard Sparsest Cut Cluster Deletion Correlation Clustering (Cluster Editing) Normalized Cut Modularity 1 2m 0 1 0 1 m m + 1 ⇢⇤ ⇤ = 1/2 Many other objectives are special cases of LAMBDACC LLNLDavid Gleich · Purdue 17 m = |E|
  • 32. And now, an answer to one of the most frequently asked questions in clustering. “What method should I use”? LLNLDavid Gleich · Purdue 32
  • 33. Changing your method (implicitly) changes the value of λ that you are using. Lambda 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 RatiotoLPbound 1 2 3 4 Graclus Louvain InfoMap RMQC RMC Dense subgraph regimeSparse cut regime This figure shows that if you use one of these algorithms (Graclus, Louvain, InfoMap, recursive max-quasi clique, or recursive max-clique) then you implicitly minimize λ-CC for some choice of λ. Turns the question “what method should I use?”into “what λ should I use?” LLNL 33David Gleich · Purdue
  • 34. Changing your method (implicitly) changes the value of λ that you are using. Lambda 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 RatiotoLPbound 1 2 3 4 Graclus Louvain InfoMap RMQC RMC Dense subgraph regimeSparse cut regime This figure shows that if you use one of these algorithms (Graclus, Louvain, InfoMap, recursive max-quasi clique, or recursive max-clique) then you implicitly minimize λ-CC for some choice of λ. Turns the question “what method should I use?”into “what λ should I use?” LLNL 34David Gleich · Purdue We wrote an entire SIMODS paper explaining how we made this figure! LP bound involves an LP with 12 billion constraints.
  • 35. 35 How should I set ! for my new clustering application? Can you give me an example of what you want your clusters to look like? I want communities that look like this! LambdaCC inspires an approach for learning the“right” objective function to use for new applications. David Gleich · Purdue LLNL
  • 36. The goal is not to reproduce the example clusters. The goal to find sets with similar properties size and density tradeoffs. LLNLDavid Gleich · Purdue 36
  • 37. Let’s go back to the figure we just saw David Gleich · Purdue 37 Each clustering traces out a bowl-shaped curve. The minimum point on each curve tells us the ! regime where the clustering optimizes LambdaCC. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound Graclus Louvain InfoMap RMQC RMC LLNL
  • 38. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound David Gleich · Purdue 38 So the “example” clustering will also correspond to some type of curve. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound LLNL
  • 39. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound David Gleich · Purdue 39 As will any other clustering. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound LLNL
  • 40. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound David Gleich · Purdue 40 As will any other clustering. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound LLNL
  • 41. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound David Gleich · Purdue 41 As will any other clustering. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound LLNL
  • 42. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound David Gleich · Purdue 42 As will any other clustering. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound LLNL
  • 43. David Gleich · Purdue 43 Strategy. Start with a fixed “good” clustering example. Find the minimizer for its curve, to get a ! that is designed to produce similar clusterings! Challenge. We want to do this without computing the entire curve. 0.13 0.17 0.25 0.5 1 1.2 1.4 1.6 1.8 2 RatiotoLPbound This is a new optimization problem where we are optimizing over !! LLNL
  • 44. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound What function is tracing out these curves? David Gleich · Purdue Score for a clustering C. A linear function in λ. LambdaCC LP bound for fixed λ. A parametric LP: concave,and piecewise linear in λ (Adler & Montiero 1992). PC( ) = FC( ) G( ) 44 PC( ) = FC( ) G( ) The “parameter fitness function.” LLNL
  • 45. We prove two useful properties about P David Gleich · Purdue 45 Since FC is linear and G is concave and piecewise linear:P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. Translation… 1. Once P goes up, it can’t go back down LLNL
  • 46. David Gleich · Purdue 46 Since FC is linear and G is concave and piecewise linear:P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. Translation… 1. Once P goes up, it can’t go back down 2. There are no “flat” regions where we might get stuck We prove two useful properties about P LLNL
  • 47. David Gleich · Purdue P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound We know the minimizer can’t be to the left of this point 47 This allows us to minimize P without seeing all of it LLNL
  • 48. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound David Gleich · Purdue P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. We know the minimizer can’t be to the left of this point 48 So this is possible. This allows us to minimize P without seeing all of it LLNL
  • 49. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound David Gleich · Purdue P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. We know the minimizer can’t be to the left of this point 49 So this is possible. But so is this. This allows us to minimize P without seeing all of it LLNL
  • 50. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound David Gleich · Purdue P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. We know the minimizer can’t be to the left of this point Evaluate P at a new point 50 So we’ve ruled out this possibility! Now we know the minimizer can’t be to the right of this one This allows us to minimize P without seeing all of it LLNL
  • 51. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85 1 2 3 4 RatiotoLPbound David Gleich · Purdue P satisfies the following two properties: 1. If < < +, then P( )  max {P( ), P( +)}. 2. If P( ) = P( +), then P achieves its minimum in [ , +]. If two input ! have the same fitness score, the minimizer is between them. 51 …so it’s not over here. This allows us to minimize P without seeing all of it LLNL
  • 52. David Gleich · Purdue We developed a bisection-like approach for minimizing P by evaluating it at carefully selected points One branch scenario: The minimizer isn’t in [m,r] Two branch scenario: Evaluate a couple more points to rule out [m,r] 52LLNL
  • 53. A simple synthetic test case to demonstrate that having an example helps. Nate Veldt 53 Modularity, (a special case of LambdaCC with λ = 1/(2m)) wasn’t able to get the community structure right for the graph G. Let’s fix that! 1. Generate a new random graph G’ from the same distribution 2. Using the ground truth of G’, learn a resolution parameter !’ 3. Cluster G using LambdaCC with ! = !’ We’ve captured the community structure for a specific class of graphs and can detect the right answer! G G’
  • 54. Nate Veldt We tested this on a regime of synthetic graphs that is hard for modularity. 54 Smaller µ à ground truth easier to detect. For each µ, we train on one graph, and tested on 5 others. One example when µ = 0.3 Modularity often fails to separate ground truth clusters. “mixing parameter”
  • 55. We can use this to test if a metadata attribute seems to be reflected in some characteristic graph structure LLNLDavid Gleich · Purdue 55 S/F Gen Maj. Maj. 2 Res. Yr HS min Preal 1.30 1.73 2.03 2.12 1.35 1.57 2.11 min Pfake 1.65 1.80 2.12 2.12 2.11 2.09 2.12 <latexit sha1_base64="u3gHB8ZaTnoGowuTl4th1FKapm8=">AAAGKHicdVTNbttGEKaSqE3Vv7g55rKpE7coHJW04do5FDDQwEmAGHUjOUlhCu6SHEpb7S6J3WVtZbGv0AfpudfkGXIrcu2pj9FZUqpF2aVg7nh25vvmj5OUnGkThu87167f6H7w4c2Peh9/8ulnn99a++KFLiqVwnFa8EK9SqgGziQcG2Y4vCoVUJFweJlMf/D3L38DpVkhh2ZWwkjQsWQ5S6lB1ela56s4gTGT1tCk4lQ5m5LWz/ViU5Sq4tDbIINvD8gGeQwS34f01/7i2ELhOWj//88KX08GJI57sWBZ7XgPJUliQc0kpdweudNaZsZiqNyRexsk6m+HxB+723hs9cPmiDyyv9ypL3d2G23k4f8fNadTQNTa5bvGcy+8wLs4oobr4VzpY04KYwpRhx2DzP6ry+mt9bAf1g+5LERzYT2YP0ena9f/ibMirQRIk3Kq9UkUlmZkqTIs5YB1rTSUNJ3SMZygKKkAPbJ1Sx25j5qM5IXCP2lIre0tuyCOorMWyiLW87Y2KYop3mjXa1OafG9kmSwrAzJtGPOKE1MQPygkYwpSw2ekTWumrx+MFS0nDYlh09ecJYqqmY+oONObekJL0JvYk3QzZwbt6ug5GDuscgPPIXPY+OzuXng34Qi7bGEmMFYA0tn68DZn2FJYsUl4Bc7695JFO79hNLK+dj65VgZHwwGVWI1YgYSztBCCYp/jnArGZxnktOLG2VjnC7ldAJ37KXO9+8tkGrOF7Puw/3Azxak0mATl2E0kMOc69xACh5Li92kMqF6M2LE05x5qv3G2+psTHKOdkVvYFpiob/ojwPlRMJiJpOAHmJJtULSzPx4+c1Z6CsGcFc7W9R6AucoYFdmqSzJ3mXN4h0GVaNwklV8QVxOsMgwODn1JFgTDqFU+m5w7q/kFiTduvO1TtPQ1oLycUHcR6i9PV6qejTmwdPKgqf1VN9hojV9Oe/SFh1nushiwsUAm/LZ1pcDD2TgRNm707tJYiGe4TLOrPOYX6IK7IVrdBJeFF7hitvtbP22t7+/Nt8TN4E7wZfB1EAW7wX7wJDgKjoO083vnz86bztvuH9133b+67xvTa525z+2g9XT//heBrBeH</latexit> (Listen, don’t read!) For the Caltech network, find the minimum value of lambda for a clustering X induced by a metadata attribute. Then look at the objective function P(λ,X) = F(λ,X)/G(λ) at the minimizer. Do this for the real attribute and a randomized attribute (just shuffle the labels); that gives a null-score where there is no relationship with graph structure.
  • 56. We can use this to test if a metadata attribute seems to be reflected in some characteristic graph structure LLNLDavid Gleich · Purdue 56 S/F Gen Maj. Maj. 2 Res. Yr HS min Preal 1.30 1.73 2.03 2.12 1.35 1.57 2.11 min Pfake 1.65 1.80 2.12 2.12 2.11 2.09 2.12 <latexit sha1_base64="u3gHB8ZaTnoGowuTl4th1FKapm8=">AAAGKHicdVTNbttGEKaSqE3Vv7g55rKpE7coHJW04do5FDDQwEmAGHUjOUlhCu6SHEpb7S6J3WVtZbGv0AfpudfkGXIrcu2pj9FZUqpF2aVg7nh25vvmj5OUnGkThu87167f6H7w4c2Peh9/8ulnn99a++KFLiqVwnFa8EK9SqgGziQcG2Y4vCoVUJFweJlMf/D3L38DpVkhh2ZWwkjQsWQ5S6lB1ela56s4gTGT1tCk4lQ5m5LWz/ViU5Sq4tDbIINvD8gGeQwS34f01/7i2ELhOWj//88KX08GJI57sWBZ7XgPJUliQc0kpdweudNaZsZiqNyRexsk6m+HxB+723hs9cPmiDyyv9ypL3d2G23k4f8fNadTQNTa5bvGcy+8wLs4oobr4VzpY04KYwpRhx2DzP6ry+mt9bAf1g+5LERzYT2YP0ena9f/ibMirQRIk3Kq9UkUlmZkqTIs5YB1rTSUNJ3SMZygKKkAPbJ1Sx25j5qM5IXCP2lIre0tuyCOorMWyiLW87Y2KYop3mjXa1OafG9kmSwrAzJtGPOKE1MQPygkYwpSw2ekTWumrx+MFS0nDYlh09ecJYqqmY+oONObekJL0JvYk3QzZwbt6ug5GDuscgPPIXPY+OzuXng34Qi7bGEmMFYA0tn68DZn2FJYsUl4Bc7695JFO79hNLK+dj65VgZHwwGVWI1YgYSztBCCYp/jnArGZxnktOLG2VjnC7ldAJ37KXO9+8tkGrOF7Puw/3Azxak0mATl2E0kMOc69xACh5Li92kMqF6M2LE05x5qv3G2+psTHKOdkVvYFpiob/ojwPlRMJiJpOAHmJJtULSzPx4+c1Z6CsGcFc7W9R6AucoYFdmqSzJ3mXN4h0GVaNwklV8QVxOsMgwODn1JFgTDqFU+m5w7q/kFiTduvO1TtPQ1oLycUHcR6i9PV6qejTmwdPKgqf1VN9hojV9Oe/SFh1nushiwsUAm/LZ1pcDD2TgRNm707tJYiGe4TLOrPOYX6IK7IVrdBJeFF7hitvtbP22t7+/Nt8TN4E7wZfB1EAW7wX7wJDgKjoO083vnz86bztvuH9133b+67xvTa525z+2g9XT//heBrBeH</latexit> (Listen, don’t read!) For the Caltech network, find the minimum value of lambda for a clustering X induced by a metadata attribute. Then look at the objective function P(λ,X) = F(λ,X)/G(λ) at the minimizer. Do this for the real attribute and a randomized attribute (just shuffle the labels); that gives a null-score where there is no relationship with graph structure.
  • 57. We can also investigate metadata sets in social networks. This led to a fun story! LLNLDavid Gleich · Purdue 57 1 1.2 1.4 1.6 1.8 0 0.2 0.4 0.6 0.8 1 How well you do at finding those same sets again. The objective ratio at a minimum, i.e. how close you get to the lower bound
  • 58. We can also investigate metadata sets in social networks. This led to a fun story! LLNLDavid Gleich · Purdue 58 1 1.2 1.4 1.6 1.8 0 0.2 0.4 0.6 0.8 1 2006-2008 2009 How well you do at finding those same sets again. The objective ratio at a minimum, i.e. how close you get to the lower bound
  • 59. A quick summary of other work from our research team on data-driven scientific computing Our team’s overall goal is to design algorithms and methods tuned to the evolving needs and nature of scientific data analysis. Low-rank methods for network alignment – Huda Nassar -> Stanford. • Principled methods that scale to aligning thousands of networks. Spectral properties and generation of realistic networks – Nicole Eikmeier -> Grinnell College • “Power-laws” in the top sing. vals of adj matrix are most robust than degree “power-laws” • Fast sampling for hypergraph models with higher-order structure. Local analysis of network data – Meng Liu • Applications in bioinformatics, software https://github.com/kfoynt/LocalGraphClustering LLNLDavid Gleich · Purdue 59 = aaa ddd aab bbb bdd Fig. 5. For a Kronecker graph with a 2 ⇥ 2 initi been “⌦-powered” three times to an 8 ⇥ 8 probability
  • 60. LLNL Paper arXiv:1903.05246 (at WWW2019) Code github.com/nveldt/LearnResParams (at WWW2018),1806.01678 Software. github: nveldt/LamCC,nveldt/MetricOptimization 60 Don’t ask what algorithm, ask what kind of clusters! Issues. • Yeah, this is still slow L • Needs to be generalized beyond lambda-CC (ongoing work with Meng Liu at Purdue) See the paper and code! David Gleich · Purdue With Nate Veldt (Purdue), Tony Wirth (Melbourne). Cameron Ruggles (Purdue) James Saunderson (Monash)