Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Engineering Data Science Objectives for Social Network Analysis
1. Engineering Data Science Objective
Functions for Social Network
Analysis
David F. Gleich
Purdue University
With Nate Veldt (Purdue -> Cornell),
Tony Wirth (Melbourne)
Paper arXiv:1903.05246 Code github.com/nveldt/LearnResParams
LLNL 1David Gleich · Purdue
2. Somewhere too close and very recently…
Application expert. “Hi, I see you work on
clustering. I want to cluster my data …
… what algorithm should I use?”
LLNLDavid Gleich · Purdue 2
3. The dreaded question for people
who study clustering, community
detection, etc.
“What algorithm should I use?”
4. Why is this such a hard question?
LLNLDavid Gleich · Purdue 4
5. Journal of Biomedicine and Biotechnology • 2005:2 (2005) 215–225 • DOI: 10.1155/JBB.2005.215
REVIEW ARTICLE
Finding Groups in Gene Expression Data
David J. Hand and Nicholas A. Heard
Department of Mathematics, Faculty of Physical Sciences, Imperial College, London SW7 2AZ, UK
Received 11 June 2004; revised 24 August 2004; accepted 24 August 2004
The vast potential of the genomic insight offered by microarray technologies has led to their widespread use since they were in-
troduced a decade ago. Application areas include gene function discovery, disease diagnosis, and inferring regulatory networks.
Microarray experiments enable large-scale, high-throughput investigations of gene activity and have thus provided the data analyst
with a distinctive, high-dimensional field of study. Many questions in this field relate to finding subgroups of data profiles which are
very similar. A popular type of exploratory tool for finding subgroups is cluster analysis, and many different flavors of algorithms
have been used and indeed tailored for microarray data. Cluster analysis, however, implies a partitioning of the entire data set, and
this does not always match the objective. Sometimes pattern discovery or bump hunting tools are more appropriate. This paper
reviews these various tools for finding interesting subgroups.
INTRODUCTION
Microarray gene expression studies are now routinely
used to measure the transcription levels of an organism’s
genes at a particular instant of time. These mRNA levels
serve as a proxy for either the level of synthesis of pro-
teins encoded by a gene or perhaps its involvement in a
metabolic pathway. Differential expression between a con-
trol organism and an experimental or diseased organism
can thus highlight genes whose function is related to the
experimental challenge.
An often cited example is the classification of cancer
types (Golub et al [1], Alizadeh et al [2], Bittner et al [3],
croarray slide can typically hold tens of thousands of gene
fragments whose responses here act as the predictor vari-
ables (p), whilst the number of patient tissue samples (n)
available in such studies is much less (for the above exam-
ples, 38 in Golub et al, 96 in Alizadeh et al, 38 in Bittner
et al, 41 in Nielsen et al, 63 in Tibshirani et al, and 80 in
Parmigiani et al).
More generally, beyond such “supervised” classifica-
tion problems, there is interest in identifying groups of
genes with related expression level patterns over time or
across repeated samples, say, even within the same classi-
fication label type. Typically one will be looking for coreg-
between neighbouring frequencies; analogously for mi-
croarray data, there is evidence of correlation of expres-
sion of genes residing closely to one another on the chro-
mosome (Turkheimer et al [17]). Thus when we come to
look at cluster analysis for microarray data, we will see
a large emphasis on methods which are computationally
suited to cope with the high-dimensional data.
CLUSTER ANALYSIS
The need to group or partition objects seems funda-
mental to human understanding: once one can identify a
class of objects, one can discuss the properties of the class
members as a whole, without having to worry about indi-
vidual differences. As a consequence, there is a vast litera-
ture on cluster analysis methods, going back at least as far
as the earliest computers. In fact, at one point in the early
1980s new ad hoc clustering algorithms were being devel-
oped so rapidly that it was suggested there should be a
moratorium on the development of new algorithms while
some understanding of the properties of the existing ones
fundamental pro
pairwise similar
such distances i
objects in the da
Cluster anal
based solely on
sist of relatively
Since cluster an
data set, usually
ter. Extensions o
tering, whereby
than one cluster
naturally to such
these ideas (in f
special case of th
rithm) was given
Since the aim
which are simila
how “similarity
clustering this fo
model). In someLLNLDavid Gleich · Purdue 5
6. Why is this such a hard question?
There are many reasons people want to cluster data
• Help understand it
• Bin items for some downstream process
• …
There are many methods and strategies to cluster data
• Linkage methods from stats
• Partitioning methods
• Objective functions (K-means) and updating algorithms
• …
I can’t psychically intuit what you need from your data!
LLNLDavid Gleich · Purdue 6
7. I don’t like studying clustering…
LLNLDavid Gleich · Purdue 7
8. I don’t like studying clustering…
… so let’s do exactly that.
LLNLDavid Gleich · Purdue 8
9. Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 9
10. Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 10
11. Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 11
12. Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 12
13. Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 13
Let’s consult an expert!
14. Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 14
15. Let’s do some warm up.
What are the clusters in this graph?
LLNLDavid Gleich · Purdue 15
16. Graph clustering seeks“communities”of nodes
LLNLDavid Gleich · Purdue 16
Objective
functions
All seek to
balance
High internal densityLow external connectivity
modularity, densest subgraph, maximum
clique, conductance, sparsest cut, etc.
17. Two objectives at opposite ends of the spectrum
min
cut(S)
`S`
+
cut(S)
`¯S`
Sparsest cut
David Gleich · Purdue 17LLNL
18. Sparsest cut
Minimize number of edges removed
to partition graph into cliques
Two objectives at opposite ends of the spectrum
Cluster Deletion
min
cut(S)
`S`
+
cut(S)
`¯S`
David Gleich · Purdue 18LLNL
19. We show sparsest cut and cluster deletion are two special
cases of the same new clustering framework:
LAMBDACC = λ Correlation Clustering
This framework also leads to
- new connections to other objectives (including modularity!)
- new approximation algorithms (2-approx for cluster deletion)
- several experiments/applications (social network analysis)
- (aside) fast method for LPs w/ metric constraints (for approx. algs)
David Gleich · Purdue 19LLNL
20. And now you are thinking…
… is this talk really going to propose
another new method?!??!?
21. I’m going to advocate for flexible
clustering frameworks, which we can
then engineer to “fit” example data
LLNLDavid Gleich · Purdue 21
22. 22
Our framework is
based on correlation
clustering
Edges in a signed
graph indicate
similarity (+)
or dissimilarity (-)
LLNLDavid Gleich · Purdue
23. i
j
k
Edges can be weighted, but problems
become harder.
w+
ij wjk
w+
ij wjk
23
Our framework is
based on correlation
clustering
Edges in a signed
graph indicate
similarity (+)
or dissimilarity (-)
LLNLDavid Gleich · Purdue
24. Our framework is
based on correlation
clustering
Edges in a signed
graph indicate
similarity (+)
or dissimilarity (-)
i
j
k
Mistake Mistake
Objective: Minimize the weight of “mistakes”
w+
ij wjk
w+
ij wjk
24LLNLDavid Gleich · Purdue
25. Given G = (V,E), construct signed
graph G’ = (V,E+,E- ), an instance
of correlation clustering
You can use correlation clustering to cluster unsigned
graphs
LLNLDavid Gleich · Purdue 25
+
++
–
–
–
+
+ –
To model sparsest cut or cluster
deletion, set resolution parameter
λ ∈ (0,1)
LAMBDACC
1
1
1
1
1
Without weights, unweighted
correlation clustering is the same
as cluster editing
26. Consider a restriction to two clusters
Positive mistakes: (1 – λ) cut(S)
Negative mistakes: λ |E–| – λ [ |S| |S| – cut(S) ]
Total weight of mistakes =
David Gleich · Purdue 26
S S
cut(S)– λ |S| |S| + λ |E–|
LLNL
27. This is a scaled version of sparsest cut!
minimize cut(S) `S``¯S` + `E `
constantTwo-cluster LAMBDACC can be written
cut(S) `S``S` < 0 ()
cut(S)
`S``S`
<Note
David Gleich · Purdue 27
cut(S)
`S`
+
cut(S)
`S`
= `V`
cut(S)
`S``S`
LLNL
28. We can write the objective in terms of cuts to get a
relationship with sparsest cut.
The general LAMBDACC objective can be written
THEOREM
Minimizing this objective produces clusters with scaled sparsest
cut at most λ (if they exist). There exists some λ’ such that
minimizing LAMBDACC will return the minimum sparsest cut
partition.
minimize
1
2
kX
i=1
cut(Si)
2
kX
i=1
`Si``Si` + `E `
David Gleich · Purdue 28LLNL
29. We show this is
equivalent to LAMBDACC
for the right choice of
λ ≫ (1-λ)
1
1
1
1
1
cluster deletion correlation clustering with infinite
penalties on negative edges
David Gleich · Purdue 29
1
1
1
1
1
For large λ,LAMBDACC generalizes cluster deletion
LLNL
30. 1 2 1 4
3
2
4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 6
1 6
4
2
1 6
Degree-weighted LAMBDACC is related to Modularity
Though this does not preserve approximations…
LAMBDACC is a linear function of Modularity
Positive weight: 1 – λdidj
Negative weight: λdidj
David Gleich · Purdue 30LLNL
32. And now, an answer to one of the
most frequently asked questions in
clustering.
“What method should I use”?
LLNLDavid Gleich · Purdue 32
33. Changing your method (implicitly) changes the value of
λ that you are using.
Lambda
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
RatiotoLPbound
1
2
3
4
Graclus
Louvain
InfoMap
RMQC
RMC
Dense subgraph regimeSparse cut regime
This figure shows that if you
use one of these algorithms
(Graclus, Louvain, InfoMap,
recursive max-quasi clique,
or recursive max-clique)
then you implicitly
minimize λ-CC for some
choice of λ.
Turns the question
“what method should I use?”into
“what λ should I use?”
LLNL 33David Gleich · Purdue
34. Changing your method (implicitly) changes the value of
λ that you are using.
Lambda
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
RatiotoLPbound
1
2
3
4
Graclus
Louvain
InfoMap
RMQC
RMC
Dense subgraph regimeSparse cut regime
This figure shows that if you
use one of these algorithms
(Graclus, Louvain, InfoMap,
recursive max-quasi clique,
or recursive max-clique)
then you implicitly
minimize λ-CC for some
choice of λ.
Turns the question
“what method should I use?”into
“what λ should I use?”
LLNL 34David Gleich · Purdue
We wrote an entire SIMODS paper
explaining how we made this figure!
LP bound involves an LP
with 12 billion constraints.
35. 35
How should I set ! for
my new clustering
application?
Can you give me an example
of what you want your
clusters to look like?
I want communities
that look like this!
LambdaCC inspires an approach for learning the“right”
objective function to use for new applications.
David Gleich · Purdue LLNL
36. The goal is not to reproduce the example clusters.
The goal to find sets with similar properties size and density tradeoffs.
LLNLDavid Gleich · Purdue 36
37. Let’s go back to the figure we just saw
David Gleich · Purdue 37
Each clustering traces out a
bowl-shaped curve.
The minimum point on each
curve tells us the ! regime
where the clustering optimizes
LambdaCC.
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
Graclus
Louvain
InfoMap
RMQC
RMC
LLNL
38. 0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
David Gleich · Purdue 38
So the “example” clustering
will also correspond to some
type of curve.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
LLNL
39. 0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
David Gleich · Purdue 39
As will any other clustering.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
LLNL
40. 0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
David Gleich · Purdue 40
As will any other clustering.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
LLNL
41. 0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
David Gleich · Purdue 41
As will any other clustering.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
LLNL
42. 0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
David Gleich · Purdue 42
As will any other clustering.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
LLNL
43. David Gleich · Purdue 43
Strategy.
Start with a fixed “good”
clustering example.
Find the minimizer for its curve,
to get a ! that is designed to
produce similar clusterings!
Challenge.
We want to do this without
computing the entire curve.
0.13 0.17 0.25 0.5
1
1.2
1.4
1.6
1.8
2
RatiotoLPbound
This is a new optimization
problem where we are
optimizing over !!
LLNL
44. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
What function is tracing out these curves?
David Gleich · Purdue
Score for a clustering C.
A linear function in λ.
LambdaCC LP bound for fixed λ.
A parametric LP: concave,and
piecewise linear in λ (Adler & Montiero 1992).
PC( ) =
FC( )
G( )
44
PC( ) =
FC( )
G( )
The “parameter fitness function.”
LLNL
45. We prove two useful properties about P
David Gleich · Purdue 45
Since FC is linear and G is concave and piecewise linear:P satisfies the following two properties:
1. If < < +, then P( ) max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
Translation…
1. Once P goes up, it can’t go
back down
LLNL
46. David Gleich · Purdue 46
Since FC is linear and G is concave and piecewise linear:P satisfies the following two properties:
1. If < < +, then P( ) max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
Translation…
1. Once P goes up, it can’t go
back down
2. There are no “flat” regions
where we might get stuck
We prove two useful properties about P
LLNL
47. David Gleich · Purdue
P satisfies the following two properties:
1. If < < +, then P( ) max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
We know the
minimizer can’t
be to the left of
this point
47
This allows us to minimize P without seeing all of it
LLNL
48. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
David Gleich · Purdue
P satisfies the following two properties:
1. If < < +, then P( ) max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
We know the
minimizer can’t
be to the left of
this point
48
So this is
possible.
This allows us to minimize P without seeing all of it
LLNL
49. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
David Gleich · Purdue
P satisfies the following two properties:
1. If < < +, then P( ) max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
We know the
minimizer can’t
be to the left of
this point
49
So this is
possible.
But so is this.
This allows us to minimize P without seeing all of it
LLNL
50. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
David Gleich · Purdue
P satisfies the following two properties:
1. If < < +, then P( ) max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
We know the
minimizer can’t
be to the left of
this point
Evaluate P at a
new point
50
So we’ve ruled out
this possibility!
Now we know the minimizer
can’t be to the right of this one
This allows us to minimize P without seeing all of it
LLNL
51. 1e-05 0.00022 0.0046 0.1 0.25 0.55 0.85
1
2
3
4
RatiotoLPbound
David Gleich · Purdue
P satisfies the following two properties:
1. If < < +, then P( ) max {P( ), P( +)}.
2. If P( ) = P( +), then P achieves its minimum in [ , +].
If two input ! have
the same fitness
score, the minimizer
is between them.
51
…so it’s not over here.
This allows us to minimize P without seeing all of it
LLNL
52. David Gleich · Purdue
We developed a bisection-like approach for minimizing P
by evaluating it at carefully selected points
One branch scenario: The
minimizer isn’t in [m,r]
Two branch scenario: Evaluate a
couple more points to rule out [m,r]
52LLNL
53. A simple synthetic test case to demonstrate that having
an example helps.
Nate Veldt 53
Modularity, (a special case of LambdaCC with λ = 1/(2m))
wasn’t able to get the community structure right for
the graph G.
Let’s fix that!
1. Generate a new random graph G’ from the same
distribution
2. Using the ground truth of G’, learn a resolution
parameter !’
3. Cluster G using LambdaCC with ! = !’
We’ve captured the community structure for a specific
class of graphs and can detect the right answer!
G
G’
54. Nate Veldt
We tested this on a regime of synthetic graphs that is
hard for modularity.
54
Smaller µ à ground truth easier to detect.
For each µ, we train on one
graph, and tested on 5 others.
One example when µ = 0.3
Modularity often fails to separate
ground truth clusters.
“mixing parameter”
55. We can use this to test if a metadata attribute seems to
be reflected in some characteristic graph structure
LLNLDavid Gleich · Purdue 55
S/F Gen Maj. Maj. 2 Res. Yr HS
min Preal 1.30 1.73 2.03 2.12 1.35 1.57 2.11
min Pfake 1.65 1.80 2.12 2.12 2.11 2.09 2.12
<latexit sha1_base64="u3gHB8ZaTnoGowuTl4th1FKapm8=">AAAGKHicdVTNbttGEKaSqE3Vv7g55rKpE7coHJW04do5FDDQwEmAGHUjOUlhCu6SHEpb7S6J3WVtZbGv0AfpudfkGXIrcu2pj9FZUqpF2aVg7nh25vvmj5OUnGkThu87167f6H7w4c2Peh9/8ulnn99a++KFLiqVwnFa8EK9SqgGziQcG2Y4vCoVUJFweJlMf/D3L38DpVkhh2ZWwkjQsWQ5S6lB1ela56s4gTGT1tCk4lQ5m5LWz/ViU5Sq4tDbIINvD8gGeQwS34f01/7i2ELhOWj//88KX08GJI57sWBZ7XgPJUliQc0kpdweudNaZsZiqNyRexsk6m+HxB+723hs9cPmiDyyv9ypL3d2G23k4f8fNadTQNTa5bvGcy+8wLs4oobr4VzpY04KYwpRhx2DzP6ry+mt9bAf1g+5LERzYT2YP0ena9f/ibMirQRIk3Kq9UkUlmZkqTIs5YB1rTSUNJ3SMZygKKkAPbJ1Sx25j5qM5IXCP2lIre0tuyCOorMWyiLW87Y2KYop3mjXa1OafG9kmSwrAzJtGPOKE1MQPygkYwpSw2ekTWumrx+MFS0nDYlh09ecJYqqmY+oONObekJL0JvYk3QzZwbt6ug5GDuscgPPIXPY+OzuXng34Qi7bGEmMFYA0tn68DZn2FJYsUl4Bc7695JFO79hNLK+dj65VgZHwwGVWI1YgYSztBCCYp/jnArGZxnktOLG2VjnC7ldAJ37KXO9+8tkGrOF7Puw/3Azxak0mATl2E0kMOc69xACh5Li92kMqF6M2LE05x5qv3G2+psTHKOdkVvYFpiob/ojwPlRMJiJpOAHmJJtULSzPx4+c1Z6CsGcFc7W9R6AucoYFdmqSzJ3mXN4h0GVaNwklV8QVxOsMgwODn1JFgTDqFU+m5w7q/kFiTduvO1TtPQ1oLycUHcR6i9PV6qejTmwdPKgqf1VN9hojV9Oe/SFh1nushiwsUAm/LZ1pcDD2TgRNm707tJYiGe4TLOrPOYX6IK7IVrdBJeFF7hitvtbP22t7+/Nt8TN4E7wZfB1EAW7wX7wJDgKjoO083vnz86bztvuH9133b+67xvTa525z+2g9XT//heBrBeH</latexit>
(Listen, don’t read!)
For the Caltech network, find the minimum value of lambda for a clustering X induced by a
metadata attribute. Then look at the objective function P(λ,X) = F(λ,X)/G(λ) at the minimizer.
Do this for the real attribute and a randomized attribute (just shuffle the labels); that gives
a null-score where there is no relationship with graph structure.
56. We can use this to test if a metadata attribute seems to
be reflected in some characteristic graph structure
LLNLDavid Gleich · Purdue 56
S/F Gen Maj. Maj. 2 Res. Yr HS
min Preal 1.30 1.73 2.03 2.12 1.35 1.57 2.11
min Pfake 1.65 1.80 2.12 2.12 2.11 2.09 2.12
<latexit sha1_base64="u3gHB8ZaTnoGowuTl4th1FKapm8=">AAAGKHicdVTNbttGEKaSqE3Vv7g55rKpE7coHJW04do5FDDQwEmAGHUjOUlhCu6SHEpb7S6J3WVtZbGv0AfpudfkGXIrcu2pj9FZUqpF2aVg7nh25vvmj5OUnGkThu87167f6H7w4c2Peh9/8ulnn99a++KFLiqVwnFa8EK9SqgGziQcG2Y4vCoVUJFweJlMf/D3L38DpVkhh2ZWwkjQsWQ5S6lB1ela56s4gTGT1tCk4lQ5m5LWz/ViU5Sq4tDbIINvD8gGeQwS34f01/7i2ELhOWj//88KX08GJI57sWBZ7XgPJUliQc0kpdweudNaZsZiqNyRexsk6m+HxB+723hs9cPmiDyyv9ypL3d2G23k4f8fNadTQNTa5bvGcy+8wLs4oobr4VzpY04KYwpRhx2DzP6ry+mt9bAf1g+5LERzYT2YP0ena9f/ibMirQRIk3Kq9UkUlmZkqTIs5YB1rTSUNJ3SMZygKKkAPbJ1Sx25j5qM5IXCP2lIre0tuyCOorMWyiLW87Y2KYop3mjXa1OafG9kmSwrAzJtGPOKE1MQPygkYwpSw2ekTWumrx+MFS0nDYlh09ecJYqqmY+oONObekJL0JvYk3QzZwbt6ug5GDuscgPPIXPY+OzuXng34Qi7bGEmMFYA0tn68DZn2FJYsUl4Bc7695JFO79hNLK+dj65VgZHwwGVWI1YgYSztBCCYp/jnArGZxnktOLG2VjnC7ldAJ37KXO9+8tkGrOF7Puw/3Azxak0mATl2E0kMOc69xACh5Li92kMqF6M2LE05x5qv3G2+psTHKOdkVvYFpiob/ojwPlRMJiJpOAHmJJtULSzPx4+c1Z6CsGcFc7W9R6AucoYFdmqSzJ3mXN4h0GVaNwklV8QVxOsMgwODn1JFgTDqFU+m5w7q/kFiTduvO1TtPQ1oLycUHcR6i9PV6qejTmwdPKgqf1VN9hojV9Oe/SFh1nushiwsUAm/LZ1pcDD2TgRNm707tJYiGe4TLOrPOYX6IK7IVrdBJeFF7hitvtbP22t7+/Nt8TN4E7wZfB1EAW7wX7wJDgKjoO083vnz86bztvuH9133b+67xvTa525z+2g9XT//heBrBeH</latexit>
(Listen, don’t read!)
For the Caltech network, find the minimum value of lambda for a clustering X induced by a
metadata attribute. Then look at the objective function P(λ,X) = F(λ,X)/G(λ) at the minimizer.
Do this for the real attribute and a randomized attribute (just shuffle the labels); that gives
a null-score where there is no relationship with graph structure.
57. We can also investigate metadata sets in social
networks. This led to a fun story!
LLNLDavid Gleich · Purdue 57
1 1.2 1.4 1.6 1.8
0
0.2
0.4
0.6
0.8
1
How well you do
at finding those
same sets again.
The objective ratio at a minimum, i.e.
how close you get to the lower bound
58. We can also investigate metadata sets in social
networks. This led to a fun story!
LLNLDavid Gleich · Purdue 58
1 1.2 1.4 1.6 1.8
0
0.2
0.4
0.6
0.8
1
2006-2008
2009
How well you do
at finding those
same sets again.
The objective ratio at a minimum, i.e.
how close you get to the lower bound
59. A quick summary of other work from our research team
on data-driven scientific computing
Our team’s overall goal is to design algorithms and methods tuned to
the evolving needs and nature of scientific data analysis.
Low-rank methods for network alignment – Huda Nassar -> Stanford.
• Principled methods that scale to
aligning thousands of networks.
Spectral properties and generation of realistic
networks – Nicole Eikmeier -> Grinnell College
• “Power-laws” in the top sing. vals of adj matrix are most
robust than degree “power-laws”
• Fast sampling for hypergraph models with higher-order structure.
Local analysis of network data – Meng Liu
• Applications in bioinformatics, software https://github.com/kfoynt/LocalGraphClustering
LLNLDavid Gleich · Purdue 59
=
aaa ddd aab
bbb
bdd
Fig. 5. For a Kronecker graph with a 2 ⇥ 2 initi
been “⌦-powered” three times to an 8 ⇥ 8 probability
60. LLNL
Paper arXiv:1903.05246 (at WWW2019)
Code github.com/nveldt/LearnResParams
(at WWW2018),1806.01678
Software. github: nveldt/LamCC,nveldt/MetricOptimization
60
Don’t ask what algorithm, ask what kind of clusters!
Issues.
• Yeah, this is still slow L
• Needs to be generalized beyond lambda-CC
(ongoing work with Meng Liu at Purdue)
See the paper and code!
David Gleich · Purdue
With Nate Veldt (Purdue),
Tony Wirth (Melbourne).
Cameron Ruggles (Purdue)
James Saunderson (Monash)