This is my KDD2015 talk on robustness in semi-supervised learning. The paper is already on Michael Mahoney's website: http://www.stat.berkeley.edu/~mmahoney/pubs/robustifying-kdd15.pdf See the KDD paper for all the details, which this talk is a bit light on.
Take control of your SAP testing with UiPath Test Suite
Using Local Spectral Methods to Robustify Graph-Based Learning
1. Using Local Spectral
Methods to Robustify
Graph-Based Learning
David F. Gleich!
Purdue University!
Joint work with
Michael
Mahoney @
Berkeley
supported by "
NSF CAREER
CCF-1149756
Code www.cs.purdue.edu/homes/dgleich/codes/robust-diffusions!
KDD2015
2. The graph-based data analysis pipeline
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
"
Raw data!
• Relationships
• Images
• Text records
• Etc.
"
Convert to a graph!
• Nearest neighs
• Kernels
• 2-mode to 1-mode
• Etc.
"
Algorithm/Learning!
• Important nodes
• Infer features
• Clustering
• Etc.
KDD2015
David Gleich · Purdue
2
3. “Noise” in the initial data
modeling decisions
"
Explicit graphs!
are those that are
given to a data
analyst.
“A social network”
• Known spam
accounts included?
• Users not logged in
for a year?
• Etc.
A type of noise
"
Constructed graphs!
are built based on some
other primary data."
“nearest neighbor graphs”
• K-NN or ε-NN
• Thresholding correlations
to zero
Often made for computational
convenience! (Graph too big.)
A different type a noise!
"
Labeled graphs!
occur in information
diffusion/propagation
“function prediction”
• Labeled nodes
• Labeled edges
• Some are wrong
A direct type of noise!
Do these decisions matter?
Our experience Yes! Dramatically so!
KDD2015
David Gleich · Purdue
3
4. The graph-based data analysis pipeline
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
"
Raw data!
• Relationships
• Images
• Text records
• Etc.
"
Convert to a graph!
• Nearest neighs
• Kernels
• 2-mode to 1-mode
• Etc.
"
Algorithm/Learning!
• Important nodes
• Infer features
• Clustering
• Etc.
Most algorithmic and
statistical research
happens here
The potential
downstream signal is
determined by this step
KDD2015
David Gleich · Purdue
4
5. Our goal: towards an integrative analysis
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
!
• How does the graph creation process affect
the outcomes of graph-based learning?
• Is there anything we can do to make this
process more robust?
KDD2015
David Gleich · Purdue
5
7. Scalable graph analytics
Local methods are one of the most successful
classes of scalable graph analytics
They don’t even look at the entire graph.
• Andersen Chung Lang (ACL) Push method
Conjecture"
Local methods regularize some variant of the
original algorithm or problem.
Justification"
For ACL and a few relatives this is exact!
Impact?!
Improved robustness to noise?
KDD2015
David Gleich · Purdue
7
c
For instance, to "
answer “what function” is
shared by the started node,
we’d only look at the circled
region.
8. Our contributions
We study these issues in the case of "
semi-supervised learning (SSL) on graphs
1. We illustrate a common mincut framework for a variety of SSL
methods
2. Show how to “localize” one (and make it scalable!)
3. Provide a more robust SSL labeling method
4. Identify a weakness in SSL methods: they cannot use extra
edges! We find one useful way to do so.
KDD2015
David Gleich · Purdue
8
9. Semi-supervised "
graph-based learning
KDD2015
David Gleich · Purdue
9
Given a graph, and a few labeled nodes,
predict the labels on the rest of the graph.
Algorithm
1. Run a diffusion for
each label (possibly
with neg. info from
other classes)
2. Assign new labels
based on the value of
each diffusion
10. Semi-supervised "
graph-based learning
KDD2015
David Gleich · Purdue
10
Given a graph, and a few labeled nodes,
predict the labels on the rest of the graph.
Algorithm
1. Run a diffusion for
each label (possibly
with neg. info from
other classes)
2. Assign new labels
based on the value of
each diffusion
11. Semi-supervised "
graph-based learning
KDD2015
David Gleich · Purdue
11
Given a graph, and a few labeled nodes,
predict the labels on the rest of the graph.
Algorithm
1. Run a diffusion for
each label (possibly
with neg. info from
other classes)
2. Assign new labels
based on the value of
each diffusion
12. The diffusions proposed for semi-
supervised learning are s,t-cut minorants
KDD2015
David Gleich · Purdue
12
1
3
2
6
4
5
7
8
9
10
t
s
In the unweighted case, "
solve via max-flow.
In the weighted case,
solve via network simplex
or industrial LP.
minimize
qP
ij2E Ci,j |xi xj |2
subject to xs = 1, xt = 0.
minimize
P
ij2E Ci,j |xi xj |
subject to xs = 1, xt = 0.
MINCUT LP
Spectral minorant – lin. sys.
13. Representative cut problems
KDD2015
David Gleich · Purdue
13
∞
∞
∞
∞
s
t
ZGL
α
α 4α
3α
4α
6α
3α
3α
5α
5α
5α 2α
5α
4α
5α
s
t
Zhou et al.
Positive label
Neg. label
Unlabeled
Andersen-Lang weighting "
variation too
Joachims has a variation too.
Zhou et al. NIPS 2003; Zhu et al., ICML 2003;
Andersen Lang, SODA 2008; Joachims, ICML 2003
These help our intuition about the solutions
All spectral minorants are linear systems.
14. Implicit regularization views
on the Zhou et al. diffusion
KDD2015
David Gleich · Purdue
14
α
α 4α
3α
4α
6α
3α
3α
5α
5α
5α 2α
5α
4α
5α
s
t
Zhou et al.
RESULT!
The spectral minorant of Zhou is equivalent to
the weakly-local MOV solution.
PROOF!
The two linear systems are the same (after
working out a few equivalences).
IMPORTANCE!
We’d expect Zhou to be “more robust”
minimize
qP
ij2E Ci,j |xi xj |2
subject to xs = 1, xt = 0.
The Mahoney-Orecchia-Vishnoi (MOV) vector is a
localized variation on the Fiedler vector to find a small
conductance set nearby a seed set.
15. A scalable, localized algorithm for Zhou
et al’s diffusion.
KDD2015
David Gleich · Purdue
15
RESULT!
We can use a variation on coordinate descent methods related to
the Andersen-Chung-Lang PUSH procedure to solve Zhou’s
diffusion in a scalable manner.
PROOF. See Gleich-Mahoney ICML ‘14
IMPORTANCE (1)!
We should be able to make Zhou et al. scale.
IMPORTANCE (2)!
Using this algorithm adds another implicit regularization term that
should further improve robustness!
minimize
qP
ij2E Ci,j |xi xj |2
subject to xs = 1, xt = 0.
minimize
P
ij2E Ci,j |xi xj |2
+ ⌧
P
i2V di xi
subject to xs = 1, xt = 0, xi 0.
16. Semi-supervised "
graph-based learning
KDD2015
David Gleich · Purdue
16
Given a graph, and a few labeled nodes,
predict the labels on the rest of the graph.
Algorithm
1. Run a diffusion for
each label (possibly
with neg. info from
other classes)
2. Assign new labels
based on the value of
each diffusion
17. Traditional rounding methods
for SSL are value-based
KDD2015
David Gleich · Purdue
17
Class 1 Class 2Class 3
Class 1
Class 2
Class 3
CLASS 1
CLASS 2
CLASS 3
VALUE-BASED
Use the largest value of the diffusion to pick the label.
Zhou’s diffusion
18. But value based rounding
doesn’t work for all diffusions
KDD2015
David Gleich · Purdue
18
Class 1 Class 2Class 3
Class 1
Class 2
Class 3
Class 1 Class 2Class 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3
(b) Zhou et al., l = 3 (c) Andersen-Lang, l = 3 (d) Joachims, l = 3 (e) ZGL, l = 3
Class 1 Class 2Class 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3
(f) Zhou et al., l = 15 (g) Andersen-Lang, l = 15 (h) Joachims, l = 15 (i) ZGL, l = 15
CLASS 1
CLASS 3
CLASS 2
VALUE-BASED rounding fails
for most of these diffusions
BUT
There is still a
signal there!
Adding more labels doesn’t help either, see the
paper for those details
19. Rank-based rounding is far
more robust.
KDD2015
David Gleich · Purdue
19
Class 1 Class 2Class 3
NEW IDEA!
Look at the RANK of the
item in each diffusion
instead of it’s VALUE.
JUSTIFICATION!
Based on the idea of
sweep-cut rounding in
spectral methods (use the
order induced by the
eigenvector, not its values)
IMPACT!
Much more robust
rounding to labels
20. Rank-based rounding has a
big impact on a real-study.
KDD2015
David Gleich · Purdue
20
2 4 6 8 10
0
0.2
0.4
0.6
0.8
errorrate
average training samples per class
Zhou
Zhou+Push
2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
errorrate
average training samples per class
Zhou
Zhou+Push
We used the digit prediction task out of Zhou’s paper and added just a bit of noise"
as label errors and switched parameters.
VALUE-BASED
RANK-BASED
21. Main empirical results
1. Zhou’s diffusion seems to work best for sparse
graphs whereas the ZGL diffusion works best for
dense
2. On the digit’s dataset, dense graph constructions
yield higher error rates
3. Densifying the a super-sparse graph construction on
the digits dataset yields lower error.
4. And a similar fact holds on an Amazon co-
purchasing network.
KDD2015
David Gleich · Purdue
21
22. 5 10
0
50
100
150
number of labels
numberofmistakes
An illustrative synthetic problem
shows the differences.
Two-class block-model, 150 nodes each"
between prob = 0.02, "
withinprob = 0.35 (dense) or 0.06 (sparse)
Reveal labels for k nodes (varied) and we have
different error rates (sparse 0/10% low/high)
and dense (20%/60% for low-high)
KDD2015
David Gleich · Purdue
22
5 10
0
50
100
150
number of labels
numberofmistakes
Joachims
Zhou
ZGL
Real-world scenario"
sparse graph, high error
Sparse graph,"
low error
20 40 60
0
20
40
60
80
100
number of labels
numberofmistakes
20 40 60
0
20
40
60
80
100
number of labels
numberofmistakes
Dense graph,"
low error rate
Dense graph,"
high error rate
23. Varying density in an SSL
construction.
KDD2015
David Gleich · Purdue
23
Ai,j = exp
✓
kdi dj k2
2
2 2
◆
di
dj = 2.5
= 1.25
We use the digits
experiment from
Zhou et al. 2003.
10 digits and a
few label errors.
We vary density
either by the
num. of nearest
neighbors or by
the kernel width.
24. As density increases, the
results just get worse
KDD2015
David Gleich · Purdue
24
1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
errorrate
σ
0.8
1.2
1.5
1.8
2.1
2.5
Zhou
Zhou+Push
10
2
0
0.1
0.2
0.3
0.4
errorrate
nearest neighbors
5
10
25
50
100
150
200
250
Zhou
Zhou+Push
Varying kernel width
Varying nearest neighors
• Adding “more” edges seems to only hurt. (Unless there is no signal).
• Zhou+Push seems to be slightly more robust (Maybe).
25. Some observations and a
question.
Adding “more data” yields “worse results” for
this procedure (in a simple setting).
Suppose I have a real-world system that can
work with up to E edges on some graph.
Is there a way I can create new edges?
KDD2015
David Gleich · Purdue
25
26. Densifying the graph with
path expansions
KDD2015
David Gleich · Purdue
26
Ak =
kX
`=1
A` If A is the adjacency matrix, then this counts the
total weight on all paths up to length k.
We now repeat the nearest neighbor computation, but with paired
parameters such that we have the same average degree.
Zhou Zhou w. Push
Avg. Deg k = 1 k 1 k = 1 k 1
19 0.163 0.114 0.156 0.117
41 0.156 0.132 0.158 0.113
53 0.183 0.142 0.179 0.136
104 0.193 0.145 0.178 0.144
138 0.216 0.102 0.204 0.101
k=4, nn = 3
27. The same result holds for Amazon’s
co-purchasing network
KDD2015
David Gleich · Purdue
27
mean F1 Confidence intervals
k Zhou Zhou
w. Push
Zhou Zhou w. Push
1 0.173 0.229 [0.15 0.19] [0.21 0.25]
2 0.197 0.231 [0.18 0.22] [0.21 0.25]
3 0.221 0.238 [0.17 0.27] [0.19 0.28]
Amazon’s co-purchasing network (on Snap) is effectively
a highly sparse nearest-neighbor network from their
(denser) co-purchasing graph.
We attempt to predict the items in a product category
based on a small sample and study the F1 score for the
predictions.
Some small details missing – see the full paper.
Ak =
kX
`=1
A`
28. (a) K2 sparse (b) K2 dense (c) RK2
Figure 2: We artificially densify this graph to Ak based on a cons
and dense di↵usions and regularization. The color indicates the
circled nodes. The unavoidable errors are caused by a mislabeled
regularizing di↵usions on dense graphs produces only a small
Towards some theory, i.e. why are
densified sparse graphs better?
KDD2015
David Gleich · Purdue
28
How do sparsity, density,
and regularization of a
diffusion play into the results
in a controlled setting?
THE ERROR
Labels
29. (a) K2 sparse (b) K2 dense (c) RK2
Figure 2: We artificially densify this graph to Ak based on a cons
and dense di↵usions and regularization. The color indicates the
circled nodes. The unavoidable errors are caused by a mislabeled
regularizing di↵usions on dense graphs produces only a small
Towards some theory, i.e. why are
densified sparse graphs better?
KDD2015
David Gleich · Purdue
29
THE ERROR
Labels
Using Push algorithm
P5
`=1 A`
P5
`=1 A`
Regularization
Dense
Dense
30. Recap, discussion, future work
Contributions
1. Flow-setup for SSL
diffusions
2. New robust rounding rule for
class selection
3. Localized Zhou’s diffusion
4. Empirical insights on density
of graph constructions
Observations
• Many of these insights
translate to directed,
weighted graphs with fuzzy
labels and/or some parallel
architectures.
• Weakness mainly empirical
results on the density.
• We need a theoretical basis
for the densification theory!
KDD2015
David Gleich · Purdue
30
Supported by NSF, ARO, DARPA
CODE www.cs.purdue.edu/homes/dgleich/codes/robust-diffusions