Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 2011, Big Learning)
1. High-Performance Computing Needs
Machine Learning... And Vice Versa
(was “GPU Metaprogramming: A Case Study in Large-Scale Convolutional Neural Networks”)
dit ion
e
Nicolas Pinto
NIPS “Big Learning” | December 16th, 2011
The Rowland Institute a
HARVARD UNIVERSITY
34. How are things done normally?
Usual Formula:
1) One grad student
35. How are things done normally?
Usual Formula:
1) One grad student
2) One Model (size limited by runtime)
36. How are things done normally?
Usual Formula:
1) One grad student
2) One Model (size limited by runtime)
3) Performance numbers on a few
standard test sets
37. How are things done normally?
Usual Formula:
1) One grad student
2) One Model (size limited by runtime)
3) Performance numbers on a few
standard test sets
4) yay. we. rock.
38. How are things done normally?
Usual Formula:
1) One grad student
2) One Model (size limited by runtime)
3) Performance numbers on a few
standard test sets
4) yay. we. rock.
5) One Ph.D.
39. How do you call this ?
“This is graduate student descent”
- David McAllester
40. How do you call this ?
“This is graduate student descent”
- David McAllester
41. What’s better than this?
“Conjugate graduate student descent?”
- Nicolas Poilvert
43. Doing things a little bit differently
1) One grad student
44. Doing things a little bit differently
1) One grad student
2) One Hundreds of Thousands of
BIG Models
45. Doing things a little bit differently
1) One grad student
2) One Hundreds of Thousands of
BIG Models
3) Performance numbers on a few
standard test sets
46. Doing things a little bit differently
1) One grad student
2) One Hundreds of Thousands of
BIG Models
3) Performance numbers on a few
standard test sets
47. Doing things a little bit differently
1) One grad student
2) One Hundreds of Thousands of
BIG Models
3) Performance numbers on a few
standard test sets
4) yay. we. rock.
48. Doing things a little bit differently
1) One grad student
2) One Hundreds of Thousands of
BIG Models
3) Performance numbers on a few
standard test sets
4) yay. we. rock.
5) Hundreds of Thousands One PhD ?
49. “ If you want to have good ideas
you must have many ideas. ”
“ Most of them will be wrong,
and what you have to learn is
which ones to throw away. ”
Linus Pauling
(double Nobel Prize Winner)
56. The curse of speed
thousands of big models
large amounts of unsupervised
learning experience
57. The curse of speed
...and the blessing of massively parallel computing
No off-the-shelf solution? DIY!
Engineering (Hardware/SysAdmin/Software) Science
58. The curse of speed
...and the blessing of massively parallel computing
No off-the-shelf solution? DIY!
Engineering (Hardware/SysAdmin/Software) Science
Leverage non-scientific high-tech
markets and their $billions of R&D...
Gaming: Graphics Cards (GPUs), PlayStation 3
Web 2.0: Cloud Computing (Amazon, Google)
62. speed
(in billion floating point operations per second)
Q9450 (Matlab/C) [2008] 0.3
Q9450 (C/SSE) [2008] 9.0
7900GTX (OpenGL/Cg) [2006] 68.2
PS3/Cell (C/ASM) [2007] 111.4
8800GTX (CUDA1.x) [2007] 192.7
GTX280 (CUDA2.x) [2008] 339.3
cha n ging...
e
GTX480 (CUDA3.x) [2010]
pe edu p is g a m 974.3
(Fermi)
>1 000X s
Pinto, Doukhan, DiCarlo, Cox PLoS 2009
Pinto, Cox GPU Comp. Gems 2011
71. Human vs. Machine
8-way object categorization
99.1
64
31.3
chance (12.5%)
baseline best model best human
72. What does it all mean?
what have we learned ?
briefly...
73. What does it all mean?
what have we learned ?
Grayscale Input
Normalize Linear SVM
simple classifier
L1 L2 L3
Filter
Threshold &
Φ1 Pool Normalize
Saturate
Φ2
...
Φk
➡ dimensionality: more filters is better
74. What does it all mean?
what have we learned ?
Grayscale Input
Normalize Linear SVM
simple classifier
L1 L2 L3
Filter
Threshold &
Φ1 Pool Normalize
Saturate
Φ2
...
Φk
➡ learning is difficult
75. What does it all mean?
what have we learned ?
Grayscale Input
Normalize Linear SVM
simple classifier
L1 L2 L3
Filter
Threshold &
Φ1 Pool Normalize
Saturate
Φ2
...
Φk
➡ non-linearities are important
76. What does it all mean?
what have we learned ?
Grayscale Input
Normalize Linear SVM
simple classifier
L1 L2 L3
Filter
Threshold &
Φ1 Pool Normalize
Saturate
Φ2
...
Φk
➡ normalization is very important
missed in previous modeling efforts
now confirmed by LeCun et al., Poggio et al., Ng et al.
77. What are these models
not good for?
ob jects
low level
s
ckgr ound
ba
fa ces
81. Facebook
Really Real World Problem
enormous scale
billion of photos
3TB+ uploaded
every day
dense, collaborative
face labels
collab. with Zak Stone & Todd Zickler @ Harvard
86. High-Throughput Screening
Labeled Faces in the Wild (LFW) View 1
> 30,000 large-scale models (1to3 layers) screened in only 3 days
HT L3s (3 layers) top 5 models
LFW view 1 performance
Lea rning!
vised
o Un super
N
Pinto, Cox (FG 2011) Pinto, Stone, Zickler, Cox (CVPR 2011)
87. Generalization
Performance on LFW View 2 (hold out)
Face Verification Performance (% correct)
88.1
86.8
85.3
79.4 Wolf et al.
ACCV 2009 Kumar et al. Ours
V1-like face.com ICCV 2009 (HT)
Pinto, Cox (FG 2011)
89. Auto-tagging
a network of 100 Facebook friends
> 86%
accurate
(w/ 90 training examples)
collab. with Zak Stone & Todd Zickler @ Harvard
Pinto, Stone, Zickler, Cox (CVPR 2011)
90.
91. vs face.com
comparison with a heavily-specialized commercial system
L3
(hardware-accelerated
brute-force random model)
Performance (% correct)
face.com
V1-likearound)
(best technology
(one layer)
training example(s) / friend Pinto, Stone, Zickler, Cox (CVPR 2011)
95. Two conflicting requirements
The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run
Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
96. Two conflicting requirements
The brain is a massively parallel computer
FA ST slow to run
➡ Big models are paralyzingly
Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
97. Two conflicting requirements
The brain is a massively parallel computer
FA ST slow to run
➡ Big models are paralyzingly
Neural data only provides weak constraints
LEXI BLE
F
➡ Lots of parameters – hard to explore
98. Two conflicting requirements
The brain is a massively parallel computer
FA ST slow to run
➡ Big models are paralyzingly
Neural data only provides weak constraints
LEXI BLE
F
➡ Lots of parameters – hard to explore
How to optimize?
105. Meta-programming !
Leave the grunt-programming to the
computer (i.e. auto-tuning like ATLAS or FFTW)
• Dynamically compile specialized versions
of the same kernel for different conditions
• Empirical run-time tuning
• For free: smooth syntactic ugliness: unroll
loops, index un-indexable registers, etc.
106. Meta-programming !
“Instrument” your solutions:
• Block size
• Work size
• Loop unrolling
• Pre-fetching
• Spilling
• etc.
... and let the computer generate
find the optimal code
113. Basic GPU Meta-programming System
A Case Study
GPU Meta-Programming:
red Machine Vision
in Biologically-Inspi
s]
[GPU Computing Gem
Pinto N, Cox DD
121. o t alo ne....
we are n
s for S ignal
Using GPU
elatio n pil ers
Corr ust com
’t tr
itchell
Daniel A. M
Don gmen
The Murch
ode fr
a
ts
ison Widefi
eld Array
c
tical”
e “iden
re thes + g *h;
ompa LOPS
• C
*c +
e*f
770 GF
+ d
b*c grating 8-s
econd snap
shots over
a +=
inte peeling,
roduced by lanking and
b*c;
-2 526 field p d after RFI b
f the J2107 e of the fiel
an image o ht is an imag
S
FLOP
n the left is . On the rig
a += d*c;
Figure 3:
O ing
hout blank
interval wit
20 G
entire time eeled imag
e. noise
the e unp e above the
ntours of th f magnitud
10
along with
co rs o This
at are orde ious data.
a += e*f;
els th dub
ivers at lev ply discard n here
to the rece m will sim tector show
k
ste
ichael hClar
ct in
fl ect or refra real-time sy n-based de
occasion, re s the MWA mple media
integration hich the si
M floor. D
wit
wil
uring deep
l require a
series of d
ata-quality
art.
tests, of w
a += g*h;
n integral p
will form a eenhill
Lincoln Gr
Paul La Plante and
Reference
s
t Boolard
a +=
y, EDGES
Memo, 058
, 2010.
R.J. Cappal
lo, M.F. M
orales, and
ics a ale, d Topics
RFI Statist , C.J. Lonsd l of Selecte
[1] A.E .E. Rogers, , R.J. Sault IE EE Journa
R.B. Wayth eld Array,
. Greenhill, hison Widefi ].
itchell, L.J of the Murc 07.1912 E, 97
[2] D.A. M Time Calib
ration
, [astro-
ph/08 s of the IEE
S.M. O rd, Real- 7 17, 2008 , Proceeding
2 (5), 707– n Overview
1
nuary 201
sday, 27 Ja rocessing, rray: Desig
in Signal P on Widefield A
he Murchis 8]. , Graphics
ale, et al., T 903.182 R.G. Edgar
[3] C.J. Lonsd [ast ro-ph/0 H. Pfister, and Series,
506, 2009, ell, K. Dale, Conference
(8), 1497–1 , D.A. Mitch d Array, ASP
R.B. Wayth on Wide-fiel
Greenhill, the Murchis
IICS‘2011 [4] S.M . Ord, L.J. ata Pro cessing in cal
Units for D Mathemati
Processing 1 radio pola
rimetry. I.
009. aa d
nderstryn20 ing
1
411, 127, 2 .J. Sault, U Janu 6.
. Breg man, and R ursday,.,2117, 137–147, 199
7
alar
amaker, J.D Th pl. Ser
up alogue of sc
[5 ] J.P. H st rophys. S ll-co herency an rophys. Su
ppl.
s, Astron. A . IV. The fu Astron. Ast
foundation polarimetry ric fidelity,
g radio ge and pola
rimet
derstandin
122. Smooth syntactic ugliness
Manipulations that are not easily
accessible in CUDA C code:
• variable-length argument lists
123. Smooth syntactic ugliness
Manipulations that were not easily
accessible in CUDA C code:
• index un-indexable resources (e.g. regs)
125. Basic GPU Meta-programming System
A Case Study
GPU Meta-Programming:
red Machine Vision
in Biologically-Inspi
s]
[GPU Computing Gem
Pinto N, Cox DD
126. ... too many
optimizations?
ba nk c
onflict
s
on
ing
isi
ale sc
ec
co
ca
pr
ch
d part ling
itionnrol
in
ixe
cla p u ca mpin
g
m loo g
m
pi
ng
adca sting
bro
ms
zero-cop trea
128. Exploring design decision space more freely
Meta-programming:
• enables efficient learning of the GPU
hardware/software
• allows full exploitation of the GPU
architecture
131. speed
(in billion floating point operations per second)
Q9450 (Matlab/C) [2008] 0.3
Q9450 (C/SSE) [2008] 9.0
7900GTX (OpenGL/Cg) [2006] 68.2
PS3/Cell (C/ASM) [2007] 111.4
8800GTX (CUDA1.x) [2007] 192.7
GTX280 (CUDA2.x) [2008] 339.3
cha n ging...
e
GTX480 (CUDA3.x) [2010]
pe edu p is g a m 974.3
(Fermi)
>1 000X s
Pinto, Doukhan, DiCarlo, Cox PLoS 2009
Pinto, Cox GPU Comp. Gems 2011
132. -10.4 1024x1024x8 16x5x5x8 726.412 ± 0.398 744.973 ± 0.571
Analysis
2048x2048x4 4x8x8x4 474.681 ± 0.160 887.974 ± 1.017
➡ Different hardware ?
Table 33.2 Performance of Auto-Tuned Implementations on Two
Hardware Platforms, Including Performance Tuned on One Platform and
Run on the Other
Optimized for:
Run on: 9400M GTX480 Tuning Speedup
9400M 0.32s 2.52s 675%
GTX480 0.016s 0.011s 52%
formance gains are observed for the auto-tuned meta-kernels as compared to
t, which was hand-picked to allow correct execution of all input ranges
ng up against hardware limitations.
133. APTER 33 GPU Metaprogramming: A Case Study
Analysis
➡ Different input configurations
Table 33.3 Performance of Auto-Tuned Implementations on Two Input
Configurations, Including Performance Tuned for One Configuration
and Run with the Other
Optimized for:
Run on: Config1 Config2 Tuning Speedup
config1 11.1ms 15.7ms 41%
config2 fails 10.8ms not comparable
, in Table 33.3 we show the effect of tuning on one input configuration an
in, significant speedups are obtained using kernels tailored to a specific inp
137. Summary
Meta-programming:
• can assist exploration and manual
optimization
• can de-clutter highly-optimized code
138. Summary
Meta-programming:
• can assist exploration and manual
optimization
• can de-clutter highly-optimized code
• is easy and flexible with the right tools
(e.g. Python, PyCUDA/CL, Cheetah, decuda)
139. Summary
Meta-programming:
• can assist exploration and manual
optimization
• can de-clutter highly-optimized code
• is easy and flexible with the right tools
(e.g. Python, PyCUDA/CL, Cheetah, decuda)
➡ helps get drastic speed-ups !
140. Summary
Meta-programming:
• can assist exploration and manual
optimization
• can de-clutter highly-optimized code
• is easy and flexible with the right tools
(e.g. Python, PyCUDA/CL, Cheetah, decuda)
➡ helps get drastic speed-ups !
➡ facilitates “auto-tuning” !
146. Auto-tuning: two approaches
• Analytical model-based optimization:
- pros: very generic (dominant in compilers), fast
“inference”
147. Auto-tuning: two approaches
• Analytical model-based optimization:
- pros: very generic (dominant in compilers), fast
“inference”
- cons: hard to build, domain expertise required, auto-
tuned code far from peak
148. Auto-tuning: two approaches
• Analytical model-based optimization:
- pros: very generic (dominant in compilers), fast
“inference”
- cons: hard to build, domain expertise required, auto-
tuned code far from peak
• Empirical optimization:
149. Auto-tuning: two approaches
• Analytical model-based optimization:
- pros: very generic (dominant in compilers), fast
“inference”
- cons: hard to build, domain expertise required, auto-
tuned code far from peak
• Empirical optimization:
- pros: auto-tuned code close to peak (dominant in
specialized libraries e.g. ATLAS, FFTW), easier to build
150. Auto-tuning: two approaches
• Analytical model-based optimization:
- pros: very generic (dominant in compilers), fast
“inference”
- cons: hard to build, domain expertise required, auto-
tuned code far from peak
• Empirical optimization:
- pros: auto-tuned code close to peak (dominant in
specialized libraries e.g. ATLAS, FFTW), easier to build
- cons: very slow “inference” (for new inputs, etc.)
151. Empirical Auto-Tuning
The goal is to empirically optimize execution
time given both
• the environment
- hardware (GPU, CPU, Memory, Mobo, etc.)
- software (SDK, Compiler suite, etc.)
• the data (input dimensions, repetitions, etc.)
152. Empirical Auto-Tuning with Meta-programming
A Case Study
GPU Meta-Programming:
red Machine Vision
in Biologically-Inspi
s]
[GPU Computing Gem
Pinto N, Cox DD
153. Intelligent
and fast
Auto-Tuning
with Machine Learning
155. Auto-tuning: best of both approaches ?
• Empirically-learned model-based
optimization:
156. Auto-tuning: best of both approaches ?
• Empirically-learned model-based
optimization:
- pros: auto-tuned code close to peak*, easier to build (?),
fast “inference” (for new inputs, hardware, etc.)
157. Auto-tuning: best of both approaches ?
• Empirically-learned model-based
optimization:
- pros: auto-tuned code close to peak*, easier to build (?),
fast “inference” (for new inputs, hardware, etc.)
- cons: unexplored !
158. Auto-tuning: best of both approaches ?
• Empirically-learned model-based
optimization:
- pros: auto-tuned code close to peak*, easier to build (?),
fast “inference” (for new inputs, hardware, etc.)
- cons: unexplored !
* could be dominant in specialized libraries
(e.g. machine learning!)
160. First Last First Last First Last
Affiliation line 1 Affiliation line 1 Affiliation line 1
Fast Machine Learning-based
Affiliation line 2 Affiliation line 2 Affiliation line 2
anon@mail.com anon@mail.com anon@mail.com
ABSTRACT
Runtime Auto-Tuning
designs, the field lacks consensus on exactly how the differ-
The rapidly evolving landscape of multicore architectures ent subsystems (memory, communication and computation)
makes the construction of efficient libraries a daunting task. should be efficiently integrated, modeled and programmed.
A family of methods known collectively as “auto-tuning” has These systems have exhibited varying degrees of memory
emerged to address this challenge. Two major approaches to hierarchy and multi-threading complexity and, as a conse-
auto-tuning are empirical and model-based: empirical auto- quence, they have been increasingly relying on flexible but
tuning is a generic but slow approach that works by mea- low-level software-controlled cache management and paral-
suring runtimes of candidate implementations, model-based lelism [Asanovic et al., 2006] in order to better control and
auto-tuning predicts those runtimes using simplified abstrac- understand the various trade-offs among performance, reli-
tions designed by hand. We show that machine learning ability, energy efficiency, production costs, etc. This evo-
methods for non-linear regression can be used to estimate lution has profoundly altered the landscape of application
timing models from data, capturing the best of both ap- development: programmers are now facing a wide diversity
Machine Learning for Predictive Auto-Tuning with Boosted
proaches. A statistically-derived model offers the speed of of low-level architectural issues that must be carefully bal-
a model-based approach, with the generality and simplicity anced if one is to write code that is both high-performance
of empirical auto-tuning. We validate our approach using and portable.
the filterbank correlation kernel described in Pinto and Cox
Regression Trees
[2012], where we find that 0.1 seconds of hill climbing on 1.1 Motivation
the regression model (“predictive auto-tuning”) can achieve In this rapidly evolving landscape, the construction of gen-
an average of 95% of the speed-up brought by minutes of eral development tools and libraries that fully utilize system
empirical auto-tuning. Our approach is not specific to filter- resources remains a daunting task. Even within special-
bank correlation, nor even to GPU kernel auto-tuning, and ized architectures from the same vendor, such as NVIDIA’s
can be applied to almost any templated-code optimization Graphics Processing Units (GPUs) and the Compute Unified
problem, spanning a wide variety of problem types, kernel Device Architecture (CUDA) [Nickolls et al., 2008, NVIDIA,
types, and platforms. 2011], many developers default to massive amounts of man-
1. INTRODUCTION First Last First Last
ual labor to optimize CUDA code to specific input domains.
In addition, hand-tuning rarely generalizes well to new hard- First Last
Affiliation line 1 Affiliation line 1 Affiliation line 1
ware generations or different input domains, and it can also
Due to power consumption and heat dissipation concerns, be error-prone or far from optimal. One of the reason is that
Affiliation line 2 Affiliation line 2 Affiliation line 2
scientific applications have shifted from computing platforms kernels can produce staggeringly large optimization spaces
where performance had been primarily driven by rises in the [Datta et al., 2008]. The problem is further compounded
anon@mail.com
clock frequency of a single “heavy-weight” processor (with
complex out-of-order control and cache structures) to a plat-
form with ever increasing numbers of “light-weight” cores.
anon@mail.com
by the fact that these spaces can be highly discontinuous
[Ryoo et al., 2008], difficult to explore, and quasi-optimal anon@mail.com
solutions lie at the edge of “performance cliffs” induced by
Interestingly, this shift is now not only relevant to compu- hard device-specific constraints (e.g. register file size or low-
tational sciences but to the development of all computer sys- latency cache size).
James Bergstra
tems: from ubiquitous consumer-facing devices (e.g. phones)
to high-end computer farms for web-scale applications (e.g.
1.2 Auto-Tuning
ABSTRACT
social networks).
Although the future lies in low-power multi-core hardware One strategy for addressing these challenges is to use one
of a variety of automatic methods known collectively as
designs, the field lacks consensus on exactly how the differ-
The rapidly evolving landscape of multicore architectures
Permission to makethe or hard copies of all or part ofof work for
“auto-tuning.” Two major auto-tuning approaches have emer-
ged in the extensive literature covering the subject (see sur-
veys in e.g. [Vuduc et al., 2001, Demmel et al., 2005, Vuduc
makes digital construction this efficient libraries a daunting task.
et al., 2005, Williams, 2008, Datta et al., 2008, Cavazos,
Nicolas Pinto
ent subsystems (memory, communication and computation)
should be efficiently integrated, modeled and programmed.
David Cox
personal or classroom use is granted without fee provided that copies are
2008, Li et al., 2009, Park et al., 2011]): analytical model- These systems have exhibited varying degrees of memory
not A familyfor profit or commercial advantage and that copies
made or distributed of methods known collectively as “auto-tuning” has
driven optimization and empirical optimization [Yotov et al.,
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or address this challenge. Two major approaches to
emerged to to redistribute to lists, requires prior specific 2003]. hierarchy and multi-threading complexity and, as a conse-
The model-driven optimization approach uses analytical
permission and/or a fee. quence, they have been increasingly relying on flexible but
auto-tuning are empirical and model-based: empirical auto-
[submitted]
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. abstractions to model the hardware architectures, in order
tuning is a generic but slow approach that works by mea- low-level software-controlled cache management and paral-
suring runtimes of candidate implementations, model-based lelism [Asanovic et al., 2006] in order to better control and
auto-tuning predicts those runtimes using simplified abstrac- understand the various trade-offs among performance, reli-
tions designed by hand. We show that machine learning ability, energy efficiency, production costs, etc. This evo-
methods for non-linear regression can be used to estimate lution has profoundly altered the landscape of application
timing models from data, capturing the best of both ap- development: programmers are now facing a wide diversity
proaches. A statistically-derived model offers the speed of of low-level architectural issues that must be carefully bal-
anced if one is to write code that is both high-performance
162. NVIDIA GTX 580 (Fermi)
0 P ie w
rev(b) 2x faster equality
1200
GFLOP/s of predictive auto-tuning
1000
Auto-tuned mean
800
2x slower
ML-based:
Reference mean
600
< 0.1sec
400
200
0
200
0 200 400 600 800 1000 1200 1400
d problem
GFLOP/s of empirical auto-tuning
r training
old way: minutes!
163. NVIDIA GTX 580 (Fermi)
0 P ie w
rev(b) 2x faster equality
1200
GFLOP/s of predictive auto-tuning
LOP /s !
RAF
1000
1 TE
Auto-tuned mean
800 > 1.
2x slower
ML-based:
Reference mean
600
< 0.1sec
400
200
0
200
0 200 400 600 800 1000 1200 1400
d problem
GFLOP/s of empirical auto-tuning
r training
old way: minutes!
166. What else could we do for HPC ?
• Minimize failures (exascale supercomputers)
167. What else could we do for HPC ?
• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
168. What else could we do for HPC ?
• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
their complex interactions
169. What else could we do for HPC ?
• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
their complex interactions
• Help design better architectures ?
170. What else could we do for HPC ?
• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
their complex interactions
• Help design better architectures ?
• $$$
171. What else could we do for HPC ?
• Minimize failures (exascale supercomputers)
• Minimize mixed-precision errors
• Help better understand hardware features and
their complex interactions
• Help design better architectures ?
• $$$
• etc.
172. It would be a
win-win-win situation!
(The Office Season 2, Episode 27: Conflict Resolution)