SlideShare une entreprise Scribd logo
1  sur  54
Télécharger pour lire hors ligne
Presenter: Shane	
  (Seungwhan)	
  Moon
PhD	
  student
Language	
  Technologies	
  Institute,	
  School	
  of	
  Computer	
  Science
Carnegie	
  Mellon	
  University
3/2/2016
How	
  it	
  works
AlphaGo vs	
  European	
  Champion	
  (Fan	
  Hui 2-­‐Dan)
October	
  5	
  – 9,	
  2015
<Official	
  match>
-­‐ Time	
  limit:	
  1	
  hour
-­‐ AlphaGo Wins (5:0)
*
rank
AlphaGo vs	
  World	
  Champion	
  (Lee	
  Sedol 9-­‐Dan)
March	
  9	
  – 15,	
  2016
<Official	
  match>
-­‐ Time	
  limit:	
  2	
  hours
Venue:	
  Seoul,	
  Four	
  Seasons	
  Hotel
Image	
  Source: Josun	
  Times Jan	
  28th
2015
Lee	
  Sedol
Photo	
  source: Maeil	
  Economics 2013/04
wiki
Computer	
  Go	
  AI?
Computer	
  Go	
  AI – Definition
s (state)
d	
  =	
  1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
=
(e.g.	
  we	
  can	
  represent	
  the	
  board	
  into	
  a	
  matrix-­‐like	
  form)
*	
  The	
  actual	
  model	
  uses	
  other	
  features	
  than	
  board	
  positions	
  as	
  well
Computer	
  Go	
  AI	
  – Definition
s (state)
d	
  =	
  1 d	
  =	
  2
a (action)
Given	
  s,	
  pick	
  the	
  best	
  a
Computer	
  Go
Artificial	
  
Intelligence
s a s'
Computer	
  Go	
  AI – An	
  Implementation	
  Idea?
d	
  =	
  1 d	
  =	
  2
…
How	
  about	
  simulating	
  all	
  possible	
  board	
  positions?
Computer	
  Go	
  AI	
  – An	
  Implementation	
  Idea?
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
…
…
…
Computer	
  Go	
  AI	
  – An	
  Implementation	
  Idea?
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
…
…
…
… d	
  =	
  maxD
Process	
  the	
  simulation	
  until	
  the	
  game	
  ends,
then	
  report	
  win	
  /	
  lose	
  results
Computer	
  Go	
  AI	
  – An	
  Implementation	
  Idea?
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
…
…
…
… d	
  =	
  maxD
Process	
  the	
  simulation	
  until	
  the	
  game	
  ends,
then	
  report	
  win	
  /	
  lose	
  results
e.g. it	
  wins	
  13	
  times	
  if	
  the	
  next	
  stone	
  gets	
  placed	
  here
37,839	
  times
431,320	
  times
Choose	
  the	
  “next	
  action	
  /	
  stone”
that	
  has	
  the	
  most	
  win-­‐counts	
  in	
  the	
  full-­‐scale	
  simulation
This	
  is	
  NOT	
  possible;	
  it	
  is	
  said	
  the	
  possible	
  configurations	
  of	
  the	
  board	
  exceeds	
  the	
  number	
   of	
  atoms	
  in	
  the	
  universe
Key: To	
  Reduce Search	
  Space
Reducing	
  Search	
  Space
1.	
  Reducing	
  “action	
  candidates”	
  (Breadth	
  Reduction)
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
…
…
… d	
  =	
  maxD
Win?
Loss?
IF	
  there	
  is	
  a	
  model	
  that	
  can	
  tell	
  you	
  that	
  these	
  moves
are	
  not	
  common	
  /	
  probable	
  (e.g.	
  by	
  experts,	
  etc.)	
  …
Reducing	
  Search	
  Space
1.	
  Reducing	
  “action	
  candidates”	
  (Breadth	
  Reduction)
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
… d	
  =	
  maxD
Win?
Loss?
Remove	
  these	
  from	
  search	
  candidates	
  in	
  advance (breadth	
  reduction)
Reducing	
  Search	
  Space
2.	
  Position	
  evaluation	
  ahead	
  of	
  time	
  (Depth	
  Reduction)
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
… d	
  =	
  maxD
Win?
Loss?
Instead	
  of	
  simulating	
  until	
  the	
  maximum	
  depth ..
Reducing	
  Search	
  Space
2.	
  Position	
  evaluation	
  ahead	
  of	
  time	
  (Depth	
  Reduction)
d	
  =	
  1 d	
  =	
  2
…
d	
  =	
  3
…
V	
  =	
  1
V	
  =	
  2
V	
  =	
  10
IF	
  there	
  is	
  a	
  function	
  that	
  can	
  measure:
V(s):	
  “board	
  evaluation	
  of	
  state	
  s”
Reducing	
  Search	
  Space
1. Reducing	
  “action	
  candidates”	
  (Breadth	
  Reduction)
2. Position	
  evaluation	
  ahead	
  of	
  time	
  (Depth	
  Reduction)
1.	
  Reducing	
  “action	
  candidates”
Learning:	
  P	
  (	
  next	
  action	
  |	
  current	
  state	
  )
=	
  P	
  (	
  a	
  |	
  s	
  )
1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Current	
  State
Prediction	
  
Model
Next	
  State
s1 s2
s2 s3
s3 s4
Data:	
  Online	
  Go experts (5~9	
  dan)
160K games, 30M	
  board	
  positions
1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Prediction	
  Model
Current	
  Board Next	
  Board
1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Prediction	
  Model
Current	
  Board Next	
  Action
There	
  are	
  19	
  X	
  19	
  =	
  361
possible	
  actions
(with	
  different	
  probabilities)
1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Prediction	
  Model
0 0	
   0 0 0	
   0 0 0 0
0 0	
   0 0 0 1 0 0 0
0 -­‐1 0 0 1 -­‐1 1 0 0
0 1 0 0 1 -­‐1 0 0 0
0 0	
   0 0 -­‐1 0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 -­‐1 0 0 0	
   0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s af:	
  s à a
Current	
  Board Next	
  Action
1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Prediction	
  
Model
0 0	
   0 0 0	
   0 0 0 0
0 0	
   0 0 0 1 0 0 0
0 -­‐1 0 0 1 -­‐1 1 0 0
0 1 0 0 1 -­‐1 0 0 0
0 0	
   0 0 -­‐1 0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 -­‐1 0 0 0	
   0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s g:	
  s à p(a|s)
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0	
  	
  	
  	
  	
  	
   0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0.2 0.1 0 0
0 0 0 0 0 0.4	
  0.2 0 0
0 0 0 0 0 0.1	
  	
  	
   0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
p(a|s) aargmax
Current	
  Board Next	
  Action
1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Prediction	
  
Model
0 0	
   0 0 0	
   0 0 0 0
0 0	
   0 0 0 1 0 0 0
0 -­‐1 0 0 1 -­‐1 1 0 0
0 1 0 0 1 -­‐1 0 0 0
0 0	
   0 0 -­‐1 0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 -­‐1 0 0 0	
   0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s g:	
  s à p(a|s) p(a|s) aargmax
Current	
  Board Next	
  Action
1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Deep	
  Learning
(13	
  Layer	
  CNN)
0 0	
   0 0 0	
   0 0 0 0
0 0	
   0 0 0 1 0 0 0
0 -­‐1 0 0 1 -­‐1 1 0 0
0 1 0 0 1 -­‐1 0 0 0
0 0	
   0 0 -­‐1 0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 -­‐1 0 0 0	
   0 0 0 0
0 0	
   0 0 0	
   0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
s g:	
  s à p(a|s)
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0	
  	
  	
  	
  	
  	
   0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0.2 0.1 0 0
0 0 0 0 0 0.4	
  0.2 0 0
0 0 0 0 0 0.1	
  	
  	
   0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
p(a|s) aargmax
Current	
  Board Next	
  Action
Convolutional	
  Neural	
  Network	
  (CNN)
CNN	
  is	
  a	
  powerful	
  model	
  for	
  image	
  recognition	
  tasks;	
  it	
  abstracts	
  out	
  the	
  input	
  image	
  through	
  convolution	
  layers
Image	
  source
Convolutional	
  Neural	
  Network	
  (CNN)
And	
  they	
  use	
  this	
  CNN	
  model	
  (similar	
  architecture)	
  to	
  evaluate	
  the	
  board	
  position;	
  which learns	
  “some”	
  spatial	
  invariance
Go: abstraction	
  is	
  the	
  key	
  to	
  win
CNN:	
  abstraction	
  is	
  its	
  forte
1.	
  Reducing	
  “action	
  candidates”
(1) Imitating	
  expert	
  moves	
  (supervised	
  learning)
Expert	
  Moves	
  Imitator	
  Model
(w/	
  CNN)
Current	
  Board Next	
  Action
Training:
1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Expert	
  Moves	
  
Imitator	
  Model
(w/	
  CNN)
Expert	
  Moves	
  
Imitator	
  Model
(w/	
  CNN)
VS
Improving	
  by	
  playing	
  against	
  itself
1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Expert	
  Moves	
  
Imitator	
  Model
(w/	
  CNN)
Expert	
  Moves	
  
Imitator	
  Model
(w/	
  CNN)
VS
Return:	
  board	
  positions, win/lose info
1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Expert	
  Moves	
  Imitator	
  Model
(w/	
  CNN)
Board	
  position win/loss
Training:
Loss
z	
  =	
  -­‐1
1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Expert	
  Moves	
  Imitator	
  Model
(w/	
  CNN)
Training:
z	
  =	
  +1
Board	
  position win/loss
Win
1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Updated	
  Model
ver 1.1
Updated	
  Model
ver 1.3VS
Return:	
  board	
  positions, win/lose info
It	
  uses	
  the	
  same	
  topology	
  as	
  the	
  expert	
  moves	
  imitator	
  model,	
  and	
  just	
  uses	
  the	
  updated parameters
Older	
  models	
  vs.	
  newer	
  models
1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Updated	
  Model	
  
ver 1.3
Updated	
  Model	
  
ver 1.7VS
Return:	
  board	
  positions, win/lose info
1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Updated	
  Model	
  
ver 1.5
Updated	
  Model	
  
ver 2.0VS
Return:	
  board	
  positions, win/lose info
1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Updated	
  Model	
  
ver 3204.1
Updated	
  Model	
  
ver 46235.2VS
Return:	
  board	
  positions, win/lose info
1.	
  Reducing	
  “action	
  candidates”
(2) Improving	
  through	
  self-­‐plays	
  (reinforcement	
  learning)
Updated	
  Model	
  
ver 1,000,000VS
The	
  final	
  model	
  wins 80%	
  of	
  the time
when	
  playing	
  against	
  the	
  first	
  model
Expert	
  Moves	
  
Imitator	
  Model
2.	
  Board	
  Evaluation
2.	
  Board	
  Evaluation
Updated	
  Model
ver 1,000,000
Board	
  Position
Training:
Win	
  /	
  Loss
Win
(0~1)
Value	
  
Prediction	
  
Model
(Regression)
Adds	
  a regression	
  layer	
  to	
  the	
  model
Predicts	
  values	
  between	
  0~1
Close	
  to	
  1:	
  a	
  good	
  board	
  position
Close	
  to	
  0:	
  a	
  bad	
  board	
  position
Reducing	
  Search	
  Space
1. Reducing	
  “action	
  candidates”
(Breadth	
  Reduction)
2. Board	
  Evaluation (Depth	
  Reduction)
Policy	
  Network
Value	
  Network
Looking	
  ahead	
  (w/	
  Monte	
  Carlo	
  Search	
  Tree)
Action	
  Candidates	
  Reduction
(Policy	
  Network)
Board	
  Evaluation
(Value	
  Network)
(Rollout):	
  Faster	
  version	
  of	
  estimating	
  p(a|s)
à uses shallow	
  networks	
  (3	
  ms à 2µs)
Results
Elo rating	
  system
Performance	
  with	
  different	
  combinations	
  of	
  AlphaGo components
Takeaways
Use	
  the	
  networks	
  trained	
  for	
  a	
  certain	
  task	
  (with	
  different	
  loss	
  objectives)	
  for	
  several	
  other	
  tasks
Lee	
  Sedol 9-­‐dan vs	
  AlphaGo
Lee	
  Sedol 9-­‐dan vs	
  AlphaGo
Energy	
  Consumption
Lee	
  Sedol AlphaGo
-­‐ Recommended calories	
  for	
  a man per	
  day
: ~2,500 kCal
-­‐ Assumption: Lee consumes	
  the	
  entire	
  amount	
  of	
  
per-­‐day calories	
  in	
  this	
  one	
  game
2,500	
  kCal *	
  4,184	
  J/kCal
~=	
  10M	
  [J]
-­‐ Assumption: CPU:	
  ~100	
  W,	
  GPU:	
  ~300 W
-­‐ 1,202 CPUs, 176 GPUs
170,000	
  J/sec	
  *	
  5	
  hr *	
  3,600	
  sec/hr
~=	
  3,000M	
  [J]
A	
  very,	
  very	
  rough	
  calculation	
  ;)
AlphaGo is	
  estimated	
  to	
  be	
  around	
  ~5-­‐dan
=	
  multiple	
  machines European	
  champion
Taking	
  CPU	
  /	
  GPU resources	
  to	
  virtually	
  infinity?
But	
  Google	
  has	
  promised	
  not	
  to	
  use	
  more	
  CPU/GPUs
than	
  they	
  used	
  for	
  Fan	
  Hui	
  for	
  the	
  game	
  with	
  Lee
No	
  one	
  knows
how it	
  will	
  converge
AlphaGo learns	
  millions	
  of	
  Go	
  games	
  every	
  day
AlphaGo will	
  presumably	
  converge	
  to	
  some	
  point	
  eventually.
However,	
  in	
  the	
  Nature	
  paper	
  they	
  don’t	
  report	
  how	
  AlphaGo’s performance	
  improves
as	
  a	
  function	
  of	
  times	
  AlphaGo plays	
  against	
  itself	
  (self-­‐plays).
What	
  if	
  AlphaGo learns	
  Lee’s	
  game	
  strategy
Google	
  said	
  they	
  won’t	
  use	
  Lee’s	
  game	
  plays	
  as	
  AlphaGo’s training	
  data	
  
Even	
  if	
  it	
  does,	
  it	
  won’t	
  be	
  easy	
  to	
  modify	
  the	
  model	
  trained	
  over	
  millions	
  of
data	
  points	
  with	
  just	
  a	
  few	
  game	
  plays	
  with	
  Lee
(prone	
  to	
  over-­‐fitting,	
  etc.)
AlphaGo’s Weakness?
AlphaGo – How	
  It	
  Works
Presenter: Shane	
  (Seungwhan)	
  Moon
PhD	
  student
Language	
  Technologies	
  Institute,	
  School	
  of	
  Computer	
  Science
Carnegie	
  Mellon	
  University
me@shanemoon.com
3/2/2016
Reference
• Silver,	
  David,	
  et	
  al.	
  "Mastering	
  the	
  game	
  of	
  Go	
  with	
  deep	
  neural	
  
networks	
  and	
  tree	
  search." Nature 529.7587	
  (2016):	
  484-­‐489.

Contenu connexe

Tendances

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리Shane (Seungwhan) Moon
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Thom Lane
 
AlphaGo Zero Introduction
AlphaGo Zero IntroductionAlphaGo Zero Introduction
AlphaGo Zero Introduction友誠 張
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoTim Riser
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Optimizers
OptimizersOptimizers
OptimizersIl Gu Yi
 
Reinforcement Learning for Self Driving Cars
Reinforcement Learning for Self Driving CarsReinforcement Learning for Self Driving Cars
Reinforcement Learning for Self Driving CarsSneha Ravikumar
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeAlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeJoonhyung Lee
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기NAVER D2
 
Artificial Intelligence in games
Artificial Intelligence in gamesArtificial Intelligence in games
Artificial Intelligence in gamesDevGAMM Conference
 

Tendances (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Alpha zero - London 2018
Alpha zero  - London 2018 Alpha zero  - London 2018
Alpha zero - London 2018
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)
 
AlphaGo Zero Introduction
AlphaGo Zero IntroductionAlphaGo Zero Introduction
AlphaGo Zero Introduction
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of Go
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Optimizers
OptimizersOptimizers
Optimizers
 
Reinforcement Learning for Self Driving Cars
Reinforcement Learning for Self Driving CarsReinforcement Learning for Self Driving Cars
Reinforcement Learning for Self Driving Cars
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeAlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
 
Artificial Intelligence in games
Artificial Intelligence in gamesArtificial Intelligence in games
Artificial Intelligence in games
 

Similaire à How AlphaGo Works

Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Ruairi de Frein
 
Gan seminar
Gan seminarGan seminar
Gan seminarSan Kim
 
Pros And Cons Of Velorak Exercise Bike
Pros And Cons Of Velorak Exercise BikePros And Cons Of Velorak Exercise Bike
Pros And Cons Of Velorak Exercise BikeCheryl Viljoen
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Magnify Analytic Solutions
 
Risking Everything with Akka Streams
Risking Everything with Akka StreamsRisking Everything with Akka Streams
Risking Everything with Akka Streamsjohofer
 
Unit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxUnit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxDrYogeshDeshmukh1
 
Ropossum: A Game That Generates Itself
Ropossum: A Game That Generates ItselfRopossum: A Game That Generates Itself
Ropossum: A Game That Generates ItselfMohammad Shaker
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagerySuneel Marthi
 
Deep learning simplified
Deep learning simplifiedDeep learning simplified
Deep learning simplifiedLovelyn Rose
 
Lucio marcenaro tue summer_school
Lucio marcenaro tue summer_schoolLucio marcenaro tue summer_school
Lucio marcenaro tue summer_schoolJun Hu
 
Practical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsPractical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsSC5.io
 
AlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesAlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesOlivier Teytaud
 
Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안KyuYeolJung
 
Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...Vasco Duarte
 
[2019] 퍼즐 게임 난이도 예측은 닥터 P에게
[2019] 퍼즐 게임 난이도 예측은 닥터 P에게[2019] 퍼즐 게임 난이도 예측은 닥터 P에게
[2019] 퍼즐 게임 난이도 예측은 닥터 P에게NHN FORWARD
 
OpenGL L02-Transformations
OpenGL L02-TransformationsOpenGL L02-Transformations
OpenGL L02-TransformationsMohammad Shaker
 
Object Tracking with Instance Matching and Online Learning
Object Tracking with Instance Matching and Online LearningObject Tracking with Instance Matching and Online Learning
Object Tracking with Instance Matching and Online LearningJui-Hsin (Larry) Lai
 
Is Production RL at a tipping point?
Is Production RL at a tipping point?Is Production RL at a tipping point?
Is Production RL at a tipping point?M Waleed Kadous
 
John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017 John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017 MLconf
 
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous DrivingKiho Suh
 

Similaire à How AlphaGo Works (20)

Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Pros And Cons Of Velorak Exercise Bike
Pros And Cons Of Velorak Exercise BikePros And Cons Of Velorak Exercise Bike
Pros And Cons Of Velorak Exercise Bike
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
 
Risking Everything with Akka Streams
Risking Everything with Akka StreamsRisking Everything with Akka Streams
Risking Everything with Akka Streams
 
Unit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptxUnit 5 Introduction to Planning and ANN.pptx
Unit 5 Introduction to Planning and ANN.pptx
 
Ropossum: A Game That Generates Itself
Ropossum: A Game That Generates ItselfRopossum: A Game That Generates Itself
Ropossum: A Game That Generates Itself
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagery
 
Deep learning simplified
Deep learning simplifiedDeep learning simplified
Deep learning simplified
 
Lucio marcenaro tue summer_school
Lucio marcenaro tue summer_schoolLucio marcenaro tue summer_school
Lucio marcenaro tue summer_school
 
Practical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit AlgorithmsPractical AI for Business: Bandit Algorithms
Practical AI for Business: Bandit Algorithms
 
AlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesAlphaZero and beyond: Polygames
AlphaZero and beyond: Polygames
 
Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안
 
Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...
 
[2019] 퍼즐 게임 난이도 예측은 닥터 P에게
[2019] 퍼즐 게임 난이도 예측은 닥터 P에게[2019] 퍼즐 게임 난이도 예측은 닥터 P에게
[2019] 퍼즐 게임 난이도 예측은 닥터 P에게
 
OpenGL L02-Transformations
OpenGL L02-TransformationsOpenGL L02-Transformations
OpenGL L02-Transformations
 
Object Tracking with Instance Matching and Online Learning
Object Tracking with Instance Matching and Online LearningObject Tracking with Instance Matching and Online Learning
Object Tracking with Instance Matching and Online Learning
 
Is Production RL at a tipping point?
Is Production RL at a tipping point?Is Production RL at a tipping point?
Is Production RL at a tipping point?
 
John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017 John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
John Maxwell, Data Scientist, Nordstrom at MLconf Seattle 2017
 
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving
 

Dernier

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 

Dernier (20)

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 

How AlphaGo Works

  • 1. Presenter: Shane  (Seungwhan)  Moon PhD  student Language  Technologies  Institute,  School  of  Computer  Science Carnegie  Mellon  University 3/2/2016 How  it  works
  • 2. AlphaGo vs  European  Champion  (Fan  Hui 2-­‐Dan) October  5  – 9,  2015 <Official  match> -­‐ Time  limit:  1  hour -­‐ AlphaGo Wins (5:0) * rank
  • 3. AlphaGo vs  World  Champion  (Lee  Sedol 9-­‐Dan) March  9  – 15,  2016 <Official  match> -­‐ Time  limit:  2  hours Venue:  Seoul,  Four  Seasons  Hotel Image  Source: Josun  Times Jan  28th 2015
  • 4. Lee  Sedol Photo  source: Maeil  Economics 2013/04 wiki
  • 6. Computer  Go  AI – Definition s (state) d  =  1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = (e.g.  we  can  represent  the  board  into  a  matrix-­‐like  form) *  The  actual  model  uses  other  features  than  board  positions  as  well
  • 7. Computer  Go  AI  – Definition s (state) d  =  1 d  =  2 a (action) Given  s,  pick  the  best  a Computer  Go Artificial   Intelligence s a s'
  • 8. Computer  Go  AI – An  Implementation  Idea? d  =  1 d  =  2 … How  about  simulating  all  possible  board  positions?
  • 9. Computer  Go  AI  – An  Implementation  Idea? d  =  1 d  =  2 … d  =  3 … … … …
  • 10. Computer  Go  AI  – An  Implementation  Idea? d  =  1 d  =  2 … d  =  3 … … … … … d  =  maxD Process  the  simulation  until  the  game  ends, then  report  win  /  lose  results
  • 11. Computer  Go  AI  – An  Implementation  Idea? d  =  1 d  =  2 … d  =  3 … … … … … d  =  maxD Process  the  simulation  until  the  game  ends, then  report  win  /  lose  results e.g. it  wins  13  times  if  the  next  stone  gets  placed  here 37,839  times 431,320  times Choose  the  “next  action  /  stone” that  has  the  most  win-­‐counts  in  the  full-­‐scale  simulation
  • 12. This  is  NOT  possible;  it  is  said  the  possible  configurations  of  the  board  exceeds  the  number   of  atoms  in  the  universe
  • 13. Key: To  Reduce Search  Space
  • 14. Reducing  Search  Space 1.  Reducing  “action  candidates”  (Breadth  Reduction) d  =  1 d  =  2 … d  =  3 … … … … d  =  maxD Win? Loss? IF  there  is  a  model  that  can  tell  you  that  these  moves are  not  common  /  probable  (e.g.  by  experts,  etc.)  …
  • 15. Reducing  Search  Space 1.  Reducing  “action  candidates”  (Breadth  Reduction) d  =  1 d  =  2 … d  =  3 … … d  =  maxD Win? Loss? Remove  these  from  search  candidates  in  advance (breadth  reduction)
  • 16. Reducing  Search  Space 2.  Position  evaluation  ahead  of  time  (Depth  Reduction) d  =  1 d  =  2 … d  =  3 … … d  =  maxD Win? Loss? Instead  of  simulating  until  the  maximum  depth ..
  • 17. Reducing  Search  Space 2.  Position  evaluation  ahead  of  time  (Depth  Reduction) d  =  1 d  =  2 … d  =  3 … V  =  1 V  =  2 V  =  10 IF  there  is  a  function  that  can  measure: V(s):  “board  evaluation  of  state  s”
  • 18. Reducing  Search  Space 1. Reducing  “action  candidates”  (Breadth  Reduction) 2. Position  evaluation  ahead  of  time  (Depth  Reduction)
  • 19. 1.  Reducing  “action  candidates” Learning:  P  (  next  action  |  current  state  ) =  P  (  a  |  s  )
  • 20. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Current  State Prediction   Model Next  State s1 s2 s2 s3 s3 s4 Data:  Online  Go experts (5~9  dan) 160K games, 30M  board  positions
  • 21. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Prediction  Model Current  Board Next  Board
  • 22. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Prediction  Model Current  Board Next  Action There  are  19  X  19  =  361 possible  actions (with  different  probabilities)
  • 23. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Prediction  Model 0 0   0 0 0   0 0 0 0 0 0   0 0 0 1 0 0 0 0 -­‐1 0 0 1 -­‐1 1 0 0 0 1 0 0 1 -­‐1 0 0 0 0 0   0 0 -­‐1 0 0 0 0 0 0   0 0 0   0 0 0 0 0 -­‐1 0 0 0   0 0 0 0 0 0   0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s af:  s à a Current  Board Next  Action
  • 24. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Prediction   Model 0 0   0 0 0   0 0 0 0 0 0   0 0 0 1 0 0 0 0 -­‐1 0 0 1 -­‐1 1 0 0 0 1 0 0 1 -­‐1 0 0 0 0 0   0 0 -­‐1 0 0 0 0 0 0   0 0 0   0 0 0 0 0 -­‐1 0 0 0   0 0 0 0 0 0   0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s g:  s à p(a|s) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.2 0.1 0 0 0 0 0 0 0 0.4  0.2 0 0 0 0 0 0 0 0.1       0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 p(a|s) aargmax Current  Board Next  Action
  • 25. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Prediction   Model 0 0   0 0 0   0 0 0 0 0 0   0 0 0 1 0 0 0 0 -­‐1 0 0 1 -­‐1 1 0 0 0 1 0 0 1 -­‐1 0 0 0 0 0   0 0 -­‐1 0 0 0 0 0 0   0 0 0   0 0 0 0 0 -­‐1 0 0 0   0 0 0 0 0 0   0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s g:  s à p(a|s) p(a|s) aargmax Current  Board Next  Action
  • 26. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Deep  Learning (13  Layer  CNN) 0 0   0 0 0   0 0 0 0 0 0   0 0 0 1 0 0 0 0 -­‐1 0 0 1 -­‐1 1 0 0 0 1 0 0 1 -­‐1 0 0 0 0 0   0 0 -­‐1 0 0 0 0 0 0   0 0 0   0 0 0 0 0 -­‐1 0 0 0   0 0 0 0 0 0   0 0 0   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s g:  s à p(a|s) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.2 0.1 0 0 0 0 0 0 0 0.4  0.2 0 0 0 0 0 0 0 0.1       0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 p(a|s) aargmax Current  Board Next  Action
  • 27. Convolutional  Neural  Network  (CNN) CNN  is  a  powerful  model  for  image  recognition  tasks;  it  abstracts  out  the  input  image  through  convolution  layers Image  source
  • 28. Convolutional  Neural  Network  (CNN) And  they  use  this  CNN  model  (similar  architecture)  to  evaluate  the  board  position;  which learns  “some”  spatial  invariance
  • 29. Go: abstraction  is  the  key  to  win CNN:  abstraction  is  its  forte
  • 30. 1.  Reducing  “action  candidates” (1) Imitating  expert  moves  (supervised  learning) Expert  Moves  Imitator  Model (w/  CNN) Current  Board Next  Action Training:
  • 31. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Expert  Moves   Imitator  Model (w/  CNN) Expert  Moves   Imitator  Model (w/  CNN) VS Improving  by  playing  against  itself
  • 32. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Expert  Moves   Imitator  Model (w/  CNN) Expert  Moves   Imitator  Model (w/  CNN) VS Return:  board  positions, win/lose info
  • 33. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Expert  Moves  Imitator  Model (w/  CNN) Board  position win/loss Training: Loss z  =  -­‐1
  • 34. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Expert  Moves  Imitator  Model (w/  CNN) Training: z  =  +1 Board  position win/loss Win
  • 35. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Updated  Model ver 1.1 Updated  Model ver 1.3VS Return:  board  positions, win/lose info It  uses  the  same  topology  as  the  expert  moves  imitator  model,  and  just  uses  the  updated parameters Older  models  vs.  newer  models
  • 36. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Updated  Model   ver 1.3 Updated  Model   ver 1.7VS Return:  board  positions, win/lose info
  • 37. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Updated  Model   ver 1.5 Updated  Model   ver 2.0VS Return:  board  positions, win/lose info
  • 38. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Updated  Model   ver 3204.1 Updated  Model   ver 46235.2VS Return:  board  positions, win/lose info
  • 39. 1.  Reducing  “action  candidates” (2) Improving  through  self-­‐plays  (reinforcement  learning) Updated  Model   ver 1,000,000VS The  final  model  wins 80%  of  the time when  playing  against  the  first  model Expert  Moves   Imitator  Model
  • 41. 2.  Board  Evaluation Updated  Model ver 1,000,000 Board  Position Training: Win  /  Loss Win (0~1) Value   Prediction   Model (Regression) Adds  a regression  layer  to  the  model Predicts  values  between  0~1 Close  to  1:  a  good  board  position Close  to  0:  a  bad  board  position
  • 42. Reducing  Search  Space 1. Reducing  “action  candidates” (Breadth  Reduction) 2. Board  Evaluation (Depth  Reduction) Policy  Network Value  Network
  • 43. Looking  ahead  (w/  Monte  Carlo  Search  Tree) Action  Candidates  Reduction (Policy  Network) Board  Evaluation (Value  Network) (Rollout):  Faster  version  of  estimating  p(a|s) à uses shallow  networks  (3  ms à 2µs)
  • 44. Results Elo rating  system Performance  with  different  combinations  of  AlphaGo components
  • 45. Takeaways Use  the  networks  trained  for  a  certain  task  (with  different  loss  objectives)  for  several  other  tasks
  • 46. Lee  Sedol 9-­‐dan vs  AlphaGo
  • 47. Lee  Sedol 9-­‐dan vs  AlphaGo Energy  Consumption Lee  Sedol AlphaGo -­‐ Recommended calories  for  a man per  day : ~2,500 kCal -­‐ Assumption: Lee consumes  the  entire  amount  of   per-­‐day calories  in  this  one  game 2,500  kCal *  4,184  J/kCal ~=  10M  [J] -­‐ Assumption: CPU:  ~100  W,  GPU:  ~300 W -­‐ 1,202 CPUs, 176 GPUs 170,000  J/sec  *  5  hr *  3,600  sec/hr ~=  3,000M  [J] A  very,  very  rough  calculation  ;)
  • 48. AlphaGo is  estimated  to  be  around  ~5-­‐dan =  multiple  machines European  champion
  • 49. Taking  CPU  /  GPU resources  to  virtually  infinity? But  Google  has  promised  not  to  use  more  CPU/GPUs than  they  used  for  Fan  Hui  for  the  game  with  Lee No  one  knows how it  will  converge
  • 50. AlphaGo learns  millions  of  Go  games  every  day AlphaGo will  presumably  converge  to  some  point  eventually. However,  in  the  Nature  paper  they  don’t  report  how  AlphaGo’s performance  improves as  a  function  of  times  AlphaGo plays  against  itself  (self-­‐plays).
  • 51. What  if  AlphaGo learns  Lee’s  game  strategy Google  said  they  won’t  use  Lee’s  game  plays  as  AlphaGo’s training  data   Even  if  it  does,  it  won’t  be  easy  to  modify  the  model  trained  over  millions  of data  points  with  just  a  few  game  plays  with  Lee (prone  to  over-­‐fitting,  etc.)
  • 53. AlphaGo – How  It  Works Presenter: Shane  (Seungwhan)  Moon PhD  student Language  Technologies  Institute,  School  of  Computer  Science Carnegie  Mellon  University me@shanemoon.com 3/2/2016
  • 54. Reference • Silver,  David,  et  al.  "Mastering  the  game  of  Go  with  deep  neural   networks  and  tree  search." Nature 529.7587  (2016):  484-­‐489.