SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
Human-­‐in-­‐a-­‐loop:	
  a	
  design	
  pattern	
  for	
  managing	
  teams	
  that	
  leverage	
  ML
Paco	
  Nathan	
  	
  @pacoid	
  
Director,	
  Learning	
  Group	
  @	
  O’Reilly	
  Media	
  
Strata	
  SG,	
  Singapore	
  	
  2017-­‐12-­‐06
Framing
Imagine	
  having	
  a	
  mostly-­‐automated	
  system	
  where	
  

people	
  and	
  machines	
  collaborate	
  together…	
  
May	
  sound	
  a	
  bit	
  Sci-­‐Fi,	
  though	
  arguably	
  commonplace.	
  

One	
  challenge	
  is	
  whether	
  we	
  can	
  advance	
  beyond	
  just	
  
handling	
  rote	
  tasks.	
  	
  
Instead	
  of	
  simply	
  running	
  code	
  libraries,	
  can	
  machines	
  

make	
  difficult	
  decisions,	
  exercise	
  judgement	
  in	
  complex	
  
situations?	
  	
  
Can	
  we	
  build	
  systems	
  in	
  which	
  people	
  who	
  aren’t	
  

AI	
  experts	
  can	
  “teach”	
  machines	
  to	
  perform	
  complex	
  

work	
  –	
  based	
  on	
  examples,	
  not	
  code?
Research	
  questions
▪ How	
  do	
  we	
  personalize	
  learning	
  experiences,	
  across	
  

ebooks,	
  videos,	
  conferences,	
  computable	
  content,	
  

live	
  online	
  courses,	
  case	
  studies,	
  expert	
  AMAs,	
  etc.	
  
▪ How	
  do	
  we	
  help	
  experts	
  (by	
  definition,	
  really	
  busy	
  

people)	
  share	
  their	
  knowledge	
  with	
  peers	
  in	
  industry?	
  
▪ How	
  do	
  we	
  manage	
  the	
  role	
  of	
  editors	
  at	
  human	
  scale,	
  

while	
  technology	
  and	
  delivery	
  media	
  evolve	
  rapidly?	
  
▪ How	
  do	
  we	
  help	
  organizations	
  learn	
  and	
  transform	
  
continuously?
3
UX	
  for	
  content	
  discovery:	
  
▪ partly	
  generated	
  +	
  curated	
  by	
  people	
  
▪ partly	
  generated	
  +	
  curated	
  by	
  AI	
  apps
AI	
  in	
  Media
▪ content	
  which	
  can	
  represented	
  as	
  

text	
  can	
  be	
  parsed	
  by	
  NLP,	
  then	
  
manipulated	
  by	
  available	
  AI	
  tooling	
  	
  
▪ labeled	
  images	
  get	
  really	
  interesting	
  
▪ assumption:	
  text	
  or	
  images	
  –	
  within	
  

a	
  context	
  –	
  have	
  inherent	
  structure	
  
▪ representation	
  of	
  that	
  kind	
  of	
  structure	
  
is	
  rare	
  in	
  the	
  Media	
  vertical	
  –	
  so	
  far
6
{"graf": [[21, "let", "let", "VB", 1, 48], [0, "'s", "'s", "PRP", 0, 49],
"take", "take", "VB", 1, 50], [0, "a", "a", "DT", 0, 51], [23, "look", "l
"NN", 1, 52], [0, "at", "at", "IN", 0, 53], [0, "a", "a", "DT", 0, 54], [
"few", "few", "JJ", 1, 55], [25, "examples", "example", "NNS", 1, 56], [0
"often", "often", "RB", 0, 57], [0, "when", "when", "WRB", 0, 58], [11,
"people", "people", "NNS", 1, 59], [2, "are", "be", "VBP", 1, 60], [26, "
"first", "JJ", 1, 61], [27, "learning", "learn", "VBG", 1, 62], [0, "abou
"about", "IN", 0, 63], [28, "Docker", "docker", "NNP", 1, 64], [0, "they"
"they", "PRP", 0, 65], [29, "try", "try", "VBP", 1, 66], [0, "and", "and"
0, 67], [30, "put", "put", "VBP", 1, 68], [0, "it", "it", "PRP", 0, 69],
"in", "in", "IN", 0, 70], [0, "one", "one", "CD", 0, 71], [0, "of", "of",
0, 72], [0, "a", "a", "DT", 0, 73], [24, "few", "few", "JJ", 1, 74], [31,
"existing", "existing", "JJ", 1, 75], [18, "categories", "category", "NNS
76], [0, "sometimes", "sometimes", "RB", 0, 77], [11, "people", "people",
1, 78], [9, "think", "think", "VBP", 1, 79], [0, "it", "it", "PRP", 0, 80
"'s", "be", "VBZ", 1, 81], [0, "a", "a", "DT", 0, 82], [32, "virtualizati
"virtualization", "NN", 1, 83], [19, "tool", "tool", "NN", 1, 84], [0, "l
"like", "IN", 0, 85], [33, "VMware", "vmware", "NNP", 1, 86], [0, "or", "
"CC", 0, 87], [34, "virtualbox", "virtualbox", "NNP", 1, 88], [0, "also",
"also", "RB", 0, 89], [35, "known", "know", "VBN", 1, 90], [0, "as", "as"
0, 91], [0, "a", "a", "DT", 0, 92], [36, "hypervisor", "hypervisor", "NN"
93], [0, "these", "these", "DT", 0, 94], [2, "are", "be", "VBP", 1, 95],
"tools", "tool", "NNS", 1, 96], [0, "which", "which", "WDT", 0, 97], [2,
"be", "VBP", 1, 98], [37, "emulating", "emulate", "VBG", 1, 99], [38,
"hardware", "hardware", "NN", 1, 100], [0, "for", "for", "IN", 0, 101], [
"virtual", "virtual", "JJ", 1, 102], [40, "software", "software", "NN", 1
103]], "id": "001.video197359", "sha1":
"4b69cf60f0497887e3776619b922514f2e5b70a8"}
AI	
  in	
  Media
7
{"count": 2, "ids": [32, 19], "pos": "np", "rank": 0.0194, "text": "virtualization tool"}
{"count": 2, "ids": [40, 69], "pos": "np", "rank": 0.0117, "text": "software applications"}
{"count": 4, "ids": [38], "pos": "np", "rank": 0.0114, "text": "hardware"}
{"count": 2, "ids": [33, 36], "pos": "np", "rank": 0.0099, "text": "vmware hypervisor"}
{"count": 4, "ids": [28], "pos": "np", "rank": 0.0096, "text": "docker"}
{"count": 4, "ids": [34], "pos": "np", "rank": 0.0094, "text": "virtualbox"}
{"count": 10, "ids": [11], "pos": "np", "rank": 0.0049, "text": "people"}
{"count": 4, "ids": [37], "pos": "vbg", "rank": 0.0026, "text": "emulating"}
{"count": 2, "ids": [27], "pos": "vbg", "rank": 0.0016, "text": "learning"}
Transcript: let's take a look at a few examples often when
people are first learning about Docker they try and put it in
one of a few existing categories sometimes people think it's
a virtualization tool like VMware or virtualbox also known as
a hypervisor these are tools which are emulating hardware for
virtual software
Confidence: 0.973419129848
39 KUBERNETES
0.8747 coreos
0.8624 etcd
0.8478 DOCKER CONTAINERS
0.8458 mesos
0.8406 DOCKER
0.8354 DOCKER CONTAINER
0.8260 KUBERNETES CLUSTER
0.8258 docker image
0.8252 EC2
0.8210 docker hub
0.8138 OPENSTACK
orm:Docker a orm:Vendor;
a orm:Container;
a orm:Open_Source;
a orm:Commercial_Software;
owl:sameAs dbr:Docker_%28software%29;
skos:prefLabel "Docker"@en;
Knowledge	
  Graph
▪ used	
  to	
  construct	
  an	
  ontology	
  about	
  
technology,	
  based	
  on	
  learning	
  
materials	
  from	
  200+	
  publishers	
  
▪ uses	
  SKOS	
  as	
  a	
  foundation,	
  ties	
  into	
  

US	
  Library	
  of	
  Congress	
  and	
  DBpedia	
  

as	
  upper	
  ontologies	
  
▪ primary	
  structure	
  is	
  “human	
  scale”,	
  

used	
  as	
  control	
  points	
  
▪ majority	
  (>90%)	
  of	
  the	
  graph	
  

comes	
  from	
  machine	
  generated	
  

data	
  products
8
AI	
  is	
  real,	
  but	
  why	
  now?
▪ Big	
  Data:	
  machine	
  data	
  (1997-­‐ish)	
  
▪ Big	
  Compute:	
  cloud	
  computing	
  (2006-­‐ish)	
  
▪ Big	
  Models:	
  deep	
  learning	
  (2009-­‐ish)	
  
The	
  confluence	
  of	
  three	
  factors	
  created	
  a	
  business	
  

environment	
  where	
  AI	
  could	
  become	
  mainstream	
  
What	
  else	
  is	
  needed?
9
Background:	
  

helping	
  machines	
  learn
Machine	
  learning
supervised	
  ML:	
  
▪ take	
  a	
  dataset	
  where	
  each	
  element	
  
has	
  a	
  label	
  
▪ train	
  models	
  on	
  a	
  portion	
  of	
  the	
  
data	
  to	
  predict	
  the	
  labels,	
  then	
  

evaluate	
  on	
  the	
  holdout	
  
▪ deep	
  learning	
  is	
  a	
  popular	
  example,	
  

but	
  only	
  if	
  you	
  have	
  lots	
  of	
  labeled	
  
training	
  data	
  available
Machine	
  learning
unsupervised	
  ML:	
  
▪ run	
  lots	
  of	
  unlabeled	
  data	
  through	
  
an	
  algorithm	
  to	
  detect	
  “structure”	
  
or	
  embedding	
  
▪ for	
  example,	
  clustering	
  algorithms	
  
such	
  as	
  K-­‐means	
  
▪ unsupervised	
  approaches	
  for	
  AI	
  

are	
  an	
  open	
  research	
  question
Active	
  learning
special	
  case	
  of	
  semi-­‐supervised	
  ML:	
  
▪ send	
  difficult	
  decisions/edge	
  cases	
  

to	
  experts;	
  let	
  algorithms	
  handle	
  
routine	
  decisions	
  (automation)	
  
▪ works	
  well	
  in	
  use	
  cases	
  which	
  have	
  
lots	
  of	
  inexpensive,	
  unlabeled	
  data	
  
▪ e.g.,	
  abundance	
  of	
  content	
  to	
  be	
  
classified,	
  where	
  the	
  cost	
  of	
  
labeling	
  is	
  the	
  expense
The	
  reality	
  of	
  data	
  rates
“If	
  you	
  only	
  have	
  10	
  examples	
  of	
  something,	
  it’s	
  going

	
  	
  to	
  be	
  hard	
  to	
  make	
  deep	
  learning	
  work.	
  If	
  you	
  have

	
  	
  100,000	
  things	
  you	
  care	
  about,	
  records	
  or	
  whatever,

	
  	
  that’s	
  the	
  kind	
  of	
  scale	
  where	
  you	
  should	
  really	
  start

	
  	
  thinking	
  about	
  these	
  kinds	
  of	
  techniques.”	
  
Jeff	
  Dean	
  	
  Google

VB	
  Summit	
  2017-­‐10-­‐23	
  
venturebeat.com/2017/10/23/google-­‐brain-­‐chief-­‐says-­‐100000-­‐
examples-­‐is-­‐enough-­‐data-­‐for-­‐deep-­‐learning/
The	
  reality	
  of	
  data	
  rates
Use	
  cases	
  for	
  deep	
  learning	
  must	
  have	
  large,	
  carefully	
  
labeled	
  data	
  sets,	
  while	
  reinforcement	
  learning	
  needs	
  
much	
  more	
  data	
  than	
  that.	
  
Active	
  learning	
  can	
  yield	
  good	
  results	
  with	
  substantially	
  
smaller	
  data	
  rates,	
  while	
  leveraging	
  an	
  organization’s	
  
expertise	
  to	
  bootstrap	
  toward	
  larger	
  labeled	
  data	
  sets,	
  
e.g.,	
  as	
  preparation	
  for	
  deep	
  learning,	
  etc.
reinforcement
learning
supervised
learning
active
learning
deep
learning
data rates
(log scale)
Case	
  studies:	
  

practices	
  in	
  industry
On-­‐demand	
  humans
17
Active	
  learning
Real-­‐World	
  Active	
  Learning:	
  Applications	
  and	
  
Strategies	
  for	
  Human-­‐in-­‐the-­‐Loop	
  Machine	
  Learning

radar.oreilly.com/2015/02/human-­‐in-­‐the-­‐loop-­‐
machine-­‐learning.html

Ted	
  Cuzzillo

O’Reilly	
  Media,	
  2015-­‐02-­‐05	
  
Develop	
  a	
  policy	
  for	
  how	
  human	
  experts	
  select	
  exemplars:	
  
▪ bias	
  toward	
  labels	
  most	
  likely	
  to	
  influence	
  the	
  classifier	
  
▪ bias	
  toward	
  ensemble	
  disagreement	
  
▪ bias	
  toward	
  denser	
  regions	
  of	
  training	
  data
18
Active	
  learning
Active	
  learning	
  and	
  transfer	
  learning

safaribooksonline.com/library/view/oreilly-­‐
artificial-­‐intelligence/9781491985250/
video314919.html

Luke	
  Biewald	
  	
  CrowdFlower

The	
  AI	
  Conf,	
  2017-­‐09-­‐17	
  
breakthroughs	
  lag	
  algorithm	
  invention,	
  waiting	
  for	
  
“killer	
  data	
  set”	
  to	
  emerge,	
  often	
  decade+
19
Design	
  pattern:	
  Human-­‐in-­‐the-­‐loop
Building	
  a	
  business	
  that	
  combines	
  human	
  experts	
  
and	
  data	
  science

oreilly.com/ideas/building-­‐a-­‐business-­‐that-­‐
combines-­‐human-­‐experts-­‐and-­‐data-­‐science-­‐2

Eric	
  Colson	
  	
  StitchFix

O’Reilly	
  Data	
  Show,	
  2016-­‐01-­‐28	
  
“what	
  machines	
  can’t	
  do	
  are	
  things	
  around	
  cognition,

	
  	
  things	
  that	
  have	
  to	
  do	
  with	
  ambient	
  information,	
  or

	
  	
  appreciation	
  of	
  aesthetics,	
  or	
  even	
  the	
  ability	
  to

	
  	
  relate	
  to	
  another	
  human”



20
Design	
  pattern:	
  Human-­‐in-­‐the-­‐loop
Unsupervised	
  fuzzy	
  labeling	
  using	
  deep	
  
learning	
  to	
  improve	
  anomaly	
  detection

conferences.oreilly.com/strata/strata-­‐
sg/public/schedule/detail/63304

Adam	
  Gibson	
  	
  Skymind

Strata	
  SG,	
  2017-­‐12-­‐07	
  
large-­‐scale	
  use	
  case	
  for	
  telecom	
  in	
  Asia	
  
method:	
  overfit	
  variational	
  autoencoders,	
  
then	
  send	
  outliers	
  to	
  human	
  analysts
21
4:15pm	
  Thu	
  7	
  Dec

Summit	
  2
Design	
  pattern:	
  Human-­‐in-­‐the-­‐loop
EY,	
  Deloitte	
  And	
  PwC	
  Embrace	
  Artificial	
  
Intelligence	
  For	
  Tax	
  And	
  Accounting

forbes.com/sites/adelynzhou/2017/11/14/ey-­‐
deloitte-­‐and-­‐pwc-­‐embrace-­‐artificial-­‐intelligence-­‐
for-­‐tax-­‐and-­‐accounting

Adelyn	
  Zhou

Forbes,	
  2017-­‐11-­‐14	
  
compliance	
  use	
  case:	
  review	
  lease	
  accounting	
  standards	
  
3x	
  more	
  consistent,	
  2x	
  efficient	
  as	
  previous	
  humans-­‐
only	
  teams	
  
break-­‐even	
  ROI	
  within	
  less	
  than	
  a	
  year
22
Design	
  pattern:	
  Human-­‐in-­‐the-­‐loop
Strategies	
  for	
  integrating	
  people	
  and	
  machine	
  
learning	
  in	
  online	
  systems

safaribooksonline.com/library/view/oreilly-­‐
artificial-­‐intelligence/9781491976289/
video311857.html

Jason	
  Laska	
  	
  Clara	
  Labs

The	
  AI	
  Conf,	
  2017-­‐06-­‐29	
  
how	
  to	
  create	
  a	
  two-­‐sided	
  marketplace	
  where	
  machines	
  
and	
  people	
  compete	
  on	
  a	
  spectrum	
  of	
  relative	
  expertise	
  
and	
  capabilities



23
Design	
  pattern:	
  Human-­‐in-­‐the-­‐loop
Building	
  human-­‐assisted	
  AI	
  applications

oreilly.com/ideas/building-­‐human-­‐
assisted-­‐ai-­‐applications

Adam	
  Marcus	
  	
  B12

O’Reilly	
  Data	
  Show,	
  2016-­‐08-­‐25	
  
Orchestra:	
  a	
  platform	
  for	
  building	
  human-­‐
assisted	
  AI	
  applications,	
  e.g.,	
  to	
  create	
  
business	
  websites

https://github.com/b12io/orchestra	
  
example	
  http://www.coloradopicked.com/
24
Design	
  pattern:	
  Flash	
  teams
Expert	
  Crowdsourcing	
  with	
  Flash	
  Teams

hci.stanford.edu/publications/2014/
flashteams/flashteams-­‐uist2014.pdf

Daniela	
  Retelny,	
  et	
  al.	
  

Stanford	
  HCI	
  
“A	
  flash	
  team	
  is	
  a	
  linked	
  set	
  of	
  modular	
  tasks	
  

	
  	
  that	
  draw	
  upon	
  paid	
  experts	
  from	
  the	
  crowd,	
  

	
  	
  often	
  three	
  to	
  six	
  at	
  a	
  time,	
  on	
  demand”	
  
http://stanfordhci.github.io/flash-­‐teams/
25
Weak	
  supervision	
  /	
  Data	
  programming
Creating	
  large	
  training	
  data	
  sets	
  quickly

oreilly.com/ideas/creating-­‐large-­‐training-­‐
data-­‐sets-­‐quickly

Alex	
  Ratner	
  	
  Stanford

O’Reilly	
  Data	
  Show,	
  2017-­‐06-­‐08	
  
Snorkel:	
  “weak	
  supervision”	
  and	
  “data	
  
programming”	
  as	
  another	
  instance	
  of	
  

human-­‐in-­‐the-­‐loop

github.com/HazyResearch/snorkel	
  
conferences.oreilly.com/strata/strata-­‐ny/public/
schedule/detail/61849
26
Prodigy	
  by	
  Explosion.ai
https://explosion.ai/blog/prodigy-­‐
annotation-­‐tool-­‐active-­‐learning
27
Problem:	
  
disambiguating	
  contexts
Disambiguating	
  contexts
Overlapping	
  contexts	
  pose	
  hard	
  problems	
  in	
  natural	
  language	
  understanding.	
  
That	
  runs	
  counter	
  to	
  the	
  correlation	
  emphasis	
  of	
  big	
  data.

NLP	
  libraries	
  lack	
  features	
  for	
  disambiguation.
Disambiguating	
  contexts
30
Suppose	
  someone	
  publishes	
  a	
  book	
  which	
  uses	
  the	
  term	
  
`IOS`:	
  are	
  they	
  talking	
  about	
  an	
  operating	
  system	
  for	
  an	
  
Apple	
  iPhone,	
  or	
  about	
  an	
  operating	
  system	
  for	
  a	
  Cisco	
  
router?	
  	
  
We	
  handle	
  lots	
  of	
  content	
  about	
  both.	
  Disambiguating	
  those	
  
contexts	
  is	
  important	
  for	
  good	
  UX	
  in	
  personalized	
  learning.	
  
In	
  other	
  words,	
  how	
  do	
  machines	
  help	
  people	
  

distinguish	
  that	
  content	
  within	
  search?	
  
Potentially	
  a	
  good	
  case	
  for	
  deep	
  learning,	
  

except	
  for	
  the	
  lack	
  of	
  labeled	
  data	
  at	
  scale.
Active	
  learning	
  through	
  Jupyter
31
Jupyter	
  notebooks	
  are	
  used	
  to	
  manage	
  ML	
  

pipelines	
  for	
  disambiguation,	
  where	
  machines	
  

and	
  people	
  collaborate:	
  
▪ ML	
  based	
  on	
  examples	
  –	
  most	
  all	
  of	
  the	
  feature	
  
engineering,	
  model	
  parameters,	
  etc.,	
  has	
  been	
  
automated	
  
▪ https://github.com/ceteri/nbtransom	
  
▪ based	
  on	
  use	
  of	
  nbformat,	
  pandas,	
  scikit-­‐learn
Active	
  learning	
  through	
  Jupyter
32
Jupyter	
  notebooks	
  are	
  used	
  to	
  manage	
  ML	
  
pipelines
and	
  people	
  collaborate:	
  
▪ ML	
  based	
  on	
  examples	
  –	
  most	
  all	
  of	
  the	
  feature	
  
engineering,	
  model	
  parameters,	
  etc.,	
  has	
  been	
  
automated	
  
▪ https://github.com/ceteri/nbtransom
▪ based	
  on	
  use	
  of	
  
Jupyter	
  notebook	
  as…	
  
▪ one	
  part	
  configuration	
  file	
  
▪ one	
  part	
  data	
  sample	
  
▪ one	
  part	
  structured	
  log	
  
▪ one	
  part	
  data	
  visualization	
  tool	
  
plus,	
  subsequent	
  data	
  mining	
  of	
  these	
  

notebooks	
  helps	
  augment	
  our	
  ontology
Active	
  learning	
  through	
  Jupyter
33
ML#Pipelines
Jupyter#kernel
Browser
SSH#tunnel
Active	
  learning	
  through	
  Jupyter
▪ Notebooks	
  allow	
  the	
  human	
  experts	
  to	
  access	
  the	
  
internals	
  of	
  a	
  mostly	
  automated	
  ML	
  pipeline,	
  rapidly	
  
▪ Stated	
  another	
  way,	
  both	
  the	
  machines	
  and	
  the	
  people	
  
become	
  collaborators	
  on	
  shared	
  documents	
  
▪ Anticipates	
  upcoming	
  collaborative	
  document	
  features	
  
in	
  JupyterLab
Active	
  learning	
  through	
  Jupyter
1. Experts	
  use	
  notebooks	
  to	
  provide	
  examples	
  of	
  book	
  chapters,	
  video	
  
segments,	
  etc.,	
  for	
  each	
  key	
  phrase	
  that	
  has	
  overlapping	
  contexts	
  
2. Machines	
  build	
  ensemble	
  ML	
  models	
  based	
  on	
  those	
  examples,	
  
updating	
  notebooks	
  with	
  model	
  evaluation	
  
3. Machines	
  attempt	
  to	
  annotate	
  labels	
  for	
  millions	
  of	
  pieces	
  of	
  content,	
  

e.g.,	
  `AlphaGo`,	
  `Golang`,	
  versus	
  a	
  mundane	
  use	
  of	
  the	
  verb	
  `go`	
  
4. Disambiguation	
  can	
  run	
  mostly	
  automated,	
  in	
  parallel	
  at	
  scale	
  –	
  

through	
  integration	
  with	
  Apache	
  Spark	
  
5. In	
  cases	
  where	
  ensembles	
  disagree,	
  ML	
  pipelines	
  defer	
  to	
  human	
  
experts	
  who	
  make	
  judgement	
  calls,	
  providing	
  further	
  examples	
  
6. New	
  examples	
  go	
  into	
  training	
  ML	
  pipelines	
  to	
  build	
  better	
  models	
  
7. Rinse,	
  lather,	
  repeat
Nuances
▪ No	
  Free	
  Lunch	
  theorem:	
  it	
  is	
  better	
  to	
  err	
  on	
  the	
  
side	
  of	
  less	
  false	
  positives	
  /	
  more	
  false	
  negatives	
  
in	
  use	
  cases	
  about	
  learning	
  materials	
  
▪ Employ	
  a	
  bias	
  toward	
  exemplars	
  policy,	
  i.e.,	
  those	
  
most	
  likely	
  to	
  influence	
  the	
  classifier	
  
▪ Potentially,	
  “AI	
  experts”	
  may	
  be	
  Customer	
  Service	
  
staff	
  who	
  review	
  edge	
  cases	
  within	
  search	
  results	
  
or	
  recommended	
  content	
  –	
  as	
  an	
  integral	
  part	
  of	
  
our	
  UX	
  –	
  then	
  re-­‐train	
  the	
  ML	
  pipelines	
  through	
  
examples	
  
Summary:	
  
how	
  this	
  matters
Management	
  strategy	
  –	
  before
Generally	
  with	
  Big	
  Data,	
  we	
  are	
  considering:	
  
▪ DAG	
  workflow	
  execution	
  –	
  which	
  is	
  linear	
  
▪ data-­‐driven	
  organizations	
  
▪ ML	
  based	
  on	
  optimizing	
  for	
  

objective	
  functions	
  
▪ questions	
  of	
  correlation	
  

versus	
  causation	
  
▪ avoiding	
  “garbage	
  in,	
  garbage	
  out”
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
38
Management	
  strategy	
  –	
  after
HITL	
  introduces	
  circularities:	
  
▪ aka,	
  second-­‐order	
  cybernetics	
  
▪ leverage	
  feedback	
  loops	
  

as	
  conversations	
  
▪ focus	
  on	
  human	
  scale,	
  

design	
  thinking	
  
▪ people	
  and	
  machines	
  

work	
  together	
  on	
  teams	
  
▪ budget	
  experts’	
  time	
  on	
  

handling	
  the	
  exceptions
AI team
content
ontology
ML models attempt
to label the data
automatically
Expert judgement
about edge cases,
provides examples
ML models trained
using examples
Expert decisions
to extend vocabulary
ML models
have consensus,
confidence
labels
39
Essential	
  takeaway	
  idea:	
  
Depending	
  on	
  the	
  organization,	
  key	
  ingredients	
  
needed	
  to	
  enable	
  effective	
  AI	
  apps	
  may	
  come	
  
from	
  non-­‐traditional	
  “tech”	
  sources	
  …	
  
In	
  other	
  words,	
  based	
  on	
  human-­‐in-­‐the-­‐loop	
  
design	
  pattern,	
  AI	
  expertise	
  may	
  emerge	
  from	
  
your	
  Sales,	
  Marketing,	
  and	
  Customer	
  Service	
  
teams	
  –	
  which	
  have	
  crucial	
  insights	
  about	
  your	
  
customers’	
  needs.
Summary
Ahead	
  in	
  AI:	
  hardware	
  advances	
  force	
  abrupt	
  
changes	
  in	
  software	
  practices	
  –	
  which	
  has	
  
lagged	
  due	
  to	
  lack	
  of	
  infrastructure,	
  data	
  
quality,	
  outdated	
  process,	
  etc.	
  
HITL	
  (active	
  learning)	
  as	
  management	
  strategy	
  
for	
  AI	
  addresses	
  broad	
  needs	
  across	
  industry,	
  
especially	
  for	
  enterprise	
  organizations.	
  
Big	
  Team	
  begins	
  to	
  take	
  its	
  place	
  in	
  the	
  formula	
  
Big	
  Data	
  +	
  Big	
  Compute	
  +	
  Big	
  Models.
Summary
The	
  “game”	
  is	
  not	
  to	
  replace	
  people	
  –	
  instead	
  it	
  
is	
  about	
  leveraging	
  AI	
  to	
  augment	
  staff,	
  so	
  that	
  
organizations	
  can	
  retain	
  people	
  with	
  valuable	
  
domain	
  expertise,	
  making	
  their	
  contributions	
  
and	
  experience	
  even	
  more	
  vital.	
  
This	
  is	
  a	
  personal	
  opinion,	
  which	
  does	
  not	
  
necessarily	
  reflect	
  the	
  views	
  of	
  my	
  employer.	
  
However,	
  the	
  views	
  of	
  my	
  employer…
Why	
  we’ll	
  never	
  run	
  out	
  of	
  jobs
43
Strata	
  Data	
  
SG,	
  Dec	
  4-­‐7

SJ,	
  Mar	
  5-­‐8

UK,	
  May	
  21-­‐24

CN,	
  Jul	
  12-­‐15	
  
The	
  AI	
  Conf	
  
CN	
  Apr	
  10-­‐13

NY,	
  Apr	
  29-­‐May	
  2

SF,	
  Sep	
  4-­‐7

UK,	
  Oct	
  8-­‐11	
  
JupyterCon	
  
NY,	
  Aug	
  21-­‐24	
  
OSCON	
  
PDX,	
  Jul	
  16-­‐19,	
  2018
44
45
Get	
  Started	
  with	
  
NLP	
  in	
  Python
Just	
  Enough	
  Math Building	
  Data	
  
Science	
  Teams
Hylbert-­‐Speys How	
  Do	
  You	
  Learn?
updates,	
  reviews,	
  conference	
  summaries…	
  
liber118.com/pxn/

@pacoid
Human-in-the-loop: a design pattern for managing teams that leverage ML

Contenu connexe

Tendances

Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The PeopleDaniel Tunkelang
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI DutchJos van Dongen
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Did you mean crowdsourcing for recommender systems?
Did you mean crowdsourcing for recommender systems?Did you mean crowdsourcing for recommender systems?
Did you mean crowdsourcing for recommender systems?oralonso
 
Data Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OData Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OSri Ambati
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsMarcel Kurovski
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesJongwook Woo
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksJongwook Woo
 
Findability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindwise
 
Three Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceThree Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceAditya Parameswaran
 
Dealing with uncertainty in fintech using AI
Dealing with uncertainty in fintech using AIDealing with uncertainty in fintech using AI
Dealing with uncertainty in fintech using AIData Products Meetup
 
Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Amr Awadallah
 
Km cognitive computing overview by ken martin 19 jan2015
Km   cognitive computing overview by ken martin 19 jan2015Km   cognitive computing overview by ken martin 19 jan2015
Km cognitive computing overview by ken martin 19 jan2015HCL Technologies
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Amr Awadallah
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive AnalysisJongwook Woo
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it worldChris Dwan
 

Tendances (20)

Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Did you mean crowdsourcing for recommender systems?
Did you mean crowdsourcing for recommender systems?Did you mean crowdsourcing for recommender systems?
Did you mean crowdsourcing for recommender systems?
 
Data Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2OData Science, Machine Learning, and H2O
Data Science, Machine Learning, and H2O
 
Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
Findability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligence
 
Three Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data ScienceThree Tools for "Human-in-the-loop" Data Science
Three Tools for "Human-in-the-loop" Data Science
 
Dealing with uncertainty in fintech using AI
Dealing with uncertainty in fintech using AIDealing with uncertainty in fintech using AI
Dealing with uncertainty in fintech using AI
 
Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Yahoo Microstrategy 2008
Yahoo Microstrategy 2008
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Km cognitive computing overview by ken martin 19 jan2015
Km   cognitive computing overview by ken martin 19 jan2015Km   cognitive computing overview by ken martin 19 jan2015
Km cognitive computing overview by ken martin 19 jan2015
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 

Similaire à Human-in-the-loop: a design pattern for managing teams that leverage ML

Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
 
Diagnosability vs The Cloud
Diagnosability vs The CloudDiagnosability vs The Cloud
Diagnosability vs The CloudBob Rhubart
 
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30Cary Millsap
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeVasu S
 
Build 2015 – Azure overview
Build 2015 – Azure overviewBuild 2015 – Azure overview
Build 2015 – Azure overviewLars Yde
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Applicare patterns di sviluppo con Azure
Applicare patterns di sviluppo con AzureApplicare patterns di sviluppo con Azure
Applicare patterns di sviluppo con AzureMarco Parenzan
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
The lab on your laptop: Technical growth with virtualization
The lab on your laptop: Technical growth with virtualizationThe lab on your laptop: Technical growth with virtualization
The lab on your laptop: Technical growth with virtualizationjpiwowar
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with pythonTom Dierickx
 
Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)Julien SIMON
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
Serverless vs. Developers – the real crash
Serverless vs. Developers – the real crashServerless vs. Developers – the real crash
Serverless vs. Developers – the real crashAWS Germany
 
How We end the Walking Dead in the Enterprise - Session Sponsored by Versent
How We end the Walking Dead in the Enterprise - Session Sponsored by VersentHow We end the Walking Dead in the Enterprise - Session Sponsored by Versent
How We end the Walking Dead in the Enterprise - Session Sponsored by VersentAmazon Web Services
 

Similaire à Human-in-the-loop: a design pattern for managing teams that leverage ML (20)

Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
 
Diagnosability vs The Cloud
Diagnosability vs The CloudDiagnosability vs The Cloud
Diagnosability vs The Cloud
 
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30Diagnosability versus The Cloud, Redwood Shores 2011-08-30
Diagnosability versus The Cloud, Redwood Shores 2011-08-30
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
 
Build 2015 – Azure overview
Build 2015 – Azure overviewBuild 2015 – Azure overview
Build 2015 – Azure overview
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Applicare patterns di sviluppo con Azure
Applicare patterns di sviluppo con AzureApplicare patterns di sviluppo con Azure
Applicare patterns di sviluppo con Azure
 
On nosql
On nosqlOn nosql
On nosql
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
The lab on your laptop: Technical growth with virtualization
The lab on your laptop: Technical growth with virtualizationThe lab on your laptop: Technical growth with virtualization
The lab on your laptop: Technical growth with virtualization
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Serverless vs. Developers – the real crash
Serverless vs. Developers – the real crashServerless vs. Developers – the real crash
Serverless vs. Developers – the real crash
 
How We end the Walking Dead in the Enterprise - Session Sponsored by Versent
How We end the Walking Dead in the Enterprise - Session Sponsored by VersentHow We end the Walking Dead in the Enterprise - Session Sponsored by Versent
How We end the Walking Dead in the Enterprise - Session Sponsored by Versent
 

Plus de Paco Nathan

Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 

Plus de Paco Nathan (20)

Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 

Dernier

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Dernier (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Human-in-the-loop: a design pattern for managing teams that leverage ML

  • 1. Human-­‐in-­‐a-­‐loop:  a  design  pattern  for  managing  teams  that  leverage  ML Paco  Nathan    @pacoid   Director,  Learning  Group  @  O’Reilly  Media   Strata  SG,  Singapore    2017-­‐12-­‐06
  • 2. Framing Imagine  having  a  mostly-­‐automated  system  where  
 people  and  machines  collaborate  together…   May  sound  a  bit  Sci-­‐Fi,  though  arguably  commonplace.  
 One  challenge  is  whether  we  can  advance  beyond  just   handling  rote  tasks.     Instead  of  simply  running  code  libraries,  can  machines  
 make  difficult  decisions,  exercise  judgement  in  complex   situations?     Can  we  build  systems  in  which  people  who  aren’t  
 AI  experts  can  “teach”  machines  to  perform  complex  
 work  –  based  on  examples,  not  code?
  • 3. Research  questions ▪ How  do  we  personalize  learning  experiences,  across  
 ebooks,  videos,  conferences,  computable  content,  
 live  online  courses,  case  studies,  expert  AMAs,  etc.   ▪ How  do  we  help  experts  (by  definition,  really  busy  
 people)  share  their  knowledge  with  peers  in  industry?   ▪ How  do  we  manage  the  role  of  editors  at  human  scale,  
 while  technology  and  delivery  media  evolve  rapidly?   ▪ How  do  we  help  organizations  learn  and  transform   continuously? 3
  • 4.
  • 5. UX  for  content  discovery:   ▪ partly  generated  +  curated  by  people   ▪ partly  generated  +  curated  by  AI  apps
  • 6. AI  in  Media ▪ content  which  can  represented  as  
 text  can  be  parsed  by  NLP,  then   manipulated  by  available  AI  tooling     ▪ labeled  images  get  really  interesting   ▪ assumption:  text  or  images  –  within  
 a  context  –  have  inherent  structure   ▪ representation  of  that  kind  of  structure   is  rare  in  the  Media  vertical  –  so  far 6
  • 7. {"graf": [[21, "let", "let", "VB", 1, 48], [0, "'s", "'s", "PRP", 0, 49], "take", "take", "VB", 1, 50], [0, "a", "a", "DT", 0, 51], [23, "look", "l "NN", 1, 52], [0, "at", "at", "IN", 0, 53], [0, "a", "a", "DT", 0, 54], [ "few", "few", "JJ", 1, 55], [25, "examples", "example", "NNS", 1, 56], [0 "often", "often", "RB", 0, 57], [0, "when", "when", "WRB", 0, 58], [11, "people", "people", "NNS", 1, 59], [2, "are", "be", "VBP", 1, 60], [26, " "first", "JJ", 1, 61], [27, "learning", "learn", "VBG", 1, 62], [0, "abou "about", "IN", 0, 63], [28, "Docker", "docker", "NNP", 1, 64], [0, "they" "they", "PRP", 0, 65], [29, "try", "try", "VBP", 1, 66], [0, "and", "and" 0, 67], [30, "put", "put", "VBP", 1, 68], [0, "it", "it", "PRP", 0, 69], "in", "in", "IN", 0, 70], [0, "one", "one", "CD", 0, 71], [0, "of", "of", 0, 72], [0, "a", "a", "DT", 0, 73], [24, "few", "few", "JJ", 1, 74], [31, "existing", "existing", "JJ", 1, 75], [18, "categories", "category", "NNS 76], [0, "sometimes", "sometimes", "RB", 0, 77], [11, "people", "people", 1, 78], [9, "think", "think", "VBP", 1, 79], [0, "it", "it", "PRP", 0, 80 "'s", "be", "VBZ", 1, 81], [0, "a", "a", "DT", 0, 82], [32, "virtualizati "virtualization", "NN", 1, 83], [19, "tool", "tool", "NN", 1, 84], [0, "l "like", "IN", 0, 85], [33, "VMware", "vmware", "NNP", 1, 86], [0, "or", " "CC", 0, 87], [34, "virtualbox", "virtualbox", "NNP", 1, 88], [0, "also", "also", "RB", 0, 89], [35, "known", "know", "VBN", 1, 90], [0, "as", "as" 0, 91], [0, "a", "a", "DT", 0, 92], [36, "hypervisor", "hypervisor", "NN" 93], [0, "these", "these", "DT", 0, 94], [2, "are", "be", "VBP", 1, 95], "tools", "tool", "NNS", 1, 96], [0, "which", "which", "WDT", 0, 97], [2, "be", "VBP", 1, 98], [37, "emulating", "emulate", "VBG", 1, 99], [38, "hardware", "hardware", "NN", 1, 100], [0, "for", "for", "IN", 0, 101], [ "virtual", "virtual", "JJ", 1, 102], [40, "software", "software", "NN", 1 103]], "id": "001.video197359", "sha1": "4b69cf60f0497887e3776619b922514f2e5b70a8"} AI  in  Media 7 {"count": 2, "ids": [32, 19], "pos": "np", "rank": 0.0194, "text": "virtualization tool"} {"count": 2, "ids": [40, 69], "pos": "np", "rank": 0.0117, "text": "software applications"} {"count": 4, "ids": [38], "pos": "np", "rank": 0.0114, "text": "hardware"} {"count": 2, "ids": [33, 36], "pos": "np", "rank": 0.0099, "text": "vmware hypervisor"} {"count": 4, "ids": [28], "pos": "np", "rank": 0.0096, "text": "docker"} {"count": 4, "ids": [34], "pos": "np", "rank": 0.0094, "text": "virtualbox"} {"count": 10, "ids": [11], "pos": "np", "rank": 0.0049, "text": "people"} {"count": 4, "ids": [37], "pos": "vbg", "rank": 0.0026, "text": "emulating"} {"count": 2, "ids": [27], "pos": "vbg", "rank": 0.0016, "text": "learning"} Transcript: let's take a look at a few examples often when people are first learning about Docker they try and put it in one of a few existing categories sometimes people think it's a virtualization tool like VMware or virtualbox also known as a hypervisor these are tools which are emulating hardware for virtual software Confidence: 0.973419129848 39 KUBERNETES 0.8747 coreos 0.8624 etcd 0.8478 DOCKER CONTAINERS 0.8458 mesos 0.8406 DOCKER 0.8354 DOCKER CONTAINER 0.8260 KUBERNETES CLUSTER 0.8258 docker image 0.8252 EC2 0.8210 docker hub 0.8138 OPENSTACK orm:Docker a orm:Vendor; a orm:Container; a orm:Open_Source; a orm:Commercial_Software; owl:sameAs dbr:Docker_%28software%29; skos:prefLabel "Docker"@en;
  • 8. Knowledge  Graph ▪ used  to  construct  an  ontology  about   technology,  based  on  learning   materials  from  200+  publishers   ▪ uses  SKOS  as  a  foundation,  ties  into  
 US  Library  of  Congress  and  DBpedia  
 as  upper  ontologies   ▪ primary  structure  is  “human  scale”,  
 used  as  control  points   ▪ majority  (>90%)  of  the  graph  
 comes  from  machine  generated  
 data  products 8
  • 9. AI  is  real,  but  why  now? ▪ Big  Data:  machine  data  (1997-­‐ish)   ▪ Big  Compute:  cloud  computing  (2006-­‐ish)   ▪ Big  Models:  deep  learning  (2009-­‐ish)   The  confluence  of  three  factors  created  a  business  
 environment  where  AI  could  become  mainstream   What  else  is  needed? 9
  • 11. Machine  learning supervised  ML:   ▪ take  a  dataset  where  each  element   has  a  label   ▪ train  models  on  a  portion  of  the   data  to  predict  the  labels,  then  
 evaluate  on  the  holdout   ▪ deep  learning  is  a  popular  example,  
 but  only  if  you  have  lots  of  labeled   training  data  available
  • 12. Machine  learning unsupervised  ML:   ▪ run  lots  of  unlabeled  data  through   an  algorithm  to  detect  “structure”   or  embedding   ▪ for  example,  clustering  algorithms   such  as  K-­‐means   ▪ unsupervised  approaches  for  AI  
 are  an  open  research  question
  • 13. Active  learning special  case  of  semi-­‐supervised  ML:   ▪ send  difficult  decisions/edge  cases  
 to  experts;  let  algorithms  handle   routine  decisions  (automation)   ▪ works  well  in  use  cases  which  have   lots  of  inexpensive,  unlabeled  data   ▪ e.g.,  abundance  of  content  to  be   classified,  where  the  cost  of   labeling  is  the  expense
  • 14. The  reality  of  data  rates “If  you  only  have  10  examples  of  something,  it’s  going
    to  be  hard  to  make  deep  learning  work.  If  you  have
    100,000  things  you  care  about,  records  or  whatever,
    that’s  the  kind  of  scale  where  you  should  really  start
    thinking  about  these  kinds  of  techniques.”   Jeff  Dean    Google
 VB  Summit  2017-­‐10-­‐23   venturebeat.com/2017/10/23/google-­‐brain-­‐chief-­‐says-­‐100000-­‐ examples-­‐is-­‐enough-­‐data-­‐for-­‐deep-­‐learning/
  • 15. The  reality  of  data  rates Use  cases  for  deep  learning  must  have  large,  carefully   labeled  data  sets,  while  reinforcement  learning  needs   much  more  data  than  that.   Active  learning  can  yield  good  results  with  substantially   smaller  data  rates,  while  leveraging  an  organization’s   expertise  to  bootstrap  toward  larger  labeled  data  sets,   e.g.,  as  preparation  for  deep  learning,  etc. reinforcement learning supervised learning active learning deep learning data rates (log scale)
  • 18. Active  learning Real-­‐World  Active  Learning:  Applications  and   Strategies  for  Human-­‐in-­‐the-­‐Loop  Machine  Learning
 radar.oreilly.com/2015/02/human-­‐in-­‐the-­‐loop-­‐ machine-­‐learning.html
 Ted  Cuzzillo
 O’Reilly  Media,  2015-­‐02-­‐05   Develop  a  policy  for  how  human  experts  select  exemplars:   ▪ bias  toward  labels  most  likely  to  influence  the  classifier   ▪ bias  toward  ensemble  disagreement   ▪ bias  toward  denser  regions  of  training  data 18
  • 19. Active  learning Active  learning  and  transfer  learning
 safaribooksonline.com/library/view/oreilly-­‐ artificial-­‐intelligence/9781491985250/ video314919.html
 Luke  Biewald    CrowdFlower
 The  AI  Conf,  2017-­‐09-­‐17   breakthroughs  lag  algorithm  invention,  waiting  for   “killer  data  set”  to  emerge,  often  decade+ 19
  • 20. Design  pattern:  Human-­‐in-­‐the-­‐loop Building  a  business  that  combines  human  experts   and  data  science
 oreilly.com/ideas/building-­‐a-­‐business-­‐that-­‐ combines-­‐human-­‐experts-­‐and-­‐data-­‐science-­‐2
 Eric  Colson    StitchFix
 O’Reilly  Data  Show,  2016-­‐01-­‐28   “what  machines  can’t  do  are  things  around  cognition,
    things  that  have  to  do  with  ambient  information,  or
    appreciation  of  aesthetics,  or  even  the  ability  to
    relate  to  another  human”
 
 20
  • 21. Design  pattern:  Human-­‐in-­‐the-­‐loop Unsupervised  fuzzy  labeling  using  deep   learning  to  improve  anomaly  detection
 conferences.oreilly.com/strata/strata-­‐ sg/public/schedule/detail/63304
 Adam  Gibson    Skymind
 Strata  SG,  2017-­‐12-­‐07   large-­‐scale  use  case  for  telecom  in  Asia   method:  overfit  variational  autoencoders,   then  send  outliers  to  human  analysts 21 4:15pm  Thu  7  Dec
 Summit  2
  • 22. Design  pattern:  Human-­‐in-­‐the-­‐loop EY,  Deloitte  And  PwC  Embrace  Artificial   Intelligence  For  Tax  And  Accounting
 forbes.com/sites/adelynzhou/2017/11/14/ey-­‐ deloitte-­‐and-­‐pwc-­‐embrace-­‐artificial-­‐intelligence-­‐ for-­‐tax-­‐and-­‐accounting
 Adelyn  Zhou
 Forbes,  2017-­‐11-­‐14   compliance  use  case:  review  lease  accounting  standards   3x  more  consistent,  2x  efficient  as  previous  humans-­‐ only  teams   break-­‐even  ROI  within  less  than  a  year 22
  • 23. Design  pattern:  Human-­‐in-­‐the-­‐loop Strategies  for  integrating  people  and  machine   learning  in  online  systems
 safaribooksonline.com/library/view/oreilly-­‐ artificial-­‐intelligence/9781491976289/ video311857.html
 Jason  Laska    Clara  Labs
 The  AI  Conf,  2017-­‐06-­‐29   how  to  create  a  two-­‐sided  marketplace  where  machines   and  people  compete  on  a  spectrum  of  relative  expertise   and  capabilities
 
 23
  • 24. Design  pattern:  Human-­‐in-­‐the-­‐loop Building  human-­‐assisted  AI  applications
 oreilly.com/ideas/building-­‐human-­‐ assisted-­‐ai-­‐applications
 Adam  Marcus    B12
 O’Reilly  Data  Show,  2016-­‐08-­‐25   Orchestra:  a  platform  for  building  human-­‐ assisted  AI  applications,  e.g.,  to  create   business  websites
 https://github.com/b12io/orchestra   example  http://www.coloradopicked.com/ 24
  • 25. Design  pattern:  Flash  teams Expert  Crowdsourcing  with  Flash  Teams
 hci.stanford.edu/publications/2014/ flashteams/flashteams-­‐uist2014.pdf
 Daniela  Retelny,  et  al.  
 Stanford  HCI   “A  flash  team  is  a  linked  set  of  modular  tasks  
    that  draw  upon  paid  experts  from  the  crowd,  
    often  three  to  six  at  a  time,  on  demand”   http://stanfordhci.github.io/flash-­‐teams/ 25
  • 26. Weak  supervision  /  Data  programming Creating  large  training  data  sets  quickly
 oreilly.com/ideas/creating-­‐large-­‐training-­‐ data-­‐sets-­‐quickly
 Alex  Ratner    Stanford
 O’Reilly  Data  Show,  2017-­‐06-­‐08   Snorkel:  “weak  supervision”  and  “data   programming”  as  another  instance  of  
 human-­‐in-­‐the-­‐loop
 github.com/HazyResearch/snorkel   conferences.oreilly.com/strata/strata-­‐ny/public/ schedule/detail/61849 26
  • 29. Disambiguating  contexts Overlapping  contexts  pose  hard  problems  in  natural  language  understanding.   That  runs  counter  to  the  correlation  emphasis  of  big  data.
 NLP  libraries  lack  features  for  disambiguation.
  • 30. Disambiguating  contexts 30 Suppose  someone  publishes  a  book  which  uses  the  term   `IOS`:  are  they  talking  about  an  operating  system  for  an   Apple  iPhone,  or  about  an  operating  system  for  a  Cisco   router?     We  handle  lots  of  content  about  both.  Disambiguating  those   contexts  is  important  for  good  UX  in  personalized  learning.   In  other  words,  how  do  machines  help  people  
 distinguish  that  content  within  search?   Potentially  a  good  case  for  deep  learning,  
 except  for  the  lack  of  labeled  data  at  scale.
  • 31. Active  learning  through  Jupyter 31 Jupyter  notebooks  are  used  to  manage  ML  
 pipelines  for  disambiguation,  where  machines  
 and  people  collaborate:   ▪ ML  based  on  examples  –  most  all  of  the  feature   engineering,  model  parameters,  etc.,  has  been   automated   ▪ https://github.com/ceteri/nbtransom   ▪ based  on  use  of  nbformat,  pandas,  scikit-­‐learn
  • 32. Active  learning  through  Jupyter 32 Jupyter  notebooks  are  used  to  manage  ML   pipelines and  people  collaborate:   ▪ ML  based  on  examples  –  most  all  of  the  feature   engineering,  model  parameters,  etc.,  has  been   automated   ▪ https://github.com/ceteri/nbtransom ▪ based  on  use  of   Jupyter  notebook  as…   ▪ one  part  configuration  file   ▪ one  part  data  sample   ▪ one  part  structured  log   ▪ one  part  data  visualization  tool   plus,  subsequent  data  mining  of  these  
 notebooks  helps  augment  our  ontology
  • 33. Active  learning  through  Jupyter 33 ML#Pipelines Jupyter#kernel Browser SSH#tunnel
  • 34. Active  learning  through  Jupyter ▪ Notebooks  allow  the  human  experts  to  access  the   internals  of  a  mostly  automated  ML  pipeline,  rapidly   ▪ Stated  another  way,  both  the  machines  and  the  people   become  collaborators  on  shared  documents   ▪ Anticipates  upcoming  collaborative  document  features   in  JupyterLab
  • 35. Active  learning  through  Jupyter 1. Experts  use  notebooks  to  provide  examples  of  book  chapters,  video   segments,  etc.,  for  each  key  phrase  that  has  overlapping  contexts   2. Machines  build  ensemble  ML  models  based  on  those  examples,   updating  notebooks  with  model  evaluation   3. Machines  attempt  to  annotate  labels  for  millions  of  pieces  of  content,  
 e.g.,  `AlphaGo`,  `Golang`,  versus  a  mundane  use  of  the  verb  `go`   4. Disambiguation  can  run  mostly  automated,  in  parallel  at  scale  –  
 through  integration  with  Apache  Spark   5. In  cases  where  ensembles  disagree,  ML  pipelines  defer  to  human   experts  who  make  judgement  calls,  providing  further  examples   6. New  examples  go  into  training  ML  pipelines  to  build  better  models   7. Rinse,  lather,  repeat
  • 36. Nuances ▪ No  Free  Lunch  theorem:  it  is  better  to  err  on  the   side  of  less  false  positives  /  more  false  negatives   in  use  cases  about  learning  materials   ▪ Employ  a  bias  toward  exemplars  policy,  i.e.,  those   most  likely  to  influence  the  classifier   ▪ Potentially,  “AI  experts”  may  be  Customer  Service   staff  who  review  edge  cases  within  search  results   or  recommended  content  –  as  an  integral  part  of   our  UX  –  then  re-­‐train  the  ML  pipelines  through   examples  
  • 38. Management  strategy  –  before Generally  with  Big  Data,  we  are  considering:   ▪ DAG  workflow  execution  –  which  is  linear   ▪ data-­‐driven  organizations   ▪ ML  based  on  optimizing  for  
 objective  functions   ▪ questions  of  correlation  
 versus  causation   ▪ avoiding  “garbage  in,  garbage  out” Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R 38
  • 39. Management  strategy  –  after HITL  introduces  circularities:   ▪ aka,  second-­‐order  cybernetics   ▪ leverage  feedback  loops  
 as  conversations   ▪ focus  on  human  scale,  
 design  thinking   ▪ people  and  machines  
 work  together  on  teams   ▪ budget  experts’  time  on  
 handling  the  exceptions AI team content ontology ML models attempt to label the data automatically Expert judgement about edge cases, provides examples ML models trained using examples Expert decisions to extend vocabulary ML models have consensus, confidence labels 39
  • 40. Essential  takeaway  idea:   Depending  on  the  organization,  key  ingredients   needed  to  enable  effective  AI  apps  may  come   from  non-­‐traditional  “tech”  sources  …   In  other  words,  based  on  human-­‐in-­‐the-­‐loop   design  pattern,  AI  expertise  may  emerge  from   your  Sales,  Marketing,  and  Customer  Service   teams  –  which  have  crucial  insights  about  your   customers’  needs.
  • 41. Summary Ahead  in  AI:  hardware  advances  force  abrupt   changes  in  software  practices  –  which  has   lagged  due  to  lack  of  infrastructure,  data   quality,  outdated  process,  etc.   HITL  (active  learning)  as  management  strategy   for  AI  addresses  broad  needs  across  industry,   especially  for  enterprise  organizations.   Big  Team  begins  to  take  its  place  in  the  formula   Big  Data  +  Big  Compute  +  Big  Models.
  • 42. Summary The  “game”  is  not  to  replace  people  –  instead  it   is  about  leveraging  AI  to  augment  staff,  so  that   organizations  can  retain  people  with  valuable   domain  expertise,  making  their  contributions   and  experience  even  more  vital.   This  is  a  personal  opinion,  which  does  not   necessarily  reflect  the  views  of  my  employer.   However,  the  views  of  my  employer…
  • 43. Why  we’ll  never  run  out  of  jobs 43
  • 44. Strata  Data   SG,  Dec  4-­‐7
 SJ,  Mar  5-­‐8
 UK,  May  21-­‐24
 CN,  Jul  12-­‐15   The  AI  Conf   CN  Apr  10-­‐13
 NY,  Apr  29-­‐May  2
 SF,  Sep  4-­‐7
 UK,  Oct  8-­‐11   JupyterCon   NY,  Aug  21-­‐24   OSCON   PDX,  Jul  16-­‐19,  2018 44
  • 45. 45 Get  Started  with   NLP  in  Python Just  Enough  Math Building  Data   Science  Teams Hylbert-­‐Speys How  Do  You  Learn? updates,  reviews,  conference  summaries…   liber118.com/pxn/
 @pacoid