SlideShare a Scribd company logo
1 of 43
Download to read offline
PyTextRank:	
  

Graph	
  algorithms	
  for	
  enhanced	
  
natural	
  language	
  processing
Paco	
  Nathan	
  @pacoid	
  
Dir,	
  Learning	
  Group	
  @	
  O’Reilly	
  Media	
  
2017-­‐09-­‐28
Background:	
  
helping	
  people	
  learn
2
Research	
  questions
▪ How	
  do	
  we	
  personalize	
  learning	
  experiences,	
  across	
  

ebooks,	
  videos,	
  conferences,	
  computable	
  content,	
  

live	
  online	
  courses,	
  case	
  studies,	
  expert	
  AMAs,	
  etc.	
  
▪ How	
  do	
  we	
  help	
  experts	
  (by	
  definition,	
  really	
  busy	
  

people)	
  share	
  their	
  knowledge	
  with	
  peers	
  in	
  industry?	
  
▪ How	
  do	
  we	
  manage	
  the	
  role	
  of	
  editors	
  at	
  human	
  scale,	
  

while	
  technology	
  and	
  delivery	
  media	
  evolve	
  rapidly?	
  
▪ How	
  do	
  we	
  help	
  organizations	
  learn	
  and	
  transform	
  
continuously?
3
4
5
UX	
  for	
  content	
  discovery:	
  
▪ partly	
  generated	
  +	
  curated	
  by	
  humans	
  
▪ partly	
  generated	
  +	
  curated	
  by	
  AI	
  apps
O’Reilly	
  Media	
  produces	
  ~20	
  conferences	
  per	
  

year,	
  each	
  with	
  ~200	
  hours	
  of	
  video	
  content.	
  
To	
  review	
  in	
  entirety,	
  an	
  acquisitions	
  editor	
  

would	
  have	
  a	
  finger	
  permanently	
  on	
  the	
  

fast-­‐forward	
  button	
  10	
  months	
  out	
  of	
  the	
  year.	
  
Introduction	
  of	
  AR/VR	
  for	
  learning	
  materials	
  

accelerates	
  that	
  disparity.	
  
Plus,	
  that’s	
  only	
  one	
  among	
  several	
  major	
  needs	
  

for	
  applying	
  AI	
  in	
  Media	
  –	
  to	
  augment	
  staff.
AI	
  in	
  Media
▪ content	
  which	
  can	
  represented	
  as	
  

text	
  can	
  be	
  parsed	
  by	
  NLP,	
  then	
  
manipulated	
  by	
  available	
  AI	
  tooling	
  	
  
▪ labeled	
  images	
  get	
  really	
  interesting	
  
▪ text	
  or	
  images	
  within	
  a	
  context	
  have	
  

inherent	
  structure	
  
▪ representation	
  of	
  that	
  kind	
  of	
  structure	
  
is	
  rare	
  in	
  the	
  Media	
  vertical	
  –	
  so	
  far
7
Working	
  with	
  text	
  and	
  NLP
▪ parsing	
  
▪ named	
  entity	
  recognition	
  
▪ smarter	
  indexing	
  (semantic	
  search)	
  
▪ summarization,	
  especially	
  for	
  video	
  
▪ query	
  expansion	
  
▪ augmenting	
  ontology	
  
▪ semantic	
  similarity	
  for	
  recommendations	
  
▪ quicker	
  development	
  of	
  assessments	
  
▪ complexity	
  estimates	
  for	
  learning	
  materials
8
{"graf": [[21, "let", "let", "VB", 1, 48], [0, "'s", "'s", "PRP", 0, 49],
"take", "take", "VB", 1, 50], [0, "a", "a", "DT", 0, 51], [23, "look", "l
"NN", 1, 52], [0, "at", "at", "IN", 0, 53], [0, "a", "a", "DT", 0, 54], [
"few", "few", "JJ", 1, 55], [25, "examples", "example", "NNS", 1, 56], [0
"often", "often", "RB", 0, 57], [0, "when", "when", "WRB", 0, 58], [11,
"people", "people", "NNS", 1, 59], [2, "are", "be", "VBP", 1, 60], [26, "
"first", "JJ", 1, 61], [27, "learning", "learn", "VBG", 1, 62], [0, "abou
"about", "IN", 0, 63], [28, "Docker", "docker", "NNP", 1, 64], [0, "they"
"they", "PRP", 0, 65], [29, "try", "try", "VBP", 1, 66], [0, "and", "and"
0, 67], [30, "put", "put", "VBP", 1, 68], [0, "it", "it", "PRP", 0, 69],
"in", "in", "IN", 0, 70], [0, "one", "one", "CD", 0, 71], [0, "of", "of",
0, 72], [0, "a", "a", "DT", 0, 73], [24, "few", "few", "JJ", 1, 74], [31,
"existing", "existing", "JJ", 1, 75], [18, "categories", "category", "NNS
76], [0, "sometimes", "sometimes", "RB", 0, 77], [11, "people", "people",
1, 78], [9, "think", "think", "VBP", 1, 79], [0, "it", "it", "PRP", 0, 80
"'s", "be", "VBZ", 1, 81], [0, "a", "a", "DT", 0, 82], [32, "virtualizati
"virtualization", "NN", 1, 83], [19, "tool", "tool", "NN", 1, 84], [0, "l
"like", "IN", 0, 85], [33, "VMware", "vmware", "NNP", 1, 86], [0, "or", "
"CC", 0, 87], [34, "virtualbox", "virtualbox", "NNP", 1, 88], [0, "also",
"also", "RB", 0, 89], [35, "known", "know", "VBN", 1, 90], [0, "as", "as"
0, 91], [0, "a", "a", "DT", 0, 92], [36, "hypervisor", "hypervisor", "NN"
93], [0, "these", "these", "DT", 0, 94], [2, "are", "be", "VBP", 1, 95],
"tools", "tool", "NNS", 1, 96], [0, "which", "which", "WDT", 0, 97], [2,
"be", "VBP", 1, 98], [37, "emulating", "emulate", "VBG", 1, 99], [38,
"hardware", "hardware", "NN", 1, 100], [0, "for", "for", "IN", 0, 101], [
"virtual", "virtual", "JJ", 1, 102], [40, "software", "software", "NN", 1
103]], "id": "001.video197359", "sha1":
"4b69cf60f0497887e3776619b922514f2e5b70a8"}
Video	
  transcription
9
{"count": 2, "ids": [32, 19], "pos": "np", "rank": 0.0194, "text": "virtualization tool"}
{"count": 2, "ids": [40, 69], "pos": "np", "rank": 0.0117, "text": "software applications"}
{"count": 4, "ids": [38], "pos": "np", "rank": 0.0114, "text": "hardware"}
{"count": 2, "ids": [33, 36], "pos": "np", "rank": 0.0099, "text": "vmware hypervisor"}
{"count": 4, "ids": [28], "pos": "np", "rank": 0.0096, "text": "docker"}
{"count": 4, "ids": [34], "pos": "np", "rank": 0.0094, "text": "virtualbox"}
{"count": 10, "ids": [11], "pos": "np", "rank": 0.0049, "text": "people"}
{"count": 4, "ids": [37], "pos": "vbg", "rank": 0.0026, "text": "emulating"}
{"count": 2, "ids": [27], "pos": "vbg", "rank": 0.0016, "text": "learning"}
Transcript: let's take a look at a few examples often when
people are first learning about Docker they try and put it in
one of a few existing categories sometimes people think it's
a virtualization tool like VMware or virtualbox also known as
a hypervisor these are tools which are emulating hardware for
virtual software
Confidence: 0.973419129848
39 KUBERNETES
0.8747 coreos
0.8624 etcd
0.8478 DOCKER CONTAINERS
0.8458 mesos
0.8406 DOCKER
0.8354 DOCKER CONTAINER
0.8260 KUBERNETES CLUSTER
0.8258 docker image
0.8252 EC2
0.8210 docker hub
0.8138 OPENSTACK
orm:Docker a orm:Vendor;
a orm:Container;
a orm:Open_Source;
a orm:Commercial_Software;
owl:sameAs dbr:Docker_%28software%29;
skos:prefLabel "Docker"@en;
Which	
  parts	
  do	
  people	
  or	
  machines	
  do	
  best?
10
team	
  goal:	
  maintain	
  structural	
  correspondence	
  between	
  the	
  layers	
  
big	
  win	
  for	
  AI:	
  inferences	
  across	
  the	
  graph
human	
  scale	
  
primary	
  structure	
  
control	
  points	
  
testability
machine	
  generated	
  data	
  products	
  
~80%	
  of	
  the	
  graph
Ontology
Motivations:	
  
a	
  new	
  generation	
  of	
  NLP
12
Statistical	
  parsing
13
Long,	
  long	
  ago,	
  the	
  field	
  of	
  NLP	
  used	
  to	
  be	
  concerned	
  about	
  
transformational	
  grammars,	
  linguistic	
  theory	
  by	
  Chomsky,	
  etc.	
  
Machine	
  learning	
  techniques	
  allowed	
  simpler	
  approaches,	
  

i.e.,	
  statistical	
  parsing.	
  With	
  the	
  rise	
  of	
  Big	
  Data,	
  those	
  
techniques	
  grew	
  more	
  effective	
  –	
  Google	
  won	
  the	
  NIST	
  2005	
  
Machine	
  Translation	
  Evaluation	
  and	
  remains	
  a	
  leader.	
  
More	
  recently,	
  deep	
  learning	
  helps	
  extend	
  performance	
  

(read:	
  less	
  errors)	
  for	
  statistical	
  parsing	
  and	
  downstream	
  uses.
ML	
  applied
Models	
  allow	
  NLP	
  packages	
  to:	
  
▪ determine	
  the	
  natural	
  language	
  being	
  used	
  
▪ segment	
  texts	
  into	
  paragraphs	
  and	
  sentences	
  
▪ annotate	
  words	
  with	
  part-­‐of-­‐speech	
  
▪ find	
  roots	
  for	
  each	
  word	
  (lemmas)	
  
▪ chunk	
  the	
  noun	
  phrases	
  
▪ identify	
  named	
  entities	
  
▪ correct	
  spelling	
  and	
  grammar	
  errors	
  
▪ estimate	
  scores	
  for	
  sentiment,	
  readability,	
  complexity	
  
▪ etc.
Common	
  misconceptions
15
▪ generational	
  lag:	
  that	
  keyword	
  analysis	
  (“bag	
  of	
  words”),	
  
stemming,	
  co-­‐occurrence,	
  LDA,	
  and	
  other	
  techniques	
  
from	
  a	
  prior	
  generation	
  remain	
  current	
  best	
  practices	
  
▪ vendor	
  FUD:	
  that	
  NLP	
  requires	
  Big	
  Data	
  frameworks	
  to	
  
run	
  at	
  scale	
  
▪ either/or	
  fallacy:	
  that	
  NLP	
  results	
  feeding	
  into	
  AI	
  apps	
  
either	
  must	
  be	
  fully	
  automated	
  or	
  require	
  lots	
  of	
  manual	
  
intervention	
  
What	
  changed?
16
Hardware:	
  multicore	
  and	
  large	
  memory	
  spaces,	
  plus	
  wide	
  
availability	
  of	
  cloud	
  computing	
  resources.	
  
The	
  same	
  data	
  +	
  algorithms	
  which	
  required	
  a	
  large	
  Hadoop	
  
cluster	
  in	
  2007,	
  run	
  faster	
  in	
  Python	
  on	
  my	
  laptop	
  in	
  2017.	
  
NLP	
  benefited	
  as	
  hardware	
  advances	
  eliminated	
  the	
  need	
  for	
  
computational	
  shortcuts	
  (e.g.	
  stemming)	
  and	
  made	
  more	
  
advanced	
  approaches	
  (e.g.,	
  graph	
  algorithms)	
  become	
  feasible.	
  
Also,	
  subtle	
  shifts	
  as	
  approximation	
  algorithms	
  became	
  more	
  
integrated,	
  e.g.,	
  substituting	
  n-­‐grams	
  with	
  skip-­‐grams.
Bag-­‐of-­‐words	
  is	
  not	
  your	
  friend
17
Continued	
  use	
  of	
  the	
  outdated	
  

NLP	
  approaches	
  (bag-­‐of-­‐words,	
  
stemming,	
  n-­‐grams,	
  plus	
  the	
  
general	
  class	
  of	
  Apache	
  Solr)	
  
increases	
  risk	
  for	
  “fake	
  news	
  bots”	
  
problems
Advances:	
  
graph	
  algorithms	
  for	
  NLP
19
TextRank
20
“TextRank:	
  Bringing	
  Order	
  into	
  Texts”

Rada	
  Mihalcea,	
  Paul	
  Tarau

Conference	
  on	
  Empirical	
  Methods	
  in	
  
Natural	
  Language	
  Processing	
  (2004-­‐07-­‐25)

https://goo.gl/AJnA76	
  
http://web.eecs.umich.edu/~mihalcea/	
  
http://www.cse.unt.edu/~tarau/
TextRank
21
“In	
  this	
  paper,	
  we	
  introduced	
  TextRank	
  –	
  a	
  graph-­‐based	
  ranking	
  
model	
  for	
  text	
  processing,	
  and	
  show	
  how	
  it	
  can	
  be	
  successfully	
  used	
  
for	
  natural	
  language	
  applications.	
  In	
  particular,	
  we	
  proposed	
  and	
  
evaluated	
  two	
  innovative	
  unsupervised	
  approaches	
  for	
  keyword	
  and	
  
sentence	
  extraction,	
  and	
  showed	
  that	
  the	
  accuracy	
  achieved	
  by	
  
TextRank	
  in	
  these	
  applications	
  is	
  competitive	
  with	
  that	
  of	
  previously	
  
proposed	
  state-­‐of-­‐the-­‐art	
  algorithms.	
  An	
  important	
  aspect	
  of	
  
TextRank	
  is	
  that	
  it	
  does	
  not	
  require	
  deep	
  linguistic	
  knowledge,	
  nor	
  
domain	
  or	
  language	
  specific	
  annotated	
  corpora,	
  which	
  makes	
  it	
  
highly	
  portable	
  to	
  other	
  domains,	
  genres,	
  or	
  languages.”	
  
TextRank
1. segment	
  document	
  into	
  paragraphs	
  and	
  
sentences	
  
2. parse	
  each	
  sentence,	
  find	
  PoS	
  annotations,	
  
lemmas	
  
3. construct	
  a	
  graph	
  where	
  nouns,	
  adjectives,	
  
verbs	
  are	
  vertices	
  
• links	
  based	
  on	
  skip-­‐grams	
  
• links	
  based	
  on	
  repeated	
  instances	
  of	
  the	
  same	
  root	
  
4. stochastic	
  link	
  analysis	
  (e.g.,	
  PageRank)	
  
identifies	
  noun	
  phrases	
  which	
  have	
  high	
  
probability	
  of	
  inbound	
  references
22
TextRank	
  leverages	
  NLP	
  and	
  graph	
  
algorithms	
  at	
  scale,	
  improving	
  
means	
  to	
  integrate	
  Deep	
  Learning	
  	
  
plus	
  Knowledge	
  Graphs	
  –	
  for	
  
enhanced	
  machine	
  intelligence	
  

in	
  text	
  handling
Graph	
  terminology
24
▪ many	
  real-­‐world	
  problems	
  can	
  have	
  their	
  data	
  
represented	
  as	
  graphs	
  
▪ graphs	
  can	
  be	
  converted	
  into	
  sparse	
  matrices	
  

(a	
  bridge	
  into	
  linear	
  algebra)	
  
▪ eigenvectors	
  find	
  the	
  stable	
  points	
  in	
  a	
  system	
  
defined	
  by	
  matrices	
  –	
  often	
  more	
  efficient	
  to	
  
compute	
  
▪ beyond	
  simple	
  graphs,	
  complex	
  data	
  may	
  require	
  
work	
  with	
  tensors
Unsupervised	
  summarization
25
By	
  the	
  numbers
"Compatibility of systems of linear constraints"
[{'index': 0, 'root': 'compatibility', 'tag': 'NNP','word': 'compatibility'},
{'index': 1, 'root': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 2, 'root': 'system', 'tag': 'NNS', 'word': 'systems'},
{'index': 3, 'root': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 4, 'root': 'linear', 'tag': 'JJ', 'word': 'linear'},
{'index': 5, 'root': 'constraint', 'tag': 'NNS','word': 'constraints'}]
compat
system
linear
constraint
1:
2:
3:
26
Important	
  outcomes
27
Unsupervised	
  methods	
  which	
  do	
  not	
  require	
  subject	
  domain	
  
info,	
  but	
  can	
  use	
  it	
  when	
  available	
  
Extracting	
  key	
  phrases	
  from	
  a	
  text,	
  in	
  lieu	
  of	
  keywords,	
  provides	
  
significantly	
  more	
  information	
  content.	
  
For	
  example,	
  consider	
  “deep	
  learning”	
  vs.	
  “deep”	
  and	
  “learning”.	
  
Also,	
  automated	
  summarization	
  can	
  reduce	
  human	
  (editorial)	
  
time	
  dramatically.	
  
Implementations:	
  
various	
  trade-­‐offs
28
Other	
  implementations
29
Paco	
  Nathan,	
  2008

Java,	
  Apache	
  Hadoop	
  /	
  English+Spanish

https://github.com/ceteri/textrank/	
  
Radim	
  Řehůřek,	
  2009

Python,	
  NumPy,	
  SciPy

https://radimrehurek.com/gensim/	
  
Jeff	
  Kubina,	
  2009

Perl	
  /	
  English

http://search.cpan.org/~kubina/Text-­‐Categorize-­‐Textrank-­‐0.51/lib/Text/Categorize/Textrank/En.pm	
  
Karin	
  Christiansen,	
  2014

Java	
  /	
  Icelandic

https://github.com/karchr/icetextsum	
  
Paco	
  Nathan,	
  2014

Scala,	
  Apache	
  Spark	
  /	
  English

https://github.com/ceteri/spark-­‐exercises	
  
Got	
  any	
  others	
  to	
  add	
  to	
  this	
  list??
PyTextRank
Paco	
  Nathan,	
  2016	
  
Python	
  implementation	
  atop	
  spaCy,	
  NetworkX,	
  datasketch:	
  
▪ https://pypi.python.org/pypi/pytextrank/	
  
▪ https://github.com/ceteri/pytextrank/
PyTextRank:	
  motivations
▪ spaCy	
  is	
  moving	
  fast	
  and	
  benefits	
  greatly	
  by	
  having	
  an	
  
abstraction	
  layer	
  above	
  it	
  
▪ NetworkX	
  allows	
  use	
  of	
  multicore	
  +	
  large	
  memory	
  spaces	
  
for	
  graph	
  algorithms,	
  obviating	
  the	
  need	
  in	
  many	
  NLP	
  use	
  
cases	
  for	
  Hadoop,	
  Spark,	
  etc.	
  
▪ datasketch	
  leverages	
  approximation	
  algorithms	
  which	
  
make	
  key	
  features	
  feasible,	
  given	
  the	
  environments	
  
described	
  above	
  
▪ monolithic	
  NLP	
  frameworks	
  often	
  attempt	
  to	
  control	
  

the	
  entire	
  pipeline	
  (instead	
  of	
  using	
  modules	
  or	
  services)	
  

and	
  are	
  basically	
  tech	
  debt	
  waiting	
  to	
  happen	
  
PyTextRank:	
  innovations
▪ fixed	
  bug	
  in	
  the	
  original	
  paper’s	
  pseudocode;	
  see	
  Java	
  impl	
  2008	
  
used	
  in	
  production	
  by	
  ShareThis,	
  Rolling	
  Stone,	
  etc.	
  
▪ included	
  verbs	
  into	
  the	
  graph	
  (though	
  not	
  in	
  keyphrase	
  output)	
  
▪ added	
  text	
  scrubbing,	
  unicode	
  normalization	
  
▪ replaced	
  stemming	
  with	
  lemmatization	
  
▪ can	
  leverage	
  named	
  entity	
  recognition,	
  noun	
  phrase	
  chunking	
  
▪ provides	
  extractive	
  summarization	
  
▪ makes	
  use	
  of	
  approximation	
  algorithms,	
  where	
  appropriate	
  
▪ applies	
  use	
  of	
  ontology:	
  random	
  walk	
  of	
  hypernyms/hyponyms	
  
helps	
  enrich	
  the	
  graph,	
  which	
  in	
  turn	
  improves	
  performance	
  
▪ integrates	
  vector	
  embedding	
  +	
  other	
  graph	
  algorithms	
  to	
  help	
  
construct	
  or	
  augment	
  ontology	
  from	
  a	
  corpus
PyTextRank:	
  summarization
1. parse	
  a	
  text	
  to	
  create	
  its	
  feature	
  vector,	
  i.e.,	
  
the	
  ranked	
  keyphrases	
  from	
  TextRank	
  
2. calculate	
  semantic	
  similarity	
  between	
  the	
  
feature	
  vector	
  and	
  each	
  sentence	
  in	
  the	
  text	
  
3. select	
  the	
  N	
  top-­‐ranked	
  sentences,	
  listed	
  in	
  
order,	
  to	
  summarize
Almost a year ago, we published our now-annual landscape of machine
intelligence companies, and goodness have we seen a lot of activity since
then. As has been the case for the last couple of years, our fund still
obsesses over 'problem first' machine intelligence -- we've invested in 

35 machine intelligence companies solving 35 meaningful problems in areas
from security to recruiting to software development. At the same time, 

the hype around machine intelligence methods continues to grow: the words
'deep learning' now equally represent a series of meaningful breakthroughs
(wonderful) but also a hyped phrase like 'big data' (not so good!). 

What's the biggest change in the last year?
oreilly.com/ideas/the-­‐current-­‐state-­‐of-­‐machine-­‐intelligence-­‐3-­‐0
NLP	
  +	
  AI,	
  by	
  the	
  numbers
1. pull	
  data	
  across	
  various	
  silos	
  by	
  
federating	
  queries	
  
2. scrub	
  text;	
  normalize	
  unicode	
  
3. synthesize	
  missing	
  tables;	
  impute	
  missing	
  
values	
  for	
  metadata;	
  parse/repair	
  TOCs	
  
4. extract	
  document	
  text	
  from	
  DOM	
  +	
  
metadata;	
  segment	
  sentences;	
  
5. parse	
  to	
  get	
  PoS	
  tagging;	
  lemmatize	
  
nouns	
  and	
  verbs;	
  

noun	
  phrase	
  chunking;	
  named	
  entity	
  
recognition	
  
6. construct	
  a	
  graph	
  for	
  each	
  document	
  
7. enrich	
  graph	
  using	
  ontology	
  

to	
  add	
  links	
  among	
  lemmas,	
  named	
  
entities,	
  etc.	
  
8. build	
  feature	
  vectors	
  with	
  TextRank	
  
9. use	
  feature	
  vectors	
  +	
  semantic	
  similarity	
  
for	
  extractive	
  summarization	
  
10. use	
  feature	
  vectors	
  +	
  vector	
  embedding	
  
to	
  expand	
  queries	
  and	
  augment	
  ontology	
  
11. apply	
  various	
  graph	
  algorithms	
  as	
  high-­‐
pass	
  filters	
  
12. build	
  priors	
  based	
  on	
  authors,	
  publishers,	
  
titles,	
  publication	
  dates	
  
13. disambiguate	
  topic	
  annotations	
  using	
  
active	
  learning	
  
14. extract	
  key	
  features	
  from	
  annotation	
  
models	
  for	
  subsequent	
  graph	
  enrichment	
  
15. use	
  classifiers	
  to	
  predict	
  language	
  and	
  
complexity	
  
16. build	
  in-­‐memory	
  semantic	
  indexing,	
  auto-­‐
completion,	
  content	
  recommendations,	
  
gap	
  analysis,	
  etc.	
  
17. run	
  monte	
  carlo	
  methods	
  to	
  rank	
  

topics	
  by	
  usage	
  
18. rinse,	
  lather,	
  repeat
Looking	
  ahead
O’Reilly	
  Media	
  is	
  actively	
  
working	
  with	
  other	
  like-­‐

minded	
  organizations	
  to	
  

drive	
  machine	
  intelligence	
  

use	
  cases	
  in	
  Media,	
  

through	
  open	
  source
PyTextRank:	
  upcoming
Package	
  installation	
  is	
  currently	
  integrated	
  with	
  PyPi	
  
2017-­‐Q4:	
  adding	
  support	
  for	
  Anaconda	
  
Also	
  working	
  on:	
  
▪ more	
  efficient	
  integration	
  with	
  spaCy,	
  e.g.,	
  spans	
  
▪ more	
  coding	
  examples	
  in	
  the	
  wiki,	
  especially	
  using	
  Jupyter	
  
▪ support	
  for	
  ReadTheDocs	
  
▪ better	
  handling	
  for	
  character	
  encoding	
  issues;	
  

see	
  “Character	
  Filtering”	
  by	
  Daniel	
  Tunkelang,	
  

along	
  with	
  his	
  entire	
  series	
  Query	
  Understanding	
  
▪ more	
  integration	
  into	
  coursework;	
  

see	
  Get	
  Started	
  with	
  NLP	
  in	
  Python	
  on	
  O’Reilly
PyTextRank:	
  upcoming
One	
  compelling	
  reason	
  for	
  using	
  TextRank	
  is	
  to	
  
leverage	
  graph	
  algorithms.	
  
Using	
  an	
  ontology	
  to	
  enrich	
  the	
  graph	
  (prior	
  to	
  the	
  
“PageRank”	
  step)	
  has	
  been	
  shown	
  to	
  improve	
  NLP	
  
results	
  –	
  thus	
  leveraging	
  graph	
  algorithms.	
  
A	
  major	
  upcoming	
  feature	
  in	
  PyTextRank	
  will	
  take	
  
an	
  ontology	
  (e.g.,	
  SKOS)	
  as	
  an	
  input	
  parameter.	
  
Working	
  now	
  to	
  train/focus	
  spaCy	
  NER	
  using	
  an	
  
input	
  ontology,	
  as	
  an	
  automated	
  preprocessing	
  
step	
  for	
  TextRank.
PyTextRank:	
  upcoming
Disambiguating	
  context	
  is	
  a	
  hard	
  problem	
  in	
  NLP,	
  and	
  
relatively	
  lacking	
  in	
  the	
  available	
  open	
  source	
  packages.	
  
For	
  example:	
  “ios	
  operating	
  system”	
  –	
  Apple	
  or	
  Cisco?	
  
O’Reilly	
  Media	
  leverage	
  a	
  human-­‐in-­‐the-­‐loop	
  design	
  
pattern	
  to	
  build	
  disambiguation	
  pipelines	
  through	
  

active	
  learning,	
  which	
  in	
  turn	
  uses	
  and	
  helps	
  extend	
  

our	
  ontology	
  work.
PyTextRank:	
  upcoming
Working	
  on	
  improved	
  summarization:	
  
▪ currently	
  provides	
  extractive	
  summarization	
  
▪ working	
  toward	
  abstractive	
  summarization;	
  

evaluating	
  integration	
  with	
  TensorFlow	
  
See	
  also:	
  
▪ Text	
  Summarization	
  in	
  Python,	
  2017-­‐04-­‐05	
  
▪ Text	
  summarization,	
  topic	
  models	
  and	
  RNNs,	
  2016-­‐09-­‐25	
  
Overall,	
  we’re	
  focused	
  on	
  improving	
  work	
  with	
  video	
  

for	
  O’Reilly	
  Media	
  use	
  cases,	
  e.g.,	
  parsing	
  and	
  repairing	
  

automated	
  transcriptions
Strata	
  Data	
  
SG,	
  Dec	
  4-­‐7

SJ,	
  Mar	
  5-­‐8

UK,	
  May	
  21-­‐24

CN,	
  Jul	
  12-­‐15	
  
The	
  AI	
  Conf	
  
CN	
  Apr	
  10-­‐13

NY,	
  Apr	
  29-­‐May	
  2

SF,	
  Sep	
  4-­‐7

UK,	
  Oct	
  8-­‐11	
  
JupyterCon	
  
NY,	
  Aug	
  21-­‐24	
  
OSCON	
  (returns!)	
  
PDX,	
  Jul	
  16-­‐19,	
  2018
41
42
Learn	
  Alongside

Innovators
Just	
  Enough	
  Math Building	
  Data	
  
Science	
  Teams
Hylbert-­‐Speys How	
  Do	
  You	
  Learn?
updates,	
  reviews,	
  conference	
  summaries…	
  
liber118.com/pxn/

@pacoid
PyTextRank: Graph algorithms for enhanced natural language processing

More Related Content

More from Paco Nathan

Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 

More from Paco Nathan (20)

Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

PyTextRank: Graph algorithms for enhanced natural language processing

  • 1. PyTextRank:  
 Graph  algorithms  for  enhanced   natural  language  processing Paco  Nathan  @pacoid   Dir,  Learning  Group  @  O’Reilly  Media   2017-­‐09-­‐28
  • 3. Research  questions ▪ How  do  we  personalize  learning  experiences,  across  
 ebooks,  videos,  conferences,  computable  content,  
 live  online  courses,  case  studies,  expert  AMAs,  etc.   ▪ How  do  we  help  experts  (by  definition,  really  busy  
 people)  share  their  knowledge  with  peers  in  industry?   ▪ How  do  we  manage  the  role  of  editors  at  human  scale,  
 while  technology  and  delivery  media  evolve  rapidly?   ▪ How  do  we  help  organizations  learn  and  transform   continuously? 3
  • 4. 4
  • 5. 5 UX  for  content  discovery:   ▪ partly  generated  +  curated  by  humans   ▪ partly  generated  +  curated  by  AI  apps
  • 6. O’Reilly  Media  produces  ~20  conferences  per  
 year,  each  with  ~200  hours  of  video  content.   To  review  in  entirety,  an  acquisitions  editor  
 would  have  a  finger  permanently  on  the  
 fast-­‐forward  button  10  months  out  of  the  year.   Introduction  of  AR/VR  for  learning  materials  
 accelerates  that  disparity.   Plus,  that’s  only  one  among  several  major  needs  
 for  applying  AI  in  Media  –  to  augment  staff.
  • 7. AI  in  Media ▪ content  which  can  represented  as  
 text  can  be  parsed  by  NLP,  then   manipulated  by  available  AI  tooling     ▪ labeled  images  get  really  interesting   ▪ text  or  images  within  a  context  have  
 inherent  structure   ▪ representation  of  that  kind  of  structure   is  rare  in  the  Media  vertical  –  so  far 7
  • 8. Working  with  text  and  NLP ▪ parsing   ▪ named  entity  recognition   ▪ smarter  indexing  (semantic  search)   ▪ summarization,  especially  for  video   ▪ query  expansion   ▪ augmenting  ontology   ▪ semantic  similarity  for  recommendations   ▪ quicker  development  of  assessments   ▪ complexity  estimates  for  learning  materials 8
  • 9. {"graf": [[21, "let", "let", "VB", 1, 48], [0, "'s", "'s", "PRP", 0, 49], "take", "take", "VB", 1, 50], [0, "a", "a", "DT", 0, 51], [23, "look", "l "NN", 1, 52], [0, "at", "at", "IN", 0, 53], [0, "a", "a", "DT", 0, 54], [ "few", "few", "JJ", 1, 55], [25, "examples", "example", "NNS", 1, 56], [0 "often", "often", "RB", 0, 57], [0, "when", "when", "WRB", 0, 58], [11, "people", "people", "NNS", 1, 59], [2, "are", "be", "VBP", 1, 60], [26, " "first", "JJ", 1, 61], [27, "learning", "learn", "VBG", 1, 62], [0, "abou "about", "IN", 0, 63], [28, "Docker", "docker", "NNP", 1, 64], [0, "they" "they", "PRP", 0, 65], [29, "try", "try", "VBP", 1, 66], [0, "and", "and" 0, 67], [30, "put", "put", "VBP", 1, 68], [0, "it", "it", "PRP", 0, 69], "in", "in", "IN", 0, 70], [0, "one", "one", "CD", 0, 71], [0, "of", "of", 0, 72], [0, "a", "a", "DT", 0, 73], [24, "few", "few", "JJ", 1, 74], [31, "existing", "existing", "JJ", 1, 75], [18, "categories", "category", "NNS 76], [0, "sometimes", "sometimes", "RB", 0, 77], [11, "people", "people", 1, 78], [9, "think", "think", "VBP", 1, 79], [0, "it", "it", "PRP", 0, 80 "'s", "be", "VBZ", 1, 81], [0, "a", "a", "DT", 0, 82], [32, "virtualizati "virtualization", "NN", 1, 83], [19, "tool", "tool", "NN", 1, 84], [0, "l "like", "IN", 0, 85], [33, "VMware", "vmware", "NNP", 1, 86], [0, "or", " "CC", 0, 87], [34, "virtualbox", "virtualbox", "NNP", 1, 88], [0, "also", "also", "RB", 0, 89], [35, "known", "know", "VBN", 1, 90], [0, "as", "as" 0, 91], [0, "a", "a", "DT", 0, 92], [36, "hypervisor", "hypervisor", "NN" 93], [0, "these", "these", "DT", 0, 94], [2, "are", "be", "VBP", 1, 95], "tools", "tool", "NNS", 1, 96], [0, "which", "which", "WDT", 0, 97], [2, "be", "VBP", 1, 98], [37, "emulating", "emulate", "VBG", 1, 99], [38, "hardware", "hardware", "NN", 1, 100], [0, "for", "for", "IN", 0, 101], [ "virtual", "virtual", "JJ", 1, 102], [40, "software", "software", "NN", 1 103]], "id": "001.video197359", "sha1": "4b69cf60f0497887e3776619b922514f2e5b70a8"} Video  transcription 9 {"count": 2, "ids": [32, 19], "pos": "np", "rank": 0.0194, "text": "virtualization tool"} {"count": 2, "ids": [40, 69], "pos": "np", "rank": 0.0117, "text": "software applications"} {"count": 4, "ids": [38], "pos": "np", "rank": 0.0114, "text": "hardware"} {"count": 2, "ids": [33, 36], "pos": "np", "rank": 0.0099, "text": "vmware hypervisor"} {"count": 4, "ids": [28], "pos": "np", "rank": 0.0096, "text": "docker"} {"count": 4, "ids": [34], "pos": "np", "rank": 0.0094, "text": "virtualbox"} {"count": 10, "ids": [11], "pos": "np", "rank": 0.0049, "text": "people"} {"count": 4, "ids": [37], "pos": "vbg", "rank": 0.0026, "text": "emulating"} {"count": 2, "ids": [27], "pos": "vbg", "rank": 0.0016, "text": "learning"} Transcript: let's take a look at a few examples often when people are first learning about Docker they try and put it in one of a few existing categories sometimes people think it's a virtualization tool like VMware or virtualbox also known as a hypervisor these are tools which are emulating hardware for virtual software Confidence: 0.973419129848 39 KUBERNETES 0.8747 coreos 0.8624 etcd 0.8478 DOCKER CONTAINERS 0.8458 mesos 0.8406 DOCKER 0.8354 DOCKER CONTAINER 0.8260 KUBERNETES CLUSTER 0.8258 docker image 0.8252 EC2 0.8210 docker hub 0.8138 OPENSTACK orm:Docker a orm:Vendor; a orm:Container; a orm:Open_Source; a orm:Commercial_Software; owl:sameAs dbr:Docker_%28software%29; skos:prefLabel "Docker"@en;
  • 10. Which  parts  do  people  or  machines  do  best? 10 team  goal:  maintain  structural  correspondence  between  the  layers   big  win  for  AI:  inferences  across  the  graph human  scale   primary  structure   control  points   testability machine  generated  data  products   ~80%  of  the  graph
  • 12. Motivations:   a  new  generation  of  NLP 12
  • 13. Statistical  parsing 13 Long,  long  ago,  the  field  of  NLP  used  to  be  concerned  about   transformational  grammars,  linguistic  theory  by  Chomsky,  etc.   Machine  learning  techniques  allowed  simpler  approaches,  
 i.e.,  statistical  parsing.  With  the  rise  of  Big  Data,  those   techniques  grew  more  effective  –  Google  won  the  NIST  2005   Machine  Translation  Evaluation  and  remains  a  leader.   More  recently,  deep  learning  helps  extend  performance  
 (read:  less  errors)  for  statistical  parsing  and  downstream  uses.
  • 14. ML  applied Models  allow  NLP  packages  to:   ▪ determine  the  natural  language  being  used   ▪ segment  texts  into  paragraphs  and  sentences   ▪ annotate  words  with  part-­‐of-­‐speech   ▪ find  roots  for  each  word  (lemmas)   ▪ chunk  the  noun  phrases   ▪ identify  named  entities   ▪ correct  spelling  and  grammar  errors   ▪ estimate  scores  for  sentiment,  readability,  complexity   ▪ etc.
  • 15. Common  misconceptions 15 ▪ generational  lag:  that  keyword  analysis  (“bag  of  words”),   stemming,  co-­‐occurrence,  LDA,  and  other  techniques   from  a  prior  generation  remain  current  best  practices   ▪ vendor  FUD:  that  NLP  requires  Big  Data  frameworks  to   run  at  scale   ▪ either/or  fallacy:  that  NLP  results  feeding  into  AI  apps   either  must  be  fully  automated  or  require  lots  of  manual   intervention  
  • 16. What  changed? 16 Hardware:  multicore  and  large  memory  spaces,  plus  wide   availability  of  cloud  computing  resources.   The  same  data  +  algorithms  which  required  a  large  Hadoop   cluster  in  2007,  run  faster  in  Python  on  my  laptop  in  2017.   NLP  benefited  as  hardware  advances  eliminated  the  need  for   computational  shortcuts  (e.g.  stemming)  and  made  more   advanced  approaches  (e.g.,  graph  algorithms)  become  feasible.   Also,  subtle  shifts  as  approximation  algorithms  became  more   integrated,  e.g.,  substituting  n-­‐grams  with  skip-­‐grams.
  • 17. Bag-­‐of-­‐words  is  not  your  friend 17
  • 18. Continued  use  of  the  outdated  
 NLP  approaches  (bag-­‐of-­‐words,   stemming,  n-­‐grams,  plus  the   general  class  of  Apache  Solr)   increases  risk  for  “fake  news  bots”   problems
  • 20. TextRank 20 “TextRank:  Bringing  Order  into  Texts”
 Rada  Mihalcea,  Paul  Tarau
 Conference  on  Empirical  Methods  in   Natural  Language  Processing  (2004-­‐07-­‐25)
 https://goo.gl/AJnA76   http://web.eecs.umich.edu/~mihalcea/   http://www.cse.unt.edu/~tarau/
  • 21. TextRank 21 “In  this  paper,  we  introduced  TextRank  –  a  graph-­‐based  ranking   model  for  text  processing,  and  show  how  it  can  be  successfully  used   for  natural  language  applications.  In  particular,  we  proposed  and   evaluated  two  innovative  unsupervised  approaches  for  keyword  and   sentence  extraction,  and  showed  that  the  accuracy  achieved  by   TextRank  in  these  applications  is  competitive  with  that  of  previously   proposed  state-­‐of-­‐the-­‐art  algorithms.  An  important  aspect  of   TextRank  is  that  it  does  not  require  deep  linguistic  knowledge,  nor   domain  or  language  specific  annotated  corpora,  which  makes  it   highly  portable  to  other  domains,  genres,  or  languages.”  
  • 22. TextRank 1. segment  document  into  paragraphs  and   sentences   2. parse  each  sentence,  find  PoS  annotations,   lemmas   3. construct  a  graph  where  nouns,  adjectives,   verbs  are  vertices   • links  based  on  skip-­‐grams   • links  based  on  repeated  instances  of  the  same  root   4. stochastic  link  analysis  (e.g.,  PageRank)   identifies  noun  phrases  which  have  high   probability  of  inbound  references 22
  • 23. TextRank  leverages  NLP  and  graph   algorithms  at  scale,  improving   means  to  integrate  Deep  Learning     plus  Knowledge  Graphs  –  for   enhanced  machine  intelligence  
 in  text  handling
  • 24. Graph  terminology 24 ▪ many  real-­‐world  problems  can  have  their  data   represented  as  graphs   ▪ graphs  can  be  converted  into  sparse  matrices  
 (a  bridge  into  linear  algebra)   ▪ eigenvectors  find  the  stable  points  in  a  system   defined  by  matrices  –  often  more  efficient  to   compute   ▪ beyond  simple  graphs,  complex  data  may  require   work  with  tensors
  • 26. By  the  numbers "Compatibility of systems of linear constraints" [{'index': 0, 'root': 'compatibility', 'tag': 'NNP','word': 'compatibility'}, {'index': 1, 'root': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 2, 'root': 'system', 'tag': 'NNS', 'word': 'systems'}, {'index': 3, 'root': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 4, 'root': 'linear', 'tag': 'JJ', 'word': 'linear'}, {'index': 5, 'root': 'constraint', 'tag': 'NNS','word': 'constraints'}] compat system linear constraint 1: 2: 3: 26
  • 27. Important  outcomes 27 Unsupervised  methods  which  do  not  require  subject  domain   info,  but  can  use  it  when  available   Extracting  key  phrases  from  a  text,  in  lieu  of  keywords,  provides   significantly  more  information  content.   For  example,  consider  “deep  learning”  vs.  “deep”  and  “learning”.   Also,  automated  summarization  can  reduce  human  (editorial)   time  dramatically.  
  • 29. Other  implementations 29 Paco  Nathan,  2008
 Java,  Apache  Hadoop  /  English+Spanish
 https://github.com/ceteri/textrank/   Radim  Řehůřek,  2009
 Python,  NumPy,  SciPy
 https://radimrehurek.com/gensim/   Jeff  Kubina,  2009
 Perl  /  English
 http://search.cpan.org/~kubina/Text-­‐Categorize-­‐Textrank-­‐0.51/lib/Text/Categorize/Textrank/En.pm   Karin  Christiansen,  2014
 Java  /  Icelandic
 https://github.com/karchr/icetextsum   Paco  Nathan,  2014
 Scala,  Apache  Spark  /  English
 https://github.com/ceteri/spark-­‐exercises   Got  any  others  to  add  to  this  list??
  • 30. PyTextRank Paco  Nathan,  2016   Python  implementation  atop  spaCy,  NetworkX,  datasketch:   ▪ https://pypi.python.org/pypi/pytextrank/   ▪ https://github.com/ceteri/pytextrank/
  • 31. PyTextRank:  motivations ▪ spaCy  is  moving  fast  and  benefits  greatly  by  having  an   abstraction  layer  above  it   ▪ NetworkX  allows  use  of  multicore  +  large  memory  spaces   for  graph  algorithms,  obviating  the  need  in  many  NLP  use   cases  for  Hadoop,  Spark,  etc.   ▪ datasketch  leverages  approximation  algorithms  which   make  key  features  feasible,  given  the  environments   described  above   ▪ monolithic  NLP  frameworks  often  attempt  to  control  
 the  entire  pipeline  (instead  of  using  modules  or  services)  
 and  are  basically  tech  debt  waiting  to  happen  
  • 32. PyTextRank:  innovations ▪ fixed  bug  in  the  original  paper’s  pseudocode;  see  Java  impl  2008   used  in  production  by  ShareThis,  Rolling  Stone,  etc.   ▪ included  verbs  into  the  graph  (though  not  in  keyphrase  output)   ▪ added  text  scrubbing,  unicode  normalization   ▪ replaced  stemming  with  lemmatization   ▪ can  leverage  named  entity  recognition,  noun  phrase  chunking   ▪ provides  extractive  summarization   ▪ makes  use  of  approximation  algorithms,  where  appropriate   ▪ applies  use  of  ontology:  random  walk  of  hypernyms/hyponyms   helps  enrich  the  graph,  which  in  turn  improves  performance   ▪ integrates  vector  embedding  +  other  graph  algorithms  to  help   construct  or  augment  ontology  from  a  corpus
  • 33. PyTextRank:  summarization 1. parse  a  text  to  create  its  feature  vector,  i.e.,   the  ranked  keyphrases  from  TextRank   2. calculate  semantic  similarity  between  the   feature  vector  and  each  sentence  in  the  text   3. select  the  N  top-­‐ranked  sentences,  listed  in   order,  to  summarize Almost a year ago, we published our now-annual landscape of machine intelligence companies, and goodness have we seen a lot of activity since then. As has been the case for the last couple of years, our fund still obsesses over 'problem first' machine intelligence -- we've invested in 
 35 machine intelligence companies solving 35 meaningful problems in areas from security to recruiting to software development. At the same time, 
 the hype around machine intelligence methods continues to grow: the words 'deep learning' now equally represent a series of meaningful breakthroughs (wonderful) but also a hyped phrase like 'big data' (not so good!). 
 What's the biggest change in the last year? oreilly.com/ideas/the-­‐current-­‐state-­‐of-­‐machine-­‐intelligence-­‐3-­‐0
  • 34. NLP  +  AI,  by  the  numbers 1. pull  data  across  various  silos  by   federating  queries   2. scrub  text;  normalize  unicode   3. synthesize  missing  tables;  impute  missing   values  for  metadata;  parse/repair  TOCs   4. extract  document  text  from  DOM  +   metadata;  segment  sentences;   5. parse  to  get  PoS  tagging;  lemmatize   nouns  and  verbs;  
 noun  phrase  chunking;  named  entity   recognition   6. construct  a  graph  for  each  document   7. enrich  graph  using  ontology  
 to  add  links  among  lemmas,  named   entities,  etc.   8. build  feature  vectors  with  TextRank   9. use  feature  vectors  +  semantic  similarity   for  extractive  summarization   10. use  feature  vectors  +  vector  embedding   to  expand  queries  and  augment  ontology   11. apply  various  graph  algorithms  as  high-­‐ pass  filters   12. build  priors  based  on  authors,  publishers,   titles,  publication  dates   13. disambiguate  topic  annotations  using   active  learning   14. extract  key  features  from  annotation   models  for  subsequent  graph  enrichment   15. use  classifiers  to  predict  language  and   complexity   16. build  in-­‐memory  semantic  indexing,  auto-­‐ completion,  content  recommendations,   gap  analysis,  etc.   17. run  monte  carlo  methods  to  rank  
 topics  by  usage   18. rinse,  lather,  repeat
  • 36. O’Reilly  Media  is  actively   working  with  other  like-­‐
 minded  organizations  to  
 drive  machine  intelligence  
 use  cases  in  Media,  
 through  open  source
  • 37. PyTextRank:  upcoming Package  installation  is  currently  integrated  with  PyPi   2017-­‐Q4:  adding  support  for  Anaconda   Also  working  on:   ▪ more  efficient  integration  with  spaCy,  e.g.,  spans   ▪ more  coding  examples  in  the  wiki,  especially  using  Jupyter   ▪ support  for  ReadTheDocs   ▪ better  handling  for  character  encoding  issues;  
 see  “Character  Filtering”  by  Daniel  Tunkelang,  
 along  with  his  entire  series  Query  Understanding   ▪ more  integration  into  coursework;  
 see  Get  Started  with  NLP  in  Python  on  O’Reilly
  • 38. PyTextRank:  upcoming One  compelling  reason  for  using  TextRank  is  to   leverage  graph  algorithms.   Using  an  ontology  to  enrich  the  graph  (prior  to  the   “PageRank”  step)  has  been  shown  to  improve  NLP   results  –  thus  leveraging  graph  algorithms.   A  major  upcoming  feature  in  PyTextRank  will  take   an  ontology  (e.g.,  SKOS)  as  an  input  parameter.   Working  now  to  train/focus  spaCy  NER  using  an   input  ontology,  as  an  automated  preprocessing   step  for  TextRank.
  • 39. PyTextRank:  upcoming Disambiguating  context  is  a  hard  problem  in  NLP,  and   relatively  lacking  in  the  available  open  source  packages.   For  example:  “ios  operating  system”  –  Apple  or  Cisco?   O’Reilly  Media  leverage  a  human-­‐in-­‐the-­‐loop  design   pattern  to  build  disambiguation  pipelines  through  
 active  learning,  which  in  turn  uses  and  helps  extend  
 our  ontology  work.
  • 40. PyTextRank:  upcoming Working  on  improved  summarization:   ▪ currently  provides  extractive  summarization   ▪ working  toward  abstractive  summarization;  
 evaluating  integration  with  TensorFlow   See  also:   ▪ Text  Summarization  in  Python,  2017-­‐04-­‐05   ▪ Text  summarization,  topic  models  and  RNNs,  2016-­‐09-­‐25   Overall,  we’re  focused  on  improving  work  with  video  
 for  O’Reilly  Media  use  cases,  e.g.,  parsing  and  repairing  
 automated  transcriptions
  • 41. Strata  Data   SG,  Dec  4-­‐7
 SJ,  Mar  5-­‐8
 UK,  May  21-­‐24
 CN,  Jul  12-­‐15   The  AI  Conf   CN  Apr  10-­‐13
 NY,  Apr  29-­‐May  2
 SF,  Sep  4-­‐7
 UK,  Oct  8-­‐11   JupyterCon   NY,  Aug  21-­‐24   OSCON  (returns!)   PDX,  Jul  16-­‐19,  2018 41
  • 42. 42 Learn  Alongside
 Innovators Just  Enough  Math Building  Data   Science  Teams Hylbert-­‐Speys How  Do  You  Learn? updates,  reviews,  conference  summaries…   liber118.com/pxn/
 @pacoid