SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
PyTextRank:	
  

graph	
  algorithms	
  for	
  enhanced	
  NLP
Paco	
  Nathan	
  @pacoid	
  
Dir,	
  Learning	
  Group	
  @	
  O’Reilly	
  Media	
  
DSSG’17,	
  Singapore	
  2017-­‐12-­‐06
Background:	
  
helping	
  people	
  learn
2
AI	
  in	
  Media
▪ content	
  which	
  can	
  represented	
  as	
  

text	
  can	
  be	
  parsed	
  by	
  NLP,	
  then	
  
manipulated	
  by	
  available	
  AI	
  tooling	
  	
  
▪ labeled	
  images	
  get	
  really	
  interesting	
  
▪ text	
  or	
  images	
  within	
  a	
  context	
  have	
  

inherent	
  structure	
  
▪ representation	
  of	
  that	
  kind	
  of	
  structure	
  
is	
  rare	
  in	
  the	
  Media	
  vertical	
  –	
  so	
  far
3
{"graf": [[21, "let", "let", "VB", 1, 48], [0, "'s", "'s", "PRP", 0, 49],
"take", "take", "VB", 1, 50], [0, "a", "a", "DT", 0, 51], [23, "look", "l
"NN", 1, 52], [0, "at", "at", "IN", 0, 53], [0, "a", "a", "DT", 0, 54], [
"few", "few", "JJ", 1, 55], [25, "examples", "example", "NNS", 1, 56], [0
"often", "often", "RB", 0, 57], [0, "when", "when", "WRB", 0, 58], [11,
"people", "people", "NNS", 1, 59], [2, "are", "be", "VBP", 1, 60], [26, "
"first", "JJ", 1, 61], [27, "learning", "learn", "VBG", 1, 62], [0, "abou
"about", "IN", 0, 63], [28, "Docker", "docker", "NNP", 1, 64], [0, "they"
"they", "PRP", 0, 65], [29, "try", "try", "VBP", 1, 66], [0, "and", "and"
0, 67], [30, "put", "put", "VBP", 1, 68], [0, "it", "it", "PRP", 0, 69],
"in", "in", "IN", 0, 70], [0, "one", "one", "CD", 0, 71], [0, "of", "of",
0, 72], [0, "a", "a", "DT", 0, 73], [24, "few", "few", "JJ", 1, 74], [31,
"existing", "existing", "JJ", 1, 75], [18, "categories", "category", "NNS
76], [0, "sometimes", "sometimes", "RB", 0, 77], [11, "people", "people",
1, 78], [9, "think", "think", "VBP", 1, 79], [0, "it", "it", "PRP", 0, 80
"'s", "be", "VBZ", 1, 81], [0, "a", "a", "DT", 0, 82], [32, "virtualizati
"virtualization", "NN", 1, 83], [19, "tool", "tool", "NN", 1, 84], [0, "l
"like", "IN", 0, 85], [33, "VMware", "vmware", "NNP", 1, 86], [0, "or", "
"CC", 0, 87], [34, "virtualbox", "virtualbox", "NNP", 1, 88], [0, "also",
"also", "RB", 0, 89], [35, "known", "know", "VBN", 1, 90], [0, "as", "as"
0, 91], [0, "a", "a", "DT", 0, 92], [36, "hypervisor", "hypervisor", "NN"
93], [0, "these", "these", "DT", 0, 94], [2, "are", "be", "VBP", 1, 95],
"tools", "tool", "NNS", 1, 96], [0, "which", "which", "WDT", 0, 97], [2,
"be", "VBP", 1, 98], [37, "emulating", "emulate", "VBG", 1, 99], [38,
"hardware", "hardware", "NN", 1, 100], [0, "for", "for", "IN", 0, 101], [
"virtual", "virtual", "JJ", 1, 102], [40, "software", "software", "NN", 1
103]], "id": "001.video197359", "sha1":
"4b69cf60f0497887e3776619b922514f2e5b70a8"}
Video	
  transcription
4
{"count": 2, "ids": [32, 19], "pos": "np", "rank": 0.0194, "text": "virtualization tool"}
{"count": 2, "ids": [40, 69], "pos": "np", "rank": 0.0117, "text": "software applications"}
{"count": 4, "ids": [38], "pos": "np", "rank": 0.0114, "text": "hardware"}
{"count": 2, "ids": [33, 36], "pos": "np", "rank": 0.0099, "text": "vmware hypervisor"}
{"count": 4, "ids": [28], "pos": "np", "rank": 0.0096, "text": "docker"}
{"count": 4, "ids": [34], "pos": "np", "rank": 0.0094, "text": "virtualbox"}
{"count": 10, "ids": [11], "pos": "np", "rank": 0.0049, "text": "people"}
{"count": 4, "ids": [37], "pos": "vbg", "rank": 0.0026, "text": "emulating"}
{"count": 2, "ids": [27], "pos": "vbg", "rank": 0.0016, "text": "learning"}
Transcript: let's take a look at a few examples often when
people are first learning about Docker they try and put it in
one of a few existing categories sometimes people think it's
a virtualization tool like VMware or virtualbox also known as
a hypervisor these are tools which are emulating hardware for
virtual software
Confidence: 0.973419129848
39 KUBERNETES
0.8747 coreos
0.8624 etcd
0.8478 DOCKER CONTAINERS
0.8458 mesos
0.8406 DOCKER
0.8354 DOCKER CONTAINER
0.8260 KUBERNETES CLUSTER
0.8258 docker image
0.8252 EC2
0.8210 docker hub
0.8138 OPENSTACK
orm:Docker a orm:Vendor;
a orm:Container;
a orm:Open_Source;
a orm:Commercial_Software;
owl:sameAs dbr:Docker_%28software%29;
skos:prefLabel "Docker"@en;
Knowledge	
  Graph
▪ used	
  to	
  construct	
  an	
  ontology	
  about	
  
technology,	
  based	
  on	
  learning	
  
materials	
  from	
  200+	
  publishers	
  
▪ uses	
  SKOS	
  as	
  a	
  foundation,	
  ties	
  into	
  

US	
  Library	
  of	
  Congress	
  and	
  DBpedia	
  

as	
  upper	
  ontologies	
  
▪ primary	
  structure	
  is	
  “human	
  scale”,	
  

used	
  as	
  control	
  points	
  
▪ majority	
  (>90%)	
  of	
  the	
  graph	
  

comes	
  from	
  machine	
  generated	
  

data	
  products
5
6
7
UX	
  for	
  content	
  discovery:	
  
▪ partly	
  generated	
  +	
  curated	
  by	
  humans	
  
▪ partly	
  generated	
  +	
  curated	
  by	
  AI	
  apps
Motivations:	
  
a	
  new	
  generation	
  of	
  NLP
8
Statistical	
  parsing
9
Long,	
  long	
  ago,	
  the	
  field	
  of	
  NLP	
  used	
  to	
  be	
  concerned	
  about	
  
transformational	
  grammars,	
  linguistic	
  theory	
  by	
  Chomsky,	
  etc.	
  
Machine	
  learning	
  techniques	
  allowed	
  simpler	
  approaches,	
  

i.e.,	
  statistical	
  parsing.	
  With	
  the	
  rise	
  of	
  Big	
  Data,	
  those	
  
techniques	
  grew	
  more	
  effective	
  –	
  Google	
  won	
  the	
  NIST	
  2005	
  
Machine	
  Translation	
  Evaluation	
  and	
  remains	
  a	
  leader.	
  
More	
  recently,	
  deep	
  learning	
  helps	
  extend	
  performance	
  

(read:	
  less	
  errors)	
  for	
  statistical	
  parsing	
  and	
  downstream	
  uses.
What	
  changed?
10
Hardware:	
  multicore	
  and	
  large	
  memory	
  spaces,	
  plus	
  wide	
  
availability	
  of	
  cloud	
  computing	
  resources.	
  
The	
  same	
  data	
  +	
  algorithms	
  which	
  required	
  a	
  large	
  Hadoop	
  
cluster	
  in	
  2007,	
  run	
  faster	
  in	
  Python	
  on	
  my	
  laptop	
  in	
  2017.	
  
NLP	
  benefited	
  as	
  hardware	
  advances	
  eliminated	
  the	
  need	
  for	
  
computational	
  shortcuts	
  (e.g.	
  stemming)	
  and	
  made	
  more	
  
advanced	
  approaches	
  (e.g.,	
  graph	
  algorithms)	
  become	
  feasible.	
  
Also,	
  subtle	
  shifts	
  as	
  approximation	
  algorithms	
  became	
  more	
  
integrated,	
  e.g.,	
  substituting	
  n-­‐grams	
  with	
  skip-­‐grams.
Bag-­‐of-­‐words	
  is	
  not	
  your	
  friend
11
Advances:	
  
graph	
  algorithms	
  for	
  NLP
12
TextRank
13
“TextRank:	
  Bringing	
  Order	
  into	
  Texts”

Rada	
  Mihalcea,	
  Paul	
  Tarau

Conference	
  on	
  Empirical	
  Methods	
  in	
  
Natural	
  Language	
  Processing	
  (2004-­‐07-­‐25)

https://goo.gl/AJnA76	
  
http://web.eecs.umich.edu/~mihalcea/	
  
http://www.cse.unt.edu/~tarau/
TextRank
14
“In	
  this	
  paper,	
  we	
  introduced	
  TextRank	
  –	
  a	
  graph-­‐based	
  ranking	
  
model	
  for	
  text	
  processing,	
  and	
  show	
  how	
  it	
  can	
  be	
  successfully	
  used	
  
for	
  natural	
  language	
  applications.	
  In	
  particular,	
  we	
  proposed	
  and	
  
evaluated	
  two	
  innovative	
  unsupervised	
  approaches	
  for	
  keyword	
  and	
  
sentence	
  extraction,	
  and	
  showed	
  that	
  the	
  accuracy	
  achieved	
  by	
  
TextRank	
  in	
  these	
  applications	
  is	
  competitive	
  with	
  that	
  of	
  previously	
  
proposed	
  state-­‐of-­‐the-­‐art	
  algorithms.	
  An	
  important	
  aspect	
  of	
  
TextRank	
  is	
  that	
  it	
  does	
  not	
  require	
  deep	
  linguistic	
  knowledge,	
  nor	
  
domain	
  or	
  language	
  specific	
  annotated	
  corpora,	
  which	
  makes	
  it	
  
highly	
  portable	
  to	
  other	
  domains,	
  genres,	
  or	
  languages.”	
  
TextRank
1. segment	
  document	
  into	
  paragraphs	
  and	
  
sentences	
  
2. parse	
  each	
  sentence,	
  find	
  PoS	
  annotations,	
  
lemmas	
  
3. construct	
  a	
  graph	
  where	
  nouns,	
  adjectives,	
  
verbs	
  are	
  vertices	
  
• links	
  based	
  on	
  skip-­‐grams	
  
• links	
  based	
  on	
  repeated	
  instances	
  of	
  the	
  same	
  root	
  
4. stochastic	
  link	
  analysis	
  (e.g.,	
  PageRank)	
  
identifies	
  noun	
  phrases	
  which	
  have	
  high	
  
probability	
  of	
  inbound	
  references
15
How	
  does	
  this	
  work?
graph	
  analysis,	
  e.g.,	
  centrality,	
  causes	
  the	
  
most	
  “central”	
  phrases	
  to	
  be	
  ranked	
  higher
16
Unsupervised	
  summarization
17
Important	
  outcomes
18
Unsupervised	
  methods	
  which	
  do	
  not	
  require	
  subject	
  domain	
  
info,	
  but	
  can	
  use	
  it	
  when	
  available	
  
Extracting	
  key	
  phrases	
  from	
  a	
  text,	
  in	
  lieu	
  of	
  keywords,	
  provides	
  
significantly	
  more	
  information	
  content.	
  
For	
  example,	
  consider	
  “deep	
  learning”	
  vs.	
  “deep”	
  and	
  “learning”.	
  
Also,	
  automated	
  summarization	
  can	
  reduce	
  human	
  (editorial)	
  
time	
  dramatically.	
  
Implementations:	
  
various	
  trade-­‐offs
19
Other	
  implementations
20
Paco	
  Nathan,	
  2008

Java,	
  Apache	
  Hadoop	
  /	
  English+Spanish

https://github.com/ceteri/textrank/	
  
Radim	
  Řehůřek,	
  2009

Python,	
  NumPy,	
  SciPy

https://radimrehurek.com/gensim/	
  
Jeff	
  Kubina,	
  2009

Perl	
  /	
  English

http://search.cpan.org/~kubina/Text-­‐Categorize-­‐Textrank-­‐0.51/lib/Text/Categorize/Textrank/En.pm	
  
Karin	
  Christiansen,	
  2014

Java	
  /	
  Icelandic

https://github.com/karchr/icetextsum	
  
Paco	
  Nathan,	
  2014

Scala,	
  Apache	
  Spark	
  /	
  English

https://github.com/ceteri/spark-­‐exercises	
  
Got	
  any	
  others	
  to	
  add	
  to	
  this	
  list??
PyTextRank
Paco	
  Nathan,	
  2016	
  
Python	
  implementation	
  atop	
  spaCy,	
  NetworkX,	
  datasketch:	
  
▪ https://pypi.python.org/pypi/pytextrank/	
  
▪ https://github.com/ceteri/pytextrank/
PyTextRank:	
  motivations
▪ spaCy	
  is	
  moving	
  fast	
  and	
  benefits	
  greatly	
  by	
  having	
  an	
  
abstraction	
  layer	
  above	
  it	
  
▪ NetworkX	
  allows	
  use	
  of	
  multicore	
  +	
  large	
  memory	
  spaces	
  
for	
  graph	
  algorithms,	
  obviating	
  the	
  need	
  in	
  many	
  NLP	
  use	
  
cases	
  for	
  Hadoop,	
  Spark,	
  etc.	
  
▪ datasketch	
  leverages	
  approximation	
  algorithms	
  which	
  
make	
  key	
  features	
  feasible,	
  given	
  the	
  environments	
  
described	
  above	
  
▪ monolithic	
  NLP	
  frameworks	
  often	
  attempt	
  to	
  control	
  

the	
  entire	
  pipeline	
  (instead	
  of	
  using	
  modules	
  or	
  services)	
  

and	
  are	
  basically	
  tech	
  debt	
  waiting	
  to	
  happen	
  
PyTextRank:	
  innovations
▪ fixed	
  bug	
  in	
  the	
  original	
  paper’s	
  pseudocode;	
  see	
  Java	
  impl	
  2008	
  
used	
  in	
  production	
  by	
  ShareThis,	
  Rolling	
  Stone,	
  etc.	
  
▪ included	
  verbs	
  into	
  the	
  graph	
  (though	
  not	
  in	
  keyphrase	
  output)	
  
▪ added	
  text	
  scrubbing,	
  unicode	
  normalization	
  
▪ replaced	
  stemming	
  with	
  lemmatization	
  
▪ can	
  leverage	
  named	
  entity	
  recognition,	
  noun	
  phrase	
  chunking	
  
▪ provides	
  extractive	
  summarization	
  
▪ makes	
  use	
  of	
  approximation	
  algorithms,	
  where	
  appropriate	
  
▪ applies	
  use	
  of	
  ontology:	
  random	
  walk	
  of	
  hypernyms/hyponyms	
  
helps	
  enrich	
  the	
  graph,	
  which	
  in	
  turn	
  improves	
  performance	
  
▪ integrates	
  vector	
  embedding	
  +	
  other	
  graph	
  algorithms	
  to	
  help	
  
construct	
  or	
  augment	
  ontology	
  from	
  a	
  corpus
PyTextRank:	
  upcoming
Package	
  installation	
  is	
  currently	
  integrated	
  with	
  PyPi	
  
2017-­‐Q4:	
  adding	
  support	
  for	
  Anaconda	
  
Also	
  working	
  on:	
  
▪ integration	
  with	
  spaCy	
  2.x,	
  e.g.,	
  spans	
  
▪ more	
  coding	
  examples	
  in	
  the	
  wiki,	
  especially	
  using	
  Jupyter	
  
▪ better	
  handling	
  for	
  character	
  encoding	
  issues;	
  

see	
  “Character	
  Filtering”	
  by	
  Daniel	
  Tunkelang,	
  

along	
  with	
  his	
  entire	
  series	
  Query	
  Understanding	
  
▪ more	
  integration	
  into	
  coursework;	
  

see	
  Get	
  Started	
  with	
  NLP	
  in	
  Python	
  on	
  O’Reilly
Strata	
  Data	
  
SG,	
  Dec	
  4-­‐7

SJ,	
  Mar	
  5-­‐8

UK,	
  May	
  21-­‐24

CN,	
  Jul	
  12-­‐15	
  
The	
  AI	
  Conf	
  
CN	
  Apr	
  10-­‐13

NY,	
  Apr	
  29-­‐May	
  2

SF,	
  Sep	
  4-­‐7

UK,	
  Oct	
  8-­‐11	
  
JupyterCon	
  
NY,	
  Aug	
  21-­‐24	
  
OSCON	
  
PDX,	
  Jul	
  16-­‐19,	
  2018
25
26
Learn	
  Alongside

Innovators
Just	
  Enough	
  Math Building	
  Data	
  
Science	
  Teams
Hylbert-­‐Speys How	
  Do	
  You	
  Learn?
updates,	
  reviews,	
  conference	
  summaries…	
  
liber118.com/pxn/

@pacoid
PyTextRank: Graph algorithms for enhanced natural language processing

Contenu connexe

Plus de Paco Nathan

Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 

Plus de Paco Nathan (20)

Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 

Dernier

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Dernier (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

PyTextRank: Graph algorithms for enhanced natural language processing

  • 1. PyTextRank:  
 graph  algorithms  for  enhanced  NLP Paco  Nathan  @pacoid   Dir,  Learning  Group  @  O’Reilly  Media   DSSG’17,  Singapore  2017-­‐12-­‐06
  • 3. AI  in  Media ▪ content  which  can  represented  as  
 text  can  be  parsed  by  NLP,  then   manipulated  by  available  AI  tooling     ▪ labeled  images  get  really  interesting   ▪ text  or  images  within  a  context  have  
 inherent  structure   ▪ representation  of  that  kind  of  structure   is  rare  in  the  Media  vertical  –  so  far 3
  • 4. {"graf": [[21, "let", "let", "VB", 1, 48], [0, "'s", "'s", "PRP", 0, 49], "take", "take", "VB", 1, 50], [0, "a", "a", "DT", 0, 51], [23, "look", "l "NN", 1, 52], [0, "at", "at", "IN", 0, 53], [0, "a", "a", "DT", 0, 54], [ "few", "few", "JJ", 1, 55], [25, "examples", "example", "NNS", 1, 56], [0 "often", "often", "RB", 0, 57], [0, "when", "when", "WRB", 0, 58], [11, "people", "people", "NNS", 1, 59], [2, "are", "be", "VBP", 1, 60], [26, " "first", "JJ", 1, 61], [27, "learning", "learn", "VBG", 1, 62], [0, "abou "about", "IN", 0, 63], [28, "Docker", "docker", "NNP", 1, 64], [0, "they" "they", "PRP", 0, 65], [29, "try", "try", "VBP", 1, 66], [0, "and", "and" 0, 67], [30, "put", "put", "VBP", 1, 68], [0, "it", "it", "PRP", 0, 69], "in", "in", "IN", 0, 70], [0, "one", "one", "CD", 0, 71], [0, "of", "of", 0, 72], [0, "a", "a", "DT", 0, 73], [24, "few", "few", "JJ", 1, 74], [31, "existing", "existing", "JJ", 1, 75], [18, "categories", "category", "NNS 76], [0, "sometimes", "sometimes", "RB", 0, 77], [11, "people", "people", 1, 78], [9, "think", "think", "VBP", 1, 79], [0, "it", "it", "PRP", 0, 80 "'s", "be", "VBZ", 1, 81], [0, "a", "a", "DT", 0, 82], [32, "virtualizati "virtualization", "NN", 1, 83], [19, "tool", "tool", "NN", 1, 84], [0, "l "like", "IN", 0, 85], [33, "VMware", "vmware", "NNP", 1, 86], [0, "or", " "CC", 0, 87], [34, "virtualbox", "virtualbox", "NNP", 1, 88], [0, "also", "also", "RB", 0, 89], [35, "known", "know", "VBN", 1, 90], [0, "as", "as" 0, 91], [0, "a", "a", "DT", 0, 92], [36, "hypervisor", "hypervisor", "NN" 93], [0, "these", "these", "DT", 0, 94], [2, "are", "be", "VBP", 1, 95], "tools", "tool", "NNS", 1, 96], [0, "which", "which", "WDT", 0, 97], [2, "be", "VBP", 1, 98], [37, "emulating", "emulate", "VBG", 1, 99], [38, "hardware", "hardware", "NN", 1, 100], [0, "for", "for", "IN", 0, 101], [ "virtual", "virtual", "JJ", 1, 102], [40, "software", "software", "NN", 1 103]], "id": "001.video197359", "sha1": "4b69cf60f0497887e3776619b922514f2e5b70a8"} Video  transcription 4 {"count": 2, "ids": [32, 19], "pos": "np", "rank": 0.0194, "text": "virtualization tool"} {"count": 2, "ids": [40, 69], "pos": "np", "rank": 0.0117, "text": "software applications"} {"count": 4, "ids": [38], "pos": "np", "rank": 0.0114, "text": "hardware"} {"count": 2, "ids": [33, 36], "pos": "np", "rank": 0.0099, "text": "vmware hypervisor"} {"count": 4, "ids": [28], "pos": "np", "rank": 0.0096, "text": "docker"} {"count": 4, "ids": [34], "pos": "np", "rank": 0.0094, "text": "virtualbox"} {"count": 10, "ids": [11], "pos": "np", "rank": 0.0049, "text": "people"} {"count": 4, "ids": [37], "pos": "vbg", "rank": 0.0026, "text": "emulating"} {"count": 2, "ids": [27], "pos": "vbg", "rank": 0.0016, "text": "learning"} Transcript: let's take a look at a few examples often when people are first learning about Docker they try and put it in one of a few existing categories sometimes people think it's a virtualization tool like VMware or virtualbox also known as a hypervisor these are tools which are emulating hardware for virtual software Confidence: 0.973419129848 39 KUBERNETES 0.8747 coreos 0.8624 etcd 0.8478 DOCKER CONTAINERS 0.8458 mesos 0.8406 DOCKER 0.8354 DOCKER CONTAINER 0.8260 KUBERNETES CLUSTER 0.8258 docker image 0.8252 EC2 0.8210 docker hub 0.8138 OPENSTACK orm:Docker a orm:Vendor; a orm:Container; a orm:Open_Source; a orm:Commercial_Software; owl:sameAs dbr:Docker_%28software%29; skos:prefLabel "Docker"@en;
  • 5. Knowledge  Graph ▪ used  to  construct  an  ontology  about   technology,  based  on  learning   materials  from  200+  publishers   ▪ uses  SKOS  as  a  foundation,  ties  into  
 US  Library  of  Congress  and  DBpedia  
 as  upper  ontologies   ▪ primary  structure  is  “human  scale”,  
 used  as  control  points   ▪ majority  (>90%)  of  the  graph  
 comes  from  machine  generated  
 data  products 5
  • 6. 6
  • 7. 7 UX  for  content  discovery:   ▪ partly  generated  +  curated  by  humans   ▪ partly  generated  +  curated  by  AI  apps
  • 8. Motivations:   a  new  generation  of  NLP 8
  • 9. Statistical  parsing 9 Long,  long  ago,  the  field  of  NLP  used  to  be  concerned  about   transformational  grammars,  linguistic  theory  by  Chomsky,  etc.   Machine  learning  techniques  allowed  simpler  approaches,  
 i.e.,  statistical  parsing.  With  the  rise  of  Big  Data,  those   techniques  grew  more  effective  –  Google  won  the  NIST  2005   Machine  Translation  Evaluation  and  remains  a  leader.   More  recently,  deep  learning  helps  extend  performance  
 (read:  less  errors)  for  statistical  parsing  and  downstream  uses.
  • 10. What  changed? 10 Hardware:  multicore  and  large  memory  spaces,  plus  wide   availability  of  cloud  computing  resources.   The  same  data  +  algorithms  which  required  a  large  Hadoop   cluster  in  2007,  run  faster  in  Python  on  my  laptop  in  2017.   NLP  benefited  as  hardware  advances  eliminated  the  need  for   computational  shortcuts  (e.g.  stemming)  and  made  more   advanced  approaches  (e.g.,  graph  algorithms)  become  feasible.   Also,  subtle  shifts  as  approximation  algorithms  became  more   integrated,  e.g.,  substituting  n-­‐grams  with  skip-­‐grams.
  • 11. Bag-­‐of-­‐words  is  not  your  friend 11
  • 13. TextRank 13 “TextRank:  Bringing  Order  into  Texts”
 Rada  Mihalcea,  Paul  Tarau
 Conference  on  Empirical  Methods  in   Natural  Language  Processing  (2004-­‐07-­‐25)
 https://goo.gl/AJnA76   http://web.eecs.umich.edu/~mihalcea/   http://www.cse.unt.edu/~tarau/
  • 14. TextRank 14 “In  this  paper,  we  introduced  TextRank  –  a  graph-­‐based  ranking   model  for  text  processing,  and  show  how  it  can  be  successfully  used   for  natural  language  applications.  In  particular,  we  proposed  and   evaluated  two  innovative  unsupervised  approaches  for  keyword  and   sentence  extraction,  and  showed  that  the  accuracy  achieved  by   TextRank  in  these  applications  is  competitive  with  that  of  previously   proposed  state-­‐of-­‐the-­‐art  algorithms.  An  important  aspect  of   TextRank  is  that  it  does  not  require  deep  linguistic  knowledge,  nor   domain  or  language  specific  annotated  corpora,  which  makes  it   highly  portable  to  other  domains,  genres,  or  languages.”  
  • 15. TextRank 1. segment  document  into  paragraphs  and   sentences   2. parse  each  sentence,  find  PoS  annotations,   lemmas   3. construct  a  graph  where  nouns,  adjectives,   verbs  are  vertices   • links  based  on  skip-­‐grams   • links  based  on  repeated  instances  of  the  same  root   4. stochastic  link  analysis  (e.g.,  PageRank)   identifies  noun  phrases  which  have  high   probability  of  inbound  references 15
  • 16. How  does  this  work? graph  analysis,  e.g.,  centrality,  causes  the   most  “central”  phrases  to  be  ranked  higher 16
  • 18. Important  outcomes 18 Unsupervised  methods  which  do  not  require  subject  domain   info,  but  can  use  it  when  available   Extracting  key  phrases  from  a  text,  in  lieu  of  keywords,  provides   significantly  more  information  content.   For  example,  consider  “deep  learning”  vs.  “deep”  and  “learning”.   Also,  automated  summarization  can  reduce  human  (editorial)   time  dramatically.  
  • 20. Other  implementations 20 Paco  Nathan,  2008
 Java,  Apache  Hadoop  /  English+Spanish
 https://github.com/ceteri/textrank/   Radim  Řehůřek,  2009
 Python,  NumPy,  SciPy
 https://radimrehurek.com/gensim/   Jeff  Kubina,  2009
 Perl  /  English
 http://search.cpan.org/~kubina/Text-­‐Categorize-­‐Textrank-­‐0.51/lib/Text/Categorize/Textrank/En.pm   Karin  Christiansen,  2014
 Java  /  Icelandic
 https://github.com/karchr/icetextsum   Paco  Nathan,  2014
 Scala,  Apache  Spark  /  English
 https://github.com/ceteri/spark-­‐exercises   Got  any  others  to  add  to  this  list??
  • 21. PyTextRank Paco  Nathan,  2016   Python  implementation  atop  spaCy,  NetworkX,  datasketch:   ▪ https://pypi.python.org/pypi/pytextrank/   ▪ https://github.com/ceteri/pytextrank/
  • 22. PyTextRank:  motivations ▪ spaCy  is  moving  fast  and  benefits  greatly  by  having  an   abstraction  layer  above  it   ▪ NetworkX  allows  use  of  multicore  +  large  memory  spaces   for  graph  algorithms,  obviating  the  need  in  many  NLP  use   cases  for  Hadoop,  Spark,  etc.   ▪ datasketch  leverages  approximation  algorithms  which   make  key  features  feasible,  given  the  environments   described  above   ▪ monolithic  NLP  frameworks  often  attempt  to  control  
 the  entire  pipeline  (instead  of  using  modules  or  services)  
 and  are  basically  tech  debt  waiting  to  happen  
  • 23. PyTextRank:  innovations ▪ fixed  bug  in  the  original  paper’s  pseudocode;  see  Java  impl  2008   used  in  production  by  ShareThis,  Rolling  Stone,  etc.   ▪ included  verbs  into  the  graph  (though  not  in  keyphrase  output)   ▪ added  text  scrubbing,  unicode  normalization   ▪ replaced  stemming  with  lemmatization   ▪ can  leverage  named  entity  recognition,  noun  phrase  chunking   ▪ provides  extractive  summarization   ▪ makes  use  of  approximation  algorithms,  where  appropriate   ▪ applies  use  of  ontology:  random  walk  of  hypernyms/hyponyms   helps  enrich  the  graph,  which  in  turn  improves  performance   ▪ integrates  vector  embedding  +  other  graph  algorithms  to  help   construct  or  augment  ontology  from  a  corpus
  • 24. PyTextRank:  upcoming Package  installation  is  currently  integrated  with  PyPi   2017-­‐Q4:  adding  support  for  Anaconda   Also  working  on:   ▪ integration  with  spaCy  2.x,  e.g.,  spans   ▪ more  coding  examples  in  the  wiki,  especially  using  Jupyter   ▪ better  handling  for  character  encoding  issues;  
 see  “Character  Filtering”  by  Daniel  Tunkelang,  
 along  with  his  entire  series  Query  Understanding   ▪ more  integration  into  coursework;  
 see  Get  Started  with  NLP  in  Python  on  O’Reilly
  • 25. Strata  Data   SG,  Dec  4-­‐7
 SJ,  Mar  5-­‐8
 UK,  May  21-­‐24
 CN,  Jul  12-­‐15   The  AI  Conf   CN  Apr  10-­‐13
 NY,  Apr  29-­‐May  2
 SF,  Sep  4-­‐7
 UK,  Oct  8-­‐11   JupyterCon   NY,  Aug  21-­‐24   OSCON   PDX,  Jul  16-­‐19,  2018 25
  • 26. 26 Learn  Alongside
 Innovators Just  Enough  Math Building  Data   Science  Teams Hylbert-­‐Speys How  Do  You  Learn? updates,  reviews,  conference  summaries…   liber118.com/pxn/
 @pacoid