PyTextRank: Graph algorithms for enhanced natural language processing

PyTextRank:
 
graph
algorithms
for
enhanced
NLP
Paco
Nathan
@pacoid

Dir,
Learning
Group
@
O’Reilly
Media

DSSG’17,
Singapore
2017-‐12-‐06

Background:

helping
people
learn
2

AI
in
Media
▪ content
which
can
represented
as
 
text
can
be
parsed
by
NLP,
then

manipulated
by
available
AI
tooling

▪ labeled
images
get
really
interesting

▪ text
or
images
within
a
context
have
 
inherent
structure

▪ representation
of
that
kind
of
structure

is
rare
in
the
Media
vertical
–
so
far
3

{"graf": [[21, "let", "let", "VB", 1, 48], [0, "'s", "'s", "PRP", 0, 49],
"take", "take", "VB", 1, 50], [0, "a", "a", "DT", 0, 51], [23, "look", "l
"NN", 1, 52], [0, "at", "at", "IN", 0, 53], [0, "a", "a", "DT", 0, 54], [
"few", "few", "JJ", 1, 55], [25, "examples", "example", "NNS", 1, 56], [0
"often", "often", "RB", 0, 57], [0, "when", "when", "WRB", 0, 58], [11,
"people", "people", "NNS", 1, 59], [2, "are", "be", "VBP", 1, 60], [26, "
"first", "JJ", 1, 61], [27, "learning", "learn", "VBG", 1, 62], [0, "abou
"about", "IN", 0, 63], [28, "Docker", "docker", "NNP", 1, 64], [0, "they"
"they", "PRP", 0, 65], [29, "try", "try", "VBP", 1, 66], [0, "and", "and"
0, 67], [30, "put", "put", "VBP", 1, 68], [0, "it", "it", "PRP", 0, 69],
"in", "in", "IN", 0, 70], [0, "one", "one", "CD", 0, 71], [0, "of", "of",
0, 72], [0, "a", "a", "DT", 0, 73], [24, "few", "few", "JJ", 1, 74], [31,
"existing", "existing", "JJ", 1, 75], [18, "categories", "category", "NNS
76], [0, "sometimes", "sometimes", "RB", 0, 77], [11, "people", "people",
1, 78], [9, "think", "think", "VBP", 1, 79], [0, "it", "it", "PRP", 0, 80
"'s", "be", "VBZ", 1, 81], [0, "a", "a", "DT", 0, 82], [32, "virtualizati
"virtualization", "NN", 1, 83], [19, "tool", "tool", "NN", 1, 84], [0, "l
"like", "IN", 0, 85], [33, "VMware", "vmware", "NNP", 1, 86], [0, "or", "
"CC", 0, 87], [34, "virtualbox", "virtualbox", "NNP", 1, 88], [0, "also",
"also", "RB", 0, 89], [35, "known", "know", "VBN", 1, 90], [0, "as", "as"
0, 91], [0, "a", "a", "DT", 0, 92], [36, "hypervisor", "hypervisor", "NN"
93], [0, "these", "these", "DT", 0, 94], [2, "are", "be", "VBP", 1, 95],
"tools", "tool", "NNS", 1, 96], [0, "which", "which", "WDT", 0, 97], [2,
"be", "VBP", 1, 98], [37, "emulating", "emulate", "VBG", 1, 99], [38,
"hardware", "hardware", "NN", 1, 100], [0, "for", "for", "IN", 0, 101], [
"virtual", "virtual", "JJ", 1, 102], [40, "software", "software", "NN", 1
103]], "id": "001.video197359", "sha1":
"4b69cf60f0497887e3776619b922514f2e5b70a8"}
Video
transcription
4
{"count": 2, "ids": [32, 19], "pos": "np", "rank": 0.0194, "text": "virtualization tool"}
{"count": 2, "ids": [40, 69], "pos": "np", "rank": 0.0117, "text": "software applications"}
{"count": 4, "ids": [38], "pos": "np", "rank": 0.0114, "text": "hardware"}
{"count": 2, "ids": [33, 36], "pos": "np", "rank": 0.0099, "text": "vmware hypervisor"}
{"count": 4, "ids": [28], "pos": "np", "rank": 0.0096, "text": "docker"}
{"count": 4, "ids": [34], "pos": "np", "rank": 0.0094, "text": "virtualbox"}
{"count": 10, "ids": [11], "pos": "np", "rank": 0.0049, "text": "people"}
{"count": 4, "ids": [37], "pos": "vbg", "rank": 0.0026, "text": "emulating"}
{"count": 2, "ids": [27], "pos": "vbg", "rank": 0.0016, "text": "learning"}
Transcript: let's take a look at a few examples often when
people are first learning about Docker they try and put it in
one of a few existing categories sometimes people think it's
a virtualization tool like VMware or virtualbox also known as
a hypervisor these are tools which are emulating hardware for
virtual software
Confidence: 0.973419129848
39 KUBERNETES
0.8747 coreos
0.8624 etcd
0.8478 DOCKER CONTAINERS
0.8458 mesos
0.8406 DOCKER
0.8354 DOCKER CONTAINER
0.8260 KUBERNETES CLUSTER
0.8258 docker image
0.8252 EC2
0.8210 docker hub
0.8138 OPENSTACK
orm:Docker a orm:Vendor;
a orm:Container;
a orm:Open_Source;
a orm:Commercial_Software;
owl:sameAs dbr:Docker_%28software%29;
skos:prefLabel "Docker"@en;

Knowledge
Graph
▪ used
to
construct
an
ontology
about

technology,
based
on
learning

materials
from
200+
publishers

▪ uses
SKOS
as
a
foundation,
ties
into
 
US
Library
of
Congress
and
DBpedia
 
as
upper
ontologies

▪ primary
structure
is
“human
scale”,
 
used
as
control
points

▪ majority
(>90%)
of
the
graph
 
comes
from
machine
generated
 
data
products
5

7
UX
for
content
discovery:

▪ partly
generated
+
curated
by
humans

▪ partly
generated
+
curated
by
AI
apps

Motivations:

a
new
generation
of
NLP
8

Statistical
parsing
9
Long,
long
ago,
the
field
of
NLP
used
to
be
concerned
about

transformational
grammars,
linguistic
theory
by
Chomsky,
etc.

Machine
learning
techniques
allowed
simpler
approaches,
 
i.e.,
statistical
parsing.
With
the
rise
of
Big
Data,
those

techniques
grew
more
effective
–
Google
won
the
NIST
2005

Machine
Translation
Evaluation
and
remains
a
leader.

More
recently,
deep
learning
helps
extend
performance
 
(read:
less
errors)
for
statistical
parsing
and
downstream
uses.

What
changed?
10
Hardware:
multicore
and
large
memory
spaces,
plus
wide

availability
of
cloud
computing
resources.

The
same
data
+
algorithms
which
required
a
large
Hadoop

cluster
in
2007,
run
faster
in
Python
on
my
laptop
in
2017.

NLP
benefited
as
hardware
advances
eliminated
the
need
for

computational
shortcuts
(e.g.
stemming)
and
made
more

advanced
approaches
(e.g.,
graph
algorithms)
become
feasible.

Also,
subtle
shifts
as
approximation
algorithms
became
more

integrated,
e.g.,
substituting
n-‐grams
with
skip-‐grams.

Bag-‐of-‐words
is
not
your
friend
11

Advances:

graph
algorithms
for
NLP
12

TextRank
13
“TextRank:
Bringing
Order
into
Texts” 
Rada
Mihalcea,
Paul
Tarau 
Conference
on
Empirical
Methods
in

Natural
Language
Processing
(2004-‐07-‐25) 
https://goo.gl/AJnA76

http://web.eecs.umich.edu/~mihalcea/

http://www.cse.unt.edu/~tarau/

TextRank
14
“In
this
paper,
we
introduced
TextRank
–
a
graph-‐based
ranking

model
for
text
processing,
and
show
how
it
can
be
successfully
used

for
natural
language
applications.
In
particular,
we
proposed
and

evaluated
two
innovative
unsupervised
approaches
for
keyword
and

sentence
extraction,
and
showed
that
the
accuracy
achieved
by

TextRank
in
these
applications
is
competitive
with
that
of
previously

proposed
state-‐of-‐the-‐art
algorithms.
An
important
aspect
of

TextRank
is
that
it
does
not
require
deep
linguistic
knowledge,
nor

domain
or
language
specific
annotated
corpora,
which
makes
it

highly
portable
to
other
domains,
genres,
or
languages.”

TextRank
1. segment
document
into
paragraphs
and

sentences

2. parse
each
sentence,
find
PoS
annotations,

lemmas

3. construct
a
graph
where
nouns,
adjectives,

verbs
are
vertices

• links
based
on
skip-‐grams

• links
based
on
repeated
instances
of
the
same
root

4. stochastic
link
analysis
(e.g.,
PageRank)

identifies
noun
phrases
which
have
high

probability
of
inbound
references
15

How
does
this
work?
graph
analysis,
e.g.,
centrality,
causes
the

most
“central”
phrases
to
be
ranked
higher
16

Unsupervised
summarization
17

Important
outcomes
18
Unsupervised
methods
which
do
not
require
subject
domain

info,
but
can
use
it
when
available

Extracting
key
phrases
from
a
text,
in
lieu
of
keywords,
provides

significantly
more
information
content.

For
example,
consider
“deep
learning”
vs.
“deep”
and
“learning”.

Also,
automated
summarization
can
reduce
human
(editorial)

time
dramatically.

Implementations:

various
trade-‐offs
19

Other
implementations
20
Paco
Nathan,
2008 
Java,
Apache
Hadoop
/
English+Spanish 
https://github.com/ceteri/textrank/

Radim
Řehůřek,
2009 
Python,
NumPy,
SciPy 
https://radimrehurek.com/gensim/

Jeff
Kubina,
2009 
Perl
/
English 
http://search.cpan.org/~kubina/Text-‐Categorize-‐Textrank-‐0.51/lib/Text/Categorize/Textrank/En.pm

Karin
Christiansen,
2014 
Java
/
Icelandic 
https://github.com/karchr/icetextsum

Paco
Nathan,
2014 
Scala,
Apache
Spark
/
English 
https://github.com/ceteri/spark-‐exercises

Got
any
others
to
add
to
this
list??

PyTextRank
Paco
Nathan,
2016

Python
implementation
atop
spaCy,
NetworkX,
datasketch:

▪ https://pypi.python.org/pypi/pytextrank/

▪ https://github.com/ceteri/pytextrank/

PyTextRank:
motivations
▪ spaCy
is
moving
fast
and
benefits
greatly
by
having
an

abstraction
layer
above
it

▪ NetworkX
allows
use
of
multicore
+
large
memory
spaces

for
graph
algorithms,
obviating
the
need
in
many
NLP
use

cases
for
Hadoop,
Spark,
etc.

▪ datasketch
leverages
approximation
algorithms
which

make
key
features
feasible,
given
the
environments

described
above

▪ monolithic
NLP
frameworks
often
attempt
to
control
 
the
entire
pipeline
(instead
of
using
modules
or
services)
 
and
are
basically
tech
debt
waiting
to
happen

PyTextRank:
innovations
▪ fixed
bug
in
the
original
paper’s
pseudocode;
see
Java
impl
2008

used
in
production
by
ShareThis,
Rolling
Stone,
etc.

▪ included
verbs
into
the
graph
(though
not
in
keyphrase
output)

▪ added
text
scrubbing,
unicode
normalization

▪ replaced
stemming
with
lemmatization

▪ can
leverage
named
entity
recognition,
noun
phrase
chunking

▪ provides
extractive
summarization

▪ makes
use
of
approximation
algorithms,
where
appropriate

▪ applies
use
of
ontology:
random
walk
of
hypernyms/hyponyms

helps
enrich
the
graph,
which
in
turn
improves
performance

▪ integrates
vector
embedding
+
other
graph
algorithms
to
help

construct
or
augment
ontology
from
a
corpus

PyTextRank:
upcoming
Package
installation
is
currently
integrated
with
PyPi

2017-‐Q4:
adding
support
for
Anaconda

Also
working
on:

▪ integration
with
spaCy
2.x,
e.g.,
spans

▪ more
coding
examples
in
the
wiki,
especially
using
Jupyter

▪ better
handling
for
character
encoding
issues;
 
see
“Character
Filtering”
by
Daniel
Tunkelang,
 
along
with
his
entire
series
Query
Understanding

▪ more
integration
into
coursework;
 
see
Get
Started
with
NLP
in
Python
on
O’Reilly

Strata
Data

SG,
Dec
4-‐7 
SJ,
Mar
5-‐8 
UK,
May
21-‐24 
CN,
Jul
12-‐15

The
AI
Conf

CN
Apr
10-‐13 
NY,
Apr
29-‐May
2 
SF,
Sep
4-‐7 
UK,
Oct
8-‐11

JupyterCon

NY,
Aug
21-‐24

OSCON

PDX,
Jul
16-‐19,
2018
25

26
Learn
Alongside 
Innovators
Just
Enough
Math Building
Data

Science
Teams
Hylbert-‐Speys How
Do
You
Learn?
updates,
reviews,
conference
summaries…

liber118.com/pxn/ 
@pacoid

PyTextRank: Graph algorithms for enhanced natural language processing

PyTextRank: Graph algorithms for enhanced natural language processing

Recommandé

Recommandé

Contenu connexe

Plus de Paco Nathan

Plus de Paco Nathan (20)

Dernier

Dernier (20)

PyTextRank: Graph algorithms for enhanced natural language processing