SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
PyTextRank
2017-02-08 SF Python Meetup
Paco Nathan, @pacoid

Dir, Learning Group @ O’Reilly Media
Just Enough Graph
3
• many real-world problems are often
represented as graphs
• graphs can generally be converted into
sparse matrices (bridge to linear algebra)
• eigenvectors find the stable points in 

a system defined by matrices – which 

may be more efficient to compute
• beyond simpler graphs, complex data 

may require work with tensors
Graph Analytics: terminology
4
Suppose we have a graph as shown below:
We call x a vertex (sometimes called a node)
An edge (sometimes called an arc) is any line
connecting two vertices
v
u
w
x
Graph Analytics: concepts
5
We can represent this kind of graph as an
adjacency matrix:
• label the rows and columns based 

on the vertices
• entries get a 1 if an edge connects the
corresponding vertices, or 0 otherwise
v
u
w
x
u v w x
u 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0
Graph Analytics: representation
6
An adjacency matrix always has certain properties:
• it is symmetric, i.e., A = AT
• it has real eigenvalues
Therefore algebraic graph theory bridges between
linear algebra and graph theory
Algebraic Graph Theory
7
Sparse Matrix Collection… for when you really need
a wide variety of sparse matrix examples, e.g., to
evaluate new ML algorithms
SuiteSparse 

Matrix Collection

faculty.cse.tamu.edu/
davis/matrices.html
Beauty in Sparsity
8
See examples in: Just Enough Math
Algebraic Graph Theory

Norman Biggs

Cambridge (1974)

amazon.com/dp/0521458978
Graph Analysis and Visualization

Richard Brath, David Jonker

Wiley (2015)

shop.oreilly.com/product/9781118845844.do
Resources
TextRank
10
TextRank: Bringing Order into Texts

Rada Mihalcea, Paul Tarau
Conference on Empirical Methods in Natural
Language Processing (July 2004)
https://goo.gl/AJnA76
http://web.eecs.umich.edu/~mihalcea/papers.html
http://www.cse.unt.edu/~tarau/
TextRank: original paper
11
Jeff Kubina (Perl / English):
http://search.cpan.org/~kubina/Text-Categorize-
Textrank-0.51/lib/Text/Categorize/Textrank/En.pm
Paco Nathan (Hadoop / English+Spanish):
https://github.com/ceteri/textrank/
Karin Christiasen (Java / Icelandic):
https://github.com/karchr/icetextsum
TextRank: other impl
12
TextRank: raw text input
13
"Compatibility of systems of linear constraints"
[{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'},
{'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'},
{'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'},
{'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'},
{'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}]
compat
system
linear
constraint
1:
2:
3:
TextRank: data results
14
https://en.wikipedia.org/wiki/PageRank
TextRank: how it works
PyTextRank
16
A pure Python implementation of TextRank, 

based on Mihalcea 2004
Because monolithic NLP frameworks which 

control the entire pipeline (instead of APIs/services)
are wretched
Because keywords…

(note that ngrams aren’t much better)
Motivations
17
Using keywords leads to surprises!
18
Better to have keyphrases and summaries
Leads toward integration with the Williams 2016 

talk on text summarization with deep learning:

http://mike.place/2016/summarization/
Motivations
19
Modifications to the original Mihalcea algorithm include:
• fixed bug in the original paper’s pseudocode; 

see Java impl 2008 used by ShareThis, etc.
• uses lemmatization instead of stemming
• verbs included in graph (not in key phrase output)
• integrates named entity resolution
• keyphrase ranks used in MinHash to approximate
semantic similarity, which produces summarizations
• allow use of ontology, i.e., AI knowledge graph to
augment the parsing
Enhancements to TextRank
20
Use of stemming, e.g., with Porter Stemmer, has 

long been a standard way to “normalize” text data: 

a computationally efficient approach to reduce words 

to their “stems” by removing inflections.
A better approach is to lemmatize, i.e., use part-of-speech
tags to lookup the root for a word in WordNet – related to
looking up its synsets, hypernyms, hyponyms, etc.
Lemmatization vs. Stemming
Lexeme PoS Stem Lemma
interact VB interact interact
comments NNS comment comment
provides VBZ provid provide
21
https://github.com/ceteri/pytextrank
• TextBlob – a Python 2/3 library that provides a
consistent API for leveraging other resources
• WordNet – think of it as somewhere between a
large thesaurus and a database
• spaCy
• NetworkX
• datasketch
• graphviz
PytTextRank: repo + dependencies
22
BTW, some really cool stuff to leverage, once you have
ranked keyphrases as feature vectors:
• Happening – semantic search

http://www.happening.technology/
• DataRefiner – topological data analysis w/ DL

https://datarefiner.com/
The Beyond
O’Reilly Media conferences + training:
NLP in Python

repeated live online courses
Strata

SJ Mar 13-16

Deep Learning sessions, 2-day training
Artificial Intelligence

NY Jun 26-29, SF Sep 17-20

SF CFP opens soon, follow @OreillyAI for updates
JupyterCon

NY Aug 22-25
speaker:
periodic newsletter for updates, 

events, conf summaries, etc.:
liber118.com/pxn/

@pacoid
airships
e.g., JP Aerospace, 40 km
atmosats
e.g.,Titan Aerospace, 20 km
microsats
e.g., Planet Labs, 400 km
robots
e.g., Blue River, 1 m
sensors
e.g., Hortau, -0.3 m
drones
e.g., HoneyComb, 120 m
Ag + DataJust Enough Math Building Data
Science Teams
Hylbert-Speys How Do You Learn?

Contenu connexe

Tendances

Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Doug Needham
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioOpen Knowledge Belgium
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability MahoutJake Mannix
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad DataSteffen Staab
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonJen Stirrup
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...తేజ దండిభట్ల
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 

Tendances (20)

Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Data visualization
Data visualizationData visualization
Data visualization
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil
 
Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...Achieving time effective federated information from scalable rdf data using s...
Achieving time effective federated information from scalable rdf data using s...
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
The Power of Machine Learning and Graphs
The Power of Machine Learning and GraphsThe Power of Machine Learning and Graphs
The Power of Machine Learning and Graphs
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 

Similaire à SF Python Meetup: TextRank in Python

Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Ken Mwai
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationTravis Oliphant
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Ravi Okade
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysisstat
 
Hash tables and hash maps in python | Edureka
Hash tables and hash maps in python | EdurekaHash tables and hash maps in python | Edureka
Hash tables and hash maps in python | EdurekaEdureka!
 
PuppetConf 2017: Custom Types & Providers: Modeling Modern REST Interfaces an...
PuppetConf 2017: Custom Types & Providers: Modeling Modern REST Interfaces an...PuppetConf 2017: Custom Types & Providers: Modeling Modern REST Interfaces an...
PuppetConf 2017: Custom Types & Providers: Modeling Modern REST Interfaces an...Puppet
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptxsalutiontechnology
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebNuxeo
 
postgres loader
postgres loaderpostgres loader
postgres loaderINRIA-OAK
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkMichael Häusler
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 

Similaire à SF Python Meetup: TextRank in Python (20)

Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
 
Hash tables and hash maps in python | Edureka
Hash tables and hash maps in python | EdurekaHash tables and hash maps in python | Edureka
Hash tables and hash maps in python | Edureka
 
PuppetConf 2017: Custom Types & Providers: Modeling Modern REST Interfaces an...
PuppetConf 2017: Custom Types & Providers: Modeling Modern REST Interfaces an...PuppetConf 2017: Custom Types & Providers: Modeling Modern REST Interfaces an...
PuppetConf 2017: Custom Types & Providers: Modeling Modern REST Interfaces an...
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptx
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
postgres loader
postgres loaderpostgres loader
postgres loader
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Tapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and FlinkTapping into Scientific Data with Hadoop and Flink
Tapping into Scientific Data with Hadoop and Flink
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Linking library data
Linking library dataLinking library data
Linking library data
 

Plus de Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 

Plus de Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 

Dernier

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

SF Python Meetup: TextRank in Python

  • 1. PyTextRank 2017-02-08 SF Python Meetup Paco Nathan, @pacoid
 Dir, Learning Group @ O’Reilly Media
  • 3. 3 • many real-world problems are often represented as graphs • graphs can generally be converted into sparse matrices (bridge to linear algebra) • eigenvectors find the stable points in 
 a system defined by matrices – which 
 may be more efficient to compute • beyond simpler graphs, complex data 
 may require work with tensors Graph Analytics: terminology
  • 4. 4 Suppose we have a graph as shown below: We call x a vertex (sometimes called a node) An edge (sometimes called an arc) is any line connecting two vertices v u w x Graph Analytics: concepts
  • 5. 5 We can represent this kind of graph as an adjacency matrix: • label the rows and columns based 
 on the vertices • entries get a 1 if an edge connects the corresponding vertices, or 0 otherwise v u w x u v w x u 0 1 0 1 v 1 0 1 1 w 0 1 0 1 x 1 1 1 0 Graph Analytics: representation
  • 6. 6 An adjacency matrix always has certain properties: • it is symmetric, i.e., A = AT • it has real eigenvalues Therefore algebraic graph theory bridges between linear algebra and graph theory Algebraic Graph Theory
  • 7. 7 Sparse Matrix Collection… for when you really need a wide variety of sparse matrix examples, e.g., to evaluate new ML algorithms SuiteSparse 
 Matrix Collection
 faculty.cse.tamu.edu/ davis/matrices.html Beauty in Sparsity
  • 8. 8 See examples in: Just Enough Math Algebraic Graph Theory
 Norman Biggs
 Cambridge (1974)
 amazon.com/dp/0521458978 Graph Analysis and Visualization
 Richard Brath, David Jonker
 Wiley (2015)
 shop.oreilly.com/product/9781118845844.do Resources
  • 10. 10 TextRank: Bringing Order into Texts
 Rada Mihalcea, Paul Tarau Conference on Empirical Methods in Natural Language Processing (July 2004) https://goo.gl/AJnA76 http://web.eecs.umich.edu/~mihalcea/papers.html http://www.cse.unt.edu/~tarau/ TextRank: original paper
  • 11. 11 Jeff Kubina (Perl / English): http://search.cpan.org/~kubina/Text-Categorize- Textrank-0.51/lib/Text/Categorize/Textrank/En.pm Paco Nathan (Hadoop / English+Spanish): https://github.com/ceteri/textrank/ Karin Christiasen (Java / Icelandic): https://github.com/karchr/icetextsum TextRank: other impl
  • 13. 13 "Compatibility of systems of linear constraints" [{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'}, {'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'}, {'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'}, {'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}] compat system linear constraint 1: 2: 3: TextRank: data results
  • 16. 16 A pure Python implementation of TextRank, 
 based on Mihalcea 2004 Because monolithic NLP frameworks which 
 control the entire pipeline (instead of APIs/services) are wretched Because keywords…
 (note that ngrams aren’t much better) Motivations
  • 17. 17 Using keywords leads to surprises!
  • 18. 18 Better to have keyphrases and summaries Leads toward integration with the Williams 2016 
 talk on text summarization with deep learning:
 http://mike.place/2016/summarization/ Motivations
  • 19. 19 Modifications to the original Mihalcea algorithm include: • fixed bug in the original paper’s pseudocode; 
 see Java impl 2008 used by ShareThis, etc. • uses lemmatization instead of stemming • verbs included in graph (not in key phrase output) • integrates named entity resolution • keyphrase ranks used in MinHash to approximate semantic similarity, which produces summarizations • allow use of ontology, i.e., AI knowledge graph to augment the parsing Enhancements to TextRank
  • 20. 20 Use of stemming, e.g., with Porter Stemmer, has 
 long been a standard way to “normalize” text data: 
 a computationally efficient approach to reduce words 
 to their “stems” by removing inflections. A better approach is to lemmatize, i.e., use part-of-speech tags to lookup the root for a word in WordNet – related to looking up its synsets, hypernyms, hyponyms, etc. Lemmatization vs. Stemming Lexeme PoS Stem Lemma interact VB interact interact comments NNS comment comment provides VBZ provid provide
  • 21. 21 https://github.com/ceteri/pytextrank • TextBlob – a Python 2/3 library that provides a consistent API for leveraging other resources • WordNet – think of it as somewhere between a large thesaurus and a database • spaCy • NetworkX • datasketch • graphviz PytTextRank: repo + dependencies
  • 22. 22 BTW, some really cool stuff to leverage, once you have ranked keyphrases as feature vectors: • Happening – semantic search
 http://www.happening.technology/ • DataRefiner – topological data analysis w/ DL
 https://datarefiner.com/ The Beyond
  • 23. O’Reilly Media conferences + training: NLP in Python
 repeated live online courses Strata
 SJ Mar 13-16
 Deep Learning sessions, 2-day training Artificial Intelligence
 NY Jun 26-29, SF Sep 17-20
 SF CFP opens soon, follow @OreillyAI for updates JupyterCon
 NY Aug 22-25
  • 24. speaker: periodic newsletter for updates, 
 events, conf summaries, etc.: liber118.com/pxn/
 @pacoid airships e.g., JP Aerospace, 40 km atmosats e.g.,Titan Aerospace, 20 km microsats e.g., Planet Labs, 400 km robots e.g., Blue River, 1 m sensors e.g., Hortau, -0.3 m drones e.g., HoneyComb, 120 m Ag + DataJust Enough Math Building Data Science Teams Hylbert-Speys How Do You Learn?