SlideShare une entreprise Scribd logo
1  sur  76
Télécharger pour lire hors ligne
NLP for the Web
Dr. Matthew Peters
@mattthemathman
(with thanks to Rutu Mulkar, Erin Renshaw, Dan Lecocq,
Chris Whitten, Jay Leary and many others at Moz)
Moz is a SaaS company that sells SEO and Content Marketing software to
professional marketers
We crawl a lot: > 1 Billion pages / day
We’d like to extract structured information from these pages
Author identification
Keyword / topic
extraction
Measure of
page’s “Reach”
Measure of
page’s “Reach”
+ structured
information à
most important
topics,
associate
authors with
areas of
expertise, etc
Most NLP tasks (parsing, POS tagging, Q&A, sentiment, etc.) focus exclusively on text content.
First two
paragraphs
http://www.seattletimes.com/seattle-news/transportation/berthas-delays-prompt-state-to-sue-tunnel-contractor/
POS tagging
Question
answering
Language
Modeling
Parsing
Sentiment
NLP Web
HTML
XML
Microdata
(schema.org)
CSS
Javascript
Interesting
problems at
intersection
3 extraction tasks
Main article
Keywords
Bertha
lawsuit
Seattle Tunnel Partners
STP
WSDOT
tunnel construction
etc.
Author
•  Motivation – what are some of the unique challenges
and opportunities in doing NLP on the web?
•  Main article extraction / page de-chroming
•  Keyword extraction
•  Author identification
•  Conclusion
Outline
Challenges & Opportunities
C1: Pages have clutter
and unrelated text
Navigation aids
Ads
Links to
other articles
C2: Text segments can confuse NLP components
Mike Lindblom Bertha: State sues - but even the lawsuit is delayed
Mike Lindblom Bertha
State sues NP chunks extracted by our chunker
the lawsuit
•  Many different standards and only partial adoption
•  Wide variety of templates and sites
•  Broken HTML
C3: The web has a lot of cruft
•  Web page have attributes other then the visual text: URL
string, page title, meta description, etc.
•  The HTML/XML has a tree structure we can use
O1: Pages have additional structure
O2: Hyperlinks
O3: CSS
<div class="article-columnist-name vcard">
<a class="author url fn" rel="author"
href="/author/mike-lindblom/">Mike Lindblom</a>
</div>
General approach
1.  Use HTML parser to represent page as a tree
2.  Split the tree into small pieces and analyze each piece
separately
3.  Run NLP pipelines or other machine learning models on these
small pieces to:
- focus attention the important pieces
- extract structured information
- other task dependent objectives
4.  Need algorithms that efficiently process only raw HTML
(without JS, image, CSS, etc.)
Content extraction /
Web page de-chroming
Extract main
article content
(and optionally
comments) from
a web page
Main article
Dragnet
•  Combine diverse features with machine learning
•  Open source: https://github.com/seomoz/dragnet
•  v1 (2013): link/text density + CETR:
blog: https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/
paper: http://www2013.org/companion/p89.pdf)
•  v2: (2015): added Readability
blog: https://moz.com/devblog/benchmarking-python-content-extraction-algorithms-
dragnet-readability-goose-and-eatiht/
10,000 foot view
•  Split page into distinct visual elements called “blocks”
•  Use machine learning to classify each block as content
or no content
Inspired by:
Kohlschütter et al, Boilerplate Detection using Shallow Text Features, WSDM ’10
Weninger et al, CETR -- Content Extraction with Tag Ratios, WWW ‘10
Readability: (https://github.com/buriy/python-readability)
Splitting the text into blocks
•  Use new lines in HTML (n) (CETR)
•  Use subtrees (Readability)
•  Flatten HTML and break on <div>, <p>, <h1>, etc.
(Kohlschütter et al. and us)
Text and link
“density” two
important features.
High text density, low
link density à more
likely to be content
Compute “smoothed tag ratio”, ratio of #
tags to # chars
Compute “smoothed absolute difference
tag ratio”, dTR / dblock.
Captures intuition that main
content occurs together
Run k-means with 3 clusters on blocks,
with one centroid always pinned
to (0, 0)
Blocks in (0, 0) cluster are non-content,
remainder content
Encode CETR
predicted class as
0-1 feature
Use a simplified version of Readability:
•  Compute score for each subtree using:
- parent id/class attributes
- length of text
•  Find subtree with highest score
•  Block feature = maximum subtree score for all subtrees containing block
Random
Forest
Model performance
From 2013 paper (v1)
Task: Extract content
and comments
Model performance
Model performance
Keyword / topic extraction
Extract a ranked list of
keywords from a page with
relevancy score

(91, 'bertha')
(61, 'stp')
(59, 'state sues')
(44, 'tunnel')
(37, 'wsdot')
(30, 'tunnel construction')
(28, 'the seattle times')
(17, 'seattle tunnel partners')
(13, 'repair bertha')
(10, 'transportation lawsuit’)
Prior work
Many prior papers on similar task
Most use small data sets (hundreds of labeled examples) à unsupervised +
supervised methods
Wide range of previous approaches and almost always tailored to specific
type of document (academic papers, etc.)
Requirements for “gold standard” are fuzzy
Our approach:
Build a web specific algorithm to leverage unique aspects of domain
Combine many different features / approaches
Overcome data limitations and build complex model by gathering lots of
data automatically
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
C
A
N
D
I
D
A
T
E
S
Main article
Run dragnet to
extract the main
article content.
Keep track of
individual blocks
and process each
separately.
This displays as a dash but is the
unicode character U+2014
Need to special case @twitter,
email@domain.com, dates, etc.
Text & Token normalization
Include web specific logic in tokenizer / normalizer
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
C
A
N
D
I
D
A
T
E
S
Processing individual blocks helps NP chunker
Mike Lindblom
Bertha: State sues - but even the lawsuit is delayed
Mike Lindblom
Bertha NP chunks extracted by our chunker
State sues
the lawsuit
Wikipedia lookup
Treat “Statue of Liberty”
as a single candidate
instead of splitting into
“Statue” “of Liberty”
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
C
A
N
D
I
D
A
T
E
S
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
C
A
N
D
I
D
A
T
E
S
TF
Ranking model features
Shallow: relative position in document, number of tokens
Occurrence: does candidate occur in title, H1, meta description, etc
Term frequency: count of occurrences, average token count, sum(in degree),
etc
QDR: information retrieval motivated “query-document relevance” ranking
models. TF-IDF (term frequency X inverse document frequency),
probabilistic approaches, language models
POS tags: is the keyword a proper noun, etc
URL features: does the keyword appear in URL
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
Classifier
(probability of
relevance)
C
A
N
D
I
D
A
T
E
S
TF
Generating training data
List of high
volume
keywords
Top 10
results
Crawl
pages
Training Data:
HTML with relevant
keyword
Commercial
Search Engine
PU learning
Learning classifiers from only positive and unlabeled
data, Elkan and Noto, SIGKDD 2008
●  Most ML classifiers have both
positive and negative
examples in training data
●  We only have one keyword
per page that is relevant
(“positive”) and many others
that may or may not be
positive
●  Use result from this paper
applied to our data
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
Classifier
(probability of
relevance)
C
A
N
D
I
D
A
T
E
S
TF
Keyword extraction review
Resulting algorithm is:
●  Robust across different content types – worst case still extracts
reasonable topics
●  Reasonably fast, about 25 pages / second end-to-end
●  Subjectively outperforms other commercial APIs (e.g. Alchemy, etc).
●  In production for a year+, processed many millions of pages
Author extraction
Author
Extract a list of author names
(or an empty list if no authors)
from a given web page.
See: https://moz.com/devblog/web-page-author-extraction/
Do we need a ML algorithm for this?
(why isn’t this trivial?)
Heuristics do an adequate job:
●  The microformat rel="author" attribute in link tags (a) is commonly used
to specify the page author
●  Some sites specify page authors with a meta author tag.
●  Many sites use names like “author” or “byline” for class attributes in their
CSS.
Sometimes heuristics work well
<div class="article-columnist-name vcard">
<a class="author url fn" rel="author"
href="/author/mike-lindblom/">Mike Lindblom</a>
</div>
But sometimes heuristics fall over
Some pages do not have any special markup for byline...
But sometimes heuristics fall over
Some pages have
misleading or wrong
markup
But sometimes heuristics fall over
Links to related stories and
sidebar bylines look nearly
identical to the main byline
Machine learning approach
Use machine learning approach
Crowd source labeled data using
Spare5
Approx 9,000 labeled pages
Case study
http://arstechnica.com/science/2016/07/algorithms-used-to-study-brain-activity-may-be-exaggerating-results/
Ranked blocks
Tags for highest ranked block
Dragnet HTML
Blockfier
Author chunker
Page HTML
Block
representation
Block ranking model
Block ranking model
Combines NLP and web features
•  Tokens in block text (similar to bag-of-words classification)
•  Tokens in block HTML tag attributes (e.g. class=“byline”)
•  The HTML tags in block (e.g. many author names are links)
•  rel=“author” and other markup inspired features
Put all features through Random Forest classifier that predicts probability a
block contains author
Block model performance
Overall block model is pretty good –
captures intuition that “bylines are easy
to spot”
Table lists Precision@K whether block
actually contains the author’s name.
Author Chunker
Modified IOB tagger similar to NP chunker or POS tagger
3 – class classification problem (Beginning of name, Inside name, Outside)
To make predictions at next token:
uni-, bi- and tri-gram tokens from previous/next few tokens
uni-, bi- and tri-gram POS tags
previous predicted IOB labels
HTML tags preceding and following the token
rel="author" and other markup inspired features
Overall 85.6% accurate chunking top block.
Author Chunker
<p class="byline”>
by
<a href="http://arstechnica.com/author/john-timmer/” rel="author”>
<span>John Timmer</span>
</a>
- <span class="date”>Jul 1, 2016 6:55 pm UTC</span>
</p>
Author Chunker using HTML features
by John Timmer - Jul 1

IN NNP NNP - NN CD

O ??
<p class="byline”> <a rel="author”><span> </span></a>
To make prediction here, we can use tokens,
POS tags and HTML structure between tokens
Overall author model performance
Overall accuracy on test set
is good, outperforming
alternatives.
(heuristics)
(commercial API)
(OS Python library)
Conclusion
POS tagging
Question
answering
Language
Modeling
Parsing
Sentiment
NLP Web
HTML
XML
Microdata
(schema.org)
CSS
Javascript
Conclusion
NLP for the Web
Dr. Matthew Peters
@mattthemathman

Contenu connexe

Tendances

"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...Stefan Adam
 
DITA,Single-source, Multi-channel Publishing
DITA,Single-source, Multi-channel PublishingDITA,Single-source, Multi-channel Publishing
DITA,Single-source, Multi-channel PublishingBruce Conway
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Realizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 TutorialRealizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 TutorialEmanuele Della Valle
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataMatthew Rowe
 
Web Database
Web DatabaseWeb Database
Web Databaseidroos7
 
ScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-AtayScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-AtayAlex Sumner
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebTomek Pluskiewicz
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingCynthiaCruz55
 
Resource description framework
Resource description frameworkResource description framework
Resource description frameworkStanley Wang
 
XML and Databases
XML and DatabasesXML and Databases
XML and DatabasesCittrex
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
Social Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsSocial Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsColin Bell
 

Tendances (19)

"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
DITA,Single-source, Multi-channel Publishing
DITA,Single-source, Multi-channel PublishingDITA,Single-source, Multi-channel Publishing
DITA,Single-source, Multi-channel Publishing
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Introduction to html
Introduction to htmlIntroduction to html
Introduction to html
 
Realizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 TutorialRealizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 Tutorial
 
Semantic web
Semantic web Semantic web
Semantic web
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
 
Web Database
Web DatabaseWeb Database
Web Database
 
ScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-AtayScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-Atay
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Dhtml ppt (2)
Dhtml ppt (2)Dhtml ppt (2)
Dhtml ppt (2)
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Resource description framework
Resource description frameworkResource description framework
Resource description framework
 
XML and Databases
XML and DatabasesXML and Databases
XML and Databases
 
Why XML is important for everyone, especially technical communicators
Why XML is important for everyone, especially technical communicatorsWhy XML is important for everyone, especially technical communicators
Why XML is important for everyone, especially technical communicators
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Social Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsSocial Graphs and Semantic Analytics
Social Graphs and Semantic Analytics
 

Similaire à NLP and the Web

Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.Shyjal Raazi
 
Rails Girls - Introduction to HTML & CSS
Rails Girls - Introduction to HTML & CSSRails Girls - Introduction to HTML & CSS
Rails Girls - Introduction to HTML & CSSTimo Herttua
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisTim Weninger
 
State of modern web technologies: an introduction
State of modern web technologies: an introductionState of modern web technologies: an introduction
State of modern web technologies: an introductionMichael Ahearn
 
Intro to mobile web application development
Intro to mobile web application developmentIntro to mobile web application development
Intro to mobile web application developmentzonathen
 
Dita for the web: Make Adaptive Content Simple for Writers and Developer
Dita for the web: Make Adaptive Content Simple for Writers and DeveloperDita for the web: Make Adaptive Content Simple for Writers and Developer
Dita for the web: Make Adaptive Content Simple for Writers and DeveloperDon Day
 
Overview of modern software ecosystem for big data analysis
Overview of modern software ecosystem for big data analysisOverview of modern software ecosystem for big data analysis
Overview of modern software ecosystem for big data analysisMichael Bryzek
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
 
DITA and SEO
DITA and SEODITA and SEO
DITA and SEOIXIASOFT
 
Html5 deciphered - designing concepts part 1
Html5 deciphered - designing concepts part 1Html5 deciphered - designing concepts part 1
Html5 deciphered - designing concepts part 1Paxcel Technologies
 
What “Model” DITA Specializations Can Teach About Information Modelinc
What “Model” DITA Specializations Can Teach About Information ModelincWhat “Model” DITA Specializations Can Teach About Information Modelinc
What “Model” DITA Specializations Can Teach About Information ModelincDon Day
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
Html Templating - DOT JS
Html Templating - DOT JSHtml Templating - DOT JS
Html Templating - DOT JSNagaraju Sangam
 
Html templating introduction
Html templating introductionHtml templating introduction
Html templating introductionNagaraju Sangam
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producingkurtgessler
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...MakoLab SA
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org sopekmir
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 

Similaire à NLP and the Web (20)

Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 
Rails Girls - Introduction to HTML & CSS
Rails Girls - Introduction to HTML & CSSRails Girls - Introduction to HTML & CSS
Rails Girls - Introduction to HTML & CSS
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
 
State of modern web technologies: an introduction
State of modern web technologies: an introductionState of modern web technologies: an introduction
State of modern web technologies: an introduction
 
Intro to mobile web application development
Intro to mobile web application developmentIntro to mobile web application development
Intro to mobile web application development
 
Dita for the web: Make Adaptive Content Simple for Writers and Developer
Dita for the web: Make Adaptive Content Simple for Writers and DeveloperDita for the web: Make Adaptive Content Simple for Writers and Developer
Dita for the web: Make Adaptive Content Simple for Writers and Developer
 
Overview of modern software ecosystem for big data analysis
Overview of modern software ecosystem for big data analysisOverview of modern software ecosystem for big data analysis
Overview of modern software ecosystem for big data analysis
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
DITA and SEO
DITA and SEODITA and SEO
DITA and SEO
 
Html5 deciphered - designing concepts part 1
Html5 deciphered - designing concepts part 1Html5 deciphered - designing concepts part 1
Html5 deciphered - designing concepts part 1
 
What “Model” DITA Specializations Can Teach About Information Modelinc
What “Model” DITA Specializations Can Teach About Information ModelincWhat “Model” DITA Specializations Can Teach About Information Modelinc
What “Model” DITA Specializations Can Teach About Information Modelinc
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Web Pages
Web PagesWeb Pages
Web Pages
 
Unit 01 (1).pdf
Unit 01 (1).pdfUnit 01 (1).pdf
Unit 01 (1).pdf
 
Html Templating - DOT JS
Html Templating - DOT JSHtml Templating - DOT JS
Html Templating - DOT JS
 
Html templating introduction
Html templating introductionHtml templating introduction
Html templating introduction
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producing
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 

Dernier

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 

Dernier (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 

NLP and the Web

  • 1. NLP for the Web Dr. Matthew Peters @mattthemathman (with thanks to Rutu Mulkar, Erin Renshaw, Dan Lecocq, Chris Whitten, Jay Leary and many others at Moz)
  • 2. Moz is a SaaS company that sells SEO and Content Marketing software to professional marketers
  • 3. We crawl a lot: > 1 Billion pages / day We’d like to extract structured information from these pages
  • 4.
  • 8. Measure of page’s “Reach” + structured information à most important topics, associate authors with areas of expertise, etc
  • 9. Most NLP tasks (parsing, POS tagging, Q&A, sentiment, etc.) focus exclusively on text content.
  • 16. •  Motivation – what are some of the unique challenges and opportunities in doing NLP on the web? •  Main article extraction / page de-chroming •  Keyword extraction •  Author identification •  Conclusion Outline
  • 18. C1: Pages have clutter and unrelated text Navigation aids Ads Links to other articles
  • 19. C2: Text segments can confuse NLP components Mike Lindblom Bertha: State sues - but even the lawsuit is delayed Mike Lindblom Bertha State sues NP chunks extracted by our chunker the lawsuit
  • 20. •  Many different standards and only partial adoption •  Wide variety of templates and sites •  Broken HTML C3: The web has a lot of cruft
  • 21. •  Web page have attributes other then the visual text: URL string, page title, meta description, etc. •  The HTML/XML has a tree structure we can use O1: Pages have additional structure
  • 23. O3: CSS <div class="article-columnist-name vcard"> <a class="author url fn" rel="author" href="/author/mike-lindblom/">Mike Lindblom</a> </div>
  • 24. General approach 1.  Use HTML parser to represent page as a tree 2.  Split the tree into small pieces and analyze each piece separately 3.  Run NLP pipelines or other machine learning models on these small pieces to: - focus attention the important pieces - extract structured information - other task dependent objectives 4.  Need algorithms that efficiently process only raw HTML (without JS, image, CSS, etc.)
  • 25. Content extraction / Web page de-chroming
  • 26. Extract main article content (and optionally comments) from a web page Main article
  • 27. Dragnet •  Combine diverse features with machine learning •  Open source: https://github.com/seomoz/dragnet •  v1 (2013): link/text density + CETR: blog: https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/ paper: http://www2013.org/companion/p89.pdf) •  v2: (2015): added Readability blog: https://moz.com/devblog/benchmarking-python-content-extraction-algorithms- dragnet-readability-goose-and-eatiht/
  • 28. 10,000 foot view •  Split page into distinct visual elements called “blocks” •  Use machine learning to classify each block as content or no content Inspired by: Kohlschütter et al, Boilerplate Detection using Shallow Text Features, WSDM ’10 Weninger et al, CETR -- Content Extraction with Tag Ratios, WWW ‘10 Readability: (https://github.com/buriy/python-readability)
  • 29. Splitting the text into blocks •  Use new lines in HTML (n) (CETR) •  Use subtrees (Readability) •  Flatten HTML and break on <div>, <p>, <h1>, etc. (Kohlschütter et al. and us)
  • 30.
  • 31.
  • 32.
  • 33. Text and link “density” two important features. High text density, low link density à more likely to be content
  • 34. Compute “smoothed tag ratio”, ratio of # tags to # chars Compute “smoothed absolute difference tag ratio”, dTR / dblock. Captures intuition that main content occurs together Run k-means with 3 clusters on blocks, with one centroid always pinned to (0, 0) Blocks in (0, 0) cluster are non-content, remainder content
  • 35. Encode CETR predicted class as 0-1 feature
  • 36. Use a simplified version of Readability: •  Compute score for each subtree using: - parent id/class attributes - length of text •  Find subtree with highest score •  Block feature = maximum subtree score for all subtrees containing block
  • 38. Model performance From 2013 paper (v1) Task: Extract content and comments
  • 41. Keyword / topic extraction
  • 42. Extract a ranked list of keywords from a page with relevancy score (91, 'bertha') (61, 'stp') (59, 'state sues') (44, 'tunnel') (37, 'wsdot') (30, 'tunnel construction') (28, 'the seattle times') (17, 'seattle tunnel partners') (13, 'repair bertha') (10, 'transportation lawsuit’)
  • 43. Prior work Many prior papers on similar task Most use small data sets (hundreds of labeled examples) à unsupervised + supervised methods Wide range of previous approaches and almost always tailored to specific type of document (academic papers, etc.) Requirements for “gold standard” are fuzzy Our approach: Build a web specific algorithm to leverage unique aspects of domain Combine many different features / approaches Overcome data limitations and build complex model by gathering lots of data automatically
  • 44. Generate Candidates Rank Candidates Raw HTML Ranked Topics C A N D I D A T E S
  • 45. Main article Run dragnet to extract the main article content. Keep track of individual blocks and process each separately.
  • 46. This displays as a dash but is the unicode character U+2014 Need to special case @twitter, email@domain.com, dates, etc. Text & Token normalization Include web specific logic in tokenizer / normalizer
  • 47. Generate Candidates Rank Candidates Raw HTML Ranked Topics Parse/ Dechroming Normalize Sentence/ word tokenize C A N D I D A T E S
  • 48. Processing individual blocks helps NP chunker Mike Lindblom Bertha: State sues - but even the lawsuit is delayed Mike Lindblom Bertha NP chunks extracted by our chunker State sues the lawsuit
  • 49. Wikipedia lookup Treat “Statue of Liberty” as a single candidate instead of splitting into “Statue” “of Liberty”
  • 50. Generate Candidates Rank Candidates Raw HTML Ranked Topics Parse/ Dechroming Normalize Sentence/ word tokenize POS tag/ Noun phrase chunk Wikipedia lookup C A N D I D A T E S
  • 51. Generate Candidates Rank Candidates Raw HTML Ranked Topics Parse/ Dechroming Normalize Sentence/ word tokenize POS tag/ Noun phrase chunk Wikipedia lookup Shallow Occurrence QDR POS URL C A N D I D A T E S TF
  • 52. Ranking model features Shallow: relative position in document, number of tokens Occurrence: does candidate occur in title, H1, meta description, etc Term frequency: count of occurrences, average token count, sum(in degree), etc QDR: information retrieval motivated “query-document relevance” ranking models. TF-IDF (term frequency X inverse document frequency), probabilistic approaches, language models POS tags: is the keyword a proper noun, etc URL features: does the keyword appear in URL
  • 53. Generate Candidates Rank Candidates Raw HTML Ranked Topics Parse/ Dechroming Normalize Sentence/ word tokenize POS tag/ Noun phrase chunk Wikipedia lookup Shallow Occurrence QDR POS URL Classifier (probability of relevance) C A N D I D A T E S TF
  • 54. Generating training data List of high volume keywords Top 10 results Crawl pages Training Data: HTML with relevant keyword Commercial Search Engine
  • 55. PU learning Learning classifiers from only positive and unlabeled data, Elkan and Noto, SIGKDD 2008 ●  Most ML classifiers have both positive and negative examples in training data ●  We only have one keyword per page that is relevant (“positive”) and many others that may or may not be positive ●  Use result from this paper applied to our data
  • 56. Generate Candidates Rank Candidates Raw HTML Ranked Topics Parse/ Dechroming Normalize Sentence/ word tokenize POS tag/ Noun phrase chunk Wikipedia lookup Shallow Occurrence QDR POS URL Classifier (probability of relevance) C A N D I D A T E S TF
  • 57. Keyword extraction review Resulting algorithm is: ●  Robust across different content types – worst case still extracts reasonable topics ●  Reasonably fast, about 25 pages / second end-to-end ●  Subjectively outperforms other commercial APIs (e.g. Alchemy, etc). ●  In production for a year+, processed many millions of pages
  • 59. Author Extract a list of author names (or an empty list if no authors) from a given web page. See: https://moz.com/devblog/web-page-author-extraction/
  • 60. Do we need a ML algorithm for this? (why isn’t this trivial?) Heuristics do an adequate job: ●  The microformat rel="author" attribute in link tags (a) is commonly used to specify the page author ●  Some sites specify page authors with a meta author tag. ●  Many sites use names like “author” or “byline” for class attributes in their CSS.
  • 61. Sometimes heuristics work well <div class="article-columnist-name vcard"> <a class="author url fn" rel="author" href="/author/mike-lindblom/">Mike Lindblom</a> </div>
  • 62. But sometimes heuristics fall over Some pages do not have any special markup for byline...
  • 63. But sometimes heuristics fall over Some pages have misleading or wrong markup
  • 64. But sometimes heuristics fall over Links to related stories and sidebar bylines look nearly identical to the main byline
  • 65. Machine learning approach Use machine learning approach Crowd source labeled data using Spare5 Approx 9,000 labeled pages
  • 67. Ranked blocks Tags for highest ranked block Dragnet HTML Blockfier Author chunker Page HTML Block representation Block ranking model
  • 68. Block ranking model Combines NLP and web features •  Tokens in block text (similar to bag-of-words classification) •  Tokens in block HTML tag attributes (e.g. class=“byline”) •  The HTML tags in block (e.g. many author names are links) •  rel=“author” and other markup inspired features Put all features through Random Forest classifier that predicts probability a block contains author
  • 69. Block model performance Overall block model is pretty good – captures intuition that “bylines are easy to spot” Table lists Precision@K whether block actually contains the author’s name.
  • 70. Author Chunker Modified IOB tagger similar to NP chunker or POS tagger 3 – class classification problem (Beginning of name, Inside name, Outside) To make predictions at next token: uni-, bi- and tri-gram tokens from previous/next few tokens uni-, bi- and tri-gram POS tags previous predicted IOB labels HTML tags preceding and following the token rel="author" and other markup inspired features Overall 85.6% accurate chunking top block.
  • 71. Author Chunker <p class="byline”> by <a href="http://arstechnica.com/author/john-timmer/” rel="author”> <span>John Timmer</span> </a> - <span class="date”>Jul 1, 2016 6:55 pm UTC</span> </p>
  • 72. Author Chunker using HTML features by John Timmer - Jul 1 IN NNP NNP - NN CD O ?? <p class="byline”> <a rel="author”><span> </span></a> To make prediction here, we can use tokens, POS tags and HTML structure between tokens
  • 73. Overall author model performance Overall accuracy on test set is good, outperforming alternatives. (heuristics) (commercial API) (OS Python library)
  • 76. NLP for the Web Dr. Matthew Peters @mattthemathman