1. NLP for the Web
Dr. Matthew Peters
@mattthemathman
(with thanks to Rutu Mulkar, Erin Renshaw, Dan Lecocq,
Chris Whitten, Jay Leary and many others at Moz)
2. Moz is a SaaS company that sells SEO and Content Marketing software to
professional marketers
3. We crawl a lot: > 1 Billion pages / day
We’d like to extract structured information from these pages
16. • Motivation – what are some of the unique challenges
and opportunities in doing NLP on the web?
• Main article extraction / page de-chroming
• Keyword extraction
• Author identification
• Conclusion
Outline
18. C1: Pages have clutter
and unrelated text
Navigation aids
Ads
Links to
other articles
19. C2: Text segments can confuse NLP components
Mike Lindblom Bertha: State sues - but even the lawsuit is delayed
Mike Lindblom Bertha
State sues NP chunks extracted by our chunker
the lawsuit
20. • Many different standards and only partial adoption
• Wide variety of templates and sites
• Broken HTML
C3: The web has a lot of cruft
21. • Web page have attributes other then the visual text: URL
string, page title, meta description, etc.
• The HTML/XML has a tree structure we can use
O1: Pages have additional structure
24. General approach
1. Use HTML parser to represent page as a tree
2. Split the tree into small pieces and analyze each piece
separately
3. Run NLP pipelines or other machine learning models on these
small pieces to:
- focus attention the important pieces
- extract structured information
- other task dependent objectives
4. Need algorithms that efficiently process only raw HTML
(without JS, image, CSS, etc.)
27. Dragnet
• Combine diverse features with machine learning
• Open source: https://github.com/seomoz/dragnet
• v1 (2013): link/text density + CETR:
blog: https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/
paper: http://www2013.org/companion/p89.pdf)
• v2: (2015): added Readability
blog: https://moz.com/devblog/benchmarking-python-content-extraction-algorithms-
dragnet-readability-goose-and-eatiht/
28. 10,000 foot view
• Split page into distinct visual elements called “blocks”
• Use machine learning to classify each block as content
or no content
Inspired by:
Kohlschütter et al, Boilerplate Detection using Shallow Text Features, WSDM ’10
Weninger et al, CETR -- Content Extraction with Tag Ratios, WWW ‘10
Readability: (https://github.com/buriy/python-readability)
29. Splitting the text into blocks
• Use new lines in HTML (n) (CETR)
• Use subtrees (Readability)
• Flatten HTML and break on <div>, <p>, <h1>, etc.
(Kohlschütter et al. and us)
30.
31.
32.
33. Text and link
“density” two
important features.
High text density, low
link density à more
likely to be content
34. Compute “smoothed tag ratio”, ratio of #
tags to # chars
Compute “smoothed absolute difference
tag ratio”, dTR / dblock.
Captures intuition that main
content occurs together
Run k-means with 3 clusters on blocks,
with one centroid always pinned
to (0, 0)
Blocks in (0, 0) cluster are non-content,
remainder content
36. Use a simplified version of Readability:
• Compute score for each subtree using:
- parent id/class attributes
- length of text
• Find subtree with highest score
• Block feature = maximum subtree score for all subtrees containing block
42. Extract a ranked list of
keywords from a page with
relevancy score
(91, 'bertha')
(61, 'stp')
(59, 'state sues')
(44, 'tunnel')
(37, 'wsdot')
(30, 'tunnel construction')
(28, 'the seattle times')
(17, 'seattle tunnel partners')
(13, 'repair bertha')
(10, 'transportation lawsuit’)
43. Prior work
Many prior papers on similar task
Most use small data sets (hundreds of labeled examples) à unsupervised +
supervised methods
Wide range of previous approaches and almost always tailored to specific
type of document (academic papers, etc.)
Requirements for “gold standard” are fuzzy
Our approach:
Build a web specific algorithm to leverage unique aspects of domain
Combine many different features / approaches
Overcome data limitations and build complex model by gathering lots of
data automatically
45. Main article
Run dragnet to
extract the main
article content.
Keep track of
individual blocks
and process each
separately.
46. This displays as a dash but is the
unicode character U+2014
Need to special case @twitter,
email@domain.com, dates, etc.
Text & Token normalization
Include web specific logic in tokenizer / normalizer
47. Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
C
A
N
D
I
D
A
T
E
S
48. Processing individual blocks helps NP chunker
Mike Lindblom
Bertha: State sues - but even the lawsuit is delayed
Mike Lindblom
Bertha NP chunks extracted by our chunker
State sues
the lawsuit
50. Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
C
A
N
D
I
D
A
T
E
S
51. Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
C
A
N
D
I
D
A
T
E
S
TF
52. Ranking model features
Shallow: relative position in document, number of tokens
Occurrence: does candidate occur in title, H1, meta description, etc
Term frequency: count of occurrences, average token count, sum(in degree),
etc
QDR: information retrieval motivated “query-document relevance” ranking
models. TF-IDF (term frequency X inverse document frequency),
probabilistic approaches, language models
POS tags: is the keyword a proper noun, etc
URL features: does the keyword appear in URL
53. Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
Classifier
(probability of
relevance)
C
A
N
D
I
D
A
T
E
S
TF
54. Generating training data
List of high
volume
keywords
Top 10
results
Crawl
pages
Training Data:
HTML with relevant
keyword
Commercial
Search Engine
55. PU learning
Learning classifiers from only positive and unlabeled
data, Elkan and Noto, SIGKDD 2008
● Most ML classifiers have both
positive and negative
examples in training data
● We only have one keyword
per page that is relevant
(“positive”) and many others
that may or may not be
positive
● Use result from this paper
applied to our data
56. Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
Classifier
(probability of
relevance)
C
A
N
D
I
D
A
T
E
S
TF
57. Keyword extraction review
Resulting algorithm is:
● Robust across different content types – worst case still extracts
reasonable topics
● Reasonably fast, about 25 pages / second end-to-end
● Subjectively outperforms other commercial APIs (e.g. Alchemy, etc).
● In production for a year+, processed many millions of pages
59. Author
Extract a list of author names
(or an empty list if no authors)
from a given web page.
See: https://moz.com/devblog/web-page-author-extraction/
60. Do we need a ML algorithm for this?
(why isn’t this trivial?)
Heuristics do an adequate job:
● The microformat rel="author" attribute in link tags (a) is commonly used
to specify the page author
● Some sites specify page authors with a meta author tag.
● Many sites use names like “author” or “byline” for class attributes in their
CSS.
61. Sometimes heuristics work well
<div class="article-columnist-name vcard">
<a class="author url fn" rel="author"
href="/author/mike-lindblom/">Mike Lindblom</a>
</div>
67. Ranked blocks
Tags for highest ranked block
Dragnet HTML
Blockfier
Author chunker
Page HTML
Block
representation
Block ranking model
68. Block ranking model
Combines NLP and web features
• Tokens in block text (similar to bag-of-words classification)
• Tokens in block HTML tag attributes (e.g. class=“byline”)
• The HTML tags in block (e.g. many author names are links)
• rel=“author” and other markup inspired features
Put all features through Random Forest classifier that predicts probability a
block contains author
69. Block model performance
Overall block model is pretty good –
captures intuition that “bylines are easy
to spot”
Table lists Precision@K whether block
actually contains the author’s name.
70. Author Chunker
Modified IOB tagger similar to NP chunker or POS tagger
3 – class classification problem (Beginning of name, Inside name, Outside)
To make predictions at next token:
uni-, bi- and tri-gram tokens from previous/next few tokens
uni-, bi- and tri-gram POS tags
previous predicted IOB labels
HTML tags preceding and following the token
rel="author" and other markup inspired features
Overall 85.6% accurate chunking top block.
72. Author Chunker using HTML features
by John Timmer - Jul 1
IN NNP NNP - NN CD
O ??
<p class="byline”> <a rel="author”><span> </span></a>
To make prediction here, we can use tokens,
POS tags and HTML structure between tokens
73. Overall author model performance
Overall accuracy on test set
is good, outperforming
alternatives.
(heuristics)
(commercial API)
(OS Python library)