Natural Language Processing in R (rNLP)

Natural Language Processing
in R (rNLP)
Fridolin Wild, The Open University, UK
Tutorial to the Doctoral School
at the Institute of Business Informatics
of the Goethe University Frankfurt

Structure of this tutorial
• An introduction to R and cRunch
• Language basics in R
• Basic I/O in R
• Social Network Analysis
• Latent Semantic Analysis
• Twitter
• Sentiment
• (Advanced I/O in R: MySQL, SparQL)

cRunch
• is an infrastructure
• for computationally-intense learning
analytics
• supporting researchers
• in investigating big data
• generated in the co-construction of
knowledge
… and beyond
…

Architecture
(Thiele & Lehner, 2011)

Architecture
(Thiele & Lehner, 2011)
Living Reports
data shop
cron jobs
R webservices

Living reports
• reports with embedded
scripts and data
• knitr and Sweave
• render to html, PDF, …
• visualisations:
– ggplot2, trellis, graphix
– jpg, png, eps, pdf
png(file=”n.png”, plot(network(m)))
• Fill-in-the-blanks:
Drop out quote went down to
<<echo=FALSE>>=
doquote[“OU”,”2011”]
@
documentclass[a4paper]{article}
title{Sweave Example 1}
author{Friedrich Leisch}
begin{document}
maketitle
In this example we embed parts of the examples from the
texttt{kruskal.test} help page into a LaTeX{} document:
<<>>=
data(airquality)
library(ctest)
kruskal.test(Ozone ~ Month, data = airquality)
@
which shows that the location parameter of the Ozone
distribution varies significantly from month to month. Finally we
include a boxplot of the data:
begin{center}
<<fig=TRUE,echo=FALSE>>=
boxplot(Ozone ~ Month, data = airquality)
@
end{center}
end{document}

Example html5 report
Example Report
=============
This is an example of embedded scripts and
data.
```{r}
a = "hello world”
print(a)
```
And here is an example of how to embed a chart.
```{r fig.width=7, fig.height=6}
plot( 5:20 )
```

Shiny Widgets (1)
• Widgets: use-case
sized encapsulations
of mini apps
• HTML5
• Two files:
ui.R, server.R
• Still missing:
manifest files
(info.plist, config.xml)

Shiny Widgets (2)
From http://www.rstudio.com/shiny/

Web Services
harmonization &
data warehousing

Example R web service
print “hello world”

More complex R web service
setContentType("image/png")
a = c(1,3,5,12,13,15)
image_file = tempfile()
png(file=image_file)
plot(a,
main = "The magic image",
ylab = "", xlab = "",
col = c("darkred", "darkblue", "darkgreen")
)
dev.off()
sendBin(readBin(image_file,'raw',n=file.info(image_file)$size))
unlink(image_file)

R web services
• Uses the apache
mod_R.so
• See http://Rapache.net
• Common server functions:
– GET and POST variables
– setContentType
– sendBin
– …

A word on memory mgmt.
• Advanced memory management
(see p.70 of Dietl diploma thesis):
– Use package big memory
(for shared memory across
threads)
– Use package Rserve (for shared
read-only access across threads)
– Swap out memory objects with
save() and load()
– The latter is typically sufficient
(hard disks are fast!)
• data management abstraction
layer for mod_R.so:
configure handler in http.conf:
specify directory match and load specific
data management routines at start up:
REvalOnStartup
"source(‟/dbal.R');"

Job scheduling
• crontab entries for R webservices
• e.g. harvest feeds
• e.g. store in local DB

Data shop and the community
• You have a „public/‟ folder :)
– „public/data‟: save() any .rda file and
it will be indexed within the hour
– „public/services‟: use this to execute
your scripts; indexed within the hour
– „public/gallery‟: use this to store
your public visualisations
– code sharing: Any .R script in your
„public/‟ folder is source readable by
the web

Not covered
The useful pointer

More NLP packages
install.packages("Natural
LanguageProcessing”)
library("Natural
LanguageProcessing")

studio
exploratory
programming

Social Network Analysis

The basic concept
• Precursors date back to 1920s, math to
Euler‟s „Seven Bridges of Koenigsberg‟

The basic concept
• Precursors date back to 1920s, math to
Euler‟s „Seven Bridges of Koenigsberg‟
• Social Networks are:
• Actors (people, groups, media, tags, …)
• Ties (interactions, relationships, …)
• Actors and ties form graph
• Graph has measurable structural
properties
• Betweenness,
• Degree of Centrality,
• Density,
• Cohesion
• Structural Patterns

Forum Messages
message_id forum_id parent_id author
130 2853483 2853445 N 2043
131 1440740 785876 N 1669
132 2515257 2515256 N 5814
133 4704949 4699874 N 5810
134 2597170 2558273 N 2054
135 2316951 2230821 N 5095
136 3407573 3407568 N 36
137 2277393 2277387 N 359
138 3394136 3382201 N 1050
139 4603931 4167338 N 453
140 6234819 6189254 6231352 5400
141 806699 785877 804668 2177
142 4430290 3371246 3380313 48
143 3395686 3391024 3391129 35
144 6270213 6024351 6265378 5780
145 2496015 2491522 2491536 2774
146 4707562 4699873 4707502 5810
147 2574199 2440094 2443801 5801
148 4501993 4424215 4491650 5232
message_id forum_id parent_id author
60 734569 31117 N 2491
221 762702 31117 1
317 762717 31117 762702 1927
1528 819660 31117 793408 1197
1950 840406 31117 839998 1348
1047 841810 31117 767386 1879
2239 862709 31117 N 1982
2420 869839 31117 862709 2038
2694 884824 31117 N 5439
2503 896399 31117 862709 1982
2846 901691 31117 895022 992
3321 951376 31117 N 5174
3384 952895 31117 951376 1597
1186 955595 31117 767386 5724
3604 958065 31117 N 716
2551 960734 31117 862709 1939
4072 975816 31117 N 584
2574 986038 31117 862709 2043
2590 987842 31117 862709 1982

Incidence Matrix
• msg_id = incident, authors appear in incidents

Derive Adjacency Matrix
= t(im) %*% im

Network Density
• Total edges = 29
• Possible edges =
18 * (18-1)/2 = 153
• Density = 0.19

Analysis
• Mix
• Match
• Optimise

Tutorials
• Starter: sna-simple.Rmd
• Real: sna-blog.Rmd
• Advanced: sna-forum.Rmd

Latent Semantic Analysis

Latent Semantic Analysis
• “Humans learn word meanings and how to combine
them into passage meaning through experience
with ~paragraph unitized verbal environments.”
• “They don‟t remember all the separate words of a
passage; they remember its overall gist or
meaning.”
• “LSA learns by „reading‟ ~paragraph unitized
texts that represent the environment.”
• “It doesn‟t remember all the separate words of a
text it; it remembers its overall gist or meaning.”
(Landauer, 2007)

Word choice is over-rated
• Educated adult understands ~100,000 word forms
• An average sentence contains 20 tokens.
• Thus 100,00020 possible combinations of words in a
sentence
• maximum of log2 100,00020
= 332 bits in word choice alone.
• 20! = 2.4 x 1018 possible orders of 20 words
= maximum of 61 bits from order of the words.
• 332/(61+ 332) = 84% word choice
(Landauer, 2007)

LSA (2)
• Assumption: texts have a semantic structure
• However, this structure is obscured by word
usage (noise, synonymy, polysemy, …)
• Proposed LSA Solution:
– map doc-term matrix
– using conceptual indices
– derived statistically (truncated SVD)
– and make similarity comparisons using
angles

Input (e.g., documents)
{ M } =
Deerwester, Dumais, Furnas, Landauer, and Harshman (1990):
Indexing by Latent Semantic Analysis, In: Journal of the American
Society for Information Science, 41(6):391-407
Only the red terms appear in more
than one document, so strip the rest.
term = feature
vocabulary = ordered set of features
TEXTMATRIX

Singular Value Decomposition
=

Truncated SVD
latent-semantic space

Reconstructed, Reduced Matrix
m4: Graph minors: A survey

Similarity in a Latent-Semantic Space
Query
Target 1
Target 2Angle 2
Angle 1
Ydimension
X dimension

doc2doc - similarities
Unreduced = pure vector
space model
- Based on M = TSD’
- Pearson Correlation
over document vectors
reduced
- based on M2 = TS2D’
- Pearson Correlation
over document vectors

Ex Post Updating: Folding-In
• SVD factor stability
– SVD calculates factors over a given text base
– Different texts – different factors
– Challenge: avoid unwanted factor changes
(e.g., bad essays)
– Solution: folding-in of essays instead of recalculating
• SVD is computationally expensive

Folding-In in Detail
1
kk
T
i STvd
1
T
ikki dSTm
2
vT
Tk Sk Dk
Mk
(Berry et al., 1995)
(1) convert
Original
Vector to
„Dk“-format
(2) convert
„Dk“-format
vector to
„Mk“-format

LSA Process & Driving Parameters
4 x 12 x 7 x 2 x 3
= 2016 Combinations

Pre-Processing
• Stemming
– Porter Stemmer (snowball.tartarus.org)
– ‚move„, ‚moving„, ‚moves„ => ‚move„
– in German even more important (more flections)
• Stop Word Elimination
– 373 Stop Words in German
• Stemming plus Stop Word Elimination
• Unprocessed („raw‟) Terms

Term Weighting Schemes
• Global Weights (GW)
– None (‚raw‘ tf)
– Normalisation
– Inverse Document
Frequency (IDF)
– 1 + Entropy
.
1
2
1
j
ij
i
tf
norm
1
)(
log2
idocfreq
numdocs
idfi
1
log
log
1
j
ijij
i
numdocs
pp
entplusone 1
j
ij
ij
ij
tf
tf
p, where
weightij = lw(tfij) ∙ gw(tfij)
 Local Weights (LW)
 None (‘raw’ tf)
 Binary Term Frequency
 Logarithmized Term Frequency
(log)

SVD-Dimensionality
• Many different proposals (see package)
• 80% variance is a good estimator

Proximity Measures
• Pearson Correlation
• Cosine Correlation
• Spearman„s Rho
pics: http://davidmlane.com/hyperstat/A62891.html

Pair-wise dis/similarity
Convergence expected: ‘eu’, ‘österreich’ Divergence expected: ‘jahr’, ‘wien’

The Package
• Available via CRAN, e.g.:
http://cran.r-project.org/web/packages/lsa/index.html
• Higher-level Abstraction to Ease Use
– Core methods:
textmatrix() / query()
lsa()
fold_in()
as.textmatrix()
– Support methods for term weighting, dimensionality
calculation, correlation measurement, …

Core Workflow
• tm = textmatrix(„dir/„)
• tm = lw_logtf(tm) *
gw_idf(tm)
• space = lsa(tm,
dims=dimcalc_share())
• tm3 = fold_in(tm, space)
• as.textmatrix(tm)

Tutorials
• Starter: lsa-indexing.Rmd
• Real: lsa-essayscoring.Rmd
• Advanced: lsa-sparse.Rmd

Additional tutorials

Tutorials
• Advanced I/O: twitter.Rmd
• Advanced I/O: sparql.Rmd
• Advanced NLP: twitter-sentiment.Rmd
• Evaluation: interrater-agreement.Rmd

Natural Language Processing in R (rNLP)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Natural Language Processing in R (rNLP)

Similaire à Natural Language Processing in R (rNLP) (20)

Plus de fridolin.wild

Plus de fridolin.wild (20)

Dernier

Dernier (20)

Natural Language Processing in R (rNLP)