The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Natural Language Processing in R (rNLP)
1. Natural Language Processing
in R (rNLP)
Fridolin Wild, The Open University, UK
Tutorial to the Doctoral School
at the Institute of Business Informatics
of the Goethe University Frankfurt
2. Structure of this tutorial
• An introduction to R and cRunch
• Language basics in R
• Basic I/O in R
• Social Network Analysis
• Latent Semantic Analysis
• Twitter
• Sentiment
• (Advanced I/O in R: MySQL, SparQL)
4. cRunch
• is an infrastructure
• for computationally-intense learning
analytics
• supporting researchers
• in investigating big data
• generated in the co-construction of
knowledge
… and beyond
…
8. Living reports
• reports with embedded
scripts and data
• knitr and Sweave
• render to html, PDF, …
• visualisations:
– ggplot2, trellis, graphix
– jpg, png, eps, pdf
png(file=”n.png”, plot(network(m)))
• Fill-in-the-blanks:
Drop out quote went down to
<<echo=FALSE>>=
doquote[“OU”,”2011”]
@
documentclass[a4paper]{article}
title{Sweave Example 1}
author{Friedrich Leisch}
begin{document}
maketitle
In this example we embed parts of the examples from the
texttt{kruskal.test} help page into a LaTeX{} document:
<<>>=
data(airquality)
library(ctest)
kruskal.test(Ozone ~ Month, data = airquality)
@
which shows that the location parameter of the Ozone
distribution varies significantly from month to month. Finally we
include a boxplot of the data:
begin{center}
<<fig=TRUE,echo=FALSE>>=
boxplot(Ozone ~ Month, data = airquality)
@
end{center}
end{document}
10. Example html5 report
Example Report
=============
This is an example of embedded scripts and
data.
```{r}
a = "hello world”
print(a)
```
And here is an example of how to embed a chart.
```{r fig.width=7, fig.height=6}
plot( 5:20 )
```
11. Shiny Widgets (1)
• Widgets: use-case
sized encapsulations
of mini apps
• HTML5
• Two files:
ui.R, server.R
• Still missing:
manifest files
(info.plist, config.xml)
15. More complex R web service
setContentType("image/png")
a = c(1,3,5,12,13,15)
image_file = tempfile()
png(file=image_file)
plot(a,
main = "The magic image",
ylab = "", xlab = "",
col = c("darkred", "darkblue", "darkgreen")
)
dev.off()
sendBin(readBin(image_file,'raw',n=file.info(image_file)$size))
unlink(image_file)
16. R web services
• Uses the apache
mod_R.so
• See http://Rapache.net
• Common server functions:
– GET and POST variables
– setContentType
– sendBin
– …
17. A word on memory mgmt.
• Advanced memory management
(see p.70 of Dietl diploma thesis):
– Use package big memory
(for shared memory across
threads)
– Use package Rserve (for shared
read-only access across threads)
– Swap out memory objects with
save() and load()
– The latter is typically sufficient
(hard disks are fast!)
• data management abstraction
layer for mod_R.so:
configure handler in http.conf:
specify directory match and load specific
data management routines at start up:
REvalOnStartup
"source(‟/dbal.R');"
21. Data shop and the community
• You have a „public/‟ folder :)
– „public/data‟: save() any .rda file and
it will be indexed within the hour
– „public/services‟: use this to execute
your scripts; indexed within the hour
– „public/gallery‟: use this to store
your public visualisations
– code sharing: Any .R script in your
„public/‟ folder is source readable by
the web
28. The basic concept
• Precursors date back to 1920s, math to
Euler‟s „Seven Bridges of Koenigsberg‟
29. The basic concept
• Precursors date back to 1920s, math to
Euler‟s „Seven Bridges of Koenigsberg‟
30. The basic concept
• Precursors date back to 1920s, math to
Euler‟s „Seven Bridges of Koenigsberg‟
• Social Networks are:
• Actors (people, groups, media, tags, …)
• Ties (interactions, relationships, …)
• Actors and ties form graph
• Graph has measurable structural
properties
• Betweenness,
• Degree of Centrality,
• Density,
• Cohesion
• Structural Patterns
42. Latent Semantic Analysis
• “Humans learn word meanings and how to combine
them into passage meaning through experience
with ~paragraph unitized verbal environments.”
• “They don‟t remember all the separate words of a
passage; they remember its overall gist or
meaning.”
• “LSA learns by „reading‟ ~paragraph unitized
texts that represent the environment.”
• “It doesn‟t remember all the separate words of a
text it; it remembers its overall gist or meaning.”
(Landauer, 2007)
43. Word choice is over-rated
• Educated adult understands ~100,000 word forms
• An average sentence contains 20 tokens.
• Thus 100,00020 possible combinations of words in a
sentence
• maximum of log2 100,00020
= 332 bits in word choice alone.
• 20! = 2.4 x 1018 possible orders of 20 words
= maximum of 61 bits from order of the words.
• 332/(61+ 332) = 84% word choice
(Landauer, 2007)
44. LSA (2)
• Assumption: texts have a semantic structure
• However, this structure is obscured by word
usage (noise, synonymy, polysemy, …)
• Proposed LSA Solution:
– map doc-term matrix
– using conceptual indices
– derived statistically (truncated SVD)
– and make similarity comparisons using
angles
45. Input (e.g., documents)
{ M } =
Deerwester, Dumais, Furnas, Landauer, and Harshman (1990):
Indexing by Latent Semantic Analysis, In: Journal of the American
Society for Information Science, 41(6):391-407
Only the red terms appear in more
than one document, so strip the rest.
term = feature
vocabulary = ordered set of features
TEXTMATRIX
49. Similarity in a Latent-Semantic Space
Query
Target 1
Target 2Angle 2
Angle 1
Ydimension
X dimension
50. doc2doc - similarities
Unreduced = pure vector
space model
- Based on M = TSD’
- Pearson Correlation
over document vectors
reduced
- based on M2 = TS2D’
- Pearson Correlation
over document vectors
51. Ex Post Updating: Folding-In
• SVD factor stability
– SVD calculates factors over a given text base
– Different texts – different factors
– Challenge: avoid unwanted factor changes
(e.g., bad essays)
– Solution: folding-in of essays instead of recalculating
• SVD is computationally expensive
52. Folding-In in Detail
1
kk
T
i STvd
1
T
ikki dSTm
2
vT
Tk Sk Dk
Mk
(Berry et al., 1995)
(1) convert
Original
Vector to
„Dk“-format
(2) convert
„Dk“-format
vector to
„Mk“-format
53. LSA Process & Driving Parameters
4 x 12 x 7 x 2 x 3
= 2016 Combinations
54. Pre-Processing
• Stemming
– Porter Stemmer (snowball.tartarus.org)
– ‚move„, ‚moving„, ‚moves„ => ‚move„
– in German even more important (more flections)
• Stop Word Elimination
– 373 Stop Words in German
• Stemming plus Stop Word Elimination
• Unprocessed („raw‟) Terms
55. Term Weighting Schemes
• Global Weights (GW)
– None (‚raw‘ tf)
– Normalisation
– Inverse Document
Frequency (IDF)
– 1 + Entropy
.
1
2
1
j
ij
i
tf
norm
1
)(
log2
idocfreq
numdocs
idfi
1
log
log
1
j
ijij
i
numdocs
pp
entplusone 1
j
ij
ij
ij
tf
tf
p, where
weightij = lw(tfij) ∙ gw(tfij)
Local Weights (LW)
None (‘raw’ tf)
Binary Term Frequency
Logarithmized Term Frequency
(log)
59. The Package
• Available via CRAN, e.g.:
http://cran.r-project.org/web/packages/lsa/index.html
• Higher-level Abstraction to Ease Use
– Core methods:
textmatrix() / query()
lsa()
fold_in()
as.textmatrix()
– Support methods for term weighting, dimensionality
calculation, correlation measurement, …