TextMining with R

Text Mining with R

Aleksei Beloshytski
Kyiv, 2012-Feb

Table of Contents
I. Goal of research and limitations

II. Data Preparation

II. Scrape text from blogs (blogs.korrespondent.net)

III. Stemming and cleaning

IV. Bottlenecks mining Cyrillic

III. Text Mining & clustering

III. Term normalization (TF-IDF). Most Frequent and Correlated terms

IV. Hierarchical clustering with hclust

V. Clustering with k-means and k-medoids

VI. Comparing clusters

IV. Conclusion

demonstrate most popular practices when mining dissimilar texts with low number
of observations
mine blogs on https://blogs.korrespondent.net and identify most discussed topics

identify bottlenecks when mining Cyrillic

perform hierarchical clustering with hclust method

perform clustering using k-means and k-medoids methods

compare results

no initial blog categorization by date range, subject(s), author(s) etc*

last 245 blogs** from blogs.korrespondent.net as of the day of analysis
blogs less then 1kb of plain text excluded

* There is no goal to achieve best cluster accuracy, but most discussed subjects
(clusters) should be identified.
** 245 – after excluding empty and small blogs (<1Kb) from initial 400 blogs

Step 1.
Scrape text from blogs

How to scrape blogs..
HTML parsing
parse each page and get urls
not transparent

HTML parsing
not transparent

RSS feed
keeps 1 day history

HTML parsing
not transparent
RSS feed
keeps only 1 day history

Twitter (@Korr_blog)
each tweet has blog URL
easy and transparent for R

Parse tweets

Get tweets
Extract URL from text
Remove empty URLs
Unshorten double-shorted URLs
Validate URLs
Remove duplicates

..
[269] "http://blogs.korrespondent.net/journalists/blog/anna-radio/a51779"
[270] "http://blogs.korrespondent.net/celebrities/blog/gritsenko/a51727"
[271] "http://blogs.korrespondent.net/celebrities/blog/press13/a51764"
[272] "http://blogs.korrespondent.net/celebrities/blog/olesdoniy/a51736"
[273] "http://blogs.korrespondent.net/journalists/blog/raimanelena/a51724"
..

* Full R code is available at the end

Clean texts
Translate all blogs in English
Extract translated text from the html code
Load texts into Corpus
Map to lower case, rm punctuation, Stop Words, numbers, strip white spaces
Stem document

Bottlenecks mining Cyrillic texts

declensions in RU/UA words. After stemming the same word has several forms

0xFF-problem (“я”, 0xFF, windows-1251). DocumentTermMatrix (in R) crops texts
E.g. „янукович‟ – filtered, „объявлять‟ – „объ‟, „братья‟ – „брать‟ (sense changes) etc

Cyrillic texts with pseudo-graphic or special symbols can‟t be encoded with windows-
1251 charset properly (additional filter uurlencoded required, not supported in R)

Translate texts into English
#see the full code in Appendix F
go_tr <- function(url) {
src.url<-URLencode(paste("http://translate.google.com/translate?sl=auto&tl=en&u=", url, sep=""))
html.parse <- htmlTreeParse(getURL(src.url), useInternalNodes = TRUE)
frame.c <- getNodeSet(html.parse, '//frameset//frame[@name="c"]')
params <- sapply(frame.c, function(t) t <- xmlAttrs(t)[[1]])
#...
dest.url <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]])
#...
dest.url <- xmlValue(getNodeSet(htmlParse(dest.url, asText = TRUE), "//p")[[1]])
return(dest.url)
}

[1] "http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268"

[1] "http://translate.googleusercontent.com/translate_c?
rurl=translate.google.com&sl=auto&tl=en&u=http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268&usg=ALkJ
rhisevp7b7yg4CxX6_iTDxyBAk4PCQ"

Original Blog Text. Example
До официального старта Евро остается 150 дней. В разгаре, так называемая, операционная
подготовка к Чемпионату. Речь идет о налаживании коммуникаций между принимающими городами,
обучении персонала и наведении марафета в целом. Ни для кого не секрет, что, по сравнению с
украинцами, поляки получили гораздо больше дивидендов в процессе подготовки к ЧЕ. В первую
очередь, речь идет о привлечении немалых ресурсов за счет финансирования из фондов ЕС.
...

...

Translated & extracted Text
Before the official launch of the Euro is 150 days.In the midst of the so-called operational
preparation for the championship.It is about establishing communication between the host cities,
staff training and marafet hover as a whole.It's no secret that, in comparison with the
Ukrainians, the Poles were far more dividends in preparation for the Championship.First of all,
we are talking about bringing considerable resources through financing from EU funds.
...

...

...

Cleaned Text
official launch euro days midst called operational preparation championship establishing
communication host cities staff training marafet hover secret comparison ukrainians poles
dividends preparation championship talking bringing considerable resources financing eu funds
...

...

...

Cleaned Text
official launch euro days midst called operational preparation championship establishing
communication host cities staff training marafet hover secret comparison ukrainians poles
dividends preparation championship talking bringing considerable resources financing eu funds
...

Stemmed Text
offici launch euro day midst call oper prepar championship establish communic host citi staff
train marafet hover secret comparison ukrainian pole dividend prepar championship talk bring
consider resourc financ eu fund
...

Step 3.
Text Mining & Clustering

Text Mining and Clustering
Build normalized TermDocumentMatrix. Remove Sparse Terms
Hierarchical Clustering, Dendrogram
Kmeans. Perform Clustering and visualize clusters
Kmedoids. Perform Clustering and visualize clusters

DocumentTermMatrix Structure

Terms
ncol=4101

0.0175105020782697, ... 0.019135397913606,
0.0095258656396137, ... 0.017510502078269,
0.0099078198722524, ... 0.014062173579334,
Docs 0.0163576201358285, ... 0.014114967574557,
nrow=237 ...
0.0113371897967796, ... 0.014732724300492,

TF-IDF

Most Frequent & Correlated Terms.
Why is that important?

Most Frequent Terms (Latin)

Non stemmed terms
> findFreqTerms(dtm, lowfreq=1)

[1] "country" "euro" “european" "government" "internet" "kiev" "kyiv" "money"
[9] "opposition" "party" "people" "political“ "power" "president" "russia" "social"
[17] "society" "tymoshenko" "ukraine“ "ukrainian" "world" "yanukovych"

Stemmed terms

[1] "chang“ "countri" "elect" "euro" "european“ "govern" "internet“ "kiev“
[9] "kyiv" "leader" "money" "opposit" "parti" "peopl" "polit" "power“
[17] "presid" "russia" "russian" "social" "societi" "tymoshenko“ "ukrain" "ukrainian“
[25] "world" "yanukovych"

* See the full R code in Appendixes

Correlated Terms (Cyrillic vs Latin). Example
>findAssocs(dtm, 'евр', 0.35) #correlation with term “евро”

евр старт гарант хлеб тыс талисман официальн воплощен будущ чемпионат живет
1.00 0.76 0.74 0.71 0.62 0.55 0.49 0.48 0.35 0.31 0.22
подготовк реплик секрет футбол
0.22 0.22 0.21 0.21

>findAssocs(dtm, „euro', 0.35)

euro championship footbal tourist airport tournament fan poland
1.00 0.68 0.57 0.49 0.45 0.43 0.42 0.42
horribl infrastructur foreign patrol unhappi prepar flashmob
0.38 0.38 0.37 0.37 0.37 0.36 0.35

Correlation Matrix (Latin vs Cyrillic). Example

English Terms: higher correlation, better term accuracy

Hierarchical Clustering (hclust)

Cluster Dendrogram*
#input – DTM normalized with TF-IDF (349 terms, sparse=0.7)
d <- dist(dtm2.df.scale, method = "euclidean") # dissimilarity matrix
#clustering with Ward‟s method
fit <- hclust(d=d, method="ward") #compare: "complete","single","mcquitty","median", "centroid"

* Full result of h-clustering is available in pdf

Hierarchical Clustering Summary

universal hierarchical clustering with different algorithms, e.g. Ward‟s objective
function based on squared Euclidean distance (it‟s worth to play with other methods)

good with large number of terms and small number of observations

gives understanding on correlation between terms in Corpus

provides visual representation on how clusters nested with each other

* Full result of h-clustering is available in pdf

Description of the k-means algorithm*

1) k initial 2) k clusters are created by 3) The centroid of each 4) Steps 2 and 3 are
"means" (in this associating every observation of the k clusters repeated until
case k=3) are with the nearest mean. The becomes the new means. convergence has been
randomly selected partitions here represent the reached.
from the data set Voronoi diagram generated by
(shown in color). the means.

* Source: http://en.wikipedia.org/wiki/K-means

Assess number of clusters using kmeans$withinss

less terms in DTM
higher sum of squares
better cluster quality

more terms in DTM
lower sum of squares
lower cluster quality

Unexpected expected results

Clustering with 20 centers
#dtm.k – DocumentTermMatrix(TF-IDF) with 349 terms (cleaned with sparse=0.9)
#nstart – let‟s try 10 random starts to generate centroids
#algorithm – “Hartigan-Wong” (default)
> dtm.clust<-kmeans(x=dtm.k, centers=20, iter.max=40, nstart=10, algorithm="Hartigan-Wong")

Cluster sizes
> dtm.clust$size

[1] 41 21 4 1 1 5 1 7 12 5 98 2 3 7 10 1 4 2 1 11

Sum of squares
> dtm.clust$withinss

[1] 0.75166171 0.37998302 0.08702162 0.00000000 0.00000000 0.10884947 0.00000000 0.21350480 0.22052166
[10] 0.07426058 1.35245927 0.03003547 0.05145358 0.12662083 0.25722734 0.00000000 0.08037547 0.02691182
[19] 0.00000000 0.22561816

* See the full R code in Appendixes

kmeans. Cluster Visualization

Distance Matrix (Euclidean)
Scale multi-dimensional DTM to 2Dim

Assess number of clusters with pam$silinfo$avg.width

Recommended number of clusters: 2. However …

Perform clustering with 20 centers
#max_diss, av_diss – maximum/average
dissimilarity between observations in cluster
and cluster‟s medoid

#diameter – maximum dissimilarity between two
observations in the cluster

#separation – minimal dissimilarity between
observation in the cluster and observation of
another cluster

Result: 4 clusters

kmedoids. Cluster Visualization

Recognized clusters* ([cluster - # of blogs])
“tymoshenko,
“Ukrainian “Ukrainian “social networks,
opposition,
elections” democracy” ex.ua”
court”
[2-21] [3-4] [6-5]
[8-7]

“Ukraine-Russia “Ukrainian “Ukraine-EU
“Euro-2012”
relations, gas” taxes” relations”
[9-12]
[10-5] [12-2] [14-7]

“protests, “culture, “all other blogs
“journalist
demonstrations, regulation” with various
investigations”
human rights” [17-4] topics”
[20-11]
[15-10] [13-3] (unrecognized)

Total blogs recognized: 91 of 236 (~40%)

* Based on kmeans

Conclusion

number of elements in data vector (349) must be significantly < number of
observations (245)

some resulted clusters include “unlike” blogs (see sum of squares)

try kmeans for better precision when mining big dissimilar texts with low number
of observations. In other cases kmedoids is more robust model

focus on similar texts for best accuracy (by category, date range)

sentimental analysis will make analysis even more tastefull

Questions & Answers

Aleksei Beloshytski
Alelsei.Beloshytski@gmail.com

Appendix A. kmeans. Voronoi Diagram (“Euclidean”)

Appendix B. kmeans. Voronoi Diagram (“Manhattan”)

Appendix C. kmeans. Heatmap (most freq. terms). TF-IDF

Appendix D. kmedoids. Heatmap (most freq. terms). TF

Appendix E. R packages required for analysis

require(twitteR)
require(XML)
require(plyr)
require(tm)
require(Rstem)
require(Snowball)
require(corrplot)
require(RWeka)
require(RCurl)
require(wordcloud)
require(ggplot2)
require(vegan)
require(reshape2)
require(cluster)
require(alphahull)

Appendix F. R Code. Translate texts into English

go_tr <- function(url) {
src.url<-URLencode(paste("http://translate.google.com/translate?sl=auto&tl=en&u=", url, sep=""))
html.parse <- htmlTreeParse(getURL(src.url), useInternalNodes = TRUE)
frame.c <- getNodeSet(html.parse, '//frameset//frame[@name="c"]')
params <- sapply(frame.c, function(t) t <- xmlAttrs(t)[[1]])
src.url <- paste("http://translate.google.com", params, sep = "")
dest.url <- getURL(src.url, followlocation = TRUE)
html <- htmlTreeParse(dest.url, useInternalNodes = TRUE)
dest.url <- capture.output(getNodeSet(html, '//meta[@http-equiv="refresh"]')[[1]])
dest.url <- strsplit(dest.url, "URL=", fixed = TRUE)[[1]][2]
dest.url <- gsub(""/>", "", dest.url, fixed = TRUE)
dest.url <- gsub(" ", "", dest.url, fixed = TRUE)
dest.url <- xmlValue(getNodeSet(htmlParse(dest.url, asText = TRUE), "//p")[[1]])
return(dest.url)
}

[1] "http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268"

[1] "http://translate.googleusercontent.com/translate_c?
rurl=translate.google.com&sl=auto&tl=en&u=http://blogs.korrespondent.net/celebrities/blog/gknbu/a50268&usg=ALkJ
rhisevp7b7yg4CxX6_iTDxyBAk4PCQ"

Appendix G. R Code. Parse tweets and extract URLs
require(twitteR)
kb_tweets<-userTimeline('Korr_Blogs', n=400)
#get text of tweets
urls<-laply(kb_tweets, function(t) t$getText())
#extract urls from text
url_expr<-regexec("http://[a-zA-Z0-9].S*$", urls);
urls<-regmatches(urls, url_expr)
#remove empty elements from the list
urls[lapply(urls, length)<1]<-NULL
#unshorten double-shorted urls
for(i in 1:length(urls)) { urls[i]<-decode_short_url(decode_short_url(urls[[i]])) }
#remove duplicates
urls<-as.list(unique(unlist(urls)))

#...

#contact me for the rest part of the code

#...

Appendix H. R Code. Handle blogs
for(i in 1:length(urls))
{
#translate blogs into English
url<-go_tr(urls[i])
blogs<-readLines(tc<-textConnection(url));
close (tc)

pagetree<-try(htmlTreeParse(blogs, useInternalNodes=TRUE, encoding="windows-1251"))
if(class(pagetree)=="try-error") next;
x<-xpathSApply(pagetree,
"//div[@class='article_box']//*[@onmouseover]/text()|//div[@class='article_box']//*[@onmouseover]/a/text()",
xmlValue)
x <- unlist(strsplit(x, "n"))
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
#...
}

#...


#...

Appendix I. R Code. Manage TermDocumentMatrix

#...
corp <- Corpus(DirSource(“//kor_blogs/en"),readerControl=list(language="en", encodeString="windows-1251"))
#..
#Clean texts, stemming and so on
#...
#Create DTM for both stemmed and not-stemmed Corpuses
dtm <- DocumentTermMatrix(corp, control = list(weighting = weightTfIdf))
dtm <- removeSparseTerms(dtm, sparse=0.995) #0.995 - for both EN and RU
#...
#Find Most Frequent and Associated terms
#Build Correlation Matrix
#..
corrplot(type="lower", tl.cex=.6, corr=corr_stem_ru, title="Correlation matrix", tl.col="grey20",
method="circle", order="FPC", addtextlabel = "ld", outline=TRUE)

#...


#...

Appendix J. R Code. Hierarchical clustering
#...
dtm2<-as.TermDocumentMatrix(dtm)
#...
dtm2.df<-as.data.frame(inspect(dtm2))
#...
(d <- dist(dtm2.df.scale, method = "euclidean")) # distance matrix
fit <- hclust(d=d, method="ward")
#..
dev.off()

#...


#...

Appendix K. R Code. Worldcloud (most frequent terms)

require(wordcloud)
#...
dtm.m <- as.matrix(dtm)
v <- apply(dtm.m,2,sum) #calculate number of occurancies for each word
v <- sort(v, decreasing=TRUE)
#..
wordcloud(d$word, d$freq, scale=c(9,.1), min.freq=3, max.words=Inf, random.order=F, rot.per=.3, colors=pal2)

#...


#...

Appendix L. R Code. kmeans analysis

#...
# assess number of clusters
wss <- (nrow(dtm)-1)*sum(apply(dtm,2,var)) #for stemmed DTM
dtm_orig <- DocumentTermMatrix(corp, control = list(weighting = weightTfIdf)) # non-stemmed DTM
dtm_orig <- removeSparseTerms(dtm_orig, sparse=0.995)
#...
# visualize withinss

# perform clustering
#dtm.k – DocumentTermMatrix(TF-IDF) with 349 terms (cleaned with sparse=0.9)
#nstart – let‟s try 10 random starts to generate centroids
#algorithm – “Hartigan-Wong” (default)
dtm.clust<-kmeans(x=dtm.k,centers=20,iter.max=40, nstart=10, algorithm="Hartigan-Wong")
dtm.clust$size

#...


#...

Appendix M. R Code. kmedoids analysis
#...
# assess number of clusters
# visualize withinss
ggplot()+geom_line(aes(x=1:236, y=asw),size=1,colour="red4") + opts(axis.text.x=theme_text(hjust=0,
colour="grey20", size=14), axis.text.y=theme_text(size=14, colour="grey20"),
axis.title.x=theme_text(size=20, colour="grey20"), axis.title.y=theme_text(angle=90, size=20,
colour="grey20")) + labs(y="average silhouette width", x="k-medoids (# clusters)",size=16) +
scale_x_continuous(breaks=c(k.best,20,40,60,80,100,120,140,160,180,200,220))

# perform kmedoids clustering
#...
dtm.clust.m$clusinfo

#...


#...

Appendix N. R Code. Visualize clusters

#...
#define which cluster to visualize
dtm.clust.v<-dtm.clust # for kmeans
dtm.clust.v<-dtm.clust.m # for kmedoids
#...
dtm_scaled <- cmdscale(dtm.dist) # scale from multi dimensions to two dimensions
require(vegan)
# distance matrix
dtm.dist <- dist(dtm.k, method="euclidean")
#...
for(i in seq_along(groups)){
points(dtm_scaled[factor(dtm.clust.v$cluster) == groups[i], ], col = cols[i], pch = 20)
}
# draw ordihull
ordihull(dtm_scaled, factor(dtm.clust.v$cluster), lty = "dotted", draw="polygon", label=TRUE)

#draw Voronoi diagram

#...


#...

Appendix O. R Code. Visualize heatmaps

#...
dtm.clust.v<-dtm.clust # for kmeans
dtm.clust.v<-dtm.clust.m # for kmedoids

dtm0 <- dtm.k #dtm for kmeans clustering
dtm0 <- removeSparseTerms(dtm0, sparse=0.7) #get terms which exist in 70% of blogs
dtm.df <- as.data.frame(inspect(dtm0))
dfc <- cbind(dtm.df, id=seq(nrow(dtm.df)), cluster=dtm.clust.v$cluster) #Append id and cluster
#...
require(ggplot2)
dev.off()
dev.new()
ggplot(dfm, aes(x=variable, y=idsort)) + geom_tile(aes(fill=value)) +
opts(axis.text.x=theme_text(angle=90, hjust=0, colour="grey20", size=14)) + labs(x="", y="")

#...


#...

Appendix P. Most Frequent Terms (Cyrillic)*

Non stemmed terms (cropped terms, 0xFF)

[1] "брать" "вли" "вопрос“ "врем" "выборы“ "высоцкий" "действи“ "евро" "знакома“
[10] "истори“ "написал" "непри" "нова" "объ“ "остаетс" "попул" "прав" "прин“
[19] "прочитал“ "прошла" "сегодн" "суд" "течение" "третий" "украине" "украинцы“ "украины“
[28] "хочу"

Stemmed terms

[1] "виктор“ "власт" "вли“ "вопрос“ "врем" "выбор“ "высоцк“
[8] "государствен“ "интересн“ "непри" "объ" "очередн“ "попул" "последн“
[15] "посто" "прав" "прин" "прочита“ "росси" "сегодн“ "страниц“
[22] "течен" "украин"

* Bold words – cropped; blue – terms don’t exist in non-stemmed variant

TextMining with R

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

En vedette

En vedette (20)

Similaire à TextMining with R

Similaire à TextMining with R (20)

Dernier

Dernier (20)

TextMining with R