SlideShare a Scribd company logo
1 of 45
Download to read offline
An Introduction to Data Mining with R
Yanchang Zhao
http://www.RDataMining.com

6 September 2013

1 / 43
Questions

Do you know data mining and techniques for it?

2 / 43
Questions

Do you know data mining and techniques for it?
Have you used R before?

2 / 43
Questions

Do you know data mining and techniques for it?
Have you used R before?
Have you used R in your data mining research or projects?

2 / 43
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Hadoop
Online Resources
3 / 43
What is R?
R 1 is a free software environment for statistical computing
and graphics.
R can be easily extended with 4,728 packages available on
CRAN2 (as of Sept 6, 2013).
Many other packages provided on Bioconductor3 , R-Forge4 ,
GitHub5 , etc.
R manuals on CRAN6
An Introduction to R
The R Language Definition
R Data Import/Export
...
1

http://www.r-project.org/
http://cran.r-project.org/
3
http://www.bioconductor.org/
4
http://r-forge.r-project.org/
5
https://github.com/
6
http://cran.r-project.org/manuals.html
2

4 / 43
Why R?

R is widely used in both academia and industry.
R is ranked no. 1 again in the KDnuggets 2013 poll on Top
Languages for analytics, data mining, data science 7 .
The CRAN Task Views 8 provide collections of packages for
different tasks.
Machine learning & atatistical learning
Cluster analysis & finite mixture models
Time series analysis
Multivariate statistics
Analysis of spatial data
...

7
8

http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html
http://cran.r-project.org/web/views/
5 / 43
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Hadoop
Online Resources
6 / 43
Classification with R

Decision trees: rpart, party
Random forest: randomForest, party
SVM: e1071, kernlab
Neural networks: nnet, neuralnet, RSNNS
Performance evaluation: ROCR

7 / 43
The Iris Dataset
# iris data
str(iris)

## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 .
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1
## $ Species
: Factor w/ 3 levels "setosa","versicolor",..:
# split into training and test datasets
set.seed(1234)
ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3))
iris.train <- iris[ind==1, ]
iris.test <- iris[ind==2, ]

8 / 43
Build a Decision Tree

# build a decision tree
library(party)
iris.formula <- Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width
iris.ctree <- ctree(iris.formula, data=iris.train)

9 / 43
plot(iris.ctree)
1
Petal.Length
p < 0.001
≤ 1.9

> 1.9
3
Petal.Width
p < 0.001
≤ 1.7

> 1.7

4
Petal.Length
p = 0.026
≤ 4.4

> 4.4

Node 2 (n = 40)

Node 5 (n = 21)

Node 6 (n = 19)

Node 7 (n = 32)

1

1

1

1

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0

0

0

setosa

setosa

0
setosa

setosa

10 / 43
Prediction

# predict on test data
pred <- predict(iris.ctree, newdata = iris.test)
# check prediction result
table(pred, iris.test$Species)
##
## pred
setosa versicolor virginica
##
setosa
10
0
0
##
versicolor
0
12
2
##
virginica
0
0
14

11 / 43
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Hadoop
Online Resources
12 / 43
Clustering with R

k-means: kmeans(), kmeansruns()9
k-medoids: pam(), pamk()
Hierarchical clustering: hclust(), agnes(), diana()
DBSCAN: fpc
BIRCH: birch

9

Functions are followed with “()”, and others are packages.
13 / 43
k-means Clustering
set.seed(8953)
iris2 <- iris
# remove class IDs
iris2$Species <- NULL
# k-means clustering
iris.kmeans <- kmeans(iris2, 3)
# check result
table(iris$Species, iris.kmeans$cluster)
##
##
##
##
##

1 2 3
setosa
0 50 0
versicolor 2 0 48
virginica 36 0 14

14 / 43
*

3.0

*

2.5

*

2.0

Sepal.Width

3.5

4.0

# plot clusters and their centers
plot(iris2[c("Sepal.Length", "Sepal.Width")], col=iris.kmeans$cluster)
points(iris.kmeans$centers[, c("Sepal.Length", "Sepal.Width")],
col=1:3, pch="*", cex=5)

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

15 / 43
Density-based Clustering

library(fpc)
iris2 <- iris[-5] # remove class IDs
# DBSCAN clustering
ds <- dbscan(iris2, eps = 0.42, MinPts = 5)
# compare clusters with original class IDs
table(ds$cluster, iris$Species)
##
##
##
##
##
##

0
1
2
3

setosa versicolor virginica
2
10
17
48
0
0
0
37
0
0
3
33

16 / 43
# 1-3: clusters; 0: outliers or noise
plotcluster(iris2, ds$cluster)

0
3

3 33
0
3
3
03 3
1

2

1

dc 2

1

1
1

3
3
3 3
0
0 2 2
2
0 2 2
2
2
2

0

2

−2

0
0

−8

−6

3

3
3
3 3 0 33
3
3
3 3
3

3 30
3
3
0
3
2
3
22
2 2 22 2 2
0
3
0
2 20 2 2
2
3
2
2 22
02
0
22
30
0
3
2 20
2
0 0
0
0
2 2

−1

0

1

1

0 1
1
1 1
1 1 1
1 11
1
1
1
1
1
11 1 1 1 11
111 1
1 1 11 1
11 1
1
1
1 11
1
11

−4

−2
dc 1

0

0

0

0
0

2

17 / 43
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Hadoop
Online Resources
18 / 43
Association Rule Mining with R

Association rules: apriori(), eclat() in package arules
Sequential patterns: arulesSequence
Visualisation of associations: arulesViz

19 / 43
The Titanic Dataset
load("./data/titanic.raw.rdata")
dim(titanic.raw)
## [1] 2201

4

idx <- sample(1:nrow(titanic.raw), 8)
titanic.raw[idx, ]
##
##
##
##
##
##
##
##
##

501
477
674
766
1485
1388
448
590

Class
Sex
Age Survived
3rd
Male Adult
No
3rd
Male Adult
No
3rd
Male Adult
No
Crew
Male Adult
No
3rd Female Adult
No
2nd Female Adult
No
3rd
Male Adult
No
3rd
Male Adult
No
20 / 43
Association Rule Mining

# find association rules with the APRIORI algorithm
library(arules)
rules <- apriori(titanic.raw, control=list(verbose=F),
parameter=list(minlen=2, supp=0.005, conf=0.8),
appearance=list(rhs=c("Survived=No", "Survived=Yes"),
default="lhs"))
# sort rules
quality(rules) <- round(quality(rules), digits=3)
rules.sorted <- sort(rules, by="lift")
# have a look at rules
# inspect(rules.sorted)

21 / 43
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#

lhs
{Class=2nd,
Age=Child}
2 {Class=2nd,
Sex=Female,
Age=Child}
3 {Class=1st,
Sex=Female}
4 {Class=1st,
Sex=Female,
Age=Adult}
5 {Class=2nd,
Sex=Male,
Age=Adult}
6 {Class=2nd,
Sex=Female}
7 {Class=Crew,
Sex=Female}
8 {Class=Crew,
Sex=Female,
Age=Adult}
9 {Class=2nd,
Sex=Male}
10 {Class=2nd,

rhs

support confidence

lift

1

=> {Survived=Yes}

0.011

1.000 3.096

=> {Survived=Yes}

0.006

1.000 3.096

=> {Survived=Yes}

0.064

0.972 3.010

=> {Survived=Yes}

0.064

0.972 3.010

=> {Survived=No}

0.070

0.917 1.354

=> {Survived=Yes}

0.042

0.877 2.716

=> {Survived=Yes}

0.009

0.870 2.692

=> {Survived=Yes}

0.009

0.870 2.692

=> {Survived=No}

0.070

0.860 1.271
22 / 43
library(arulesViz)
plot(rules, method = "graph")
Graph for 12 rules

width: support (0.006 − 0.192)
color: lift (1.222 − 3.096)

{Class=2nd,Age=Child}
{Class=2nd,Sex=Female}
{Class=1st,Sex=Female,Age=Adult}

{Class=Crew,Sex=Female}
{Survived=Yes}
{Class=2nd,Sex=Female,Age=Child}

{Class=Crew,Sex=Female,Age=Adult}
{Class=1st,Sex=Female}

{Class=2nd,Sex=Male}

{Class=2nd,Sex=Female,Age=Adult}

{Class=3rd,Sex=Male}
{Survived=No}

{Class=3rd,Sex=Male,Age=Adult}
{Class=2nd,Sex=Male,Age=Adult}

23 / 43
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Hadoop
Online Resources
24 / 43
Text Mining with R

Text mining: tm
Topic modelling: topicmodels, lda
Word cloud: wordcloud
Twitter data access: twitteR

25 / 43
Fetch Twitter Data
## retrieve tweets from the user timeline of @rdatammining
library(twitteR)
# tweets <- userTimeline('rdatamining')
load(file = "./data/rdmTweets.RData")
(nDocs <- length(tweets))
## [1] 320
strwrap(tweets[[320]]$text, width = 50)
##
##
##
##

[1]
[2]
[3]
[4]

"An R Reference Card for Data Mining is now"
"available on CRAN. It lists many useful R"
"functions and packages for data mining"
"applications."

# convert tweets to a data frame
df <- do.call("rbind", lapply(tweets, as.data.frame))
26 / 43
Text Cleaning
library(tm)
# build a corpus
myCorpus <- Corpus(VectorSource(df$text))
# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)
# remove punctuation & numbers
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
# remove URLs
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
# remove 'r' and 'big' from stopwords
myStopwords <- setdiff(stopwords("english"), c("r", "big"))
# remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

27 / 43
Stemming
# keep a copy of corpus
myCorpusCopy <- myCorpus
# stem words
myCorpus <- tm_map(myCorpus, stemDocument)
# stem completion
myCorpus <- tm_map(myCorpus, stemCompletion,
dictionary = myCorpusCopy)
# replace "miners" with "mining", because "mining" was
# first stemmed to "mine" and then completed to "miners"
myCorpus <- tm_map(myCorpus, gsub, pattern="miners",
replacement="mining")
strwrap(myCorpus[320], width=50)
## [1] "r reference card data mining now available cran"
## [2] "list used r functions package data mining"
## [3] "applications"

28 / 43
Frequent Terms

myTdm <- TermDocumentMatrix(myCorpus,
control=list(wordLengths=c(1,Inf)))
# inspect frequent words
(freq.terms <- findFreqTerms(myTdm, lowfreq=20))
## [1] "analysis"
## [4] "data"
## [7] "network"
## [10] "postdoctoral"
## [13] "slides"
## [16] "university"

"big"
"examples"
"package"
"r"
"social"
"used"

"computing"
"mining"
"position"
"research"
"tutorial"

29 / 43
Associations

# which words are associated with 'r'?
findAssocs(myTdm, "r", 0.2)
## examples
##
0.32

code
0.29

package
0.20

# which words are associated with 'mining'?
findAssocs(myTdm, "mining", 0.25)
##
##
##
##

data
0.47
supports
0.30

mahout recommendation
0.30
0.30
frequent
itemset
0.26
0.26

sets
0.30

30 / 43
Network of Terms
library(graph)
library(Rgraphviz)
plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T)

university

tutorial

social

network

analysis

mining

research

postdoctoral

position

used

r

data

big

package

examples

computing

slides

31 / 43
Word Cloud
library(wordcloud)
m <- as.matrix(myTdm)
freq <- sort(rowSums(m), decreasing=T)
wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F)
provided melbourne

analysis outlier
map

mining network

open
graphics
thanks
conference users
processing
cfp text

analyst

exampleschapter
postdoctoral

slides used big

job

analytics join

high
sydney
topic

china

large
snowfall
casesee available poll draft
performance applications
group now
reference course code can via
visualizing
series tenuretrack
industrial center due introduction
association clustering access
information
page distributed
sentiment videos techniques tried
youtube
top presentation science
classification southern
wwwrdataminingcom
canberra added experience
management

predictive

talk

r

linkedin

vacancy

research
package
notes card

get

data

database

statistics
rdatamining
knowledge list
graph

free online

using
recent

published

workshop find

position

fast call

studies

tutorial

california

cloud

frequent
week tools

document

technology

nd

australia social university
datasets

google
short software

time learn
details
lecture

book

forecasting functions follower submission
business events
kdnuggetsinteractive
detection programmingcanada
spatial
search
ausdm pdf modelling machine
twitter
starting fellow
web
scientist
computing parallel ibm
amp rules
dmapps
handling

32 / 43
Topic Modelling
library(topicmodels)
set.seed(123)
myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8)
terms(myLda, 5)
##
##
##
##
##
##
##
##
##
##
##
##

[1,]
[2,]
[3,]
[4,]
[5,]
[1,]
[2,]
[3,]
[4,]
[5,]

Topic 1
Topic 2
Topic 3 Topic 4
"data"
"r"
"r"
"research"
"mining"
"package" "time"
"position"
"big"
"examples" "series" "data"
"association" "used"
"users" "university"
"rules"
"code"
"talk"
"postdoctoral"
Topic 5
Topic 6
Topic 7
Topic 8
"mining"
"group"
"data"
"analysis"
"data"
"data"
"r"
"network"
"slides"
"used"
"mining"
"social"
"modelling" "software" "analysis" "text"
"tools"
"kdnuggets" "book"
"slides"

33 / 43
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Hadoop
Online Resources
34 / 43
Time Series Analysis with R

Time series decomposition: decomp(), decompose(), arima(),
stl()
Time series forecasting: forecast
Time Series Clustering: TSclust
Dynamic Time Warping (DTW): dtw

35 / 43
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Hadoop
Online Resources
36 / 43
Social Network Analysis with R

Packages: igraph, sna
Centrality measures: degree(), betweenness(), closeness(),
transitivity()
Clusters: clusters(), no.clusters()
Cliques: cliques(), largest.cliques(), maximal.cliques(),
clique.number()
Community detection: fastgreedy.community(),
spinglass.community()

37 / 43
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Hadoop
Online Resources
38 / 43
R and Hadoop
Packages: RHadoop, RHive
RHadoop10 is a collection of 3 R packages:
rmr2 - perform data analysis with R via MapReduce on a
Hadoop cluster
rhdfs - connect to Hadoop Distributed File System (HDFS)
rhbase - connect to the NoSQL HBase database

You can play with it on a single PC (in standalone or
pseudo-distributed mode), and your code developed on that
will be able to work on a cluster of PCs (in full-distributed
mode)!
Step by step to set up my first R Hadoop system
http://www.rdatamining.com/tutorials/rhadoop

10

https://github.com/RevolutionAnalytics/RHadoop/wiki
39 / 43
An Example of MapReducing with R
library(rmr2)
map <- function(k, lines) {
words.list <- strsplit(lines, "s")
words <- unlist(words.list)
return(keyval(words, 1))
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function(input, output = NULL) {
mapreduce(input = input, output = output, input.format = "text",
map = map, reduce = reduce)
}
## Submit job
out <- wordcount(in.file.path, out.file.path)
11
11

From Jeffrey Breen’s presentation on Using R with Hadoop

http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/
40 / 43
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Hadoop
Online Resources
41 / 43
Online Resources
RDataMining website
http://www.rdatamining.com

R Reference Card for Data Mining
R and Data Mining: Examples and Case Studies

RDataMining Group on LinkedIn (3100+ members)
http://group.rdatamining.com

RDataMining on Twitter (1200+ followers)
http://twitter.com/rdatamining

Free online courses
http://www.rdatamining.com/resources/courses

Online documents
http://www.rdatamining.com/resources/onlinedocs

42 / 43
The End

Thanks!
Email: yanchang(at)rdatamining.com
43 / 43

More Related Content

What's hot

Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With REdureka!
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factorskrishna singh
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
R Programming: Introduction to Matrices
R Programming: Introduction to MatricesR Programming: Introduction to Matrices
R Programming: Introduction to MatricesRsquared Academy
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaEdureka!
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPyDevashish Kumar
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia dataminingKrish_ver2
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programmingizahn
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 
Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithmKIRAN R
 
Chart and graphs in R programming language
Chart and graphs in R programming language Chart and graphs in R programming language
Chart and graphs in R programming language CHANDAN KUMAR
 
Datastructures in python
Datastructures in pythonDatastructures in python
Datastructures in pythonhydpy
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
 

What's hot (20)

Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With R
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
R Programming: Introduction to Matrices
R Programming: Introduction to MatricesR Programming: Introduction to Matrices
R Programming: Introduction to Matrices
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
 
5desc
5desc5desc
5desc
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia datamining
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithm
 
Chart and graphs in R programming language
Chart and graphs in R programming language Chart and graphs in R programming language
Chart and graphs in R programming language
 
Datastructures in python
Datastructures in pythonDatastructures in python
Datastructures in python
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 

Viewers also liked

Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Revolution Analytics
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slidesYanchang Zhao
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
Ontology-driven KDD Process Composition
Ontology-driven KDD Process CompositionOntology-driven KDD Process Composition
Ontology-driven KDD Process CompositionEmanuele Storti
 
PPT file
PPT filePPT file
PPT filebutest
 

Viewers also liked (20)

Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
Data mining
Data miningData mining
Data mining
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slides
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
R refcard-data-mining
R refcard-data-miningR refcard-data-mining
R refcard-data-mining
 
Exploring Data
Exploring DataExploring Data
Exploring Data
 
Ontology-driven KDD Process Composition
Ontology-driven KDD Process CompositionOntology-driven KDD Process Composition
Ontology-driven KDD Process Composition
 
Data clustering
Data clustering Data clustering
Data clustering
 
PPT file
PPT filePPT file
PPT file
 
26.docking
26.docking26.docking
26.docking
 

Similar to An Introduction to Data Mining with R

Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in RSujaAldrin
 
TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RFayan TAO
 
RDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationRDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationYanchang Zhao
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rYanchang Zhao
 
R language tutorial
R language tutorialR language tutorial
R language tutorialDavid Chiu
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMarc Borowczak
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with RDr Nisha Arora
 
Ggplot2 work
Ggplot2 workGgplot2 work
Ggplot2 workARUN DN
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
 
Root cause of community problem for this discussion, you will i
Root cause of community problem for this discussion, you will iRoot cause of community problem for this discussion, you will i
Root cause of community problem for this discussion, you will issusere73ce3
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Structured Streaming with Apache Spark
Structured Streaming with Apache SparkStructured Streaming with Apache Spark
Structured Streaming with Apache SparkDataya Nolja
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methodsijcsity
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
 

Similar to An Introduction to Data Mining with R (20)

Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
TAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with RTAO Fayan_Report on Top 10 data mining algorithms applications with R
TAO Fayan_Report on Top 10 data mining algorithms applications with R
 
RDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationRDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisation
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-r
 
Session 02
Session 02Session 02
Session 02
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification Challenges
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with R
 
Ggplot2 work
Ggplot2 workGgplot2 work
Ggplot2 work
 
Ruby Programming
Ruby ProgrammingRuby Programming
Ruby Programming
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
Root cause of community problem for this discussion, you will i
Root cause of community problem for this discussion, you will iRoot cause of community problem for this discussion, you will i
Root cause of community problem for this discussion, you will i
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Easy R
Easy REasy R
Easy R
 
Statistical computing 01
Statistical computing 01Statistical computing 01
Statistical computing 01
 
Structured Streaming with Apache Spark
Structured Streaming with Apache SparkStructured Streaming with Apache Spark
Structured Streaming with Apache Spark
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 

More from Yanchang Zhao

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisYanchang Zhao
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rYanchang Zhao
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rYanchang Zhao
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-cardYanchang Zhao
 

More from Yanchang Zhao (7)

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysis
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-r
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-r
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 

Recently uploaded

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

An Introduction to Data Mining with R

  • 1. An Introduction to Data Mining with R Yanchang Zhao http://www.RDataMining.com 6 September 2013 1 / 43
  • 2. Questions Do you know data mining and techniques for it? 2 / 43
  • 3. Questions Do you know data mining and techniques for it? Have you used R before? 2 / 43
  • 4. Questions Do you know data mining and techniques for it? Have you used R before? Have you used R in your data mining research or projects? 2 / 43
  • 5. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 3 / 43
  • 6. What is R? R 1 is a free software environment for statistical computing and graphics. R can be easily extended with 4,728 packages available on CRAN2 (as of Sept 6, 2013). Many other packages provided on Bioconductor3 , R-Forge4 , GitHub5 , etc. R manuals on CRAN6 An Introduction to R The R Language Definition R Data Import/Export ... 1 http://www.r-project.org/ http://cran.r-project.org/ 3 http://www.bioconductor.org/ 4 http://r-forge.r-project.org/ 5 https://github.com/ 6 http://cran.r-project.org/manuals.html 2 4 / 43
  • 7. Why R? R is widely used in both academia and industry. R is ranked no. 1 again in the KDnuggets 2013 poll on Top Languages for analytics, data mining, data science 7 . The CRAN Task Views 8 provide collections of packages for different tasks. Machine learning & atatistical learning Cluster analysis & finite mixture models Time series analysis Multivariate statistics Analysis of spatial data ... 7 8 http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html http://cran.r-project.org/web/views/ 5 / 43
  • 8. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 6 / 43
  • 9. Classification with R Decision trees: rpart, party Random forest: randomForest, party SVM: e1071, kernlab Neural networks: nnet, neuralnet, RSNNS Performance evaluation: ROCR 7 / 43
  • 10. The Iris Dataset # iris data str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 . ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: # split into training and test datasets set.seed(1234) ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3)) iris.train <- iris[ind==1, ] iris.test <- iris[ind==2, ] 8 / 43
  • 11. Build a Decision Tree # build a decision tree library(party) iris.formula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris.ctree <- ctree(iris.formula, data=iris.train) 9 / 43
  • 12. plot(iris.ctree) 1 Petal.Length p < 0.001 ≤ 1.9 > 1.9 3 Petal.Width p < 0.001 ≤ 1.7 > 1.7 4 Petal.Length p = 0.026 ≤ 4.4 > 4.4 Node 2 (n = 40) Node 5 (n = 21) Node 6 (n = 19) Node 7 (n = 32) 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 setosa setosa 0 setosa setosa 10 / 43
  • 13. Prediction # predict on test data pred <- predict(iris.ctree, newdata = iris.test) # check prediction result table(pred, iris.test$Species) ## ## pred setosa versicolor virginica ## setosa 10 0 0 ## versicolor 0 12 2 ## virginica 0 0 14 11 / 43
  • 14. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 12 / 43
  • 15. Clustering with R k-means: kmeans(), kmeansruns()9 k-medoids: pam(), pamk() Hierarchical clustering: hclust(), agnes(), diana() DBSCAN: fpc BIRCH: birch 9 Functions are followed with “()”, and others are packages. 13 / 43
  • 16. k-means Clustering set.seed(8953) iris2 <- iris # remove class IDs iris2$Species <- NULL # k-means clustering iris.kmeans <- kmeans(iris2, 3) # check result table(iris$Species, iris.kmeans$cluster) ## ## ## ## ## 1 2 3 setosa 0 50 0 versicolor 2 0 48 virginica 36 0 14 14 / 43
  • 17. * 3.0 * 2.5 * 2.0 Sepal.Width 3.5 4.0 # plot clusters and their centers plot(iris2[c("Sepal.Length", "Sepal.Width")], col=iris.kmeans$cluster) points(iris.kmeans$centers[, c("Sepal.Length", "Sepal.Width")], col=1:3, pch="*", cex=5) 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 15 / 43
  • 18. Density-based Clustering library(fpc) iris2 <- iris[-5] # remove class IDs # DBSCAN clustering ds <- dbscan(iris2, eps = 0.42, MinPts = 5) # compare clusters with original class IDs table(ds$cluster, iris$Species) ## ## ## ## ## ## 0 1 2 3 setosa versicolor virginica 2 10 17 48 0 0 0 37 0 0 3 33 16 / 43
  • 19. # 1-3: clusters; 0: outliers or noise plotcluster(iris2, ds$cluster) 0 3 3 33 0 3 3 03 3 1 2 1 dc 2 1 1 1 3 3 3 3 0 0 2 2 2 0 2 2 2 2 2 0 2 −2 0 0 −8 −6 3 3 3 3 3 0 33 3 3 3 3 3 3 30 3 3 0 3 2 3 22 2 2 22 2 2 0 3 0 2 20 2 2 2 3 2 2 22 02 0 22 30 0 3 2 20 2 0 0 0 0 2 2 −1 0 1 1 0 1 1 1 1 1 1 1 1 11 1 1 1 1 1 11 1 1 1 11 111 1 1 1 11 1 11 1 1 1 1 11 1 11 −4 −2 dc 1 0 0 0 0 0 2 17 / 43
  • 20. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 18 / 43
  • 21. Association Rule Mining with R Association rules: apriori(), eclat() in package arules Sequential patterns: arulesSequence Visualisation of associations: arulesViz 19 / 43
  • 22. The Titanic Dataset load("./data/titanic.raw.rdata") dim(titanic.raw) ## [1] 2201 4 idx <- sample(1:nrow(titanic.raw), 8) titanic.raw[idx, ] ## ## ## ## ## ## ## ## ## 501 477 674 766 1485 1388 448 590 Class Sex Age Survived 3rd Male Adult No 3rd Male Adult No 3rd Male Adult No Crew Male Adult No 3rd Female Adult No 2nd Female Adult No 3rd Male Adult No 3rd Male Adult No 20 / 43
  • 23. Association Rule Mining # find association rules with the APRIORI algorithm library(arules) rules <- apriori(titanic.raw, control=list(verbose=F), parameter=list(minlen=2, supp=0.005, conf=0.8), appearance=list(rhs=c("Survived=No", "Survived=Yes"), default="lhs")) # sort rules quality(rules) <- round(quality(rules), digits=3) rules.sorted <- sort(rules, by="lift") # have a look at rules # inspect(rules.sorted) 21 / 43
  • 24. # # # # # # # # # # # # # # # # # # # # # # # # lhs {Class=2nd, Age=Child} 2 {Class=2nd, Sex=Female, Age=Child} 3 {Class=1st, Sex=Female} 4 {Class=1st, Sex=Female, Age=Adult} 5 {Class=2nd, Sex=Male, Age=Adult} 6 {Class=2nd, Sex=Female} 7 {Class=Crew, Sex=Female} 8 {Class=Crew, Sex=Female, Age=Adult} 9 {Class=2nd, Sex=Male} 10 {Class=2nd, rhs support confidence lift 1 => {Survived=Yes} 0.011 1.000 3.096 => {Survived=Yes} 0.006 1.000 3.096 => {Survived=Yes} 0.064 0.972 3.010 => {Survived=Yes} 0.064 0.972 3.010 => {Survived=No} 0.070 0.917 1.354 => {Survived=Yes} 0.042 0.877 2.716 => {Survived=Yes} 0.009 0.870 2.692 => {Survived=Yes} 0.009 0.870 2.692 => {Survived=No} 0.070 0.860 1.271 22 / 43
  • 25. library(arulesViz) plot(rules, method = "graph") Graph for 12 rules width: support (0.006 − 0.192) color: lift (1.222 − 3.096) {Class=2nd,Age=Child} {Class=2nd,Sex=Female} {Class=1st,Sex=Female,Age=Adult} {Class=Crew,Sex=Female} {Survived=Yes} {Class=2nd,Sex=Female,Age=Child} {Class=Crew,Sex=Female,Age=Adult} {Class=1st,Sex=Female} {Class=2nd,Sex=Male} {Class=2nd,Sex=Female,Age=Adult} {Class=3rd,Sex=Male} {Survived=No} {Class=3rd,Sex=Male,Age=Adult} {Class=2nd,Sex=Male,Age=Adult} 23 / 43
  • 26. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 24 / 43
  • 27. Text Mining with R Text mining: tm Topic modelling: topicmodels, lda Word cloud: wordcloud Twitter data access: twitteR 25 / 43
  • 28. Fetch Twitter Data ## retrieve tweets from the user timeline of @rdatammining library(twitteR) # tweets <- userTimeline('rdatamining') load(file = "./data/rdmTweets.RData") (nDocs <- length(tweets)) ## [1] 320 strwrap(tweets[[320]]$text, width = 50) ## ## ## ## [1] [2] [3] [4] "An R Reference Card for Data Mining is now" "available on CRAN. It lists many useful R" "functions and packages for data mining" "applications." # convert tweets to a data frame df <- do.call("rbind", lapply(tweets, as.data.frame)) 26 / 43
  • 29. Text Cleaning library(tm) # build a corpus myCorpus <- Corpus(VectorSource(df$text)) # convert to lower case myCorpus <- tm_map(myCorpus, tolower) # remove punctuation & numbers myCorpus <- tm_map(myCorpus, removePunctuation) myCorpus <- tm_map(myCorpus, removeNumbers) # remove URLs removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) # remove 'r' and 'big' from stopwords myStopwords <- setdiff(stopwords("english"), c("r", "big")) # remove stopwords myCorpus <- tm_map(myCorpus, removeWords, myStopwords) 27 / 43
  • 30. Stemming # keep a copy of corpus myCorpusCopy <- myCorpus # stem words myCorpus <- tm_map(myCorpus, stemDocument) # stem completion myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) # replace "miners" with "mining", because "mining" was # first stemmed to "mine" and then completed to "miners" myCorpus <- tm_map(myCorpus, gsub, pattern="miners", replacement="mining") strwrap(myCorpus[320], width=50) ## [1] "r reference card data mining now available cran" ## [2] "list used r functions package data mining" ## [3] "applications" 28 / 43
  • 31. Frequent Terms myTdm <- TermDocumentMatrix(myCorpus, control=list(wordLengths=c(1,Inf))) # inspect frequent words (freq.terms <- findFreqTerms(myTdm, lowfreq=20)) ## [1] "analysis" ## [4] "data" ## [7] "network" ## [10] "postdoctoral" ## [13] "slides" ## [16] "university" "big" "examples" "package" "r" "social" "used" "computing" "mining" "position" "research" "tutorial" 29 / 43
  • 32. Associations # which words are associated with 'r'? findAssocs(myTdm, "r", 0.2) ## examples ## 0.32 code 0.29 package 0.20 # which words are associated with 'mining'? findAssocs(myTdm, "mining", 0.25) ## ## ## ## data 0.47 supports 0.30 mahout recommendation 0.30 0.30 frequent itemset 0.26 0.26 sets 0.30 30 / 43
  • 33. Network of Terms library(graph) library(Rgraphviz) plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T) university tutorial social network analysis mining research postdoctoral position used r data big package examples computing slides 31 / 43
  • 34. Word Cloud library(wordcloud) m <- as.matrix(myTdm) freq <- sort(rowSums(m), decreasing=T) wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F) provided melbourne analysis outlier map mining network open graphics thanks conference users processing cfp text analyst exampleschapter postdoctoral slides used big job analytics join high sydney topic china large snowfall casesee available poll draft performance applications group now reference course code can via visualizing series tenuretrack industrial center due introduction association clustering access information page distributed sentiment videos techniques tried youtube top presentation science classification southern wwwrdataminingcom canberra added experience management predictive talk r linkedin vacancy research package notes card get data database statistics rdatamining knowledge list graph free online using recent published workshop find position fast call studies tutorial california cloud frequent week tools document technology nd australia social university datasets google short software time learn details lecture book forecasting functions follower submission business events kdnuggetsinteractive detection programmingcanada spatial search ausdm pdf modelling machine twitter starting fellow web scientist computing parallel ibm amp rules dmapps handling 32 / 43
  • 35. Topic Modelling library(topicmodels) set.seed(123) myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8) terms(myLda, 5) ## ## ## ## ## ## ## ## ## ## ## ## [1,] [2,] [3,] [4,] [5,] [1,] [2,] [3,] [4,] [5,] Topic 1 Topic 2 Topic 3 Topic 4 "data" "r" "r" "research" "mining" "package" "time" "position" "big" "examples" "series" "data" "association" "used" "users" "university" "rules" "code" "talk" "postdoctoral" Topic 5 Topic 6 Topic 7 Topic 8 "mining" "group" "data" "analysis" "data" "data" "r" "network" "slides" "used" "mining" "social" "modelling" "software" "analysis" "text" "tools" "kdnuggets" "book" "slides" 33 / 43
  • 36. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 34 / 43
  • 37. Time Series Analysis with R Time series decomposition: decomp(), decompose(), arima(), stl() Time series forecasting: forecast Time Series Clustering: TSclust Dynamic Time Warping (DTW): dtw 35 / 43
  • 38. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 36 / 43
  • 39. Social Network Analysis with R Packages: igraph, sna Centrality measures: degree(), betweenness(), closeness(), transitivity() Clusters: clusters(), no.clusters() Cliques: cliques(), largest.cliques(), maximal.cliques(), clique.number() Community detection: fastgreedy.community(), spinglass.community() 37 / 43
  • 40. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 38 / 43
  • 41. R and Hadoop Packages: RHadoop, RHive RHadoop10 is a collection of 3 R packages: rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect to Hadoop Distributed File System (HDFS) rhbase - connect to the NoSQL HBase database You can play with it on a single PC (in standalone or pseudo-distributed mode), and your code developed on that will be able to work on a cluster of PCs (in full-distributed mode)! Step by step to set up my first R Hadoop system http://www.rdatamining.com/tutorials/rhadoop 10 https://github.com/RevolutionAnalytics/RHadoop/wiki 39 / 43
  • 42. An Example of MapReducing with R library(rmr2) map <- function(k, lines) { words.list <- strsplit(lines, "s") words <- unlist(words.list) return(keyval(words, 1)) } reduce <- function(word, counts) { keyval(word, sum(counts)) } wordcount <- function(input, output = NULL) { mapreduce(input = input, output = output, input.format = "text", map = map, reduce = reduce) } ## Submit job out <- wordcount(in.file.path, out.file.path) 11 11 From Jeffrey Breen’s presentation on Using R with Hadoop http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/ 40 / 43
  • 43. Outline Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources 41 / 43
  • 44. Online Resources RDataMining website http://www.rdatamining.com R Reference Card for Data Mining R and Data Mining: Examples and Case Studies RDataMining Group on LinkedIn (3100+ members) http://group.rdatamining.com RDataMining on Twitter (1200+ followers) http://twitter.com/rdatamining Free online courses http://www.rdatamining.com/resources/courses Online documents http://www.rdatamining.com/resources/onlinedocs 42 / 43