Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Loading in …3
×
1 of 29

Introduction to Data Mining with R and Data Import/Export in R

5

Share

Download to read offline

RDataMining Slides Series: Introduction to Data Mining with R and Data Import/Export in R

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Introduction to Data Mining with R and Data Import/Export in R

  1. 1. Introduction to Data Mining with R and Data Import/Export in R Yanchang Zhao http://www.RDataMining.com 30 September 2014 1 / 25
  2. 2. Questions I Do you know data mining and its algorithms and techniques? 2 / 25
  3. 3. Questions I Do you know data mining and its algorithms and techniques? I Have you heard of R? 2 / 25
  4. 4. Questions I Do you know data mining and its algorithms and techniques? I Have you heard of R? I Have you used R in your research or projects? 2 / 25
  5. 5. Outline Introduction to R R Packages and Functions for Data Mining Data Import and Export Online Resources 3 / 25
  6. 6. What is R? I R 1 is a free software environment for statistical computing and graphics. I R can be easily extended with 5,800+ packages available on CRAN2 (as of 13 Sept 2014). I Many other packages provided on Bioconductor3, R-Forge4, GitHub5, etc. I R manuals on CRAN6 I An Introduction to R I The R Language De
  7. 7. nition I R Data Import/Export I . . . 1http://www.r-project.org/ 2http://cran.r-project.org/ 3http://www.bioconductor.org/ 4http://r-forge.r-project.org/ 5https://github.com/ 6http://cran.r-project.org/manuals.html 4 / 25
  8. 8. Why R? I R is widely used in both academia and industry. I R was ranked no. 1 in the KDnuggets 2014 poll on Top Languages for analytics, data mining, data science7 (actually R has been no. 1 in 2011, 2012 & 2013!). I The CRAN Task Views 8 provide collections of packages for dierent tasks. I Machine learning atatistical learning I Cluster analysis
  9. 9. nite mixture models I Time series analysis I Multivariate statistics I Analysis of spatial data I . . . 7 http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html 8 http://cran.r-project.org/web/views/ 5 / 25
  10. 10. Outline Introduction to R R Packages and Functions for Data Mining Data Import and Export Online Resources 6 / 25
  11. 11. Classi
  12. 12. cation with R I Decision trees: rpart, party I Random forest: randomForest, party I SVM: e1071, kernlab I Neural networks: nnet, neuralnet, RSNNS I Performance evaluation: ROCR 7 / 25
  13. 13. Clustering with R I k-means: kmeans(), kmeansruns()9 I k-medoids: pam(), pamk() I Hierarchical clustering: hclust(), agnes(), diana() I DBSCAN: fpc I BIRCH: birch 9Functions are followed with (), and others are packages. 8 / 25
  14. 14. Association Rule Mining with R I Association rules: apriori(), eclat() in package arules I Sequential patterns: arulesSequence I Visualisation of associations: arulesViz 9 / 25
  15. 15. Text Mining with R I Text mining: tm I Topic modelling: topicmodels, lda I Word cloud: wordcloud I Twitter data access: twitteR 10 / 25
  16. 16. Time Series Analysis with R I Time series decomposition: decomp(), decompose(), arima(), stl() I Time series forecasting: forecast I Time Series Clustering: TSclust I Dynamic Time Warping (DTW): dtw 11 / 25
  17. 17. Social Network Analysis with R I Packages: igraph, sna I Centrality measures: degree(), betweenness(), closeness(), transitivity() I Clusters: clusters(), no.clusters() I Cliques: cliques(), largest.cliques(), maximal.cliques(), clique.number() I Community detection: fastgreedy.community(), spinglass.community() 12 / 25
  18. 18. R and Big Data I Hadoop I Hadoop (or YARN) - a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models I R Packages: RHadoop, RHIPE I Spark I Spark - a fast and general engine for large-scale data processing, which can be 100 times faster than Hadoop I SparkR - R frontend for Spark I H2O I H2O - an open source in-memory prediction engine for big data science I R Package: h2o I MongoDB I MongoDB - an open-source document database I R packages: rmongodb, RMongo 13 / 25
  19. 19. R and Hadoop I Packages: RHadoop, RHive I RHadoop10 is a collection of R packages: I rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster I rhdfs - connect to Hadoop Distributed File System (HDFS) I rhbase - connect to the NoSQL HBase database I . . . I You can play with it on a single PC (in standalone or pseudo-distributed mode), and your code developed on that will be able to work on a cluster of PCs (in full-distributed mode)! I Step-by-Step Guide to Setting Up an R-Hadoop System http://www.rdatamining.com/big-data/ r-hadoop-setup-guide 10https://github.com/RevolutionAnalytics/RHadoop/wiki 14 / 25
  20. 20. Outline Introduction to R R Packages and Functions for Data Mining Data Import and Export Online Resources 15 / 25
  21. 21. Data Import and Export 11 Read data from and write data to I R native formats (incl. Rdata and RDS) I CSV
  22. 22. les I EXCEL
  23. 23. les I ODBC databases I SAS databases R Data Import/Export: I http://cran.r-project.org/doc/manuals/R-data.pdf 11Chapter 2: Data Import and Export, in book R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining.pdf 16 / 25
  24. 24. Save and Load R Objects I save(): save R objects into a .Rdata
  25. 25. le I load(): read R objects from a .Rdata
  26. 26. le I rm(): remove objects from R a - 1:10 save(a, file = ./data/dumData.Rdata) rm(a) a ## Error: object 'a' not found load(./data/dumData.Rdata) a ## [1] 1 2 3 4 5 6 7 8 9 10 17 / 25
  27. 27. Save and Load R Objects - More Functions I save.image(): save current workspace to a
  28. 28. le It saves everything! I readRDS(): read a single R object from a .rds
  29. 29. le I saveRDS(): save a single R object to a
  30. 30. le I Advantage of readRDS() and saveRDS(): You can restore the data under a dierent object name. I Advantage of load() and save(): You can save multiple R objects to one
  31. 31. le. 18 / 25
  32. 32. Import from and Export to .CSV Files I write.csv(): write an R object to a .CSV
  33. 33. le I read.csv(): read an R object from a .CSV
  34. 34. le # create a data frame var1 - 1:5 var2 - (1:5)/10 var3 - c(R, and, Data Mining, Examples, Case Studies) df1 - data.frame(var1, var2, var3) names(df1) - c(VarInt, VarReal, VarChar) # save to a csv file write.csv(df1, ./data/dummmyData.csv, row.names = FALSE) # read from a csv file df2 - read.csv(./data/dummmyData.csv) print(df2) ## VarInt VarReal VarChar ## 1 1 0.1 R ## 2 2 0.2 and ## 3 3 0.3 Data Mining ## 4 4 0.4 Examples ## 5 5 0.5 Case Studies 19 / 25
  35. 35. Import from and Export to EXCEL Files Package xlsx: read, write, format Excel 2007 and Excel 97/2000/XP/2003
  36. 36. les library(xlsx) xlsx.file - ./data/dummmyData.xlsx write.xlsx(df2, xlsx.file, sheetName = sheet1, row.names = F) df3 - read.xlsx(xlsx.file, sheetName = sheet1) df3 ## VarInt VarReal VarChar ## 1 1 0.1 R ## 2 2 0.2 and ## 3 3 0.3 Data Mining ## 4 4 0.4 Examples ## 5 5 0.5 Case Studies 20 / 25
  37. 37. Read from Databases I Package RODBC: provides connection to ODBC databases. I Function odbcConnect(): sets up a connection to database I sqlQuery(): sends an SQL query to the database I odbcClose() closes the connection. library(RODBC) db - odbcConnect(dsn = servername, uid = userid, pwd = ******) sql - SELECT * FROM lib.table WHERE ... # or read query from file sql - readChar(myQuery.sql, nchars=99999) myData - sqlQuery(db, sql, errors=TRUE) odbcClose(db) 21 / 25
  38. 38. Read from Databases I Package RODBC: provides connection to ODBC databases. I Function odbcConnect(): sets up a connection to database I sqlQuery(): sends an SQL query to the database I odbcClose() closes the connection. library(RODBC) db - odbcConnect(dsn = servername, uid = userid, pwd = ******) sql - SELECT * FROM lib.table WHERE ... # or read query from file sql - readChar(myQuery.sql, nchars=99999) myData - sqlQuery(db, sql, errors=TRUE) odbcClose(db) Functions sqlFetch(), sqlSave() and sqlUpdate(): read, write or update a table in an ODBC database 21 / 25
  39. 39. Import Data from SAS Package foreign provides function read.ssd() for importing SAS datasets (.sas7bdat
  40. 40. les) into R. library(foreign) # for importing SAS data # the path of SAS on your computer sashome - C:/Program Files/SAS/SASFoundation/9.2 filepath - ./data # filename should be no more than 8 characters, without extension fileName - dumData # read data from a SAS dataset a - read.ssd(file.path(filepath), fileName, sascmd=file.path(sashome, sas.exe)) 22 / 25
  41. 41. Import Data from SAS Package foreign provides function read.ssd() for importing SAS datasets (.sas7bdat
  42. 42. les) into R. library(foreign) # for importing SAS data # the path of SAS on your computer sashome - C:/Program Files/SAS/SASFoundation/9.2 filepath - ./data # filename should be no more than 8 characters, without extension fileName - dumData # read data from a SAS dataset a - read.ssd(file.path(filepath), fileName, sascmd=file.path(sashome, sas.exe)) Another way: using function read.xport() to read a
  43. 43. le in SAS Transport (XPORT) format 22 / 25
  44. 44. Outline Introduction to R R Packages and Functions for Data Mining Data Import and Export Online Resources 23 / 25
  45. 45. Online Resources I RDataMining website http://www.rdatamining.com I R Reference Card for Data Mining I R and Data Mining: Examples and Case Studies I RDataMining Group on LinkedIn (7,000+ members) http://group.rdatamining.com I RDataMining on Twitter (1,700+ followers) @RDataMining I Free online courses http://www.rdatamining.com/resources/courses I Online documents http://www.rdatamining.com/resources/onlinedocs 24 / 25
  46. 46. The End Thanks! Email: yanchang(at)rdatamining.com 25 / 25

×