The document discusses integrating R and Hadoop for big data analytics. It notes that existing statistical applications like R are incapable of handling big data, while data management tools lack analytical capabilities. Integrating R with Hadoop bridges this gap by leveraging R's analytics and statistics functionality with Hadoop's ability to process and store distributed data. RHadoop is introduced as an open source project that allows R programmers to directly use MapReduce functionality in R code. Specific RHadoop packages like rhdfs and rmr2 are described that enable interacting with HDFS and performing statistical analysis via MapReduce on Hadoop clusters. Text analytics use cases with R and Hadoop like sentiment analysis are also briefly outlined.
2. Why R on Hadoop ?
Storing and processing large amounts of data is a challenging job for
existing statistical computer applications such as R
! Statistical applications are incapable of handling Big Data
! Data management tools lack analytical and statistical capabilities
! Both R and Hadoop have their own working environments
! R provides the analytics and statistics functionality
! Hadoop provides algorithms for processing and storing distributed data
Integrating R with Hadoop bridges the gap between these two applications
3. Analyse Hadoop data using R
Because R is one of the most well known statistical software, an analyst
working with Hadoop may also want to use existing R packages with
Hadoop
! R is the most comprehensive statistical analysis package available
! R is free and open source software
! R packages are powerful and widely used for statistical and data analysis
! Can be used for parallel computing across a number of cores and clusters
Integration can leverage the processing power of R and Hadoop and
make it sufficient for Big Data Analytics
4. Enabling R on Hadoop
Functionality from R open source packages can be used in the
writing of mapper and reducer functions
R and Hadoop can be integrated by
! RHadoop
! RHIPE
! Segue
! R with Hadoop Streaming
Options for R on Hadoop
5. RHadoop Overview
RHadoop is an open source project that allows programmers
directly use the functionality of MapReduce in R code
! Collection of R packages:
rhdfs
rmr2
rhbase
plyrmr
! Mostly implemented in native R
6. When to use RHadoop
For data exploration
Data aggregation need
To make use of parallel framework in Hadoop
To sample data
Majorly RHadoop is used for managing and performing data
analysis tasks with Hadoop framework
7. RHadoop Packages Overview
This R package provides basic connectivity to the HDFS
! Helps to browse, read, write, and modify files stored in HDFS
! Functions kind of replicate standard HDFS commands
! File manipulations
hdfs.copy, hdfs.move, hdfs.delete, hdfs.put, hdfs.get
! Handling directories hdfs.dircreate, hdfs.mkdir
About rhdfs
8. RHadoop Packages Overview
• library(rhdfs) #Loading the R library
• hdfs.init() #rhdfs package initialization
• hdfs.ls(‘/’) #Lists out all HDFS related files and directories
• hdfs.mkdir() #Create new directory in HDFS file system
• hdfs.rm() #Remove directory from HDFS file system
• help(‘rhdfs’) #Lists all functions of rhdfs package
More examples later...
Sample rhdfs functions
9. RHadoop Packages Overview
This R package allows an R programmer to perform statistical analysis via
MapReduce on a Hadoop cluster
! More focus on the data analysis of very large data sets
! Java alternative for writing MapReduce programs
! Uses Hadoop Streaming API to write MapReduce jobs in R
! All components communicate via key-value pairs
! By default, it supports some HDFS data loading functions
About rmr2
10. MapReduce workflow in rmr2
The rmr2 package creates a client-side
environment for MapReduce to execute
map and reduce functions
! Allows these functions to access
variables outside their scope
! Work with inputs and outputs of
MapReduce
! Enables programmers to write R
variables to HDFS and vice versa
11. Function Categories in rmr2
! For storing and retrieving data
ü to.dfs: To write R objects to HDFS
ü from.dfs: To read mapreduce output from HDFS to R file system
! For mapreduce
ü mapreduce(): For defining and executing mapreduce jobs
ü keyval(): To create and extract key-value pairs
12. MapReduce function syntax in rmr2
Syntax of rmr2 function:
mapreduce (input, output, map, reduce, input.format, output.format)
! Input: HDFS path for the input data
! Output: HDFS path for the output data
! Map/Reduce: Map and Reduce functions applied on data
! Input.format/Output.format: Data format i.e. text, csv, json
! Typically, map and reduce components consists of keyval helper
function to ensure output is key-value pairs
14. How Text Mining Works with R and Hadoop
Lexical statistics, study of measuring the frequency of words
Data mining techniques used to identify relationships and patterns
Sentiment analysis used to understand the underlying attitude
Tools like R and SAS offer statistical functionality
Handling large databases needs new technologies (Hadoop)
16. Sentiment Analysis
• Also known as opinion mining
• Important components of text mining
• Extract opinion sentiment from end user reviews
• Sentiment further classified as positive, negative or neutral
Study of analysing people’s opinions, sentiments,
attitudes, appraisals, and evaluations
17. Parameters used in Sentiment Analysis
• Polarity, which can be positive, negative, or neutral
• Emotional states, which can be sad, angry, or happy
• Scaling system or numeric values
• Subjectivity/objectivity
• Features based on key entities such as durability of the furniture,
• Screen size of the cell phone, lens quality of a camera, etc.
The process of sentiment analysis involves classification of
given text on the basis of the following parameters:
18. How Sentiment Analysis Works
A Simple Sentiment Algorithm: This algorithm assigns sentiment score by simply
counting the number of occurrences of “positive” and “negative” words in any
sentence
“I bought an iPhone few days back. It is really nice. The touch screen and voice quality are really cool. It is so
better than my old Blackberry phone which was so hard to type with tiny keys. However iPhone is a bit
expensive.”
Positive Words: nice, cool, better
Negative Words: hard, expensive
Sentence Sentiment Score: Tot. Pos – Tot. Neg (3-2=>1)
Sentence Sentiment Polarity: Positive
Overall Score: Sum of all sentence sentiment scores