SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Data 
Manipulation 
Using R 
Cleaning 
& 
Summarizing 
Datasets 
ACM 
DataScience 
Camp 
Packages 
Useful 
for 
this 
Presenta<on 
dplyr 
Ram 
Narasimhan 
@ramnarasimhan 
Oct 
25, 
2014
hHp://goo.gl/DXe1zs 
3a-­‐2
What 
will 
we 
be 
covering 
today? 
Basics 
of 
Data 
Manipula<on 
• What 
do 
we 
mean 
by 
Data 
Manipula<on? 
• 4 
Reserved 
Words 
in 
R 
(NA, 
NaN, 
Inf 
& 
NULL) 
• Data 
Quality: 
Cleaning 
up 
data 
– Missing 
Values 
| 
Duplicate 
Rows| 
FormaLng 
Columns 
• SubseTng 
Data 
• “Factors” 
in 
R 
Data 
Manipula<on 
Made 
Intui<ve 
• dplyr 
• The 
“pipe” 
operator 
%>% 
(‘and 
then’) 
3a-­‐3
A 
note 
about 
Built-­‐in 
datasets 
• Many 
datasets 
come 
bundled 
with 
R 
• Many 
packages 
have 
their 
own 
data 
sets 
• To 
find 
what 
you 
have, 
type 
data()! 
> data()! 
#Examples: mtcars, iris, quakes, faithful, airquality, 
women! 
#In ggplot2! 
> movies; diamonds! 
Note 
Important: 
You 
won’t 
permanently 
damage 
these, 
so 
feel 
free 
to 
experiment! 
3a-­‐4
Why Data 
Carpentry?
Good 
data 
scienYsts 
understand, 
in 
a 
deep 
way, 
that 
the 
heavy 
liZing 
of 
cleanup 
and 
preparaYon 
isn’t 
something 
that 
gets 
in 
the 
way 
of 
solving 
the 
problem 
– 
it 
is 
the 
problem. 
DJ 
PaYl, 
Building 
Data 
Science 
Teams
What 
are 
the 
ways 
to 
manipulate 
data? 
Missing 
values 
Data 
Summariza<on 
Group 
By 
Factors 
Aggregate 
Subset 
/ 
Exclude 
Bucke<ng 
Values 
Rearrange 
(Shape) 
Merge 
Datasets 
3a-­‐7
Data Quality
Data 
Quality 
Datasets 
in 
real 
life 
are 
never 
perfect… 
How 
to 
handle 
these 
real-­‐life 
data 
quality 
issues? 
• Missing 
Values 
• Duplicate 
Rows 
• Inconsistent 
Dates 
• Impossible 
values 
(NegaYve 
Sales) 
– Check 
using 
if 
condiYons 
– Outlier 
detecYon 
3a-­‐9
NA, 
NULL, 
Inf 
& 
NaN 
• NA # missing! 
• NULL !# undefined! 
• Inf # infinite 3/0! 
• NaN # Not a number Inf/Inf! 
From 
R 
DocumentaYon 
• NULL 
represents 
the 
null 
object 
in 
R: 
it 
is 
a 
reserved 
word. 
NULL 
is 
oZen 
returned 
by 
expressions 
and 
funcYons 
whose 
values 
are 
undefined. 
• NA 
is 
a 
logical 
constant 
of 
length 
1 
which 
contains 
a 
missing 
value 
indicator. 
3a-­‐10
Dealing 
with 
NA’s 
(Unavailable 
Values) 
• To 
check 
if 
any 
value 
is 
NA: 
is.na()! 
Usage: is.na(variable) ! 
is.na(vector)! 
> x <- c(3, NA, 4, NA, NA)! 
> is.na(x[2])! 
[1] TRUE! 
> is.na(x)! 
[1] FALSE TRUE FALSE TRUE TRUE! 
> !is.na(x)! 
[1] TRUE FALSE TRUE FALSE FALSE! 
! 
! 
! 
3a-­‐11 
Let’s 
use 
the 
built-­‐in 
dataset 
> is.na(airquality$Ozone)! 
#TRUE if the value is NA, FALSE otherwise! 
>!is.na(airquality$Ozone) #note the !(not)! 
Prints FALSE if any value is NA! 
! 
airquality! 
How 
to 
Convert 
these 
NA’s 
to 
0’s? 
tf <- is.na(airquality$Solar.R) # TRUE FALSE 
conditional vector! 
(TRUE if the values of the Solar.R variable is 
NA, FALSE otherwise)! 
airquality$Solar.R[tf] <- 0! 
! 
!
Cleaning 
the 
data 
• Duplicate 
Rows! 
– Which rows are duplicated?! 
> duplicated(iris)! 
FormaLng 
Columns 
• as.numeric()! 
• as.character()! 
3a-­‐12 
“iris” 
is 
a 
built-­‐in 
dataset 
in 
R
Subsetting 
Summarizing 
& Aggregation
“Factors” 
in 
R 
• Categorical 
Variables 
in 
StaYsYcs 
– Example: 
“Gender” 
= 
{Male, 
Female} 
– “Meal” 
= 
{Breakfast, 
Lunch, 
Dinner} 
– Hair 
Color 
= 
{blonde, 
brown, 
bruneme, 
red} 
R 
concept 
Note: 
There 
is 
no 
intrinsic 
ordering 
to 
the 
categories 
• In 
R, 
Categorical 
variables 
are 
called 
“Factors” 
– The 
limited 
set 
of 
values 
they 
can 
take 
on 
are 
called 
“Levels” 
class(iris$Species)! 
iris$Species[1:5] #notice that all Levels are listed! 
str(mtcars)! 
#Let's make the "gear" column into a factor! 
mtcars$gear <- as.factor(mtcars$gear)! 
str(mtcars$gear)! 
3a-­‐14
The 
subset() 
func<on 
Usage: 
subset(dataframe, condition) 
• Very 
easy 
to 
use 
syntax 
• One 
of 
the 
most 
useful 
commands 
small_iris <- subset(iris, Sepal.Length > 7)! 
subset(movies, mpaa=='R')! 
Things 
to 
keep 
in 
mind 
• Note 
that 
we 
don’t 
need 
to 
say 
df$column_name! 
• Note 
that 
equals 
condiYon 
is 
wrimen 
as 
==! 
• Usually 
a 
good 
idea 
to 
verify 
the 
number 
of 
rows 
in 
the 
smaller 
data 
frame 
(using 
nrow()) 
! 
3a-­‐15
Aggrega<ng 
using 
table() 
Table 
counts 
the 
#Observa<ons 
in 
each 
level 
of 
a 
factor 
table(vector)! 
! 
table(iris$Species)! 
table(mtcars$gear)! 
table(mtcars$cyl)! 
#put it together to create a summary table! 
table(mtcars$gear, mtcars$cyl) ! 
These 
resulYng 
tables 
are 
someYmes 
referred 
to 
as 
“frequency 
tables” 
#Using "with”: note that we don't need to use $! 
with(movies, table(year))! 
with(movies, table(length))! 
with(movies, table(length>200))! 
3a-­‐16
Data 
Manipula<on 
-­‐ 
Key 
Takeaways 
1. Data 
Quality: 
is.na(), 
na.rm(), 
is.nan(), 
is.null() 
2. Table() 
to 
get 
frequencies 
3. Subset(df, 
var==value) 
3a-­‐17
dplyr 
dplyr-­‐18
Why 
Use 
dplyr? 
• Very 
intuiYve, 
once 
you 
understand 
the 
basics 
• Very 
fast 
– Created 
with 
execuYon 
Ymes 
in 
mind 
• Easy 
for 
those 
migraYng 
from 
the 
SQL 
world 
• When 
wrimen 
well, 
your 
code 
reads 
like 
a 
‘recipe’ 
• “Code 
the 
way 
you 
think” 
dplyr-­‐19
SAC 
– 
Split-­‐Apply-­‐Combine 
• Let’s 
understand 
the 
SAC 
idiom 
dplyr-­‐20
tbl_df() 
and 
glimpse() 
tbl_df 
is 
a 
‘wrapper’ 
that 
preLfies 
a 
data 
frame 
> glimpse(movies)! 
Variables:! 
$ title (chr) "$", "$1000 a Touchdown", "$21 a Day Once a Month", "$...! 
$ year (int) 1971, 1939, 1941, 1996, 1975, 2000, 2002, 2002, 1987, ...! 
$ length (int) 121, 71, 7, 70, 71, 91, 93, 25, 97, 61, 99, 96, 10, 10...! 
$ budget (int) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...! 
$ rating (dbl) 6.4, 6.0, 8.2, 8.2, 3.4, 4.3, 5.3, 6.7, 6.6, 6.0, 5.4,...! 
$ votes (int) 348, 20, 5, 6, 17, 45, 200, 24, 18, 51, 23, 53, 44, 11...! 
$ r1 (dbl) 4.5, 0.0, 0.0, 14.5, 24.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4....! 
$ r2 (dbl) 4.5, 14.5, 0.0, 0.0, 4.5, 4.5, 0.0, 4.5, 4.5, 0.0, 0.0...! 
$ r3 (dbl) 4.5, 4.5, 0.0, 0.0, 0.0, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5,...! 
$ r4 (dbl) 4.5, 24.5, 0.0, 0.0, 14.5, 14.5, 4.5, 4.5, 0.0, 4.5, 1...! 
$ r5 (dbl) 14.5, 14.5, 0.0, 0.0, 14.5, 14.5, 24.5, 4.5, 0.0, 4.5,...! 
$ r6 (dbl) 24.5, 14.5, 24.5, 0.0, 4.5, 14.5, 24.5, 14.5, 0.0, 44....! 
$ r7 (dbl) 24.5, 14.5, 0.0, 0.0, 0.0, 4.5, 14.5, 14.5, 34.5, 14.5...! 
$ r8 (dbl) 14.5, 4.5, 44.5, 0.0, 0.0, 4.5, 4.5, 14.5, 14.5, 4.5, ...! 
$ r9 (dbl) 4.5, 4.5, 24.5, 34.5, 0.0, 14.5, 4.5, 4.5, 4.5, 4.5, 1...! 
$ r10 (dbl) 4.5, 14.5, 24.5, 45.5, 24.5, 14.5, 14.5, 14.5, 24.5, 4...! 
$ mpaa (fctr) , , , , , , R, , , , , , , , PG-13, PG-13, , , , , , ...! 
$ Action (int) 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, ...! 
$ Animation (int) 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...! 
$ Comedy (int) 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, ...! 
$ Drama (int) 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, ...! 
$ Documentary (int) 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...! 
$ Romance (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...! 
$ Short (int) 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...! 
dplyr-­‐21 
> library(ggplot2)! 
> glimpse(movies)! 
> pretty_movies <- tbl_df(movies)! 
> movies! 
> pretty_movies! 
> pretty_movies! 
Source: local data frame [58,788 x 24]! 
! 
title year length budget rating votes r1 r2 r3 r4! 
1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5! 
2 $1000 a Touchdown 1939 71 NA 6.0 20 0.0 14.5 4.5 24.5! 
3 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0 0.0 0.0 0.0! 
4 $40,000 1996 70 NA 8.2 6 14.5 0.0 0.0 0.0! 
5 $50,000 Climax Show, The 1975 71 NA 3.4 17 24.5 4.5 0.0 14.5! 
6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5! 
7 $windle 2002 93 NA 5.3 200 4.5 0.0 4.5 4.5! 
8 '15' 2002 25 NA 6.7 24 4.5 4.5 4.5 4.5! 
9 '38 1987 97 NA 6.6 18 4.5 4.5 4.5 0.0! 
10 '49-'17 1917 61 NA 6.0 51 4.5 0.0 4.5 4.5! 
.. ... ... ... ... ... ... ... ... ... ...! 
Variables not shown: r5 (dbl), r6 (dbl), r7 (dbl), r8 (dbl), r9 (dbl), r10! 
(dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama! 
(int), Documentary (int), Romance (int), Short (int)! 
> !
Understanding 
the 
Pipe 
Operator 
• On 
January 
first 
of 
2014, 
a 
new 
R 
package 
was 
launched 
on 
github 
– maggritr 
• A 
“magic” 
operator 
called 
the 
PIPE 
was 
introduced 
%>% 
(Read 
aloud 
as: 
THEN, 
“AND 
THEN”, 
“PIPE 
TO”) 
round(sqrt(1000), 3)! 
! 
library(magrittr)! 
1000 %>% sqrt %>% round()! 
1000 %>% sqrt %>% round(.,3)! 
Take 
1000, 
and 
then 
its 
sqrt 
And 
then 
round 
it 
1000 
Sqrt 
funcYon 
Round 
funcYon 
32 
31.62278
dplyr 
takes 
advantage 
of 
Pipe 
• Dplyr 
takes 
the 
%>% 
operator 
and 
uses 
it 
to 
great 
effect 
for 
manipulaYng 
data 
frames 
• Works 
ONLY 
with 
Data 
Frames 
A 
belief 
that 
90% 
of 
data 
manipulaYon 
can 
be 
accomplished 
with 
5 
basic 
“verbs”
dplyr 
Package 
• The 
five 
Basic 
“Verbs” 
! 
Verbs! What 
does 
it 
do? 
filter()! Select 
a 
subset 
of 
ROWS 
by 
condiYons 
arrange()! Reorders 
ROWS 
in 
a 
data 
frame 
select()! Select 
the 
COLUMNS 
of 
interest 
mutate()! Create 
new 
columns 
based 
on 
exisYng 
columns 
(mutaYons!) 
summarise()! Aggregate 
values 
for 
each 
group, 
reduces 
to 
single 
value 
dplyr-­‐24
Remember 
these 
Verbs 
(Mnemonics) 
• FILTERows 
• SELECT 
Column 
Types 
• ArRange 
Rows 
(SORT) 
• Mutate 
(into 
something 
new) 
• Summarize 
by 
Groups 
dplyr-­‐25
filter() 
• Usage: 
– Returns 
a 
subset 
of 
rows 
– MulYple 
condiYons 
can 
be 
supplied. 
– They 
are 
combined 
with 
an 
AND 
movies_with_budgets <- filter(movies_df, !is.na(budget))! 
filter(movies, Documentary==1)! 
filter(movies, Documentary==1) %>% nrow() ! 
good_comedies <- filter(movies, rating > 9, Comedy==1) ! 
dim(good_comedies) #171 movies! 
! 
#' Let us say we only want highly rated comdies, which a lot 
of people have watched, made after year 2000.! 
movies %>%! 
filter(rating >8, Comedy==1, votes > 100, year > 2000)! 
! 
dplyr-­‐26 
filter(data, condition)!
Select() 
• Usage: 
movies_df <- tbl_df(movies)! 
select(movies_df, title, year, rating) #Just the columns we want to see! 
select(movies_df, -c(r1:r10)) #we don't want certain columns! 
! 
#You can also select a range of columns from start:end! 
select(movies_df, title:votes) # All the columns from title to votes ! 
select(movies_df, -c(budget, r1:r10, Animation, Documentary, Short, Romance))! 
! 
select(movies_df, contains("r")) # Any column that contains 'r' in its name! 
select(movies_df, ends_with("t")) # All vars ending with ”t"! 
! 
select(movies_df, starts_with("r")) # Gets all vars staring with “r”! 
#The above is not quite what we want. We don't want the Romance column! 
select(movies_df, matches("r[0-9]")) # Columns that match a regex.! 
dplyr-­‐27 
select(data, columns)!
arrange() 
Usage: 
– Returns 
arrange(data, column_to_sort_by)! 
a 
reordered 
set 
of 
rows 
– MulYple 
inputs 
are 
arranged 
from 
leZ-­‐to-­‐right 
movies_df <- tbl_df(movies)! 
arrange(movies_df, rating) #but this is not what we want! 
arrange(movies_df, desc(rating)) ! 
#Show the highest ratings first and the latest year…! 
#Sort by Decreasing Rating and Year! 
arrange(movies_df, desc(rating), desc(year)) ! 
dplyr-­‐28 
What’s 
the 
difference 
between 
these 
two? 
arrange(movies_df, desc(rating), desc(year)) ! 
arrange(movies_df, desc(year), desc(rating)) !
mutate() 
• Usage: 
mutate(data, new_col = func(oldcolumns)! 
• Creates 
new 
columns, 
that 
are 
funcYons 
of 
exisYng 
variables 
mutate(iris, aspect_ratio = Petal.Width/Petal.Length)! 
! 
movies_with_budgets <- filter(movies_df, !is.na(budget))! 
mutate(movies_with_budgets, costPerMinute = budget/length) %>%! 
select(title, costPerMinute)! 
! 
dplyr-­‐29
group_by() 
& 
summarize() 
group_by(data, column_to_group) %>%! 
summarize(function_of_variable)! 
• Group_by 
creates 
groups 
of 
data 
• Summarize 
aggregates 
the 
data 
for 
each 
group 
by_rating <- group_by(movies_df, rating)! 
! 
by_rating %>% summarize(n())! 
! 
avg_rating_by_year <- ! 
!group_by(movies_df, year) %>%! 
!summarize(avg_rating = mean(rating))! 
! 
! 
! 
dplyr-­‐30
Chaining 
the 
verbs 
together 
• Let’s 
put 
it 
all 
together 
in 
a 
logical 
fashion 
• Use 
a 
sequence 
of 
steps 
to 
find 
the 
most 
expensive 
movie 
per 
minute 
of 
eventual 
footage 
producers_nightmare <- ! 
filter(movies_df, !is.na(budget)) %>%! 
mutate(costPerMinute = budget/length) %>%! 
arrange(desc(costPerMinute)) %>%! 
select(title, costPerMinute)! 
! 
dplyr-­‐31
Bonus: 
Pipe 
into 
Plot 
• The 
output 
of 
a 
series 
of 
“pipes” 
can 
also 
be 
fed 
to 
a 
“plot” 
command 
movies %>%! 
group_by(rating) %>%! 
summarize(n()) %>%! 
plot() # plots the histogram of movies by Each value of rating! 
! 
movies %>%! 
group_by(year) %>%! 
summarise(y=mean(rating)) %>%! 
with(barplot(y, names.arg=year, main="AVG IMDB Rating by Year"))! 
! 
dplyr-­‐32
References 
• Dplyr 
vignemes: 
hmp://cran.rstudio.com/web/packages/dplyr/ 
vignemes/introducYon.html 
• Kevin 
Markham’s 
dplyr 
tutorial 
– hmp://rpubs.com/justmarkham/dplyr-­‐tutorial 
– His 
YouTube 
video 
(38-­‐minutes) 
– hmps://www.youtube.com/watch? 
feature=player_embedded&v=jWjqLW-­‐u3hc 
• hmp://paYlv.com/dplyr/ 
– Use 
arrows 
to 
move 
forward 
and 
back 
dplyr-­‐33
Aggrega<ng 
Data 
Using 
“Cut” 
What 
does 
“cut” 
do? 
– BuckeYng 
– Cuts 
a 
conYnuous 
variable 
into 
groups 
• Extremely 
useful 
for 
grouping 
values 
Take 
the 
airquality 
Temperature 
Data 
and 
group 
into 
buckets 
range(airquality$Temp)! 
#First let's cut this vector into 5 groups:! 
cut(airquality$Temp, 5)! 
cut(airquality$Temp, 5, labels=FALSE)! 
#How many data points fall in each of the 5 intervals?! 
table(cut(airquality$Temp, 5))! 
! 
Tempbreaks=seq(50,100, by=10)! 
TempBuckets <- cut(airquality$Temp, breaks=Tempbreaks)! 
summary(TempBuckets)! 
! 
! 
3a-­‐34
aggregate() 
How 
many 
of 
each 
species 
do 
we 
have? 
Usage: 
Replaced 
by 
dplyr 
aggregate(y ~ x, data, FUN)! 
aggregate(numeric_variable ~ grouping variable, data)! 
!How 
to 
read 
this? 
“Split 
the 
<numeric_variable> 
by 
the 
<grouping 
variable>” 
Split 
y 
into 
groups 
of 
x, 
and 
apply 
the 
funcYon 
to 
each 
group 
aggregate(Sepal.Length ~ Species, data=iris, FUN='mean') ! 
Note 
the 
Crucial 
Difference 
between 
the 
two 
lines: 
aggregate(Sepal.Length~Species, data=iris, 
FUN='length')! 
aggregate(Species ~ Sepal.Length, data=iris, 
FUN='length') # caution!! 
! Note: 
If 
you 
are 
doing 
lots 
of 
summarizing, 
the 
“doBy” 
package 
is 
worth 
looking 
into 
3a-­‐35

Contenu connexe

Tendances

R programming Fundamentals
R programming  FundamentalsR programming  Fundamentals
R programming FundamentalsRagia Ibrahim
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
Data preprocessing in Machine Learning
Data preprocessing in Machine LearningData preprocessing in Machine Learning
Data preprocessing in Machine LearningPyingkodi Maran
 
R for data visualization and graphics
R for data visualization and graphicsR for data visualization and graphics
R for data visualization and graphicsRobert Kabacoff
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programmingizahn
 
Single linear regression
Single linear regressionSingle linear regression
Single linear regressionKen Plummer
 
Generalized Nonlinear Models in R
Generalized Nonlinear Models in RGeneralized Nonlinear Models in R
Generalized Nonlinear Models in Rhtstatistics
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Categorical data analysis
Categorical data analysisCategorical data analysis
Categorical data analysisSumit Das
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Introduction To Data Science Using R
Introduction To Data Science Using RIntroduction To Data Science Using R
Introduction To Data Science Using RANURAG SINGH
 

Tendances (20)

R programming Fundamentals
R programming  FundamentalsR programming  Fundamentals
R programming Fundamentals
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Vector in R
Vector in RVector in R
Vector in R
 
Data visualization with R
Data visualization with RData visualization with R
Data visualization with R
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Data preprocessing in Machine Learning
Data preprocessing in Machine LearningData preprocessing in Machine Learning
Data preprocessing in Machine Learning
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 
R for data visualization and graphics
R for data visualization and graphicsR for data visualization and graphics
R for data visualization and graphics
 
Model selection
Model selectionModel selection
Model selection
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
Single linear regression
Single linear regressionSingle linear regression
Single linear regression
 
Time series
Time seriesTime series
Time series
 
Generalized Nonlinear Models in R
Generalized Nonlinear Models in RGeneralized Nonlinear Models in R
Generalized Nonlinear Models in R
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Categorical data analysis
Categorical data analysisCategorical data analysis
Categorical data analysis
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Introduction To Data Science Using R
Introduction To Data Science Using RIntroduction To Data Science Using R
Introduction To Data Science Using R
 

En vedette

Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyrRomain Francois
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply FunctionSakthi Dasans
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text filesEdwin de Jonge
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Spencer Fox
 
Data and donuts: Data Visualization using R
Data and donuts: Data Visualization using RData and donuts: Data Visualization using R
Data and donuts: Data Visualization using RC. Tobin Magle
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R StudioSusan Johnston
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementDr. Volkan OBAN
 
RHive tutorials - Basic functions
RHive tutorials - Basic functionsRHive tutorials - Basic functions
RHive tutorials - Basic functionsAiden Seonghak Hong
 
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016Penn State University
 
20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料安隆 沖
 

En vedette (20)

Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
 
dplyr
dplyrdplyr
dplyr
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
Tokyor36
Tokyor36Tokyor36
Tokyor36
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016
 
Data and donuts: Data Visualization using R
Data and donuts: Data Visualization using RData and donuts: Data Visualization using R
Data and donuts: Data Visualization using R
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R Studio
 
Dplyr and Plyr
Dplyr and PlyrDplyr and Plyr
Dplyr and Plyr
 
Fast data munging in R
Fast data munging in RFast data munging in R
Fast data munging in R
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
 
RHive tutorials - Basic functions
RHive tutorials - Basic functionsRHive tutorials - Basic functions
RHive tutorials - Basic functions
 
Building powerful dashboards with r shiny
Building powerful dashboards with r shinyBuilding powerful dashboards with r shiny
Building powerful dashboards with r shiny
 
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
 
20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料
 
Rlecturenotes
RlecturenotesRlecturenotes
Rlecturenotes
 

Similaire à Data Manipulation Using R (& dplyr)

From Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn IntroductionFrom Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn Introductionelliando dias
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Parallel Computing in R
Parallel Computing in RParallel Computing in R
Parallel Computing in Rmickey24
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceHeroku
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
 
[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13AnusAhmad
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for BeginnersMetamarkets
 
wadar_poster_final
wadar_poster_finalwadar_poster_final
wadar_poster_finalGiorgio Orsi
 
Long Memory presentation to SURF
Long Memory presentation to SURFLong Memory presentation to SURF
Long Memory presentation to SURFRichard Hunt
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashingSEMINARGROOT
 
Dbrec - Music recommendations using DBpedia
Dbrec - Music recommendations using DBpediaDbrec - Music recommendations using DBpedia
Dbrec - Music recommendations using DBpediaAlexandre Passant
 
Amir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's databaseAmir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's databaseit-people
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R languagechhabria-nitesh
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumertirlukachaitanya
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraDataStax
 

Similaire à Data Manipulation Using R (& dplyr) (20)

From Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn IntroductionFrom Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn Introduction
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Parallel Computing in R
Parallel Computing in RParallel Computing in R
Parallel Computing in R
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for Beginners
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
wadar_poster_final
wadar_poster_finalwadar_poster_final
wadar_poster_final
 
Long Memory presentation to SURF
Long Memory presentation to SURFLong Memory presentation to SURF
Long Memory presentation to SURF
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
Dbrec - Music recommendations using DBpedia
Dbrec - Music recommendations using DBpediaDbrec - Music recommendations using DBpedia
Dbrec - Music recommendations using DBpedia
 
Amir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's databaseAmir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's database
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R language
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
 

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Data Manipulation Using R (& dplyr)

  • 1. Data Manipulation Using R Cleaning & Summarizing Datasets ACM DataScience Camp Packages Useful for this Presenta<on dplyr Ram Narasimhan @ramnarasimhan Oct 25, 2014
  • 3. What will we be covering today? Basics of Data Manipula<on • What do we mean by Data Manipula<on? • 4 Reserved Words in R (NA, NaN, Inf & NULL) • Data Quality: Cleaning up data – Missing Values | Duplicate Rows| FormaLng Columns • SubseTng Data • “Factors” in R Data Manipula<on Made Intui<ve • dplyr • The “pipe” operator %>% (‘and then’) 3a-­‐3
  • 4. A note about Built-­‐in datasets • Many datasets come bundled with R • Many packages have their own data sets • To find what you have, type data()! > data()! #Examples: mtcars, iris, quakes, faithful, airquality, women! #In ggplot2! > movies; diamonds! Note Important: You won’t permanently damage these, so feel free to experiment! 3a-­‐4
  • 6. Good data scienYsts understand, in a deep way, that the heavy liZing of cleanup and preparaYon isn’t something that gets in the way of solving the problem – it is the problem. DJ PaYl, Building Data Science Teams
  • 7. What are the ways to manipulate data? Missing values Data Summariza<on Group By Factors Aggregate Subset / Exclude Bucke<ng Values Rearrange (Shape) Merge Datasets 3a-­‐7
  • 9. Data Quality Datasets in real life are never perfect… How to handle these real-­‐life data quality issues? • Missing Values • Duplicate Rows • Inconsistent Dates • Impossible values (NegaYve Sales) – Check using if condiYons – Outlier detecYon 3a-­‐9
  • 10. NA, NULL, Inf & NaN • NA # missing! • NULL !# undefined! • Inf # infinite 3/0! • NaN # Not a number Inf/Inf! From R DocumentaYon • NULL represents the null object in R: it is a reserved word. NULL is oZen returned by expressions and funcYons whose values are undefined. • NA is a logical constant of length 1 which contains a missing value indicator. 3a-­‐10
  • 11. Dealing with NA’s (Unavailable Values) • To check if any value is NA: is.na()! Usage: is.na(variable) ! is.na(vector)! > x <- c(3, NA, 4, NA, NA)! > is.na(x[2])! [1] TRUE! > is.na(x)! [1] FALSE TRUE FALSE TRUE TRUE! > !is.na(x)! [1] TRUE FALSE TRUE FALSE FALSE! ! ! ! 3a-­‐11 Let’s use the built-­‐in dataset > is.na(airquality$Ozone)! #TRUE if the value is NA, FALSE otherwise! >!is.na(airquality$Ozone) #note the !(not)! Prints FALSE if any value is NA! ! airquality! How to Convert these NA’s to 0’s? tf <- is.na(airquality$Solar.R) # TRUE FALSE conditional vector! (TRUE if the values of the Solar.R variable is NA, FALSE otherwise)! airquality$Solar.R[tf] <- 0! ! !
  • 12. Cleaning the data • Duplicate Rows! – Which rows are duplicated?! > duplicated(iris)! FormaLng Columns • as.numeric()! • as.character()! 3a-­‐12 “iris” is a built-­‐in dataset in R
  • 14. “Factors” in R • Categorical Variables in StaYsYcs – Example: “Gender” = {Male, Female} – “Meal” = {Breakfast, Lunch, Dinner} – Hair Color = {blonde, brown, bruneme, red} R concept Note: There is no intrinsic ordering to the categories • In R, Categorical variables are called “Factors” – The limited set of values they can take on are called “Levels” class(iris$Species)! iris$Species[1:5] #notice that all Levels are listed! str(mtcars)! #Let's make the "gear" column into a factor! mtcars$gear <- as.factor(mtcars$gear)! str(mtcars$gear)! 3a-­‐14
  • 15. The subset() func<on Usage: subset(dataframe, condition) • Very easy to use syntax • One of the most useful commands small_iris <- subset(iris, Sepal.Length > 7)! subset(movies, mpaa=='R')! Things to keep in mind • Note that we don’t need to say df$column_name! • Note that equals condiYon is wrimen as ==! • Usually a good idea to verify the number of rows in the smaller data frame (using nrow()) ! 3a-­‐15
  • 16. Aggrega<ng using table() Table counts the #Observa<ons in each level of a factor table(vector)! ! table(iris$Species)! table(mtcars$gear)! table(mtcars$cyl)! #put it together to create a summary table! table(mtcars$gear, mtcars$cyl) ! These resulYng tables are someYmes referred to as “frequency tables” #Using "with”: note that we don't need to use $! with(movies, table(year))! with(movies, table(length))! with(movies, table(length>200))! 3a-­‐16
  • 17. Data Manipula<on -­‐ Key Takeaways 1. Data Quality: is.na(), na.rm(), is.nan(), is.null() 2. Table() to get frequencies 3. Subset(df, var==value) 3a-­‐17
  • 19. Why Use dplyr? • Very intuiYve, once you understand the basics • Very fast – Created with execuYon Ymes in mind • Easy for those migraYng from the SQL world • When wrimen well, your code reads like a ‘recipe’ • “Code the way you think” dplyr-­‐19
  • 20. SAC – Split-­‐Apply-­‐Combine • Let’s understand the SAC idiom dplyr-­‐20
  • 21. tbl_df() and glimpse() tbl_df is a ‘wrapper’ that preLfies a data frame > glimpse(movies)! Variables:! $ title (chr) "$", "$1000 a Touchdown", "$21 a Day Once a Month", "$...! $ year (int) 1971, 1939, 1941, 1996, 1975, 2000, 2002, 2002, 1987, ...! $ length (int) 121, 71, 7, 70, 71, 91, 93, 25, 97, 61, 99, 96, 10, 10...! $ budget (int) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...! $ rating (dbl) 6.4, 6.0, 8.2, 8.2, 3.4, 4.3, 5.3, 6.7, 6.6, 6.0, 5.4,...! $ votes (int) 348, 20, 5, 6, 17, 45, 200, 24, 18, 51, 23, 53, 44, 11...! $ r1 (dbl) 4.5, 0.0, 0.0, 14.5, 24.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4....! $ r2 (dbl) 4.5, 14.5, 0.0, 0.0, 4.5, 4.5, 0.0, 4.5, 4.5, 0.0, 0.0...! $ r3 (dbl) 4.5, 4.5, 0.0, 0.0, 0.0, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5,...! $ r4 (dbl) 4.5, 24.5, 0.0, 0.0, 14.5, 14.5, 4.5, 4.5, 0.0, 4.5, 1...! $ r5 (dbl) 14.5, 14.5, 0.0, 0.0, 14.5, 14.5, 24.5, 4.5, 0.0, 4.5,...! $ r6 (dbl) 24.5, 14.5, 24.5, 0.0, 4.5, 14.5, 24.5, 14.5, 0.0, 44....! $ r7 (dbl) 24.5, 14.5, 0.0, 0.0, 0.0, 4.5, 14.5, 14.5, 34.5, 14.5...! $ r8 (dbl) 14.5, 4.5, 44.5, 0.0, 0.0, 4.5, 4.5, 14.5, 14.5, 4.5, ...! $ r9 (dbl) 4.5, 4.5, 24.5, 34.5, 0.0, 14.5, 4.5, 4.5, 4.5, 4.5, 1...! $ r10 (dbl) 4.5, 14.5, 24.5, 45.5, 24.5, 14.5, 14.5, 14.5, 24.5, 4...! $ mpaa (fctr) , , , , , , R, , , , , , , , PG-13, PG-13, , , , , , ...! $ Action (int) 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, ...! $ Animation (int) 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...! $ Comedy (int) 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, ...! $ Drama (int) 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, ...! $ Documentary (int) 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...! $ Romance (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...! $ Short (int) 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...! dplyr-­‐21 > library(ggplot2)! > glimpse(movies)! > pretty_movies <- tbl_df(movies)! > movies! > pretty_movies! > pretty_movies! Source: local data frame [58,788 x 24]! ! title year length budget rating votes r1 r2 r3 r4! 1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5! 2 $1000 a Touchdown 1939 71 NA 6.0 20 0.0 14.5 4.5 24.5! 3 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0 0.0 0.0 0.0! 4 $40,000 1996 70 NA 8.2 6 14.5 0.0 0.0 0.0! 5 $50,000 Climax Show, The 1975 71 NA 3.4 17 24.5 4.5 0.0 14.5! 6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5! 7 $windle 2002 93 NA 5.3 200 4.5 0.0 4.5 4.5! 8 '15' 2002 25 NA 6.7 24 4.5 4.5 4.5 4.5! 9 '38 1987 97 NA 6.6 18 4.5 4.5 4.5 0.0! 10 '49-'17 1917 61 NA 6.0 51 4.5 0.0 4.5 4.5! .. ... ... ... ... ... ... ... ... ... ...! Variables not shown: r5 (dbl), r6 (dbl), r7 (dbl), r8 (dbl), r9 (dbl), r10! (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama! (int), Documentary (int), Romance (int), Short (int)! > !
  • 22. Understanding the Pipe Operator • On January first of 2014, a new R package was launched on github – maggritr • A “magic” operator called the PIPE was introduced %>% (Read aloud as: THEN, “AND THEN”, “PIPE TO”) round(sqrt(1000), 3)! ! library(magrittr)! 1000 %>% sqrt %>% round()! 1000 %>% sqrt %>% round(.,3)! Take 1000, and then its sqrt And then round it 1000 Sqrt funcYon Round funcYon 32 31.62278
  • 23. dplyr takes advantage of Pipe • Dplyr takes the %>% operator and uses it to great effect for manipulaYng data frames • Works ONLY with Data Frames A belief that 90% of data manipulaYon can be accomplished with 5 basic “verbs”
  • 24. dplyr Package • The five Basic “Verbs” ! Verbs! What does it do? filter()! Select a subset of ROWS by condiYons arrange()! Reorders ROWS in a data frame select()! Select the COLUMNS of interest mutate()! Create new columns based on exisYng columns (mutaYons!) summarise()! Aggregate values for each group, reduces to single value dplyr-­‐24
  • 25. Remember these Verbs (Mnemonics) • FILTERows • SELECT Column Types • ArRange Rows (SORT) • Mutate (into something new) • Summarize by Groups dplyr-­‐25
  • 26. filter() • Usage: – Returns a subset of rows – MulYple condiYons can be supplied. – They are combined with an AND movies_with_budgets <- filter(movies_df, !is.na(budget))! filter(movies, Documentary==1)! filter(movies, Documentary==1) %>% nrow() ! good_comedies <- filter(movies, rating > 9, Comedy==1) ! dim(good_comedies) #171 movies! ! #' Let us say we only want highly rated comdies, which a lot of people have watched, made after year 2000.! movies %>%! filter(rating >8, Comedy==1, votes > 100, year > 2000)! ! dplyr-­‐26 filter(data, condition)!
  • 27. Select() • Usage: movies_df <- tbl_df(movies)! select(movies_df, title, year, rating) #Just the columns we want to see! select(movies_df, -c(r1:r10)) #we don't want certain columns! ! #You can also select a range of columns from start:end! select(movies_df, title:votes) # All the columns from title to votes ! select(movies_df, -c(budget, r1:r10, Animation, Documentary, Short, Romance))! ! select(movies_df, contains("r")) # Any column that contains 'r' in its name! select(movies_df, ends_with("t")) # All vars ending with ”t"! ! select(movies_df, starts_with("r")) # Gets all vars staring with “r”! #The above is not quite what we want. We don't want the Romance column! select(movies_df, matches("r[0-9]")) # Columns that match a regex.! dplyr-­‐27 select(data, columns)!
  • 28. arrange() Usage: – Returns arrange(data, column_to_sort_by)! a reordered set of rows – MulYple inputs are arranged from leZ-­‐to-­‐right movies_df <- tbl_df(movies)! arrange(movies_df, rating) #but this is not what we want! arrange(movies_df, desc(rating)) ! #Show the highest ratings first and the latest year…! #Sort by Decreasing Rating and Year! arrange(movies_df, desc(rating), desc(year)) ! dplyr-­‐28 What’s the difference between these two? arrange(movies_df, desc(rating), desc(year)) ! arrange(movies_df, desc(year), desc(rating)) !
  • 29. mutate() • Usage: mutate(data, new_col = func(oldcolumns)! • Creates new columns, that are funcYons of exisYng variables mutate(iris, aspect_ratio = Petal.Width/Petal.Length)! ! movies_with_budgets <- filter(movies_df, !is.na(budget))! mutate(movies_with_budgets, costPerMinute = budget/length) %>%! select(title, costPerMinute)! ! dplyr-­‐29
  • 30. group_by() & summarize() group_by(data, column_to_group) %>%! summarize(function_of_variable)! • Group_by creates groups of data • Summarize aggregates the data for each group by_rating <- group_by(movies_df, rating)! ! by_rating %>% summarize(n())! ! avg_rating_by_year <- ! !group_by(movies_df, year) %>%! !summarize(avg_rating = mean(rating))! ! ! ! dplyr-­‐30
  • 31. Chaining the verbs together • Let’s put it all together in a logical fashion • Use a sequence of steps to find the most expensive movie per minute of eventual footage producers_nightmare <- ! filter(movies_df, !is.na(budget)) %>%! mutate(costPerMinute = budget/length) %>%! arrange(desc(costPerMinute)) %>%! select(title, costPerMinute)! ! dplyr-­‐31
  • 32. Bonus: Pipe into Plot • The output of a series of “pipes” can also be fed to a “plot” command movies %>%! group_by(rating) %>%! summarize(n()) %>%! plot() # plots the histogram of movies by Each value of rating! ! movies %>%! group_by(year) %>%! summarise(y=mean(rating)) %>%! with(barplot(y, names.arg=year, main="AVG IMDB Rating by Year"))! ! dplyr-­‐32
  • 33. References • Dplyr vignemes: hmp://cran.rstudio.com/web/packages/dplyr/ vignemes/introducYon.html • Kevin Markham’s dplyr tutorial – hmp://rpubs.com/justmarkham/dplyr-­‐tutorial – His YouTube video (38-­‐minutes) – hmps://www.youtube.com/watch? feature=player_embedded&v=jWjqLW-­‐u3hc • hmp://paYlv.com/dplyr/ – Use arrows to move forward and back dplyr-­‐33
  • 34. Aggrega<ng Data Using “Cut” What does “cut” do? – BuckeYng – Cuts a conYnuous variable into groups • Extremely useful for grouping values Take the airquality Temperature Data and group into buckets range(airquality$Temp)! #First let's cut this vector into 5 groups:! cut(airquality$Temp, 5)! cut(airquality$Temp, 5, labels=FALSE)! #How many data points fall in each of the 5 intervals?! table(cut(airquality$Temp, 5))! ! Tempbreaks=seq(50,100, by=10)! TempBuckets <- cut(airquality$Temp, breaks=Tempbreaks)! summary(TempBuckets)! ! ! 3a-­‐34
  • 35. aggregate() How many of each species do we have? Usage: Replaced by dplyr aggregate(y ~ x, data, FUN)! aggregate(numeric_variable ~ grouping variable, data)! !How to read this? “Split the <numeric_variable> by the <grouping variable>” Split y into groups of x, and apply the funcYon to each group aggregate(Sepal.Length ~ Species, data=iris, FUN='mean') ! Note the Crucial Difference between the two lines: aggregate(Sepal.Length~Species, data=iris, FUN='length')! aggregate(Species ~ Sepal.Length, data=iris, FUN='length') # caution!! ! Note: If you are doing lots of summarizing, the “doBy” package is worth looking into 3a-­‐35