SlideShare une entreprise Scribd logo
1  sur  121
Télécharger pour lire hors ligne
dev.Objective() ~ Thursday, 14 May 2015
Getting to KnowYour Data with R
Steve Withington
01
Steve
Withington
✤ Blue River / Mura CMS
Application Engineer & Instructor!
✤ dev.Objective() Steering
Committee Member since 2010!
✤ Author!
✤ Marathoner & Triathlete!
✤ Amateur Data Enthusiast
Steps in a Data Analysis
✤ Define the question!
✤ Define the ideal data set!
✤ Determine what data you can access!
✤ Obtain the data!
✤ Clean the data!
✤ Exploratory data analysis!
✤ Statistical prediction/modeling!
✤ Interpret results!
✤ Challenge results!
✤ Synthesize/write up results!
✤ Create reproducible code
Steps in a Data Analysis
✤ Define the question!
✤ Define the ideal data set!
✤ Determine what data you can access!
✤ Obtain the data!
✤ Clean the data!
✤ Exploratory data analysis!
✤ Statistical prediction/modeling!
✤ Interpret results!
✤ Challenge results!
✤ Synthesize/write up results!
✤ Create reproducible code
Types of Data Science Questions
✤ In approximate order of difficulty!
✤ Descriptive!
✤ Exploratory!
✤ Inferential!
✤ Predictive!
✤ Causal!
✤ Mechanistic
Descriptive Analysis
✤ Goal: Describe a set of data!
✤ The first kind of data analysis performed!
✤ Commonly applied to census data!
✤ The description and interpretation are different steps!
✤ Descriptions can usually not be generalized without
additional statistical modeling
Descriptive Analysis
– http://www.census.gov/2010census/
Exploratory Analysis
✤ Goal: Find relationships you didn’t know about!
✤ Exploratory models are good for discovering new connections!
✤ They are also useful for defining future studies!
✤ Exploratory analyses are usually not the final say!
✤ Exploratory analyses alone should not be used for
generalizing/predicting!
✤ Correlation does not imply causation
Correlation is Not Causation
✤ Even if you observe two variables are correlated with
each other, you have to prove to yourself that they’re
not related because of some other variable you didn’t
measure!
✤ Example: shoe size and literacy
Exploratory Analysis
– http://www.sdss.org
Inferential Analysis
✤ Goal: Use a relatively small sample of data to say
something about a bigger population!
✤ Inference is commonly the goal of statistical models!
✤ Inference involves estimating both the quantity you
care about and your uncertainty about your estimate!
✤ Inference depends heavily on both the population
and the sampling scheme
Inferential Analysis
– http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3521092/
Predictive Analysis
✤ Goal: To use the data on some objects to predict the values for
another object!
✤ If X predicts Y it does not mean that X causes Y!
✤ Accurate prediction depends heavily on measuring the right
variables!
✤ Although there are better and worse prediction models,
more data and a simple model works really well!
✤ Prediction is very hard, especially about the future references
Predictive Analysis
– http://fivethirtyeight.com/interactives/uk-general-election-predictions/
Predictive Analysis
– http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/
Causal Analysis
✤ Goal: To find out what happens to one variable when you make
another variable change!
✤ Usually randomized studies are required to identify causation!
✤ There are approaches to inferring causation in non-randomized
studies, but they are complicated and sensitive to assumptions!
✤ Causal relationships are usually identified as average effects,
but may not apply to every individual!
✤ Causal models are usually the “gold standard” for data analysis
Causal Analysis
– http://www.nejm.org/doi/full/10.1056/NEJM200007133430201
Mechanistic Analysis
✤ Goal: Understand the exact changes in variables that lead to changes
in other variables for individual objects!
✤ Incredibly hard to infer, except in simple situations!
✤ Usually modeled by a deterministic set of equations (physical/
engineering science)!
✤ Generally the random component of the data is measurement error!
✤ If the equations are known but the parameters are not, they may be
inferred with data analysis
Mechanistic Analysis
– http://www.ellisonfoundation.org/content/mechanistic-analysis-transcriptional-switch-regulating-p53-activity-and-premature-aging
“You can have data without information, 

but you cannot have information without data.”
–Daniel Keys Moran
01
NewTerminology
– http://www.dailymail.co.uk/sciencetech/article-2247081/There-soon-words-data-stored-world.html
Yottabyte
✤ 1000 kB kilobyte

1000
2
MB megabyte

1000
3
GB gigabyte

1000
4
TB terabyte

1000
5
PB petabyte

1000
6
EB exabyte

1000
7
ZB zettabyte

1000
8
YB yottabyte!
✤ 1 YB 

= 1000
8
bytes 

= 10
24
bytes 

= 1 000 000 000 000 000 000 000 000 bytes 

= 1000 zettabytes 

= 1 trillion terabytes
NSA’s Bumblehive
– http://www.bbc.com/news/business-26383058 (Capable of storing one thousand trillion gigabytes!)
Data
✤ Companies have terabytes of data about the consumers they interact with!
✤ Governmental, academic, and private research institutions have extensive
archival and survey data on every manner of research topic
The Challenge
✤ Presenting information in easily accessible and
digestible ways
Visualizations
✤ We’re really good at gleaning useful information from
visual presentations!
✤ Graphical presentations are an excellent way to
convey results and uncover meaning
Visualizing Friendships
– https://www.facebook.com/note.php?note_id=469716398919
Visualizing Friendships
– https://www.facebook.com/note.php?note_id=469716398919
“Data is a set of values of 

qualitative or quantitative variables; restated, 

pieces of data are individual pieces of information.”
– http://en.wikipedia.org/wiki/Data
Definition of Data
“Data is a set of values of qualitative or quantitative variables; restated,
pieces of data are individual pieces of information.” !
✤ Qualitative: properties that are observed and can generally not
be measured with a numerical result.

(Sex, country of origin, etc.)!
✤ Quantitative: properties that can exist as a magnitude or
multitude. 

(Height, weight, blood pressure, etc.)
Definition of Data
“Data is a set of values of qualitative or quantitative variables; restated,
pieces of data are individual pieces of information.” !
✤ Variable: a logical set of attributes!
✤ Attribute: a characteristic of an object (person, thing, etc.)
Definition of Data
“Data is a set of values of qualitative or quantitative variables; restated,
pieces of data are individual pieces of information.” !
✤ Information: an answer to a question, as well as that from
which knowledge and data can be derived (as data represents
values attributed to parameters, and knowledge signifies
understanding of real things or abstract concepts). As it regards
data, the information’s existence is not necessarily coupled to
an observer (it exists beyond an event horizon, for example),
while in the case of knowledge, information requires a
cognitive observer.
“The data may not contain the answer. 

The combination of some data and an aching desire 

for an answer does not ensure that a reasonable answer 

can be extracted from a given body of data.”
–John Wilder Tukey
What Data Looks Like
– http://brianknaus.com/software/srtoolbox/s_4_1_sequence80.txt
What Data Looks Like
– https://developers.facebook.com/docs/graph-api
What Data Looks Like
– http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html
What Data Looks Like
– http://www.baseball-reference.com/leagues/AL/2015-standard-pitching.shtml
Data is Second Most Important
✤ The most important thing in data science is the
question!
✤ The second most important is the data!
✤ Often the data will limit or enable the questions!
✤ But having data can’t save you if you don’t have a
question
01
“We wanted users to be able to begin in an interactive
environment, where they did not consciously think of
themselves as programming. Then as their needs became
clearer and their sophistication increased, they should be able
to slide gradually into programming, when the language and
system aspects would become more important.”
–John Chambers, “Stages in the Evolution of S”
R
✤ Comprehensive statistical platform (just about any type of data analysis can be done in R)!
✤ Useful for interactive work, and also contains a powerful programming language for
developing new tools (user -> programmer)!
✤ Open source & free (see http://www.fsf.org)!
✤ State of the art graphics!
✤ Easily import data from a wide variety of sources!
✤ RStudio IDE!
✤ Developer Ecosystem and active community (mailing lists, StackOverflow, etc.)!
✤ Can be integrated into other languages, including C++, Java, Python, PHP, etc.
Design of the R System
✤ The “base” R system that you download from CRAN!
✤ Everything else
The R System
✤ R functionality is divided into a number of packages!
✤ The “base” R system contains, among other things, the base package
which is required to run R and contains the most fundamental functions!
✤ There are several other packages contained in the “base” system as well
as several “recommended” packages!
✤ Over 6,000 other packages are available on CRAN developed by users
and programmers around the world!
✤ In addition, sites such as http://bioconductor.org (for genomic and
biological data analysis) host a plethora of R packages
Where to Get R
✤ Comprehensive R Archive Network (CRAN)

http://cran.r-project.org!
✤ Linux, Mac OS X, and Windows
Getting Started
✤ If you think of R as having a programming language instead of
being one, you’ll be better off in the long run!
✤ It’s a case-sensitive, interpreted language!
✤ Enter commands one at a time via the command line, or sets of
commands from a source file!
✤ Wide variety of data types, including vectors, matrices, data
frames (similar to recordsets), and lists (collections of objects)!
✤ An object is basically anything that can be assigned a value
demo()
RStudio
RStudio’s Interface
text-file.R
Some Important R Functions
✤ Access help file
Some Important R Functions
✤ Search help files
Some Important R Functions
✤ Get arguments





✤ See code
R Reference Card
– http://cran.r-project.org/doc/contrib/Short-refcard.pdf
Entering Input
✤ At the R prompt we type expressions. The <- symbol is the assignment operator









✤ The grammar of the language determines whether an expression is complete or
not



✤ The # character indicates a comment. Anything to the right of the # (including
the # itself) is ignored
Evaluation
✤ When a complete expression is entered at the prompt, it is
evaluated and the result of the evaluated expression is
returned. The result may be auto-printed.











✤ The [1] indicates that x is a vector and 5 is the first element
Printing
✤ The : operator is used to create integer sequences
Objects
✤ R has five basic or “atomic” classes of objects:!
✤ character!
✤ numeric (real numbers)!
✤ integer!
✤ complex!
✤ logical (True/False)
Objects
✤ The most basic object is a vector!
✤ Vectors are one-dimensional arrays that can hold data!
✤ A vector can only contain objects of the same class!
✤ The one exception is a list, which is represented as a
vector but can contain objects of different classes (that’s
usually why they’re used)!
✤ Empty vectors can be created with the vector() function
Numbers
✤ Numbers in R are generally treated as numeric objects (i.e., double
precision real numbers)!
✤ If you explicitly want an integer, you need to specify the L suffix!
✤ For example, entering 1 gives you a numeric object; entering 1L
explicitly gives you an integer!
✤ There is also a special number Inf which represents infinity; e.g., 1 /
0; Inf can be used in ordinary calculations; e.g., 1 / Inf is 0!
✤ The value NaN represents an undefined value (“not a number”); e.g., 0
/ 0; NaN can also be thought of as a missing value
Attributes
✤ R objects can have attributes!
✤ names, dimnames!
✤ dimensions (e.g., matrices, arrays)!
✤ class!
✤ length!
✤ other user-defined attributes/metadata!
✤ Attributes of an object can be accessed using the attributes()
function
Vectors
✤ Vectors are one-dimensional arrays that can hold numeric data, character
data, or logical data. The combine function c() is used to form the
vector.!
!
!
✤ You can also use the vector() function

Vectors
✤ You can refer to elements of a vector using a numeric vector of
positions within brackets. For example, a[c(2, 4)] refers to the
second and fourth elements of vector a
Mixing Objects
✤ What about the following?







✤ When different objects are mixed in a vector, coercion
occurs so that every element in the vector is of the
same class
Explicit Coercion
✤ Objects can be explicitly coerced from one class to
another using the as.* functions, if available
Explicit Coercion
✤ Nonsensical coercion results in NAs
Matrices
✤ A matrix is a two-dimensional array. It has a dimension attribute
which is an integer vector of length 2 (nrow, ncol)
Matrices
✤ Matrices are constructed column-wise, so entries can be thought
of starting in the “upper left” corner an running down the
columns
Matrices
✤ Matrices can also be created directly from vectors by adding a
dimension attribute
cbind-ing and rbind-ing
✤ Matrices can also be created by column-binding or row-
binding with cbind() and rbind()
Arrays
✤ Arrays are similar to matrices, however they can have
more than two dimensions
Factors
✤ Variables can be described as nominal, ordinal, or continuous!
✤ Nominal, categorical, without an implied order (e.g., blood type)!
✤ Ordinal, also categorical and imply order but not amount (e.g., grades)!
✤ Continuous, take on any value within some range, and both order and
amount are implied (e.g., age)!
✤ Categorical (nominal) and ordered categorical (ordinal) variables in R are
called factors!
✤ Factors are crucial in R because they determine how data is analyzed and
presented visually
Factors
✤ The factor() function stores the categorical values as a vector of
integers in the range [1… k], (where k is the number of unique values in
the nominal variable) and an internal vector of character strings (the
original values) mapped to these integers
Factors
✤ By default, factor levels for character vectors are created in alphabetical
order and can be overridden by specifying a levels option
Lists
✤ Lists are the most complex of the R data types. Basically, a list is an ordered
collection of objects. A list allows you to gather a variety of (possibly
unrelated) objects under one name. For example, a list may contain a
combination of vectors, matrices, data frames, and even other lists.
Data Frames
✤ A data frame is more general than a matrix in that columns can contain different modes of data
(numeric, character, etc.). They store tabular data. These are the most common data types in R.!
✤ They are represented as a special type of list where every element of the list has to have
the same length!
✤ Each element of the list can be thought of as a column and the length of each element
of the list is the number of rows!
✤ Unlike matrices, data frames can store different classes of objects in each column (just
like lists); matrices must have every element be the same class!
✤ Data frames also have a special attribute called row.names!
✤ Data frames are usually created by calling read.table() or read.csv()!
✤ Can be converted to a matrix by calling data.matrix()
Data Frames
✤ Each column must have only one mode, but you can
put columns of different modes together
Names
✤ R objects can also have names, which is very useful for
writing readable code and self-describing objects
Names
✤ Lists can also have names
Names
✤ And matrices
MissingValues
✤ Missing values are denote by NA or NaN for undefined
mathematical operations!
✤ is.na() is used to test objects if they are NA!
✤ is.nan() is used to test objects for NaN!
✤ NA values have a class also, so there are integer NA,
character NA, etc.!
✤ A NaN value is also NA but the converse is not true
MissingValues
DataTypes Summary
✤ atomic classes: numeric, logical, character, integer, complex!
✤ vectors!
✤ matrices!
✤ arrays!
✤ lists!
✤ data frames!
✤ names!
✤ missing values
DataTypes Summary
Oddities
✤ The period (.) has no special significant in object names. The dollar sign ($) has a
somewhat similar meaning to the period in other object-oriented languages and
can be used to identify the parts of a data frame or list (e.g., mydata
$columnname)!
✤ R don’t have multiline or block comments (must use the #)!
✤ Assigning a value to a nonexistent element of a vector, matrix, array, or list
expands that structure to accommodate the new value!
✤ R doesn’t have scalar values, they’re represented as one-element vectors!
✤ Indices start at 1, not 0!
✤ Variables can’t be declared, they come into existence on the first assignment
Helpful Methods
✤ getwd()!
✤ setwd(“some/directory”)!
✤ list.files()!
✤ ls() or objects()!
✤ rm(object/objectlist)!
✤ rm(list=ls())!
✤ edit(objectname)!
✤ source(“filename”)!
✤ sink(“filename.ext”, append=TRUE, split=TRUE)!
✤ data() ~ list available data sets!
✤ search() ~ list loaded packages!
✤ q()
Reading Data
✤ There are a few principal functions reading data into R!
✤ read.table, read.csv, for reading tabular data!
✤ readLines, for reading lines of a text file!
✤ source, for reading in R code files (inverse of dump)!
✤ dget, for reading in R code files (inverse of dput)!
✤ load, for reading in saved workspaces!
✤ unserialize, for reading single R objects in binary form
Writing Data
✤ There are similar functions for writing data to files!
✤ write.table!
✤ writeLines!
✤ dump!
✤ dput!
✤ save!
✤ serialize
read.table
✤ The read.table function is one of the most commonly used functions for reading data!
✤ It has a few important arguments:!
✤ file, the name of the file, or a connection!
✤ header, logical indicating if the file has a header line!
✤ sep, a string indicating how the columns are separated!
✤ colClasses, a character vector indicating the class of each column in the dataset!
✤ nrows, the number of rows in the dataset!
✤ comment.char, a character string indicating the comment character!
✤ skip, the number of lines to skip from the beginning!
✤ stringsAsFactors, should character variables be coded as factors?
read.table
✤ For small to moderately sized datasets, you can usually call read.table without
specifying any other arguments



✤ R will automatically!
✤ skip lines that begin with #!
✤ figure out how many rows there are (and how much memory needs to be allocated)!
✤ figure what time of variable is in each column of the table (telling R all these things
directly makes R run much faster and more efficiently)!
✤ read.csv is identical to read.table except that the default separator is a comma
Larger Datasets
✤ With much larger datasets, doing the following things will make your
life easier and will prevent R from choking!
✤ Read the help page for read.table, which contains many hints!
✤ Make a rough calculation of the memory required to store your
dataset!
✤ If the dataset is larger than the amount of RAM on your computer,
you can probably stop right here!
✤ Set comment.char = “” if there are no commented lines in your
file
Larger Datasets
✤ Use the colClasses argument!
✤ Specifying this option instead of using the default can make
read.table run much faster, often twice as fast!
✤ In order to use this option, you have to know the class of each column
in your data farm!
✤ If all of the columns are numeric, for example, then you can just set
colClasses = “numeric”!
✤ Set nrows!
✤ This doesn’t make R run faster but it helps with memory usage
KnowThy System
✤ In general, when using R with larger datasets, it’s useful to
know a few things about your system!
✤ How much memory is available?!
✤ What other applications are in use?!
✤ Are there other users logged into the same system?!
✤ What is the Operating System?!
✤ Is the OS 32 or 64 bit?
Calculating Memory Requirements
✤ I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric
data. Roughly, how much memory is required to store this data frame? Note: each
alphanumeric character requires 8 bits (1 byte) of memory to store.!
✤ 1,500,000 rows * 120 columns = 180,000,000 alphanumeric characters!
✤ 180,000,000 * 8 bits = 1,440,000,000 bytes!
✤ 1,440,000,000 bytes / 1024 = 1,406,250 kb!
✤ 1,406,250 kb / 1024 = 1,373.29 MB!
✤ 1,373.29 MB / 1024 = 1.34 GB!
✤ If you only have 2 GB of RAM, you may want to think twice before doing it. If you've
got 4, 6, or 8+ GB of RAM, you should be fine.
Textual Formats
✤ dumping and dputing are useful because the resulting textual format is
editable, and in the case of corruption, potentially recoverable!
✤ Unlike writing out a table or csv file, dump and dput preserve the metadata
(sacrificing some readability), so that another user doesn't have to specify it
all over again!
✤ Can work much better with version control programs like SVN or Git which
can only track changes meaningfully in text files!
✤ Can be longer-lived; if there is corruption somewhere in the file, it can be
easier to fix the problem!
✤ Downside: Format is not very space-efficient
dput-ting R Objects
✤ Another way to pass data around is by deparsing the
R object with dput and reading it back in using dget
Dumping R Objects
✤ Multiple objects can be deparsed using the dump
function and read back in using source
Interfaces to the OutsideWorld
✤ Data are read in using connection interfaces!
✤ Connections can be made to files (most common) or to other
more exotic things!
✤ file, opens a connection to a file!
✤ gzfile, opens a connection to a file compressed with gzip!
✤ bzfile, opens a connection to a file compressed with bzip2!
✤ url, opens a connection to a webpage
File Connections
✤ description is the name of the file!
✤ open is a code indicating!
✤ “r” read only!
✤ “w” writing (and initializing a new file)!
✤ “a” appending!
✤ “rb”, “wb”, “ab” reading, writing, or appending in binary mode (Windows)
Connections
✤ In general, connections are powerful tools that let you
navigate files or other external objects!
✤ In practice, we often don’t need to deal with the
connection interface directly
Reading Lines of aText File
✤ readLines can be useful for reading in lines of
webpages
Connecting and Listing Databases
Connecting and Listing Databases
R Packages
✤ When you download R from the Comprehensive Archive
Network (CRAN), you get the “base” R system!
✤ The base R system comes with basic functionality and
implements the R language!
✤ One reason R is so useful is the large collection of packages
that extend the basic functionality of R!
✤ R packages are developed and published by the larger R
community
Obtaining R Packages
✤ The primary location for obtaining R packages is
CRAN!
✤ For biological applications, many packages are
available from the Bioconductor Project!
✤ You can obtain information about the available
packages on CRAN with the
available.packages() function
Obtaining R Packages
✤ There are over 6,000 packages on CRAN covering a wide
range of topics!
✤ A list of some topics is available through the Task Views link
(http://cran.r-project.org/web/views/), which groups
together many R packages related to a given topic
Installing an R Package
✤ Packages can be installed with the install.packages() function in R!
✤ To install a single package, pass the name of the package to the
install.packages() function as the first argument!
✤ The following code installs the zipcode package from CRAN
✤ This command downloads the zipcode package from CRAN and installs it on your
computer!
✤ Any packages on which this package depends will also be downloaded and installed
Installing Multiple R Packages
✤ You can install multiple R packages at once with a single call to
install.packages()!
✤ Place the names of the R packages in a character vector
Installing Packages via RStudio
Installing Packages via RStudio
Loading R Packages
✤ Installing a package does not make it immediately available to
your in R; you must load the package!
✤ The library() function is used to load packages into R!
✤ The following code is used to load the ggplot2 package into R
Loading R Packages
✤ Any packages that need to be loaded as dependencies
will be loaded first, before the named package is
loaded!
✤ Note: Do not put the package name in quotes!!
✤ Some packages produce messages when they are
loaded (but some don’t)
Loading R Packages
✤ After loading a package, the functions exported by that packages will be
attached to the top of the search() list (after the workspace)
example()
R Packages Summary
✤ R packages provide a powerful mechanism for extending the
functionality of R!
✤ R packages can be obtained from CRAN or other repositories!
✤ The install.packages() can be used to install packages
at the R console!
✤ The library() function loads packages that have been
installed so that you may access the functionality in the
packages
Learning Resources
✤ The R Manuals

http://cran.r-project.org/manuals.html!
✤ QuickR

http://www.statmethods.net !
✤ R Mailing Lists

https://stat.ethz.ch/mailman/listinfo!
✤ SwiRl

http://swirlstats.com!
✤ Introduction to Data Science

https://www.udemy.com/introduction-to-data-science/!
✤ Johns Hopkins Data University Data Science Specialization

https://www.coursera.org/specialization/jhudatascience/1
More Learning Resources
✤ R By Example

http://www.mayin.org/ajayshah/KB/R/index.html!
✤ R Data Import/Export Manual

http://cran.r-project.org/doc/manuals/r-release/R-data.html!
✤ Data Analysis with R: Visually Analyze and Summarize Data Sets

https://www.udacity.com/course/data-analysis-with-r--ud651!
✤ Intro to Descriptive Statistics: Mathematics for Understanding Data

https://www.udacity.com/course/intro-to-descriptive-statistics--ud827!
✤ Intro to Inferential Statistics: Making Predictions from Data

https://www.udacity.com/course/intro-to-inferential-statistics--ud201
Good Stuff
✤ FiveThirtyEight

http://fivethirtyeight.com !
✤ Kaggle

https://www.kaggle.com!
✤ Figshare

http://figshare.com!
✤ OpenCPU

https://www.openspu.org!
✤ RPy

http://rpy.sourceforge.net!
✤ rCharts

http://rcharts.io!
✤ Facebook Data Science Research Service

https://www.facebook.com/data!
✤ Google’s R Style Guide

http://google-styleguide.googlecode.com/svn/trunk/Rguide.xml
Books
✤ Chambers. Software for Data Analysis. 

Springer, 2008.!
✤ Matloff. The Art of R Programming: A Tour of Statistical Software Design.

No Starch Press, 2011.!
✤ Kabacoff. R in Action: Data analysis and graphics with R. 

Second Edition. Manning Publications, 2015.!
✤ Cielen & Meysman. Introducing Data Science. 

Manning Publications, 2015.!
✤ Tuckey. Exploratory Data Analysis. 

Pearson, 1977.
Books
✤ Manning Publications Discount Code

ctwdevob!
✤ Packt Discount Code

DEV.Objective
01

Contenu connexe

Tendances

The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 
Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Salesforce Engineering
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data ManagementC. Tobin Magle
 
Open Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningOpen Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningSteven Van Vaerenbergh
 
Internet Research Skills-A dey irc 2
Internet Research Skills-A dey irc 2Internet Research Skills-A dey irc 2
Internet Research Skills-A dey irc 2Apu Dey
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Onlinesfdatascience
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learningSara Hooker
 
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Mark Wilkinson
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopCosmoAIMS Bassett
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
WDAqua introduction presentation
WDAqua introduction presentationWDAqua introduction presentation
WDAqua introduction presentationHady Elsahar
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at ScaleDavid Simons
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine LearningCorey Chivers
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearcheXascale Infolab
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 

Tendances (20)

The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
Open Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningOpen Data, Big Data and Machine Learning
Open Data, Big Data and Machine Learning
 
Introduction to open-data
Introduction to open-dataIntroduction to open-data
Introduction to open-data
 
Internet Research Skills-A dey irc 2
Internet Research Skills-A dey irc 2Internet Research Skills-A dey irc 2
Internet Research Skills-A dey irc 2
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Online
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Streaming Outlier Analysis for Fun and Scalability
Streaming Outlier Analysis for Fun and Scalability Streaming Outlier Analysis for Fun and Scalability
Streaming Outlier Analysis for Fun and Scalability
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
 
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
WDAqua introduction presentation
WDAqua introduction presentationWDAqua introduction presentation
WDAqua introduction presentation
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine Learning
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 

Similaire à Getting to Know Your Data with R

Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsHugo Bowne-Anderson
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data sciencebhavesh lande
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsIlkay Altintas, Ph.D.
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansJameel Syed
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
 

Similaire à Getting to Know Your Data with R (20)

Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientists
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 

Dernier

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 

Dernier (20)

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 

Getting to Know Your Data with R

  • 1. dev.Objective() ~ Thursday, 14 May 2015 Getting to KnowYour Data with R Steve Withington
  • 2. 01 Steve Withington ✤ Blue River / Mura CMS Application Engineer & Instructor! ✤ dev.Objective() Steering Committee Member since 2010! ✤ Author! ✤ Marathoner & Triathlete! ✤ Amateur Data Enthusiast
  • 3. Steps in a Data Analysis ✤ Define the question! ✤ Define the ideal data set! ✤ Determine what data you can access! ✤ Obtain the data! ✤ Clean the data! ✤ Exploratory data analysis! ✤ Statistical prediction/modeling! ✤ Interpret results! ✤ Challenge results! ✤ Synthesize/write up results! ✤ Create reproducible code
  • 4. Steps in a Data Analysis ✤ Define the question! ✤ Define the ideal data set! ✤ Determine what data you can access! ✤ Obtain the data! ✤ Clean the data! ✤ Exploratory data analysis! ✤ Statistical prediction/modeling! ✤ Interpret results! ✤ Challenge results! ✤ Synthesize/write up results! ✤ Create reproducible code
  • 5. Types of Data Science Questions ✤ In approximate order of difficulty! ✤ Descriptive! ✤ Exploratory! ✤ Inferential! ✤ Predictive! ✤ Causal! ✤ Mechanistic
  • 6. Descriptive Analysis ✤ Goal: Describe a set of data! ✤ The first kind of data analysis performed! ✤ Commonly applied to census data! ✤ The description and interpretation are different steps! ✤ Descriptions can usually not be generalized without additional statistical modeling
  • 8. Exploratory Analysis ✤ Goal: Find relationships you didn’t know about! ✤ Exploratory models are good for discovering new connections! ✤ They are also useful for defining future studies! ✤ Exploratory analyses are usually not the final say! ✤ Exploratory analyses alone should not be used for generalizing/predicting! ✤ Correlation does not imply causation
  • 9. Correlation is Not Causation ✤ Even if you observe two variables are correlated with each other, you have to prove to yourself that they’re not related because of some other variable you didn’t measure! ✤ Example: shoe size and literacy
  • 11. Inferential Analysis ✤ Goal: Use a relatively small sample of data to say something about a bigger population! ✤ Inference is commonly the goal of statistical models! ✤ Inference involves estimating both the quantity you care about and your uncertainty about your estimate! ✤ Inference depends heavily on both the population and the sampling scheme
  • 13. Predictive Analysis ✤ Goal: To use the data on some objects to predict the values for another object! ✤ If X predicts Y it does not mean that X causes Y! ✤ Accurate prediction depends heavily on measuring the right variables! ✤ Although there are better and worse prediction models, more data and a simple model works really well! ✤ Prediction is very hard, especially about the future references
  • 16. Causal Analysis ✤ Goal: To find out what happens to one variable when you make another variable change! ✤ Usually randomized studies are required to identify causation! ✤ There are approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions! ✤ Causal relationships are usually identified as average effects, but may not apply to every individual! ✤ Causal models are usually the “gold standard” for data analysis
  • 18. Mechanistic Analysis ✤ Goal: Understand the exact changes in variables that lead to changes in other variables for individual objects! ✤ Incredibly hard to infer, except in simple situations! ✤ Usually modeled by a deterministic set of equations (physical/ engineering science)! ✤ Generally the random component of the data is measurement error! ✤ If the equations are known but the parameters are not, they may be inferred with data analysis
  • 20. “You can have data without information, 
 but you cannot have information without data.” –Daniel Keys Moran
  • 21. 01
  • 23. Yottabyte ✤ 1000 kB kilobyte
 1000 2 MB megabyte
 1000 3 GB gigabyte
 1000 4 TB terabyte
 1000 5 PB petabyte
 1000 6 EB exabyte
 1000 7 ZB zettabyte
 1000 8 YB yottabyte! ✤ 1 YB 
 = 1000 8 bytes 
 = 10 24 bytes 
 = 1 000 000 000 000 000 000 000 000 bytes 
 = 1000 zettabytes 
 = 1 trillion terabytes
  • 24. NSA’s Bumblehive – http://www.bbc.com/news/business-26383058 (Capable of storing one thousand trillion gigabytes!)
  • 25. Data ✤ Companies have terabytes of data about the consumers they interact with! ✤ Governmental, academic, and private research institutions have extensive archival and survey data on every manner of research topic
  • 26. The Challenge ✤ Presenting information in easily accessible and digestible ways
  • 27. Visualizations ✤ We’re really good at gleaning useful information from visual presentations! ✤ Graphical presentations are an excellent way to convey results and uncover meaning
  • 30. “Data is a set of values of 
 qualitative or quantitative variables; restated, 
 pieces of data are individual pieces of information.” – http://en.wikipedia.org/wiki/Data
  • 31. Definition of Data “Data is a set of values of qualitative or quantitative variables; restated, pieces of data are individual pieces of information.” ! ✤ Qualitative: properties that are observed and can generally not be measured with a numerical result.
 (Sex, country of origin, etc.)! ✤ Quantitative: properties that can exist as a magnitude or multitude. 
 (Height, weight, blood pressure, etc.)
  • 32. Definition of Data “Data is a set of values of qualitative or quantitative variables; restated, pieces of data are individual pieces of information.” ! ✤ Variable: a logical set of attributes! ✤ Attribute: a characteristic of an object (person, thing, etc.)
  • 33. Definition of Data “Data is a set of values of qualitative or quantitative variables; restated, pieces of data are individual pieces of information.” ! ✤ Information: an answer to a question, as well as that from which knowledge and data can be derived (as data represents values attributed to parameters, and knowledge signifies understanding of real things or abstract concepts). As it regards data, the information’s existence is not necessarily coupled to an observer (it exists beyond an event horizon, for example), while in the case of knowledge, information requires a cognitive observer.
  • 34. “The data may not contain the answer. 
 The combination of some data and an aching desire 
 for an answer does not ensure that a reasonable answer 
 can be extracted from a given body of data.” –John Wilder Tukey
  • 35. What Data Looks Like – http://brianknaus.com/software/srtoolbox/s_4_1_sequence80.txt
  • 36. What Data Looks Like – https://developers.facebook.com/docs/graph-api
  • 37. What Data Looks Like – http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html
  • 38. What Data Looks Like – http://www.baseball-reference.com/leagues/AL/2015-standard-pitching.shtml
  • 39. Data is Second Most Important ✤ The most important thing in data science is the question! ✤ The second most important is the data! ✤ Often the data will limit or enable the questions! ✤ But having data can’t save you if you don’t have a question
  • 40. 01
  • 41. “We wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.” –John Chambers, “Stages in the Evolution of S”
  • 42. R ✤ Comprehensive statistical platform (just about any type of data analysis can be done in R)! ✤ Useful for interactive work, and also contains a powerful programming language for developing new tools (user -> programmer)! ✤ Open source & free (see http://www.fsf.org)! ✤ State of the art graphics! ✤ Easily import data from a wide variety of sources! ✤ RStudio IDE! ✤ Developer Ecosystem and active community (mailing lists, StackOverflow, etc.)! ✤ Can be integrated into other languages, including C++, Java, Python, PHP, etc.
  • 43. Design of the R System ✤ The “base” R system that you download from CRAN! ✤ Everything else
  • 44. The R System ✤ R functionality is divided into a number of packages! ✤ The “base” R system contains, among other things, the base package which is required to run R and contains the most fundamental functions! ✤ There are several other packages contained in the “base” system as well as several “recommended” packages! ✤ Over 6,000 other packages are available on CRAN developed by users and programmers around the world! ✤ In addition, sites such as http://bioconductor.org (for genomic and biological data analysis) host a plethora of R packages
  • 45. Where to Get R ✤ Comprehensive R Archive Network (CRAN)
 http://cran.r-project.org! ✤ Linux, Mac OS X, and Windows
  • 46. Getting Started ✤ If you think of R as having a programming language instead of being one, you’ll be better off in the long run! ✤ It’s a case-sensitive, interpreted language! ✤ Enter commands one at a time via the command line, or sets of commands from a source file! ✤ Wide variety of data types, including vectors, matrices, data frames (similar to recordsets), and lists (collections of objects)! ✤ An object is basically anything that can be assigned a value
  • 51. Some Important R Functions ✤ Access help file
  • 52. Some Important R Functions ✤ Search help files
  • 53. Some Important R Functions ✤ Get arguments
 
 
 ✤ See code
  • 54. R Reference Card – http://cran.r-project.org/doc/contrib/Short-refcard.pdf
  • 55. Entering Input ✤ At the R prompt we type expressions. The <- symbol is the assignment operator
 
 
 
 
 ✤ The grammar of the language determines whether an expression is complete or not
 
 ✤ The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored
  • 56. Evaluation ✤ When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated expression is returned. The result may be auto-printed.
 
 
 
 
 
 ✤ The [1] indicates that x is a vector and 5 is the first element
  • 57. Printing ✤ The : operator is used to create integer sequences
  • 58. Objects ✤ R has five basic or “atomic” classes of objects:! ✤ character! ✤ numeric (real numbers)! ✤ integer! ✤ complex! ✤ logical (True/False)
  • 59. Objects ✤ The most basic object is a vector! ✤ Vectors are one-dimensional arrays that can hold data! ✤ A vector can only contain objects of the same class! ✤ The one exception is a list, which is represented as a vector but can contain objects of different classes (that’s usually why they’re used)! ✤ Empty vectors can be created with the vector() function
  • 60. Numbers ✤ Numbers in R are generally treated as numeric objects (i.e., double precision real numbers)! ✤ If you explicitly want an integer, you need to specify the L suffix! ✤ For example, entering 1 gives you a numeric object; entering 1L explicitly gives you an integer! ✤ There is also a special number Inf which represents infinity; e.g., 1 / 0; Inf can be used in ordinary calculations; e.g., 1 / Inf is 0! ✤ The value NaN represents an undefined value (“not a number”); e.g., 0 / 0; NaN can also be thought of as a missing value
  • 61. Attributes ✤ R objects can have attributes! ✤ names, dimnames! ✤ dimensions (e.g., matrices, arrays)! ✤ class! ✤ length! ✤ other user-defined attributes/metadata! ✤ Attributes of an object can be accessed using the attributes() function
  • 62. Vectors ✤ Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function c() is used to form the vector.! ! ! ✤ You can also use the vector() function

  • 63. Vectors ✤ You can refer to elements of a vector using a numeric vector of positions within brackets. For example, a[c(2, 4)] refers to the second and fourth elements of vector a
  • 64. Mixing Objects ✤ What about the following?
 
 
 
 ✤ When different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class
  • 65. Explicit Coercion ✤ Objects can be explicitly coerced from one class to another using the as.* functions, if available
  • 66. Explicit Coercion ✤ Nonsensical coercion results in NAs
  • 67. Matrices ✤ A matrix is a two-dimensional array. It has a dimension attribute which is an integer vector of length 2 (nrow, ncol)
  • 68. Matrices ✤ Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner an running down the columns
  • 69. Matrices ✤ Matrices can also be created directly from vectors by adding a dimension attribute
  • 70. cbind-ing and rbind-ing ✤ Matrices can also be created by column-binding or row- binding with cbind() and rbind()
  • 71. Arrays ✤ Arrays are similar to matrices, however they can have more than two dimensions
  • 72. Factors ✤ Variables can be described as nominal, ordinal, or continuous! ✤ Nominal, categorical, without an implied order (e.g., blood type)! ✤ Ordinal, also categorical and imply order but not amount (e.g., grades)! ✤ Continuous, take on any value within some range, and both order and amount are implied (e.g., age)! ✤ Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors! ✤ Factors are crucial in R because they determine how data is analyzed and presented visually
  • 73. Factors ✤ The factor() function stores the categorical values as a vector of integers in the range [1… k], (where k is the number of unique values in the nominal variable) and an internal vector of character strings (the original values) mapped to these integers
  • 74. Factors ✤ By default, factor levels for character vectors are created in alphabetical order and can be overridden by specifying a levels option
  • 75. Lists ✤ Lists are the most complex of the R data types. Basically, a list is an ordered collection of objects. A list allows you to gather a variety of (possibly unrelated) objects under one name. For example, a list may contain a combination of vectors, matrices, data frames, and even other lists.
  • 76. Data Frames ✤ A data frame is more general than a matrix in that columns can contain different modes of data (numeric, character, etc.). They store tabular data. These are the most common data types in R.! ✤ They are represented as a special type of list where every element of the list has to have the same length! ✤ Each element of the list can be thought of as a column and the length of each element of the list is the number of rows! ✤ Unlike matrices, data frames can store different classes of objects in each column (just like lists); matrices must have every element be the same class! ✤ Data frames also have a special attribute called row.names! ✤ Data frames are usually created by calling read.table() or read.csv()! ✤ Can be converted to a matrix by calling data.matrix()
  • 77. Data Frames ✤ Each column must have only one mode, but you can put columns of different modes together
  • 78. Names ✤ R objects can also have names, which is very useful for writing readable code and self-describing objects
  • 79. Names ✤ Lists can also have names
  • 81. MissingValues ✤ Missing values are denote by NA or NaN for undefined mathematical operations! ✤ is.na() is used to test objects if they are NA! ✤ is.nan() is used to test objects for NaN! ✤ NA values have a class also, so there are integer NA, character NA, etc.! ✤ A NaN value is also NA but the converse is not true
  • 83. DataTypes Summary ✤ atomic classes: numeric, logical, character, integer, complex! ✤ vectors! ✤ matrices! ✤ arrays! ✤ lists! ✤ data frames! ✤ names! ✤ missing values
  • 85. Oddities ✤ The period (.) has no special significant in object names. The dollar sign ($) has a somewhat similar meaning to the period in other object-oriented languages and can be used to identify the parts of a data frame or list (e.g., mydata $columnname)! ✤ R don’t have multiline or block comments (must use the #)! ✤ Assigning a value to a nonexistent element of a vector, matrix, array, or list expands that structure to accommodate the new value! ✤ R doesn’t have scalar values, they’re represented as one-element vectors! ✤ Indices start at 1, not 0! ✤ Variables can’t be declared, they come into existence on the first assignment
  • 86. Helpful Methods ✤ getwd()! ✤ setwd(“some/directory”)! ✤ list.files()! ✤ ls() or objects()! ✤ rm(object/objectlist)! ✤ rm(list=ls())! ✤ edit(objectname)! ✤ source(“filename”)! ✤ sink(“filename.ext”, append=TRUE, split=TRUE)! ✤ data() ~ list available data sets! ✤ search() ~ list loaded packages! ✤ q()
  • 87. Reading Data ✤ There are a few principal functions reading data into R! ✤ read.table, read.csv, for reading tabular data! ✤ readLines, for reading lines of a text file! ✤ source, for reading in R code files (inverse of dump)! ✤ dget, for reading in R code files (inverse of dput)! ✤ load, for reading in saved workspaces! ✤ unserialize, for reading single R objects in binary form
  • 88. Writing Data ✤ There are similar functions for writing data to files! ✤ write.table! ✤ writeLines! ✤ dump! ✤ dput! ✤ save! ✤ serialize
  • 89. read.table ✤ The read.table function is one of the most commonly used functions for reading data! ✤ It has a few important arguments:! ✤ file, the name of the file, or a connection! ✤ header, logical indicating if the file has a header line! ✤ sep, a string indicating how the columns are separated! ✤ colClasses, a character vector indicating the class of each column in the dataset! ✤ nrows, the number of rows in the dataset! ✤ comment.char, a character string indicating the comment character! ✤ skip, the number of lines to skip from the beginning! ✤ stringsAsFactors, should character variables be coded as factors?
  • 90. read.table ✤ For small to moderately sized datasets, you can usually call read.table without specifying any other arguments
 
 ✤ R will automatically! ✤ skip lines that begin with #! ✤ figure out how many rows there are (and how much memory needs to be allocated)! ✤ figure what time of variable is in each column of the table (telling R all these things directly makes R run much faster and more efficiently)! ✤ read.csv is identical to read.table except that the default separator is a comma
  • 91. Larger Datasets ✤ With much larger datasets, doing the following things will make your life easier and will prevent R from choking! ✤ Read the help page for read.table, which contains many hints! ✤ Make a rough calculation of the memory required to store your dataset! ✤ If the dataset is larger than the amount of RAM on your computer, you can probably stop right here! ✤ Set comment.char = “” if there are no commented lines in your file
  • 92. Larger Datasets ✤ Use the colClasses argument! ✤ Specifying this option instead of using the default can make read.table run much faster, often twice as fast! ✤ In order to use this option, you have to know the class of each column in your data farm! ✤ If all of the columns are numeric, for example, then you can just set colClasses = “numeric”! ✤ Set nrows! ✤ This doesn’t make R run faster but it helps with memory usage
  • 93. KnowThy System ✤ In general, when using R with larger datasets, it’s useful to know a few things about your system! ✤ How much memory is available?! ✤ What other applications are in use?! ✤ Are there other users logged into the same system?! ✤ What is the Operating System?! ✤ Is the OS 32 or 64 bit?
  • 94. Calculating Memory Requirements ✤ I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame? Note: each alphanumeric character requires 8 bits (1 byte) of memory to store.! ✤ 1,500,000 rows * 120 columns = 180,000,000 alphanumeric characters! ✤ 180,000,000 * 8 bits = 1,440,000,000 bytes! ✤ 1,440,000,000 bytes / 1024 = 1,406,250 kb! ✤ 1,406,250 kb / 1024 = 1,373.29 MB! ✤ 1,373.29 MB / 1024 = 1.34 GB! ✤ If you only have 2 GB of RAM, you may want to think twice before doing it. If you've got 4, 6, or 8+ GB of RAM, you should be fine.
  • 95. Textual Formats ✤ dumping and dputing are useful because the resulting textual format is editable, and in the case of corruption, potentially recoverable! ✤ Unlike writing out a table or csv file, dump and dput preserve the metadata (sacrificing some readability), so that another user doesn't have to specify it all over again! ✤ Can work much better with version control programs like SVN or Git which can only track changes meaningfully in text files! ✤ Can be longer-lived; if there is corruption somewhere in the file, it can be easier to fix the problem! ✤ Downside: Format is not very space-efficient
  • 96. dput-ting R Objects ✤ Another way to pass data around is by deparsing the R object with dput and reading it back in using dget
  • 97. Dumping R Objects ✤ Multiple objects can be deparsed using the dump function and read back in using source
  • 98. Interfaces to the OutsideWorld ✤ Data are read in using connection interfaces! ✤ Connections can be made to files (most common) or to other more exotic things! ✤ file, opens a connection to a file! ✤ gzfile, opens a connection to a file compressed with gzip! ✤ bzfile, opens a connection to a file compressed with bzip2! ✤ url, opens a connection to a webpage
  • 99. File Connections ✤ description is the name of the file! ✤ open is a code indicating! ✤ “r” read only! ✤ “w” writing (and initializing a new file)! ✤ “a” appending! ✤ “rb”, “wb”, “ab” reading, writing, or appending in binary mode (Windows)
  • 100. Connections ✤ In general, connections are powerful tools that let you navigate files or other external objects! ✤ In practice, we often don’t need to deal with the connection interface directly
  • 101. Reading Lines of aText File ✤ readLines can be useful for reading in lines of webpages
  • 104. R Packages ✤ When you download R from the Comprehensive Archive Network (CRAN), you get the “base” R system! ✤ The base R system comes with basic functionality and implements the R language! ✤ One reason R is so useful is the large collection of packages that extend the basic functionality of R! ✤ R packages are developed and published by the larger R community
  • 105. Obtaining R Packages ✤ The primary location for obtaining R packages is CRAN! ✤ For biological applications, many packages are available from the Bioconductor Project! ✤ You can obtain information about the available packages on CRAN with the available.packages() function
  • 106. Obtaining R Packages ✤ There are over 6,000 packages on CRAN covering a wide range of topics! ✤ A list of some topics is available through the Task Views link (http://cran.r-project.org/web/views/), which groups together many R packages related to a given topic
  • 107. Installing an R Package ✤ Packages can be installed with the install.packages() function in R! ✤ To install a single package, pass the name of the package to the install.packages() function as the first argument! ✤ The following code installs the zipcode package from CRAN ✤ This command downloads the zipcode package from CRAN and installs it on your computer! ✤ Any packages on which this package depends will also be downloaded and installed
  • 108. Installing Multiple R Packages ✤ You can install multiple R packages at once with a single call to install.packages()! ✤ Place the names of the R packages in a character vector
  • 111. Loading R Packages ✤ Installing a package does not make it immediately available to your in R; you must load the package! ✤ The library() function is used to load packages into R! ✤ The following code is used to load the ggplot2 package into R
  • 112. Loading R Packages ✤ Any packages that need to be loaded as dependencies will be loaded first, before the named package is loaded! ✤ Note: Do not put the package name in quotes!! ✤ Some packages produce messages when they are loaded (but some don’t)
  • 113. Loading R Packages ✤ After loading a package, the functions exported by that packages will be attached to the top of the search() list (after the workspace)
  • 115. R Packages Summary ✤ R packages provide a powerful mechanism for extending the functionality of R! ✤ R packages can be obtained from CRAN or other repositories! ✤ The install.packages() can be used to install packages at the R console! ✤ The library() function loads packages that have been installed so that you may access the functionality in the packages
  • 116. Learning Resources ✤ The R Manuals
 http://cran.r-project.org/manuals.html! ✤ QuickR
 http://www.statmethods.net ! ✤ R Mailing Lists
 https://stat.ethz.ch/mailman/listinfo! ✤ SwiRl
 http://swirlstats.com! ✤ Introduction to Data Science
 https://www.udemy.com/introduction-to-data-science/! ✤ Johns Hopkins Data University Data Science Specialization
 https://www.coursera.org/specialization/jhudatascience/1
  • 117. More Learning Resources ✤ R By Example
 http://www.mayin.org/ajayshah/KB/R/index.html! ✤ R Data Import/Export Manual
 http://cran.r-project.org/doc/manuals/r-release/R-data.html! ✤ Data Analysis with R: Visually Analyze and Summarize Data Sets
 https://www.udacity.com/course/data-analysis-with-r--ud651! ✤ Intro to Descriptive Statistics: Mathematics for Understanding Data
 https://www.udacity.com/course/intro-to-descriptive-statistics--ud827! ✤ Intro to Inferential Statistics: Making Predictions from Data
 https://www.udacity.com/course/intro-to-inferential-statistics--ud201
  • 118. Good Stuff ✤ FiveThirtyEight
 http://fivethirtyeight.com ! ✤ Kaggle
 https://www.kaggle.com! ✤ Figshare
 http://figshare.com! ✤ OpenCPU
 https://www.openspu.org! ✤ RPy
 http://rpy.sourceforge.net! ✤ rCharts
 http://rcharts.io! ✤ Facebook Data Science Research Service
 https://www.facebook.com/data! ✤ Google’s R Style Guide
 http://google-styleguide.googlecode.com/svn/trunk/Rguide.xml
  • 119. Books ✤ Chambers. Software for Data Analysis. 
 Springer, 2008.! ✤ Matloff. The Art of R Programming: A Tour of Statistical Software Design.
 No Starch Press, 2011.! ✤ Kabacoff. R in Action: Data analysis and graphics with R. 
 Second Edition. Manning Publications, 2015.! ✤ Cielen & Meysman. Introducing Data Science. 
 Manning Publications, 2015.! ✤ Tuckey. Exploratory Data Analysis. 
 Pearson, 1977.
  • 120. Books ✤ Manning Publications Discount Code
 ctwdevob! ✤ Packt Discount Code
 DEV.Objective
  • 121. 01