SlideShare a Scribd company logo
1 of 1
Download to read offline
testdat: An	
  R	
  package	
  for	
  unit	
  tes2ng	
  of	
  tabular	
  data	
  
Mo#va#on	
  
Karthik	
  Ram1,	
  Hilary	
  Parker2,	
  Alyssa	
  Frazee3	
  
1	
  The	
  rOpenSci	
  project,	
  University	
  of	
  California,	
  Berkeley.	
  Berkeley,	
  CA	
  94720	
  USA,	
  karthik.ram@berkeley.edu
2	
  Etsy	
  Inc.,	
  Brooklyn,	
  NY.	
  USA,	
  hilary@etsy.com
3	
  Department	
  of	
  Biosta2s2cs,	
  Johns	
  Hopkins	
  Bloomberg	
  School	
  of	
  Public	
  Health,	
  Bal2more,	
  MD.	
  USA,	
  afrazee@jhsph.edu
Contribute	
  
The	
  testdat	
  package,	
  like	
  rOpenSci,	
  is	
  an	
  open-­‐
source,	
  community-­‐supported	
  project!	
  	
  
Improve	
  data	
  preprocessing:	
  
Data	
  preprocessing	
  is	
  an	
  important	
  and	
  under-­‐
discussed	
  step	
  in	
  data	
  analysis.	
  By	
  providing	
  
func2ons	
  to	
  easily	
  test	
  for	
  and	
  correct	
  common	
  
piXalls,	
  we	
  aim	
  to	
  help	
  researchers	
  overcome	
  these	
  
stumbling	
  blocks.	
  
	
  	
  	
  
Encourage	
  reproducibility:	
  
By	
  providing	
  a	
  suite	
  of	
  func2ons	
  that	
  easily	
  test	
  and	
  
correct	
  data	
  for	
  common	
  errors,	
  we	
  hope	
  to	
  
encourage	
  researchers	
  to	
  perform	
  data	
  
preprocessing	
  as	
  part	
  of	
  a	
  reproducible	
  workflow,	
  
rather	
  than	
  in	
  tools	
  such	
  as	
  Excel.	
  
	
  	
  
Communicate	
  analy#cal	
  steps:	
  
By	
  providing	
  readable	
  func2ons	
  for	
  preprocessing,	
  
we	
  aim	
  for	
  researchers	
  to	
  include	
  the	
  data	
  
preprocessing	
  code	
  in	
  their	
  analyses	
  or	
  papers,	
  to	
  
communicate	
  that	
  they	
  took	
  exhaus2ve	
  steps	
  to	
  
remove	
  ar2facts	
  from	
  data.	
  
Example	
  Func#ons	
   Workflow	
  
Obtain	
  
> dat
date num name
1 2014-01-01 1 NULL
2 2014-01-01 2 naa
3 2014-01-01 3 foo
4 2014-01-01 4 foo
5 2014-01-01 5 foo
6 2014-01-01 6 foo
7 2014-01-01 7 foo
8 2014-01-01 8 foo
9 2014-01-01 999 foo
10 2014-01-01 n/a foo
> class(dat$num)
[1] "factor"
> class(dat$name)
[1] "factor”
> test_NA(dat)
Now checking 3 columns...
999 was identified as a possible
NA alias -- please verify this is
not a data value!
row column value
1 9 2 999
2 10 2 n/a
3  1 3 NULL
> clean_dat <- fix_NA(dat,
custom_NAs="naa")
Now fixing 3 columns...
> clean_dat
date num name
1 2014-01-01 1 <NA>
2 2014-01-01 2 <NA>
3 2014-01-01 3 foo
4 2014-01-01 4 foo
5 2014-01-01 5 foo
6 2014-01-01 6 foo
7 2014-01-01 7 foo
8 2014-01-01 8 foo
9 2014-01-01 NA foo
10 2014-01-01 NA foo
> class(clean_dat$num)
[1] "numeric"
> class(clean_dat$name)
[1] "character"
Test	
  
Fix	
  
test_utf8.R, clean_utf8.R!
!
Test	
  and	
  correct	
  uX8	
  characters,	
  which	
  cannot	
  be	
  
read	
  into	
  R.	
  
!
test_NA.R, fix_NA.R!
!
Test	
  and	
  correct	
  for	
  common	
  missing-­‐value	
  
indicators	
  that	
  are	
  not	
  converted	
  to	
  an	
  NA	
  
character	
  in	
  R.	
  
!
test_continuous_date.R,
fix_continuous_date.R!
!
Test	
  and	
  correct	
  for	
  unexpected	
  gaps	
  in	
  date	
  
ranges.	
  
!
test_white_spaces.R,
fix_white_spaces.R!
!
Test	
  and	
  correct	
  for	
  white-­‐spaces	
  in	
  character	
  
vectors.	
  
!
test_outliers.R!
!
Test	
  for	
  outliers	
  in	
  your	
  numeric	
  data.	
  A	
  correct	
  
func2on	
  is	
  not	
  supplied,	
  as	
  this	
  has	
  sta2s2cal	
  
implica2ons.	
  
!

More Related Content

What's hot

What's hot (8)

Computer science solution - programming - big c plus plus
Computer science   solution - programming - big c plus plusComputer science   solution - programming - big c plus plus
Computer science solution - programming - big c plus plus
 
Linked Lists Saloni
Linked Lists SaloniLinked Lists Saloni
Linked Lists Saloni
 
Artificial Intelligence Lab File
Artificial Intelligence Lab FileArtificial Intelligence Lab File
Artificial Intelligence Lab File
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
Bc0038– data structure using c
Bc0038– data structure using cBc0038– data structure using c
Bc0038– data structure using c
 
Day 5b statistical functions.pptx
Day 5b   statistical functions.pptxDay 5b   statistical functions.pptx
Day 5b statistical functions.pptx
 
Lecture4
Lecture4Lecture4
Lecture4
 
Lecture2
Lecture2Lecture2
Lecture2
 

Similar to testdat: An R package for unit testing of tabular data

Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3Goran S. Milovanovic
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMarc Borowczak
 
Data base testing
Data base testingData base testing
Data base testingBugRaptors
 
R Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB AcademyR Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB Academyrajkamaltibacademy
 
Normalization in Database
Normalization in DatabaseNormalization in Database
Normalization in DatabaseA. S. M. Shafi
 
Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14Mazhar Poohlah
 
Ezgi Karaesmen - Data Cleaning and Manipulation with R
Ezgi Karaesmen - Data Cleaning and Manipulation with REzgi Karaesmen - Data Cleaning and Manipulation with R
Ezgi Karaesmen - Data Cleaning and Manipulation with RRehgan Avon
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R NotesLakshmiSarvani6
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with RDr Nisha Arora
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01shaziabibi5
 
Kudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxKudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxDIPESH30
 
Mathematic iii test case
Mathematic iii test caseMathematic iii test case
Mathematic iii test casesyafiqahrahimi
 
Mathematic iii test case
Mathematic iii test caseMathematic iii test case
Mathematic iii test casesyafiqahrahimi
 

Similar to testdat: An R package for unit testing of tabular data (20)

Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification Challenges
 
Bc0041
Bc0041Bc0041
Bc0041
 
Data base testing
Data base testingData base testing
Data base testing
 
R Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB AcademyR Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB Academy
 
Normalisation revision
Normalisation revisionNormalisation revision
Normalisation revision
 
Normalization in Database
Normalization in DatabaseNormalization in Database
Normalization in Database
 
Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14
 
Ezgi Karaesmen - Data Cleaning and Manipulation with R
Ezgi Karaesmen - Data Cleaning and Manipulation with REzgi Karaesmen - Data Cleaning and Manipulation with R
Ezgi Karaesmen - Data Cleaning and Manipulation with R
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
Mathematic iii
Mathematic iiiMathematic iii
Mathematic iii
 
Data exploration in r
Data exploration in rData exploration in r
Data exploration in r
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with R
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01
 
Kudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxKudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docx
 
R programming
R programmingR programming
R programming
 
Mathematic iii test case
Mathematic iii test caseMathematic iii test case
Mathematic iii test case
 
Mathematic iii test case
Mathematic iii test caseMathematic iii test case
Mathematic iii test case
 

More from Hilary Parker

WiDS Claremont 2022.pdf
WiDS Claremont 2022.pdfWiDS Claremont 2022.pdf
WiDS Claremont 2022.pdfHilary Parker
 
rstudio::conf(2019L)
rstudio::conf(2019L)rstudio::conf(2019L)
rstudio::conf(2019L)Hilary Parker
 
Using Data Effectively: Beyond Art and Science
Using Data Effectively: Beyond Art and ScienceUsing Data Effectively: Beyond Art and Science
Using Data Effectively: Beyond Art and ScienceHilary Parker
 
Women in Analytics Conference, April 2018
Women in Analytics Conference, April 2018Women in Analytics Conference, April 2018
Women in Analytics Conference, April 2018Hilary Parker
 
Opinionated Analysis Development -- EARL SF Keynote
Opinionated Analysis Development -- EARL SF KeynoteOpinionated Analysis Development -- EARL SF Keynote
Opinionated Analysis Development -- EARL SF KeynoteHilary Parker
 
Opinionated Analysis Development -- rstudio::conf
Opinionated Analysis Development -- rstudio::confOpinionated Analysis Development -- rstudio::conf
Opinionated Analysis Development -- rstudio::confHilary Parker
 

More from Hilary Parker (8)

WiDS Claremont 2022.pdf
WiDS Claremont 2022.pdfWiDS Claremont 2022.pdf
WiDS Claremont 2022.pdf
 
eCOTS 2020
eCOTS 2020eCOTS 2020
eCOTS 2020
 
rstudio::conf(2019L)
rstudio::conf(2019L)rstudio::conf(2019L)
rstudio::conf(2019L)
 
Using Data Effectively: Beyond Art and Science
Using Data Effectively: Beyond Art and ScienceUsing Data Effectively: Beyond Art and Science
Using Data Effectively: Beyond Art and Science
 
ICOTS 2018
ICOTS 2018ICOTS 2018
ICOTS 2018
 
Women in Analytics Conference, April 2018
Women in Analytics Conference, April 2018Women in Analytics Conference, April 2018
Women in Analytics Conference, April 2018
 
Opinionated Analysis Development -- EARL SF Keynote
Opinionated Analysis Development -- EARL SF KeynoteOpinionated Analysis Development -- EARL SF Keynote
Opinionated Analysis Development -- EARL SF Keynote
 
Opinionated Analysis Development -- rstudio::conf
Opinionated Analysis Development -- rstudio::confOpinionated Analysis Development -- rstudio::conf
Opinionated Analysis Development -- rstudio::conf
 

Recently uploaded

Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 

testdat: An R package for unit testing of tabular data

  • 1. testdat: An  R  package  for  unit  tes2ng  of  tabular  data   Mo#va#on   Karthik  Ram1,  Hilary  Parker2,  Alyssa  Frazee3   1  The  rOpenSci  project,  University  of  California,  Berkeley.  Berkeley,  CA  94720  USA,  karthik.ram@berkeley.edu 2  Etsy  Inc.,  Brooklyn,  NY.  USA,  hilary@etsy.com 3  Department  of  Biosta2s2cs,  Johns  Hopkins  Bloomberg  School  of  Public  Health,  Bal2more,  MD.  USA,  afrazee@jhsph.edu Contribute   The  testdat  package,  like  rOpenSci,  is  an  open-­‐ source,  community-­‐supported  project!     Improve  data  preprocessing:   Data  preprocessing  is  an  important  and  under-­‐ discussed  step  in  data  analysis.  By  providing   func2ons  to  easily  test  for  and  correct  common   piXalls,  we  aim  to  help  researchers  overcome  these   stumbling  blocks.         Encourage  reproducibility:   By  providing  a  suite  of  func2ons  that  easily  test  and   correct  data  for  common  errors,  we  hope  to   encourage  researchers  to  perform  data   preprocessing  as  part  of  a  reproducible  workflow,   rather  than  in  tools  such  as  Excel.       Communicate  analy#cal  steps:   By  providing  readable  func2ons  for  preprocessing,   we  aim  for  researchers  to  include  the  data   preprocessing  code  in  their  analyses  or  papers,  to   communicate  that  they  took  exhaus2ve  steps  to   remove  ar2facts  from  data.   Example  Func#ons   Workflow   Obtain   > dat date num name 1 2014-01-01 1 NULL 2 2014-01-01 2 naa 3 2014-01-01 3 foo 4 2014-01-01 4 foo 5 2014-01-01 5 foo 6 2014-01-01 6 foo 7 2014-01-01 7 foo 8 2014-01-01 8 foo 9 2014-01-01 999 foo 10 2014-01-01 n/a foo > class(dat$num) [1] "factor" > class(dat$name) [1] "factor” > test_NA(dat) Now checking 3 columns... 999 was identified as a possible NA alias -- please verify this is not a data value! row column value 1 9 2 999 2 10 2 n/a 3  1 3 NULL > clean_dat <- fix_NA(dat, custom_NAs="naa") Now fixing 3 columns... > clean_dat date num name 1 2014-01-01 1 <NA> 2 2014-01-01 2 <NA> 3 2014-01-01 3 foo 4 2014-01-01 4 foo 5 2014-01-01 5 foo 6 2014-01-01 6 foo 7 2014-01-01 7 foo 8 2014-01-01 8 foo 9 2014-01-01 NA foo 10 2014-01-01 NA foo > class(clean_dat$num) [1] "numeric" > class(clean_dat$name) [1] "character" Test   Fix   test_utf8.R, clean_utf8.R! ! Test  and  correct  uX8  characters,  which  cannot  be   read  into  R.   ! test_NA.R, fix_NA.R! ! Test  and  correct  for  common  missing-­‐value   indicators  that  are  not  converted  to  an  NA   character  in  R.   ! test_continuous_date.R, fix_continuous_date.R! ! Test  and  correct  for  unexpected  gaps  in  date   ranges.   ! test_white_spaces.R, fix_white_spaces.R! ! Test  and  correct  for  white-­‐spaces  in  character   vectors.   ! test_outliers.R! ! Test  for  outliers  in  your  numeric  data.  A  correct   func2on  is  not  supplied,  as  this  has  sta2s2cal   implica2ons.   !