Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
testdat: An R package for unit testing of tabular data
1. testdat: An
R
package
for
unit
tes2ng
of
tabular
data
Mo#va#on
Karthik
Ram1,
Hilary
Parker2,
Alyssa
Frazee3
1
The
rOpenSci
project,
University
of
California,
Berkeley.
Berkeley,
CA
94720
USA,
karthik.ram@berkeley.edu
2
Etsy
Inc.,
Brooklyn,
NY.
USA,
hilary@etsy.com
3
Department
of
Biosta2s2cs,
Johns
Hopkins
Bloomberg
School
of
Public
Health,
Bal2more,
MD.
USA,
afrazee@jhsph.edu
Contribute
The
testdat
package,
like
rOpenSci,
is
an
open-‐
source,
community-‐supported
project!
Improve
data
preprocessing:
Data
preprocessing
is
an
important
and
under-‐
discussed
step
in
data
analysis.
By
providing
func2ons
to
easily
test
for
and
correct
common
piXalls,
we
aim
to
help
researchers
overcome
these
stumbling
blocks.
Encourage
reproducibility:
By
providing
a
suite
of
func2ons
that
easily
test
and
correct
data
for
common
errors,
we
hope
to
encourage
researchers
to
perform
data
preprocessing
as
part
of
a
reproducible
workflow,
rather
than
in
tools
such
as
Excel.
Communicate
analy#cal
steps:
By
providing
readable
func2ons
for
preprocessing,
we
aim
for
researchers
to
include
the
data
preprocessing
code
in
their
analyses
or
papers,
to
communicate
that
they
took
exhaus2ve
steps
to
remove
ar2facts
from
data.
Example
Func#ons
Workflow
Obtain
> dat
date num name
1 2014-01-01 1 NULL
2 2014-01-01 2 naa
3 2014-01-01 3 foo
4 2014-01-01 4 foo
5 2014-01-01 5 foo
6 2014-01-01 6 foo
7 2014-01-01 7 foo
8 2014-01-01 8 foo
9 2014-01-01 999 foo
10 2014-01-01 n/a foo
> class(dat$num)
[1] "factor"
> class(dat$name)
[1] "factor”
> test_NA(dat)
Now checking 3 columns...
999 was identified as a possible
NA alias -- please verify this is
not a data value!
row column value
1 9 2 999
2 10 2 n/a
3 1 3 NULL
> clean_dat <- fix_NA(dat,
custom_NAs="naa")
Now fixing 3 columns...
> clean_dat
date num name
1 2014-01-01 1 <NA>
2 2014-01-01 2 <NA>
3 2014-01-01 3 foo
4 2014-01-01 4 foo
5 2014-01-01 5 foo
6 2014-01-01 6 foo
7 2014-01-01 7 foo
8 2014-01-01 8 foo
9 2014-01-01 NA foo
10 2014-01-01 NA foo
> class(clean_dat$num)
[1] "numeric"
> class(clean_dat$name)
[1] "character"
Test
Fix
test_utf8.R, clean_utf8.R!
!
Test
and
correct
uX8
characters,
which
cannot
be
read
into
R.
!
test_NA.R, fix_NA.R!
!
Test
and
correct
for
common
missing-‐value
indicators
that
are
not
converted
to
an
NA
character
in
R.
!
test_continuous_date.R,
fix_continuous_date.R!
!
Test
and
correct
for
unexpected
gaps
in
date
ranges.
!
test_white_spaces.R,
fix_white_spaces.R!
!
Test
and
correct
for
white-‐spaces
in
character
vectors.
!
test_outliers.R!
!
Test
for
outliers
in
your
numeric
data.
A
correct
func2on
is
not
supplied,
as
this
has
sta2s2cal
implica2ons.
!