2021-02-24
When starting an analysis with R, it is somewhat difficult to order
files in a robust way. R is flexible enough to allow the user to chose any
kind of organisation. And when programming, we often wonder:
What is the best place to save my data? What to do with
the data cleaning steps? What about that commented code? The script is long,
should, it be splitted in different files and source them?
How / Where to explain the logic / rational of the analysis?
In 6 months from now, will the analysis remain clear enough so it can be
developed further?
Of course, if the analysis can be programmed in less than few dozen of
code lines, these questions may not be relevant. Eventually, a
rmarkdown
approach answers most of the questions above as it provides
user-friendly literacy programming options. But when the code extends more
than that, this might not be easy to maintain,
this is where the R package workflow becomes handy.
A R package is not much more than a well standardised way to store, share and use elements necessary for a specific task. At first, R users would associate a package with a set of useful utilities that will help them in fulfilling the analysis; therefore a package is something you download, install and use. But a R package can actually be much more and for statistical programming, it is also a relevant structure to shelter a complete analysis.
I aim here at explaining how I use the R package for my statistical programming activities. A R package can be understood as the implementation of rules which will help the programmer to share the code. Much information exists online explaining what a R package is. But the standpoint matters, and I believe this often presented from the software developer point of view; I will try to present it from the end-user statistical programmer perspective.
For this, not much is strictly required but there are helper packages that
will definitelly improve the experience. I would suggest installing
the R packages devtools
, usethis
, pkgdown
; they help in initiating,
documenting, checking and building your package with R commands (R CMD
) calls
and standard file templates and much more.
All you need to create the package rpack
is, in a R console, run the
4 or 5 following commands for a clean fresh start:
> devtools::create("rpack")
> usethis::use_mit_license()
If R complains about a missing name argument:
> options(usethis.full_name = "My name")
> usethis::use_mit_license()
Then any potential issues with the package can be tested for:
> devtools::check()
── R CMD check results ─────────────────────────────────── rpack 0.0.0.9000 ────
Duration: 13.6s
0 errors ✔ | 0 warnings ✔ | 0 notes ✔
Finally, to bring the package information to a non-R user:
> pkgdown::build_site()
First things first, after opening R, the command devtools::create("rpack")
is going to set the structure for a package called rpack
. The output below
details what happened: it is creating a folders rpack
and R
as part of the
former, write a DESCRIPTION
and NAMESPACE
files with minimum content.
> devtools::create("rpack")
✔ Creating 'rpack/'
✔ Setting active project to '/home/fcollin/rpack'
✔ Creating 'R/'
✔ Writing 'DESCRIPTION'
Package: rpack
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R (parsed):
* First Last <first.last@example.com> [aut, cre] (YOUR-ORCID-ID)
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to
pick a license
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
✔ Writing 'NAMESPACE'
The package is all set-up and can even already be built and/or installed.
> devtools::build()
✔ checking for file ‘/home/fcollin/rpack/DESCRIPTION’ (658ms)
─ preparing ‘rpack’:
✔ checking DESCRIPTION meta-information ...
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
Removed empty directory ‘rpack/R’
─ building ‘rpack_0.0.0.9000.tar.gz’
[1] "/home/fcollin/rpack_0.0.0.9000.tar.gz"
However, at that point, the package does obviously do nothing but documenting
a good intention.
It is useful to verify the package is well formed with devtools::check()
,
as it informs the user about any potential problems in the package.
The last lines of the chek results are reported below.
with the numbers of errors, warnings and notes; part
of the game being to keep the three counters at 0.
> devtools::check()
── R CMD check results ─────────────────────────────────── rpack 0.0.0.9000 ────
Duration: 14.7s
❯ checking DESCRIPTION meta-information ... WARNING
Non-standard license specification:
`use_mit_license()`, `use_gpl3_license()` or friends to pick a
license
Standardizable: FALSE
0 errors ✔ | 1 warning ✖ | 0 notes ✔
And already, a warning has been detected; indeed the License
field in
the DESCRIPTION file is not valid. As suggested by the result, it is
possible to use the command usethis::use_mit_license()
to get a MIT
license. This is going to update the LICENSE file, and add two
standard files describing it.
> usethis::use_mit_license()
✔ Setting License field in DESCRIPTION to 'MIT + file LICENSE'
✔ Writing 'LICENSE'
✔ Writing 'LICENSE.md'
✔ Adding '^LICENSE\\.md$' to '.Rbuildignore'
If R complains about a missing name argument:
> options(usethis.full_name = "My name")
> usethis::use_mit_license()
Now, running again devtools::check()
, all counters should be at 0, the
package is clean and ready for the next step.
> devtools::check()
── R CMD check results ─────────────────────────────────── rpack 0.0.0.9000 ────
Duration: 13.6s
0 errors ✔ | 0 warnings ✔ | 0 notes ✔
Finally, the function pkgdown::build_site()
is relevant as it build
a full website bringing together the package documentation in a comfortable
way; the package is then making one step outside of the R environment so
someone can consult it before diving if necessary into the package source.
> pkgdown::build_site()
usethis
- License a package for other
possiblities.git
, that is the good time to use git init
so the commit
represents a clean state of the package and features can start being added.That is the good side with a package, there is always a good place for every kind of information.
I can think of three ways of managing the data in a package. Either as a raw file of data, either as a R data file or as a R function.
inst
folder (standard folder), eventually in a subfolder
which is often refered as extdata
. That way, your data is never lost,
always available as was when doing the analysis. But a raw data set, is not
often directly edible for R.as a R data set: anyt R object can be saved (save()
), and can be directly
available within the data
directory. If I want to include my preprocessed
t2d
data into my data analysis package, then I will save in the package
with usethis::use_data(t2d)
, it will appear as t2d.rda
within the data/
folder. You also need to add a file in R/
, for instance R/data-t2d.R
which will describe the dataset with useful information which will help
recording in a complete way the metadata around the data:
#' T2D
#'
#' My preprocessed dataset.
#' @format 5 cols and 50 rows.
#' @source received from ... the.
#' @detail This dataset includes .... Note that ....
#'
"t2d"
It is even possible to save the data preprocessing steps into a dedicated
folder data-raw/
, this is not a standard R folder, and it should be
added to the list of files to ignore when building the package. In
.Rbuildignore
you should add ^data-raw$
.
Finally, you may not necessarily store the data, but simply a function
that will pick the data and preprocess them for you. That function, would
be stored in the R/
folder. That way, your keep your package slim. The
advantage is that you are always pointing at the original file, however, if the
preprocessing steps are too long, the solution above might be more satisfactory.
#' T2D Preprocessing
#'
#' Preprocess the dataset and make it ready for the analysis.
#' @param file (`character`)\cr the location of the source file.
#' @export
#' @examples
#' t2d <- preprocess_t2d()
#' head(t2d)
#'
preprocess_t2d <- (file = "/predictable/file/location.csv") {
y <- read.table(file)
y$col <- ... # all the preprocessing steps
y
}
Identify the simple actions required for the analysis and translate them into functions with minimal working examples. An analysis broken down to its small fundamental bricks will result in a code easier to maintain, adapt and fix. The slight hoverhead will result in really reusable code.