R - Data Analysis as R packages

2021-02-24

Lifecycle: maturing

What is this topic?

When starting an analysis with R, it is somewhat difficult to order files in a robust way. R is flexible enough to allow the user to chose any kind of organisation. And when programming, we often wonder: What is the best place to save my data? What to do with the data cleaning steps? What about that commented code? The script is long, should, it be splitted in different files and source them? How / Where to explain the logic / rational of the analysis? In 6 months from now, will the analysis remain clear enough so it can be developed further? Of course, if the analysis can be programmed in less than few dozen of code lines, these questions may not be relevant. Eventually, a rmarkdown approach answers most of the questions above as it provides user-friendly literacy programming options. But when the code extends more than that, this might not be easy to maintain, this is where the R package workflow becomes handy.

A R package is not much more than a well standardised way to store, share and use elements necessary for a specific task. At first, R users would associate a package with a set of useful utilities that will help them in fulfilling the analysis; therefore a package is something you download, install and use. But a R package can actually be much more and for statistical programming, it is also a relevant structure to shelter a complete analysis.

I aim here at explaining how I use the R package for my statistical programming activities. A R package can be understood as the implementation of rules which will help the programmer to share the code. Much information exists online explaining what a R package is. But the standpoint matters, and I believe this often presented from the software developer point of view; I will try to present it from the end-user statistical programmer perspective.

Material and methods

For this, not much is strictly required but there are helper packages that will definitelly improve the experience. I would suggest installing the R packages devtools, usethis, pkgdown; they help in initiating, documenting, checking and building your package with R commands (R CMD) calls and standard file templates and much more.

1. Lets start: Set Up the package

In short

All you need to create the package rpack is, in a R console, run the 4 or 5 following commands for a clean fresh start:

> devtools::create("rpack")
> usethis::use_mit_license()

If R complains about a missing name argument:

> options(usethis.full_name = "My name")
> usethis::use_mit_license()

Then any potential issues with the package can be tested for:

> devtools::check()
── R CMD check results ─────────────────────────────────── rpack 0.0.0.9000 ────
Duration: 13.6s

0 errors ✔ | 0 warnings ✔ | 0 notes ✔

Finally, to bring the package information to a non-R user:

> pkgdown::build_site()

In details

First things first, after opening R, the command devtools::create("rpack") is going to set the structure for a package called rpack. The output below details what happened: it is creating a folders rpack and R as part of the former, write a DESCRIPTION and NAMESPACE files with minimum content.

> devtools::create("rpack")
✔ Creating 'rpack/'
✔ Setting active project to '/home/fcollin/rpack'
✔ Creating 'R/'
✔ Writing 'DESCRIPTION'
Package: rpack
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R (parsed):
    * First Last <first.last@example.com> [aut, cre] (YOUR-ORCID-ID)
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to
    pick a license
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
✔ Writing 'NAMESPACE'

The package is all set-up and can even already be built and/or installed.

> devtools::build()
✔  checking for file ‘/home/fcollin/rpack/DESCRIPTION’ (658ms)
─  preparing ‘rpack’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
   Removed empty directory ‘rpack/R’
─  building ‘rpack_0.0.0.9000.tar.gz’

[1] "/home/fcollin/rpack_0.0.0.9000.tar.gz"

However, at that point, the package does obviously do nothing but documenting a good intention. It is useful to verify the package is well formed with devtools::check(), as it informs the user about any potential problems in the package. The last lines of the chek results are reported below. with the numbers of errors, warnings and notes; part of the game being to keep the three counters at 0.

> devtools::check()
── R CMD check results ─────────────────────────────────── rpack 0.0.0.9000 ────
Duration: 14.7s

❯ checking DESCRIPTION meta-information ... WARNING
  Non-standard license specification:
    `use_mit_license()`, `use_gpl3_license()` or friends to pick a
    license
  Standardizable: FALSE

0 errors ✔ | 1 warning ✖ | 0 notes ✔

And already, a warning has been detected; indeed the License field in the DESCRIPTION file is not valid. As suggested by the result, it is possible to use the command usethis::use_mit_license() to get a MIT license. This is going to update the LICENSE file, and add two standard files describing it.

> usethis::use_mit_license()
✔ Setting License field in DESCRIPTION to 'MIT + file LICENSE'
✔ Writing 'LICENSE'
✔ Writing 'LICENSE.md'
✔ Adding '^LICENSE\\.md$' to '.Rbuildignore'

If R complains about a missing name argument:

> options(usethis.full_name = "My name")
> usethis::use_mit_license()

Now, running again devtools::check(), all counters should be at 0, the package is clean and ready for the next step.

> devtools::check()
── R CMD check results ─────────────────────────────────── rpack 0.0.0.9000 ────
Duration: 13.6s

0 errors ✔ | 0 warnings ✔ | 0 notes ✔

Finally, the function pkgdown::build_site() is relevant as it build a full website bringing together the package documentation in a comfortable way; the package is then making one step outside of the R environment so someone can consult it before diving if necessary into the package source.

> pkgdown::build_site()

To complete

  • The files generated are templates; it is good advise to update few fields in the DESCRIPTION file which will provide useful information to anyone looking at that package: Title, Version, Authors and Description.
  • The License above does not have to be a MIT one, check usethis - License a package for other possiblities.
  • If using git, that is the good time to use git init so the commit represents a clean state of the package and features can start being added.

2. What goes where?

That is the good side with a package, there is always a good place for every kind of information.

The Data

I can think of three ways of managing the data in a package. Either as a raw file of data, either as a R data file or as a R function.

  • raw file: if the raw file needs to be packed with the analysis, it can be added within the inst folder (standard folder), eventually in a subfolder which is often refered as extdata. That way, your data is never lost, always available as was when doing the analysis. But a raw data set, is not often directly edible for R.
  • as a R data set: anyt R object can be saved (save()), and can be directly available within the data directory. If I want to include my preprocessed t2d data into my data analysis package, then I will save in the package with usethis::use_data(t2d), it will appear as t2d.rda within the data/ folder. You also need to add a file in R/, for instance R/data-t2d.R which will describe the dataset with useful information which will help recording in a complete way the metadata around the data:

    #' T2D
    #'
    #' My preprocessed dataset.
    #' @format 5 cols and 50 rows.
    #' @source received from ... the.
    #' @detail This dataset includes .... Note that ....
    #'
    "t2d"
    

It is even possible to save the data preprocessing steps into a dedicated folder data-raw/, this is not a standard R folder, and it should be added to the list of files to ignore when building the package. In .Rbuildignore you should add ^data-raw$.

  • Finally, you may not necessarily store the data, but simply a function that will pick the data and preprocess them for you. That function, would be stored in the R/ folder. That way, your keep your package slim. The advantage is that you are always pointing at the original file, however, if the preprocessing steps are too long, the solution above might be more satisfactory.

    #' T2D Preprocessing
    #'
    #' Preprocess the dataset and make it ready for the analysis.
    #' @param file (`character`)\cr the location of the source file.
    #' @export
    #' @examples
    #' t2d <- preprocess_t2d()
    #' head(t2d)
    #'
    preprocess_t2d <- (file = "/predictable/file/location.csv") {
    y <- read.table(file)
    y$col <- ... # all the preprocessing steps
    y
    }
    

Functions

Identify the simple actions required for the analysis and translate them into functions with minimal working examples. An analysis broken down to its small fundamental bricks will result in a code easier to maintain, adapt and fix. The slight hoverhead will result in really reusable code.

Examples

Assessment of Disability Progression Independent of Relapse and Brain MRI Activity in Patients with Multiple Sclerosis in Poland

Credits

Image by siala from Pixabay

comments powered by Disqus