For a general introduction on the topic in French, see Pouzat, Davison, and Hinsen (2015Pouzat, Christophe, Andrew Davison, and Konrad Hinsen. 2015. “La Recherche Reproductible : Une Communication Scientifique Explicite.” Statistique Et Société 3 (1). http://www.publications-sfds.fr/index.php/stat_soc/article/view/448.). If you want to explore the topic of reproducible research in French, the Recherche reproductible : principes méthodologiques pour une science transparente MOOC is of interest.
Reproducible research refers to research that can be reproduced under various conditions and by different people. It applies to every area of research, both experimental and computational, but is often (but not always) easier to implement for computational work. The different levels of reproducibility can formalised13 as follows:
Repeat my experiment, i.e. obtain the same tables/graphs/results using the same setup (data, software, …) in the same lab or on the same computer. That’s basically re-running one of my analysis some time after I original developed it.
Reproduce an experiment (not ones own), i.e. obtain the same tables/graphs/results in a different lab or on a different computer, using the same setup (the data would be downloaded from a public repository and the same software, but possibly using a different version, or a different operation system).
Replicate an experiment, i.e. obtain the same (similar enough) tables/graphs/results in a different set up. The data could still be downloaded from the public repository, or possibly re-generated/re-simulated, and the analysis would be re-implemented based on the original description.
Finally, re-use the information/knowledge from one experiment to run a different experiment with the aim to confirm results from scratch.
The table below summerised these concepts focusing on data and code in computational projects.
There are many reasons to work reproducibly, and Markowetz (2015Markowetz, Florian. 2015. “Five Selfish Reasons to Work Reproducibly.” Genome Biology 16 (1): 274. https://doi.org/10.1186/s13059-015-0850-7.) nicely summarises 5 good reasons. Importantly, he stressed out that the first beneficiary of reproducible work are the student/researcher that apply these principles:
Reproducible research is an essential part of any data analysis. With the tools that are available, one can argue that it has become more difficult not to produce reproducible reports than to producing then.
Reproducible documents have been a part of R since the very
beginning. See for example R. Gentleman and Temple Lang (2004Gentleman, Robert, and Duncan Temple Lang. 2004. “Statistical Analyses and Reproducible Research.” Bioconductor Project Working Papers. Working Paper 2. https://biostats.bepress.com/bioconductor/paper2.), to see how such compendia play a
central role within the Bioconductor
project (more about Bioconductor in it’s dedicated
chapter). Originally, these were written in LaTeX, interleaved with R
code chunks, forming so called Sweave documents (with extension
Figure 8.1: The rmarkdown sticker
More recently, it has become to use the
markup language, rather than LaTeX. Once interleaved with R code
chunks, these documents become Rmarkdown files (
.Rmd). The can be
converted into markdown using
knitr::knit, that executes the code
chunk and incorporates their output in the resulting markdown
documents, which itself is converted to one of many output formats,
typically pdf of html using pandoc. In R, this
final conversion is done using
rmarkdown::render (that relies on
knitr::knit converts the
md by executing the code
chunks and replacing the code by its output (text, tables, figures,
md file is then compiled into the desired output
format (typically html
or pdf) using
In practice, in R, these two steps are automatically handled in one
The rmarkdown package is developed and maintained by RStudio and benefits from excellent documentation, support and integrates into the RStudio editor.
An Rmarkdown document is composed of
An optional YAML header, delimited by
Text in simple markdown format.
One or more R code chunks delimited by three backticks. Each code chunk can be uniquely named and parametrised with a set of code chunk options.
These respective parts of an
Rmd file are show below and will be
demonstrated during the course.
The following video, R Markdown: The bigger picture by Garrett Grolemund at a 2019 RStudio conference provides a very nice introduction on the many reasons why writing reproducible documents in essential in data science and biomedical research.
Among the most options that can set for code chunks is
cache = TRUE will avoid that specific code chunk to
be cached and not recomputed every time the documented in knitted,
unless the code chunk was modified. This is an important feature
when long computations are necessary.
DT::datatable function allows to create dynamic tables
directly from R, as show below.
sessionInfo() function, such at the end of this material. This
allows readers to review the version of R itself and all the
packages that were used to produce the report.
Prepare an Rmarkdown report summarising the portal ecology data. The report should include a Material and methods section where the data is read in (ideally from the online file) and briefly described, a Data preparation section where rows with missing values are filtered out, and a Visualisation section where one or two plots are rendered. Finish your report with a Session information section.
Note: When you prepare an
Rmd report, it is advised to start
with a code chunk that load all packages, and update that code chunk
as you proceed with your analysis and use new packages. This avoids
the situation where some commands work in your R console (where you
initially loaded the package) but fail when you compile your
report. Indeed, an
Rmd file is compiled in a new, clean R session,
without access to your working session, the packages that were loaded,
and the variables that were created.
There are other tools for reproducible research, that aim to disseminate more than code and data. Docker containers for example enable to share the complete image of an operating system, including all system dependencies and software/data to repeat a complete analysis. These are useful tools, even though their aren’t necessarily ideal, and beyond the aim of this course. In the annex, (chapter 12), we show how to use a pre-build course-specific RStudio cloud instance based on the Renku infrastructure.
To test if your report is fully reproducible, uniquely name the
surname_surname_beers_report.Rmd, post it on the course
forum and ask your neighbour to download it, compile it into pdf and
provide feedback on whether the document was reproducible and easy to
student_results.Rmd, to be
compiled in either pdf or html. Remove everything but the lines
containing the header. In the example below, the header is composed
of the 6 first lines, starting and ending with
title: "Student results"
date: "March 14, 2020"
rWSBIM1207 package, use the
interroA.csv() function to get the
name of a csv file containing test results for a set of students,
and read these data into R using
read_csv. 18 Display the
few first observations and write a short sentence explaining the
Make sure that you can compile you
Rmd file into either pdf or html.
Create a new section called Data visualisation.
Here, the goal is to visualise the score distributions for the four
ggplot2. These distributions will be visualised using
boxplots. You will need to visualise these distribution for each
test separately, and for male and female students.
As discussed during the course, we need data in a long format to be
able to use
ggplot2. Start by converting these data into a long
pivot_longer(). Display the first rows of these new
data and write a short sentence describing them and the
transformation you just applied.
ggplot2 to visualise the score distributions along boxplots
for each test and for female and male students.
And course reports! The exam of this course will consist in a reproducible report using the tools described below.↩︎
Not only reviewers, also professors that read exams. See previous footnote.↩︎
This is an important exercise, as it will mimic the exam
situation, where you will hand in your
Rmd reports that will
need to be compiled (to pdf) before marking. In addition, it
demonstrates the challenges of writing and reproducing an
understandable and reproducible document.↩︎
It is important here not to copy/paste the filename
interroA.csv(). That file is distributed with the
rWSBIM1207 package, and has a computer-specific path. Because we
want our reports to be reproducible, we want to use the filename
as returned by the function on any computer, not the one of a
specific computer. Make sure you either pass
read_csv() or store its output into a variable that
is passed to
Page built: 2024-02-22 using R version 4.3.2 Patched (2023-12-27 r85757)