Preamble

The WSBIM1207 course is an introduction to bioinformatics (and data science) for biology and biomedical students. It introduces bioinformatics methodology and technologies without relying on any prerequisites. The aim of this course is for students to be in a position to understand important notions of bioinformatics and tackle simple bioinformatics-related problems in R, in particular to develop simple R analysis scripts and reproducible analysis reports to interrogate, visualise and understand data in a tidy tabular format.

The course is followed by Bioinformatics (WSBIM1322) and Omics data analysis (WSBIM2122).

It is interesting to start this course by asking the students, who have likely not yet been exposed to bioinformatics

What is bioinformatics?

While Wikipedia defines it as

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data.

this course won’t try to answer that question by providing an overview of the many context where methods and software tools are developed in the frame of biological data. We will focus on the getting hands-on experience in data manipulation. Hence, the material and how it is taught will be focused on practice.

Motivation

Today, it is difficult to overestimate the very broad importance and impact of data. Given the abundance of data around us, and the sophistication of tools for their analysis and interpretation that are readily available, data has become a tool of profound social change. Research in general, and biomedical research in particular, is at the centre of this evolution. And while bioinformatics has been playing a central role in bio-medical research for many years now, bioinformatics skills aren’t well integrated in life science curricula, limiting students in their career prospects and research horizon (Wilson Sayres et al. 2018Wilson Sayres, M A, C Hauser, M Sierk, S Robic, A G Rosenwald, T M Smith, E W Triplett, et al. 2018. “Bioinformatics Core Competencies for Undergraduate Life Sciences Education.” PLoS One 13 (6): e0196878. https://doi.org/10.1371/journal.pone.0196878.). It is important for young researchers to acquire quantitative, computational and data skills to address the challenges that lie ahead.

This first course will focus on the data skills. It will not teach to use any specific piece of bioinformatics software. Once one has identified a relevant piece of software, running it shouldn’t be a major problem1 Although, in all fairness, some bioinformatics software are notoriously difficult to install and run.. The important part will be to understand what it does, why it does it and how to assess if the output can be trusted. The latter requires to explore and understand the data and the results, i.e. data skills. As described by Buffalo (2015Buffalo, Vince. 2015. Bioinformatics Data Skills. O’Reilly Media, Inc.), equating learning a piece of bioinformatics software to learning bioinformatics is like learning pipetting as a means to learn molecular biology.

Critical thinking is essential in any aspect of research, and bioinformatics and data analysis doesn’t escape that rule. Teaching critical thinking is however very difficult, and arguably impossible without first possessing the skills and master the tools to be critical about data. Rather than teaching a limited set of software tools, this course aims at teaching a set of hands-on skills and methodologies that allow to interrogate and visualise data, and hence be in a position to be critical with respect to new data. In addition, while specific software become obsolete and are replaced, or are specific to one specific field of research, critical investigation of data will always be required and will never be replaced.

We will be using the R language and environment (R Core Team 2019R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.) and the RStudio integrated development environment to acquire these data skills. Other interactive language such as Python and the interactive jupyer notebooks would also have been a good fit. One motivation of this choice is the availability of numerous R/Bionductor packages (Huber et al. 2015Huber, W, V J Carey, R Gentleman, S Anders, M Carlson, B S Carvalho, H C Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nat Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.) for the analysis of high throughput biology data.

Another reason why the focus of the course ought to be on data skills is that a notable difficulty in modern, multidisciplinary research is communication. Wetlab biomedical scientists aren’t required to become bioinformaticians, statisticians, programmers, … to be outstanding researchers, but they will need to communicate efficiently with these experts (and vice versa). What most often unites all these experts is data, and communication around data is critical. The importance of critical thinking and communication around data becomes more evident when one realises that, when tracking work in bioinformatics core facilities, only a minority of projects were purely routine and that most researchers came to the bioinformatics core seeking customized analysis, rather than a standardized package (Chang 2015Chang, J. 2015. “Core Services: Reward Bioinformaticians.” Nature 520 (7546): 151–52. https://doi.org/10.1038/520151a.).

To illustrate the reasons why R in general (and in the case of biomedical sciences, Bioconductor in particular) are worth learning, I provide here some examples where these software are used. From a bioinformatics point of view:

In addition, here are three short videos (in French, with subtitles in multiple languages) by PhD students that assist in the teaching of this course. The videos illustrate real-work applications of some of the concepts taught in the two bachelor bioinformatics course (this one and WSBIM1322):

  • Julie Devis is pursuing a PhD in the Computational Biology and Bioinformatics Unit with Prof Laurent Gatto. She uses R and Bioconductor to study the methylation in cancer germ-line genes.
  • Valentine Robaux is pursuing a PhD in the Cardiovascular Research Unit with Profs Sandrine Horman and Christophe Beauloye. She uses R and Bioconductor to investigate platelet function.
  • Jean Fain is pursuing a PhD in the Epigenetics Unit with Prof Charles De Smet. He uses R and Bioconductor to study DNA methylation and its role in cancer.

And more generally, at Google, Pfizer, Merck, GSK, Bank of America, the InterContinental Hotels Group, Shell, …

In summary, the overall learning objectives of this course are:

  • for students to apply and adapt the general data analysis techniques and principles that are presented to new data and new contexts;

  • establish links between different concepts seen in the course such as, for example, the importance of tidy data in general, how it applies to dataframes in R, and how it enables reasoning on the data;

  • become autonomous when being presented with new data and be in a position to explore and understand them.

References and credits

References are provided throughout the course. Several stand out however, as they cover large parts of the material or provide complementary resources.

The material for the first chapters, covering the Introduction to data science with R, was originally based on the Data Carpentry Ecology curiculum (von Hardenberg et al. 2019von Hardenberg, Achaz, Adam Obeng, Aleksandra Pawlik, Alex Pletzer, Alexey Shiklomanov, Anne Fouilloux, April Wright, et al. 2019. Data Carpentry: R for Data Analysis and Visualization of Ecological Data.” Edited by Francois Michonneau and Auriel Fournier. https://doi.org/10.5281/zenodo.569338.). The main dataset by Blackmore et al. (2017), introduced in 2021, was prepared by the Bioconductor education group as part of their lesson development.

General references for this course are R for Data Science (Grolemund and Wickham 2017Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. O’Reilly Media. https://r4ds.had.co.nz/.) and Bioinformatics Data Skills (Buffalo 2015Buffalo, Vince. 2015. Bioinformatics Data Skills. O’Reilly Media, Inc.).

The RStudio Cheat Sheets are also a handy resource and readers will be pointed to specific sheets in the respective chapters.

This course is being taught by Prof Laurent Gatto with invaluable assistance from Mr Jean Fain (since 2019), Ms Valetine Robaux (since 2020), Ms Julie Devis (since 2022), Dr Axelle Loriot (2018 - 2021) and Mr Kevin Missault (2018 - 2020) at the Faculty of Pharmacy and Biomedical Sciences (FASB) at the UCLouvain, Belgium.

Pre-requisites

There are no programming or technical pre-requisities for this course, other than basic computer usage, such as general knowledge about files (binary and text files) and folders and as well as downloading files. Familiarity with a spreadsheet editor is helpful for the first chapter.

Software requirements are documented in the Setup section below.

About this course material

This material is written in R markdown (Allaire et al. 2024Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. Rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.) and compiled as a book using knitr (Xie 2024bXie, Yihui. 2024b. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.) bookdown (Xie 2024aXie, Yihui. 2024a. Bookdown: Authoring Books and Technical Documents with r Markdown. https://github.com/rstudio/bookdown.). The source code is publicly available in a Github repository https://github.com/uclouvain-cbio/WSBIM1207 and the compiled material can be read at http://bit.ly/WSBIM1207.

Contributions to this material are welcome. The best way to contribute or contact the maintainers is by means of pull requests and issues. Please familiarise yourself with the code of conduct. By participating in this project you agree to abide by its terms.

Citation

If you use this course, please cite it as

Laurent Gatto, Kevin Missault, Manon Martin & Axelle Loriot. (2021, September 24). UCLouvain-CBIO/WSBIM1207: Introduction to bioinformatics (Version v2.0.0). Zenodo. DOI: 10.5281/zenodo.5532658

DOI

License

This material is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.

Setup

For chapter 1 about Data organisation with Spreadsheets, a spreadsheet programme is necessary.

We will be using the R environment for statistical computing as main data science language. We will also use the RStudio interface to interact with R and write scripts and reports. Both R and RStudio are easy to install and works on all major operating systems.

Once R and RStudio are installed, a set of packages will need to be installed. See section 13.1 for details.

Page built: 2024-08-27 using R version 4.4.1 (2024-06-14)