Chapter 11 Conclusions

In this course, we have seen the importance of structured data. Good data structure starts with simple, tidy tabular data, whether it is manually encoded in spreadsheet, or handled in R as dataframes or tibbles. More complex data, that doesn’t fit in tabular data, can be modelled into dedicated objects that display specialised behaviour. Structured data allows us to reason on that data, without having to look it at. Reasoning on and generalisation of data in turn allows to manipulate and visualise it, i.e. to explore, analyse and understand it. The cherry on top of the data analysis cake is to be able to reproduce an analysis, either oneself or share it in a way that others can.

As mentioned in the preamble, the goal of this course is obviously not for students that take it to qualify as bioinformaticians at the end. However, what is important is to appreciate the importance of data and their analysis, and to become fluent in exploring, discussing and communicating around data. A shared appreciation of data and their complexity will hopefully reduce the distinction between bioinformaticians and experimental scientists. Indeed, at the end of the day, it’s useful to remember that

We are all biologists, in that we study biology. Some use wet lab experiments, others dry lab techniques.

11.1 Next steps

  • Statistics and machine learning (see your statistics courses and the follow course WSBIM1322).
  • Getting better at programming and data analysis. See (Grolemund and Wickham 2017Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. O’Reilly Media. https://r4ds.had.co.nz/.) and (Wickham 2014aWickham, Hadley. 2014a. Advanced r. Chapman & Hall/CRC the r Series. Taylor & Francis.).
  • Evolving scripts into tools/packages (Wickham 2015Wickham, Hadley. 2015. R Packages. 1st ed. O’Reilly Media, Inc.).
  • Other tools: unix command line and git/GitHub (Perez-Riverol et al. 2016Perez-Riverol, Yasset, Laurent Gatto, Rui Wang, Timo Sachsenberg, Julian Uszkoreit, Felipe da Veiga Leprevost, Christian Fufezan, et al. 2016. “Ten Simple Rules for Taking Advantage of Git and GitHub.” PLOS Computational Biology 12 (7): 1–11. https://doi.org/10.1371/journal.pcbi.1004947.). See also this short tutorial.
  • Omics data analysis (see upcoming WSBIM2122 course).

11.2 Additional exercises

To answer the following exercises, you’ll need to resort to what you have learnt in various chapters.

► Question

Make sure you have rWSBIM1207 version >= 0.1.16 and load the 2022 Belgian road accidents statistics and the associated metadata, describing the variables. The path to the former as an rds file is available with road_accidents_be_2022.rds(). The road_accidents_be_meta.csv() returns the path to the metadata csv file.

The data provides the Number of killed, seriously injured, slightly injured and uninjured victims of road accidents, by age group, type of user, sex and various characteristics of the accident in Belgium in 20222.

  • Using the appropriate functions, load both files into R and familiarise yourself with the data.

  • Visualise the numbers for men and women over the hours of the day for all age classes. Ignore any unknown information. Do you see a difference between man and women?

  • Visualise the number of victims in the different provinces. Do this comparison for the different type of victims. Ignore any unknown information. Use lines and points for this visualisation.

  • Come up with additional visualisations that you could produce with these data. Use bar plots for this visualisation.

► Question

For this exercise, make sure you have rWSBIM1207 version >= 0.1.17. Using the population_be.csv() function, get the path to 35 files with the population numbers across multiple regions of Belgium up to 2023.

Tip: given that all files contain equivalent data (i.e. for the same variables), you can use read_csv() to load all files as once into a long table.

  • How many regions have been survey over the years?

  • What differences are there in terms of region?

Focusing on Belgium and the Brussels, Walloon and Flemish regions, and from 1991 on, generate one figure (possibly with multiple facets) that allows to answer the following questions:

  • Has the population increased since 1991?

  • Are there more women or men living in these regions? Have this changed since 1991?

  • Have the changes in population been driven by men, women or both equally?

  • What region has the biggest (lowest) population?

► Question

Here, we are going to examining the evolution of bankruptcies in different region in Belgium between 2005 and 2023. The original data are available on the Belgian office for statistics, statbel, webpage.

Two files are needed:

  • The fill dataset, TF_BANKRUPTCIES.zip, contains over 175000 records and is relatively large. Instead, you can use TF_BANKRUPTCIES_subset.txt that contains just over 10000 records, and made available through the rWSBIM1207 package (see below).

  • The Method_BANKRUPTCIES.xlsx file describes the variables in the bankruptcies dataset. It can be either downloaded from the statbel page, or read from the rWSBIM1207 package (see below). Note that you are not asked to load the xlsx file in R (although you can with readxl::read_xlsx()); feel free to open it with your favourite spreadsheet programme.

You will see that there are some minor discrepancies between the metadata file and the data variables, such as the CD_WEEK variable, that is documented in the metadata file, but absent from the actual data. The other, non-documented, time-related variables are self-explanatory.

The (compressed) TF_BANKRUPTCIES_subset.txt and Method_BANKRUPTCIES.xlsx are available in the rWSBIM1207 package (version >= 0.1.19). Their paths is available with the faillites_be() function.

You are asked to:

  • Visualise the number of bankruptcies over time in the different regions. Make sure to remove observations that don’t have any region. Note that the dates are recorded in a different format as what we have seen so far, but the usual conversion functions still apply.

  • The improve the readability of these busy plots, computer the number of bankruptcies per year and repeat the above visualisation.

  • We expect that bankruptcies of large companies will lead to more (full-time) unemployment. Check this hypothesis by (1) compute and display a table of average job losses by company size, and (2) produce a figure that uses the distribution of job losses (as opposed as only the averages). To facilitate the interpretation, order the companies by their size, in the table and figure.

► Question

This question was contributed by Mr William Vanloo in August 2024.

We will explore the road accidents that occurred in Belgium in 2023, as available from the Statbel website: the .txt file contains the dataset and the .xlsx file contains the metadata describing the different variables.

  • Among this data set, visualise the number of accidents occurring in different light conditions by severity of injury in the province of Liège over time. Be sure to remove the ‘not available’ data.

  • Visualise the number of accidents in the different provinces during the month of January according to the different light conditions and only for fatal accidents.

► Question

  • The results of the 2020 to 2023 bioinformatics exams are provided in an XLS file that does not adhere to the tidy data principles defined in the chapter 1. Clean the data, export it to a format of your choice (csv, txt, tsv, etc.), and import it into R. Be sure to retain all the information encoded in the original file during your cleanup. Display the first elements of the data in R, a summary of the data, and their dimensions.

  • How many students have taken WSBIM1207, WSBIM1322, and WSBIM2122?

  • Display a first table showing the course grades and the overall average for the top three students (i.e., those with the highest cycle averages) who completed the undergraduate program (WSBIM1207 and WSBIM1322), and a second table showing the top three students who also took the master’s course (WSBIM2122).

  • Visualize the distribution of results for the three courses, highlighting the results obtained in the first and second semesters. Interpret your figure.

► Question

  • Load the quantitative proteomics data from the se20250616.rds file into R. Indicate the number of proteins, samples, and subcellular locations included in the dataset.
  • Calculate the log fold change between the COND and CTRL conditions for each protein. Display the fold changes for a few proteins. Note that the quantitative data are already log-transformed.
  • Visualize the distributions of the log fold changes as a function of subcellular locations.
  • The cells in the COND group were treated with an agent that disrupts protein trafficking to a specific location. Based on the previous figure, what is this location? Why?

Page built: 2025-07-17 using R version 4.5.0 (2025-04-11)