Chapter 11 Conclusions

In this course, we have seen the importance of structured data. Good data structure starts with simple, tidy tabular data, whether it is manually encoded in spreadsheet, or handled in R as dataframes or tibbles. More complex data, that doesn’t fit in tabular data, can be modelled into dedicated objects that display specialised behaviour. Structured data allows us to reason on that data, without having to look it at. Reasoning on and generalisation of data in turn allows to manipulate and visualise it, i.e. to explore, analyse and understand it. The cherry on top of the data analysis cake is to be able to reproduce an analysis, either oneself or share it in a way that others can.

As mentioned in the preamble, the goal of this course is obviously not for students that take it to qualify as bioinformaticians at the end. However, what is important is to appreciate the importance of data and their analysis, and to become fluent in exploring, discussing and communicating around data. A shared appreciation of data and their complexity will hopefully reduce the distinction between bioinformaticians and experimental scientists. Indeed, at the end of the day, it’s useful to remember that

We are all biologists, in that we study biology. Some use wet lab experiments, others dry lab techniques.

11.1 Next steps

  • Statistics and machine learning (see your statistics courses and the follow course WSBIM1322).
  • Getting better at programming and data analysis. See (Grolemund and Wickham 2017Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. O’Reilly Media. https://r4ds.had.co.nz/.) and (Wickham 2014aWickham, Hadley. 2014a. Advanced r. Chapman & Hall/CRC the r Series. Taylor & Francis.).
  • Evolving scripts into tools/packages (Wickham 2015Wickham, Hadley. 2015. R Packages. 1st ed. O’Reilly Media, Inc.).
  • Other tools: unix command line and git/GitHub (Perez-Riverol et al. 2016Perez-Riverol, Yasset, Laurent Gatto, Rui Wang, Timo Sachsenberg, Julian Uszkoreit, Felipe da Veiga Leprevost, Christian Fufezan, et al. 2016. “Ten Simple Rules for Taking Advantage of Git and GitHub.” PLOS Computational Biology 12 (7): 1–11. https://doi.org/10.1371/journal.pcbi.1004947.). See also this short tutorial.
  • Omics data analysis (see upcoming WSBIM2122 course).

11.2 Additional exercises

To answer the following exercises, you’ll need to resort to what you have learnt in various chapters.

► Question

Make sure you have rWSBIM1207 version >= 0.1.16 and load the 2022 Belgian road accidents statistics and the associated metadata, describing the variables. The path to the former as an rds file is available with road_accidents_be_2022.rds(). The road_accidents_be_meta.csv() returns the path to the metadata csv file.

The data provides the Number of killed, seriously injured, slightly injured and uninjured victims of road accidents, by age group, type of user, sex and various characteristics of the accident in Belgium in 20222.

  • Using the appropriate functions, load both files into R and familiarise yourself with the data.

  • Visualise the numbers for men and women over the hours of the day for all age classes. Ignore any unknown information. Do you see a difference between man and women?

  • Visualise the number of victims in the different provinces. Do this comparison for the different type of victims. Ignore any unknown information. Use lines and points for this visualisation.

  • Come up with additional visualisations that you could produce with these data. Use bar plots for this visualisation.

► Question

For this exercise, make sure you have rWSBIM1207 version >= 0.1.17. Using the population_be.csv() function, get the path to 35 files with the population numbers across multiple regions of Belgium up to 2023.

Tip: given that all files contain equivalent data (i.e. for the same variables), you can use read_csv() to load all files as once into a long table.

  • How many regions have been survey over the years?

  • What differences are there in terms of region?

Focusing on Belgium and the Brussels, Walloon and Flemish regions, and from 1991 on, generate one figure (possibly with multiple facets) that allows to answer the following questions:

  • Has the population increased since 1991?

  • Are there more women or men living in these regions? Have this changed since 1991?

  • Have the changes in population been driven by men, women or both equally?

  • What region has the biggest (lowest) population?

► Question

Here, we are going to examining the evolution of bankruptcies in different region in Belgium between 2005 and 2023. The original data are available on the Belgian office for statistics, statbel, webpage.

Two files are needed:

  • The fill dataset, TF_BANKRUPTCIES.zip, contains over 175000 records and is relatively large. Instead, you can use TF_BANKRUPTCIES_subset.txt that contains just over 10000 records, and made available through the rWSBIM1207 package (see below).

  • The Method_BANKRUPTCIES.xlsx file describes the variables in the bankruptcies dataset. It can be either downloaded from the statbel page, or read from the rWSBIM1207 package (see below). Note that you are not asked to load the xlsx file in R (although you can with readxl::read_xlsx()); feel free to open it with your favourite spreadsheet programme.

You will see that there are some minor discrepancies between the metadata file and the data variables, such as the CD_WEEK variable, that is documented in the metadata file, but absent from the actual data. The other, non-documented, time-related variables are self-explanatory.

The (compressed) TF_BANKRUPTCIES_subset.txt and Method_BANKRUPTCIES.xlsx are available in the rWSBIM1207 package (version >= 0.1.19). Their paths is available with the faillites_be() function.

You are asked to:

  • Visualise the number of bankruptcies over time in the different regions. Make sure to remove observations that don’t have any region. Note that the dates are recorded in a different format as what we have seen so far, but the usual conversion functions still apply.

  • The improve the readability of these busy plots, computer the number of bankruptcies per year and repeat the above visualisation.

  • We expect that bankruptcies of large companies will lead to more (full-time) unemployment. Check this hypothesis by (1) compute and display a table of average job losses by company size, and (2) produce a figure that uses the distribution of job losses (as opposed as only the averages). To facilitate the interpretation, order the companies by their size, in the table and figure.

► Question

This question was contributed by Mr William Vanloo in August 2024.

We will explore the road accidents that occurred in Belgium in 2023, as available from the Statbel website: the .txt file contains the dataset and the .xlsx file contains the metadata describing the different variables.

  • Among this data set, visualise the number of accidents occurring in different light conditions by severity of injury in the province of Liège over time. Be sure to remove the ‘not available’ data.

  • Visualise the number of accidents in the different provinces during the month of January according to the different light conditions and only for fatal accidents.

Page built: 2024-08-27 using R version 4.4.1 (2024-06-14)