In this course, we have seen the importance of structured data. Good data structure starts with simple, tidy tabular data, whether it is manually encoded in spreadsheet, or handled in R as dataframes or tibbles. More complex data, that doesn’t fit in tabular data, can be modelled into dedicated objects that display specialised behaviour. Structured data allows us to reason on that data, without having to look it at. Reasoning on and generalisation of data in turn allows to manipulate and visualise it, i.e. to explore, analyse and understand it. The cherry on top of the data analysis cake is to be able to reproduce an analysis, either oneself or share it in a way that others can.
As mentioned in the preamble, the goal of this course is obviously not for students that take it to qualify as bioinformaticians at the end. However, what is important is to appreciate the importance of data and their analysis, and to become fluent in exploring, discussing and communicating around data. A shared appreciation of data and their complexity will hopefully reduce the distinction between bioinformaticians and experimental scientists. Indeed, at the end of the day, it’s useful to remember that
We are all biologists, in that we study biology. Some use wet lab experiments, others dry lab techniques.
To answer the following exercises, you’ll need to resort to what you have learnt in various chapters.
► Question
Make sure you have rWSBIM1207
version >= 0.1.16 and load the 2022
Belgian road accidents statistics and the associated metadata,
describing the variables. The path to the former as an rds
file is
available with road_accidents_be_2022.rds()
. The
road_accidents_be_meta.csv()
returns the path to the metadata csv
file.
The data provides the Number of killed, seriously injured, slightly injured and uninjured victims of road accidents, by age group, type of user, sex and various characteristics of the accident in Belgium in 20222.
Using the appropriate functions, load both files into R and familiarise yourself with the data.
Visualise the numbers for men and women over the hours of the day for all age classes. Ignore any unknown information. Do you see a difference between man and women?
Visualise the number of victims in the different provinces. Do this comparison for the different type of victims. Ignore any unknown information. Use lines and points for this visualisation.
Come up with additional visualisations that you could produce with these data. Use bar plots for this visualisation.
► Question
For this exercise, make sure you have rWSBIM1207
version >=
0.1.17. Using the population_be.csv()
function, get the path to
35 files with the population
numbers across multiple regions of Belgium up to 2023.
Tip: given that all files contain equivalent data (i.e. for the
same variables), you can use read_csv()
to load all files as once
into a long table.
How many regions have been survey over the years?
What differences are there in terms of region?
Focusing on Belgium and the Brussels, Walloon and Flemish regions, and from 1991 on, generate one figure (possibly with multiple facets) that allows to answer the following questions:
Has the population increased since 1991?
Are there more women or men living in these regions? Have this changed since 1991?
Have the changes in population been driven by men, women or both equally?
What region has the biggest (lowest) population?
► Question
Here, we are going to examining the evolution of bankruptcies in different region in Belgium between 2005 and 2023. The original data are available on the Belgian office for statistics, statbel, webpage.
Two files are needed:
The fill dataset, TF_BANKRUPTCIES.zip
, contains over 175000
records and is relatively large. Instead, you can use
TF_BANKRUPTCIES_subset.txt
that contains just over 10000 records,
and made available through the rWSBIM1207
package (see below).
The Method_BANKRUPTCIES.xlsx
file describes the variables in the
bankruptcies dataset. It can be either downloaded from the statbel
page,
or read from the rWSBIM1207
package (see below). Note that you
are not asked to load the xlsx file in R (although you can with
readxl::read_xlsx()
); feel free to open it with your favourite
spreadsheet programme.
You will see that there are some minor discrepancies between the
metadata file and the data variables, such as the CD_WEEK
variable,
that is documented in the metadata file, but absent from the actual
data. The other, non-documented, time-related variables are
self-explanatory.
The (compressed) TF_BANKRUPTCIES_subset.txt
and
Method_BANKRUPTCIES.xlsx
are available in the rWSBIM1207
package
(version >= 0.1.19). Their paths is available with the
faillites_be()
function.
You are asked to:
Visualise the number of bankruptcies over time in the different regions. Make sure to remove observations that don’t have any region. Note that the dates are recorded in a different format as what we have seen so far, but the usual conversion functions still apply.
The improve the readability of these busy plots, computer the number of bankruptcies per year and repeat the above visualisation.
We expect that bankruptcies of large companies will lead to more (full-time) unemployment. Check this hypothesis by (1) compute and display a table of average job losses by company size, and (2) produce a figure that uses the distribution of job losses (as opposed as only the averages). To facilitate the interpretation, order the companies by their size, in the table and figure.
► Question
This question was contributed by Mr William Vanloo in August 2024.
We will explore the road accidents that occurred in Belgium in 2023,
as available from the
Statbel
website: the .txt
file contains the dataset and the .xlsx
file
contains the metadata describing the different variables.
Among this data set, visualise the number of accidents occurring in different light conditions by severity of injury in the province of Liège over time. Be sure to remove the ‘not available’ data.
Visualise the number of accidents in the different provinces during the month of January according to the different light conditions and only for fatal accidents.
Page built: 2024-08-27 using R version 4.4.1 (2024-06-14)