Chapter 10 Additional programming concepts

Learning Objectives

Learn programming concepts, including

how to handle conditions
iteration of data structures
good coding practice
code re-use through functions

When the size and the complexity of the data increases, or the data science question of interest becomes more complex, the data analysis techniques as we have seen them so far need to be complemented with programming techniques. From a data science point of view, there is no clear delimitation between data analysis and (data) programming, both morphing into each other1919 A fundamental difference however is how data analysis and programming are taught. When it comes to researchers, and biomedical researchers in particular, teaching programming to analyse data isn’t successful. Teaching data analysis to eventually programme with data, however, has proven a successful strategy..

This chapter will introduce some additional programming skills and demonstrate how to use them in the context of high troughput omics data.

10.1 Writing clean code

Writing clean code means writing easily readable code, hence easily understandable code and, eventually code with less bugs. One easy way to achieve this is through consistency, i.e. stick to a style guide. The issue is that there are several style guides available, often with conflicting suggestions. Two widely used ones are the Bioconductor style guide and Hadley Wickhams’s R Style Guide. The advice from for naming variables seen in the first and second chapters are also relevant.

Here are a couple of suggestions:

Use <- to assign variables. Use = is also valid, but make sure that you stick with one.
Use either camel case (camelCaseNaming) of snake case (snake_case_naming), and avoid using dots (don’t use dot.variable.names). These conventions apply to functions, variables and files (for the latter, a - instead of _ is also acceptable).
Always spell out TRUE and FALSE, and resist the temptation to use T and F instead2020 TRUE and FALSE are reserved words; one can’t use them as variable names. This however doesn’t hold for T and F. One can use T and F as regular variable names such, as for example, T <- FALSE and F <- TRUE!.
Use 4 spaces for indenting. No tabs.
No lines longer than 80 characters.
Use spaces around binary operators.
No spaces between a function name and the opening parenthesis.

Another good advise is to avoid re-writing the same code many times. We will see below two strategies to do that, namely iteration and writing new functions. This firstly reduces the amount of code typed and hence the number of bugs, but more importantly enables consistency. If something in your code changes, there’s only one change and it applies everywhere, rather than doing that same change repeatedly (at the first of adding bugs and to miss some updates).

When writing code, keep it simple and short2121 The official motto is KISS, keep your functions simple, stupid, widely used in programming.. Ideally, the code should be evident. But when the questions tackled are not trivial, is becomes essential to add comments to clarify aspects of the script/program. Make sure to use them to describe why something is done, rather than explaining how things are done (which is tyically best done by the code itself).

A general guideline to avoid bugs is to apply defensive programming:

making the code work in a predicable manner
writing code that fails in a well-defined manner
if something weird happens, either properly deal with it, or it fail quickly and loudly

Here are some examples of how to do this:

use functions like is.numeric(x), is.character(x), is.data.frame(x), … but make sure that the variable you are going to use is of the expected type.
Make sure the length or dimensions of what you are going to use are what you expect:

length(x) > 0
nrow(x) > 0
ncol(x) > 0

Failing fast and well! Wrap such test in stopifnot(), so that if they fail, you get immediately an error, rather than risking that you code fails later with obscure error messages or, worse, the code runs to completion but returns meaningless results.

stopifnot(length(x) > 0)
stopifnot(dim(x) > 0) ## same as next one
stopifnot(ncol(x) > 0, nrow(x) > 0)

10.2 Iteration

Iteration describes the situtation when a specific operation has to be repeated many times on different inputs of the same type. For example, if we have a vector of numeric x shown below,

(x <- 1:10)

##  [1]  1  2  3  4  5  6  7  8  9 10

and we wanted to apply the logarithm operation on each element of x, it wouldn’t be convenient to type

log(1)
log(2)
log(3)
...
log(10)

The concept of iteration allows us to program the following command:

Repeat log for each value of my input x.

or more formally

Repeat log(i) where i takes in turn each value of my input x.

We will see different ways of implementing such an iteration.

Using a `for` loop

for (i in x) {
    print(log(i))
}

## [1] 0
## [1] 0.6931472
## [1] 1.098612
## [1] 1.386294
## [1] 1.609438
## [1] 1.791759
## [1] 1.94591
## [1] 2.079442
## [1] 2.197225
## [1] 2.302585

The loop above only prints the results on screen. They aren’t stored and are lost for any further re-use, which would be very annoying if it took much more time to perform all the calculations. In the code chunk below, we are going to first inititalise a vector with the appropriate number of NA values and, at each iteration, we then store the result. We however now need to change our loop and iterate of the indices of the input vector, so that we can re-use these indices to save the results in the output vector.

res_loop <- rep(NA, length(x))
for (i in seq_along(x)) {
    res_loop[i] <- log(x[i])
}
res_loop

##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851

Using the `apply` function

The apply family of functions implements our defintion of iteration quite literally

Repeat log for each value of my input x

is reformulated as

For each value of my input x, apply log

and coded as

res_apply <- sapply(x, log)
res_apply

##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851

There are three such function that apply a function iteratively:

sapply iterates over a vector and returns a new vector of the same length as the input vector2222 This is a simplification of how sapply works, that is partly defined by the the simplify argument and whether the result of applying the function on each element of the input can be returned as a vector. The alternative is to return a list, like lapply..
lapply iterates over a linear input (a vector of a list) and returns a list of the same length as the input.
apply iterates of the rows or the columns of a data.frame or matrix and returns a list or vector or approritate length (number of rows or columns). The dimension over which the iterations proceeds is defined by the second argument, where 1 defines rows and 2 defined columns2323 This generalises to arrays with more than 2 dimensions..

When performing the same operations in different ways (using different implementations), it is essential to verify that the results are identical or, if not, at least compatible. The most direct way to do the former is to use the identical function:

identical(res_loop, res_apply)

## [1] TRUE

The purrr package a set of map functions similar to the apply set described above.

Vectorisation

We must not forget the obvious, which is vectorisation. Many R functions work by default iteratively on every element of a vectors, i.e. they work irrespectively whether the vector is of length 12424 A vector of length 1 would be called a scalar in other programming languages. or longer.

res_vec <- log(x)
res_vec

##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851

And, as before, we check that we obtain identical results:

identical(res_loop, res_vec)

## [1] TRUE

Which iteration to use?

Even though they produce the same result, the iteration strategies above aren’t equal, and some should be preferred in different situations.

When a vectorised solution exists, this is the one that should be chosen. It is by far the fastest solution, but only applicable to existing vectorised functions. If you were to write your own function to iterate over, it is advisable to write a vectorised function.
The apply functions are extremely convenient and concise, and hence widely used. They have a couple of advantages, including that there is no need to explicitly initialise a result variable and that they can easily be parallelised.
For loops are the most generic solution for iteration, and they require to initialise the result variable that will be populated during the loop. As opposed to popular belief, they aren’t slower than using apply functions. They are however the best solution if, during the iteration, one has the access another element than the one currently processed (typically, if i is the counter, accessing i + 1 or i - 1).

► Question

Use the expressions.csv() function from the rWSBIM1207 package to get the path to 5 csv files containing gene expression data for a single gene and hundreds of patience. Read all data in and combine them into a single dataframe.

Before starting, open at least two of these files to familiarise yourself with their structure and identify how and with what function they need to be combined.

► Solution

Let’s start by loading the required packages and check two file. We will also use the tidyverse to work with the data.

library("tidyverse")
library("rWSBIM1207")
fls <- expressions.csv()
read_csv(fls[1])

## Rows: 570 Columns: 4
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sampleID, patient, type
## dbl (1): A1BG
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 570 × 4
##    sampleID         patient      type   A1BG
##    <chr>            <chr>        <chr> <dbl>
##  1 TCGA-05-4244-01A TCGA-05-4244 tumor  26.0
##  2 TCGA-05-4249-01A TCGA-05-4249 tumor 120. 
##  3 TCGA-05-4250-01A TCGA-05-4250 tumor  50.9
##  4 TCGA-05-4382-01A TCGA-05-4382 tumor 146. 
##  5 TCGA-05-4384-01A TCGA-05-4384 tumor 127. 
##  6 TCGA-05-4389-01A TCGA-05-4389 tumor  67.1
##  7 TCGA-05-4390-01A TCGA-05-4390 tumor 165. 
##  8 TCGA-05-4395-01A TCGA-05-4395 tumor  22.0
##  9 TCGA-05-4396-01A TCGA-05-4396 tumor  17.4
## 10 TCGA-05-4397-01A TCGA-05-4397 tumor 127. 
## # ℹ 560 more rows

read_csv(fls[2])

## Rows: 570 Columns: 4
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sampleID, patient, type
## dbl (1): A1CF
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 570 × 4
##    sampleID         patient      type    A1CF
##    <chr>            <chr>        <chr>  <dbl>
##  1 TCGA-05-4244-01A TCGA-05-4244 tumor  0    
##  2 TCGA-05-4249-01A TCGA-05-4249 tumor  0.322
##  3 TCGA-05-4250-01A TCGA-05-4250 tumor  0    
##  4 TCGA-05-4382-01A TCGA-05-4382 tumor  0    
##  5 TCGA-05-4384-01A TCGA-05-4384 tumor  0    
##  6 TCGA-05-4389-01A TCGA-05-4389 tumor 36.2  
##  7 TCGA-05-4390-01A TCGA-05-4390 tumor  0    
##  8 TCGA-05-4395-01A TCGA-05-4395 tumor  0    
##  9 TCGA-05-4396-01A TCGA-05-4396 tumor 15.9  
## 10 TCGA-05-4397-01A TCGA-05-4397 tumor  0    
## # ℹ 560 more rows

We see that the data will have to be combined column-wise. We could do it manually, adding news columns one by on to the first data, but this would require to check that the observations are in the same order. Instead, we will use a full_join, as seen in chapter 7.

## read the first one
res <- read_csv(fls[1])
## iterate over the other data
for (i in 2:length(fls)) {
    data_i <- read_csv(fls[i], progress = FALSE)
    res <- full_join(res, data_i)
}
res

## # A tibble: 570 × 8
##    sampleID         patient      type   A1BG   A1CF A2BP1 A2LD1   A2ML1
##    <chr>            <chr>        <chr> <dbl>  <dbl> <dbl> <dbl>   <dbl>
##  1 TCGA-05-4244-01A TCGA-05-4244 tumor  26.0  0      1.75 136.    0.349
##  2 TCGA-05-4249-01A TCGA-05-4249 tumor 120.   0.322  1.61  89.1   1.61 
##  3 TCGA-05-4250-01A TCGA-05-4250 tumor  50.9  0      0    151.    0    
##  4 TCGA-05-4382-01A TCGA-05-4382 tumor 146.   0      0    112.    4.79 
##  5 TCGA-05-4384-01A TCGA-05-4384 tumor 127.   0      0     87.6   0    
##  6 TCGA-05-4389-01A TCGA-05-4389 tumor  67.1 36.2    0    112.   36.6  
##  7 TCGA-05-4390-01A TCGA-05-4390 tumor 165.   0     97.7   64.9   0.639
##  8 TCGA-05-4395-01A TCGA-05-4395 tumor  22.0  0      0    181.  391.   
##  9 TCGA-05-4396-01A TCGA-05-4396 tumor  17.4 15.9    0    134.    0.758
## 10 TCGA-05-4397-01A TCGA-05-4397 tumor 127.   0      0    229.   19.2  
## # ℹ 560 more rows

10.3 Conditionals

When an operation has to be executed when a condition is met, one typically uses a if and else construct:

if (CONDITION) {
   ## DO SOMETHING
} else {
  ## DO SOMETHING ELSE
}

For examples

(x <- rnorm(1))

## [1] -0.6264538

if (x > 0) {
   print(log(x))
} else {
   print(log(-x))
}

## [1] -0.4676802

But note that in the example above, it would be much better to simplify the code and use the absolute value of x before taking the log, which will generalise the calculation for positive and negative values …

log(abs(x))

## [1] -0.4676802

… and works for vectors of any length thanks to vectorisation.

x <- rnorm(10)
log(abs(x))

##  [1] -1.6947599 -0.1795710  0.4670498 -1.1101553 -0.1978799 -0.7186105
##  [7] -0.3033716 -0.5520273 -1.1861709  0.4132885

There are also in-line, vectorised versions of the if/else statements seen above: the ifelse function that ships with the base package, and if_else from dplyr. Both work similarly (the differences are beyond the scope of this course) and have the form:

if_else(condition, true, false)

where condition is the condition to be tested, and will return either TRUE or FALSE;
true is the expression that is executed if the condition is TRUE;
false is the expression that is executed otherwise (i.e if the condition is FALSE).

Here is an example, that will return 1 if x is strictly positive, and 0 otherwise:

x <- 0.5
if_else(x > 0, 1, 0)

## [1] 1

The function being vectorised, the condition can be a vector of length greater than 1:

x <- rnorm(10)
x

##  [1]  0.38984324 -0.62124058 -2.21469989  1.12493092 -0.04493361 -0.01619026
##  [7]  0.94383621  0.82122120  0.59390132  0.91897737

if_else(x > 0, 1, 0)

##  [1] 1 0 0 1 0 0 1 1 1 1

The vectorised conditional functions can directly be used in a standard data analysis pipeline:

x <- tibble(x = letters[1:5],
            y = rnorm(5))
x

## # A tibble: 5 × 2
##   x          y
##   <chr>  <dbl>
## 1 a     -0.626
## 2 b      0.184
## 3 c     -0.836
## 4 d      1.60 
## 5 e      0.330

mutate(x, z = if_else(y > 0, 1, 0))

## # A tibble: 5 × 3
##   x          y     z
##   <chr>  <dbl> <dbl>
## 1 a     -0.626     0
## 2 b      0.184     1
## 3 c     -0.836     0
## 4 d      1.60      1
## 5 e      0.330     1

10.4 Writing new functions

A function is composed of a name, inputs (inside the parenthesis), a body (between curly brackets) and an ouput (last statement or variably inside the return statement).

my_fun <- function(x, y) {
    message("First input: ", x)
    message("Second input: ", y)
    z <- x * abs(y)
    return(z)
}
my_fun(2, -5)

## First input: 2

## Second input: -5

## [1] 10

► Question

Complete the following function. It is supposed to take two inputs, x and y and, depending whether the x > y or x <= y, it generates the permutation sample(x, y) in the first case or draws a sample from rnorm(1000, x, y) in the second case. Finally, it returns the sum of all values.

fun <- function(x, y) {
    res <- NA
    if (   ) {
        res <- sample(, )
    } else {
        res <- rnorm(, , )
    }
    return()
}

To check your answer, run it with inputs 5, 15 and 15, 5, after setting the random number generator to 1 each time with set.seed(1) and you should get 4825.2778709 and 23.

► Solution

fun

## function (x, y) 
## {
##     res <- NA
##     if (x > y) {
##         res <- sample(x, y)
##     }
##     else {
##         res <- rnorm(1000, x, y)
##     }
##     return(sum(res))
## }
## <bytecode: 0x5bf3b5061068>

set.seed(1L)
fun(5, 15)

## [1] 4825.278

set.seed(1L)
fun(15, 5)

## [1] 23

10.5 Pattern matching

One skill that comes handy more often than not is the ability to find patterns in text and replace these. We are going to see two functions out of many to perform such tasks. As an illustration, we are going to mine a vector or peptides identified by mass spectrometry. The data can be loaded from the rWSBIM1207 data:

library("rWSBIM1207")
data(peptides)
length(peptides)

## [1] 26796

head(peptides)

## [1] "aAAGEAk" "VTEEAk"  "aSGDPGk" "sAAAEAk" "aANAEGk" "aAAEEAR"

Among the 26796 petides, which ones contain the pattern "AAGE"?

pattern <- "AAGE"
## indices of matching peptides
grep(pattern, peptides)

## [1]     1  7026 17140 21912 23423 24947 25257 26658

## matching peptides
grep(pattern, peptides, value = TRUE)

## [1] "aAAGEAk"              "aFHGAAGEAGAAFGNHGANR" "eAIDAAGEGGR"         
## [4] "sAAGEFADDPcSSVkR"     "eAAGEGPVLYEDPPDQk"    "eIEAAGEk"            
## [7] "dVQELYAAGENR"         "sAAGEFADDPcSSVk"

## position of matching peptides
head(grepl(pattern, peptides))

## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

Replace a pattern by another string.

head(peptides)

## [1] "aAAGEAk" "VTEEAk"  "aSGDPGk" "sAAAEAk" "aANAEGk" "aAAEEAR"

head(sub("AE", "XX", peptides))

## [1] "aAAGEAk" "VTEEAk"  "aSGDPGk" "sAAXXAk" "aANXXGk" "aAXXEAR"

But careful that sub only replaced the first occurrence in a string!

head(peptides)

## [1] "aAAGEAk" "VTEEAk"  "aSGDPGk" "sAAAEAk" "aANAEGk" "aAAEEAR"

head(sub("A", "X", peptides))

## [1] "aXAGEAk" "VTEEXk"  "aSGDPGk" "sXAAEAk" "aXNAEGk" "aXAEEAR"

head(gsub("A", "X", peptides))

## [1] "aXXGEXk" "VTEEXk"  "aSGDPGk" "sXXXEXk" "aXNXEGk" "aXXEEXR"

10.6 Analysing data from multiple files

This exercise recapitulates the most important material that we have seen in this course. We are going to analyse the student tests A and B results using functionality from the dplyr package (seen in chapter 5). The student test files are available in the rWSBIM1207 package (the interroA.csv and interroB.csv functions return their respective paths). We want to compare how the male and female students in groups A and B have performed. To do this, we want to calculate the mean scores and visualise the score distributions in each groups.

► Question

Before writing any code to answer the questions above, take a couple of minutes to think about what packages are needed to answer the questions above and identify the individual steps needed to first prepare the data needed to find the answers and then to produce the answers.

Start by loading the rWSBIM1207 and tidyverse package, to get all the functionality we will need.

► Solution

Using the interroA.csv() and interroB.csv() functions from the rWSBIM1207 package, create a vector containing the two csv file names.

► Solution

fls <- c(interroA.csv(),
         interroB.csv())
fls

## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/Bioc3.20/rWSBIM1207/extdata/interroA.csv"
## [2] "/home/lgatto/R/x86_64-pc-linux-gnu-library/Bioc3.20/rWSBIM1207/extdata/interroB.csv"

Iterate over the files and read each into an element of a list. Then, bind the two elements of the list into a single dataframe containing all results.

► Solution

l <- as.list(fls)
for (i in seq_along(fls))
    l[[i]] <- read_csv(fls[i])

## Rows: 100 Columns: 8
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): id, gender
## dbl (6): height, X, interro1, interro2, interro3, interro4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 105 Columns: 8
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): id, gender
## dbl (6): height, X, interro1, interro2, interro3, interro4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

interros <- rbind(l[[1]], l[[2]])
interros

## # A tibble: 205 × 8
##    id     height gender      X interro1 interro2 interro3 interro4
##    <chr>   <dbl> <chr>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 A74890    168 M       1.43        16       18        7       10
##  2 A85494    167 M       1.05        15       18       13       NA
##  3 A51820    166 M       0.435        4       10       NA        7
##  4 A98669    164 M       0.715       15       15       18       13
##  5 A75521    171 M       0.917       18       10       17       NA
##  6 A96704    178 F      -2.66        11       20       14       17
##  7 A23214    155 M       1.11        12        2        8       14
##  8 A31124    177 M      -0.485       19        4        8       20
##  9 A80471    187 F       0.231       19       16       16        8
## 10 A21783    195 F      -0.295       13       11        8       20
## # ℹ 195 more rows

dim(interros)

## [1] 205   8

l <- lapply(fls, read_csv)

## Rows: 100 Columns: 8
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): id, gender
## dbl (6): height, X, interro1, interro2, interro3, interro4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 105 Columns: 8
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): id, gender
## dbl (6): height, X, interro1, interro2, interro3, interro4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

interros2 <- rbind(l[[1]], l[[2]])
identical(interros, interros2)

## [1] FALSE

What are the mean scores for male and female students in each test (interro1 to interro4).

► Solution

interros %>%
    group_by(gender) %>%
    summarise(m1 = mean(interro1, na.rm = TRUE),
              m2 = mean(interro2, na.rm = TRUE),
              m3 = mean(interro3, na.rm = TRUE),
              m4 = mean(interro4, na.rm = TRUE))

## # A tibble: 2 × 5
##   gender    m1    m2    m3    m4
##   <chr>  <dbl> <dbl> <dbl> <dbl>
## 1 F       10.2  11.2  13.8  13.9
## 2 M       10.2  10.8  12.8  13.7

What are the mean scores for male and female students in each test (interro1 to interro4) and in each group (A and B).

► Solution

To answer this question, we need to be able to tell the two groups apart. This information is however not available in the full dataset. We could analyse the two test individually to answer this particular question, but a preferred approach is the add this information, as it will be needed later anyway.

One way would be to check the number of students that tool the two tests, and add a new column based on these values. We still have the two dataframes in the l list, which allows us to calculate the respective number of rows and use this information to create a group variables.

sapply(l, nrow)

## [1] 100 105

interros$group <- c(rep("A", nrow(l[[1]])),
                    rep("B", nrow(l[[2]])))
interros

## # A tibble: 205 × 9
##    id     height gender      X interro1 interro2 interro3 interro4 group
##    <chr>   <dbl> <chr>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <chr>
##  1 A74890    168 M       1.43        16       18        7       10 A    
##  2 A85494    167 M       1.05        15       18       13       NA A    
##  3 A51820    166 M       0.435        4       10       NA        7 A    
##  4 A98669    164 M       0.715       15       15       18       13 A    
##  5 A75521    171 M       0.917       18       10       17       NA A    
##  6 A96704    178 F      -2.66        11       20       14       17 A    
##  7 A23214    155 M       1.11        12        2        8       14 A    
##  8 A31124    177 M      -0.485       19        4        8       20 A    
##  9 A80471    187 F       0.231       19       16       16        8 A    
## 10 A21783    195 F      -0.295       13       11        8       20 A    
## # ℹ 195 more rows

The assumption here is that the order of the observations in the large dataframe hasn’t changed!

An alternative is to add the group information before combining the dataframes.

l[[1]]$group <- "A"
l[[2]]$group <- "B"
interros2 <- rbind(l[[1]], l[[2]])
identical(interros, interros2)

## [1] FALSE

Now we can group by gender and group to perform our summary calculations:

interros %>%
    group_by(gender, group) %>%
    summarise(m1 = mean(interro1, na.rm = TRUE),
              m2 = mean(interro2, na.rm = TRUE),
              m3 = mean(interro3, na.rm = TRUE),
              m4 = mean(interro4, na.rm = TRUE))

## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 6
## # Groups:   gender [2]
##   gender group    m1    m2    m3    m4
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl>
## 1 F      A      9.6   10.5  13.2  13.8
## 2 F      B     10.9   12.0  14.4  14.0
## 3 M      A     10.4   10.6  12.4  14.9
## 4 M      B      9.98  10.9  13.1  12.6

Instead of comparing the means for the four tests in each group and gender, visualise their distributions using boxplots. In addition to the boxplots, overlay the actual score values (using the jitter geom) and colour the points based on the students height.

► Solution

The produce this visualisation, we first need to convert the data we have in a long format using pivot_longer(). Below, we create two new variables: test, that contains interro1, …, interro4, and score, that contains the students marks the actual test.

We obtain a new tibble that contains 4 times more rows that initially, as the tests, that were in 4 different columns, have now been gathered as a single new variable.

interros2 <- pivot_longer(interros,
                          names_to = "test",
                          values_to = "score",
                          starts_with("interro"))
interros2

## # A tibble: 820 × 7
##    id     height gender     X group test     score
##    <chr>   <dbl> <chr>  <dbl> <chr> <chr>    <dbl>
##  1 A74890    168 M      1.43  A     interro1    16
##  2 A74890    168 M      1.43  A     interro2    18
##  3 A74890    168 M      1.43  A     interro3     7
##  4 A74890    168 M      1.43  A     interro4    10
##  5 A85494    167 M      1.05  A     interro1    15
##  6 A85494    167 M      1.05  A     interro2    18
##  7 A85494    167 M      1.05  A     interro3    13
##  8 A85494    167 M      1.05  A     interro4    NA
##  9 A51820    166 M      0.435 A     interro1     4
## 10 A51820    166 M      0.435 A     interro2    10
## # ℹ 810 more rows

We can now map the new test and score variables to the x and y axes and construct the visualisation.

ggplot(interros2, aes(y = score, x = test)) +
    geom_boxplot() +
    geom_jitter(aes(colour = height)) +
    facet_grid(gender ~ group)

## Warning: Removed 36 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 36 rows containing missing values or values outside the scale
## range (`geom_point()`).

10.7 Additional exercises

► Question

Complete the following function that converts meters to yards, knowing that that 1 yard corresponds to 0.9144 meters

meters_to_yards <- function(argument goes here) {
  yards <- _code goes here!_
  return(yards)
}

► Question

Write functions that convert temperatures:

from Fahrenheit to Kelvin, where \(T_K = (T_F - 32) \times 5/9 + 273.15\)
fom Celsius to Kelvin, where \(T_K = T_C + 273.15\)
from Kelvin to Celsis
from Fahrenheit to Celsius

Page built: 2025-04-03 using R version 4.4.1 (2024-06-14)

Chapter 10 Additional programming concepts

10.1 Writing clean codeCopy link

10.2 IterationCopy link

Using a for loopCopy link

Using the apply functionCopy link

VectorisationCopy link

Which iteration to use?Copy link

10.3 ConditionalsCopy link

10.4 Writing new functionsCopy link

10.5 Pattern matchingCopy link

10.6 Analysing data from multiple filesCopy link

10.7 Additional exercisesCopy link

10.1 Writing clean code

10.2 Iteration

Using a `for` loop

Using the `apply` function

Vectorisation

Which iteration to use?

10.3 Conditionals

10.4 Writing new functions

10.5 Pattern matching

10.6 Analysing data from multiple files

10.7 Additional exercises