Data Carpentry - additional R exercises

RNASeq data and annotation

Import the data from two tab-separated files into R. The kem_counts.tsv file contains RNA-Seq expression counts for 13 genes and 18 samples and kem_annot.tsv contains annotation about each sample. Read the data into two tibbles named kem and annot respectively and familiarise yourself with the content of the two new tables.
Convert the counts data into a long table format and annotation each sample using the experimental design.
Identity the three transcript identifiers that have the highest expression count over all samples.
Visualise the distribution of the expression for the three transcripts selected above in cell types A and B under both treatments.
For all genes, calculate the mean intensities in each experimental group (as defined by the cell_type and treatment variables).
Focusing only on the three most expressed transcripts and cell type A, calculate the fold-change induced by the treatment. The fold-change is the ratio between the average expressions in two conditions.

library("tidyverse")

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2

## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

annot <- read_tsv("data/kem_annot.tsv")

## Parsed with column specification:
## cols(
##   sample_id = col_character(),
##   jurkat = col_character(),
##   cell_type = col_character(),
##   treatment = col_character()
## )

kem <- read_tsv("data/kem_counts.tsv") %>%
    gather(key = sample_id, value = expression, -ref) %>%
    left_join(annot)

## Parsed with column specification:
## cols(
##   ref = col_character(),
##   `KEM182-01` = col_double(),
##   `KEM182-02` = col_double(),
##   `KEM182-03` = col_double(),
##   `KEM182-04` = col_double(),
##   `KEM182-05` = col_double(),
##   `KEM182-06` = col_double(),
##   `KEM182-07` = col_double(),
##   `KEM182-08` = col_double(),
##   `KEM182-09` = col_double(),
##   `KEM182-10` = col_double(),
##   `KEM182-11` = col_double(),
##   `KEM182-12` = col_double(),
##   `KEM182-13` = col_double(),
##   `KEM182-14` = col_double(),
##   `KEM182-15` = col_double(),
##   `KEM182-16` = col_double()
## )

## Joining, by = "sample_id"

k <- kem %>%
    group_by(ref) %>%
    summarise(tot_exprs = sum(expression)) %>%
    arrange(desc(tot_exprs)) %>%
    select(ref) %>%
    head(3)
kem3 <- right_join(kem, k)

## Joining, by = "ref"

ggplot(kem3, aes(x = treatment, y = expression)) +
    geom_boxplot() +
    geom_jitter() +
    facet_grid(ref ~ cell_type)

kem %>%
    group_by(ref, cell_type, treatment) %>%
    summarise(mean_expression = mean(expression))

## # A tibble: 52 x 4
## # Groups:   ref, cell_type [26]
##    ref             cell_type treatment  mean_expression
##    <chr>           <chr>     <chr>                <dbl>
##  1 ENSG00000055917 A         none                 2074 
##  2 ENSG00000055917 A         stimulated           1743 
##  3 ENSG00000055917 B         none                 2202.
##  4 ENSG00000055917 B         stimulated           2101.
##  5 ENSG00000063245 A         none                 1290 
##  6 ENSG00000063245 A         stimulated           1717.
##  7 ENSG00000063245 B         none                 1214.
##  8 ENSG00000063245 B         stimulated           1636 
##  9 ENSG00000090263 A         none                  425.
## 10 ENSG00000090263 A         stimulated            414 
## # … with 42 more rows

kem3 %>%
    filter(cell_type == "A") %>%
    group_by(ref, cell_type, treatment) %>%
    summarise(mean_expression = mean(expression)) %>%
    spread(key = treatment, value = mean_expression) %>%
    mutate(fold_change = stimulated/none)

## # A tibble: 3 x 5
## # Groups:   ref, cell_type [3]
##   ref             cell_type    none stimulated fold_change
##   <chr>           <chr>       <dbl>      <dbl>       <dbl>
## 1 ENSG00000109971 A          59334.     62578.        1.05
## 2 ENSG00000198712 A          90102.    116475         1.29
## 3 ENSG00000198804 A         118486.    134135         1.13

Quantitative proteomics statistical results

Download and load the prot.csv file into R.
Produce a histogram of p-value. Discuss the distribution of these p-values.
Produce a volcano plot, showing the log2 fold-change along the x axis and -log10 of the adjusted p-value along the y axis. Discuss.
Produce an MA plot, showing the average intensity (over all samples) along the x axis, and the log2 fold-change along the y axis. Discuss.

prot <- read_csv("./data/prot.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   gene = col_character()
## )

## See spec(...) for full column specifications.

ggplot(prot, aes(x = p.value)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(prot, aes(x = LogFC,
                 y = -log10(adjp))) +
             geom_point()

ggplot(prot, aes(x = AvgExpr,
                 y = LogFC)) +
             geom_point()