Contents

1 What is a SummarizedExperiment?

Alt SummarizedExperiment

Alt SummarizedExperiment

2 Working with an existing SummarizedExperiment object

library(SummarizedExperiment)

The airway experiment data package summarizes an RNA-seq experiment investigating human smooth-muscle airway cell lines treated with dexamethasone. Load the library and data set.

library(airway)
data(airway)
airway
## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

2.1 Three main parts of a SummarizedExperiment

Exercise What are the dim() and dimnames() of the airway object? What do the rows and columns correspond to in terms of the original experiment? Hint: use dplyr::glimpse() to look at the first few elements of each dimnames.

Exercise Use the functions assay(), colData(), rowRanges() to extract the three major components of the airway data set. assay() is large, so interogate it using class(), dim() and head() rather than simply printing it to the screen. What do each of these parts corresspond to in terms of the RNASeq experiment summarized in this object?

2.2 The matrix-like behavior of SummarizedExperiment

As a quick refresher, an R matrix

set.seed(123)
m <- matrix(
    rnorm(12), nrow = 4, ncol = 3,
    dimnames = list(letters[1:4], LETTERS[1:3])
)
m
##             A          B          C
## a -0.56047565  0.1292877 -0.6868529
## b -0.23017749  1.7150650 -0.4456620
## c  1.55870831  0.4609162  1.2240818
## d  0.07050839 -1.2650612  0.3598138

has three main components to it’s interface, dim(), dimnames(), and two-dimensional [-subsetting.

dim(m)
## [1] 4 3
dimnames(m)
## [[1]]
## [1] "a" "b" "c" "d"
## 
## [[2]]
## [1] "A" "B" "C"
m[1:2, 2:1]
##           B          A
## a 0.1292877 -0.5604756
## b 1.7150650 -0.2301775
m[1:2,]
##            A         B          C
## a -0.5604756 0.1292877 -0.6868529
## b -0.2301775 1.7150650 -0.4456620

A tricky, and unfortunate, behavior is that subsetting a single row or column of a matrix results, by default in the matrix being coerced to a vector

m[1,]
##          A          B          C 
## -0.5604756  0.1292877 -0.6868529
m[,1]
##           a           b           c           d 
## -0.56047565 -0.23017749  1.55870831  0.07050839

Avoid this behavior using the drop = FALSE argument.

m[ 1, , drop = FALSE]
##            A         B          C
## a -0.5604756 0.1292877 -0.6868529

SummarizedExperiment implements the ‘matrix-like’ interface. This means that it supports the functions dim(), dimnames(), and two-dimensional [-subsetting.

Exercise Subset airway to contain the first 5 rows and first 4 columns. Check out the colData(), rowRanges(), and assay() of the resulting object.

2.3 Basic summaries of assay() values

Returning to our simple matrix

m
##             A          B          C
## a -0.56047565  0.1292877 -0.6868529
## b -0.23017749  1.7150650 -0.4456620
## c  1.55870831  0.4609162  1.2240818
## d  0.07050839 -1.2650612  0.3598138

It is straight-forward to transform a matrix, e.g., by adding 1

m + 1
##           A          B         C
## a 0.4395244  1.1292877 0.3131471
## b 0.7698225  2.7150650 0.5543380
## c 2.5587083  1.4609162 2.2240818
## d 1.0705084 -0.2650612 1.3598138

or applying a transformation

abs(m + 1)
##           A         B         C
## a 0.4395244 1.1292877 0.3131471
## b 0.7698225 2.7150650 0.5543380
## c 2.5587083 1.4609162 2.2240818
## d 1.0705084 0.2650612 1.3598138

There are special functions for row- and column-wise operations that are often performed on matrices (check out the matrixStats package for further mathematical operations), e.g.,

colSums(m)
##         A         B         C 
## 0.8385636 1.0402077 0.4513808
rowMeans(m)
##          a          b          c          d 
## -0.3726803  0.3464085  1.0812354 -0.2782463

Exercise Extract the assay() component of airway and calculate the column sums. What do these numbers represent?

Exercise Transform the assay() matrix by adding 1 and taking the natural log(), then create a vector representing the average of each row of this transformed matrix. What does this vector represent? Visualize it in a histogram, density plot, or other representation.

2.4 Subsetting SummarizedExperiment

We mentioned that the summarized experiment implements the matrix-like two-dimensional [ subsetting interface.

The colData() of airway contains information about experimental design, for instance the cell line and dexamethasone treatment level of each sample.

colData(airway)
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677

There’s a short-cut to accessing each column, e.g.,

airway$dex
## [1] untrt trt   untrt trt   untrt trt   untrt trt  
## Levels: trt untrt

Exercies Use this shortcut to quickly subset airway so that it has only the untrt samples present. Verify that the assay() data have also been subset correctly.

When working with a matrix, common scenario is to manipulate groups of rows based on some criterion, e.g., if one wished to keep rows with mean above 0, one might

ridx <- rowMeans(m) > 0
m[ridx, , drop = FALSE]
##            A         B         C
## b -0.2301775 1.7150650 -0.445662
## c  1.5587083 0.4609162  1.224082

Exercise Remove all the rows from airway that had no gene expression across all samples. How many rows had no expression? Re-calculate the histogram of log-transoformed average expression with these rows excluded.

2.5 The list-like interface of GRangesList

An R list() is an object that contain different types of vectors, including other lists. Here’s a simple example

l <- list(a = 1:5, b = month.abb)

The interface to a list allows for length(), [ (to return a subset of the original list) and [[ (to extract a single element of the list, either by name or by position). The elements of the list can, but do not have to be, named.

names(l)
## [1] "a" "b"
length(l)
## [1] 2
l[c(2, 1)]
## $b
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
## 
## $a
## [1] 1 2 3 4 5
l[2]        # list of length 1, containing element 2 of original list
## $b
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
l[[2]]      # element 2 of original list
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"

One useful function is lengths(), which returns a vector of the length of each element in the lsit

lengths(l)
##  a  b 
##  5 12

The [IRanges], [S4Vectors], and GenomicRanges packages implement several classes that implement this list-like interface; the classes all end with name *List, e.g., GRangesList.

Exercise Extract the row ranges from airway, use class() to discover the type of object. What does each element of the list represent? What does the entire list represent?

r <- rowRanges(airway)

Exercise Prove to yourself that the object implements a list-like interface, supporting length(), [ and [[ subsetting.

Exercise Use lengths() to determine the number of exons in each gene. How many genes are there with a single exon? What gene has the largest number of exons?

2.6 Provenance in GRangesList

GRanges / GRangesList objects contain provenance information about the genome to which the ranges refer, accessible using seqinfo().

Exercise Use seqinfo(r) to extract the information about the sequences the genomic ranges are based on.

Exercise Note that the genome is always NA. Take a quick look at the vignette browseVignettes("airway"), scaning down to the ‘Aligning reads’ section. We’re told that the reference genome used for alignments is “GRCh37”. Update the row range to contain this information

genome(r) <- "GRCh37"
seqinfo(r)
## Seqinfo object with 722 sequences (1 circular) from GRCh37 genome:
##   seqnames seqlengths isCircular genome
##   1         249250621      FALSE GRCh37
##   2         243199373      FALSE GRCh37
##   3         198022430      FALSE GRCh37
##   4         191154276      FALSE GRCh37
##   5         180915260      FALSE GRCh37
##   ...             ...        ...    ...
##   LRG_94        12428      FALSE GRCh37
##   LRG_96        93210      FALSE GRCh37
##   LRG_97        25996      FALSE GRCh37
##   LRG_98        18750      FALSE GRCh37
##   LRG_99        13294      FALSE GRCh37

Update the rowRanges() of airway with this more complete information.

rowRanges(airway) <- r

Note that SummarizedExperiment has convenience functions that allow direct access to provenance

seqinfo(airway)
## Seqinfo object with 722 sequences (1 circular) from GRCh37 genome:
##   seqnames seqlengths isCircular genome
##   1         249250621      FALSE GRCh37
##   2         243199373      FALSE GRCh37
##   3         198022430      FALSE GRCh37
##   4         191154276      FALSE GRCh37
##   5         180915260      FALSE GRCh37
##   ...             ...        ...    ...
##   LRG_94        12428      FALSE GRCh37
##   LRG_96        93210      FALSE GRCh37
##   LRG_97        25996      FALSE GRCh37
##   LRG_98        18750      FALSE GRCh37
##   LRG_99        13294      FALSE GRCh37

Exercise Use as() to coerce the sequence info to a GRanges object, and select the genomic range corressponding to chromosome 14.

library(dplyr)    # %>%
chr14 <- 
    seqinfo(r) %>%
    as("GRanges") %>%
    subset(seqnames == "14")

2.7 Subsetting by overlaps

The %over% function returns TRUE for each range on the left of the operator that overlaps a range on the right of the operator. Thus the row ranges overlapping chromosome 14 are

idx <- r %over% chr14
r[idx]
## GRangesList object of length 2244:
## $ENSG00000006432
## GRanges object with 26 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]       14 71189243-71197581      - |    453195 ENSE00002597852
##    [2]       14 71196368-71197581      - |    453196 ENSE00002485643
##    [3]       14 71196383-71197581      - |    453197 ENSE00002510454
##    [4]       14 71196653-71197581      - |    453198 ENSE00002249042
##    [5]       14 71196852-71197581      - |    453199 ENSE00001518020
##    ...      ...               ...    ... .       ...             ...
##   [22]       14 71249940-71250162      - |    453216 ENSE00002524190
##   [23]       14 71267384-71267797      - |    453217 ENSE00001137145
##   [24]       14 71275483-71275888      - |    453218 ENSE00001518024
##   [25]       14 71275483-71276223      - |    453219 ENSE00002468641
##   [26]       14 71275483-71276251      - |    453220 ENSE00002511719
##   -------
##   seqinfo: 722 sequences (1 circular) from GRCh37 genome
## 
## ...
## <2243 more elements>

Exercise How many ranges of r overlap chromosome 14?

Exercise Use idx to subset airway to contain only genes on chromosome 14.

Exercise We took a long route to get to this subset-by-overlap; summarize our story with two lines of code that allow us to subset airway by overlap.

3 Constructing a SummarizedExperiment object ‘by hand’

library(readr)
library(tibble)
pdata_file <- file.choos()    # airway-sample-sheet.csv
counts_file <- file.choose()  # airway-read-counts.csv

Exercise Use read_csv() to import the sample sheet and read counts.

pdata <- read_csv(pdata_file)
## Parsed with column specification:
## cols(
##   SampleName = col_character(),
##   cell = col_character(),
##   dex = col_character(),
##   albut = col_character(),
##   Run = col_character(),
##   avgLength = col_double(),
##   Experiment = col_character(),
##   Sample = col_character(),
##   BioSample = col_character()
## )
counts <- read_csv(counts_file)
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Run = col_character()
## )
## See spec(...) for full column specifications.

Exercise Assemble SummarizedExperiment from transposed counts and pdata. The "Run" column needs to be made into rownames on both the pdata and counts object.

pdata <- column_to_rownames(pdata, "Run")
counts <- column_to_rownames(counts, "Run")
se <- SummarizedExperiment(t(counts), colData = pdata)
se
## class: SummarizedExperiment 
## dim: 33469 8 
## metadata(0):
## assays(1): ''
## rownames(33469): ENSG00000000003 ENSG00000000419 ...
##   ENSG00000273492 ENSG00000273493
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(8): SampleName cell ... Sample BioSample

Exercise Following examples from the lectures, calculate total library size and plot average log gene expression from you newly-created SummarizedExperiment.

4 Provenance

sessionInfo()
## R version 3.6.0 Patched (2019-04-26 r76431)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] tibble_2.1.3                readr_1.3.1                
##  [3] dplyr_0.8.2                 airway_1.5.0               
##  [5] SummarizedExperiment_1.15.5 DelayedArray_0.11.2        
##  [7] BiocParallel_1.19.0         matrixStats_0.54.0         
##  [9] Biobase_2.45.0              GenomicRanges_1.37.14      
## [11] GenomeInfoDb_1.21.1         IRanges_2.19.10            
## [13] S4Vectors_0.23.17           BiocGenerics_0.31.4        
## [15] BiocStyle_2.13.2           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1             pillar_1.4.2           compiler_3.6.0        
##  [4] BiocManager_1.30.5.1   XVector_0.25.0         bitops_1.0-6          
##  [7] tools_3.6.0            zlibbioc_1.31.0        digest_0.6.19         
## [10] evaluate_0.14          lattice_0.20-38        pkgconfig_2.0.2       
## [13] rlang_0.4.0            Matrix_1.2-17          yaml_2.2.0            
## [16] xfun_0.8               GenomeInfoDbData_1.2.1 stringr_1.4.0         
## [19] knitr_1.23             hms_0.4.2              tidyselect_0.2.5      
## [22] grid_3.6.0             glue_1.3.1             R6_2.4.0              
## [25] rmarkdown_1.13         bookdown_0.11          purrr_0.3.2           
## [28] magrittr_1.5           codetools_0.2-16       htmltools_0.3.6       
## [31] assertthat_0.2.1       stringi_1.4.3          RCurl_1.95-4.12       
## [34] crayon_1.3.4