SummarizedExperiment
?assay()
, assays()
: A matrix
-like or list of matrix
-like
objects of identical dimension
matrix
-like: implements dim()
, dimnames()
, and 2-dimensional
[
, [<-
methods.colData()
: Annotations on each column, as a DataFrame
.
rowData()
and / or rowRanges()
: Annotations on each row.
rowRanges()
: coordinates of gene / exons in transcripts /
etc.rowData()
:P-values and log-fold change of each gene
after differential expresssion analysis.metadata()
: List of unstructured metadata describing the overall
content of the object.SummarizedExperiment
objectlibrary(SummarizedExperiment)
The airway experiment data package summarizes an RNA-seq experiment investigating human smooth-muscle airway cell lines treated with dexamethasone. Load the library and data set.
library(airway)
data(airway)
airway
## class: RangedSummarizedExperiment
## dim: 64102 8
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
SummarizedExperiment
Exercise What are the dim()
and dimnames()
of the airway
object? What do the rows and columns correspond to in terms of the
original experiment? Hint: use dplyr::glimpse()
to look at the first
few elements of each dimnames
.
Exercise Use the functions assay()
, colData()
, rowRanges()
to extract the three major components of the airway
data
set. assay()
is large, so interogate it using class()
, dim()
and
head()
rather than simply printing it to the screen. What do each of
these parts corresspond to in terms of the RNASeq experiment
summarized in this object?
SummarizedExperiment
As a quick refresher, an R matrix
set.seed(123)
m <- matrix(
rnorm(12), nrow = 4, ncol = 3,
dimnames = list(letters[1:4], LETTERS[1:3])
)
m
## A B C
## a -0.56047565 0.1292877 -0.6868529
## b -0.23017749 1.7150650 -0.4456620
## c 1.55870831 0.4609162 1.2240818
## d 0.07050839 -1.2650612 0.3598138
has three main components to it’s interface, dim()
, dimnames()
,
and two-dimensional [
-subsetting.
dim(m)
## [1] 4 3
dimnames(m)
## [[1]]
## [1] "a" "b" "c" "d"
##
## [[2]]
## [1] "A" "B" "C"
m[1:2, 2:1]
## B A
## a 0.1292877 -0.5604756
## b 1.7150650 -0.2301775
m[1:2,]
## A B C
## a -0.5604756 0.1292877 -0.6868529
## b -0.2301775 1.7150650 -0.4456620
A tricky, and unfortunate, behavior is that subsetting a single row or column of a matrix results, by default in the matrix being coerced to a vector
m[1,]
## A B C
## -0.5604756 0.1292877 -0.6868529
m[,1]
## a b c d
## -0.56047565 -0.23017749 1.55870831 0.07050839
Avoid this behavior using the drop = FALSE
argument.
m[ 1, , drop = FALSE]
## A B C
## a -0.5604756 0.1292877 -0.6868529
SummarizedExperiment
implements the ‘matrix-like’ interface. This
means that it supports the functions dim()
, dimnames()
, and
two-dimensional [
-subsetting.
Exercise Subset airway
to contain the first 5 rows and first 4
columns. Check out the colData()
, rowRanges()
, and assay()
of
the resulting object.
assay()
valuesReturning to our simple matrix
m
## A B C
## a -0.56047565 0.1292877 -0.6868529
## b -0.23017749 1.7150650 -0.4456620
## c 1.55870831 0.4609162 1.2240818
## d 0.07050839 -1.2650612 0.3598138
It is straight-forward to transform a matrix, e.g., by adding 1
m + 1
## A B C
## a 0.4395244 1.1292877 0.3131471
## b 0.7698225 2.7150650 0.5543380
## c 2.5587083 1.4609162 2.2240818
## d 1.0705084 -0.2650612 1.3598138
or applying a transformation
abs(m + 1)
## A B C
## a 0.4395244 1.1292877 0.3131471
## b 0.7698225 2.7150650 0.5543380
## c 2.5587083 1.4609162 2.2240818
## d 1.0705084 0.2650612 1.3598138
There are special functions for row- and column-wise operations that are often performed on matrices (check out the matrixStats package for further mathematical operations), e.g.,
colSums(m)
## A B C
## 0.8385636 1.0402077 0.4513808
rowMeans(m)
## a b c d
## -0.3726803 0.3464085 1.0812354 -0.2782463
Exercise Extract the assay()
component of airway
and calculate
the column sums. What do these numbers represent?
Exercise Transform the assay()
matrix by adding 1 and taking the
natural log()
, then create a vector representing the average of each
row of this transformed matrix. What does this vector represent?
Visualize it in a histogram, density plot, or other representation.
SummarizedExperiment
We mentioned that the summarized experiment implements the matrix-like
two-dimensional [
subsetting interface.
The colData()
of airway
contains information about experimental
design, for instance the cell line and dexamethasone treatment level
of each sample.
colData(airway)
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
## SRR1039520 SRX384357 SRS508579 SAMN02422683
## SRR1039521 SRX384358 SRS508580 SAMN02422677
There’s a short-cut to accessing each column, e.g.,
airway$dex
## [1] untrt trt untrt trt untrt trt untrt trt
## Levels: trt untrt
Exercies Use this shortcut to quickly subset airway
so that it
has only the untrt
samples present. Verify that the assay()
data
have also been subset correctly.
When working with a matrix, common scenario is to manipulate groups of rows based on some criterion, e.g., if one wished to keep rows with mean above 0, one might
ridx <- rowMeans(m) > 0
m[ridx, , drop = FALSE]
## A B C
## b -0.2301775 1.7150650 -0.445662
## c 1.5587083 0.4609162 1.224082
Exercise Remove all the rows from airway
that had no gene
expression across all samples. How many rows had no expression?
Re-calculate the histogram of log-transoformed average expression with
these rows excluded.
GRangesList
An R list()
is an object that contain different types of vectors,
including other lists. Here’s a simple example
l <- list(a = 1:5, b = month.abb)
The interface to a list allows for length()
, [
(to return a subset
of the original list) and [[
(to extract a single element of the
list, either by name or by position). The elements of the list can,
but do not have to be, named.
names(l)
## [1] "a" "b"
length(l)
## [1] 2
l[c(2, 1)]
## $b
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
##
## $a
## [1] 1 2 3 4 5
l[2] # list of length 1, containing element 2 of original list
## $b
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
l[[2]] # element 2 of original list
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
One useful function is lengths()
, which returns a vector of the
length of each element in the lsit
lengths(l)
## a b
## 5 12
The [IRanges], [S4Vectors], and GenomicRanges packages
implement several classes that implement this list-like interface; the
classes all end with name *List
, e.g., GRangesList
.
Exercise Extract the row ranges from airway
, use class()
to
discover the type of object. What does each element of the list
represent? What does the entire list represent?
r <- rowRanges(airway)
Exercise Prove to yourself that the object implements a list-like
interface, supporting length()
, [
and [[
subsetting.
Exercise Use lengths()
to determine the number of exons in each
gene. How many genes are there with a single exon? What gene has the
largest number of exons?
GRangesList
GRanges
/ GRangesList
objects contain provenance information about
the genome to which the ranges refer, accessible using seqinfo()
.
Exercise Use seqinfo(r)
to extract the information about the
sequences the genomic ranges are based on.
Exercise Note that the genome
is always NA
. Take a quick look
at the vignette browseVignettes("airway")
, scaning down to the
‘Aligning reads’ section. We’re told that the reference genome used
for alignments is “GRCh37”. Update the row range to contain this
information
genome(r) <- "GRCh37"
seqinfo(r)
## Seqinfo object with 722 sequences (1 circular) from GRCh37 genome:
## seqnames seqlengths isCircular genome
## 1 249250621 FALSE GRCh37
## 2 243199373 FALSE GRCh37
## 3 198022430 FALSE GRCh37
## 4 191154276 FALSE GRCh37
## 5 180915260 FALSE GRCh37
## ... ... ... ...
## LRG_94 12428 FALSE GRCh37
## LRG_96 93210 FALSE GRCh37
## LRG_97 25996 FALSE GRCh37
## LRG_98 18750 FALSE GRCh37
## LRG_99 13294 FALSE GRCh37
Update the rowRanges()
of airway
with this more complete
information.
rowRanges(airway) <- r
Note that SummarizedExperiment
has convenience functions that
allow direct access to provenance
seqinfo(airway)
## Seqinfo object with 722 sequences (1 circular) from GRCh37 genome:
## seqnames seqlengths isCircular genome
## 1 249250621 FALSE GRCh37
## 2 243199373 FALSE GRCh37
## 3 198022430 FALSE GRCh37
## 4 191154276 FALSE GRCh37
## 5 180915260 FALSE GRCh37
## ... ... ... ...
## LRG_94 12428 FALSE GRCh37
## LRG_96 93210 FALSE GRCh37
## LRG_97 25996 FALSE GRCh37
## LRG_98 18750 FALSE GRCh37
## LRG_99 13294 FALSE GRCh37
Exercise Use as()
to coerce the sequence info to a GRanges
object, and select the genomic range corressponding to chromosome 14.
library(dplyr) # %>%
chr14 <-
seqinfo(r) %>%
as("GRanges") %>%
subset(seqnames == "14")
The %over%
function returns TRUE for each range on the left of the
operator that overlaps a range on the right of the operator. Thus the
row ranges overlapping chromosome 14 are
idx <- r %over% chr14
r[idx]
## GRangesList object of length 2244:
## $ENSG00000006432
## GRanges object with 26 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] 14 71189243-71197581 - | 453195 ENSE00002597852
## [2] 14 71196368-71197581 - | 453196 ENSE00002485643
## [3] 14 71196383-71197581 - | 453197 ENSE00002510454
## [4] 14 71196653-71197581 - | 453198 ENSE00002249042
## [5] 14 71196852-71197581 - | 453199 ENSE00001518020
## ... ... ... ... . ... ...
## [22] 14 71249940-71250162 - | 453216 ENSE00002524190
## [23] 14 71267384-71267797 - | 453217 ENSE00001137145
## [24] 14 71275483-71275888 - | 453218 ENSE00001518024
## [25] 14 71275483-71276223 - | 453219 ENSE00002468641
## [26] 14 71275483-71276251 - | 453220 ENSE00002511719
## -------
## seqinfo: 722 sequences (1 circular) from GRCh37 genome
##
## ...
## <2243 more elements>
Exercise How many ranges of r
overlap chromosome 14?
Exercise Use idx
to subset airway
to contain only genes on
chromosome 14.
Exercise We took a long route to get to this subset-by-overlap;
summarize our story with two lines of code that allow us to subset
airway
by overlap.
SummarizedExperiment
object ‘by hand’library(readr)
library(tibble)
pdata_file <- file.choos() # airway-sample-sheet.csv
counts_file <- file.choose() # airway-read-counts.csv
Exercise Use read_csv()
to import the sample sheet and
read counts.
pdata <- read_csv(pdata_file)
## Parsed with column specification:
## cols(
## SampleName = col_character(),
## cell = col_character(),
## dex = col_character(),
## albut = col_character(),
## Run = col_character(),
## avgLength = col_double(),
## Experiment = col_character(),
## Sample = col_character(),
## BioSample = col_character()
## )
counts <- read_csv(counts_file)
## Parsed with column specification:
## cols(
## .default = col_double(),
## Run = col_character()
## )
## See spec(...) for full column specifications.
Exercise Assemble SummarizedExperiment
from transposed counts
and pdata
. The "Run"
column needs to be made into rownames on both
the pdata
and counts
object.
pdata <- column_to_rownames(pdata, "Run")
counts <- column_to_rownames(counts, "Run")
se <- SummarizedExperiment(t(counts), colData = pdata)
se
## class: SummarizedExperiment
## dim: 33469 8
## metadata(0):
## assays(1): ''
## rownames(33469): ENSG00000000003 ENSG00000000419 ...
## ENSG00000273492 ENSG00000273493
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(8): SampleName cell ... Sample BioSample
Exercise Following examples from the lectures, calculate total
library size and plot average log gene expression from you
newly-created SummarizedExperiment
.
sessionInfo()
## R version 3.6.0 Patched (2019-04-26 r76431)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] tibble_2.1.3 readr_1.3.1
## [3] dplyr_0.8.2 airway_1.5.0
## [5] SummarizedExperiment_1.15.5 DelayedArray_0.11.2
## [7] BiocParallel_1.19.0 matrixStats_0.54.0
## [9] Biobase_2.45.0 GenomicRanges_1.37.14
## [11] GenomeInfoDb_1.21.1 IRanges_2.19.10
## [13] S4Vectors_0.23.17 BiocGenerics_0.31.4
## [15] BiocStyle_2.13.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 pillar_1.4.2 compiler_3.6.0
## [4] BiocManager_1.30.5.1 XVector_0.25.0 bitops_1.0-6
## [7] tools_3.6.0 zlibbioc_1.31.0 digest_0.6.19
## [10] evaluate_0.14 lattice_0.20-38 pkgconfig_2.0.2
## [13] rlang_0.4.0 Matrix_1.2-17 yaml_2.2.0
## [16] xfun_0.8 GenomeInfoDbData_1.2.1 stringr_1.4.0
## [19] knitr_1.23 hms_0.4.2 tidyselect_0.2.5
## [22] grid_3.6.0 glue_1.3.1 R6_2.4.0
## [25] rmarkdown_1.13 bookdown_0.11 purrr_0.3.2
## [28] magrittr_1.5 codetools_0.2-16 htmltools_0.3.6
## [31] assertthat_0.2.1 stringi_1.4.3 RCurl_1.95-4.12
## [34] crayon_1.3.4