Chapter 2 SummarizedExperiments

Objectives

  • Present the SummarizedExperiment class, extensively used in omics analyses.
  • Show how to build SummarizedExperiment objects.

Data in bioinformatics is often more complex than the basic data types we have seen so far. To deal with this, developers define specialised data containers (termed classes) that match the properties of the data they need to handle.

This aspect is central to the Bioconductor5 project which uses the same core data infrastructure across packages. This certainly contributed to Bioconductor’s success. Bioconductor package developers are advised to make use of existing infrastructure to provide coherence, interoperability and stability to the project as a whole.

To illustrate such an omics data container, we’ll present the SummarizedExperiment class.

2.1 SummarizedExperiment

The figure below represents the anatomy of the SummarizedExperiment class.

Objects of the class SummarizedExperiment contain :

  • One (or more) assay(s) containing the quantitative omics data (expression data), stored as a matrix-like object. Features (genes, transcripts, proteins, …) are defined along the rows, and samples along the columns.

  • A sample metadata slot containing sample co-variates, stored as a data frame. Rows from this table represent samples (rows match exactly the columns of the expression data).

  • A feature metadata slot containing feature co-variates, stored as a data frame. The rows of this data frame match exactly the rows of the expression data.

The coordinated nature of the SummarizedExperiment guarantees that during data manipulation, the dimensions of the different slots will always match (i.e the columns in the expression data and then rows in the sample metadata, as well as the rows in the expression data and feature metadata) during data manipulation. For example, if we had to exclude one sample from the assay, it would be automatically removed from the sample metadata in the same operation.

The metadata slots can grow additional co-variates (columns) without affecting the other structures.

2.1.1 Creating a SummarizedExperiment

In order to create a SummarizedExperiment, we will create the individual components, i.e the count matrix, the sample and gene metadata from csv files. These are typically how RNA-Seq data are provided (after raw data have been processed).

  • An expression matrix: we load the count matrix, specifying that the first columns contains row/gene names, and convert the data.frame to a matrix. You can download it here.
count_matrix <- read.csv("data/count_matrix.csv",
                         row.names = 1) %>%
    as.matrix()

count_matrix[1:5, ]
##         GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
## Asl           1170        361        400        586        626        988
## Apod         36194      10347       9173      10620      13021      29594
## Cyp2d22       4060       1616       1603       1901       2171       3349
## Klk6           287        629        641        578        448        195
##         GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
## Asl            836        535        586        597        938       1035
## Apod         24959      13668      13230      15868      27769      34301
## Cyp2d22       3122       2008       2254       2277       2985       3452
## Klk6           186       1101        537        567        327        233
##         GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
## Asl            494        481        666        937        803        541
## Apod         11258      11812      15816      29242      20415      13682
## Cyp2d22       1883       2014       2417       3678       2920       2216
## Klk6           742        881        828        250        798        710
##         GSM2545354 GSM2545362 GSM2545363 GSM2545380
## Asl            473        748        576       1192
## Apod         11088      15916      11166      38148
## Cyp2d22       1821       2842       2011       4019
## Klk6           894        501        598        259
##  [ reached getOption("max.print") -- omitted 1 row ]
dim(count_matrix)
## [1] 1474   22
  • A table describing the samples, available here.
sample_metadata <- read.csv("data/sample_metadata.csv")
sample_metadata
##        sample     organism age    sex   infection  strain time     tissue mouse
## 1  GSM2545336 Mus musculus   8 Female  InfluenzaA C57BL/6    8 Cerebellum    14
## 2  GSM2545337 Mus musculus   8 Female NonInfected C57BL/6    0 Cerebellum     9
## 3  GSM2545338 Mus musculus   8 Female NonInfected C57BL/6    0 Cerebellum    10
## 4  GSM2545339 Mus musculus   8 Female  InfluenzaA C57BL/6    4 Cerebellum    15
## 5  GSM2545340 Mus musculus   8   Male  InfluenzaA C57BL/6    4 Cerebellum    18
## 6  GSM2545341 Mus musculus   8   Male  InfluenzaA C57BL/6    8 Cerebellum     6
## 7  GSM2545342 Mus musculus   8 Female  InfluenzaA C57BL/6    8 Cerebellum     5
## 8  GSM2545343 Mus musculus   8   Male NonInfected C57BL/6    0 Cerebellum    11
## 9  GSM2545344 Mus musculus   8 Female  InfluenzaA C57BL/6    4 Cerebellum    22
## 10 GSM2545345 Mus musculus   8   Male  InfluenzaA C57BL/6    4 Cerebellum    13
## 11 GSM2545346 Mus musculus   8   Male  InfluenzaA C57BL/6    8 Cerebellum    23
##  [ reached 'max' / getOption("max.print") -- omitted 11 rows ]
dim(sample_metadata)
## [1] 22  9
  • A table describing the genes, available here.
gene_metadata <- read.csv("data/gene_metadata.csv")
gene_metadata[1:10, 1:4]
##       gene ENTREZID
## 1      Asl   109900
## 2     Apod    11815
## 3  Cyp2d22    56448
## 4     Klk6    19144
## 5    Fcrls    80891
## 6   Slc2a4    20528
## 7     Exd2    97827
## 8     Gjc2   118454
## 9     Plp1    18823
## 10    Gnb4    14696
##                                                                        product
## 1                                argininosuccinate lyase transcript variant X1
## 2                                        apolipoprotein D transcript variant 3
## 3     cytochrome P450 family 2 subfamily d polypeptide 22 transcript variant 2
## 4                          kallikrein related-peptidase 6 transcript variant 2
## 5                  Fc receptor-like S scavenger receptor transcript variant X1
## 6           solute carrier family 2 (facilitated glucose transporter) member 4
## 7                                        exonuclease 3'-5' domain containing 2
## 8                            gap junction protein gamma 2 transcript variant 1
## 9                          proteolipid protein (myelin) 1 transcript variant 1
## 10 guanine nucleotide binding protein (G protein) beta 4 transcript variant X2
##       ensembl_gene_id
## 1  ENSMUSG00000025533
## 2  ENSMUSG00000022548
## 3  ENSMUSG00000061740
## 4  ENSMUSG00000050063
## 5  ENSMUSG00000015852
## 6  ENSMUSG00000018566
## 7  ENSMUSG00000032705
## 8  ENSMUSG00000043448
## 9  ENSMUSG00000031425
## 10 ENSMUSG00000027669
dim(gene_metadata)
## [1] 1474    9

We will create a SummarizedExperiment from these tables:

  • The count matrix that will be used as the assay

  • The table describing the samples will be used as the sample metadata slot

  • The table describing the genes will be used as the features metadata slot

To do this we can put the different parts together using the SummarizedExperiment constructor:

## BiocManager::install("SummarizedExperiment")
library("SummarizedExperiment")

First, we make sure that the samples are in the same order in the count matrix and the sample annotation, and the same for the genes in the count matrix and the gene annotation.

stopifnot(rownames(count_matrix) == gene_metadata$gene)
stopifnot(colnames(count_matrix) == sample_metadata$sample)
se <- SummarizedExperiment(assays = list(counts = count_matrix),
                           colData = sample_metadata,
                           rowData = gene_metadata)
se
## class: SummarizedExperiment 
## dim: 1474 22 
## metadata(0):
## assays(1): counts
## rownames(1474): Asl Apod ... Pbx1 Rgs4
## rowData names(9): gene ENTREZID ... phenotype_description
##   hsapiens_homolog_associated_gene_name
## colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
## colData names(9): sample organism ... tissue mouse

Using this data structure, we can access the expression matrix with the assay function:

head(assay(se))
##         GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
## Asl           1170        361        400        586        626        988
## Apod         36194      10347       9173      10620      13021      29594
## Cyp2d22       4060       1616       1603       1901       2171       3349
## Klk6           287        629        641        578        448        195
##         GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
## Asl            836        535        586        597        938       1035
## Apod         24959      13668      13230      15868      27769      34301
## Cyp2d22       3122       2008       2254       2277       2985       3452
## Klk6           186       1101        537        567        327        233
##         GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
## Asl            494        481        666        937        803        541
## Apod         11258      11812      15816      29242      20415      13682
## Cyp2d22       1883       2014       2417       3678       2920       2216
## Klk6           742        881        828        250        798        710
##         GSM2545354 GSM2545362 GSM2545363 GSM2545380
## Asl            473        748        576       1192
## Apod         11088      15916      11166      38148
## Cyp2d22       1821       2842       2011       4019
## Klk6           894        501        598        259
##  [ reached getOption("max.print") -- omitted 2 rows ]
dim(assay(se))
## [1] 1474   22

We can access the sample metadata using the colData function:

colData(se)
## DataFrame with 22 rows and 9 columns
##                 sample     organism       age         sex   infection
##            <character>  <character> <integer> <character> <character>
## GSM2545336  GSM2545336 Mus musculus         8      Female  InfluenzaA
## GSM2545337  GSM2545337 Mus musculus         8      Female NonInfected
## GSM2545338  GSM2545338 Mus musculus         8      Female NonInfected
## GSM2545339  GSM2545339 Mus musculus         8      Female  InfluenzaA
## GSM2545340  GSM2545340 Mus musculus         8        Male  InfluenzaA
## ...                ...          ...       ...         ...         ...
## GSM2545353  GSM2545353 Mus musculus         8      Female NonInfected
## GSM2545354  GSM2545354 Mus musculus         8        Male NonInfected
## GSM2545362  GSM2545362 Mus musculus         8      Female  InfluenzaA
## GSM2545363  GSM2545363 Mus musculus         8        Male  InfluenzaA
##                 strain      time      tissue     mouse
##            <character> <integer> <character> <integer>
## GSM2545336     C57BL/6         8  Cerebellum        14
## GSM2545337     C57BL/6         0  Cerebellum         9
## GSM2545338     C57BL/6         0  Cerebellum        10
## GSM2545339     C57BL/6         4  Cerebellum        15
## GSM2545340     C57BL/6         4  Cerebellum        18
## ...                ...       ...         ...       ...
## GSM2545353     C57BL/6         0  Cerebellum         4
## GSM2545354     C57BL/6         0  Cerebellum         2
## GSM2545362     C57BL/6         4  Cerebellum        20
## GSM2545363     C57BL/6         4  Cerebellum        12
##  [ reached getOption("max.print") -- omitted 1 row ]
dim(colData(se))
## [1] 22  9

We can also access the feature metadata using the rowData function:

head(rowData(se))
## DataFrame with 6 rows and 9 columns
##                gene  ENTREZID                product    ensembl_gene_id
##         <character> <integer>            <character>        <character>
## Asl             Asl    109900 argininosuccinate ly.. ENSMUSG00000025533
## Apod           Apod     11815 apolipoprotein D tra.. ENSMUSG00000022548
## Cyp2d22     Cyp2d22     56448 cytochrome P450 fami.. ENSMUSG00000061740
## Klk6           Klk6     19144 kallikrein related-p.. ENSMUSG00000050063
## Fcrls         Fcrls     80891 Fc receptor-like S s.. ENSMUSG00000015852
## Slc2a4       Slc2a4     20528 solute carrier famil.. ENSMUSG00000018566
##         external_synonym chromosome_name   gene_biotype  phenotype_description
##              <character>     <character>    <character>            <character>
## Asl        2510006M18Rik               5 protein_coding abnormal circulating..
## Apod                  NA              16 protein_coding abnormal lipid homeo..
## Cyp2d22             2D22              15 protein_coding abnormal skin morpho..
## Klk6                Bssp               7 protein_coding abnormal cytokine le..
## Fcrls      2810439C17Rik               3 protein_coding decreased CD8-positi..
## Slc2a4            Glut-4              11 protein_coding abnormal circulating..
##         hsapiens_homolog_associated_gene_name
##                                   <character>
## Asl                                       ASL
## Apod                                     APOD
## Cyp2d22                                CYP2D6
## Klk6                                     KLK6
## Fcrls                                   FCRL4
## Slc2a4                                 SLC2A4
dim(rowData(se))
## [1] 1474    9

2.1.2 Subsetting a SummarizedExperiment

SummarizedExperiment can be subset just like with data frames, with numerics or with characters of logicals.

Below, we create a new instance of class SummarizedExperiment that contains only the 5 first features for the 3 first samples.

se1 <- se[1:5, 1:3]
se1
## class: SummarizedExperiment 
## dim: 5 3 
## metadata(0):
## assays(1): counts
## rownames(5): Asl Apod Cyp2d22 Klk6 Fcrls
## rowData names(9): gene ENTREZID ... phenotype_description
##   hsapiens_homolog_associated_gene_name
## colnames(3): GSM2545336 GSM2545337 GSM2545338
## colData names(9): sample organism ... tissue mouse
colData(se1)
## DataFrame with 3 rows and 9 columns
##                 sample     organism       age         sex   infection
##            <character>  <character> <integer> <character> <character>
## GSM2545336  GSM2545336 Mus musculus         8      Female  InfluenzaA
## GSM2545337  GSM2545337 Mus musculus         8      Female NonInfected
## GSM2545338  GSM2545338 Mus musculus         8      Female NonInfected
##                 strain      time      tissue     mouse
##            <character> <integer> <character> <integer>
## GSM2545336     C57BL/6         8  Cerebellum        14
## GSM2545337     C57BL/6         0  Cerebellum         9
## GSM2545338     C57BL/6         0  Cerebellum        10
rowData(se1)
## DataFrame with 5 rows and 9 columns
##                gene  ENTREZID                product    ensembl_gene_id
##         <character> <integer>            <character>        <character>
## Asl             Asl    109900 argininosuccinate ly.. ENSMUSG00000025533
## Apod           Apod     11815 apolipoprotein D tra.. ENSMUSG00000022548
## Cyp2d22     Cyp2d22     56448 cytochrome P450 fami.. ENSMUSG00000061740
## Klk6           Klk6     19144 kallikrein related-p.. ENSMUSG00000050063
## Fcrls         Fcrls     80891 Fc receptor-like S s.. ENSMUSG00000015852
##         external_synonym chromosome_name   gene_biotype  phenotype_description
##              <character>     <character>    <character>            <character>
## Asl        2510006M18Rik               5 protein_coding abnormal circulating..
## Apod                  NA              16 protein_coding abnormal lipid homeo..
## Cyp2d22             2D22              15 protein_coding abnormal skin morpho..
## Klk6                Bssp               7 protein_coding abnormal cytokine le..
## Fcrls      2810439C17Rik               3 protein_coding decreased CD8-positi..
##         hsapiens_homolog_associated_gene_name
##                                   <character>
## Asl                                       ASL
## Apod                                     APOD
## Cyp2d22                                CYP2D6
## Klk6                                     KLK6
## Fcrls                                   FCRL4

We can also use the colData() function to subset on something from the sample metadata or the rowData() to subset on something from the feature metadata.

► Question

Create a new object keeping only miRNAs (see gene_biotype feature variable) and the non infected samples.

► Solution

► Question

Extract the gene expression levels of the 3 first genes in samples at time 0 and at time 8.

► Solution

2.1.2.1 Adding variables to metadata

We can also add information to the metadata. Suppose that you want to add the center where the samples were collected …

colData(se)$center <- rep("University of Illinois", nrow(colData(se)))
colData(se)
## DataFrame with 22 rows and 10 columns
##                 sample     organism       age         sex   infection
##            <character>  <character> <integer> <character> <character>
## GSM2545336  GSM2545336 Mus musculus         8      Female  InfluenzaA
## GSM2545337  GSM2545337 Mus musculus         8      Female NonInfected
## GSM2545338  GSM2545338 Mus musculus         8      Female NonInfected
## GSM2545339  GSM2545339 Mus musculus         8      Female  InfluenzaA
## GSM2545340  GSM2545340 Mus musculus         8        Male  InfluenzaA
## ...                ...          ...       ...         ...         ...
## GSM2545353  GSM2545353 Mus musculus         8      Female NonInfected
## GSM2545354  GSM2545354 Mus musculus         8        Male NonInfected
## GSM2545362  GSM2545362 Mus musculus         8      Female  InfluenzaA
##                 strain      time      tissue     mouse                 center
##            <character> <integer> <character> <integer>            <character>
## GSM2545336     C57BL/6         8  Cerebellum        14 University of Illinois
## GSM2545337     C57BL/6         0  Cerebellum         9 University of Illinois
## GSM2545338     C57BL/6         0  Cerebellum        10 University of Illinois
## GSM2545339     C57BL/6         4  Cerebellum        15 University of Illinois
## GSM2545340     C57BL/6         4  Cerebellum        18 University of Illinois
## ...                ...       ...         ...       ...                    ...
## GSM2545353     C57BL/6         0  Cerebellum         4 University of Illinois
## GSM2545354     C57BL/6         0  Cerebellum         2 University of Illinois
## GSM2545362     C57BL/6         4  Cerebellum        20 University of Illinois
##  [ reached getOption("max.print") -- omitted 2 rows ]

This illustrates that the metadata slots can grow indefinitely without affecting the other structures!

2.2 RangedSummarizedExperiment

In addition to the SummarizedExperiment slots we have seen above, it is also possible to store information about the features’ genomic ranges. Below, we load the airway RangedSummarizedExperiment data to illustrate this:

library(airway)
data(airway)
airway
## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
rowRanges(airway)
## GRangesList object of length 63677:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        X 99883667-99884983      - |    667145 ENSE00001459322
##    [2]        X 99885756-99885863      - |    667146 ENSE00000868868
##    [3]        X 99887482-99887565      - |    667147 ENSE00000401072
##    [4]        X 99887538-99887565      - |    667148 ENSE00001849132
##    [5]        X 99888402-99888536      - |    667149 ENSE00003554016
##    ...      ...               ...    ... .       ...             ...
##   [13]        X 99890555-99890743      - |    667156 ENSE00003512331
##   [14]        X 99891188-99891686      - |    667158 ENSE00001886883
##   [15]        X 99891605-99891803      - |    667159 ENSE00001855382
##   [16]        X 99891790-99892101      - |    667160 ENSE00001863395
##   [17]        X 99894942-99894988      - |    667161 ENSE00001828996
##   -------
##   seqinfo: 722 sequences (1 circular) from an unspecified genome
## 
## ...
## <63676 more elements>

This GRangesList contains a list of 64102 range elements, one for each gene/feature. Each element is a GRanges object that contains the positions of the exons or the genes:

rowRanges(airway)[[1]]
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        X 99883667-99884983      - |    667145 ENSE00001459322
##    [2]        X 99885756-99885863      - |    667146 ENSE00000868868
##    [3]        X 99887482-99887565      - |    667147 ENSE00000401072
##    [4]        X 99887538-99887565      - |    667148 ENSE00001849132
##    [5]        X 99888402-99888536      - |    667149 ENSE00003554016
##    ...      ...               ...    ... .       ...             ...
##   [13]        X 99890555-99890743      - |    667156 ENSE00003512331
##   [14]        X 99891188-99891686      - |    667158 ENSE00001886883
##   [15]        X 99891605-99891803      - |    667159 ENSE00001855382
##   [16]        X 99891790-99892101      - |    667160 ENSE00001863395
##   [17]        X 99894942-99894988      - |    667161 ENSE00001828996
##   -------
##   seqinfo: 722 sequences (1 circular) from an unspecified genome

It is possible to use these row genomic ranges to subset features in regions of interest, defined as GRanges object. Below, for example, we subset features between bases 1M and 2M on chromosome 7:

roi <- GRanges(seqnames = "7", ranges = 1e6:2e6)
subsetByOverlaps(airway, roi)
## class: RangedSummarizedExperiment 
## dim: 34 8 
## metadata(1): ''
## assays(1): counts
## rownames(34): ENSG00000002822 ENSG00000073067 ... ENSG00000266391
##   ENSG00000273230
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

2.3 Creating a count matrix from multiple files

Raw HTS fastq files from different sample are processed into raw counts one by one (see next chapter). These data meed to be merged in a single count matrix, were each column represents a sample, and each line represents a gene. This count matrix will be the starting point of a differential expression analysis.

##                 sample1 sample2 sample3 sample4 sample5 sample6
## ENSG00000223972       0       0       0       0       0       1
## ENSG00000227232      14      28      17      40      16      13
## ENSG00000278267       8       4       3       1       1       6
## ENSG00000243485       0       0       0       0       0       0
## ENSG00000284332       0       0       0       0       0       0
## ENSG00000237613       0       0       0       0       0       0

► Question

Imagine you had initially 6 samples (3 control and 3 treated samples) that you have processed independently giving the 6 counts files generated by FeatureCounts located in the folder wsbim_data/count_data/.

samples <- list.files("wsbim2122_data/count_data",
                      pattern = "*tsv.gz",
                      full.names = TRUE)
samples
## [1] "wsbim2122_data/count_data/sample1_counts.tsv.gz"
## [2] "wsbim2122_data/count_data/sample2_counts.tsv.gz"
## [3] "wsbim2122_data/count_data/sample3_counts.tsv.gz"
## [4] "wsbim2122_data/count_data/sample4_counts.tsv.gz"
## [5] "wsbim2122_data/count_data/sample5_counts.tsv.gz"
## [6] "wsbim2122_data/count_data/sample6_counts.tsv.gz"

Use these files to create a count matrix.

► Solution


  1. The Bioconductor was initiated by Robert Gentleman, one of the two creators of the R language. Bioconductor provides tools for the analysis and comprehension of omics data. Bioconductor uses the R statistical programming language, and is open source and open development.↩︎

Page built: 2023-10-05 using R version 4.3.1 Patched (2023-07-10 r84676)