Objectives
SummarizedExperiment
class, extensively used in omics
analyses.SummarizedExperiment
objects.Data in bioinformatics is often more complex than the basic data types we have seen so far. To deal with this, developers define specialised data containers (termed classes) that match the properties of the data they need to handle.
This aspect is central to the Bioconductor5 project which uses the same core data infrastructure across packages. This certainly contributed to Bioconductor’s success. Bioconductor package developers are advised to make use of existing infrastructure to provide coherence, interoperability and stability to the project as a whole.
To illustrate such an omics data container, we’ll present the
SummarizedExperiment
class.
The figure below represents the anatomy of the SummarizedExperiment class.
Objects of the class SummarizedExperiment contain :
One (or more) assay(s) containing the quantitative omics data (expression data), stored as a matrix-like object. Features (genes, transcripts, proteins, …) are defined along the rows, and samples along the columns.
A sample metadata slot containing sample co-variates, stored as a data frame. Rows from this table represent samples (rows match exactly the columns of the expression data).
A feature metadata slot containing feature co-variates, stored as a data frame. The rows of this data frame match exactly the rows of the expression data.
The coordinated nature of the SummarizedExperiment guarantees that during data manipulation, the dimensions of the different slots will always match (i.e the columns in the expression data and then rows in the sample metadata, as well as the rows in the expression data and feature metadata) during data manipulation. For example, if we had to exclude one sample from the assay, it would be automatically removed from the sample metadata in the same operation.
The metadata slots can grow additional co-variates (columns) without affecting the other structures.
In order to create a SummarizedExperiment
, we will create the
individual components, i.e the count matrix, the sample and gene
metadata from csv files. These are typically how RNA-Seq data are
provided (after raw data have been processed).
data.frame
to a matrix
. You can download it
here.count_matrix <- read.csv("data/count_matrix.csv",
row.names = 1) %>%
as.matrix()
count_matrix[1:5, ]
## GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
## Asl 1170 361 400 586 626 988
## Apod 36194 10347 9173 10620 13021 29594
## Cyp2d22 4060 1616 1603 1901 2171 3349
## Klk6 287 629 641 578 448 195
## GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
## Asl 836 535 586 597 938 1035
## Apod 24959 13668 13230 15868 27769 34301
## Cyp2d22 3122 2008 2254 2277 2985 3452
## Klk6 186 1101 537 567 327 233
## GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
## Asl 494 481 666 937 803 541
## Apod 11258 11812 15816 29242 20415 13682
## Cyp2d22 1883 2014 2417 3678 2920 2216
## Klk6 742 881 828 250 798 710
## GSM2545354 GSM2545362 GSM2545363 GSM2545380
## Asl 473 748 576 1192
## Apod 11088 15916 11166 38148
## Cyp2d22 1821 2842 2011 4019
## Klk6 894 501 598 259
## [ reached getOption("max.print") -- omitted 1 row ]
## [1] 1474 22
## sample organism age sex infection strain time tissue mouse
## 1 GSM2545336 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 14
## 2 GSM2545337 Mus musculus 8 Female NonInfected C57BL/6 0 Cerebellum 9
## 3 GSM2545338 Mus musculus 8 Female NonInfected C57BL/6 0 Cerebellum 10
## 4 GSM2545339 Mus musculus 8 Female InfluenzaA C57BL/6 4 Cerebellum 15
## 5 GSM2545340 Mus musculus 8 Male InfluenzaA C57BL/6 4 Cerebellum 18
## 6 GSM2545341 Mus musculus 8 Male InfluenzaA C57BL/6 8 Cerebellum 6
## 7 GSM2545342 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebellum 5
## 8 GSM2545343 Mus musculus 8 Male NonInfected C57BL/6 0 Cerebellum 11
## 9 GSM2545344 Mus musculus 8 Female InfluenzaA C57BL/6 4 Cerebellum 22
## 10 GSM2545345 Mus musculus 8 Male InfluenzaA C57BL/6 4 Cerebellum 13
## 11 GSM2545346 Mus musculus 8 Male InfluenzaA C57BL/6 8 Cerebellum 23
## [ reached 'max' / getOption("max.print") -- omitted 11 rows ]
## [1] 22 9
## gene ENTREZID
## 1 Asl 109900
## 2 Apod 11815
## 3 Cyp2d22 56448
## 4 Klk6 19144
## 5 Fcrls 80891
## 6 Slc2a4 20528
## 7 Exd2 97827
## 8 Gjc2 118454
## 9 Plp1 18823
## 10 Gnb4 14696
## product
## 1 argininosuccinate lyase transcript variant X1
## 2 apolipoprotein D transcript variant 3
## 3 cytochrome P450 family 2 subfamily d polypeptide 22 transcript variant 2
## 4 kallikrein related-peptidase 6 transcript variant 2
## 5 Fc receptor-like S scavenger receptor transcript variant X1
## 6 solute carrier family 2 (facilitated glucose transporter) member 4
## 7 exonuclease 3'-5' domain containing 2
## 8 gap junction protein gamma 2 transcript variant 1
## 9 proteolipid protein (myelin) 1 transcript variant 1
## 10 guanine nucleotide binding protein (G protein) beta 4 transcript variant X2
## ensembl_gene_id
## 1 ENSMUSG00000025533
## 2 ENSMUSG00000022548
## 3 ENSMUSG00000061740
## 4 ENSMUSG00000050063
## 5 ENSMUSG00000015852
## 6 ENSMUSG00000018566
## 7 ENSMUSG00000032705
## 8 ENSMUSG00000043448
## 9 ENSMUSG00000031425
## 10 ENSMUSG00000027669
## [1] 1474 9
We will create a SummarizedExperiment
from these tables:
The count matrix that will be used as the assay
The table describing the samples will be used as the sample metadata slot
The table describing the genes will be used as the features metadata slot
To do this we can put the different parts together using the
SummarizedExperiment
constructor:
First, we make sure that the samples are in the same order in the count matrix and the sample annotation, and the same for the genes in the count matrix and the gene annotation.
stopifnot(rownames(count_matrix) == gene_metadata$gene)
stopifnot(colnames(count_matrix) == sample_metadata$sample)
se <- SummarizedExperiment(assays = list(counts = count_matrix),
colData = sample_metadata,
rowData = gene_metadata)
se
## class: SummarizedExperiment
## dim: 1474 22
## metadata(0):
## assays(1): counts
## rownames(1474): Asl Apod ... Pbx1 Rgs4
## rowData names(9): gene ENTREZID ... phenotype_description
## hsapiens_homolog_associated_gene_name
## colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
## colData names(9): sample organism ... tissue mouse
Using this data structure, we can access the expression matrix with
the assay
function:
## GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
## Asl 1170 361 400 586 626 988
## Apod 36194 10347 9173 10620 13021 29594
## Cyp2d22 4060 1616 1603 1901 2171 3349
## Klk6 287 629 641 578 448 195
## GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
## Asl 836 535 586 597 938 1035
## Apod 24959 13668 13230 15868 27769 34301
## Cyp2d22 3122 2008 2254 2277 2985 3452
## Klk6 186 1101 537 567 327 233
## GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
## Asl 494 481 666 937 803 541
## Apod 11258 11812 15816 29242 20415 13682
## Cyp2d22 1883 2014 2417 3678 2920 2216
## Klk6 742 881 828 250 798 710
## GSM2545354 GSM2545362 GSM2545363 GSM2545380
## Asl 473 748 576 1192
## Apod 11088 15916 11166 38148
## Cyp2d22 1821 2842 2011 4019
## Klk6 894 501 598 259
## [ reached getOption("max.print") -- omitted 2 rows ]
## [1] 1474 22
We can access the sample metadata using the colData
function:
## DataFrame with 22 rows and 9 columns
## sample organism age sex infection
## <character> <character> <integer> <character> <character>
## GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA
## GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected
## GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected
## GSM2545339 GSM2545339 Mus musculus 8 Female InfluenzaA
## GSM2545340 GSM2545340 Mus musculus 8 Male InfluenzaA
## ... ... ... ... ... ...
## GSM2545353 GSM2545353 Mus musculus 8 Female NonInfected
## GSM2545354 GSM2545354 Mus musculus 8 Male NonInfected
## GSM2545362 GSM2545362 Mus musculus 8 Female InfluenzaA
## GSM2545363 GSM2545363 Mus musculus 8 Male InfluenzaA
## strain time tissue mouse
## <character> <integer> <character> <integer>
## GSM2545336 C57BL/6 8 Cerebellum 14
## GSM2545337 C57BL/6 0 Cerebellum 9
## GSM2545338 C57BL/6 0 Cerebellum 10
## GSM2545339 C57BL/6 4 Cerebellum 15
## GSM2545340 C57BL/6 4 Cerebellum 18
## ... ... ... ... ...
## GSM2545353 C57BL/6 0 Cerebellum 4
## GSM2545354 C57BL/6 0 Cerebellum 2
## GSM2545362 C57BL/6 4 Cerebellum 20
## GSM2545363 C57BL/6 4 Cerebellum 12
## [ reached getOption("max.print") -- omitted 1 row ]
## [1] 22 9
We can also access the feature metadata using the rowData
function:
## DataFrame with 6 rows and 9 columns
## gene ENTREZID product ensembl_gene_id
## <character> <integer> <character> <character>
## Asl Asl 109900 argininosuccinate ly.. ENSMUSG00000025533
## Apod Apod 11815 apolipoprotein D tra.. ENSMUSG00000022548
## Cyp2d22 Cyp2d22 56448 cytochrome P450 fami.. ENSMUSG00000061740
## Klk6 Klk6 19144 kallikrein related-p.. ENSMUSG00000050063
## Fcrls Fcrls 80891 Fc receptor-like S s.. ENSMUSG00000015852
## Slc2a4 Slc2a4 20528 solute carrier famil.. ENSMUSG00000018566
## external_synonym chromosome_name gene_biotype phenotype_description
## <character> <character> <character> <character>
## Asl 2510006M18Rik 5 protein_coding abnormal circulating..
## Apod NA 16 protein_coding abnormal lipid homeo..
## Cyp2d22 2D22 15 protein_coding abnormal skin morpho..
## Klk6 Bssp 7 protein_coding abnormal cytokine le..
## Fcrls 2810439C17Rik 3 protein_coding decreased CD8-positi..
## Slc2a4 Glut-4 11 protein_coding abnormal circulating..
## hsapiens_homolog_associated_gene_name
## <character>
## Asl ASL
## Apod APOD
## Cyp2d22 CYP2D6
## Klk6 KLK6
## Fcrls FCRL4
## Slc2a4 SLC2A4
## [1] 1474 9
SummarizedExperiment can be subset just like with data frames, with numerics or with characters of logicals.
Below, we create a new instance of class SummarizedExperiment that contains only the 5 first features for the 3 first samples.
## class: SummarizedExperiment
## dim: 5 3
## metadata(0):
## assays(1): counts
## rownames(5): Asl Apod Cyp2d22 Klk6 Fcrls
## rowData names(9): gene ENTREZID ... phenotype_description
## hsapiens_homolog_associated_gene_name
## colnames(3): GSM2545336 GSM2545337 GSM2545338
## colData names(9): sample organism ... tissue mouse
## DataFrame with 3 rows and 9 columns
## sample organism age sex infection
## <character> <character> <integer> <character> <character>
## GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA
## GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected
## GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected
## strain time tissue mouse
## <character> <integer> <character> <integer>
## GSM2545336 C57BL/6 8 Cerebellum 14
## GSM2545337 C57BL/6 0 Cerebellum 9
## GSM2545338 C57BL/6 0 Cerebellum 10
## DataFrame with 5 rows and 9 columns
## gene ENTREZID product ensembl_gene_id
## <character> <integer> <character> <character>
## Asl Asl 109900 argininosuccinate ly.. ENSMUSG00000025533
## Apod Apod 11815 apolipoprotein D tra.. ENSMUSG00000022548
## Cyp2d22 Cyp2d22 56448 cytochrome P450 fami.. ENSMUSG00000061740
## Klk6 Klk6 19144 kallikrein related-p.. ENSMUSG00000050063
## Fcrls Fcrls 80891 Fc receptor-like S s.. ENSMUSG00000015852
## external_synonym chromosome_name gene_biotype phenotype_description
## <character> <character> <character> <character>
## Asl 2510006M18Rik 5 protein_coding abnormal circulating..
## Apod NA 16 protein_coding abnormal lipid homeo..
## Cyp2d22 2D22 15 protein_coding abnormal skin morpho..
## Klk6 Bssp 7 protein_coding abnormal cytokine le..
## Fcrls 2810439C17Rik 3 protein_coding decreased CD8-positi..
## hsapiens_homolog_associated_gene_name
## <character>
## Asl ASL
## Apod APOD
## Cyp2d22 CYP2D6
## Klk6 KLK6
## Fcrls FCRL4
We can also use the colData()
function to subset on something from
the sample metadata or the rowData()
to subset on something from the
feature metadata.
► Question
Create a new object keeping only miRNAs (see gene_biotype
feature
variable) and the non infected samples.
► Solution
► Question
Extract the gene expression levels of the 3 first genes in samples at time 0 and at time 8.
► Solution
We can also add information to the metadata. Suppose that you want to add the center where the samples were collected …
## DataFrame with 22 rows and 10 columns
## sample organism age sex infection
## <character> <character> <integer> <character> <character>
## GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA
## GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected
## GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected
## GSM2545339 GSM2545339 Mus musculus 8 Female InfluenzaA
## GSM2545340 GSM2545340 Mus musculus 8 Male InfluenzaA
## ... ... ... ... ... ...
## GSM2545353 GSM2545353 Mus musculus 8 Female NonInfected
## GSM2545354 GSM2545354 Mus musculus 8 Male NonInfected
## GSM2545362 GSM2545362 Mus musculus 8 Female InfluenzaA
## strain time tissue mouse center
## <character> <integer> <character> <integer> <character>
## GSM2545336 C57BL/6 8 Cerebellum 14 University of Illinois
## GSM2545337 C57BL/6 0 Cerebellum 9 University of Illinois
## GSM2545338 C57BL/6 0 Cerebellum 10 University of Illinois
## GSM2545339 C57BL/6 4 Cerebellum 15 University of Illinois
## GSM2545340 C57BL/6 4 Cerebellum 18 University of Illinois
## ... ... ... ... ... ...
## GSM2545353 C57BL/6 0 Cerebellum 4 University of Illinois
## GSM2545354 C57BL/6 0 Cerebellum 2 University of Illinois
## GSM2545362 C57BL/6 4 Cerebellum 20 University of Illinois
## [ reached getOption("max.print") -- omitted 2 rows ]
This illustrates that the metadata slots can grow indefinitely without affecting the other structures!
In addition to the SummarizedExperiment
slots we have seen above, it
is also possible to store information about the features’ genomic
ranges. Below, we load the airway
RangedSummarizedExperiment
data
to illustrate this:
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
## GRangesList object of length 63677:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] X 99883667-99884983 - | 667145 ENSE00001459322
## [2] X 99885756-99885863 - | 667146 ENSE00000868868
## [3] X 99887482-99887565 - | 667147 ENSE00000401072
## [4] X 99887538-99887565 - | 667148 ENSE00001849132
## [5] X 99888402-99888536 - | 667149 ENSE00003554016
## ... ... ... ... . ... ...
## [13] X 99890555-99890743 - | 667156 ENSE00003512331
## [14] X 99891188-99891686 - | 667158 ENSE00001886883
## [15] X 99891605-99891803 - | 667159 ENSE00001855382
## [16] X 99891790-99892101 - | 667160 ENSE00001863395
## [17] X 99894942-99894988 - | 667161 ENSE00001828996
## -------
## seqinfo: 722 sequences (1 circular) from an unspecified genome
##
## ...
## <63676 more elements>
This GRangesList
contains a list of 64102 range elements, one for
each gene/feature. Each element is a GRanges
object that contains
the positions of the exons or the genes:
## GRanges object with 17 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] X 99883667-99884983 - | 667145 ENSE00001459322
## [2] X 99885756-99885863 - | 667146 ENSE00000868868
## [3] X 99887482-99887565 - | 667147 ENSE00000401072
## [4] X 99887538-99887565 - | 667148 ENSE00001849132
## [5] X 99888402-99888536 - | 667149 ENSE00003554016
## ... ... ... ... . ... ...
## [13] X 99890555-99890743 - | 667156 ENSE00003512331
## [14] X 99891188-99891686 - | 667158 ENSE00001886883
## [15] X 99891605-99891803 - | 667159 ENSE00001855382
## [16] X 99891790-99892101 - | 667160 ENSE00001863395
## [17] X 99894942-99894988 - | 667161 ENSE00001828996
## -------
## seqinfo: 722 sequences (1 circular) from an unspecified genome
It is possible to use these row genomic ranges to subset features in
regions of interest, defined as GRanges
object. Below, for example,
we subset features between bases 1M and 2M on chromosome 7:
## class: RangedSummarizedExperiment
## dim: 34 8
## metadata(1): ''
## assays(1): counts
## rownames(34): ENSG00000002822 ENSG00000073067 ... ENSG00000266391
## ENSG00000273230
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
Raw HTS fastq files from different sample are processed into raw counts one by one (see next chapter). These data meed to be merged in a single count matrix, were each column represents a sample, and each line represents a gene. This count matrix will be the starting point of a differential expression analysis.
## sample1 sample2 sample3 sample4 sample5 sample6
## ENSG00000223972 0 0 0 0 0 1
## ENSG00000227232 14 28 17 40 16 13
## ENSG00000278267 8 4 3 1 1 6
## ENSG00000243485 0 0 0 0 0 0
## ENSG00000284332 0 0 0 0 0 0
## ENSG00000237613 0 0 0 0 0 0
► Question
Imagine you had initially 6 samples (3 control and 3 treated samples)
that you have processed independently giving the 6 counts files
generated by FeatureCounts located in the folder
wsbim_data/count_data/
.
## [1] "wsbim2122_data/count_data/sample1_counts.tsv.gz"
## [2] "wsbim2122_data/count_data/sample2_counts.tsv.gz"
## [3] "wsbim2122_data/count_data/sample3_counts.tsv.gz"
## [4] "wsbim2122_data/count_data/sample4_counts.tsv.gz"
## [5] "wsbim2122_data/count_data/sample5_counts.tsv.gz"
## [6] "wsbim2122_data/count_data/sample6_counts.tsv.gz"
Use these files to create a count matrix.
► Solution
The Bioconductor was initiated by Robert Gentleman, one of the two creators of the R language. Bioconductor provides tools for the analysis and comprehension of omics data. Bioconductor uses the R statistical programming language, and is open source and open development.↩︎
Page built: 2023-10-05 using R version 4.3.1 Patched (2023-07-10 r84676)