Objectives
Introduce the notion of data containers
Give an overview of the SummarizedExperiment
, extensively used in omics analyses
Data in bioinformatics is often complex. To deal with this, developers define specialised data containers (termed classes) that match the properties of the data they need to handle.
This aspect is central to the Bioconductor10 The Bioconductor was initiated by Robert Gentleman, one of the two creators of the R language. Bioconductor provides tools dedicated to omics data analysis. Bioconductor uses the R statistical programming language, and is open source and open development. project which uses the same core data infrastructure across packages. This certainly contributed to Bioconductor’s success. Bioconductor package developers are advised to make use of existing infrastructure to provide coherence, interoperability and stability to the project as a whole.
To illustrate such an omics data container, we’ll present the SummarizedExperiment
class.
The figure below represents the anatomy of SummarizedExperiment.
Objects of the class SummarizedExperiment contain :
One (or more) assay(s) containing the quantitative omics data (expression data), stored as a matrix-like object. Features (genes, transcripts, proteins, …) are defined along the rows and samples along the columns.
A sample metadata slot containing sample co-variates, stored as a data frame. Rows from this table represent samples (rows match exactly the columns of the expression data).
A feature metadata slot containing feature co-variates, stored as data frame. The rows of this dataframe’s match exactly the rows of the expression data.
The coordinated nature of the SummarizedExperiment guarantees that during data manipulation, the dimensions of the different slots will always match (i.e the columns in the expression data and then rows in the sample metadata, as well as the rows in the expression data and feature metadata) during data manipulation. For example, if we had to exclude one sample from the assay, it would be automatically removed from the sample metadata in the same operation.
The metadata slots can grow additional co-variates (columns) without affecting the other structures.
Remember the rna
dataset that we have used previously.
From this table we have already created 3 different tables.
1:5, ] count_matrix[
## GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
## Asl 1170 361 400 586 626 988
## Apod 36194 10347 9173 10620 13021 29594
## Cyp2d22 4060 1616 1603 1901 2171 3349
## Klk6 287 629 641 578 448 195
## Fcrls 85 233 244 237 180 38
## GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
## Asl 836 535 586 597 938 1035
## Apod 24959 13668 13230 15868 27769 34301
## Cyp2d22 3122 2008 2254 2277 2985 3452
## Klk6 186 1101 537 567 327 233
## Fcrls 68 375 199 177 89 67
## GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
## Asl 494 481 666 937 803 541
## Apod 11258 11812 15816 29242 20415 13682
## Cyp2d22 1883 2014 2417 3678 2920 2216
## Klk6 742 881 828 250 798 710
## Fcrls 300 233 231 81 303 285
## GSM2545354 GSM2545362 GSM2545363 GSM2545380
## Asl 473 748 576 1192
## Apod 11088 15916 11166 38148
## Cyp2d22 1821 2842 2011 4019
## Klk6 894 501 598 259
## Fcrls 248 179 184 68
dim(count_matrix)
## [1] 1474 22
sample_metadata
## # A tibble: 22 × 9
## sample organism age sex infection strain time tissue mouse
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 GSM2545336 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebel… 14
## 2 GSM2545337 Mus musculus 8 Female NonInfected C57BL/6 0 Cerebel… 9
## 3 GSM2545338 Mus musculus 8 Female NonInfected C57BL/6 0 Cerebel… 10
## 4 GSM2545339 Mus musculus 8 Female InfluenzaA C57BL/6 4 Cerebel… 15
## 5 GSM2545340 Mus musculus 8 Male InfluenzaA C57BL/6 4 Cerebel… 18
## 6 GSM2545341 Mus musculus 8 Male InfluenzaA C57BL/6 8 Cerebel… 6
## 7 GSM2545342 Mus musculus 8 Female InfluenzaA C57BL/6 8 Cerebel… 5
## 8 GSM2545343 Mus musculus 8 Male NonInfected C57BL/6 0 Cerebel… 11
## 9 GSM2545344 Mus musculus 8 Female InfluenzaA C57BL/6 4 Cerebel… 22
## 10 GSM2545345 Mus musculus 8 Male InfluenzaA C57BL/6 4 Cerebel… 13
## # … with 12 more rows
gene_metadata
## # A tibble: 1,474 × 9
## gene ENTREZID product ensembl_gene_id external_synonym chromosome_name
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Asl 109900 argininosu… ENSMUSG0000002… 2510006M18Rik 5
## 2 Apod 11815 apolipopro… ENSMUSG0000002… <NA> 16
## 3 Cyp2d22 56448 cytochrome… ENSMUSG0000006… 2D22 15
## 4 Klk6 19144 kallikrein… ENSMUSG0000005… Bssp 7
## 5 Fcrls 80891 Fc recepto… ENSMUSG0000001… 2810439C17Rik 3
## 6 Slc2a4 20528 solute car… ENSMUSG0000001… Glut-4 11
## 7 Exd2 97827 exonucleas… ENSMUSG0000003… 4930539P14Rik 12
## 8 Gjc2 118454 gap juncti… ENSMUSG0000004… B230382L12Rik 11
## 9 Plp1 18823 proteolipi… ENSMUSG0000003… DM20 X
## 10 Gnb4 14696 guanine nu… ENSMUSG0000002… 6720453A21Rik 3
## # … with 1,464 more rows, and 3 more variables: gene_biotype <chr>,
## # phenotype_description <chr>, hsapiens_homolog_associated_gene_name <chr>
We will create a SummarizedExperiment
from these tables:
The count matrix that will be used as the assay
The table describing the samples will be used as the sample metadata slot
The table describing the genes will be used as the features metadata slot
To do this we can put the different parts together using the
SummarizedExperiment
constructor:
#BiocManager::install("SummarizedExperiment")
library("SummarizedExperiment")
<- SummarizedExperiment(assays = count_matrix,
se colData = sample_metadata,
rowData = gene_metadata)
se
## class: SummarizedExperiment
## dim: 1474 22
## metadata(0):
## assays(1): ''
## rownames(1474): Asl Apod ... Pbx1 Rgs4
## rowData names(9): gene ENTREZID ... phenotype_description
## hsapiens_homolog_associated_gene_name
## colnames(22): GSM2545336 GSM2545337 ... GSM2545363 GSM2545380
## colData names(9): sample organism ... tissue mouse
Using this data structure, we can access the expression matrix with
the assay
function:
head(assay(se))
## GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
## Asl 1170 361 400 586 626 988
## Apod 36194 10347 9173 10620 13021 29594
## Cyp2d22 4060 1616 1603 1901 2171 3349
## Klk6 287 629 641 578 448 195
## Fcrls 85 233 244 237 180 38
## Slc2a4 782 231 248 265 313 786
## GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
## Asl 836 535 586 597 938 1035
## Apod 24959 13668 13230 15868 27769 34301
## Cyp2d22 3122 2008 2254 2277 2985 3452
## Klk6 186 1101 537 567 327 233
## Fcrls 68 375 199 177 89 67
## Slc2a4 528 249 266 357 654 693
## GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
## Asl 494 481 666 937 803 541
## Apod 11258 11812 15816 29242 20415 13682
## Cyp2d22 1883 2014 2417 3678 2920 2216
## Klk6 742 881 828 250 798 710
## Fcrls 300 233 231 81 303 285
## Slc2a4 271 304 349 715 513 320
## GSM2545354 GSM2545362 GSM2545363 GSM2545380
## Asl 473 748 576 1192
## Apod 11088 15916 11166 38148
## Cyp2d22 1821 2842 2011 4019
## Klk6 894 501 598 259
## Fcrls 248 179 184 68
## Slc2a4 248 350 317 796
dim(assay(se))
## [1] 1474 22
We can access the sample metadata using the colData
function:
colData(se)
## DataFrame with 22 rows and 9 columns
## sample organism age sex infection
## <character> <character> <numeric> <character> <character>
## GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA
## GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected
## GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected
## GSM2545339 GSM2545339 Mus musculus 8 Female InfluenzaA
## GSM2545340 GSM2545340 Mus musculus 8 Male InfluenzaA
## ... ... ... ... ... ...
## GSM2545353 GSM2545353 Mus musculus 8 Female NonInfected
## GSM2545354 GSM2545354 Mus musculus 8 Male NonInfected
## GSM2545362 GSM2545362 Mus musculus 8 Female InfluenzaA
## GSM2545363 GSM2545363 Mus musculus 8 Male InfluenzaA
## GSM2545380 GSM2545380 Mus musculus 8 Female InfluenzaA
## strain time tissue mouse
## <character> <numeric> <character> <numeric>
## GSM2545336 C57BL/6 8 Cerebellum 14
## GSM2545337 C57BL/6 0 Cerebellum 9
## GSM2545338 C57BL/6 0 Cerebellum 10
## GSM2545339 C57BL/6 4 Cerebellum 15
## GSM2545340 C57BL/6 4 Cerebellum 18
## ... ... ... ... ...
## GSM2545353 C57BL/6 0 Cerebellum 4
## GSM2545354 C57BL/6 0 Cerebellum 2
## GSM2545362 C57BL/6 4 Cerebellum 20
## GSM2545363 C57BL/6 4 Cerebellum 12
## GSM2545380 C57BL/6 8 Cerebellum 19
dim(colData(se))
## [1] 22 9
We can also access the feature metadata using the rowData
function:
head(rowData(se))
## DataFrame with 6 rows and 9 columns
## gene ENTREZID product ensembl_gene_id
## <character> <numeric> <character> <character>
## Asl Asl 109900 argininosuccinate ly.. ENSMUSG00000025533
## Apod Apod 11815 apolipoprotein D tra.. ENSMUSG00000022548
## Cyp2d22 Cyp2d22 56448 cytochrome P450 fami.. ENSMUSG00000061740
## Klk6 Klk6 19144 kallikrein related-p.. ENSMUSG00000050063
## Fcrls Fcrls 80891 Fc receptor-like S s.. ENSMUSG00000015852
## Slc2a4 Slc2a4 20528 solute carrier famil.. ENSMUSG00000018566
## external_synonym chromosome_name gene_biotype phenotype_description
## <character> <character> <character> <character>
## Asl 2510006M18Rik 5 protein_coding abnormal circulating..
## Apod NA 16 protein_coding abnormal lipid homeo..
## Cyp2d22 2D22 15 protein_coding abnormal skin morpho..
## Klk6 Bssp 7 protein_coding abnormal cytokine le..
## Fcrls 2810439C17Rik 3 protein_coding decreased CD8-positi..
## Slc2a4 Glut-4 11 protein_coding abnormal circulating..
## hsapiens_homolog_associated_gene_name
## <character>
## Asl ASL
## Apod APOD
## Cyp2d22 CYP2D6
## Klk6 KLK6
## Fcrls FCRL4
## Slc2a4 SLC2A4
dim(rowData(se))
## [1] 1474 9
SummarizedExperiment can be subset just like with data frames, with numerics or with characters of logicals.
Below, we create a new instance of class SummarizedExperiment that contains only the 5 first features for the 3 first samples.
<- se[1:5, 1:3]
se1 se1
## class: SummarizedExperiment
## dim: 5 3
## metadata(0):
## assays(1): ''
## rownames(5): Asl Apod Cyp2d22 Klk6 Fcrls
## rowData names(9): gene ENTREZID ... phenotype_description
## hsapiens_homolog_associated_gene_name
## colnames(3): GSM2545336 GSM2545337 GSM2545338
## colData names(9): sample organism ... tissue mouse
colData(se1)
## DataFrame with 3 rows and 9 columns
## sample organism age sex infection
## <character> <character> <numeric> <character> <character>
## GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA
## GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected
## GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected
## strain time tissue mouse
## <character> <numeric> <character> <numeric>
## GSM2545336 C57BL/6 8 Cerebellum 14
## GSM2545337 C57BL/6 0 Cerebellum 9
## GSM2545338 C57BL/6 0 Cerebellum 10
rowData(se1)
## DataFrame with 5 rows and 9 columns
## gene ENTREZID product ensembl_gene_id
## <character> <numeric> <character> <character>
## Asl Asl 109900 argininosuccinate ly.. ENSMUSG00000025533
## Apod Apod 11815 apolipoprotein D tra.. ENSMUSG00000022548
## Cyp2d22 Cyp2d22 56448 cytochrome P450 fami.. ENSMUSG00000061740
## Klk6 Klk6 19144 kallikrein related-p.. ENSMUSG00000050063
## Fcrls Fcrls 80891 Fc receptor-like S s.. ENSMUSG00000015852
## external_synonym chromosome_name gene_biotype phenotype_description
## <character> <character> <character> <character>
## Asl 2510006M18Rik 5 protein_coding abnormal circulating..
## Apod NA 16 protein_coding abnormal lipid homeo..
## Cyp2d22 2D22 15 protein_coding abnormal skin morpho..
## Klk6 Bssp 7 protein_coding abnormal cytokine le..
## Fcrls 2810439C17Rik 3 protein_coding decreased CD8-positi..
## hsapiens_homolog_associated_gene_name
## <character>
## Asl ASL
## Apod APOD
## Cyp2d22 CYP2D6
## Klk6 KLK6
## Fcrls FCRL4
We can also use the colData()
function to subset on something from the sample metadata, or the rowData()
to subset on something from the feature metadata.
For example, here we keep only miRNAs and the non infected samples:
<- se[rowData(se)$gene_biotype == "miRNA",
se1 colData(se)$infection == "NonInfected"]
se1
## class: SummarizedExperiment
## dim: 7 7
## metadata(0):
## assays(1): ''
## rownames(7): Mir1901 Mir378a ... Mir128-1 Mir7682
## rowData names(9): gene ENTREZID ... phenotype_description
## hsapiens_homolog_associated_gene_name
## colnames(7): GSM2545337 GSM2545338 ... GSM2545353 GSM2545354
## colData names(9): sample organism ... tissue mouse
assay(se1)
## GSM2545337 GSM2545338 GSM2545343 GSM2545348 GSM2545349 GSM2545353
## Mir1901 45 44 74 55 68 33
## Mir378a 11 7 9 4 12 4
## Mir133b 4 6 5 4 6 7
## Mir30c-2 10 6 16 12 8 17
## Mir149 1 2 0 0 0 0
## Mir128-1 4 1 2 2 1 2
## Mir7682 2 0 4 1 3 5
## GSM2545354
## Mir1901 60
## Mir378a 8
## Mir133b 3
## Mir30c-2 15
## Mir149 2
## Mir128-1 1
## Mir7682 5
colData(se1)
## DataFrame with 7 rows and 9 columns
## sample organism age sex infection
## <character> <character> <numeric> <character> <character>
## GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected
## GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected
## GSM2545343 GSM2545343 Mus musculus 8 Male NonInfected
## GSM2545348 GSM2545348 Mus musculus 8 Female NonInfected
## GSM2545349 GSM2545349 Mus musculus 8 Male NonInfected
## GSM2545353 GSM2545353 Mus musculus 8 Female NonInfected
## GSM2545354 GSM2545354 Mus musculus 8 Male NonInfected
## strain time tissue mouse
## <character> <numeric> <character> <numeric>
## GSM2545337 C57BL/6 0 Cerebellum 9
## GSM2545338 C57BL/6 0 Cerebellum 10
## GSM2545343 C57BL/6 0 Cerebellum 11
## GSM2545348 C57BL/6 0 Cerebellum 8
## GSM2545349 C57BL/6 0 Cerebellum 7
## GSM2545353 C57BL/6 0 Cerebellum 4
## GSM2545354 C57BL/6 0 Cerebellum 2
rowData(se1)
## DataFrame with 7 rows and 9 columns
## gene ENTREZID product ensembl_gene_id
## <character> <numeric> <character> <character>
## Mir1901 Mir1901 100316686 microRNA 1901 ENSMUSG00000084565
## Mir378a Mir378a 723889 microRNA 378a ENSMUSG00000105200
## Mir133b Mir133b 723817 microRNA 133b ENSMUSG00000065480
## Mir30c-2 Mir30c-2 723964 microRNA 30c-2 ENSMUSG00000065567
## Mir149 Mir149 387167 microRNA 149 ENSMUSG00000065470
## Mir128-1 Mir128-1 387147 microRNA 128-1 ENSMUSG00000065520
## Mir7682 Mir7682 102466847 microRNA 7682 ENSMUSG00000106406
## external_synonym chromosome_name gene_biotype phenotype_description
## <character> <character> <character> <character>
## Mir1901 Mirn1901 18 miRNA NA
## Mir378a Mirn378 18 miRNA abnormal mitochondri..
## Mir133b mir 133b 1 miRNA no abnormal phenotyp..
## Mir30c-2 mir 30c-2 1 miRNA NA
## Mir149 Mirn149 1 miRNA increased circulatin..
## Mir128-1 Mirn128 1 miRNA no abnormal phenotyp..
## Mir7682 mmu-mir-7682 1 miRNA NA
## hsapiens_homolog_associated_gene_name
## <character>
## Mir1901 NA
## Mir378a MIR378A
## Mir133b MIR133B
## Mir30c-2 MIR30C2
## Mir149 NA
## Mir128-1 MIR128-1
## Mir7682 NA
For the following exercise, you should download the SE.rda object
(that contains the se
object), and open the file using the
‘load()’ function.
download.file(url = "https://raw.githubusercontent.com/UCLouvain-CBIO/bioinfo-training-01-intro-r/master/data/SE.rda",
destfile = "data/SE.rda")
load(file = "data/SE.rda")
► Question
Extract the gene expression levels of the 3 first genes in samples at time 0 and at time 8.
► Solution
We can also add information to the metadata. Suppose that you want to add the center where the samples were collected…
colData(se)$center <- rep("University of Illinois", nrow(colData(se)))
colData(se)
## DataFrame with 22 rows and 10 columns
## sample organism age sex infection
## <character> <character> <numeric> <character> <character>
## GSM2545336 GSM2545336 Mus musculus 8 Female InfluenzaA
## GSM2545337 GSM2545337 Mus musculus 8 Female NonInfected
## GSM2545338 GSM2545338 Mus musculus 8 Female NonInfected
## GSM2545339 GSM2545339 Mus musculus 8 Female InfluenzaA
## GSM2545340 GSM2545340 Mus musculus 8 Male InfluenzaA
## ... ... ... ... ... ...
## GSM2545353 GSM2545353 Mus musculus 8 Female NonInfected
## GSM2545354 GSM2545354 Mus musculus 8 Male NonInfected
## GSM2545362 GSM2545362 Mus musculus 8 Female InfluenzaA
## GSM2545363 GSM2545363 Mus musculus 8 Male InfluenzaA
## GSM2545380 GSM2545380 Mus musculus 8 Female InfluenzaA
## strain time tissue mouse center
## <character> <numeric> <character> <numeric> <character>
## GSM2545336 C57BL/6 8 Cerebellum 14 University of Illinois
## GSM2545337 C57BL/6 0 Cerebellum 9 University of Illinois
## GSM2545338 C57BL/6 0 Cerebellum 10 University of Illinois
## GSM2545339 C57BL/6 4 Cerebellum 15 University of Illinois
## GSM2545340 C57BL/6 4 Cerebellum 18 University of Illinois
## ... ... ... ... ... ...
## GSM2545353 C57BL/6 0 Cerebellum 4 University of Illinois
## GSM2545354 C57BL/6 0 Cerebellum 2 University of Illinois
## GSM2545362 C57BL/6 4 Cerebellum 20 University of Illinois
## GSM2545363 C57BL/6 4 Cerebellum 12 University of Illinois
## GSM2545380 C57BL/6 8 Cerebellum 19 University of Illinois
This illustrates that the metadata slots can grow indefinitely without affecting the other structures!
Take-home message
SummarizedExperiment
represent an efficient way to store and to handle omics data.
They are used in many Bioconductor packages.
If you follow next training focused on RNA sequencing analysis, you will learn to
use the Bioconductor DESeq2
package to do some differential expression analyses.
DESeq2
’s whole analysis is handled in a SummarizedExperiment
.
Page built: 2021-10-01 using R version 4.1.1 (2021-08-10)