Chapter 3 High-level data structures

Learning objectives

Understand the difference between basic data structures and high-level objects.
Understand the concept of high-level object.
Know how to learn about objects and access data from them.
Learn about how to import Bioconductor objects widely used in omics data analysis.
Become familiar with how objects are implemented.

3.1 Introduction

The term object stems from the widely used object-oriented programming (OOP) paradigm. OOP is however outside of scope of this course (see part III of Wickham (⊕2014Wickham, Hadley. 2014. Advanced R. Chapman & Hall/CRC the r Series. Taylor & Francis.), which is also freely available online). In the R refresher chapter 1, we have seen the base types of R, which we have efficiently used in many situations. Sometimes, however, these aren’t enough.

We have also use object, for example in chapter 2, when generating ggplot visualisation. We have seen that it is possible to store the output of ggplot2 in a variable, that contain all the data and information necessary to produce the figure.

3.2 Why objects?

Combine many parts of a complex data, for example store data (and possibly different forms thereof (see chapter 5) and metadata together.
Objects are self describing: an object to store high-throughput sequencing (HTS) data contains high-throughput sequencing data. Replace high-throughput sequencing data by DNA sequences (see chapter 4), proteomics, … and it will still hold.
Using object enables consistency33 This consistency comes however at the cost of flexibility, hence the importance of well designed objects, that fit several use cases for the type of data at hand.: whoever prepares the HTS data, it will always have the same structure.

If all information is stored in a single object44 This idea is consistent with the notion that you would like to place all the data relevant to an expriment into a single folder., it is easier to save it, to share it with others, and to use it as an input to a function.

The above properties lead to greater robustness and interoperability when working with objects modelling complex data.

Common behaviour (methods) can be specialised to match the particular data type (this is called polymorphism): for example plotting (a vector, a matrix, NGS data), summary, subsetting ([ and [[), …
Another import benefit of using objects to model complex data is abstraction: it is possible to hide all the (unnecessary for its usage) complexity of a particular type of data and present/give access to its most imporant parts in a user-friendly interface and efficient way.
Relations between object can be recorded (this is called inheritance).

To avoid confusion, it is useful to define and separate objects and classes. A class describes (models) the general properties for groups of objects. For example, we could define a class to describe people in this course as follows:

Class person:
- Name: a character
- Surname: a character
- Noma: an optional numeric
- Role: a character (student, teaching assistant, instructor)

An object is an instantiation of a class with specific values in its fields:

Object of class person:
- Name: Laurent
- Surname: Gatto
- Noma: NA
- Role: instructor

► Question

Write down two objects that would describe a teaching assistant and a student.

► Solution

Object of class person:
- Name: Axelle
- Surname: Loriot
- Noma: NA
- Role: teaching assistant

Object of class person:
- Name: Alejandra
- Surname: Sanchez-Cortez
- Noma: 123456
- Role: student

We will focus on objects and their usage here.

3.3 Examples in base R

Let’s start by generating a testing dataset composed of two correlated variables, x and y:

set.seed(1)
x <- rnorm(100)
y <- x + rnorm(100)
df <- data.frame(x, y)
head(df)

##            x          y
## 1 -0.6264538 -1.2468205
## 2  0.1836433  0.2257592
## 3 -0.8356286 -1.7465503
## 4  1.5952808  1.7533096
## 5  0.3295078 -0.3250769
## 6 -0.8204684  0.9468189

plot(y ~ x, df)

Figure 3.1: The x and y variables display a linear relationship.

We are now going to model the linear relation between x and y using the lm (linear model) function. The resulting variable fit is the result of computing the linear regression \(y = \beta_{0} + \beta_{1} x + \epsilon\). When printing the object, we see the call we performed, the intercept (\(\beta_0\)) and the slope (\(\beta_1\)) of the regression.

fit <- lm(y ~ x, df)
fit

## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Coefficients:
## (Intercept)            x  
##    -0.03769      0.99894

We can plot the linear regression line with abline and passing the fit variable.

plot(y ~ x, df)
abline(fit, col = "red")

Figure 3.2: The regression line modelling the linear regression between the two x and y variables.

The fit variable is an object of class lm, which we can check with the class function.

class(fit)

## [1] "lm"

It contains much more than simply the intercept and the slope of the regression line.

► Question

Find out what an object of class lm contains by exploring the content of the fit object (try the names and str functions) and reading the lm manual page.

Finally, to demonstrate polymorphism, let’s try to plot the fit variable.

par(mfrow = c(2, 2))
plot(fit)

Figure 3.3: Plotting an instance of class lm produces a set of diagnostic plots, that inform us about the quality of the regression.

3.4 Examples from the Bioconductor project

We have already briefly seen the Bioconductor project. It initiated by Robert Gentleman (Gentleman et al. (⊕2004Gentleman, Robert C., Vincent J. Carey, Douglas M. Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, et al. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biol 5 (10): –80. https://doi.org/10.1186/gb-2004-5-10-r80.); Huber et al. (⊕2015Huber, W, V J Carey, R Gentleman, S Anders, M Carlson, B S Carvalho, H C Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nat Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.)), one of the two creators of the R language, and centrally offers dedicated R packages for bioinformatics.

Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, and an active user community.

This video provides a great overview of the project at large.

Bioconductor provides a rich set of classes for omics data, summarised below.

3.4.1 Common Bioconductor classes

Rectangular feature x sample data – SummarizedExperiment::SummarizedExperiment() (RNAseq count matrix, microarray, …)
Genomic coordinates – GenomicRanges::GRanges() (1-based, closed interval)
DNA / RNA / AA sequences – Biostrings::*StringSet()
Gene sets – GSEABase::GeneSet() GSEABase::GeneSetCollection()
Multi-omics data – MultiAssayExperiment::MultiAssayExperiment()
Single cell data – SingleCellExperiment::SingleCellExperiment()
Mass spectrometry data – Spectra::Spectra()
Quantitative proteomics data – QFeatures::QFeatures()

3.4.2 The `SummarizedExperiment` class

Let’s start by installing55 Before installing the package, we first check if they are available with require. If they are missing, the call to require will return FALSE, otherwise it returns TRUE and loads the packages. Given that we negate the return value of require using !, the packages will be installed only when missing. theSummarizedExperiment package, that implements the class, and the airway data package, that provides an RNA-Seq example data.

if (!require("SummarizedExperiment"))
    BiocManager::install("SummarizedExperiment")
if (!require("airway"))
    BiocManager::install("airway")

If you know you have the packages, it is more straightforward to directly load the packages.

library("SummarizedExperiment")
library("airway")

The airway package provides a data for an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethason published by Himes et al. (⊕2014Himes, B E, X Jiang, P Wagner, R Hu, Q Wang, B Klanderman, R M Whitaker, et al. 2014. “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene That Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS One 9 (6): e99625. https://doi.org/10.1371/journal.pone.0099625.). It contains the RNA expression data for 64102 transcripts and 8 samples. These 8 samples correspond to treated and untreated condition in 4 different cell lines.

Objects of the class SummarizedExperiment contain one or more assays (i.e. sets if quantitative expresson data), each represented by a matrix-like object (typically) of numeric mode. The rows represent features such as genes, transcripts, proteins or more generally genomic ranges and the columns represent samples. The figure below represents its anatomy in greater details.

Figure 3.4: Schematic representation of the anatomy if a SummarizedExperiment object. (Figure taken from the SummarizedExperiment package vignette.)

Schematic representation of the anatomy if a `SummarizedExperiment` object. (Figure taken from the `SummarizedExperiment` package vignette.)

The sample (columns) metadata can be accessed with the colData() function.
The features (rows) metadata can be accessed with the rowData() column.
If the features represent ranges along genomic coordinates, these can be accessed with rowRanges()
Additional metadata describing the overall experiment can be accessed with metadata().
The quantiative data can be accessed with assay().
assays() returns a list of matrix-like assays.

► Question

Load the airway data and display it in your R console. Make sure you understand all parts in the short text summary that is displayed. What class is airway of.

► Question

Extract the quantitative assay data, the rows and columns metadata, the row ranges and the experiment metadata. What are the classes of these individual parts of the SummarizedExperiment object.

► Solution

data(airway)
airway

## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

class(airway)

## [1] "RangedSummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"

assay(airway)

##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003        679        448        873        408       1138
## ENSG00000000005          0          0          0          0          0
## ENSG00000000419        467        515        621        365        587
## ENSG00000000457        260        211        263        164        245
## ENSG00000000460         60         55         40         35         78
## ENSG00000000938          0          0          2          0          1
## ENSG00000000971       3251       3679       6177       4252       6721
## ENSG00000001036       1433       1062       1733        881       1424
## ENSG00000001084        519        380        595        493        820
## ENSG00000001167        394        236        464        175        658
## ENSG00000001460        172        168        264        118        241
## ENSG00000001461       2112       1867       5137       2657       2735
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003       1047        770        572
## ENSG00000000005          0          0          0
## ENSG00000000419        799        417        508
## ENSG00000000457        331        233        229
## ENSG00000000460         63         76         60
## ENSG00000000938          0          0          0
## ENSG00000000971      11027       5176       7995
## ENSG00000001036       1439       1359       1109
## ENSG00000001084        714        696        704
## ENSG00000001167        584        360        269
## ENSG00000001460        210        155        177
## ENSG00000001461       2751       2467       2905
##  [ reached 'max' / getOption("max.print") -- omitted 63665 rows ]

class(assay(airway))

## [1] "matrix" "array"

colData(airway)

## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677

class(colData(airway))

## [1] "DFrame"
## attr(,"package")
## [1] "S4Vectors"

rowData(airway)

## DataFrame with 63677 rows and 10 columns
##                         gene_id     gene_name  entrezid   gene_biotype
##                     <character>   <character> <integer>    <character>
## ENSG00000000003 ENSG00000000003        TSPAN6        NA protein_coding
## ENSG00000000005 ENSG00000000005          TNMD        NA protein_coding
## ENSG00000000419 ENSG00000000419          DPM1        NA protein_coding
## ENSG00000000457 ENSG00000000457         SCYL3        NA protein_coding
## ENSG00000000460 ENSG00000000460      C1orf112        NA protein_coding
## ...                         ...           ...       ...            ...
## ENSG00000273489 ENSG00000273489 RP11-180C16.1        NA      antisense
## ENSG00000273490 ENSG00000273490        TSEN34        NA protein_coding
## ENSG00000273491 ENSG00000273491  RP11-138A9.2        NA        lincRNA
##                 gene_seq_start gene_seq_end              seq_name seq_strand
##                      <integer>    <integer>           <character>  <integer>
## ENSG00000000003       99883667     99894988                     X         -1
## ENSG00000000005       99839799     99854882                     X          1
## ENSG00000000419       49551404     49575092                    20         -1
## ENSG00000000457      169818772    169863408                     1         -1
## ENSG00000000460      169631245    169823221                     1          1
## ...                        ...          ...                   ...        ...
## ENSG00000273489      131178723    131182453                     7         -1
## ENSG00000273490       54693789     54697585 HSCHR19LRC_LRC_J_CTG1          1
## ENSG00000273491      130600118    130603315          HG1308_PATCH          1
##                 seq_coord_system        symbol
##                        <integer>   <character>
## ENSG00000000003               NA        TSPAN6
## ENSG00000000005               NA          TNMD
## ENSG00000000419               NA          DPM1
## ENSG00000000457               NA         SCYL3
## ENSG00000000460               NA      C1orf112
## ...                          ...           ...
## ENSG00000273489               NA RP11-180C16.1
## ENSG00000273490               NA        TSEN34
## ENSG00000273491               NA  RP11-138A9.2
##  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

class(rowData(airway))

## [1] "DFrame"
## attr(,"package")
## [1] "S4Vectors"

rowRanges(airway)

## GRangesList object of length 63677:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        X 99883667-99884983      - |    667145 ENSE00001459322
##    [2]        X 99885756-99885863      - |    667146 ENSE00000868868
##    [3]        X 99887482-99887565      - |    667147 ENSE00000401072
##    [4]        X 99887538-99887565      - |    667148 ENSE00001849132
##    [5]        X 99888402-99888536      - |    667149 ENSE00003554016
##    ...      ...               ...    ... .       ...             ...
##   [13]        X 99890555-99890743      - |    667156 ENSE00003512331
##   [14]        X 99891188-99891686      - |    667158 ENSE00001886883
##   [15]        X 99891605-99891803      - |    667159 ENSE00001855382
##   [16]        X 99891790-99892101      - |    667160 ENSE00001863395
##   [17]        X 99894942-99894988      - |    667161 ENSE00001828996
##   -------
##   seqinfo: 722 sequences (1 circular) from an unspecified genome
## 
## ...
## <63676 more elements>

class(rowRanges(airway))

## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"

The row and column metadata are stored as a special data.frame implementation66 Like the tibble, that is an implememtation specific to the tidyverse packages., that has been developed in the Bioconductor project. It behaves like like a classical data.frame.

► Question

What biological sample names in the airway dataset were treated with dexamethason? Which ones weren’t?

► Solution

dex_treated <- airway$dex == "trt"
## shorthand for colData(airway)$dex == "trt"
## Treated samples
colData(airway)[dex_treated, "BioSample"]

## [1] SAMN02422675 SAMN02422670 SAMN02422673 SAMN02422677
## 8 Levels: SAMN02422669 SAMN02422670 SAMN02422673 SAMN02422675 ... SAMN02422683

## Control samples
colData(airway)[!dex_treated, "BioSample"]

## [1] SAMN02422669 SAMN02422678 SAMN02422682 SAMN02422683
## 8 Levels: SAMN02422669 SAMN02422670 SAMN02422673 SAMN02422675 ... SAMN02422683

It is also possible to use the tidyverse for this by first coercing the DataFrame to a tibble.

as_tibble(colData(airway)) %>%
    filter(dex == "trt") %>%
    dplyr::select(BioSample)

## # A tibble: 4 × 1
##   BioSample   
##   <fct>       
## 1 SAMN02422675
## 2 SAMN02422670
## 3 SAMN02422673
## 4 SAMN02422677

as_tibble(colData(airway)) %>%
    filter(dex != "trt") %>%
    dplyr::select(BioSample)

## # A tibble: 4 × 1
##   BioSample   
##   <fct>       
## 1 SAMN02422669
## 2 SAMN02422678
## 3 SAMN02422682
## 4 SAMN02422683

► Question

What are the dimensions of the assay data, and the row and column metadata? What can you way about these?

► Solution

dim(airway)

## [1] 63677     8

dim(colData(airway))

## [1] 8 9

dim(rowData(airway))

## [1] 63677    10

3.4.3 Subsetting `SummarizedExperiment` objects

As mentioned in the introduction, it is possible to redefine common functions to have a special behaviour with objects. This is the case for the [ operator, that will automatically subset all parts of the object. Below, we create a new instance of class SummarizedExperiment that contains only the 5 first features for the 3 first samples.

se <- airway[1:5, 1:3]
se

## class: RangedSummarizedExperiment 
## dim: 5 3 
## metadata(1): ''
## assays(1): counts
## rownames(5): ENSG00000000003 ENSG00000000005 ENSG00000000419
##   ENSG00000000457 ENSG00000000460
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(3): SRR1039508 SRR1039509 SRR1039512
## colData names(9): SampleName cell ... Sample BioSample

► Question

Verfiy that the assay data, the column and row metadata and the row ranges have been preserved.

Subsetting can be performed with numeric vectors (as shown above), as well as character of logical vectors, just like with dataframes or matrices.

► Question

Create a new object that contains the data for features (rows) "ENSG00000166963", "ENSG00000267970" and "ENSG00000254663" and samples (columns) "SRR1039517" and "SRR1039521".

► Solution

airway[c("ENSG00000166963", "ENSG00000267970", "ENSG00000254663"),
       c("SRR1039517", "SRR1039521")]

## class: RangedSummarizedExperiment 
## dim: 3 2 
## metadata(1): ''
## assays(1): counts
## rownames(3): ENSG00000166963 ENSG00000267970 ENSG00000254663
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(2): SRR1039517 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

The row ranges are defined as a list of genomic ranges (GRanges objects), one genomic range per row in the SummarizedExperiment. This GRangesList object behaves like a list, albeit a list that specifically contains GRanges objects.

If we look at the first feature, corresponding to gene ENSG00000000003, we see that is is composed of 17 exons (ENSE00001459322 to ENSE00001828996) and their detailed positions on chromosome X.

rowRanges(airway)[[1]]

## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        X 99883667-99884983      - |    667145 ENSE00001459322
##    [2]        X 99885756-99885863      - |    667146 ENSE00000868868
##    [3]        X 99887482-99887565      - |    667147 ENSE00000401072
##    [4]        X 99887538-99887565      - |    667148 ENSE00001849132
##    [5]        X 99888402-99888536      - |    667149 ENSE00003554016
##    ...      ...               ...    ... .       ...             ...
##   [13]        X 99890555-99890743      - |    667156 ENSE00003512331
##   [14]        X 99891188-99891686      - |    667158 ENSE00001886883
##   [15]        X 99891605-99891803      - |    667159 ENSE00001855382
##   [16]        X 99891790-99892101      - |    667160 ENSE00001863395
##   [17]        X 99894942-99894988      - |    667161 ENSE00001828996
##   -------
##   seqinfo: 722 sequences (1 circular) from an unspecified genome

The position of genes can also be used to subset a SummarizedExperiment. In the example below, we select all features that overlap with positions 100000 to 1100000 on chromosome 1. To do this we first create a new GRanges object that matches the region of interest:

roi <- GRanges(seqnames="1", ranges=100000:1100000)
roi

## GRanges object with 1 range and 0 metadata columns:
##       seqnames         ranges strand
##          <Rle>      <IRanges>  <Rle>
##   [1]        1 100000-1100000      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

We can now use roi to extract the feature of interest with the subsetByOverlaps function, which returns a new SummarizedExperiment object that contains 74 features.

se <- subsetByOverlaps(airway, roi)
se

## class: RangedSummarizedExperiment 
## dim: 74 8 
## metadata(1): ''
## assays(1): counts
## rownames(74): ENSG00000131591 ENSG00000177757 ... ENSG00000272512
##   ENSG00000273443
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

► Question

How many genes are present on chromosome X between position 100000 et 1000000.

► Solution

roi <- GRanges(seqnames = "X", ranges = 1e5:1e6)
nrow(subsetByOverlaps(airway, roi))

## [1] 11

► Question

Visit the Ensembl page and look up the Ensembl gene identifier (ENSG00...) of the BRCA1 gene. Verify if these match the information in the airway data. If not, why could this be? How many exons are there in the airway data for this gene?

► Solution

gr <- rowRanges(airway["ENSG00000012048", ])
gr

## GRangesList object of length 1:
## $ENSG00000012048
## GRanges object with 99 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]       17 41196312-41197819      - |    550890 ENSE00001814242
##    [2]       17 41196313-41197819      - |    550891 ENSE00001312675
##    [3]       17 41196822-41197819      - |    550892 ENSE00001831829
##    [4]       17 41197580-41197819      - |    550893 ENSE00002914501
##    [5]       17 41197646-41197819      - |    550894 ENSE00001937547
##    ...      ...               ...    ... .       ...             ...
##   [95]       17 41277294-41277376      - |    550984 ENSE00001896310
##   [96]       17 41277294-41277387      - |    550985 ENSE00001878709
##   [97]       17 41277294-41277419      - |    550986 ENSE00001869098
##   [98]       17 41277294-41277467      - |    550987 ENSE00001954850
##   [99]       17 41277294-41277468      - |    550988 ENSE00001601762
##   -------
##   seqinfo: 722 sequences (1 circular) from an unspecified genome

length(gr[[1]])

## [1] 99

3.4.4 Constructing a `SummarizedExperiment`

Complex object such as SummarizedExperiment are generally constructed by functions (called constructors) from files. It is possible to construct a SummarizedExperiment by hand. To do so, we first need to construct the different parts of the object.

Let’s create an expression matrix for 200 genes and 6 samples by randomly sample values between 1 and 10000.

nrows <- 200
ncols <- 6
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)

Below, we create a the colData part using the appropriate constructor for the class of data needed:

cd <- DataFrame(Treatment = rep(c("ChIP", "Input"), 3),
                row.names = LETTERS[1:6])

Let’s also create some ranges. Here, we will have one range per features, as opposed to the airway data, that has multiple ranges (exons) per feature (gene). The ranges below to chromosomes 1 and 2 (50 and 150 respectively), with ranges (constructure with the IRanges constructor) starting between 1e5 and 1e6 with width 100, and strand chosing randomly as the positive or negative strand.

rng <- GRanges(rep(c("chr1", "chr2"), c(50, 150)),
               IRanges(floor(runif(200, 1e5, 1e6)), width = 100),
               strand = sample(c("+", "-"), 200, TRUE))

We can now put the different parts together using the SummarizedExperiment constructor:

se <- SummarizedExperiment(assays = list(counts = counts),
                           colData = cd,
                           rowRanges = rng)
se

## class: RangedSummarizedExperiment 
## dim: 200 6 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames(6): A B ... E F
## colData names(1): Treatment

We can also create a rowData object and add it to our se object:

rowData(se) <- DataFrame(some_value = sample(200))
se

## class: RangedSummarizedExperiment 
## dim: 200 6 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(1): some_value
## colnames(6): A B ... E F
## colData names(1): Treatment

3.5 Additional exercises

► Question

Using the function kem2.tsv from the rWSBIM1207 package, import the count and annotation data into R and use them to create an object of class SummarizedExperiment.

Page built: 2025-07-22 using R version 4.5.0 (2025-04-11)

Chapter 3 High-level data structures

3.1 IntroductionCopy link

3.2 Why objects?Copy link

3.3 Examples in base RCopy link

3.4 Examples from the Bioconductor projectCopy link

3.4.1 Common Bioconductor classesCopy link

3.4.2 The SummarizedExperiment classCopy link

3.4.3 Subsetting SummarizedExperiment objectsCopy link

3.4.4 Constructing a SummarizedExperiment Copy link

3.5 Additional exercisesCopy link

3.1 Introduction

3.2 Why objects?

3.3 Examples in base R

3.4 Examples from the Bioconductor project

3.4.1 Common Bioconductor classes

3.4.2 The `SummarizedExperiment` class

3.4.3 Subsetting `SummarizedExperiment` objects

3.4.4 Constructing a `SummarizedExperiment`

3.5 Additional exercises