Motivation: at the end of this chapter, the students will be able to address the following points:
Differential expression analysis is univariate - each gene is tested on its own. This probably doesn’t reflect the underlying biology - genes work in conjunction, not in isolation. One wouldn’t expect that the effect of a drug, or a mutation, … would lead to the perturbation of a single gene expression.
The univariate approach expects (or tests for) significant changes in single genes. Moderate effects in many (related) genes cannot, by definition be identified as statistically significant, even it such an effect may be biologically more plausible that one of a few large ones.
The goal of an enrichment analysis is to test for a group of related genes, called gene sets, and test whether the genes within these sets are enriched for differentially expression.
While nothing would stop a user to create their own gene sets, the sets that are generally used stem from community-maintained resources.
The Gene Ontology (Ashburner et al. 2000Ashburner, M, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis, et al. 2000. “Gene Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium.” Nat Genet 25 (1): 25–29. https://doi.org/10.1038/75556.) defines GO terms. These terms are based on a controlled vocabulary (to resuse the same words in a consistent way) and relations13 that define the directed links between terms. These relations define a hierarchy between GO terms: all term can be represented as a (direct acyclic) graph (DAG). There exist thus very general and very specific terms.
These terms are classed into three categories, called namespaces:
Here’s an example of GO term GO:0007155
, that describes cell adhesion.
A Term from the GO ontology: GO:0007155
Label: cell adhesion
The attachment of a cell, either to another cell or to an underlying
substrate such as the extracellular matrix, via cell adhesion
molecules.
If you look at the GO term GO:0007155
entry on the Gene Ontology
page, you can
find more details and, if you scroll down, example genes that are
annotated with that term.
The whole Gene Ontology is can be accessed in R with the GO.db package.
In the code chunk below, we query the GO.db
package through the
org.Hs.eg.db
interface. This org.db-type of packages for Homo
sapiens enables to perform various queries for human genes, such as
retrieving all gene symbols and ENTREZ identifiers (the columns
below) that are annotated with a GO term (the keys
below) of
interest.
The GO term of interest here is focal adhesion:
A cell-substrate junction that anchors the cell to the extracellular
matrix and that forms a point of termination of actin filaments. In
insects focal adhesion has also been referred to as hemi-adherens
junction (HAJ).
library("GO.db")
library("org.Hs.eg.db")
GO_0005925 <- AnnotationDbi::select(org.Hs.eg.db,
keys = "GO:0005925",
columns = c("ENTREZID", "SYMBOL"),
keytype = "GO") %>%
as_tibble %>%
filter(!duplicated(ENTREZID))
## 'select()' returned 1:many mapping between keys and columns
## # A tibble: 422 × 5
## GO EVIDENCE ONTOLOGY ENTREZID SYMBOL
## <chr> <chr> <chr> <chr> <chr>
## 1 GO:0005925 HDA CC 60 ACTB
## 2 GO:0005925 HDA CC 70 ACTC1
## 3 GO:0005925 ISS CC 71 ACTG1
## 4 GO:0005925 HDA CC 81 ACTN4
## 5 GO:0005925 HDA CC 87 ACTN1
## 6 GO:0005925 IMP CC 88 ACTN2
## 7 GO:0005925 IMP CC 89 ACTN3
## 8 GO:0005925 HDA CC 102 ADAM10
## 9 GO:0005925 HDA CC 118 ADD1
## 10 GO:0005925 HDA CC 214 ALCAM
## # ℹ 412 more rows
We have 422 genes matching this GO term. There are thus 422 genes in the GO_0005925 GO set.
► Question
Repeat the code above to extract the genes annotated with the
GO:0005813
term for the centrosome:
A structure comprised of a core structure (in most organisms, a pair
of centrioles) and peripheral material from which a
microtubule-based structure, such as a spindle apparatus, is
organized. Centrosomes occur close to the nucleus during interphase
in many eukaryotic cells, though in animal cells it changes
continually during the cell-division cycle.
► Solution
KEGG pathway is a collection of manually drawn and curated pathway maps representing current knowledge of the molecular interaction, reaction and relation networks.
The figure below shows the pathways for the cell cycle in humans.
The KEGGREST package provides a client interface to the KEGG server.
Alike KEGG patway, Reactome is a free, open-source, curated and peer-reviewed pathway database. The Bioconductor reactome.db package provides access to reactome maps and annotations within R.
MSigDB is a collection of annotated gene sets for use with GSEA software. The MSigDB gene sets are divided into 9 collections:
The msigdbr CRAN package provides the MSigDB gene sets in a standard R data frame with key-value pairs.
To illustrate enrichment analyses, we will use the DESeq2 results
stored in the res_tbl
variable, computed in the previous
chapter.
We will focus on the genes that have an adjusted p-value (those that have been tested) and that have unique ENTREZ gene identifiers.
res_tbl <- res_tbl %>%
filter(!is.na(ENTREZID),
!is.na(padj),
!duplicated(ENTREZID)) %>%
mutate(ENTREZID = as.character(ENTREZID))
res_tbl
## # A tibble: 13,999 × 9
## ENSEMBL gene ENTREZID baseMean log2FoldChange lfcSE stat pvalue
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00000101856 PGRMC1 10857 1209. -4.67 0.178 -26.2 8.24e-152
## 2 ENSG00000177494 ZBED2 79413 277. 6.25 0.351 17.8 6.25e- 71
## 3 ENSG00000136159 NUDT15 55270 630. 2.24 0.154 14.6 3.32e- 48
## 4 ENSG00000175315 CST6 1474 220. 4.29 0.300 14.3 2.20e- 46
## 5 ENSG00000128245 YWHAH 7533 2692. 2.05 0.144 14.2 5.64e- 46
## 6 ENSG00000113272 THG1L 54974 381. 2.33 0.171 13.6 3.45e- 42
## 7 ENSG00000213853 EMP2 2013 506. 2.53 0.191 13.2 8.45e- 40
## 8 ENSG00000115107 STEAP3 55240 193. 3.44 0.271 12.7 7.18e- 37
## 9 ENSG00000008513 ST3GA… 6482 1419. 3.00 0.240 12.5 6.39e- 36
## 10 ENSG00000105855 ITGB8 3696 6134. 2.18 0.178 12.2 2.26e- 34
## # ℹ 13,989 more rows
## # ℹ 1 more variable: padj <dbl>
An over representation analysis relies on the hypergeometric distribution. The hypergeometric distribution is a discrete probability distribution that describes the probability of \(k\) successes (random draws for which the object drawn has a specified feature) in \(n\) draws, without replacement, from a finite population of size \(N\) that contains exactly \(K\) objects with that feature, wherein each draw is either a success or a failure.
\[ P(X = k) = \frac{\binom{N}{k} \binom{N-K}{n-k}}{\binom{N}{n}} \]
where
The example used to describe the distribution is an urn contain \(N\) marbles of two colours. There are \(K\) green and and \(N-K\) red marbles in the urn. Drawing a green marble is defined as success, and a red marble failure. Using the formula above, we can compute the probability to draw \(k\) green marbles from the urn.
In the frame of an enrichment analysis (Rivals et al. 2007Rivals, I, L Personnaz, L Taing, and M C Potier. 2007. “Enrichment or Depletion of a GO Category Within a Class of Genes: Which Test?” Bioinformatics 23 (4): 401–7. https://doi.org/10.1093/bioinformatics/btl633.), we use the following formulation to calculate a probabiliy that we have more green marbles than we would expect by change, i.e. there to be an over representation of green marbles.
\[ p = 1 - \sum_{i = 0}^{k - 1} \frac{\binom{N}{k} \binom{N-K}{n-k}}{\binom{N}{n}} \]
To perform an over representation analysis, we thus need to define:
And fill out the following table and count the number of DE genes that are in the set of interest, the non-DE that are in the set, and the DE and non-DE genes that are not in the set:
GO | not_GO | |
---|---|---|
DE | n | p |
not_DE | m | q |
## DE and GO
n <- length(intersect(res_tbl$ENTREZID[res_tbl$padj < 0.05],
GO_0005925$ENTREZID))
## not DE and GO
m <- length(intersect(res_tbl$ENTREZID[res_tbl$padj >= 0.05],
GO_0005925$ENTREZID))
## DE and not GO
p <- length(setdiff(res_tbl$ENTREZID[res_tbl$padj < 0.05],
GO_0005925$ENTREZID))
## not DE not not GO
q <- length(setdiff(res_tbl$ENTREZID[res_tbl$padj >= 0.05],
GO_0005925$ENTREZID))
cont_mat <- matrix(c(n, m, p, q), nrow = 2)
rownames(cont_mat) <- c("DE", "not_DE")
colnames(cont_mat) <- c("GO", "not_GO")
cont_mat
## GO not_GO
## DE 159 4029
## not_DE 209 9602
We can now apply a Fisher’s exact (or hypergeometric test) that will test whether we can identify a statistically enrichment of DE genes in the GO category.
##
## Fisher's Exact Test for Count Data
##
## data: cont_mat
## p-value = 3.082e-08
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 1.511311 Inf
## sample estimates:
## odds ratio
## 1.812898
Note that we could keep the default alternative = "two.sided"
to
test to over or under representation.
► Question
Repeat the ORA analysis above the GO:0005813 term.
This approach is straightfoward and very fast. Its major drawback however is that we need to define a cutoff to differentiate DE from non-DE genes. Setting this threshold might have a effect on the results.
► Question
Try setting different DE genes and check if, in the cases above, this has and effect on the GO terms of interest.
Gene set enrichment analysis refers to a broad family of tests. Here, we will define the principles based on (Subramanian et al. 2005Subramanian, A, P Tamayo, V K Mootha, S Mukherjee, B L Ebert, M A Gillette, A Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proc Natl Acad Sci U S A 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.), keeping in mind that the exact implementation will differ in different tools.
The major advantage of GSEA approaches is that they don’t rely on
defining DE genes. The first step is to order the genes of
interest based on the statistics used; here, we will use the
p-values. This is already the case for our res_tbl
table. We also
need to know which genes are in our set of interest.
res_tbl$inGO <- res_tbl$ENTREZID %in% GO_0005925$ENTREZID
dplyr::select(res_tbl, ENTREZID, padj, inGO)
## # A tibble: 13,999 × 3
## ENTREZID padj inGO
## <chr> <dbl> <lgl>
## 1 10857 1.36e-147 FALSE
## 2 79413 5.17e- 67 FALSE
## 3 55270 1.83e- 44 FALSE
## 4 1474 9.11e- 43 FALSE
## 5 7533 1.87e- 42 FALSE
## 6 54974 9.51e- 39 FALSE
## 7 2013 2.00e- 36 FALSE
## 8 55240 1.48e- 33 FALSE
## 9 6482 1.17e- 32 FALSE
## 10 3696 3.75e- 31 TRUE
## # ℹ 13,989 more rows
We are now going to compute a score by traversing to ordered gene list
and count a positive score when we encounter a gene in the gene set,
and a -1 when the gene is not in the gene set. The positive score,
computed by set_ratio()
below, is defined by \(\frac{n_{genes} - n_{genes~in~set}}{n_{genes~in~set}}\) so that the sum of all genes in
the set and those not in the set becomes zero.
The cumulative sum of these scores along the ordered gene list becomes the GSEA path which is plotted below, and the maximun score obtained along this path is the GSEA score.
gsea_path <- ifelse(res_tbl$inGO, set_ratio(res_tbl), -1)
gsea_path <- cumsum(gsea_path)
gsea_score <- max(gsea_path)
plot(gsea_path, type = "l",
xlab = "Ordered genes (padj)",
ylab = "score",
main = paste0("GSEA score: ", round(gsea_score)))
abline(h = gsea_score, col = "steelblue", lty = "dotted")
To be able to compute a statistical significance, we need to compute a null distribution of GSEA scores, i.e. a distribution of scores reflecting the absence of any enrichment. This is done by permuting the samples in the original data, repeating the statistical analysis, reorder the genes according to the new statistics (we used the adjusted p-value above) and compute a new GSEA score.
## DataFrame with 6 rows and 4 columns
## Cell Type Condition sizeFactor
## <character> <character> <factor> <numeric>
## sample1 Cell1 Epithelial mock 0.726883
## sample2 Cell1 Epithelial mock 1.308664
## sample3 Cell1 Epithelial mock 1.077368
## sample4 Cell1 Epithelial KD 0.966025
## sample5 Cell1 Epithelial KD 1.042111
## sample6 Cell1 Epithelial KD 1.012664
## DataFrame with 6 rows and 4 columns
## Cell Type Condition sizeFactor
## <character> <character> <factor> <numeric>
## sample1 Cell1 Epithelial mock 0.726883
## sample2 Cell1 Epithelial KD 1.308664
## sample3 Cell1 Epithelial KD 1.077368
## sample4 Cell1 Epithelial KD 0.966025
## sample5 Cell1 Epithelial mock 1.042111
## sample6 Cell1 Epithelial mock 1.012664
ensembl_to_geneName <- tibble(ENSEMBL = rownames(dds)) %>%
mutate(gene = AnnotationDbi::mapIds(org.Hs.eg.db, ENSEMBL, "SYMBOL",
"ENSEMBL")) %>%
mutate(ENTREZID = AnnotationDbi::mapIds(org.Hs.eg.db, ENSEMBL, "ENTREZID",
"ENSEMBL"))
res_tmp <- DESeq(dds_tmp) %>%
results(name = "Condition_KD_vs_mock") %>%
as_tibble(rownames = "ENSEMBL") %>%
left_join(ensembl_to_geneName) %>%
mutate(inGO = ENTREZID %in% GO_0005925$ENTREZID) %>%
filter(!is.na(ENTREZID),
!is.na(padj),
!duplicated(ENTREZID)) %>%
dplyr::select(ENSEMBL, ENTREZID, padj, inGO) %>%
arrange(padj)
res_tmp <- res_tmp %>%
mutate(score = ifelse(inGO, set_ratio(res_tmp), -1)) %>%
mutate(score = cumsum(score))
res_tmp
## # A tibble: 22,108 × 5
## ENSEMBL ENTREZID padj inGO score
## <chr> <chr> <dbl> <lgl> <dbl>
## 1 ENSG00000278267 102466751 0.999 FALSE -1
## 2 ENSG00000233750 100420257 0.999 FALSE -2
## 3 ENSG00000236679 728517 0.999 FALSE -3
## 4 ENSG00000230021 101928626 0.999 FALSE -4
## 5 ENSG00000225972 100887749 0.999 FALSE -5
## 6 ENSG00000225630 100652939 0.999 FALSE -6
## 7 ENSG00000237973 107075141 0.999 FALSE -7
## 8 ENSG00000229344 107075310 0.999 FALSE -8
## 9 ENSG00000248527 106480796 0.999 FALSE -9
## 10 ENSG00000198744 107075270 0.999 FALSE -10
## # ℹ 22,098 more rows
## [1] 84.44526
This approach is very time consuming, given that the statistical tests for all the genes need to be recomputed at each permutation. In addition, for experiments with limited number of samples, the number of permutations would be limited.
We start by defining a function that will repeat the steps above and return a list containing the GSEA score and path.
##' A function that assigns the permuted Conditions,
##' runs DESeq and a GSEA analysis
##'
##' @param x a DESeq object.
##' @param perm `character()` of length `ncol(x)` indicating the
##' permutation to be tested.
##'
##' @return `numeric()` containing the GSEA path.
gsea_perm <- function(x, perm) {
## Set the Condition based on the permutation
x$Condition[perm] <- "mock"
x$Condition[-perm] <- "KD"
## Run DESeq2, extract and annotate results
suppressMessages(
tbl <- DESeq(x) %>%
results(name = "Condition_KD_vs_mock") %>%
as_tibble(rownames = "ENSEMBL") %>%
left_join(ensembl_to_geneName) %>%
mutate(inGO = ENTREZID %in% GO_0005925$ENTREZID) %>%
filter(!is.na(ENTREZID),
!is.na(padj),
!duplicated(ENTREZID)) %>%
dplyr::select(ENSEMBL, ENTREZID, padj, inGO) %>%
arrange(padj)
)
## Compute GSEA results
tbl <- tbl %>%
mutate(score = if_else(inGO, set_ratio(tbl), -1)) %>%
mutate(score = cumsum(score))
return(tbl$score)
}
We will execute this function on all unique permutations, corresponding on the 10 first columns (including the actual design, in the first column). Note that columns 11 to 20 are simply the opposite of the first 10.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 1 1 1 1 1 1 1 1 1 1 2 2 2 2
## [2,] 2 2 2 2 3 3 3 4 4 5 3 3 3 4
## [3,] 3 4 5 6 4 5 6 5 6 6 4 5 6 5
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 2 2 3 3 3 4
## [2,] 4 5 4 4 5 5
## [3,] 6 6 5 6 6 6
We not apply our function on every permutation.
And calculate the scores
If we compare the actual score to the permutation scores, we see that it is the largest one.
An empirical p-value can be computer by dividing the number of null scores that are greater than the real score divided by the number of permutations (the number of null scores). In our case, given that no null scores are greater, the nominal p-value would be 0.
## [1] 0
Below, we illustrate all GSEA path and confirm that the actual score is indeed the largest one.
plot(paths[[1]], type = "l",
ylim = c(min(unlist(paths)), max(scores)),
col = "steelblue")
grid()
for (i in 2:length(paths))
lines(paths[[i]], type = "l", col = "#00000060")
► Question
Given that we have 10 permutations, how many null scores greater that the actual GSEA score do we need for our results to become non significant (assuming we set an alpha of 0.05)?
► Solution
There exist various approaches to the GSEA analysis, that apply different permutation approaches to increate the number of possible permutations and reduce the running time.
► Question
Repeat the GSEA analysis above the GO:0005813 term.
While there is not need to define a list of differentially expressed genes for the GSEA analysis, the genes need to be ordered, which can be done in different ways and lead to different results.
clusterProfiler
package
I practice, the analyses presented above are executed using any of the very many packages that are available. Here, we will demonstrate clusterProfiler14, that itself relies on other packages to perform the GSEA-related computations.
These dedicated packages will perform either of the two tests (or variations thereof) to all sets, adjust the computed p-values for multiplicity, and provide additional details and visualisations for further exploration of the results.
de_genes <- res_tbl$ENTREZID[res_tbl$padj < 0.05]
go_ora <- enrichGO(gene = de_genes,
universe = res_tbl$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = "CC",
readable = TRUE) %>%
as_tibble
go_ora
## # A tibble: 19 × 9
## ID Description GeneR…¹ BgRatio pvalue p.adj…² qvalue geneID Count
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int>
## 1 GO:0022626 cytosolic r… 59/3847 91/120… 9.55e-11 6.51e-8 6.02e-8 RPL27… 59
## 2 GO:0022625 cytosolic l… 37/3847 54/120… 3.40e- 8 1.16e-5 1.07e-5 RPL27… 37
## 3 GO:0044391 ribosomal s… 83/3847 168/12… 1.46e- 6 3.30e-4 3.05e-4 DAP3/… 83
## 4 GO:0005925 focal adhes… 159/38… 367/12… 1.93e- 6 3.30e-4 3.05e-4 ITGB8… 159
## 5 GO:0030055 cell-substr… 160/38… 372/12… 3.01e- 6 4.11e-4 3.80e-4 ITGB8… 160
## 6 GO:0042788 polysomal r… 22/3847 30/120… 3.71e- 6 4.22e-4 3.90e-4 RPL38… 22
## 7 GO:0098857 membrane mi… 98/3847 210/12… 4.49e- 6 4.37e-4 4.04e-4 EMP2/… 98
## 8 GO:0045121 membrane ra… 97/3847 209/12… 6.70e- 6 5.71e-4 5.28e-4 EMP2/… 97
## 9 GO:0005840 ribosome 89/3847 195/12… 3.49e- 5 2.64e-3 2.44e-3 DAP3/… 89
## 10 GO:0022627 cytosolic s… 23/3847 37/120… 1.42e- 4 8.97e-3 8.29e-3 RPS24… 23
## 11 GO:0098984 neuron to n… 99/3847 228/12… 1.45e- 4 8.97e-3 8.29e-3 NETO2… 99
## 12 GO:0009986 cell surface 170/38… 424/12… 1.67e- 4 9.49e-3 8.78e-3 EMP2/… 170
## 13 GO:0015935 small ribos… 33/3847 63/120… 5.64e- 4 2.69e-2 2.48e-2 DAP3/… 33
## 14 GO:0015934 large ribos… 51/3847 108/12… 5.87e- 4 2.69e-2 2.48e-2 MRPL1… 51
## 15 GO:0010008 endosome me… 167/38… 425/12… 5.91e- 4 2.69e-2 2.48e-2 STEAP… 167
## 16 GO:0014069 postsynapti… 89/3847 211/12… 9.49e- 4 3.99e-2 3.69e-2 NETO2… 89
## 17 GO:0032279 asymmetric … 90/3847 214/12… 9.96e- 4 3.99e-2 3.69e-2 NETO2… 90
## 18 GO:0099572 postsynapti… 92/3847 220/12… 1.09e- 3 4.14e-2 3.83e-2 NETO2… 92
## 19 GO:0043235 receptor co… 86/3847 205/12… 1.39e- 3 5.00e-2 4.62e-2 ITGB8… 86
## # … with abbreviated variable names ¹GeneRatio, ²p.adjust
The GeneRatio
and BgRatio
correspond respectively to
the ration between the number of genes of interest in the gene set and the number of (unique) genes-to-interest: \(\frac{|genes\_of\_interest~in~gene\_set|}{|genes\_of\_interest|}\)
the ratio between the number of genes of interest in the gene set and size of the universe, i.e. the number of (unique) genes across all gene sets: \(\frac{|genes~in~gene\_set|}{|universe|}\)
Below, we have 12 genes of interest, "a"
, to "l"
:
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
And two gene sets, "X"
and "Y"
:
## geneset entrez_gene
## 1 X a
## 2 X b
## 3 X c
## 4 X d
## 5 X e
## 6 X f
## 7 X g
## 8 X h
## 9 X i
## 10 X j
## geneset entrez_gene
## 1 Y b
## 2 Y c
## 3 Y d
## 4 Y e
## 5 Y f
## 6 Y g
## 7 Y h
## 8 Y i
## 9 Y j
## 10 Y k
## 11 Y l
## 12 Y m
## 13 Y n
## 14 Y o
## 15 Y p
## 16 Y q
## 17 Y r
## 18 Y s
## 19 Y t
## 20 Y u
## 21 Y v
## 22 Y w
## 23 Y x
## 24 Y y
## 25 Y z
We get:
## GeneRatio BgRatio geneID Count
## X 10/12 10/26 a/b/c/d/e/f/g/h/i/j 10
## Y 11/12 25/26 b/c/d/e/f/g/h/i/j/k/l 11
► Question
Repeat the ORA analysis above using all GO namespaces. See ?enrichGO
for details.
The functions that perform GSEA in clusterProfiler
require the genes
to be ordered in decreasing order. This indicates that the p-values
can’t be used without transformations. One could also use the log2
fold-change or the (absolute value) test statistics.
ordered_genes <- abs(res_tbl$stat)
names(ordered_genes) <- res_tbl$ENTREZID
ordered_genes <- sort(ordered_genes, decreasing = TRUE)
go_gsea <- gseGO(gene = ordered_genes,
OrgDb = org.Hs.eg.db,
scoreType = "pos") %>%
as_tibble
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.62% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## # A tibble: 463 × 11
## ID Descr…¹ setSize enric…² NES pvalue p.adj…³ qvalue rank leadi…⁴
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <chr>
## 1 GO:00308… epithe… 411 0.467 1.36 4.21e-9 2.24e-5 1.75e-5 3284 tags=3…
## 2 GO:00015… blood … 480 0.451 1.32 1.12e-8 2.97e-5 2.33e-5 4093 tags=4…
## 3 GO:00485… blood … 427 0.454 1.33 1.09e-7 1.53e-4 1.20e-4 4059 tags=4…
## 4 GO:00082… negati… 465 0.445 1.30 1.15e-7 1.53e-4 1.20e-4 2478 tags=2…
## 5 GO:00455… negati… 407 0.451 1.32 1.70e-7 1.80e-4 1.41e-4 4659 tags=4…
## 6 GO:00015… angiog… 364 0.459 1.33 4.56e-7 3.03e-4 2.37e-4 4093 tags=4…
## 7 GO:00430… positi… 377 0.458 1.33 3.62e-7 3.03e-4 2.37e-4 3702 tags=4…
## 8 GO:00400… positi… 376 0.450 1.31 4.56e-7 3.03e-4 2.37e-4 5085 tags=5…
## 9 GO:00512… positi… 379 0.453 1.32 6.51e-7 3.85e-4 3.01e-4 5085 tags=5…
## 10 GO:00423… taxis 337 0.458 1.33 9.67e-7 3.95e-4 3.10e-4 4468 tags=4…
## # … with 453 more rows, 1 more variable: core_enrichment <chr>, and abbreviated
## # variable names ¹Description, ²enrichmentScore, ³p.adjust, ⁴leading_edge
► Question
Repeat the GSEA analysis above using all GO namespaces. See
?enrichGO
for details.
► Question
Compare and discuss the use of these different ways to order the genes for the GSEA analysis.
► Question
Compare the GO and GSEA results. Which GO terms are shared or unique to each approach?
► Solution
The enrichKEGG()
and gseKEGG()
functions can be used to perform
the ORA and GSEA analyses againt the KEGG sets.
► Question
Perform ORA and GSEA analysis using the KEGG set.
We will use the hallmark gene set as an example here, but any set described above can be utilised.
library("msigdbr")
msig_h <- msigdbr(species = "Homo sapiens", category = "H") %>%
dplyr::select(gs_name, entrez_gene) %>%
dplyr::rename(ont = gs_name, gene = entrez_gene)
msig_h
## # A tibble: 8,209 × 2
## ont gene
## <chr> <int>
## 1 HALLMARK_ADIPOGENESIS 19
## 2 HALLMARK_ADIPOGENESIS 11194
## 3 HALLMARK_ADIPOGENESIS 10449
## 4 HALLMARK_ADIPOGENESIS 33
## 5 HALLMARK_ADIPOGENESIS 34
## 6 HALLMARK_ADIPOGENESIS 35
## 7 HALLMARK_ADIPOGENESIS 47
## 8 HALLMARK_ADIPOGENESIS 50
## 9 HALLMARK_ADIPOGENESIS 51
## 10 HALLMARK_ADIPOGENESIS 112
## # ℹ 8,199 more rows
The msig_h
table can now be used with the enricher()
and GSEA()
functions from the clusterProfiler
package to perform ORA and GSEA
analyses on the MSigDB hallmark gene set.
msig_ora <- enricher(gene = de_genes,
universe = res_tbl$ENTREZID,
TERM2GENE = msig_h) %>%
as_tibble
msig_ora
## # A tibble: 3 × 9
## ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int>
## 1 HALLMARK_… HALLMARK_H… 82/1211 164/33… 1.42e-4 0.00709 0.00627 10370… 82
## 2 HALLMARK_… HALLMARK_M… 34/1211 58/3347 3.72e-4 0.00931 0.00823 1844/… 34
## 3 HALLMARK_… HALLMARK_A… 53/1211 106/33… 2.13e-3 0.0355 0.0314 3566/… 53
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.62% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## # A tibble: 17 × 11
## ID Description setSize enrichmentScore NES pvalue p.adjust qvalue
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HALLMARK_… HALLMARK_E… 129 0.545 1.54 1.65e-7 8.27e-6 4.70e-6
## 2 HALLMARK_… HALLMARK_H… 164 0.503 1.44 2.94e-6 7.35e-5 4.18e-5
## 3 HALLMARK_… HALLMARK_M… 58 0.600 1.62 1.49e-5 1.90e-4 1.08e-4
## 4 HALLMARK_… HALLMARK_P… 177 0.485 1.39 1.52e-5 1.90e-4 1.08e-4
## 5 HALLMARK_… HALLMARK_G… 175 0.471 1.35 5.25e-5 5.25e-4 2.98e-4
## 6 HALLMARK_… HALLMARK_T… 161 0.475 1.36 1.09e-4 9.07e-4 5.16e-4
## 7 HALLMARK_… HALLMARK_U… 122 0.486 1.37 2.21e-4 1.58e-3 8.98e-4
## 8 HALLMARK_… HALLMARK_K… 124 0.481 1.36 8.01e-4 4.50e-3 2.56e-3
## 9 HALLMARK_… HALLMARK_E… 152 0.465 1.33 8.11e-4 4.50e-3 2.56e-3
## 10 HALLMARK_… HALLMARK_X… 134 0.472 1.34 9.85e-4 4.92e-3 2.80e-3
## 11 HALLMARK_… HALLMARK_A… 106 0.485 1.36 1.32e-3 5.98e-3 3.40e-3
## 12 HALLMARK_… HALLMARK_A… 124 0.463 1.31 3.61e-3 1.50e-2 8.54e-3
## 13 HALLMARK_… HALLMARK_I… 113 0.463 1.30 4.50e-3 1.73e-2 9.84e-3
## 14 HALLMARK_… HALLMARK_T… 50 0.526 1.41 6.21e-3 2.22e-2 1.26e-2
## 15 HALLMARK_… HALLMARK_I… 147 0.440 1.25 7.47e-3 2.49e-2 1.42e-2
## 16 HALLMARK_… HALLMARK_M… 116 0.460 1.30 9.04e-3 2.75e-2 1.56e-2
## 17 HALLMARK_… HALLMARK_E… 159 0.430 1.23 9.34e-3 2.75e-2 1.56e-2
## # ℹ 3 more variables: rank <int>, leading_edge <chr>, core_enrichment <chr>
The clusterProfiler
documentation provides a chapter on the
visualization of functional enrichment
results.
Another useful visualisation, that links the enrichment results back to the whole set of results is to highlight the genes in a particular set of interest on the volcano plot.
sel <- res_tbl$ENTREZID %in% GO_0005925$ENTREZID
plot(res_tbl$log2FoldChange, -log10(res_tbl$padj))
points(res_tbl$log2FoldChange[sel],
-log10(res_tbl$padj)[sel],
col = "red")
grid()
► Question
Extract the Entrez ids from the top hit in the msig_gsea
results
above (available in core_enrichment
) and visualise the results on a
volcano plot.
► Solution
Enrichment analyses are powerful techniques, as they allow to integrate thousands of univariate results into biologically driven sets.
Their interpretation isn’t however always straightforward when different methods (ORA and GSEA) and implementations (different GSEA permutation strategies) or user-set parameters (DE cutoffs in ORA, ordering criteria in GSEA) provide different results.
Example of relations between terms are, for example, is_a (mitosis is_a cell cycle phase) or part_of (mitosis part_of M phase of mitotic cell cycle).↩︎
Page built: 2023-10-05 using R version 4.3.1 Patched (2023-07-10 r84676)