scp
vignettes/advanced.Rmd
advanced.Rmd
This vignette is dedicated to advanced users and to method
developers. It assumes that you are already familiar with
QFeatures
and scp
and that you are looking for
more flexibility in the analysis of your single-cell proteomics (SCP)
data. In fact, scp
provides wrapper functions around
generic functions and metrics. However, advanced users may want to apply
or develop their own features. The QFeatures
class offers a
flexible data container while guaranteeing data consistency.
In this vignette, you will learn how to:
scp
As a general guideline, you can add/remove/update data in a
QFeatures
in 4 main steps:
QFeatures
object.QFeatures
object is still
valid.To illustrate the different topics, we will load the
scp1
example data.
library(scp)
data("scp1")
scp1
#> An instance of class QFeatures containing 5 assays:
#> [1] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 166 rows and 11 columns
#> [2] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 176 rows and 11 columns
#> [3] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 215 rows and 16 columns
#> [4] peptides: SingleCellExperiment with 539 rows and 38 columns
#> [5] proteins: SingleCellExperiment with 292 rows and 38 columns
To illustrate how to modify quantitative data, we will implement a
normByType
function that will normalize the feature (row)
for each cell type separately. This function is probably not relevant
for a real case analysis, but it provides a good example of a custom
data processing. The process presented in this section is applicable to
any custom function that takes at least a matrix-like
object as input and returns a matrix-like object as output.
normByType <- function(x, type) {
## Check argument
stopifnot(length(type) == ncol(x))
## Normalize for each type separately
for (i in unique(type)) {
## Get normalization factor
nf <- rowMedians(x[, type == i], na.rm = TRUE)
## Perform normalization
x[, type == i] <- x[, type == i] / nf
}
## Return normalized data
x
}
Suppose we want to apply the function to the proteins
assay, we need to first extract that assay. We here need to transfer the
sample annotations from the QFeatures
object to the
extracted SingleCellExperiment
in order to get the sample
types required by the normByType
function. We therefore use
getWithColData
.
sce <- getWithColData(scp1, "proteins")
sce
#> class: SingleCellExperiment
#> dim: 292 38
#> metadata(0):
#> assays(2): assay aggcounts
#> rownames(292): A1A519 A5D8V6 ... REV__CON__Q32PI4 REV__CON__Q3MHN5
#> rowData names(9): protein Match.time.difference ...
#> Potential.contaminant .n
#> colnames(38): 190321S_LCA10_X_FP97AG_RI1 190321S_LCA10_X_FP97AG_RI2 ...
#> 190914S_LCB3_X_16plex_Set_21_RI15 190914S_LCB3_X_16plex_Set_21_RI16
#> colData names(7): Set Channel ... sortday digest
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
Next, we can apply the data transformation to the quantitative data.
As mentioned above, our function expects a matrix-like object as an
input, so we use the assay
function. We then update the
SingleCellExperiment
object.
We are now faced with 2 possibilities: either we want to create a new assay or we want to overwrite an existing assay. In both cases we need to make sure your data is still valid after data transformation.
Creating a new assay has the advantage that you don’t modify an existing assay and hence limit the risk of introducing inconsistency in the data and avoid losing intermediate steps of the data processing.
We add the transformed assay using the addAssay
function, then link the parent assay to the transformed assay using
addAssayLinkOneToOne
. Note that if each row name in the
parent assay does not match exactly one row in the child assay, you can
also use addAssayLink
that will require a linking variable
in the rowData
.
scp1 <- addAssay(scp1, sce, name = "proteinsNorm")
scp1 <- addAssayLinkOneToOne(scp1, from = "proteins", to = "proteinsNorm")
scp1
#> An instance of class QFeatures containing 6 assays:
#> [1] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 166 rows and 11 columns
#> [2] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 176 rows and 11 columns
#> [3] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 215 rows and 16 columns
#> [4] peptides: SingleCellExperiment with 539 rows and 38 columns
#> [5] proteins: SingleCellExperiment with 292 rows and 38 columns
#> [6] proteinsNorm: SingleCellExperiment with 292 rows and 38 columns
Overwriting an existing assay has the advantage to limit the memory consumption as compared to adding a new assay. You can overwrite an assay simply by replacing the target assay in its corresponding slot.
scp1[["proteins"]] <- sce
Applying custom changes to the data increases the risk for data
inconsistencies. You can however verify that a QFeatures
object is still valid after some processing thanks to the
validObject
function.
validObject(scp1)
#> [1] TRUE
If the function detects no issues in the data, it will return
TRUE
. Otherwise the function will throw an informative
error that should guide the user to identifying the issue.
To illustrate how to modify the sample annotations, we will compute
the average expression in each sample and include to the
colData
of the QFeatures
object. This is
typically performed when computing QC metrics for sample filtering. So,
we first extract the required data, in this case the quantitative
values, and compute the sample-wise average protein expression.
m <- assay(scp1, "proteins")
meanExprs <- colMeans(m, na.rm = TRUE)
meanExprs
#> 190321S_LCA10_X_FP97AG_RI1 190321S_LCA10_X_FP97AG_RI2
#> 45368.3628 7566.8782
#> 190321S_LCA10_X_FP97AG_RI3 190321S_LCA10_X_FP97AG_RI4
#> 3271.0402 2158.4230
#> 190321S_LCA10_X_FP97AG_RI5 190321S_LCA10_X_FP97AG_RI6
#> 1162.3936 1835.7939
#> 190321S_LCA10_X_FP97AG_RI7 190321S_LCA10_X_FP97AG_RI8
#> 1293.4738 1415.4205
#> 190321S_LCA10_X_FP97AG_RI9 190321S_LCA10_X_FP97AG_RI10
#> 1965.6531 1264.4064
#> 190321S_LCA10_X_FP97AG_RI11 190222S_LCA9_X_FP94BM_RI1
#> 1238.5979 48901.4654
#> 190222S_LCA9_X_FP94BM_RI2 190222S_LCA9_X_FP94BM_RI3
#> 5189.5912 2802.1850
#> 190222S_LCA9_X_FP94BM_RI4 190222S_LCA9_X_FP94BM_RI5
#> 1204.2350 715.7667
#> 190222S_LCA9_X_FP94BM_RI6 190222S_LCA9_X_FP94BM_RI7
#> 1032.7122 403.8267
#> 190222S_LCA9_X_FP94BM_RI8 190222S_LCA9_X_FP94BM_RI9
#> 1089.0548 1076.2613
#> 190222S_LCA9_X_FP94BM_RI10 190222S_LCA9_X_FP94BM_RI11
#> 1572.3337 1186.4431
#> 190914S_LCB3_X_16plex_Set_21_RI1 190914S_LCB3_X_16plex_Set_21_RI2
#> 96881.8350 2731.9430
#> 190914S_LCB3_X_16plex_Set_21_RI3 190914S_LCB3_X_16plex_Set_21_RI4
#> 8208.1958 208.2857
#> 190914S_LCB3_X_16plex_Set_21_RI5 190914S_LCB3_X_16plex_Set_21_RI6
#> 3209.7730 2014.4889
#> 190914S_LCB3_X_16plex_Set_21_RI7 190914S_LCB3_X_16plex_Set_21_RI8
#> 2897.1103 3006.6741
#> 190914S_LCB3_X_16plex_Set_21_RI9 190914S_LCB3_X_16plex_Set_21_RI10
#> 2760.8564 2549.9348
#> 190914S_LCB3_X_16plex_Set_21_RI11 190914S_LCB3_X_16plex_Set_21_RI12
#> 2388.5289 2533.8006
#> 190914S_LCB3_X_16plex_Set_21_RI13 190914S_LCB3_X_16plex_Set_21_RI14
#> 2134.5615 3125.1980
#> 190914S_LCB3_X_16plex_Set_21_RI15 190914S_LCB3_X_16plex_Set_21_RI16
#> 3276.5618 3195.8827
Next, we insert the computed averages into the colData
.
You need to make sure to match sample names because an extracted assay
may not contain all samples and may be in a different order compared to
the colData
.
The new sample variable meanProtExprs
is now accessible
for filtering or plotting. The $
operator makes it
straightforward to access the new data.
To make sure that the process did not corrupt the
colData
, we advise to verify the data is still valid.
validObject(scp1)
#> [1] TRUE
We will again illustrate how to modify the feature annotations with an example. We here demonstrate how to add the number of samples in which each feature is detected for the three first assays (PSM assays). The challenge here is that the metric needs to be computed for each assay separately and inserted in the corresponding assay.
We will take advantage of the replacement function for
rowData
as implemented in QFeatures
. It
expects a list-like object where names indicate in which assays we want
to modify the rowData
and each element contains a table
with the replacement values.
We therefore compute the metrics for each assay separately and store
the results in a named List
.
## Initialize the List object that will store the computed values
res <- List()
## We compute the metric for the first 3 assays
for (i in 1:3) {
## We get the quantitative values for the current assay
m <- assay(scp1[[i]])
## We compute the number of samples in which each features is detected
n <- rowSums(!is.na(m) & m != 0)
## We store the result as a DataFrame in the List
res[[i]] <- DataFrame(nbSamples = n)
}
names(res) <- names(scp1)[1:3]
res
#> List of length 3
#> names(3): 190321S_LCA10_X_FP97AG 190222S_LCA9_X_FP94BM 190914S_LCB3_X_16plex_Set_21
res[[1]]
#> DataFrame with 166 rows and 1 column
#> nbSamples
#> <numeric>
#> PSM3773 11
#> PSM9078 11
#> PSM9858 11
#> PSM11744 11
#> PSM21752 0
#> ... ...
#> PSM732069 11
#> PSM735396 11
#> PSM744756 10
#> PSM745037 11
#> PSM745130 11
Now that we have a List
of DataFrame
s, we
can update the object.
rowData(scp1) <- res
The new feature variable nbSamples
is now accessible for
filtering or plotting. The rbindRowData
function
facilitates the access the new data.
library("ggplot2")
rd <- rbindRowData(scp1, i = 1:3)
ggplot(data.frame(rd)) +
aes(x = nbSamples) +
geom_histogram(bins = 16) +
facet_wrap(~ assay)
To make sure that the process did not corrupt the
rowData
in any assay, we advise to verify the data is still
valid.
validObject(scp1)
#> [1] TRUE
scp
The modifying data in a QFeatures
involves a
multiple-step process. Creating a wrapper function that would take care
of those different steps in a single line of code is a good habit to
reduce the length of analysis scripts and hence making it easier to
understand and less error-prone.
We will wrap the last example in a new function that we call
computeNbDetectedSamples
.
computeNbDetectedSamples <- function(object, i) {
res <- List()
for (ii in i) {
m <- assay(object[[ii]])
n <- rowSums(!is.na(m) & m != 0)
res[[ii]] <- DataFrame(nbSamples = n)
}
names(res) <- names(object)[i]
rowData(object) <- res
stopifnot(validObject(object))
object
}
Thanks to this new function, the previous section now simply boils down to running:
scp1 <- computeNbDetectedSamples(scp1, i = 1:3)
Keep in mind a few recommendations when creating a new function for
scp
:
QFeatures
object as input
(and more arguments if needed) and return a QFeatures
object as output. This will make workflows much easier to
understand.i
argument, selecting
rowData
variables is passed through rowvars
and selecting colData
variables is passed through
colvars
.rformassspectrometry
coding
style
So you developed a new metric or method and believe it might benefit
the community? We would love to hear about your improvements and
eventually include your new functionality into scp
or
associate your new package to our documentation. Please, raise an issue
in our Github repository to suggest your improvements or, better, submit
your code as a pull
request.
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ggplot2_3.5.1 scp_1.15.1
[3] QFeatures_1.15.2 MultiAssayExperiment_1.31.4
[5] SummarizedExperiment_1.35.1 Biobase_2.65.0
[7] GenomicRanges_1.57.1 GenomeInfoDb_1.41.1
[9] IRanges_2.39.2 S4Vectors_0.43.2
[11] BiocGenerics_0.51.0 MatrixGenerics_1.17.0
[13] matrixStats_1.3.0 BiocStyle_2.33.1
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 farver_2.1.2
[3] dplyr_1.1.4 bitops_1.0-8
[5] fastmap_1.2.0 SingleCellExperiment_1.27.2
[7] lazyeval_0.2.2 RCurl_1.98-1.16
[9] nipals_0.8 digest_0.6.36
[11] lifecycle_1.0.4 cluster_2.1.6
[13] ProtGenerics_1.37.0 magrittr_2.0.3
[15] compiler_4.4.1 rlang_1.1.4
[17] sass_0.4.9 tools_4.4.1
[19] igraph_2.0.3 utf8_1.2.4
[21] yaml_2.3.10 knitr_1.48
[23] labeling_0.4.3 S4Arrays_1.5.5
[25] htmlwidgets_1.6.4 DelayedArray_0.31.10
[27] plyr_1.8.9 RColorBrewer_1.1-3
[29] abind_1.4-5 withr_3.0.0
[31] purrr_1.0.2 desc_1.4.3
[33] grid_4.4.1 fansi_1.0.6
[35] colorspace_2.1-1 scales_1.3.0
[37] MASS_7.3-61 cli_3.6.3
[39] rmarkdown_2.27 crayon_1.5.3
[41] ragg_1.3.2 generics_0.1.3
[43] metapod_1.13.0 httr_1.4.7
[45] reshape2_1.4.4 BiocBaseUtils_1.7.0
[47] cachem_1.1.0 stringr_1.5.1
[49] zlibbioc_1.51.1 AnnotationFilter_1.29.0
[51] BiocManager_1.30.23 XVector_0.45.0
[53] vctrs_0.6.5 Matrix_1.7-0
[55] slam_0.1-51 jsonlite_1.8.8
[57] bookdown_0.40 IHW_1.33.0
[59] ggrepel_0.9.5 clue_0.3-65
[61] systemfonts_1.1.0 jquerylib_0.1.4
[63] tidyr_1.3.1 glue_1.7.0
[65] pkgdown_2.1.0.9000 stringi_1.8.4
[67] gtable_0.3.5 UCSC.utils_1.1.0
[69] munsell_0.5.1 lpsymphony_1.33.0
[71] tibble_3.2.1 pillar_1.9.0
[73] htmltools_0.5.8.1 GenomeInfoDbData_1.2.12
[75] R6_2.5.1 textshaping_0.4.0
[77] evaluate_0.24.0 lattice_0.22-6
[79] highr_0.11 bslib_0.8.0
[81] fdrtool_1.2.17 Rcpp_1.0.13
[83] SparseArray_1.5.27 xfun_0.46
[85] MsCoreUtils_1.17.0 fs_1.6.4
[87] pkgconfig_2.0.3
This vignette is distributed under a CC BY-SA license license.