williams2020_tmt.Rmd
Williams et al. present an auto-sampling device to interface between the nanoPOTS processing workflow and the LC-MS/MS. This allows for increased automation of the technology that is critical for advancing the field to real single-cell proteomics applications.
The authors did not provided the code required to fully reproduce their analysis. We therefore based our replication solely on the experimental section. The authors mention:
For TMT-based quantification, corrected reporter ion intensities were extracted. Reporter ion intensities from single cells were normalized to the reference channel containing 0.2 ng of apeptide mixture from the three AML cell lines at PSM level using MaxQuant (v. 1.6.12.0). To minimize the batch effect from multiple TMT experiments, the relative abundances from 19 TMT plexes were log2-transformed and the data matrices from all of the TMT experiments were combined after sequentially centering the column and row values to their median values. A criterion of >70% valid values and at least two identified peptides per protein were required to generate the “quantifiable’ protein list. The data matrix was further analyzed by Perseus for statistical analysis including princip[al] component analysis (PCA) and heatmap analysis.
Furthermore, the processed data by the authors is not provided, so we will benchmark the replication based on the figures in the article.
The authors however distribute the MaxQuant output that we have
already converted to the scp
data framework (Vanderaa and Gatto (2021)). See the scp
vignette for more information about the framework. The
QFeatures
object containing the data sets from Williams et
al. is available from the scpdata
package. Let’s load the
required packages before starting the replication.
We can now load the multiplexed (TMT) SCP data set from Williams et
al. This is performed by calling the williams2020_tmt()
function from scpdata
(Vanderaa and
Gatto (2022)).
(williams <- williams2020_tmt())
## An instance of class QFeatures containing 4 assays:
## [1] peptides_intensity: SingleCellExperiment with 17966 rows and 209 columns
## [2] peptides_corrected: SingleCellExperiment with 17966 rows and 209 columns
## [3] proteins_intensity: SingleCellExperiment with 2646 rows and 209 columns
## [4] proteins_corrected: SingleCellExperiment with 2646 rows and 209 columns
The data contain 4 different SingleCellExperiment
objects that we refer to as assays. Each assay contains
expression data along with feature metadata. Each row in an assay
represents a feature that is either a peptide (assay
1-2) or a protein (assay 3-4). The protein and peptide data are split in
two assays, containing either the summed peptide/protein intensities, or
the corrected (reference normalization) quantifications. We also store
the relationships between features of different assays. For instance,
peptides_intensity
are linked to
proteins_intensity
. We can plot these relationships and get
an overveiew of the data set.
plot(williams)
Each column (n = 209) in an assay represents a
sample. We can easily retrieve the sample annotations
from the colData
.
colData(williams)
## DataFrame with 209 rows and 5 columns
## Channel SampleType ChannelIndex Amount Batch
## <character> <character> <character> <character> <character>
## 1.A1 TMT126 Boost 1 10ng A1
## 2.A1 TMT127N Reference 2 0.2ng A1
## 3.A1 TMT127C Empty 3 0 A1
## 4.A1 TMT128N MOLM-14 4 1cell A1
## 5.A1 TMT128C K562 5 1cell A1
## ... ... ... ... ... ...
## 7.C4 TMT129C MOLM-14 7 1cell C4
## 8.C4 TMT130N K562 8 1cell C4
## 9.C4 TMT130C CMK 9 1cell C4
## 10.C4 TMT131N MOLM-14 10 1cell C4
## 11.C4 TMT131C K562 11 1cell C4
table(williams$SampleType)
##
## Boost CMK Empty K562 MOLM-14 Reference
## 19 38 19 57 57 19
Most of the samples are single cells, either CMK, K562 or MOLM-14.
The remaining assays are either booster channels (Boost
)
with 10ng of protein lysate used to enhance peptide identification,
reference channels
(Reference') with 0.2ng of protein lysate used for reference normalization by MaxQuant, and empty channels ('Empty
)
with no material but containing spillover signal from the booster
channel.
The first step of the workflow is to remove reverse hits and
contaminants. Before performing this step, we subsest the data to only
single-cell samples. This subset is performed on the Amount
variable from the colData
, so we use
subsetByColData()
.
(williams <- subsetByColData(williams, williams$Amount == "1cell"))
## An instance of class QFeatures containing 4 assays:
## [1] peptides_intensity: SingleCellExperiment with 17966 rows and 152 columns
## [2] peptides_corrected: SingleCellExperiment with 17966 rows and 152 columns
## [3] proteins_intensity: SingleCellExperiment with 2646 rows and 152 columns
## [4] proteins_corrected: SingleCellExperiment with 2646 rows and 152 columns
We are left with a data set containing 152 single-cells. We can now
filter out the reverse hits and contaminants. The required information
is contained in the rowData
of each assay
(Reverse
and Potential.contaminant
,
respectively). We use filterFeatures()
to filter on the
rowData
.
(williams <- filterFeatures(williams,
~ Reverse != "+" &
Potential.contaminant != "+"))
## An instance of class QFeatures containing 4 assays:
## [1] peptides_intensity: SingleCellExperiment with 17735 rows and 152 columns
## [2] peptides_corrected: SingleCellExperiment with 17735 rows and 152 columns
## [3] proteins_intensity: SingleCellExperiment with 2583 rows and 152 columns
## [4] proteins_corrected: SingleCellExperiment with 2583 rows and 152 columns
We are left with 17,735 peptides and 2,583 proteins that pass the quality control.
The authors next mention they log2-transformed the protein data. They
however do not mention how they deal with 0 values that will become
infinite after any logarithmic transformation. In this replication
study, we replace 0’s with missing values using
zeroIsNA()
.
williams <- zeroIsNA(williams, i = "proteins_corrected")
We then apply logTransform()
on the
proteins_corrected
assay containing the protein data
normalized by MaxQuant.
(williams <- logTransform(williams,
i = "proteins_corrected",
name = "proteins_log",
base = 2))
## An instance of class QFeatures containing 5 assays:
## [1] peptides_intensity: SingleCellExperiment with 17735 rows and 152 columns
## [2] peptides_corrected: SingleCellExperiment with 17735 rows and 152 columns
## [3] proteins_intensity: SingleCellExperiment with 2583 rows and 152 columns
## [4] proteins_corrected: SingleCellExperiment with 2583 rows and 152 columns
## [5] proteins_log: SingleCellExperiment with 2583 rows and 152 columns
We stored the results in a new assay called
proteins_log
.
The authors then further normalize the data by sequentially
centering the column and row values to their median
values. Centering columns to the median is readily available in
normalizeSCP()
.
williams <- normalizeSCP(williams,
i = "proteins_log",
name = "proteins_norm",
method = "center.median")
The author mention they perform row centering for each batch
separately. This functionality is not available from the package we rely
on. So, we’ll perform this step manually. First, we extract the
normalized assay along with the associated colData
using
getWithColData()
. <>
sceNorm <- getWithColData(williams, "proteins_norm")
## Warning: 'experiments' dropped; see 'metadata'
Then, we build a for loop that iterates through each batch and centers the rows based on the median within each batch.
for (batch in sceNorm$Batch) {
ind <- which(sceNorm$Batch == batch)
rowMeds <- rowMedians(assay(sceNorm[, ind]), na.rm = TRUE)
assay(sceNorm[, ind]) <- sweep(assay(sceNorm[, ind]), FUN = "-",
STATS = rowMeds, MARGIN = 1)
}
Finally, we add the row-normalized assay back in the data set and add the relationships from the previous assay to this new assay.
williams <- addAssay(williams, sceNorm, name = "proteins_bc")
williams <- addAssayLinkOneToOne(williams, from = "proteins_norm",
to = "proteins_bc")
The authors remove proteins that have more than 30 % missing data. We
use filterNA()
and provide the threshold of maximum 30%
missing values.
williams <- filterNA(williams, i = "proteins_bc", pNA = 0.3)
The authors next keep only proteins that are identified with at least
2 peptides. We can retrieve which proteins are made of several peptides
from the rowData
of the peptides_corrected
assay. The peptides were mapped to the protein data thanks to the
Leading.razor.protein
. All proteins that are present at
least 2 are kept.
## Get `Leading.razor.protein` from the peptide `rowData`
prots <- rowData(williams[["peptides_corrected"]])$Leading.razor.protein
## Get all proteins that are present at least twice
prots <- unique(prots[duplicated(prots)])
## Keep only proteins that were kept after missing data filtering
prots <- prots[prots %in% rownames(williams)[["proteins_bc"]]]
head(prots)
## [1] "TPM3_HUMAN" "XPO2_HUMAN" "PTMA_HUMAN" "MIC19_HUMAN" "RL1D1_HUMAN"
## [6] "TLN1_HUMAN"
We generated a vector with the proteins identified at least twice. We
can now use it to subset the data set. Note that a
QFeatures
object can be seen as a three-order array: \(features \times samples \times assay\).
Hence, QFeatures
supports three-order subsetting
x[rows, columns, assays]
. The great advantage of this
subsetting is that all peptides linked to these desired proteins are
also retrieved thanks to the relationships stored between proteins and
peptides.
(williams <- williams[prots, , ])
## An instance of class QFeatures containing 7 assays:
## [1] peptides_intensity: SingleCellExperiment with 12538 rows and 152 columns
## [2] peptides_corrected: SingleCellExperiment with 12538 rows and 152 columns
## [3] proteins_intensity: SingleCellExperiment with 1197 rows and 152 columns
## [4] proteins_corrected: SingleCellExperiment with 1197 rows and 152 columns
## [5] proteins_log: SingleCellExperiment with 1197 rows and 152 columns
## [6] proteins_norm: SingleCellExperiment with 1197 rows and 152 columns
## [7] proteins_bc: SingleCellExperiment with 1197 rows and 152 columns
This step concludes the quantitative data processing. We update the
QFeatures
plot that provides an overview of the
processing.
plot(williams)
The authors used the processed data to perform two types of downstream analysis: single-cell correlation analysis and principal component analysis. We here provide the published figure that we aim to reproduce in this vignette.
To run the downstream analysis, we only need the last assay generated
during the quantitative data processing. We extract the assay as well as
the associated sample annotations using
getWithColData()
.
sce <- getWithColData(williams, "proteins_bc")
## Warning: 'experiments' dropped; see 'metadata'
## Warning: Ignoring redundant column names in 'colData(x)':
We first compute the correlation matrix. We use pairwise
We use the ComplexHeatmap
package to plot the heatmap.
This package allows easy annotation of the plot, namely labeling each
sample with its sample type and channel.
cols <- c(CMK = "green4", `K562` = "dodgerblue", `MOLM-14` = "red3")
colAnnot <- columnAnnotation(Population = sce$SampleType,
Channel = sce$Channel,
col = list(Population = cols))
We can now plot the heatmap.
Heatmap(corMat, name = "Pearson correlation",
show_row_names = FALSE, show_column_names = FALSE,
top_annotation = colAnnot)
Like in the figure from the article, we observe a positive correlation between samples from the same cell type. We also observe with sample type clusters. While this is not shown in the original figure, we can clearly link this effect to TMT channel bias. This should be corrected for, but a the poor experimental design causes sample type and TMT channel to be confounded, and hence entangling the TMT effects with bioligical effects.
PCA can directly be computed from a SingleCellExperiment
object thanks to the scater
package. We therefore use
runPCA()
. Note that we also impute the missing values with
zeros as PCA does not support missing value.
sce <- runPCA(sce, exprs_values = 1)
We generate the PCA plots using plotPCA()
, and use
different colouring schemes to highlight different samples
annotations.
plotPCA(sce, colour_by = "SampleType")
The sample types can be differentiated on the PCA plot, although the CMK and MOLM-14 slightly overlap what is not observed in the published work. Another small difference is that the first to PC explaine 15 % of the variance while the authors report 22 % explained variance.
We were able to reproduce the general results from the analysis by Williams et al. (2020). However, we notice small discrepancies indicating small differences between workflows. Without a script or the intermediate results from the authors, we are unable to pinpoint which steps differ.
You can reproduce this vignette using Docker
:
docker pull cvanderaa/scp_replication_docker:v1
docker run \
-e PASSWORD=bioc \
-p 8787:8787 \
cvanderaa/scp_replication_docker:v1
Open your browser and go to http://localhost:8787. The USER is rstudio
and the password is bioc
. You can find the vignette in the
vignettes
folder.
See the website home page for more information.
The system details of the machine that built the vignette are:
## Machine: Linux (5.15.0-48-generic)
## R version: R.4.2.1 (svn: 82513)
## RAM: 16.5 GB
## CPU: 16 core(s) - 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] benchmarkme_1.0.8 scater_1.25.7
## [3] ggplot2_3.3.6 scuttle_1.7.4
## [5] SingleCellExperiment_1.19.1 ComplexHeatmap_2.13.1
## [7] scp_1.7.4 scpdata_1.5.4
## [9] ExperimentHub_2.5.0 AnnotationHub_3.5.2
## [11] BiocFileCache_2.5.0 dbplyr_2.2.1
## [13] QFeatures_1.7.3 MultiAssayExperiment_1.23.9
## [15] SummarizedExperiment_1.27.3 Biobase_2.57.1
## [17] GenomicRanges_1.49.1 GenomeInfoDb_1.33.7
## [19] IRanges_2.31.2 S4Vectors_0.35.4
## [21] BiocGenerics_0.43.4 MatrixGenerics_1.9.1
## [23] matrixStats_0.62.0 BiocStyle_2.25.0
##
## loaded via a namespace (and not attached):
## [1] circlize_0.4.15 BiocBaseUtils_0.99.12
## [3] systemfonts_1.0.4 igraph_1.3.5
## [5] lazyeval_0.2.2 BiocParallel_1.31.12
## [7] digest_0.6.29 foreach_1.5.2
## [9] htmltools_0.5.3 magick_2.7.3
## [11] viridis_0.6.2 fansi_1.0.3
## [13] magrittr_2.0.3 memoise_2.0.1
## [15] ScaledMatrix_1.5.1 cluster_2.1.4
## [17] doParallel_1.0.17 Biostrings_2.65.6
## [19] pkgdown_2.0.6 colorspace_2.0-3
## [21] ggrepel_0.9.1 blob_1.2.3
## [23] rappdirs_0.3.3 textshaping_0.3.6
## [25] xfun_0.33 dplyr_1.0.10
## [27] crayon_1.5.2 RCurl_1.98-1.9
## [29] jsonlite_1.8.2 iterators_1.0.14
## [31] glue_1.6.2 gtable_0.3.1
## [33] zlibbioc_1.43.0 XVector_0.37.1
## [35] GetoptLong_1.0.5 DelayedArray_0.23.2
## [37] BiocSingular_1.13.1 shape_1.4.6
## [39] scales_1.2.1 DBI_1.1.3
## [41] Rcpp_1.0.9 viridisLite_0.4.1
## [43] xtable_1.8-4 clue_0.3-61
## [45] bit_4.0.4 rsvd_1.0.5
## [47] MsCoreUtils_1.9.1 httr_1.4.4
## [49] RColorBrewer_1.1-3 ellipsis_0.3.2
## [51] farver_2.1.1 pkgconfig_2.0.3
## [53] sass_0.4.2 utf8_1.2.2
## [55] labeling_0.4.2 tidyselect_1.1.2
## [57] rlang_1.0.6 later_1.3.0
## [59] AnnotationDbi_1.59.1 munsell_0.5.0
## [61] BiocVersion_3.16.0 tools_4.2.1
## [63] cachem_1.0.6 cli_3.4.1
## [65] generics_0.1.3 RSQLite_2.2.17
## [67] evaluate_0.16 stringr_1.4.1
## [69] fastmap_1.1.0 yaml_2.3.5
## [71] ragg_1.2.3 knitr_1.40
## [73] bit64_4.0.5 fs_1.5.2
## [75] purrr_0.3.4 KEGGREST_1.37.3
## [77] AnnotationFilter_1.21.0 sparseMatrixStats_1.9.0
## [79] mime_0.12 compiler_4.2.1
## [81] beeswarm_0.4.0 filelock_1.0.2
## [83] curl_4.3.2 png_0.1-7
## [85] interactiveDisplayBase_1.35.0 tibble_3.1.8
## [87] bslib_0.4.0 stringi_1.7.8
## [89] highr_0.9 desc_1.4.2
## [91] lattice_0.20-45 ProtGenerics_1.29.0
## [93] Matrix_1.5-1 vctrs_0.4.2
## [95] pillar_1.8.1 lifecycle_1.0.2
## [97] BiocManager_1.30.18 jquerylib_0.1.4
## [99] GlobalOptions_0.1.2 BiocNeighbors_1.15.1
## [101] cowplot_1.1.1 bitops_1.0-7
## [103] irlba_2.3.5.1 httpuv_1.6.6
## [105] R6_2.5.1 bookdown_0.29
## [107] promises_1.2.0.1 gridExtra_2.3
## [109] vipor_0.4.5 codetools_0.2-18
## [111] benchmarkmeData_1.0.4 MASS_7.3-58.1
## [113] assertthat_0.2.1 rprojroot_2.0.3
## [115] rjson_0.2.21 withr_2.5.0
## [117] GenomeInfoDbData_1.2.9 parallel_4.2.1
## [119] beachmat_2.13.4 rmarkdown_2.16
## [121] DelayedMatrixStats_1.19.1 Cairo_1.6-0
## [123] shiny_1.7.2 ggbeeswarm_0.6.0
This vignette is distributed under a CC BY-SA licence licence.