Reproducing the multiplexed SCP analysis by Williams et al. 2020

Introduction

Williams et al. present an auto-sampling device to interface between the nanoPOTS processing workflow and the LC-MS/MS. This allows for increased automation of the technology that is critical for advancing the field to real single-cell proteomics applications.

The authors did not provided the code required to fully reproduce their analysis. We therefore based our replication solely on the experimental section. The authors mention:

For TMT-based quantification, corrected reporter ion intensities were extracted. Reporter ion intensities from single cells were normalized to the reference channel containing 0.2 ng of apeptide mixture from the three AML cell lines at PSM level using MaxQuant (v. 1.6.12.0). To minimize the batch effect from multiple TMT experiments, the relative abundances from 19 TMT plexes were log2-transformed and the data matrices from all of the TMT experiments were combined after sequentially centering the column and row values to their median values. A criterion of >70% valid values and at least two identified peptides per protein were required to generate the “quantifiable’ protein list. The data matrix was further analyzed by Perseus for statistical analysis including princip[al] component analysis (PCA) and heatmap analysis.

Furthermore, the processed data by the authors is not provided, so we will benchmark the replication based on the figures in the article.

The authors however distribute the MaxQuant output that we have already converted to the scp data framework (Vanderaa and Gatto (2021)). See the scp vignette for more information about the framework. The QFeatures object containing the data sets from Williams et al. is available from the scpdata package. Let’s load the required packages before starting the replication.

library(scpdata)
library(scp)
library(ComplexHeatmap)
library(scater)

Load the data

We can now load the multiplexed (TMT) SCP data set from Williams et al. This is performed by calling the williams2020_tmt() function from scpdata (Vanderaa and Gatto (2022)).

(williams <- williams2020_tmt())
## An instance of class QFeatures containing 4 assays:
##  [1] peptides_intensity: SingleCellExperiment with 17966 rows and 209 columns 
##  [2] peptides_corrected: SingleCellExperiment with 17966 rows and 209 columns 
##  [3] proteins_intensity: SingleCellExperiment with 2646 rows and 209 columns 
##  [4] proteins_corrected: SingleCellExperiment with 2646 rows and 209 columns

The data contain 4 different SingleCellExperiment objects that we refer to as assays. Each assay contains expression data along with feature metadata. Each row in an assay represents a feature that is either a peptide (assay 1-2) or a protein (assay 3-4). The protein and peptide data are split in two assays, containing either the summed peptide/protein intensities, or the corrected (reference normalization) quantifications. We also store the relationships between features of different assays. For instance, peptides_intensity are linked to proteins_intensity. We can plot these relationships and get an overveiew of the data set.

plot(williams)

Each column (n = 209) in an assay represents a sample. We can easily retrieve the sample annotations from the colData.

colData(williams)
## DataFrame with 209 rows and 5 columns
##           Channel  SampleType ChannelIndex      Amount       Batch
##       <character> <character>  <character> <character> <character>
## 1.A1       TMT126       Boost            1        10ng          A1
## 2.A1      TMT127N   Reference            2       0.2ng          A1
## 3.A1      TMT127C       Empty            3           0          A1
## 4.A1      TMT128N     MOLM-14            4       1cell          A1
## 5.A1      TMT128C        K562            5       1cell          A1
## ...           ...         ...          ...         ...         ...
## 7.C4      TMT129C     MOLM-14            7       1cell          C4
## 8.C4      TMT130N        K562            8       1cell          C4
## 9.C4      TMT130C         CMK            9       1cell          C4
## 10.C4     TMT131N     MOLM-14           10       1cell          C4
## 11.C4     TMT131C        K562           11       1cell          C4
table(williams$SampleType)
## 
##     Boost       CMK     Empty      K562   MOLM-14 Reference 
##        19        38        19        57        57        19

Most of the samples are single cells, either CMK, K562 or MOLM-14. The remaining assays are either booster channels (Boost) with 10ng of protein lysate used to enhance peptide identification, reference channels (Reference') with 0.2ng of protein lysate used for reference normalization by MaxQuant, and empty channels ('Empty) with no material but containing spillover signal from the booster channel.

Feature quality control

The first step of the workflow is to remove reverse hits and contaminants. Before performing this step, we subsest the data to only single-cell samples. This subset is performed on the Amount variable from the colData, so we use subsetByColData().

(williams <- subsetByColData(williams, williams$Amount == "1cell"))
## An instance of class QFeatures containing 4 assays:
##  [1] peptides_intensity: SingleCellExperiment with 17966 rows and 152 columns 
##  [2] peptides_corrected: SingleCellExperiment with 17966 rows and 152 columns 
##  [3] proteins_intensity: SingleCellExperiment with 2646 rows and 152 columns 
##  [4] proteins_corrected: SingleCellExperiment with 2646 rows and 152 columns

We are left with a data set containing 152 single-cells. We can now filter out the reverse hits and contaminants. The required information is contained in the rowData of each assay (Reverse and Potential.contaminant, respectively). We use filterFeatures() to filter on the rowData.

(williams <- filterFeatures(williams,
                            ~ Reverse != "+" &
                                Potential.contaminant != "+"))
## An instance of class QFeatures containing 4 assays:
##  [1] peptides_intensity: SingleCellExperiment with 17735 rows and 152 columns 
##  [2] peptides_corrected: SingleCellExperiment with 17735 rows and 152 columns 
##  [3] proteins_intensity: SingleCellExperiment with 2583 rows and 152 columns 
##  [4] proteins_corrected: SingleCellExperiment with 2583 rows and 152 columns

We are left with 17,735 peptides and 2,583 proteins that pass the quality control.

Log-transformation

The authors next mention they log2-transformed the protein data. They however do not mention how they deal with 0 values that will become infinite after any logarithmic transformation. In this replication study, we replace 0’s with missing values using zeroIsNA().

williams <- zeroIsNA(williams, i = "proteins_corrected")

We then apply logTransform() on the proteins_corrected assay containing the protein data normalized by MaxQuant.

(williams <- logTransform(williams, 
                         i = "proteins_corrected",
                         name = "proteins_log", 
                         base = 2))
## An instance of class QFeatures containing 5 assays:
##  [1] peptides_intensity: SingleCellExperiment with 17735 rows and 152 columns 
##  [2] peptides_corrected: SingleCellExperiment with 17735 rows and 152 columns 
##  [3] proteins_intensity: SingleCellExperiment with 2583 rows and 152 columns 
##  [4] proteins_corrected: SingleCellExperiment with 2583 rows and 152 columns 
##  [5] proteins_log: SingleCellExperiment with 2583 rows and 152 columns

We stored the results in a new assay called proteins_log.

Normalization

The authors then further normalize the data by sequentially centering the column and row values to their median values. Centering columns to the median is readily available in normalizeSCP().

williams <- normalizeSCP(williams,
                         i = "proteins_log",
                         name = "proteins_norm",
                         method = "center.median")

Batch correction

The author mention they perform row centering for each batch separately. This functionality is not available from the package we rely on. So, we’ll perform this step manually. First, we extract the normalized assay along with the associated colData using getWithColData(). <>

sceNorm <- getWithColData(williams, "proteins_norm")
## Warning: 'experiments' dropped; see 'metadata'

Then, we build a for loop that iterates through each batch and centers the rows based on the median within each batch.

for (batch in sceNorm$Batch) {
    ind <- which(sceNorm$Batch == batch) 
    rowMeds <- rowMedians(assay(sceNorm[, ind]), na.rm = TRUE)
    assay(sceNorm[, ind]) <- sweep(assay(sceNorm[, ind]), FUN = "-", 
                                   STATS = rowMeds, MARGIN = 1)
}

Finally, we add the row-normalized assay back in the data set and add the relationships from the previous assay to this new assay.

williams <- addAssay(williams, sceNorm, name = "proteins_bc")
williams <- addAssayLinkOneToOne(williams, from = "proteins_norm", 
                                 to = "proteins_bc")

Handle missing data

The authors remove proteins that have more than 30 % missing data. We use filterNA() and provide the threshold of maximum 30% missing values.

williams <- filterNA(williams, i = "proteins_bc", pNA = 0.3)

Keep confident proteins

The authors next keep only proteins that are identified with at least 2 peptides. We can retrieve which proteins are made of several peptides from the rowData of the peptides_corrected assay. The peptides were mapped to the protein data thanks to the Leading.razor.protein. All proteins that are present at least 2 are kept.

## Get `Leading.razor.protein` from the peptide `rowData`
prots <- rowData(williams[["peptides_corrected"]])$Leading.razor.protein
## Get all proteins that are present at least twice
prots <- unique(prots[duplicated(prots)])
## Keep only proteins that were kept after missing data filtering
prots <- prots[prots %in% rownames(williams)[["proteins_bc"]]]
head(prots)
## [1] "TPM3_HUMAN"  "XPO2_HUMAN"  "PTMA_HUMAN"  "MIC19_HUMAN" "RL1D1_HUMAN"
## [6] "TLN1_HUMAN"

We generated a vector with the proteins identified at least twice. We can now use it to subset the data set. Note that a QFeatures object can be seen as a three-order array: \(features \times samples \times assay\). Hence, QFeatures supports three-order subsetting x[rows, columns, assays]. The great advantage of this subsetting is that all peptides linked to these desired proteins are also retrieved thanks to the relationships stored between proteins and peptides.

(williams <- williams[prots, , ])
## An instance of class QFeatures containing 7 assays:
##  [1] peptides_intensity: SingleCellExperiment with 12538 rows and 152 columns 
##  [2] peptides_corrected: SingleCellExperiment with 12538 rows and 152 columns 
##  [3] proteins_intensity: SingleCellExperiment with 1197 rows and 152 columns 
##  [4] proteins_corrected: SingleCellExperiment with 1197 rows and 152 columns 
##  [5] proteins_log: SingleCellExperiment with 1197 rows and 152 columns 
##  [6] proteins_norm: SingleCellExperiment with 1197 rows and 152 columns 
##  [7] proteins_bc: SingleCellExperiment with 1197 rows and 152 columns

This step concludes the quantitative data processing. We update the QFeatures plot that provides an overview of the processing.

plot(williams)

Downstream analysis

The authors used the processed data to perform two types of downstream analysis: single-cell correlation analysis and principal component analysis. We here provide the published figure that we aim to reproduce in this vignette.

Figure 6 from Williams et al. 2020

To run the downstream analysis, we only need the last assay generated during the quantitative data processing. We extract the assay as well as the associated sample annotations using getWithColData().

sce <- getWithColData(williams, "proteins_bc")
## Warning: 'experiments' dropped; see 'metadata'
## Warning: Ignoring redundant column names in 'colData(x)':

Correlation analysis

We first compute the correlation matrix. We use pairwise

m <- assay(sce)
corMat <- cor(m, method = "pearson", use = "pairwise.complete.obs")

We use the ComplexHeatmap package to plot the heatmap. This package allows easy annotation of the plot, namely labeling each sample with its sample type and channel.

cols <- c(CMK = "green4", `K562` = "dodgerblue", `MOLM-14` = "red3")
colAnnot <- columnAnnotation(Population = sce$SampleType,
                             Channel = sce$Channel,
                             col = list(Population = cols))

We can now plot the heatmap.

Heatmap(corMat, name = "Pearson correlation",
        show_row_names = FALSE, show_column_names = FALSE,
        top_annotation = colAnnot)

Like in the figure from the article, we observe a positive correlation between samples from the same cell type. We also observe with sample type clusters. While this is not shown in the original figure, we can clearly link this effect to TMT channel bias. This should be corrected for, but a the poor experimental design causes sample type and TMT channel to be confounded, and hence entangling the TMT effects with bioligical effects.

PCA

PCA can directly be computed from a SingleCellExperiment object thanks to the scater package. We therefore use runPCA(). Note that we also impute the missing values with zeros as PCA does not support missing value.

sce <- runPCA(sce, exprs_values = 1)

We generate the PCA plots using plotPCA(), and use different colouring schemes to highlight different samples annotations.

plotPCA(sce, colour_by = "SampleType")

The sample types can be differentiated on the PCA plot, although the CMK and MOLM-14 slightly overlap what is not observed in the published work. Another small difference is that the first to PC explaine 15 % of the variance while the authors report 22 % explained variance.

Conclusion

We were able to reproduce the general results from the analysis by Williams et al. (2020). However, we notice small discrepancies indicating small differences between workflows. Without a script or the intermediate results from the authors, we are unable to pinpoint which steps differ.

Reproduce this vignette

You can reproduce this vignette using Docker:

docker pull cvanderaa/scp_replication_docker:v1
docker run \
    -e PASSWORD=bioc \
    -p 8787:8787 \
    cvanderaa/scp_replication_docker:v1

Open your browser and go to http://localhost:8787. The USER is rstudio and the password is bioc. You can find the vignette in the vignettes folder.

See the website home page for more information.

Requirements

Hardware and software

The system details of the machine that built the vignette are:

## Machine: Linux (5.15.0-48-generic)
## R version: R.4.2.1 (svn: 82513)
## RAM: 16.5 GB
## CPU: 16 core(s) - 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz

Timing

The total time required to compile this vignette is:

## 54.3122 secs

Memory

The final williams memory size is:

## [1] "0.1 Gb"

Session info

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] benchmarkme_1.0.8           scater_1.25.7              
##  [3] ggplot2_3.3.6               scuttle_1.7.4              
##  [5] SingleCellExperiment_1.19.1 ComplexHeatmap_2.13.1      
##  [7] scp_1.7.4                   scpdata_1.5.4              
##  [9] ExperimentHub_2.5.0         AnnotationHub_3.5.2        
## [11] BiocFileCache_2.5.0         dbplyr_2.2.1               
## [13] QFeatures_1.7.3             MultiAssayExperiment_1.23.9
## [15] SummarizedExperiment_1.27.3 Biobase_2.57.1             
## [17] GenomicRanges_1.49.1        GenomeInfoDb_1.33.7        
## [19] IRanges_2.31.2              S4Vectors_0.35.4           
## [21] BiocGenerics_0.43.4         MatrixGenerics_1.9.1       
## [23] matrixStats_0.62.0          BiocStyle_2.25.0           
## 
## loaded via a namespace (and not attached):
##   [1] circlize_0.4.15               BiocBaseUtils_0.99.12        
##   [3] systemfonts_1.0.4             igraph_1.3.5                 
##   [5] lazyeval_0.2.2                BiocParallel_1.31.12         
##   [7] digest_0.6.29                 foreach_1.5.2                
##   [9] htmltools_0.5.3               magick_2.7.3                 
##  [11] viridis_0.6.2                 fansi_1.0.3                  
##  [13] magrittr_2.0.3                memoise_2.0.1                
##  [15] ScaledMatrix_1.5.1            cluster_2.1.4                
##  [17] doParallel_1.0.17             Biostrings_2.65.6            
##  [19] pkgdown_2.0.6                 colorspace_2.0-3             
##  [21] ggrepel_0.9.1                 blob_1.2.3                   
##  [23] rappdirs_0.3.3                textshaping_0.3.6            
##  [25] xfun_0.33                     dplyr_1.0.10                 
##  [27] crayon_1.5.2                  RCurl_1.98-1.9               
##  [29] jsonlite_1.8.2                iterators_1.0.14             
##  [31] glue_1.6.2                    gtable_0.3.1                 
##  [33] zlibbioc_1.43.0               XVector_0.37.1               
##  [35] GetoptLong_1.0.5              DelayedArray_0.23.2          
##  [37] BiocSingular_1.13.1           shape_1.4.6                  
##  [39] scales_1.2.1                  DBI_1.1.3                    
##  [41] Rcpp_1.0.9                    viridisLite_0.4.1            
##  [43] xtable_1.8-4                  clue_0.3-61                  
##  [45] bit_4.0.4                     rsvd_1.0.5                   
##  [47] MsCoreUtils_1.9.1             httr_1.4.4                   
##  [49] RColorBrewer_1.1-3            ellipsis_0.3.2               
##  [51] farver_2.1.1                  pkgconfig_2.0.3              
##  [53] sass_0.4.2                    utf8_1.2.2                   
##  [55] labeling_0.4.2                tidyselect_1.1.2             
##  [57] rlang_1.0.6                   later_1.3.0                  
##  [59] AnnotationDbi_1.59.1          munsell_0.5.0                
##  [61] BiocVersion_3.16.0            tools_4.2.1                  
##  [63] cachem_1.0.6                  cli_3.4.1                    
##  [65] generics_0.1.3                RSQLite_2.2.17               
##  [67] evaluate_0.16                 stringr_1.4.1                
##  [69] fastmap_1.1.0                 yaml_2.3.5                   
##  [71] ragg_1.2.3                    knitr_1.40                   
##  [73] bit64_4.0.5                   fs_1.5.2                     
##  [75] purrr_0.3.4                   KEGGREST_1.37.3              
##  [77] AnnotationFilter_1.21.0       sparseMatrixStats_1.9.0      
##  [79] mime_0.12                     compiler_4.2.1               
##  [81] beeswarm_0.4.0                filelock_1.0.2               
##  [83] curl_4.3.2                    png_0.1-7                    
##  [85] interactiveDisplayBase_1.35.0 tibble_3.1.8                 
##  [87] bslib_0.4.0                   stringi_1.7.8                
##  [89] highr_0.9                     desc_1.4.2                   
##  [91] lattice_0.20-45               ProtGenerics_1.29.0          
##  [93] Matrix_1.5-1                  vctrs_0.4.2                  
##  [95] pillar_1.8.1                  lifecycle_1.0.2              
##  [97] BiocManager_1.30.18           jquerylib_0.1.4              
##  [99] GlobalOptions_0.1.2           BiocNeighbors_1.15.1         
## [101] cowplot_1.1.1                 bitops_1.0-7                 
## [103] irlba_2.3.5.1                 httpuv_1.6.6                 
## [105] R6_2.5.1                      bookdown_0.29                
## [107] promises_1.2.0.1              gridExtra_2.3                
## [109] vipor_0.4.5                   codetools_0.2-18             
## [111] benchmarkmeData_1.0.4         MASS_7.3-58.1                
## [113] assertthat_0.2.1              rprojroot_2.0.3              
## [115] rjson_0.2.21                  withr_2.5.0                  
## [117] GenomeInfoDbData_1.2.9        parallel_4.2.1               
## [119] beachmat_2.13.4               rmarkdown_2.16               
## [121] DelayedMatrixStats_1.19.1     Cairo_1.6-0                  
## [123] shiny_1.7.2                   ggbeeswarm_0.6.0

Licence

This vignette is distributed under a CC BY-SA licence licence.

Reference

Vanderaa, Christophe, and Laurent Gatto. 2021. “Replication of Single-Cell Proteomics Data Reveals Important Computational Challenges.” Expert Rev. Proteomics, October.

———. 2022. “The Current State of Single-Cell Proteomics Data Analysis,” October. https://arxiv.org/abs/2210.01020.

Williams, Sarah M, Andrey V Liyu, Chia-Feng Tsai, Ronald J Moore, Daniel J Orton, William B Chrisler, Matthew J Gaffrey, et al. 2020. “Automated Coupling of Nanodroplet Sample Preparation with Liquid Chromatography-Mass Spectrometry for High-Throughput Single-Cell Proteomics.” Anal. Chem. 92 (15): 10588–96.

Christophe Vanderaa, Computational Biology, UCLouvain

Laurent Gatto, Computational Biology, UCLouvain

5 October 2022