The scp data framework

Our data structure is relying on two curated data classes: QFeatures (Gatto (2020)) and SingleCellExperiment (Amezquita et al. (2019)). QFeatures is dedicated to the manipulation and processing of MS-based quantitative data. It explicitly records the successive steps to allow users to navigate up and down the different MS levels. SingleCellExperiment is another class designed as an efficient data container that serves as an interface to state-of-the-art methods and algorithms for single-cell data. Our framework combines the two classes to inherit from their respective advantages.

Because mass spectrometry (MS)-based single-cell proteomics (SCP) only captures the proteome of between one and a few tens of single-cells in a single run, the data is usually acquired across many MS batches. Therefore, the data for each run should conceptually be stored in its own container, that we here call an assay. The expected input for working with the scp package is quantification data of peptide to spectrum matches (PSM). These data can then be processed to reconstruct peptide and protein data. The links between related features across different assays are stored to facilitate manipulation and visualization of of PSM, peptide and protein data. This is conceptually shown below.

`scp` relies on `SingleCellExperiment` and `QFeatures` objects.

scp relies on SingleCellExperiment and QFeatures objects.

There are two input tables required for starting an analysis with scp:

  1. The feature data
  2. The sample data

Feature data

The feature data are generated after the identification and quantification of the MS spectra by a pre-processing software such as MaxQuant, ProteomeDiscoverer or MSFragger (the list of available software is actually much longer). We will here use as an example a data table that has been generated by MaxQuant. The table is available from the scp package and is called mqScpData (for MaxQuant generated SCP data).

library(scp)
data("mqScpData")
dim(mqScpData)
#> [1] 1361  149

In this toy example, there are 1361 rows corresponding to features (quantified PSMs) and 149 columns corresponding to different data fields recorded by MaxQuant during the processing of the MS spectra. The columns can be divided into three categories:

  • Columns holding feature quantifications
  • Columns holding feature metadata
  • Columns holding MS run metadata

Feature quantifications

The quantification data can be composed of one (in case of label-free acquisition) up to 16 columns (in case of TMT-16 multiplexing). The columns holding the quantification start with Reporter.intensity. followed by a number.

(quantCols <- grep("Reporter.intensity.\\d", colnames(mqScpData),
                  value = TRUE))
#>  [1] "Reporter.intensity.1"  "Reporter.intensity.2"  "Reporter.intensity.3" 
#>  [4] "Reporter.intensity.4"  "Reporter.intensity.5"  "Reporter.intensity.6" 
#>  [7] "Reporter.intensity.7"  "Reporter.intensity.8"  "Reporter.intensity.9" 
#> [10] "Reporter.intensity.10" "Reporter.intensity.11" "Reporter.intensity.12"
#> [13] "Reporter.intensity.13" "Reporter.intensity.14" "Reporter.intensity.15"
#> [16] "Reporter.intensity.16"

As you may notice, the example data was acquired using a TMT-16 protocol since we retrieve 16 quantification columns. Actually, some runs were acquired using a TMT-11 protocol (11 labels) but we will come back to this later.

head(mqScpData[, quantCols])
#>   Reporter.intensity.1 Reporter.intensity.2 Reporter.intensity.3
#> 1                61251               501.71               3731.3
#> 2                58648              1099.80               2837.7
#> 3               150350              3705.00               9361.0
#> 4                27347               405.90               1525.2
#> 5                84035               583.09               4092.3
#> 6                44895               700.23               2283.0
#>   Reporter.intensity.4 Reporter.intensity.5 Reporter.intensity.6
#> 1              1643.30               871.84               981.87
#> 2               494.32               349.26              1030.50
#> 3                 0.00              1945.40              1188.60
#> 4                 0.00                 0.00               318.74
#> 5               530.13               718.13              2204.50
#> 6              1109.60                 0.00               675.79
#>   Reporter.intensity.7 Reporter.intensity.8 Reporter.intensity.9
#> 1              1200.10               939.06              1457.50
#> 2                 0.00              1214.10               800.58
#> 3              1574.00              2302.10              2176.10
#> 4                 0.00               519.81                 0.00
#> 5               960.51               453.77              1188.40
#> 6                 0.00               809.38               668.88
#>   Reporter.intensity.10 Reporter.intensity.11 Reporter.intensity.12
#> 1               1329.80                981.83                    NA
#> 2                807.79                391.38                    NA
#> 3               1399.50               1307.50                2192.4
#> 4                507.23                370.79                    NA
#> 5                740.99                  0.00                    NA
#> 6               1467.50                901.38                    NA
#>   Reporter.intensity.13 Reporter.intensity.14 Reporter.intensity.15
#> 1                    NA                    NA                    NA
#> 2                    NA                    NA                    NA
#> 3                1791.4                1727.5                2157.3
#> 4                    NA                    NA                    NA
#> 5                    NA                    NA                    NA
#> 6                    NA                    NA                    NA
#>   Reporter.intensity.16
#> 1                    NA
#> 2                    NA
#> 3                  1398
#> 4                    NA
#> 5                    NA
#> 6                    NA

Feature metadata

Most columns in the mqScpData table contain information used or generated during the identification of the MS spectra. For instance, you may find the charge of the parent ion, the score and probability of a correct match between the MS spectrum and a peptide sequence, the sequence of the best matching peptide, its length, its modifications, the retention time of the peptide on the LC, the protein(s) the peptide originates from and much more.

head(mqScpData[, c("Charge", "Score", "PEP", "Sequence", "Length",
                   "Retention.time", "Proteins")])
#>   Charge  Score        PEP    Sequence Length Retention.time
#> 1      2 41.029 5.2636e-04   ATNFLAHEK      9         65.781
#> 2      2 44.349 5.8789e-04   ATNFLAHEK      9         63.787
#> 3      2 51.066 4.0315e-24 SHTILLVQPTK     11         71.884
#> 4      2 63.816 4.7622e-06 SHTILLVQPTK     11         68.633
#> 5      2 74.464 6.8709e-09 SHTILLVQPTK     11         71.946
#> 6      2 41.502 5.3705e-02     SLVIPEK      7         76.204
#>               Proteins
#> 1 sp|P29692|EF1D_HUMAN
#> 2 sp|P29692|EF1D_HUMAN
#> 3  sp|P84090|ERH_HUMAN
#> 4  sp|P84090|ERH_HUMAN
#> 5  sp|P84090|ERH_HUMAN
#> 6 sp|P62269|RS18_HUMAN

MS run metadata

This type of metadata is related to the MS instrument. In MaxQuant, only the file name generated by the MS instrument is stored. There is one file for each MS run, hence the file name can be used as a batch identifier.

unique(mqScpData$Raw.file)
#> [1] "190321S_LCA10_X_FP97AG"        "190222S_LCA9_X_FP94BM"        
#> [3] "190914S_LCB3_X_16plex_Set_21"  "190321S_LCA10_X_FP97_blank_01"

Sample data

Next to the quantification data and the feature data, sample metadata contains the experimental design generated by the researcher. The rows of sample metadata correspond to a sample in the experiment and the columns correspond to the available information about the sample. We will here use the second example table:

data("sampleAnnotation")
head(sampleAnnotation)
#>                Raw.file              Channel SampleType lcbatch sortday digest
#> 1 190222S_LCA9_X_FP94BM Reporter.intensity.1    Carrier    LCA9      s8      N
#> 2 190222S_LCA9_X_FP94BM Reporter.intensity.2  Reference    LCA9      s8      N
#> 3 190222S_LCA9_X_FP94BM Reporter.intensity.3     Unused    LCA9      s8      N
#> 4 190222S_LCA9_X_FP94BM Reporter.intensity.4   Monocyte    LCA9      s8      N
#> 5 190222S_LCA9_X_FP94BM Reporter.intensity.5      Blank    LCA9      s8      N
#> 6 190222S_LCA9_X_FP94BM Reporter.intensity.6   Monocyte    LCA9      s8      N

This table may contain any information about the samples. For example, useful information could be the type of sample that is analysed, a phenotype known from the experimental design, the MS batch, the acquisition date, MS settings used to acquire the sample, the LC batch, the sample preparation batch, etc… However, scp requires 2 specific fields in the sample metadata:

  1. One column containing the MS run names (Raw.file in this case). It must have the same name as the name of the column containing the MS run names in the quantification table. This will allow scp to correctly match and split data that were acquired separately. This is illustrated below by the linked circles.
  2. One column that tells scp the names of the columns in the feature data holds the quantification of the corresponding sample. This is illustrated below by the arrow.
Linking the sample data to the feature data

Linking the sample data to the feature data

readSCP

readSCP is the function that converts the sample and the feature data into a QFeatures object following the data structure described above, that is storing the data belonging to each MS batch in a separate SingleCellExperiment object. We therefore provide the feature data, the sample data to the function as well as the name of the column that holds the batch name in both tables and the name of the column in the sample data that points to the quantification columns in the feature data.

Sample names and suffix

readSCP automatically assigns names that are unique across all samples in all assays. This is performed by appending the name of the batch where a given sample is found in with the name of the quantification column for that sample. Suppose a sample belongs to batch 190222S_LCA9_X_FP94BM and the quantification values in the feature data are found in the column called Reporter.intensity.3, then the sample name will become 190222S_LCA9_X_FP94BMReporter.intensity.3. Optionally, to improve the readability of sample names, readSCP can take a suffix instead of the quantification column name. For instance, in the example below, we will provide a short suffix with the TMT index to remind that samples were multiplexed using TMT.

Special case: empty samples

In some rare cases, it can be beneficial to remove empty samples (all quantifications are NA) from the assays. Such samples can occur when samples that were acquired with different multiplexing labels are merged in a single table. For instance, the SCoPE2 data we provide as an example contains runs that were acquired with two TMT protocols. The 3 first assays were acquired using the TMT-11 protocol and the last assay was acquired using a TMT-16 protocol. When exporting the table, the authors combined the data in a single table, were missing channels in the TMT-11 data are filled with NA. This is essential when working in table format, but since scp keeps the runs separated we can allow for different numbers of channels per run. When setting removeEmptyCols = TRUE, readSCP automatically detects and removes columns that contain only NAs,

Running readSCP

We convert the sample and the feature data into a QFeatures object in a single command thanks to readSCP.

scp <- readSCP(featureData = mqScpData,
               colData = sampleAnnotation,
               batchCol = "Raw.file",
               channelCol = "Channel",
               suffix = paste0("_TMT", 1:16),
               removeEmptyCols = TRUE)
#> Loading data as a 'SingleCellExperiment' object
#> Splitting data based on 'Raw.file'
#> Formatting sample metadata (colData)
#> Formatting data as a 'QFeatures' object

As indicated by the output on the console, readSCP proceeds as follows:

  1. If featureData is the path to a CSV file, it reads the file using read.csv. The table is converted to a SingleCellExperiment object. readSCP needs to know in which field(s) the quantitative data is stored. Those field name(s) is/are provided by the channelCol field in the metaData table. So in this example, the column names holding the quantitative data in mqScpData are stored in the column named Channel in sampleAnnotation.

  2. The SingleCellExperiment object is then split according to batch. The split is performed depending on the batchCol field in featureData, in this case the data will be split according to the Raw.file column in mqScpData. Raw.file contains the names of the acquisition runs that was used by MaxQuant to retrieve the raw data files.

  3. The sample metadata is generated from the supplied colData. Note that in order for readSCP to correctly match the feature data with the metadata, colData must contain the same batchCol field with batch names.

  4. Finally, the split feature data and the sample metadata are stored in a single QFeatures object.

Here is a compact overview of the data:

scp
#> An instance of class QFeatures containing 4 assays:
#>  [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 395 rows and 11 columns 
#>  [2] 190321S_LCA10_X_FP97_blank_01: SingleCellExperiment with 109 rows and 11 columns 
#>  [3] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 487 rows and 11 columns 
#>  [4] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 370 rows and 16 columns

We can see that the object returned by readSCP is a QFeatures object containing 4 SingleCellExperiment assays that have been named after the 4 MS batches. Each assay contains either 11 or 16 columns (samples) depending on the TMT labelling strategy and a variable number of rows (quantified PSMs). Each piece of information can easily be retrieved thanks to the Qfeatures architectures. As mentioned in the previous vignette, sample data is retrieved from the colData. Note that unique sample names were automatically generated by combining the batch name and the suffix provided to readSCP:

head(colData(scp))
#> DataFrame with 6 rows and 6 columns
#>                                 Raw.file       Channel  SampleType     lcbatch
#>                              <character>   <character> <character> <character>
#> 190222S_LCA9_X_FP94BM_TMT1 190222S_LC... Reporter.i...     Carrier        LCA9
#> 190222S_LCA9_X_FP94BM_TMT2 190222S_LC... Reporter.i...   Reference        LCA9
#> 190222S_LCA9_X_FP94BM_TMT3 190222S_LC... Reporter.i...      Unused        LCA9
#> 190222S_LCA9_X_FP94BM_TMT4 190222S_LC... Reporter.i...    Monocyte        LCA9
#> 190222S_LCA9_X_FP94BM_TMT5 190222S_LC... Reporter.i...       Blank        LCA9
#> 190222S_LCA9_X_FP94BM_TMT6 190222S_LC... Reporter.i...    Monocyte        LCA9
#>                                sortday      digest
#>                            <character> <character>
#> 190222S_LCA9_X_FP94BM_TMT1          s8           N
#> 190222S_LCA9_X_FP94BM_TMT2          s8           N
#> 190222S_LCA9_X_FP94BM_TMT3          s8           N
#> 190222S_LCA9_X_FP94BM_TMT4          s8           N
#> 190222S_LCA9_X_FP94BM_TMT5          s8           N
#> 190222S_LCA9_X_FP94BM_TMT6          s8           N

Notice that the sample names were suffixed with TMT indexes.

The feature metadata is retrieved from the rowData. Since the feature metadata is specific to each assay, we need to tell from which assay we want to get the rowData:

head(rowData(scp[["190222S_LCA9_X_FP94BM"]]))[, 1:5]
#> DataFrame with 6 rows and 5 columns
#>                 uid      Sequence    Length Modifications Modified.sequence
#>         <character>   <character> <integer>   <character>       <character>
#> PSM2  _(Acetyl (...     ATNFLAHEK         9 Acetyl (Pr...     _(Acetyl (...
#> PSM4  _(Acetyl (... SHTILLVQPT...        11 Acetyl (Pr...     _(Acetyl (...
#> PSM6  _(Acetyl (...       SLVIPEK         7 Acetyl (Pr...     _(Acetyl (...
#> PSM9  _AAGLALK_ ...       AAGLALK         7    Unmodified         _AAGLALK_
#> PSM12 _AALSAGK_ ...       AALSAGK         7    Unmodified         _AALSAGK_
#> PSM15 _AAPEASGTP... AAPEASGTPS...        16    Unmodified     _AAPEASGTP...

Finally, we can also retrieve the quantification matrix for an assay of interest:

head(assay(scp, "190222S_LCA9_X_FP94BM"))
#>       190222S_LCA9_X_FP94BM_TMT1 190222S_LCA9_X_FP94BM_TMT2
#> PSM2                     58648.0                    1099.80
#> PSM4                     27347.0                     405.90
#> PSM6                     44895.0                     700.23
#> PSM9                    122070.0                    1153.50
#> PSM12                    58605.0                     895.25
#> PSM15                     5006.5                     517.86
#>       190222S_LCA9_X_FP94BM_TMT3 190222S_LCA9_X_FP94BM_TMT4
#> PSM2                     2837.70                     494.32
#> PSM4                     1525.20                       0.00
#> PSM6                     2283.00                    1109.60
#> PSM9                     7361.90                    1732.30
#> PSM12                    2763.80                     867.82
#> PSM15                     446.19                     458.17
#>       190222S_LCA9_X_FP94BM_TMT5 190222S_LCA9_X_FP94BM_TMT6
#> PSM2                      349.26                    1030.50
#> PSM4                        0.00                     318.74
#> PSM6                        0.00                     675.79
#> PSM9                     1515.60                    2252.00
#> PSM12                    1050.30                    1268.70
#> PSM15                     467.90                     649.50
#>       190222S_LCA9_X_FP94BM_TMT7 190222S_LCA9_X_FP94BM_TMT8
#> PSM2                        0.00                    1214.10
#> PSM4                        0.00                     519.81
#> PSM6                        0.00                     809.38
#> PSM9                      444.48                    2343.80
#> PSM12                     532.30                    1073.10
#> PSM15                     259.84                     533.55
#>       190222S_LCA9_X_FP94BM_TMT9 190222S_LCA9_X_FP94BM_TMT10
#> PSM2                      800.58                      807.79
#> PSM4                        0.00                      507.23
#> PSM6                      668.88                     1467.50
#> PSM9                     3100.20                     1825.20
#> PSM12                     911.30                     1300.00
#> PSM15                     393.53                      463.26
#>       190222S_LCA9_X_FP94BM_TMT11
#> PSM2                       391.38
#> PSM4                       370.79
#> PSM6                       901.38
#> PSM9                      2372.50
#> PSM12                     1185.90
#> PSM15                      353.04

What about label-free SCP?

The scp package is meant for both label-free and multiplexed SCP data. Like in the example above, the label-free data should contain the batch names in both the feature data and the sample data. The sample data must also contain a column that points to the columns of the feature data that contains the quantifications. Since label-free SCP acquires one single-cell per run, this sample data column should point the same column for all samples. Moreover, this means that each PSM assay will contain a single column.

What about other input formats?

readSCP should work with any PSM quantification table that is output by a pre-processing software. For instance, you can easily import the PSM tables generated by ProteomeDiscoverer. The batch names are contained in the File ID column (that should be supplied as the batchCol argument to readSCP). The quantification columns are contained in the columns starting with Abundance, eventually followed by a multiplexing tag name. These columns should be stored in a dedicated column of the sample data to be supplied as channelCol to readSCP.

If your input cannot be loaded using the procedure described in this vignette, you can submit a feature request (see next section).

Need help?

You can open an issue on the GitHub repository in case of troubles when loading your SCP data with readSCP. Any suggestion or feature request about the function or the documentation are also warmly welcome.

Session information

R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] scp_1.3.3                    QFeatures_1.3.7             
 [3] MultiAssayExperiment_1.19.16 SummarizedExperiment_1.23.5 
 [5] Biobase_2.53.0               GenomicRanges_1.45.0        
 [7] GenomeInfoDb_1.29.8          IRanges_2.27.2              
 [9] S4Vectors_0.31.5             BiocGenerics_0.39.2         
[11] MatrixGenerics_1.5.4         matrixStats_0.61.0          
[13] BiocStyle_2.21.3            

loaded via a namespace (and not attached):
 [1] lattice_0.20-45             png_0.1-7                  
 [3] utf8_1.2.2                  rprojroot_2.0.2            
 [5] digest_0.6.28               SingleCellExperiment_1.15.2
 [7] R6_2.5.1                    evaluate_0.14              
 [9] highr_0.9                   pillar_1.6.3               
[11] zlibbioc_1.39.0             rlang_0.99.0.9000          
[13] lazyeval_0.2.2              jquerylib_0.1.4            
[15] Matrix_1.3-4                rmarkdown_2.11             
[17] pkgdown_1.9000.9000.9000    textshaping_0.3.5          
[19] desc_1.4.0                  stringr_1.4.0              
[21] ProtGenerics_1.25.1         igraph_1.2.6               
[23] RCurl_1.98-1.5              DelayedArray_0.19.4        
[25] compiler_4.1.1              xfun_0.26                  
[27] pkgconfig_2.0.3             systemfonts_1.0.2          
[29] htmltools_0.5.2             tidyselect_1.1.1           
[31] tibble_3.1.5                GenomeInfoDbData_1.2.7     
[33] bookdown_0.24               fansi_0.5.0                
[35] crayon_1.4.1                dplyr_1.0.7                
[37] MASS_7.3-54                 bitops_1.0-7               
[39] grid_4.1.1                  jsonlite_1.7.2             
[41] lifecycle_1.0.1             AnnotationFilter_1.17.1    
[43] magrittr_2.0.1              MsCoreUtils_1.5.0          
[45] cli_3.0.1                   stringi_1.7.5              
[47] cachem_1.0.6                XVector_0.33.0             
[49] fs_1.5.0                    bslib_0.3.1                
[51] ellipsis_0.3.2              ragg_1.1.3                 
[53] vctrs_0.3.8                 generics_0.1.0             
[55] tools_4.1.1                 glue_1.4.2                 
[57] purrr_0.3.4                 fastmap_1.1.0              
[59] yaml_2.2.1                  clue_0.3-60                
[61] cluster_2.1.2               BiocManager_1.30.16        
[63] memoise_2.0.0               knitr_1.36                 
[65] sass_0.4.0                 

Reference

Amezquita, Robert A, Aaron T L Lun, Etienne Becht, Vince J Carey, Lindsay N Carpp, Ludwig Geistlinger, Federico Martini, et al. 2019. “Orchestrating Single-Cell Analysis with Bioconductor.” Nat. Methods, December, 1–9.
Gatto, Laurent. 2020. QFeatures: Quantitative Features for Mass Spectrometry Data.”