class: center, middle, inverse, title-slide .title[ # Reading single-cell proteomics data ] .author[ ### Laurent Gatto and Christophe Vanderaa ] --- class: middle name: cc-by ### Get the slides at [https://bit.ly/read_scp_data](https://bit.ly/read_scp_data) These slides are available under a **creative common [CC-BY license](http://creativecommons.org/licenses/by/4.0/)**. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially <img height="20px" alt="CC-BY" src="https://raw.githubusercontent.com/UCLouvain-CBIO/scp-teaching/main/img/cc1.jpg" />. ??? In this presentation, you will learn how to convert data tables into `QFeatures` objects that can further be used for data processing. I highly recommend you to watch the previous video about handling quantitative proteomics features, if that's not already done. The slides are available at the given link and are shared under CC-BY license. --- class: middle, center, inverse # How can I convert my single-cell proteomics data to a QFeatures object? ??? So the focus point of this presentation is answering the question: how can I convert my single-cell proteomics data to a QFeatures object? --- class: middle ## How can I convert my single-cell proteomics data to a QFeatures object? The `readSCP()` function converts quantified mass spectrometry data tables to `QFeatures` objects. <br> <img src="./figs/read_scp_data_readSCP.svg" width="90%" style="display: block; margin: auto;" /> ??? You can do that thanks to the `readSCP()` function. The function combines an input table and a sample table into a ready-to-process `QFeatures` object. Let me explain what those two tables correspond to. --- class: ## Input table .panelset[ .panel[.panel-name[Description] .left-column[ <img src="./figs/read_scp_data_inputTable.png" width="100%" style="display: block; margin: auto;" /> ] .right-column[ Input table = output table from pre-processing software, such as MaxQuant (*e.g.* `evidence.txt`) or ProteomeDiscoverer (*e.g.* `PSMs.txt`). In general, 3 types of columns: - feature annotations: *e.g.* peptide sequence, ion charge, protein name - quantification columns: 1 to n (depending on technology) - acquisition data: *e.g.* file name ] ] .panel[.panel-name[Example]
] ] ??? ### Description The input table is typically a table that is generated by a pre-processing software, such as MaxQuant or ProteomeDiscoverer. The input table usually contains 3 types of columns: - First, some columns hold the feature annotation, think about the peptide sequence, the charge of the analyzed ion, the protein name - You also have columns that hold the quantification data. The number of columns may vary depending on the technology and the pre-processing software. - The last type of columns contain data associated to the mass spectrometry acquisition. For instance, the name of the file where the instrument has stored the data. Let's have a look at some example data ### Example The first columns here are annotations of the features, with the peptide sequence, its length, the ion charge, etc. Then, there is the quantitative data and you can see all elements are numbers that correspond to ion intensities. You can identify the quantitative data in this specific example because the column names start with `Reporter.intensity.` followed by a number. The last columns is the name of the acquisition and relates to the mass spectrometry run. --- class: ## Sample annotation .panelset[ .panel[.panel-name[Description] .left-column[ <img src="./figs/read_scp_data_sampleTable.png" width="80%" style="display: block; margin: auto;" /> ] .right-column[ Sample table = table generated by the researcher. 1 line = 1 sample Two columns are **required**: - Names of the quantification columns from the input table - Acquisition name, same as in the input table Other columns can contain additional sample annotations, such as: - Experiment metadata (date, researcher's name, instruments, ...) - Sample preparation (cell culture batch, LC batch, TMT label, ...) - Sample metadata (species, treatment, disease, sex, age, ...) - Sample type (single-cells, carrier, blanks, ...) - Other data (FACS data, microscopy data, phenotypic data, ...) - ... ] ] .panel[.panel-name[Example]
] ] ??? ### Description Next to the input table, the `readSCP` function requires a sample table. This is generated by the researcher. Each line contains information about a single sample. 2 columns are required to work with `readSCP`. The first column tells the software what are the names of the columns containing the quantification data in the input table. The other column contains the acquisition names, just like we saw for the input table. Beside those 2 required columns, you can include any sample data you may have, be it experimental metadata, sample preparation information, sample types or other data collected during sample preparation. These data are valuable when it comes to data modelling! ### Example In this example, the 2 first columns are the required columns. The first column is very similar to the column in the input data and may notice that the second column holds the names that I pointed out as quantitative columns in the input table that were starting with `Reporter.intensity.`. The remaining columns are examples of additional information that could be available. --- class: middle, center, inverse # What happens under the hood? ??? You may ask yourself: what is `readSCP` exactly doing? --- class: ## What happens under the hood? .panelset[ .panel[.panel-name[Step1] .left-column[ <img src="./figs/read_scp_data_step1.svg" width="80%" style="display: block; margin: auto;" /> ] .right-column[ The feature annotations are separated from the quantitative data. The two table pieces are converted to a [`SingleCellExperiment`](https://www.bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) object [1], a specialized **Bioconductor** data container that creates an interface to existing functions to analyse single-cell data. <br><br><br> <p style="color:grey;font-size:0.75em;"> [1] Amezquita, Robert A., Aaron T. L. Lun, Etienne Becht, Vince J. Carey, Lindsay N. Carpp, Ludwig Geistlinger, Federico Martini, et al. 2019. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods, December, 1–9. </p> ] ] .panel[.panel-name[Step2] .left-column[ <img src="./figs/read_scp_data_step2.svg" width="100%" style="display: block; margin: auto;" /> ] .right-column[ The data is then split based on the acquisition name. Each quantitative column now corresponds to a **single** and **unique** sample. ] ] .panel[.panel-name[Step3] .pull-left[ <img src="./figs/read_scp_data_step3.svg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ The sample table is matched to the split feature data. This is performed based on the two **required** columns: - Acquisition name to match each data piece - Quantification column names to match columns in each data piece **Unique sample IDs** are created ] ] .panel[.panel-name[Step4] .pull-left[ <img src="./figs/read_scp_data_step4.svg" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ All the data pieces are wrapped into a `QFeatures` object. Overall, the `QFeatures` format enables seamless **data management and access**, important for **downstream** data processing and visualisation. ] ] ] ??? Well, `readSCP` prepares the data in 4 main steps. ### Step1 First, it takes the input table and separates the feature annotation from the quantitative data. The two tables can then be converted to a SingleCellExperiment object that is a specialized Bioconductor data container that creates an interface to existing functions to analyse single-cell data. ### Step2 Next, the data is further split, but this time, the split is along the rows based on the mass run. Because the acquisitions are now separated, each quantitative column now corresponds to a single and unique sample. ### Step3 In step3, the sample table comes in. The sample annotations are linked to the quantification data and this is performed based on the two required columns I just described. First, the acquisition names are matched between the input table and the sample table. Next, the quantitative column names from the sample tables are used to link each sample to its corresponding quantification column. The combination of quantification column name and acquisition name creates unique sample identifiers that are stored in the sample table. ### Step4 The final step is to wrap all those data pieces into a `QFeatures` object. Overall, the QFeatures format enables seamless data management and access that are important for downstream data processing and visualisation. --- class: middle, center, inverse # readSCP() in practice ??? Let me now show you how to use `readSCP` in practice --- class: ## `readSCP()` in practice .panelset[ .panel[.panel-name[Data] .pull-left[ `sampleTable`
] .pull-right[ `inputTable`
] ] .panel[.panel-name[Code] ```r readSCP(inputTable, sampleTable, batchCol = "Raw.file", channelCol = "Channel") ``` Overview of the resulting `QFeatures` object: ``` ## An instance of class QFeatures containing 4 assays: ## [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 395 rows and 16 columns ## [2] 190321S_LCA10_X_FP97_blank_01: SingleCellExperiment with 109 rows and 16 columns ## [3] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 487 rows and 16 columns ## [4] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 370 rows and 16 columns ``` ] ] ??? ### Data Consider the two example tables here that are very similar to the ones I showed previously. Note that both table contain the `Raw.file` column necessary for mathcing the acquisition runs. See also that the `Channel` column in the `sampleTable` contains the column names corresponding to the quantification columns in the `inputTable`. ### Code You can convert those two tables to a `QFeatures` object using `readSCP()` by running the command shown here. We call `readSCP()`, we provide the 2 tables, but we also need to tell how the matching is performed. We here tell the function which column in both tables should be used to match the acquisition batch, in this case remember it was `Raw.file`. Then, we need to tell the function which column in the sample table contains the quantification columns, it's the column called `Channel` in this example. When running the command, we create a new object and you can find an overview here. You can see that we indeed have a `QFeatures` object with 4 assays because there are 4 acquisitions. Each assay is a SingleCellExperiment object with variable number of rows, here corresponding to different quantified PSMs and 16 columns because a multiplexing with 16 labels was used to acquire the data. --- class: middle, inverse, center # What if I acquired a single acquisition or already processed my data? ??? As a mention, you can also import data from a single acquisition or import already processed single-cell proteomics data. --- class: middle ## What if I acquired a single acquisition or already processed my data? `readQFeatures()`: import pre-processed data for a single acquisition as a `QFeatures` object. ```r readQFeatures(inputTable, ecol = paste0("Reporter.intensity.", 1:16), colData = sampleTable) ``` `readSingleCellExperiment()`: import data as a `SingleCellExperiment` object. ```r readSingleCellExperiment(inputTable, ecol = paste0("Reporter.intensity.", 1:16), colData = sampleTable) ``` Both functions require: - input table - quantification column names Optional: sample annotation table. ??? Suppose you acquired data from a single acquisition, you may not want to bother with splitting your data using `readSCP()` but rather use the simpler function `readQFeatures()`. Also, if you have already processed your data or if you downloaded data ready to analyse, you may simply want to load it as a `SingleCellExperiment` object for downstream analyses (we will come to that in the next slide deck). You can do so using the `readSingleCellExperiment()`. Both functions require an input table, similar to `readSCP` and you need to point out which columns are the quantification columns. Optionally, you can provide sample annotations. I hope this presentation helped you understand how to get your single-cell proteomics data into R for data processing and analysis using our software. --- class: middle, inverse, center # Exercise ??? I suggest to test your understanding with a small exercise. --- class: middle #### Given the input and sample tables, what command creates a single-cell proteomics QFeatures object? .pull-left[ `inputTable`
] .pull-right[ `sampleTable`
] ```r 1. readSCP(inputTable, sampleTable, batchCol = "Label", channelCol = c("lab1", "lab2")) 2. readSCP(inputTable, sampleTable, batchCol = "MSrun", channelCol = c("lab1", "lab2")) 3. readSCP(inputTable, sampleTable, batchCol = "MSrun", channelCol = "Label") 4. readSCP(inputTable, sampleTable, batchCol = "CellType", channelCol = "Label") ``` ??? Given the input and sample tables, What command creates a single-cell proteomics QFeatures object? Command 1, 2, 3 or 4? You can pause the video to think about it... The solution is command 3. See that we can match the acquisition runs using the `MSrun` column in both tables and the `Label` column in the sample table contains the quantification column names of the input table. --- class: middle ### Further information Learn more about loading single-cell proteomics data as a `QFeatures` object in our dedicated vignette at https://uclouvain-cbio.github.io/scp/articles/read_scp.html. ### Funding Fonds de la Recherche Scientifique (FNRS), Belgium ??? I hope you found the correct solution. If you're looking for more detailed information, you can have a look at our vignette dedicated to loading single-cell proteomics data. Thank you very much for watching and see you in another video.