Mass spectrometry-based proteomics

LSTAT2340 – Traitement de données -omiques

Laurent Gatto

2026-03-16

This material available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially.

Biological information flow

Information flow in biological systems (Source *Central dogma of biology* on Wikipedia). Information flow in biological systems (Source Central dogma of biology on Wikipedia).

There are three main biological entities that respectively store information, act as data intermediates, and the functional units, and the figure below show how information flows between these three levels.

DNA, that lives in the nucleus of cells, is the central information storage mechanism, and encodes the blueprint of the functional units as genes. DNA is transcribed into messenger RNA (mRNA), that re-localises outside the nucleus and is further processed into its mature exon-only form after removal of the non-coding introns sequences. Finally, the mRNA is translated by the ribosomial machinery into proteins directly into the endoplasmic reticulum (ER) where they are then redirected to their final destination.

Sub-cellular structure of an animal cell (Source Cell biology on Wikipedia).

Sub-cellular structure of an animal cell (Source *Cell biology* on Wikipedia).

In addition to the standard information worflow where DNA is transcribed into RNA that itself is translated into proteins, there is also reverse transcription, that generates (complementary) DNA from and RNA molecule, as well as replication of DNA (during cell division) and RNA molecules.

Why studying proteins

Proteins as the functional units in all living organisms, and they are highly dynamic. The caterpillar and the resulting butterfly have the same genome. The complement of all the expressed proteins, termed the proteome is however very different.

The metamorphosis from a caterpilar to a monarch butterfly. (Image from Phys.org)

The metamorphosis from a caterpilar to a monarch butterfly. (Image from Phys.org)

There are different modalities of the proteome that are of interest. In addition to the presence or amount of protein in a biological samples, it is also important to study the interactions between proteins forming protein-protein complexes, the presence of post-transcriptional modification (such as, for example, phosphorylations), the rate at which proteins are produced and degraded, or where the proteins reside inside a cell.

The technique of choice to study proteins in a high throughput way is mass spectrometry.

A beginner’s guide to mass spectrometry–based proteomics (Sinha and Mann 2020) is an approachable introduction to sample preparation, mass spectrometry and data analysis.

Setup

We are going to use Bioconductor (Huber et al. 2015) packages from the R for Mass Spectrometry initiative. The associated book provide further details and state-of-the-art infrastructure for MS and proteomics data analysis.

The *R for Mass Spectrometry* intiative. The R for Mass Spectrometry intiative.

Packages can be install with the BiocManager package, available from CRAN. If BiocManager isn’t available on your computer, install it with:

install.packages("BiocManager")

Any packages (such as Spectra) and their dependencies can be installed with:

BiocManager::install("Spectra")

For additional information on how to analyse mass spectrometry-based proteomics data, refer to (L. Gatto and Christoforou 2014) and (Laurent Gatto 2019), or explore the proteomics- and mass spectrometry-related packages on the Bioconductor page

How does mass spectrometry work?

Mass spectrometry (MS) is a technology that separates charged molecules (ions) based on their mass to charge ratio (M/Z). It is often coupled to chromatography (liquid LC, but can also be gas-based GC). The time an analytes takes to elute from the chromatography column is the retention time.

A chromatogram, illustrating the total amount of analytes over the retention time.

A chromatogram, illustrating the total amount of analytes over the retention time.

An mass spectrometer is composed of three components:

  1. The source, that ionises the molecules: examples are Matrix-assisted laser desorption/ionisation (MALDI) or electrospray ionisation. (ESI)
  2. The analyser, that separates the ions: Time of flight (TOF) or Orbitrap.
  3. The detector that quantifies the ions.

When using mass spectrometry for proteomics, the proteins are first digested with a protease such as trypsin. In mass shotgun proteomics, the analytes assayed in the mass spectrometer are peptides.

Often, ions are subjected to more than a single MS round. After a first round of separation, the peaks in the spectra, called MS1 spectra, represent peptides. At this stage, the only information we possess about these peptides are their retention time and their mass-to-charge (we can also infer their charge be inspecting their isotopic envelope, i.e the peaks of the individual isotopes, see below), which is not enough to infer their identify (i.e. their sequence).

In MSMS (or MS2), the settings of the mass spectrometer are set automatically to select a certain number of MS1 peaks (for example 20). Once a narrow M/Z range has been selected (corresponding to one high-intensity peak, a peptide, and some background noise), it is fragmented (using for example collision-induced dissociation (CID), higher energy collisional dissociation (HCD) or electron-transfer dissociation (ETD)). The fragment ions are then themselves separated in the analyser to produce a MS2 spectrum. The unique fragment ion pattern can then be used to infer the peptide sequence using de novo sequencing (when the spectrum is of high enough quality) of using a search engine such as, for example Mascot, MSGF+, …, that will match the observed, experimental spectrum to theoratical spectra (see details below).

Schematics of a mass spectrometer and two rounds of MS.

Schematics of a mass spectrometer and two rounds of MS.

The animation below show how 25 ions different ions (i.e. having different M/Z values) are separated throughout the MS analysis and are eventually detected (i.e. quantified). The final frame shows the hypothetical spectrum.

Separation and detection of ions in a mass spectrometer.

Separation and detection of ions in a mass spectrometer.

The figures below illustrate the two rounds of MS. The spectrum on the left is an MS1 spectrum acquired after 21 minutes and 3 seconds of elution. 10 peaks, highlited by dotted vertical lines, were selected for MS2 analysis. The peak at M/Z 460.79 (488.8) is highlighted by a red (orange) vertical line on the MS1 spectrum and the fragment spectra are shown on the MS2 spectrum on the top (bottom) right figure.

Parent ions in the MS1 spectrum (left) and two sected fragment ions MS2 spectra (right).

Parent ions in the MS1 spectrum (left) and two sected fragment ions MS2 spectra (right).

The figures below represent the 3 dimensions of MS data: a set of spectra (M/Z and intensity) of retention time, as well as the interleaved nature of MS1 and MS2 (and there could be more levels) data.

MS1 spectra (blue) over retention time (left). MS2 spectra (pink) interleaved between two MS1 spectra (right),

MS1 spectra (blue) over retention time (left). MS2 spectra (pink) interleaved between two MS1 spectra (right),

Reading and accessing MS data

Let’s read a very small raw MS data file into R using the Spectra() from the Spectra package (Rainer et al. 2026). The file that we are going to load is available in the MsDataHub data package.

  1. Load the Spectra package
library("Spectra")
  1. Get the path to the raw MS data file. When you run this command for the first time, it will download the MS data file in a local cache so that next time you need it, it can be retrieved directly.
library("MsDataHub")
(rawf <- TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.20141210.mzML.gz())
##                                                   EH7808 
## "/home/lgatto/.cache/R/ExperimentHub/4189228548251_7858"
  1. Load the MS data into R using the Spectra() importer function.
sp <- Spectra(rawf)
sp
## MSn data (Spectra) with 7534 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1    0.4584         1
## 2            1    0.9725         2
## 3            1    1.8524         3
## 4            1    2.7424         4
## 5            1    3.6124         5
## ...        ...       ...       ...
## 7530         2   3600.47      7530
## 7531         2   3600.83      7531
## 7532         2   3601.18      7532
## 7533         2   3601.57      7533
## 7534         2   3601.98      7534
##  ... 34 more variables/columns.
## 
## file(s):
## 4189228548251_7858

The object that is returned by Spectra() is of class Specta, that can store, access and manipulate raw MS data. Note that here we are focusing on MS-based proteomics data, but this also applied to MS-based metabolomics data.

class(sp)
## [1] "Spectra"
## attr(,"package")
## [1] "Spectra"
  1. We can find out how many spectra are available in that data using the function length. Full MS acquisitions would contain hundreds of thousands spectra.
length(sp)
## [1] 7534
  1. We can use various accessor function to get the MS level of these spectra, their retention time, or the M/Z and intensity of the precursor peaks of the ion corresponding to the MS2 spectra.
msLevel(sp)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [ reached 'max' / getOption("max.print") -- omitted 7434 entries ]
rtime(sp)
##   [1]  0.4584  0.9725  1.8524  2.7424  3.6124  4.4925  5.3825  6.2525  7.1324
##  [10]  8.0324  8.8925  9.7724 10.6624 11.5324 12.4125 13.2924 14.1724 15.0524
##  [19] 15.9324 16.8124 17.6925 18.5724 19.4524 20.3226 21.2124 22.0824 22.9624
##  [28] 23.8524 24.7224 25.6025 26.4825 27.3624 28.2424 29.1224 30.0024 30.8725
##  [37] 31.7624 32.6324 33.5125 34.3924 35.4096 36.2627 37.2628 38.2528 39.2427
##  [46] 40.2327 41.2227 42.2228 43.2027 44.2027 45.1927 46.1828 47.1727 48.1627
##  [55] 49.1527 50.1428 51.1428 52.1610 53.0124 53.8924 54.7724 55.6525 56.5324
##  [64] 57.4124 58.3025 59.1824 60.0825 60.9424 61.8324 62.7124 63.6033 64.4725
##  [73] 65.3524 66.2324 67.1227 68.0033 68.8825 69.7624 70.6324 71.5124 72.4033
##  [82] 73.2724 74.1624 75.0427 75.9432 76.8032 77.6824 78.5624 79.4425 80.3324
##  [91] 81.2127 82.1024 82.9924 83.8735 84.7625 85.6424 86.5324 87.4124 88.3024
## [100] 89.1824
##  [ reached 'max' / getOption("max.print") -- omitted 7434 entries ]
precursorMz(sp)
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [ reached 'max' / getOption("max.print") -- omitted 7434 entries ]
precursorIntensity(sp)
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [ reached 'max' / getOption("max.print") -- omitted 7434 entries ]
  1. We can also extract subsets of the full experiment or individual spectra using [ and plot them.
sp[3]
## MSn data (Spectra) with 1 spectra in a MsBackendMzR backend:
##     msLevel     rtime scanIndex
##   <integer> <numeric> <integer>
## 1         1    1.8524         3
##  ... 34 more variables/columns.
## 
## file(s):
## 4189228548251_7858
plotSpectra(sp[1119])

Visualisation of the 3rd MS spectrum in our small test data set.

Visualisation of the 3rd MS spectrum in our small test data set.

Exercises

Spectra objects also contains additional metadata for each spectrum, that can be accessed, as a DataFrame, with spectraData. What variables are available?

spectraData(sp)
## DataFrame with 7534 rows and 35 columns
##        msLevel     rtime acquisitionNum scanIndex            dataStorage
##      <integer> <numeric>      <integer> <integer>            <character>
## 1            1    0.4584              1         1 /home/lgatto/.cache/..
##                  dataOrigin centroided  smoothed  polarity precScanNum
##                 <character>  <logical> <logical> <integer>   <integer>
## 1    /home/lgatto/.cache/..      FALSE        NA         1          NA
##      precursorMz precursorIntensity precursorCharge collisionEnergy
##        <numeric>          <numeric>       <integer>       <numeric>
## 1             NA                 NA              NA              NA
##      isolationWindowLowerMz isolationWindowTargetMz isolationWindowUpperMz
##                   <numeric>               <numeric>              <numeric>
## 1                        NA                      NA                     NA
##      peaksCount totIonCurrent basePeakMZ basePeakIntensity electronBeamEnergy
##       <integer>     <numeric>  <numeric>         <numeric>          <numeric>
## 1         25800       9187505     445.12            984171                 NA
##      ionisationEnergy     lowMZ    highMZ mergedScan mergedResultScanNum
##             <numeric> <numeric> <numeric>  <integer>           <integer>
## 1                   0   399.999   2008.45         NA                  NA
##      mergedResultStartScanNum mergedResultEndScanNum injectionTime
##                     <integer>              <integer>     <numeric>
## 1                          NA                     NA       50.3826
##                filterString             spectrumId ionMobilityDriftTime
##                 <character>            <character>            <numeric>
## 1    FTMS + p NSI Full ms.. controllerType=0 con..                   NA
##      scanWindowLowerLimit scanWindowUpperLimit
##                 <numeric>            <numeric>
## 1                     400                 2000
##  [ reached 'max' / getOption("max.print") -- omitted 10 rows ]
spectraVariables(sp)
##  [1] "msLevel"                  "rtime"                   
##  [3] "acquisitionNum"           "scanIndex"               
##  [5] "dataStorage"              "dataOrigin"              
##  [7] "centroided"               "smoothed"                
##  [9] "polarity"                 "precScanNum"             
## [11] "precursorMz"              "precursorIntensity"      
## [13] "precursorCharge"          "collisionEnergy"         
## [15] "isolationWindowLowerMz"   "isolationWindowTargetMz" 
## [17] "isolationWindowUpperMz"   "peaksCount"              
## [19] "totIonCurrent"            "basePeakMZ"              
## [21] "basePeakIntensity"        "electronBeamEnergy"      
## [23] "ionisationEnergy"         "lowMZ"                   
## [25] "highMZ"                   "mergedScan"              
## [27] "mergedResultScanNum"      "mergedResultStartScanNum"
## [29] "mergedResultEndScanNum"   "injectionTime"           
## [31] "filterString"             "spectrumId"              
## [33] "ionMobilityDriftTime"     "scanWindowLowerLimit"    
## [35] "scanWindowUpperLimit"

The actual MS data, i.e. the M/Z and intensity pairs are accessed with the peaksData() function, that returns a list of matrices. Extract the peaks of the spectra 1, 1119, 2235 and 7500.

peaksData(sp[c(1, 1119, 2235, 7500)])
## List of length 4

Identification

The raw data is still a long way of obtaining biologically relevant proteomics data. The first step to obtain proteomics data is to identify the peptides that have been acquired in the MS. Peptide identification work by comparing expected and observed spectra. As shown below, when a precursor peptide ion is fragmented in a CID cell, it breaks at specific bonds, producing sets of peaks (a, b, c and x, y, z) that can be predicted.

Peptide fragmentation.

Peptide fragmentation.

It is thus possible to calculate the expected set of fagment peaks for a given peptide, such as SIGFEGDSIGR below. We will make use of the PSMatch package (Deflandre, Gibb, and Gatto 2026) to handle identification data.

library("PSMatch")
calculateFragments("SIGFEGDSIGR")
## Fixed modifications used: C=57.02146
## Variable modifications used: None
##           mz ion type pos z        seq     peptide
## 1   88.03931  b1    b   1 1          S SIGFEGDSIGR
## 2  201.12337  b2    b   2 1         SI SIGFEGDSIGR
## 3  258.14483  b3    b   3 1        SIG SIGFEGDSIGR
## 4  405.21324  b4    b   4 1       SIGF SIGFEGDSIGR
## 5  534.25583  b5    b   5 1      SIGFE SIGFEGDSIGR
## 6  591.27729  b6    b   6 1     SIGFEG SIGFEGDSIGR
## 7  706.30423  b7    b   7 1    SIGFEGD SIGFEGDSIGR
## 8  793.33626  b8    b   8 1   SIGFEGDS SIGFEGDSIGR
## 9  906.42032  b9    b   9 1  SIGFEGDSI SIGFEGDSIGR
## 10 963.44178 b10    b  10 1 SIGFEGDSIG SIGFEGDSIGR
## 11 175.11895  y1    y   1 1          R SIGFEGDSIGR
## 12 232.14041  y2    y   2 1         GR SIGFEGDSIGR
## 13 345.22447  y3    y   3 1        IGR SIGFEGDSIGR
## 14 432.25650  y4    y   4 1       SIGR SIGFEGDSIGR
##  [ reached 'max' / getOption("max.print") -- omitted 18 rows ]

To illustrate this, we are going to use data from the following file. Once downloaded, let’s load it to obtain a Spectra object that contains identified spectra.

spi <- readRDS("itraqdata2-spectra.rds")

The identification process aims to to compare obseved and expected peaks. If there is a good match, the MS2 spectrum is assigned the peptide sequence. Below, we use the plotSpectraPTM() function that automatically annotates the peaks based on the

s <- "SIGFEGDSIGR"
plotSpectraPTM(spi[14])

Matching observed and expected peaks.

Matching observed and expected peaks.

It is also possible to plot 2 spectra (that for example match the same sequence) to compare them directly

spi[c(25, 28)]$sequence
## [1] "IMIDLDGTENK" "IMIDLDGTENK"
plotSpectraMirror(spi[25], spi[28])

Direct comparison of 2 MS2 spectra.

Direct comparison of 2 MS2 spectra.

In a full experiment, all possible peptides from the known (or relevant) proteome of interest (such as databases that can be downloaded from the UniProt site1 The Universal Protein Resource (UniProt) is a freely and accessible comprehensive resource for protein sequence and annotation data.) are compared to the millions of observed spectra.

Statistical challenge The matching between millions of observed and possible spectra causes real challenges due to the large search space and the risk of false positives. See (Käll et al. 2008) for further reading.

From the list of identified peptides, it is then necessary to infer the most provable proteins that were present in the biological sample.

Statistical challenge Protein inference is a difficult task, as peptides often match multiple proteins (either different isoforms or proteins stemming from different gene but with identical domains), which leads to the definition of protein groups, i.e. sets of proteins that can’t be distinguished with the set of identified peptides at hand. See (Nesvizhskii and Aebersold 2005) for further reading.

Quantitation

The last step of MS data processing is to quantify peptide abundances in the biological samples. The table below summarises the different possibilites depending whether the proteins or peptides are labelled, and whether the quantitation is performed in MS1 or MS2.

Label-free Labelled
MS1 XIC SILAC, 15N
MS2 Counting iTRAQ, TMT

Label-free MS2: Spectral counting

In spectral counting, on simply counts the number of quantified peptides that are assigned to a protein.

Labelled MS2: Isobaric tagging

Isobaric tagging refers to the labelling using isobaric tags, i.e. chemical tags that have the same mass and hence can’t be distinguish by the spectrometer. The peptides of different samples (4, 6, 10 or 11) are labelled with different tags and combined prior to mass spectrometry acquisition. Given that they are isobaric, all identical peptides, irrespective of the tag and this the sample of origin, are co-analysed, up to fragmentation prior to MS2 analysis. During fragmentation, the isobaric tags fall of, fragment themselves, and result in a set of sample specific peaks. These specific peaks can be used to infer sample-specific quantitation, while the rest of the MS2 spectrum is used for identification.

Label-free MS1: extracted ion chromatograms

In label-free quantitation, the precursor peaks that match an identified peptide are integrated of retention time and the area under that extracted ion chromatogram is used to quantify that peptide in that sample.

Figure: credit Johannes Rainer.

Labelled MS1: SILAC

In SILAc quantitation, sample are grown in a medium that contains heavy amino acids (typically arginine and lysine). All proteins gown in this heavy growth medium contain the heavy form of these amino acids. Two samples, one grown in heavy medium, and one grown in normal (light) medium are then combined and analysed together. The heavy peptides precursor peaks are systematically shifted compared to the light ones, and the ratio between the height of a heavy and light peaks can be used to calculate peptide and protein fold-changes.

Figure: credit Wikimedia Commons.

Statistical challenge These different quantitation techniques come with their respective and distinct challenges, such as large quantities of raw data processing, data transformation and normalisation, and different underlying statistical models for the quantitative data (count data for spectral counting, continuous data for the others).

Exercise

As the name of the file implies, the data is an iTRAQ-based isobar quantitation experiment. We can visualise the reporter peaks as follows:

plotSpectra(spi[14])

Visualisation of the iTRAQ reporter peaks.

Visualisation of the iTRAQ reporter peaks.
plotSpectra(spi[14], xlim = c(114, 117.5))

Visualisation of the iTRAQ reporter peaks.

Visualisation of the iTRAQ reporter peaks.

From the closed-up figure, we see that the 4 reporter ions indicate that there was slightly less of that peptide in the two first samples, i.e. those labelled with the 114 and 115 reporters.

The rest of the spectrum (i.e m/z > 120) correspond to the dissociated fragment ions, used for identification.

Quantitative data processing

Statistical challenge Quantitative data processing come with numerous statistical challenges such as missing data handling, aggregation of quantitation data, data normalisation, … and the effect of these on the downstream statistical tests.

We will use the QFeatures package to handle quantitative data.

library("QFeatures")

Let’s start by loading a PSM-level quantitation table into a QFeatures object:

data(hlpsms) ## from the QFeature package
hlpsms
##           X126      X127C      X127N      X128C      X128N      X129C     X129N
## 383 0.12283431 0.08045915 0.07080406 0.09386901 0.05181569 0.13034383 0.1754010
## 475 0.35268185 0.14162381 0.16752388 0.07843497 0.07108744 0.03214548 0.0668626
## 478 0.01546089 0.16142297 0.08693813 0.23120844 0.11466435 0.09610188 0.1597782
##          X130C      X130N       X131 Sequence
## 383 0.04006866 0.11478839 0.11961594  SQGEIDk
## 475 0.03196179 0.02810434 0.02957384  YEAQGDk
## 478 0.01012712 0.08059400 0.04370403  TTScDTk
##                                                                                   ProteinDescriptions
## 383          Tetratricopeptide repeat protein 39B OS=Mus musculus GN=Ttc39b PE=1 SV=1 - [TT39B_MOUSE]
## 475 Vacuolar protein sorting-associated protein 4B OS=Mus musculus GN=Vps4b PE=1 SV=2 - [VPS4B_MOUSE]
## 478                        C-type mannose receptor 2 OS=Mus musculus GN=Mrc2 PE=1 SV=3 - [MRC2_MOUSE]
##     NbProteins ProteinGroupAccessions                     Modifications qValue
## 383          1                 Q8BYY4                      K7(TMT6plex)  0.008
## 475          1                 P46467                      K7(TMT6plex)  0.001
## 478          1                 Q64449 C4(Carbamidomethyl); K7(TMT6plex)  0.008
##        PEP IonScore NbMissedCleavages IsolationInterference IonInjectTimems
## 383 0.1180       27                 0                     0              70
## 475 0.0107       27                 0                     0              70
## 478 0.1180       11                 0                     0              70
##     Intensity Charge     mzDa     MHDa DeltaMassPPM RTmin markers
## 383    335000      2 503.2742 1005.541        -0.38 24.02 unknown
## 475    926000      2 520.2668 1039.526         0.61 18.85 unknown
## 478    159000      2 521.2584 1041.510         1.11 10.17 unknown
##  [ reached 'max' / getOption("max.print") -- omitted 3007 rows ]
## Get columns with quantitative data
## (others contain metadata)
(quantCols <- grep("^X", colnames(hlpsms)))
##  [1]  1  2  3  4  5  6  7  8  9 10
## Parse and load the data
qf <- readQFeatures(hlpsms, quantCols = quantCols,
                    name = "psms")
## Checking arguments.
## Loading data as a 'SummarizedExperiment' object.
## Formatting sample annotations (colData).
## Formatting data as a 'QFeatures' object.
qf
## An instance of class QFeatures (type: bulk) with 1 set:
## 
##  [1] psms: SummarizedExperiment with 3010 rows and 10 columns

In label-free experiments, given that each sample is acquired independently, the proportion of missing values can be as high several tens of percent. In such situations, removing features/rows with missing values can be too harsh. Imputation is possible (see the impute() function and documentation), albeit tricky, as different mechanisms can be responsible for missing value that appear either at random or not at random (Lazar et al. 2016).

In MS-baesd proteomics, we don’t acquire data at the protein level, given that we cleave our peptides. We thus need to aggregate spectrum-level quantitation values into peptide and protein-level data using.

Below, we illustrate this aggregating values using median aggregation and the aggregateFeatures function, first from PSMs to peptides, then from peptides to proteins.

qf <- qf |>
    aggregateFeatures(
        "psms",
        name = "peptides",
        fcol = "Sequence",
        fun = colMedians) |>
    aggregateFeatures(
        "peptides",
        name = "proteins",
        fcol = "ProteinDescriptions",
        fun = colMedians)
## Your row data contain missing values. Please read the relevant
## section(s) in the aggregateFeatures manual page regarding the effects
## of missing values on data aggregation.
## Aggregated: 1/1
## Aggregated: 1/1
qf
## An instance of class QFeatures (type: bulk) with 3 sets:
## 
##  [1] psms: SummarizedExperiment with 3010 rows and 10 columns 
##  [2] peptides: SummarizedExperiment with 2923 rows and 10 columns 
##  [3] proteins: SummarizedExperiment with 1600 rows and 10 columns

See ?aggregateFeatures for more details.

Following on from here, many data processing such as normalisation, non-specific filtering, and hypothesis testing is very similar to other omics data.

You can find more details in the many Bioconductor package vignettes. The R for Mass Spectrometry book provide further details and state-of-the-art infrastructure for MS and proteomics data analysis.

Applications in statistical learning

Proteomics can be used for many applications. Simple identification and quantitation of proteins in various different biological samples to identify differentially abundant proteins (this is possible in bulk and single-cells - see the scp package), but also analysis of post-translational modifications, interactions between proteins, and protein sub-cellular localisation, as illustrated below.

The `pRoloc` package. The pRoloc package.

In this section, I illustrate a use case of mass spectrometry-based proteomics to infer sub-cellular protein localisation and the application of statistical learning. This section requires the pRoloc and pRolocdata packages (L. Gatto et al. 2014; Breckels et al. 2016). Both packages can be installed with the BiocManager::install() function, as illustrated above.

library("pRoloc")
library("pRolocdata")

The hyperLOPIT2015 data contains localisation data for 5032 proteins in mouse embryonic stem cells (Christoforou et al. 2016). The protein information is collected along a 20 fraction density gradient - our data matrix has dimensions 5032 rows and 20 columns. We use PCA to easily visualise it in 2 dimensions.

data(hyperLOPIT2015)
## set colours
setStockcol(paste0(getStockcol(), 80))
## produce PCA plot
plot2D(hyperLOPIT2015)
## Warning in prcomp.default(x = structure(c(0.028, 0.039, 0.021, 0.12, 0.055, :
## partial argument match of 'scale' to 'scale.'

Mouse stem cell spatial proteomics data from Christoforou et al.

Mouse stem cell spatial proteomics data from Christoforou et al.

Each dot on the PCA plot represents a single protein. The protein is either of unknown localisation, and represented by a grey circle, or is of known localisation (and called an organelle marker) and is coloured according to its expected sub-cellular localisation.

In the next figure, we have trained an support vector machine (SVM) model and classified proteins of unknown localisation using an SVM model. The size of the dots are scaled according to the classifier score and only assignments corresponding to a false discovery rate of 5% have been considered.

sz <- exp(fData(hyperLOPIT2015)$svm.score) - 1
plot2D(hyperLOPIT2015, fcol = "final.assignment", cex = sz)
## Warning in prcomp.default(x = structure(c(0.028, 0.039, 0.021, 0.12, 0.055, :
## partial argument match of 'scale' to 'scale.'

New sub-cellular assignment after using support vector machine classifier.

New sub-cellular assignment after using support vector machine classifier.

Session information

## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /opt/R-4.5/lib/R/lib/libRblas.so 
## LAPACK: /opt/R-4.5/lib/R/lib/libRlapack.so;  LAPACK version 3.12.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Brussels
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] pRolocdata_1.48.0           pRoloc_1.50.0              
##  [3] MLInterfaces_1.90.0         cluster_2.1.8.1            
##  [5] annotate_1.88.0             XML_3.99-0.20              
##  [7] AnnotationDbi_1.72.0        MSnbase_2.36.0             
##  [9] ProtGenerics_1.42.0         mzR_2.44.0                 
## [11] Rcpp_1.1.0                  QFeatures_1.20.0           
## [13] MultiAssayExperiment_1.36.1 SummarizedExperiment_1.40.0
## [15] Biobase_2.70.0              GenomicRanges_1.62.1       
## [17] Seqinfo_1.0.0               IRanges_2.44.0             
## [19] MatrixGenerics_1.22.0       matrixStats_1.5.0          
## [21] PSMatch_1.14.0              MsDataHub_1.10.0           
## [23] Spectra_1.20.0              BiocParallel_1.44.0        
## [25] S4Vectors_0.48.0            BiocGenerics_0.56.0        
## [27] generics_0.1.4             
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.5.0           filelock_1.0.3          tibble_3.3.0           
##   [4] hardhat_1.4.2           preprocessCore_1.72.0   pROC_1.19.0.1          
##   [7] rpart_4.1.24            lifecycle_1.0.4         httr2_1.2.2            
##  [10] doParallel_1.0.17       globals_0.18.0          lattice_0.22-7         
##  [13] MASS_7.3-65             dendextend_1.19.1       magrittr_2.0.4         
##  [16] limma_3.66.0            plotly_4.11.0           sass_0.4.10            
##  [19] rmarkdown_2.30          jquerylib_0.1.4         yaml_2.3.12            
##  [22] otel_0.2.0              MsCoreUtils_1.22.1      DBI_1.2.3              
##  [25] RColorBrewer_1.1-3      lubridate_1.9.4         abind_1.4-8            
##  [28] purrr_1.2.0             mixtools_2.0.0.1        AnnotationFilter_1.34.0
##  [31] nnet_7.3-20             rappdirs_0.3.3          ipred_0.9-15           
##  [34] lava_1.8.2              listenv_0.10.0          BiocStyle_2.38.0       
##  [37] parallelly_1.46.0       ncdf4_1.24              codetools_0.2-20       
##  [40] DelayedArray_0.36.0     tidyselect_1.2.1        farver_2.1.2           
##  [43] viridis_0.6.5           BiocFileCache_3.0.0     jsonlite_2.0.0         
##  [46] caret_7.0-1             e1071_1.7-17            survival_3.8-3         
##  [49] iterators_1.0.14        foreach_1.5.2           segmented_2.1-4        
##  [52] tools_4.5.0             progress_1.2.3          glue_1.8.0             
##  [55] prodlim_2025.04.28      gridExtra_2.3           SparseArray_1.10.7     
##  [58] BiocBaseUtils_1.12.0    xfun_0.55               tufte_0.14.0           
##  [61] dplyr_1.1.4             withr_3.0.2             BiocManager_1.30.27    
##  [64] fastmap_1.2.0           digest_0.6.39           timechange_0.3.0       
##  [67] R6_2.6.1                colorspace_2.1-2        gtools_3.9.5           
##  [70] lpSolve_5.6.23          biomaRt_2.66.0          RSQLite_2.4.5          
##  [73] tidyr_1.3.1             hexbin_1.28.5           data.table_1.17.8      
##  [76] recipes_1.3.1           FNN_1.1.4.1             class_7.3-23           
##  [79] prettyunits_1.2.0       httr_1.4.7              htmlwidgets_1.6.4      
##  [82] S4Arrays_1.10.1         ModelMetrics_1.2.2.2    pkgconfig_2.0.3        
##  [85] gtable_0.3.6            timeDate_4051.111       blob_1.2.4             
##  [88] S7_0.2.1                impute_1.84.0           XVector_0.50.0         
##  [91] htmltools_0.5.9         MALDIquant_1.22.3       clue_0.3-66            
##  [94] scales_1.4.0            png_0.1-8               gower_1.0.2            
##  [97] knitr_1.50              MetaboCoreUtils_1.18.1  reshape2_1.4.5         
## [100] coda_0.19-4.1          
##  [ reached 'max' / getOption("max.print") -- omitted 48 entries ]

References

Breckels, L M, C M Mulvey, K S Lilley, and L Gatto. 2016. “A Bioconductor Workflow for Processing and Analysing Spatial Proteomics Data.” F1000Res 5: 2926. https://doi.org/10.12688/f1000research.10411.2.
Christoforou, A, C M Mulvey, L M Breckels, A Geladaki, T Hurrell, P C Hayward, T Naake, et al. 2016. “A Draft Map of the Mouse Pluripotent Stem Cell Spatial Proteome.” Nat Commun 7 (January): 8992. https://doi.org/10.1038/ncomms9992.
Deflandre, Guillaume, Sebastian Gibb, and Laurent Gatto. 2026. PSMatch: An R/Bioconductor Package to Explore Proteomics Identification Data.” OSF Preprints. osf.io/62v9p_v3.
Gatto, Laurent. 2019. Bioconductor Tools for Mass Spectrometry and Proteomics. https://rawgit.com/lgatto/bioc-ms-prot/master/lab.html.
Gatto, L, L M Breckels, S Wieczorek, T Burger, and K S Lilley. 2014. “Mass-Spectrometry-Based Spatial Proteomics Data Analysis Using pRoloc and pRolocdata.” Bioinformatics 30 (9): 1322–24. https://doi.org/10.1093/bioinformatics/btu013.
Gatto, L, and A Christoforou. 2014. “Using R and Bioconductor for Proteomics Data Analysis.” Biochim. Biophys. Acta 1844 (1 Pt A): 42–51.
Huber, W, V J Carey, R Gentleman, S Anders, M Carlson, B S Carvalho, H C Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nat. Methods 12 (2): 115–21.
Käll, L, J D Storey, M J MacCoss, and W S Noble. 2008. “Posterior Error Probabilities and False Discovery Rates: Two Sides of the Same Coin.” J Proteome Res 7 (1): 40–44. https://doi.org/10.1021/pr700739d.
Lazar, C, L Gatto, M Ferro, C Bruley, and T Burger. 2016. “Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.” J Proteome Res 15 (4): 1116–25. https://doi.org/10.1021/acs.jproteome.5b00981.
Nesvizhskii, A I, and R Aebersold. 2005. “Interpretation of Shotgun Proteomic Data: The Protein Inference Problem.” Mol Cell Proteomics 4 (10): 1419–40. https://doi.org/10.1074/mcp.R500012-MCP200.
Rainer, Johannes, Marilyn De Graeve, Philippine Louail, Sebastian Gibb, and Laurent Gatto. 2026. “An Open Infrastructure for Mass Spectrometry Data in R,” February. osf.io/cwt2v_v3.
Sinha, Ankit, and Matthias Mann. 2020. A beginner’s guide to mass spectrometry–based proteomics.” The Biochemist, September. https://doi.org/10.1042/BIO20200057.