1 Background

This case-study is a subset of the data of the 6th study of the Clinical Proteomic Technology Assessment for Cancer (CPTAC). In this experiment, the authors spiked the Sigma Universal Protein Standard mixture 1 (UPS1) containing 48 different human proteins in a protein background of 60 ng/\(\mu\)L Saccharomyces cerevisiae strain BY4741. Two different spike-in concentrations were used: 6A (0.25 fmol UPS1 proteins/\(\mu\)L) and 6B (0.74 fmol UPS1 proteins/\(\mu\)L) [5]. We limited ourselves to the data of LTQ-Orbitrap W at site 56. The data were searched with MaxQuant version 1.5.2.8, and detailed search settings were described in Goeminne et al. (2016) [1]. Three replicates are available for each concentration.

2 Data

We first import the peptides.txt file. This is the file that contains your peptide-level intensities. For a MaxQuant search [6], this peptides.txt file can be found by default in the “path_to_raw_files/combined/txt/” folder from the MaxQuant output, with “path_to_raw_files” the folder where raw files were saved. In this tutorial, we will use a MaxQuant peptides file from MaxQuant that can be found on the pdaData repository. We will use the MSnbase package to import the data.

We generate the object peptideFile with the path to the peptides.txt file. In this file we will fetch the data from the github repository linked to the pda course: https://statomics.github.io/pda. You can also replace the peptideFile with a string that points to the path of a file on your local hard drive. With the grepEcols function we find the columns that are containing the expression data of the peptides in the peptides.txt file.

library(MSqRob)
library(MSnbase)
library(tidyverse)
library(limma)
peptidesFile <- "https://raw.githubusercontent.com/statOmics/pda/data/quantification/cptacAvsB_lab3/peptides.txt"
proteinGroupsFile <-"https://raw.githubusercontent.com/statOmics/pda/data/quantification/cptacAvsB_lab3/proteinGroups.txt"
#peptidesFile <- "https://raw.githubusercontent.com/statOmics/pda/data/quantification/cancer/peptides3vs3.txt"
#proteinGroupsFile <-"https://raw.githubusercontent.com/statOmics/pda/data/quantification/cancer/proteinGroups.txt"
#peptidesFile <- "https://raw.githubusercontent.com/statOmics/pda/data/quantification/cancer/peptides9vs9.txt"
#proteinGroupsFile <-"https://raw.githubusercontent.com/statOmics/pda/data/quantification/cancer/proteinGroups.txt"
ecols<-grepEcols(peptidesFile, "Intensity ", split = "\t")

Next we import the peptide intensities

pepData<-readMSnSet2(peptidesFile,ecol=ecols,fnames="Sequence",sep="\t")
pepData

## MSnSet (storageMode: lockedEnvironment)
## assayData: 11466 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData
##   featureNames: AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNRVGFAEAAR
##     (11466 total)
##   fvarLabels: Sequence N.term.cleavage.window ... MS.MS.Count (65
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
##  MSnbase version: 2.10.1

The pepData object is an MSnSet, a container for the data, features information and experimental annotation. They can be accessed using the accessor functions ‘exprs’ (matrix of intensities, features in rows, samples in columns), ‘fData’ (properties for each feature, peptide or protein, on the rows) and ‘pData’ (properties for the samples on the columns).

We will make use from data wrangling functionalities from the tidyverse package. The %>% operator allows us to pipe the output of one function to the next function.

head(exprs(pepData))

##                    Intensity.6A_7 Intensity.6A_8 Intensity.6A_9
## AAAAGAGGAGDSGDAVTK              0              0          66760
## AAAALAGGK                 2441300        1220000        1337600
## AAAALAGGKK                1029200         668040         638990
## AAADALSDLEIK               515460         670780         712140
## AAADALSDLEIKDSK            331130         420900         365560
## AAAEEFQR                        0              0          51558
##                    Intensity.6B_7 Intensity.6B_8 Intensity.6B_9
## AAAAGAGGAGDSGDAVTK              0          37436              0
## AAAALAGGK                 2850900         935580        1606200
## AAAALAGGKK                 777030         641270         562840
## AAADALSDLEIK               426580         620510         737780
## AAADALSDLEIKDSK            329250         380820         414490
## AAAEEFQR                        0              0          76970

pepData %>% exprs %>% head

##                    Intensity.6A_7 Intensity.6A_8 Intensity.6A_9
## AAAAGAGGAGDSGDAVTK              0              0          66760
## AAAALAGGK                 2441300        1220000        1337600
## AAAALAGGKK                1029200         668040         638990
## AAADALSDLEIK               515460         670780         712140
## AAADALSDLEIKDSK            331130         420900         365560
## AAAEEFQR                        0              0          51558
##                    Intensity.6B_7 Intensity.6B_8 Intensity.6B_9
## AAAAGAGGAGDSGDAVTK              0          37436              0
## AAAALAGGK                 2850900         935580        1606200
## AAAALAGGKK                 777030         641270         562840
## AAADALSDLEIK               426580         620510         737780
## AAADALSDLEIKDSK            329250         380820         414490
## AAAEEFQR                        0              0          76970

pepData %>% sampleNames

## [1] "Intensity.6A_7" "Intensity.6A_8" "Intensity.6A_9" "Intensity.6B_7"
## [5] "Intensity.6B_8" "Intensity.6B_9"

The sample names are rather long and contain information on the spike-in concentration and the repeat. We will remove “Intensity.6” from the string

sampleNames(pepData) <- pepData %>% sampleNames %>% str_replace(., pattern="Intensity.6", replacement="")

Next we will add information on the proteins to the feature data. The feature data of the object pepData contains information for each peptide in the experiment. This info on the proteins can be found in the proteingroups.txt file of maxquant. The peptides.txt file contains many data on each feature (peptide). We will only retain specified columns Protein, Sequence, Reverse and Potential.Contaminant, indicating the protein, peptide sequence, if the peptide is a decoy and if it is a contaminant, respectively.

In older and more recent versions of maxquant the names of the fields might vary. We can therefore explore the names we want to select with the shiny app that can be launched from the selectFeatureData from the MSnbase package, i.e. by running the function without an fcol argument that specifies the columns you would like to select.

The code can be found below, but is not executed by this Rmd script (note, that the flag eval=FALSE has been used as option for the R chunk).

pepData<-selectFeatureData(pepData)

If we know the columns, we can specify them as a vector in the fcol slot.

pepData<-selectFeatureData(pepData,fcol=c("Proteins","Potential.contaminant","Reverse","Sequence"))

Calculate how many non zero intensities we have per peptide. This will be useful for filtering.

fData(pepData)$nNonZero<- rowSums(exprs(pepData) > 0)

Peptides with zero intensities are missing peptides and should be represent with a NA value instead of 0.

exprs(pepData)[exprs(pepData)==0]<-NA

Next we create the data on the experimental layout. We can do this based on the samplenames. The information on the spike in condition is available as the first letter of the sample name. We extract this information and recast the character vector in a factor.

pData(pepData)$condition <- pepData %>% sampleNames %>% substr(1,1) %>% as.factor

2.1 Data exploration

We can inspect the missingness in our data with the plotNA() function provided with MSnbase. 45% of all peptide intensities are missing and for some peptides we don’t even measure a signal in any sample. The missingness is similar across samples. Note, that we plot the peptide data, so the label protein in the plot refers to peptides.

plotNA(pepData)

When we look at density plots of the intensities, we see that most intensities are low and that there is a long tail to the right. It is more useful to explore the data upon log transformation.

plotDensities(exprs(pepData))

nrow(pepData)

## [1] 11466

There are 11466 peptides before preprocessing.

3 Preprocessing

We will log transform, normalize, filter and summarize the data.

3.1 Log transform the data

pepData <- log(pepData, base = 2)
plotDensities(exprs(pepData))

When we look at the densities we see small shifts in location and shape of the density distributions. So normalisation will be required. But, we will first perform filtering.

3.2 Filtering

3.2.1 Handling overlapping protein groups

In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.

pepData<-pepData[fData(pepData)$Proteins %in% smallestUniqueGroups(fData(pepData)$Proteins),]

3.2.2 Remove reverse sequences (decoys) and contaminants

We now remove the contaminants, peptides that map to decoy sequences and proteins, which were only identified by peptides with modifications.

pepData <- pepData[fData(pepData)$Reverse!="+",]
pepData <- pepData[fData(pepData)$Potential.contaminant!="+",]

3.2.3 Remove peptides of proteins that were only identified with modified peptides

Proteins for which all peptides are carrying modifications (PTMs) can be considered as unreliable. We will filter out these proteins. This information is included in the Only.identified.by.site column of proteinGroups.txt file of maxquant. The variable is containing a “+” character if the protein is only identified by modified peptides and an empty string “” if this is not the case. Sometimes an NA value is also present. We will replace these by the empty character string. The information of the peptides.txt file can be linked to the proteinGroups.txt file by using the Proteins column from the peptides.txt file and the Protein.IDs column in the proteinGroups.txt file. If NAs are included in the

proteinGroups<-read.table(proteinGroupsFile, sep = "\t", header = TRUE, quote = "", comment.char = "")
only_site <- proteinGroups$Only.identified.by.site
only_site[is.na(only_site)] <- ""
proteinsOnlyIdentifiedWithPTMs <- proteinGroups$Protein.IDs[only_site=="+"]
pepData<-pepData[!(fData(pepData)$Proteins %in% proteinsOnlyIdentifiedWithPTMs),]

We will now remove the proteinGroups object from the R session to free memory.

rm(proteinGroups,only_site)
gc()

##           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells 4191128 223.9    8160118 435.8         NA  5421600 289.6
## Vcells 7362700  56.2   14786712 112.9      16384 12255594  93.6

3.2.4 Drop peptides that were only identified in one sample

We want to keep peptide that were at least observed twice.

pepData<-pepData[fData(pepData)$nNonZero>=2,]
nrow(pepData)

## [1] 7006

We keep 7006 peptides upon filtering.

3.3 Quantile normalize the data

pepData <- normalise(pepData, "quantiles")

3.4 Explore quantile normalized data

Upon normalisation the density curves for all samples coincide.

plotDensities(exprs(pepData))

We can visualize our data using a Multi Dimensional Scaling plot, eg. as provided by the limma package.

plotMDS(exprs(pepData),col=as.double(pData(pepData)$condition))

The first axis in the plot is showing the leading log fold changes (differences on the log scale) between the samples. We notice that the leading differences (log FC) in the peptide data seems to be driven by technical variability. Indeed the samples do not seem to be clearly separated according to the spike in condition. Because there are missing values for the peptide level data we use the na.rm=TRUE argument to summarize the data based on the observed peptides for every protein.

3.5 Summarization to protein level

protData<-combineFeatures(pepData,fcol="Proteins",method="robust",na.rm=TRUE)

## Your data contains missing values. Please read the relevant section
## in the combineFeatures manual page for details the effects of
## missing values on data aggregation.

If you summarize using robust summarization please refer to Sticker et al. (2019) https://www.biorxiv.org/content/10.1101/668863v1

We filter proteins for which we have at least two protein expression values in both conditions:

protData<-protData[
protData %>% exprs %>% is.na %>% `!` %>% rowSums>4,]

We notice that the leading differences (log FC) in the protein data is still according to technical variation. On the second dimension, however, we also observe a clear separation according to the spike-in condition. Hence, the summarization that accounts for peptide specific effects makes the effects due to the spike-in condition more prominent!

plotMDS(exprs(protData),col=as.double(pData(pepData)$condition))

4 Data Analysis

4.1 Estimation

MSqRob is currently working with a format where we have one dataframe for each protein. This will be changed in the next release. Therefore we first have to reorganise the data.

Next the models are fitted. This is done using the fit.model function. We only have to model the data using the factor condition from the pheno data of the protein level MSnSet. The name of the factor variable is specified in the fixed argument (if multiple predictors have to be incorporated in the model, a vector of variable names has to be provided in this argument.). The argument shrinkage is used to specify if ridge regression has to be adopted. For the sake of speed we do not do this in the tutorial. The shrinkage has to be specified for each variable in the fixed effects. We also have to indicate this for the intercept (which we never shrink). So we specify it at c(0,0) to indicate that the intercept (first 0) and the parameters for the factor condition (second 0) are not penalized. We set the robust_var function equal to FALSE, this functionality will be removed from the package in the next release.

protMSqRob <- MSnSet2protdata(protData, "Proteins")
models <- fit.model(protdata=protMSqRob, response="quant_value", fixed="condition",shrinkage.fixed=c(0,0),robust_var=FALSE)

4.2 Inference

Many biologists have problems with the reference coding. In MSqRob we have opted to formulate contrasts using all levels of a factor. Internally, the contrasts are than recasted according to the factor level that is the reference class.

L <- makeContrast("conditionB - conditionA",  levels=c("conditionA","conditionB"))
tests <- test.contrast_adjust(models, L)
nSig <- sum(tests$signif,na.rm=TRUE)
head(tests,nSig)

##                           estimate         se       df      Tval
## P99999ups|CYC_HUMAN_UPS   2.050636 0.11605868 6.654575 17.668960
## P63279ups|UBC9_HUMAN_UPS  1.822164 0.10313343 6.654575 17.668029
## P08311ups|CATG_HUMAN_UPS  1.619617 0.09702194 6.654575 16.693308
## P10636-8ups|TAU_HUMAN_UPS 1.359550 0.09607743 6.654575 14.150566
## P02787ups|TRFE_HUMAN_UPS  1.450411 0.11489626 6.654575 12.623657
## P06732ups|KCRM_HUMAN_UPS  1.635609 0.14163621 6.654575 11.547958
## Q06830ups|PRDX1_HUMAN_UPS 1.530474 0.13843135 6.654575 11.055830
## P51965ups|UB2E1_HUMAN_UPS 1.673807 0.16336963 6.654575 10.245524
## P01344ups|IGF2_HUMAN_UPS  1.375413 0.11320467 5.654575 12.149786
## P01127ups|PDGFB_HUMAN_UPS 1.455661 0.15661185 6.654575  9.294706
## O00762ups|UBE2C_HUMAN_UPS 1.564630 0.14566881 5.654575 10.741008
## P63165ups|SUMO1_HUMAN_UPS 1.225013 0.14024360 6.654575  8.734891
## P07339ups|CATD_HUMAN_UPS  1.984265 0.20559617 5.654575  9.651272
## P01031ups|CO5_HUMAN_UPS   1.784246 0.18621568 5.654575  9.581611
## P00918ups|CAH2_HUMAN_UPS  1.481828 0.18585267 6.654575  7.973132
## P12081ups|SYHC_HUMAN_UPS  1.460529 0.16157996 5.654575  9.039047
## P02144ups|MYG_HUMAN_UPS   1.434464 0.21707277 6.654575  6.608219
## P01375ups|TNFA_HUMAN_UPS  1.858506 0.25357755 5.654575  7.329141
## P08263ups|GSTA1_HUMAN_UPS 1.214455 0.19778835 6.654575  6.140176
## P09211ups|GSTP1_HUMAN_UPS 1.200100 0.17612676 5.654575  6.813842
## P61769ups|B2MG_HUMAN_UPS  2.013064 0.33931273 6.654575  5.932768
##                                   pval         qval signif
## P99999ups|CYC_HUMAN_UPS   7.695925e-07 0.0004422204   TRUE
## P63279ups|UBC9_HUMAN_UPS  7.698573e-07 0.0004422204   TRUE
## P08311ups|CATG_HUMAN_UPS  1.114841e-06 0.0004422204   TRUE
## P10636-8ups|TAU_HUMAN_UPS 3.259668e-06 0.0009697512   TRUE
## P02787ups|TRFE_HUMAN_UPS  6.802262e-06 0.0016189384   TRUE
## P06732ups|KCRM_HUMAN_UPS  1.202521e-05 0.0023849998   TRUE
## Q06830ups|PRDX1_HUMAN_UPS 1.586476e-05 0.0026970094   TRUE
## P51965ups|UB2E1_HUMAN_UPS 2.568011e-05 0.0038199157   TRUE
## P01344ups|IGF2_HUMAN_UPS  2.890451e-05 0.0038218185   TRUE
## P01127ups|PDGFB_HUMAN_UPS 4.729006e-05 0.0056275169   TRUE
## O00762ups|UBE2C_HUMAN_UPS 5.657889e-05 0.0061208070   TRUE
## P63165ups|SUMO1_HUMAN_UPS 6.954121e-05 0.0068961704   TRUE
## P07339ups|CATD_HUMAN_UPS  1.008135e-04 0.0089091090   TRUE
## P01031ups|CO5_HUMAN_UPS   1.048130e-04 0.0089091090   TRUE
## P00918ups|CAH2_HUMAN_UPS  1.217723e-04 0.0096606026   TRUE
## P12081ups|SYHC_HUMAN_UPS  1.431931e-04 0.0106499868   TRUE
## P02144ups|MYG_HUMAN_UPS   3.749826e-04 0.0262487840   TRUE
## P01375ups|TNFA_HUMAN_UPS  4.319106e-04 0.0285540867   TRUE
## P08263ups|GSTA1_HUMAN_UPS 5.751165e-04 0.0360204519   TRUE
## P09211ups|GSTP1_HUMAN_UPS 6.288343e-04 0.0374156384   TRUE
## P61769ups|B2MG_HUMAN_UPS  7.005605e-04 0.0396984292   TRUE

There are 21 proteins with a significant effect at the 5% FDR level.

4.3 Plots

4.3.1 Volcano-plot

volcano<- ggplot(tests,aes(x=estimate,y=-log10(pval),color=signif)) + geom_point() + scale_color_manual(values=c("black","red"))
volcano

An interactive volcano plot can be obtained by using the plotly library. We alter the object tests to add the protein name to the data frame in ggplot, this will allow us to see the protein name in the interactive plot.

library(plotly)
volcano<- ggplot(tests %>% rownames_to_column("protein"),aes(x=estimate,y=-log10(pval),color=signif,protein=protein)) + geom_point() + scale_color_manual(values=c("black","red"))
ggplotly(volcano)

4.3.2 Heatmap

We first select the names of the significant proteins.

sigNames<- tests %>% rownames_to_column("protein") %>% filter(signif) %>% pull(protein)
heatmap(exprs(protData)[sigNames,])

4.3.3 Detail plot for the top 5 proteins.

We first extract the normalized protein expression values for a particular protein and we create a dataframe that contains the protein expression values and the relevant variables of the experimental design.

for (protName in rownames(tests)[1:5])
{
plotDat<-pepData[fData(pepData)$Proteins==protName] %>% exprs
plotDataStack<-data.frame(quant_value=c(plotDat),sequence=rep(rownames(plotDat),rep=ncol(plotDat)),sample=rep(colnames(plotDat),each=nrow(plotDat)),condition=rep(pData(pepData)$condition,each=nrow(plotDat)))
plot1 <- ggplot(plotDataStack, aes(x=sample, y=quant_value,fill=condition))
print(plot1 +geom_boxplot(outlier.shape=NA) + geom_point(position=position_jitter(width=.1),aes(shape=sequence)) + scale_shape_manual(values=1:nrow(plotDat)) +labs(title = protName, x="sample", y="Peptide intensity (log2)"))
}

4.3.4 Sensitivity FDP plot

Because we know the ground truth for the cptac study, i.e. we know that only the spike-in proteins (UPS) are differentially expressed, we can evalute the fold changes. Yeast proteins should be not differentially expressed and their log fold changes should be centered around 0. These of UPS proteins are spiked at differt concentrations and their log2 fold changes should be centered around log2(concB/concA)=log2(0.74/0.25)=1.5655972.

We will first create a novel factor in the results table that is called spike to indicate which proteins are spiked.

tests$spike <- tests %>% rownames %>% grepl(.,pattern="UPS")
ggplot(tests, aes(x=spike, y=estimate)) +
  geom_boxplot() +
  geom_hline(yintercept=c(0,log(0.74/0.25,base=2)),color="red")+
  ylab("log2 FC")

4.3.5 Sensitivity FDP plot

Because we know the ground truth for the cptac study, i.e. we know that only the spike-in proteins (UPS) are differentially expressed, we can calculate

the sensitivity or true positive rate (TPR), the proportion of actual positives that are correctly identified, in the protein list that we return \[TPR=\frac{TP}{\text{#actual positives}},\] here TP are the true positives in the list. The TPR is thus the fraction of ups proteins that we can recall.
false discovery proportion (FPD): fraction of false positives in the protein list that we return: \[FPD=\frac{FP}{FP+TP},\] with FP the false positives. In our case the yeast proteins that are in our list.

Instead of only calculating that for the protein list that is returned for the chosen FDR level, we can do this for all possible FDR cutoffs so that we get an overview of the quality of the ranking of the proteins in the protein list.

tests<-tests %>% rownames_to_column("protein") %>%
mutate(
FDP=cumsum(!spike)/(1:length(spike)),
TPR=cumsum(spike)/sum(spike)) %>%
column_to_rownames("protein")

ggplot(tests,aes(x=FDP,y=TPR)) + geom_path() +geom_vline(xintercept=0.05,lty=2) + geom_point(data=tests[sum(tests$signif,na.rm=TRUE),],aes(x=FDP,y=TPR),cex=2)

We observe that there is a good FDR control: the FDP at the 5% FDR level is indeed close to 0.05.

Introduction to proteomics data analysis

5 July 2019

Contents