Contents

1 Background

Eighteen Estrogen Receptor Positive Breast cancer tissues from from patients treated with tamoxifen upon recurrence have been assessed in a proteomics study. Nine patients had a good outcome (or) and the other nine had a poor outcome (pd). The proteomes have been assessed using an LTQ-Orbitrap and the thermo output .RAW files were searched with MaxQuant (version 1.4.1.2) against the human proteome database (FASTA version 2012-09, human canonical proteome).

2 Data

We first import the peptides.txt file. This is the file that contains your peptide-level intensities. For a MaxQuant search [6], this peptides.txt file can be found by default in the “path_to_raw_files/combined/txt/” folder from the MaxQuant output, with “path_to_raw_files” the folder where raw files were saved. In this tutorial, we will use a MaxQuant peptides file from MaxQuant that can be found on the pdaData repository. We will use the MSnbase package to import the data.

We generate the object peptideFile with the path to the peptides.txt file. In this file we will fetch the data from the github repository linked to the pda course: https://statomics.github.io/pda. You can also replace the peptideFile with a string that points to the path of a file on your local hard drive. With the grepEcols function we find the columns that are containing the expression data of the peptides in the peptides.txt file.

library(MSqRob)
library(MSnbase)
library(tidyverse)
library(limma)
peptidesFile <- "https://raw.githubusercontent.com/statOmics/pda/data/quantification/cancer/peptides3vs3.txt"
proteinGroupsFile <-"https://raw.githubusercontent.com/statOmics/pda/data/quantification/cancer/proteinGroups.txt"
ecols<-grepEcols(peptidesFile, "Intensity ", split = "\t")

Next we import the peptide intensities

pepData<-readMSnSet2(peptidesFile,ecol=ecols,fnames="Sequence",sep="\t")
pepData
## MSnSet (storageMode: lockedEnvironment)
## assayData: 34205 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData
##   featureNames: AAAAAAAAAAAAAAAGAGAGAK AAAAAAAAAAGAAGGR ...
##     YYYDGKDYIEFNK (34205 total)
##   fvarLabels: Sequence Proteins ... Oxidation..M..site.IDs (38
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
##  MSnbase version: 2.10.1

The pepData object is an MSnSet, a container for the data, features information and experimental annotation. They can be accessed using the accessor functions ‘exprs’ (matrix of intensities, features in rows, samples in columns), ‘fData’ (properties for each feature, peptide or protein, on the rows) and ‘pData’ (properties for the samples on the columns).

We will make use from data wrangling functionalities from the tidyverse package. The %>% operator allows us to pipe the output of one function to the next function.

head(exprs(pepData))
##                            Intensity.OR.01 Intensity.OR.04 Intensity.OR.07
## AAAAAAAAAAAAAAAGAGAGAK                   0        17923000         6237900
## AAAAAAAAAAGAAGGR                   8439600        15859000         5783100
## AAAAAAAGDSDSWDADAFSVEDPVRK         3352800         6592900         5295700
## AAAAAAALQAK                              0        22380000        14302000
## AAAAAAGAASGLPGPVAQGLK              3867400        13661000         3764300
## AAAAAAGAGPEMVR                     4113700         4431500         4944400
##                            Intensity.PD.02 Intensity.PD.03 Intensity.PD.04
## AAAAAAAAAAAAAAAGAGAGAK            11510000         3003800               0
## AAAAAAAAAAGAAGGR                  11056000         4491600         5673700
## AAAAAAAGDSDSWDADAFSVEDPVRK         4721100         6766500         5119600
## AAAAAAALQAK                       31208000        15994000        18562000
## AAAAAAGAASGLPGPVAQGLK              5036100         9125600         2497600
## AAAAAAGAGPEMVR                           0         3424900         2745700
pepData %>% exprs %>% head
##                            Intensity.OR.01 Intensity.OR.04 Intensity.OR.07
## AAAAAAAAAAAAAAAGAGAGAK                   0        17923000         6237900
## AAAAAAAAAAGAAGGR                   8439600        15859000         5783100
## AAAAAAAGDSDSWDADAFSVEDPVRK         3352800         6592900         5295700
## AAAAAAALQAK                              0        22380000        14302000
## AAAAAAGAASGLPGPVAQGLK              3867400        13661000         3764300
## AAAAAAGAGPEMVR                     4113700         4431500         4944400
##                            Intensity.PD.02 Intensity.PD.03 Intensity.PD.04
## AAAAAAAAAAAAAAAGAGAGAK            11510000         3003800               0
## AAAAAAAAAAGAAGGR                  11056000         4491600         5673700
## AAAAAAAGDSDSWDADAFSVEDPVRK         4721100         6766500         5119600
## AAAAAAALQAK                       31208000        15994000        18562000
## AAAAAAGAASGLPGPVAQGLK              5036100         9125600         2497600
## AAAAAAGAGPEMVR                           0         3424900         2745700
pepData %>% sampleNames
## [1] "Intensity.OR.01" "Intensity.OR.04" "Intensity.OR.07" "Intensity.PD.02"
## [5] "Intensity.PD.03" "Intensity.PD.04"

The sample names are rather long and contain information on the spike-in concentration and the repeat. We will remove “Intensity.” from the string

sampleNames(pepData) <- pepData %>% sampleNames %>% str_replace(., pattern="Intensity.", replacement="")

Next we will add information on the proteins to the feature data. The feature data of the object pepData contains information for each peptide in the experiment. This info on the proteins can be found in the proteingroups.txt file of maxquant. The peptides.txt file contains many data on each feature (peptide). We will only retain specified columns Protein, Sequence, Reverse and Potential.Contaminant, indicating the protein, peptide sequence, if the peptide is a decoy and if it is a contaminant, respectively.

In older and more recent versions of maxquant the names of the fields might vary. We can therefore explore the names we want to select with the shiny app that can be launched from the selectFeatureData from the MSnbase package, i.e. by running the function without an fcol argument that specifies the columns you would like to select.

The code can be found below, but is not executed by this Rmd script (note, that the flag eval=FALSE has been used as option for the R chunk).

pepData<-selectFeatureData(pepData)

If we know the columns, we can specify them as a vector in the fcol slot.

pepData<-selectFeatureData(pepData,fcol=c("Proteins","Protein.names","Gene.names", "Contaminant","Reverse","Sequence"))

Calculate how many non zero intensities we have per peptide. This will be useful for filtering.

fData(pepData)$nNonZero<- rowSums(exprs(pepData) > 0)

Peptides with zero intensities are missing peptides and should be represent with a NA value instead of 0.

exprs(pepData)[exprs(pepData)==0]<-NA

Next we create the data on the experimental layout. We can do this based on the samplenames. The information on the spike in condition is available as the first letter of the sample name. We extract this information and recast the character vector in a factor.

pData(pepData)$condition <- pepData %>% sampleNames %>% substr(1,2) %>% as.factor

2.1 Data exploration

We can inspect the missingness in our data with the plotNA() function provided with MSnbase. 44% of all peptide intensities are missing and for some peptides we don’t even measure a signal in any sample. The missingness is similar across samples. Note, that we plot the peptide data, so the label protein in the plot refers to peptides.

plotNA(pepData)

When we look at density plots of the intensities, we see that most intensities are low and that there is a long tail to the right. It is more useful to explore the data upon log transformation.

plotDensities(exprs(pepData))

nrow(pepData)
## [1] 34205

There are 34205 peptides before preprocessing.

3 Preprocessing

We will log transform, normalize, filter and summarize the data.

3.1 Log transform the data

pepData <- log(pepData, base = 2)
plotDensities(exprs(pepData))