Duguet et al. 2017 compared the proteomes of mouse regulatory T cells (Treg) and conventional T cells (Tconv) in order to discover differentially regulated proteins between these two cell populations. For each biological repeat the proteomes were extracted for both Treg and Tconv cell pools, which were purified by flow cytometry. The data in data/quantification/mouseTcell on the pdaData repository are a subset of the data PXD004436 on PRIDE.
We first analyse a subset of the data that have a completely randomized design, i.e. 4 mice per group for the first group only regulatory T cells are assessed for the second group only conventional T cells are assessed.
We first import the peptides.txt file. This is the file that contains your peptide-level intensities. For a MaxQuant search [6], this peptides.txt file can be found by default in the “path_to_raw_files/combined/txt/” folder from the MaxQuant output, with “path_to_raw_files” the folder where raw files were saved. In this tutorial, we will use a MaxQuant peptides file from MaxQuant that can be found on the pdaData repository. We will use the MSnbase package to import the data.
We generate the object peptideFile with the path to the peptides.txt file. In this file we will fetch the data from the github repository linked to the pda course: https://statomics.github.io/pda. You can also replace the peptideFile with a string that points to the path of a file on your local hard drive.
With the grepEcols function we find the columns that are containing the expression data of the peptides in the peptides.txt file.
library(MSqRob)
library(MSnbase)
library(tidyverse)
library(limma)
peptidesFile <- "https://raw.githubusercontent.com/statOmics/pda/data/quantification/mouseTcell/peptidesCRD.txt"
proteinGroupsFile <-"https://raw.githubusercontent.com/statOmics/pda/data/quantification/mouseTcell/proteinGroups.txt"
ecols<-grepEcols(peptidesFile, "Intensity ", split = "\t")
Next we import the peptide intensities
pepData<-readMSnSet2(peptidesFile,ecol=ecols,fnames="Sequence",sep="\t")
pepData
## MSnSet (storageMode: lockedEnvironment)
## assayData: 55814 features, 8 samples
## element names: exprs
## protocolData: none
## phenoData: none
## featureData
## featureNames: AAAAAAAAAAGAAGGR AAAAAAAAAAGDSDSWDADTFSMEDPVRK ...
## YYYDKNIIHK (55814 total)
## fvarLabels: Sequence N.term.cleavage.window ... MS.MS.Count (74
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## MSnbase version: 2.10.1
The pepData object is an MSnSet, a container for the data, features information and experimental annotation. They can be accessed using the accessor functions ‘exprs’ (matrix of intensities, features in rows, samples in columns), ‘fData’ (properties for each feature, peptide or protein, on the rows) and ‘pData’ (properties for the samples on the columns).
We will make use from data wrangling functionalities from the tidyverse package. The %>% operator allows us to pipe the output of one function to the next function.
head(exprs(pepData))
## Intensity.Tconv.M11_R2_inj1
## AAAAAAAAAAGAAGGR 47982000
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 0
## AAAAAAAGGAAIAVSTGIETATIQK 0
## AAAAAAGAASGIPGPVAQGIK 0
## AAAAAAGPEMVR 63831000
## AAAAAAPGGGGGEPR 0
## Intensity.Tconv.M12_1 Intensity.Tconv.M12_2
## AAAAAAAAAAGAAGGR 171220000 111470000
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 22907000 13303000
## AAAAAAAGGAAIAVSTGIETATIQK 0 0
## AAAAAAGAASGIPGPVAQGIK 18284000 0
## AAAAAAGPEMVR 0 104440000
## AAAAAAPGGGGGEPR 25372000 11910000
## Intensity.Tconv.M6_inj1
## AAAAAAAAAAGAAGGR 0
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 6744500
## AAAAAAAGGAAIAVSTGIETATIQK 314120
## AAAAAAGAASGIPGPVAQGIK 2904800
## AAAAAAGPEMVR 12747000
## AAAAAAPGGGGGEPR 2166100
## Intensity.Treg.M11_R1_inj1
## AAAAAAAAAAGAAGGR 52952000
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 0
## AAAAAAAGGAAIAVSTGIETATIQK 0
## AAAAAAGAASGIPGPVAQGIK 10387000
## AAAAAAGPEMVR 21421000
## AAAAAAPGGGGGEPR 0
## Intensity.Treg.M12_3 Intensity.Treg.M5_inj1
## AAAAAAAAAAGAAGGR 120130000 1668800
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 14499000 437460
## AAAAAAAGGAAIAVSTGIETATIQK 0 173070
## AAAAAAGAASGIPGPVAQGIK 14162000 520550
## AAAAAAGPEMVR 70512000 1604600
## AAAAAAPGGGGGEPR 13873000 361240
## Intensity.Treg.M5_inj2
## AAAAAAAAAAGAAGGR 4730400
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 0
## AAAAAAAGGAAIAVSTGIETATIQK 382430
## AAAAAAGAASGIPGPVAQGIK 1090900
## AAAAAAGPEMVR 3327900
## AAAAAAPGGGGGEPR 665860
pepData %>% exprs %>% head
## Intensity.Tconv.M11_R2_inj1
## AAAAAAAAAAGAAGGR 47982000
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 0
## AAAAAAAGGAAIAVSTGIETATIQK 0
## AAAAAAGAASGIPGPVAQGIK 0
## AAAAAAGPEMVR 63831000
## AAAAAAPGGGGGEPR 0
## Intensity.Tconv.M12_1 Intensity.Tconv.M12_2
## AAAAAAAAAAGAAGGR 171220000 111470000
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 22907000 13303000
## AAAAAAAGGAAIAVSTGIETATIQK 0 0
## AAAAAAGAASGIPGPVAQGIK 18284000 0
## AAAAAAGPEMVR 0 104440000
## AAAAAAPGGGGGEPR 25372000 11910000
## Intensity.Tconv.M6_inj1
## AAAAAAAAAAGAAGGR 0
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 6744500
## AAAAAAAGGAAIAVSTGIETATIQK 314120
## AAAAAAGAASGIPGPVAQGIK 2904800
## AAAAAAGPEMVR 12747000
## AAAAAAPGGGGGEPR 2166100
## Intensity.Treg.M11_R1_inj1
## AAAAAAAAAAGAAGGR 52952000
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 0
## AAAAAAAGGAAIAVSTGIETATIQK 0
## AAAAAAGAASGIPGPVAQGIK 10387000
## AAAAAAGPEMVR 21421000
## AAAAAAPGGGGGEPR 0
## Intensity.Treg.M12_3 Intensity.Treg.M5_inj1
## AAAAAAAAAAGAAGGR 120130000 1668800
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 14499000 437460
## AAAAAAAGGAAIAVSTGIETATIQK 0 173070
## AAAAAAGAASGIPGPVAQGIK 14162000 520550
## AAAAAAGPEMVR 70512000 1604600
## AAAAAAPGGGGGEPR 13873000 361240
## Intensity.Treg.M5_inj2
## AAAAAAAAAAGAAGGR 4730400
## AAAAAAAAAAGDSDSWDADTFSMEDPVRK 0
## AAAAAAAGGAAIAVSTGIETATIQK 382430
## AAAAAAGAASGIPGPVAQGIK 1090900
## AAAAAAGPEMVR 3327900
## AAAAAAPGGGGGEPR 665860
pepData %>% sampleNames
## [1] "Intensity.Tconv.M11_R2_inj1" "Intensity.Tconv.M12_1"
## [3] "Intensity.Tconv.M12_2" "Intensity.Tconv.M6_inj1"
## [5] "Intensity.Treg.M11_R1_inj1" "Intensity.Treg.M12_3"
## [7] "Intensity.Treg.M5_inj1" "Intensity.Treg.M5_inj2"
The sample names are rather long and contain information on the spike-in concentration and the repeat. We will remove “Intensity.” from the string
sampleNames(pepData) <- pepData %>% sampleNames %>% str_replace(., pattern="Intensity.", replacement="")
Next we will add information on the proteins to the feature data. The feature data of the object pepData contains information for each peptide in the experiment. This info on the proteins can be found in the proteingroups.txt file of maxquant. The peptides.txt file contains many data on each feature (peptide). We will only retain specified columns Protein, Sequence, Reverse and Potential.Contaminant, indicating the protein, peptide sequence, if the peptide is a decoy and if it is a contaminant, respectively.
In older and more recent versions of maxquant the names of the fields might vary. We can therefore explore the names we want to select with the shiny app that can be launched from the selectFeatureData from the MSnbase package, i.e. by running the function without an fcol argument that specifies the columns you would like to select.
The code can be found below, but is not executed by this Rmd script (note, that the flag eval=FALSE has been used as option for the R chunk).
pepData<-selectFeatureData(pepData)
If we know the columns, we can specify them as a vector in the fcol slot.
pepData<-selectFeatureData(pepData,fcol=c("Proteins","Protein.names","Gene.names", "Potential.contaminant","Reverse","Sequence"))
Calculate how many non zero intensities we have per peptide. This will be useful for filtering.
fData(pepData)$nNonZero<- rowSums(exprs(pepData) > 0)
Peptides with zero intensities are missing peptides and should be represent with a NA
value instead of 0
.
exprs(pepData)[exprs(pepData)==0]<-NA
Next we create the data on the experimental layout. We can do this based on the samplenames.
pData(pepData)$cellType <- pepData %>% sampleNames %>% substr(1,4) %>% as.factor
We will also replace the sample names to make them shorter
sampleNames(pepData)<-pepData %>% sampleNames %>% substr(1,4) %>% paste0(.,"_m",1:8)
We can inspect the missingness in our data with the plotNA()
function provided with MSnbase
. 41% of all peptide intensities are missing and for some peptides we don’t even measure a signal in any sample. The missingness is similar across samples. Note, that we plot the peptide data, so the label protein in the plot refers to peptides.
plotNA(pepData)
When we look at density plots of the intensities, we see that most intensities are low and that there is a long tail to the right. It is more useful to explore the data upon log transformation.
plotDensities(exprs(pepData))
nrow(pepData)
## [1] 55814
There are 55814 peptides before preprocessing.
We will log transform, normalize, filter and summarize the data.
pepData <- log(pepData, base = 2)
plotDensities(exprs(pepData))
When we look at the densities we see small shifts in location and shape of the density distributions. So normalisation will be required. But, we will first perform filtering.
In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.
pepData<-pepData[fData(pepData)$Proteins %in% smallestUniqueGroups(fData(pepData)$Proteins),]
We now remove the contaminants, peptides that map to decoy sequences and proteins, which were only identified by peptides with modifications.
pepData <- pepData[fData(pepData)$Reverse!="+",]
pepData <- pepData[fData(pepData)$Potential.contaminant!="+",]
Proteins for which all peptides are carrying modifications (PTMs) can be considered as unreliable. We will filter out these proteins. This information is included in the Only.identified.by.site column of proteinGroups.txt file of maxquant. The variable is containing a “+” character if the protein is only identified by modified peptides and an empty string “” if this is not the case. Sometimes an NA value is also present. We will replace these by the empty character string. The information of the peptides.txt file can be linked to the proteinGroups.txt file by using the Proteins column from the peptides.txt file and the Protein.IDs column in the proteinGroups.txt file. If NAs are included in the
proteinGroups<-read.table(proteinGroupsFile, sep = "\t", header = TRUE, quote = "", comment.char = "")
only_site <- proteinGroups$Only.identified.by.site
only_site[is.na(only_site)] <- ""
proteinsOnlyIdentifiedWithPTMs <- proteinGroups$Protein.IDs[only_site=="+"]
pepData<-pepData[!(fData(pepData)$Proteins %in% proteinsOnlyIdentifiedWithPTMs),]
We will now remove the proteinGroups object from the R session to free memory.
rm(proteinGroups,only_site)
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 4158115 222.1 8594064 459.0 NA 5446005 290.9
## Vcells 8237739 62.9 24432497 186.5 16384 24428835 186.4
We want to keep peptide that were at least observed twice.
pepData<-pepData[fData(pepData)$nNonZero>=2,]
nrow(pepData)
## [1] 43346
We keep 43346 peptides upon filtering.
pepData <- normalise(pepData, "quantiles")
Upon normalisation the density curves for all samples coincide.
plotDensities(exprs(pepData))
We can visualize our data using a Multi Dimensional Scaling plot, eg. as provided by the limma package.
plotMDS(exprs(pepData),col=as.double(pData(pepData)$cellType))
The first axis in the plot is showing the leading log fold changes (differences on the log scale) between the samples. We notice that the leading differences (log FC) in the peptide data seems to be driven by technical variability. Indeed the samples do not seem to be clearly separated according to the spike in condition. Because there are missing values for the peptide level data we use the na.rm=TRUE
argument to summarize the data based on the observed peptides for every protein.
protData<-combineFeatures(pepData,fcol="Proteins",method="robust",na.rm=TRUE)
## Your data contains missing values. Please read the relevant section
## in the combineFeatures manual page for details the effects of
## missing values on data aggregation.
If you summarize using robust summarization please refer to Sticker et al. (2019) https://www.biorxiv.org/content/10.1101/668863v1
We filter proteins for which we have at least two protein expression values in both conditions:
protData<-protData[
protData %>% exprs %>% is.na %>% `!` %>% rowSums>4,]
plotMDS(exprs(protData),col=as.double(pData(pepData)$cellType))
MSqRob is currently working with a format where we have one dataframe for each protein. This will be changed in the next release. Therefore we first have to reorganise the data.
Next the models are fitted. This is done using the fit.model function. We only have to model the data using the factor condition from the pheno data of the protein level MSnSet. The name of the factor variable is specified in the fixed argument (if multiple predictors have to be incorporated in the model, a vector of variable names has to be provided in this argument.). The argument shrinkage is used to specify if ridge regression has to be adopted. For the sake of speed we do not do this in the tutorial. The shrinkage has to be specified for each variable in the fixed effects. We also have to indicate this for the intercept (which we never shrink). So we specify it at c(0,0) to indicate that the intercept (first 0) and the parameters for the factor condition (second 0) are not penalized. We set the robust_var function equal to FALSE, this functionality will be removed from the package in the next release.
protMSqRob <- MSnSet2protdata(protData, "Proteins",annotations=c("Protein.names","Gene.names"))
models <- fit.model(protdata=protMSqRob, response="quant_value", fixed="cellType",shrinkage.fixed=c(0,0),robust_var=FALSE)
Many biologists have problems with the reference coding. In MSqRob we have opted to formulate contrasts using all levels of a factor. Internally, the contrasts are than recasted according to the factor level that is the reference class.
L <- makeContrast("cellTypeTreg - cellTypeTcon", levels=c("cellTypeTreg","cellTypeTcon"))
tests <- test.contrast_adjust(models, L)
nSig <- sum(tests$signif,na.rm=TRUE)
head(tests,nSig)
## Protein.names
## P24452 Macrophage-capping protein
## P09055 Integrin beta-1
## P51807 Dynein light chain Tctex-type 1
## Q7TSJ2 Microtubule-associated protein 6
## Q7TMB8 Cytoplasmic FMR1-interacting protein 1
## P07356 Annexin A2
## P13020 Gelsolin
## Q9DC16 Endoplasmic reticulum-Golgi intermediate compartment protein 1
## P08207 Protein S100-A10
## P21550 Beta-enolase
## Q9CQ62 2,4-dienoyl-CoA reductase, mitochondrial
## Q7TPR4 Alpha-actinin-1
## Q00417 Transcription factor 7
## Q05186 Reticulocalbin-1
## P24472 Glutathione S-transferase A4
## Q8BGW0 Protein THEMIS
## O09131 Glutathione S-transferase omega-1
## Q61107 Guanylate-binding protein 4
## Q80SU7 Interferon-induced very large GTPase 1
## P54071 Isocitrate dehydrogenase [NADP], mitochondrial
## Q8K157 Aldose 1-epimerase
## Q60611 DNA-binding protein SATB1
## Q61205 Platelet-activating factor acetylhydrolase IB subunit gamma
## Q04447 Creatine kinase B-type
## Q8BFP9 [Pyruvate dehydrogenase (acetyl-transferring)] kinase isozyme 1, mitochondrial
## P27782 Lymphoid enhancer-binding factor 1
## Q8C0P5 Coronin-2A
## Q80TM9 Nischarin
## Q99JT9 1,2-dihydroxy-3-keto-5-methylthiopentene dioxygenase
## O08709 Peroxiredoxin-6
## Q8BMS9 Ras association domain-containing protein 2
## Q3UW53 Protein Niban
## Gene.names estimate se df Tval pval
## P24452 Capg 2.5733130 0.1805818 8.773187 14.250122 2.288090e-07
## P09055 Itgb1 1.4938717 0.1451408 8.773187 10.292572 3.424618e-06
## P51807 Dynlt1 -1.8839705 0.2041501 8.773187 -9.228358 8.288363e-06
## Q7TSJ2 Map6 -4.8486543 0.4105680 6.773187 -11.809627 9.116831e-06
## Q7TMB8 Cyfip1 3.8124960 0.2716250 5.773187 14.035882 1.110763e-05
## P07356 Anxa2 2.1200806 0.2383387 8.773187 8.895242 1.112390e-05
## P13020 Gsn -1.7559807 0.2043310 8.773187 -8.593806 1.463261e-05
## Q9DC16 Ergic1 2.2349065 0.2601791 8.773187 8.589877 1.468574e-05
## P08207 S100a10 2.1513269 0.2294401 7.773187 9.376422 1.658871e-05
## P21550 Eno3 2.1652362 0.2675637 8.773187 8.092415 2.349849e-05
## Q9CQ62 Decr1 1.5363410 0.1911350 8.773187 8.037989 2.477275e-05
## Q7TPR4 Actn1 -1.6026653 0.2098550 8.773187 -7.637013 3.687910e-05
## Q00417 Tcf7 -1.4581094 0.1926169 8.773187 -7.569996 3.947666e-05
## Q05186 Rcn1 1.8598060 0.2483611 8.773187 7.488315 4.291799e-05
## P24472 Gsta4 3.7435527 0.3490961 5.773187 10.723560 4.994952e-05
## Q8BGW0 Themis -1.9323078 0.2676994 8.773187 -7.218200 5.686465e-05
## O09131 Gsto1 2.8857200 0.4160874 8.773187 6.935369 7.699875e-05
## Q61107 Gbp4 1.6022703 0.2354065 8.773187 6.806397 8.867683e-05
## Q80SU7 Gvin1 1.2547038 0.1868043 8.773187 6.716677 9.794098e-05
## P54071 Idh2 -1.5283193 0.2279181 8.773187 -6.705563 9.916050e-05
## Q8K157 Galm 1.4390806 0.2160349 8.773187 6.661334 1.041808e-04
## Q60611 Satb1 -1.6974812 0.2726644 8.773187 -6.225532 1.716383e-04
## Q61205 Pafah1b3 1.4460925 0.2366737 8.773187 6.110067 1.966883e-04
## Q04447 Ckb -1.5665234 0.2594216 8.773187 -6.038522 2.141953e-04
## Q8BFP9 Pdk1 -1.6705728 0.2822215 8.773187 -5.919367 2.472431e-04
## P27782 Lef1 -2.7050547 0.4569871 8.773187 -5.919324 2.472559e-04
## Q8C0P5 Coro2a 3.3088892 0.4760683 6.773187 6.950450 2.569178e-04
## Q80TM9 Nisch -1.2700088 0.2165037 8.773187 -5.865992 2.638150e-04
## Q99JT9 Adi1 -0.9134530 0.1561034 8.773187 -5.851589 2.684915e-04
## O08709 Prdx6 -0.8809735 0.1532448 8.773187 -5.748797 3.046074e-04
## Q8BMS9 Rassf2 -1.1524409 0.2023824 8.773187 -5.694374 3.258429e-04
## Q3UW53 Fam129a 2.5595788 0.4551092 8.773187 5.624098 3.556796e-04
## qval signif
## P24452 0.0008674147 TRUE
## P09055 0.0064913640 TRUE
## P51807 0.0069592040 TRUE
## Q7TSJ2 0.0069592040 TRUE
## Q7TMB8 0.0069592040 TRUE
## P07356 0.0069592040 TRUE
## P13020 0.0069592040 TRUE
## Q9DC16 0.0069592040 TRUE
## P08207 0.0069875349 TRUE
## P21550 0.0085375921 TRUE
## Q9CQ62 0.0085375921 TRUE
## Q7TPR4 0.0115120025 TRUE
## Q00417 0.0115120025 TRUE
## Q05186 0.0116215794 TRUE
## P24472 0.0126239096 TRUE
## Q8BGW0 0.0134733669 TRUE
## O09131 0.0171707204 TRUE
## Q61107 0.0186763263 TRUE
## Q80SU7 0.0187958721 TRUE
## P54071 0.0187958721 TRUE
## Q8K157 0.0188071218 TRUE
## Q60611 0.0295763957 TRUE
## Q61205 0.0324193646 TRUE
## Q04447 0.0338339309 TRUE
## Q8BFP9 0.0350983219 TRUE
## P27782 0.0350983219 TRUE
## Q8C0P5 0.0350983219 TRUE
## Q80TM9 0.0350983219 TRUE
## Q99JT9 0.0350983219 TRUE
## O08709 0.0384922168 TRUE
## Q8BMS9 0.0398474371 TRUE
## Q3UW53 0.0421369205 TRUE
There are 32 proteins with a significant effect at the 5% FDR level.
volcano<- ggplot(tests,aes(x=estimate,y=-log10(pval),color=signif)) + geom_point() + scale_color_manual(values=c("black","red"))
volcano
An interactive volcano plot can be obtained by using the plotly library. We alter the object tests to add the protein name to the data frame in ggplot, this will allow us to see the protein name in the interactive plot.
library(plotly)
volcano<- ggplot(tests %>% rownames_to_column("protein"), aes(x=estimate,y=-log10(pval),color=signif,protein=protein,annotation=Protein.names,gene=Gene.names)) + geom_point() + scale_color_manual(values=c("black","red"))
ggplotly(volcano)
There are no significant proteins so we will plot a heatmap of the top 10 proteins, instead. Note, that these results are not statistically significant.
sigNames<- tests %>% rownames_to_column("protein") %>% filter(signif) %>% pull(protein)
heatmap(exprs(protData)[sigNames,])
We first extract the normalized protein expression values for a particular protein and we create a dataframe that contains the protein expression values and the relevant variables of the experimental design.
for (protName in rownames(tests)[1:5])
{
plotDat<-pepData[fData(pepData)$Proteins==protName] %>% exprs
plotDataStack<-data.frame(quant_value=c(plotDat),sequence=rep(rownames(plotDat),rep=ncol(plotDat)),sample=rep(colnames(plotDat),each=nrow(plotDat)),condition=rep(pData(pepData)$cellType,each=nrow(plotDat)))
plot1 <- ggplot(plotDataStack, aes(x=sample, y=quant_value,fill=condition))
print(plot1 +geom_boxplot(outlier.shape=NA) + geom_point(position=position_jitter(width=.1),aes(shape=sequence)) + scale_shape_manual(values=1:nrow(plotDat)) +labs(title = protName, x="sample", y="Peptide intensity (log2)"))
}