Preparing `MDSvis` input objects from a distance matrix and sample properties

Abstract

This vignette shows an example of building input objects for using MDSvis. Here the starting point is: (1) a pairwise distance matrix, where each row and each column correspond to one sample, and (2) additional sample specific properties. This vignette is distributed under a CC BY-SA 4.0 license.

Introduction

This short vignette is complementary to the main MDSvis package vignette, which provides an overview of MDSvis Shiny application (app) features.

Here we focus on ‘how to’ create some input objects, which need to be provided to the Shiny app. In this vignette, we show that we can take advantage of the Multi Dimensional Scaling (MDS) interactive visualisations, starting from any distance matrix, and some sample specific properties.

Installation and loading dependencies

To install this package, start R and enter (un-commented):

# if (!require("devtools", quietly = TRUE))
#     install.packages("devtools")
# 
# devtools::install_github(repo = 'https://github.com/UCLouvain-CBIO/MDSvis')

We now load the packages needed in the current vignette:

suppressPackageStartupMessages(library(CytoMDS))
library(MDSvis)

Dataset

For this vignette, we use the frog skin active peptide (fsap) dataset. It consists of 226 samples, where each sample represents an amino acid sequence corresponding to a reported amphibian host defense peptide.

Distances between these peptides can be obtained using pairwise alignments and computing an appropriate dissimilarity metric between their amino acid sequences. For details, see the fsap dataset methods section below.

On top of the peptide distances, peptide properties have also been collected:
- Uniprot accession number
- taxonomic family, genus, species
- some physicochemical parameters

In the following code, we import the two text files of the dataset:
- the distance matrix file (imported as a R matrix)
- the sample properties file (imported as a R data.frame)

# inputs
inputDir <- system.file("extdata", package = "MDSvis")
distPath <- file.path(inputDir, "fsap_mature_226_distances.txt")
propPath <- file.path(inputDir, "fsap_mature_226_properties.txt")

# reading distance matrix from disk
distData <- read.delim(distPath)
rownames(distData) <- distData$X
distData$X <- NULL
distMat <- as.matrix(distData)

# reading properties
propDF <- read.delim(propPath)

Generation of input files for the Shiny application

We still need to pre-process the data properties and split them into two parts: - the variables that will be used to annotate the plots, or associate colour, shape, faceting etc. These variable are mostly factor variables. We extract them into the phenoData object - which needs to be a data.frame. - the continuous variables that can be exploited as sample statistics in the bi-plots. These need to be included in a list of statistics object.

# split properties between factors and continuous variables (stats)

statMat <- as.matrix(propDF[, -1])
statList <- list(quant_properties = statMat)

propDF <- propDF[, 1, drop = FALSE]
propDF$family <- factor(substr(propDF$Sequence_Name, start = 1, stop = 3))
propDF$genus <- factor(substr(propDF$Sequence_Name, start = 1, stop = 6))
propDF$species <- factor(substr(propDF$Sequence_Name, start = 1, stop = 8))

Below are frequency tables for peptides belonging to the different families, genera and species.

table(propDF$family)

## 
## HYL MIC RAN 
##  43   7 176

table(propDF$genus)

## 
## HYL_AG HYL_CR HYL_HP HYL_HY HYL_LD HYL_LI HYL_PH HYL_PL HYL_SP MIC_KA RAN_AM 
##      9      2      2      1      2      7     17      2      1      7     10 
## RAN_BA RAN_FE RAN_GL RAN_HD RAN_LB RAN_NA RAN_OD RAN_PE RAN_RA RAN_SY RAN_TH 
##     15      1      2      1     15      3     63     19     16     30      1

table(propDF$species)

## 
## HYL_AGca HYL_AGda HYL_CRca HYL_HPci HYL_HPra HYL_HYsi HYL_LDpu HYL_LIfa 
##        6        3        2        1        1        1        2        4 
## HYL_LIin HYL_PHbi HYL_PHhy HYL_PLth HYL_SPla MIC_KAma MIC_KAse RAN_AMha 
##        3       10        7        2        1        2        5        4 
## RAN_AMlo RAN_BAad RAN_BApl RAN_FEca RAN_GLru RAN_HDma RAN_LBca RAN_LBpi 
##        6        5       10        1        2        1        7        8 
## RAN_NApa RAN_NApl RAN_ODan RAN_ODli RAN_PEle RAN_PEni RAN_RAbo RAN_RAch 
##        2        1       47       16        4       15        5       11 
## RAN_SYni RAN_SYsp RAN_THkw 
##       13       17        1

From the distance matrix, we now need to calculate the MDS projection object, which will be used as main input in the Shiny app.
This is done using the CytoMDS package. Here we set the number of projection dimensions to 10. See the documentation of CytoMDS::computeMetricMDS() for details on other possible options.

MDSObj <- CytoMDS::computeMetricMDS(pwDist = distMat, 
                                    nDim = 10)

show(MDSObj)

## MDS object containing MDS projection (using Smacof algorithm)  data:
## Nb of dimensions:  10 
## Nb of points:  226 
## Stress:  0.156626 
## Pseudo RSquare:  0.829141 
## Goodness of fit:  0.975468

Finally, we can save the newly created objects as .rds files on disk.
(Uncomment the corresponding code to do it).
These files can now be selected within the shiny app for visualization.

# outputs (for shiny app)
MDSRDSPath <- "fsap_mature_MDS.rds"
statRDSPath <- "fsap_mature_stats.rds"
propRDSPath <- "fsap_mature_props.rds"

# saving objects 
#saveRDS(MDSObj, file = MDSRDSPath)
#saveRDS(statList, file = statRDSPath)
#saveRDS(propDF, file = propRDSPath)

Visualization of the MDS projection

The MDSvis function run_app launches the interactive Shiny app. From here, you can take a look at the main MDSvis vignette for an overview of the Shiny app features.

#MDSvis::run_app()

Here below is a screen shot from the Shiny app, displaying a view of the MDS projection of the fsap dataset.

fsap dataset methods

Data mining and assembly of a reference dataset

For the construction of an initial dataset of known amphibian host defense peptides, sequences belonging to the “frog skin active peptide” superfamily (FSAP; InterPro: IPR016322) were collected from the Uniprot database and relevant literature. FSAP from the higher-level taxonomic classifications of Ranidae, Microhylidae and Hyloidea are differentiated using the prefixes RAN_, MIC_ and HYL_, respectively. These designations are broadly referred to as “family” throughout this vignette. Each family prefix is followed by a letter code to identify the genus and species from which the FSAP was identified, followed by its unique Uniprot accession. Each FSAP precursor protein is composed of a tripartite structure in which the N-terminal signal peptide and spacer regions are enzymatically cleaved to release the bioactive, or “mature” peptide. A total of 226 mature FSAP were thus identified and used for subsequent analysis.

Pairwise alignments using primary structure and hydrophobicity

Pairwise alignments of mature FSAP sequences were generated with AlignMe 1.2 (Stamm et al. 2013) using a combined scale that incorporates the Hessa, White and von Heinje hydropathy scale (HWvH; weight: 2.6 (Hessa et al. 2007) and the VTML substitution matrix (weight: 0.5), according to the developer’s recommendations. Gap penalties were similarly based on (Stamm et al. 2014) and were set as follows: below threshold gap opening penalty: 18.0, below threshold gap extension penalty: 1.3, above threshold gap opening penalty: 11.0, above threshold gap extension penalty 4.4: thresholds for penalties: 0; termini gap opening penalty 13.1; termini gap extension penalty: 0.9 (Stamm et al. 2014). Given the short lengths of mature FSAP sequences, window size was set to 0 to prevent a smoothing effect. Absolute differences between the hydrophobicity measurements generated for each peptide pair were calculated and used to construct a distance, or dissimilarity matrix.

Physicochemical parameters distinguishing FSAP

The following parameters were calculated for each mature FSAP sequence:
1. peptide length (number of amino acids)
2. grand average of hydropathy, or GRAVY index (obtained by dividing the sum of hydropathy values of all amino acids by total sequence length), to determine the hydrophobicity/hydrophilicity of the protein; a positive index indicates a more hydrophobic character, whereas a negative index indicates a more hydrophilic character
3. hydrophobic moment (measure of the amphipathicity of a structure, or asymmetry of hydrophobicity, by quantifying the distribution of hydrophobic and hydrophilic residues along its length; particularly relevant for α-helices)
4. isoelectric point, or pI (pH at which the peptide carries a net electrical charge of zero)
5. molar mass
6. net charge at pH 7
7. cysteine content (%)
8. ratio of lysines to arginines (e.g., changes in this ratio in α-helical peptides can affect the hydrophobicity required for membrane perturbation)

Acknowledgement

We would like to thank Kim Roelants for devising the method to calculate distances between FSAP, as described in the fsap dataset methods section in the current vignette.

Session information

## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MDSvis_0.1.6        CytoMDS_1.5.0       Biobase_2.69.0     
## [4] BiocGenerics_0.55.0 generics_0.1.4      BiocStyle_2.37.0   
## 
## loaded via a namespace (and not attached):
##   [1] Rdpack_2.6.4         polynom_1.4-1        gridExtra_2.3       
##   [4] rlang_1.1.6          magrittr_2.0.3       e1071_1.7-16        
##   [7] matrixStats_1.5.0    compiler_4.5.0       gdata_3.0.1         
##  [10] systemfonts_1.2.3    vctrs_0.6.5          stringr_1.5.1       
##  [13] pkgconfig_2.0.3      shape_1.4.6.1        fastmap_1.2.0       
##  [16] backports_1.5.0      promises_1.3.3       ncdfFlow_2.55.0     
##  [19] rmarkdown_2.29       nloptr_2.2.1         graph_1.87.0        
##  [22] ragg_1.4.0           purrr_1.0.4          xfun_0.52           
##  [25] glmnet_4.1-9         jomo_2.7-6           cachem_1.1.0        
##  [28] jsonlite_2.0.0       flowWorkspace_4.21.0 later_1.4.2         
##  [31] pan_1.9              parallel_4.5.0       broom_1.0.8         
##  [34] cluster_2.1.8.1      R6_2.6.1             bslib_0.9.0         
##  [37] stringi_1.8.7        RColorBrewer_1.1-3   smacof_2.1-7        
##  [40] boot_1.3-31          rpart_4.1.24         jquerylib_0.1.4     
##  [43] Rcpp_1.0.14          bookdown_0.43        iterators_1.0.14    
##  [46] knitr_1.50           base64enc_0.1-3      weights_1.0.4       
##  [49] flowCore_2.21.0      httpuv_1.6.16        Matrix_1.7-3        
##  [52] nnls_1.6             splines_4.5.0        nnet_7.3-20         
##  [55] tidyselect_1.2.1     rstudioapi_0.17.1    yaml_2.3.10         
##  [58] doParallel_1.0.17    codetools_0.2-20     lattice_0.22-7      
##  [61] tibble_3.2.1         plyr_1.8.9           shiny_1.10.0        
##  [64] evaluate_1.0.3       foreign_0.8-90       desc_1.4.3          
##  [67] survival_3.8-3       proxy_0.4-27         pillar_1.10.2       
##  [70] BiocManager_1.30.26  mice_3.18.0          checkmate_2.3.2     
##  [73] foreach_1.5.2        stats4_4.5.0         ellipse_0.5.0       
##  [76] reformulas_0.4.1     shinyjs_2.1.0        plotly_4.10.4       
##  [79] S4Vectors_0.47.0     ggplot2_3.5.2        ggcyto_1.37.0       
##  [82] scales_1.4.0         minqa_1.2.8          gtools_3.9.5        
##  [85] xtable_1.8-4         class_7.3-23         glue_1.8.0          
##  [88] Hmisc_5.2-3          lazyeval_0.2.2       tools_4.5.0         
##  [91] hexbin_1.28.5        data.table_1.17.4    lme4_1.1-37         
##  [94] fs_1.6.6             XML_3.99-0.18        grid_4.5.0          
##  [97] plotrix_3.8-4        tidyr_1.3.1          RProtoBufLib_2.21.0 
## [100] rbibutils_2.3        CytoPipeline_1.9.0   colorspace_2.1-1    
## [103] nlme_3.1-168         htmlTable_2.4.3      Formula_1.2-5       
## [106] cli_3.6.5            textshaping_1.0.1    cytolib_2.21.0      
## [109] viridisLite_0.4.2    dplyr_1.1.4          Rgraphviz_2.53.0    
## [112] gtable_0.3.6         sass_0.4.10          digest_0.6.37       
## [115] wordcloud_2.6        htmlwidgets_1.6.4    farver_2.1.2        
## [118] htmltools_0.5.8.1    pkgdown_2.1.3.9000   lifecycle_1.0.4     
## [121] httr_1.4.7           mitml_0.4-5          mime_0.13           
## [124] MASS_7.3-65

References

Hessa, Tara, Nadja M Meindl-Beinker, Andreas Bernsel, Hyun Kim, Yoko Sato, Mirjam Lerch-Bader, Ingmarie Nilsson, Stephen H White, and Gunnar von Heijne. 2007. “Molecular Code for Transmembrane-Helix Recognition by the Sec61 Translocon.” Nature 450 (7172): 1026–30.

Stamm, Marcus, René Staritzbichler, Kamil Khafizov, and Lucy R Forrest. 2014. “AlignMe - a Membrane Protein Sequence Alignment Web Server.” Nucleic Acids Res. 42 (Web Server issue): W246–51.

Stamm, Marcus, René Staritzbichler, Kamil Khafizov, and Lucy R. Forrest. 2013. “Alignment of Helical Membrane Protein Sequences Using AlignMe.” PLoS One 8 (3): e57731.

Philippe Hauchamps

Shabnam Zaman