
The goal ofprotti is to provide flexible functionsand workflows for proteomics quality control and data analysis, within asingle, user-friendly package. It can be used for label-free DDA, DIAand SRM data generated with search tools and software such asSpectronaut, MaxQuant, Proteome Discoverer and Skyline. Both limitedproteolysis mass spectrometry (LiP-MS) and regular bottom-up proteomicsexperiments can be analysed.
protti is developed and maintained by members of thelab of Paola Picotti at ETH Zurich. Our lab is focused on proteinstructural changes that occur in response to perturbations such asmetabolite, drug and protein binding-events, as well as proteinaggregation and enzyme activation (Piazza 2018,Piazza 2020,Cappelletti, Hauser& Piazza 2021). We have devoloped mass spectrometry-basedstructural and chemical proteomic methods aimed at monitoring proteinconformational changes in the complex cellular milieu (Feng 2014).
There is a wide range of functionsprotti providesto the user. The main areas of application are:
The protti package has been peer-reviewed and was published inBioinformatics Advances:
Jan-Philipp Quast, Dina Schuster, Paola Picotti. protti: an R packagefor comprehensive data analysis of peptide- and protein-centricbottom-up proteomics data.Bioinformatics Advances, Volume 2,Issue 1, 2022, vbab041,https://doi.org/10.1093/bioadv/vbab041
Please make sure to cite this publication if you used protti for yourdata analysis.
protti is implemented as an R package.
You can install the release version fromCRAN using theinstall.packages() function.
install.packages("protti",dependencies =TRUE)You can install the development version fromGitHub using thedevtoolspackage by copying the following commands into R:
Note: If you do not havedevtools installed make sure todo so by removing the comment sign (#).
# install.packages("devtools")devtools::install_github("jpquast/protti",dependencies =TRUE)Thedependencies = TRUE argument in bothinstall.packages() anddevtools::install_github() also installs suggested packagesthat are required for some functions to work. If this argument is notincluded functions that use a package that is not installed by defaultwill throw an error and prompt the user to install the missing package.If you happen to run into problems during the installation ofprotti we recommend removing this argument andinstalling packages manually if they are needed for a certainfunction.
Sinceprotti is designed to be a flexible tool forthe analysis of your data, there are many ways in which it can be used.In this section we will give a general overview for a very simplepipeline that takes a result from the search tool of your choice and ina few steps returns a list of significantly changing proteins orpeptides. To ensure that you have your data in the right format pleasecheck out theinputpreparation vignette.
A complete list of functions and their documentation is availablehere. Within Ryou can access the same documentation by calling? followedby the function name without parenthesis.
In general functions with the prefixqc_* are used forquality control of your data. Functions starting withfetch_* allow you to retrieve data from a database directlyinto your R session. When a function starts withfilter_*it is meant to be used to filter your data prior to analysis.
For more in detail workflow suggestions and demonstrations of variousfunctions, you can have a look at the package vignettes. Theseinclude:
In this example we are going to analyse synthetic data of which weknow the ground truth. The same principles would apply to any real data.Before you start analysing your data you should load all requiredpackages.protti is designed to work well with thetidyverse packagefamily and we will use them for this example. Therefore, you should alsoload them before you get started. Note: If you do not have thetidyverse installed you can do so by removing the commentsign (#) in front of theinstall.packages() function. Thiswill install them directly from CRAN.
# Load prottilibrary(protti)# Install the tidyverse if necessary# install.packages("tidyverse")# Load tidyverse packages. Can also be done by calling library(tidyverse)library(dplyr)library(magrittr)Usually the search tool of your choice generates a report for youthat has either a.txt or.csv format. You caneasily load reports into R by using theread_protti()function. This function is a wrapper around the fastfread() function from thedata.table packageand theclean_names() function from thejanitor package. This will allow you to not only load yourdata into R very fast, but also to clean up the column names into lowersnake case. This will make it easier to remember them and to use them inyour data analysis.
# Load datadata<-read_protti("filename.csv")Since we will use synthetic data for this example we are going tocall thecreate_synthetic_data() function fromprotti. Of course you do not need to do this step inyour analysis pipeline.
The data this function creates is similar to data obtained from aLiP-MS experiment. Please note that any of the steps in this workflowcan also be applied to protein abundance data that contains protein IDsand protein intensities.
set.seed(42)# Makes example reproducible# Create synthetic datadata<-create_synthetic_data(n_proteins =100,frac_change =0.05,n_replicates =4,n_conditions =2,method ="effect_random",additional_metadata =FALSE)# The method "effect_random" as opposed to "dose-response" just randomly samples# the extend of the change of significantly changing peptides for each condition.# They do not follow any trend and can go in any direction.Before you start analysing your data it is recommended that youfilter out any observations not necessary for your analysis. Theseinclude for example:
On your own data you can easily achieve this withdplyr’sfilter() function. Our synthetic datadoes not require any filtering at this step.
Due to the fact that variances increase with increasing rawintensities, statistical tests would have a bias towards lower-intensitypeptides or proteins. Therefore you should log2 transform your data tocorrect for this mean-variance relationship. We do not need to do thisfor the synthetic data as it is already log2 transformed. For your owndata just usedplyr’smutate() together withlog2().
In addition to filtering and log2 transformation it is also advisedto normalise your data to equal out small differences in overall sampleintensities that result from unequal sample concentrations.protti provides thenormalise() functionfor this purpose. For this example we will use median normalisation(method = "median"). This function generates an additionalcolumn callednormalised_intensity_log2 that contains thenormalised intensities.
Note: If your search tool already normalised your data you shouldnot normalise it another time.
normalised_data<- data%>%normalise(sample = sample,intensity_log2 = peptide_intensity_missing,method ="median" )The next step is to deal with missing data points. You could chooseto impute missing data in a later step, but this is only recommended ifonly a small proportion of your data is missing. In order to calculatestatistical significance of differentially abundant peptides or proteinswe would like to have at least a minimum number of observations percondition. Theprotti functionassign_missingness() checks for each treatment-to-referencecondition if the defined minimum number of observations is satisfied andassigns a missingness type to each comparison as follows.
If a certain condition has all replicates while the other one hasless than 20% (adjusted downward) of total possible replicates, the caseis considered to be “missing not at random” (MNAR). Inorder to be labeled “missing at random” (MAR) 70% (adjusteddownward) of total replicates need to be present in both conditions. Ifyou performed an experiment with 4 replicates that means that bothconditions need to contain at least 2 observations. Comparisons thathave too few observations are labeledNA. These will not beimputed if imputation is performed later on using theimpute() function. You can read the exact details in thedocumentation of this function and also adjust the thresholds if youwant to be more or less conservative with how many data points toretain.
data_missing<- normalised_data%>%assign_missingness(sample = sample,condition = condition,grouping = peptide,intensity = normalised_intensity_log2,ref_condition ="condition_1",retain_columns =c(protein, change_peptide) )# Next to the columns it generates, assign_missingness only contains the columns# you provide as input in its output. If you want to retain additional columns you# can provide them in the retain_columns argument.Note: Instead of “peptide” in thegrouping argumentyou can provide protein IDs in case you are working with proteinabundance data. However, then intensities should be protein intensitiesand not peptide intensities.
For the calculation of abundance changes and the associatedsignificancesprotti provides the functioncalculate_diff_abundance(). You can choose betweendifferent statistical methods. For this example we will chose amoderated t-test.
The type of missingness assigned to a comparison does not have anyinfluence on the statistical test. However, by default (can be changed)comparisons with missingnessNA are filtered out prior top-value adjustment. This means that in addition to imputation, the usercan use missingness cutoffs also in order to define which comparisonsare too incomplete to be trustworthy even if significant.
result<- data_missing%>%calculate_diff_abundance(sample = sample,condition = condition,grouping = peptide,intensity_log2 = normalised_intensity_log2,missingness = missingness,comparison = comparison,filter_NA_missingness =TRUE,method ="moderated_t-test",retain_columns =c(protein, change_peptide) )Next we can use a Volcano plot to visualize significantly changingpeptides with the functionvolcano_plot(). You can chooseto create an interactive plot with theinteractiveargument. Please note that this is not recommended for largedatasets.
result%>%volcano_plot(grouping = peptide,log2FC = diff,significance = pval,method ="target",target_column = change_peptide,target =TRUE,legend_label ="Ground Truth",significance_cutoff =c(0.05,"adj_pval") )