Input Preparation Workflow

DinaSchuster

2024-10-22

Source:vignettes/input_preparation_workflow.Rmd

input_preparation_workflow.Rmd

Introduction

This vignette will give you an overview of how you can prepare thequantitative protein/peptide matrix output from common search enginesand software such asSpectronaut,MaxQuant,ProteomeDiscoverer andSkylinefor the analysis withprotti. Due to its modular andflexible structureprotti can be used on the output ofcommon bottom-up proteomics search engines irrespective of themeasurement mode (DDA, DIA, targeted-MS).

Furthermore, you are not only restricted to reports from the abovementioned search engines. As long as your data has a tabular format(data frame) and a specific minimal number of data columns you cananalyse it withprotti. The columns minimally requiredcontain information on sample, condition, intensity, protein ID and thelevel the intensity is based on (fragment, precursor, peptide) ifdifferent from protein intensity. Depending on the analysis many morecolumns can be useful, but they are not required. Ultimately, your datashould have a structure similar to this:

Sample	Protein ID	Peptide Sequence	Condition	Intensity
sample1	P62942	PEPTIDER	treated	14000
sample2	P62942	PEPTI	treated	15000
sample3	P62942	PEPTIDE	treated	14500
sample4	P62942	PEPTIDER	control	18000
sample5	P62942	PEPTI	control	21000
sample6	P62942	PEPTIDE	control	19000

It is very important, that each unit of the level you perform youranalysis on (e.g. peptide) has a single unique intensity associated withit. If, for example, a peptide has two different intensities,protti would not know how to deal with this and manyfunctions will likely fail.

Data should always be organised in a format calledtidy data. That meansdata should be contained in a long format (e.g. all sample names in onecolumn) rather than a wide format (e.g. each sample name in its owncolumn with intensity as the content of the columns). You can easilyachieve this by using thepivot_longer() function from thetidyrpackage. The output of many search engines already contains tidy dataand working with it is very easy because you can refer to informationwith only one variable.protti is designed to worktogether well with thetidyverse packagefamily that is build around the concept of tidy data.

Protein-centric analysis

Many search engines provide the user with protein intensities.However, it is also possible to calculate protein intensities directlyfrom precursor intensities with theprotti functioncalculate_protein_abundance().Prottiimplements the"iq" method, previously implemented in the Rpackageiqwhich performs protein quantification based on the maximal peptide ratioextraction algorithm adapted from the MaxLFQ algorithm (Cox, J. 2013).

One advantage of calculating the protein abundance withprotti is the possibility to median normalise runintensities on the precursor level. This is closer to the actuallyacquired intensities and thus sample concentrations than ifnormalisation is performed on the protein level. Some search enginesprovide the option for automatic median normalisation but not all.Furthermore, some search engines calculate protein intensities bysummation of precursor intensities irrespective of missingness ofpeptides in certain samples. In these cases the maximal peptide ratioimplemented in extraction algorithm provides a more robust calculationof protein intensities.

If you prefer to use protein intensities provided by the seach engineof your choice this is not a problem and we will show how some of thisinformation can be converted into the right format.

Loading packages

We will demonstrate how most outputs can be converted with functionsfrom the R packagesmagrittr,dplyr,tidyrandstringr.You can load packages after you installed them with thelibrary() function.

library(magrittr)library(dplyr)library(tidyr)library(stringr)

Note that we are using the R packagemagrittrbecause of its pipe operator%>%. It takes the output ofthe preceding function and supplies it as the first argument of thefollowing function. Using%>% makes code easier to readand follow.

Spectronaut

Spectronaut reports already contain data in the tidy data format.Therefore nothing needs to be changed in order to use them withprotti. However, the columns we would recommend (notall columns are required) to export from Spectronaut are:

R.Condition (condition names)
R.FileName (file names)
PG.ProteinAccessions (protein identifiers)
PEP.IsProteotypic (logical indicating if peptide isproteotypic)
PEP.StrippedSequence (peptide sequence)
PEP.NrOfMissedCleavages (number of missed cleavages, relevant forquality control)
EG.IsDecoy (logical indicating if peptide is a decoy match)
EG.PrecursorId (peptide precursor ID)
FG.Quantity (precursor quantity, required for peptide-centricanalyses)
FG.Charge (precursor charge state, relevant for qualitycontrol)
PG.Quantity (protein group quantity, required for protein-centricanalyses if you prefer not to calculate them withcalculate_protein_abundance())

Please make sure that the report is a .csv file. You can use theread_protti() function in order to load the spectronautreport into R. This function is a wrapper around the fastfread() function from thedata.tablepackage and theclean_names() function from thejanitorpackage. This will allow you to not only load your data into R veryfast, but also to clean up the column names into lower snake case. Thiswill make it easier to remember them and to use them in your dataanalysis. For the Spectronaut columnsR.FileName willchange for example intor_file_name.

# To read in your own data you can use read_protti()spectronaut_data<-read_protti(filename="mydata/spectronaut.csv")

MaxQuant

Depending on which analysis you are performing you will have to usedifferent outputs. For peptide-centric analyses we would recommend touse theevidence.txt file. If you want to perform aprotein-centric analysis and you want to use protein quantitiescalculated by MaxQuant, you need theproteinGroups.txtfile. However, you can also apply the maximal peptide ratio extractionalgorithm from theiq R package implemented in theprotein_abundance_calculation() function ofprotti. This allows you to only use theevidence.txt file. The resulting protein intensities areidentical since they were calculated with the same algorithm.

Peptide-centric analysis/LiP-MS analysis

In case you are interested in performing apeptide-centric analysis (necessary for LiP-MS), youshould use theevidence.txt file provided in the searchoutput of MaxQuant.

Theevidence.txt file basically contains all theinformation we need to runprotti. It is also containedin a long format which makes it easy to read in and use directly. Onething to take into consideration is the lack of a column for informationon proteotypicity of peptides. However, this information can be inferredfrom theProteins column if it contains more than oneprotein ID. You can extract this information and create a new columncalledis_proteotypic containing logicals that will beTRUE if theProteins column does not contain asemicolon andFALSE if it does (this indicates that thepeptide belongs to more than one protein). As mentioned in the dataanalysis vignettes this information is necessary for the analysis ofLiP-MS data but it could be also considered for the correct calculationof protein abundances.

Another column that is required for the analysis of your data is acolumn indicating conditions to which certain samples belong. This canbe easily added to the evidence file by joining a data frame containingthe specific annotations. You can create such a data frame in Excel andimport it into R for a large number of samples or just create itdirectly in R.

MaxQuant output provides information on decoy hits contained in thecolumnreverse and also has information on whether your hitis a contaminantpotential_contaminant. You should filterthese out before the analysis. However, the contaminant column can beused for quality control.

One important thing for MaxQuant data is tomake sure thatyou only have one intensity assigned to each peptide orprecursor. You can do this by summing up all intensities thatMaxQuant exports (these can be MULTI-MSMS, MSMS, ISO-MSMS, MULTI-MATCH,ISO-SECPEP) or you can filter for example for precursors with MULTI-MSMSquantification and only use these.

In this section we will show you how to read in the file withread_protti() and how to create theis_proteotypic column and thecondition column(minimally required) with the help of thestringr anddplyr packages. How to filter your data best is describedin the data analysis vignettes.

# To read in your own data you can use read_protti()evidence<-read_protti(filename="yourpath/evidence.txt")evidence_proteotypic<-evidence%>%# adds new column with logicals that are TRUE if the peptide can be assigned# to only one protein and FALSE if it can be assigned to multiplemutate(is_proteotypic=str_detect(    string=proteins,    pattern=";",    negate=TRUE))%>%# adds new column with logicals indicating if peptide is coming from a potential contaminantmutate(is_contaminant=ifelse(potential_contaminant=="+",TRUE,FALSE))# Make an annotation data frame and merge it with your data frame to obtain conditions# We are annotating sample 1-3 as controls and samples 4-6 as treated conditionsfile_name<-c(# make sure that the names are the same name as in your report"sample1","sample2","sample3","sample4","sample5","sample6")condition<-c("control","control","control","treated","treated","treated")annotation<-data.frame(file_name,condition)# Combine your long data frame with the annotationevidence_annotated<-evidence_proteotypic%>%left_join(y=annotation, by="file_name")

Protein-centric analysis

Forprotein-centric analyses you can use theproteinGroups.txt file provided by MaxQuant. This filecontains information in a wide format where each sample has its owncolumn containing intensity values. Therefore, we need to transform thisdata into a long format to meet the conditions of tidy data.

We will filter the data and usetidyr’spivot_longer() to change the format to long format.Furthermore, we produce an annotation data frame to create aconditions column. The filtering is only done in order toremove proteins with potentially low quality. Further filtering fordecoys and potential contaminants should be performed based on the dataanalysis vignettes.

# To read in your own data you can use read_protti()protein_groups<-read_protti(filename="yourpath/proteinGroups.txt")%>%# adds new column with logicals indicating if protein is a potential contaminant,# you can filter these out later on. You should also consider filtering out proteins# that were "only identified by site" and reverse hits, as well as proteins with only# one identified peptidemutate(is_potential_contaminant=ifelse(potential_contaminant=="+",TRUE,FALSE))# Change wide format to long format and create new columns called `r_file_name`and `intensity`protein_groups_long<-protein_groups%>%pivot_longer(    cols=starts_with("intensity_"),    names_to="file_name",    values_to="intensity")# Make an annotation data frame and merge it with your data frame to obtain conditions# We are annotating sample 1-3 as controls and samples 4-6 as treated conditionsfile_name<-c(# make sure that the names are the same name as in your report"intensity_sample1","intensity_sample2","intensity_sample3","intensity_sample4","intensity_sample5","intensity_sample6")condition<-c("control","control","control","treated","treated","treated")annotation<-data.frame(file_name,condition)# Combine your long data frame with the annotationprotein_groups_annotated<-protein_groups_long%>%left_join(y=annotation, by="file_name")

Skyline

The Skyline output is already in long format, however, to process ityou need to sum up the transition intensities to obtain the intensity ofone precursor. If you prefer to analyse your data on the fragment level,you should create a column that uniquely identifies each fragment ofeach precursor. You could do that by pasting together the peptidesequence with the charge and the product m/z.

The required Skyline output columns include:

Peptide Sequence
Protein Name
Replicate Name
Precursor Charge
Product Mz (we are exporting these to distinguish transitions)
Area (or Intensity, depending on what you prefer to use)

You can add replicate and condition annotations in Skyline directly.However, we will explain in this section how you can also do it in R. Ifyou want to analyse your data on the protein abundance level you willhave to combine the precursor intensities to obtain one value forprotein abundance. This could be done using thecalculate_protein_abundance() function fromprotti.

# Load dataskyline_data<-read_protti(filename="yourpath/skyline.csv")skyline_data_int<-skyline_data%>%# create a column with precursor informationmutate(precursor=paste0(peptide_sequence,"_",charge))%>%group_by(replicate_name,precursor)%>%# making a new column containing the summed up intensities of all transitions of one precursormutate(sum_intensity=sum(area))%>%select(-c(product_mz,area))%>%# removing the columns we don't needdistinct()# removing duplicated rows from the data frame# Add annotation# make sure that the names are the same name as in your reportreplicate_name<-c("sample_1","sample_2","sample_3","sample_1","sample_2","sample_3")condition<-c("control","control","control","treated","treated","treated")annotation<-data.frame(replicate_name,condition)# Combine your long data frame with the annotationskyline_annotated<-skyline_data_int%>%left_join(y=annotation, by="replicate_name")

Proteome Discoverer

The Proteome Discoverer output contains data in wide format (onecolumn for each sample). Similar to MaxQuant there is also the optionfor a peptide or a protein-centric export. We will discuss both cases inthis segment.

Peptide-centric analysis/LiP-MS analysis

For apeptide-centric or a LiP-MS analysis pleaseexport the “Peptide Groups” report. Before preparing your export you canadd the column “sequence” to your table otherwise Proteome Discovererwill only export the “annotated sequence” column which includes thepreceding and following amino acids in the protein sequence.

The required columns include:

Sequence
Modifications
Number Proteins
Contaminant
Master Protein Accessions
Abundance or normalized abundance columns
Quan Info

After saving the report as an Excel file please convert it to a .csvfile, simply by opening it and saving it as such.

We will read in the file usingread_protti() and thenselect the columns we are interested in. You can use thecontaminant column for qualitiy control. Thenumber_proteins column contains information on theproteotypicity. If this is 1 then the peptide is proteotypic. If youwant to analyse your data qualitatively only with quality controlfunctions ofprotti you can keep peptides withoutquantifications. Before you start your quantitative analysis removeobservations that are labeled"No Quan Values" in thequan_info column. In the below example they are filteredout at this step, but you can keep them and only filter them outlater.

# Load datapd_pep_data<-read_protti("yourpath/PDpeptides.csv")# Select relevant columnspd_pep_selected<-pd_pep_data%>%select(sequence,modifications,number_proteins,contaminant,master_protein_accessions,starts_with("abundances_grouped"),# select all columns that start with "abundances_grouped"quan_info)# Filter data framepd_pep_filtered<-pd_pep_selected%>%filter(contaminant==FALSE)%>%# remove annotated contaminantsfilter(number_proteins==1)%>%# select proteotypic peptidesfilter(quan_info!="No Quan Values")# remove peptides that have no quantification values# Convert into long formatpd_pep_long<-pd_pep_filtered%>%pivot_longer(    cols=starts_with("abundances"),    names_to="file_name",    values_to="intensity")%>%# combine peptide sequence and modifications to make a precursor columnmutate(precursor=paste(sequence,modifications))# Make annotation data framefile_name<-c(# make sure that the names are the same name as in your report"abundances_grouped_f1","abundances_grouped_f2","abundances_grouped_f3","abundances_grouped_f4","abundances_grouped_f5","abundances_grouped_f6")condition<-c("control","control","control","treated","treated","treated")annotation<-data.frame(file_name,condition)# Combine your long data frame with the annotationpd_pep_long_annotated<-pd_pep_long%>%left_join(y=annotation, by="file_name")

Protein-centric analysis

For aprotein-centric or analysis please export the“Proteins” report.

The required columns include:

Accession
Description
Contaminant
Number Peptides
Abundance or normalized abundance columns

After saving the report as an Excel file please convert it to a .csvfile, simply by opening it and saving it as such.

We will read in the file usingread_protti() and thenselect the columns we are interested in. Similar to above you can eitherfilter thecontaminant andnumber_peptidescolumns now or later.

# Load datapd_prot_data<-read_protti("yourpath/PDproteins.csv")# Select relevant columnspd_prot_selected<-pd_prot_data%>%select(accession,description,contaminant,number_peptides,starts_with("abundances_grouped"),# select all columns that start with "abundances_grouped")# Filter data framepd_prot_data_filtered<-pd_prot_selected%>%filter(contaminant==FALSE)%>%# remove annotated contaminantsfilter(number_peptides>1)# select proteins with more than one identified peptide# Convert into long formatpd_prot_long<-pd_prot_data_filtered%>%pivot_longer(    cols=starts_with("abundances"),    names_to="file_name",    values_to="intensity")# Make annotation data framefile_name<-c(# make sure that the names are the same name as in your report"abundances_grouped_f1","abundances_grouped_f2","abundances_grouped_f3","abundances_grouped_f4","abundances_grouped_f5","abundances_grouped_f6")condition<-c("control","control","control","treated","treated","treated")annotation<-data.frame(file_name,condition)# Combine your long data frame with the annotationpd_prot_long_annotated<-pd_prot_long%>%left_join(y=annotation, by="file_name")

Other search engines and software

As mentioned in the beginning of this vignette you can use the outputof any search engine as long as it contains the minimally requiredcolumns. If it is not in the right format you can see if some of theabove transformations can be applied to your data. It is also alwaysuseful to check if you can find additional columns that help you in youranalysis and that you can export from your search engine. Always makesure that all of your observations are ones you are interested in. Checkif there are decoys, contaminants or non-proteotypic peptides in yourdata. For protein-centric analysis, potentially remove quantificationsthat rely on only a few peptides.

Movatterモバイル変換

Input Preparation Workflow

DinaSchuster

2024-10-22

Introduction

Protein-centric analysis

Loading packages

Spectronaut

MaxQuant

Peptide-centric analysis/LiP-MS analysis

Protein-centric analysis

Skyline

Proteome Discoverer

Peptide-centric analysis/LiP-MS analysis

Protein-centric analysis

Other search engines and software