Movatterモバイル変換


[0]ホーム

URL:


Relative protein quantification forDIA-MS-based proteomics

Thang V Pham

2024-12-04

Introduction

Theiq package, short for ion-based proteinquantification, implements the MaxLFQ algorithm for data-independentacquisition (DIA) mass spectrometry (MS) based proteomics data. Thealgorithm was originally designed for data dependent acquisition (DDA)data (Cox et al., MCP 2014). The package also offers options for otherquantitation methods including topN (using N most intense fragmentions), MeanInt (using all fragment ions), and median polish (Tukey,1977). Finally, output from a MaxQuant experiment for DDA data can alsobe quantified usingiq.

To installiq from CRAN (this needs to be doneonce)

install.packages("iq")

and the package can be loaded for usage in the usual manner

library("iq")

This vignette presents three examples for raw data processingpipelines: Spectronaut (Bruderer et al., MCP 2015), OpenSWATH (Rost etal., Nat. Biotechnol. 2014), and MaxQuant (Cox & Mann, Nat.Biotechnol. 2008). The example data are available in the project GitHubrelease. To keep the file size manageable, we exclude data columns notnecessary for the analysis, followed by gzip compression. The user maydownload the data to a local working folder to run the examples.

Spectronaut output

An example dataset

We present an analysis of a publicly available dataset which was usedin a benchmark experiment for label-free DDA and DIA proteomics(Bruderer et al., MCP 2015). The result of the MaxQuant (version1.6.4.0) DDA search is used as a spectral library in Spectronaut version13.0 to process DIA data. Each sample is assigned to a unique conditionin Spectronaut. Subsequently, we use the conditions from C01 to C24 assample names. We use the default Spectronaut long format export with theaddition of columns:PG.Genes,PG.ProteinNames,F.ExcludedFromQuantification,F.FrgLossType,F.FrgIon,F.Charge, andF.PeakArea.

First we perform preprocessing to filter out ion fragments not usedfor quantification and to keep only relevant columns to keep the filesize small. Note that the following code section cannot be executed herebecause the starting input fileBruderer15-DIA-longformat.txtis not available. To process your own data, replaceBruderer15-DIA-longformat.txt by the name of the Spectronautexport.

raw<-read.delim("Bruderer15-DIA-longformat.txt")selected<- raw$F.ExcludedFromQuantification=="False"&            raw$F.FrgLossType=="noloss"&            (is.na(raw$PG.Qvalue)| raw$PG.Qvalue<=0.01)&            (is.na(raw$EG.Qvalue)| raw$EG.Qvalue<=0.01)raw<- raw[selected,c("R.Condition","PG.ProteinGroups","EG.ModifiedSequence","FG.Charge","F.FrgIon","F.Charge","F.PeakArea","PG.Genes","PG.ProteinNames")]write.table(raw,"Bruderer15-DIA-longformat-compact.txt",sep ="\t",row.names =FALSE)

Note that if the quantification filtering option is selected whenexporting the report from Spectronaut, the software already filters outentries with high protein q-values and peptide q-values. We can checkthat by

all(raw$PG.Qvalue<=0.01)all(raw$EG.Qvalue<=0.01)

We write the content ofraw to a text file and gzip itto keep the file size manageable. In the following, we will continuewith the compressed fileBruderer15-DIA-longformat-compact.txt.gz. Note that the userdoes not need to write the data to disk and read it back like in thisexample when processing their own data.

raw<-read.delim(gzfile("Bruderer15-DIA-longformat-compact.txt.gz"))

The MaxLFQ quantification is performed in three steps

norm_data<- iq::preprocess(raw)protein_list<- iq::create_protein_list(norm_data)result<- iq::create_protein_table(protein_list)

The first stepiq::preprocess prepares a long-formatinput including removing low-intensity ions and performing mediannormalization. To select a threshold for removing low-intensity ions,plot a histogram of the data as follows

hist(log2(raw[,"F.PeakArea"]),100,main ="Histogram of log2 intensities",col ="steelblue",border ="steelblue",freq =FALSE)

Here the default threshold value of zero forlog2_intensity_cutoff is appropriate as it removes thestrange distribution of low-intensity data points on the left of thefigure.

The default value forsample_id is"R.Condition". However, in practice we find it convenientto use"R.FileName" because one does not need to preparethe condition column in Spectronaut. Thesecondary_idparameter enforces the uniqueness of the peptide species for the MaxLFQalgorithm. For example, when multiple spectral libraries are used, it ispossible that a fragment comes from two or more libraries. In that case,we can usesecondary_id = c("EG.Library", "FG.Id", "FG.Charge", "F.FrgIon", "F.Charge").Note that one needs to export the necessary columns fromSpectronaut.

The second stepiq::create_protein_list converts thelong-format table to a list where each element contains the underlyingdata of each protein. The column names are sample names and row namesfragment ions. The result can be used to visualized proteinquantification for manual validation as in the next section.

The last statementiq::create_protein_table quantifiesthe protein list one by one.

The following statement produces a quantified protein table in atab-separated text fileoutput-maxLFQ.txt containing theprotein identifiers in the first column, the quantified proteins, andthe annotation text in the last column. This file can be opened by aspreadsheet application such as Excel.

write.table(cbind(Protein =rownames(result$estimate),                  result$estimate,annotation = result$annotation),"output-maxLFQ.txt",sep ="\t",row.names =FALSE)

Protein visualization

The functioniq::plot_protein() plots the underlying datafor individual proteins. By default, the names of fragment ions aredisplayed on the right panel, which can be disabled by setting parametersplit to NULL.

iq::plot_protein(protein_list$P00366,main ="Protein P00366",split =NULL)

By attaching the result of the quantitation algorithm to the datatable, we can demonstrate both protein quantitative values and theunderlying data. Here we set the colors of fragment ions to gray.

iq::plot_protein(rbind(protein_list$P00366,MaxLFQ = iq::maxLFQ(protein_list$P00366)$estimate),main ="MaxLFQ quantification of P00366",col =c(rep("gray",nrow(protein_list$P00366)),"green"),split =NULL)

The names of constituent ions are displayed on the right whensplit is notNULL. We can pass additionalgraphical parameters to fine-tune the plot. For example, in thefollowing we make the font size smaller by settingcex to0.4 due to the restricted plotting space.

iq::plot_protein(protein_list$P00366,main ="Protein P00366",cex =0.4)

We can also use the protein plotting function to display differentquantitative methods for evaluation. For example, here we show thequantitative values of a spike-in protein using MaxFLQ and the spike-inground truth values (rescaled so that the means of the two are thesame).

MaxLFQ_estimate<- iq::maxLFQ(protein_list$P12799)$estimateground_truth<-log2(rep(c(200,125.99,79.37,50,4,2.52,1.59,1),each =3))ground_truth<- ground_truth-mean(ground_truth)+mean(MaxLFQ_estimate)iq::plot_protein(rbind(MaxLFQ = MaxLFQ_estimate,Groundtruth = ground_truth),main ="P12799 - MaxLFQ versus groundtruth",split =0.75,cex =0.9,col =c("green","gold"))

Extracting additional protein annotation

The input data might contain extra annotations for the proteins. Weprovide a convenient function to extract additional annotation. Thefollowing script produces the same quantitative values as before, butwith additional columns for gene names and protein names.

extra_names<- iq::extract_annotation(rownames(result$estimate),                                      raw,annotation_columns =c("PG.Genes","PG.ProteinNames"))write.table(cbind(Protein =rownames(result$estimate),                  extra_names[,c("PG.Genes","PG.ProteinNames")],                  result$estimate,annotation = result$annotation),"output-maxLFQ-annotation.txt",sep ="\t",row.names =FALSE)

OpenSWATH output

We download OpenSWATH data from the publication of Schubert etal. (Cell Host & Microbe 2015). The data is in a long format withfragment ions and their corresponding intensities are concatenated intwo entries for each peptide. We separate these entries into an extendedlong format. Again, we will only keep relevant columns and compress theresult. Thus, the user can skip this code section and continue to thenext section using the data available in the project GitHub release.

tab<-read.delim("./Mtb_feature_alignment_requant_filtered_max10_fixed_noUPS.tsv",stringsAsFactors =FALSE)tab$Condition[tab$Condition=="d20_6h"]<-"d20_06h"tab_list<-vector("list",nrow(tab))for (iin1:nrow(tab)) {    a<-unlist(strsplit(tab[i,"aggr_Fragment_Annotation"],";"))    b<-unlist(strsplit(tab[i,"aggr_Peak_Area"],";"))    tab_list[[i]]<-NULLfor (jin1:length(a)) {        tab[i,"aggr_Fragment_Annotation"]<- a[j]        tab[i,"aggr_Peak_Area"]<- b[j]        tab_list[[i]]<-rbind(tab_list[[i]], tab[i,])    }}tab_extended<-do.call(rbind.data.frame, tab_list)quant<-as.double(tab_extended$aggr_Peak_Area)short_name<-paste(tab_extended$Condition, tab_extended$BioReplicate,                    tab_extended$Run,sep ="_")tab_small<-cbind(tab_extended[,c("ProteinName","FullPeptideName","Charge","aggr_Fragment_Annotation")], quant, short_name)write.table(tab_small,"Schubert15-OpenSWATH.txt",sep ="\t",row.names =FALSE)

Protein quantification

First we load the compressed data prepared by the previous step

tab_small<-read.delim(gzfile("Schubert15-OpenSWATH.txt.gz"))

and check the head of the data file

head(tab_small)#>       ProteinName  FullPeptideName Charge aggr_Fragment_Annotation     quant short_name#> 1 1/Rv0244c_fadE5 ALGHGEFSDVDVDTAR      2                     y9_1  1200.270    d00_1_1#> 2 1/Rv0244c_fadE5 ALGHGEFSDVDVDTAR      2                     y7_1  1635.659    d00_1_1#> 3 1/Rv0244c_fadE5 ALGHGEFSDVDVDTAR      2                     y5_1  2130.497    d00_1_1#> 4 1/Rv0244c_fadE5 ALGHGEFSDVDVDTAR      2                    y12_1  2324.244    d00_1_1#> 5 1/Rv0244c_fadE5 ALGHGEFSDVDVDTAR      2                    y10_1  2853.498    d00_1_1#> 6 1/Rv0244c_fadE5 ALGHGEFSDVDVDTAR      2                     y9_1 27360.707    d05_1_2

Then we plot a histogram of the data to examine the datadistribution

hist(log2(tab_small[,"quant"]),100,main ="Histogram of log2 intensities",col ="steelblue",border ="steelblue",freq =FALSE)

Here we do not observe abnormal low-intensity values, and the defaultvalue intensity cutoff of zero is good.

We follow the same procedure as for Spectronaut output for proteinquantification. Nevertheless, we have to adapt the column namesappropriately. Unique values in theprimary_id column formthe list of proteins to be quantified. A concatenation of the columns insecondary_id determines the fragment ions used forquantification. Unique values in thesample_id column formthe list of samples. Finally, columnintensity_colspecifies the quantitative values.

norm_data<- iq::preprocess(tab_small,primary_id ="ProteinName",secondary_id =c("FullPeptideName","Charge","aggr_Fragment_Annotation"),sample_id ="short_name",intensity_col ="quant")protein_list<- iq::create_protein_list(norm_data)result<- iq::create_protein_table(protein_list)write.table(cbind(Protein =rownames(result$estimate),                  result$estimate,annotation = result$annotation),"Schubert-output-maxLFQ.txt",sep ="\t",row.names =FALSE)

The resulting output text fileSchubert-output-maxLFQ.txtcan be loaded into a spreadsheet application.

MaxQuant output

The MaxLFQ algorithm is implemented in the software package MaxQuant.Nevertheless, there are cases in which MaxQuant fails in the label-freequantification step. One possible solution is to run MaxQuant withoutnormalization and derive the MaxLFQ values usingiq.

Read the protein group data

First, we read the protein quantification from MaxQuant output,ignoring entries detected in the reversed fasta database. The proteingroup table provides links to the underlying data in the MaxQuantevidence.txt file. When LFQ values are available, we can storethem in a table to compare the result of theiqimplementation to that of MaxQuant.

dda<-read.delim(gzfile("proteinGroups.txt.gz"))dda<-subset(dda, Reverse=="")# remove reversed entriesrownames(dda)<- dda[,"Protein.IDs"]# use protein group ids as rownameslfq<-grep("^LFQ",colnames(dda))dda_log2<-log2(dda[, lfq])dda_log2[dda_log2==-Inf]<-NAcolnames(dda_log2)<-sprintf("C%02d",1:24)

Building up a protein list from the MaxQuant evidence file

We first perform median normalization on intensities in theevidence.txt file. Subsequently, we produce a list of proteinas in the case of Spectronaut output in a variablep_list.

evidence<-read.delim(gzfile("evidence.txt.gz"),stringsAsFactors =FALSE)rownames(evidence)<- evidence[,"id"]# median normalizationex<-paste0(paste0("sample",rep(1:8,each=3),"_R0"),rep(1:3,8))ex_median<-rep(NA,length(ex))names(ex_median)<- exfor (iin ex) {    tmp<-subset(evidence, Experiment== i)    ex_median[i]<-median(tmp[,"Intensity"],na.rm =TRUE)}f<-mean(ex_median)/ex_medianevidence[,"Intensity"]<- evidence[,"Intensity"]* f[evidence[,"Experiment"]]# create a protein listp_list<-list()for (iin1:nrow(dda)) {    tmp<-unlist(strsplit(as.character(dda[i,"Evidence.IDs"]),";"))    a<- evidence[tmp,]    b<-data.frame(cn = a[,"Experiment"],rn =paste(a[,"Modified.sequence"], a[,"Charge"],sep="_"),quant = a[,"Intensity"])    b<- b[!is.na(b$quant),]    m<-matrix(NA,nrow =length(unique(b$rn)),ncol =length(ex),dimnames =list(unique(b$rn), ex))if (nrow(b)>0) {for (jin1:nrow(b)) {            rn<-as.character(b$rn[j])            cn<-as.character(b$cn[j])if (is.na(m[rn, cn])) {                m[rn, cn]<- b$quant[j]            }else {                m[rn, cn]<- m[rn, cn]+ b$quant[j]            }        }        p_list[[rownames(dda)[i]]]<-log2(m)    }else {        p_list[[rownames(dda)[i]]]<-NA    }}

Once the protein list is created, we can perform the MaxLFQ algorithmfor DDA as for DIA data. An example for MaxLFQ by MaxQuant andiq is shown here (rescaled so that the means of the twoquantification values are equal).

w1<- iq::maxLFQ(p_list$A1L0T0)$estimatew2<-as.numeric(dda_log2["A1L0T0", ])w2<- w2-mean(w2,na.rm =TRUE)+mean(w1,na.rm =TRUE)tmp<-rbind(p_list$A1L0T0,`MaxLFQ by iq`= w1,`MaxLFQ by MaxQuant`= w2)colnames(tmp)<-sprintf("C%02d",1:24)iq::plot_protein(tmp,main ="A1L0T0",cex =0.5,split =0.65,col =c(rep("gray",nrow(p_list$A1L0T0)),"green","blue"))

Finally, we write out the result of MaxLFQ quantification to a textfile.

output_mq<- iq::create_protein_table(p_list)write.table(cbind(Protein =rownames(output_mq$estimate),                  output_mq$estimate,annotation = output_mq$annotation),"output_IQ_LFQ.txt",sep ="\t",row.names =FALSE)

The tab-separated text fileoutput_IQ_LFQ.txt now containsthe MaxLFQ values by the packageiq implementation.

Other quantitation methods: topN, MeanInt, and median polish

We also implement the topN method, the MeanInt method, and the medianpolish method by introducing themethod parameter to theiq::create_protein_table() function. We create outputs for alldifferent methods in the following. Here, the variableprotein_list is from the Spectronaut or OpenSWATH example.

# default MaxLFQoutput<- iq::create_protein_table(protein_list)# median polishoutput<- iq::create_protein_table(protein_list,method ="median_polish")# top 3output<- iq::create_protein_table(protein_list,method ="topN",N =3)# top 5output<- iq::create_protein_table(protein_list,method ="topN",N =5)# MeanInt in the original intensity spaceoutput<- iq::create_protein_table(protein_list,method ="meanInt",aggregation_in_log_space =FALSE)

For the topN and MeanInt method, the aggregating operation can becarried out in the log space by default or in the original intensityspace by setting the parameteraggregation_in_log_space toFALSE.

References

  1. Bruderer R, Bernhardt OM, Gandhi T, et al. (2015) Extending thelimits of quantitative proteome profiling with data-independentacquisition and application to acetaminophen-treated three-dimensionalliver microtissues.Mol Cell Proteomics 14(5):1400–1410.
  2. Cox J, Hein MY, Luber CA, et al. (2014) Accurate proteome-widelabel-free quantification by delayed normalization and maximal peptideratio extraction, termed MaxLFQ.Mol Cell Proteomics13(9):2513–2526.
  3. Cox J and Mann M (2008) MaxQuant enables high peptide identificationrates, individualized p.p.b.-range mass accuracies and proteome-wideprotein quantification.Nat Biotechnol 26, pp 1367-72.
  4. Pham TV, Henneman AA, Jimenez CR (2020) iq: an R package to estimaterelative protein abundances from ion quantification in DIA-MS-basedproteomics.Bioinformatics 36(8):2611-2613.
  5. Rost HL, Rosenberger G, Navarro P, et al. (2014). OpenSWATH enablesautomated, targeted analysis of data-independent acquisition MS data.Nat. Biotechnol. 32, 219–223.
  6. Schubert OT, Ludwig C, Kogadeeva M, et al. (2015) Absolute proteomecomposition and dynamics during dormancy and resuscitation ofmycobacterium tuberculosis.Cell Host Microbe18(1):96-108.
  7. Tukey JW. (1977)Exploratory Data Analysis, ReadingMassachusetts: Addison-Wesley.

[8]ページ先頭

©2009-2025 Movatter.jp