Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Nature Methods
  • Brief Communication
  • Published:

Salmon provides fast and bias-aware quantification of transcript expression

Nature Methodsvolume 14pages417–419 (2017)Cite this article

Subjects

Abstract

We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA–seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.

This is a preview of subscription content,access via your institution

Access options

Access through your institution

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

9,800 Yen / 30 days

cancel any time

Subscription info for Japanese customers

We have a dedicated website for our Japanese customers. Please go tonatureasia.com to subscribe to this journal.

Buy this article

  • Purchase on SpringerLink
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Performance of Salmon.

Similar content being viewed by others

References

  1. Hoadley, K.A. et al.Cell158, 929–944 (2014).

    Article CAS  Google Scholar 

  2. Li, J.J., Huang, H., Bickel, P.J. & Brenner, S.E.Genome Res.24, 1086–1101 (2014).

    Article CAS  Google Scholar 

  3. Weinstein, J.N. et al.Nat. Genet.45, 1113–1120 (2013).

    Article  Google Scholar 

  4. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L.Genome Biol.12, R22 (2011).

    Article CAS  Google Scholar 

  5. Love, M.I., Hogenesch, J.B. & Irizarry, R.A.Nat. Biotechnol.34, 1287–1291 (2016).

    Article CAS  Google Scholar 

  6. Morán, I. et al.Cell Metab.16, 435–448 (2012).

    Article  Google Scholar 

  7. Teng, M. et al.Genome Biol.17, 74 (2016).

    Article  Google Scholar 

  8. Kodama, Y., Shumway, M. & Leinonen, R.Nucleic Acids Res.40, D54–D56 (2012).

    Article CAS  Google Scholar 

  9. Patro, R., Mount, S.M. & Kingsford, C.Nat. Biotechnol.32, 462–464 (2014).

    Article CAS  Google Scholar 

  10. Bray, N.L., Pimentel, H., Melsted, P. & Pachter, L.Nat. Biotechnol.34, 525–527 (2016).

    Article CAS  Google Scholar 

  11. Lappalainen, T. et al.Nature501, 506–511 (2013).

    Article CAS  Google Scholar 

  12. SEQC/MAQ-III Consortium.Nat. Biotechnol.32, 903–914 (2014).

  13. Frazee, A.C., Jaffe, A.E., Langmead, B. & Leek, J.T.Bioinformatics31, 2778–2784 (2015).

    Article CAS  Google Scholar 

  14. Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. & Dewey, C.N.Bioinformatics26, 493–500 (2010).

    Article  Google Scholar 

  15. Roberts, A. & Pachter, L.Nat. Methods10, 71–73 (2013).

    Article CAS  Google Scholar 

  16. Langmead, B. & Salzberg, S.L.Nat. Methods9, 357–359 (2012).

    Article CAS  Google Scholar 

  17. Srivastava, A., Sarkar, H., Gupta, N. & Patro, R.Bioinformatics32, i192–i200 (2016).

    Article CAS  Google Scholar 

  18. t'Hoen, P.A. et al.Nat. Biotechnol.31, 1015–1022 (2013).

    Article CAS  Google Scholar 

  19. Foulds, J., Boyles, L., DuBois, C., Smyth, P. & Welling, M. inProc. 19th ACM SIGKDD Int. Conf. Knowledge Discov. & Data Mining 446–454 (ACM, 2013).

  20. Bishop, C.M. et al.Pattern Recognition and Machine Learning (Springer, 2006).

  21. Hensman, J., Papastamoulis, P., Glaus, P., Honkela, A. & Rattray, M.Bioinformatics31, 3881–3889 (2015).

    CAS PubMed PubMed Central  Google Scholar 

  22. Nariai, N. et al.BMC Genomics15 (Suppl. 10), S5 (2014).

    Article  Google Scholar 

  23. Cappé, O. inMixtures: Estimation and Applications (eds. Mengersen, K.L., Robert, C.P. & Titterington, D.M.) Ch. 2 (John Wiley & Sons, 2011).

  24. Hsieh, C.-J., Yu, H.-F. & Dhillon, I.S.ICML15, 2370–2379 (2015).

    Google Scholar 

  25. Salzman, J., Jiang, H. & Wong, W.H.Stat. Sci.26, 1 (2011).

    Article  Google Scholar 

  26. Nicolae, M., Mangul, S., Maă ndoiu, I.I. & Zelikovsky, A.Algorithms Mol. Biol.6, 9 (2011).

    Article  Google Scholar 

  27. Turro, E. et al.Genome Biol.12, R13 (2011).

    Article CAS  Google Scholar 

  28. Li, X., David, G., Andersen, M.K. & Freedman, M.J. inProc. Ninth Eur. Conf. Computer Syst. 27 (ACM, 2014).

  29. Jackman, S. & Birol, I.F1000Research5, 1795 (2016).

    Google Scholar 

  30. Merkel, D.Linux J.2014 (2014).

  31. Di Tommaso, P., Chatzou, M., Baraja, P.P. & Notredame, C.figsharehttps://dx.doi.org/10.6084/m9.figshare.1254958.v2 (2014).

  32. Brett, K.B.-J. & Greene, C.S. Preprint athttps://doi.org/10.1101/056473 (2016).

Download references

Acknowledgements

We wish to thank those who have been using and providing feedback on Salmon since early in its (open) development cycle. The software has been greatly improved in many ways based on their feedback. This research is funded in part by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4554 to C.K. It is partially funded by the US National Science Foundation (CCF-1256087, CCF-1319998, BBSRC-NSF/BIO-1564917) and the US National Institutes of Health (R21HG006913, R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow. This work was partially completed while G.D. was a postdoctoral fellow in the Computational Biology Department at Carnegie Mellon University. M.I.L. was supported by NIH grant 5T32CA009337-35. R.A.I. was supported by NIH R01 grant HG005220.

Author information

Authors and Affiliations

  1. Department of Computer Science, Stony Brook University, Stony Brook, New York, USA

    Rob Patro

  2. DNAnexus, Mountain View, California, USA

    Geet Duggal

  3. Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Cambridge, Massachusetts, USA

    Michael I Love & Rafael A Irizarry

  4. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Cambridge, Massachusetts, USA

    Michael I Love & Rafael A Irizarry

  5. Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

    Carl Kingsford

Authors
  1. Rob Patro

    You can also search for this author inPubMed Google Scholar

  2. Geet Duggal

    You can also search for this author inPubMed Google Scholar

  3. Michael I Love

    You can also search for this author inPubMed Google Scholar

  4. Rafael A Irizarry

    You can also search for this author inPubMed Google Scholar

  5. Carl Kingsford

    You can also search for this author inPubMed Google Scholar

Contributions

R.P. and C.K. designed the method, which was implemented by R.P. R.P., G.D., M.I.L., R.I., and C.K. designed the experiments, and R.P., G.D., and M.I.L. conducted the experiments. R.P., G.D., M.I.L., R.A.I., and C.K. wrote the manuscript.

Corresponding authors

Correspondence toRob Patro orCarl Kingsford.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Overview ofSalmon’s method and components and execution timeline.

Salmon accepts either raw (green arrows) or aligned (gray arrow) reads as input. When processing quasi-mappings or aligned reads,Salmon executes an online inference algorithm. This ensures that transcript abundance estimates are available to estimate weights for the rich equivalence classes, and to consider the appropriate conditional probabilities when learning the experimental parameters and foreground bias models. After a fragment’s contributions to the online abundance estimates and bias models have been computed, the fragment is placed into an appropriate equivalence class (or one is created if it does not yet exist). Once all of the fragments have been observed, the initial abundances and fragment equivalence classes are passed to the offline inference module. The offline module learns the background bias models (based on initial abundance estimates) and then corrects the effective transcript lengths to account for the appropriate biases. Finally, the offline inference algorithm (EM or VBEM) is run over the reduced representation of the data until convergence. Once estimation is complete, posterior samples are generated via Gibbs sampling or a bootstrap procedure if the user has requested this.

Supplementary Figure 2 The false discovery rate (FDR) vs. sensitivity of detecting differentially expressed transcripts onPolyester simulated data

The false discovery rate (FDR) vs. the sensitivity ofSalmon,Salmon (align),kallisto andeXpress onPolyester simulated RNA-seq data using empirically-derived fragment GC bias profiles. All methods were run with bias-correction enabled, but onlySalmon’s model incorporates corrections for fragment GC bias. This leads to a large improvement in sensitivity at almost every FDR value.

Supplementary Figure 3 Abundance vs. fold change accuracy onPolyester simulated data

The log2 fold change between the estimated and true abundances as a function of the true abundance (measured in TPM), for all methods and for all replicates of both simulated “conditions” (each row displays points from all samples within a given condition). The top row corresponds to the 8 samples simulated from the data showing the weak fragment GC content bias, while the bottom row corresponds to the 8 samples simulated from the data showing the stronger fragment GC content bias. Points with an estimated log2 fold change of > 0.5 or < -0.5 are colored red. The fraction of red points appears in the upper right-hand corner of each plot.Salmon consistently demonstrates log fold changes closer to 0 than eitherkallisto oreXpress, across most of the range of expression.

Supplementary Figure 4 Consistency of estimates on SEQC data within and between centers

The distribution of the mean absolute error of (inverse hyperbolic sine-transformed) TPMs between different replicates of data from the SEQC [12] study. The A sample corresponds to universal human reference tissue (UHRR) and the B sample corresponds to human brain tissue (HBRR). When comparing the replicates that were sequenced at different centers, the inter-replicate distances are larger. However, we observe thatSalmon’s bias correction methodology results in improved consistency (i.e., reduced distances) compared to the estimates produced by other methods, especially when comparing replicates sequenced at different centers, where we expect the effects of bias to be more pronounced.

Supplementary Figure 5Salmon reduces false isoform switching

Transcripts demonstrating dominant isoform switching that results from technical bias. In the quantification estimates computed usingkallisto andeXpress, these two-isoform genes show a change in the dominant isoform between conditions (an asterisk denotes a t-test on log2(TPM+1) with p < 1×10−6). However,Salmon directly corrects for technical biases that appear to underlie differences across sequencing center, revealing that the dominant isoform has not, in fact, switched across center.

Supplementary Figure 6 Quantification accuracy forSalmon,Salmon (align),kallisto andeXpress usingRSEM-sim data.

The distribution of Spearman correlations over all 20 replicates of theRSEM-sim data forSalmon,kallisto andeXpress.Salmon andkallisto yield very similar distributions of correlations (no statistically significant difference), while both methods yield correlations greater than that ofeXpress (Mann-Whitney U test, p = 3.39780 × 10−8).

Supplementary Figure 7 Effect of number of GC models

The effect of the number of conditional GC models used to account for correlation between fragment GC and sequence-specific bias. We choose the default to be 3 bins; the simplest model that demonstrates the majority of the benefit. Panels a, b and c show the result of varying the number of conditional GC models on an analysis of the GEUVADIS data for all genes, all transcripts, and genes with only two transcripts, respectively.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–7, Supplementary Tables 1–4, Supplementary Notes 1 and 2, and Supplementary Algorithms 1 (PDF 1950 kb)

Rights and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Patro, R., Duggal, G., Love, M.et al. Salmon provides fast and bias-aware quantification of transcript expression.Nat Methods14, 417–419 (2017). https://doi.org/10.1038/nmeth.4197

Download citation

Access through your institution
Buy or subscribe

Advertisement

Search

Advanced search

Quick links

Nature Briefing AI and Robotics

Sign up for theNature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox.Sign up for Nature Briefing: AI and Robotics

[8]ページ先頭

©2009-2025 Movatter.jp