Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Nature Genetics
  • Technical Report
  • Published:

A framework for variation discovery and genotyping using next-generation DNA sequencing data

Nature Geneticsvolume 43pages491–498 (2011)Cite this article

Subjects

Abstract

Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (4×) 1000 Genomes Project datasets.

This is a preview of subscription content,access via your institution

Access options

Access through your institution

Subscription info for Japanese customers

We have a dedicated website for our Japanese customers. Please go tonatureasia.com to subscribe to this journal.

Buy this article

  • Purchase on SpringerLink
  • Instant access to the full article PDF.

¥ 4,980

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Framework for variation discovery and genotyping from next-generation DNA sequencing.
Figure 2: Integrative genomics viewer (IGV) visualization of alignments in region chr.1: 1,510,530–1,510,589 from the Trio NA12878 Illumina reads from the 1000 Genomes Project (a) and NA12878 HiSeq reads before (left) and after (right) multiple sequence realignment (b).
Figure 3: Raw (pink) and recalibrated (blue) base quality scores for NGS paired-end read sets of NA12878 of Illumina/GA (a), Roche/454 (b) and Life/SOLiD (c) lanes from the 1000 Genomes Project and Illumina/HiSeq (d).
Figure 4: Results of variant quality recalibration on HiSeq, exome and low-pass data sets.
Figure 5: Variation discovered among 60 individuals from the CEPH population from the 1000 Genomes Project pilot phase plus low-pass NA12878.
Figure 6: Sensitivity and specificity of multi-sample discovery of variation in NA12878 with increasing cohort size for low-pass NA12878 read sets processed withN additional CEPH samples.

Similar content being viewed by others

References

  1. The 1000 Genomes Project Consortium. et al. A map of human genome variation from population-scale sequencing.Nature467, 1061–1073 (2010).

  2. Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude.Science329, 75–78 (2010).

    Article CAS  Google Scholar 

  3. Ng, S.B. et al. Exome sequencing identifies the cause of a mendelian disorder.Nat. Genet.42, 30–35 (2009).

    Article  Google Scholar 

  4. Lee, W. et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient.Nature465, 473–477 (2010).

    Article CAS  Google Scholar 

  5. Pleasance, E.D. et al. A comprehensive catalogue of somatic mutations from a human cancer genome.Nature463, 191–196 (2009).

    Article  Google Scholar 

  6. Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers.Nature463, 899–905 (2010).

    Article CAS  Google Scholar 

  7. Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing.Science328, 636–639 (2010).

    Article CAS  Google Scholar 

  8. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment.Bioinformatics25, 1966–1967 (2009).

    Article CAS  Google Scholar 

  9. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores.Genome Res.18, 1851–1858 (2008).

    Article CAS  Google Scholar 

  10. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics25, 1754–1760 (2009).

    Article CAS  Google Scholar 

  11. Ning, Z., Cox, A.J. & Mullikin, J.C. SSAHA: a fast search method for large DNA databases.Genome Res.11, 1725–1729 (2001).

    Article CAS  Google Scholar 

  12. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities.Genome Res.8, 186–194 (1998).

    Article CAS  Google Scholar 

  13. Brockman, W. et al. Quality scores and SNP detection in sequencing-by-synthesis systems.Genome Res.18, 763–770 (2008).

    Article CAS  Google Scholar 

  14. Li, M., Nordborg, M. & Li, L.M. Adjust quality scores from alignment and improve sequencing accuracy.Nucleic Acids Res.32, 5183–5191 (2004).

    Article CAS  Google Scholar 

  15. Li, R. et al. SNP detection for massively parallel whole-genome resequencing.Genome Res.19, 1124–1132 (2009).

    Article CAS  Google Scholar 

  16. Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays.Science327, 78–81 (2010).

    Article CAS  Google Scholar 

  17. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry.Nature456, 53–59 (2008).

    Article CAS  Google Scholar 

  18. Koboldt, D., Chen, K., Wylie, T. & Larson, D. VarScan: variant detection in massively parallel sequencing of individual and pooled samples.Bioinformatics25, 2283–2285 (2009).

    Article CAS  Google Scholar 

  19. Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing.Nature452, 872–876 (2008).

    Article CAS  Google Scholar 

  20. Mokry, M. et al. Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries.Nucleic Acids Res.38, e116 (2010).

    Article  Google Scholar 

  21. Shen, Y. et al. A SNP discovery method to assess variant allele probability from next-generation resequencing data.Genome Res.20, 273–280 (2010).

    Article CAS  Google Scholar 

  22. Hoberman, R. et al. A probabilistic approach for SNP discovery in high-throughput human resequencing data.Genome Res.19, 1542–1552 (2009).

    Article CAS  Google Scholar 

  23. Malhis, N. & Jones, S. High quality SNP calling using Illumina data at shallow coverage.Bioinformatics26, 1029 (2010).

    Article CAS  Google Scholar 

  24. Li, H. et al. The Sequence Alignment/Map format and SAMtools.Bioinformatics25, 2078–2079 (2009).

    Article  Google Scholar 

  25. Handsaker, R.E., Korn, J.M., Nemesh, J. & McCarroll, S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale.Nat. Genet.43, 269–276 (2011).

    Article CAS  Google Scholar 

  26. McKenna, A.H. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data.Genome Res.20, 1297–1303 (2010).

    Article CAS  Google Scholar 

  27. Browning, B.L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies.Am.J. Hum. Genet.85, 847–861 (2009).

    Article CAS  Google Scholar 

  28. Langmead, B., Schatz, M.C., Lin, J., Pop, M. & Salzberg, S.L. Searching for SNPs with cloud computing.Genome Biol.10, R134 (2009).

    Article  Google Scholar 

  29. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.Genome Biol.10, R25 (2009).

    Article  Google Scholar 

  30. Green, R.E. et al. A draft sequence of the Neandertal genome.Science328, 710–722 (2010).

    Article CAS  Google Scholar 

  31. Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing.Nat. Biotechnol.27, 182–189 (2009).

    Article CAS  Google Scholar 

  32. Ng, S., Turner, E., Robertson, P. & Flygare, S. Targeted capture and massively parallel sequencing of 12 human exomes.Nature461, 272–276 (2009).

    Article CAS  Google Scholar 

  33. Mckernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding.Genome Res.19, 1527–1541 (2009).

    Article CAS  Google Scholar 

  34. Ebersberger, I., Metzler, D., Schwarz, C. & Pääbo, S. Genomewide comparison of DNA sequences between humans and chimpanzees.Am.J. Hum. Genet.70, 1490–1497 (2002).

    Article CAS  Google Scholar 

  35. Freudenberg-Hua, Y. et al. Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population.Genome Res.13, 2271–2276 (2003).

    Article CAS  Google Scholar 

  36. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G.Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (Cambridge University Press, Cambridge, UK, 1998).

  37. Dohm, J.C., Lottaz, C., Borodina, T. & Himmelbauer, H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.Nucleic Acids Res.36, e105 (2008).

    Article  Google Scholar 

  38. HUGO Consortium. et al. Mapping human genetic diversity in Asia.Science326, 1541–1545 (2009).

  39. Bishop, C.Pattern Recognition and Machine Learning (Springer, New York, New York, USA, 2006).

Download references

Acknowledgements

Many thanks to our colleagues in Medical and Population Genetics and Cancer Informatics and the 1000 Genomes Project who encouraged and supported us during the development of the Genome Analysis Toolkit and associated tools. This work was supported by grants from the National Human Genome Research Institute, including the Large Scale Sequencing and Analysis of Genomes grant (54 HG003067) and the Joint SNP and CNV calling in 1000 Genomes sequence data grant (U01 HG005208). We would also like to thank our excellent anonymous reviewers for their thoughtful comments.

Author information

Authors and Affiliations

  1. Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA

    Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler & Mark J Daly

  2. Brigham and Women's Hospital, Boston, Massachusetts, USA

    Anthony A Philippakis

  3. Harvard Medical School, Boston, Massachusetts, USA

    Anthony A Philippakis, David Altshuler & Mark J Daly

  4. Center for Human Genetic Research, Massachusetts General Hospital, Richard B. Simches Research Center, Boston, Massachusetts, USA

    Manuel A Rivas, David Altshuler & Mark J Daly

Authors
  1. Mark A DePristo
  2. Eric Banks
  3. Ryan Poplin
  4. Kiran V Garimella
  5. Jared R Maguire
  6. Christopher Hartl
  7. Anthony A Philippakis
  8. Guillermo del Angel
  9. Manuel A Rivas
  10. Matt Hanna
  11. Aaron McKenna
  12. Tim J Fennell
  13. Andrew M Kernytsky
  14. Andrey Y Sivachenko
  15. Kristian Cibulskis
  16. Stacey B Gabriel
  17. David Altshuler
  18. Mark J Daly

Contributions

M.A.D., E.B., R.P., K.V.G., J.R.M., C.H., A.A.P., G.d.A., M.A.R., T.J.F., A.Y.S. and K.C. conceived of, implemented and performed analytic approaches. M.A.D., E.B., R.P., K.V.G., G.d.A., A.M.K. and M.J.D. wrote the manuscript. M.A.D., M.H. and A.M. developed Picard and GATK infrastructure underlying the tools implemented here. M.A.D., S.B.G., D.A. and M.J.D. lead the team.

Corresponding author

Correspondence toMark A DePristo.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figure 1, Supplementary Tables 1–7 and Supplementary Note (PDF 806 kb)

Rights and permissions

About this article

Cite this article

DePristo, M., Banks, E., Poplin, R.et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data.Nat Genet43, 491–498 (2011). https://doi.org/10.1038/ng.806

Download citation

This article is cited by

Access through your institution
Buy or subscribe

Associated content

Collection

Computational Biology

Advertisement

Search

Advanced search

Quick links

Nature Briefing

Sign up for theNature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox.Sign up for Nature Briefing

[8]ページ先頭

©2009-2026 Movatter.jp