- Technical Report
- Published:
A framework for variation discovery and genotyping using next-generation DNA sequencing data
- Mark A DePristo1,
- Eric Banks1,
- Ryan Poplin1,
- Kiran V Garimella1,
- Jared R Maguire1,
- Christopher Hartl1,
- Anthony A Philippakis1,2,3,
- Guillermo del Angel1,
- Manuel A Rivas1,4,
- Matt Hanna1,
- Aaron McKenna1,
- Tim J Fennell1,
- Andrew M Kernytsky1,
- Andrey Y Sivachenko1,
- Kristian Cibulskis1,
- Stacey B Gabriel1,
- David Altshuler1,3,4 &
- …
- Mark J Daly1,3,4
Nature Geneticsvolume 43, pages491–498 (2011)Cite this article
80kAccesses
10kCitations
86Altmetric
Abstract
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.
This is a preview of subscription content,access via your institution
Access options
Subscription info for Japanese customers
We have a dedicated website for our Japanese customers. Please go tonatureasia.com to subscribe to this journal.
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
¥ 4,980
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
References
The 1000 Genomes Project Consortium. et al. A map of human genome variation from population-scale sequencing.Nature467, 1061–1073 (2010).
Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude.Science329, 75–78 (2010).
Ng, S.B. et al. Exome sequencing identifies the cause of a mendelian disorder.Nat. Genet.42, 30–35 (2009).
Lee, W. et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient.Nature465, 473–477 (2010).
Pleasance, E.D. et al. A comprehensive catalogue of somatic mutations from a human cancer genome.Nature463, 191–196 (2009).
Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers.Nature463, 899–905 (2010).
Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing.Science328, 636–639 (2010).
Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment.Bioinformatics25, 1966–1967 (2009).
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores.Genome Res.18, 1851–1858 (2008).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics25, 1754–1760 (2009).
Ning, Z., Cox, A.J. & Mullikin, J.C. SSAHA: a fast search method for large DNA databases.Genome Res.11, 1725–1729 (2001).
Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities.Genome Res.8, 186–194 (1998).
Brockman, W. et al. Quality scores and SNP detection in sequencing-by-synthesis systems.Genome Res.18, 763–770 (2008).
Li, M., Nordborg, M. & Li, L.M. Adjust quality scores from alignment and improve sequencing accuracy.Nucleic Acids Res.32, 5183–5191 (2004).
Li, R. et al. SNP detection for massively parallel whole-genome resequencing.Genome Res.19, 1124–1132 (2009).
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays.Science327, 78–81 (2010).
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry.Nature456, 53–59 (2008).
Koboldt, D., Chen, K., Wylie, T. & Larson, D. VarScan: variant detection in massively parallel sequencing of individual and pooled samples.Bioinformatics25, 2283–2285 (2009).
Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing.Nature452, 872–876 (2008).
Mokry, M. et al. Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries.Nucleic Acids Res.38, e116 (2010).
Shen, Y. et al. A SNP discovery method to assess variant allele probability from next-generation resequencing data.Genome Res.20, 273–280 (2010).
Hoberman, R. et al. A probabilistic approach for SNP discovery in high-throughput human resequencing data.Genome Res.19, 1542–1552 (2009).
Malhis, N. & Jones, S. High quality SNP calling using Illumina data at shallow coverage.Bioinformatics26, 1029 (2010).
Li, H. et al. The Sequence Alignment/Map format and SAMtools.Bioinformatics25, 2078–2079 (2009).
Handsaker, R.E., Korn, J.M., Nemesh, J. & McCarroll, S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale.Nat. Genet.43, 269–276 (2011).
McKenna, A.H. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data.Genome Res.20, 1297–1303 (2010).
Browning, B.L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies.Am.J. Hum. Genet.85, 847–861 (2009).
Langmead, B., Schatz, M.C., Lin, J., Pop, M. & Salzberg, S.L. Searching for SNPs with cloud computing.Genome Biol.10, R134 (2009).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.Genome Biol.10, R25 (2009).
Green, R.E. et al. A draft sequence of the Neandertal genome.Science328, 710–722 (2010).
Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing.Nat. Biotechnol.27, 182–189 (2009).
Ng, S., Turner, E., Robertson, P. & Flygare, S. Targeted capture and massively parallel sequencing of 12 human exomes.Nature461, 272–276 (2009).
Mckernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding.Genome Res.19, 1527–1541 (2009).
Ebersberger, I., Metzler, D., Schwarz, C. & Pääbo, S. Genomewide comparison of DNA sequences between humans and chimpanzees.Am.J. Hum. Genet.70, 1490–1497 (2002).
Freudenberg-Hua, Y. et al. Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population.Genome Res.13, 2271–2276 (2003).
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G.Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (Cambridge University Press, Cambridge, UK, 1998).
Dohm, J.C., Lottaz, C., Borodina, T. & Himmelbauer, H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.Nucleic Acids Res.36, e105 (2008).
HUGO Consortium. et al. Mapping human genetic diversity in Asia.Science326, 1541–1545 (2009).
Bishop, C.Pattern Recognition and Machine Learning (Springer, New York, New York, USA, 2006).
Acknowledgements
Many thanks to our colleagues in Medical and Population Genetics and Cancer Informatics and the 1000 Genomes Project who encouraged and supported us during the development of the Genome Analysis Toolkit and associated tools. This work was supported by grants from the National Human Genome Research Institute, including the Large Scale Sequencing and Analysis of Genomes grant (54 HG003067) and the Joint SNP and CNV calling in 1000 Genomes sequence data grant (U01 HG005208). We would also like to thank our excellent anonymous reviewers for their thoughtful comments.
Author information
Authors and Affiliations
Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler & Mark J Daly
Brigham and Women's Hospital, Boston, Massachusetts, USA
Anthony A Philippakis
Harvard Medical School, Boston, Massachusetts, USA
Anthony A Philippakis, David Altshuler & Mark J Daly
Center for Human Genetic Research, Massachusetts General Hospital, Richard B. Simches Research Center, Boston, Massachusetts, USA
Manuel A Rivas, David Altshuler & Mark J Daly
- Mark A DePristo
Search author on:PubMed Google Scholar
- Eric Banks
Search author on:PubMed Google Scholar
- Ryan Poplin
Search author on:PubMed Google Scholar
- Kiran V Garimella
Search author on:PubMed Google Scholar
- Jared R Maguire
Search author on:PubMed Google Scholar
- Christopher Hartl
Search author on:PubMed Google Scholar
- Anthony A Philippakis
Search author on:PubMed Google Scholar
- Guillermo del Angel
Search author on:PubMed Google Scholar
- Manuel A Rivas
Search author on:PubMed Google Scholar
- Matt Hanna
Search author on:PubMed Google Scholar
- Aaron McKenna
Search author on:PubMed Google Scholar
- Tim J Fennell
Search author on:PubMed Google Scholar
- Andrew M Kernytsky
Search author on:PubMed Google Scholar
- Andrey Y Sivachenko
Search author on:PubMed Google Scholar
- Kristian Cibulskis
Search author on:PubMed Google Scholar
- Stacey B Gabriel
Search author on:PubMed Google Scholar
- David Altshuler
Search author on:PubMed Google Scholar
- Mark J Daly
Search author on:PubMed Google Scholar
Contributions
M.A.D., E.B., R.P., K.V.G., J.R.M., C.H., A.A.P., G.d.A., M.A.R., T.J.F., A.Y.S. and K.C. conceived of, implemented and performed analytic approaches. M.A.D., E.B., R.P., K.V.G., G.d.A., A.M.K. and M.J.D. wrote the manuscript. M.A.D., M.H. and A.M. developed Picard and GATK infrastructure underlying the tools implemented here. M.A.D., S.B.G., D.A. and M.J.D. lead the team.
Corresponding author
Correspondence toMark A DePristo.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Figure 1, Supplementary Tables 1–7 and Supplementary Note (PDF 806 kb)
Rights and permissions
About this article
Cite this article
DePristo, M., Banks, E., Poplin, R.et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data.Nat Genet43, 491–498 (2011). https://doi.org/10.1038/ng.806
Received:
Accepted:
Published:
Issue date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
This article is cited by
The nature and distribution of putative non-functional alleles suggest only two independent events at the origins of Astyanax mexicanus cavefish populations
- Maxime Policarpo
- Laurent Legendre
- Didier Casane
BMC Ecology and Evolution (2024)
Fine mapping and identification of two NtTOM2A homeologs responsible for tobacco mosaic virus replication in tobacco (Nicotiana tabacum L.)
- Xuebo Wang
- Zhan Shen
- Dan Liu
BMC Plant Biology (2024)
Comparison of capture-based mtDNA sequencing performance between MGI and illumina sequencing platforms in various sample types
- Zehui Feng
- Fan Peng
- Xu Guo
BMC Genomics (2024)
Temperature impacts the bovine ex vivo immune response towards Mycoplasmopsis bovis
- Thomas Démoulins
- Thatcha Yimthin
- Joerg Jores
Veterinary Research (2024)
Apelin-13 modulates the endometrial transcriptome of the domestic pig during implantation
- Kamil Dobrzyn
- Marta Kiezun
- Nina Smolińska
BMC Genomics (2024)


