- Technical Report
- Published:
A general framework for estimating the relative pathogenicity of human genetic variants
- Martin Kircher ORCID:orcid.org/0000-0001-9278-54711 na1,
- Daniela M Witten2 na1,
- Preti Jain3 nAff4,
- Brian J O'Roak1 nAff4,
- Gregory M Cooper ORCID:orcid.org/0000-0001-5509-99233 &
- …
- Jay Shendure1
Nature Geneticsvolume 46, pages310–315 (2014)Cite this article
74kAccesses
203Altmetric
Abstract
Current methods for annotating and interpreting human genetic variation tend to exploit a single information type (for example, conservation) and/or are restricted in scope (for example, to missense changes). Here we describe Combined Annotation–Dependent Depletion (CADD), a method for objectively integrating many diverse annotations into a single measure (C score) for each variant. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants. We precompute C scores for all 8.6 billion possible human single-nucleotide variants and enable scoring of short insertions-deletions. C scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects and complex trait associations, and they highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.
This is a preview of subscription content,access via your institution
Access options
Subscription info for Japanese customers
We have a dedicated website for our Japanese customers. Please go tonatureasia.com to subscribe to this journal.
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others

Inferring the molecular and phenotypic impact of amino acid variants with MutPred2

Understanding the heterogeneous performance of variant effect predictors across human protein-coding genes
References
Cooper, G.M. et al. Single-nucleotide evolutionary constraint scores highlight disease-causing mutations.Nat. Methods7, 250–251 (2010).
Cooper, G.M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data.Nat. Rev. Genet.12, 628–640 (2011).
Musunuru, K. et al. From noncoding variant to phenotype viaSORT1 at the 1p13 cholesterol locus.Nature466, 714–719 (2010).
Ward, L.D. & Kellis, M. Interpreting noncoding genetic variation in complex traits and human disease.Nat. Biotechnol.30, 1095–1106 (2012).
Ng, S.B. et al. Targeted capture and massively parallel sequencing of 12 human exomes.Nature461, 272–276 (2009).
Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations.Nat. Methods7, 248–249 (2010).
Ng, P.C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function.Nucleic Acids Res.31, 3812–3814 (2003).
Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence.Genome Res.15, 901–913 (2005).
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.Genome Res.15, 1034–1050 (2005).
Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies.Genome Res.20, 110–121 (2010).
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome.Nature489, 57–74 (2012).
Kimura, M.The Neutral Theory of Molecular Evolution (Cambridge University Press, Cambridge and New York, 1983).
Paten, B. et al. Genome-wide nucleotide-level mammalian ancestor reconstruction.Genome Res.18, 1829–1843 (2008).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes.Nature491, 56–65 (2012).
Paten, B., Herrero, J., Beal, K., Fitzgerald, S. & Birney, E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs.Genome Res.18, 1814–1828 (2008).
McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.Bioinformatics26, 2069–2070 (2010).
Meyer, L.R. et al. The UCSC Genome Browser database: extensions and updates 2013.Nucleic Acids Res.41, D64–D69 (2013).
Boyle, A.P. et al. High-resolution mapping and characterization of open chromatin across the genome.Cell132, 311–322 (2008).
Johnson, D.S., Mortazavi, A., Myers, R.M. & Wold, B. Genome-wide mapping ofin vivo protein-DNA interactions.Science316, 1497–1502 (2007).
Grantham, R. Amino acid difference formula to help explain protein evolution.Science185, 862–864 (1974).
Franc, V. & Sonnenburg, S. Optimized cutting plane algorithm for large-scale risk minimization.J. Mach. Learn. Res.10, 2157–2192 (2009).
Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities.Genome Res.8, 186–194 (1998).
Liao, B.Y. & Zhang, J. Null mutations in human and mouse orthologs frequently result in different phenotypes.Proc. Natl. Acad. Sci. USA105, 6987–6992 (2008).
Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants.Nature493, 216–220 (2013).
Makrythanasis, P. et al.MLL2 mutation detection in 86 patients with Kabuki syndrome: a genotype-phenotype study.Clin. Genet.doi:10.1111/cge.12081 (16 January 2013).
Giardine, B. et al. HbVar database of human hemoglobin variants and thalassemia mutations: 2007 update.Hum. Mutat.28, 206 (2007).
Baker, M. One-stop shop for disease genes.Nature491, 171 (2012).
Patwardhan, R.P. et al. Massively parallel functional dissection of mammalian enhancersin vivo.Nat. Biotechnol.30, 265–270 (2012).
Patwardhan, R.P. et al. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis.Nat. Biotechnol.27, 1173–1175 (2009).
O'Roak, B.J. et al. Exome sequencing in sporadic autism spectrum disorders identifies severede novo mutations.Nat. Genet.43, 585–589 (2011).
O'Roak, B.J. et al. Sporadic autism exomes reveal a highly interconnected protein network ofde novo mutations.Nature485, 246–250 (2012).
Sanders, S.J. et al.De novo mutations revealed by whole-exome sequencing are strongly associated with autism.Nature485, 237–241 (2012).
Neale, B.M. et al. Patterns and rates of exonicde novo mutations in autism spectrum disorders.Nature485, 242–245 (2012).
Iossifov, I. et al.De novo gene disruptions in children on the autistic spectrum.Neuron74, 285–299 (2012).
Rauch, A. et al. Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study.Lancet380, 1674–1682 (2012).
de Ligt, J. et al. Diagnostic exome sequencing in persons with severe intellectual disability.N. Engl. J. Med.367, 1921–1929 (2012).
Cooper, G.M. et al. A copy number variation morbidity map of developmental delay.Nat. Genet.43, 838–846 (2011).
Ng, S.B. et al. Exome sequencing identifiesMLL2 mutations as a cause of Kabuki syndrome.Nat. Genet.42, 790–793 (2010).
Rohland, N. & Reich, D. Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture.Genome Res.22, 939–946 (2012).
Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual.Science338, 222–226 (2012).
Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.Proc. Natl. Acad. Sci. USA106, 9362–9367 (2009).
Nicolae, D.L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS.PLoS Genet.6, e1000888 (2010).
Gerstein, M.B. et al. Architecture of the human regulatory network derived from ENCODE data.Nature489, 91–100 (2012).
Schaub, M.A., Boyle, A.P., Kundaje, A., Batzoglou, S. & Snyder, M. Linking disease associations with regulatory information in the human genome.Genome Res.22, 1748–1759 (2012).
González-Pérez, A. & Lopez-Bigas, N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel.Am. J. Hum. Genet.88, 440–449 (2011).
Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites.Nat. Genet.45, 723–729 (2013).
Weedon, M.N. et al. Recessive mutations in a distalPTF1A enhancer cause isolated pancreatic agenesis.Nat. Genet.46, 61–64 (2014).
Stenson, P.D. et al. The Human Gene Mutation Database: 2008 update.Genome Med.1, 13 (2009).
MacArthur, D.G. et al. A systematic survey of loss-of-function variants in human protein-coding genes.Science335, 823–828 (2012).
Tavaré, S. Some probabilistic and statistical problems in the analysis of DNA sequences.Lect. Math. Life Sci.17, 57–86 (1986).
Fujita, P.A. et al. The UCSC Genome Browser database: update 2011.Nucleic Acids Res.39, D876–D882 (2011).
Rosenbloom, K.R. et al. ENCODE whole-genome data in the UCSC Genome Browser: update 2012.Nucleic Acids Res.40, D912–D917 (2012).
Hubisz, M.J., Pollard, K.S. & Siepel, A. PHAST and RPHAST: phylogenetic analysis with space/time models.Brief. Bioinform.12, 41–51 (2011).
Davydov, E.V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP.PLoS Comput. Biol.6, e1001025 (2010).
McVicker, G., Gordon, D., Davis, C. & Green, P. Widespread genomic signatures of natural selection in hominid evolution.PLoS Genet.5, e1000471 (2009).
Hoffman, M.M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation.Nat. Methods9, 473–476 (2012).
Tennessen, J.A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes.Science337, 64–69 (2012).
Khurana, E., Fu, Y., Chen, J. & Gerstein, M. Interpretation of genomic variants using a unified biological network approach.PLoS Comput. Biol.9, e1002886 (2013).
Liu, X., Jian, X. & Boerwinkle, E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions.Hum. Mutat.32, 894–899 (2011).
Acknowledgements
We thank P. Green and members of the Shendure laboratory for helpful discussions and suggestions. Our work was supported by US NIH grants U54HG006493 (to J.S. and G.M.C.), DP5OD009145 (to D.M.W.) and DP1HG007811 (to J.S.).
Author information
Preti Jain & Brian J O'Roak
Present address: Present address: Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, Oregon, USA.,
Martin Kircher and Daniela M Witten: These authors contributed equally to this work.
Authors and Affiliations
Department of Genome Sciences, University of Washington, Seattle, Washington, USA
Martin Kircher, Brian J O'Roak & Jay Shendure
Department of Biostatistics, University of Washington, Seattle, Washington, USA
Daniela M Witten
HudsonAlpha Institute for Biotechnology, Huntsville, Alabama, USA
Preti Jain & Gregory M Cooper
- Martin Kircher
You can also search for this author inPubMed Google Scholar
- Daniela M Witten
You can also search for this author inPubMed Google Scholar
- Preti Jain
You can also search for this author inPubMed Google Scholar
- Brian J O'Roak
You can also search for this author inPubMed Google Scholar
- Gregory M Cooper
You can also search for this author inPubMed Google Scholar
- Jay Shendure
You can also search for this author inPubMed Google Scholar
Contributions
G.M.C. and J.S. designed the study. M.K. processed the annotation data and scores and developed and implemented the simulator and scripts required for scoring. P.J. and B.J.O. prepared and provided data sets and annotations. D.M.W. and M.K. developed the model and performed model training. D.M.W. performed the analysis of individual features and interactions. M.K., D.M.W., G.M.C. and J.S. analyzed the model's performance on different data sets. G.M.C. analyzed the GWAS data. J.S., G.M.C., M.K. and D.M.W. wrote the manuscript with input from all authors.
Corresponding authors
Correspondence toGregory M Cooper orJay Shendure.
Ethics declarations
Competing interests
The authors (M.K., D.M.W., G.M.C. and J.S.) have filed a provisional patent application with the US Patent and Trademark Office on the basis of CADD.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–18, Supplementary Tables 1–12 and Supplementary Note (PDF 4022 kb)
Rights and permissions
About this article
Cite this article
Kircher, M., Witten, D., Jain, P.et al. A general framework for estimating the relative pathogenicity of human genetic variants.Nat Genet46, 310–315 (2014). https://doi.org/10.1038/ng.2892
Received:
Accepted:
Published:
Issue Date: