- Letter
- Published:
Genome assembly comparison identifies structural variants in the human genome
- Razi Khaja1,
- Junjun Zhang1,
- Jeffrey R MacDonald1,
- Yongshu He1,
- Ann M Joseph-George1,
- John Wei1,
- Muhammad A Rafiq1,2,
- Cheng Qian1,
- Mary Shago1,
- Lorena Pantano3,
- Hiroyuki Aburatani4,
- Keith Jones5,
- Richard Redon6,
- Matthew Hurles6,
- Lluis Armengol3,
- Xavier Estivill3,7,
- Richard J Mural8,
- Charles Lee9,
- Stephen W Scherer1 &
- …
- Lars Feuk1
Nature Geneticsvolume 38, pages1413–1418 (2006)Cite this article
955Accesses
17Altmetric
Abstract
Numerous types of DNA variation exist, ranging from SNPs to larger structural alterations such as copy number variants (CNVs) and inversions. Alignment of DNA sequence from different sources has been used to identify SNPs1,2 and intermediate-sized variants (ISVs)3. However, only a small proportion of total heterogeneity is characterized, and little is known of the characteristics of most smaller-sized (<50 kb) variants. Here we show that genome assembly comparison is a robust approach for identification of all classes of genetic variation. Through comparison of two human assemblies (Celera's R27c compilation and the Build 35 reference sequence), we identified megabases of sequence (in the form of 13,534 putative non-SNP events) that were absent, inverted or polymorphic in one assembly. Database comparison and laboratory experimentation further demonstrated overlap or validation for 240 variable regions and confirmed >1.5 million SNPs. Some differences were simple insertions and deletions, but in regions containing CNVs, segmental duplication and repetitive DNA, they were more complex. Our results uncover substantial undescribed variation in humans, highlighting the need for comprehensive annotation strategies to fully interpret genome scanning and personalized sequencing projects.
This is a preview of subscription content,access via your institution
Access options
Subscription info for Japanese customers
We have a dedicated website for our Japanese customers. Please go tonatureasia.com to subscribe to this journal.
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
References
Marth, G.T. et al. A general approach to single-nucleotide polymorphism discovery.Nat. Genet.23, 452–456 (1999).
Tsui, C. et al. Single nucleotide polymorphisms (SNPs) that map to gaps in the human SNP map.Nucleic Acids Res.31, 4910–4916 (2003).
Tuzun, E. et al. Fine-scale structural variation of the human genome.Nat. Genet.37, 727–732 (2005).
Lander, E.S. et al. Initial sequencing and analysis of the human genome.Nature409, 860–921 (2001).
Venter, J.C. et al. The sequence of the human genome.Science291, 1304–1351 (2001).
Myers, E.W., Sutton, G.G., Smith, H.O., Adams, M.D. & Venter, J.C. On the sequencing and assembly of the human genome.Proc. Natl. Acad. Sci. USA99, 4145–4146 (2002).
Adams, M.D., Sutton, G.G., Smith, H.O., Myers, E.W. & Venter, J.C. The independence of our genome assemblies.Proc. Natl. Acad. Sci. USA100, 3025–3026 (2003).
Istrail, S. et al. Whole-genome shotgun assembly and comparison of human genome assemblies.Proc. Natl. Acad. Sci. USA101, 1916–1921 (2004).
Waterston, R.H., Lander, E.S. & Sulston, J.E. On the sequencing of the human genome.Proc. Natl. Acad. Sci. USA99, 3712–3716 (2002).
Waterston, R.H., Lander, E.S. & Sulston, J.E. More on the sequencing of the human genome.Proc. Natl. Acad. Sci. USA100, 3022–3024 (2003).
Feuk, L., Carson, A.R. & Scherer, S.W. Structural variation in the human genome.Nat. Rev. Genet.7, 85–97 (2006).
Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning DNA sequences.J. Comput. Biol.7, 203–214 (2000).
Mobarry, C. & Sutton, G. An assembly-to-assembly comparison tool. inProceedings of the Third Annual RECOMB Satellite Meeting on DNA Sequencing Technologies and Computation (2003).
Kent, W.J. BLAT–the BLAST-like alignment tool.Genome Res.12, 656–664 (2002).
Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.Nucleic Acids Res.33, D501–D504 (2005).
Bailey, J.A. et al. Recent segmental duplications in the human genome.Science297, 1003–1007 (2002).
Iafrate, A.J. et al. Detection of large-scale variation in the human genome.Nat. Genet.36, 949–951 (2004).
Redon, R. et al. Global variation in copy number in the human genome.Nature (in the press).
Wang, J. et al. dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans.Hum. Mutat.27, 323–329 (2006).
Hillier, L.W. et al. The DNA sequence of human chromosome 7.Nature424, 157–164 (2003).
Scherer, S.W. et al. Human chromosome 7: DNA sequence and biology.Science300, 767–772 (2003).
Schmutz, J. et al. The DNA sequence and comparative analysis of human chromosome 5.Nature431, 268–274 (2004).
Shendure, J., Mitra, R.D., Varma, C. & Church, G.M. Advanced sequencing technologies: methods and goals.Nat. Rev. Genet.5, 335–344 (2004).
Bennett, S.T., Barnes, C., Cox, A., Davies, L. & Brown, C. Toward the 1,000 dollars human genome.Pharmacogenomics6, 373–382 (2005).
Service, R.F. Gene sequencing. The race for the $1000 genome.Science311, 1544–1546 (2006).
Cheung, J. et al. Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence.Genome Biol.4, R25 (2003).
Kent, W.J. et al. The human genome browser at UCSC.Genome Res.12, 996–1006 (2002).
Feuk, L. et al. Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies.PLoS Genet.1, e56 (2005).
Pfaffl, M.W. A new mathematical model for relative quantification in real-time RT-PCR.Nucleic Acids Res.29, e45 (2001).
Osborne, L.R. et al. A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome.Nat. Genet.29, 321–325 (2001).
Acknowledgements
We thank T. Tang, L. Wong, J. Wittnam, C.-F. Chu and W. Hwang of The Centre for Applied Genomics for technical assistance. Computational analyses were supported by the Shared Hierarchical Academic Research Computing Network (SHARCNET) and the Centre for Computational Biology at the Hospital for Sick Children. The work was supported by Genome Canada/Ontario Genomics Institute, the Canadian Institutes of Health Research (CIHR), the Canada Foundation for Innovation and the McLaughlin Centre for Molecular Medicine (all to S.W.S). L.A. and X.E. are supported by Genoma España and Genome Canada joint R+D+I projects and by the Generalitat de Catalunya (Departament d'Universitats, 2005SGR00008, and Departament de Salut). L.F. is supported by CIHR. S.W.S. is an Investigator of CIHR and International Scholar of Howard Hughes Medical Institute.
Author information
Authors and Affiliations
The Hospital for Sick Children and Department of Molecular and Medical Genetics, Program in Genetics and Genomic Biology, University of Toronto and The Centre for Applied Genomics, MaRS Centre, Toronto, M5G 1L7, Ontario, Canada
Razi Khaja, Junjun Zhang, Jeffrey R MacDonald, Yongshu He, Ann M Joseph-George, John Wei, Muhammad A Rafiq, Cheng Qian, Mary Shago, Stephen W Scherer & Lars Feuk
Department of Biosciences, Commission on Science and Technology for Sustainable Development in the South Institute of Information Technology (CIIT), Islamabad, 44000, Pakistan
Muhammad A Rafiq
Genes and Disease Program, Center for Genomic Regulation, Charles Darwin s/n, Barcelona Biomedical Research Park, Barcelona, 08003, Catalonia, Spain
Lorena Pantano, Lluis Armengol & Xavier Estivill
Genome Science Laboratory, Research Center for Advanced Science and Technology, University of Tokyo, 4-6-1 Komaba, Meguro, 153-8904, Tokyo, Japan
Hiroyuki Aburatani
Affymetrix, Inc., Santa Clara, 95051, California, USA
Keith Jones
The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, CB10 1SA, Cambridge, UK
Richard Redon & Matthew Hurles
Pompeu Fabra University, Charles Darwin s/n, and National Genotyping Centre, Passeig Marítim 37-49, Barcelona Biomedical Research Park, Barcelona, Catalonia, Spain
Xavier Estivill
Windber Research Institute, 620 7th Street, Windber, 15963-1331, Pennsylvania, USA
Richard J Mural
Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, 20 Shattuck St., Boston, 02115, Massachusetts, USA
Charles Lee
- Razi Khaja
You can also search for this author inPubMed Google Scholar
- Junjun Zhang
You can also search for this author inPubMed Google Scholar
- Jeffrey R MacDonald
You can also search for this author inPubMed Google Scholar
- Yongshu He
You can also search for this author inPubMed Google Scholar
- Ann M Joseph-George
You can also search for this author inPubMed Google Scholar
- John Wei
You can also search for this author inPubMed Google Scholar
- Muhammad A Rafiq
You can also search for this author inPubMed Google Scholar
- Cheng Qian
You can also search for this author inPubMed Google Scholar
- Mary Shago
You can also search for this author inPubMed Google Scholar
- Lorena Pantano
You can also search for this author inPubMed Google Scholar
- Hiroyuki Aburatani
You can also search for this author inPubMed Google Scholar
- Keith Jones
You can also search for this author inPubMed Google Scholar
- Richard Redon
You can also search for this author inPubMed Google Scholar
- Matthew Hurles
You can also search for this author inPubMed Google Scholar
- Lluis Armengol
You can also search for this author inPubMed Google Scholar
- Xavier Estivill
You can also search for this author inPubMed Google Scholar
- Richard J Mural
You can also search for this author inPubMed Google Scholar
- Charles Lee
You can also search for this author inPubMed Google Scholar
- Stephen W Scherer
You can also search for this author inPubMed Google Scholar
- Lars Feuk
You can also search for this author inPubMed Google Scholar
Contributions
The study was designed by R.K., S.W.S. and L.F. The GCA algorithm was created by R.K. Sequence alignment and computational analysis was performed by R.K., J.Z., J.R.M, J.W., C.Q., L.A. and R.J.M. FISH analysis was performed by Y.H., A.M.J.G., M.S. and C.L. PCR analysis was performed by M.A.R., L.P., L.A. and L.F. J.Z., J.R.M, J.W., C.Q., H.A., K.J., R.R., M.H., L.A., X.E., C.L., S.W.S. and L.F contributed to the analysis of overlap with genomic features, creation of data sets for such analysis and interpretation of the data. S.W.S. and L.F conceptualized, designed and coordinated the experiments. The paper was written by S.W.S and L.F.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Table 1
Results for MegaBLAST and A2Amapper comparing R27c versus Build 35 and comparing Build 35 versus R27c. (PDF 18 kb)
Supplementary Table 2
List of copy-unmatched sequences identified by GCA; table also shows information on repeat content and re-BLAT versus Build 35, Build 36 and chimpanzee Build 1. (XLS 176 kb)
Supplementary Table 3
Intra- and interscaffold inversions identified by GCA between R27c and Build 35. (PDF 10 kb)
Supplementary Table 4
List of refined set of unmatched sequences used for analysis of overlap with genomic features; all entries in this list with an insertion point were used for genomic overlap analysis. (XLS 5127 kb)
Supplementary Table 5
Analysis of RefSeq genes and mRNAs. (XLS 377 kb)
Supplementary Table 6
Results and details for PCR-based assays. (XLS 34 kb)
Supplementary Table 7
Results and details for fluoresecencein situ hybridization experiments. (PDF 37 kb)
Supplementary Table 8
Results of comparisons of single-base mismatches detected by GCA with dbSNP_125 and with HapMap QC+/QC− SNPs. (PDF 59 kb)
Supplementary Table 9
Comparison between assembly differences and other genomic features. (XLS 82 kb)
Rights and permissions
About this article
Cite this article
Khaja, R., Zhang, J., MacDonald, J.et al. Genome assembly comparison identifies structural variants in the human genome.Nat Genet38, 1413–1418 (2006). https://doi.org/10.1038/ng1921
Received:
Accepted:
Published:
Issue Date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
This article is cited by
Sequence assembly demystified
- Niranjan Nagarajan
- Mihai Pop
Nature Reviews Genetics (2013)
Revealing the missing expressed genes beyond the human reference genome by RNA-Seq
- Geng Chen
- Ruiyuan Li
- Tieliu Shi
BMC Genomics (2011)
Construction of Japanese BAC library Yamato-2 (JY2): a set of 330K clone resources of damage-minimized DNA taken from a genetically established Japanese individual
- Yasunobu Terabayashi
- Keiko Morita
- Takashi Hirano
Human Cell (2011)
CCL3L1 and HIV/AIDS susceptibility
- Tanmoy Bhattacharya
- Jennifer Stanton
- Steven M Wolinsky
Nature Medicine (2009)
Accurate detection of uniparental disomy and microdeletions by SNP array analysis in myelodysplastic syndromes with normal cytogenetics
- S Heinrichs
- R V Kulkarni
- A T Look
Leukemia (2009)