Navbar Search Filter Mobile Enter search termSearch

Bioinformatics Journals

International Society for Computational Biology

Navbar Search Filter Enter search termSearch

Advanced Search

Search Menu

AI Discovery Assistant

Article Navigation

Volume 21

Issue 24

December 2005

Article Contents

REFERENCES

Journal Article

Beware of mis-assembled genomes

Steven L. Salzberg*

1 ¹

Center for Bioinformatics and Computational Biology, University of Maryland

College Park, MD 20742, USA

^*To whom correspondence should be addressed. E-mail:[email protected]

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

James A. Yorke

2 ²

Institute for Physical Sciences and Technology, University of Maryland

College Park, MD 20742, USA

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Bioinformatics, Volume 21, Issue 24, December 2005, Pages 4320–4321,https://doi.org/10.1093/bioinformatics/bti769

Published:

25 October 2005

Navbar Search Filter Mobile Enter search termSearch

Navbar Search Filter Enter search termSearch

Advanced Search

Search Menu

AI Discovery Assistant

With hundreds of genomes now in GenBank, researchers might be forgiven for assuming that genome sequence data are correct, at least at a large scale. Certainly there might be errors at some small rate, perhaps 1 in 50 000 or 100 000 bases (Schmutzet al., 2004;Readet al., 2002), but at a large scale these genomes are put together correctly, are not they? Well, not always.

We have been looking at the assemblies of large genomes for several years now, and for every ‘draft’ genome we look at, we find hundreds—and sometimes thousands—of mis-assemblies. These include regions where a genome is incorrectly re-arranged as well as places where large chunks of DNA sequence are simply deleted and the surrounding sequences just crunched together.

The source of most mis-assemblies is, as it has always been, repeats. Genomes vary in their repeat content, but we have learned that large genomes are filled with repeats of all shapes and sizes. To illustrate how these repeats result in sequences being ‘lost’ by an assembler, consider the situation inFigure 1.

Fig. 1

Assemblies can collapse around repetitive sequences. R1 and R2, in yellow, represent near-identical copies of the same DNA sequence.

Open in new tab Download slide

In the figure, we see that the genome has two copies, R1 and R2, of a sequence that lie near one another, separated by a unique region shown in red. If R1 and R2 are long enough, then the assembler will not have any individual sequences (‘reads’) containing the entire repeat and its unique flanking sequences (the green and blue regions). The result will be that the genome assembly looks like the lower half of the figure, with a contiguous stretch of DNA (a contig) that has just one copy of the repeat, incorrectly jamming together the blue and green regions, and the red region will have no place to go.

If this seems like a made-up example, it is not: we have observed that even the best assemblers today make exactly this mistake when assembling theDrosophila species currently being sequenced. Compressions such as this can easily total 1% or more of the genome, and the ‘orphan’ regions can be quite long, 5000–10 000 bp or more. And we would note thatDrosophila is not a particularly difficult genome as compared with many others currently under way. To those who might think (or argue) that the assembler they are using is not prone to such errors, we can only reply that we have seen these types of errors in all the major assemblers in use today (e.g. Arachne (Batzoglouet al., 2002;Jaffeet al., 2003), Celera Assembler (Myerset al., 2000), Jazz (Aparicioet al., 2002), Phusion (Mullikin and Ning, 2003), PCAP (Huanget al., 2003) and Atlas (Havlaket al., 2004)), in some cases after running the assemblers ourselves and in other cases after carefully examining the results of assemblies created by others.

We have developed software for improving assemblies that can detect at least some situations like the one shown above, although there is still no automated way of fixing these problems. However, the problem is often made much more difficult by the diploid nature of most large genomes, particularly the many mammalian genomes currently being sequenced by the NIH. The problem is this: the two copies of a chromosome are always slightly divergent, and this has led assembly groups (including ours) to develop methods for separating the two haplotypes from one another. But wherever there are tandem repeats in two or more copies, it can become extremely difficult to distinguish an incorrectly collapsed repeat (including situations such as that shown inFig. 1) from true polymorphisms between the haplotypes.

A tremendous amount of genome analysis is built upon the framework of the DNA sequence itself: not only are genes and regulatory sites anchored in the sequence, but analyses of synteny, duplications and evolutionary relationships among species all depend on having the correct structure of the genome. We need to devote more effort to making sure the basis for all these analyses does not turn out to be a house of cards. Our group has created a website (Author Webpage) for depositing reference assemblies: genomes for which the sequence is finished, and for which we can demonstrate how all the original data map to that finished sequence. The site also distinguishes the original whole-genome shotgun reads from any additional finishing reads. This small set of genomes, which thus far only includes bacteria, should be just the beginning: all assemblies need to be available so that others can check them and, if necessary, correct them. Fortunately, NCBI has created a much larger resource to capture both draft and finished assemblies, the Assembly Archive (Salzberget al., 2004). This archive captures the complete information about how a set of raw sequences maps to a genome assembly, whether that assembly is ‘draft’ or ‘finished’. After spending fifteen years and hundreds of millions of dollars on the human genome, the community has a near-complete draft sequence, but the evidence for that sequence—the underlying raw data and the assembly itself—is, amazingly, not available. Indeed, many of the original assemblies of parts of the human genome were done in the mid- and late-1990s, and are now lost. We can only hope that future genomes would not be needlessly lost now that there is a place to deposit them.

Are we arguing that all genomes should be finished? Actually, finishing does not necessarily address this problem at all. Finishing efforts are usually directed at closing gaps, not at fixing mis-assemblies, and therefore ‘finished’ genomes are very likely to contain errors of the type we are discussing. A better term for such genomes is ‘closed’: gaps are closed but sequence is not confirmed. We strongly suspect that many of the already-published finished genomes in GenBank today contain assembly errors.

Clearly we also need new, well-defined methods for comparing assemblies. The most popular metrics right now all seem to emphasize size: size of contigs, size of scaffolds, and especially N50 sizes. (The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome. The N50 size is the smallest contig in that set.) The standard of judging assembly quality by size of contigs is questionable. Large contigs may simply reflect overly aggressive joining of contigs, thereby creating larger contigs with mis-assemblies. As a consequence, genome scientists who are not experts at assembly can be completely misled by statistics about contig sizes, and as a result might prefer the ‘larger’ but incorrect assembly when given a choice.

We need to start capturing assemblies and looking at them with a more skeptical eye. This need has become even greater in the face of a growing number of ‘draft’ assemblies, many of which will never be finished. Before launching lengthy projects based on these genomes, we need to be confident that they are assembled correctly. The bioinformatics community should take the lead in this effort, by developing standards for quality control and by devoting more time and energy to careful evaluations of genome assemblies.

REFERENCES

Aparicio

et al.,

Whole-genome shotgun assembly and analysis of the genome ofFugu rubripes

Science

2002

, vol.

297

(pg.

1301

1310

)

Batzoglou

et al.,

ARACHNE: a whole-genome shotgun assembler

Genome Res.

2002

, vol.

(pg.

177

189

)

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Havlak

et al.,

The Atlas genome assembly system

Genome Res.

2004

, vol.

(pg.

721

732

)

Huang

et al.,

PCAP: a whole-genome assembly program

Genome Res.

2003

, vol.

(pg.

2164

2170

)

Jaffe

D.B.

et al.,

Whole-genome sequence assembly for Mammalian genomes: arachne 2

Genome Res.

2003

, vol.

(pg.

)

Mullikin

J.C.

Ning

The phusion assembler

Genome Res.

2003

, vol.

(pg.

)

Myers

E.W.

et al.,

A whole-genome assembly ofDrosophila

Science

2000

, vol.

287

(pg.

2196

2204

)

Read

T.D.

et al.,

Comparative genome sequencing for discovery of novel polymorphisms inBacillus anthracis

Science

2002

, vol.

296

(pg.

2028

2033

)

Salzberg

S.L.

et al.,

The genome assembly archive: a new public resource

PLoS Biol.

2004

, vol.

pg.

E285

Schmutz

et al.,

Quality assessment of the human genome sequence

Nature

2004

, vol.

429

(pg.

365

368

)

Issue Section:

Citations

Views

3,411

Altmetric

More metrics information

Metrics

Total Views3,411

2,652Pageviews

759PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	19
December 2016	15
January 2017	19
February 2017	35
March 2017	44
April 2017	38
May 2017	23
June 2017	26
July 2017	12
August 2017	21
September 2017	19
October 2017	25
November 2017	27
December 2017	45
January 2018	37
February 2018	49
March 2018	74
April 2018	50
May 2018	58
June 2018	44
July 2018	57
August 2018	56
September 2018	58
October 2018	40
November 2018	50
December 2018	47
January 2019	40
February 2019	54
March 2019	54
April 2019	66
May 2019	49
June 2019	30
July 2019	38
August 2019	36
September 2019	55
October 2019	34
November 2019	33
December 2019	36
January 2020	42
February 2020	26
March 2020	24
April 2020	40
May 2020	24
June 2020	36
July 2020	33
August 2020	25
September 2020	21
October 2020	33
November 2020	32
December 2020	37
January 2021	30
February 2021	26
March 2021	52
April 2021	43
May 2021	39
June 2021	41
July 2021	25
August 2021	29
September 2021	37
October 2021	45
November 2021	22
December 2021	13
January 2022	31
February 2022	26
March 2022	50
April 2022	33
May 2022	31
June 2022	23
July 2022	26
August 2022	44
September 2022	54
October 2022	38
November 2022	38
December 2022	14
January 2023	29
February 2023	25
March 2023	24
April 2023	29
May 2023	31
June 2023	19
July 2023	22
August 2023	26
September 2023	15
October 2023	18
November 2023	24
December 2023	21
January 2024	38
February 2024	51
March 2024	50
April 2024	25
May 2024	34
June 2024	21
July 2024	52
August 2024	33
September 2024	27
October 2024	31
November 2024	24
December 2024	8
January 2025	9
February 2025	16
March 2025	13

Citations

130Web of Science

Altmetrics

Email alerts

New journal issues

New journal articles

Citing articles via

Web of Science (130)

Google Scholar

HISSTA: a human in situ single-cell transcriptome atlas

Seed2LP: seed inference in metabolic networks for reverse ecology applications

CancerTrialMatch: a computational resource for the management of biomarker-based clinical trials at a community cancer center

Marker selection strategies for circulating tumor DNA guided by phylogenetic inference

XPRS: A Tool for Interpretable and Explainable Polygenic Risk Score

Looking for your next opportunity?

Having trouble contacting the network. Please try again in a moment or two.

Online ISSN 1367-4811

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Cookie policy
Privacy policy
Legal notice

Movatterモバイル変換

Article Contents

Beware of mis-assembled genomes

Cite

REFERENCES

Citations

Views

Altmetric

Email alerts

New journal issues alert

Sign in

Personal account

Journal article activity alert

Sign in

Personal account

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only