Movatterモバイル変換


[0]ホーム

URL:


Skip to Main Content
Advertisement
Oxford Academic
Search
Bioinformatics
International Society for Computational Biology
Close
Search
Article Navigation
Journal Article

Beware of mis-assembled genomes

,
Steven L. Salzberg*
1  1  
Center for Bioinformatics and Computational Biology, University of Maryland
 
College Park, MD 20742, USA
*To whom correspondence should be addressed. E-mail:[email protected]
Search for other works by this author on:
James A. Yorke
2  2  
Institute for Physical Sciences and Technology, University of Maryland
 
College Park, MD 20742, USA
Search for other works by this author on:
Bioinformatics, Volume 21, Issue 24, December 2005, Pages 4320–4321,https://doi.org/10.1093/bioinformatics/bti769
Published:
25 October 2005
Search
Close
Search

With hundreds of genomes now in GenBank, researchers might be forgiven for assuming that genome sequence data are correct, at least at a large scale. Certainly there might be errors at some small rate, perhaps 1 in 50 000 or 100 000 bases (Schmutzet al., 2004;Readet al., 2002), but at a large scale these genomes are put together correctly, are not they? Well, not always.

We have been looking at the assemblies of large genomes for several years now, and for every ‘draft’ genome we look at, we find hundreds—and sometimes thousands—of mis-assemblies. These include regions where a genome is incorrectly re-arranged as well as places where large chunks of DNA sequence are simply deleted and the surrounding sequences just crunched together.

The source of most mis-assemblies is, as it has always been, repeats. Genomes vary in their repeat content, but we have learned that large genomes are filled with repeats of all shapes and sizes. To illustrate how these repeats result in sequences being ‘lost’ by an assembler, consider the situation inFigure 1.

Assemblies can collapse around repetitive sequences. R1 and R2, in yellow, represent near-identical copies of the same DNA sequence.
Fig. 1

Assemblies can collapse around repetitive sequences. R1 and R2, in yellow, represent near-identical copies of the same DNA sequence.

In the figure, we see that the genome has two copies, R1 and R2, of a sequence that lie near one another, separated by a unique region shown in red. If R1 and R2 are long enough, then the assembler will not have any individual sequences (‘reads’) containing the entire repeat and its unique flanking sequences (the green and blue regions). The result will be that the genome assembly looks like the lower half of the figure, with a contiguous stretch of DNA (a contig) that has just one copy of the repeat, incorrectly jamming together the blue and green regions, and the red region will have no place to go.

If this seems like a made-up example, it is not: we have observed that even the best assemblers today make exactly this mistake when assembling theDrosophila species currently being sequenced. Compressions such as this can easily total 1% or more of the genome, and the ‘orphan’ regions can be quite long, 5000–10 000 bp or more. And we would note thatDrosophila is not a particularly difficult genome as compared with many others currently under way. To those who might think (or argue) that the assembler they are using is not prone to such errors, we can only reply that we have seen these types of errors in all the major assemblers in use today (e.g. Arachne (Batzoglouet al., 2002;Jaffeet al., 2003), Celera Assembler (Myerset al., 2000), Jazz (Aparicioet al., 2002), Phusion (Mullikin and Ning, 2003), PCAP (Huanget al., 2003) and Atlas (Havlaket al., 2004)), in some cases after running the assemblers ourselves and in other cases after carefully examining the results of assemblies created by others.

We have developed software for improving assemblies that can detect at least some situations like the one shown above, although there is still no automated way of fixing these problems. However, the problem is often made much more difficult by the diploid nature of most large genomes, particularly the many mammalian genomes currently being sequenced by the NIH. The problem is this: the two copies of a chromosome are always slightly divergent, and this has led assembly groups (including ours) to develop methods for separating the two haplotypes from one another. But wherever there are tandem repeats in two or more copies, it can become extremely difficult to distinguish an incorrectly collapsed repeat (including situations such as that shown inFig. 1) from true polymorphisms between the haplotypes.

A tremendous amount of genome analysis is built upon the framework of the DNA sequence itself: not only are genes and regulatory sites anchored in the sequence, but analyses of synteny, duplications and evolutionary relationships among species all depend on having the correct structure of the genome. We need to devote more effort to making sure the basis for all these analyses does not turn out to be a house of cards. Our group has created a website (Author Webpage) for depositing reference assemblies: genomes for which the sequence is finished, and for which we can demonstrate how all the original data map to that finished sequence. The site also distinguishes the original whole-genome shotgun reads from any additional finishing reads. This small set of genomes, which thus far only includes bacteria, should be just the beginning: all assemblies need to be available so that others can check them and, if necessary, correct them. Fortunately, NCBI has created a much larger resource to capture both draft and finished assemblies, the Assembly Archive (Salzberget al., 2004). This archive captures the complete information about how a set of raw sequences maps to a genome assembly, whether that assembly is ‘draft’ or ‘finished’. After spending fifteen years and hundreds of millions of dollars on the human genome, the community has a near-complete draft sequence, but the evidence for that sequence—the underlying raw data and the assembly itself—is, amazingly, not available. Indeed, many of the original assemblies of parts of the human genome were done in the mid- and late-1990s, and are now lost. We can only hope that future genomes would not be needlessly lost now that there is a place to deposit them.

Are we arguing that all genomes should be finished? Actually, finishing does not necessarily address this problem at all. Finishing efforts are usually directed at closing gaps, not at fixing mis-assemblies, and therefore ‘finished’ genomes are very likely to contain errors of the type we are discussing. A better term for such genomes is ‘closed’: gaps are closed but sequence is not confirmed. We strongly suspect that many of the already-published finished genomes in GenBank today contain assembly errors.

Clearly we also need new, well-defined methods for comparing assemblies. The most popular metrics right now all seem to emphasize size: size of contigs, size of scaffolds, and especially N50 sizes. (The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome. The N50 size is the smallest contig in that set.) The standard of judging assembly quality by size of contigs is questionable. Large contigs may simply reflect overly aggressive joining of contigs, thereby creating larger contigs with mis-assemblies. As a consequence, genome scientists who are not experts at assembly can be completely misled by statistics about contig sizes, and as a result might prefer the ‘larger’ but incorrect assembly when given a choice.

We need to start capturing assemblies and looking at them with a more skeptical eye. This need has become even greater in the face of a growing number of ‘draft’ assemblies, many of which will never be finished. Before launching lengthy projects based on these genomes, we need to be confident that they are assembled correctly. The bioinformatics community should take the lead in this effort, by developing standards for quality control and by devoting more time and energy to careful evaluations of genome assemblies.

REFERENCES

Aparicio
S.
et al.
,
Whole-genome shotgun assembly and analysis of the genome ofFugu rubripes
Science
,
2002
, vol.
297
(pg.
1301
-
1310
)
Batzoglou
S.
et al.
,
ARACHNE: a whole-genome shotgun assembler
Genome Res.
,
2002
, vol.
12
(pg.
177
-
189
)
Havlak
P.
et al.
,
The Atlas genome assembly system
Genome Res.
,
2004
, vol.
14
(pg.
721
-
732
)
Huang
X.
et al.
,
PCAP: a whole-genome assembly program
Genome Res.
,
2003
, vol.
13
(pg.
2164
-
2170
)
Jaffe
D.B.
et al.
,
Whole-genome sequence assembly for Mammalian genomes: arachne 2
Genome Res.
,
2003
, vol.
13
(pg.
91
-
96
)
Mullikin
J.C.
Ning
Z.
,
The phusion assembler
Genome Res.
,
2003
, vol.
13
(pg.
81
-
90
)
Myers
E.W.
et al.
,
A whole-genome assembly ofDrosophila
Science
,
2000
, vol.
287
(pg.
2196
-
2204
)
Read
T.D.
et al.
,
Comparative genome sequencing for discovery of novel polymorphisms inBacillus anthracis
Science
,
2002
, vol.
296
(pg.
2028
-
2033
)
Salzberg
S.L.
et al.
,
The genome assembly archive: a new public resource
PLoS Biol.
,
2004
, vol.
2
pg.
E285
Schmutz
J.
et al.
,
Quality assessment of the human genome sequence
Nature
,
2004
, vol.
429
(pg.
365
-
368
)
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email:[email protected]
Issue Section:
Letter to the Editor
Advertisement

Citations

Views

3,411

Altmetric

Metrics
Total Views3,411
2,652Pageviews
759PDF Downloads
Since 11/1/2016
Month:Total Views:
November 201619
December 201615
January 201719
February 201735
March 201744
April 201738
May 201723
June 201726
July 201712
August 201721
September 201719
October 201725
November 201727
December 201745
January 201837
February 201849
March 201874
April 201850
May 201858
June 201844
July 201857
August 201856
September 201858
October 201840
November 201850
December 201847
January 201940
February 201954
March 201954
April 201966
May 201949
June 201930
July 201938
August 201936
September 201955
October 201934
November 201933
December 201936
January 202042
February 202026
March 202024
April 202040
May 202024
June 202036
July 202033
August 202025
September 202021
October 202033
November 202032
December 202037
January 202130
February 202126
March 202152
April 202143
May 202139
June 202141
July 202125
August 202129
September 202137
October 202145
November 202122
December 202113
January 202231
February 202226
March 202250
April 202233
May 202231
June 202223
July 202226
August 202244
September 202254
October 202238
November 202238
December 202214
January 202329
February 202325
March 202324
April 202329
May 202331
June 202319
July 202322
August 202326
September 202315
October 202318
November 202324
December 202321
January 202438
February 202451
March 202450
April 202425
May 202434
June 202421
July 202452
August 202433
September 202427
October 202431
November 202424
December 20248
January 20259
February 202516
March 202513
Citations
Powered by Dimensions
130Web of Science
Altmetrics
×

Email alerts

New journal issues alert

To set up an email alert, pleasesign in to your personal account, orregister

Sign in

Personal account

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD

Journal article activity alert

To set up an email alert, pleasesign in to your personal account, orregister

Sign in

Personal account

  • Sign in with email/username & password
  • Get email alerts
  • Save searches
  • Purchase content
  • Activate your purchase/trial code
  • Add your ORCID iD
Having trouble contacting the network. Please try again in a moment or two.
Oxford University Press
Journals Career Network
Advertisement
Advertisement
Advertisement
Bioinformatics
  • Online ISSN 1367-4811
  • Copyright © 2025 Oxford University Press
Close
Close
This Feature Is Available To Subscribers Only

Sign In orCreate an Account

Close

This PDF is available to Subscribers Only

View Article Abstract & Purchase Options

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Close

[8]ページ先頭

©2009-2025 Movatter.jp