With hundreds of genomes now in GenBank, researchers might be forgiven for assuming that genome sequence data are correct, at least at a large scale. Certainly there might be errors at some small rate, perhaps 1 in 50 000 or 100 000 bases (Schmutzet al., 2004;Readet al., 2002), but at a large scale these genomes are put together correctly, are not they? Well, not always.
We have been looking at the assemblies of large genomes for several years now, and for every ‘draft’ genome we look at, we find hundreds—and sometimes thousands—of mis-assemblies. These include regions where a genome is incorrectly re-arranged as well as places where large chunks of DNA sequence are simply deleted and the surrounding sequences just crunched together.
The source of most mis-assemblies is, as it has always been, repeats. Genomes vary in their repeat content, but we have learned that large genomes are filled with repeats of all shapes and sizes. To illustrate how these repeats result in sequences being ‘lost’ by an assembler, consider the situation inFigure 1.
Assemblies can collapse around repetitive sequences. R1 and R2, in yellow, represent near-identical copies of the same DNA sequence.
In the figure, we see that the genome has two copies, R1 and R2, of a sequence that lie near one another, separated by a unique region shown in red. If R1 and R2 are long enough, then the assembler will not have any individual sequences (‘reads’) containing the entire repeat and its unique flanking sequences (the green and blue regions). The result will be that the genome assembly looks like the lower half of the figure, with a contiguous stretch of DNA (a contig) that has just one copy of the repeat, incorrectly jamming together the blue and green regions, and the red region will have no place to go.
If this seems like a made-up example, it is not: we have observed that even the best assemblers today make exactly this mistake when assembling theDrosophila species currently being sequenced. Compressions such as this can easily total 1% or more of the genome, and the ‘orphan’ regions can be quite long, 5000–10 000 bp or more. And we would note thatDrosophila is not a particularly difficult genome as compared with many others currently under way. To those who might think (or argue) that the assembler they are using is not prone to such errors, we can only reply that we have seen these types of errors in all the major assemblers in use today (e.g. Arachne (Batzoglouet al., 2002;Jaffeet al., 2003), Celera Assembler (Myerset al., 2000), Jazz (Aparicioet al., 2002), Phusion (Mullikin and Ning, 2003), PCAP (Huanget al., 2003) and Atlas (Havlaket al., 2004)), in some cases after running the assemblers ourselves and in other cases after carefully examining the results of assemblies created by others.
We have developed software for improving assemblies that can detect at least some situations like the one shown above, although there is still no automated way of fixing these problems. However, the problem is often made much more difficult by the diploid nature of most large genomes, particularly the many mammalian genomes currently being sequenced by the NIH. The problem is this: the two copies of a chromosome are always slightly divergent, and this has led assembly groups (including ours) to develop methods for separating the two haplotypes from one another. But wherever there are tandem repeats in two or more copies, it can become extremely difficult to distinguish an incorrectly collapsed repeat (including situations such as that shown inFig. 1) from true polymorphisms between the haplotypes.
A tremendous amount of genome analysis is built upon the framework of the DNA sequence itself: not only are genes and regulatory sites anchored in the sequence, but analyses of synteny, duplications and evolutionary relationships among species all depend on having the correct structure of the genome. We need to devote more effort to making sure the basis for all these analyses does not turn out to be a house of cards. Our group has created a website (Author Webpage) for depositing reference assemblies: genomes for which the sequence is finished, and for which we can demonstrate how all the original data map to that finished sequence. The site also distinguishes the original whole-genome shotgun reads from any additional finishing reads. This small set of genomes, which thus far only includes bacteria, should be just the beginning: all assemblies need to be available so that others can check them and, if necessary, correct them. Fortunately, NCBI has created a much larger resource to capture both draft and finished assemblies, the Assembly Archive (Salzberget al., 2004). This archive captures the complete information about how a set of raw sequences maps to a genome assembly, whether that assembly is ‘draft’ or ‘finished’. After spending fifteen years and hundreds of millions of dollars on the human genome, the community has a near-complete draft sequence, but the evidence for that sequence—the underlying raw data and the assembly itself—is, amazingly, not available. Indeed, many of the original assemblies of parts of the human genome were done in the mid- and late-1990s, and are now lost. We can only hope that future genomes would not be needlessly lost now that there is a place to deposit them.
Are we arguing that all genomes should be finished? Actually, finishing does not necessarily address this problem at all. Finishing efforts are usually directed at closing gaps, not at fixing mis-assemblies, and therefore ‘finished’ genomes are very likely to contain errors of the type we are discussing. A better term for such genomes is ‘closed’: gaps are closed but sequence is not confirmed. We strongly suspect that many of the already-published finished genomes in GenBank today contain assembly errors.
Clearly we also need new, well-defined methods for comparing assemblies. The most popular metrics right now all seem to emphasize size: size of contigs, size of scaffolds, and especially N50 sizes. (The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome. The N50 size is the smallest contig in that set.) The standard of judging assembly quality by size of contigs is questionable. Large contigs may simply reflect overly aggressive joining of contigs, thereby creating larger contigs with mis-assemblies. As a consequence, genome scientists who are not experts at assembly can be completely misled by statistics about contig sizes, and as a result might prefer the ‘larger’ but incorrect assembly when given a choice.
We need to start capturing assemblies and looking at them with a more skeptical eye. This need has become even greater in the face of a growing number of ‘draft’ assemblies, many of which will never be finished. Before launching lengthy projects based on these genomes, we need to be confident that they are assembled correctly. The bioinformatics community should take the lead in this effort, by developing standards for quality control and by devoting more time and energy to careful evaluations of genome assemblies.
Month: | Total Views: |
---|---|
November 2016 | 19 |
December 2016 | 15 |
January 2017 | 19 |
February 2017 | 35 |
March 2017 | 44 |
April 2017 | 38 |
May 2017 | 23 |
June 2017 | 26 |
July 2017 | 12 |
August 2017 | 21 |
September 2017 | 19 |
October 2017 | 25 |
November 2017 | 27 |
December 2017 | 45 |
January 2018 | 37 |
February 2018 | 49 |
March 2018 | 74 |
April 2018 | 50 |
May 2018 | 58 |
June 2018 | 44 |
July 2018 | 57 |
August 2018 | 56 |
September 2018 | 58 |
October 2018 | 40 |
November 2018 | 50 |
December 2018 | 47 |
January 2019 | 40 |
February 2019 | 54 |
March 2019 | 54 |
April 2019 | 66 |
May 2019 | 49 |
June 2019 | 30 |
July 2019 | 38 |
August 2019 | 36 |
September 2019 | 55 |
October 2019 | 34 |
November 2019 | 33 |
December 2019 | 36 |
January 2020 | 42 |
February 2020 | 26 |
March 2020 | 24 |
April 2020 | 40 |
May 2020 | 24 |
June 2020 | 36 |
July 2020 | 33 |
August 2020 | 25 |
September 2020 | 21 |
October 2020 | 33 |
November 2020 | 32 |
December 2020 | 37 |
January 2021 | 30 |
February 2021 | 26 |
March 2021 | 52 |
April 2021 | 43 |
May 2021 | 39 |
June 2021 | 41 |
July 2021 | 25 |
August 2021 | 29 |
September 2021 | 37 |
October 2021 | 45 |
November 2021 | 22 |
December 2021 | 13 |
January 2022 | 31 |
February 2022 | 26 |
March 2022 | 50 |
April 2022 | 33 |
May 2022 | 31 |
June 2022 | 23 |
July 2022 | 26 |
August 2022 | 44 |
September 2022 | 54 |
October 2022 | 38 |
November 2022 | 38 |
December 2022 | 14 |
January 2023 | 29 |
February 2023 | 25 |
March 2023 | 24 |
April 2023 | 29 |
May 2023 | 31 |
June 2023 | 19 |
July 2023 | 22 |
August 2023 | 26 |
September 2023 | 15 |
October 2023 | 18 |
November 2023 | 24 |
December 2023 | 21 |
January 2024 | 38 |
February 2024 | 51 |
March 2024 | 50 |
April 2024 | 25 |
May 2024 | 34 |
June 2024 | 21 |
July 2024 | 52 |
August 2024 | 33 |
September 2024 | 27 |
October 2024 | 31 |
November 2024 | 24 |
December 2024 | 8 |
January 2025 | 9 |
February 2025 | 16 |
March 2025 | 13 |
Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide
This PDF is available to Subscribers Only
View Article Abstract & Purchase OptionsFor full access to this pdf, sign in to an existing account, or purchase an annual subscription.