.2013 Apr 29;8(4):e62856.

doi: 10.1371/journal.pone.0062856. Print 2013.

Effects of GC bias in next-generation-sequencing data on de novo genome assembly

Yen-Chun Chen¹, Tsunglin Liu, Chun-Hui Yu, Tzen-Yuh Chiang, Chi-Chuan Hwang

Affiliations

PMID:23638157
PMCID: PMC3639258
DOI: 10.1371/journal.pone.0062856

Effects of GC bias in next-generation-sequencing data on de novo genome assembly

Yen-Chun Chen et al. PLoS One.2013.

.2013 Apr 29;8(4):e62856.

doi: 10.1371/journal.pone.0062856. Print 2013.

Authors

Yen-Chun Chen¹, Tsunglin Liu, Chun-Hui Yu, Tzen-Yuh Chiang, Chi-Chuan Hwang

Affiliation

¹ Department of Engineering Science, National Cheng Kung University, Tainan, Taiwan.

PMID:23638157
PMCID: PMC3639258
DOI: 10.1371/journal.pone.0062856

Abstract

Next-generation-sequencing (NGS) has revolutionized the field of genome assembly because of its much higher data throughput and much lower cost compared with traditional Sanger sequencing. However, NGS poses new computational challenges to de novo genome assembly. Among the challenges, GC bias in NGS data is known to aggravate genome assembly. However, it is not clear to what extent GC bias affects genome assembly in general. In this work, we conduct a systematic analysis on the effects of GC bias on genome assembly. Our analyses reveal that GC bias only lowers assembly completeness when the degree of GC bias is above a threshold. At a strong GC bias, the assembly fragmentation due to GC bias can be explained by the low coverage of reads in the GC-poor or GC-rich regions of a genome. This effect is observed for all the assemblers under study. Increasing the total amount of NGS data thus rescues the assembly fragmentation because of GC bias. However, the amount of data needed for a full rescue depends on the distribution of GC contents. Both low and high coverage depths due to GC bias lower the accuracy of assembly. These pieces of information provide guidance toward a better de novo genome assembly in the presence of GC bias.

PubMed Disclaimer

Conflict of interest statement

Competing Interests:The authors have declared that no competing interests exist.

Figures

**Figure 1. Scatter plots of GC content and read coverage of real Illumina data.**
The data sets are fromS. aureus USA300 (A) andS. aureus MRSA252 (B) genomes. Read coverage is normalized to the mean value, which is represented by a horizontal dashed line. A vertical dashed line denotes the mean GC content. The data points are fitted by a straight line and the slope is defined as the degree of GC bias. The two cases represent a negative and positive GC bias, respectively.

**Figure 2. Correlation between the degree of GC bias and two statistics of GC contents.**
No correlation can be observed between the degree of GC bias (y-axis) and either the mean GC content (A) or the standard deviation of GC contents (B).

**Figure 3. Scatter plots of GC content and read coverage of the simulated data.**
From theS. aureus USA300 genome, we simulated reads of 50X coverage at three degrees of GC bias: negative slope −3.83 (A), slope zero (B), and positive slope 3.72 (C).

**Figure 4. Completeness of assemblies of three bacterial genomes by eight assemblers, each treating nine data sets.**
The nine data sets at various degrees of GC bias (shown in different colors) are simulated from the genomes of three bacteria:*E. coli* (A),*S. aureus* (B), andM. tuberculosis (C). Assembly completeness is measured by the N50 length of the contigs after error corrections. The left and right columns show the results of assembly using simulated data of a 50X and 100X coverage, respectively. Note that the Velvet-SC assembly of the genomeM. tuberculosis failed without a clear reason. At a 50X coverage, strong GC bias leads to more fragmented assembly in all cases. Such performance drops can be rescued via increasing the amount of data to a 100X coverage.

**Figure 5. Completeness of theE.coli assemblies using data of various coverage.**
Assembly completeness is measured by N50 length of the corrected contigs, which are output by eight assemblers when treating simulated reads of various coverage (50X, 100X, 250X, 500X, 1000X, and 2000X) at a zero (blue line) and a strong positive GC bias (slope 3.6, pink line).

**Figure 6. Percentage of unaligned reference sequences.**
The results are from the assemblies of three bacterial genomes:*E. coli* (A),*S. aureus* (B), andM. tuberculosis (C). Each of the eight assemblers treats data at a strong negative, zero, and strong positive GC bias.

**Figure 7. Read coverage and mis-assemblies on theS.aureus genome.**
Read coverage (blue curves) and mis-assemblies (colored regions in the bottom bar) in a region ofS. aureus genome at a strong negative (A), zero (B), and strong positive (C) GC bias. Different colors represent different types of mis-assemblies: tandem repeat error (green), translocation error (purple), unaligned reference sequence (red). The colors are projected to the curve of read coverage. The down-triangles in the coverage curves denote single-base insertions.

**Figure 8. Distributions of coverage depths at all bases and at error bases.**
Distributions of coverage depths at error bases (red curves) are compared with those at all bases (blue curves) in the Velvet assemblies of three bacterial genomes:*E. coli* (A),*S. aureus* (B), andM. tuberculosis (C), using data simulated at a strong negative (left column), zero (middle column), and strong positive (right column) GC bias.

**Figure 9. Ratio of corrected N50 length at a strong GC bias to that at no GC bias.**
Ratio of the corrected N50 length at a strong negative GC bias (A) and a strong positive GC bias (B) to that at no GC bias when assembling the data of five species (in different colors) using eight assemblers.

**Figure 10. Distribution of GC contents and read coverage of the five species under study.**
The red curves stand for GC contents (scale in top axis). The blue and yellow curves represent read coverage at a strong positive and strong negative GC bias, respectively (scale in bottom axis). We used the data at 100X coverage for the five species.

**Figure 11. Scatter plots of GC content and read coverage of data simulated with various degrees of background fluctuations.**
The data are simulated from theE. coli genome at three degrees of background fluctuations: zero (top row), 10 (middle row), and 20 (bottom row). At each degree of background fluctuation, we simulated PE reads at a strong negative (A), zero (B), and a strong positive (C) GC bias, respectively.

**Figure 12. Corrected N50 length of assemblies at three background fluctuations.**
We show the corrected N50 length in eight assemblies of three bacterial genomes:*E. coli* (A),*S. aureus* (B), andM. tuberculosis (C), using simulated data at three degrees of background fluctuations (x-axis), each at three degrees of GC bias: negative (yellow), zero (dark blue), and positive (pink).

**Figure 13. Estimation of the degree of GC bias using reference sequences and assembled contigs.**
We show the scatter plots of GC content and read coverage forP.fluorescens Pf0-1 Illumina library (DRR001171) based on the known reference genome (A) and the contigs assembled by Edena, which contain 6,610,650 bases and the N50 length is 8,257 (B).

**Figure 14. Correlation between the degree of GC bias obtained using reference sequences and assembled contigs.**
The correlation is calculated for thirteen Illumina data sets, including eight data sets by Edena, four data sets by Vevlet and one data set by ABySS. The high R² value (0.88) indicates that estimating the degree of GC bias using the assembled contigs is appropriate.

See this image and copyright information in PMC

Cited by

An overlooked phenomenon: complex interactions of potential error sources on the quality of bacterial de novo genome assemblies.
Rádai Z, Váradi A, Takács P, Nagy NA, Schmitt N, Prépost E, Kardos G, Laczkó L.Rádai Z, et al.BMC Genomics. 2024 Jan 9;25(1):45. doi: 10.1186/s12864-023-09910-4.BMC Genomics. 2024.PMID:38195441Free PMC article.
Population structure of mitochondrial genomes in Saccharomyces cerevisiae.
Wolters JF, Chiu K, Fiumera HL.Wolters JF, et al.BMC Genomics. 2015 Jun 11;16(1):451. doi: 10.1186/s12864-015-1664-4.BMC Genomics. 2015.PMID:26062918Free PMC article.
Modeling of shotgun sequencing of DNA plasmids using experimental and theoretical approaches.
Shityakov S, Bencurova E, Förster C, Dandekar T.Shityakov S, et al.BMC Bioinformatics. 2020 Apr 3;21(1):132. doi: 10.1186/s12859-020-3461-6.BMC Bioinformatics. 2020.PMID:32245400Free PMC article.
Unveiling lignocellulolytic trait of a goat omasum inhabitant Klebsiella variicola strain HSTU-AAM51 in light of biochemical and genome analyses.
Md Abdullah-Al-Mamun, Hossain MS, Debnath GC, Sultana S, Rahman A, Hasan Z, Das SR, Ashik MA, Prodhan MY, Aktar S, Cho KM, Haque MA.Md Abdullah-Al-Mamun, et al.Braz J Microbiol. 2022 Mar;53(1):99-130. doi: 10.1007/s42770-021-00660-7. Epub 2022 Jan 28.Braz J Microbiol. 2022.PMID:35088248Free PMC article.
The large genome size variation in the Hesperis clade was shaped by the prevalent proliferation of DNA repeats and rarer genome downsizing.
Hloušková P, Mandáková T, Pouch M, Trávníček P, Lysak MA.Hloušková P, et al.Ann Bot. 2019 Aug 2;124(1):103-120. doi: 10.1093/aob/mcz036.Ann Bot. 2019.PMID:31220201Free PMC article.

See all "Cited by" articles

References

1. Schuster SC (2008) Next-generation sequencing transforms today's biology. Nat Methods 5: 16–18. - PubMed
1. Paszkiewicz K, Studholme DJ (2010) De novo assembly of short sequence reads. Brief Bioinform 11: 457–472. - PubMed
1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380. - PMC - PubMed
1. Metzker ML (2010) Sequencing technologies - the next generation. Nat Rev Genet 11: 31–46. - PubMed
1. Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, et al. (2012) The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 40: D571–579. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Related information

MedGen

Grants and funding

This study was funded by International Research-Intensive Centers of Excellence (I-RICE;https://irice.stpi.narl.org.tw/index_en.jsp) of NSC in Taiwan (NSC 103-2911-I-006-301). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Movatterモバイル変換

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Full text links

Actions

Share

Effects of GC bias in next-generation-sequencing data on de novo genome assembly

Affiliation

Effects of GC bias in next-generation-sequencing data on de novo genome assembly

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous