- Research
- Open access
- Published:
Accurate assembly of full-length consensus for viral quasispecies
BMC Bioinformaticsvolume 26, Article number: 36 (2025)Cite this article
625Accesses
Abstract
Background
Viruses can inhabit their hosts in the form of an ensemble of various mutant strains. Reconstructing a robust consensus representation for these diverse mutant strains is essential for recognizing the genetic variations among strains and delving into aspects like virulence, pathogenesis, and selecting therapies. Virus genomes are typically small, often composed of only a few thousand to several hundred thousand nucleotides. While constructing a high-quality consensus of virus strains might seem feasible, most current assemblers only generated fragmented contigs. It’s important to emphasize the significance of assembling a single full-length consensus contig, as it’s vital for identifying genetic diversity and estimating strain abundance accurately.
Results
In this paper, we developed FC-Virus, a de novo genome assembly strategy specifically targeting highly diverse viral populations. FC-Virus first identifies thek-mers that are common across most viral strains, and then uses thesek-mers as a backbone to build a full-length consensus sequence covering the entire genome. We benchmark FC-Virus against state-of-the-art genome assemblers.
Conclusion
Experimental results confirm that FC-Virus can construct a single, accurate full-length consensus, whereas other assemblers only manage to produce fragmented contigs. FC-Virus is freely available athttps://github.com/qdu-bioinfo/FC-Virus.git.
Introduction
Viruses replicate their genetic material rapidly within host cells, resulting in a high mutation rate. As they undergo multiple replication cycles, different mutant strains emerge, each with genomes that are closely related but slightly distinct [1]. This group of genetically related mutant strains is known as a viral quasispecies. The genetic variations among mutant strains primarily manifest as Single Nucleotide Polymorphisms (SNPs) and Indels (insertions or deletions) [2]. These genetic differences significantly influence the varying degrees of virulence, transmissibility, and drug resistance observed among different viral strains. Research has shown that viruses often exist within their hosts as viral quasispecies. Reconstructing a full-length consensus for a viral quasispecies can yield a high-quality reference genome. This helps in characterizing viral strains, enhancing our understanding of the virus’s pathogenesis, and supporting vaccine development.
Advancements in high-throughput sequencing (HTS) have made it possible to reconstruct full-length genomes [3]. HTS technology can generate numerous short fragments (known as reads) from various mutant virus strains. In principle, these short reads should allow for the reconstruction of a full-length consensus genome for virus populations, as most viral genomes are typically just a few thousand to several hundred thousand nucleotides long. However, current assemblers struggle with this task due to sequencing biases and errors, varying levels of strain abundance, repetitive segments, and substantial discrepancies in length between the reads and the actual genome.
Numerous well-established generic assembly algorithms are recognized for their efficacy in genome assembly across diverse sequencing datasets and applications. These algorithms commonly utilize graph models, such as overlap graphs and de Bruijn graphs to facilitate the reconstruction of the genome by identifying a path within these graphs. Assemblers such as Celera [4], Newbler [5], Arachne [6], CAP3 [7], TIGR [8], PCAP [9], AMOS [10], Phrap [11], and Phusion [12] were developed using the overlap graph as their foundational model [13]. Each vertex in an overlap graph represents a read sequence, and an edge connects two vertices if the suffix of one read matches the prefix of the other. Assemblers like SPAdes [14], AllPath [15], ABySS [16], Velvet [17], SOAPdenovo2 [18], IDBA [19], EulerUSR [20], SKESA [21], and Ray [22] model the genome assembly problem by seeking an Eulerian path within a de Bruijn graph. The de Bruijn graph treats eachk-mer (a substring of lengthk from a read) as a vertex and connects two vertices when they share a common prefix and suffix of lengthk−1. Compared to the overlap graph, the de Bruijn graph generally requires fewer computational resources for construction and storage. However, it involves a larger number of vertices and fails to capture longer overlaps between reads, which increases the graph’s complexity and makes path extraction more challenging.
However, these assemblers often struggle with viral genome assembly because they are not specifically designed for reconstructing viral genomes [23]. Due to the complexity and large size of genomes, many assemblers focus on constructing relatively unambiguous subsequences, or contigs. Consequently, even with smaller viral genomes, these assemblers often struggle to produce a full-length consensus sequence, leading to fragmented contigs. These fragmented contigs represent partial sequences from different strains but do not together form a full-length genome of any single strain or match the combined length of all strain genomes. This limitation indicates that fragmented contigs are insufficient as a comprehensive reference for all strains and fail to accurately represent the genome of each individual strain.
The assembly of viral quasispecies involves two main tasks: consensus assembly and strain-level assembly. Consensus assembly focuses on creating a reference genome that closely represents all the strains, capturing a high level of similarity across them. In contrast, Strain-level assembly aims to individually reconstruct the genome of each specific strain. VICUNA is designed for reconstructing a robust consensus from ultra-deep sequencing data [24]. It employs an overlap-layout-consensus based assembly algorithm to generate a consensus sequence through the following steps: read trimming, contig construction and clustering, contig validation, and contig extension and merging. VICUNA is among the few tools designed specifically for assembling highly diverse viral populations. Algorithms such as SAVAGE [25], viaDBG [26], PEHaplo [27], Virus-VG [28], VG-Flow [29] and VStrains [23] are designed for assembling strain-level genomes. Due to the high mutation rates and varying expression levels among strains, these algorithms often result in fragmented contigs rather than complete viral strains, or they combine contigs and fail to capture the differences between strains.
In this paper, we introduce FC-Virus, a novel de novo assembly method designed to reconstruct a full-length consensus sequence from viral quasispecies. The key contributions of this study are as follows:
- (i)
Proposal of a concept and methodology for identifying homologousk-mers, which arek-mers shared across multiple strains.
- (ii)
Development of a de novo assembly strategy that uses homologousk-mers as a backbone, enabling the reconstruction of a full-length consensus sequence.
- (iii)
Evaluation of the FC-Virus algorithm against state-of-the-art assemblers using both simulated and real datasets. Experimental results show that FC-Virus produces a single consensus sequence that matches or even surpasses the results of multiple fragmented contigs from other assemblers.
Method
The FC-Virus takes high-throughput reads as input, aiming to reconstruct a full-length consensus sequence that maximally incorporates the characteristics of all strains [30]. FC-Virus first extractsk-mers from reads and identifies homologousk-mers according to the abundance ofk-mers. FC-Virus regards the read containing at least two homologousk-mers as homologous read, and then it glues homologous reads together to form a consensus sequence. FC-Virus finally extends and refines the consensus using a greedy strategy. The general work flow of FC-Virus algorithm is illustrated in Fig. 1. Details of each step are explained in the following sections.
Identification of homologousk-mers
The genomes of different strains of the same virus are highly similar, so manyk-mers are shared among most strains. We refer to thesek-mers as homologousk-mers. FC-Virus counted the occurrences ofk-mers (by default,\(k=25\)) across different abundance levels and plotted a frequency distribution graph. According to the definition of homologousk-mers, they are expected to cluster around the final peak in the abundance graph when longer repeated sequences are absent in the strain genome. In cases where peaks are poorly defined or very few homologousk-mers are identified, we select the top\(x\%\) ofk-mers by frequency, withx typically set to 30% as an empirical value. Using this strategy, homologousk-mers may not be present in every strain but are likely to be common across most strains. With thesek-mers as anchors, a consensus sequence can be assembled.
The workflow of FC-Virus algorithm.a Homologousk-mer extraction. FC-Virus extractsk-mers from reads, plots their frequency distribution. It then builds a de Bruijn graph fromk-mers in the last peak interval. If this graph lacks a large connected component, thek-mers are classified as homologous. Otherwise, it evaluates the preceding peaks.b Consensus assembly. FC-Virus stitches together reads using homologousk-mers and determines the base composition of divergent regions through a voting process, resulting in a preliminary consensus.c Consensus refinement. FC-Virus extends the consensus using previously unusedk-mers through a greedy strategy
To mitigate the influence of repeated sequence, FC-Virus builds a de Bruijn Graph using thek-mers from the final peak range. If this graph contains large connected branches (i.e., the majority ofk-mers within the peak interval can be linked together), it indicates that the last peak results from repeats rather than from true genomic similarities. In this case, FC-Virus iteratively checks adjacent left peaks until encountering one caused by genomic similarities. This strategy stems from the observation that sub-sequences shared among strains are scattered throughout the genome sequence and rarely aggregate in one single location. Consequently, the de Bruijn graph constructed from homologousk-mers is often highly fragmented and unlikely to contain a large connected component.
Suppose there are total ofm reads, each with a length ofL, which can generatenk-mers. Then we have\(n \le \min \{m(L-k+1), 4^k\}\). The space required to store these reads andk-mers is\(O(m+n)\). SinceL andk are constants and\(n \le m(L-k+1)\), along with the fact that the number of homologousk-mers is less than the total number ofk-mers, we can simplify the space complexity toO(m). Besides, reconstructing thek-mers takes\(O(m(L-k+1))\) time. Identifying homologousk-mers requires scanning allk-mers, which takesO(n) time. As a result, the overall time complexity isO(m).
Consensus assembly
FC-Virus regards reads containing at least two homologousk-mers as homologous reads, and it reconstructs a consensus sequence through the following steps.
An example of determining the sequence composition between neighboring homologousk-mers. FC-Virus distinguishes between two scenarios based on whether the homologousk-mers overlap
step1: Let setR denote all homologous reads. FC-Virus first selects a homologous read fromR that contains the highest number of homologousk-mers as the seed readr\((r \in R)\). It then usesstep 2 and3 to refine the sequence composition between the first and the last homologousk-mer in the seed readr. This refined sequence regarded as the initial consensus sequence. The consensus sequence is subsequently extended usingstep 4 and refined iteratively withstep 2 and3 until no unused homologous reads remain.
step2: FC-Virus scans the homologousk-mers in the seed readr from left to right, looking for homologous reads that contain two adjacent homologousk-mers fromr simultaneously, and aggregates these reads together. For two adjacentk-mers\(k_{i}\) and\(k_{i+1}\) in the seed readr, FC-Virus extracts a subset\(R_{i}\) fromR. Each read\(r_{j}^{i}\)\((r_{j}^{i} \in R_{i})\) in this subset contains both homologousk-mers\(k_{i}\) and\(k_{i+1}\).
step3: FC-Virus determines the sequence composition between adjacent homologousk-mers\(k_{i}\) and\(k_{i+1}\) based on the read set\(R_{i}\). It categorizes the sequence between homologousk-mers\(k_{i}\) and\(k_{i+1}\) into two scenarios (as shown in Fig. 2). In the first case, there is an overlap between homologousk-mers\(k_{i}\) and\(k_{i+1}\), while in the second case, there is no overlap. If an overlap exists between\(k_{i}\) and\(k_{i+1}\), FC-Virus integrates them seamlessly into the consensus sequence. For example, consider the seed readCACTTGCCGTACGGT, withCTTGC andTGCCG as two neighboring homologousk-mers. In this case, the consensus sequence between thesek-mers would beCTTGCCG. When there is no overlap between\(k_{i}\) and\(k_{i+1}\), FC-Virus selects the most frequently occurring sequence composition from\(R_{i}\). For instance, suppose the seed read isCACTTGCCGTACGGT, andACTTG andTACGG are two adjacentk-mers\(k_{i}\) and\(k_{i+1}\), the set\(R_{i}\) consists of 3 reads:CACTTGCCGTACGGT,ACTTGCCATACGGTC, andGACTTGCCGTACGGA. In this case, there are two types of sequences between\(k_{i}\) and\(k_{i+1}\):ACTTGCCGTACGG andACTTGCCATACGG, supported by 2 and 1 reads, respectively. FC-Virus would choose the sequenceACTTGCCGTACGG, which is supported by the most reads, as part of the consensus in the region between\(k_{i}\) and\(k_{i+1}\).
step4: FC-Virus extends the consensus sequence by selecting an unused homologous read that contains the last two homologousk-mers\(k_{n-1}\) and\(k_{n}\) of the seedr. If no such read is available, FC-Virus relaxes the criteria, requiring the new seed to encompass\(k_{n}\). Additionally, FC-Virus mandates that the new seed must contain at least one unused homologousk-mer to the right of\(k_{n}\). The seed readr is updated to this newly selected read, and the sequence of the new seedr is then refined usingstep 2 and3.
FC-Virus can generate a consensus sequence inO(|R|L) time and memory, whereR represents the set of homologous reads andL is the length of the reads, following the steps described above. However, this consensus lacks information to the left of the firstk-mer and to the right of the lastk-mer. The non-uniformity of current sequencing technologies might result in homologousk-mers that do not cover the entire genome. Additionally, long repetitive sequences, which exceed the read length, may be represented only once by this strategy. To address these limitations, FC-Virus refines the consensus using the strategies outlined in the following section.
Consensus refinement
FC-Virus aligns reads to the consensus using homologousk-mers as anchor points and calculates the average depthX of the consensus based on these reads. Following the alignment positions of reads on the consensus, FC-Virus condensesk-mers located at the same position on the consensus into a block. Algorithm 1 is then used to classify allk-mers in this block as either used or unused, and their abundances are updated accordingly. As described in Algorithm 1, FC-Virus adjusts the abundance ofk-mers within blockH based on the consensus depthX. Note that blockH consists of fourk-mers: one that matches the consensus sequence and three others that differ from it only by the last base. The abundance of blockH is defined as the sum of the abundances of its fourk-mers. If this total abundance significantly exceedsX, somek-mers inH will be labeled as unused, even if they are part of the consensus. This process aims to identify and labelk-mers that are likely part of repetitive segments as unused.
FC-Virus extends the consensus at both ends using unusedk-mers. It starts with ak-mer from the endpoint of the current consensus as the seedk-mer and groups all unusedk-mers that meet the connectivity condition into a blockH. The connectivity condition requires that thek−1 prefix of onek-mer matches exactly with thek−1 suffix of another. FC-Virus then selects the unusedk-mer with the highest abundance from blockH to extend the consensus. Thisk-mer becomes the new seed, and the abundance and usage status ofk-mers in blockH are updated using Algorithm 1. This process is repeated iteratively until no more extendablek-mers are found (see Figure S1 for more details). The time complexity of this process isO(n), wheren represents the number ofk-mers.
Experimental setup
We conducted benchmarking experiments to compare FC-Virus with traditional genome assembly methods, including SOAPdenovo2, SPAdes, and IDBA, as well as algorithms designed for viral quasispecies assembly, such as viaDBG, VG-Flow, and VStrains. In addition to benchmarking against these widely used algorithms, we also conducted comparisons against the actual genome sequence of each strain. We used the genome of one strain as a consensus and compared it with the genomes of the other strains. This benchmarking method is referred to as the ’Reference’. We also tried to benchmark the VICUNA algorithm, but unfortunately, its download link isn’t available.
Dataset information
We benchmarked FC-Virus against other algorithms on the following datasets:
Widely used simulated datasets. We collected four simulated datasets from [25] consisting of 5 HIV, 6 Polio, 10 HCV, and 15 ZIKV mixed strains, respectively. The genome length ranges for dataset HIV, Polio, HCV, and ZIKV are\(9478 \sim 9719\),\(7428 \sim 7460\),\(9273 \sim 9311\), and\(10,251 \sim 10,269\)bp. These datasets can be download athttps://bitbucket.org/jbaaijens/savage-benchmarks.
Newly simulated datasets for SARS-CoV-2 (COVID-19). To investigate how sequencing depth affects assembler performance, we used SimSeq [31] to generate 11 simulated COVID-19 datasets with sequencing depths ranging from 50 to 20, 000X. Note that each simulated COVID-19 dataset comprises three strains downloaded from NCBI Database with accession number of ON944270.1, ON286803.1, and OL519143.1. The genome lengths of these strains range from 29, 675 to 29, 853bp. These simulated datasets are available athttps://bitbucket.org/fc-virus-benchmark.
Real datasets. We collected a real Illumina MiSeq dataset from a lab mix of five HIV strains, named HIV-LABMIX. The reads in this dataset are 250bp in length, and it’s available for access athttps://github.com/cbg-ethz/5-virus-mix.
Baselines and evaluation metrics
To evaluate the degree of sequence fragmentation, we analyzed the length distribution of all contigs produced by each algorithm. We also assessed the assembly results using universal metrics such as genome fraction, N50, NGA50 duplication ratio, largest alignment, number of contigs, N’s per 100kbp, mismatches per 100kbp and indels per 100kbp, as calculated by QUAST [32]. Note that we used the termerrors per 100kbp to represent the sum of N’s, mismatches and indels per 100kbp. Additionally, we aligned reads back to the contigs and evaluated the assembly quality in terms of read mapping rate.
Results
Evaluation of fragmentation degree
We compared the length distribution of contigs (longer than 350bp) assembled by compared assemblers across four widely used simulated datasets. As shown in Fig. 3, FC-Virus consistently produces exactly a single long contig whose length closely matches that of the viral genome. In contrast, other assemblers generate numerous short contigs, with IDBA and SOAPdenovo2 typically producing contigs shorter than the viral genome. Although SPAdes, VG-Flow, ViaDBG, and VStrains do generate some longer contigs, they also produces a substantial number of short ones. We attribute this result to their algorithmic strategies. SPAdes and other existing assembly algorithms usually construct a graph based on the overlap between reads ork-mers. They then extract paths from the graph to form contigs. In this study, the reads sequenced from multiple strains of the same virus, the differences between strain genomes make the graph complex, thereby increasing the difficulty of subsequent path extraction. Therefore, the compared algorithms generate many short contigs. These contigs typically represent different strains but fail to cover the full range of strains (see the next section for more details). FC-Virus, on the other hand, integrates strain variations into its consensus assembly and refinement processes. In regions shared by multiple strains, FC-Virus constructs a consensus sequence that closely matches most strains. In contrast, other assemblers might generate multiple contigs that correspond to strains with higher abundance.
The length distribution of contigs (longer than 350bp) produced by assemblers on simulated datasets of HIV, POLIO, HCV, and ZIKV
Benchmark using universal criteria
We used QUAST, a tool for assessing the quality of genome assemblies, to evaluate how well the contigs assembled by each algorithm align with the reference viral strains. For each dataset, we employed each strain’s genome as the ground truth reference genome individually and then averaged the results across all evaluations. Table1 presents the average results for each algorithm across the HIV, POLIO, ZIKV, COVID-19, and HIV-LABMIX datasets. The sequencing depth for all these datasets are all 20, 000X. Table1 presents an interesting result: FC-Virus achieves significantly fewer errors per 100kbp compared to the Reference, which uses a strain’s genome as the consensus. This highlights the importance of consensus assembly, suggesting that the consensus generated by FC-Virus serves as a more accurate reference genome than an individual strain’s genome.
As presented in Table1, traditional assembly algorithms, with the exception of SPAdes, tend to perform poorly in virus genome assembly. The contigs assembled by IDBA and SOAPdenovo2 are generally quite short, with numerous contigs less than 350bp in length. Since we only focus on contigs with a length greater than 350bp, the advantages of these two algorithms aren’t particularly noticeable. In some datasets, it is even challenging to compute theirNGA50 values, whereNGA50 is the contig length at which the aligned contigs cumulatively cover half of the reference genome. If the contigs are too short or fragmented, they may not cover enough of the reference genome, makingNGA50 calculation impossible. Compared to FC-Virus, SPAdes generally performs slightly better in genome coverage. However, FC-Virus outperforms SPAdes on evaluation criteria like duplication ratio, errors per 100 kbp, NGA50, N50, largest alignment, and the number of contigs. Strain-level assembly algorithms such as viaDBG, VG-Flow, and VStrains generally excel in genome coverage, largest alignment length, N50, and NGA50. However, they struggle with high duplication ratios and errors per 100 kbp. These algorithms often produce a greater number of contigs than the number of strains, leading to excessive redundancy. In some datasets, their duplication rate far exceeds the number of strains present.
The errors per 100kbp in the longest contig assembled by SPAdes and FC-Virus on simulated datasets of HIV, POLIO, HCV, and ZIKV
We observed that the total length of contigs assembled by SPAdes is significantly greater than the length of consensus generated by FC-Virus. Although SPAdes and FC-Virus hold a similar genome fraction, SPAdes exhibits a higher duplication rate. This raised the question of whether the high duplication rate indicates that SPAdes successfully assembled all viral strains. According to the experimental results from [23,29] and our Appendix Table S17, we found that SPAdes did not assemble all viral strains, it only managed to assemble a portion of them. While this is a positive outcome, it doesn’t entirely align with our goal of assembling a single reference sequence. We then extracted the longest contig assembled by SPAdes and compare it with FC-Virus. We found that the genome fraction, largest alignment, N50, NGA50, and duplication ratio achieved by SPAdes (with only the longest contig) are similar to those of FC-Virus. However, SPAdes exhibited a high error rate than FC-Virus (Fig. 4).
Assessment of reads re-mapping rate
We aligned reads to contigs assembled by each algorithm and calculated the percentage of reads with both ends matching to the assembled contigs. Figure 5 presents the percentage of reads that align to contigs at both ends across the simulated datasets of HIV, POLIO, HCV, and ZIKV.
Performance of assemblers in terms of the percentage of reads with both ends aligned to assembled contigs
As shown in Fig. 5, FC-Virus, VG-Flow, and Vstrains all demonstrate strong performance in read re-mapping rates, with average values of 99.15%, 98.54%, and 97.73%, respectively. However, it’s worth noting that VG-Flow and VStrains generate a greater number of contigs with a larger total length, which naturally results in higher read re-mapping rates. In contrast, FC-Virus achieves similar or even better performance with just a single contig. Even in the worst-case scenario, such as with the HCV dataset where FC-Virus has a re-mapping rate of 98.08%, the result is still impressive. This suggests that FC-Virus’s consensus covers nearly all the reads and can effectively serve as a reference genome.
Investigate the impact of sequencing depth
The sequencing depth of datasets used in existing studies related to the assembly of viral strains is usually very high, at 20, 000X. We’re interested in exploring the impact of sequencing depth on the performance of assemblers. We evaluated FC-Virus along with other algorithms on 11 COVID-19 datasets with sequencing depths ranging from 50X to 20, 000X. To minimize interference from other factors, we maintained a consistent error rate (less than\(1\%\)) across all simulated datasets. The results show that all four algorithms perform well in terms of genome coverage and read re-mapping rate. However, they differ in contig count, duplication ratio, errors per 100kbp and N50 (see Fig. 6).
Impact of sequencing depth on assembler performance.a Impact of sequencing depth on the number of assembled contigs.b Impact of sequencing depth on the errors per 100kbp.c Impact of sequencing depth on the duplication ratio.d Impact of sequencing depth on the N50
As shown in Fig. 6, the impact of sequencing depth on IDBA, SPAdes, VStrains and FC-Virus is relatively insignificant, except for SOAPdenovo2, VG-Flow, and viaDBG. Note that viaDBG, Vstrains, and VG-Flow fail to produce results for some datasets. The FC-Virus exhibits strong comparability across various evaluation criteria, which is consistent with the previous results. SPAdes, viaDBG, VF-Flow, and Vstrains perform similarly to FC-Virus across most evaluation criteria. However, viaDBG, VF-Flow, and Vstrains suffer from a high duplication ratio. The errors per 100kbp obtained by SPAdes are sometimes higher than those of FC-Virus.
These results were somewhat surprising, as we initially expected that sequencing depth would significantly impact the performance of most assemblers. We also anticipated that assemblers would perform less effectively on the COVID-19 dataset compared to datasets like HCV, POLIO, and ZIKA due to the larger genome size of COVID-19. We attribute the observed performance to the low sequencing error rate we used and the relatively small number of strains in the COVID-19 dataset, which has only 3 strains, while the other datasets have between 5 and 15 strains. Our findings suggest that with a low sequencing error rate, the effect of sequencing depth on assembler performance is minimal. Moving forward, we plan to investigate how the number of strains in a dataset influences assembler performance.
Assessment of CPU time and memory requirements
In theory, both the total time complexity and space complexity of FC-Virus areO(m), wherem is the number of reads. To better understand its performance in practice, we evaluated the computational demands of FC-Virus alongside the other compared assemblers in terms of CPU time and peak memory usage (Fig. 7). It can be observed that FC-Virus requires the shortest CPU time for the POLIO and HIV-LABMIX datasets. For the other datasets, its CPU time is just behind that of SOAPdenovo2. We attribute FC-Virus’s strong performance in CPU time to its use of homologous reads for constructing consensus, which significantly reduces the required processing time. Additionally, the time spent on its greedy consensus refinement step is also relatively low. In terms of memory usage, FC-Virus also performs either the best or is just behind IDBA. Both SOAPdenovo and IDBA are traditional genome assembly algorithms. Although they perform well in terms of computational demands, their assembly results lag behind those of other algorithms.
Computational requirements.a CPU time used by assemblers across six datasets.b Peak memory usage for each assembler on the six datasets
Conclusion
In this paper, we presented FC-Virus, an efficient genome assembly algorithm designed to accurately reconstruct full-length consensus sequences for viral quasispecies. We were the first to introduce the concept of homologousk-mers and outline a strategy for identifying them. By using homologousk-mers as anchors, FC-Virus merges reads to produce a consensus sequence that serves as a reference genome. This reference enables more detailed analysis of strain composition and distribution within the dataset. Our experimental results demonstrate that FC-Virus outperforms other assemblers across most evaluation metrics, thanks to its specialized consensus assembly approach. FC-Virus has the advantage of generating a single consensus sequence that delivers the same assembly effect as multiple contigs produced by other assemblers. Future work will focus on using the consensus sequences generated by FC-Virus as reference genomes for assembling individual strain genomes.
Availability of data and materials
The simulated datasets are available athttps://bitbucket.org/fc-virus-benchmark andhttps://bitbucket.org/jbaaijens/savage-benchmarks/src/master/. Real datasets can be downloaded athttps://github.com/cbg-ethz/5-virus-mix. The materials are available athttps://github.com/qdu-bioinfo/FC-Virus-Supplementary.
References
Alves Brunna M, Siqueira Juliana D, Garrido Marianne M, Botelho Ornella M, Prellwitz Isabel M, Ribeiro Sayonara R, Soares Esmeralda A, Soares Marcelo A. Characterization of hiv-1 near full-length proviral genome quasispecies from patients with undetectable viral load undergoing first-line haart therapy. Viruses. 2017;9(12):392.
Kim S, Misra A. Snp genotyping: technologies and biomedical applications. Annu Rev Biomed Eng. 2007;9:289–320.
Rishton Gilbert M. Reactive compounds and in vitro false positives in hts. Drug Discov Today. 1997;2(9):382–4.
Craig Venter J, Adams Mark D, Myers Eugene W, Li Peter W, Mural Richard J, Sutton Granger G, Smith Hamilton O, Yandell Mark, Evans Cheryl A, Holt Robert A, et al. The sequence of the human genome. Science. 2001;291(5507):1304–51.
Nederbragt Alexander J. On the middle ground between open source and commercial software-the case of the newbler program. Genome Biol. 2014;15(4):1–2.
Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov Jill P, Lander ES. Arachne: a whole-genome shotgun assembler. Genome Res. 2002;12(1):177–89.
Huang X, Madan A. Cap3: a dna sequence assembly program. Genome Res. 1999;9(9):868–77.
Sutton GG, White O, Adams MD, Kerlavage AR. Tigr assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci Technol. 1995;1(1):9–19.
Huang X, Wang J, Aluru S, Yang S-P, Hillier L. Pcap: a whole-genome assembly program. Genome Res. 2003;13(9):2164–70.
Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M. Next generation sequence assembly with amos. Curr Protoc Bioinform. 2011;33(1):11–8.
De La Bastide M, McCombie WR. Assembling genomic dna sequences with phrap. Curr Protoc Bioinform. 2007;17(1):11–4.
Mullikin JC, Ning Z. The phusion assembler. Genome Res. 2003;13(1):81–90.
Li Z, Chen Y, Desheng M, Yuan J, Shi Y, Zhang H, Gan J, Li N, Xuesong H, Liu B, et al. Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Brief Funct Genom. 2012;11(1):25–37.
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. Spades: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77.
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB. Allpaths: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18(5):810–20.
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. Abyss: a parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–23.
Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 2008;18(5):821–9.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Guangxi W, Hao ZY, Shi YL, Chang Y, Wang B, Yao L, Han C, Cheung DW, Yiu S-M, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam T-W, Wang J. Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012;1(1):18.
Peng Y, Leung HC, Yiu SM, Chin FYL. Idba-ud: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28(11):1420–8.
Chaisson MJ, Brinza D, Pevzner PA. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 2009;19(2):336–46.
Souvorov A, Agarwala R, Lipman DJ. Skesa: strategic k-mer extension for scrupulous assemblies. Genome Biol. 2018;19(1):153.
Boisvert S, Laviolette F, Corbeil J. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol. 2010;17(11):1519–33.
Luo R, Lin Y. Vstrains. De novo reconstruction of viral strains via iterative path extraction from assembly graphs. In International Conference on Research in Computational Molecular Biology, pp. 3–20. Springer; 2023
Yang X, Charlebois P, Gnerre S, Coole MG, Lennon NJ, Levin JZ, James Q, Ryan EM, Zody MC, Henn MR. De novo assembly of highly diverse viral populations. BMC Genom. 2012;13:1–13.
Baaijens JA, El Aabidine AZ, Rivals E, Schönhuth A. De novo assembly of viral quasispecies using overlap graphs. Genome Res. 2017;27(5):835–48.
Freire B, Ladra S, Paramá JR, Salmela L. Inference of viral quasispecies with a paired de bruijn graph. Bioinformatics. 2021;37(4):473–81.
Chen J, Zhao Y, Sun Y. De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding. Bioinformatics. 2018;34(17):2927–35.
Baaijens JA, Van der Roest B, Köster J, Stougie L, Schönhuth A. Full-length de novo viral quasispecies assembly through variation graph construction. Bioinformatics. 2019;35(24):5086–94.
Baaijens Jasmijn A, Stougie L, Schönhuth A. Strain-aware assembly of genomes from mixed samples using flow variation graphs. In Research in Computational Molecular Biology: 24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10–13, 2020, Proceedings 24, pp. 221–222. Springer; 2020
Chor B, Horn D, Goldman N, Levy Y, Massingham T. Genomic dna k-mer spectra: models and modalities. Genome Biol. 2009;10:1–10.
Benidt S, Nettleton D. Simseq: a nonparametric approach to simulation of rna-sequence datasets. Bioinformatics. 2015;31(13):2131–40.
Gurevich A, Saveliev V, Vyahhi N, Tesler G. Quast: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–5.
Acknowledgements
The authors would like to thank the editors and reviewers for their constructive comments and suggestions on this manuscript.
Funding
This work is supported by National Natural Science Foundation of China under No. 62202251, No. 6227226 and No.62172028 and Natural Science Foundation of Shandong Province under No. ZR2022QF133.
Author information
Authors and Affiliations
College of Computer Science and Technology, Qingdao University, Qingdao, China
Jia Tian, Ziyu Gao, Minghao Li & Jin Zhao
School of Software Engineering, Beijing Jiaotong University, Beijing, China
Ergude Bao
- Jia Tian
You can also search for this author inPubMed Google Scholar
- Ziyu Gao
You can also search for this author inPubMed Google Scholar
- Minghao Li
You can also search for this author inPubMed Google Scholar
- Ergude Bao
You can also search for this author inPubMed Google Scholar
- Jin Zhao
You can also search for this author inPubMed Google Scholar
Contributions
J.Z. contributed to the design of the study, J.T. and Z.G. implemented FC-strains. J.T., Z.G. and M.L. performed experiments. J.Z., J.T., Z.G., B.E. and M.L. wrote and reviewed the manuscript.
Corresponding author
Correspondence toJin Zhao.
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tian, J., Gao, Z., Li, M.et al. Accurate assembly of full-length consensus for viral quasispecies.BMC Bioinformatics26, 36 (2025). https://doi.org/10.1186/s12859-025-06045-z
Received:
Accepted:
Published:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative