- Research Article
- Open access
- Published:
CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminativek-mers
BMC Genomicsvolume 16, Article number: 236 (2015)Cite this article
22kAccesses
87Altmetric
Abstract
Background
The problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce.
Results
We introduceClark a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy ofClark is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded modeClark classifies, with high accuracy, about 32 million metagenomic short reads per minute.Clark can also classify BAC clones or transcripts to chromosome arms and centromeric regions.
Conclusions
Clark is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available athttp://clark.cs.ucr.edu/.
Background
The classification problem of determining the origin of a given DNA sequence (e.g., a read or a transcript) in a given set of target sequences (e.g., a set of known genomes) is common to several fields of computational molecular biology. Here, we focus our attention on two applications related to metagenomics and genomics.
In metagenomics, the objective is to study the composition of microbial community in an environmental sample. For example, sequencing of seawater samples has enabled discoveries in microbial diversity in the marine environment [1]. Similarly, the study of samples from the human body has elucidated the symbiotic relationships between the human microbiome and human health [2,3]. Once a metagenomic sample is sequenced, the first task is to determine the identities of the microbial species present in the sample. Several tools are available to classify metagenomic reads against known bacterial genomes via alignment (e.g., [4-7]) or sequence composition (e.g., [8-11]). A recent comparative evaluation of these tools [12] demonstrated thatNBC [8] exhibits the highest accuracy and sensitivity at the genus level among [4-6,9]. This study also showed thatNBC and other probabilistic methods (e.g.,PHYMMBL [5]) as well BLAST-based methods (e.g.,MEGAN [4],METAPHYLER [6]) are computationally expensive. Recently, new faster methods have been introduced (e.g.,KRAKEN [11]) but their performance still does not meetNBC’s sensitivity. To the best of our knowledge, there is no tool yet that has both a sensitivity comparable toNBC and a speed comparable toKRAKEN. A related group of metagenomic tools, such asMETAPHLAN [7] andWGSQUIKR [13] addresses the abundance estimation problem, that is, they estimate from the reads the proportion of each organism present in the sample.
The second application is associated withde novo clone-by-clone sequencing and assembly. Given a BAC clone (or a transcript), an objective of a classification problem sometimes is to determine which chromosome (or arm) is the most likely origin of that clone/transcript. The problem assumes that reads for each BAC/transcript as well as reads for each chromosome arm are available, but that the fully-assembled reference genome is not. This is the situation in barley, which we have used for this work, and for many other organisms. In the past, the BAC/transcript assignment problem had been addressed using general-purpose alignment tools (e.g.,BLAST [14] orBLAT [15]), as in [16].
In both of these applications the computational problem is the same: given a set of DNA sequences to be classified (henceforth called “objects”) and a set of reference sequences (e.g., genus-level sequences, chromosome arms, etc., henceforth called “targets”), identify which target is the most likely origin of each object based on sequence similarity. Although this problem has been extensively studied, it is still computationally challenging due to the rapid advances in sequencing technologies: cheaper, faster, sequencing instruments can now generate billion of reads in a few days. As the number of objects grows, so does the number of targets, as demonstrated by the exponential growth of GenBank [17]. Given these demands, it is critical for software tools to minimize computational resources (time, memory, I/O, etc) required for analysis.
Here we presentCLARK (CLAssifier based on Reduced K-mers), a new tool that can accurately and efficiently classify objects to targets, based on reduced sets ofk-mers (i.e., DNA words of lengthk).CLARK is the first method able to perform classification of short metagenomics reads at the genus/species level with a sensitivity comparable to that ofNBC, while achieving a comparable speed toKRAKEN. In some situations,CLARK can be faster and more precise thanKRAKEN at the genus/species level. Unlike tools likeLMAT [10],METAPHYLAN,PHYLOPYTHIAS [9],METAPHYLER [6], orNBC,CLARK produces assignments with confidence scores, which are critical to post-process assignments in downstream analyses. Additionally,CLARK is designed to be user-friendly, self-contained (i.e., does not depend on any other tool or library), and multi-core-friendly.CLARK does not need as much disk space asKRAKEN orPHYMMBL. Finally, a “RAM-light” version ofCLARK can be run on a memory-limited architecture (such as a 4 GB RAM laptop).
Results and discussion
We briefly reviewCLARK’s algorithm before reporting experimental results.
Target-specifick-mers and Classification
During preprocessing,CLARK builds a large index containing thek-spectrums of all targets sequences. We recall that ak-mer is a DNA word of fixed lengthk, and that thek-spectrum of a stringx is the vector of dimension 4k that collects the number of occurences of all possiblek-mers inx. Thek-spectrum is a succinct (lossy) representation ofx, which allows sequence comparison (see e.g., [18]). Once allk-spectrums of target sequences have been collected in the index,CLARK removes any commonk-mers between targets (seeMethods section).
Henceforth, we call the remainingk-mers eithertarget-specific ordiscriminative, because they represent genomic regions that uniquely characterize each target. Finally, an object is assigned to the target with which it shares the highest number ofk-mers.
CLARK offers two modes of execution. The first mode (henceforth named “full”) outputs for each object the number of hits against all the targets and the confidence score of the assignment (which is a number 0.5–1.0). The second mode (“default”) employs sampling to reduce the number the target-specifick-mers for classification, and outputs assignments without any detailed statistics so that the output size is significantly reduced (seeMethods section for more details). The default mode is slightly less accurate, but it is faster.
Metagenomics classification
Inputs to this classification task are (1) NCBI/RefSeq databases of known bacterial genomes (targets) and, either (2A) the set of metagenomic reads used in [11] and the set of simulated long reads from “simHC” [19], or (2B) the set of real metagenomic reads from the Human Microbiome Project (objects). The Human Microbiome Project data are freely accessible [2,3].
At the time we carried out the experiments the NCBI/RefSeq database was composed of 2,752 complete bacterial genomes, distributed into 695 distinct genera, or 1,473 species. The total length of all these bacterial genomes was about 9.5 Gbp. The average size of a genome was about 3.5 Mbp.
In the first experiment, we used three microbial metagenomics datasets called “HiSeq”, “MiSeq” and “simBA-5” that were introduced in [11]. According to [11], “the HiSeq and MiSeq metagenomes were built using twenty sets of bacterial whole-genome shotgun reads. These reads were found either as part of the GAGE-B project [20] or in the NCBI Sequence Read Archive. Each metagenome contains sequences from ten genomes (see Additional file1: Table S1 in [11] for the list of genomes). For these metagenomes, 10% of their sequences were selected from each of the ten component genome data sets (i.e., each genome had equal sequence abundance)”. The set “simBA-5” included “simulated bacterial and archaeal reads, and was created with an error rate five times higher than” the default (see [11]). We also analyzed the set “simHC” of synthetic reads [19], which simulates high complexity communities lacking dominant populations. SimHC contains 113 sets of reads from various microbial genomes. From simHC, we selected arbitrarily twenty distinct genomes, and extracted the first 500 reads for each genome to build a total of 10,000 reads (see Additional file1: Table S4). We called this latter dataset “simHC.20.500”.
For the experiments below we used the “HiSeq”, “MiSeq” (which can be considered set of read of low/medium complexity), “simBA-5” from [11] and “simHC.20.500” (which can be considered set of reads of high complexity). Each of these sets contains 10,000 reads. The average read length in HiSeq was 92 bp, 156 bp in MiSeq, and 951 bp in simHC.20.500. In simBA-5, all reads are 100 bp long.
In the second experiment, we have arbitrarily chosen three metagenomic samples selected from the Human Microbiome Project [2,3]. The three samples we used were SRS015072 (mid-vagina) containing 572 thousand paired-end reads, SRS019120 (saliva) containing 4.3 million paired-end reads, and SRS023847 (nose) containing 5.2 million paired-end reads.
HiSeq, MiSeq, simBA-5 and simHC.20.500
We usedCLARK to classify the reads in the four datasets described above and compared its classification results against the state-of-the-art methods, namelyNBC [8], which we chose for its high accuracy (currently the most sensitive metagenomics classifier, according to [12]), andKRAKEN, which we chose due to its high speed (currently the fastest metagenomics classifier, according to [11]) and its high precision at the genus level.
We classified the reads (i) against 695 genus-level targets (Table1) and (ii) against 1473 species-level targets (Table2).
For a given level in the taxonomy tree (e.g., genus), we defineprecision as the fraction of correct assignments over the total number of assignments, andsensitivity as the ratio between the number of correct assignments and the number of objects to be classified. In order to have a fair comparison againstKRAKEN’s assignments, whenKRAKEN produces an assignment that is not available at or below the genus or species level, it is then considered as not assigned.
Table1 reports precision, sensitivity and processing speeds (in 103 reads per minute) obtained byNBC,KRAKEN andCLARK on the HiSeq, MiSeq, simBA-5 and simHC.20.500 datasets, for several values of thek-mer length. The table illustrates how the performance of these tools is affected by the choice ofk. By increasingk one generally increases precision, but can lower sensitivity (also see Figure1). To carry out a fair comparison between tools, we decided to first determineNBC’s andKRAKEN’s optimalk-mer length, and then runCLARK with a value ofk that would match either sensitivity or precision.
Classification performance ofCLARK for severalk-mer length and for various datasets.CLARK’s precision, sensitivity, assignment rate, average confidence scores and precision of high confidence assignments (HC) for several choices of thek-mer length on the “HiSeq” metagenomic dataset(a), the “MiSeq” metagenomic dataset(b), the “simBA-5” metagenomic dataset(c), the “simHC.20.500” metagenomic dataset(d), and barley unigenes(e).(a) –(d) are results of the classification against the 695 genus-level targets.
NBC was tested withk=11,13,15. We observed thatk=15 produced the highest sensitivity on all datasets. The valuek=15 is the highest possible value, which is recommended by the authors of [8] for datasets composed of short reads. SinceNBC produces detailed statistics on the assignments, we executedCLARK in “full” mode for a fair comparison. Usingk=20 forCLARK (full mode) we obtained a similar sensitivity toNBC (CLARK is actually more sensitive thanNBC on HiSeq and simHC.20.500). At the same level of sensitivity ofNBC,CLARK achieves a higher precision and it is thousands of times faster.
In the case ofKRAKEN,k=31 was the value used in [11] for HiSeq, MiSeq and simBA-5 and it is supposed to achieve the highest precision. Nonetheless, we tried to runKRAKEN for other values ofk. As expected, Table1 shows thatk=31 produces the best precision for all the datasets. For this comparison, we also ranCLARK withk=31. Observe thatCLARK (default mode) is slightly less sensitive thanKRAKEN but is more precise and faster. The difference in speed is significant for all datasets of short reads (300−800 thousand additional reads/min). On simHC.20.500,KRAKEN andCLARK achieve the same speed due to the fact that these datasets contain longer reads. Finally,CLARK has better sensitivity thanKRAKEN on simHC.20.500.
The same comparisons were carried out between the two variants ofKRAKEN andCLARK optimized for speed, calledKRAKEN-Q andCLARK-E (E for “Express”, seeMethods section). As indicated in Table1,KRAKEN-Q achieves the best precision for all the datasets whenk=31, which is consistent with [11]. However, whenk=31CLARK-E runs four–five times faster thanKRAKEN-Q and is also more precise. In addition, observe that as we decreasek, both variants gets faster butCLARK-E maintains a precision above 90% whileKRAKEN-Q produces progressively lower precisions.
In the last row of Table1, we report the performance ofCLARK-l, another variant ofCLARK designed for low RAM architectures that runs only fork=27 (seeMethods section).CLARK-l performs assignments with a lower precision thanCLARK (the difference is at most 3.5% in these experiments) but can process more than 1.5 million of reads per minute on HiSeq or simBA-5, and only uses about 4% of the memory used byCLARK (see Additional file1: Table S1).
All experimental results reported so far were obtained in single-threaded mode. If a multi-core architecture is available,CLARK andKRAKEN can take advantage of it. In Additional file1: Table S2, we summarize the classification speed of the two tools using 1, 2, 4 or 8 threads fork=31. Observe that using eight threads,CLARK achieves a speed-up of 5.2x compared to one thread, whileKRAKEN only achieves a speed-up of 1.2x. When comparingCLARK-E toKRAKEN-Q, we can make similar observations. In general, note thatCLARK-E is at least five times faster thanKRAKEN-Q, independently of the number of threads used.
For the analysis at the species level, we repeated the classification of the objects in the four datasets described above against species-level targets. This time we used values ofk that allowed best sensitivity forNBC (k=15) and best precision forKRAKEN (k=31). Observe in Table2 thatNBC achieves the best sensitivity on all datasets. However, whenCLARK is ran in full mode usingk=20, it achieves a higher precision thanNBC on HiSeq, MiSeq and simHC.20.500, and is several orders of magnitude faster. In addition,CLARK in default mode usingk=31 achieves higher precision thanKRAKEN on all datasets (as much as 10% higher on HiSeq and MiSeq) whenk=31.CLARK also outperforms the speed ofKRAKEN on HiSeq, MiSeq and simBA-5. On simHC.20.500, since the reads are much longer, the speed ofKRAKEN andCLARK are comparable. But,CLARK has higher sensitivity thanKRAKEN on HiSeq, MiSeq and simHC.20.500. Finally, the fast variantCLARK-E, as previously observed for the experiments at the genus level, outperformsKRAKEN-Q in both speed and precision.
Human microbiome samples
In the second experiment, we usedCLARK to classify Human Microbiome Project reads against 695 genus-level targets described above. This time, however, the “ground truth” was not available.
Usingk=31,CLARK was able to assign 42.1% of the reads in SRS015072 (mid-vagina), 30.8% of the reads in SRS019120 (saliva) and 49.8% of the reads in SRS023847 (nose).KRAKEN achieved similar rates of assigned reads usingk=31. Reducingk would increase the number of assignments, at the cost of increasing the probability of misclassification. We investigated whether we could take advantage ofCLARK’s confidence scores to compensate for a smaller value ofk, and improve the fraction of assigned reads.
Figure1a to Figure1d show thatCLARK’s sensitivity on the four datasets is the highest fork=20 ork=21. However, the precision fork=20 andk=21 is about 15% lower than fork=31, which implies that a large proportion of assignments may be incorrect. We have strong experimental evidence that shows that the higher isCLARK’s confidence score for an assignment, the more likely that assignment is correct (see Additional file1: Supplementary Note 2). In addition, we observe in Figure1a to Figure1d that the precision of high confidence assignments is higher than the average precision of all assignments, and is relatively constant for allk-mer length. The idea is to usek=20 to maximize the number of assigned reads, but only consider high confidence assignments to increase the precision. We call an assignmenthigh confidence if the confidence score is higher than 0.75,low confidence otherwise.
Observe in Table3 that the number of high confidence assignments fork=20 is significantly higher than fork=31. The relative increase in assignments is about 40% (from 42.1% to 62.3% in SRS015072, 30.8% to 55.1% on SRS019120, and 49.8% to 68.3% on SRS023847). Table3 also reports the most frequent five genera in high confidence assignments. For the saliva sample, the dominance ofStreptococcus,Haemophilus andPrevotella is consistent with findings in [2] and [11]. Study [21], which focused on salivary microbiota of 35 inflammatory bowel disease patients, also reportsStreptococcus,Prevotella,Neisseria,Haemophilus andVeillonella as dominant genera. Concerning the mid-vagina sample, we have found thatLactobacillus is the dominant genus, in agreement with findings reported in [2,22,23]. The proportion ofLactobacillus we have identified (64.7%) is very close to the reported proportion (69%–71%) in [22,23]. The presence ofPseudomonas andGardnerella is expected because some individuals who lackLactobacillus have insteadGardnerella orPseudomonas as the predominant bacteria [22,23]. In the nose sample, the high presence ofPropionibacterium andStaphylococcus is consistent with the results in [2].
Classification of barley BACs and unigenes to chromosome arms and centromeres
Inputs to this classification task were (1) barley chromosome arms (targets) and (2) barley BACs or unigenes (objects). Samples of each barley chromosome arm were obtained using flow-sorting [24]. The procedure to obtain gene-rich barley BACs was described in [25]. Sequences for chromosome arms and BACs were generated on an Illumina HiSeq 2000 instrument by J. Weger at UC Riverside.
For the targets, we processed thirteen datasets of shotgun sequenced reads: one for barley chromosome 1H and twelve for barley chromosome arms (namely, 2HL, 2HS, 3HL, 3HS, 4HL, 4HS, 5HL, 5HS, 6HL, 6HS, 7HL, and 7HS). After quality-trimming the reads, we had a total of about 181 Gbp of sequence data. The cumulative size of the assembled barley chromosome arms obtained viaSOAPDENOVO [26] resulted in about 2 Gbp (about 40% of the barley genome).
The objects were 50,938 barley unigenes (transcript assembly from ESTs) obtained from [27] for a total of about 222.4 Mbp. Additionally, we trimmed short reads for 15,721 BACs obtained from [25], for a total of about 1.73 Gbp. We also had access to 15,697 BAC assemblies (not all BACs had a sufficient number of reads for an assembly) for a total of about 1.80 Gbp. While the genomic location for the majority of these “objects” was unknown, we had 1,652 unigenes for which a location was derived from the Golden Gate oligonucleotide pool assay (OPA) [28], which allowed us to determine a presumed location of 2,252 BACs [25]. We should point out that although we have used these locations as the “ground truth” to establish the accuracy of the classification, our observations indicate about 5% errors in these OPA assignments [25].
As stated above, the most critical parameter inCLARK is the length of thek-mer used for classification. By assuming that the subset of the unigenes that have a location via OPA are correct, we were able to estimateCLARK’s precision and sensitivity for various choices ofk. Figure1e shows these statistics, along with the assignment rate (fraction of unigenes assigned) and the average confidence score for all assignments. Observe that ask increases, the number of assignments decreases but the precision/sensitivity increases. Based on this analysis we determined thatk=19 represents a good tradeoff for this dataset.
Table4 summarizesCLARK’s assignment of barley unigenes (assemblies) to barley chromosomes arms (assemblies) usingk=19. When both targets and objects are assemblies, we call this an “A2A” assignment. Observe that most of the assignments have high confidence and they are relatively evenly distributed among barley chromosome arms (the seven barley chromosomes are believed to be relatively similar in length). Observe in Figure1e thatCLARK’s precision and sensitivity for this classification task is very high (both at 98.49%) while the average confidence score is above 0.96, and 99.44% of unigenes are assigned.
Additional file1: Table S3 presents a summary ofCLARK’s assignment of barley BACs (assemblies) to arms (assemblies), while Table5 refers to the same assignments based on the reads instead of the assemblies (“R2R” assignment). The consistency between these results (same distribution of BACs assignments over chromosome arms, and similar proportion of high and low confidence assignments) demonstrates the robustness of our approach. The agreement with OPA-based locations is 92.9% for R2R assignments, and 93.2% for A2A assignments. Observe that the agreement for BAC/arm assignments is lower than unigene/arm assignments (98.49%).
Finally, we comparedCLARK against (1) theBLAST-based method used in [25] for BAC-arm assignment (A2A); and (2) the assignments provided in [16,29]. For (1),CLARK assigned 13,706 BACs (of which 2,252 have a prior OPA-based location) while theBLAST-based method assigned 13,583 BACs (of which 2,238 have a prior OPA-based location).CLARK’s precision and sensitivity were 93.2% and 93.2%, respectively, whileBLAST-based’s precision and sensitivity were 92.4% and 91.9%, respectively.BLAST-based andCLARK disagreed on 19 assignments; within these 19 disagreements,CLARK agreed with the GoldenGate assays on seven cases, andBLAST-based agreed on four cases. In (2), we examined the assignment for the 1,037 BACs that were sequenced by our group and by Leibniz-Institut fur Pflanzengenetik und Kulturpflanzenforschung, Gatersleben, Germany (IPK) [16] and we identified only 42 disagreements (4% of the total); among these disagreements, 19 had an independent assignment via POP-seq [29]. In 15 cases out of 19, our assignment agreed with the POP-seq assignment. For the 23 disagreements for which there was no POP-seq assignment, we compared the assembled BACs and we discovered 6 cases in which the sequences were less than 30% similar, suggesting a naming error. In summary, there were only a handful of cases where the disagreement could not be readily explained.
Performance dependency on thek-mer length
To determine the optimal value ofk for a particular dataset one can take advantage of prior knowledge, as we did in the case of unigene/BAC assignment to chromosomes. In that case, we had 1,657 unigenes for which the correct BAC assignment (approximately 95% accuracy) was experimentally determined via Illumina GoldenGate assay (BOPA1 and BOPA2). Given these known assignments, we estimated precision and sensitivity, as well as the average confidence score for all assignments and the assignment rate (see Figure1e). Observe thatk=19 maximizes all four measurements. Higher precision and average confidence score can be achieved by using higherk but at the cost of decreasing sensitivity and assignment rate.
Similar evaluation were carried out on the metagenomic datasets. Figure1a to Figure1d show precision, sensitivity, as well as assignment rate and average confidence score as a function ofk. In both cases we observe that as we increasek, precision and the average confidence score are increasing, while the sensitivity is decreasing. We observe that the maximum sensitivity is achieved fork in the range 19–22 for all metagenomic datasets, independently of the reads length or complexity.
As a consequence, for high sensitivity (or high number of assignments) one must choosek between 19 and 22. For high precision (or high confidence score) one must choosek higher than 26. The current implementation supportsk up to 32.
Conclusions
We have presentedCLARK, a new method for metagenomic sequence classification and chromosome/arm assignments of DNA sequences.
Experimental results demonstrate thatCLARK has several advantages over alternative methods. (i)CLARK is able to classify short metagenomic reads with high accuracy at multiple taxonomic ranks (i.e., species and genus level) and its assignments on real metagenomic samples are consistent with findings published in the literature. (ii)CLARK can achieve the same or better accuracy than the state-of-the-art metagenomic classifiers. (iii) The classification speed ofCLARK, in the context of metagenomics, is unmatched.CLARK can classify 32 million metagenomic short reads per minute, which is five times faster thanKRAKEN. In addition,CLARK “scales” better on a multi-core architectures: the speed-up one can obtain by adding more threads is higher thanKRAKEN. (iv)CLARK is able to output confidence scores, is user-friendly and self-contained (unlike most of other classifiers, it does not require external tool such asBLAST orMEGABLAST, etc). (v)CLARK can be executed with relatively small amounts of RAM (unlikeLMAT) or disk space (unlikePHYMMBL orKRAKEN). Indeed,LMAT can use about 500 GB of RAM, while the maximum amount of RAM needed byCLARK is less than 165 GB (see Additional file1: Table S1).PHYMMBL orKRAKEN require respectively about 120 GB and 140 GB of disk space to run, whileCLARK requires 40–42 GB for classification. (vi) In the context of genomics,CLARK can classify BACs and transcripts with better accuracy than previously used BLAST-based method [25], and can infer centromeric regions.
Even though in this manuscript we focus the attention on genus and species level classification,CLARK is expected to work also at higher taxonomic levels such as phylum, family or class. As it is now, however,CLARK cannot take advantage of taxonomic tree structures. We believe thatCLARK will be useful in a variety of applications in metagenomics and genomics. For instance, we have usedCLARK to identify chimerism and vector contamination in sequenced BACs.
Methods
Building target-specifick-mer sets
CLARK accepts inputs infasta/fastq format; alternatively the input can be given as a text file containing thek-mer distribution (i.e., each line contains ak-mer and its number of occurrences).CLARK first builds an index from the target sequences, unless one already exists for the specified input files. If a user wants to classify objects at the genus level (or another taxonomic rank), he/she is expected to generate targets by grouping genomes of the same genus (or with the same taxonomic label). This strategy represents a major difference with other tools (such asLMAT, orKRAKEN). The index is a hash-table storing, for each distinctk-merw (1) the ID for the target containingw, (2) the number of distinct targets containingw, and (3) the number of occurrences ofw in all the targets. This hash-table uses separate chaining to resolve collisions (at each bucket).CLARK then removes anyk-mer that appears in more than one target, except in the case of chromosome arm assignment. In the latter case,k-mers shared by the two arms of the same chromosome are used to define centromeric regions of overlap. Also,k-mers in the index may be removed based on their number of occurrences if the user has specified a minimum number of occurrences. These rarek-mers tend to be spurious from sequencing errors. Other metagenomic classifiers likeKRAKEN andLMAT do not offer this protection against noise, which is very useful when target sequences are reads (or low-quality assemblies). Then, the resulting sets of target-specifick-mers are stored in disk for the next phase. The time and memory needed to create the index (fork=31) are given in Additional file1: Table S1. This table also contains the time and memory required byNBC andKRAKEN. Observe thatCLARK is faster thanNBC andKRAKEN to create the index, and it uses less RAM and disk space thanKRAKEN for classifying objects.
The concept of “target-specifick-mers” is similar to the notion of “clade-specific marker genes” proposed in [7] or “genome-specific markers” recently proposed in [30]. WhileCLARK uses exact matching to identify the target-specifick-mers derived from any region in the genome, the authors in [7] disregard intergenic regions. The authors of [30] focus on strain-specific markers identified by approximate string matching, whileCLARK uses exact matching. Another important difference is that the method presented in [30] relies onMEGABLAST [31] to perform the classification, which is several orders of magnitude slower thanKRAKEN [11].
For users that want to runCLARK on workstations with limited amounts of RAM, we have designedCLARK-l (“light”).CLARK-l is a variant ofCLARK that has a much smaller RAM footprint but can classify objects with similar speed and accuracy. The reduction in RAM can be achieved by constructing a hash-table of smaller size and by constructing smaller sets of discriminativek-mers. Instead of considering allk-mers in a target,CLARK-l samples a fraction of them.CLARK-l uses 27-mers (27-mers appeared to be a good tradeoff between speed, low memory usage and precision) and skips four consecutive/non-overlapping 27-mers. As a result,CLARK-l’s peak RAM usage is about 3.8 GB during the index creation, and 2.8 GB when computing the classification (see Additional file1: Table S1).CLARK-l has also the advantage to be very fast in building the hash table. Table1 includes the performance ofCLARK-l. While the precision and sensitivity are lower compared toCLARK,CLARK-l still achieves high precision and high speed.
Sequence classification
In the full mode, once the index containing target-specifick-mers has been created,CLARK creates a “dictionary” that associatesk-mers to targets. Then,CLARK iteratively processes each object: for each object sequenceoCLARK queries the index to fetch the set ofk-mers ino. A “hit” is obtained when ak-mer (either forward or reverse complement) matches a target-specifick-mer set. Objecto is assigned to the target that has the highest number of hits (see algorithmic details in Additional file1: Supplementary Note 1 and Additional file1: Table S5). The confidence score is computed ash1/(h1+h2), whereh1 is the number of hits for the highest target, andh2 is the number of hits for the second-highest target.
The rationale to remove commonk-mers between targets (at any taxonomy level defined by the user) is that they increase the “noise” in the classification process. If they were present, more targets could obtain the same number of hits which would complicate the assignment. If such conflicts can be avoided, then there is no need to query the taxonomy tree, and find, for example, the lowest common ancestor taxons for “conflicting nodes” to resolve them as it is done in other tools (e.g.,KRAKEN orLMAT). Observe in Additional file1: Figure S1, that most ofCLARK’s assignments have high confidence scores. Observe that at least 95% of all assignments in HiSeq, MiSeq, simBA-5 and simHC.20.500 made byCLARK in the full mode, have confidence scores equal to 1 (i.e., exactly one target gets hits), and the average confidence scores in all these assignments is 0.997. This implies that, on average, the number of hits for the top target (which will receive the assignment) is about 336 times higher than the second. Thus,CLARK, unlikeLMAT orKRAKEN, does not need the taxonomy tree to classify objects, instead one “flat” level is clearly sufficient.
If users are not interested in collecting confidence scores and all hit counts, then it is recommended to use the default mode ofCLARK. In this mode,CLARK stops queryingk-mers for an object as soon as there is at least one target that collects at least half of the total possible hits. Also, this mode loads in main memory about half of the target-specifick-mers. This is done by alternatively loading or skipping target-specifick-mers based on their index positions.CLARK runs significantly faster in default mode (2–5 times faster in our experiments) with negligible degradation of sensitivity and assignment rate. Also, the RAM usage is significantly lower than the full mode (up to 50% lower in our experiments). If speed is the primary concern, we have designed an “express” variant ofCLARK calledCLARK-E.CLARK-E is based upon Theorem 1 (see Additional file1: Supplementary Note 1), which states that if an object originates from one of the targets then either one or no target will be hit from thek-mers in the object. Since we use target-specifick-mer sets, at most one target can be associated to thek-mers of an object. In addition, we reduce the number of queries to the database by considering a sample of thek-mers in the object. SoCLARK-E only queries non-overlappingk-mers, and the object is assigned to the first target that obtains a hit. This optimization allowsCLARK-E to be extremely fast compared toCLARK/KRAKEN (see Table1), while maintaining high precision and sensitivity.
Running time analysis
All experiments presented in this study were run on a Dell PowerEdge T710 server (dual Intel Xeon X5660 2.8 Ghz, 12 cores, 192 GB of RAM).CLARK-l was also run on a Mac OS X, Version 10.9.5 (2.53 GHz Intel Core 2 Duo, 4 GB of RAM). When comparingKRAKEN toCLARK in their default mode, andKRAKEN-Q toCLARK-E, we always setKRAKEN to “preload” its database in main memory and print results to a file (instead of the standard output) to achieve the highest speed. For consistency,CLARK was also run under the same conditions. For the results in Table1 and Table2,CLARK (v1.0),NBC (v1.1), andKRAKEN (v0.10.4-beta) were run in single-threaded mode, three times on the same inputs in order to smooth fluctuations due to I/O and cache issues (the reported numbers are best values). We have also run the latest version of Kraken (v0.10.5-beta) and we did not observe a significant variation of accuracy and usage of RAM. However, we observed a 15% decrease in the classification speed compared to version v0.10.4-beta. The software toolCLARK is available for download athttp://clark.cs.ucr.edu/.
References
Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, et al.Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004; 304(5667):66–74.
Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, Chinwalla AT, et al.Structure, function and diversity of the healthy human microbiome. Nature. 2012; 486(7402):207–14.
The Human Microbiome Project Consortium. A framework for human microbiome research. Nature. 2012; 486(7402):215–21.
Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res. 2007; 17(3):377–86.
Brady A, Salzberg S. PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat Methods. 2011; 8(5):367.
Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics. 2011; 12(Suppl 2):4.
Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012; 9(8):811–4.
Rosen GL, Reichenberger ER, Rosenfeld AM. NBC: the naive bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics. 2011; 27(1):127–9.
Patil KR, Haider P, Pope PB, Turnbaugh PJ, Morrison M, Scheffer T, et al.Taxonomic metagenome sequence assignment with structured output models. Nat Methods. 2011; 8(3):191–2.
Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics. 2013; 29(18):2253–60.
Wood D, Salzberg S. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014; 15(3):46.
Bazinet AL, Cummings MP. A comparative evaluation of sequence classification programs. BMC Bioinf. 2012; 13(1):92.
Koslicki D, Foucart S, Rosen G. WGSQuikr: Fast whole-genome shotgun metagenomic classification. PloS one. 2014; 9(3):91784.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.
Kent WJ. BLAT: the BLAST-like alignment tool. Genome Res. 2002; 12(4):656–64.
International Barley Genome Sequencing Consortium. A physical, genetic and functional sequence assembly of the barley genome. Nature. 2012; 491(7426):711–6.
Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, et al.Genbank. Nucleic Acids Res. 2012:1195.
Vinga S, Almeida J. Alignment-free sequence comparison: a review. Bioinformatics. 2003; 19(4):513–23.
Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, et al.Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007; 4(6):495–500.
Magoc T, Pabinger S, Canzar S, Liu X, Su Q, Puiu D, et al.GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics. 2013; 29(14):1718–25.
Said HS, Suda W, Nakagome S, Chinen H, Oshima K, Kim S, et al.Dysbiosis of salivary microbiota in inflammatory bowel disease and its association with oral immunological biomarkers. DNA Res. 2013:037.
Antonio MA, Hawes SE, Hillier SL. The identification of vaginal lactobacillus species and the demographic and microbiologic characteristics of women colonized by these species. J Infectious Diseases. 1999; 180(6):1950–6.
Hyman RW, Fukushima M, Diamond L, Kumm J, Giudice LC, Davis RW. Microbes on the human vaginal epithelium. Proc Nat Acad Sci. 2005; 102(22):7952–7.
Doležel J, Vrána J, Šafář J, Bartoš J, Kubaláková M, Šimková H. Chromosomes in the flow to simplify genome analysis. Funct Integr Genomics. 2012; 12(3):397–416.
Lonardi S, Duma D, Alpert M, Cordero F, Beccuti M, Bhat PR, et al.Combinatorial pooling enables selective sequencing of the barley gene space. PLoS Comput Biol. 2013; 9(4):1003010.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al.SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012; 1(1):18.
Close TJ, Wanamaker S, Roose ML, Lyon M. HarvEST. Methods Mol Biol. 2006; 406:161– 77.
Close TJ, Bhat PR, Lonardi S, Wu Y, Rostoks N, Ramsay L, et al.Development and implementation of high-throughput SNP genotyping in barley. BMC Genomics. 2009; 10(1):582.
Mascher M, Muehlbauer GJ, Rokhsar DS, Chapman J, Schmutz J, Barry K, et al.Anchoring and ordering NGS contig assemblies by population sequencing (Popseq). Plant J. 2013; 76(4):718–27. doi:10.1111/tpj.12319.
Tu Q, He Z, Zhou J. Strain/species identification in metagenomes using genome-specific markers. Nucleic Acids Res. 2014; 42(8):67.
Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000; 7(1-2):203–14.
Acknowledgements
This project was funded in part by USDA, “Advancing the Barley Genome” (2009-65300-05645) and by NSF, “ABI Innovation: Barcoding-Free Multiplexing: Leveraging Combinatorial Pooling for High-Throughput Sequencing” (DBI-1062301), and “III: Algorithms and Software Tools for Epigenetics Research” (IIS-1302134). We are thankful to the authors ofNBC andKraken for their useful advice on running their tools. We thank Dr. Gail Rosen and the anonymous reviewers for constructive comments on the manuscript.
Author information
Authors and Affiliations
Department of Computer Science & Engineering, University of California, 900 University Avenue, CA, 92521, Riverside, USA
Rachid Ounit & Stefano Lonardi
Department of Plant & Botanic Sciences, University of California, 900 University Avenue, CA, 92521, Riverside, USA
Steve Wanamaker & Timothy J Close
- Rachid Ounit
You can also search for this author inPubMed Google Scholar
- Steve Wanamaker
You can also search for this author inPubMed Google Scholar
- Timothy J Close
You can also search for this author inPubMed Google Scholar
- Stefano Lonardi
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toStefano Lonardi.
Additional information
Competing interests
The authors declare that they have no competing financial interests.
Authors’ contributions
RO designed, implemented, tested, and optimizedClark; RO also collected experimental results and wrote the draft of the manuscript; SW helped to carry out the comparison betweenClark and theBlast-based method that he wrote for [25]; SL and TJC proposed and supervised the project; RO, SL and TJC edited the manuscript. All authors read and approved the final manuscript.
Additional file
Additional file 1
Supplementary Material. Detail about the mathematical modeling, the impact of thek-mer length on results, the analysis of the confidence scores, and the software implementation.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visithttps://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ounit, R., Wanamaker, S., Close, T.J.et al. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminativek-mers.BMC Genomics16, 236 (2015). https://doi.org/10.1186/s12864-015-1419-2
Received:
Accepted:
Published:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative