. 2012 Oct 11;28(23):3150–3152. doi:10.1093/bioinformatics/bts565

CD-HIT: accelerated for clustering the next-generation sequencing data

Limin Fu^1,^✉,Beifang Niu^1,^✉,Zhengwei Zhu^1,^†,^✉,Sitao Wu¹,Weizhong Li^1,^*

¹Center for Research in Biological Systems, University of California San Diego, La Jolla, CA 92093, USA

^✉

*To whom correspondence should be addressed.

^†Present address: Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania.

Associate Editor: Inanc Birol

^✉

Corresponding author.

Received 2012 Jun 27; Revised 2012 Aug 31; Accepted 2012 Sep 12; Issue date 2012 Dec 1.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

PMC Copyright notice

PMCID: PMC3516142 PMID:23060610

Abstract

Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions.

Availability:http://cd-hit.org.

Contact:liwz@sdsc.edu

Supplementary information:Supplementary data are available atBioinformatics online.

1 INTRODUCTION

Sequence analysis has played a crucial role in computational biology. With the advancement of the next-generation sequencing technologies, the amount of available sequencing data is growing exponentially. Removing redundancy from such data by clustering could be crucial for reducing storage space, computational time and noise interference in some analysis methods, etc.

CD-HIT was originally developed to cluster protein sequences to create reference databases with reduced redundancy (Liet al., 2001) and was then extended to support clustering nucleotide sequences (Li and Godzik, 2006). Since its release, CD-HIT has become very widely used for a large variety of applications ranging from non-redundant dataset creation (Suzeket al., 2007), protein family classifications (Yoosephet al., 2008), artifact identification (Niuet al., 2010), metagenomics annotation (Sunet al., 2011), RNA analysis (Loong and Mishra, 2007), to various prediction studies (Rubinstein and Fiser, 2008).

With sequencing data rapidly growing in public data repositories as well as in individual laboratories, there has been strong demand for an enhanced CD-HIT with greater efficiency. In response to such demand, we have developed this enhanced and parallelized version of CD-HIT, to exploit the fact that multi-core machines have become very common in ordinary laboratories.

A computer cluster-based parallelization procedure for CD-HIT has been proposed inSuzeket al. (2007), though not fully parallelized, this procedure provides good speedup using computer cluster. Since computer clusters are not as easily available as multi-core machines, here we propose an alternative parallelization technique, which assumes shared memory model and works well on multi-core machines.

2 METHODS

Basically, CD-HIT is a greedy incremental algorithm that starts with the longest input sequence as the first cluster representative, and then process the remaining sequences from long to short to classify each sequence as a redundant or representative sequence based on its similarities to the existing representatives. The similarities are estimated by common word counting using word indexing and counting tables to filter out unnecessary sequence alignments, which are used to compute exact similarities. In the following sections, we will describe the techniques that are used to accelerate CD-HIT.

2.1 Simplification

In order to support full parallelization, the core steps of CD-HIT have been simplified into two key procedures: a checking procedure and a clustering procedure. Using these two procedures, the algorithm requires at most two word tables without the need to swap them to disk, which was necessary in the original CD-HIT due to the use of multiple tables for large datasets.

Given a word table, the checking procedure checks each of the remaining sequences against the table and determines whether it is a redundant sequence. The clustering procedure will make a final determination of the status of a sequence, and if it is classified as a representative sequence, it is used to update the word table. A more detailed description with illustration is available inSupplementary Material Section 1.2 andFigure S1.

2.2 Parallelization

GivenT threads or cores, the basic idea of our novel parallelization technique is to use two word tables and useT−1 threads to run multiple checking procedures using one word table (a immutable checking table), and the remaining thread to run a single clustering procedure using the other table (a mutable clustering table) in parallel. Due to the sequential characteristics of CD-HIT, it will require properly grouping the input sequences and switch the word tables to guarantee the correctness of the parallelization.

This is achieved by dividing each round of computation into two stages. The first stage is to runT checking procedures on the sequence group defined for this round of computation using the word table (checking table) from the previous round. Then, the second stage will use an additional and empty word table (clustering table) to run a clustering procedure in one thread on the current sequence group, and at the same time the remainingT−1 threads will run multiple checking procedures on the remaining sequences.

Since the clustering procedure may finish before or after the checking procedures, proper scheduling is used to guarantee all threads are active most of the time. At the end of each round, the clustering table will become the checking table of the next round, and the checking table of this round will be emptied to become the clustering table of the next round. More information including detailed description, illustration and pseudo codes, etc. are available in theSupplementary Material (Sections 1.3–1.5).

2.3 Other enhancements

Besides the parallelization, the new CD-HIT includes other enhancements such as faster file reading, better filtering threshold estimation, more efficient word counting and better alignment band estimation, etc. The new filtering threshold estimation is slightly more precise and can filter out more unnecessary alignments. Now word counting is handled more efficiently for input datasets with high redundancy, by maintaining a smaller counting array for hit representatives instead of a full counting array for every representatives. The improved alignment band estimation can find a narrower band for banded alignment.

2.4 Implementation

This enhanced CD-HIT is implemented in the C++ programming language and uses OpenMP (http://www.openmp.org) for parallelization. The parallel for construct of OpenMP is used for running the checking and clustering procedures in multiple threads with dynamic scheduling. Different computation data buffers are allocated for different threads. The checking word table is immutable and shared by multiple threads.

3 RESULTS

To see how much performance improvement has been achieved, we tested the new CD-HIT (V4.6) and the old CD-HIT (V3.1.2) on a set of datasets including two protein sequence datasets: SWISS-PROT (∼0.4 M sequences), NR (∼12 M sequences) and two DNA sequence datasets: HumanGut (MH0002, ∼23 M reads;Qinet al., 2010) and TwinStudy (SRP000319, ∼8 M reads;Turnbaughet al., 2009). Both the SWISS-PROT and the NR datasets were downloaded from NCBI (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/) on October 20, 2010. We also compared CD-HIT with a similar program UCLUST (V5.1.221) fromEdgar (2010). These tests were done on a Debian Linux server with four 12-core AMD Opteron 6172 processors. Equivalent parameters were used for different programs whenever possible. Details and additional tests are available inSupplementary Material 2.

Table 1 compares the efficiency of the enhanced CD-HIT to the previous version of CD-HIT and the latest UCLUST. The results demonstrate that the new CD-HIT without using multi-core is significantly more efficient than the old one and is comparable to or more efficient than UCLUST as well. When multi-cores are used, the new CD-HIT is much more efficient than either of them. To test the effectiveness of the parallelization in the new CD-HIT, the full datasets were clustered using different number of cores. The time speedups are shown inFigure 1, which indicates the parallelization has good speedup for up to ∼24 cores with a quasi-linear speedup for up to ∼8 cores. Besides speed improvements, the new CD-HIT also has better clustering quality than the old CD-HIT and UCLUST (Supplementary Material and Table S2).

Table 1.

Comparison to the previous CD-HIT and UCLUST

Dataset	CD-HIT3 (min)	CD-HIT4 (min)	CD-HIT4 (8 cores) (min)	UCLUST5 (min)
Swissprot	80	58	12	15
NR	44	22	6	46
Twinstudy	47	19	4	56
HumanGut	494	42	8	214

Open in a new tab

UCLUST5 free version cannot run on the full NR, TwinStudy and HumanGut datasets, so subsets with ∼1 M sequences of NR, 1 M reads of TwinStudy and 4 M reads of HumanGut are used in this comparison.

Fig. 1. — Evaluation of CD-HIT parallelization: computational time speedup with respect to the number of used CPU cores

4 CONCLUSIONS

In this application note, we presented an enhanced CD-HIT that has been accelerated by a novel parallelization technique and a few other improvements. We tested and demonstrated that this new CD-HIT can achieve significant speedup over the previous CD-HIT using a single core, and its acceleration by multi-core computer can scale up well to a reasonable number of cores. Clustering on large datasets that normally runs for days can now finish in hours on multicore machines. We believe this enhanced CD-HIT will find more applications in handling the next-generation sequencing data.

Funding: This study was supported by National Institute of Health award R01RR025030 from the National Center for Research Resources and awardR01HG005978 from the National Human Genome Research Institute.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

supp_28_23_3150__index.html^{(765B, html)}

References

Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. doi: 10.1093/bioinformatics/btq461. [DOI] [PubMed] [Google Scholar]
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
Li W, et al. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17:282–283. doi: 10.1093/bioinformatics/17.3.282. [DOI] [PubMed] [Google Scholar]
Loong SNK, Mishra SK. Unique folding of precursor microRNAs: quantitative evidence and implications for de novo identification. RNA. 2007;13:170–187. doi: 10.1261/rna.223807. [DOI] [PMC free article] [PubMed] [Google Scholar]
Niu B, et al. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics. 2010;11:187. doi: 10.1186/1471-2105-11-187. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qin J, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–65. doi: 10.1038/nature08821. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rubinstein R, Fiser A. Predicting disulfide bond connectivity in proteins by correlated mutations analysis. Bioinformatics. 2008;24:498–504. doi: 10.1093/bioinformatics/btm637. [DOI] [PubMed] [Google Scholar]
Sun S, et al. Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource. Nucleic Acids Res. 2011;39:D546–D551. doi: 10.1093/nar/gkq1102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Suzek BE, et al. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282–1288. doi: 10.1093/bioinformatics/btm098. [DOI] [PubMed] [Google Scholar]
Turnbaugh PJ, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457:480–484. doi: 10.1038/nature07540. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yooseph S, et al. Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering. BMC Bioinformatics. 2008;9:182. doi: 10.1186/1471-2105-9-182. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_28_23_3150__index.html^{(765B, html)}

supp_bts565_bts565.doc^{(1.2MB, doc)}

Movatterモバイル変換

PERMALINK

CD-HIT: accelerated for clustering the next-generation sequencing data

Limin Fu

Beifang Niu

Zhengwei Zhu

Sitao Wu

Weizhong Li

Abstract

1 INTRODUCTION

2 METHODS

2.1 Simplification

2.2 Parallelization

2.3 Other enhancements

2.4 Implementation

3 RESULTS

Table 1.

Fig. 1.

4 CONCLUSIONS

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

Movatterモバイル変換

PERMALINK

CD-HIT: accelerated for clustering the next-generation sequencing data

Limin Fu

Beifang Niu

Zhengwei Zhu

Sitao Wu

Weizhong Li

Abstract

1 INTRODUCTION

2 METHODS

2.1 Simplification

2.2 Parallelization

2.3 Other enhancements

2.4 Implementation

3 RESULTS

Table 1.

Fig. 1.

4 CONCLUSIONS

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases