- Notifications
You must be signed in to change notification settings - Fork2
Fast and accurate tool for calculating Average Nucleotide Identity (ANI) and clustering virus genomes and metagenomes
License
refresh-bio/vclust
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Vclust is an alignment-based tool for fast and accurate calculation of Average Nucleotide Identity (ANI) between complete or metagenomically-assembled viral genomes. The tool also performs ANI-based clustering of genomes according to standards recommended by international virus consortia, includingInternational Committee on Taxonomy of Viruses (ICTV) andMinimum Information about an Uncultivated Virus Genome (MIUViG).
Vclust uses a Lempel-Ziv-based pairwise sequence aligner (LZ-ANI) for ANI calculation. LZ-ANI achieves high sensitivity in detecting matched and mismatched nucleotides, ensuring accurate ANI determination. Its efficiency comes from a simplified indel handling model, making LZ-ANI magnitudes faster than alignment-based tools (e.g., BLASTn, MegaBLAST) while maintaining comparable accuracy to the most sensitive BLASTn searches.
Vclust offers multiple similarity measures between two genome sequences:
- ANI: The number of identical nucleotides across local alignments divided by the total length of the alignments.
- Global ANI (gANI): The number of identical nucleotides across local alignments divided by the length of the query/reference genome.
- Total ANI (tANI): The number of identical nucleotides between query-reference and reference-query genomes divided by the sum length of both genomes. tANI is equivalent to theVIRIDIC's intergenomic similarity.
- Coverage (alignment fraction): The proportion of the query/reference sequence aligned with the reference/query sequence.
- Number of local alignments: The number of local alignments between the two genome sequences.
- Ratio between genome lengths: The length of the shorter genome divided by the longer one.
Vclust provides six clustering algorithms tailored to various scenarios, including taxonomic classification and dereplication of viral genomes.
- Single-linkage
- Complete-linkage
- UCLUST
- CD-HIT (Greedy incremental)
- Greedy set cover (adopted from MMseqs2)
- Leiden algorithm [optional]
Vclust uses three efficient C++ tools -Kmer-db,LZ-ANI,Clusty - for prefiltering, aligning, calculating ANI, and clustering viral genomes. This combination enables the processing of millions of virus genomes within a few hours on a mid-range workstation.
For datasets containing up to 1000 viral genomes, Vclust is available athttp://www.vclust.org.
# Install Vclust (requires Python >= 3.7)pip install vclust# Prefilter similar genome sequence pairs before conducting pairwise alignments.vclust prefilter -i example/multifasta.fna -o fltr.txt# Align similar genome sequence pairs and calculate pairwise ANI measures.vclust align -i example/multifasta.fna -o ani.tsv --filter fltr.txt# Cluster genome sequences based on given ANI measure and minimum threshold.vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --metric ani --ani 0.95
The Vclust documentation is available on theGitHub Wiki and includes the following sections:
- Features
- Installation
- Quick Start
- Usage
- Input data
- Prefilter
- Align
- Cluster
- Deduplicate
- Optimizing sensitivity and resource usage
- Use cases
- Classify viruses into species and genera following ICTV standards
- Assign viral contigs into vOTUs following MIUViG standards
- Dereplicate viral contigs into representative genomes
- Process large dataset of diverse virus genomes (IMG/VR)
- Deduplicate (remove duplicate sequences) between and within multiple datasets
- Process large dataset of highly redundant virus genomes
- Cluster plasmid genomes into pOTUs
- Calculate pairwise similarities between all-versus-all genomes
- FAQ: Frequently Asked Questions
Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S.Ultrafast and accurate sequence alignment and clustering of viral genomes. bioRxiv [doi:10.1101/2024.06.27.601020].
About
Fast and accurate tool for calculating Average Nucleotide Identity (ANI) and clustering virus genomes and metagenomes