refresh-bio/vclustPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star58

Fast and accurate tool for calculating Average Nucleotide Identity (ANI) and clustering virus genomes and metagenomes

License

GPL-3.0 license

58 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
3rd_party		3rd_party
example		example
images		images
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
makefile		makefile
pyproject.toml		pyproject.toml
test.py		test.py
vclust.py		vclust.py

Repository files navigation

Vclust

Vclust is an alignment-based tool for fast and accurate calculation of Average Nucleotide Identity (ANI) between complete or metagenomically-assembled viral genomes. The tool also performs ANI-based clustering of genomes according to standards recommended by international virus consortia, includingInternational Committee on Taxonomy of Viruses (ICTV) andMinimum Information about an Uncultivated Virus Genome (MIUViG).

Features

💎 Accurate ANI calculations

Vclust uses a Lempel-Ziv-based pairwise sequence aligner (LZ-ANI) for ANI calculation. LZ-ANI achieves high sensitivity in detecting matched and mismatched nucleotides, ensuring accurate ANI determination. Its efficiency comes from a simplified indel handling model, making LZ-ANI magnitudes faster than alignment-based tools (e.g., BLASTn, MegaBLAST) while maintaining comparable accuracy to the most sensitive BLASTn searches.

📐 Multiple similarity measures

Vclust offers multiple similarity measures between two genome sequences:

ANI: The number of identical nucleotides across local alignments divided by the total length of the alignments.
Global ANI (gANI): The number of identical nucleotides across local alignments divided by the length of the query/reference genome.
Total ANI (tANI): The number of identical nucleotides between query-reference and reference-query genomes divided by the sum length of both genomes. tANI is equivalent to theVIRIDIC's intergenomic similarity.
Coverage (alignment fraction): The proportion of the query/reference sequence aligned with the reference/query sequence.
Number of local alignments: The number of local alignments between the two genome sequences.
Ratio between genome lengths: The length of the shorter genome divided by the longer one.

🌟 Multiple clustering algorithms

Vclust provides six clustering algorithms tailored to various scenarios, including taxonomic classification and dereplication of viral genomes.

Single-linkage
Complete-linkage
UCLUST
CD-HIT (Greedy incremental)
Greedy set cover (adopted from MMseqs2)
Leiden algorithm [optional]

🔥 Speed and efficiency

Vclust uses three efficient C++ tools -Kmer-db,LZ-ANI,Clusty - for prefiltering, aligning, calculating ANI, and clustering viral genomes. This combination enables the processing of millions of virus genomes within a few hours on a mid-range workstation.

🌎 Web service

For datasets containing up to 1000 viral genomes, Vclust is available athttp://www.vclust.org.

Quick start

# Install Vclust (requires Python >= 3.7)pip install vclust# Prefilter similar genome sequence pairs before conducting pairwise alignments.vclust prefilter -i example/multifasta.fna -o fltr.txt# Align similar genome sequence pairs and calculate pairwise ANI measures.vclust align -i example/multifasta.fna -o ani.tsv --filter fltr.txt# Cluster genome sequences based on given ANI measure and minimum threshold.vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --metric ani --ani 0.95

Documentation

The Vclust documentation is available on theGitHub Wiki and includes the following sections:

Features
Installation
Quick Start
Usage
1. Input data
2. Prefilter
3. Align
4. Cluster
5. Deduplicate
Optimizing sensitivity and resource usage
Use cases
1. Classify viruses into species and genera following ICTV standards
2. Assign viral contigs into vOTUs following MIUViG standards
3. Dereplicate viral contigs into representative genomes
4. Process large dataset of diverse virus genomes (IMG/VR)
5. Deduplicate (remove duplicate sequences) between and within multiple datasets
6. Process large dataset of highly redundant virus genomes
7. Cluster plasmid genomes into pOTUs
8. Calculate pairwise similarities between all-versus-all genomes
FAQ: Frequently Asked Questions

Citation

Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S.Ultrafast and accurate sequence alignment and clustering of viral genomes. bioRxiv [doi:10.1101/2024.06.27.601020].

License

GNU General Public License, version 3

About

Fast and accurate tool for calculating Average Nucleotide Identity (ANI) and clustering virus genomes and metagenomes

Releases12

v1.3.0: Add the deduplicate command; improve verbosity; add more tests Latest

Dec 20, 2024

+ 11 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

Vclust

Features

💎 Accurate ANI calculations

📐 Multiple similarity measures

🌟 Multiple clustering algorithms

🔥 Speed and efficiency

🌎 Web service

Quick start

Documentation

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases12

Packages

Contributors3

Languages

Movatterモバイル変換

License

refresh-bio/vclust

Folders and files

Latest commit

History

Repository files navigation

Vclust

Features

💎 Accurate ANI calculations

📐 Multiple similarity measures

🌟 Multiple clustering algorithms

🔥 Speed and efficiency

🌎 Web service

Quick start

Documentation

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases12

Packages0

Contributors3

Languages

Packages