- Notifications
You must be signed in to change notification settings - Fork14
faylward/viralrecall
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
ViralRecall is a flexible command-line tool for detecting signatures ofgiant viruses (NCLDV) in genomic data. Version 2 has been updated tofocus more on NCLDV compared to version 1, but the original options arestill available.
ViralRecall is written in Python 3.5.6 and requires biopython,matplotlib, numpy, and pandas. ViralRecall uses Prodigal and HMMER3 forprotein prediction and HMM searches, respectively. Please ensure thesetools are installed in your PATH before using.
A requirements.yml file is provided in this repo to solve some of theissues related to outdated pandas version. The requirements.yml filespecifies conda environment dependencies so you don't have to installeach separately.After cloning repository, please follow these steps:
cd viralrecall
and
conda env create -f requirements.yml
If you wish to proceed without the requirements.yml file, simply create a condaenvironment by typingconda create -n viralrecall
. In that case you might haveto install some dependencies yourself. On a Unix system you should beable to install these tools with:
sudo apt install prodigal
and
sudo apt install hmmer
or if you don't have sudo privileges, you can try with conda:
conda install prodigal -c bioconda
and
conda install hmmer -c bioconda
Please ensure you are using > Python 3.5.2 and have the appropriatepython modules installed. If this is an issue please create a Pythonenvironment using conda (see herehttps://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
To start: > git clonehttps://github.com/faylward/viralrecall
cd viralrecall
ViralRecall was tested on Ubuntu 16.04 and should work on mostUnix-based systems. To see the help menu use: > python viralrecall.py-h
Viralrecall can be run using either of two viral HMM databases: 1)GVOGs, a custom set of Giant Virus Orthologous Groups that are fairlyspecific to Nucleo-Cytoplasmic Large DNA Viruses (NCLDV), or 2) VOGDB,which contains a wide collection of viral orthologous groups and isuseful for broad characterization of viral signatures.
In addition to either the GVOG or VOGDB searches, ViralRecall matchesproteins against the Pfam database (Pfam v. 32), which is abroad-specificity database that detects many protein families that arecommon in the genomes of cellular organisms.
The database files are available for download from Zenodo. To downloadand unpack, navigate to the folder that contains the viralrecall.pyscript and type:
wget -O hmm.tar.gzhttps://zenodo.org/records/12666277/files/hmm.tar.gz?download=1
and then
tar -xvzf hmm.tar.gz
This should create a hmm/ directory with the appropriate HMM files,including the gvog.hmm database and the vogdb.hmm database (downloadedfrom the vogdb.org website on 12/14/2020). This directory should belocated in the same directory as the acc/ directory and theviralrecall.py script.
Note: The GVOG database was updated in May 2021, and it is recommendedto use this latest version. The new database is substantially smaller,among other things, and this helps with runtime. But if you'd like toaccess the original GVOG database you can do so here:https://zenodo.org/record/4710691/files/hmm.tar.gz?download=1
After you have downloaded and unpacked the database files you should beable to run viralrecall.py. To see the help menu you can run: > pythonviralrecall.py -h
To test if ViralRecall will run properly type: > python viralrecall.py-i examples/arm29B.fna -p test_outdir -t 2 -f
Results should be located in the test_outdir folder. The output folderwill contain:
*.faa: The proteins predicted from the input file using Prodigal
*.fasta: The DNA coding sequence of the predicted proteins fromProdigal
*.full_annot.tsv: A full annotation table of the predicted ORFs. Thisincludes descriptions of the GVOG and Pfam annotations, so it can beuseful if you want to look at certain annoatations in more depth.
*.vregion.annot.tsv: An annotation of only the regions with some viralsignatures
*summary.tsv: Summary statistics for the predicted viral regions (orcontig-level stats if the -c flag was used). This also includes theNCLDV marker output (marker hit: bit score)
*.pfamout: Raw output of the Pfam HMMER3 search
*.vogout: Raw output of the GVOG or VOGDB HMMER3 search
*.markerout: Raw output of the NCLDV marker gene HMMER3 search
Additionally, for each viral region viralrecall will print out .faa and.fna files for the proteins and nucleotide sequences for the regionsfound. Please be sure to use only .fna files as input.
There are several parameters you can change in viralrecall depending onyour preferences and the data you're analyzing. The important parametersthat will influence the results are:
-db, --database This is the database usef for viral detection. Thedefault is the GVOG database (also can be specifified with "GVOG")."VOG" can be specified for the vogdb database, and "marker" to onlysearch against 10 NCLDV marker genes. GVOGs are more useful forNCLDV-specific searches. The "marker" option is much faster and may beuseful for quickly screening large datasets.
-s, --minscore This is the mean score that a genomic regions needsto have in order to pass the filter and get reported as a viral region.The score is calculated from the HMMER3 scores, with higher scoresindicating more and better matches to the GVOG database, and lowerscores indicating more and higher matches to the Pfam database. Thedefault is 10.
-w, --window Size of the sliding window to use for calculatingmoving averages. A smaller window may help predict short viral regions,but may split large viral regions into several pieces.
-m, --minsize Minimum size, in kilobases, of the viral regions toreport.
-g, --minvog Minimum number of hits against the GVOG database thatmust be recorded in a region in order for it to be reported (largervalues == higher confidence).
-c, --contiglevel If this option is used, mean ViralRecall scoreswill be provided for the input contigs rather than viral regions. Thisis useful for screening contigs for viral signatures.
-r, --redo If you have already run ViralRecall and you want tore-run it with different parameters, you can use the -r flag to avoidre-running Prodigal and HMMER, which are the most time-consuming steps.
-b, --batch Use this flag if the input is a folder of .fna files tosearch, rather than a single .fna file.
For example, if we wanted to recover regions of a eukaryotic contig withsignatures of NCLDV, we could use the following command:
python viralrecall.py -i examples/arm29B.fna -p test_outdir -s 15 -m30 -g 10
Here we are asking for only regions that have a mean score >= 15, areat least 30 kilobases long, and have at least 10 GVOG hits.
If we want to quickly re-do the above analysis with differentparameters, but without re-doing gene predictions and HMMER3 searches,we can use the -r flag:
python viralrecall.py -i examples/arm29B.fna -p test_outdir -s 15 -m15 -g 15 -w 20 -r
Maybe we want to re-do the analysis using a different e-value. Thedefault is 1e-10, which is fairly stringent, so we can relax it a bit:
python viralrecall.py -i examples/arm29B.fna -p test_outdir -s 15 -m 15-g 15 -w 20 -r -e 1e-5
So once you finished the hmmer searches you can easily re-calculatethings with the -r flag.
If you have many sequences you wish to test you can put them all in afolder and use batch mode. Here the input (-i) should point to a folderwith .fna files in it. Basic usage is:
python viralrecall.py -i examples/testfolder -p folderout -b
All of the output files should have their own folder in the folderoutdirectory. You can also use the -b flag with the -r flag for quickre-calculations.
Prodigal is intended to predict genes on prokaryotic genomes, and ittherefore will draw an error if used on very long eukaryotic contigs (>32 Mbp in length). I've re-compiled a binary of Prodigal that will runon longer contigs, and this is available in the bin/ folder of thisGitHub repo. This is NOT the default version of prodigal that is used -if you wish to use this binary you will need to make sure it is locatedin your PATH (and not any other version of prodigal you may haveinstalled).
Some users have noticed errors or warnings involving Pandas, which usesslightly different syntax in different Python versions. These can canusually be resolved by changing the Python version used to 3.5.4.
ViralRecall: A Flexible Command-Line Tool for the Detection of GiantVirus Signatures in Omic Data, FO Aylward and M Moniruzzaman, Viruses,2021; 13(2):150.https://doi.org/10.3390/v13020150.
For questions or comments feel free to email Frank Aylward at faylwardat vt dot edu
This tool requires Prodigal and HMMER3. Their citations are:
Hyatt et al. “Prodigal: prokaryotic gene recognition and translationinitiation site identification”. BMC bioinformatics, 2010.
Eddy, "A new generation of homology search tools based on probabilisticinference". Genome Informatics, 2009.