- Notifications
You must be signed in to change notification settings - Fork7
VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files
License
hewm2008/VCF2Dis
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files
TheVCF2Dis article has been published inGigaScience, please cited this article if possible
Acceptance Date :2025 Mar 5
Publication :GigaScience
Title : VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files
Doi :https://doi.org/10.1093/gigascience/giaf032
- Install
Thenew version will be updated and maintained inhewm2008/VCF2Dis, please click below Link to download the latest version
Justsh make.sh
to compile. The executableVCF2Dis
can be found in the folder ofbin/VCF2Dis
ForLinux /Unix andmacOS
tar -zxvf VCF2DisXXX.tar.gz # if Link do not work ,Tryre-install [zlib]library cd VCF2DisXXX; # [zlib] and copy them to the library Dir sh make.sh; # VCF2Dis-xx/src/include/zlib ./bin/VCF2Dis
Note: If fail to link,try tore-install the librarieszlib
Note::R withape,dplyr andggtree are recommended
You can use Docker to install and run VCF2Dis. Follow the steps below:
- Install Docker: Ensure Docker is installed on your system. If not, you can install it by following theDocker Official Documentation.
- Pull the Docker Image: Use the following command to pull the VCF2Dis Docker image from the Alibaba Cloud Container Registry:
docker pull registry.cn-shenzhen.aliyuncs.com/knight134/vcf2dis:v1.53e## Docker image from the Alibaba Cloud Container Registrydocker run -it --rm vcf2dis:v1.53e VCF2Dis## After pulling the image, you can run the containe
- Install Singularity: Ensure Singularity is installed on your system. If not, you can install it by following theSingularity Official Documentation.
- Build the SIF File: Use the following command to build a Singularity image file (SIF) from the Docker image:
singularity build vcf2dis_1.53e.sif docker://registry.cn-shenzhen.aliyuncs.com/knight134/vcf2dis:v1.53e# you can download followssingularityexec vcf2dis_1.53e.sif VCF2Dis
- Download the SIF File:Alternatively, you can download the built SIF file directly from thevcf2dis_1.53e.sif. Once downloaded, you can run it using Singularity.
- Main parameter description:
Usage: VCF2Dis -InPut<in.vcf> -OutPut<p_dis.mat>-InPut<str> Input one or muti GATK VCF genotype File-OutPut<str> OutPut Sample p-Distance matrix-InList<str> Input GATK muti-chr VCF Path List-SubPop<str> SubGroup SampleList of VCF File [ALLsample]-Rand<float> Probability (0-1]for each site to join Calculation [1]-help Show morehelp [hewm2008 v1.53s]
For more details, please use-help and see theexample
-InFormat<str> Input File is [VCF/FA/PHY] Format,defaut: [VCF]-InSampleGroup<str> InFile of sample Group info,format(sample groupA)-TreeMethod<int> Construct Tree Method,1:NJ-tree 2:UPGMA-tree [1]-KeepMF Keep the Middle File diff& Use matrix
Three examples were provided in the directory ofexample/Example*
- To Create the p_distance matrix and construct nj-tree newick tree
# 1.1) To new all the sample p_distance matrix and newick tree based VCF, run VCF2Dis directly ./bin/VCF2Dis-InPutin.vcf.gz-OutPut p_dis.mat # ./bin/VCF2Dis -InPut in.fa.gz-OutPut p_dis.mat -InFormat FA# 2.2) To new sub group sample p_distance matrix and and newick tree ; put their sample name into File sample.list ./bin/VCF2Dis-InPutchr1.vcf.gz chr2.vcf.gz-OutPut p_dis.mat -SubPop sample.list
- Simple tree visualization (for advanced tree display and annotation please refer to
iTOL
,Evolview
,MEGA
)
you will obtain thep_dis.nwk
tree file and neighbor-joining tree in pdf formatp_dis.pdf
after VCF2Dis.
- Simple tree visualization (for advanced tree display and annotation please refer to
Note::if you can't get thep_dis.nwk tree file but had thep_dis.mat, here are the3 methods to get the tree file.
- Running multiple times by using a method of sampling with replacement.Users can randomly select a part of the sites [-Rand] and construct a new nj-tree as above, and Repeat NN times [recommand NN=100]. X=(1,2....NN);
#!/bin/bashNN=100if ["$#"-eq 1 ];thenNN=$1fiforXin$(seq 1$NN)do./bin/VCF2Dis -InPut in.vcf.gz -OutPut p_dis_${X}.mat -Rand 0.25# PHYLIPNEW-3.69.650/bin/fneighbor -datafile p_dis_${X}.mat -outfile tree.out1_${X}.txt -matrixtype s -treetype n -outtreefile tree.out2_${X}.tredone
- Merge all the nj-tree and construct and display a boostrap nj-tree. (For advanced display tree and annotation please refer to
iTOL
,Evolview
andMEGA
)
- Merge all the nj-tree and construct and display a boostrap nj-tree. (For advanced display tree and annotation please refer to
#!/bin/bashNN=100if ["$#"-eq 1 ];then NN=$1ficat p_*.nwk> alltree_merge.tre# cat tree*.tre > alltree_merge.trePHYLIPNEW-3.69.650/bin/fconsense -intreefile alltree_merge.tre -outfile out -treeprint Yperl ./bin/percentageboostrapTree.pl alltree_merge.treefile$NN Final_boostrap.tre# NN is the input number
How to Install PHYLIPNEW please Click onhere or Click onhere(Chinese)
The formula for calculating p-distance between indivisuals from VCF SNP datasets was listed below:
D_ij=(1/L) * [(sum(d(l)_ij))]
Where L is the length of regions where SNPs can be identified, and given the alleles at positionl
are A/C between samplei
and samplej
:
d(l)_ij=0.0 if the genotypes of the two individuals were AA and AA; d(l)_ij=0.5 if the genotypes of the two individuals were AA and AC; d(l)_ij=0.0 if the genotypes of the two individuals were AC and AC; d(l)_ij=1.0 if the genotypes of the two individuals were AA and CC; d(l)_ij=0.0 if the genotypes of the two individuals were CC and CC;
To further know about the p_distance matrix based the VCF file, please refer tothis website.
VCF2Dis have been cited in more than 170 times bysearching against google scholar.
Below were some NJ-tree images that I draw in the paper before.
- 50 Rices NBT
- 31 soybeans NG
Display tree by MAGA after test Data VCF2Dis -i ALL.chr*.genotypes.vcf.gz -SubPop subsample203.list -InSampleGroup pop.info
- 📧hewm2008@gmail.com /hewm2008@qq.com /
- join the QQ Group : 125293663
- other Contributors : Lian Xu (xulian@ntu.edu.cn); Xun Liao (1911751806@qq.com)
######################swimming in the sky and flying in the sea ########################### ##
About
VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files