


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files


NotificationsYou must be signed in to change notification settings


Repository files navigation

VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files

TheVCF2Dis article has been published inGigaScience, please cited this article if possible
Acceptance Date :2025 Mar 5
Publication :GigaScience
Title : VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files
Doi :

1) Install and Parameter

    1. Install

Thenew version will be updated and maintained inhewm2008/VCF2Dis, please click below Link to download the latest version



Option 1 : Local compilation

Justsh to compile. The executableVCF2Dis can be found in the folder ofbin/VCF2Dis
ForLinux /Unix andmacOS

        tar -zxvf  VCF2DisXXX.tar.gz            # if Link do not work ,Tryre-install [zlib]library        cd VCF2DisXXX;                          # [zlib] and copy them to the library Dir        sh;                             # VCF2Dis-xx/src/include/zlib        ./bin/VCF2Dis

Note: If fail to link,try tore-install the librarieszlib
Note::R withape,dplyr andggtree are recommended

Option 2: Docker container

You can use Docker to install and run VCF2Dis. Follow the steps below:

  1. Install Docker: Ensure Docker is installed on your system. If not, you can install it by following theDocker Official Documentation.
  2. Pull the Docker Image: Use the following command to pull the VCF2Dis Docker image from the Alibaba Cloud Container Registry:
    docker pull  Docker image from the Alibaba Cloud Container Registrydocker run -it --rm vcf2dis:v1.53e VCF2Dis## After pulling the image, you can run the containe

Option 3: Singularity container

  1. Install Singularity: Ensure Singularity is installed on your system. If not, you can install it by following theSingularity Official Documentation.
  2. Build the SIF File: Use the following command to build a Singularity image file (SIF) from the Docker image:
    singularity build vcf2dis_1.53e.sif docker:// you can download followssingularityexec  vcf2dis_1.53e.sif  VCF2Dis
  3. Download the SIF File:Alternatively, you can download the built SIF file directly from thevcf2dis_1.53e.sif. Once downloaded, you can run it using Singularity.
    1. Main parameter description:
Usage: VCF2Dis -InPut<in.vcf>  -OutPut<p_dis.mat>-InPut<str>     Input one or muti GATK VCF genotype File-OutPut<str>     OutPut Sample p-Distance matrix-InList<str>     Input GATK muti-chr VCF Path List-SubPop<str>     SubGroup SampleList of VCF File [ALLsample]-Rand<float>   Probability (0-1]for each site to join Calculation [1]-help                Show morehelp [hewm2008 v1.53s]

For more details, please use-help and see theexample

-InFormat<str>   Input File is [VCF/FA/PHY] Format,defaut: [VCF]-InSampleGroup<str>   InFile of sample Group info,format(sample groupA)-TreeMethod<int>   Construct Tree Method,1:NJ-tree 2:UPGMA-tree [1]-KeepMF                Keep the Middle File diff& Use matrix

2) Example

Three examples were provided in the directory ofexample/Example*

1) an Example of nj-tree with no boostrap

    1. To Create the p_distance matrix and construct nj-tree newick tree
# 1.1) To new all the sample p_distance matrix and newick tree based VCF, run VCF2Dis directly      ./bin/VCF2Dis-InPutin.vcf.gz-OutPut p_dis.mat      #  ./bin/VCF2Dis     -InPut  in.fa.gz-OutPut p_dis.mat -InFormat  FA# 2.2) To new sub group sample p_distance matrix and and newick tree ; put their sample name into File sample.list      ./bin/VCF2Dis-InPutchr1.vcf.gz chr2.vcf.gz-OutPut p_dis.mat  -SubPop  sample.list
    1. Simple tree visualization (for advanced tree display and annotation please refer toiTOL,Evolview,MEGA)
      you will obtain thep_dis.nwk tree file and neighbor-joining tree in pdf formatp_dis.pdf after VCF2Dis.

Note::if you can't get thep_dis.nwk tree file but had thep_dis.mat, here are the3 methods to get the tree file.

2) an Example of nj-tree with boostrap

    1. Running multiple times by using a method of sampling with replacement.Users can randomly select a part of the sites [-Rand] and construct a new nj-tree as above, and Repeat NN times [recommand NN=100]. X=(1,2....NN);
#!/bin/bashNN=100if ["$#"-eq  1 ];thenNN=$1fiforXin$(seq 1$NN)do./bin/VCF2Dis -InPut in.vcf.gz -OutPut p_dis_${X}.mat -Rand 0.25# PHYLIPNEW-3.69.650/bin/fneighbor -datafile p_dis_${X}.mat -outfile tree.out1_${X}.txt -matrixtype s -treetype n -outtreefile tree.out2_${X}.tredone
    1. Merge all the nj-tree and construct and display a boostrap nj-tree. (For advanced display tree and annotation please refer toiTOL,Evolview andMEGA)
#!/bin/bashNN=100if ["$#"-eq  1 ];then  NN=$1ficat  p_*.nwk>    alltree_merge.tre#  cat  tree*.tre  > alltree_merge.trePHYLIPNEW-3.69.650/bin/fconsense   -intreefile   alltree_merge.tre  -outfile out  -treeprint Yperl  ./bin/    alltree_merge.treefile$NN    Final_boostrap.tre# NN is the input number

How to Install PHYLIPNEW please Click onhere or Click onhere(Chinese)

4) Introduction

The formula for calculating p-distance between indivisuals from VCF SNP datasets was listed below:

            D_ij=(1/L) * [(sum(d(l)_ij))]

Where L is the length of regions where SNPs can be identified, and given the alleles at positionl are A/C between samplei and samplej:

            d(l)_ij=0.0     if the genotypes of the two individuals were AA and AA;            d(l)_ij=0.5     if the genotypes of the two individuals were AA and AC;            d(l)_ij=0.0     if the genotypes of the two individuals were AC and AC;            d(l)_ij=1.0     if the genotypes of the two individuals were AA and CC;            d(l)_ij=0.0     if the genotypes of the two individuals were CC and CC;

To further know about the p_distance matrix based the VCF file, please refer tothis website.

5) Results

VCF2Dis have been cited in more than 170 times bysearching against google scholar.
Below were some NJ-tree images that I draw in the paper before.

  • 50 Rices NBT
  • 31 soybeans NG
    Display tree by MAGA after test Data VCF2Dis -i ALL.chr*.genotypes.vcf.gz -SubPop subsample203.list -InSampleGroup


6) Discussing

######################swimming in the sky and flying in the sea ########################### ##


VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files








No packages published

