- Notifications
You must be signed in to change notification settings - Fork5
a C++ API of htslib to be easily integrated and safely used. More importantly, it can be callled seamlessly in R/Python/Julia etc.
License
Zilong-Li/vcfpp
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This project introduces vcfpp (vcf plus plus), a single C++ file as interface to the basichtslib, which can be easily included in a C++ program for scripting high-performance genomic analyses. Check outvcfppR as an example of how vcfpp.h can facilitate rapidly writing high-performance R package.
Features:
- single file to be easily included and compiled
- easy and safeAPI to use.
- objects are RAII. no worry about allocate and free memory.
- the full functionalities of
htslib, e.g. supports of compressed VCF/BCF and URL link as filename. - compatible with C++11 and later
- installhtslib on your system
- download the releasedvcfpp.h
- put
vcfpp.hin the same folder as your cpp source file or a folder say/my/writable/path/or the system path
The documentation of API ishere.
In this example, we count the number of heterozygous genotypes for each sample in all records. You can paste the example code into aexample.cpp file and compile it byg++ example.cpp -std=c++11 -O3 -Wall -I. -lhts. You can replace-I. with-I/my/writable/path/ if you putvcfpp.h there.
#include"vcfpp.h"usingnamespacestd;usingnamespacevcfpp;intmain(intargc,char*argv[]){// read data from 1000 genomes serverBcfReadervcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");BcfRecordvar(vcf.header);// construct a variant recordvector<char>gt;// genotype can be bool, char or int typevector<int>hetsum(vcf.nsamples,0);while (vcf.getNextVariant(var)) {var.getGenotypes(gt);if (!var.isSNP()|| !var.isNoneMissing())continue;assert(var.ploidy()==2);// make sure it is diploidyfor(inti=0;i<gt.size() /2;i++)hetsum[i]+=abs(gt[2*i+0]-gt[2*i+1]); }for (autoi :hetsum) {cout <<i <<endl; }return0;}
There are 3 types used for genotypes, ie.vector<bool>,vector<char> andvector<int>. One can usevector<bool> andvector<char> for memory-effcient goal. The downside is that it only stores 0 and 1. Andvector<int> can store missing values and multialleles.
If you usevector<bool> andvector<char> to store the genotypes, then there is no way to represent missing values. Hence the returned genotypes always have 0s and 1s. And genotypes with missing allele (eg.0/.,./0,1/.,./1,./.) are codes as1/0. It’s recommended to usevar.isNoneMissing() to check if there is missing value.
If this default behavior forvector<bool> andvector<char> is not what you want, you should usevector<int> to store the genotypes, then any missing allele will be coded as-9.Note you should take the missing value-9 into account for downstream analysis.
There are many ways for writing the VCF/BCF file.
Here we construct an initial BCF with header using VCF4.3 specification. Next we add meta data in the header and write out variant record given a string.
BcfWriterbw("out.bcf.gz","VCF4.3");bw.header.addFORMAT("GT","1","String","Genotype");bw.header.addINFO("AF","A","Float","Estimated allele frequency in the range (0,1)");bw.header.addContig("chr20");// add chromosomefor (auto&s : {"id01","id02","id03"})bw.header.addSample(s);// add 3 samplesbw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0");
In this example, we first read VCF filetest/test-vcf-read.vcf.gz. Second, we construct a variant record and update the record with the input VCF. Third, we construct a BcfWriter object using the meta data in the header of the input VCF, writing out the header and the modified variant record.
BcfReaderbr("test/test.vcf.gz");BcfWriterbw("out.vcf.gz",br.header);BcfRecordvar(br.header);br.getNextVariant(var);bw.writeHeader();var.setPOS(100001);// update the POS of the variantbw.writeRecord(var);
In some cases where we want to modify the samples of a template VCF, we can associate theBcfWriter andBcfRecord with another modifiable header instead of the non-modifiable header as the above.
BcfReaderbr("test/test.vcf.gz");BcfWriterbw("out.vcf.gz");// copy header of template vcf and restrict to target samplebw.copyHeader("test/test.vcf.gz","HG00097");BcfRecordvar(bw.header);br.getNextVariant(var);var.setPOS(100001);bw.writeRecord(var);// output only sample HG00097
All variants related API can be foundBcfRecord. The commonly used are listed below.
BcfReadervcf("bcf.gz");// construct a vcf readerBcfRecordvar(vcf.header);// construct an empty variant record associated with vcf headervcf.getNextVariant(var)// get next variantvector<char>gt;// genotype can be bool, char or int typevar.getGenotypes(gt),var.setGenotypes(gt);// get or set genotypes for current variantvar.isNoneMissing();// check if there is missing value after getting genotypesvector<int>gq;// genotype quality usually is of int typevar.getFORMAT("GQ",gq),var.setFORMAT("GQ",gq);// get or set a vector of genotypes qualityvector<int>pl;// Phred-scaled genotype likelihoods usually is of int typevar.getFORMAT("PL",pl);// get a vector of Phred-scaled genotype likelihoodsfloataf;var.getINFO("AF",af),var.setINFO("AF",af);// get or set AF (allele frequency) value in INFOintmq;var.getINFO("MQ",mq)// get MQ (Average mapping quality) value from INFOvector<int>dp4;// Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse basesvar.getINFO("DP4",dp4),var.setINFO("DP4",dp4);// get or set a vector of dp4 value from INFOvar.isSNP();// check if variant is SNPvar.isSV();// check if variant is SVvar.isIndel();// check if variant is indelvar.isMultiAllelic();// check if variant is MultiAllelicvar.POS(),var.setPOS();// get POS or modify POS
All variants related API can be found inBcfHeader.
Examples of vcfpp working with R are in folderRcpp andhttps://github.com/Zilong-Li/vcfppR.
Examples of vcfpp working with Python are in folderPybind11.
Find more useful command line tools in foldertools.
About
a C++ API of htslib to be easily integrated and safely used. More importantly, it can be callled seamlessly in R/Python/Julia etc.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.