Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

a C++ API of htslib to be easily integrated and safely used. More importantly, it can be callled seamlessly in R/Python/Julia etc.

License

NotificationsYou must be signed in to change notification settings

Zilong-Li/vcfpp

Repository files navigation

https://github.com/Zilong-Li/vcfpp/actions/workflows/linux.yml/badge.svghttps://github.com/Zilong-Li/vcfpp/actions/workflows/mac.yml/badge.svghttps://img.shields.io/badge/Documentation-latest-blue.svghttps://img.shields.io/github/v/release/Zilong-Li/vcfpp.svghttps://img.shields.io/github/license/Zilong-Li/vcfpp?style=plastic.svg

This project introduces vcfpp (vcf plus plus), a single C++ file as interface to the basichtslib, which can be easily included in a C++ program for scripting high-performance genomic analyses. Check outvcfppR as an example of how vcfpp.h can facilitate rapidly writing high-performance R package.

Features:

  • single file to be easily included and compiled
  • easy and safeAPI to use.
  • objects are RAII. no worry about allocate and free memory.
  • the full functionalities ofhtslib, e.g. supports of compressed VCF/BCF and URL link as filename.
  • compatible with C++11 and later

Table of Contents

Installation

  1. installhtslib on your system
  2. download the releasedvcfpp.h
  3. putvcfpp.h in the same folder as your cpp source file or a folder say/my/writable/path/ or the system path

Usage

The documentation of API ishere.

Reading VCF

In this example, we count the number of heterozygous genotypes for each sample in all records. You can paste the example code into aexample.cpp file and compile it byg++ example.cpp -std=c++11 -O3 -Wall -I. -lhts. You can replace-I. with-I/my/writable/path/ if you putvcfpp.h there.

#include"vcfpp.h"usingnamespacestd;usingnamespacevcfpp;intmain(intargc,char*argv[]){// read data from 1000 genomes serverBcfReadervcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");BcfRecordvar(vcf.header);// construct a variant recordvector<char>gt;// genotype can be bool, char or int typevector<int>hetsum(vcf.nsamples,0);while (vcf.getNextVariant(var)) {var.getGenotypes(gt);if (!var.isSNP()|| !var.isNoneMissing())continue;assert(var.ploidy()==2);// make sure it is diploidyfor(inti=0;i<gt.size() /2;i++)hetsum[i]+=abs(gt[2*i+0]-gt[2*i+1]);    }for (autoi :hetsum) {cout <<i <<endl; }return0;}

Genotypes Coding

There are 3 types used for genotypes, ie.vector<bool>,vector<char> andvector<int>. One can usevector<bool> andvector<char> for memory-effcient goal. The downside is that it only stores 0 and 1. Andvector<int> can store missing values and multialleles.

Code genotypes with missing allele as heterozygous.

If you usevector<bool> andvector<char> to store the genotypes, then there is no way to represent missing values. Hence the returned genotypes always have 0s and 1s. And genotypes with missing allele (eg.0/.,./0,1/.,./1,./.) are codes as1/0. It’s recommended to usevar.isNoneMissing() to check if there is missing value.

Code missing allele as -9

If this default behavior forvector<bool> andvector<char> is not what you want, you should usevector<int> to store the genotypes, then any missing allele will be coded as-9.Note you should take the missing value-9 into account for downstream analysis.

Writing VCF

There are many ways for writing the VCF/BCF file.

Use an empty template

Here we construct an initial BCF with header using VCF4.3 specification. Next we add meta data in the header and write out variant record given a string.

BcfWriterbw("out.bcf.gz","VCF4.3");bw.header.addFORMAT("GT","1","String","Genotype");bw.header.addINFO("AF","A","Float","Estimated allele frequency in the range (0,1)");bw.header.addContig("chr20");// add chromosomefor (auto&s : {"id01","id02","id03"})bw.header.addSample(s);// add 3 samplesbw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0");

Use an existing VCF as template

In this example, we first read VCF filetest/test-vcf-read.vcf.gz. Second, we construct a variant record and update the record with the input VCF. Third, we construct a BcfWriter object using the meta data in the header of the input VCF, writing out the header and the modified variant record.

BcfReaderbr("test/test.vcf.gz");BcfWriterbw("out.vcf.gz",br.header);BcfRecordvar(br.header);br.getNextVariant(var);bw.writeHeader();var.setPOS(100001);// update the POS of the variantbw.writeRecord(var);

In some cases where we want to modify the samples of a template VCF, we can associate theBcfWriter andBcfRecord with another modifiable header instead of the non-modifiable header as the above.

BcfReaderbr("test/test.vcf.gz");BcfWriterbw("out.vcf.gz");// copy header of template vcf and restrict to target samplebw.copyHeader("test/test.vcf.gz","HG00097");BcfRecordvar(bw.header);br.getNextVariant(var);var.setPOS(100001);bw.writeRecord(var);// output only sample HG00097

Variants Operation

All variants related API can be foundBcfRecord. The commonly used are listed below.

BcfReadervcf("bcf.gz");// construct a vcf readerBcfRecordvar(vcf.header);// construct an empty variant record associated with vcf headervcf.getNextVariant(var)// get next variantvector<char>gt;// genotype can be bool, char or int typevar.getGenotypes(gt),var.setGenotypes(gt);// get or set genotypes for current variantvar.isNoneMissing();// check if there is missing value after getting genotypesvector<int>gq;// genotype quality usually is of int typevar.getFORMAT("GQ",gq),var.setFORMAT("GQ",gq);// get or set a vector of genotypes qualityvector<int>pl;// Phred-scaled genotype likelihoods usually is of int typevar.getFORMAT("PL",pl);// get a vector of Phred-scaled genotype likelihoodsfloataf;var.getINFO("AF",af),var.setINFO("AF",af);// get or set AF (allele frequency) value in INFOintmq;var.getINFO("MQ",mq)// get MQ (Average mapping quality) value from INFOvector<int>dp4;// Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse basesvar.getINFO("DP4",dp4),var.setINFO("DP4",dp4);// get or set a vector of dp4 value from INFOvar.isSNP();// check if variant is SNPvar.isSV();// check if variant is SVvar.isIndel();// check if variant is indelvar.isMultiAllelic();// check if variant is MultiAllelicvar.POS(),var.setPOS();// get POS or modify POS

Header Operation

All variants related API can be found inBcfHeader.

Working with R

Examples of vcfpp working with R are in folderRcpp andhttps://github.com/Zilong-Li/vcfppR.

Working with Python

Examples of vcfpp working with Python are in folderPybind11.

Command Line Tools

Find more useful command line tools in foldertools.

About

a C++ API of htslib to be easily integrated and safely used. More importantly, it can be callled seamlessly in R/Python/Julia etc.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp