TheBeviMed package comes with a script containingfunctions to simplify reading allele count matrices from VCF files. Thefunctions depend ontabix, but have the advantage ofallowing variants in local regions to be read in, reducing the amount ofmemory consumed at any one time. However, if you want to analyse manyregions, it may be more efficient to read in larger parts of the file -in which case, a package such asvcfR might be moreappropriate.
In order to make the functions available, we must source thescript:
The script creates the functionvcf2matrix, whichdepends on the external programtabix (available fromhttp://www.htslib.org/download/) for reading allelecount matrices from VCF files. It uses arguments:
vcf_file_name - path to vcf file.chr -character value givingchromosome.from/to -integer valuesgiving from/to coordinates for chromosome.samples -character vector of sample namesas used in the VCF.include_variant_info -boolean valuedetermining whether to return just a matrix of allele counts(TRUE, default) or a list of allele count matrixG anddata.frame of variant informationinfo (FALSE). The variant informationinfo could be useful for filtering the variants, forexample if the VCF has not been pre-filtered for rare variants.description_columns -integer value givingnumber of columns of description fields in the VCF file (i.e. before thegenotype columns begin), defaults to9.warn_if_AF_greater_than -numeric valuegiving threshold allele frequency for generating a warning.You can invoke the function simply to obtain the allele count matrixand pass straight tobevimed, along with phenotypelabel:
ac_matrix<-vcf2matrix("my-vcf.vcf.gz",chr="2",from=1,to=1e4)pheno<-read.table(file="my-phenotype-data.txt",header=TRUE)bevimed(y=pheno$disease_status,G=ac_matrix)