Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

R Bindings for htslib/bcf

License

NotificationsYou must be signed in to change notification settings

lindenb/rbcf

Repository files navigation

R Bindings for htslib/bcf+vcf

Build Status

Last commit

INSTALL

Requirements: R, git, curl-dev,

git clone "https://github.com/lindenb/rbcf"cd rbcfmake

Author

Pierre Lindenbaum PhD. Institut du Thorax. Nantes, France.@yokofakun

Examples

Show Htslib and Rbcf versions

Code:

# load the librarylibrary(rbcf)#print the version of the associated htslib paste("HTSLIB:",htslib.version())#print the version of rbcfpaste("RBCF:",rcbf.version())

Output:

[1] "HTSLIB: 1.10.2"[1] "RBCF: 0.0-1"

Open and close a VCF file

Code:

# load rbcflibrary(rbcf)# we don't need the index for this filefp <- bcf.open("./data/rotavirus_rf.01.vcf",FALSE)# error (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# dispose the vcf readerbcf.close(fp)print("Done.")

Output:

[1] TRUE[1] "Done."

Print the INFOs in the VCF header

Code:

# load rbcflibrary(rbcf)# we don't need the index for this filefp <- bcf.open("./data/rotavirus_rf.01.vcf",FALSE)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# print INFObcf.infos(fp)# dispose the vcf readerbcf.close(fp)# print the table

Output:

         ID Number    TypeINDEL INDEL      0    FlagIDV     IDV      1 IntegerIMF     IMF      1   FloatDP       DP      1 IntegerVDB     VDB      1   FloatRPB     RPB      1   FloatMQB     MQB      1   FloatBQB     BQB      1   FloatMQSB   MQSB      1   FloatSGB     SGB      1   FloatMQ0F   MQ0F      1   FloatICB     ICB      1   FloatHOB     HOB      1   FloatAC       AC      A IntegerAN       AN      1 IntegerDP4     DP4      4 IntegerMQ       MQ      1 Integer                                                                                         Descri(...)INDEL                                                      "Indicates that the variant is an IN(...)IDV                                                    "Maximum number of reads supporting an i(...)IMF                                                  "Maximum fraction of reads supporting an i(...)DP                                                                                  "Raw read d(...)VDB   "Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is bet(...)RPB                                   "Mann-Whitney U test of Read Position Bias (bigger is bet(...)MQB                                 "Mann-Whitney U test of Mapping Quality Bias (bigger is bet(...)BQB                                    "Mann-Whitney U test of Base Quality Bias (bigger is bet(...)MQSB                      "Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is bet(...)SGB                                                                      "Segregation based met(...)MQ0F                                                     "Fraction of MQ0 reads (smaller is bet(...)ICB                                        "Inbreeding Coefficient Binomial test (bigger is bet(...)HOB                                          "Bias in the number of HOMs number (smaller is bet(...)AC                      "Allele count in genotypes for each ALT allele, in the same order as li(...)AN                                                     "Total number of alleles in called genot(...)DP4            "Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse b(...)MQ                                                                         "Average mapping qua(...)[1] TRUE

Print the FORMATs in the VCF header

Code:

# load rbcflibrary(rbcf)# we don't need the index for this filefp <- bcf.open("./data/rotavirus_rf.01.vcf",FALSE)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# print FORMATbcf.formats(fp)# dispose the vcf readerbcf.close(fp)

Output:

   ID Number    Type                                 DescriptionPL PL      G Integer "List of Phred-scaled genotype likelihoods"GT GT      1  String                                  "Genotype"[1] TRUE

Print the FILTERs in the VCF header

Code:

# load rbcflibrary(rbcf)# we don't need the index for this filefp <- bcf.open("./data/gnomad.exomes.r2.0.1.sites.bcf",FALSE)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# print FILTERsbcf.filters(fp)# dispose the vcf readerbcf.close(fp)

Output:

                             IDPASS                       PASSAC0                         AC0InbreedingCoeff InbreedingCoeffLCR                         LCRRF                           RFSEGDUP                   SEGDUP                                                                                               (...)PASS                                                                                           (...)AC0             "Allele Count is zero (i.e. no high-confidence genotype (GQ >= 20, DP >= 10, AB(...)InbreedingCoeff                                                                                (...)LCR                                                                                        "In (...)RF                                                  "Failed random forests filters (SNV cutoff (...)SEGDUP                                                                              "In a segme(...)[1] TRUE

Print the Samples in the VCF header

The samples are defined in the '#CHROM' line of the VCF

Code:

# load rbcflibrary(rbcf)# we don't need the index for this filefp <- bcf.open("./data/rotavirus_rf.01.vcf",FALSE)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# print the number of samplespaste("Number of samples:",bcf.nsamples(fp))# get the name for the 1st samplepaste("First sample:",bcf.sample.at(fp,1))# get the 1-based index for the samplesbcf.sample2index(fp,c("S1","S2","S3","missing"))# get all the samplesbcf.samples(fp)# dispose the vcf readerbcf.close(fp)

Output:

[1] "Number of samples: 5"[1] "First sample: S1"     S1      S2      S3 missing       1       2       3       0 [1] "S1" "S2" "S3" "S4" "S5"[1] TRUE

Print the Dictionary in the VCF header

Code:

# load rbcflibrary(rbcf)# we don't need the index for this filefp <- bcf.open("./data/rotavirus_rf.01.vcf",FALSE)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# print the dictionarybcf.dictionary(fp)# dispose the vcf readerbcf.close(fp)

Output:

     chrom sizeRF01  RF01 3302RF02  RF02 2687RF03  RF03 2592RF04  RF04 2362RF05  RF05 1579RF06  RF06 1356RF07  RF07 1074RF08  RF08 1059RF09  RF09 1062RF10  RF10  751RF11  RF11  666[1] TRUE

Print the Indexed Chromosomes

Code:

# load rbcflibrary(rbcf)# Open the indexed VCFfp <- bcf.open("./data/rotavirus_rf.02.vcf.gz")# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# get the indexed contigsbcf.contigs(fp)# dispose the vcf readerbcf.close(fp)

Output:

 [1] "RF01" "RF02" "RF03" "RF04" "RF05" "RF06" "RF07" "RF08" "RF09" "RF10"[11] "RF11"[1] TRUE

Scanning the variants

Code:

# load rbcflibrary(rbcf)# create a function counting variants in a VCFcount.variants<-function(filename) {# we don't need the index for this filefp <- bcf.open(filename,FALSE)# error on openingif(is.null(fp)) return(-1)# number of variantsn<-0# loop while we can read a variantwhile(!is.null(vc<-bcf.next(fp))) {# increment the countn<-n+1}# dispose the vcf readerbcf.close(fp)# return the number of variantn}# filenamesvcfs<-c("./data/gnomad.exomes.r2.0.1.sites.bcf","./data/rotavirus_rf.01.vcf","./data/rotavirus_rf.02.vcf.gz","./data/rotavirus_rf.03.vcf.gz","./data/rotavirus_rf.04.bcf")# print the number of variants for each vcffor(f in vcfs) {cat(paste(f," ",count.variants(f),"\n"))}

Output:

./data/gnomad.exomes.r2.0.1.sites.bcf   50 ./data/rotavirus_rf.01.vcf   45 ./data/rotavirus_rf.02.vcf.gz   45 ./data/rotavirus_rf.03.vcf.gz   45 ./data/rotavirus_rf.04.bcf   45

Scanning the variants

Code:

# load rbcflibrary(rbcf)# create a function counting variants in a VCFcount.variants<-function(filename,predicate) {# we don't need the index for this filefp <- bcf.open(filename,FALSE)# error on openingif(is.null(fp)) return(-1)# number of variantsn<-0# loop while we can read a variantwhile(!is.null(vc<-bcf.next(fp))) {# test the variantif(predicate(vc)) {# increment the countn<-n+1}}# dispose the vcf readerbcf.close(fp)# return the number of variantn}# A vcffilename <- "./data/gnomad.exomes.r2.0.1.sites.bcf"# filtersfilters<-list(list("desc"="accept all","predicate"=function(ctx) {TRUE} ),list("desc"="accept none","predicate"=function(ctx) {FALSE} ),list("desc"="CHROM is '1'","predicate"=function(ctx) { variant.contig(ctx)=="1"} ),list("desc"="POS is even","predicate"=function(ctx) { (variant.pos(ctx)%%2)==1} ),list("desc"="PASS filter","predicate"=function(ctx) {!variant.is.filtered(ctx)} ),list("desc"="count(FILTER)>1","predicate"=function(ctx) {length(variant.filters(ctx))>1} ),list("desc"="FILTER contains SEGDUP","predicate"=function(ctx) {variant.has.filter(ctx,"SEGDUP")} ),list("desc"="SNP","predicate"=function(ctx) {variant.is.snp(ctx)} ),list("desc"="POS!=END","predicate"=function(ctx) { variant.pos(ctx)!=variant.end(ctx)} ),list("desc"="not diallelic","predicate"=function(ctx) {variant.nalleles(ctx)!=2} ),list("desc"="REF is 'A'","predicate"=function(ctx) {variant.reference(ctx)=="A"} ),list("desc"="any allele is 'A'","predicate"=function(ctx) {"A"  %in% variant.alleles(ctx)} ),list("desc"="any ALT allele is 'A'","predicate"=function(ctx) {"A"  %in% variant.alt.alleles(ctx)} ),list("desc"="No QUAL","predicate"=function(ctx) {!variant.has.qual(ctx)} ),list("desc"="variant has ID","predicate"=function(ctx) {variant.has.id(ctx)}),list("desc"="variant ID match 'rs1*' ","predicate"=function(ctx) {grepl("^rs1",variant.id(ctx))}),list("desc"="variant has INFO/AF_NFE","predicate"=function(ctx) {variant.has.attribute(ctx,"AF_NFE")}),list("desc"="variant has INFO/AF_NFE > 1E-5","predicate"=function(ctx) {variant.has.attribute(ctx,"AF_NFE") && length(which(variant.float.attribute(ctx,"AF_NFE") > 1E-5))>0}),list("desc"="Missense in PLEKHN1 (VEP)","predicate"=function(ctx) {# NO VEP annotation ?if(!variant.has.attribute(ctx,"CSQ")) return(FALSE);# get VEP annotationpredictions <- variant.vep(ctx)# In SCN5Apredictions <- predictions[which(predictions$SYMBOL=="PLEKHN1"),]# Consequence must contain missensepredictions <- predictions[grep("missense_variant",predictions$Consequence),]nrow(predictions)>0}))# count the variant for each filterfor(flt in filters) {print(paste(basename(filename)," filter:",flt[["desc"]]," count:",count.variants(filename,flt[["predicate"]]),"\n"))}

Output:

[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: accept all  count: 50 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: accept none  count: 0 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: CHROM is '1'  count: 50 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: POS is even  count: 24 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: PASS filter  count: 48 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: count(FILTER)>1  count: 2 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: FILTER contains SEGDUP  count: 1 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: SNP  count: 47 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: POS!=END  count: 3 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: not diallelic  count: 8 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: REF is 'A'  count: 6 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: any allele is 'A'  count: 27 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: any ALT allele is 'A'  count: 21 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: No QUAL  count: 1 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: variant has ID  count: 34 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: variant ID match 'rs1*'   count: 2 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: variant has INFO/AF_NFE  count: 50 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: variant has INFO/AF_NFE > 1E-5  count: 14 \n"[1] "gnomad.exomes.r2.0.1.sites.bcf  filter: Missense in PLEKHN1 (VEP)  count: 13 \n"

Print a VEP table for a Variant

Code:

# load rbcflibrary(rbcf)# A vcffilename <- "./data/gnomad.exomes.r2.0.1.sites.bcf"# we don't need the index for this filefp <- bcf.open(filename,FALSE)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# current variantvc <- NULLwhile(!is.null(vc<-bcf.next(fp))) {#find the first variant having an INFO/CSQ attributeif(variant.has.attribute(vc,"CSQ")) break;}if(!is.null(vc)) {# get the VEP table for the variantpredictions<-variant.vep(vc)}# dispose the vcf readerbcf.close(fp)# showpredictions

Output:

[1] TRUE   Allele             Consequence   IMPACT   SYMBOL            Gene1       C downstream_gene_variant MODIFIER   KLHL17 ENSG000001879612       A downstream_gene_variant MODIFIER   KLHL17 ENSG000001879613       C downstream_gene_variant MODIFIER C1orf170 ENSG000001876424       A downstream_gene_variant MODIFIER C1orf170 ENSG000001876425       C          intron_variant MODIFIER  PLEKHN1 ENSG000001875836       A          intron_variant MODIFIER  PLEKHN1 ENSG000001875837       C          intron_variant MODIFIER  PLEKHN1 ENSG000001875838       A          intron_variant MODIFIER  PLEKHN1 ENSG000001875839       C          intron_variant MODIFIER  PLEKHN1 ENSG0000018758310      A          intron_variant MODIFIER  PLEKHN1 ENSG0000018758311      C downstream_gene_variant MODIFIER C1orf170 ENSG0000018764212      A downstream_gene_variant MODIFIER C1orf170 ENSG0000018764213      C downstream_gene_variant MODIFIER C1orf170 ENSG0000018764214      A downstream_gene_variant MODIFIER C1orf170 ENSG0000018764215      C   upstream_gene_variant MODIFIER  PLEKHN1 ENSG0000018758316      A   upstream_gene_variant MODIFIER  PLEKHN1 ENSG0000018758317      C   upstream_gene_variant MODIFIER  PLEKHN1 ENSG0000018758318      A   upstream_gene_variant MODIFIER  PLEKHN1 ENSG00000187583   Feature_type         Feature         BIOTYPE EXON INTRON1    Transcript ENST00000338591  protein_coding            2    Transcript ENST00000338591  protein_coding            3    Transcript ENST00000341290  protein_coding            4    Transcript ENST00000341290  protein_coding            5    Transcript ENST00000379407  protein_coding        2/146    Transcript ENST00000379407  protein_coding        2/147    Transcript ENST00000379409  protein_coding        2/148    Transcript ENST00000379409  protein_coding        2/149    Transcript ENST00000379410  protein_coding        2/1510   Transcript ENST00000379410  protein_coding        2/1511   Transcript ENST00000433179  protein_coding            12   Transcript ENST00000433179  protein_coding            13   Transcript ENST00000479361 retained_intron            14   Transcript ENST00000479361 retained_intron            15   Transcript ENST00000480267 retained_intron            16   Transcript ENST00000480267 retained_intron            17   Transcript ENST00000491024  protein_coding            18   Transcript ENST00000491024  protein_coding                                       HGVSc HGVSp cDNA_position CDS_position1                                                                2                                                                3                                                                4                                                                5  ENST00000379407.3:c.184-51G>C                                 6  ENST00000379407.3:c.184-51G>A                                 7  ENST00000379409.2:c.184-51G>C                                 8  ENST00000379409.2:c.184-51G>A                                 9  ENST00000379410.3:c.184-51G>C                                 10 ENST00000379410.3:c.184-51G>A                                 11                                                               12                                                               13                                                               14                                                               15                                                               16                                                               17                                                               18                                                                  Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE1                                             rs540662886          1     45112                                             rs540662886          2     45113                                             rs540662886          1     49784                                             rs540662886          2     49785                                             rs540662886          1         6                                             rs540662886          2         7                                             rs540662886          1         8                                             rs540662886          2         9                                             rs540662886          1         10                                            rs540662886          2         11                                            rs540662886          1     497312                                            rs540662886          2     497313                                            rs540662886          1     497914                                            rs540662886          2     497915                                            rs540662886          1      64916                                            rs540662886          2      64917                                            rs540662886          1     328618                                            rs540662886          2     3286   STRAND        FLAGS VARIANT_CLASS MINIMISED SYMBOL_SOURCE HGNC_ID CANONICAL1       1                        SNV                    HGNC   24023       YES2       1                        SNV                    HGNC   24023       YES3      -1                        SNV                    HGNC   28208          4      -1                        SNV                    HGNC   28208          5       1                        SNV                    HGNC   25284          6       1                        SNV                    HGNC   25284          7       1                        SNV                    HGNC   25284          8       1                        SNV                    HGNC   25284          9       1                        SNV                    HGNC   25284       YES10      1                        SNV                    HGNC   25284       YES11     -1                        SNV                    HGNC   28208       YES12     -1                        SNV                    HGNC   28208       YES13     -1                        SNV                    HGNC   28208          14     -1                        SNV                    HGNC   28208          15      1                        SNV                    HGNC   25284          16      1                        SNV                    HGNC   25284          17      1 cds_start_NF           SNV                    HGNC   25284          18      1 cds_start_NF           SNV                    HGNC   25284             TSL APPRIS        CCDS            ENSP SWISSPROT        TREMBL       UNIPARC1             CCDS30550.1 ENSP00000343930    Q6TDP4 Q0VGE6&B3KXL7 UPI00001DFBF02             CCDS30550.1 ENSP00000343930    Q6TDP4 Q0VGE6&B3KXL7 UPI00001DFBF03                         ENSP00000343864                         UPI000022DAF44                         ENSP00000343864                         UPI000022DAF45             CCDS53256.1 ENSP00000368717    Q494U1        J3KSM5 UPI00005764FF6             CCDS53256.1 ENSP00000368717    Q494U1        J3KSM5 UPI00005764FF7                         ENSP00000368719    Q494U1        J3KSM5 UPI0000D61E068                         ENSP00000368719    Q494U1        J3KSM5 UPI0000D61E069                 CCDS4.1 ENSP00000368720    Q494U1        J3KSM5 UPI00001416D810                CCDS4.1 ENSP00000368720    Q494U1        J3KSM5 UPI00001416D811                        ENSP00000414022    Q5SV97               UPI0000418FB012                        ENSP00000414022    Q5SV97               UPI0000418FB013                                                                             14                                                                             15                                                                             16                                                                             17                        ENSP00000462558                  J3KSM5 UPI000268AE1F18                        ENSP00000462558                  J3KSM5 UPI000268AE1F   GENE_PHENO SIFT PolyPhen DOMAINS HGVS_OFFSET     GMAF AFR_MAF AMR_MAF1                                               C:0.0008     C:0     C:02                                               C:0.0008     C:0     C:03                                               C:0.0008     C:0     C:04                                               C:0.0008     C:0     C:05                                               C:0.0008     C:0     C:06                                               C:0.0008     C:0     C:07                                               C:0.0008     C:0     C:08                                               C:0.0008     C:0     C:09                                               C:0.0008     C:0     C:010                                              C:0.0008     C:0     C:011                                              C:0.0008     C:0     C:012                                              C:0.0008     C:0     C:013                                              C:0.0008     C:0     C:014                                              C:0.0008     C:0     C:015                                              C:0.0008     C:0     C:016                                              C:0.0008     C:0     C:017                                              C:0.0008     C:0     C:018                                              C:0.0008     C:0     C:0   EAS_MAF EUR_MAF SAS_MAF AA_MAF EA_MAF ExAC_MAF ExAC_Adj_MAF ExAC_AFR_MAF1      C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-042      C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-043      C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-044      C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-045      C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-046      C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-047      C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-048      C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-049      C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-0410     C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-0411     C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-0412     C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-0413     C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-0414     C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-0415     C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-0416     C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-0417     C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-0418     C:0 C:0.004     C:0    C:0                          C:0  C:2.146e-04   ExAC_AMR_MAF ExAC_EAS_MAF ExAC_FIN_MAF ExAC_NFE_MAF ExAC_OTH_MAF1           C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-052           C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-053           C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-054           C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-055           C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-056           C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-057           C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-058           C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-059           C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-0510          C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-0511          C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-0512          C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-0513          C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-0514          C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-0515          C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-0516          C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-0517          C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-0518          C:0  C:0.0002281   C:0.002986          C:0  C:1.606e-05   ExAC_SAS_MAF CLIN_SIG SOMATIC PHENO PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS1           C:0                                                                2           C:0                                                                3           C:0                                                                4           C:0                                                                5           C:0                                                                6           C:0                                                                7           C:0                                                                8           C:0                                                                9           C:0                                                                10          C:0                                                                11          C:0                                                                12          C:0                                                                13          C:0                                                                14          C:0                                                                15          C:0                                                                16          C:0                                                                17          C:0                                                                18          C:0                                                                   MOTIF_SCORE_CHANGE LoF LoF_filter LoF_flags LoF_info"1                                                       2                                                       3                                                       4                                                       5                                                       6                                                       7                                                       8                                                       9                                                       10                                                      11                                                      12                                                      13                                                      14                                                      15                                                      16                                                      17                                                      18

Print a SNPEFF table for a Variant

Code:

# load rbcflibrary(rbcf)# A vcffilename <- "./data/rotavirus_rf.ann.vcf.gz"# we don't need the index for this filefp <- bcf.open(filename,FALSE)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# current variantvc <- NULLwhile(!is.null(vc<-bcf.next(fp))) {#find the first variant having an INFO/ANN attributeif(variant.has.attribute(vc,"ANN")) break;}if(!is.null(vc)) {# get SNPEFF tablepredictions<-variant.snpeff(vc)}# dispose the vcf readerbcf.close(fp)# showpredictions

Output:

[1] TRUE  Allele       Annotation Annotation_Impact    Gene_Name      Gene_ID1      C missense_variant          MODERATE Gene_18_3284 Gene_18_3284  Feature_Type Feature_ID Transcript_BioType Rank   HGVS.c      HGVS.p1   transcript AAA47319.1     protein_coding  1/1 c.952A>C p.Lys318Gln  cDNA.pos / cDNA.length CDS.pos / CDS.length AA.pos / AA.length Distance1               952/3267             952/3267           318/1088           ERRORS / WARNINGS / INFO'"1

Query the indexed vcf using intervals

Code:

# load rbcflibrary(rbcf)# create a function counting variants in a VCF, in some intervalscount.variants<-function(filename,intervals) {# open the indexed VCFfp <- bcf.open(filename)# error on openingif(is.null(fp)) return(-1)# loop over the intervalsfor(interval in intervals) {# try query the intervalif(bcf.query(fp,interval)) {# number of variantsn<-0# loop while we can read a variantwhile(!is.null(vc<-bcf.next(fp))) {# increment the countn<-n+1}print(paste("Number of variants in ",basename(filename),"/'",interval,"' :",n,sep=""))}# query failedelse {print(paste("Cannot query ",basename(filename),"/'",interval,"'",sep=""))}}# dispose the vcf readerbcf.close(fp)}some_intervals <-c("","RF03","RF03:2000-3000","1:1-10000000","chr1")count.variants("./data/rotavirus_rf.02.vcf.gz",some_intervals)count.variants("./data/1000G.ALL.2of4intersection.20100804.genotypes.bcf",some_intervals)# another way to query is set collect=TRUE to return a vector of variantfp <- bcf.open("./data/rotavirus_rf.02.vcf.gz")if(!is.null(fp)) {print(paste("Number of variants using collect:",length(bcf.query(fp,"RF03",collect=TRUE))))bcf.close(fp)}

Output:

[1] "Cannot query rotavirus_rf.02.vcf.gz/''"[1] "Number of variants in rotavirus_rf.02.vcf.gz/'RF03' :8"[1] "Number of variants in rotavirus_rf.02.vcf.gz/'RF03:2000-3000' :4"[1] "Cannot query rotavirus_rf.02.vcf.gz/'1:1-10000000'"[1] "Cannot query rotavirus_rf.02.vcf.gz/'chr1'"[1] TRUE[1] "Cannot query 1000G.ALL.2of4intersection.20100804.genotypes.bcf/''"[1] "Cannot query 1000G.ALL.2of4intersection.20100804.genotypes.bcf/'RF03'"[1] "Cannot query 1000G.ALL.2of4intersection.20100804.genotypes.bcf/'RF03:2000-3000'"[1] "Number of variants in 1000G.ALL.2of4intersection.20100804.genotypes.bcf/'1:1-10000000' :11"[1] "Cannot query 1000G.ALL.2of4intersection.20100804.genotypes.bcf/'chr1'"[1] TRUE[1] "Number of variants using collect: 8"[1] TRUE

Attribute in INFO

Code:

# load rbcflibrary(rbcf)# find given variantfind.variant<-function(fp,contig,pos) {if(!bcf.query(fp,paste(contig,":",pos,"-",pos,sep=""))) return(NULL)# loop while we can read a variantwhile(!is.null(vc<-bcf.next(fp))) {return(vc)}return(NULL)}filename<-"./data/gnomad.exomes.r2.0.1.sites.bcf"# open the VCF with indexfp <- bcf.open(filename)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)ctx <-find.variant(fp,"1",905608)stopifnot(variant.has.attribute(ctx,"CSQ"))print(paste("CSQ(no split) ",variant.string.attribute(ctx,"CSQ",split=FALSE)))print(paste("CSQ(split) ",variant.string.attribute(ctx,"CSQ")))stopifnot(variant.has.attribute(ctx,"AN_POPMAX"))print(paste("AN_POPMAX:",variant.int.attribute(ctx,"AN_POPMAX")))stopifnot(variant.has.attribute(ctx,"AF_POPMAX"))print(paste("AF_POPMAX:",variant.float.attribute(ctx,"AF_POPMAX")))print(paste("flag:VQSR_NEGATIVE_TRAIN_SITE:",variant.flag.attribute(ctx,"VQSR_NEGATIVE_TRAIN_SITE")))# dispose the vcf readerbcf.close(fp)

Output:

[1] "CSQ(no split)  T|downstream_gene_variant|MODIFIER|KLHL17|ENSG00000187961|Transcript|ENST00(...) [1] "CSQ(split)  T|downstream_gene_variant|MODIFIER|KLHL17|ENSG00000187961|Transcript|ENST0000(...) [2] "CSQ(split)  A|downstream_gene_variant|MODIFIER|KLHL17|ENSG00000187961|Transcript|ENST0000(...) [3] "CSQ(split)  T|downstream_gene_variant|MODIFIER|C1orf170|ENSG00000187642|Transcript|ENST00(...) [4] "CSQ(split)  A|downstream_gene_variant|MODIFIER|C1orf170|ENSG00000187642|Transcript|ENST00(...) [5] "CSQ(split)  T|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379407|(...) [6] "CSQ(split)  A|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379407|(...) [7] "CSQ(split)  T|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379409|(...) [8] "CSQ(split)  A|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379409|(...) [9] "CSQ(split)  T|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379410|(...)[10] "CSQ(split)  A|intron_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000379410|(...)[11] "CSQ(split)  T|downstream_gene_variant|MODIFIER|C1orf170|ENSG00000187642|Transcript|ENST00(...)[12] "CSQ(split)  A|downstream_gene_variant|MODIFIER|C1orf170|ENSG00000187642|Transcript|ENST00(...)[13] "CSQ(split)  T|downstream_gene_variant|MODIFIER|C1orf170|ENSG00000187642|Transcript|ENST00(...)[14] "CSQ(split)  A|downstream_gene_variant|MODIFIER|C1orf170|ENSG00000187642|Transcript|ENST00(...)[15] "CSQ(split)  T|upstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000(...)[16] "CSQ(split)  A|upstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000(...)[17] "CSQ(split)  T|upstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000(...)[18] "CSQ(split)  A|upstream_gene_variant|MODIFIER|PLEKHN1|ENSG00000187583|Transcript|ENST00000(...)[1] "AN_POPMAX: 106408" "AN_POPMAX: 106408"[1] "AF_POPMAX: 1.87955993169453e-05" "AF_POPMAX: 9.39778965403093e-06"[1] "flag:VQSR_NEGATIVE_TRAIN_SITE: FALSE"[1] TRUE

Working with Genotypes

Code:

# load rbcflibrary(rbcf)# find given variantfind.variant<-function(fp,contig,pos) {if(!bcf.query(fp,paste(contig,":",pos,"-",pos,sep=""))) return(NULL)# loop while we can read a variantwhile(!is.null(vc<-bcf.next(fp))) {return(vc)}return(NULL)}filename<-"./data/1000G.ALL.2of4intersection.20100804.genotypes.bcf"# open the VCF with indexfp <- bcf.open(filename)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# find a variantctx <-find.variant(fp,"1",10583)print(paste("Number of genotypes ",variant.nsamples(ctx)))# get 10-th genotypegt<-variant.genotype(ctx,10)print(paste("sample ",genotype.sample(gt)))# get genotype by namegt<-variant.genotype(ctx,"NA18997")print(paste("sample ",genotype.sample(gt)))print(paste("alleles ",genotype.alleles.idx0(gt)))print(paste("genotype ploidy ? ",genotype.ploidy(gt)))print(paste("genotype is hom ref ? ",genotype.homref(gt)))print(paste("genotype is het ? ",genotype.het(gt)))print(paste("genotype is het-non-ref ? ",genotype.hetnonref(gt)))print(paste("genotype is phased ? ",genotype.phased(gt)))print(paste("genotype is no call ? ",genotype.nocall(gt)))print(paste("genotype FORMAT/OG ? ",genotype.string.attribute(gt,"OG")))print(paste("genotype FORMAT/GQ ? ",genotype.int.attribute(gt,"GQ")))# hum spec says gt should be integerprint(paste("genotype has GQ ? ",genotype.has.gq(gt)))print(paste("genotype GQ ",genotype.gq(gt)))print(paste("genotype has DP ? ",genotype.has.dp(gt)))print(paste("genotype DP ",genotype.int.attribute(gt,"DP")))print(paste("genotype DP ",genotype.dp(gt)))print(paste("genotype has PL ? ",genotype.has.pl(gt)))print(paste("genotype PL ",genotype.pl(gt)))print(paste("genotype has AD ? ",genotype.has.ad(gt)))print(paste("genotype AD ",genotype.ad(gt)))# dispose the vcf readerbcf.close(fp)

Output:

[1] "Number of genotypes  629"[1] "sample  HG00120"[1] "sample  NA18997"[1] "alleles  0" "alleles  1"[1] "genotype ploidy ?  2"[1] "genotype is hom ref ?  FALSE"[1] "genotype is het ?  TRUE"[1] "genotype is het-non-ref ?  FALSE"[1] "genotype is phased ?  TRUE"[1] "genotype is no call ?  FALSE"[1] "genotype FORMAT/OG ?  1/1"[1] "genotype FORMAT/GQ ?  "[1] "genotype has GQ ?  FALSE"[1] "genotype GQ  -1"[1] "genotype has DP ?  TRUE"[1] "genotype DP  1"[1] "genotype DP  1"[1] "genotype has PL ?  FALSE"[1] "genotype PL  "[1] "genotype has AD ?  TRUE"[1] "genotype AD  4" "genotype AD  1"[1] TRUE

Working with vectorized Genotypes

Code:

# load rbcflibrary(rbcf)# find given variantfind.variant<-function(fp,contig,pos) {if(!bcf.query(fp,paste(contig,":",pos,"-",pos,sep=""))) return(NULL)# loop while we can read a variantwhile(!is.null(vc<-bcf.next(fp))) {return(vc)}return(NULL)}filename<-"./data/1000G.ALL.2of4intersection.20100804.genotypes.bcf"# open the VCF with indexfp <- bcf.open(filename)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# find a variantctx <-find.variant(fp,"1",10583)print(paste("Number of genotypes ",variant.nsamples(ctx)))# Retrieve the DP for all genotypes (length of vector equals to number of samples)dp <- variant.genotypes.int.attribute(ctx, "DP")stopifnot(length(dp) == variant.nsamples(ctx))stopifnot(is.integer(dp))cat("DP: ", paste(dp, collapse = ", "), "\n")# Retrieve the AD for all genotypes (length of vector equals to 2x number of samples)#  The first two number contain the AD for REF and ALT for the first sample respectively.ad <- variant.genotypes.int.attribute(ctx, "AD")stopifnot(length(ad) == 2*variant.nsamples(ctx))stopifnot(is.integer(ad))cat("AD: ", paste(ad, collapse = ", "), "\n")# Helper function to get all the GT indizesgt <- variant.genotypes.allele.idx0(ctx)stopifnot(length(gt) == 2*variant.nsamples(ctx))stopifnot(is.integer(gt))cat("GT - Integers: ", paste(gt, collapse = ", "), "\n")# Helper function to get all the GT allele-countsgt <- variant.genotypes.allele.counts(ctx, 0)stopifnot(length(gt) == variant.nsamples(ctx))stopifnot(is.integer(gt))cat("GT - Allele Counts (Reference): ", paste(gt, collapse = ", "), "\n")gt <- variant.genotypes.allele.counts(ctx, 1)stopifnot(length(gt) == variant.nsamples(ctx))stopifnot(is.integer(gt))cat("GT - Allele Counts (Alternative): ", paste(gt, collapse = ", "), "\n")# Helper function to get all the GT stringsgt <- variant.genotypes.allele.strings(ctx)stopifnot(length(gt) == variant.nsamples(ctx))stopifnot(is.character(gt))cat("GT - Strings: ", paste(gt, collapse = ", "), "\n")# dispose the vcf readerbcf.close(fp)

Output:

[1] "Number of genotypes  629"DP:  NA, NA, NA, NA, NA, NA, 2, NA, 1, 5, NA, NA, 1, NA, NA, 3, NA, NA, NA, 4, 1, 1, 1, 1, NA, (...)AD:  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 0, NA, NA, 5, 1, 4, 1, NA, NA, NA, NA, (...)GT - Integers:  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0(...)GT - Allele Counts (Reference):  2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,(...)GT - Allele Counts (Alternative):  0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, (...)GT - Strings:  0|0, 0|0, 0|0, 0|0, 0|0, 0|0, 0|0, 0|0, 0|0, 0|1, 0|0, 0|0, 0|0, 0|0, 0|0, 0|0, (...)[1] TRUE

Writing variants to a new VCF/BCF file

Code:

# load rbcflibrary(rbcf)# vcf input filenamefilenamein = "./data/rotavirus_rf.01.vcf"# output vcf filename. "-" is standard outputfilenameout =  "-"fp <- bcf.open(filenamein,FALSE)# error on opening (exit 0 for tests)if(is.null(fp)) quit(save="no",status=0,runLast=FALSE)# create a new VCF writer using the header from 'fp'out <- bcf.new.writer(fp,filenameout)# error on opening (exit 0 for tests)if(is.null(out)) quit(save="no",status=0,runLast=FALSE)# loop while we can read a variantwhile(!is.null(vc<-bcf.next(fp))) {# only write POS%10==0if(variant.pos(vc)%%10==0) { # write variantbcf.write.variant(out,vc);}}# dispose the vcf readerbcf.close(fp)# dispose the vcf rwriterbcf.close(out);

Output:

[1] TRUE##fileformat=VCFv4.2##FILTER=<ID=PASS,Description="All filters passed">##samtoolsVersion=1.3.1+htslib-1.3.1##samtoolsCommand=samtools mpileup -Ou -f rotavirus_rf.fa S1.bam S2.bam S3.bam S4.bam S5.bam##reference=file://rotavirus_rf.fa##contig=<ID=RF01,length=3302>##contig=<ID=RF02,length=2687>##contig=<ID=RF03,length=2592>##contig=<ID=RF04,length=2362>##contig=<ID=RF05,length=1579>##contig=<ID=RF06,length=1356>##contig=<ID=RF07,length=1074>##contig=<ID=RF08,length=1059>##contig=<ID=RF09,length=1062>##contig=<ID=RF10,length=751>##contig=<ID=RF11,length=666>##ALT=<ID=*,Description="Represents allele(s) other than observed.">##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site(...)##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigg(...)##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bi(...)##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigge(...)##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Stra(...)##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is(...)##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is b(...)##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele,(...)##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reve(...)##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">##bcftools_callVersion=1.3-10-g820e1d6+htslib-1.2.1-267-g87141ea##bcftools_callCommand=call -vm -Oz -o rotavirus_rf.vcf.gz -##bcftools_viewVersion=1.10-6-g2782d9f+htslib-1.2.1-1336-g7c16b56-dirty##bcftools_viewCommand=view /home/lindenb/src/jvarkit/src/test/resources/rotavirus_rf.vcf.gz; D(...)#CHROMPOSIDREFALTQUALFILTERINFOFORMATS1S2S3S4S5RF01970.AC48.6696.DP=36;VDB=0.693968;SGB=10.3229;RPB=0.658863;MQB=1;MQSB=1;BQB=0.572843;(...)RF032150.TA6.90687.DP=37;VDB=0.557348;SGB=-1.00336;RPB=0.579851;MQB=1;MQSB=1;BQB=1;MQ0F=(...)RF041900.AC36.8224.DP=39;VDB=0.706942;SGB=7.10992;RPB=0.81363;MQB=1;MQSB=1;BQB=1;MQ0F=0;(...)RF041920.AT42.014.DP=39;VDB=0.966939;SGB=0.816387;RPB=0.864752;MQB=1;MQSB=1;BQB=0.864752(...)

[8]ページ先頭

©2009-2025 Movatter.jp