Movatterモバイル変換

Type:

Package

Title:

Detect Copy Number Variants from SNPs Data

Version:

1.3.0

Date:

2024-09-20

Language:

en-US

Maintainer:

Piyal Karunarathne <piyalkarumail@yahoo.com>

Description:

Functions in this package will import filtered variant call format (VCF) files of SNPs data and generate data sets to detect copy number variants, visualize them and do downstream analyses with copy number variants(e.g. Environmental association analyses).

License:

AGPL (≥ 3)

Imports:

data.table, graphics, colorspace, R.utils, qgraph, stringr

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.3.2

Depends:

R (≥ 3.6.0)

Suggests:

rmarkdown, knitr, testthat (≥ 3.0.0), covr

Config/testthat/edition:

URL:

https://piyalkarum.github.io/rCNV/,https://cran.r-project.org/package=rCNV

BugReports:

https://github.com/piyalkarum/rCNV/issues

Packaged:

2024-09-20 12:14:33 UTC; piyalkaru

NeedsCompilation:

Author:

Piyal Karunarathne

[aut, cre], Qiujie Zhou

[aut], Klaus Schliep

[aut], Pascal Milesi

[aut]

Repository:

CRAN

Date/Publication:

2024-09-20 12:40:02 UTC

Normalized allele depth example data

Description

Normalized example SNPs data of Chinook Salmon from Larson et al. 2014The data has been normalized with TMM

Usage

data(ADnorm)

Format

An object of classlist of length 2.

References

Larson, W. A., Seeb, L. W., Everett, M. V., Waples, R. K.,Templin, W. D., & Seeb, J. E. (2014). Genotyping by sequencing resolvesshallow population structure to inform conservation of Chinook salmon(Oncorhynchus tshawytscha). Evolutionary Applications, 7(3)

Allele Depth (AD) example data

Description

Example SNPs data of Chinook Salmon from Larson et al. et al. 2014.The data contains only a partial snps data set of RadSeq data afterfiltering.

Usage

data(ADtable)

Format

An object of classdata.frame with 3000 rows and 109 columns.

References

Larson, W. A., Seeb, L. W., Everett, M. V., Waples, R. K., Templin,W. D., & Seeb, J. E. (2014). Genotyping by sequencing resolves shallowpopulation structure to inform conservation of Chinook salmon(Oncorhynchus tshawytscha). Evolutionary Applications, 7(3), 355-369.
McKinney, G. J., Waples, R. K., Seeb, L. W., & Seeb, J. E. (2017).Paralogs are revealed by proportion of heterozygotes and deviations inread ratios in genotyping by sequencing data from natural populations.Molecular Ecology Resources, 17(4)

Correct allele depth values

Description

A function to correct depth values with odd number of coverage values due tosequencing anomalies or miss classification where genotype is homozygous anddepth values indicate heterozygosity.The function adds a value of one to the allele with the lowest depth valuefor when odd number anomalies or make the depth value zero for whenmiss-classified. The genotype table must be provided for the latter.

Usage

ad.correct(  het.table,  gt.table = NULL,  odd.correct = TRUE,  verbose = TRUE,  parallel = FALSE)

Arguments

het.table

allele depth table generated from the functionhetTgen

gt.table

genotype table generated from the function hetTgen

odd.correct

logical, to correct for odd number anomalies in AD values.defaultTRUE

verbose

logical. show progress. DefaultTRUE

parallel

logical. whether to parallelize the process

Value

Returns the coverage corrected allele depth table similar to theoutput ofhetTgen

Author(s)

Piyal Karunarathne

Examples

## Not run: adc<-ad.correct(ADtable)

Generate allele frequency table for individuals or populations

Description

Get alternative allele frequency across all individuals per SNP from thegenotype or allele depth tables

Usage

allele.freq(gtt, f.typ = c("pop", "ind"), verbose = TRUE)

Arguments

gtt

a list or data frame of genotype and/or allele depth table produced fromhetTgen (or similar)

f.typ

character. type of allele frequency to be calculated (individual"ind" or population"pop")

verbose

logical. whether to show the progress of the analysis

Details

If the allele frequencies to be calculated for populations from both genotype table and the allele depth table, they must be provided in a list with element namesAD for allele depth table andGT for the genotype table. See the examples.

Value

Returns a data frame or a list (if both genotype and allele depth used)of allele frequencies

Author(s)

Piyal Karunarathne

Examples

## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)het.table<-hetTgen(vcf,"GT")ad.table<-hetTgen(vcf,"AD")# for individual based AFfrQ<-allele.freq(het.table,f.typ="ind")#for population-wise and both allele depth and genotype tablesfrQ<-allele.freq(list(AD=ad.table,GT=het.table),f.typ="pop")## End(Not run)

Get allele information for duplicate detection

Description

The function to calculate allele median ratios, proportion of heterozygotesand allele probability values under different assumptions (see details),and their chi-square significance values for duplicate detection

Usage

allele.info(  X,  x.norm = NULL,  Fis,  method = c("MedR", "QN", "pca", "TMM", "TMMex"),  logratioTrim = 0.3,  sumTrim = 0.05,  Weighting = TRUE,  Acutoff = -1e+10,  plot.allele.cov = TRUE,  verbose = TRUE,  parallel = FALSE,  ...)

Arguments

X

allele depth table generated from the functionhetTgen (non-normalized)

x.norm

a data frame of normalized allele coverage, output ofcpm.normal. If not provided, calculated usingX.

Fis

numeric. Inbreeding coefficient calculated usingh.zygosity() function

method

character. method to be used for normalization(seecpm.normal details). DefaultTMM

logratioTrim

numeric. percentage value (0 - 1) of variation to betrimmed in log transformation

sumTrim

numeric. amount of trim to use on the combined absolutelevels (“A” values) for methodTMM

Weighting

logical, whether to compute (asymptotic binomial precision)weights

Acutoff

numeric, cutoff on “A” values to use before trimming

plot.allele.cov

logical, plot comparative plots of allele depthcoverage in homozygotes and heterozygotes

verbose

logical, whether to print progress

parallel

logical. whether to parallelize the process

...

further arguments to be passed toplot

Details

Allele information generated here are individual SNP based and presents theproportion of heterozygotes, number of samples, and deviation of alleledetection from a 1:1 ratio of reference and alternative alleles.The significance of the deviation is tested with Z-score testZ = \frac{ \frac{N}{2}-N_A}{ \sigma_{x}},and chi-square test (see references for more details on the method).

Value

Returns a data frame of median allele ratio, proportion ofheterozygotes, number of heterozygotes, and allele probability at differentassumptions with their chi-square significance

Author(s)

Piyal Karunarathne, Pascal Milesi, Klaus Schliep

References

McKinney, G. J., Waples, R. K., Seeb, L. W., & Seeb, J. E. (2017).Paralogs are revealed by proportion of heterozygotes and deviations in readratios in genotyping by sequencing data from natural populations.Molecular Ecology Resources, 17(4)
Karunarathne et al. 2022 (to be added)

Examples

## Not run: data(ADtable)hz<-h.zygosity(vcf,verbose=FALSE)Fis<-mean(hz$Fis,na.rm = TRUE)AI<-allele.info(ADtable,x.norm=ADnorm,Fis=Fis)## End(Not run)

Allele info example data

Description

Semi-randomly generated data from the function dup.snp.info.Data contains depth and proportion values of 2857 snps

Usage

data(alleleINF)

Format

An object of classdata.frame with 2857 rows and 28 columns.

Source

Chinook Salmon sequence reads McKinney et al. 2017

References

Larson, W. A., Seeb, L. W., Everett, M. V., Waples, R. K., Templin, W. D., & Seeb, J. E. (2014). Genotyping by sequencing resolves #' shallow population structure to inform conservation of Chinook salmon (Oncorhynchus tshawytscha). Evolutionary Applications, 7(3)
McKinney, G. J., Waples, R. K., Seeb, L. W., & Seeb, J. E. (2017). Paralogs are revealed by proportion of heterozygotes and deviations in read ratios in genotyping by sequencing data from natural populations. Molecular Ecology Resources, 17(4)

Examples

data(alleleINF)with(alleleINF,plot(medRatio~propHet))

Find CNVs from deviants

Description

Categorize deviant and non-deviant into "singlets" and "duplicates" based on the statistical approaches specified by the user.The intersection of all the stats provided will be used in the categorization. If one would like to use the intersection of at least two stats, this can be specified in then.ints

Usage

cnv(  data,  test = c("z.het", "z.05", "z.all", "chi.het", "chi.05", "chi.all"),  filter = c("intersection", "kmeans"),  WGS = TRUE,  ft.threshold = 0.05,  plot = TRUE,  verbose = TRUE,  ...)

Arguments

data

A data frame of allele information generated with the functionallele.info

test

vector of characters. Type of test to be used for significance.See details

filter

character. Type of filter to be used for filtering CNVs.defaultkmeans. See details.

WGS

logical. test parameter. See details

ft.threshold

confidence interval for filteringdefault = 0.05

plot

logical. Plot the detection of duplicates. defaultTRUE

verbose

logical. show progress

...

other arguments to be passed toplot

Details

SNP deviants are detected with both excess of heterozygosityaccording to HWE and deviant SNPs where depth values fall outside of thenormal distribution are detected using thefollowing methods:

Z-score testZ_{x} = \sum_{i=1}^{n} Z_{i};Z_{i} = \frac{\left ( (N_{i}\times p)- N_{Ai} \right )}{\sqrt{N_{i}\times p(1-p)}}
chi-square testX_{x}^{2} = \sum_{i-1}^{n} X_{i}^{2};X_{i}^{2} = (\frac{(N_{i}\times p - N_{Ai})^2}{N_{i}\times p} + \frac{(N_{i}\times (1 - p)- (N_{i} - N_{Ai}))^2}{N_{i}\times (1-p)})

See references for more details on the methods

Users can pick among Z-score for heterozygotes (z.het, chi.het),all allele combinations (z.all, chi.all) and the assumption of noprobe bias p=0.5 (z.05, chi.05)

filter will determine whether theintersection orkmeansclustering of the providedtests should be used in filtering CNVs.The intersection uses threshold values for filtering and kmeans useunsupervised clustering. Kmeans clustering is recommended if one is uncertainabout the threshold values.

WGS is a test parameter to include or exclude coefficient of variance(cv) in kmeans. For data sets with more homogeneous depth distribution,excluding cv improves CNV detection. If you're not certain about this, useTRUE which is the default.

Value

Returns a data frame of SNPs with their detected duplication status

Author(s)

Piyal Karunarathne Qiujie Zhou

Examples

## Not run: data(alleleINF)DD<-cnv(alleleINF)## End(Not run)

Calculate normalized depth for alleles

Description

This function outputs the normalized depth values separately for each allele,calculated using normalization factor with trimmed mean of M-values ofsample libraries, median ratios normalization or quantile normalization,See details.

Usage

cpm.normal(  het.table,  method = c("MedR", "QN", "pca", "TMM", "TMMex"),  logratioTrim = 0.3,  sumTrim = 0.05,  Weighting = TRUE,  Acutoff = -1e+10,  verbose = TRUE,  plot = TRUE)

Arguments

het.table

allele depth table generated from the functionhetTgen

method

character. method to be used (see details). DefaultTMM

logratioTrim

numeric. percentage value (0 - 1) of variation to betrimmed in log transformation

sumTrim

numeric. amount of trim to use on the combined absolutelevels (“A” values) for methodTMM

Weighting

logical, whether to compute (asymptotic binomial precision)weights

Acutoff

numeric, cutoff on “A” values to use before trimming(only for TMM(ex))

verbose

logical. show progress

plot

logical. Plot the boxplot of sample library sizes showing outliers

Details

This function converts an observed depth value table to aneffective depth value table using several normalization methods;

TMM normalization (See the original publication for more information).It is different from the functionnormz only in calculation of thecounts per million is for separate alleles instead of the total depth.TheTMMex method is an extension of theTMM method forlarge data sets containing SNPs exceeding 10000
The methodMedR is median ratio normalization;
QN - quantile normalization (see Maza, Elie, et al. 2013 for acomparison of methods).
PCA - a modified Kaiser's Rule applied to depth values: Sample variationof eigen values smaller than 0.7 are removed (i.e., the first eigen value < 0.7)to eliminate the effect of the library size of samples

Value

Returns a list with (AD), a data frame of normalized depth valuessimilar to the output ofhetTgen function and(outliers) a list of outlier sample names

Author(s)

Piyal Karunarathne, Qiujie Zhou

References

Robinson MD, Oshlack A (2010). A scaling normalization method fordifferential expression analysis of RNA-seq data. Genome Biology 11, R25
Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductorpackage for differential expression analysis of digital gene expressiondata. Bioinformatics 26
Maza, Elie, et al. "Comparison of normalization methods fordifferential gene expression analysis in RNA-Seq experiments: a matter ofrelative size of studied transcriptomes." Communicative & integrativebiology 6.6 (2013): e25849

Examples

## Not run: data(ADtable)ADnormalized<-cpm.normal(ADtable)## End(Not run)

Simulate median allele ratios for varying number of samples and depth values

Description

This function will simulate the expected median allele ratios under HWEfor given ranges of no. of samples and depth coverage values.This is useful if you need to find the cutoff values of allele ratios fordifferent no. of samples and depth of coverage values in your data set.

Usage

depthVsSample(  cov.len = 100,  sam.len = 100,  nsims = 1000,  plot = TRUE,  col = c("#1C86EE", "#00BFFF", "#DAA520", "#FF0000"))

Arguments

cov.len

max value of depth of coverage to be simulated

sam.len

maximum no. of samples to be simulated

nsims

numerical. no. of simulations to be done for each combination of samples and depthdepth and no. samples ranges

plot

logical. Whether to plot the output (a plot of no. samplesvs median depth of coverage colored by median allele ratios)

col

character. Two colors to add to the gradient

Value

A matrix of median allele ratios where rows are the number ofsamples and columns are depth of coverage values

Author(s)

Pascal Milesi, Piyal Karunarathne

Examples

## Not run: depthVsSample(cov.len=100,sam.len=100)

Plot classified SNPs into deviants/CNVs and non-deviants/non-CNVs

Description

The function plots detected deviants/CNVs from functionssig.snps,cnv anddupGet on a median ratio (MedRatio) Vs. proportion ofheterozygote (PropHet) plot.

Usage

dup.plot(ds, ...)

Arguments

ds

a data frame of detected deviants/cnvs (outputs of functions above)

...

other graphical parameters to be passed to the functionplot

Value

Returns no value, only plots proportion of heterozygotes vs allelemedian ratio separated by duplication status

Author(s)

Piyal Karunarathne

Examples

## Not run: data(alleleINF)DD<-dupGet(alleleINF,plot=FALSE)dup.plot(DD)## End(Not run)

Validate detected deviants/cnvs

Description

This function will validate the detected duplicated-SNPs (deviants/cnvs) using a movingwindow approach (see details)

Usage

dup.validate(d.detect, window.size = 100, scaf.size = 10000)

Arguments

d.detect

a data frame of detected duplicates or deviants from the outputs ofdupGet orcnv

window.size

numerical. a single value of the desired moving windowsize (default100 bp)

scaf.size

numerical. scaffold size to be checked. i.e. the chromosome/scaffolds will be split into equal pieces of this sizedefault=10000

Details

Loci/SNP positions correctly ordered according to a referencesequence is necessary for this function to work properly. The list of deviants/cnvs provided in thed.detect will be split into pices ofscaf.size and the number of deviants/cnvs will be counted along each split with a moving window ofwindow.size. The resulting percentages of deviants/cnvs will be averaged for each scaf.size split; this is thecnv.ratio column in the output. Thus, ideally, thecnv.ratio is a measure of how confident the detected deviants/cnvs are in an actual putative duplicated region withing the givenscaf.size. This ratio is sensitive to the picked window size and the scaf.size; as a rule of thumb, it is always good to use a known gene length as the scaf.size, if you need to check a specific gene for the validity of the detected duplicates.Please also note that this function is still in itsbeta-testing phase and also under development for non-mapped reference sequences. Therefore, your feedback and suggestions will be highly appreciated.

Value

A data frame of deviant/cnv ratios (columncnv.ratio) for a split of the chromosome/scaffold given by thescaf.size; this ratio is an average value of the percentage of deviants/cnvs present within the givenwindow.size for each split (chromosome/scaffold length/sacf.size); the start and the end positions of each split is given in thestart andend columns

Author(s)

Piyal Karunarathne

Examples

## Not run: # suggestion to visualize dup.validate outputlibrary(ggplot2)library(dplyr)dvs<-dupGet(alleleINF,test=c("z.05","chi.05"))dvd<-dup.validate(dvs,window.size = 1000)# Example data framedf <- data.frame(dvd[,3:5])df$cnv.ratio<-as.numeric(df$cnv.ratio)# Calculate midpointsdf <- df %>%  mutate(midpoint = (start + end) / 2)ggplot() +  # Horizontal segments for each start-end range  geom_segment(data = df, aes(x = start, xend = end,  y = cnv.ratio, yend = cnv.ratio), color = "blue") +  # Midpoints line connecting midpoints of each range  geom_path(data = df, aes(x = midpoint, y = cnv.ratio), color = "red") +  geom_point(data = df, aes(x = midpoint, y = cnv.ratio), color = "red") +  # Aesthetic adjustments  theme_minimal() +  labs(title = "CNV Ratio along a Continuous Axis with Midpoint Fluctuation",      x = "Genomic Position",       y = "CNV Ratio")## End(Not run)

Detect deviants from SNPs; classify SNPs

Description

Detect deviant SNPs using excess of heterozygotes(alleles that do not follow HWE) and allelic-ratio deviations(alleles with ratios that do not follow a normal Z-score or chi-squaredistribution).See details.

Usage

dupGet(  data,  Fis,  test = c("z.het", "z.05", "z.all", "chi.het", "chi.05", "chi.all"),  intersection = FALSE,  method = c("fisher", "chi.sq"),  plot = TRUE,  verbose = TRUE,  ...)

Arguments

data

data frame of the output ofallele.info

Fis

numeric. Inbreeding coefficient calculated usingh.zygosity() function

test

character. type of test to be used for significance. See details

intersection

logical, whether to use the intersection of the methodsspecified intest (if more than one)

method

character. method for testing excess of heterozygotes.Fisher exact test (fisher) or Chi-square test (chi.sq)

plot

logical. whether to plot the detected singlets and duplicateson allele ratio vs. proportion of heterozygotes plot.

verbose

logical. show progress

...

additional parameters passed on toplot

Details

SNP deviants are detected with both excess of heterozygosityaccording to HWE and deviant SNPs where depth values fall outside of thenormal distribution are detected using thefollowing methods:

Z-score testZ_{x} = \sum_{i=1}^{n} Z_{i};Z_{i} = \frac{\left ( (N_{i}\times p)- N_{Ai} \right )}{\sqrt{N_{i}\times p(1-p)}}
chi-square testX_{x}^{2} = \sum_{i-1}^{n} X_{i}^{2};X_{i}^{2} = (\frac{(N_{i}\times p - N_{Ai})^2}{N_{i}\times p} + \frac{(N_{i}\times (1 - p)- (N_{i} - N_{Ai}))^2}{N_{i}\times (1-p)})

See references for more details on the methods

Users can pick among Z-score for heterozygotes (z.het, chi.het),all allele combinations (z.all, chi.all) and the assumption of noprobe bias p=0.5 (z.05, chi.05)

Value

Returns a data frame of snps/alleles with their duplication status

Author(s)

Piyal Karunarathne Qiujie Zhou

Examples

## Not run: data(alleleINF)DD<-dupGet(alleleINF,Fis=0.1,test=c("z.05","chi.05"))## End(Not run)

Export VCF files

Description

A function to export tables/matrices in VCF format to VCF files

Usage

exportVCF(out.vcf, out.path, compress = TRUE)

Arguments

out.vcf

a matrix or data frame in vcf file format to be exported

out.path

a character string of output path for the vcf file;should end in the name as the vcf file and .vcf. See examples

compress

logical. whether to compress the output file. IfTRUE,the file will be.gz compressed

Value

Exports a vcf file to a given destination

Author(s)

Piyal Karunarathne

Examples

## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path)exportVCF(vcf,"../exVcf.vcf")## End(Not run)

Get missingness of individuals in raw vcf

Description

A function to get the percentage of missing data of snps per SNP and persample

Usage

get.miss(  data,  type = c("samples", "snps"),  plot = TRUE,  verbose = TRUE,  parallel = FALSE)

Arguments

data

a list containing imported vcf file usingreadVCF orgenotype table generated usinghetTgen

type

character. Missing percentages per sample“samples” or per SNP “snps”, default both

plot

logical. Whether to plot the missingness density with ninetyfive percent quantile

verbose

logical. Whether to show progress

parallel

logical. whether to parallelize the process

Value

Returns a data frame of allele depth or genotypes

Author(s)

Piyal Karunarathne

Examples

## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)missing<-get.miss(vcf,plot=TRUE)## End(Not run)

Format genotype for BayEnv and BayPass

Description

This function generates necessary genotype count formats for BayEnv andBayPass with a subset of SNPs

Usage

gt.format(  gt,  info,  format = c("benv", "bpass"),  snp.subset = NULL,  parallel = FALSE)

Arguments

gt

multi-vector. an imported data.frame of genotypes or genotypedata frame generated byhetTgen or path to GT.FORMATfile generated from VCFTools

info

a data frame containing sample and population information.It must have “sample” and “population” columns

format

character. output format i.e., for BayPass or BayEnv

snp.subset

numerical. number of randomly selected subsets of SNPs.default = NULL

parallel

logical. whether to parallelize the process

Value

Returns a list with formatted genotype data:$bayenv - snpsin horizontal format - for BayEnv (two lines per snp);$baypass - vertical format - for BayPass(two column per snp);$sub.bp - subsets snps for BayPass$sub.be - subsets of snps for BayEnv

Author(s)

Piyal Karunarathne

Examples

## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)het.table<-hetTgen(vcf,"GT")info<-unique(substr(colnames(het.table)[-c(1:3)],1,8))GT<-gt.format(het.table,info)## End(Not run)

Determine per sample heterozygosity and inbreeding coefficient

Description

This function will calculate the heterozygosity on a per-sample basis fromvcf files (snps), and most importantly inbreeding coefficient which is usedto filter out the samples with bad mapping quality.

Usage

h.zygosity(vcf, plot = FALSE, pops = NA, verbose = TRUE, parallel = FALSE)

Arguments

vcf

an imported vcf file in in a list usingreadVCF or a data frame of genotypes generated usinghetTgen

plot

logical. Whether to plot a boxplot of inbreeding coefficientsfor populations. A list of populations must be provided

pops

character. A list of population names with the same length andorder as the number of samples in the vcf

verbose

logical. Show progress

parallel

logical. Parallelize the process

Value

Returns a data frame of expected “E(Hom)” and observed“O(Hom)” homozygotes with their inbreeding coefficients.

Author(s)

Piyal Karunarathne, Pascal Milesi, Klaus Schliep

Examples

## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)pp<-substr(colnames(vcf$vcf)[-c(1:9)],1,8)hzygots<-h.zygosity(vcf,plot=TRUE,pops=pp)## End(Not run)

Generate allele depth or genotype table

Description

hetTgen extracts the read depth and coverage values for each snp for allthe individuals from a vcf file generated from readVCF (or GatKVariantsToTable: see details)

Usage

hetTgen(  vcf,  info.type = c("AD", "AD-tot", "GT", "GT-012", "GT-AB", "DP"),  verbose = TRUE,  parallel = FALSE)

Arguments

vcf

an imported vcf file in a list usingreadVCF

info.type

character.AD: allele depth value,AD-tot:totalallele depth,DP=unfiltered depth (sum),GT: genotype,GT-012:genotype in 012 format,GT-AB:genotype in AB format.DefaultAD, See details.

verbose

logical. whether to show the progress of the analysis

parallel

logical. whether to parallelize the process

Details

If you generate the depth values for allele by sample using GatKVariantsToTable option, use only -F CHROM -F POS -GF AD flags to generatethe table. Or keep only the CHROM, POS, ID, ALT, and individual AD columns.For info.typeGT option is provided to extract the genotypes ofindividuals by snp.

Value

Returns a data frame of allele depth, genotype of SNPs for all theindividuals extracted from a VCF file

Author(s)

Piyal Karunarathne, Klaus Schliep

Examples

## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)het.table<-hetTgen(vcf)## End(Not run)

Remove MAF allele

Description

A function to remove the alleles with minimum allele frequency and keep onlya bi-allelic matrix when loci are multi-allelic

Usage

maf(h.table, AD = TRUE, verbose = TRUE, parallel = FALSE)

Arguments

h.table

allele depth table generated from the functionhetTgen

AD

logical. If TRUE a allele depth table similar tohetTgenoutput will be returns; IfFALSE, individual AD values per SNP will bereturned in a list.

verbose

logical. Show progress

parallel

logical. whether to parallelize the process

Value

A data frame or a list of minimum allele frequency removed allele depth

Author(s)

Piyal Karunarathne

Examples

## Not run: mf<-maf(ADtable)

Calculate normalization factor for each sample

Description

This function calculates the normalization factor for each sample usingdifferent methods. See details.

Usage

norm.fact(  df,  method = c("TMM", "TMMex", "MedR", "QN"),  logratioTrim = 0.3,  sumTrim = 0.05,  Weighting = TRUE,  Acutoff = -1e+10)

Arguments

df

a data frame or matrix of allele depth values(total depth per snp per sample)

method

character. method to be used (see details). DefaultTMM

logratioTrim

numeric. percentage value (0 - 1) of variation to betrimmed in log transformation

sumTrim

numeric. amount of trim to use on the combined absolutelevels (“A” values) for methodTMM

Weighting

logical, whether to compute (asymptotic binomial precision)weights

Acutoff

numeric, cutoff on “A” values to use before trimming

Details

Originally described for normalization of RNA sequences(Robinson & Oshlack 2010), this function computes normalization (scaling)factors to convert observed library sizes into effective library sizes.It uses the method trimmed means of M-values proposed by Robinson &Oshlack (2010). See the original publication andedgeR packagefor more information.The methodMedR is median ratio normalization;QN - quantile normalization (see Maza, Elie, et al. 2013 for acomparison of methods).

Value

Returns a numerical vector of normalization factors for each sample

Author(s)

Piyal Karunarathne

References

Robinson MD, and Oshlack A (2010). A scaling normalization method fordifferential expression analysis of RNA-seq data. Genome Biology 11, R25
Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductorpackage for differential expression analysis of digital gene expressiondata. Bioinformatics 26

Examples

## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path)df<-hetTgen(vcf,"AD-tot",verbose=FALSE)norm.fact(df)## End(Not run)

Simulate and plot detection power of bias in allele ratios

Description

This function simulates 95% confidence level Z-score based detection powerof allele biases for a given number of samples and a range of depths

Usage

power.bias(  Dlist = c(2, 4, 8, 16),  sam = 100,  intensity = 0.005,  nsims = 1000,  p = 0.5,  plot = TRUE)

Arguments

Dlist

numerical. vector of depths values to be tested

sam

numerical. number of samples

intensity

numerical. frequency of bias

nsims

numerical. number of simulations to be done for each sample

p

numerical. expected allele ratio (0.5 for data with knownsequencing biases)

plot

logical. plot the output

Value

Returns a list of detection probability values for the given range ofsamples and depth

Author(s)

Pascal Milesi, Piyal Karunarathne

Import VCF file

Description

Function to import raw single and multi-sample VCF files.The function required the R-packagedata.table for faster importing.

Usage

readVCF(vcf.file.path, verbose = FALSE)

Arguments

vcf.file.path

path to the vcf file

verbose

logical. show progress

Value

Returns a list with vcf table in a data frame, excluding meta data.

Author(s)

Piyal Karunarathne

Examples

## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path)## End(Not run)

Determine pairwise relatedness

Description

Relatedness is determined according to genome-wide relationship assessmentof Yang et al. 2010 equation 6, on a per sample basis (with itself andothers), using SNPs.

Usage

relatedness(  vcf,  plot = TRUE,  threshold = 0.5,  verbose = TRUE,  parallel = FALSE)

Arguments

vcf

an imported vcf file in a list usingreadVCFor a data frame of genotypes generated usinghetTgen

plot

logical. Whether to plot relatedness of samples againstthemselves, among themselves and outliers

threshold

numerical. A value indicating to filter the individuals ofrelatedness among themselves. Default0.5 (siblings)

verbose

logical. Show progress.

parallel

logical. Parallelize the process

Details

According to Yang et al. (2010), out breeding non-related pairs should have arelatedness value of zero while the individual with itself will have arelatedness value of one. Relatedness value of ~0.5 indicates siblings.

Value

A data frame of individuals and relatedness scoreA_{jk}

Author(s)

Piyal Karunarathne, Klaus Schliep

References

Yang, J., Benyamin, B., McEvoy, B. et al. Common SNPs explain alarge proportion of the heritability for human height. Nat Genet 42, 565569(2010).

Examples

## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)relate<-relatedness(vcf)## End(Not run)

Identify significantly different heterozygotes from SNPs data

Description

This function will recognize the SNPs with a proportion of heterozygotessignificantly higher than expected under HWE and plot deviant snps basedonly on the excess of heterozygotes.

Usage

sig.hets(  a.info,  Fis,  method = c("chi.sq", "fisher"),  plot = TRUE,  verbose = TRUE,  ...)

Arguments

a.info

allele info table generated from filtered vcfs using thefunctionallele.info or allele depth table generated fromhetTgen

Fis

numeric. Inbreeding coefficient calculated usingh.zygosity() function

method

character. Method for testing significance. Fisher exact test(fisher) or Chi-square test (chi.sq)

plot

logical. Whether to plot the identified duplicated snps withthe expected values

verbose

logical, if TRUE, the progress is shown

...

other arguments passed toplot

Value

A matrix of expected heterozygote proportions from the observeddata with p-value indicating significance of deviation.

Author(s)

Piyal Karunarathne, Pascal Milesi, Klaus Schliep, Qiujie Zhou

Examples

## Not run: data(alleleINF)AI <- alleleINFduplicates<-sig.hets(AI,plot=TRUE,Fis=0.1)## End(Not run)

Simulate Allele Frequencies

Description

This function simulates allele frequencies of a desired population sizeunder HWE

Usage

sim.als(n = 500, nrun = 10000, res = 0.001, plot = TRUE)

Arguments

n

desired populations size (set this value same as your actualpopulation size for an accurate simulation)

nrun

number of simulations to run on each allele frequency.The higher this number, the closer the simulations will be to thetheoretical values (at the cost of computer power); 10000 is an optimal value.

res

desired resolution of the theoretical allele frequency

plot

logical. whether to plot the simulation

Value

A list of two matrices:

allele_freqs: theoretical allele frequency
simulated_freqs: simulated frequencies at different confidence intervals

Author(s)

Piyal Karunarathne, Pascal Milesi

Examples

## Not run: alleles <- sim.als(n=200,nrun=1000,res=0.001,plot=TRUE)

Get sequencing quality statistics of raw VCF files(with GatK generated vcf files only)

Description

This function will generate a table similar to VariantsToTable option inGatK from raw vcf files for filtering purposes. The function will alsoplot all the parameters (see details & values).

Usage

vcf.stat(vcf, plot = TRUE, ...)

Arguments

vcf

an imported vcf file in data.frame or matrix format usingreadVCF

plot

logical. Whether to plot the (12) parameters

...

other arguments passed on toplot(e.g. col,border)

Details

For more details see instructions of GatK

Value

Returns a data frame with quality parameters from the INFO. field ofthe vcf

QUAL: The Phred-scaled probability that a REF/ALT polymorphism existsat this site given sequencing data
AC: Allele count
AF: Allele frequency
DP: unfiltered depth
QD: QualByDepth - This is the variant confidence (from the QUALfield) divided by the unfiltered depth of non-hom-ref samples
FS: FisherStrand - This is the Phred scaled probability that there isstrand bias at the site
SOR: StrandOddsRatio - This is another way to estimate strand biasusing a test similar to the symmetric odds ratio test
MQ: RMSMappingQuality - This is the root mean square mapping qualityover all the reads at the site
MQRankSum: MappingQualityRankSumTest - This is the u-basedz-approximation from the Rank Sum Test for mapping qualities
ReadPosRankSum: ReadPosRankSumTest: This is the u-basedz-approximation from the Rank Sum Test for site position within reads

Author(s)

Piyal Karunarathne

Examples

## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)statistics<-vcf.stat(vcf,plot=TRUE)## End(Not run)

Calculate population-wise Vst

Description

This function calculates Vst (variant fixation index) for populations givena list of duplicated loci

Usage

vst(AD, pops, id.list = NULL, qGraph = TRUE, verbose = TRUE, ...)

Arguments

AD

data frame of total allele depth values of (duplicated, ifid.list is not provided) SNPs

pops

character. A vector of population names for each individual.Must be the same length as the number of samples in AD

id.list

character. A vector of duplicated SNP IDs. Must match the IDsin the AD data frame

qGraph

logical. Plot the network plot based on Vst values(see details)

verbose

logical. show progress

...

additional arguments passed toqgraph

Details

Vst is calculated with the following equation

V_{T} = \frac{ V_{S} }{V_{T}}

where VT is the variance of normalizedread depths among all individuals from the two populations and VS is theaverage of the variance within each population, weighed for population size(see reference for more details)Seeqgraph help for details on qgraph output

Value

Returns a matrix of pairwise Vst values for populations

Author(s)

Piyal Karunarathne

References

Redon, Richard, et al. Global variation in copy number in the human genome.nature 444.7118 (2006)

Examples

## Not run: data(alleleINF)data(ADtable)DD<-dupGet(alleleINF)ds<-DD[DD$dup.stat=="deviant",]ad<-ADtable[match(paste0(ds$CHROM,".",ds$POS),paste0(ADtable$CHROM,".",ADtable$POS)),]vst(ad,pops=substr(colnames(ad)[-c(1:4)],1,11))## End(Not run)

Run permutation on Vst

Description

This function runs a permutation test on Vst calculation

Usage

vstPermutation(  AD,  pops,  nperm = 100,  histogram = TRUE,  stat = 2,  qGraph = TRUE)

Arguments

AD

data frame of total allele depth values of SNPs

pops

character. A vector of population names for each individual.Must be the same length as the number of samples in AD

nperm

numeric. Number of permutations to perform

histogram

logical. plots the distribution histogram of permuted vst values vs. observed values

stat

numeric. The stat to be plotted in histogram. 1 for Mean Absolute Distance or 2 (default) for Root Mean Square Distance

qGraph

logical. Plot the network plot based on observed Vst values(seevst() help page for more details)

Value

Returns a list with observed vst values, an array of permuted vst values and the p-values for the permutation test

Author(s)

Jorge Cortés-Miranda (email:jorge.cortes.m@ug.uchile.cl), Piyal Karunarathne

Examples

## Not run: data(alleleINF)data(ADtable)DD<-dupGet(alleleINF)ds<-DD[DD$dup.stat=="deviant",]ad<-ADtable[match(paste0(ds$CHROM,".",ds$POS),paste0(ADtable$CHROM,".",ADtable$POS)),]vstPermutation(ad,pops=substr(colnames(ad)[-c(1:4)],1,11))## End(Not run)