| Type: | Package |
| Title: | Detect Copy Number Variants from SNPs Data |
| Version: | 1.3.0 |
| Date: | 2024-09-20 |
| Language: | en-US |
| Maintainer: | Piyal Karunarathne <piyalkarumail@yahoo.com> |
| Description: | Functions in this package will import filtered variant call format (VCF) files of SNPs data and generate data sets to detect copy number variants, visualize them and do downstream analyses with copy number variants(e.g. Environmental association analyses). |
| License: | AGPL (≥ 3) |
| Imports: | data.table, graphics, colorspace, R.utils, qgraph, stringr |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.2 |
| Depends: | R (≥ 3.6.0) |
| Suggests: | rmarkdown, knitr, testthat (≥ 3.0.0), covr |
| Config/testthat/edition: | 3 |
| URL: | https://piyalkarum.github.io/rCNV/,https://cran.r-project.org/package=rCNV |
| BugReports: | https://github.com/piyalkarum/rCNV/issues |
| Packaged: | 2024-09-20 12:14:33 UTC; piyalkaru |
| NeedsCompilation: | no |
| Author: | Piyal Karunarathne |
| Repository: | CRAN |
| Date/Publication: | 2024-09-20 12:40:02 UTC |
Normalized allele depth example data
Description
Normalized example SNPs data of Chinook Salmon from Larson et al. 2014The data has been normalized with TMM
Usage
data(ADnorm)Format
An object of classlist of length 2.
References
Larson, W. A., Seeb, L. W., Everett, M. V., Waples, R. K.,Templin, W. D., & Seeb, J. E. (2014). Genotyping by sequencing resolvesshallow population structure to inform conservation of Chinook salmon(Oncorhynchus tshawytscha). Evolutionary Applications, 7(3)
Allele Depth (AD) example data
Description
Example SNPs data of Chinook Salmon from Larson et al. et al. 2014.The data contains only a partial snps data set of RadSeq data afterfiltering.
Usage
data(ADtable)Format
An object of classdata.frame with 3000 rows and 109 columns.
References
Larson, W. A., Seeb, L. W., Everett, M. V., Waples, R. K., Templin,W. D., & Seeb, J. E. (2014). Genotyping by sequencing resolves shallowpopulation structure to inform conservation of Chinook salmon(Oncorhynchus tshawytscha). Evolutionary Applications, 7(3), 355-369.
McKinney, G. J., Waples, R. K., Seeb, L. W., & Seeb, J. E. (2017).Paralogs are revealed by proportion of heterozygotes and deviations inread ratios in genotyping by sequencing data from natural populations.Molecular Ecology Resources, 17(4)
Correct allele depth values
Description
A function to correct depth values with odd number of coverage values due tosequencing anomalies or miss classification where genotype is homozygous anddepth values indicate heterozygosity.The function adds a value of one to the allele with the lowest depth valuefor when odd number anomalies or make the depth value zero for whenmiss-classified. The genotype table must be provided for the latter.
Usage
ad.correct( het.table, gt.table = NULL, odd.correct = TRUE, verbose = TRUE, parallel = FALSE)Arguments
het.table | allele depth table generated from the function |
gt.table | genotype table generated from the function hetTgen |
odd.correct | logical, to correct for odd number anomalies in AD values.default |
verbose | logical. show progress. Default |
parallel | logical. whether to parallelize the process |
Value
Returns the coverage corrected allele depth table similar to theoutput ofhetTgen
Author(s)
Piyal Karunarathne
Examples
## Not run: adc<-ad.correct(ADtable)Generate allele frequency table for individuals or populations
Description
Get alternative allele frequency across all individuals per SNP from thegenotype or allele depth tables
Usage
allele.freq(gtt, f.typ = c("pop", "ind"), verbose = TRUE)Arguments
gtt | a list or data frame of genotype and/or allele depth table produced from |
f.typ | character. type of allele frequency to be calculated (individual |
verbose | logical. whether to show the progress of the analysis |
Details
If the allele frequencies to be calculated for populations from both genotype table and the allele depth table, they must be provided in a list with element namesAD for allele depth table andGT for the genotype table. See the examples.
Value
Returns a data frame or a list (if both genotype and allele depth used)of allele frequencies
Author(s)
Piyal Karunarathne
Examples
## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)het.table<-hetTgen(vcf,"GT")ad.table<-hetTgen(vcf,"AD")# for individual based AFfrQ<-allele.freq(het.table,f.typ="ind")#for population-wise and both allele depth and genotype tablesfrQ<-allele.freq(list(AD=ad.table,GT=het.table),f.typ="pop")## End(Not run)Get allele information for duplicate detection
Description
The function to calculate allele median ratios, proportion of heterozygotesand allele probability values under different assumptions (see details),and their chi-square significance values for duplicate detection
Usage
allele.info( X, x.norm = NULL, Fis, method = c("MedR", "QN", "pca", "TMM", "TMMex"), logratioTrim = 0.3, sumTrim = 0.05, Weighting = TRUE, Acutoff = -1e+10, plot.allele.cov = TRUE, verbose = TRUE, parallel = FALSE, ...)Arguments
X | allele depth table generated from the function |
x.norm | a data frame of normalized allele coverage, output of |
Fis | numeric. Inbreeding coefficient calculated using |
method | character. method to be used for normalization(see |
logratioTrim | numeric. percentage value (0 - 1) of variation to betrimmed in log transformation |
sumTrim | numeric. amount of trim to use on the combined absolutelevels (“A” values) for method |
Weighting | logical, whether to compute (asymptotic binomial precision)weights |
Acutoff | numeric, cutoff on “A” values to use before trimming |
plot.allele.cov | logical, plot comparative plots of allele depthcoverage in homozygotes and heterozygotes |
verbose | logical, whether to print progress |
parallel | logical. whether to parallelize the process |
... | further arguments to be passed to |
Details
Allele information generated here are individual SNP based and presents theproportion of heterozygotes, number of samples, and deviation of alleledetection from a 1:1 ratio of reference and alternative alleles.The significance of the deviation is tested with Z-score testZ = \frac{ \frac{N}{2}-N_A}{ \sigma_{x}},and chi-square test (see references for more details on the method).
Value
Returns a data frame of median allele ratio, proportion ofheterozygotes, number of heterozygotes, and allele probability at differentassumptions with their chi-square significance
Author(s)
Piyal Karunarathne, Pascal Milesi, Klaus Schliep
References
McKinney, G. J., Waples, R. K., Seeb, L. W., & Seeb, J. E. (2017).Paralogs are revealed by proportion of heterozygotes and deviations in readratios in genotyping by sequencing data from natural populations.Molecular Ecology Resources, 17(4)
Karunarathne et al. 2022 (to be added)
Examples
## Not run: data(ADtable)hz<-h.zygosity(vcf,verbose=FALSE)Fis<-mean(hz$Fis,na.rm = TRUE)AI<-allele.info(ADtable,x.norm=ADnorm,Fis=Fis)## End(Not run)Allele info example data
Description
Semi-randomly generated data from the function dup.snp.info.Data contains depth and proportion values of 2857 snps
Usage
data(alleleINF)Format
An object of classdata.frame with 2857 rows and 28 columns.
Source
Chinook Salmon sequence reads McKinney et al. 2017
References
Larson, W. A., Seeb, L. W., Everett, M. V., Waples, R. K., Templin, W. D., & Seeb, J. E. (2014). Genotyping by sequencing resolves #' shallow population structure to inform conservation of Chinook salmon (Oncorhynchus tshawytscha). Evolutionary Applications, 7(3)
McKinney, G. J., Waples, R. K., Seeb, L. W., & Seeb, J. E. (2017). Paralogs are revealed by proportion of heterozygotes and deviations in read ratios in genotyping by sequencing data from natural populations. Molecular Ecology Resources, 17(4)
Examples
data(alleleINF)with(alleleINF,plot(medRatio~propHet))Find CNVs from deviants
Description
Categorize deviant and non-deviant into "singlets" and "duplicates" based on the statistical approaches specified by the user.The intersection of all the stats provided will be used in the categorization. If one would like to use the intersection of at least two stats, this can be specified in then.ints
Usage
cnv( data, test = c("z.het", "z.05", "z.all", "chi.het", "chi.05", "chi.all"), filter = c("intersection", "kmeans"), WGS = TRUE, ft.threshold = 0.05, plot = TRUE, verbose = TRUE, ...)Arguments
data | A data frame of allele information generated with the function |
test | vector of characters. Type of test to be used for significance.See details |
filter | character. Type of filter to be used for filtering CNVs.default |
WGS | logical. test parameter. See details |
ft.threshold | confidence interval for filtering |
plot | logical. Plot the detection of duplicates. default |
verbose | logical. show progress |
... | other arguments to be passed to |
Details
SNP deviants are detected with both excess of heterozygosityaccording to HWE and deviant SNPs where depth values fall outside of thenormal distribution are detected using thefollowing methods:
Z-score test
Z_{x} = \sum_{i=1}^{n} Z_{i};Z_{i} = \frac{\left ( (N_{i}\times p)- N_{Ai} \right )}{\sqrt{N_{i}\times p(1-p)}}chi-square test
X_{x}^{2} = \sum_{i-1}^{n} X_{i}^{2};X_{i}^{2} = (\frac{(N_{i}\times p - N_{Ai})^2}{N_{i}\times p} + \frac{(N_{i}\times (1 - p)- (N_{i} - N_{Ai}))^2}{N_{i}\times (1-p)})
See references for more details on the methods
Users can pick among Z-score for heterozygotes (z.het, chi.het),all allele combinations (z.all, chi.all) and the assumption of noprobe bias p=0.5 (z.05, chi.05)
filter will determine whether theintersection orkmeansclustering of the providedtests should be used in filtering CNVs.The intersection uses threshold values for filtering and kmeans useunsupervised clustering. Kmeans clustering is recommended if one is uncertainabout the threshold values.
WGS is a test parameter to include or exclude coefficient of variance(cv) in kmeans. For data sets with more homogeneous depth distribution,excluding cv improves CNV detection. If you're not certain about this, useTRUE which is the default.
Value
Returns a data frame of SNPs with their detected duplication status
Author(s)
Piyal Karunarathne Qiujie Zhou
Examples
## Not run: data(alleleINF)DD<-cnv(alleleINF)## End(Not run)Calculate normalized depth for alleles
Description
This function outputs the normalized depth values separately for each allele,calculated using normalization factor with trimmed mean of M-values ofsample libraries, median ratios normalization or quantile normalization,See details.
Usage
cpm.normal( het.table, method = c("MedR", "QN", "pca", "TMM", "TMMex"), logratioTrim = 0.3, sumTrim = 0.05, Weighting = TRUE, Acutoff = -1e+10, verbose = TRUE, plot = TRUE)Arguments
het.table | allele depth table generated from the function |
method | character. method to be used (see details). Default |
logratioTrim | numeric. percentage value (0 - 1) of variation to betrimmed in log transformation |
sumTrim | numeric. amount of trim to use on the combined absolutelevels (“A” values) for method |
Weighting | logical, whether to compute (asymptotic binomial precision)weights |
Acutoff | numeric, cutoff on “A” values to use before trimming(only for TMM(ex)) |
verbose | logical. show progress |
plot | logical. Plot the boxplot of sample library sizes showing outliers |
Details
This function converts an observed depth value table to aneffective depth value table using several normalization methods;
TMM normalization (See the original publication for more information).It is different from the function
normzonly in calculation of thecounts per million is for separate alleles instead of the total depth.TheTMMexmethod is an extension of theTMMmethod forlarge data sets containing SNPs exceeding 10000The method
MedRis median ratio normalization;QN - quantile normalization (see Maza, Elie, et al. 2013 for acomparison of methods).
PCA - a modified Kaiser's Rule applied to depth values: Sample variationof eigen values smaller than 0.7 are removed (i.e., the first eigen value < 0.7)to eliminate the effect of the library size of samples
Value
Returns a list with (AD), a data frame of normalized depth valuessimilar to the output ofhetTgen function and(outliers) a list of outlier sample names
Author(s)
Piyal Karunarathne, Qiujie Zhou
References
Robinson MD, Oshlack A (2010). A scaling normalization method fordifferential expression analysis of RNA-seq data. Genome Biology 11, R25
Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductorpackage for differential expression analysis of digital gene expressiondata. Bioinformatics 26
Maza, Elie, et al. "Comparison of normalization methods fordifferential gene expression analysis in RNA-Seq experiments: a matter ofrelative size of studied transcriptomes." Communicative & integrativebiology 6.6 (2013): e25849
Examples
## Not run: data(ADtable)ADnormalized<-cpm.normal(ADtable)## End(Not run)Simulate median allele ratios for varying number of samples and depth values
Description
This function will simulate the expected median allele ratios under HWEfor given ranges of no. of samples and depth coverage values.This is useful if you need to find the cutoff values of allele ratios fordifferent no. of samples and depth of coverage values in your data set.
Usage
depthVsSample( cov.len = 100, sam.len = 100, nsims = 1000, plot = TRUE, col = c("#1C86EE", "#00BFFF", "#DAA520", "#FF0000"))Arguments
cov.len | max value of depth of coverage to be simulated |
sam.len | maximum no. of samples to be simulated |
nsims | numerical. no. of simulations to be done for each combination of samples and depthdepth and no. samples ranges |
plot | logical. Whether to plot the output (a plot of no. samplesvs median depth of coverage colored by median allele ratios) |
col | character. Two colors to add to the gradient |
Value
A matrix of median allele ratios where rows are the number ofsamples and columns are depth of coverage values
Author(s)
Pascal Milesi, Piyal Karunarathne
Examples
## Not run: depthVsSample(cov.len=100,sam.len=100)Plot classified SNPs into deviants/CNVs and non-deviants/non-CNVs
Description
The function plots detected deviants/CNVs from functionssig.snps,cnv anddupGet on a median ratio (MedRatio) Vs. proportion ofheterozygote (PropHet) plot.
Usage
dup.plot(ds, ...)Arguments
ds | a data frame of detected deviants/cnvs (outputs of functions above) |
... | other graphical parameters to be passed to the function |
Value
Returns no value, only plots proportion of heterozygotes vs allelemedian ratio separated by duplication status
Author(s)
Piyal Karunarathne
Examples
## Not run: data(alleleINF)DD<-dupGet(alleleINF,plot=FALSE)dup.plot(DD)## End(Not run)Validate detected deviants/cnvs
Description
This function will validate the detected duplicated-SNPs (deviants/cnvs) using a movingwindow approach (see details)
Usage
dup.validate(d.detect, window.size = 100, scaf.size = 10000)Arguments
d.detect | a data frame of detected duplicates or deviants from the outputs of |
window.size | numerical. a single value of the desired moving windowsize (default |
scaf.size | numerical. scaffold size to be checked. i.e. the chromosome/scaffolds will be split into equal pieces of this sizedefault=10000 |
Details
Loci/SNP positions correctly ordered according to a referencesequence is necessary for this function to work properly. The list of deviants/cnvs provided in thed.detect will be split into pices ofscaf.size and the number of deviants/cnvs will be counted along each split with a moving window ofwindow.size. The resulting percentages of deviants/cnvs will be averaged for each scaf.size split; this is thecnv.ratio column in the output. Thus, ideally, thecnv.ratio is a measure of how confident the detected deviants/cnvs are in an actual putative duplicated region withing the givenscaf.size. This ratio is sensitive to the picked window size and the scaf.size; as a rule of thumb, it is always good to use a known gene length as the scaf.size, if you need to check a specific gene for the validity of the detected duplicates.Please also note that this function is still in itsbeta-testing phase and also under development for non-mapped reference sequences. Therefore, your feedback and suggestions will be highly appreciated.
Value
A data frame of deviant/cnv ratios (columncnv.ratio) for a split of the chromosome/scaffold given by thescaf.size; this ratio is an average value of the percentage of deviants/cnvs present within the givenwindow.size for each split (chromosome/scaffold length/sacf.size); the start and the end positions of each split is given in thestart andend columns
Author(s)
Piyal Karunarathne
Examples
## Not run: # suggestion to visualize dup.validate outputlibrary(ggplot2)library(dplyr)dvs<-dupGet(alleleINF,test=c("z.05","chi.05"))dvd<-dup.validate(dvs,window.size = 1000)# Example data framedf <- data.frame(dvd[,3:5])df$cnv.ratio<-as.numeric(df$cnv.ratio)# Calculate midpointsdf <- df %>% mutate(midpoint = (start + end) / 2)ggplot() + # Horizontal segments for each start-end range geom_segment(data = df, aes(x = start, xend = end, y = cnv.ratio, yend = cnv.ratio), color = "blue") + # Midpoints line connecting midpoints of each range geom_path(data = df, aes(x = midpoint, y = cnv.ratio), color = "red") + geom_point(data = df, aes(x = midpoint, y = cnv.ratio), color = "red") + # Aesthetic adjustments theme_minimal() + labs(title = "CNV Ratio along a Continuous Axis with Midpoint Fluctuation", x = "Genomic Position", y = "CNV Ratio")## End(Not run)Detect deviants from SNPs; classify SNPs
Description
Detect deviant SNPs using excess of heterozygotes(alleles that do not follow HWE) and allelic-ratio deviations(alleles with ratios that do not follow a normal Z-score or chi-squaredistribution).See details.
Usage
dupGet( data, Fis, test = c("z.het", "z.05", "z.all", "chi.het", "chi.05", "chi.all"), intersection = FALSE, method = c("fisher", "chi.sq"), plot = TRUE, verbose = TRUE, ...)Arguments
data | data frame of the output of |
Fis | numeric. Inbreeding coefficient calculated using |
test | character. type of test to be used for significance. See details |
intersection | logical, whether to use the intersection of the methodsspecified in |
method | character. method for testing excess of heterozygotes.Fisher exact test ( |
plot | logical. whether to plot the detected singlets and duplicateson allele ratio vs. proportion of heterozygotes plot. |
verbose | logical. show progress |
... | additional parameters passed on to |
Details
SNP deviants are detected with both excess of heterozygosityaccording to HWE and deviant SNPs where depth values fall outside of thenormal distribution are detected using thefollowing methods:
Z-score test
Z_{x} = \sum_{i=1}^{n} Z_{i};Z_{i} = \frac{\left ( (N_{i}\times p)- N_{Ai} \right )}{\sqrt{N_{i}\times p(1-p)}}chi-square test
X_{x}^{2} = \sum_{i-1}^{n} X_{i}^{2};X_{i}^{2} = (\frac{(N_{i}\times p - N_{Ai})^2}{N_{i}\times p} + \frac{(N_{i}\times (1 - p)- (N_{i} - N_{Ai}))^2}{N_{i}\times (1-p)})
See references for more details on the methods
Users can pick among Z-score for heterozygotes (z.het, chi.het),all allele combinations (z.all, chi.all) and the assumption of noprobe bias p=0.5 (z.05, chi.05)
Value
Returns a data frame of snps/alleles with their duplication status
Author(s)
Piyal Karunarathne Qiujie Zhou
Examples
## Not run: data(alleleINF)DD<-dupGet(alleleINF,Fis=0.1,test=c("z.05","chi.05"))## End(Not run)Export VCF files
Description
A function to export tables/matrices in VCF format to VCF files
Usage
exportVCF(out.vcf, out.path, compress = TRUE)Arguments
out.vcf | a matrix or data frame in vcf file format to be exported |
out.path | a character string of output path for the vcf file;should end in the name as the vcf file and .vcf. See examples |
compress | logical. whether to compress the output file. If |
Value
Exports a vcf file to a given destination
Author(s)
Piyal Karunarathne
Examples
## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path)exportVCF(vcf,"../exVcf.vcf")## End(Not run)Get missingness of individuals in raw vcf
Description
A function to get the percentage of missing data of snps per SNP and persample
Usage
get.miss( data, type = c("samples", "snps"), plot = TRUE, verbose = TRUE, parallel = FALSE)Arguments
data | a list containing imported vcf file using |
type | character. Missing percentages per sample“samples” or per SNP “snps”, default both |
plot | logical. Whether to plot the missingness density with ninetyfive percent quantile |
verbose | logical. Whether to show progress |
parallel | logical. whether to parallelize the process |
Value
Returns a data frame of allele depth or genotypes
Author(s)
Piyal Karunarathne
Examples
## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)missing<-get.miss(vcf,plot=TRUE)## End(Not run)Format genotype for BayEnv and BayPass
Description
This function generates necessary genotype count formats for BayEnv andBayPass with a subset of SNPs
Usage
gt.format( gt, info, format = c("benv", "bpass"), snp.subset = NULL, parallel = FALSE)Arguments
gt | multi-vector. an imported data.frame of genotypes or genotypedata frame generated by |
info | a data frame containing sample and population information.It must have “sample” and “population” columns |
format | character. output format i.e., for BayPass or BayEnv |
snp.subset | numerical. number of randomly selected subsets of SNPs. |
parallel | logical. whether to parallelize the process |
Value
Returns a list with formatted genotype data:$bayenv - snpsin horizontal format - for BayEnv (two lines per snp);$baypass - vertical format - for BayPass(two column per snp);$sub.bp - subsets snps for BayPass$sub.be - subsets of snps for BayEnv
Author(s)
Piyal Karunarathne
Examples
## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)het.table<-hetTgen(vcf,"GT")info<-unique(substr(colnames(het.table)[-c(1:3)],1,8))GT<-gt.format(het.table,info)## End(Not run)Determine per sample heterozygosity and inbreeding coefficient
Description
This function will calculate the heterozygosity on a per-sample basis fromvcf files (snps), and most importantly inbreeding coefficient which is usedto filter out the samples with bad mapping quality.
Usage
h.zygosity(vcf, plot = FALSE, pops = NA, verbose = TRUE, parallel = FALSE)Arguments
vcf | an imported vcf file in in a list using |
plot | logical. Whether to plot a boxplot of inbreeding coefficientsfor populations. A list of populations must be provided |
pops | character. A list of population names with the same length andorder as the number of samples in the vcf |
verbose | logical. Show progress |
parallel | logical. Parallelize the process |
Value
Returns a data frame of expected “E(Hom)” and observed“O(Hom)” homozygotes with their inbreeding coefficients.
Author(s)
Piyal Karunarathne, Pascal Milesi, Klaus Schliep
Examples
## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)pp<-substr(colnames(vcf$vcf)[-c(1:9)],1,8)hzygots<-h.zygosity(vcf,plot=TRUE,pops=pp)## End(Not run)Generate allele depth or genotype table
Description
hetTgen extracts the read depth and coverage values for each snp for allthe individuals from a vcf file generated from readVCF (or GatKVariantsToTable: see details)
Usage
hetTgen( vcf, info.type = c("AD", "AD-tot", "GT", "GT-012", "GT-AB", "DP"), verbose = TRUE, parallel = FALSE)Arguments
vcf | an imported vcf file in a list using |
info.type | character. |
verbose | logical. whether to show the progress of the analysis |
parallel | logical. whether to parallelize the process |
Details
If you generate the depth values for allele by sample using GatKVariantsToTable option, use only -F CHROM -F POS -GF AD flags to generatethe table. Or keep only the CHROM, POS, ID, ALT, and individual AD columns.For info.typeGT option is provided to extract the genotypes ofindividuals by snp.
Value
Returns a data frame of allele depth, genotype of SNPs for all theindividuals extracted from a VCF file
Author(s)
Piyal Karunarathne, Klaus Schliep
Examples
## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)het.table<-hetTgen(vcf)## End(Not run)Remove MAF allele
Description
A function to remove the alleles with minimum allele frequency and keep onlya bi-allelic matrix when loci are multi-allelic
Usage
maf(h.table, AD = TRUE, verbose = TRUE, parallel = FALSE)Arguments
h.table | allele depth table generated from the function |
AD | logical. If TRUE a allele depth table similar to |
verbose | logical. Show progress |
parallel | logical. whether to parallelize the process |
Value
A data frame or a list of minimum allele frequency removed allele depth
Author(s)
Piyal Karunarathne
Examples
## Not run: mf<-maf(ADtable)Calculate normalization factor for each sample
Description
This function calculates the normalization factor for each sample usingdifferent methods. See details.
Usage
norm.fact( df, method = c("TMM", "TMMex", "MedR", "QN"), logratioTrim = 0.3, sumTrim = 0.05, Weighting = TRUE, Acutoff = -1e+10)Arguments
df | a data frame or matrix of allele depth values(total depth per snp per sample) |
method | character. method to be used (see details). Default |
logratioTrim | numeric. percentage value (0 - 1) of variation to betrimmed in log transformation |
sumTrim | numeric. amount of trim to use on the combined absolutelevels (“A” values) for method |
Weighting | logical, whether to compute (asymptotic binomial precision)weights |
Acutoff | numeric, cutoff on “A” values to use before trimming |
Details
Originally described for normalization of RNA sequences(Robinson & Oshlack 2010), this function computes normalization (scaling)factors to convert observed library sizes into effective library sizes.It uses the method trimmed means of M-values proposed by Robinson &Oshlack (2010). See the original publication andedgeR packagefor more information.The methodMedR is median ratio normalization;QN - quantile normalization (see Maza, Elie, et al. 2013 for acomparison of methods).
Value
Returns a numerical vector of normalization factors for each sample
Author(s)
Piyal Karunarathne
References
Robinson MD, and Oshlack A (2010). A scaling normalization method fordifferential expression analysis of RNA-seq data. Genome Biology 11, R25
Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductorpackage for differential expression analysis of digital gene expressiondata. Bioinformatics 26
Examples
## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path)df<-hetTgen(vcf,"AD-tot",verbose=FALSE)norm.fact(df)## End(Not run)Simulate and plot detection power of bias in allele ratios
Description
This function simulates 95% confidence level Z-score based detection powerof allele biases for a given number of samples and a range of depths
Usage
power.bias( Dlist = c(2, 4, 8, 16), sam = 100, intensity = 0.005, nsims = 1000, p = 0.5, plot = TRUE)Arguments
Dlist | numerical. vector of depths values to be tested |
sam | numerical. number of samples |
intensity | numerical. frequency of bias |
nsims | numerical. number of simulations to be done for each sample |
p | numerical. expected allele ratio (0.5 for data with knownsequencing biases) |
plot | logical. plot the output |
Value
Returns a list of detection probability values for the given range ofsamples and depth
Author(s)
Pascal Milesi, Piyal Karunarathne
Import VCF file
Description
Function to import raw single and multi-sample VCF files.The function required the R-packagedata.table for faster importing.
Usage
readVCF(vcf.file.path, verbose = FALSE)Arguments
vcf.file.path | path to the vcf file |
verbose | logical. show progress |
Value
Returns a list with vcf table in a data frame, excluding meta data.
Author(s)
Piyal Karunarathne
Examples
## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path)## End(Not run)Determine pairwise relatedness
Description
Relatedness is determined according to genome-wide relationship assessmentof Yang et al. 2010 equation 6, on a per sample basis (with itself andothers), using SNPs.
Usage
relatedness( vcf, plot = TRUE, threshold = 0.5, verbose = TRUE, parallel = FALSE)Arguments
vcf | an imported vcf file in a list using |
plot | logical. Whether to plot relatedness of samples againstthemselves, among themselves and outliers |
threshold | numerical. A value indicating to filter the individuals ofrelatedness among themselves. Default |
verbose | logical. Show progress. |
parallel | logical. Parallelize the process |
Details
According to Yang et al. (2010), out breeding non-related pairs should have arelatedness value of zero while the individual with itself will have arelatedness value of one. Relatedness value of ~0.5 indicates siblings.
Value
A data frame of individuals and relatedness scoreA_{jk}
Author(s)
Piyal Karunarathne, Klaus Schliep
References
Yang, J., Benyamin, B., McEvoy, B. et al. Common SNPs explain alarge proportion of the heritability for human height. Nat Genet 42, 565569(2010).
Examples
## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)relate<-relatedness(vcf)## End(Not run)Identify significantly different heterozygotes from SNPs data
Description
This function will recognize the SNPs with a proportion of heterozygotessignificantly higher than expected under HWE and plot deviant snps basedonly on the excess of heterozygotes.
Usage
sig.hets( a.info, Fis, method = c("chi.sq", "fisher"), plot = TRUE, verbose = TRUE, ...)Arguments
a.info | allele info table generated from filtered vcfs using thefunction |
Fis | numeric. Inbreeding coefficient calculated using |
method | character. Method for testing significance. Fisher exact test( |
plot | logical. Whether to plot the identified duplicated snps withthe expected values |
verbose | logical, if TRUE, the progress is shown |
... | other arguments passed to |
Value
A matrix of expected heterozygote proportions from the observeddata with p-value indicating significance of deviation.
Author(s)
Piyal Karunarathne, Pascal Milesi, Klaus Schliep, Qiujie Zhou
Examples
## Not run: data(alleleINF)AI <- alleleINFduplicates<-sig.hets(AI,plot=TRUE,Fis=0.1)## End(Not run)Simulate Allele Frequencies
Description
This function simulates allele frequencies of a desired population sizeunder HWE
Usage
sim.als(n = 500, nrun = 10000, res = 0.001, plot = TRUE)Arguments
n | desired populations size (set this value same as your actualpopulation size for an accurate simulation) |
nrun | number of simulations to run on each allele frequency.The higher this number, the closer the simulations will be to thetheoretical values (at the cost of computer power); 10000 is an optimal value. |
res | desired resolution of the theoretical allele frequency |
plot | logical. whether to plot the simulation |
Value
A list of two matrices:
allele_freqs: theoretical allele frequency
simulated_freqs: simulated frequencies at different confidence intervals
Author(s)
Piyal Karunarathne, Pascal Milesi
Examples
## Not run: alleles <- sim.als(n=200,nrun=1000,res=0.001,plot=TRUE)Get sequencing quality statistics of raw VCF files(with GatK generated vcf files only)
Description
This function will generate a table similar to VariantsToTable option inGatK from raw vcf files for filtering purposes. The function will alsoplot all the parameters (see details & values).
Usage
vcf.stat(vcf, plot = TRUE, ...)Arguments
vcf | an imported vcf file in data.frame or matrix format using |
plot | logical. Whether to plot the (12) parameters |
... | other arguments passed on to |
Details
For more details see instructions of GatK
Value
Returns a data frame with quality parameters from the INFO. field ofthe vcf
QUAL: The Phred-scaled probability that a REF/ALT polymorphism existsat this site given sequencing data
AC: Allele count
AF: Allele frequency
DP: unfiltered depth
QD: QualByDepth - This is the variant confidence (from the QUALfield) divided by the unfiltered depth of non-hom-ref samples
FS: FisherStrand - This is the Phred scaled probability that there isstrand bias at the site
SOR: StrandOddsRatio - This is another way to estimate strand biasusing a test similar to the symmetric odds ratio test
MQ: RMSMappingQuality - This is the root mean square mapping qualityover all the reads at the site
MQRankSum: MappingQualityRankSumTest - This is the u-basedz-approximation from the Rank Sum Test for mapping qualities
ReadPosRankSum: ReadPosRankSumTest: This is the u-basedz-approximation from the Rank Sum Test for site position within reads
Author(s)
Piyal Karunarathne
Examples
## Not run: vcf.file.path <- paste0(path.package("rCNV"), "/example.raw.vcf.gz")vcf <- readVCF(vcf.file.path=vcf.file.path)statistics<-vcf.stat(vcf,plot=TRUE)## End(Not run)Calculate population-wise Vst
Description
This function calculates Vst (variant fixation index) for populations givena list of duplicated loci
Usage
vst(AD, pops, id.list = NULL, qGraph = TRUE, verbose = TRUE, ...)Arguments
AD | data frame of total allele depth values of (duplicated, if |
pops | character. A vector of population names for each individual.Must be the same length as the number of samples in AD |
id.list | character. A vector of duplicated SNP IDs. Must match the IDsin the AD data frame |
qGraph | logical. Plot the network plot based on Vst values(see details) |
verbose | logical. show progress |
... | additional arguments passed to |
Details
Vst is calculated with the following equation
V_{T} = \frac{ V_{S} }{V_{T}}
where VT is the variance of normalizedread depths among all individuals from the two populations and VS is theaverage of the variance within each population, weighed for population size(see reference for more details)Seeqgraph help for details on qgraph output
Value
Returns a matrix of pairwise Vst values for populations
Author(s)
Piyal Karunarathne
References
Redon, Richard, et al. Global variation in copy number in the human genome.nature 444.7118 (2006)
Examples
## Not run: data(alleleINF)data(ADtable)DD<-dupGet(alleleINF)ds<-DD[DD$dup.stat=="deviant",]ad<-ADtable[match(paste0(ds$CHROM,".",ds$POS),paste0(ADtable$CHROM,".",ADtable$POS)),]vst(ad,pops=substr(colnames(ad)[-c(1:4)],1,11))## End(Not run)Run permutation on Vst
Description
This function runs a permutation test on Vst calculation
Usage
vstPermutation( AD, pops, nperm = 100, histogram = TRUE, stat = 2, qGraph = TRUE)Arguments
AD | data frame of total allele depth values of SNPs |
pops | character. A vector of population names for each individual.Must be the same length as the number of samples in AD |
nperm | numeric. Number of permutations to perform |
histogram | logical. plots the distribution histogram of permuted vst values vs. observed values |
stat | numeric. The stat to be plotted in histogram. 1 for Mean Absolute Distance or 2 ( |
qGraph | logical. Plot the network plot based on observed Vst values(see |
Value
Returns a list with observed vst values, an array of permuted vst values and the p-values for the permutation test
Author(s)
Jorge Cortés-Miranda (email:jorge.cortes.m@ug.uchile.cl), Piyal Karunarathne
Examples
## Not run: data(alleleINF)data(ADtable)DD<-dupGet(alleleINF)ds<-DD[DD$dup.stat=="deviant",]ad<-ADtable[match(paste0(ds$CHROM,".",ds$POS),paste0(ADtable$CHROM,".",ADtable$POS)),]vstPermutation(ad,pops=substr(colnames(ad)[-c(1:4)],1,11))## End(Not run)