Movatterモバイル変換

Title:

Significant Interval Discovery with Categorical Covariates

Version:

0.2.7

Author:

Felipe Llinares Lopez, Dean Bodenham

Maintainer:

Dean Bodenham <deanbodenhambsse@gmail.com>

Description:

A method which uses the Cochran-Mantel-Haenszel test with significant pattern mining to detect intervals in binary genotype data which are significantly associated with a particular phenotype, while accounting for categorical covariates.

Depends:

R (≥ 3.3.0), bindata

License:

GPL-2 |GPL-3

LinkingTo:

Rcpp

Imports:

Rcpp

Suggests:

testthat

RoxygenNote:

5.0.1

SystemRequirements:

C++11

NeedsCompilation:

yes

Packaged:

2016-09-13 16:11:45 UTC; dean

Repository:

CRAN

Date/Publication:

2016-09-13 21:19:17

Demo of fastcmh

Description

This function runs a demo for fastcmh, by first creating a sample data setand then running fastcmh on this data set.

Usage

demofastcmh(saveToFolder = FALSE, folder = NULL)

Arguments

saveToFolder

A flag indicating whether or not the data files createdfor the demo should be saved to file. The default isFALSE, i.e. nofiles are saved to the folder. The only reason to save demo data to afolder is for the user to be able to have a look at the files after thedemo.

folder

The folder in which the data for the demo will be saved.Default is the current directory,"./". The demo data will createdinfolder/data and the results will be saved infolder/results as an RData file.

Details

This function will first create a sample data set infolder/data,and will then runrunfastcmh on this data set, before saving theeach step showing the R code that can be used to do the step, then runningthat R code, and then waiting for the user to press enter before movingonto the next step. IfsaveToFolder=FALSE, (default) then no filesare saved and all the results are kept in memory.

Examples

demofastcmh()

Create sample data for fastcmh

Description

This function creates sample data for use with therunfastcmh method.

Usage

makefastcmhdata(folder = "./", xfilename = "data.txt",  yfilename = "label.txt", covfilename = "cov.txt", K = 2, L = 1000,  n = 200, noiseP = 0.3, corruptP = 0.05, rho = 0.8, tau1 = 100,  taulength1 = 4, tau2 = 200, taulength2 = 4, seednum = 2,  truetaufilename = "truetau.txt", showOutput = FALSE, saveToList = FALSE)

Arguments

folder

The folder in which the data will be saved. Default iscurrent directory"./".

xfilename

The name of the data file. Default is"data.txt"

yfilename

The name of the label file. Default is"label.txt"

covfilename

The name of the file containing the covariate categories. This file actually just containsK numbers, whereK is thenumber of covariates. Default is"cov.txt"

K

The number of covariates (a positive integer). Default isK=2.

L

The number of features (length of each sequence). Default isL=1000.

n

The number of samples (cases and controls combined). Default isn=200, i.e. 100 cases and 100 controls.

noiseP

The background noise in the data (as a probability of 0/1being flipped). Default isnoiseP=0.3

corruptP

The probability of data corruption: each bit hasprobabilitycorruptP of being flipped. Default iscorruptP=0.05.

rho

The strength of the confounding in the confounded interval (asa probability). Default isrho=0.8 (i.e. a very strong signal).

tau1

The location of the significant interval (starting point).Default value istau1=100.

taulength1

The length of the significant interval. Default value istaulength1=4, so default significant interval is[100, 103].

tau2

The location of the confounded significant interval (startingpoint). Default value istau2=200.

taulength2

The length of the confounded significant interval.Default value istaulength2=4, so default significant interval is[200, 203].

seednum

The seed used for generating the data. Default value isseednum=2.

truetaufilename

The file where the location of the true significantintervals are saved (as opposed to the detected significant intervals).Default is"truetau.txt".

showOutput

Flag to decide whether or not to show output, where filesare created, their names, etc. Default isFALSE, so will save tofolder by default. However, all of the examples usesaveToList=TRUE in order to avoid writing to file. The list willconsist ofdata,label andcov data frames, whensaveToList=TRUE.

saveToList

Flag to decide whether or not to save data to the folder,or to return (output) the data as a list. By default,saveToList=FALSE.

Examples

#make a small sample data set, using the default parametersmylist <- makefastcmhdata(showOutput=TRUE, saveToList=TRUE)#make a very small sample data setmylist <- makefastcmhdata(n=20, L=10, tau1=2, taulength1=2,       tau2=6, taulength2=2, saveToList=TRUE)

Run the fastcmh algorithm

Description

This function runs the FastCMH algorithm on a particular data set.

Usage

runfastcmh(folder = NULL, data = NULL, label = NULL, cov = NULL,  alpha = 0.05, Lmax = 0, showProcessing = FALSE, saveAllPvals = FALSE,  doFDR = FALSE, useDependenceFDR = FALSE, saveToFile = FALSE,  saveFilename = "fastcmhresults.RData", saveFolder = NULL)

Arguments

folder

The folder in which the data is saved. If the any ofdata,label andpvalue arguments are not specified,then filenames must have following a naming convention inside the folder:the data file is"data.txt"(i.e. the full path is"folder/data.txt"), the phenotype label fileislabel.txt, and covariate label file iscov.txt. More details on the structure of these files is givenbelow, or the user can use themakefastcmhdata function tosee an example of the correct data formats. Iffolder="/data/", thedata infastcmh/inst/extdata is used.

data

The filename for the data file. Default isNULL. Thedata file must be anL x n txt file containing only0s and1s, which are space-separated in each row, while each row is on aseparate newline.

label

The filename for the phenotype label file. Default isNULL. The label file should consist of a single column (i.e. eachrow is on a separate line) of0s and1s.

cov

The filename for the covariate label file. Default isNULL. Thecov file contains a single column of positiveintegers. The first row, containing valuen_1, specifies that thefirstn_1 columns have covariate value1; the second row,containingn_2, specifies that the nextn_2 rows havecovariate value2, etc.

alpha

The value of the FWER; must be a number between 0 and 1.Default isalpha=0.05.

Lmax

The maximum length of significant intervals which isconsidered. Must be a non-negative integer. For example,Lmax=10searches for significant intervals up to length 10. SettingLmax=0will search for significant intervals up to any length (with algorithmpruning appropriately). Default isLmax=0.

showProcessing

A flag which will turn printing to screen on/off.Default isFALSE (which is “off”).

saveAllPvals

A flag which controls whether or not all the intervals(less than minimum attainable pvalue) will be returned. Default isFALSE (which is “no, do notreturn all intervals”).

doFDR

A flag which controls whether or not Gilbert's Tarone FDRprocedure (while accounting for positive regression dependence) isperformed. Default isFALSE (which is “no, do not do FDR”).

useDependenceFDR

A flag which controls whether or not Gilbert'sTarone FDR procedure uses the dependent formulation by Benjamini andYekutieli (2001), which further adjusts alpha by dividing by the harmonicmean. This flag is only used ifdoFDR==TRUE. Default isFALSE.

saveToFile

A flag which controls whether or not the results aresaved to file. By default,saveToFile=FALSE, and the data frame isreturned in R. See the examples below.

saveFilename

A string which gives the filename to which the outputis saved (needs to havesaveToFile=TRUE) as an RData file. Defaultis"fastcmhresults.RData".

saveFolder

A string which gives the path to which the output willbe saved (needs to havesaveToFile=TRUE). Default is"./".

Details

This function runs the FastCMH algorithm on a particular data set inorder to discover intervals that are statistically significantlyassociated with a particular label, while accounting for categoricalcovariates.

The user must either supply the folder, which contains files named"data.txt","label.txt" and"cov.txt", or thenon-default filenames must be specified individually. See the descriptions of argumentsdata,label andcov to see the format ofthe input files, or make a small sample data file using themakefastcmhdata function.By default, filtered results are provided. The user also has the optionof using an FDR procedure rather than the standard FWER-preservingprocedure.

Value

runfastcmh will return a list ifsaveToFile=FALSE (defaultsetting), otherwise it will save the list in an .RData file. The fieldsof the list are:

sig: a dataframe listing the significant intervals, afterfilterting. Columnsstart,end andpvalue indicatethe start and end points of the interval (inclusive), and thep-value for that interval.
unfiltered: a dataframe listing all the significant intervalsbefore filtering. The filtering compares the overlapping intervals andreturns the interval with the smallest p-value in each cluster ofoverlapping intervals. Dataframe has has structure assig.
fdr: (if doFDR==TRUE) significant intervals using Gilbert'sFDR-Tarone procedure, after filtering. Dataframe has same structure assig.
unfilteredFdr: (if doFDR==TRUE) a dataframe listing all the significant intervals before filtering. See description ofunfiltered.
allTestablle: (if saveAllPvals==TRUE) a dataframe listing allthe testable intervals, many of which will not be significant. Dataframehas same structure assig.
histObs: Together with histFreq gives a histogram of maximumattainable CMH statistics.
histFreq: Histogram of maximum attainable CMH statistics (onlyreliable in the testable range).
summary: a character string summarising the results. Usecat(...$summary) to print the results with the correctindentation/new lines.
timing: a list containing (i)details, a characterstring summarising the runtime values for the experiment - usecat(...$timing$details) for correct indentation, etc.(ii)exec, the total execution time. (iii)init, the timeto initialise the objects. (iv)fileIO, the time to read the inputfiles. (v)compSigThresh, the time to compute the significancethreshold. (vi)compSigInt, the time to compute the significantintervals.

Author(s)

Felipe Llinares Lopez, Dean Bodenham

References

Gilbert, P. B. (2005)A modified false discovery ratemultipl-comparisons procedure for discrete data, applied to humanimmunodeficiency virus genetics. Journal of the Royal StatisticalSociety: Series C (Applied Statistics), 54(1), 143-158.

Benjamini, Y., Yekutieli, D. (2001).The control of the falsediscovery rate in multiple testing under dependency.Annals of Statistics, 29(4), 1165-1188.

Examples

#Example with default naming convention used for data, label and cov files# Note: using "/data/" as the argument for folder#       accesses the data/ directory in the fastcmh package foldermylist <- runfastcmh("/data/")#Example where the progress will be shownmylist <- runfastcmh(folder="/data/", showProcessing=TRUE)#Example where many parameters are specifiedmylist <- runfastcmh(folder="/data/", data="data2.txt", alpha=0.01, Lmax=7)#Example where Gilbert's Tarone-FDR procedure is usedmylist <- runfastcmh("/data/", doFDR=TRUE)#Example where FDR procedure takes some dependence structures into accountmylist <- runfastcmh("/data/", doFDR=TRUE, useDependenceFDR=TRUE)

Movatterモバイル変換

Demo of fastcmh

Description

Usage

Arguments

Details

See Also

Examples

Create sample data for fastcmh

Description

Usage

Arguments

See Also

Examples

Run the fastcmh algorithm

Description

Usage

Arguments

Details

Value

Author(s)

See Also

References

Examples