Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Flexible Inference via Permutations in R

License

NotificationsYou must be signed in to change notification settings

permaverse/flipr

Repository files navigation

R-CMD-checktest-coverageCodecov test coveragepkgdownCRAN status

The goal of theflipr packageis to provide a flexible framework for making inference via permutation.The idea is to promote the permutation framework as an incrediblywell-suited tool for inference on complex data. You supply your data, ascomplex as it might be, in the form of lists in which each entry storesone data point in a representation that suits you andflipr takes care of thepermutation magic and provides you with either point estimates orconfidence regions or$p$-value of hypothesis tests. Permutation testsare especially appealing because they are exact no matter how small orbig your sample sizes are. You can also use the so-callednon-parametric combination approach in this setting to combine severalstatistics to better target the alternative hypothesis you are testingagainst. Asymptotic consistency is also guaranteed under mild conditionson the statistic you use. Theflipr package provides aflexible permutation framework for making inference such as pointestimation, confidence intervals or hypothesis testing, on any kind ofdata, be it univariate, multivariate, or more complex such asnetwork-valued data, topological data, functional data or density-valueddata.

Installation

You can install the package fromCRANwith:

install.packages("flipr")

Alternatively, You can install the development version offlipr fromGitHub with:

# install.packages("pak")pak::pak("permaverse/flipr")

Example

library(flipr)

We hereby use the very simple t-test for comparing the means of twounivariate samples to show how easy it is to carry out a permutationtest withflipr.

Data generation

Let us first generate two samples of size$15$ governed by Gaussiandistributions with equal variance but different means:

set.seed(123)n<-15x<- rnorm(n=n,mean=0,sd=1)y<- rnorm(n=n,mean=1,sd=1)

Given the data we simulated, the parameter of interest here is thedifference between the means of the distributions, say$\delta = \mu_y - \mu_x$.

Make the two samples exchangeable under$H_0$

In the context of null hypothesis testing, we consider the nullhypothesis$H_0: \mu_y - \mu_x = \delta$. We can use a permutationscheme to approach the$p$-value if the two samples areexchangeableunder$H_0$. This means that we need to transform for example the secondsample tomake it exchangeable with the first sample under$H_0$. Inthis simple example, this can be achieved as follows. Let$X_1, \dots, X_{n_x} \sim \mathcal{N}(\mu_x, 1)$ and$Y_1, \dots, Y_{n_y} \sim \mathcal{N}(\mu_y, 1)$. We can then transformthe second sample as$Y_i \longleftarrow Y_i - \delta$.

We can define a proper function to do this, termed thenullspecification function, which takes two input arguments:

  • y which is a list storing the data points in the second sample;
  • parameters which is a numeric vector of values for the parametersunder investigation (here only$\delta$ and thusparameters is oflength$1$ withparameters[1] = delta).

In our simple example, it boils down to:

null_spec<-function(y,parameters) {purrr::map(y,~.x-parameters[1])}

Choose suitable test statistics

Next, we need to decide which test statistic(s) we are going to use forperforming the test. Here, we are only interested in one parameter,namely the mean difference$\delta$. Since the two samples share thesame variance, we can use for example the$t$-statistic with a pooledestimate of the common variance.

This statistic can be easily computed usingstats::t.test(x, y, var.equal = TRUE)$statistic. However, we want toextend its evaluation to any permuted version of the data. Teststatistic functions compatible withflipr should have at leasttwo mandatory input arguments:

  • data which is either a concatenated list of size$n_x + n_y$regrouping the data points of both samples or a distance matrix ofsize$(n_x + n_y) \times (n_x + n_y)$ stored as an object of classdist.
  • indices1 which is an integer vector of size$n_x$ storing theindices of the data points belonging to the first sample in thecurrent permuted version of the data.

Some test statistics are already implemented inflipr and ready to use.User-defined test statistics can be used as well, with the use of thehelper functionuse_stat(nsamples = 2, stat_name = ). This functioncreates and saves an.R file in theR/ folder of the current workingdirectory and populates it with the following template:

#' Test Statistic for the Two-Sample Problem#'#' This function computes the test statistic...#'#' @param data A list storing the concatenation of the two samples from which#'   the user wants to make inference. Alternatively, a distance matrix stored#'   in an object of class \code{\link[stats]{dist}} of pairwise distances#'   between data points.#' @param indices1 An integer vector that contains the indices of the data#'   points belong to the first sample in the current permuted version of the#'   data.#'#' @return A numeric value evaluating the desired test statistic.#' @export#'#' @examples#' # TO BE DONE BY THE DEVELOPER OF THE PACKAGEstat_{{{name}}}<-function(data,indices1) {n<-if (inherits(data,"dist"))    attr(data,"Size")elseif (inherits(data,"list"))    length(data)else    stop("The `data` input should be of class either list or dist.")indices2<- seq_len(n)[-indices1]x<-data[indices1]y<-data[indices2]# Here comes the code that computes the desired test# statistic from input samples stored in lists x and y}

For instance, aflipr-compatible version ofthe$t$-statistic with pooled variance will look like:

my_t_stat<-function(data,indices1) {n<-if (inherits(data,"dist"))    attr(data,"Size")elseif (inherits(data,"list"))    length(data)else    stop("The `data` input should be of class either list or dist.")indices2<- seq_len(n)[-indices1]x<-data[indices1]y<-data[indices2]# Here comes the code that computes the desired test# statistic from input samples stored in lists x and yx<- unlist(x)y<- unlist(y)stats::t.test(x,y,var.equal=TRUE)$statistic}

Here, we are only going to use the$t$-statistic for this example, butwe might be willing to use more than one statistic for a parameter or wemight have several parameters under investigation, each one of themrequiring a different test statistic. We therefore group all the teststatistics that we need into a single list:

stat_functions<-list(my_t_stat)

Assign test statistics to parameters

Finally we need to define a named list that tellsflipr which test statisticsamong the ones declared in thestat_functions list should be used foreach parameter under investigation. This is used to determine bounds oneach parameter for the plausibility function. This list, often termedstat_assignments, should therefore have as many elements as there areparameters under investigation. Each element should be named after aparameter under investigation and should list the indices correspondingto the test statistics that should be used for that parameter instat_functions. In our example, it boils down to:

stat_assignments<-list(delta=1)

Use the plausibility function

Now we can instantiate a plausibility function as follows:

pf<-PlausibilityFunction$new(null_spec=null_spec,stat_functions=stat_functions,stat_assignments=stat_assignments,x,y)#> ! Setting the seed for sampling permutations is mandatory for obtaining a continuous p-value function. Using `seed = 1234`.

Now, assume we want to test the following hypotheses:

$$ H_0: \delta = 0 \quad \mbox{v.s.} \quad H_1: \delta \ne 0. $$

We use the$get_value() method for this purpose, which essentiallyevaluates the permutation$p$-value of a two-sided test by default:

pf$get_value(0)#> [1] 0.1078921

We can compare the resulting$p$-value with the one obtained using themore classic parametric test:

t.test(x,y,var.equal=TRUE)$p.value#> [1] 0.1030946

The permutation$p$-value does not quite match the parametric one. Thisis because of two reasons:

  1. The resolution of a permutation$p$-value is of the order of$1/(B+1)$, where$B$ is the number of sampled permutations. Bydefault, the plausibility function is instantiated with$B = 1000$:
pf$nperms#> [1] 1000
  1. We randomly sample$B$ permutations out of the$\binom{n_x+n_y}{n_x}$ possible permutations and therefore introduceextra variability in the$p$-value.

If we were to ask for more permutations, say$B = 1,000,000$, we wouldbe much closer to the parametric$p$-value:

pf$set_nperms(1000000)pf$get_value(0)#> [1] 0.1029879

About

Flexible Inference via Permutations in R

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors4

  •  
  •  
  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp