Movatterモバイル変換

This package brings together a number of routines for the two sampleproblem for univariate data. There are two data sets x and y and we wantto test whether they were generated by the same probabilitydistribution.

The Testing Methods

In the continuous case we have a sample x of size of n and a sample yof size m. In the discussion below both are assumed to be ordered. Wedenote by z the combined sample. In the discrete case we have a randomvariable with values v, and x and y are the counts. We denote by k thenumber of observed values.

We denote the edf’s of the two data sets by\(\widehat{F}\) and\(\widehat{G}\), respectively. Moreover wedenote the empirical distribution function of the combined data set by\(\widehat{H}\).

chi large

In the case of continuous data, it is binned intonbins[1]equal probability bins, with the defaultnbins[1]=100. In thecase of discrete data the number of classes comes fromvals,the values observed.

Then a standard chi square test is run and the p value is found usingthe usual chi square approximation.

chi small

Many previous studies have shown the a chi square test with a largenumber of classes often has very poor power. So this test in the case ofcontinuous data usesnbins[2] equal probability bins, with thedefaultnbins[2]=10. In the case of discrete data if the numberof classes exceedsnbins[2]=10 the classes are combined untilthere arenbins[2]=10 classes.

In either case the p values are found using the usual chi-squareapproximation. The degrees of freedom are the number of bins. For aliterature review of chi square tests see(W.Rolke 2021).

For all other tests the p values are found using a permutation test.They are

KS Kolmogorov-Smirnov test

This test is based on the quantity\(\max\{|\widehat{F}(x)-\widehat{G}(x)|:x\in\mathbf{R}\}\). In the continuous case (without ties) thefunction\(\widehat{F}(x)-\widehat{G}(x)\) eithertakes a jump of size\(1/n\) up at apoint in the x data set, a jump of size\(1/m\) down if x is a point in the y dataset, or is flat. Therefore the test statistic is

\[\max\{\sum_{i=1}^{j} \vert \frac1nI\{z_i \in x\}-\frac1m I\{z_i \in y\} \vert:j=1,..,n+m\}\]

In the discrete case the jumps have sizes\(x_i/n\) and\(y_j/m\), respectively and the teststatistic is

\[\max\{\sum_{i=1}^{j} \vert \frac{x_i}n-\frac{y_j}m \vert:j=1,..,k\}\]

This test was first proposed in(Kolmogorov1933),(Smirnov 1939) and is one ofthe most widely used tests today.

There is a close connection between the edf’s and the ranks of the xand y in the combined data set. Using this one can also calculate the KSstatistic, and many statistics to follow, based on these ranks. This isused in many software routines, for example the ks.test routine in R.However, when applied to discrete data these formulas can fail badlybecause of the many ties, and so we will not use them in ourroutine.

Kuiper Kuiper’s test

This test is closely related to Kolmogorov-Smirnov, but it uses thesum of the largest positive and negative differences between the edf’sas a test statistic:

\[T_x=\sum_{i=1}^{j} \vert \frac1n I\{z_i\in x\}-\frac1m I\{z_i \in x\} \vert:j=1,..,n\]\[T_y=\sum_{i=1}^{j} \vert \frac1n I\{z_i \iny\}-\frac1m I\{z_i \in y\} \vert:j=1,..,m\] and the the teststatistic is\(T_x-T_y\). The discretecase follows directly. This test was first proposed in(Kuiper 1960).

CvM Cramer-vonMises test

This test is based on the integrated squared difference between theedf’s:

\[\frac{nm}{(n+m)^2}\sum_{i=1}^{n+m}\left( \hat{F}(z_i)-\hat{G}(z_i) \right)^2 \] the extension tothe two-sample problem of the Cramer-vonMises criterion is discussed in(T. W. Anderson 1962).

AD Anderson-Darling test

This test is based on

\[nm\sum_{i=1}^{n+m-1} \frac{\left(\hat{F}(z_i)-\hat{G}(z_i) \right)^2}{i(n-i)} \]

It was first proposed in(Theodore W.Anderson, Darling, et al. 1952).

LR Lehmann-Rosenblatt test

Let\(r_i\) and\(s_i\) be the ranks of x and y in thecombined sample, then the test statistic is given by

\[\frac1{nm(n+m)}\left[n\sum_{i=1}^n(r_i-1)^2+m\sum_{i=1}^m(s_i-1)^2\right]\](Lehmann 1951) and(Rosenblatt 1952)

ZA, ZK and ZC Zhang’s tests

These tests are based on the paper(Zhang2006). Note that the calculation ofZC is the oneexception from the rule: even in the discrete data case the calculationof the test statistic does require loops of lengths n and m. Thecalculation time will therefore increase with the sample sizes, and forvery large data sets this test might have to be excluded.

Wassp1 Wasserstein p=1 test

A test based on the Wasserstein p1 metric. It compares the quantilesof the x and y in the combined data set. If n=m the test statistic inthe continuous case is very simple:

\[ \frac1n\sum_{i=1}^n |x_i-y_i|\]If\(n\ne m\) it is first necessary tofind the respective quantiles of the x’s and y’s in the combined dataset. In the discrete case the test statistic is

\[\sum_{i=1}^{k-1} \vert\sum_{j=1}^i\frac{x_j}{n}-\sum_{j=1}^i\frac{y_j}{m} \vert(v_{i+1}-v_i)\]

For a discussion of the Wasserstein distance see(Vaserstein 1969).

The Arguments oftwosample_test

The routine has the following arguments:

x andy are numeric vectors. Inthe continuous data case they are simply the data, in the discrete casethe counts.
vals a numeric vector of values of the discreterandom variable. If missing continuous data is assumed. In the discretecase x, y and vals have to be vectors of equal length.
TS a routine to calculate a test statistic. Ifmissing built-in tests are used.
TSextra a list to be passed to TS.
wx,wy weights for data set (optional)
B=5000 number of simulation runs for permutationtest. If B=0 only test statistics are found.
nbins=c(50, 10) number of bins to use forchi-square test. In the continuous data case bins are foundequi-probable via the quantiles of the combined data set. In thediscrete case nbins[1] is changed to the number of values, so thechi-square test is done on the original data set. Then neighboringvalues are combined until there are nbins[2] classes.
minexpcount=5 minimum expected counts reuiredfor chi sqaure tests
maxProcessor if >1 parallel computing isused.
UseLargeSample for large data sets (>10000)use methods which have large sample theories to find p values.
samplingmethod independence or MCMC for discretedata
rnull routine to generate data for parametricbootstrap.
SuppressMessages=FALSE suppress informativemessages
doMethods a character vector giving the names ofthe methods to include.

Say for example we want to use only the KS and AD tests:

x=rnorm(10)y=rnorm(12)twosample_test(x, y,B=500,doMethods=c("KS","AD"))#> $statistics#>       KS       AD#> 0.450000 1.490646#>#> $p.values#>    KS    AD#> 0.150 0.142

The Output oftwosample_test

A list with vectors statistics (the test statistics) and pvalues.

shiny app

The tests can also be run via a shiny app. To do so run

run_shiny()

twosample_power andplot_power

The routinetwosample_power allows the estimation of thepower of the various tests, andplot_power draws thecorresponding power graph with the methods sorted by their meanpower.

Note that due to the use of permutation tests the power calculationsare fairly slow. Because of the time constraints imposed by CRAN thefollowing routines are not run. A full power study with (say) 25alternatives and fairly large data sets might take several hours torun.

New tests can be used in the same way as above.

Example 4

Say we wish to find the powers of the tests when one data set comesfrom a standard normal distribution with sample size 100 and the otherfrom a t distribution with 1-10 degrees of freedom and sample size200.

plot_power(twosample_power(f=function(df)list(x=rnorm(100),y=rt(200, df)))

The Arguments oftwosample_power

The arguments TS, TSextra, nbins, minexpcount , UseLargeSample,samplingmethod, rnull, SuppressMessages, maxProcessor and B are the sameas intwosample_test. In addtion we have

f A function that generates data sets x and y,either continuous or the counts of discrete variables. In the discretecase care needs to be taken that the resulting vectors have the samelength asvals. The output has to be a list with elements x andy. In the case of discrete data the list also has to include a vectorvals, the values of the discrete variable.
… arguments passed to f. The most common casewould be a single vector, but two vectors or none is alsopossible.
alpha=0.05 type I error probability to use inpower calculation

The arguments ofplot_power are a matrix of powerscreated bytwosample_power and (optional) Smooth=FALSEif no smoothing is desired and a name of the variable on the x axis.

Example 5

We wish to find the powers of the tests when the data set x are 100observations comes from a binomial n=10, p=0.5 and y 120 observationsfrom from a binomial n=10 and a number of different successprobabilities p. Note that in either case not all the possible values0-10 will actually appear in every data set, so we need to take somecare in assuring that x, y and vals match every time:

plot_power(twosample_power(function(p) {    vals=0:10    x=table(c(vals,rbinom(100,10,0.5)))-1#Make sure each vector has length 11    y=table(c(vals,rbinom(120,10, p)))-1#and names 1-10    vals=vals[x+y>0]# only vals with at least one countlist(x=x[as.character(vals)],y=y[as.character(vals)],vals=vals)  },p=seq(0.5,0.6,length=5)),"p")

Example 6

We want to assess the effect of the sample size when one data setcomes from a standard normal and the other from a normal with mean 1 andstandard deviation 1:

plot_power(twosample_power(f=function(n)list(x=rnorm(n),y=rnorm(n,1)),n=10*1:10),"n")

Adjusted p values for Several Tests

As no single test can be relied upon to consistently have good power,it is reasonable to employ several of them. We would then reject thenull hypothesis if any of the tests does so, that is, if the smallestp-value is less than the desired type I error probability\(\alpha\).

This procedure clearly suffers from the problem of simultaneousinference, and the true type I error probability will be much largerthan\(\alpha\). It is however possibleto adjust the p value so it does achieve the desired\(\alpha\). This can be done as follows:

We generate a number of data sets under the null hypothesis.Generally about 1000 will be sufficient. Then for each simulated dataset we apply the tests we wish to include, and record the smallest pvalue. Here is an example. .

pvals=matrix(0,1000,13)for(iin1:1000)  pvals[i, ]=R2sample::twosample_test(runif(100),runif(100),B=1000)$p.values

Next we find the smallest p value in each run for two selections offour methods. One is the selection found to be best above, namelyZhang’s ZC and ZK methods, the methods by Kuiper and Wasserstein as wellas a chi square test with a small number of bins. As a second selectionwe use the methods by Kolmogorov-Smirnov, Kuiper, Anderson-Darling,Cramer-vonMises and Lehman-Rosenblatt, which for this null data turn outto be highly correlated.

colnames(pvals)=names(R2sample::twosample_test(runif(100),runif(100),B=1000)$p.values)p1=apply(pvals[,c("ZK","ZC","Wassp1","Kuiper","ES small" )],1, min)p2=apply(pvals[,c("KS","Kuiper","AD","CvM","LR")],1, min)

Next we find the empirical distribution function for the two sets ofp values and draw their graphs. We also add the curves for the cases offour identical tests and the case of four independent tests, which ofcourse is the Bonferroni correction. The data for the cdfs is in theinst/extdata directory of the package.

tmp=R2sample::pvaluecdfTests=factor(c(rep("Identical Tests",nrow(tmp)),rep("Correlated Selection",nrow(tmp)),rep("Best Selection",nrow(tmp)),rep("Independent Tests",nrow(tmp))),levels=c("Identical Tests","Correlated Selection","Best Selection","Independent Tests"),ordered =TRUE)dta=data.frame(x=c(tmp[,1],tmp[,1],tmp[,1],tmp[,1]),y=c(tmp[,1],tmp[,3],tmp[,2],1-(1-tmp[,1])^4),Tests=Tests)ggplot2::ggplot(data=dta, ggplot2::aes(x=x,y=y,col=Tests))+  ggplot2::geom_line(linewidth=1.2)+  ggplot2::labs(x="p value",y="CDF")+  ggplot2::scale_color_manual(values=c("blue","red","Orange","green"))

Case Studies

The package includes the routinerun.studies, which provides20 case studies each for the continuous and the discrete case. Thisallows the user to easily compare the power of the methods included inthe package to a different one of their choice.

Say a user wishes to study the performance of the chi-square test,implemented inchitest. Of course this test is alreadyimplemented inR2sample, so this is for illustration only. Saywe wish to see its performance when compared to the tests implemented inR2sample for the case where the null hypothesis specifies auniform distribution but the data actually comes from a lineardistribution with slope s:

chitest=function(x, y, TSextra) {   nbins=TSextra$nbins   nx=length(x);ny=length(y);n=nx+ny   xy=c(x,y)   bins=quantile(xy, (0:nbins)/nbins)   Ox=hist(x, bins,plot=FALSE)$counts   Oy=hist(y, bins,plot=FALSE)$counts   tmp=sqrt(sum(Ox)/sum(Oy))   chi=sum((Ox/tmp-Oy*tmp)^2/(Ox+Oy))   pval=1-pchisq(chi, nbins-1)   out=ifelse(TSextra$statistic,chi,pval)names(out)="ChiSquare"   out}TSextra=list(nbins=5,statistic=FALSE)

pwr=R2sample::run.studies(chitest,"uniform.linear",TSextra=TSextra,With.p.value =TRUE)#> Running case study uniform.linear.cont ...R2sample::plot_power(pwr,"slope")

Arguments ofrun.studies:

TS: new test statistic
study: either the name(s) or position(s) of the desired sudies inthe list of all studies
TSextra: a list passed to TS
With.p.value =FALSE set to TRUE if the new method finds p value(s).TS should then return only those
BasicComparison=TRUE: All 20 case studies are run for one value ofthe parameter under the alternative.
nsample=500: sample size
alpha=0.05: true type I error probability
param_alt: a vector of values of the parameter of the alternativedistribution. If missing the ones included in R2sample are used. If morethan one study is run this needs to be a list of vectors
B=1000 number of simulation runs

Arguments of routine to calculate new test:

Continuous data:
x,y the two data sets
TSextra (optional): a list passed to the routine with any object neededto do the calculation.
Discrete data:
x,y the two data sets
vals: the possible values of the discrete random variable
TSextra (optional): a list passed to the routine with any object neededto do the calculation.

The output of the routine has to be a named vector with either thetest statistic(s) or the p value(s).

The case studies are as follows. In each the first term specifies themodel under the null hypothesis and the second the model under thealternative hypothesis.

uniform.linear\(\hspace{3cm}\)U[0,1] vs a linear model on [0,1] with slope s.
uniform.quadratic\(\hspace{2.5cm}\) U[0,1] vs a quadraticmodel with vertex at 0.5 and some curvature a.
uniform.bump\(\hspace{3.1cm}\)U[0,1] vs U[0,1]+N(0.5,0.05).
uniform.sine\(\hspace{3.3cm}\)U[0,1] vs U[0,1]+Sine wave
beta22.betaaa\(\hspace{3cm}\)Beta(2,2) vs Beta(a,a)
beta22.beta2a\(\hspace{3cm}\)Beta(2,2) vs Beta(2,a)
normal.shift\(\hspace{3.5cm}\)N(0,1) vs N(\(\mu\),1)
normal.stretch\(\hspace{3.1cm}\)N(0,1) vs N(0,\(\sigma\))
normal.t\(\hspace{4.2cm}\)N(0,1) vs t(df)
normal.outlier1\(\hspace{3.1cm}\) N(0,1) vsN(0,1)+U[2,3]
normal.outlier2\(\hspace{3.1cm}\) N(0,1) vsN(0,1)+U[-3,-2]+U[2,3]
exponential.gamma\(\hspace{2.3cm}\) Exp(1) vsGamma(1,b)
exponential.weibull\(\hspace{2.5cm}\) Exp(1) vsWeibull(1,b)
exponential.bump\(\hspace{2.7cm}\) Exp(1) vsExp(1)+N(0.5,0.05)
gamma.normal\(\hspace{3.1cm}\)Gamma(\(\mu\)) vs N(\(\bar{x}\), sd(x)), here the mean of thenormal distribution are the sample mean and sample standard deviation ofthe x data set.
normal.normalmixture\(\hspace{2.1cm}\) N(0,1) vs N(\(-\mu\),1)+N(\(\mu\),1)
uniform.uniformmixture\(\hspace{1.9cm}\) U[0,1] vs. \(\alpha\)U[0,1/2]+(1-\(\alpha\))U[1/2,1]
uniform.betamixture\(\hspace{2.5cm}\) U[0,1] vs. \(\alpha\)U[0,1/2]+(1-\(\alpha\))Beta(2,2)
chisquare.noncentral\(\hspace{2.3cm}\)\(\chi^2(5)\) vs. \(\chi^2(5, \tau)\)
uniform.triangular\(\hspace{2.9cm}\) U[0,1]vs. triangular

Generally a user will wish to apply their test to all the casestudies. This can be done easily with:

R2sample::run.studies(chitest,TSextra=TSextra,With.p.value=TRUE)

run.studies can also be used to run the studies for theincluded methods but for different values of param_alt, nsample and/oralpha. For example, say we wish to find the powers of the includedcontinuous methods for the cases uniform.linear and uniform.quadratic aswell as samples of size 2000 and a true type I error of 0.1. Moreoverfor the case uniform/linear we want the slopes to be 0.1 and 0.2 and forthe case uniform.quadratic we want param_alt to be 1.3:

R2sample::run.studies(TRUE,# continuous data/modelstudy=c("uniform.linear","uniform.quadratic"),param_alt=list(c(0.1,0.2),1.3),nsample=2000,alpha=0.1)

Movatterモバイル変換

R2sample

Example 1

New Tests

Arguments and output of new test routine for continuous data

Example 2

New Test

Arguments and output of new test routine for discrete data

Permutation Method

Parametric Bootstrap

Example 3