Note that all examples are run without parallelprocessing and a small number of simulation runs B to satisfy CRANsubmission rules.
This package brings together a number of routines for the two sampleproblem for multivariate data. There are two data sets x and y and wewant to test whether they were generated by the same probabilitydistribution.
The highlights of this package are:
We generate two-dimensional data sets of size 100 and 120respectively from multivariate normal distributions and run thetest:
x1= mvtnorm::rmvnorm(100,c(0,0))y1= mvtnorm::rmvnorm(120,c(0,0))twosample_test(x1, y1,B=B,maxProcessor =1)#> Data is assumed to be continuous#> $statistics#> KS K CvM AD NN1 NN5 AZ BF#> 0.108300 0.176600 0.075930 0.562000 1.108300 2.036700 0.006952 0.250400#> BG FR NN0 CF1 CF2 CF3 CF4 Ball#> 0.763600 -1.102000 0.227200 -1.102000 1.441500 1.127300 1.285200 0.006210#> ES EP#> 15.452000 32.531000#>#> $p.values#> KS K CvM AD NN1 NN5 AZ BF BG FR NN0#> 0.7650 0.7600 0.7500 0.8700 0.1200 0.3450 0.4500 0.5500 0.3450 0.1353 0.0000#> CF1 CF2 CF3 CF4 Ball ES EP#> 0.1353 0.4864 0.1298 0.3027 0.3761 0.3479 0.1144Arguments of twosample_test
The routinechiTS.cont (included in the package) findseither the test statistic or the p value of a chi square test in twodimensions. First we need to set
this list is passed tochiTS.cont and tells the routineto
twosample_test(x1, y1,TS=chiTS.cont,TSextra=TSextra,B=B,maxProcessor =1)#> Data is assumed to be continuous#> $statistics#> Chisq Stat 3,3 Chisq Stat 3,4#> 4.2376 16.3280#>#> $p.values#> Chisq Stat 3,3 Chisq Stat 3,4#> 0.840 0.145Arguments and output of new test routine for continuousdata
The arguments have to be x and y for the two data sets and(optionally) a list calledTSextra for any additional inputneeded to find test statistic.
Note that the output vector of the routine has to be anamed vector. If the routine is written in Rcppparallel programming is not available.
If several tests are run one can use the routinetwosample_test_adjusted_pvalue to find a p value that isadjusted for simultaneous testing:
The routinetwosample_power allows us to estimatethe power of the tests.
Let’s say we want to estimate the power when one data set comes froma multivariate standard normal distribution and the other from a normaldistribution with a different covariance. We first need a function thatgenerates these data sets:
f=function(a=0) { S=diag(2) x=mvtnorm::rmvnorm(100,sigma = S) S[1,2]=a S[2,1]=a y=mvtnorm::rmvnorm(120,sigma = S)list(x=x,y=y)}Now we can run
twosample_power(f,c(0,0.5),B=B,maxProcessor=1)#> KS K CvM AD NN1 NN5 AZ BF BG FR NN0 CF1 CF2 CF3#> 0 0.080 0.11 0.050 0.035 0.055 0.04 0.045 0.055 0.04 0.05 1 0.05 0.05 0.04#> 0.5 0.135 0.21 0.255 0.280 0.255 0.29 0.240 0.245 0.04 0.26 1 0.26 0.11 0.24#> CF4 Ball ES EP#> 0 0.055 0.055 0.045 0.040#> 0.5 0.185 0.115 0.435 0.455Arguments of twosample_power
Again the user can provide their own routine:
twosample_power(f,c(0,0.5),TS=chiTS.cont,TSextra=TSextra,B=B,maxProcessor=1)#> Chisq Stat 3,3 Chisq Stat 3,4#> 0 0.045 0.07#> 0.5 0.645 0.58The routine that generates data can also have two arguments:
f1=function(a=0,n=100) { S=diag(2) x=mvtnorm::rmvnorm(100,sigma = S) S[1,2]=a S[2,1]=a y=mvtnorm::rmvnorm(n,sigma = S)list(x=x,y=y)}If the user routine returns p values run
First note that tests for discrete data are implemented only in twodimensions.
We consider the case of data from binomial distributions:
vals_x=0:5#possible values of first random variablevals_y=0:6#possible values of second random variablea1x=rbinom(1000,5,0.5)a2x=rbinom(1000,6,0.5)a1y=rbinom(1200,5,0.5)a2y=rbinom(1200,6,0.5)x2=matrix(0,6*7,4)colnames(x2)=c("vals_x","vals_y","x","y")x2[,1]=rep(vals_x,length(vals_y))x2[,2]=rep(vals_y,each=length(vals_x))for(iin0:5) {for(jin0:6) { x2[x2[,1]==i&x2[,2]==j,3]=sum(a1x==i&a2x==j) x2[x2[,1]==i&x2[,2]==j,4]=sum(a1y==i&a2y==j) }}twosample_test(x2,B=B,maxProcessor =1)#> Data is assumed to be discrete#> $statistics#> KS K CvM AD NN AZ BF ChiSquare#> 0.03333 0.05716 0.17180 1.11780 3.53210 0.04232 0.06514 28.20800#>#> $p.values#> KS K CvM AD NN AZ BF ChiSquare#> 0.4700 0.3400 0.3650 0.3350 0.5650 0.3350 0.3050 0.7469Arguments of twosample_test
As in the case of continuous data the arguments includeTS,TSextra,minexpcount,rnull,maxProcessor anddoMethods. In addition we nowhave
Again one can find a p value adjusted for simultaneous testing:
The routinechiTS.disc (included in package) does achi-square test for discrete data:
Again we need a routine that generates data sets. In the discretecase this has to be a routine whose output is a matrix with columnsnamed vals_x, vals_y, x and y
g=function(a=0) { x=cbind(rbinom(1000,5,0.5),rpois(1000,1)) x[,2][x[,2]>5]=5 lambda=1+a*x[,1]/5 y=cbind(rbinom(1200,5,0.5),rpois(1200, lambda)) y[,2][y[,2]>5]=5 vx=0:5 vy=0:5 A=matrix(0,length(vx)*length(vy),4) k=0for(iinseq_along(vx))for(jinseq_along(vy)) { k=k+1 A[k,1]=vx[i] A[k,2]=vy[j] A[k,3]=sum(x[,1]==vx[i]& x[,2]==vy[j]) A[k,4]=sum(y[,1]==vx[i]& y[,2]==vy[j]) }colnames(A)=c("vals_x","vals_y","x","y") A}twosample_power(g,c(0,0.25,0.5),B=200,maxProcessor=1)#> KS K CvM AD NN AZ BF Chisquare#> 0 0.085 0.040 0.075 0.080 0.035 0.045 0.045 0.035#> 0.25 0.415 0.315 0.495 0.495 0.090 0.115 0.115 0.240#> 0.5 0.990 0.965 0.990 0.990 0.670 0.140 0.200 0.910or using a user supplied test:
TSextra=list(which="statistic")twosample_power(g,c(0,0.25,0.5),B=200,TS=chiTS.disc,TSextra=TSextra,maxProcessor=1)#> Chisq Stat#> 0 0.015#> 0.25 0.100#> 0.5 0.845TSextra=list(which="pvalue")twosample_power(g,c(0,0.25,0.5),B=200,TS=chiTS.disc,TSextra=TSextra,With.p.value =TRUE,maxProcessor=1)#> Chisq P#> 0 0.030#> 0.25 0.195#> 0.5 0.830We have a data set x and we want to test whether it comes from abivariate normal distribution. Instead of a goodness-of-fit test,however, we generate a Monte Carlo data set y from a bivariate normal rvwith mean and covariance estimated from the real data set x, and thenrun a two-sample test.
In this scenario the two data sets are not independent, and thepermutation approach to finding p values is extremely conservative, thatis, the true type I error probability is much smaller than the nominalone. This in turn makes the power of the test much lower as well.Instead one can supply a routine that generates new data, just as onewould in a goodness-of-fit test. Moreover, all the methods who findtheir own p values will now fail completely and so sshould not berun.
# generate real and MC data sets:f=function(mu) { x=mvtnorm::rmvnorm(100,c(mu, mu)) y=mvtnorm::rmvnorm(100,apply(x,2, mean),cor(x))list(x=x,y=y)}#True data is a mixture of normal and uniformg=function(alpha=0) { x=rbind(mvtnorm::rmvnorm((1-alpha)*100,c(0,0)),matrix(runif(200*alpha),ncol=2)) y=mvtnorm::rmvnorm(100,apply(x,2, mean),cor(x))list(x=x,y=y)}# generate two-sample data setrnull=function(dta) { x=mvtnorm::rmvnorm(nrow(dta$x),apply(dta$x,2, mean),cor(dta$x)) y=mvtnorm::rmvnorm(nrow(x),apply(x,2, mean),cor(x))list(x=x,y=y)}# Only run these methods for hypbrid problemmt=c("KS","K","CvM","AD","NN1","NN5","AZ","BF","BG")# Null hypothesis is true:twosample_power(f,c(0,1),doMethods = mt,B=200,maxProcessor =1)#> KS K CvM AD NN1 NN5 AZ BF BG#> 0 0.025 0.010 0.02 0.010 0.015 0.04 0.015 0.00 0.05#> 1 0.025 0.045 0.01 0.015 0.045 0.02 0.010 0.01 0.03twosample_power(f,c(0,1),rnull=rnull,B=200,maxProcessor =1)#> KS K CvM AD NN1 NN5 AZ BF BG#> 0 0.025 0.020 0.08 0.07 0.050 0.06 0.050 0.050 0.050#> 1 0.070 0.075 0.05 0.04 0.035 0.06 0.035 0.055 0.035# Null hypothesis is false:twosample_power(g,c(0,0.5),doMethods = mt,B=200,maxProcessor =1)#> KS K CvM AD NN1 NN5 AZ BF BG#> 0 0.00 0.035 0.000 0.005 0.02 0.02 0.005 0.000 0.060#> 0.5 0.87 1.000 0.505 0.500 0.72 0.87 1.000 0.995 0.995twosample_power(g,c(0,0.5),rnull=rnull,B=200,maxProcessor =1)#> KS K CvM AD NN1 NN5 AZ BF BG#> 0 0.045 0.035 0.045 0.05 0.05 0.08 0.045 0.06 0.035#> 0.5 0.940 1.000 0.765 0.79 0.81 0.93 0.995 1.00 1.000As we can see, the true type I error using the permutation method ismuch smaller than the nominal\(\alpha=0.05\), and so the power is lower aswell.
The routinerun.studies allows the user to quickly comparethe power of a new test with the power of the included tests for 25different combinations of null hypothesis vs alternative for continuousdata and 20 for discrete data. It also allows the user to find the powerfor those case studies for different sample size and type I errorprobabilities.
As an example, let’s compare the power of the test based ondifferences in means to the included ones for two of the studies.
Note that these examples are not run so the package can be submittedto CRAN:
Arguments of run.studies
The data setpower_studies_results included in MD2sample hasthe results for the included tests using a sample size of 200, a truetype I error probability of 0.05 and two values of the parameter underthe alternative, on which makes the null hypothesis true and one whichmakes it false. If the user wants different numbers he can run:
run.studies(Continuous=TRUE,study=c("NormalD2","tD2"),param_alt=cbind(c(0.4,0.4),c(0.7,0.7)),alpha=0.1,B=100)Say the new method can find p values without simulation:
Note that the routine should return anamed vectorwith the p values.
TSextra=list(which="pvalue",nbins=cbind(c(3,3),c(4,4)))run.studies(Continuous=TRUE,study=c("NormalD2","tD2"),TS=chiTS.cont,TSextra=TSextra,With.p.value =TRUE,B=500,SuppressMessages =TRUE,maxProcessor=1)#> Average number of times a test is close to best:#> BG Ball CF2 NN0 KS#> 1.0 2.0 3.0 4.0 6.0#> K NN1 CF4 FR CF1#> 6.0 6.0 8.0 10.5 10.5#> CF3 CvM NN5 AD BF#> 10.5 11.0 13.0 13.5 15.0#> AZ ES EP Chisq Pval 3,4 Chisq Pval 3,4#> 16.0 17.0 18.0 19.5 19.5#> Chisq Pval 3,4 Chisq Pval 3,4 KS K CvM AD NN1 NN5#> NormalD2 0.986 0.986 0.418 0.430 0.552 0.642 0.484 0.664#> tD2 0.960 0.960 0.396 0.393 0.539 0.635 0.357 0.502#> AZ BF BG FR NN0 CF1 CF2 CF3 CF4 Ball ES#> NormalD2 0.884 0.831 0.050 0.601 0.395 0.601 0.377 0.601 0.502 0.302 0.973#> tD2 0.817 0.781 0.067 0.491 0.312 0.491 0.307 0.491 0.401 0.200 0.863#> EP#> NormalD2 0.978#> tD2 0.933Consider the following situation. We have a data set\(x\) and want to test whether it comes fromsome probability distributionF, that is, we have agoodness-of-fit problem. However, it is not possible to calculateprobabilities from\(F\), probablybecause this would require integration in high dimensions. We are,though, able to generate a second data set\(y\) under\(F\), and so can run a twosample test.
It can be shown that if the model\(F\) is not fully specified but includesparameters that have to be estimated from\(x\), finding p values using the permutationmethod leads to an extremely conservative tests, that is, the true typeI error probability is much smaller than the desired one. Instead onecan use a parametric bootstrap approach, that is a new data set\(y\) can be generated in each simulationrun.
Let’s say we wish to test whether the data set\(x\) comes from a multivariate normaldistribution with unknown mean and covariance:
f=function(a) { x=mvtnorm::rmvnorm(500,c(0.5,0.5)) y=rbind(matrix(runif(a*1000),ncol=2), mvtnorm::rmvnorm((1-a)*500,c(0.5,0.5)))list(x=x,y=y)}rnull=function(dta) { muhat=apply(dta$x,2, mean) sigmahat=cor(dta$x)list(x=mvtnorm::rmvnorm(nrow(dta$x), muhat, sigmahat),y=mvtnorm::rmvnorm(nrow(dta$y), muhat, sigmahat))}dta=f(0)# Null hypothesis is truetwosample_test(dta,rnull=rnull,B=B,maxProcessor =1)#> Data is assumed to be continuous#> $statistics#> KS K CvM AD NN1 NN5 AZ BF#> 0.076000 0.116000 0.176700 1.321800 1.064000 2.038000 0.001345 0.293900#> BG#> 0.888500#>#> $p.values#> KS K CvM AD NN1 NN5 AZ BF BG#> 0.245 0.255 0.205 0.170 0.055 0.250 0.170 0.200 0.035dta=f(0.2)# Null hypothesis is falsetwosample_test(dta,rnull=rnull,B=B,maxProcessor =1)#> Data is assumed to be continuous#> $statistics#> KS K CvM AD NN1 NN5 AZ BF BG#> 0.12400 0.20000 0.49640 3.25860 1.10000 2.22200 0.01947 0.97180 4.03100#>#> $p.values#> KS K CvM AD NN1 NN5 AZ BF BG#> 0.000 0.000 0.000 0.000 0.005 0.000 0.020 0.000 0.000Again we can find adjusted p values:
and find the power of the tests:
Denote by\(\mathbf{z}\) thecombined data set\(x_1,..,x_n,y_1,..y_m\). Let\(\hat{F}\) and\(\hat{G}\) be the empirical distributionfunctions of the two data sets, and let\(\hat{H}\) be the empirical distributionfunction of\(\mathbf{z}\)
Kolmogorov-Smirnov test
This classic test uses the test statistic
\[\max\{\vert\hat{F}(z_i)-\hat{G}(z_i)\vert;z_1,..,z_{n+m}\}\]
It was first proposed in(Kolmogorov1933) and(Smirnov 1939).
Kiuper’s test
A variant of Kolmorogov-Smirnov:
\[\max\{\hat{F}(z_i)-\hat{G}(z_i);z_1,..,z_{n+m}\}-\min\{\hat{F}(z_i)-\hat{G}(z_i);z_1,..,z_{n+m}\}\] This test was firstdiscussed in(Kuiper 1960).
Cramer-vonMises test
\[\sum_{i=1}^{n+m}\left(\hat{F}(z_i)-\hat{G}(z_i)\right)^2\] the extension to thetwo-sample problem of the Cramer-vonMises criterion is discussed in(T. W. Anderson 1962).
Anderson-Darling test
\[\sum_{i=1}^{n+m}\frac{\left(\hat{F}(z_i)-\hat{G}(z_i)\right)^2}{\hat{H}(z_i)(1-\hat{H}(z_i))}\]It was first proposed in(Theodore W. Anderson,Darling, et al. 1952).
The test statistics are the average number of nearest neighbors ofthe\(\mathbf{x}\) data set that arealso from\(\mathbf{x}\) plus theaverage number of nearest neighbors of the\(\mathbf{y}\) data set that are also from\(\mathbf{y}\).NN1 uses onenearest neighbor and\(NN5\) uses5.
We denote by\(||.||\) Euclideandistance
Aslan-Zech test
This test discussed in(Aslan and and2005) uses the test statistic
\[\begin{aligned}&\frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m \log(||x_i-y_j||) -\\&\frac{1}{n^2}\sum_{i=1}^n \sum_{i<j} \log(||x_i-x_j||) - \\&\frac{1}{m^2}\sum_{i=1}^m \sum_{i<j} \log(||y_i-y_j||)\end{aligned}\]Baringhaus-Franz test
Similar to the Aslan-Zech test, it uses the test statistic
\[\begin{aligned}&\frac{nm}{n+m}\left[\frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m\sqrt{||x_i-y_j||} + \right.\\&\frac{1}{n^2}\sum_{i=1}^n \sum_{i<j} \sqrt{||x_i-x_j||} -\\&\left. \frac{1}{m^2}\sum_{i=1}^m \sum_{i<j} \sqrt{||y_i-y_j||}\right]\\\end{aligned}\] and was first proposed in(Baringhausand Franz 2004).
Biswas-Ghosh test
Another variation of test based on Euclidean distance was discussedin(Biswas and Ghosh 2014).
\[\begin{aligned}&B_{xy} = \frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m \sqrt{||x_i-y_j||}\\&B_{xx}= \frac{2}{n(n-1)}\sum_{i=1}^n \sum_{i<j}\sqrt{||x_i-x_j||} \\&B_{yy}=\frac{2}{m(m-1)}\sum_{i=1}^m \sum_{i<j}\sqrt{||y_i-y_j||}\\&\left(B_{xx}-B_{xy}\right)^2+\left(B_{yy}-B_{xy}\right)^2\end{aligned}\]
Friedman-Rafski test
This test is a multi-dimensional extension of the classicWald-Wolfowitz test bases on minimum spanning trees. It was discussed in(Friedman and Rafsky 1979).
Simple Nearest Neighboor test
Similar to the nearest neigboor tests described earlier, it uses onlythe number of nearest neighbors of the first data set that are also fromthe first data set. This number has a binomial distribution, and thiscan be used to find p values.
Chen-Friedman tests
These tests, discussed in(Chen and Friedman,n.d.), are implemented in thegTests(Chen and Zhang 2017) package.
Ball Divergence test
A test described in(Pan et al. 2018)and implemented in the R package(Zhu et al.2021).
Implemented for discrete data are versions of the Kolmogorov-Smirnov,Kuiper, Cramer-vonMises, Anderson-Darling, nearest neighboor,Aslan-Zech, Baringhaus-Franz tests as well as a chisquare test.