Note that all examples are run without parallelprocessing and a small number of simulation runs B to satisfy CRANsubmission rules.

This package brings together a number of routines for the two sampleproblem for multivariate data. There are two data sets x and y and wewant to test whether they were generated by the same probabilitydistribution.

Continuous Data

Example 1

We generate two-dimensional data sets of size 100 and 120respectively from multivariate normal distributions and run thetest:

set.seed(123)

x1= mvtnorm::rmvnorm(100,c(0,0))y1= mvtnorm::rmvnorm(120,c(0,0))twosample_test(x1, y1,B=B,maxProcessor =1)#> Data is assumed to be continuous#> $statistics#>        KS         K       CvM        AD       NN1       NN5        AZ        BF#>  0.108300  0.176600  0.075930  0.562000  1.108300  2.036700  0.006952  0.250400#>        BG        FR       NN0       CF1       CF2       CF3       CF4      Ball#>  0.763600 -1.102000  0.227200 -1.102000  1.441500  1.127300  1.285200  0.006210#>        ES        EP#> 15.452000 32.531000#>#> $p.values#>     KS      K    CvM     AD    NN1    NN5     AZ     BF     BG     FR    NN0#> 0.7650 0.7600 0.7500 0.8700 0.1200 0.3450 0.4500 0.5500 0.3450 0.1353 0.0000#>    CF1    CF2    CF3    CF4   Ball     ES     EP#> 0.1353 0.4864 0.1298 0.3027 0.3761 0.3479 0.1144

Arguments of twosample_test

x a matrix of numbers with different observations indifferent rows, or a list of x and y.
y a matrix of numbers, or missing if x is a list with bothdata sets.
TS a routine to calculate test statistics or p values fornew tests.
TSextra additional info passed to TS, if necessary.
B=5000 number of simulation runs for permutationtest.
nbins=c(5,5) bins for chi square tests (2D only).
minexpcount=5 lowest required count for chi-squaretest.
Ranges=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrixwith lower and upper bounds of observations.
rnull routine that generates data under the nullhypothesis, for hybrid problem (see below).
DoTransform=TRUE should data be transfomed to unithypercube?
SuppressMessages=FALSE should informative messages beprinted?
maxProcessor maximum number of cores to use. If set to 1 noparallel processing is used.
doMethods=“all” names of method to include.

New Tests

The routinechiTS.cont (included in the package) findseither the test statistic or the p value of a chi square test in twodimensions. First we need to set

TSextra=list(which="statistic",nbins=cbind(c(3,3),c(3,4)))

this list is passed tochiTS.cont and tells the routineto

return the test statistic (and not the p value)
Run two tests, first with 3 and 3 bins in the x and y direction andthen 3 and 4 bins.

twosample_test(x1, y1,TS=chiTS.cont,TSextra=TSextra,B=B,maxProcessor =1)#> Data is assumed to be continuous#> $statistics#> Chisq Stat 3,3 Chisq Stat 3,4#>         4.2376        16.3280#>#> $p.values#> Chisq Stat 3,3 Chisq Stat 3,4#>          0.840          0.145

Arguments and output of new test routine for continuousdata

The arguments have to be x and y for the two data sets and(optionally) a list calledTSextra for any additional inputneeded to find test statistic.

Note that the output vector of the routine has to be anamed vector. If the routine is written in Rcppparallel programming is not available.

Adjusted p values

If several tests are run one can use the routinetwosample_test_adjusted_pvalue to find a p value that isadjusted for simultaneous testing:

twosample_test_adjusted_pvalue(x1, y1,B=c(B,B),maxProcessor =1)#> Data is assumed to be continuous#> p values of individual tests:#> ES :  0.3761#> CvM :  0.77#> AZ :  0.435#> NN5 :  0.38#> BG :  0.385#> adjusted p value of combined tests: 0.9196

Power Estimation

The routinetwosample_power allows us to estimatethe power of the tests.

Example 2

Let’s say we want to estimate the power when one data set comes froma multivariate standard normal distribution and the other from a normaldistribution with a different covariance. We first need a function thatgenerates these data sets:

f=function(a=0) {   S=diag(2)   x=mvtnorm::rmvnorm(100,sigma = S)   S[1,2]=a   S[2,1]=a   y=mvtnorm::rmvnorm(120,sigma = S)list(x=x,y=y)}

Now we can run

twosample_power(f,c(0,0.5),B=B,maxProcessor=1)#>        KS    K   CvM    AD   NN1  NN5    AZ    BF   BG   FR NN0  CF1  CF2  CF3#> 0   0.080 0.11 0.050 0.035 0.055 0.04 0.045 0.055 0.04 0.05   1 0.05 0.05 0.04#> 0.5 0.135 0.21 0.255 0.280 0.255 0.29 0.240 0.245 0.04 0.26   1 0.26 0.11 0.24#>       CF4  Ball    ES    EP#> 0   0.055 0.055 0.045 0.040#> 0.5 0.185 0.115 0.435 0.455

Arguments of twosample_power

f a function that generates data
… arguments passed to f, up to 2.
TS a routine to calculate test statistics or p values fornew tests.
TSextra additional info passed to TS, if necessary.
alpha=0.05 desired type I error probability.
B=1000 number of simulation runs for powerestimation.
nbins=c(5,5) bins for chi square tests (2D only).
minexpcount=5 lowest required count for chi-squaretest.
Ranges=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrixwith lower and upper bounds of observations.
rnull routine that generates data under the nullhypothesis, for hybrid problem (see below).
DoTransform=TRUE should data be transformed to unithypercube?
SuppressMessages=FALSE should informative messages beprinted?
LargeSampleOnly=FALSE should only methods with large sampletheories be run?
doMethods=“all” names of method to include.

Again the user can provide their own routine:

twosample_power(f,c(0,0.5),TS=chiTS.cont,TSextra=TSextra,B=B,maxProcessor=1)#>     Chisq Stat 3,3 Chisq Stat 3,4#> 0            0.045           0.07#> 0.5          0.645           0.58

The routine that generates data can also have two arguments:

f1=function(a=0,n=100) {   S=diag(2)   x=mvtnorm::rmvnorm(100,sigma = S)   S[1,2]=a   S[2,1]=a   y=mvtnorm::rmvnorm(n,sigma = S)list(x=x,y=y)}

If the user routine returns p values run

TSextra=list(which="pvalue",nbins=c(5,5))twosample_power(f1,c(0,0.5),c(100,120),TS=chiTS.cont,TSextra=TSextra,B=B,With.p.value=TRUE,maxProcessor=1)#>         Chisq Pval 5,5#> 0|100            0.045#> 0.5|120          0.430

Discrete Data

First note that tests for discrete data are implemented only in twodimensions.

Example 3

We consider the case of data from binomial distributions:

vals_x=0:5#possible values of first random variablevals_y=0:6#possible values of second random variablea1x=rbinom(1000,5,0.5)a2x=rbinom(1000,6,0.5)a1y=rbinom(1200,5,0.5)a2y=rbinom(1200,6,0.5)x2=matrix(0,6*7,4)colnames(x2)=c("vals_x","vals_y","x","y")x2[,1]=rep(vals_x,length(vals_y))x2[,2]=rep(vals_y,each=length(vals_x))for(iin0:5) {for(jin0:6) {    x2[x2[,1]==i&x2[,2]==j,3]=sum(a1x==i&a2x==j)    x2[x2[,1]==i&x2[,2]==j,4]=sum(a1y==i&a2y==j)  }}twosample_test(x2,B=B,maxProcessor =1)#> Data is assumed to be discrete#> $statistics#>        KS         K       CvM        AD        NN        AZ        BF ChiSquare#>   0.03333   0.05716   0.17180   1.11780   3.53210   0.04232   0.06514  28.20800#>#> $p.values#>        KS         K       CvM        AD        NN        AZ        BF ChiSquare#>    0.4700    0.3400    0.3650    0.3350    0.5650    0.3350    0.3050    0.7469

Arguments of twosample_test

As in the case of continuous data the arguments includeTS,TSextra,minexpcount,rnull,maxProcessor anddoMethods. In addition we nowhave

x either a vector of counts for first variable, or a matrixwith columns vals_x, vals_y, x and y.
y vector with counts of second variable
vals_x possible values of first variable
vals_y possible values of second variable
samplingmethod =“Binomial” how the permuted data is found.The alternative is “independence”, which is slightly more accurate butalso much slower.

Adjusted p values

Again one can find a p value adjusted for simultaneous testing:

twosample_test_adjusted_pvalue(x2,B=c(B,B),maxProcessor=1)#> Data is assumed to be discrete#> p values of individual tests:#> Chisquare :  0.7469#> KS :  0.59#> AZ :  0.39#> CvM :  0.4#> adjusted p value of combined tests: 0.8556

User supplied tests

The routinechiTS.disc (included in package) does achi-square test for discrete data:

TSextra=list(which="statistic")twosample_test(x2,TS=chiTS.disc,TSextra=TSextra,B=B,maxProcessor =1)#> Data is assumed to be discrete#> $statistics#> Chisq Stat#>     31.752#>#> $p.values#> Chisq Stat#>      0.805

Power Estimation

Again we need a routine that generates data sets. In the discretecase this has to be a routine whose output is a matrix with columnsnamed vals_x, vals_y, x and y

g=function(a=0) {   x=cbind(rbinom(1000,5,0.5),rpois(1000,1))   x[,2][x[,2]>5]=5   lambda=1+a*x[,1]/5   y=cbind(rbinom(1200,5,0.5),rpois(1200, lambda))   y[,2][y[,2]>5]=5   vx=0:5   vy=0:5   A=matrix(0,length(vx)*length(vy),4)   k=0for(iinseq_along(vx))for(jinseq_along(vy)) {       k=k+1       A[k,1]=vx[i]       A[k,2]=vy[j]       A[k,3]=sum(x[,1]==vx[i]& x[,2]==vy[j])       A[k,4]=sum(y[,1]==vx[i]& y[,2]==vy[j])     }colnames(A)=c("vals_x","vals_y","x","y")   A}

twosample_power(g,c(0,0.25,0.5),B=200,maxProcessor=1)#>         KS     K   CvM    AD    NN    AZ    BF Chisquare#> 0    0.085 0.040 0.075 0.080 0.035 0.045 0.045     0.035#> 0.25 0.415 0.315 0.495 0.495 0.090 0.115 0.115     0.240#> 0.5  0.990 0.965 0.990 0.990 0.670 0.140 0.200     0.910

or using a user supplied test:

TSextra=list(which="statistic")twosample_power(g,c(0,0.25,0.5),B=200,TS=chiTS.disc,TSextra=TSextra,maxProcessor=1)#>      Chisq Stat#> 0         0.015#> 0.25      0.100#> 0.5       0.845TSextra=list(which="pvalue")twosample_power(g,c(0,0.25,0.5),B=200,TS=chiTS.disc,TSextra=TSextra,With.p.value =TRUE,maxProcessor=1)#>      Chisq P#> 0      0.030#> 0.25   0.195#> 0.5    0.830

Goodness-of-fit/Two-sample hybrid problem

We have a data set x and we want to test whether it comes from abivariate normal distribution. Instead of a goodness-of-fit test,however, we generate a Monte Carlo data set y from a bivariate normal rvwith mean and covariance estimated from the real data set x, and thenrun a two-sample test.

In this scenario the two data sets are not independent, and thepermutation approach to finding p values is extremely conservative, thatis, the true type I error probability is much smaller than the nominalone. This in turn makes the power of the test much lower as well.Instead one can supply a routine that generates new data, just as onewould in a goodness-of-fit test. Moreover, all the methods who findtheir own p values will now fail completely and so sshould not berun.

# generate real and MC data sets:f=function(mu) {    x=mvtnorm::rmvnorm(100,c(mu, mu))    y=mvtnorm::rmvnorm(100,apply(x,2, mean),cor(x))list(x=x,y=y)}#True data is a mixture of normal and uniformg=function(alpha=0) {    x=rbind(mvtnorm::rmvnorm((1-alpha)*100,c(0,0)),matrix(runif(200*alpha),ncol=2))    y=mvtnorm::rmvnorm(100,apply(x,2, mean),cor(x))list(x=x,y=y)}# generate two-sample data setrnull=function(dta) {   x=mvtnorm::rmvnorm(nrow(dta$x),apply(dta$x,2, mean),cor(dta$x))   y=mvtnorm::rmvnorm(nrow(x),apply(x,2, mean),cor(x))list(x=x,y=y)}

# Only run these methods for hypbrid problemmt=c("KS","K","CvM","AD","NN1","NN5","AZ","BF","BG")# Null hypothesis is true:twosample_power(f,c(0,1),doMethods = mt,B=200,maxProcessor =1)#>      KS     K  CvM    AD   NN1  NN5    AZ   BF   BG#> 0 0.025 0.010 0.02 0.010 0.015 0.04 0.015 0.00 0.05#> 1 0.025 0.045 0.01 0.015 0.045 0.02 0.010 0.01 0.03twosample_power(f,c(0,1),rnull=rnull,B=200,maxProcessor =1)#>      KS     K  CvM   AD   NN1  NN5    AZ    BF    BG#> 0 0.025 0.020 0.08 0.07 0.050 0.06 0.050 0.050 0.050#> 1 0.070 0.075 0.05 0.04 0.035 0.06 0.035 0.055 0.035# Null hypothesis is false:twosample_power(g,c(0,0.5),doMethods = mt,B=200,maxProcessor =1)#>       KS     K   CvM    AD  NN1  NN5    AZ    BF    BG#> 0   0.00 0.035 0.000 0.005 0.02 0.02 0.005 0.000 0.060#> 0.5 0.87 1.000 0.505 0.500 0.72 0.87 1.000 0.995 0.995twosample_power(g,c(0,0.5),rnull=rnull,B=200,maxProcessor =1)#>        KS     K   CvM   AD  NN1  NN5    AZ   BF    BG#> 0   0.045 0.035 0.045 0.05 0.05 0.08 0.045 0.06 0.035#> 0.5 0.940 1.000 0.765 0.79 0.81 0.93 0.995 1.00 1.000

As we can see, the true type I error using the permutation method ismuch smaller than the nominal\(\alpha=0.05\), and so the power is lower aswell.

Benchmarking

The routinerun.studies allows the user to quickly comparethe power of a new test with the power of the included tests for 25different combinations of null hypothesis vs alternative for continuousdata and 20 for discrete data. It also allows the user to find the powerfor those case studies for different sample size and type I errorprobabilities.

As an example, let’s compare the power of the test based ondifferences in means to the included ones for two of the studies.

Note that these examples are not run so the package can be submittedto CRAN:

run.studies(Continuous=TRUE,TS=newTest_1,study=c("NormalD2","tD2"),B=1000)

Arguments of run.studies

Continuous=TRUE run cases for continuous data.
TS routine to calculate test statistics.
study either the name of the study, or its number. Ifmissing all the studies are run.
TSextra list passed to TS, if any.
With.p.value=FALSE does user supplied routine return pvalues?
nsample=200 desired sample size.
alpha=0.05 type I error.
param_alt (list of) values of parameter under thealternative hypothesis. If missing included values are used.
SuppressMessages=FALSE should messages be printed?
B = 1000 number of simulation runs.

The data setpower_studies_results included in MD2sample hasthe results for the included tests using a sample size of 200, a truetype I error probability of 0.05 and two values of the parameter underthe alternative, on which makes the null hypothesis true and one whichmakes it false. If the user wants different numbers he can run:

run.studies(Continuous=TRUE,study=c("NormalD2","tD2"),param_alt=cbind(c(0.4,0.4),c(0.7,0.7)),alpha=0.1,B=100)

Say the new method can find p values without simulation:

Note that the routine should return anamed vectorwith the p values.

TSextra=list(which="pvalue",nbins=cbind(c(3,3),c(4,4)))run.studies(Continuous=TRUE,study=c("NormalD2","tD2"),TS=chiTS.cont,TSextra=TSextra,With.p.value =TRUE,B=500,SuppressMessages =TRUE,maxProcessor=1)#> Average number of times a test is close to best:#>             BG           Ball            CF2            NN0             KS#>            1.0            2.0            3.0            4.0            6.0#>              K            NN1            CF4             FR            CF1#>            6.0            6.0            8.0           10.5           10.5#>            CF3            CvM            NN5             AD             BF#>           10.5           11.0           13.0           13.5           15.0#>             AZ             ES             EP Chisq Pval 3,4 Chisq Pval 3,4#>           16.0           17.0           18.0           19.5           19.5#>          Chisq Pval 3,4 Chisq Pval 3,4    KS     K   CvM    AD   NN1   NN5#> NormalD2          0.986          0.986 0.418 0.430 0.552 0.642 0.484 0.664#> tD2               0.960          0.960 0.396 0.393 0.539 0.635 0.357 0.502#>             AZ    BF    BG    FR   NN0   CF1   CF2   CF3   CF4  Ball    ES#> NormalD2 0.884 0.831 0.050 0.601 0.395 0.601 0.377 0.601 0.502 0.302 0.973#> tD2      0.817 0.781 0.067 0.491 0.312 0.491 0.307 0.491 0.401 0.200 0.863#>             EP#> NormalD2 0.978#> tD2      0.933

Goodness-of-fit / twosample hybrid problem.

Consider the following situation. We have a data set\(x\) and want to test whether it comes fromsome probability distributionF, that is, we have agoodness-of-fit problem. However, it is not possible to calculateprobabilities from\(F\), probablybecause this would require integration in high dimensions. We are,though, able to generate a second data set\(y\) under\(F\), and so can run a twosample test.

It can be shown that if the model\(F\) is not fully specified but includesparameters that have to be estimated from\(x\), finding p values using the permutationmethod leads to an extremely conservative tests, that is, the true typeI error probability is much smaller than the desired one. Instead onecan use a parametric bootstrap approach, that is a new data set\(y\) can be generated in each simulationrun.

Example 5

Let’s say we wish to test whether the data set\(x\) comes from a multivariate normaldistribution with unknown mean and covariance:

f=function(a) {    x=mvtnorm::rmvnorm(500,c(0.5,0.5))    y=rbind(matrix(runif(a*1000),ncol=2),            mvtnorm::rmvnorm((1-a)*500,c(0.5,0.5)))list(x=x,y=y)}rnull=function(dta) {  muhat=apply(dta$x,2, mean)  sigmahat=cor(dta$x)list(x=mvtnorm::rmvnorm(nrow(dta$x), muhat, sigmahat),y=mvtnorm::rmvnorm(nrow(dta$y), muhat, sigmahat))}

dta=f(0)# Null hypothesis is truetwosample_test(dta,rnull=rnull,B=B,maxProcessor =1)#> Data is assumed to be continuous#> $statistics#>       KS        K      CvM       AD      NN1      NN5       AZ       BF#> 0.076000 0.116000 0.176700 1.321800 1.064000 2.038000 0.001345 0.293900#>       BG#> 0.888500#>#> $p.values#>    KS     K   CvM    AD   NN1   NN5    AZ    BF    BG#> 0.245 0.255 0.205 0.170 0.055 0.250 0.170 0.200 0.035

dta=f(0.2)# Null hypothesis is falsetwosample_test(dta,rnull=rnull,B=B,maxProcessor =1)#> Data is assumed to be continuous#> $statistics#>      KS       K     CvM      AD     NN1     NN5      AZ      BF      BG#> 0.12400 0.20000 0.49640 3.25860 1.10000 2.22200 0.01947 0.97180 4.03100#>#> $p.values#>    KS     K   CvM    AD   NN1   NN5    AZ    BF    BG#> 0.000 0.000 0.000 0.000 0.005 0.000 0.020 0.000 0.000

Again we can find adjusted p values:

twosample_test_adjusted_pvalue(dta,rnull=rnull,B=B,maxProcessor =1)

and find the power of the tests:

twosample_power(f,c(0,0.5),rnull=rnull,B=B,maxProcessor=1)

Methods included in MD2sample for continuous data

Methods whose p values are found via simulation

Methods based on empirical distribution functions

Denote by\(\mathbf{z}\) thecombined data set\(x_1,..,x_n,y_1,..y_m\). Let\(\hat{F}\) and\(\hat{G}\) be the empirical distributionfunctions of the two data sets, and let\(\hat{H}\) be the empirical distributionfunction of\(\mathbf{z}\)

Kolmogorov-Smirnov test

This classic test uses the test statistic

\[\max\{\vert\hat{F}(z_i)-\hat{G}(z_i)\vert;z_1,..,z_{n+m}\}\]

It was first proposed in(Kolmogorov1933) and(Smirnov 1939).

Kiuper’s test

A variant of Kolmorogov-Smirnov:

\[\max\{\hat{F}(z_i)-\hat{G}(z_i);z_1,..,z_{n+m}\}-\min\{\hat{F}(z_i)-\hat{G}(z_i);z_1,..,z_{n+m}\}\] This test was firstdiscussed in(Kuiper 1960).

Cramer-vonMises test

\[\sum_{i=1}^{n+m}\left(\hat{F}(z_i)-\hat{G}(z_i)\right)^2\] the extension to thetwo-sample problem of the Cramer-vonMises criterion is discussed in(T. W. Anderson 1962).

Anderson-Darling test

\[\sum_{i=1}^{n+m}\frac{\left(\hat{F}(z_i)-\hat{G}(z_i)\right)^2}{\hat{H}(z_i)(1-\hat{H}(z_i))}\]It was first proposed in(Theodore W. Anderson,Darling, et al. 1952).

Methods based on nearest neighbors

The test statistics are the average number of nearest neighbors ofthe\(\mathbf{x}\) data set that arealso from\(\mathbf{x}\) plus theaverage number of nearest neighbors of the\(\mathbf{y}\) data set that are also from\(\mathbf{y}\).NN1 uses onenearest neighbor and\(NN5\) uses5.

Methods based on distances between observations

We denote by\(||.||\) Euclideandistance

Aslan-Zech test

This test discussed in(Aslan and and2005) uses the test statistic

\[\begin{aligned}&\frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m \log(||x_i-y_j||) -\\&\frac{1}{n^2}\sum_{i=1}^n \sum_{i<j} \log(||x_i-x_j||) - \\&\frac{1}{m^2}\sum_{i=1}^m \sum_{i<j} \log(||y_i-y_j||)\end{aligned}\]Baringhaus-Franz test

Similar to the Aslan-Zech test, it uses the test statistic

\[\begin{aligned}&\frac{nm}{n+m}\left[\frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m\sqrt{||x_i-y_j||} + \right.\\&\frac{1}{n^2}\sum_{i=1}^n \sum_{i<j} \sqrt{||x_i-x_j||} -\\&\left. \frac{1}{m^2}\sum_{i=1}^m \sum_{i<j} \sqrt{||y_i-y_j||}\right]\\\end{aligned}\] and was first proposed in(Baringhausand Franz 2004).

Biswas-Ghosh test

Another variation of test based on Euclidean distance was discussedin(Biswas and Ghosh 2014).

\[\begin{aligned}&B_{xy} = \frac{1}{nm}\sum_{i=1}^n \sum_{j=1}^m \sqrt{||x_i-y_j||}\\&B_{xx}= \frac{2}{n(n-1)}\sum_{i=1}^n \sum_{i<j}\sqrt{||x_i-x_j||} \\&B_{yy}=\frac{2}{m(m-1)}\sum_{i=1}^m \sum_{i<j}\sqrt{||y_i-y_j||}\\&\left(B_{xx}-B_{xy}\right)^2+\left(B_{yy}-B_{xy}\right)^2\end{aligned}\]

Methods whose p values are found using a large sample theory

Friedman-Rafski test

This test is a multi-dimensional extension of the classicWald-Wolfowitz test bases on minimum spanning trees. It was discussed in(Friedman and Rafsky 1979).

Simple Nearest Neighboor test

Similar to the nearest neigboor tests described earlier, it uses onlythe number of nearest neighbors of the first data set that are also fromthe first data set. This number has a binomial distribution, and thiscan be used to find p values.

Chen-Friedman tests

These tests, discussed in(Chen and Friedman,n.d.), are implemented in thegTests(Chen and Zhang 2017) package.

Ball Divergence test

A test described in(Pan et al. 2018)and implemented in the R package(Zhu et al.2021).

References

Anderson, T. W. 1962.“On the Distribution of the Two_sampleCramer-von Mises Criterion.”The Annals of MathematicalStatistics 33(3): 1148–59.

Anderson, Theodore W, Donald A Darling, et al. 1952.“AsymptoticTheory of Certain Goodness of Fit Criteria Based on StochasticProcesses.”The Annals of Mathematical Statistics 23(2): 193–212.

Aslan, B., and G. Zech and. 2005.“New Test for the MultivariateTwo-Sample Problem Based on the Concept of Minimum Energy.”Journal of Statistical Computation and Simulation 75 (2):109–19.https://doi.org/10.1080/00949650410001661440.

Baringhaus, L., and C. Franz. 2004.“On a New MultivariateTtwo-Sample Test.”Journal of Multivariate Analysis 88:190–206.

Biswas, M., and A. Ghosh. 2014.“A Nonparametric Two-Sample TestApplicable to High Dimensional Data.”Journal of MultivariateAnalysis 123: 160–71.

Chen, H., and J. H. Friedman. n.d.“A New Graph-Based Two-SampleTest for Multivariate and Object Data.”Journal of theAmerican Statistical Association 112: 397–409.

Chen, H., and J. Zhang. 2017.gTests: Graph-Based Two-SampleTests.https://CRAN.R-project.org/package=gTests.

Friedman, J. H., and L. C. Rafsky. 1979.“MultivariateGeneralizations of the Wald-Wolfowitz and Smirnov Two-SampleTests.”The Annals of Statistics 7 (4).https://doi.org/10.1214/aos/1176344722.

Kolmogorov, A. N. 1933.“Sulla Determinazione Empirica Di UnaLegge Di Distribuzione.”Giornale Dell’Instituto ItalianoDegli Attuari 4: 83–91.

Kuiper, N. H. 1960.“Tests Concerning Random Points on aCircle.”Proceedings of the Koninklijke Nederlandse Akademievan Wetenschappen 63: 38–47.

Pan, W., Y. Tian, X. Wang, and H. Zhang. 2018.“Ball Divergence:Nonparametric Two Sample Test.”The Annals of Statistics46 (3).https://doi.org/10.1214/17-aos1579.

Smirnov, N. V. 1939.“Estimate of Deviation Between EmpiricalDistribution Functions in Two Independent Samples.”Bull.Moscow Univ. 2: 3–16.

Zhu, J., W. Pan, W. Zheng, and X. Wang. 2021.“Ball:AnR Package for Detecting Distribution Difference andAssociation in Metric Spaces.”Journal of StatisticalSoftware 97 (6): 1–31.https://doi.org/10.18637/jss.v097.i06.

Movatterモバイル変換

MD2sample

Continuous Data

Example 1

New Tests

Adjusted p values

Power Estimation

Example 2

Discrete Data

Example 3

Adjusted p values

User supplied tests

Power Estimation

Goodness-of-fit/Two-sample hybrid problem

Benchmarking

Goodness-of-fit / twosample hybrid problem.

Example 5

Methods included in MD2sample for continuous data

Methods whose p values are found via simulation

Methods based on empirical distribution functions

Methods based on nearest neighbors

Methods based on distances between observations

Methods whose p values are found using a large sample theory

Methods included in MD2sample for discrete data

References