This package brings together a number of routines for the two sampleproblem for univariate data. There are two data sets x and y and we wantto test whether they were generated by the same probabilitydistribution.
The highlights of this package are:
We generate two data sets of size 100 and 120 respectively fromstandard normal distributions and run the test:
x1=rnorm(100)y1=rnorm(120)twosample_test(x1, y1,B=500)#> $statistics#> KS Kuiper CvM AD LR ZA ZK ZC#> 0.1383 0.1666 0.3363 1.6834 0.6655 -3.2530 2.4087 -3.1940#> Wassp1 ES large ES small EP large EP small#> 0.2224 16.0900 5.1700 34.8500 8.8000#>#> $p.values#> KS Kuiper CvM AD LR ZA ZK ZC#> 0.2020 0.3560 0.0920 0.1280 0.1040 0.3540 0.4000 0.3420#> Wassp1 ES large ES small EP large EP small#> 0.1540 0.8846 0.5228 0.2482 0.4559In this case the the null hypothesis is true, both data sets weregenerated by a standard normal distribution. And indeed, all 13 testshave p values much larger than (say) 0.05 and therefore correctly failto reject the null hypothesis.
Alternatively one might wish to carry out a select number of tests,but then find a correct p value for the combined test. This can be doneas follows:
twosample_test_adjusted_pvalue(x1, y1)#> p values of individual tests:#> ES small : 0.5228#> ZA : 0.6384#> ZK : 0.5766#> Wassp1 : 0.8618#> Kuiper : 0.6124#> adjusted p value of combined tests: 0.969By default for continuous data this runs four tests (chi square witha small number of bins, ZA, ZK and Wassp1). It the finds the smallest pvalue (here 0.02) and adjusts it for the multiple testing problem,resulting in an overall p value of 0.05.
Say we wish to use two new tests for continuous data. One is based onthe difference in standardized means and the other is based on thedifference in standard deviations:
\[\begin{aligned}&TS_1 = sd(x) - sd(y) \\&TS_2 = \bar{x}/sd(x) - \bar{y}/sd(y) \\\end{aligned}\]
myTS1=function(x, y) { out=c(0,0) out[1]=abs(sd(x)-sd(y)) out[2]=abs(mean(x)/sd(x)-mean(y)/sd(y))names(out)=c("std","std t test") out}Then we can simply run
twosample_test(x1, y1,TS=myTS1,B=500)#> $statistics#> std std t test#> 0.06396 0.18060#>#> $p.values#> std std t test#> 0.512 0.192The arguments have to be x and y for the two data sets and(optionally) a list called TSextra for any additional input needed tofind test statistic.
Note that the output vector of the routine has to be anamed vector. If the routine is written in Rcppparallel programming is not available.
Here we generate two data sets of size 1000 and 1200 respectivelyfrom a binomial distribution with 5 trials and success probabilities of0.5 and 0.55, respectively.
x2=table(c(0:5,rbinom(1000,5,0.5)))-1y2=table(c(0:5,rbinom(1200,5,0.55)))-1rbind(x2, y2)#> 0 1 2 3 4 5#> x2 28 164 297 345 132 34#> y2 15 120 335 423 251 56twosample_test(x2, y2,vals=0:5,B=500)$p.values#> KS Kuiper CvM AD LR ZA Wassp1 large small#> 0 0 0 0 0 0 0 0 0In this case the the null hypothesis is false, all nine tests fordiscrete data have p values much smaller than (say) 0.05 and thereforecorrectly reject the null hypothesis.
Notice, it is the presence of thevals argument that tellsthetwosample_test command that the data is discrete. Thevectors x and y are the counts. Note that the lengths of the threevectors have to be the same and no value of vals is allowed that hasboth x and y equal to 0.
Again we might want to run a different test, say based again on thedifference in standardized means. This time we will implement the newtest using Rcpp:
(For reasons to do with submission to CRAN this routine is alreadypart ofR2sample)
#include<Rcpp.h>usingnamespace Rcpp;// [[Rcpp::export]]NumericVector myTS2(IntegerVector x, IntegerVector y, NumericVector vals){ Rcpp::CharacterVector methods=CharacterVector::create("std t test");intconst nummethods=methods.size();int k=x.size(), n=sum(x), i;double meanx=0.0, meany=0.0, sdx=0.0, sdy=0.0; NumericVector TS(nummethods); TS.names()= methods;for(i=0;i<k;++i){ meanx= meanx+ x[i]*vals[i]/n; meany= meany+ y[i]*vals[i]/n;}for(i=0;i<k;++i){ sdx= sdx+ x[i]*(vals[i]- meanx)*(vals[i]- meanx); sdy= sdy+ y[i]*(vals[i]- meany)*(vals[i]- meany);} sdx= sqrt(sdx/(n-1)); sdy= sqrt(sdy/(n-1)); TS(0)=std::abs(meanx/sdx- meany/sdy);return TS;}Now
twosample_test(x2, y2,vals=0:5,TS=R2sample::myTS2,B=500)#> $statistics#> std t test#> 0.2714#>#> $p.values#> std t test#> 0The arguments have to be x and y for the two data sets (the counts)and vals for the set of values where\(P(X=vals)>0\). Also (optionally) a listcalled TSextra for any additional input needed to find teststatistic.
Note that the output vector of the routine has to be anamed vector.
If the routine is written in Rcpp parallel programming is notavailable.
For all the tests except the chi square variants the p values arefound via the permutation method. The idea of a permutation test issimple. Say we have data sets\(x_1,..,x_n\) and\(y_1,..,y_m\). They are combined into onelarge data set, permuted and split again in sets of sizes n and m. Underthe null hypothesis these new data sets come from the same distributionas the actual data. Therefore calculating the tests statistics for themand repeating many times one can build up the distributions of the teststatistics and find p values from them.
In the discrete case the situation is somewhat more complicated. Saywe have data sets with values\(v_1,..,v_k\) and counts\(x_1,..,x_n\),\(y_1,..,y_k\). One can then mimic thedescription above by expanding these to yield a large data set with\(x_1+y_1\)\(v_1\)’s,\(x_2+y_2\)\(v_2\)’s and so on. Then this vector ispermuted and split as described above. This method is used with theargumentsamplingmethod=“independence”. The drawback of thissampling method is that its calculation speed increases with the samplesize.
Alternatively R2sample provides the following option. One can showthat the distribution of the permuted data sets has the followingdistribution: let\(n=\sum x_i\) and\(m=\sum y_i\), then
\[P(X=a|x,y)=\frac{\prod_{j=1}^k{{x_j+y_j}\choose{a_j}}}{{n+m}\choose{n}}\]for any\(a\) such that\(0\le a_i \le x_i; i=1,..,k\) and\(\sum a=n\). Sampling from this distributioncan be done as follows: set\(p=\sum x_i/\sum(x_i+y_i)\), the overall percentage of the data in the x dataset. Next generate a new count\(x^{*}_i\) for the\(i^{th}\) bin with\(X^*_i\sim Binom(x_i+y_i,p)\),\(i=1,..,k\). If\(\sum x_i \ne \sum x^{*}_i\), randomlyselect a bin according to the probabilities\((x_1+y_1,..,x_k+y_k)\) and add or subtract1 from\(x^{*}_i\) as needed. Repeatthis until\(\sum x_i = \sumx^{*}_i\).
The default is Binomial sampling with a simulation size of 5000 fortesting and 1000 for power estimation.
The user can also provide a routinernull, which generatesnew data under the null hypothesis. This is needed in the followingsituation: we wish to perform a goodness-of-fit test\(H_0:X\sim F(p)\), where\(F\) is some distribution function thatdepends on unknown parameters\(p\).These parameters can be estimated from the data by\(\hat{p}\). Unfortunately\(F\) is complicated and can not be evaluateddirectly. We can, however, sample from\(F(\hat{p})\). So we generate a second dataset\(y\) from\(F(\hat{p})\). In this setup the permutationmethod fails, and the p values have to be found via the parametricbootstrap.
Say we have a data set\(x\) andwish to test whether it comes from a normal distribution with unknownmean and standard deviation. Let’s assume we do not know how to evalutethe normal cdf, so instead we sample from\(N(\bar{x}, sd(x))\):
rnull=function(dta) {list(x=dta$x,y=rnorm(length(dta$y),mean(dta$x),sd(dta$x)))}twosample_test(x1, y1,rnull=rnull,B=500)#> $statistics#> KS Kuiper CvM AD LR ZA ZK ZC#> 0.1383 0.1666 0.3363 1.6834 0.6655 -3.2530 2.4087 -3.1940#> Wassp1 ES large ES small EP large EP small#> 0.2224 16.0900 5.1700 34.8500 8.8000#>#> $p.values#> KS Kuiper CvM AD LR ZA ZK ZC#> 0.0540 0.1700 0.0180 0.0160 0.0240 0.0820 0.1320 0.0780#> Wassp1 ES large ES small EP large EP small#> 0.0160 0.8846 0.5228 0.2482 0.4559The routinernull has to have as its argument a list withthe data sets and return an object of the same type.
In the continuous case we have a sample x of size of n and a sample yof size m. In the discussion below both are assumed to be ordered. Wedenote by z the combined sample. In the discrete case we have a randomvariable with values v, and x and y are the counts. We denote by k thenumber of observed values.
We denote the edf’s of the two data sets by\(\widehat{F}\) and\(\widehat{G}\), respectively. Moreover wedenote the empirical distribution function of the combined data set by\(\widehat{H}\).
In the case of continuous data, it is binned intonbins[1]equal probability bins, with the defaultnbins[1]=100. In thecase of discrete data the number of classes comes fromvals,the values observed.
Then a standard chi square test is run and the p value is found usingthe usual chi square approximation.
Many previous studies have shown the a chi square test with a largenumber of classes often has very poor power. So this test in the case ofcontinuous data usesnbins[2] equal probability bins, with thedefaultnbins[2]=10. In the case of discrete data if the numberof classes exceedsnbins[2]=10 the classes are combined untilthere arenbins[2]=10 classes.
In either case the p values are found using the usual chi-squareapproximation. The degrees of freedom are the number of bins. For aliterature review of chi square tests see(W.Rolke 2021).
For all other tests the p values are found using a permutation test.They are
This test is based on the quantity\(\max\{|\widehat{F}(x)-\widehat{G}(x)|:x\in\mathbf{R}\}\). In the continuous case (without ties) thefunction\(\widehat{F}(x)-\widehat{G}(x)\) eithertakes a jump of size\(1/n\) up at apoint in the x data set, a jump of size\(1/m\) down if x is a point in the y dataset, or is flat. Therefore the test statistic is
\[\max\{\sum_{i=1}^{j} \vert \frac1nI\{z_i \in x\}-\frac1m I\{z_i \in y\} \vert:j=1,..,n+m\}\]
In the discrete case the jumps have sizes\(x_i/n\) and\(y_j/m\), respectively and the teststatistic is
\[\max\{\sum_{i=1}^{j} \vert \frac{x_i}n-\frac{y_j}m \vert:j=1,..,k\}\]
This test was first proposed in(Kolmogorov1933),(Smirnov 1939) and is one ofthe most widely used tests today.
There is a close connection between the edf’s and the ranks of the xand y in the combined data set. Using this one can also calculate the KSstatistic, and many statistics to follow, based on these ranks. This isused in many software routines, for example the ks.test routine in R.However, when applied to discrete data these formulas can fail badlybecause of the many ties, and so we will not use them in ourroutine.
This test is closely related to Kolmogorov-Smirnov, but it uses thesum of the largest positive and negative differences between the edf’sas a test statistic:
\[T_x=\sum_{i=1}^{j} \vert \frac1n I\{z_i\in x\}-\frac1m I\{z_i \in x\} \vert:j=1,..,n\]\[T_y=\sum_{i=1}^{j} \vert \frac1n I\{z_i \iny\}-\frac1m I\{z_i \in y\} \vert:j=1,..,m\] and the the teststatistic is\(T_x-T_y\). The discretecase follows directly. This test was first proposed in(Kuiper 1960).
This test is based on the integrated squared difference between theedf’s:
\[\frac{nm}{(n+m)^2}\sum_{i=1}^{n+m}\left( \hat{F}(z_i)-\hat{G}(z_i) \right)^2 \] the extension tothe two-sample problem of the Cramer-vonMises criterion is discussed in(T. W. Anderson 1962).
This test is based on
\[nm\sum_{i=1}^{n+m-1} \frac{\left(\hat{F}(z_i)-\hat{G}(z_i) \right)^2}{i(n-i)} \]
It was first proposed in(Theodore W.Anderson, Darling, et al. 1952).
Let\(r_i\) and\(s_i\) be the ranks of x and y in thecombined sample, then the test statistic is given by
\[\frac1{nm(n+m)}\left[n\sum_{i=1}^n(r_i-1)^2+m\sum_{i=1}^m(s_i-1)^2\right]\](Lehmann 1951) and(Rosenblatt 1952)
These tests are based on the paper(Zhang2006). Note that the calculation ofZC is the oneexception from the rule: even in the discrete data case the calculationof the test statistic does require loops of lengths n and m. Thecalculation time will therefore increase with the sample sizes, and forvery large data sets this test might have to be excluded.
A test based on the Wasserstein p1 metric. It compares the quantilesof the x and y in the combined data set. If n=m the test statistic inthe continuous case is very simple:
\[ \frac1n\sum_{i=1}^n |x_i-y_i|\]If\(n\ne m\) it is first necessary tofind the respective quantiles of the x’s and y’s in the combined dataset. In the discrete case the test statistic is
\[\sum_{i=1}^{k-1} \vert\sum_{j=1}^i\frac{x_j}{n}-\sum_{j=1}^i\frac{y_j}{m} \vert(v_{i+1}-v_i)\]
For a discussion of the Wasserstein distance see(Vaserstein 1969).
The routine has the following arguments:
x andy are numeric vectors. Inthe continuous data case they are simply the data, in the discrete casethe counts.
vals a numeric vector of values of the discreterandom variable. If missing continuous data is assumed. In the discretecase x, y and vals have to be vectors of equal length.
TS a routine to calculate a test statistic. Ifmissing built-in tests are used.
TSextra a list to be passed to TS.
wx,wy weights for data set (optional)
B=5000 number of simulation runs for permutationtest. If B=0 only test statistics are found.
nbins=c(50, 10) number of bins to use forchi-square test. In the continuous data case bins are foundequi-probable via the quantiles of the combined data set. In thediscrete case nbins[1] is changed to the number of values, so thechi-square test is done on the original data set. Then neighboringvalues are combined until there are nbins[2] classes.
minexpcount=5 minimum expected counts reuiredfor chi sqaure tests
maxProcessor if >1 parallel computing isused.
UseLargeSample for large data sets (>10000)use methods which have large sample theories to find p values.
samplingmethod independence or MCMC for discretedata
rnull routine to generate data for parametricbootstrap.
SuppressMessages=FALSE suppress informativemessages
doMethods a character vector giving the names ofthe methods to include.
Say for example we want to use only the KS and AD tests:
A list with vectors statistics (the test statistics) and pvalues.
The routinetwosample_power allows the estimation of thepower of the various tests, andplot_power draws thecorresponding power graph with the methods sorted by their meanpower.
Note that due to the use of permutation tests the power calculationsare fairly slow. Because of the time constraints imposed by CRAN thefollowing routines are not run. A full power study with (say) 25alternatives and fairly large data sets might take several hours torun.
New tests can be used in the same way as above.
Say we wish to find the powers of the tests when one data set comesfrom a standard normal distribution with sample size 100 and the otherfrom a t distribution with 1-10 degrees of freedom and sample size200.
The arguments TS, TSextra, nbins, minexpcount , UseLargeSample,samplingmethod, rnull, SuppressMessages, maxProcessor and B are the sameas intwosample_test. In addtion we have
f A function that generates data sets x and y,either continuous or the counts of discrete variables. In the discretecase care needs to be taken that the resulting vectors have the samelength asvals. The output has to be a list with elements x andy. In the case of discrete data the list also has to include a vectorvals, the values of the discrete variable.
… arguments passed to f. The most common casewould be a single vector, but two vectors or none is alsopossible.
alpha=0.05 type I error probability to use inpower calculation
The arguments ofplot_power are a matrix of powerscreated bytwosample_power and (optional) Smooth=FALSEif no smoothing is desired and a name of the variable on the x axis.
We wish to find the powers of the tests when the data set x are 100observations comes from a binomial n=10, p=0.5 and y 120 observationsfrom from a binomial n=10 and a number of different successprobabilities p. Note that in either case not all the possible values0-10 will actually appear in every data set, so we need to take somecare in assuring that x, y and vals match every time:
plot_power(twosample_power(function(p) { vals=0:10 x=table(c(vals,rbinom(100,10,0.5)))-1#Make sure each vector has length 11 y=table(c(vals,rbinom(120,10, p)))-1#and names 1-10 vals=vals[x+y>0]# only vals with at least one countlist(x=x[as.character(vals)],y=y[as.character(vals)],vals=vals) },p=seq(0.5,0.6,length=5)),"p")We want to assess the effect of the sample size when one data setcomes from a standard normal and the other from a normal with mean 1 andstandard deviation 1:
As no single test can be relied upon to consistently have good power,it is reasonable to employ several of them. We would then reject thenull hypothesis if any of the tests does so, that is, if the smallestp-value is less than the desired type I error probability\(\alpha\).
This procedure clearly suffers from the problem of simultaneousinference, and the true type I error probability will be much largerthan\(\alpha\). It is however possibleto adjust the p value so it does achieve the desired\(\alpha\). This can be done as follows:
We generate a number of data sets under the null hypothesis.Generally about 1000 will be sufficient. Then for each simulated dataset we apply the tests we wish to include, and record the smallest pvalue. Here is an example. .
pvals=matrix(0,1000,13)for(iin1:1000) pvals[i, ]=R2sample::twosample_test(runif(100),runif(100),B=1000)$p.valuesNext we find the smallest p value in each run for two selections offour methods. One is the selection found to be best above, namelyZhang’s ZC and ZK methods, the methods by Kuiper and Wasserstein as wellas a chi square test with a small number of bins. As a second selectionwe use the methods by Kolmogorov-Smirnov, Kuiper, Anderson-Darling,Cramer-vonMises and Lehman-Rosenblatt, which for this null data turn outto be highly correlated.
colnames(pvals)=names(R2sample::twosample_test(runif(100),runif(100),B=1000)$p.values)p1=apply(pvals[,c("ZK","ZC","Wassp1","Kuiper","ES small" )],1, min)p2=apply(pvals[,c("KS","Kuiper","AD","CvM","LR")],1, min)Next we find the empirical distribution function for the two sets ofp values and draw their graphs. We also add the curves for the cases offour identical tests and the case of four independent tests, which ofcourse is the Bonferroni correction. The data for the cdfs is in theinst/extdata directory of the package.
tmp=R2sample::pvaluecdfTests=factor(c(rep("Identical Tests",nrow(tmp)),rep("Correlated Selection",nrow(tmp)),rep("Best Selection",nrow(tmp)),rep("Independent Tests",nrow(tmp))),levels=c("Identical Tests","Correlated Selection","Best Selection","Independent Tests"),ordered =TRUE)dta=data.frame(x=c(tmp[,1],tmp[,1],tmp[,1],tmp[,1]),y=c(tmp[,1],tmp[,3],tmp[,2],1-(1-tmp[,1])^4),Tests=Tests)ggplot2::ggplot(data=dta, ggplot2::aes(x=x,y=y,col=Tests))+ ggplot2::geom_line(linewidth=1.2)+ ggplot2::labs(x="p value",y="CDF")+ ggplot2::scale_color_manual(values=c("blue","red","Orange","green"))The package includes the routinerun.studies, which provides20 case studies each for the continuous and the discrete case. Thisallows the user to easily compare the power of the methods included inthe package to a different one of their choice.
Say a user wishes to study the performance of the chi-square test,implemented inchitest. Of course this test is alreadyimplemented inR2sample, so this is for illustration only. Saywe wish to see its performance when compared to the tests implemented inR2sample for the case where the null hypothesis specifies auniform distribution but the data actually comes from a lineardistribution with slope s:
chitest=function(x, y, TSextra) { nbins=TSextra$nbins nx=length(x);ny=length(y);n=nx+ny xy=c(x,y) bins=quantile(xy, (0:nbins)/nbins) Ox=hist(x, bins,plot=FALSE)$counts Oy=hist(y, bins,plot=FALSE)$counts tmp=sqrt(sum(Ox)/sum(Oy)) chi=sum((Ox/tmp-Oy*tmp)^2/(Ox+Oy)) pval=1-pchisq(chi, nbins-1) out=ifelse(TSextra$statistic,chi,pval)names(out)="ChiSquare" out}TSextra=list(nbins=5,statistic=FALSE)pwr=R2sample::run.studies(chitest,"uniform.linear",TSextra=TSextra,With.p.value =TRUE)#> Running case study uniform.linear.cont ...R2sample::plot_power(pwr,"slope")Arguments ofrun.studies:
Arguments of routine to calculate new test:
Continuous data:
x,y the two data sets
TSextra (optional): a list passed to the routine with any object neededto do the calculation.
Discrete data:
x,y the two data sets
vals: the possible values of the discrete random variable
TSextra (optional): a list passed to the routine with any object neededto do the calculation.
The output of the routine has to be a named vector with either thetest statistic(s) or the p value(s).
The case studies are as follows. In each the first term specifies themodel under the null hypothesis and the second the model under thealternative hypothesis.
uniform.linear\(\hspace{3cm}\)U[0,1] vs a linear model on [0,1] with slope s.
uniform.quadratic\(\hspace{2.5cm}\) U[0,1] vs a quadraticmodel with vertex at 0.5 and some curvature a.
uniform.bump\(\hspace{3.1cm}\)U[0,1] vs U[0,1]+N(0.5,0.05).
uniform.sine\(\hspace{3.3cm}\)U[0,1] vs U[0,1]+Sine wave
beta22.betaaa\(\hspace{3cm}\)Beta(2,2) vs Beta(a,a)
beta22.beta2a\(\hspace{3cm}\)Beta(2,2) vs Beta(2,a)
normal.shift\(\hspace{3.5cm}\)N(0,1) vs N(\(\mu\),1)
normal.stretch\(\hspace{3.1cm}\)N(0,1) vs N(0,\(\sigma\))
normal.t\(\hspace{4.2cm}\)N(0,1) vs t(df)
normal.outlier1\(\hspace{3.1cm}\) N(0,1) vsN(0,1)+U[2,3]
normal.outlier2\(\hspace{3.1cm}\) N(0,1) vsN(0,1)+U[-3,-2]+U[2,3]
exponential.gamma\(\hspace{2.3cm}\) Exp(1) vsGamma(1,b)
exponential.weibull\(\hspace{2.5cm}\) Exp(1) vsWeibull(1,b)
exponential.bump\(\hspace{2.7cm}\) Exp(1) vsExp(1)+N(0.5,0.05)
gamma.normal\(\hspace{3.1cm}\)Gamma(\(\mu\)) vs N(\(\bar{x}\), sd(x)), here the mean of thenormal distribution are the sample mean and sample standard deviation ofthe x data set.
normal.normalmixture\(\hspace{2.1cm}\) N(0,1) vs N(\(-\mu\),1)+N(\(\mu\),1)
uniform.uniformmixture\(\hspace{1.9cm}\) U[0,1] vs. \(\alpha\)U[0,1/2]+(1-\(\alpha\))U[1/2,1]
uniform.betamixture\(\hspace{2.5cm}\) U[0,1] vs. \(\alpha\)U[0,1/2]+(1-\(\alpha\))Beta(2,2)
chisquare.noncentral\(\hspace{2.3cm}\)\(\chi^2(5)\) vs. \(\chi^2(5, \tau)\)
uniform.triangular\(\hspace{2.9cm}\) U[0,1]vs. triangular
Generally a user will wish to apply their test to all the casestudies. This can be done easily with:
run.studies can also be used to run the studies for theincluded methods but for different values of param_alt, nsample and/oralpha. For example, say we wish to find the powers of the includedcontinuous methods for the cases uniform.linear and uniform.quadratic aswell as samples of size 2000 and a true type I error of 0.1. Moreoverfor the case uniform/linear we want the slopes to be 0.1 and 0.2 and forthe case uniform.quadratic we want param_alt to be 1.3: