Movatterモバイル変換


[0]ホーム

URL:


R Package Rgof

library(Rgof)Bsim=500#Number of Simulation Runs

Note that Bsim is very small, so that this vignette does not take tolong to run. Therefore none of the p values or powers below arecorrect.

The packageRgof brings together a number of routines forthe goodness-of-fit problem for univariate data. We have a data set\(\pmb{x}\), and we want to testwhether it was generated by the probability distribution F.

The highlights of this package are:

set.seed(123)

Note all runs of the test routine are done withB=1000 and all runs of the power routines with argumentsB=500 in order to passdevtools::check().

The Methods

  1. Kolmogorov Smirnov (KS)(Massey 1951),(Kolmogorov 1933),(Smirnov 1948)
  2. Kuiper (K)(Kuiper 1960)
  3. Anderson-Darling (AD)(Anderson and Darling1952),(Anderson and Darling1954)
  4. Cramer-vonMises (CvM)(Anderson1962)
  5. Wilson (W)
  6. Zhang’s methods (ZA, ZK, ZC)(Zhang2002)
  7. Wasserstein p=1 (Wassp1)(E del Barrio1999)

For all of these tests the distribution of the test statistic underthe null hypothesis is found via simulation.

  1. Eight variations of chi square tests, using the formulas by Pearsonor based on the likelihood ratio test, bins of equal size or equalprobability and both a large number and a small number of bins, 50 and10 by default. The p values are found using the usual chi squareapproximation to the null distribution. If parameters are estimated thisis done either via the method of minimum chi square(Berkson 1980) or via a user-provided estimator.In all cases bins are combined until all of them have an expected countof at least 5.

There is a very large literature on chi square tests, the oldest ofthe goodness of fit tests. For a survey see(Rolke and Gutierrez-Gongora 2020).

All the methods above are also implemented for discrete data, exceptfor Zhang’s tests, which have no discrete analog.

It is worth noting that these discrete versions are based on thetheoretical ideas of the tests and not on the actual formula ofcalculation for the continuous case. The test statistics can thereforebe different even when applied to the same data. For example, theAnderson-Darling test is based on the distance measure

\[A^2=n\int_{-\infty}^{\infty}\frac{(\hat{F}(x)-F(x))^2}{F(x)(1-F(x))}dF(x) \] where\(F\) is the theoretical distributionfunction under the null hypothesis and\(\hat{F}\) is the empirical distributionfunction. In the case of continuous data it can be shown that

\[A^2=-n-\frac1n\sum_{i=1}^n(2i-1)\left(\log F(x_i) +\log[1-F(x_{n+1-i})\right)\] However,for discrete data we have

\[A^2=n\sum_{i=1}^k\frac{(\hat{F}(x_i)-F(x_i))^2}{F(x_i)(1-F(x_i))}\left(F(x_i)-F(x_{i-1}\right)\]

with\(F(x_0)=0\).

In the continuous case\(\hat{F}\)is a step function but\(F\) iscontinuous, and therefore\(A^2>0\).In the discrete case however\(A^2=0\)is possible. This shows that the two cases are fundamentally differentand therefore require different formulas for the test statistic.

As for continuous data null distributions are found using simulation.In fact in the case of discrete data none of the tests has a knowndistribution for the test statistic under the null hypothesis.

  1. Four variations of chi square tests, using the formulas by Pearsonand log-likelihood as well as a large number and a small number of bins.Again the routine combines bins until all have expected counts greaterthan 5, and the chi square approximation is used to find p values. Thecombination of bins is done in such a way that the bins remain of equalsize as much as possible.

These methods can be used for both discrete and histogram data. Themain difference between these two is that discrete data has (acountable) number of possible values whereas histogram data has possibleranges of values (the bins). The only method directly affected by thisdifference is Wassp1, which requires actual values. All other methodsignore thevals argument.

Testing

Discrete (Histogram) Data/Model

Simple Null Hypothesis

We generate a data set of size 1000 from a Binomial distribution withn=20 and success probability 0.5, and then test\(H_0:F=Bin(20, 0.5)\).

vals=0:20#possible values of random variablepnull=function()pbinom(0:20,20,0.5)# cumulative distribution function (cdf)rnull=function()table(c(0:20,rbinom(1000,20,0.5)))-1# generate data under the null hypothesis, make sure that vector of counts has#same length as vals, possibly 0.
  • Null Hypothesis is true
x=rnull()# Basic Testgof_test(x, vals, pnull, rnull,B=1000)#> maxProcessor set to 1 for faster computation#> $statistics#>     KS      K     AD    CvM      W    l-P    s-P    l-L    s-L#> 0.1700 0.1730 0.1320 0.0249 2.0450 2.6010 1.9530 2.6750 2.0380#>#> $p.values#>     KS      K     AD    CvM      W    l-P    s-P    l-L    s-L#> 0.6640 0.7570 0.9960 0.9680 0.6670 0.9978 0.9967 0.9974 0.9960#Test with adjusted overall p valuegof_test_adjusted_pvalue(x, vals, pnull, rnull,B=c(1000,500),maxProcessor =1)#> p values of individual tests:#> W :  0.407#> AD :  0.416#> s-P :  0.9971#> adjusted p value of combined tests: 0.7674
  • Null Hypothesis is false
x=table(c(0:20,rbinom(1000,20,0.55)))-1#true p is 0.55, not 0.5# Basic Testgof_test(x, vals, pnull, rnull,B=1000,doMethod ="all")$p.value#> maxProcessor set to 1 for faster computation#>  KS   K  AD CvM   W l-P s-P l-L s-L#>   0   0   0   0   0   0   0   0   0#Test with adjusted overall p valuegof_test_adjusted_pvalue(x, vals, pnull, rnull,B=c(1000,500),maxProcessor =1)#> p values of individual tests:#> W :  0.32#> AD :  0.973#> s-P :  0#> adjusted p value of combined tests: 0.002

Arguments ofgof_test for discrete data/model:

  • x: vector with counts (histogram heights). Should have a numberfor each value of vals, possibly 0.

  • vals: all possible values of discrete random variable, that isall x with\(P(X=x)>0\)

  • pnull: function to find values of cumulative distributionfunction for each value of vals. Function has no arguments.

  • rnull: function to generate data from true density. Function hasno arguments. Function needs to insure that output is a vector with samelength as vals.

  • B=5000: number of simulation runs

  • w: function to find importance sampling weights, ifneeded

  • phat: function to estimate parameters

  • TS: function to find values of user-supplied teststatistics

  • TSextra: a list that is passed to TS if any additional info isrequired.

  • nbins=c(50, 10): number of bins for chi square tests. The firstone is already given by the data in the discrete case, for the secondbins are joined.

  • rate=0, if not 0 sample size is assumed to have come from aPoisson random variable with rate “rate”.

  • minexpcount=5, minimal expected counts for chi squaretests.

  • ChiUsePhat=TRUE, if TRUE uses user supplied function phat forparameter estimation. If false uses method of minimum chisquare.

  • maxProcessor=1 if greater than 1 number of cores for parallelprocessing. Parallel processing is usually not be useful for discretedata as the single thread version generally runs faster.

  • doMethods=“all” names of methods to include

The arguments ofgof_test_adjusted_pvalue for discretedata/model are the same, except that the number of simulation runs B istwo numbers. The first is used for estimating the individual p values,the second for the adjustment.

Random Sample Size

In some fields like high energy physics it is common that the samplesize is not fixed but a random variable drawn from a Poissondistribution with a known rate. Our package runs this as follows:

rnull=function()table(c(0:20,rbinom(rpois(1,650),20,0.5)))-1x=rnull()gof_test(x, vals, pnull, rnull,rate=650,B=1000)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W    l-P    s-P    l-L    s-L#> 0.0120 0.0190 0.1750 0.1390 0.0280 0.5198 0.5198 0.5107 0.5107

Composite Null Hypothesis

We generate a data set of size 1000 from a binomial distribution withn=20 and success probability p, and then test F=Bin(20, .). p isestimated from data.

vals=0:20pnull=function(p=0.5)pbinom(0:20,20,ifelse(p>0&&p<1, p,0.5))rnull=function(p=0.5)table(c(0:20,rbinom(1000,20, p)))-1phat=function(x)sum(0:20*x)/sum(x)/20
  • Null Hypothesis is true
x=table(c(0:20,rbinom(1000,20,0.5)))-1gof_test(x, vals, pnull, rnull,phat=phat,B=1000)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W    l-P    s-P    l-L    s-L#> 0.1700 0.2680 0.4900 0.6400 0.2470 0.3843 0.8239 0.3190 0.8074
  • Null Hypothesis is true
x=table(c(0:20,rbinom(1000,20,0.55)))-1# p is not 0.5, but data is still from a binomial distribution with n=20gof_test(x, vals, pnull, rnull,phat=phat,B=1000)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W    l-P    s-P    l-L    s-L#> 0.0770 0.1280 0.2960 0.4820 0.0810 0.4317 0.3359 0.4815 0.3812
  • Null Hypothesis is false
x=table(c(rep(0:20,5),rbinom(1000-21*5,20,0.53)))# data has to many small and large values to be from a binomialgof_test(x, vals, pnull, rnull,phat=phat,B=1000)$p.value#> maxProcessor set to 1 for faster computation#>    KS     K    AD   CvM     W   l-P   s-P   l-L   s-L#> 0.214 0.004 0.000 0.000 0.011 0.000 0.000 0.000 0.000

The arguments are the same as for the simple hypothesis case, exceptthat the user has to supply a routine phat to estimate the parameters,and the functions pnull and rnull now have one argument, namely thevector with parameter estimates.

The estimation of the parameter(s) in the case of the chi squaretests is done either by using the function phat or via the minimum chisquare method. The routine uses a general function minimizer. If thereare values of the parameter that are not possible this can lead towarnings. It is best to put a check into the pnull function to avoidthis issue. As an example the function pnull above checks that thesuccess probability p is in the interval\((0,1)\).

Histogram Data

A variant of discrete data sometimes encountered is data given in theform of a histogram, that is as a set of bins and their counts. The maindistinction is that discrete data has specific values, for example thenon-negative integers for a Poisson distribution, whereas histogram datahas ranges of numbers, the bins. It turns out that, though, that theonly method that requires actual values is Wassp1, and for that methodone can use the midpoint of the intervals.

As an example consider the following case: we have histogram data andwe want to test whether it comes from an exponential rate 1distribution, truncated to the interval 0-2:

rnull=function() {  y=rexp(2500,1)# Exp(1) data  y= y[y<2][1:1500]# 1500 events on 0-2  bins=0:40/20# binninghist(y, bins,plot=FALSE)$counts# find bin counts}x=rnull()bins=0:40/20vals= (bins[-1]+bins[-21])/2#use bin midpoints as valuespnull=function() {   bins=1:40/20pexp(bins,1)/pexp(2,1)}gof_test(x, vals, pnull, rnull)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W    l-P    s-P    l-L    s-L#> 0.1250 0.1644 0.5416 0.4640 0.6668 0.7620 0.7702 0.7351 0.7789

Continuous Data

Simple Hypothesis

pnull=function(x)pnorm(x)rnull=function()rnorm(1000)TSextra=list(qnull=function(x)qnorm(x))#optional quantile function used by chi square tests and Wassp1 test.
  • Null Hypothesis is true
x=rnorm(1000)#Basic Testsgof_test(x,NA, pnull, rnull,B=1000,TSextra=TSextra)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W     ZA     ZK     ZC Wassp1 ES-l-P ES-s-P#> 0.2900 0.2900 0.2970 0.2700 0.0440 0.8550 0.4600 0.8970 0.3330 0.2113 0.1315#> EP-l-P EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#> 0.1050 0.2925 0.1702 0.1236 0.1327 0.2954#Adjusted p valuegof_test_adjusted_pvalue(x,NA, pnull, rnull,B=c(1000,500),TSextra=TSextra,maxProcessor =1)#> p values of individual tests:#> W :  0.084#> ZC :  0.23#> AD :  0.053#> ES-s-P :  0.1315#> adjusted p value of combined tests: 0.1428
  • Null Hypothesis is false
x=rnorm(1000,0.5)gof_test(x,NA, pnull, rnull,B=1000,TSextra=TSextra)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W     ZA     ZK     ZC Wassp1 ES-l-P ES-s-P#>      0      0      0      0      0      0      0      0      0      0      0#> EP-l-P EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#>      0      0      0      0      0      0

Composite Hypothesis - One Parameter

pnull=function(x,p=0)pnorm(x, p)TSextra=list(qnull =function(x,p=0)qnorm(x, p))rnull=function(p)rnorm(1000, p)phat=function(x)mean(x)
  • Null Hypothesis is true
x=rnorm(1000)gof_test(x,NA, pnull, rnull,phat=phat,TSextra=TSextra,B=1000)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W     ZA     ZK     ZC Wassp1 ES-l-P ES-s-P#> 0.5790 0.5790 0.8290 0.6750 0.6590 0.8400 0.8920 0.8780 0.8800 0.1819 0.5819#> EP-l-P EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#> 0.1445 0.8424 0.0904 0.5836 0.1194 0.8291
  • Null Hypothesis is true
x=rnorm(1000,0.5)gof_test(x,NA, pnull, rnull,phat=phat,TSextra=TSextra)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W     ZA     ZK     ZC Wassp1 ES-l-P ES-s-P#> 0.9752 0.9752 0.8086 0.9284 0.9148 0.2738 0.2352 0.2422 0.7756 0.1701 0.1045#> EP-l-P EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#> 0.1571 0.8034 0.1721 0.0618 0.1669 0.7995
  • Null Hypothesis is false
x=rnorm(1000,0.5,2)gof_test(x,NA, pnull, rnull,phat=phat,TSextra=TSextra,B=1000)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W     ZA     ZK     ZC Wassp1 ES-l-P ES-s-P#>      0      0      0      0      0      0      0      0      0      0      0#> EP-l-P EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#>      0      0      0      0      0      0

The arguments of gof_test are the same as in the discrete case,except that vals=NA. The functions pnull and rnull are functions of onevariable in the case of a simple hypothesis and of two variables in thecase of parameeter estimation.

Composite Hypothesis - Multiple Parameters

pnull=function(x,p=c(0,1))pnorm(x, p[1],ifelse(p[2]>0, p[2],0.001))TSextra=list(qnull =function(x,p=c(0,1))qnorm(x, p[1],ifelse(p[2]>0, p[2],0.001)))rnull=function(p=c(0,1))rnorm(1000, p[1],ifelse(p[2]>0, p[2],0.001))phat=function(x)c(mean(x),sd(x))
  • Null Hypothesis is true
x=rnorm(1000)gof_test(x,NA, pnull, rnull,phat=phat,TSextra=TSextra,B=1000)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W     ZA     ZK     ZC Wassp1 ES-l-P ES-s-P#> 0.8030 0.8030 0.7320 0.7580 0.7350 0.8690 0.7090 0.8540 0.7170 0.3538 0.0778#> EP-l-P EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#> 0.7731 0.4104 0.3177 0.0808 0.7928 0.4082
  • Null Hypothesis is true
x=rnorm(1000,0.5)gof_test(x,NA, pnull, rnull,phat=phat,TSextra=TSextra,B=1000)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W     ZA     ZK     ZC Wassp1 ES-l-P ES-s-P#> 0.7780 0.7780 0.6400 0.5390 0.4930 0.6740 0.8650 0.6300 0.6200 0.6458 0.6988#> EP-l-P EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#> 0.2448 0.7135 0.6520 0.7076 0.2315 0.7126
  • Null Hypothesis is true
x=rnorm(1000,0.5,2)gof_test(x,NA, pnull, rnull,phat=phat,TSextra=TSextra,B=1000)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W     ZA     ZK     ZC Wassp1 ES-l-P ES-s-P#> 0.9770 0.9770 0.9600 0.9720 0.9670 0.7800 0.9110 0.6460 0.9630 0.9952 0.8894#> EP-l-P EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#> 0.9482 0.8352 0.9930 0.8893 0.9284 0.8368
  • Null Hypothesis is false
x=rt(1000,2)gof_test(x,NA, pnull, rnull,phat=phat,TSextra=TSextra,B=1000)$p.value#> maxProcessor set to 1 for faster computation#>     KS      K     AD    CvM      W     ZA     ZK     ZC Wassp1 ES-l-P ES-s-P#>      0      0      0      0      0      0      0      0      0      0      0#> EP-l-P EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#>      0      0      0      0      0      0

Power Estimation

For estimating the power of the various tests one also has to providethe routineralt, which generates data under the alternativehypothesis:

Discrete Data/Model

Simple Null Hypothesis

vals=0:10pnull=function()pbinom(0:10,10,0.5)rnull=function ()table(c(0:10,rbinom(100,10,0.5)))-1ralt=function (p=0.5)table(c(0:10,rbinom(100,10, p)))-1P=gof_power(pnull, vals, rnull, ralt,param_alt=seq(0.5,0.6,0.02),B=Bsim,nbins=c(11,5),maxProcessor =1)plot_power(P,"p",Smooth=FALSE)

In all cases the arguments are the same as forgof_test. Inaddition we now have

  • ralt: a routine with one parameter that generates data under somealternative hypothesis.

  • param_alt: values to be passed to ralt. This allows thecalculation of the power for many different values.

  • alpha=0.05 type I error probability for tests.

  • B=1000 the number of simulation runs.

Composite Null Hypothesis

vals=0:10pnull=function(p=0.5)pbinom(0:10,10,ifelse(0<p&p<1,p,0.001))rnull=function (p=0.5)table(c(0:10,rbinom(100,10,ifelse(0<p&p<1,p,0.001))))-1phat=function(x)sum(0:10*x)/1000
  • Null Hypothesis is true
ralt=function (p=0.5)table(c(0:10,rbinom(100,10, p)))-1gof_power(pnull, vals, rnull, ralt,c(0.5,0.6),phat=phat,B=Bsim,nbins=c(11,5),maxProcessor =1)#>        KS     K    AD   CvM     W Wassp1   l-P   s-P#> 0.5 0.036 0.034 0.054 0.056 0.042  0.048 0.038 0.038#> 0.6 0.070 0.050 0.068 0.060 0.060  0.046 0.032 0.032

Note that power estimation in the case of a composite hypothesis (akawith parameters estimated) is much slower than the simple hypothesiscase.

  • Null Hypothesis is false
ralt=function (p=0.5)table(c(rep(0:10,2),rbinom(100,10, p)))gof_power(pnull, vals, rnull, ralt,0.5,phat=phat,B=Bsim,nbins=c(11,5),maxProcessor =1)#>     KS      K     AD    CvM      W Wassp1    l-P    s-P#>  0.108  0.290  1.000  1.000  0.654  1.000  1.000  1.000

Continuous Data/Model

Simple Null Hypothesis

pnull=function(x)pnorm(x)TSextra=list(qnull =function(x)qnorm(x))rnull=function()rnorm(100)ralt=function(mu=0)rnorm(100, mu)gof_power(pnull,NA, rnull, ralt,c(0,1),TSextra=TSextra,B=Bsim,maxProcessor =1)#>      KS     K    AD   CvM     W    ZA    ZK    ZC Wassp1 ES-l-P ES-s-P EP-l-P#> 0 0.026 0.026 0.048 0.046 0.034 0.026 0.046 0.052  0.052  0.058  0.054  0.056#> 1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000  1.000  1.000  1.000  1.000#>   EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#> 0  0.046  0.066  0.062   0.05  0.056#> 1  1.000  1.000  1.000   1.00  1.000

Composite Null Hypothesis

pnull=function(x,p=c(0,1))pnorm(x, p[1],ifelse(p[2]>0, p[2],0.01))TSextra=list(qnull =function(x,p=c(0,1))qnorm(x, p[1],ifelse(p[2]>0, p[2],0.01)))rnull=function(p=c(0,1))rnorm(500, p[1], p[2])ralt=function(mu=0)rnorm(100, mu)phat=function(x)c(mean(x),sd(x))gof_power(pnull,NA, rnull, ralt,c(0,1),phat= phat,TSextra=TSextra,B=Bsim,maxProcessor=1)#>      KS     K    AD   CvM     W    ZA    ZK    ZC Wassp1 ES-l-P ES-s-P EP-l-P#> 0 0.958 0.898 0.030 0.040 0.034 0.544 0.004 0.010  0.996  0.050  0.034  0.056#> 1 0.956 0.878 0.028 0.034 0.032 0.520 0.004 0.008  0.998  0.052  0.036  0.046#>   EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#> 0  0.070  0.058  0.044  0.060  0.082#> 1  0.038  0.052  0.036  0.056  0.038
ralt=function(df=1) {# t distribution truncated at +- 5  x=rt(1000, df)  x=x[abs(x)<5]  x[1:100]}gof_power(pnull,NA, rnull, ralt,c(2,50),phat=phat,Range=c(-5,5),TSextra=TSextra,B=Bsim,maxProcessor=1)#>       KS     K    AD   CvM     W    ZA    ZK    ZC Wassp1 ES-l-P ES-s-P EP-l-P#> 2  0.992 0.978 0.562 0.514 0.542 0.946 0.272 0.184  1.000   0.27  0.346  0.236#> 50 0.948 0.880 0.030 0.024 0.028 0.636 0.010 0.032  0.942   0.04  0.070  0.046#>    EP-s-P ES-l-L ES-s-L EP-l-L EP-s-L#> 2   0.268  0.260  0.364  0.242  0.284#> 50  0.054  0.044  0.074  0.056  0.068

Running other tests

It is very easy for a user to add other goodness-of-fit tests to thepackage.

Example

Say we wish to use tests that are variants of the Cramer-vonMisestest, using the integrated absolute difference of the empirical and thetheoretical distribution function:

\[\int_{-\infty}^{\infty} \vert F(x) -\hat{F}(x) \vert^p dF(x)\] For continuous data we have theroutine

newTScont=function(x, pnull, param) {   Fx=sort(pnull(x))   n=length(x)   out=c(sum(abs( (2*1:n-1)/2/n-Fx )),sum(sqrt(abs( (2*1:n-1)/2/n-Fx ))))names(out)=c("CvM alt 1","CvM alt 2")   out}

This routine has to have three or four arguments x, pnull, param and(optionally) TSextra. x is the data and pnull a function that finds thecdf at x. param has to be the estimated parameters in the case of acomposite null hypothesis, and is ignored in the case of a simple nullhypothesis. Tsextra has to be a list of additional items needed for thecalculation of the test statistic, if any.

Note that the return object has to be anamedvector.

Then we can run this test with

pnull=function(x)punif(x)rnull=function()runif(500)x=rnull()Rgof::gof_test(x,NA, pnull, rnull,TS=newTScont)#> maxProcessor set to 1 for faster computation#> $statistics#> CvM alt 1 CvM alt 2#>     5.469    47.960#>#> $p.values#> CvM alt 1 CvM alt 2#>    0.6264    0.6374

Say we want to find the power of this test when the true distributionis a linear:

ralt=function(slope=0) {if(slope==0) y=runif(500)else y=(slope-1+sqrt((1-slope)^2+4*slope*runif(500)))/2/slope}
gof_power(pnull,NA, rnull, ralt,TS=newTScont,param_alt=round(seq(0,0.5,length=3),3),Range=c(0,1),B=Bsim,maxProcessor =1)#>      CvM alt 1 CvM alt 2#> 0        0.054     0.058#> 0.25     0.898     0.906#> 0.5      1.000     1.000

for discrete data we will write the routine using Rcpp:

(For reasons to avoid issues with CRAN submission this routine isalready part of Rgof)

#include<Rcpp.h>usingnamespace Rcpp;// [[Rcpp::export]]NumericVector newTSdisc(IntegerVector x,                      Function pnull,                      NumericVector param,                      NumericVector vals){  Rcpp::CharacterVector methods=CharacterVector::create("CvM alt");intconst nummethods=methods.size();int k=x.size(), n, i;  NumericVector TS(nummethods), ecdf(k), Fx(k);double tmp;  TS.names()=  methods;  Fx=pnull(param);  n=0;for(i=0;i<k;++i) n= n+ x[i];  ecdf(0)=double(x(0))/double(n);for(i=1;i<k;++i){    ecdf(i)= ecdf(i-1)+ x(i)/double(n);}  tmp=std::abs(ecdf[0]-Fx(0))*Fx(0);for(i=1;i<k;++i)     tmp= tmp+std::abs(ecdf(i)-Fx(i))*(Fx(i)-Fx(i-1));  TS(0)= tmp;return TS;}

The routine has to have four or five arguments x, pnull, param, valsand (optionally) TSextra. The output vector has to have names.

Note that one drawback of writing the routine in Rcpp is that it isthen not possible to use multiple processors.

As an example we will test whether some data comes from a Binomialdistribution with n=10 trials. The success parameter p will be estimatedfrom the data:

vals=0:10pnull=function(p)pbinom(0:10,10, p)rnull=function(p)table(c(0:10,rbinom(10000,10, p)))-1phat=function(x)sum(0:10*x)/100000x=rnull(0.5)gof_test(x, vals, pnull, rnull,phat=phat,TS=Rgof::newTSdisc)#> maxProcessor set to 1 for faster computation#> $statistics#> CvM alt     l-P     s-P     l-L     s-L#> 0.00129 5.84800 3.92900 5.79700 3.87700#>#> $p.values#> CvM alt     l-P     s-P     l-L     s-L#>  0.8342  0.7550  0.8635  0.7601  0.8680

For the power calculations we will consider data that is actually amixture of two Binomials:

ralt=function(tau=0) {   x=rbinom(5000,10,0.5-tau)   y=rbinom(5000,10,0.5+tau)table(c(0:10,x,y))-1}gof_power(pnull, vals, rnull, ralt,TS=Rgof::newTSdisc,phat=phat,param_alt=round(seq(0,0.05,length=3),3),B=Bsim,maxProcessor =1)#>       CvM alt#> 0       0.034#> 0.025   0.194#> 0.05    1.000

If the new routine can also calculate p values use the argumentWith.p.value=TRUE to indicate that. In that case the returnobject should be a (vector of) p values.

Adjusted p values for Several Tests

As no single test can be relied upon to consistently have good power,it is reasonable to employ several of them. We could then reject thenull hypothesis if any of the tests does so, that is, if the smallestp-value is less than the desired type I error probability\(\alpha\).

This procedure clearly suffers from the problem of simultaneousinference, and the true type I error probability will be much largerthan\(\alpha\). It is however possibleto adjust the p value so it does achieve the desired\(\alpha\). This can be done as follows:

We generate a number of data sets under the null hypothesis.Generally about 1000 will be sufficient. Then for each simulated dataset we apply the tests we wish to include, and record the smallest pvalue. Here is an example. Say the null hypothesis specifies a uniform\([0.1]\) and a sample size of 250.

pnull=function(x)punif(x)rnull=function()runif(250)pvals=matrix(0,1000,16)for(iin1:1000)  pvals[i, ]=Rgof::gof_test(rnull(),NA, pnull,                            rnull,B=1000)$p.values

Note this is not run because of CRAN timeconstraints. The resulting matrix is inRgof::pvaluecdf.

Next we find the smallest p value in each run for two selections offour methods. One is the selection found to be best above, namely themethods by Wilson, Anderson-Darling, Zhang’s ZC and a chi square testwith a small number of bins and using Pearson’s formula. As a secondselection we use the methods by Kolmogorov-Smirnov, Kuiper,Anderson-Darling and Cramer-vonMises. It can be checked that for thisnull hypothesis these methods are highly correlated.

colnames(pvals)=names(Rgof::gof_test(rnull(),NA, pnull, rnull,B=10)$p.values)p1=apply(pvals[,c("W","ZC","AD","ES-s-P" )],1, min)p2=apply(pvals[,c("KS","K","AD","CvM")],1, min)

Next we find the empirical distribution function for the two sets ofp values and draw their graphs. We also add the curve for the cases offour identical tests and the case of four independent tests, which ofcourse is the Bonferroni correction. The data for the cdf is in theinst/extdata directory of the package

tmp=Rgof::pvaluecdfTests=factor(c(rep("Identical Tests",nrow(tmp)),rep("Correlated Selection",nrow(tmp)),rep("Best Selection",nrow(tmp)),rep("Independent Tests",nrow(tmp))),levels=c("Identical Tests","Correlated Selection","Best Selection","Independent Tests"),ordered =TRUE)dta=data.frame(x=c(tmp[,1],tmp[,1],tmp[,1],tmp[,1]),y=c(tmp[,1],tmp[,3],tmp[,2],1-(1-tmp[,1])^4),Tests=Tests)ggplot2::ggplot(data=dta, ggplot2::aes(x=x,y=y,col=Tests))+  ggplot2::geom_line(linewidth=1.2)+  ggplot2::labs(x="p value",y="CDF")+  ggplot2::scale_color_manual(values=c("blue","red","Orange","green"))

Here is how to find these adjusted p values withRgof:

x=rnull()Rgof::gof_test_adjusted_pvalue(x,NA, pnull, rnull,B=c(1000,500),maxProcessor =1)#> p values of individual tests:#> W :  0.304#> ZC :  0.78#> AD :  0.526#> ES-s-P :  0.813#> adjusted p value of combined tests: 0.6988

Weighted Data

Sometimes the data/model uses importance sampling weights. This canbe done as follows. Say we want to test whether the data comes from astandard normal distribution, truncated to [-3,3] and with weights froma t distribution with 3 degrees of freedom:

\(H_0: F=N(0,1)\),\(X\sim t(3)\)

df=3pnull=function(x)pnorm(x)/(2*pnorm(3)-1)rnull=function() {x=rt(2000, df);x=x[abs(x)<3];sort(x[1:1000])}w=function(x) (dnorm(x)/(2*pnorm(3)-1))/(dt(x,df)/(2*pt(3,df)-1))x=sort(rnull())plot(x,w(x),type="l",ylim=c(0,2*max(w(x))))

ralt=function(m=0) {x=rt(2000,df)+m;x=x[abs(x)<3];sort(x[1:1000])}set.seed(111)Rgof::gof_power(pnull,NA, rnull, ralt,w=w,param_alt =c(0,0.2),Range=c(-3,3),B=Bsim,maxProcessor =1)#>       KS    K   CvM   AD#> 0   0.06 0.06 0.056 0.05#> 0.2 1.00 1.00 1.000 1.00

It should be noted that these tests are quite sensitive to the sizeof the weights and to the sample size, so one should always do asimulation study to verify that they work in the case underconsideration.

Case Studies

The package includes the routinerun.studies, which provides20 case studies each for the continuous and the discrete case. Thisallows the user to easily compare the power of the methods included inthe package to a different one of their choice.

If the user supplied method only yields the test statistic,simulation is used to find the p value. If the new test can find pvalues, these are used and then the routine will of course run muchfaster.

Say a user wishes to study the performance of the chi square test forthe case where the null hypothesis specifies a uniform distribution butthe data actually comes from a linear distribution with slope s. (Thistest is of course already included in the package, so this is forillustration purposes only). So

chitest=function(x, pnull, param, TSextra) {    nbins=TSextra$nbins#number of bins    bins=seq(min(x)-1e-10,max(x)+1e-10,length=nbins+1)    O=hist(x, bins,plot=FALSE)$counts#bin countsif(param[1]!=-99) {#with parameter estimation        E=length(x)*diff(pnull(bins, param))#expected counts        chi=sum((O-E)^2/E)#Pearson's chi square        pval=1-pchisq(chi, nbins-1-length(param))#p value    }else {      E=length(x)*diff(pnull(bins))      chi=sum((O-E)^2/E)      pval=1-pchisq(chi,nbins-1)    }    out=ifelse(TSextra$statistic, chi, pval)names(out)="ChiSquare"    out}

In this example we find the power ofchitest when the nullhypothesis specifies a uniform distribution but the true distribution isa linear with slope s:

TSextra=list(nbins=5,statistic=FALSE)pwr=Rgof::run.studies(chitest,"uniform.linear",TSextra=TSextra,With.p.value=TRUE)#> Running case uniform.linear.cont ...Rgof::plot_power(pwr,"Slope")

Arguments ofrun.studies:

The arguments of the user supplied test routine have to be

Continuous Data:

Discrete Data

In either case the output of the routine has to be a named vectorwith either the test statistic(s) or the p values(s).

The case studies are as follows. In each the first term specifies themodel under the null hypothesis and the second the model under thealternative hypothesis. The cases studies with names ending in .estinclude parameter estimation.

Without parameter estimation

  1. uniform.linear\(\hspace{3cm}\)U[0,1] vs a linear model on [0,1] with slope s.

  2. uniform.quadratic\(\hspace{2.5cm}\) U[0,1] vs a quadraticmodel with vertex at 0.5 and some curvature a.

  3. uniform.bump\(\hspace{3.1cm}\)U[0,1] vs U[0,1]+N(0.5,0.05).

  4. uniform.sine\(\hspace{3.3cm}\)U[0,1] vs U[0,1]+Sine wave

  5. beta22.betaaa\(\hspace{3cm}\)Beta(2,2) vs Beta(a,a)

  6. beta22.beta2a\(\hspace{3cm}\)Beta(2,2) vs Beta(2,a)

  7. normal.shift\(\hspace{3.5cm}\)N(0,1) vs N(\(\mu\),1)

  8. normal.stretch\(\hspace{3.1cm}\)N(0,1) vs N(0,\(\sigma\))

  9. normal.t\(\hspace{4.2cm}\)N(0,1) vs t(df)

  10. normal.outlier1\(\hspace{3.1cm}\) N(0,1) vsN(0,1)+U[2,3]

  11. normal.outlier2\(\hspace{3.1cm}\) N(0,1) vsN(0,1)+U[-3,-2]+U[2,3]

  12. exponential.gamma\(\hspace{2.3cm}\) Exp(1) vsGamma(1,b)

  13. exponential.weibull\(\hspace{2.5cm}\) Exp(1) vsWeibull(1,b)

  14. exponential.bump\(\hspace{2.7cm}\) Exp(1) vsExp(1)+N(0.5,0.05)

  15. trunc.exponential.linear\(\hspace{1.7cm}\) Exp(1) vs Linear, on[0,1]

With parameter estimation

  1. normal.t.est\(\hspace{3.7cm}\)N(\(\mu\),\(\sigma\)) vs t(df)

  2. exponential.weibull.est\(\hspace{1.9cm}\) Exp(\(\lambda\)) vs Weibull(1,b)

  3. trunc.exponential.linear.est\(\hspace{1.2cm}\) Exp(\(\lambda\)) vs Linear, on [0,1]

  4. exponential.gamma.est\(\hspace{1.8cm}\) Exp(\(\lambda\)) vs Gamma(1,b)

  5. normal.cauchy.est\(\hspace{2.7cm}\) N(\(\mu\),\(\sigma\)) vs Cauchy (Breit-Wigner)

Generally a user will likely wish to run all the included casestudies. This is very easily done with

Rgof::run.studies(chitest,TSextra=TSextra,With.p.value=TRUE)

If only the test statistic is supplied by the new test and thereforesimulation has to be used to find p values, this will take several hoursto run.run.studies can also be used to run the studies for theincluded methods but for different values of param_alt, nsample and/oralpha. For example, say we wish to find the powers of the includedcontinuous methods for the case uniform.linear and samples of size 2000,slopes of 0.1 and 0.2, and a true type I error of 0.01:

Rgof::run.studies(TRUE,# continuous data/modelstudy="uniform.linear",param_alt=c(0.1,0.2),nsample=2000,alpha=0.01)

References

Anderson, T W. 1962.“On the Distribution of the Two-SampleCramer-von Mises Criterion.”Annals of Mathematical Statistics 33 (3): 1148–59.
Anderson, T W, and D A Darling. 1952.“Asymptotic Theory ofCertain Goodness-of-Fit Criteria Based on Stochastic Processes.”Annals of Mathematical Statistics 23: 193–212.
———. 1954.“A Test of Goodness-of-Fit.”JASA 49:765–69.
Berkson, J. 1980.“Minimum Chi-Square, Not MaximumLikelihood.”Ann. Math. Stat 8 (3): 457–87.
E del Barrio, C Matran, J A Cuesta-Albertos. 1999.“Tests ofGoodness of Fit Based on the L2-Wasserstein Distance.”Annalsof Statistics 1230-1239: 27.
Kolmogorov, A. 1933.“Sulla Determinazione Empirica Di Una LeggeDi Distribuzione.”G. Ist. Ital. Attuari. 4: 83–91.
Kuiper, N H. 1960.“Tests Concerning Random Points on aCircle.”Proceedings of the Koninklijke Nederlandse Akademievan Wetenschappen 63: 38–47.
Massey, F J. 1951.“TheKolmogorov-Smirnov Test forGoodness-of-Fit.”JASA 46: 68–78.
Rolke, Wolfgang, and Cristian Gutierrez-Gongora. 2020.“AChi-Square Goodness-of-Fit Test for Continuous Distributions Against aKnown Alternative.”Computational Statistics.https://doi.org/10.1007/s00180-020-00997-x.
Smirnov, N. 1948.“Table for Estimating the Goodness of Fit ofEmpirical Distributions.”Annals of MathematicalStatistics 19: 279–81.
Zhang, J. 2002.“Powerful Goodness-of-Fit Tests Based onLikelihood Ratio.”Journal of the RSS (Series B) 64:281–94.

[8]ページ先頭

©2009-2025 Movatter.jp