library(PowerTOST)# attach the library

Defaults

Parameter	Argument	Purpose	Default
$\small{\alpha}$	`alpha`	Nominal level of the test	`0.05`
$\small{\pi}$	`targetpower`	Minimum desiredpower	`0.80`
logscale	`logscale`	Analysis on log-transformed or original scale?	`TRUE`
$\small{\theta_0}$	`theta0`	‘True’ or assumed deviation of T from R	see below
$\small{\theta_1}$	`theta1`	Lower BE limit	see below
$\small{\theta_2}$	`theta2`	Upper BE limit	see below
CV	`CV`	CV	none
design	`design`	Planned design	`"2x2"`
method	`method`	Algorithm	`"exact"`
robust	`robust`	‘Robust’ evaluation (Senn’s basic estimator)	`FALSE`
print	`print`	Show information in the console?	`TRUE`
details	`details`	Show details of the sample size search?	`FALSE`
imax	`imax`	Maximum number of iterations	`100`

Defaults depending on the argumentlogscale:

Parameter	Argument	`logscale = TRUE`	`logscale = FALSE`
$\small{\theta_0}$	`theta0`	`0.95`	`+0.05`
$\small{\theta_1}$	`theta1`	`0.80`	`−0.20`
$\small{\theta_2}$	`theta2`	`1.25`	`+0.20`

Argumentstargetpower,theta0,theta1,theta2, andCV have to begiven as fractions, not in percent.
TheCV is generally thewithin- (intra-) subjectcoefficient of variation. Only fordesign = "parallel" itis thetotal (a.k.a. pooled)CV.¹

The terminology of the argumentdesign follows thispattern:treatments x sequences x periods. The conventionalTR|RT (a.k.a. AB|BA) design can beabbreviated as"2x2". Some call the"parallel"design a ‘one-sequence’ design. The design"paired" has twoperiods but no sequences and is the standard design for studying linearpharmacokinetics (where a single dose is followed by multiple doses). Aprofile in steady state (T) is compared to the one after the single dose(R). Note that the underlying model assumes no period effects.

Implemented exact algorithms are"exact" /"owenq" (Owen’s Q function, default)² and"mvt"(direct integration of the bivariate non-centralt-distribution). Approximations are"noncentral" /"nct" (non-centralt-distribution) and"shifted" /"central" (‘shifted’ centralt-distribution).

"robust = TRUE" forces the degrees of freedom ton-seq and is used only in higher-order crossover designs.It could be used if the evaluation was done with a mixed-effectsmodel.

WithsampleN.TOST(..., print = FALSE) results areprovided as a data frame³ with nine columnsDesign,alpha,CV,theta0,theta1,theta2,Sample size,Achieved power, andTarget power.
To accesse.g., the sample size use eithersampleN.TOST(...)[7] orsampleN.TOST(...)[["Sample size"]]. We suggest to use thelatter in scripts for clarity.⁴

The estimated sample size gives always thetotal number of subjects (not subject/sequence in crossovers orsubjects/group in a parallel design – like in some other softwarepackages).

Sample size

Designs with one (parallel) to four periods (replicates) aresupported.

#     design                        name   df# "parallel"           2 parallel groups  n-2#      "2x2"               2x2 crossover  n-2#    "2x2x2"             2x2x2 crossover  n-2#    "2x2x3"   2x2x3 replicate crossover 2n-3#    "2x2x4"   2x2x4 replicate crossover 3n-4#    "2x4x4"   2x4x4 replicate crossover 3n-4#    "2x3x3"   partial replicate (2x3x3) 2n-3#    "2x4x2"            Balaam’s (2x4x2)  n-2#   "2x2x2r" Liu’s 2x2x2 repeated x-over 3n-2#   "paired"                paired means  n-1

Example 1

Estimate the sample size for assumed intra-subjectCV0.30.

sampleN.TOST(CV =0.30)## +++++++++++ Equivalence test - TOST +++++++++++#             Sample size estimation# -----------------------------------------------# Study design: 2x2 crossover# log-transformed data (multiplicative model)## alpha = 0.05, target power = 0.8# BE margins = 0.8 ... 1.25# True ratio = 0.95,  CV = 0.3## Sample size (total)#  n     power# 40   0.815845

To get only the sample size:

sampleN.TOST(CV =0.30,details =FALSE,print =FALSE)[["Sample size"]]# [1] 40

Note that the sample size is always rounded up to give balancedsequences (here a multiple of two). Since power is higher than ourtarget, likely this was the case here. Let us check that.
Which power will we get with a sample size of 39?

power.TOST(CV =0.30,n =39)# Unbalanced design. n(i)=20/19 assumed.# [1] 0.8056171

Confirmed that with 39 subjects we will already reach the targetpower. That means also that one dropout will not compromise power. Wecould explore that further in aPowerAnalysis.

Note thatsampleN.TOST() is not vectorized. If we areinterested in combinations of assumed values:

sampleN.TOST.vectorized<-function(CVs, theta0s, ...) {  n<- power<-matrix(ncol =length(CVs),nrow =length(theta0s))for (iinseq_along(theta0s)) {for (jinseq_along(CVs)) {      tmp<-sampleN.TOST(CV = CVs[j],theta0 = theta0s[i], ...)      n[i, j]<- tmp[["Sample size"]]      power[i, j]<- tmp[["Achieved power"]]    }  }  DecPlaces<-function(x)match(TRUE,round(x,1:15)== x)  fmt.col<-paste0("CV %.",max(sapply(CVs,FUN = DecPlaces),na.rm =TRUE),"f")  fmt.row<-paste0("theta %.",max(sapply(theta0s,FUN = DecPlaces),na.rm =TRUE),"f")colnames(power)<-colnames(n)<-sprintf(fmt.col, CVs)rownames(power)<-rownames(n)<-sprintf(fmt.row, theta0s)  res<-list(n = n,power = power)return(res)}CVs<-seq(0.20,0.40,0.05)theta0s<-seq(0.90,0.95,0.01)x<-sampleN.TOST.vectorized(CV = CVs,theta0 = theta0s,details =FALSE,print =FALSE)cat("Sample size\n");print(x$n);cat("Achieved power\n");print(signif(x$power,digits =5))# Sample size#            CV 0.20 CV 0.25 CV 0.30 CV 0.35 CV 0.40# theta 0.90      38      56      80     106     134# theta 0.91      32      48      66      88     112# theta 0.92      28      40      56      76      96# theta 0.93      24      36      50      66      84# theta 0.94      22      32      44      58      74# theta 0.95      20      28      40      52      66# Achieved power#            CV 0.20 CV 0.25 CV 0.30 CV 0.35 CV 0.40# theta 0.90 0.81549 0.80358 0.80801 0.80541 0.80088# theta 0.91 0.81537 0.81070 0.80217 0.80212 0.80016# theta 0.92 0.82274 0.80173 0.80021 0.80678 0.80238# theta 0.93 0.81729 0.81486 0.81102 0.80807 0.80655# theta 0.94 0.83063 0.81796 0.81096 0.80781 0.80740# theta 0.95 0.83468 0.80744 0.81585 0.80747 0.80525

Perhaps the capacity of the clinical site is limited. Any study canalso be performed in a replicate design and assessed for ABE. As a ruleof thumb the total sample in a 3-period replicate is ~¾ of the 2×2×2crossover and the one of a 2-sequence 4-period replicate ~½ of the2×2×2. The number of treatments and hence, of biosamples – which mainlydrives the study’s cost – will be roughly the same.

designs<-c("2x2x2","2x2x3","2x3x3","2x2x4")# data.frame of resultsres<-data.frame(design = designs,n =NA_integer_,power =NA_real_,n.do =NA_integer_,power.do =NA_real_,stringsAsFactors =FALSE)# this line for R <4.0.0for (iin1:4) {# print = FALSE suppresses output to the console# we are only interested in columns 7-8# let's also calculate power for one dropout  res[i,2:3]<-sampleN.TOST(CV =0.30,design = res$design[i],print =FALSE)[7:8]  res[i,4]<- res[i,2]-1  res[i,5]<-suppressMessages(power.TOST(CV =0.30,design = res$design[i],n = res[i,4]))}print(res,row.names =FALSE)#  design  n     power n.do  power.do#   2x2x2 40 0.8158453   39 0.8056171#   2x2x3 30 0.8204004   29 0.8068731#   2x3x3 30 0.8204004   29 0.8063834#   2x2x4 20 0.8202398   19 0.7991508

As expected – and as a bonus – we obtain a small gain in power,though in the 4-period design with one dropout power will be slightlycompromised.
But why is power in the replicate designs higher than in the 2×2×2crossover? If residual variances are equal, the width of the confidenceinterval depends only on thet-value and in particular on thedegrees of freedom – which themselves depend on the design and thesample size.

#  design                      name  n formula df t.value#   2x2x2           2x2x2 crossover 40     n-2 38   1.686#   2x2x3 2x2x3 replicate crossover 30   2*n-3 57   1.672#   2x3x3 partial replicate (2x3x3) 30   2*n-3 57   1.672#   2x2x4 2x2x4 replicate crossover 20   3*n-4 56   1.673

If the capacity is 24 beds, we would opt for a 4-period fullreplicate.

As another option (e.g., if the blood volume is limitedand/or there are concerns about a higher dropout-rate in amultiple-period study) we could stay with the 2×2×2 crossover but splitthe sample size into groups. In Europe (and for the FDA if certainconditions⁵ are fulfilled), there are no problemspooling the data and use the conventional model.

sequence + subject(sequence) + period + treatment

However, some regulators prefer to incorporate group-terms in themodel.

group + sequence + subject(group × sequence) + period(group) + group × sequence + treatment

Since we have more terms in the model, we will loose some degrees offreedom. Let us explore in simulations how that would impact power. Bydefault functionpower.TOST.sds() performs 100,000simulations.

grouping<-function(capacity, n) {# split sample size into >=2 groups based on capacityif (n<= capacity) {# make equal groups    ngrp<-rep(ceiling(n/2),2)  }else {# at least one = capacity    ngrp<-rep(0,ceiling(n/ capacity))    grps<-length(ngrp)    ngrp[1]<- capacityfor (jin2:grps) {      n.tot<-sum(ngrp)# what we have so farif (n.tot+ capacity<= n) {        ngrp[j]<- capacity      }else {        ngrp[j]<- n- n.tot      }    }  }return(ngrp =list(grps =length(ngrp),ngrp = ngrp))}CV<-0.30capacity<-24# clinical capacityres<-data.frame(n =NA_integer_,grps =NA_integer_,n.grp =NA_integer_,m.1 =NA_real_,m.2 =NA_real_)x<-sampleN.TOST(CV = CV,print =FALSE,details =FALSE)res$n<- x[["Sample size"]]res$m.1<- x[["Achieved power"]]x<-grouping(capacity = capacity,n = res$n)res$grps<- x[["grps"]]ngrp<- x[["ngrp"]]res$n.grp<-paste(ngrp,collapse ="|")res$m.2<-power.TOST.sds(CV = CV,n = res$n,grps = res$grps,ngrp = ngrp,gmodel =2,progress =FALSE)res$loss<-100*(res$m.2- res$m.1)/ res$m.1names(res)[2:6]<-c("groups","n/group","pooled model","group model","loss (%)")res[1,4:6]<-sprintf("%6.4f", res[1,4:6])cat("Achieved power and relative loss\n");print(res,row.names =FALSE)# Achieved power and relative loss#   n groups n/group pooled model group model loss (%)#  40      2   24|16       0.8158      0.8120  -0.4664

With ~0.5% the relative loss in power is practically negligible.

Example 2

Estimate the sample size for equivalence of the ratio of two meanswith normality on original scale based on Fieller’s (‘fiducial’)confidence interval.⁶ Crossover design, within-subjectCV_w 0.20, between-subjectCV_b0.40.

sampleN.RatioF(CV =0.20,CVb =0.40)## +++++++++++ Equivalence test - TOST +++++++++++#     based on Fieller's confidence interval#             Sample size estimation# -----------------------------------------------# Study design: 2x2 crossover# Ratio of means with normality on original scale# alpha = 0.025, target power = 0.8# BE margins = 0.8 ... 1.25# True ratio = 0.95,  CVw = 0.2,  CVb = 0.4## Sample size#  n     power# 28   0.807774

In this function the default$\small{\alpha}$ is 0.025, since it isintended for studies with clinical endpoints, where the 95% confidenceinterval is usually used for equivalence testing.⁷

Example 3

Estimate the sample size based on the results of a 2×2×2 pilot studyin 16 subjects where we observed an intra-subjectCV 0.20 and$\small{\theta_0}$ 0.92.

Basic

If we believe [sic] that in the pivotal study both the$\small{\theta_0}$ andCV will beexactly like in the pilot, this is a straightforward exercise.We simply provide the required arguments.

sampleN.TOST(CV =0.20,theta0 =0.92)## +++++++++++ Equivalence test - TOST +++++++++++#             Sample size estimation# -----------------------------------------------# Study design: 2x2 crossover# log-transformed data (multiplicative model)## alpha = 0.05, target power = 0.8# BE margins = 0.8 ... 1.25# True ratio = 0.92,  CV = 0.2## Sample size (total)#  n     power# 28   0.822742

This approach is called by some ‘carved in stone’ because it relieson – very strong – assumptions which likely are not justified. Althoughpower curves are relatively flat close to unity (i.e., theimpact on power is small when moving from say,$\small{\theta_0}$ 1 to 0.95) but they aregetting increasingly steep when moving away more from unity.

$Fig. 1 Power curve (\small{\alpha} 0.05, 2×2×2 design, CV 0.20; blue n = 28, black n = 22, red n = 16)$

Fig. 1 Power curve ($\small{\alpha}$ 0.05, 2×2×2 design,CV 0.20;
blue n = 28, black n = 22, red n = 16)

Both$\small{\theta_0}$ andCV (as every estimate) are uncertain to some extent, whichdepends on the degrees of freedom (sample size and design). Hence, itmight not be good idea to perform very small pilot studies(e.g., in only six subjects). Although it might be possiblethat in the pivotal study theCV is indeedlower thanthe one we observed in the pilot, it would be even more risky than the‘carved in stone’ approach to assume a lower one in planning the pivotalstudy.

With the functionCVCL() we can calculate confidencelimits of theCV. It is advisable to use the upper confidencelimit as a conservative approach. As a side effect – if theCVwill be lower than assumed – we get a ‘safety margin’ for the T/Rratio.

df<-16-2# degrees of freedom of the 2x2x2 crossover pilotCVCL(CV =0.20,df = df,side ="upper",alpha =0.20)[["upper CL"]]# [1] 0.2443631

I prefer$\small{\alpha=0.20}$ inanalogy to the producer’s risk$\small{\beta=0.20}$ when planning forpower$\small{\pi=1-\beta=0.80}$.Gould proposed the more liberal$\small{\alpha=0.25}$.⁸ Let us repeat thesample size estimation based on the upperCL of the CV.

CL.upper<-CVCL(CV =0.20,df =16-2,side ="upper",alpha =0.20)[["upper CL"]]res<-sampleN.TOST(CV = CL.upper,theta0 =0.92,print =FALSE)print(res[7:8],row.names =FALSE)#  Sample size Achieved power#           40       0.816919

Of course, this has a massive impact on the sample size, whichincreases from 28 to 40. It might be difficult to convince themanagement to invest ~40% more than with the ‘carved in stone’approach.

However, we can also explore how power would be affected if ourassumption is true and the study will nevertheless be performed withonly 28 subjects.

CL.upper<-CVCL(CV =0.20,df =16-2,side ="upper",alpha =0.20)[["upper CL"]]power.TOST(CV = CL.upper,theta0 =0.92,n =28)# [1] 0.679253

There will be a drop in power from the ~0.82 the management expectsto only ~0.67. That’s just slightly higher than betting for two dozensin Roulette…

Fig. 2 Power forCV0.244

As mentioned above, if theCV turns out to be lower thanassumed, we gain a ‘safety margin’ for the T/R ratio. Let us explorethat. We perform the study with 40 subjects, theCV will be0.22 (less than the ~0.24 we assumed), and the T/R with 0.90will beworse than the 0.92 we assumed.

power.TOST(CV =0.22,theta0 =0.90,n =40)# [1] 0.7686761

Below our target but still acceptable.

Advanced

In thebasic approach we concentratedmainly on the uncertainty of theCV. But this is not the end ofthe story. Clearly$\small{\theta_0}$is uncertain as well. With the functionexpsampleN.TOST()we can dive deeper into this matter. Let us start with theCVonly.

expsampleN.TOST(CV =0.20,theta0 =0.92,prior.type ="CV",prior.parm =list(m =16,design ="2x2x2"))## ++++++++++++ Equivalence test - TOST ++++++++++++#        Sample size est. with uncertain CV# -------------------------------------------------# Study design:  2x2 crossover# log-transformed data (multiplicative model)## alpha = 0.05, target power = 0.8# BE margins = 0.8 ... 1.25# Ratio = 0.92# CV = 0.2 with 14 df## Sample size (ntotal)#  n   exp. power# 30   0.806069

Not that bad. The sample size increases fairly from the 28 of the‘carved in stone’ approach to 30 but is substantially lower than the 40we estimated based on the upper confidence limit of theCV.

Let us keep theCV ‘fixed’ and take only the uncertainty of$\small{\theta_0}$ into account.

expsampleN.TOST(CV =0.20,theta0 =0.92,prior.type ="theta0",prior.parm =list(m =16,design ="2x2x2"))## ++++++++++++ Equivalence test - TOST ++++++++++++#      Sample size est. with uncertain theta0# -------------------------------------------------# Study design:  2x2 crossover# log-transformed data (multiplicative model)## alpha = 0.05, target power = 0.8# BE margins = 0.8 ... 1.25# Ratio = 0.92# CV = 0.2## Sample size (ntotal)#  n   exp. power# 46   0.805236

It starts to hurt. We saw already that power curves are getting steepif the T/R ratio is not close to unity. Our$\small{\theta_0}$ 0.92 was not very nicebut in the pivotal study it might be even lower as well – which has alarger impact on power than theCV.

Now for the ‘worst case scenario’, where we take both uncertaintiesinto account.

expsampleN.TOST(CV =0.20,theta0 =0.92,prior.type ="both",prior.parm =list(m =16,design ="2x2x2"),details =FALSE)## ++++++++++++ Equivalence test - TOST ++++++++++++#   Sample size est. with uncertain CV and theta0# -------------------------------------------------# Study design:  2x2 crossover# log-transformed data (multiplicative model)## alpha = 0.05, target power = 0.8# BE margins = 0.8 ... 1.25# Ratio = 0.92 with 14 df# CV = 0.2 with 14 df## Sample size (ntotal)#  n   exp. power# 54   0.802440

This sample size is almost twice the 28 your boss got from a popularExcel-Sheet.⁹ If you are not fired right away whensuggesting such a study, take it as a warning whatmighthappen.
At least, if the pivotal study is performed in a lower sample size andfails,you know why.

$Fig. 3 Expected power for uncertain estimates (from top: CV, \theta_0, both)$

Fig. 3 Expected power foruncertain estimates (from top:CV,$\theta_0$, both)

If you are adventurous, consider an Adaptive Two-Stage SequentialDesign with sample size re-estimation. Various methods are provided inthe packagePower2Stage.¹⁰

Example 4

An alternative to an assumed$\small{\theta_0}$ is‘StatisticalAssurance’.¹¹ This concept uses the distribution ofT/R-ratios and assumes an uncertainty parameter$\small{\sigma_\textrm{u}}$. A naturalassumption is$\small{\sigma_\textrm{u}=1-\theta_0}$,i.e., for the commonly applied$\small{\theta_0=0.95}$ one can use theargumentsem = 0.05 of the functionexpsampleN.TOST() where the argumenttheta0must be kept at 1. The following example reproducesTable 1 of the paper.

CV<-0.214res<-data.frame(target =c(rep(0.8,5),rep(0.9,5)),theta0 =rep(c(1,seq(0.95,0.92,-0.01)),2),n.1 =NA_integer_,power =NA_real_,sigma.u =rep(c(0.0005,seq(0.05,0.08,0.01)),2),n.2 =NA_integer_,assurance =NA_real_)for (iin1:nrow(res)) {  res[i,3:4]<-sampleN.TOST(CV = CV,targetpower = res$target[i],theta0 = res$theta0[i],print =FALSE)[7:8]  res[i,6:7]<-expsampleN.TOST(CV = CV,targetpower = res$target[i],theta0 =1,# mandatory!prior.type ="theta0",prior.parm =list(sem = res$sigma.u[i]),print =FALSE)[9:10]}res<-signif(res,3)res[,5]<-sprintf("%.2f", res[,5])names(res)[c(3,6)]<-"n"print(res,row.names =FALSE)#  target theta0  n power sigma.u  n assurance#     0.8   1.00 18 0.833    0.00 18     0.833#     0.8   0.95 22 0.824    0.05 22     0.833#     0.8   0.94 24 0.817    0.06 22     0.800#     0.8   0.93 26 0.801    0.07 26     0.819#     0.8   0.92 30 0.802    0.08 28     0.803#     0.9   1.00 22 0.916    0.00 22     0.916#     0.9   0.95 28 0.904    0.05 28     0.904#     0.9   0.94 32 0.909    0.06 32     0.903#     0.9   0.93 36 0.905    0.07 38     0.902#     0.9   0.92 42 0.908    0.08 48     0.902

One caveat: Assumingno uncertainty ($\small{\sigma_\textrm{u}=0}$) would failbecause the level of technical success is zero. Here a small value of0.0005 was used instead.

Example 5

Estimate the sample size for a study of two blood pressure loweringdrugs in a 2×2×2 design assessing the difference in means ofuntransformed data (raw, linear scale;i.e., specifyinglogscale = FALSE). In this setup everything has to be givenwith the same units (i.e., here$\small{\theta_0}$ –5 mm Hg,$\small{\theta_1}$ –15 mm Hg,$\small{\theta_2}$ +15 mm Hg systolicBP).

Assuming a residual standard deviation 20 mm Hg.

planned<-"2x2x2"logscale<-FALSEtheta0<--5theta1<--15theta2<-+15# if not given, -theta1 is usedSD.resid<-20# residual standard deviationsampleN.TOST(CV = SD.resid,theta0 = theta0,theta1 = theta1,theta2 = theta2,logscale = logscale,design = planned)## +++++++++++ Equivalence test - TOST +++++++++++#             Sample size estimation# -----------------------------------------------# Study design: 2x2 crossover# untransformed data (additive model)## alpha = 0.05, target power = 0.8# BE margins = -15 ... 15# True diff. = -5,  CV = 20## Sample size (total)#  n     power# 52   0.807468

Assuming a standard deviation of the difference T – R 28 mm Hg.

known<-known.designs()[,c(2,6)]# extract relevant informationbk<- known[known$design== planned,"bk"]# retrieve design constanttxt<-paste0("The design constant for design\"",                   planned,"\" is ", bk)SD.delta<-28# standard deviation of the differenceSD.resid<-  SD.delta/sqrt(bk)# convert to residual SDcat(txt);sampleN.TOST(CV = SD.resid,theta0 = theta0,theta1 = theta1,theta2 = theta2,logscale = logscale,design = planned)# The design constant for design "2x2x2" is 2## +++++++++++ Equivalence test - TOST +++++++++++#             Sample size estimation# -----------------------------------------------# Study design: 2x2 crossover# untransformed data (additive model)## alpha = 0.05, target power = 0.8# BE margins = -15 ... 15# True diff. = -5,  CV = 19.79899## Sample size (total)#  n     power# 50   0.800423

Note that other software packages (e.g., PASS, nQuery,StudySize,…) require the standard deviation of the difference asinput.

Higher-Order Designs

Designs with three and four treatments/periods are supported.

#  design            name   df#   "3x3"   3x3 crossover 2n-4# "3x6x3" 3x6x3 crossover 2n-4#   "4x4"   4x4 crossover 3n-6

Whichdesign argument in studies with more than twoperiods you should use depends on the planned evaluation.

Suppose we have a bioequivalence study with three treatments – A, B, andC – and the objective of the study is to make pairwise comparisons amongthe treatments. Suppose further that treatment C is different in kindfrom A and B, so that the assumption of homogeneous variance among thethree treatments is questionable. One way to do the analyses, undernormality assumptions, is Two at a Time – e.g., to test hypotheses aboutA and B, use only the data from A and B. Another way is All at Once –include the data from all three treatments in a single analysis, makingpairwise comparisons within this analysis. If the assumption ofhomogeneous variance is correct, the All at Once approach will providemored.f. for estimating thecommon variance, resulting in increased power. If the variance of Cdiffers from that of A and B, the All at Once approach may have reducedpower or an inflated type I error rate, depending on the direction ofthe difference in variances.
— Donald J. Schuirmann (2004)¹²

All at Once
In this approach you assume homogenicity in the ANOVA of pooled dataand get one residual variance.
To plan for this approach specify one of thedesignarguments given above.
Two at a Time
In this approach – which is preferred by some agencies,e.g., the EMA¹³ ¹⁴ – you exclude one treatment and performthe analysis on the remaining two. That means, you obtain twoseparate Incomplete Block Designs (IBD).
One example is comparing a test (T) and to references fromtwo different regions (R1,R2). You get notonly two point estimates (like in the ANOVA) but also two within-subjectvariances (in the comparisonsT vsR1andT vsR2).
Another example is a pilot study with two candidate treatments(C1,C2) and one reference (R).You would select the candidate with$\small{\min \left\{\left|\log_{e}\theta_\textrm{C1}\right |,\left |\log_{e}\theta_\textrm{C2}\right | \right \}}$ for the pivotalstudy. If$\small{\left |\log_{e}\theta_\textrm{C1}\right |\sim \left| \log_{e}\theta_\textrm{C2}\right | }$, select the candidatewith lower within-subject variance.
To plan for this approach specifydesign = "2x2x2".

Apart from the regulatory recommendation we suggest this approach,since the ‘All at Once’ approach may lead to biased estimates and aninflated type I error.¹⁵

Three treatments intended for evaluation ‘All at Once’ or ‘Two at aTime’.

CV<-0.20res<-data.frame(design =c("3x6x3","2x2x2"),n =NA_integer_,power =NA_real_,stringsAsFactors =FALSE)for (iin1:2) {  res[i,2:3]<-sampleN.TOST(CV = CV,design = res$design[i],print =FALSE)[7:8]}print(res,row.names =FALSE)#  design  n     power#   3x6x3 18 0.8089486#   2x2x2 20 0.8346802

Four treatments (e.g., Test fasting and fed, Referencefasting and fed).

CV<-0.20res<-data.frame(design =c("4x4","2x2x2"),n =NA_integer_,power =NA_real_,stringsAsFactors =FALSE)for (iin1:2) {  res[i,2:3]<-sampleN.TOST(CV = CV,design = res$design[i],print =FALSE)[7:8]}print(res,row.names =FALSE)#  design  n     power#     4x4 20 0.8527970#   2x2x2 20 0.8346802

Power

Let us first recap the hypotheses in bioequivalence.

The ‘Two One-Sided Tests Procedure’ (TOST)¹⁶\[\small{H_\textrm{0L}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\leq\theta_1\:vs\:H_\textrm{1L}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}>\theta_1}\]\[\small{H_\textrm{0U}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\geq\theta_2\:vs\:H_\textrm{1U}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2}\]
The confidence interval inclusion approach\[\small{H_0:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\ni\left\{ \theta_1, \theta_2\right\}\:vs\:H_1:\theta_1<\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2}\]

Note that the null hypotheses implybioinequivalence where$\small{\left\{\theta_1,\theta_2\right\}}$are the lower and upper limits of the bioequivalence range.
TOST provides a pair of$\small{p}$values (where$\small{H_0}$ is notrejected if$\small{\max}(p_\textrm{L},p_\textrm{U})>\alpha$)and is of historical interest only because theCI inclusion approach is preferred inregulatory guidelines.

From a regulatory perspective the outcome of a comparativeBA study is dichotomous. Either the study demonstratedbioequivalence (confidence interval entirelywithin$\small{\left\{\theta_1,\theta_2\right\}}$) or not.¹⁷ Only if theCI lies entirelyoutside$\small{\{\theta_1,\theta_2\}}$, thenull hypothesis is not rejected and further studies not warranted.
In any case, calculation ofpost hoc (a.k.a.a posteriori,retrospective) power is futile.¹⁸

There is simple intuition behind results like these: If my car made itto the top of the hill, then it is powerful enough to climb that hill;if it didn’t, then it obviously isn’t powerful enough. Retrospectivepower is an obvious answer to a rather uninteresting question. A moremeaningful question is to ask whether the car is powerful enough toclimb a particular hill never climbed before; or whether a different carcan climb that new hill. Such questions are prospective, notretrospective.
— Russell V. Lenth (2000)¹⁹

If a study passes – despite lower than desired power – there isno reason to reject thestudy. It only means that assumptions (‼) in sample size estimationwere not realized. TheCV might have been higher, and/or theT/R-ratio worse, and/or the dropout-rate higher than anticipated. On theother hand, ifpost hoc power is higher than desired, it doesnot further support a study which already demonstrated BE.

Nevertheless, exploring power is useful when trying to understand whya study failed and to plan another study. Let us continue with theexample from above. Ignoring our concerns, themanagement decided to perform the pivotal study with 28 subjects. TheT/R-ratio was slightly worse (0.90), theCV higher (0.25), andwe had one dropout in the first sequence and two in the second. ThefunctionCI.BE() comes handy.

n<-c(14-1,14-2)# 14 dosed in each sequenceround(100*CI.BE(pe =0.90,CV =0.25,n = n),2)#  lower  upper#  79.87 101.42

The study failed although by a small margin. One might be tempted torepeat the study with an – only slightly – higher sample size. But whatwas thepost hoc power of the failed study?

power.TOST(CV =0.25,theta0 =0.90,n =c(13,12))# observed values# [1] 0.4963175

Actually the chance of passing was worse than tossing a coin.

NB, in calculatingpost hocpower theobserved$\small{\theta_0}$ has to be used. In somestatistical reports high‘power’ is given even for afailed study, which isnot even wrong. Alas,$\small{\theta_0=1}$ is still the defaultin some software packages.

power.TOST(CV =0.20,theta0 =0.92,n =28)# assumed in planning# [1] 0.822742power.TOST(CV =0.25,theta0 =1,n =c(13,12))# observed CV but wrong T/R-ratio# [1] 0.8558252

Since all estimates were worse than assumed, how could one get a‘power’ even higher than desired, despite the fact that the studyfailed to demonstrate bioequivalence? That’s nonsense, ofcourse.$\small{\theta_0=1}$ givesthe ‘power to detect a significant difference of 20%’ – a flawed conceptwhich was abandoned after Schuirmann’s paper of 1987.

Pooling

When planning the next study one can use the entire arsenal fromabove. Since we have more accurate estimates (from 25subjects instead of the 16 of the pilot) the situation is more clearnow.
As a further step we can take the information of both studies intoaccount with the functionCVpooled().

CVs<- ("  CV |  n | design | study         0.20 | 16 |  2x2x2 | pilot         0.25 | 25 |  2x2x2 | pivotal")txtcon<-textConnection(CVs)data<-read.table(txtcon,header =TRUE,sep ="|",strip.white =TRUE,as.is =TRUE)close(txtcon)print(CVpooled(data,alpha =0.20),digits =4,verbose =TRUE)# Pooled CV = 0.2322 with 37 degrees of freedom# Upper 80% confidence limit of CV = 0.2603

Before pooling, variances are weighted by the degrees of freedom.Hence, the new estimate is with ~0.23 closer to the 0.25 of the largerstudy. Note also that the upper confidence limit is with ~0.26 higherthan the one of the pilot study with ~0.24.

Caveats

Don’t pool data blindly. In the ideal situation you know the entirebackground of all studies (clinical performance, bioanalytics). Even ifall studies were performed at the sameCRO, more things areimportant. One purpose of a pilot study is to find a suitable samplingschedule. If the sampling schedule of the pilot was not ideal(e.g.,C_max was not sufficiently enoughdescribed), pooling is not a good idea. It might well be that in thepivotal study – with a ‘better’ sampling schedule – itsCV ismore reliable. On the other hand,AUC is less sensitive todifferent sampling schedules.

Pooling data from the literature should be done with great caution(if at all). Possibly critical information is missing. Consider using aCV from the upper end of values instead. Common sensehelps.

An example where pooling could be misleading:C_max data of pilot and pivotal studies in fivedifferent designs with 11 to 39 subjects, fasting/fed, three differentbioanalytical methods (GC/ECD,LC-MS/MS,GC/MS), chiral and achiral (which is notrelevant for this drug since the active enantiomer is ~95% of the totaldrug and there is noin vivo interconversion). Note that mostpivotal studies were ‘overpowered’.

Fig. 4 Blue deltoid pooledCV 0.163; dotted line its upperCL 0.167.

This is an apples-and-oranges comparison. Red squares showCVs which were above the upperCL of the pooledCV. Given,only in two studies (#1, #6) their lowerCL did not overlap the upper one of thepooledCV.

Which side of the great divide are you on? Do you believe that meta isbetter or do you hold instead that pooling is fooling? Well, to nail mycolours to the mast, I belong to the former school. It seems to me thatthere is no other topic in medical statistics, with the possibleexceptions of cross-over trials, bioequivalence and n-of-1 studies,which has the same capacity as this one to rot the brains.
— Stephen Senn (2020)²⁰

Hints

Power Analysis

Although we suggest to explore the various options shown inExample 3, it is worthwhile to have a first look withthe functionpa.ABE().

pa.ABE(CV =0.20,theta0 =0.92)# Sample size plan ABE#  Design alpha  CV theta0 theta1 theta2 Sample size Achieved power#     2x2  0.05 0.2   0.92    0.8   1.25          28       0.822742## Power analysis# CV, theta0 and number of subjects leading to min. acceptable power of ~0.7:#  CV= 0.2377, theta0= 0.9001#  n = 21 (power= 0.7104)

Fig. 5 Power Analysis (in eachpanel one argument is varied and others kept constant)

This exercise confirms what we already know. The most criticalparameter is$\small{\theta_0}$,whereas dropouts are the least important.

More details are given in the vignettePowerAnalysis.

Dropouts

As we have seen, the impact of dropouts on power is rather limited.RegularlyCROssuggest additional subjects to ‘compensate for the potential loss inpower’. IMHO, milking sponsors to make wealthyCROs richer. Note thatthe dropout-rate is based ondosed subjects. Hence, the correctformula for the adjusted sample size$\small{n{}'}$ based on the estimatedone$\small{n}$ is$\small{n{}'=n / (1-{dropout\:rate})}$,andnot$\small{n{}'=n \times(1+{dropout\:rate})}$. An example for studies with threeperiods:

balance<-function(x, y) {return(y* (x%/% y+as.logical(x%% y)))}do<-0.15# anticipated dropout-rate 15%seqs<-3n<-seq(12L,96L,12L)res<-data.frame(n = n,adj1 =balance(n/ (1- do), seqs),# correctelig1 =NA_integer_,diff1 =NA_integer_,adj2 =balance(n* (1+ do), seqs),# wrongelig2 =NA_integer_,diff2 =NA_integer_)res$elig1<-floor(res$adj1* (1- do))res$diff1<-sprintf("%+i", res$elig1- n)res$elig2<-floor(res$adj2* (1- do))res$diff2<-sprintf("%+i", res$elig2- n)invisible(ifelse(res$elig1- n>=0,         res$optim<- res$elig1,         res$optim<- res$elig2))res$diff<-sprintf("%+i", res$optim- n)names(res)[c(2,5)]<-c("n'1","n'2")res$diff1[which(res$diff1=="+0")]<-"\u00B10"res$diff2[which(res$diff2=="+0")]<-"\u00B10"res$diff[which(res$diff=="+0")]<-"\u00B10"print(res,row.names =FALSE)#   n n'1 elig1 diff1 n'2 elig2 diff2 optim diff#  12  15    12    ±0  15    12    ±0    12   ±0#  24  30    25    +1  30    25    +1    25   +1#  36  45    38    +2  42    35    -1    38   +2#  48  57    48    ±0  57    48    ±0    48   ±0#  60  72    61    +1  69    58    -2    61   +1#  72  87    73    +1  84    71    -1    73   +1#  84  99    84    ±0  99    84    ±0    84   ±0#  96 114    96    ±0 111    94    -2    96   ±0

With the wrong formula – especially for high dropout rates – youmight end up with less eligible subjects (elig2) thanplanned thus compromising power. On the other hand, with the correct one(due to rounding up to get balanced sequences) you might end up withslightly too many (elig1). Of course, if you want to be onethe safe side, you can select the ‘best’ (columnoptim).

Literature Data

Sometimes theCV is not given in the literature. By means ofthe functionCVfromCI() we can calculate it from theconfidence interval, the design, and the sample size.²¹

CVfromCI(lower =0.8323,upper =1.0392,design ="2x2x4",n =26)# [1] 0.3498608

The method is exact if the subjects/sequence are known. In theliterature quite often only the total sample size is given and thefunction tries to keep sequences as balanced as possible. What if thestudy was imbalanced?
A total sample size of 26 was reported. The study was either balanced orimbalanced to an unknown degree:

n<-26CV.est<-CVfromCI(lower =0.8323,upper =1.0392,design ="2x2x4",n =26)n.est<-sampleN.TOST(CV = CV.est,design ="2x2x4",print =FALSE)[["Sample size"]]n1<-balance(seq(n,18,-1),2)/2n2<- n- n1nseqs<-unique(data.frame(n1 = n1,n2 = n2,n = n))res<-data.frame(n1 = nseqs$n1,n2 = nseqs$n2,CV.true =NA_real_,CV.est = CV.est,n.true =NA_integer_,n.est = n.est)for (iin1:nrow(res)) {  res$CV.true[i]<-CVfromCI(lower =0.8323,upper =1.0392,design ="2x2x4",n =c(res$n1[i], res$n2[i]))  res$n.true[i]<-sampleN.TOST(CV = res$CV.true[i],design ="2x2x4",print =FALSE)[["Sample size"]]  res$n.est[i]<-sampleN.TOST(CV = CV.est,design ="2x2x4",print =FALSE)[["Sample size"]]}print(signif(res,5),row.names =FALSE)#  n1 n2 CV.true  CV.est n.true n.est#  13 13 0.34986 0.34986     26    26#  12 14 0.34876 0.34986     26    26#  11 15 0.34546 0.34986     26    26#  10 16 0.33988 0.34986     24    26#   9 17 0.33196 0.34986     24    26

The trueCV of any imbalanced study might have been lowerthan what we assumed. That means, if we use the estimatedCV –falsely assuming balanced sequences – our sample size estimation willalways be conservative.

Direction of Deviation

If you are unsure about the direction of the deviation of T from R(lower or higher) always assume$\small{\theta_0<1}$.

CV<-0.21d<-0.05# delta 5%, direction unknownn<-sampleN.TOST(CV = CV,theta0 =1- d,print =FALSE,details =FALSE)[["Sample size"]]res1<-data.frame(CV = CV,theta0 =c(1- d,1/ (1- d)),n = n,power =NA_real_)for (iin1:nrow(res1)) {  res1$power[i]<-power.TOST(CV = CV,theta0 = res1$theta0[i],n = n)}n<-sampleN.TOST(CV = CV,theta0 =1+ d,print =FALSE)[["Sample size"]]res2<-data.frame(CV = CV,theta0 =c(1+ d,1/ (1+ d)),n = n,power =NA_real_)for (iin1:nrow(res1)) {  res2$power[i]<-power.TOST(CV = CV,theta0 = res2$theta0[i],n = n)}res<-rbind(res1, res2)print(signif(res[order(res$n, res$theta0), ],4),row.names =FALSE)#    CV theta0  n  power#  0.21 0.9524 20 0.8081#  0.21 1.0500 20 0.8081#  0.21 0.9500 22 0.8374#  0.21 1.0530 22 0.8374

If you use 1.05 (sample size 20) power will be maintained all the waydown to its reciprocal (0.9524) butnot to 0.95 (where youwould already need a sample size of 22). On the other hand, 0.95preserves power up to 1.053. The statement ‘sample size based on adeviation of ±5%’ seen in many protocol does not make sense.

Authors

function	author(s)
`sampleN.TOST`,`power.TOST`,`power.TOST.sds`,`sampleN.RatioF`, `CVCL`,`CI.BE`,`CVpooled`	Detlew Labes
`expsampleN.TOST`	Benjamin Lang, DetlewLabes
`CVfromCI`	Detlew Labes, Helmut Schütz,Benjamin Lang
`pa.ABE`	Helmut Schütz, DetlewLabes

License

GPL-3 2025-09-17Helmut Schütz

It is common a misconception that in a parallel designone obtains thebetween- (inter-) subject variability. The factthat products are administered only once does not mean that thewithin-subject variability does not exist (weobserved only oneoccasion). Hence, the variability consists of bothbetween- andwithin-components. We can estimate the between-subjectvariabilityonly in a crossover design. Referring tobetween-subject variability in a parallel design is sloppy terminologyat least.↩︎
Owen DB.A special case of a bivariate non-centralt-distribution. Biometrika. 1965: 52(3/4); 437–46.doi:10.2307/2333696.↩︎
R Documentation.R-manual. Data Frames.2022-02-08.Online.↩︎
In other functions ofPowerTOST the samplesize is not given in the seventh column but always named"Sample size".↩︎
Schütz H.Multi-Group Studies in Bioequivalence. Topool or not to pool? Review of Guidelines. BioBridges. Prague,26–27 September, 2018.Online.↩︎
Fieller EC.Some Problems In IntervalEstimation. J Royal Stat Soc B. 1954; 16(2): 175–85.JSTOR:2984043.↩︎
EMEA, CPMP.Points to Consider on Switching betweenSuperiority and Non-Inferiority. London. 27 July 2000.CPMP/EWP/482/99.Online.↩︎
Gould AL.Group Sequential Extension of a StandardBioequivalence Testing Procedure. J Pharmacokin Biopharm. 1995:23(1); 57–86.doi:10.1007/BF02353786.↩︎
Dubins D.My Tiny Contribution to Clinical Research:FARTSSIE. 2021-03-21.GitHub.↩︎
Labes D, Lang B, Schütz H.Power2Stage: Power andSample-Size Distribution of 2-Stage Bioequivalence Studies.2021-11-20.CRAN.↩︎
Ring A, Lang B, Kazaroho C, Labes D, Schall R, SchützH.Sample size determination in bioequivalence studies usingstatistical assurance. Br J Clin Pharmacol. 2019; 85(10): 2369–77.doi:10.1111/bcp.14055.Openaccess.↩︎
Schuirmann DJ.Two at a Time? Or All at Once?International Biometric Society, Eastern North American Region, SpringMeeting. Pittsburgh, PA. March 28 – 31, 2004.Onlineabstract.↩︎
EMA, CHMP.Guideline on the Investigation ofBioequivalence. London. 20 January 2010. CPMP/EWP/QWP/1401/98Rev. 1/ Corr **.Online.↩︎
EGA.Revised EMA Bioequivalence Guideline.Questions & Answers.Online.↩︎
D’Angelo P.Testing for Bioequivalence inHigher‐Order Crossover Designs: Two‐at‐a‐Time Principle Versus PooledANOVA. 2^nd Conference of the Global BioequivalenceHarmonisation Initiative. Rockville, MD. 15–16 September, 2016.↩︎
Schuirmann DJ.A Comparison of the Two One-SidedTests Procedure and the Power Approach for Assessing the Equivalence ofAverage Bioavailability. J Pharmacokin Biopharm. 1987; 15(6):657–80.doi:10.1007/BF01068419.↩︎
If theCIoverlaps$\small{\{\theta_1,\theta_2\}}$, theoutcome is indecisive (called ‘the gray zone’ by some). As long as$\small{\theta_0}$ lies within$\small{\{\theta_1,\theta_2\}}$ you canhope to show BE in a larger sample size. However, once$\small{\theta_0}$ approaches one of thelimits, if might be futile – even with a very smallCV.Try
sampleN.TOST(CV = 0.075, theta0 = 0.81) and then
sampleN.TOST(CV = 0.075, theta0 = 0.80) to get an idea.↩︎
Hoenig JM, Heisey DM.The Abuse of Power: ThePervasive Fallacy of Power Calculations for Data Analysis. Am Stat.2001; 55(1): 19–24.Openaccess.↩︎
Lenth RA.Two Sample-Size Practices that I Don’tRecommend.Online.↩︎
Senn S.A Dip in the Pool. 2020-05-03.Online.↩︎
Schütz H.Sample Size Estimation for BE Studies:Algebra… Bucharest. 19 March 2013.Online ↩︎

Movatterモバイル変換

Average Bioequivalence

Details and examples of other methods are accessible via the menu bar ontop of the page and in theOnlinemanual of all functions.