Themlpwr package is a powerful tool for comprehensivepower analysis and design optimization in research. It addresseschallenges in optimizing study designs for power across multipledimensions while considering cost constraints. By combining Monte Carlosimulations, surrogate modeling techniques, and cost functions,mlpwr enables researchers to model the relationship betweendesign parameters and statistical power, allowing for efficientexploration of the parameter space.
Using Monte Carlo simulation,mlpwr estimatesstatistical power across different design configurations by generatingsimulated datasets and performing hypothesis tests on these. A surrogatemodel, such as linear regression, logistic regression, support vectorregression (SVR), or Gaussian process regression, is then fitted toapproximate the power function. This facilitates the identification ofoptimal design parameter values.
Themlpwr package offers two primary types of outputsbased on specified goals and constraints. Researchers can obtain studydesign parameters that yield the desired power level at the lowestpossible cost, taking budget limitations and resource availability intoaccount. Alternatively, researchers can identify design parameters thatmaximize power within a given cost threshold, enabling informed resourceallocation.
In conclusion, themlpwr package provides acomprehensive and flexible tool for power analysis and designoptimization. It guides users through the process of optimizing studydesigns, enhancing statistical power, and making informed decisionswithin their research context.
For more details, refer toZimmer & Debelak(2023).
In this Vignette we will apply themlpwr package to anItem Response problem. We will tackle two types of problems: 1) testingwhether a Rasch or a 2PL model is more suitable to our data and 2)conducting a DIF analysis in a 2PL model. Both of these problems requirean a-priori power analysis to ensure enough participants are recruitedto appropriately test our hypothesis. To facilitate the understanding wewill apply our analysis to a concrete research setting: evaluating amath test in school.
The Rasch model, also known as the Rasch measurement model or theRasch model for item response theory (IRT), is a widely usedpsychometric model for analyzing data in educational and psychologicalmeasurement. It provides a framework for assessing the relationshipbetween respondents’ abilities and item difficulties.
The Rasch model assumes that\[Pr(U_{pi} =1 | \theta_p, \beta_i)\], the probability of an individual\(p\) correctly solving a test item\(i\), is influenced by both thecharacteristics of the item (item parameters) and the person (personparameters). Specifically, the probability of a person with a mathability\(\theta_p\) answering an itemcorrectly increases as their math ability becomes more pronounced.Conversely, as the difficulty of an item\(\beta_i\) increases, the probability ofanswering that item correctly decreases. This relationship can berepresented by the following formula:
\[Pr(U_{pi} = 1 | \theta_p, \beta_i) = \frac{e^{\theta_p - \beta_i}}{1 +e^{\theta_p - \beta_i}}\]
The Rasch model assumes a strictequality of itemdiscrimination across all items. In general, the itemdiscrimination parameter measures the extent to which an item is capableof distinguishing between individuals with different abilities. Inanalytical terms, the item discrimination parameter represents the slopeof the item response function graph. A steeper slope indicates astronger relationship between the ability (\(\theta\)) and a correct response,indicating how effectively the item discriminates among examinees alongthe ability scale continuum. A higher value of the item discriminationparameter suggests that the item is more effective in differentiatingexaminees. Practically, a higher discrimination parameter value meansthat the probability of a correct response increases more rapidly as thelatent trait (\(\theta\)) increases.The Rasch model assumes this item discrimination to be equal among allitems, which might not be the optimal fit for the data. Thus a secondmodel, the 2PL model, was introduced.
The two-parameter logistic (2PL) model is a more flexible alternativeand can be seen as an extension of the Rasch model. When the assumptionof equal item discrimination is no good fit for the data, the 2PL modelis preferred over the Rasch model. The 2PL model introduces a seconditem parameter, called the discrimination or slope parameter (denoted as\(\alpha\)), in addition to thedifficulty parameter (denoted as\(\beta_i\)) for each item\(i\). Mathematically, the 2PL model can bedescribed as follows:
\[Pr(U_{pi} = 1 | \theta_p, \alpha_i, \beta_i) =\frac{e^{\alpha_i(\theta_p - \beta_i)}}{1 + e^{\alpha_i(\theta_p -\beta_i)}}\]
A high value of\(\alpha_i\)indicates that the item has a strong ability to distinguish among testtakers. This means that the probability of giving a correct responseincreases more rapidly as the ability\(\theta\) increases.
When evaluating psychological and educational tests, researchersoften face the question of which IRT model better describes the data. Inthis chapter, our specific hypothesis test aims to address whether theinclusion of the additional discrimination parameter in the 2PL model isessential for describing the data, or if the discrimination parametersare sufficiently similar for the Rasch model to adequately describe thedata. Therefore, it becomes relevant to test the null hypothesis ofequal slope parameters, as this may allow us to reject an incorrectlyassumed Rasch model. Conducting a power analysis can help determine theminimum sample size needed to reject the simpler Rasch model. In thefollowing sections, we will first perform an a priori power analysis andthen a posteriori power analysis to test the Rasch model against the 2PLmodel.
A school is interested in accurately assessing the math skills oftheir students to inform instructional strategies and curriculumdevelopment. They want to administer a math test to a group of studentsand collect their item response data. In the end the goal is to find anadequate model for the data to better understand how the math testassesses the math ability of students. We propose two modelingapproaches: the Rasch model or the Two-Parameter Logistic (2PL)model.
The Rasch model assumes that an item only influences the responseprobability through item difficulty, disregarding variations in itemdiscrimination. We are primarily interested in estimating the difficultylevel of each math item to identify areas where students might strugglethe most. The Rasch model can provide precise estimates of itemdifficulty, allowing the school to focus on improving those specificconcepts or skills.
However, we also recognize that the 2PL model can offer additionalinsights by considering item discrimination. By incorporatingdiscrimination parameters, the 2PL model captures the extent to which anitem differentiates between students with high and low math abilities.This information can help identify items that are particularly effectiveat discriminating between students with varying skill levels.
To make an informed decision, we plan to compare the Rasch and 2PL bytesting them against each other using a likelihood ratio test. Once wehave data we could simply test if there is a statistically significantdifference between the models. But this test is only reliable if thereis enough data to begin with. To ensure that our planned test will beable to detect differences up to a margin of error, we decide to do ana-priori power analysis to determine the required sample size before westart collecting data. For this we use themlpwrpackage.
To simulate data for an IRT task like the one described above, wewill use themirt package. For the subsequent powersimulation we will use themlpwr package. If it is yourfirst time using them, you have to install them like so:
Now the packages are permanently installed on your computer. To usethem in R you need to (re-)load them every time you start a session.
To simulate the data for our scenario, we need the followingspecifications:
a, and the item difficultyd. This corresponds to slopes and intercepts in the 2PLmodel.# Defining intercepts and slopesa<-c(1.04,1.2,1.19,0.61,1.31,0.83,1.46,1.27,0.51,0.81)d<-c(0.06,-1.79,-1.15,0.88,-0.2,-1.87,1.23,-0.08,-0.71,0.6)# Setting number of observationsN<-100# Itemtypeitemtype<-"2PL"Afterwards we can simulate a dataset using the functionsimdata in the packagemirt.
# Simulate Datasim_data<-simdata(a = a,d = d,N = N,itemtype = itemtype)# First 5 rows if simulated datasim_data[1:5,]## Item_1 Item_2 Item_3 Item_4 Item_5 Item_6 Item_7 Item_8 Item_9 Item_10## [1,] 1 0 0 0 0 0 1 0 1 1## [2,] 1 0 1 0 0 0 0 0 0 1## [3,] 1 0 0 1 1 0 0 1 0 1## [4,] 0 0 0 0 0 0 1 0 0 0## [5,] 0 0 1 1 0 0 0 0 1 1After discussions with the schools experts we deem items 1-4 the mostimportant for their test. Thus we want to ensure a good model fit forthese items. Both a Rasch model or 2PL model are plausible for theseitems, thus we fit them both then administrate a likelihood ratio testto check which model fits better. The rest of the items are fit with a2PL model.
# Fit 2PL modelmod<-mirt(sim_data)# Rasch contsraint for items 1-4constrained<-"F = 1-4 CONSTRAIN = (1-4, a1)"# Fit constrained modelmod_constrained<-mirt(sim_data, constrained)# Fit 2PL with equal item discrimination# Compare model fitres<-anova(mod_constrained, mod)We rely on the p-value of the likelihood ratio test to compare themodels. We deem the 2PL model significantly better if the p-value isbelow 0.01 and extract this in the form of a logical TRUE/FALSEresponse.
## [1] TRUEThe previous section showed how we can perform a data simulation testwith a subsequent hypothesis test. Now we know the test’s itemsapproximately follow the specifications above and we want to make surethat the school recruits enough participants to take their test in orderto later select the correct model. Thus we perform a power simulationwithmlpwr.
A power simulation can be done using thefind.designfunction. For our purpose we submit 4 parametersto it:
Note: additional specifications are possible (seedocumentation)but not necessary. For most parameters like the choice of surrogatefunction the default values should already be good choices. As we areworking with only 1 design parameter here (N = sample size)we don’t need to submit a cost function that weighs the designparameters.
Thefind.design function needs to reiterate a datasimulation multiple times. For this purpose it expects a data generatingfunction (DGF) as its main input. The DGF takes an observation numberN as input and must return a logical value of whether thehypothesis test was TRUE or FALSE. We can formulate the above simulationprocess in a function like so:
simfun_irt1<-function(N) {# generate data dat<-simdata(a =c(1.04,1.2,1.19,0.61,1.31,0.83,1.46,1.27,0.51,0.81),d =c(0.06,-1.79,-1.15,0.88,-0.2,-1.87,1.23,-0.08,-0.71,0.6),N = N,itemtype ="2PL")# test hypothesis mod<-mirt(dat)# Fit 2PL Model constrained<-"F = 1-4 CONSTRAIN = (1-4, a1)" mod_constrained<-mirt(dat, constrained)# Fit 2PL with equal slopes res<-anova(mod_constrained, mod)# perform model comparison res$p[2]<0.01# extract significance}This function is also implemented the same way in themlpwr package and can be accessed using:
With the specified parameters we can perform the power analysis. Theboundaries were set based on an educated guess from experience, but ifthe analysis fails one can always readjust them and do the analysisagain. We set a random seed for reproducibility reasons.
set.seed(123)res<-find.design(simfun = simfun_irt1,boundaries =c(40,100),power = .95,evaluations =2000)Now we can summarize the results using thesummarycommand.
## ## Call:## find.design(simfun = simfun_irt1, boundaries = c(40, 100), power = 0.95, ## evaluations = 2000)## ## Design: N = 56## ## Power: 0.95271, SE: 0.00601## Evaluations: 2000, Time: 655.23, Updates: 16## Surrogate: Logistic regressionAs we can see the calculated sample size for the desired power of0.95 is 56. The estimated power for this sample size is 0.95271 with astandard error of 0.00601.The summary additionally reports the number ofsimulation function evaluations, the time until termination in seconds,and the number of surrogate model updates. SeeZimmer & Debelak(2023) for more details. We can also plot our power simulation andlook at the calculated function using theplotfunction.
Confidence Intervals (gray) are printed in addition to the estimatedpower curve (black), so we can get a feel for how the design parameter(here sample size N) influences the power level and also where theprediction is more or less uncertain. The black dots show us thesimulated data.
We finished our power analysis and report back to the school that asample size of 56 should be sufficient to determine an appropriate modelfor their IRT problem.
Differential Item Functioning (DIF) is the phenomenon that an itembehaves differently for different groups of people.
DIF tests are essential in educational testing, particularly whenexamining common sociodemographic characteristics such as gender, age,language, and academic background. Detecting DIF helps ensure fairnessin testing. If a test is unfair, it can disadvantage certain groups ofindividuals and potentially lead to incorrect decisions. For instance,in the context of aptitude tests for college admissions, an unfair testcould wrongly determine whether a student is admitted to a specificcourse of study or not. Therefore, it is crucial to reliably detect andaddress DIF in the data.
A school is developing a math test. They want to ensure that the testitems are fair and do not favor or disadvantage any specific group ofstudents based on their first language. The school’s experts suspectthat especially the first test item could be problematic. Thus, theyplan to investigate the presence of DIF in the first item of the test ina study. The school will need to recruit native English speakers andnon-native English speakers for this study, but they don’t know howmany. Additionally their time resources for recruitment are limited. Todetermine an appropriate sample size under the given cost constraint(limited time restraints) we use themlpwr package for thepower analysis.
To simulate data for an IRT task like the one described above, wewill use themirt package. For the subsequent powersimulation we will use themlpwr package. If you did notrun the first part of the script, themlpwr andmirt packages have to be installed and loaded.
The data simulation follows a similar approach to that in the Raschvs. 2PL section. Thus we will directly set up the simulationfunction.
simfun_irt2<-function(N1, N2) {# generate data a1<- a2<-c(1.04,1.2,1.19,0.61,1.31,0.83,1.46,1.27,0.51,0.81) d1<- d2<-c(0.06,-1.79,-1.15,0.88,-0.2,-1.87,1.23,-0.08,-0.71,0.6) a2[1]<- a2[1]+0.3 d2[1]<- d2[1]+0.5 dat1<-simdata(a = a1,d = d1,N = N1,itemtype ="2PL") dat2<-simdata(a = a2,d = d2,N = N2,itemtype ="2PL") dat<-as.data.frame(rbind(dat1, dat2)) group<-c(rep("1", N1),rep("2", N2))# Fit model model<-multipleGroup(dat,1,group = group)# Perform DIF for item 1 dif<-DIF(model,which.par =c("a1","d"),items2test =c(1))#Extract significance dif$p[1]<0.05}The simulation consists of the following steps:
Data Generation: The function generates simulated data using thesimdata function. It defines two sets of item parameters,a1 andd1, for the first group, anda2 andd2 for the second group. Theseparameters represent slope and intercept (equates to item discriminationand item difficulty), respectively. TheN1 andN2 parameters are the design parameters and specify thesample sizes for the first and second groups, respectively. We assumeDIF for item 1, meaning native English speakers(group == 1) have an easier time solving this item thannon-native English speakers(group == 2) Thus, theparameters for the first item differ between the groups (higherdifficulty and discrimination for non-native English speakers), whileall the others are the same.
Data Preparation: The generated data for both groups are combinedinto a single data framedat, and a vectorgroup is created to indicate the group membership (nativevs. non-native) of each observation.
Fit model: ThemultipleGroup function is used to fita multiple-group IRT model to the combined data (dat) withone latent trait dimension. The group information is provided using thegroup argument.
Perform DIF for item 1: TheDIF function is appliedto the fitted model (model) to perform DIF analysis. Thewhich.par argument specifies the parameters to test forDIF, including discrimination (a) and difficulty(d). Theitems2test argument specifies thespecific item(s) to test for DIF, with item 1 selected in thiscase.
Extract significance: The significance of the DIF analysis foritem 1 is extracted by comparing the p-value (dif$p[1])with a significance threshold of 0.05. If the p-value is less than 0.05,the function returnsTRUE, indicating the presence of DIFfor item 1. Otherwise, it returnsFALSE, suggesting nosignificant DIF for item 1. In other words if the function returns true,there is evidence that the difficulty of item 1 varies for nativevs. non-native speakers of the same ability.
Following is one example of a dataframe simulated by the function.The group argument corresponds to native vs. non-native speaker, therows to individual students, the columns to the 10 items in the mathtest.
## group Item_1 Item_2 Item_3 Item_4 Item_5 Item_6 Item_7 Item_8 Item_9 Item_10## 1 1 1 1 0 1 0 0 1 1 1 1## 2 1 0 1 1 0 1 1 1 1 0 1## 3 1 1 0 0 1 0 0 1 0 0 0## 4 1 0 1 0 0 1 0 0 0 0 0## 5 1 1 0 0 0 1 1 0 0 0 1## 6 2 0 0 0 1 1 0 1 1 0 1## 7 2 1 0 1 1 1 0 0 1 1 0## 8 2 1 1 1 1 1 0 0 1 1 1## 9 2 0 0 1 1 0 0 1 0 0 0## 10 2 1 0 1 1 0 0 1 1 1 1Now that the data generation and hypothesis test is done, we alsoneed to specify a cost function because we have more than one designparameter. For this artificial scenario we assume that recruitingnon-native speakers is harder than recruiting native English speakers,because the school has to spend more time recruiting. Put differentlythe cost for recruiting non-native participants is higher. We includethis into the power analysis by defining a cost function like so:
This assumes that the school spends on average 5 minutes recruiting anative speaker vs. 7 minutes for non-native speakers. Contrary to theprevious example, the school now has a set budget of time to recruitparticipants. They want to know how they should allocate their resourcesin order to gain maximum power.
Now that all components are defined, a power analysis usingfind.design frommlpwr is possible. We needthe following inputs,
Note 1: additional specifications are possible (seedocumentation)but not necessary. For most parameters like the choice of surrogatefunction the default values should already be good choices.
With the specified parameters we can perform the power analysis. Weset a random seed for reproducibility reasons.
set.seed(111)res<-find.design(simfun = simfun_irt2,boundaries =list(N1 =c(100,700),N2 =c(100,700)),costfun = costfun_irt2,cost =4000,evaluations =1000)Now we can summarize the results using thesummarycommand.
## ## Call:## find.design(simfun = simfun_irt2, boundaries = list(N1 = c(100, ## 700), N2 = c(100, 700)), evaluations = 1000, costfun = costfun_irt2, ## cost = 4000)## ## Design: N1 = 363, N2 = 312## ## Power: 0.63819, SE: 0.01743, Cost: 3999## Evaluations: 1000, Time: 232.59, Updates: 32## Surrogate: Gaussian process regressionAs we can see the calculated sample sizes for a cost of max. 4000 are363 for native speakers and 312 for the non-native speakers. Theestimated power for the sample sizes is 0.63819 with a standard error of0.01743. The summary additionally reports the number of simulationfunction evaluations, the time until termination in seconds, and thenumber of surrogate model updates. The details of the surrogate modelingalgorithm are described in apaper.
We can also plot our power simulation and look at the calculatedfunction using theplot function.
The black dots show us the simulated data. The red line correspondsto the cost constraint. The violet cross corresponds to the optimaldesign. The power level is indicated by the blue color.
We conclude our power analysis with the before stated results: 363for native speakers and 312 for the non-native speakers and let theschool take over the recruitment for their study.