| Title: | Methods for Conducting Nonresponse Bias Analysis (NRBA) |
| Version: | 0.3.1 |
| Description: | Facilitates nonresponse bias analysis (NRBA) for survey data. Such data may arise from a complex sampling design with features such as stratification, clustering, or unequal probabilities of selection. Multiple types of analyses may be conducted: comparisons of response rates across subgroups; comparisons of estimates before and after weighting adjustments; comparisons of sample-based estimates to external population totals; tests of systematic differences in covariate means between respondents and full samples; tests of independence between response status and covariates; and modeling of outcomes and response status as a function of covariates. Extensive documentation and references are provided for each type of analysis. Krenzke, Van de Kerckhove, and Mohadjer (2005)http://www.asasrms.org/Proceedings/y2005/files/JSM2005-000572.pdf and Lohr and Riddles (2016)https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2016002/article/14677-eng.pdf?st=q7PyNsGR provide an overview of the methods implemented in this package. |
| License: | GPL (≥ 3) |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.2.3 |
| Imports: | broom, dplyr, magrittr, rlang, srvyr, stats, survey (≥4.1-1), svrep, tidyr |
| Suggests: | knitr, rmarkdown, stringr, testthat (≥ 3.0.0), tibble |
| Config/testthat/edition: | 3 |
| Depends: | R (≥ 4.1.0) |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2023-11-21 01:54:42 UTC; schneider_b |
| Author: | Ben Schneider |
| Maintainer: | Ben Schneider <BenjaminSchneider@westat.com> |
| Repository: | CRAN |
| Date/Publication: | 2023-11-21 05:10:02 UTC |
nrba: Methods for Conducting Nonresponse Bias Analysis (NRBA)
Description

Facilitates nonresponse bias analysis (NRBA) for survey data. Such data may arise from a complex sampling design with features such as stratification, clustering, or unequal probabilities of selection. Multiple types of analyses may be conducted: comparisons of response rates across subgroups; comparisons of estimates before and after weighting adjustments; comparisons of sample-based estimates to external population totals; tests of systematic differences in covariate means between respondents and full samples; tests of independence between response status and covariates; and modeling of outcomes and response status as a function of covariates. Extensive documentation and references are provided for each type of analysis. Krenzke, Van de Kerckhove, and Mohadjer (2005)http://www.asasrms.org/Proceedings/y2005/files/JSM2005-000572.pdf and Lohr and Riddles (2016)https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2016002/article/14677-eng.pdf?st=q7PyNsGR provide an overview of the methods implemented in this package.
Author(s)
Maintainer: Ben SchneiderBenjaminSchneider@westat.com (ORCID)
Authors:
Jim GreenJimGreen@westat.com
Shelley BrockShelleyBrock@westat.com (Author of original SAS macro, WesNRBA)
Tom KrenzkeTomKrenzke@westat.com (Author of original SAS macro, WesNRBA)
Michael JonesMichaelJones@westat.com (Author of original SAS macro, WesNRBA)
Wendy Van de Kerckhove (Author of original SAS macro, WesNRBA)
David Ferraro (Author of original SAS macro, WesNRBA)
Laura Alvarez-Rojas (Author of original SAS macro, WesNRBA)
Katie HubbellKatieHubbell@westat.com (Author of original SAS macro, WesNRBA)
Other contributors:
Westat [copyright holder]
Pipe operator
Description
Seemagrittr::%>% for details.
Usage
lhs %>% rhsArguments
lhs | A value or the magrittr placeholder. |
rhs | A function call using the magrittr semantics. |
Value
The result of callingrhs(lhs).
Assess the range of possible bias based on specified assumptionsabout how nonrespondents differ from respondents
Description
This range-of-bias analysis assesses the range of possible nonresponse biasunder varying assumptions about how nonrespondents differ from respondents.The range of potential bias is calculated for both unadjusted estimates (i.e., from using base weights)and nonresponse-adjusted estimates (i.e., based on nonresponse-adjusted weights).
Usage
assess_range_of_bias( survey_design, y_var, comparison_cell, status, status_codes, assumed_multiple = c(0.5, 0.75, 0.9, 1.1, 1.25, 1.5), assumed_percentile = NULL)Arguments
survey_design | A survey design object created with the 'survey' package |
y_var | Name of a variable whose mean or proportion is to be estimated |
comparison_cell | (Optional) The name of a variable in the datadividing the sample into cells. If supplied, then the analysis is based onassumptions about differences between respondents and nonrespondentswithin the same cell. Typically, the variable used is a nonresponse adjustment cellor post-stratification variable. |
status | A character string giving the name of the variable representing response/eligibility status.The status variable should have at most four categories,representing eligible respondents (ER),eligible nonrespondents (EN),known ineligible cases (IE),and cases whose eligibility is unknown (UE). |
status_codes | A named vector,with four entries named 'ER', 'EN', 'IE', and 'UE'. |
assumed_multiple | One or more numeric values.Within each nonresponse adjustment cell,the mean for nonrespondents is assumed to be a specified multipleof the mean for respondents. If |
assumed_percentile | One or more numeric values, ranging from 0 to 1.Within each nonresponse adjustment cell,the mean of a continuous variable among nonrespondents isassumed to equal a specified percentile of the variable among respondents.The |
Value
A data frame summarizing the range of bias under each assumption.For a numeric outcome variable, there is one row per value ofassumed_multiple orassumed_percentile. For a categoricaloutcome variable, there is one row per combination of categoryandassumed_multiple orassumed_percentile.
The columnbias_of_unadj_estimate is the nonresponse biasof the estimate from respondents produced using the unadjusted weights.The columnbias_of_adj_estimate is the nonresponse biasof the estimate from respondents producedusing nonresponse-adjusted weights, based on a weighting-classadjustment withcomparison_cell as the weighting class variable.If nocomparison_cell is specified, the two bias estimateswill be the same.
References
See Petraglia et al. (2016) for an example of a range-of-bias analysisusing these methods.
Petraglia, E., Van de Kerckhove, W., and Krenzke, T. (2016).Review of the Potential for Nonresponse Bias in FoodAPS 2012.Prepared for the Economic Research Service,U.S. Department of Agriculture. Washington, D.C.
Examples
# Load example datasuppressPackageStartupMessages(library(survey))data(api)base_weights_design <- svydesign( data = apiclus1, id = ~dnum, weights = ~pw, fpc = ~fpc) |> as.svrepdesign(type = "JK1")base_weights_design$variables$response_status <- sample( x = c("Respondent", "Nonrespondent"), prob = c(0.75, 0.25), size = nrow(base_weights_design), replace = TRUE)# Assess range of bias for mean of `api00`# based on assuming nonrespondent means# are equal to the 25th percentile or 75th percentile# among respondents, within nonresponse adjustment cells assess_range_of_bias( survey_design = base_weights_design, y_var = "api00", comparison_cell = "stype", status = "response_status", status_codes = c("ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown"), assumed_percentile = c(0.25, 0.75) )# Assess range of bias for proportions of `sch.wide`# based on assuming nonrespondent proportions# are equal to some multiple of respondent proportions,# within nonresponse adjustment cells assess_range_of_bias( survey_design = base_weights_design, y_var = "sch.wide", comparison_cell = "stype", status = "response_status", status_codes = c("ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown"), assumed_multiple = c(0.25, 0.75) )Calculate Response Rates
Description
Calculates response rates using one of the response rate formulasdefined by AAPOR (American Association of Public Opinion Research).
Usage
calculate_response_rates( data, status, status_codes = c("ER", "EN", "IE", "UE"), weights, rr_formula = "RR3", elig_method = "CASRO-subgroup", e = NULL)Arguments
data | A data frame containing the selected sample, one row per case. |
status | A character string giving the name of the variable representing response/eligibility status.The |
status_codes | A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'. |
weights | (Optional) A character string giving the name of a variable representing weights in the datato use for calculating weighted response rates |
rr_formula | A character vector including any of the following: 'RR1', 'RR3', and 'RR5'. |
elig_method | If |
e | (Required if |
Value
Output consists of a data frame giving weighted and unweighted response rates. The following columns may be included, depending on the arguments supplied:
RR1_UnweightedRR1_WeightedRR3_UnweightedRR3_WeightedRR5_UnweightedRR5_Weightedn: Total sample sizeNhat: Sum of weights for the total samplen_ER: Number of eligible respondentsNhat_ER: Sum of weights for eligible respondentsn_EN: Number of eligible nonrespondentsNhat_EN: Sum of weights for eligible nonrespondentsn_IE: Number of ineligible casesNhat_IE: Sum of weights for ineligible casesn_UE: Number of cases whose eligibility is unknownNhat_UE: Sum of weights for cases whose eligibility is unknowne_unwtd: IfRR3 is calculated, the eligibility rate estimatee used for the unweighted response rate.e_wtd: IfRR3 is calculated, the eligibility rate estimatee used for the weighted response rate.
If the data frame is grouped (i.e. by usingdf %>% group_by(Region)),then the output contains one row per subgroup.
Formulas
Denote the sample totals as follows:
ER: Total number of eligible respondents
EN: Total number of eligible non-respondents
IE: Total number of ineligible cases
UE: Total number of cases whose eligibility is unknown
For weighted response rates, these totals are calculated using weights.
The response rate formulas are then as follows:
RR1 = ER / ( ER + EN + UE )
RR1 essentially assumes that all cases with unknown eligibility are in fact eligible.
RR3 = ER / ( ER + EN + (e * UE) )
RR3 uses an estimate,e, of the eligibility rate among cases with unknown eligibility.
RR5 = ER / ( ER + EN )
RR5 essentially assumes that all cases with unknown eligibility are in fact ineligible.
ForRR3, an estimate,e, of the eligibility rate among cases with unknown eligibility must be used.AAPOR strongly recommends that the basis for the estimate should be explicitly stated and detailed.
The CASRO methods, which might be appropriate for the design, use the formulae = 1 - ( IE / (ER + EN + IE) ).
For
elig_method='CASRO-overall', an estimate is calculated for the overall sampleand this single estimate is used when calculating response rates for subgroups.For
elig_method='CASRO-subgroup', estimates are calculated separately for each subgroup.
Please consult AAPOR's currentStandard Definitions for in-depth explanations.
References
The American Association for Public Opinion Research. 2016.Standard Definitions:Final Dispositions of Case Codes and Outcome Rates for Surveys. 9th edition. AAPOR.
Examples
# Load example datadata(involvement_survey_srs, package = "nrba")involvement_survey_srs[["RESPONSE_STATUS"]] <- sample(1:4, size = 5000, replace = TRUE)# Calculate overall response ratesinvolvement_survey_srs %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), weights = "BASE_WEIGHT", rr_formula = "RR3", elig_method = "CASRO-overall" )# Calculate response rates by subgrouplibrary(dplyr)involvement_survey_srs %>% group_by(STUDENT_RACE, STUDENT_SEX) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), weights = "BASE_WEIGHT", rr_formula = "RR3", elig_method = "CASRO-overall" )# Compare alternative approaches for handling of cases with unknown eligiblityinvolvement_survey_srs %>% group_by(STUDENT_RACE) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "CASRO-overall" )involvement_survey_srs %>% group_by(STUDENT_RACE) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "CASRO-subgroup" )involvement_survey_srs %>% group_by(STUDENT_RACE) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "specified", e = 0.5 )involvement_survey_srs %>% transform(e_by_email = ifelse(PARENT_HAS_EMAIL == "Has Email", 0.75, 0.25)) %>% group_by(PARENT_HAS_EMAIL) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "specified", e = "e_by_email" )Test the independence of survey response and auxiliary variables
Description
Tests whether response status among eligible sample cases is independent of categorical auxiliary variables,using a Chi-Square test with Rao-Scott's second-order adjustment.If the data include cases known to be ineligible or who have unknown eligibility status,the data are subsetted to only include respondents and nonrespondents known to be eligible.
Usage
chisq_test_ind_response( survey_design, status, status_codes = c("ER", "EN", "UE", "IE"), aux_vars)Arguments
survey_design | A survey design object created with the |
status | A character string giving the name of the variable representing response/eligibility status.The |
status_codes | A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'. |
aux_vars | A list of names of auxiliary variables. |
Details
Please seesvychisq for details of how the Rao-Scott second-order adjusted test is conducted.
Value
A data frame containing the results of the Chi-Square test(s) of independence between response status and each auxiliary variable.If multiple auxiliary variables are specified, the output data contains one row per auxiliary variable.
The columns of the output dataset include:
auxiliary_variable: The name of the auxiliary variable testedstatistic: The value of the test statisticndf: Numerator degrees of freedom for the reference distributionddf: Denominator degrees of freedom for the reference distributionp_value: The p-value of the test of independencetest_method: Text giving the name of the statistical testvariance_method: Text describing the method of variance estimation
References
Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contigency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.
Examples
# Create a survey design object ----library(survey)data(involvement_survey_srs, package = "nrba")involvement_survey <- svydesign( weights = ~BASE_WEIGHT, id = ~UNIQUE_ID, data = involvement_survey_srs)# Test whether response status varies by race or by sex ----test_results <- chisq_test_ind_response( survey_design = involvement_survey, status = "RESPONSE_STATUS", status_codes = c( "ER" = "Respondent", "EN" = "Nonrespondent", "UE" = "Unknown", "IE" = "Ineligible" ), aux_vars = c("STUDENT_RACE", "STUDENT_SEX"))print(test_results)Test of differences in survey percentages relative to external estimates
Description
Compare estimated percentages from the present survey to external estimates from a benchmark source.A Chi-Square test with Rao-Scott's second-order adjustment is used to evaluate whether the survey's estimates differ from the external estimates.
Usage
chisq_test_vs_external_estimate(survey_design, y_var, ext_ests, na.rm = TRUE)Arguments
survey_design | A survey design object created with the |
y_var | Name of dependent categorical variable. |
ext_ests | A numeric vector containing the external estimate of the percentages for each category.The vector must have names, each name corresponding to a given category. |
na.rm | Whether to drop cases with missing values |
Details
Please seesvygofchisq for details of how the Rao-Scott second-order adjusted test is conducted.The test statistic,statistic is obtained by calculating the Pearson Chi-squared statistic for the estimated table of population totals. The reference distribution is a Satterthwaite approximation. The p-value is obtained by comparingstatistic/scale to a Chi-squared distribution withdf degrees of freedom.
Value
A data frame containing the results of the Chi-Square test(s) of whether survey-based estimates systematically differ from external estimates.
The columns of the output dataset include:
statistic: The value of the test statisticdf: Degrees of freedom for the reference Chi-Squared distributionscale: Estimated scale parameter.p_value: The p-value of the test of independencetest_method: Text giving the name of the statistical testvariance_method: Text describing the method of variance estimation
References
Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contigency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.
Examples
library(survey)# Create a survey design ----data("involvement_survey_pop", package = "nrba")data("involvement_survey_str2s", package = "nrba")involvement_survey_sample <- svydesign( data = involvement_survey_str2s, weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL)# Subset to only include survey respondents ----involvement_survey_respondents <- subset( involvement_survey_sample, RESPONSE_STATUS == "Respondent")# Test whether percentages of categorical variable differ from benchmark ----parent_email_benchmark <- c( "Has Email" = 0.85, "No Email" = 0.15)chisq_test_vs_external_estimate( survey_design = involvement_survey_respondents, y_var = "PARENT_HAS_EMAIL", ext_ests = parent_email_benchmark)Calculate cumulative estimates of a mean/proportion
Description
Calculates estimates of a mean/proportion which are cumulative with respectto a predictor variable, such as week of data collection or number of contact attempts.This can be useful for examining whether estimates are affected by decisions such aswhether to extend the data collection period or make additional contact attempts.
Usage
get_cumulative_estimates( survey_design, y_var, y_var_type = NULL, predictor_variable)Arguments
survey_design | A survey design object created with the |
y_var | Name of a variable whose mean or proportion is to be estimated. |
y_var_type | Either |
predictor_variable | Name of a variable for which cumulative estimates of |
Value
A dataframe of cumulative estimates. The first column–whose name matchespredictor_variable–givesdescribes the values ofpredictor_variable for which a given estimate was computed.The other columns of the result include the following:
outcome | The name of the variable for which estimates are computed |
outcome_category | For a categorical variable, the category of that variable |
estimate | The estimated mean or proportion. |
std_error | The estimated standard error |
respondent_sample_size | The number of cases used to produce the estimate (excluding missing values) |
References
See Maitland et al. (2017) for an example of a level-of-effort analysisbased on this method.
Maitland, A. et al. (2017).A Nonresponse Bias Analysisof the Health Information National Trends Survey (HINTS).Journal of Health Communication 22, 545-553.doi:10.1080/10810730.2017.1324539
Examples
# Create an example survey design# with a variable representing number of contact attemptslibrary(survey)data(involvement_survey_srs, package = "nrba")survey_design <- svydesign( weights = ~BASE_WEIGHT, id = ~UNIQUE_ID, fpc = ~N_STUDENTS, data = involvement_survey_srs)# Cumulative estimates from respondents for average student age ----get_cumulative_estimates( survey_design = survey_design |> subset(RESPONSE_STATUS == "Respondent"), y_var = "STUDENT_AGE", y_var_type = "numeric", predictor_variable = "CONTACT_ATTEMPTS")# Cumulative estimates from respondents for proportions of categorical variable ----get_cumulative_estimates( survey_design = survey_design |> subset(RESPONSE_STATUS == "Respondent"), y_var = "WHETHER_PARENT_AGREES", y_var_type = "categorical", predictor_variable = "CONTACT_ATTEMPTS")Summarize the variance estimation method for the survey design
Description
Summarize the variance estimation method for the survey design
Usage
get_variance_method(survey_design)Arguments
survey_design | A survey design object created with the |
Details
For replicate designs, the type of replicates will be determinedbased on the'type' element of the survey design object.Iftype = 'bootstrap', this can correspond to any of varioustypes of bootstrap replication (Canty-Davison bootstrap, Rao-Wu's (n-1) bootstrap, etc.).
For designs which use linearization-based variance estimation, the summaryonly indicates that linearization is used for variance estimation and, if a specialmethod is used for PPS variance estimation (e.g. Overton's approximation),that PPS variance estimation method will be described.
Value
A text string describing the method used for variance estimation
Parent involvement survey: population data
Description
An example dataset describing a population of 20,000 students with disabilitiesin 20 school districts. This population is the basis for selecting a sample ofstudents for a parent involvement survey.
Usage
involvement_survey_popFormat
A data frame with 20,000 rows and 9 variables
Fields
- UNIQUE_ID
A unique identifier for students
- SCHOOL_DISTRICT
A unique identifier for school districts
- SCHOOL_ID
A unique identifier for schools, nested within districts
- STUDENT_GRADE
Student's grade level: 'PK', 'K', 1-12
- STUDENT_AGE
Student's age, measured in years
- STUDENT_DISABILITY_CODE
Code for student's disability category (e.g. 'VI' for 'Visual Impairments')
- STUDENT_DISABILITY_CATEGORY
Student's disability category (e.g. 'Visual Impairments')
- STUDENT_SEX
'Female' or 'Male'
- STUDENT_RACE
Seven-level code with descriptive label (e.g. 'AS7 (Asian)')
Examples
involvement_survey_popParent involvement survey: simple random sample
Description
An example dataset describing a simple random sample of 5,000 parentsof students with disabilities, from a population of 20,000.The parent involvement survey measures a single key outcome:whether "parents perceive that schools facilitate parent involvementas a means of improving services and results for children with disabilities."
The variableBASE_WEIGHT provides the base sampling weight.The variableN_STUDENTS_IN_SCHOOL can be used to provide a finite population correctionfor variance estimation.
Usage
involvement_survey_srsFormat
A data frame with 5,000 rows and 17 variables
Fields
- UNIQUE_ID
A unique identifier for students
- RESPONSE_STATUS
Survey response/eligibility status: 'Respondent', 'Nonrespondent', 'Ineligble', 'Unknown'
- WHETHER_PARENT_AGREES
Parent agreement ('AGREE' or 'DISAGREE') for whether they perceive that schools facilitate parent involvement
- SCHOOL_DISTRICT
A unique identifier for school districts
- SCHOOL_ID
A unique identifier for schools, nested within districts
- STUDENT_GRADE
Student's grade level: 'PK', 'K', 1-12
- STUDENT_AGE
Student's age, measured in years
- STUDENT_DISABILITY_CODE
Code for student's disability category (e.g. 'VI' for 'Visual Impairments')
- STUDENT_DISABILITY_CATEGORY
Student's disability category (e.g. 'Visual Impairments')
- STUDENT_SEX
'Female' or 'Male'
- STUDENT_RACE
Seven-level code with descriptive label (e.g. 'AS7 (Asian)')
- PARENT_HAS_EMAIL
Whether parent has an e-mail address ('Has Email' vs 'No Email')
- PARENT_HAS_EMAIL_BENCHMARK
Population benchmark for category of
PARENT_HAS_EMAIL- PARENT_HAS_EMAIL_BENCHMARK
Population benchmark for category of
STUDENT_RACE- BASE_WEIGHT
Sampling weight to use for weighted estimates
- N_STUDENTS
Total number of students in the population
- CONTACT_ATTEMPTS
The number of contact attempts made for each case (ranges between 1 and 6)
Examples
involvement_survey_srsParent involvement survey: stratified, two-stage sample
Description
An example dataset describing a stratified, multistage sample of 1,000 parentsof students with disabilities, from a population of 20,000.The parent involvement survey measures a single key outcome:whether "parents perceive that schools facilitate parent involvementas a means of improving services and results for children with disabilities."
The sample was selected by sampling 5 schools from each of 20 districts,and then sampling parents of 10 children in each sampled school.The variableBASE_WEIGHT provides the base sampling weight.The variableSCHOOL_DISTRICT was used for stratification,and the variablesSCHOOL_ID andUNIQUE_ID uniquely identifythe first and second stage sampling units (schools and parents).The variablesN_SCHOOLS_IN_DISTRICT andN_STUDENTS_IN_SCHOOLcan be used to provide finite population corrections.
Usage
involvement_survey_str2sFormat
A data frame with 5,000 rows and 18 variables
Fields
- UNIQUE_ID
A unique identifier for students
- RESPONSE_STATUS
Survey response/eligibility status: 'Respondent', 'Nonrespondent', 'Ineligble', 'Unknown'
- WHETHER_PARENT_AGREES
Parent agreement ('AGREE' or 'DISAGREE') for whether they perceive that schools facilitate parent involvement
- SCHOOL_DISTRICT
A unique identifier for school districts
- SCHOOL_ID
A unique identifier for schools, nested within districts
- STUDENT_GRADE
Student's grade level: 'PK', 'K', 1-12
- STUDENT_AGE
Student's age, measured in years
- STUDENT_DISABILITY_CODE
Code for student's disability category (e.g. 'VI' for 'Visual Impairments')
- STUDENT_DISABILITY_CATEGORY
Student's disability category (e.g. 'Visual Impairments')
- STUDENT_SEX
'Female' or 'Male'
- STUDENT_RACE
Seven-level code with descriptive label (e.g. 'AS7 (Asian)')
- PARENT_HAS_EMAIL
Whether parent has an e-mail address ('Has Email' vs 'No Email')
- PARENT_HAS_EMAIL_BENCHMARK
Population benchmark for category of
PARENT_HAS_EMAIL- STUDENT_RACE_BENCHMARK
Population benchmark for category of
STUDENT_RACE- N_SCHOOLS_IN_DISTRICT
Total number of schools in each district
- N_STUDENTS_IN_SCHOOL
Total number of students in each school
- BASE_WEIGHT
Sampling weight to use for weighted estimates
- CONTACT_ATTEMPTS
The number of contact attempts made for each case (ranges between 1 and 6)
Examples
# Load the datainvolvement_survey_str2s# Prepare the data for analysis with the 'survey' package library(survey) involvement_survey <- svydesign( data = involvement_survey_str2s, weights = ~ BASE_WEIGHT, strata = ~ SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL )Fit a regression model to predict survey outcomes
Description
A regression model is fit to the sample data topredict outcomes measured by a survey.This model can be used to identify auxiliary variables that arepredictive of survey outcomes and hence are potentially usefulfor nonresponse bias analysis or weighting adjustments.
Only data from survey respondents will be used to fit the model,since survey outcomes are only measured among respondents.
The function returns a summary of the model, including overall testsfor each variable of whether that variable improves the model'sability to predict response status in the population of interest (not just in the random sample at hand).
Usage
predict_outcome_via_glm( survey_design, outcome_variable, outcome_type = "continuous", outcome_to_predict = NULL, numeric_predictors = NULL, categorical_predictors = NULL, model_selection = "main-effects", selection_controls = list(alpha_enter = 0.5, alpha_remain = 0.5, max_iterations = 100L))Arguments
survey_design | A survey design object created with the |
outcome_variable | Name of an outcome variable to use as the dependent variable in the modelThe value of this variable is expected to be |
outcome_type | Either |
outcome_to_predict | Only required if |
numeric_predictors | A list of names of numeric auxiliary variables to use for predicting response status. |
categorical_predictors | A list of names of categorical auxiliary variables to use for predicting response status. |
model_selection | A character string specifying how to select a model.The default and recommended method is 'main-effects', which simply includes main effectsfor each of the predictor variables. |
selection_controls | Only required if |
Details
See Lumley and Scott (2017) for details of how regression models are fit to survey data.For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted(see section 4 of Lumley and Scott (2017) for statistical details)using the functionregTermTest(method = "LRT", lrt.approximation = "saddlepoint")from the 'survey' package.
If the user specifiesmodel_selection = "stepwise", a regression modelis selected by adding and removing variables based on the p-value from alikelihood ratio rest. At each stage, a single variable is added to the model ifthe p-value of the likelihood ratio test from adding the variable is belowalpha_enterand its p-value is less than that of all other variables not already in the model.Next, of the variables already in the model, the variable with the largest p-valueis dropped if its p-value is greater thanalpha_remain. This iterative processcontinues until a maximum number of iterations is reached or untileither all variables have been added to the model or there are no unadded variablesfor which the likelihood ratio test has a p-value belowalpha_enter.
Value
A data frame summarizing the fitted regression model.
Each row in the data frame represents a coefficient in the model.The columnvariable describes the underlying variablefor the coefficient. For categorical variables, the columnvariable_categoryindicates the particular category of that variable for which a coefficient is estimated.
The columnsestimated_coefficient,se_coefficient,conf_intrvl_lower,conf_intrvl_upper, andp_value_coefficientare summary statistics for the estimated coefficient. Note thatp_value_coefficientis based on the Wald t-test for the coefficient.
The columnvariable_level_p_value gives the p-value of theRao-Scott Likelihood Ratio Test for including the variable in the model.This likelihood ratio test has its test statistic given by the columnLRT_chisq_statistic, and the reference distributionfor this test is a linear combination ofp F-distributionswith numerator degrees of freedom given byLRT_df_numeratorand denominator degrees of freedom given byLRT_df_denominator,wherep is the number of coefficients in the model corresponding tothe variable being tested.
References
Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605
Examples
library(survey)# Create a survey design ----data(involvement_survey_str2s, package = "nrba")survey_design <- svydesign( weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, id = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL, data = involvement_survey_str2s)predict_outcome_via_glm( survey_design = survey_design, outcome_variable = "WHETHER_PARENT_AGREES", outcome_type = "binary", outcome_to_predict = "AGREE", model_selection = "main-effects", numeric_predictors = c("STUDENT_AGE"), categorical_predictors = c("STUDENT_DISABILITY_CATEGORY", "PARENT_HAS_EMAIL"))Fit a logistic regression model to predict response to the survey.
Description
A logistic regression model is fit to the sample data topredict whether an individual responds to the survey (i.e. is an eligible respondent)rather than a nonrespondent. Ineligible cases and cases with unknown eligibility statusare not included in this model.
The function returns a summary of the model, including overall testsfor each variable of whether that variable improves the model'sability to predict response status in the population of interest (not just in the random sample at hand).
This model can be used to identify auxiliary variables associated with response statusand compare multiple auxiliary variables in terms of their ability to predict response status.
Usage
predict_response_status_via_glm( survey_design, status, status_codes = c("ER", "EN", "IE", "UE"), numeric_predictors = NULL, categorical_predictors = NULL, model_selection = "main-effects", selection_controls = list(alpha_enter = 0.5, alpha_remain = 0.5, max_iterations = 100L))Arguments
survey_design | A survey design object created with the |
status | A character string giving the name of the variable representing response/eligibility status.The |
status_codes | A named vector, with two entries named 'ER' and 'EN'indicating which values of the |
numeric_predictors | A list of names of numeric auxiliary variables to use for predicting response status. |
categorical_predictors | A list of names of categorical auxiliary variables to use for predicting response status. |
model_selection | A character string specifying how to select a model.The default and recommended method is 'main-effects', which simply includes main effectsfor each of the predictor variables. |
selection_controls | Only required if |
Details
See Lumley and Scott (2017) for details of how regression models are fit to survey data.For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted(see section 4 of Lumley and Scott (2017) for statistical details)using the functionregTermTest(method = "LRT", lrt.approximation = "saddlepoint")from the 'survey' package.
If the user specifiesmodel_selection = "stepwise", a regression modelis selected by adding and removing variables based on the p-value from alikelihood ratio rest. At each stage, a single variable is added to the model ifthe p-value of the likelihood ratio test from adding the variable is belowalpha_enterand its p-value is less than that of all other variables not already in the model.Next, of the variables already in the model, the variable with the largest p-valueis dropped if its p-value is greater thanalpha_remain. This iterative processcontinues until a maximum number of iterations is reached or untileither all variables have been added to the model or there are no unadded variablesfor which the likelihood ratio test has a p-value belowalpha_enter.
Value
A data frame summarizing the fitted logistic regression model.
Each row in the data frame represents a coefficient in the model.The columnvariable describes the underlying variablefor the coefficient. For categorical variables, the columnvariable_category indicatesthe particular category of that variable for which a coefficient is estimated.
The columnsestimated_coefficient,se_coefficient,conf_intrvl_lower,conf_intrvl_upper,andp_value_coefficient are summary statistics forthe estimated coefficient. Note thatp_value_coefficient is based on the Wald t-test for the coefficient.
The columnvariable_level_p_value gives the p-value of theRao-Scott Likelihood Ratio Test for including the variable in the model.This likelihood ratio test has its test statistic given by the columnLRT_chisq_statistic, and the reference distributionfor this test is a linear combination ofp F-distributionswith numerator degrees of freedom given byLRT_df_numerator anddenominator degrees of freedom given byLRT_df_denominator,wherep is the number of coefficients in the model corresponding tothe variable being tested.
References
Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605
Examples
library(survey)# Create a survey design ----data(involvement_survey_str2s, package = "nrba")survey_design <- survey_design <- svydesign( data = involvement_survey_str2s, weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL)predict_response_status_via_glm( survey_design = survey_design, status = "RESPONSE_STATUS", status_codes = c( "ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown" ), model_selection = "main-effects", numeric_predictors = c("STUDENT_AGE"), categorical_predictors = c("PARENT_HAS_EMAIL", "STUDENT_GRADE"))Re-weight data to match population benchmarks, using raking or post-stratification
Description
Adjusts weights in the data to ensure that estimated population totals forgrouping variables match known population benchmarks. If there is only one grouping variable,simple post-stratification is used. If there are multiple grouping variables,raking (also known as iterative post-stratification) is used.
Usage
rake_to_benchmarks( survey_design, group_vars, group_benchmark_vars, max_iterations = 100, epsilon = 5e-06)Arguments
survey_design | A survey design object created with the |
group_vars | Names of grouping variables in the data dividing the sampleinto groups for which benchmark data are available. These variables cannot have any missing values |
group_benchmark_vars | Names of group benchmark variables in the data corresponding to |
max_iterations | If there are multiple grouping variables,then raking is used rather than post-stratification.The parameter |
epsilon | If raking is used, convergence for a given margin is declaredif the maximum change in a re-weighted total is less than |
Details
Raking adjusts the weight assigned to each sample memberso that, after reweighting, the weighted sample percentages for population subgroupsmatch their known population percentages. In a sense, raking causesthe sample to more closely resemble the population in terms of variablesfor which population sizes are known.
Raking can be useful to reduce nonresponse bias caused byhaving groups which are overrepresented in the responding samplerelative to their population size.If the population subgroups systematically differ in terms of outcome variables of interest,then raking can also be helpful in terms of reduce sampling variances. However,when population subgroups do not differ in terms of outcome variables of interest,then raking may increase sampling variances.
There are two basic requirements for raking.
Basic Requirement 1 - Values of the grouping variable(s) must be known for all respondents.
Basic Requirement 2 - The population size of each group must be known (or precisely estimated).
When there is effectively only one grouping variable(though this variable can be defined as a combination of other variables),raking amounts to simple post-stratification.For example, simple post-stratification would be used if the grouping variableis "Age x Sex x Race", and the population size of each combination ofage, sex, and race is known.The method of "iterative poststratification" (also known as "iterative proportional fitting")is used when there are multiple grouping variables,and population sizes are known for each grouping variablebut not for combinations of grouping variables.For example, iterative proportional fitting would be necessaryif population sizes are known for age groups and for gender categoriesbut not for combinations of age groups and gender categories.
Value
A survey design object with raked or post-stratified weights
Examples
# Load the survey datadata(involvement_survey_srs, package = "nrba")# Calculate population benchmarkspopulation_benchmarks <- list( "PARENT_HAS_EMAIL" = data.frame( PARENT_HAS_EMAIL = c("Has Email", "No Email"), PARENT_HAS_EMAIL_POP_BENCHMARK = c(17036, 2964) ), "STUDENT_RACE" = data.frame( STUDENT_RACE = c( "AM7 (American Indian or Alaska Native)", "AS7 (Asian)", "BL7 (Black or African American)", "HI7 (Hispanic or Latino Ethnicity)", "MU7 (Two or More Races)", "PI7 (Native Hawaiian or Other Pacific Islander)", "WH7 (White)" ), STUDENT_RACE_POP_BENCHMARK = c(206, 258, 3227, 1097, 595, 153, 14464) ))# Add the population benchmarks as variables in the datainvolvement_survey_srs <- merge( x = involvement_survey_srs, y = population_benchmarks$PARENT_HAS_EMAIL, by = "PARENT_HAS_EMAIL")involvement_survey_srs <- merge( x = involvement_survey_srs, y = population_benchmarks$STUDENT_RACE, by = "STUDENT_RACE")# Create a survey design objectlibrary(survey)survey_design <- svydesign( weights = ~BASE_WEIGHT, id = ~UNIQUE_ID, fpc = ~N_STUDENTS, data = involvement_survey_srs)# Subset data to only include respondentssurvey_respondents <- subset( survey_design, RESPONSE_STATUS == "Respondent")# Rake to the benchmarksraked_survey_design <- rake_to_benchmarks( survey_design = survey_respondents, group_vars = c("PARENT_HAS_EMAIL", "STUDENT_RACE"), group_benchmark_vars = c( "PARENT_HAS_EMAIL_POP_BENCHMARK", "STUDENT_RACE_POP_BENCHMARK" ),)# Inspect estimates from respondents, before and after rakingsvymean( x = ~PARENT_HAS_EMAIL, design = survey_respondents)svymean( x = ~PARENT_HAS_EMAIL, design = raked_survey_design)svymean( x = ~WHETHER_PARENT_AGREES, design = survey_respondents)svymean( x = ~WHETHER_PARENT_AGREES, design = raked_survey_design)Select and fit a model using stepwise regression
Description
A regression model is selected by iteratively adding and removing variables based on the p-value from alikelihood ratio rest. At each stage, a single variable is added to the model ifthe p-value of the likelihood ratio test from adding the variable is belowalpha_enterand its p-value is less than that of all other variables not already in the model.Next, of the variables already in the model, the variable with the largest p-valueis dropped if its p-value is greater thanalpha_remain. This iterative processcontinues until a maximum number of iterations is reached or untileither all variables have been added to the model or there are no variablesnot yet in the model whose likelihood ratio test has a p-value belowalpha_enter.
Stepwise model selection generally invalidates inferential statisticssuch as p-values, standard errors, or confidence intervals and leads tooverestimation of the size of coefficients for variables included in the selected model.This bias increases as the value ofalpha_enter oralpha_remain decreases.The use of stepwise model selection should be limited only toreducing a large list of candidate variables for nonresponse adjustment.
Usage
stepwise_model_selection( survey_design, outcome_variable, predictor_variables, model_type = "binary-logistic", max_iterations = 100L, alpha_enter = 0.5, alpha_remain = 0.5)Arguments
survey_design | A survey design object created with the |
outcome_variable | The name of an outcome variable to use as the dependent variable. |
predictor_variables | A list of names of variables to consider as predictors for the model. |
model_type | A character string describing the type of model to fit. |
max_iterations | Maximum number of iterations to try adding new variables to the model. |
alpha_enter | The maximum p-value allowed for a variable to be added to the model.Large values such as 0.5 or greater are recommended to reduce the biasof estimates from the selected model. |
alpha_remain | The maximum p-value allowed for a variable to remain in the model.Large values such as 0.5 or greater are recommended to reduce the biasof estimates from the selected model. |
Details
See Lumley and Scott (2017) for details of how regression models are fit to survey data.For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted(see section 4 of Lumley and Scott (2017) for statistical details)using the functionregTermTest(method = "LRT", lrt.approximation = "saddlepoint")from the 'survey' package.
See Sauerbrei et al. (2020) for a discussion of statistical issues with using stepwise model selection.
Value
An object of classsvyglm representinga regression model fit using the 'survey' package.
References
Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605
Sauerbrei, W., Perperoglou, A., Schmid, M. et al. (2020). State of the art in selection of variables and functional forms in multivariable analysis - outstanding issues. Diagnostic and Prognostic Research 4, 3. https://doi.org/10.1186/s41512-020-00074-3
Examples
library(survey)# Load example data and prepare it for analysisdata(involvement_survey_str2s, package = 'nrba')involvement_survey <- svydesign( data = involvement_survey_str2s, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL, strata = ~ SCHOOL_DISTRICT, weights = ~ BASE_WEIGHT)involvement_survey <- involvement_survey |> transform(WHETHER_PARENT_AGREES = factor(WHETHER_PARENT_AGREES))# Fit a regression model using stepwise selectionselected_model <- stepwise_model_selection( survey_design = involvement_survey, outcome_variable = "WHETHER_PARENT_AGREES", predictor_variables = c("STUDENT_RACE", "STUDENT_DISABILITY_CATEGORY"), model_type = "binary-logistic", max_iterations = 100, alpha_enter = 0.5, alpha_remain = 0.5)t-test of differences in means/percentages between responding sample and full sample, or between responding sample and eligible sample
Description
The functiont_test_resp_vs_full tests whether means of auxiliary variables differ between respondents and the full selected sample,where the full sample consists of all cases regardless of response status or eligibility status.
The functiont_test_resp_vs_elig tests whether means differ between the responding sample and the eligible sample,where the eligible sample consists of all cases known to be eligible, regardless of response status.
See Lohr and Riddles (2016) for the statistical theory of this test.
Usage
t_test_resp_vs_full( survey_design, y_vars, na.rm = TRUE, status, status_codes = c("ER", "EN", "IE", "UE"), null_difference = 0, alternative = "unequal", degrees_of_freedom = survey::degf(survey_design) - 1)t_test_resp_vs_elig( survey_design, y_vars, na.rm = TRUE, status, status_codes = c("ER", "EN", "IE", "UE"), null_difference = 0, alternative = "unequal", degrees_of_freedom = survey::degf(survey_design) - 1)Arguments
survey_design | A survey design object created with the |
y_vars | Names of dependent variables for tests. For categorical variables, percentages of each category are tested. |
na.rm | Whether to drop cases with missing values for a given dependent variable. |
status | The name of the variable representing response/eligibility status. |
status_codes | A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'. |
null_difference | The difference between the two means under the null hypothesis. Default is |
alternative | Can be one of the following:
|
degrees_of_freedom | The degrees of freedom to use for the test's reference distribution.Unless specified otherwise, the default is the design degrees of freedom minus one,where the design degrees of freedom are estimated using the |
Value
A data frame describing the results of the t-tests, one row per dependent variable.
Statistical Details
The t-statistic used for the test has as its numerator the difference in means between the two samples, minus thenull_difference.The denominator for the t-statistic is the estimated standard error of the difference in means.Because the two means are based on overlapping groups and thus have correlated sampling errors, special care is taken to estimate the covariance of the two estimates.For designs which use sets of replicate weights for variance estimation, the two means and their difference are estimated using each set of replicate weights;the estimated differences from the sets of replicate weights are then used to estimate sampling error with a formula appropriate to the replication method (JKn, BRR, etc.).For designs which use linearization methods for variance estimation, the covariance between the two means is estimated using the method of linearization based on influence functions implemented in thesurvey package.See Osier (2009) for an overview of the method of linearization based on influence functions.Eckman et al. (2023) showed in a simulation study that linearization and replicationperformed similarly in estimating the variance of a difference in means for overlapping samples.
Unless specified otherwise using thedegrees_of_freedom parameter, the degrees of freedom for the test are set to the design degrees of freedom minus one.Design degrees of freedom are estimated using thesurvey package'sdegf method.
See Lohr and Riddles (2016) for the statistical details of this test.See Van de Kerckhove et al. (2009) and Amaya and Presser (2017)for examples of a nonresponse bias analysis which uses t-tests to compare responding samples to eligible samples.
References
Amaya, A., Presser, S. (2017).Nonresponse Bias for Univariate and Multivariate Estimates of Social Activities and Roles. Public Opinion Quarterly, Volume 81, Issue 1, 1 March 2017, Pages 1–36, https://doi.org/10.1093/poq/nfw037
Eckman, S., Unangst, J., Dever, J., Antoun, A. (2023).The Precision of Estimates of Nonresponse Bias in Means. Journal of Survey Statistics and Methodology, 11(4), 758-783. https://doi.org/10.1093/jssam/smac019
Lohr, S., Riddles, M. (2016).Tests for Evaluating Nonresponse Bias in Surveys. Survey Methodology 42(2): 195-218. https://www150.statcan.gc.ca/n1/pub/12-001-x/2016002/article/14677-eng.pdf
Osier, G. (2009).Variance estimation for complex indicators of poverty and inequality using linearization techniques. Survey Research Methods, 3(3), 167-195. https://doi.org/10.18148/srm/2009.v3i3.369
Van de Kerckhove, W., Krenzke, T., and Mohadjer, L. (2009).Adult Literacy and Lifeskills Survey (ALL) 2003: U.S. Nonresponse Bias Analysis (NCES 2009-063). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Washington, DC.
Examples
library(survey)# Create a survey design ----data(involvement_survey_srs, package = 'nrba')survey_design <- svydesign(weights = ~ BASE_WEIGHT, id = ~ UNIQUE_ID, fpc = ~ N_STUDENTS, data = involvement_survey_srs)# Compare respondents' mean to the full sample mean ----t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown"))# Compare respondents' mean to the mean of all eligible cases ----t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown"))# One-sided tests ---- ## Null Hypothesis: Y_bar_resp - Y_bar_full <= 0.1 ## Alt. Hypothesis: Y_bar_resp - Y_bar_full > 0.1t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown"), null_difference = 0.1, alternative = 'greater') ## Null Hypothesis: Y_bar_resp - Y_bar_full >= 0.1 ## Alt. Hypothesis: Y_bar_resp - Y_bar_full < 0.1t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown"), null_difference = 0.1, alternative = 'less')t-test of differences in estimated means/percentages from two different sets of replicate weights.
Description
Tests whether estimates of means/percentages differ systematically between two sets of replicate weights:an original set of weights, and the weights after adjustment (e.g. post-stratification or nonresponse adjustments) and possibly subsetting (e.g. subsetting to only include respondents).
Usage
t_test_of_weight_adjustment( orig_design, updated_design, y_vars, na.rm = TRUE, null_difference = 0, alternative = "unequal", degrees_of_freedom = NULL)Arguments
orig_design | A replicate design object created with the |
updated_design | A potentially updated version of |
y_vars | Names of dependent variables for tests. For categorical variables, percentages of each category are tested. |
na.rm | Whether to drop cases with missing values for a given dependent variable. |
null_difference | The difference between the two means/percentages under the null hypothesis. Default is |
alternative | Can be one of the following:
|
degrees_of_freedom | The degrees of freedom to use for the test's reference distribution.Unless specified otherwise, the default is the design degrees of freedom minus one,where the design degrees of freedom are estimated using the |
Value
A data frame describing the results of the t-tests, one row per dependent variable.
Statistical Details
The t-statistic used for the test has as its numerator the difference in means/percentages between the two samples, minus thenull_difference.The denominator for the t-statistic is the estimated standard error of the difference in means.Because the two means are based on overlapping groups and thus have correlated sampling errors, special care is taken to estimate the covariance of the two estimates.For designs which use sets of replicate weights for variance estimation, the two means and their difference are estimated using each set of replicate weights;the estimated differences from the sets of replicate weights are then used to estimate sampling error with a formula appropriate to the replication method (JKn, BRR, etc.).
This analysis is not implemented for designs which use linearization methods for variance estimation.
Unless specified otherwise using thedegrees_of_freedom parameter, the degrees of freedom for the test are set to the design degrees of freedom minus one.Design degrees of freedom are estimated using thesurvey package'sdegf method.
See Van de Kerckhove et al. (2009) for an example of this type of nonresponse bias analysis (among others).See Lohr and Riddles (2016) for the statistical details of this test.
References
Lohr, S., Riddles, M. (2016).Tests for Evaluating Nonresponse Bias in Surveys. Survey Methodology 42(2): 195-218. https://www150.statcan.gc.ca/n1/pub/12-001-x/2016002/article/14677-eng.pdf
Van de Kerckhove, W., Krenzke, T., and Mohadjer, L. (2009).Adult Literacy and Lifeskills Survey (ALL) 2003: U.S. Nonresponse Bias Analysis (NCES 2009-063). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Washington, DC.
Examples
library(survey)# Create a survey design ----data(involvement_survey_srs, package = 'nrba')survey_design <- svydesign(weights = ~ BASE_WEIGHT, id = ~ UNIQUE_ID, fpc = ~ N_STUDENTS, data = involvement_survey_srs)# Create replicate weights for the design ----rep_svy_design <- as.svrepdesign(survey_design, type = "subbootstrap", replicates = 500)# Subset to only respondents (always subset *after* creating replicate weights)rep_svy_respondents <- subset(rep_svy_design, RESPONSE_STATUS == "Respondent")# Apply raking adjustment ----raked_rep_svy_respondents <- rake_to_benchmarks( survey_design = rep_svy_respondents, group_vars = c("PARENT_HAS_EMAIL", "STUDENT_RACE"), group_benchmark_vars = c("PARENT_HAS_EMAIL_BENCHMARK", "STUDENT_RACE_BENCHMARK"),)# Compare estimates from respondents in original vs. adjusted design ----t_test_of_weight_adjustment(orig_design = rep_svy_respondents, updated_design = raked_rep_svy_respondents, y_vars = c('STUDENT_AGE', 'STUDENT_SEX'))t_test_of_weight_adjustment(orig_design = rep_svy_respondents, updated_design = raked_rep_svy_respondents, y_vars = c('WHETHER_PARENT_AGREES'))# Compare estimates to true population values ----data('involvement_survey_pop', package = 'nrba')mean(involvement_survey_pop$STUDENT_AGE)prop.table(table(involvement_survey_pop$STUDENT_SEX))Test for differences in means/percentages between two potentially overlapping groups
Description
Test for differences in means/percentages between two potentially overlapping groups
Usage
t_test_overlap( survey_design, y_vars, na.rm = TRUE, status, group_1, group_2, null_difference = 0, alternative = "unequal", degrees_of_freedom = survey::degf(survey_design) - 1)Arguments
survey_design | A survey design object created with the |
y_vars | Names of dependent variables for tests. For categorical variables, percentages of each category are tested. |
na.rm | Whether to drop cases with missing values for a given dependent variable. |
status | The name of the variable representing response/eligibility status. |
group_1 | Vector of values of |
group_2 | Vector of values of |
null_difference | The hypothesized difference between the groups' means. Default is |
alternative | Can be one of the following:
|
degrees_of_freedom | The degrees of freedom to use for the test's reference distribution.Unless specified otherwise, the default is the design degrees of freedom minus one,where the design degrees of freedom are estimated using the |
Value
A data frame describing the difference in group means/percentages and the statistics from the t-test
t-test of differences in means/percentages relative to external estimates
Description
Compare estimated means/percentages from the present survey to external estimates from a benchmark source.A t-test is used to evaluate whether the survey's estimates differ from the external estimates.
Usage
t_test_vs_external_estimate( survey_design, y_var, ext_ests, ext_std_errors = NULL, na.rm = TRUE, null_difference = 0, alternative = "unequal", degrees_of_freedom = survey::degf(survey_design) - 1)Arguments
survey_design | A survey design object created with the |
y_var | Name of dependent variable. For categorical variables, percentages of each category are tested. |
ext_ests | A numeric vector containing the external estimate of the mean for the dependent variable.If |
ext_std_errors | (Optional) The standard errors of the external estimates.This is useful if the external data are estimated with an appreciable level of uncertainty,for instance if the external data come from a survey with a small-to-moderate sample size.If supplied, the variance of the difference between the survey and external estimatesis estimated by adding the variance of the external estimates to the estimated varianceof the survey's estimates. |
na.rm | Whether to drop cases with missing values for |
null_difference | The hypothesized difference between the estimate and the external mean. Default is |
alternative | Can be one of the following:
|
degrees_of_freedom | The degrees of freedom to use for the test's reference distribution.Unless specified otherwise, the default is the design degrees of freedom minus one,where the design degrees of freedom are estimated using the survey package's |
Value
A data frame describing the results of the t-tests, one row per mean being compared.
References
See Brick and Bose (2001) for an example of this analysis methodand a discussion of its limitations.
Brick, M., and Bose, J. (2001).Analysis of Potential Nonresponse Bias. inProceedings of the Section on Survey Research Methods. Alexandria, VA: American Statistical Association.http://www.asasrms.org/Proceedings/y2001/Proceed/00021.pdf
Examples
library(survey)# Create a survey design ----data("involvement_survey_str2s", package = 'nrba')involvement_survey_sample <- svydesign( data = involvement_survey_str2s, weights = ~ BASE_WEIGHT, strata = ~ SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL)# Subset to only include survey respondents ----involvement_survey_respondents <- subset(involvement_survey_sample, RESPONSE_STATUS == "Respondent")# Test whether percentages of categorical variable differ from benchmark ----parent_email_benchmark <- c( 'Has Email' = 0.85, 'No Email' = 0.15)t_test_vs_external_estimate( survey_design = involvement_survey_respondents, y_var = "PARENT_HAS_EMAIL", ext_ests = parent_email_benchmark)# Test whether the sample mean differs from the population benchmark ----average_age_benchmark <- 11t_test_vs_external_estimate( survey_design = involvement_survey_respondents, y_var = "STUDENT_AGE", ext_ests = average_age_benchmark, null_difference = 0)Adjust weights in a replicate design for nonresponse or unknown eligibility status, using weighting classes
Description
Updates weights in a survey design object to adjust for nonresponse and/or unknown eligibilityusing the method of weighting class adjustment. For unknown eligibility adjustments, the weight in each classis set to zero for cases with unknown eligibility, and the weight of all other cases in the class isincreased so that the total weight is unchanged. For nonresponse adjustments, the weight in each classis set to zero for cases classified as eligible nonrespondents, and the weight of eligible respondent casesin the class is increased so that the total weight is unchanged.
This function currently only works for survey designs with replicate weights,since the linearization-based estimators included in thesurvey package (or Stata or SAS for that matter)are unable to fully reflect the impact of nonresponse adjustment.Adjustments are made to both the full-sample weights and all of the sets of replicate weights.
Usage
wt_class_adjust( survey_design, status, status_codes, wt_class = NULL, type = c("UE", "NR"))Arguments
survey_design | A replicate survey design object created with the |
status | A character string giving the name of the variable representing response/eligibility status. |
status_codes | A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'. |
wt_class | (Optional) A character string giving the name of the variable which divides sample cases into weighting classes. |
type | A character vector including one or more of the following options:
|
Details
See the vignette "Nonresponse Adjustments" from the svrep package for a step-by-step walkthrough ofnonresponse weighting adjustments in R:vignette(topic = "nonresponse-adjustments", package = "svrep")
Value
A replicate survey design object, with adjusted full-sample and replicate weights
References
See Chapter 2 of Heeringa, West, and Berglund (2017) or Chapter 13 of Valliant, Dever, and Kreuter (2018)for an overview of nonresponse adjustment methods based on redistributing weights.
Heeringa, S., West, B., Berglund, P. (2017). Applied Survey Data Analysis, 2nd edition. Boca Raton, FL: CRC Press."Applied Survey Data Analysis, 2nd edition." Boca Raton, FL: CRC Press.
Valliant, R., Dever, J., Kreuter, F. (2018)."Practical Tools for Designing and Weighting Survey Samples, 2nd edition." New York: Springer.
See Also
svrep::redistribute_weights(),vignette(topic = "nonresponse-adjustments", package = "svrep")
Examples
library(survey)# Load an example datasetdata("involvement_survey_str2s", package = "nrba")# Create a survey design objectinvolvement_survey_sample <- svydesign( data = involvement_survey_str2s, weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL)rep_design <- as.svrepdesign(involvement_survey_sample, type = "mrbbootstrap")# Adjust weights for nonresponse within weighting classesnr_adjusted_design <- wt_class_adjust( survey_design = rep_design, status = "RESPONSE_STATUS", status_codes = c( "ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown" ), wt_class = "PARENT_HAS_EMAIL", type = "NR")