Movatterモバイル変換

Title:

Methods for Conducting Nonresponse Bias Analysis (NRBA)

Version:

0.3.1

Description:

Facilitates nonresponse bias analysis (NRBA) for survey data. Such data may arise from a complex sampling design with features such as stratification, clustering, or unequal probabilities of selection. Multiple types of analyses may be conducted: comparisons of response rates across subgroups; comparisons of estimates before and after weighting adjustments; comparisons of sample-based estimates to external population totals; tests of systematic differences in covariate means between respondents and full samples; tests of independence between response status and covariates; and modeling of outcomes and response status as a function of covariates. Extensive documentation and references are provided for each type of analysis. Krenzke, Van de Kerckhove, and Mohadjer (2005)http://www.asasrms.org/Proceedings/y2005/files/JSM2005-000572.pdf and Lohr and Riddles (2016)https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2016002/article/14677-eng.pdf?st=q7PyNsGR provide an overview of the methods implemented in this package.

License:

GPL (≥ 3)

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.2.3

Imports:

broom, dplyr, magrittr, rlang, srvyr, stats, survey (≥4.1-1), svrep, tidyr

Suggests:

knitr, rmarkdown, stringr, testthat (≥ 3.0.0), tibble

Config/testthat/edition:

Depends:

R (≥ 4.1.0)

VignetteBuilder:

knitr

NeedsCompilation:

Packaged:

2023-11-21 01:54:42 UTC; schneider_b

Author:

Ben Schneider

[aut, cre], Jim Green [aut], Shelley Brock [aut] (Author of original SAS macro, WesNRBA), Tom Krenzke [aut] (Author of original SAS macro, WesNRBA), Michael Jones [aut] (Author of original SAS macro, WesNRBA), Wendy Van de Kerckhove [aut] (Author of original SAS macro, WesNRBA), David Ferraro [aut] (Author of original SAS macro, WesNRBA), Laura Alvarez-Rojas [aut] (Author of original SAS macro, WesNRBA), Katie Hubbell [aut] (Author of original SAS macro, WesNRBA), Westat [cph]

Maintainer:

Ben Schneider <BenjaminSchneider@westat.com>

Repository:

CRAN

Date/Publication:

2023-11-21 05:10:02 UTC

nrba: Methods for Conducting Nonresponse Bias Analysis (NRBA)

Description

logo

Author(s)

Maintainer: Ben SchneiderBenjaminSchneider@westat.com (ORCID)

Authors:

Jim GreenJimGreen@westat.com
Shelley BrockShelleyBrock@westat.com (Author of original SAS macro, WesNRBA)
Tom KrenzkeTomKrenzke@westat.com (Author of original SAS macro, WesNRBA)
Michael JonesMichaelJones@westat.com (Author of original SAS macro, WesNRBA)
Wendy Van de Kerckhove (Author of original SAS macro, WesNRBA)
David Ferraro (Author of original SAS macro, WesNRBA)
Laura Alvarez-Rojas (Author of original SAS macro, WesNRBA)
Katie HubbellKatieHubbell@westat.com (Author of original SAS macro, WesNRBA)

Other contributors:

Westat [copyright holder]

Pipe operator

Description

Seemagrittr::%>% for details.

Usage

lhs %>% rhs

Arguments

lhs

A value or the magrittr placeholder.

rhs

A function call using the magrittr semantics.

Value

The result of callingrhs(lhs).

Assess the range of possible bias based on specified assumptionsabout how nonrespondents differ from respondents

Description

This range-of-bias analysis assesses the range of possible nonresponse biasunder varying assumptions about how nonrespondents differ from respondents.The range of potential bias is calculated for both unadjusted estimates (i.e., from using base weights)and nonresponse-adjusted estimates (i.e., based on nonresponse-adjusted weights).

Usage

assess_range_of_bias(  survey_design,  y_var,  comparison_cell,  status,  status_codes,  assumed_multiple = c(0.5, 0.75, 0.9, 1.1, 1.25, 1.5),  assumed_percentile = NULL)

Arguments

survey_design

A survey design object created with the 'survey' package

y_var

Name of a variable whose mean or proportion is to be estimated

comparison_cell

(Optional) The name of a variable in the datadividing the sample into cells. If supplied, then the analysis is based onassumptions about differences between respondents and nonrespondentswithin the same cell. Typically, the variable used is a nonresponse adjustment cellor post-stratification variable.

status

A character string giving the name of the variable representing response/eligibility status.The status variable should have at most four categories,representing eligible respondents (ER),eligible nonrespondents (EN),known ineligible cases (IE),and cases whose eligibility is unknown (UE).

status_codes

A named vector,with four entries named 'ER', 'EN', 'IE', and 'UE'.status_codes indicates how the values of the status variable are to be interpreted.

assumed_multiple

One or more numeric values.Within each nonresponse adjustment cell,the mean for nonrespondents is assumed to be a specified multipleof the mean for respondents. Ify_var is a categorical variable,then the assumed nonrespondent mean (i.e., the proportion) in each cell is capped at 1.

assumed_percentile

One or more numeric values, ranging from 0 to 1.Within each nonresponse adjustment cell,the mean of a continuous variable among nonrespondents isassumed to equal a specified percentile of the variable among respondents.Theassumed_percentile parameter should be used only when they_varvariable is numeric. Quantiles are estimated with weights,using the functionsvyquantile(..., qrule = "hf2").

Value

A data frame summarizing the range of bias under each assumption.For a numeric outcome variable, there is one row per value ofassumed_multiple orassumed_percentile. For a categoricaloutcome variable, there is one row per combination of categoryandassumed_multiple orassumed_percentile.

The columnbias_of_unadj_estimate is the nonresponse biasof the estimate from respondents produced using the unadjusted weights.The columnbias_of_adj_estimate is the nonresponse biasof the estimate from respondents producedusing nonresponse-adjusted weights, based on a weighting-classadjustment withcomparison_cell as the weighting class variable.If nocomparison_cell is specified, the two bias estimateswill be the same.

References

See Petraglia et al. (2016) for an example of a range-of-bias analysisusing these methods.

Petraglia, E., Van de Kerckhove, W., and Krenzke, T. (2016).Review of the Potential for Nonresponse Bias in FoodAPS 2012.Prepared for the Economic Research Service,U.S. Department of Agriculture. Washington, D.C.

Examples

# Load example datasuppressPackageStartupMessages(library(survey))data(api)base_weights_design <- svydesign(  data    = apiclus1,  id      = ~dnum,  weights = ~pw,  fpc     = ~fpc) |> as.svrepdesign(type = "JK1")base_weights_design$variables$response_status <- sample(  x = c("Respondent", "Nonrespondent"),  prob = c(0.75, 0.25),  size = nrow(base_weights_design),  replace = TRUE)# Assess range of bias for mean of `api00`# based on assuming nonrespondent means# are equal to the 25th percentile or 75th percentile# among respondents, within nonresponse adjustment cells  assess_range_of_bias(    survey_design = base_weights_design,    y_var = "api00",    comparison_cell = "stype",    status = "response_status",    status_codes = c("ER" = "Respondent",                     "EN" = "Nonrespondent",                     "IE" = "Ineligible",                     "UE" = "Unknown"),    assumed_percentile = c(0.25, 0.75)  )# Assess range of bias for proportions of `sch.wide`# based on assuming nonrespondent proportions# are equal to some multiple of respondent proportions,# within nonresponse adjustment cells  assess_range_of_bias(    survey_design = base_weights_design,    y_var = "sch.wide",    comparison_cell = "stype",    status = "response_status",    status_codes = c("ER" = "Respondent",                     "EN" = "Nonrespondent",                     "IE" = "Ineligible",                     "UE" = "Unknown"),    assumed_multiple = c(0.25, 0.75)  )

Calculate Response Rates

Description

Calculates response rates using one of the response rate formulasdefined by AAPOR (American Association of Public Opinion Research).

Usage

calculate_response_rates(  data,  status,  status_codes = c("ER", "EN", "IE", "UE"),  weights,  rr_formula = "RR3",  elig_method = "CASRO-subgroup",  e = NULL)

Arguments

data

A data frame containing the selected sample, one row per case.

status

A character string giving the name of the variable representing response/eligibility status.Thestatus variable should have at most four categories,representing eligible respondents (ER), eligible nonrespondents (EN),known ineligible cases (IE), and cases whose eligibility is unknown (UE).

status_codes

A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'.status_codes indicates how the values of thestatus variable are to be interpreted.

weights

(Optional) A character string giving the name of a variable representing weights in the datato use for calculating weighted response rates

rr_formula

A character vector including any of the following: 'RR1', 'RR3', and 'RR5'.
These are the names of formulas defined by AAPOR. See theFormulas section below for formulas.

elig_method

Ifrr_formula='RR3', this specifies how to estimatean eligibility rate for cases with unknown eligibility. Must be one of the following:

'CASRO-overall'
Estimates an eligibility rate using the overall sample.If response rates are calculated for subgroups, the single overall sample estimatewill be used as the estimated eligibility rate for subgroups as well.

'CASRO-subgroup'
Estimates eligibility rates separately for each subgroup.

'specified'
With this option, a numeric value is supplied by the user to the parametere.

Forelig_method='CASRO-overall' orelig_method='CASRO-subgroup',the eligibility rate is estimated as(ER)/(ER + NR + IE).

e

(Required ifelig_method='specified'). A numeric value between 0 and 1 specifying the estimated eligibility rate for cases with unknown eligibility.A character string giving the name of a numeric variable may also be supplied; in that case, the eligibility rate must be constant for all cases in a subgroup.

Value

Output consists of a data frame giving weighted and unweighted response rates. The following columns may be included, depending on the arguments supplied:

RR1_Unweighted
RR1_Weighted
RR3_Unweighted
RR3_Weighted
RR5_Unweighted
RR5_Weighted
n: Total sample size
Nhat: Sum of weights for the total sample
n_ER: Number of eligible respondents
Nhat_ER: Sum of weights for eligible respondents
n_EN: Number of eligible nonrespondents
Nhat_EN: Sum of weights for eligible nonrespondents
n_IE: Number of ineligible cases
Nhat_IE: Sum of weights for ineligible cases
n_UE: Number of cases whose eligibility is unknown
Nhat_UE: Sum of weights for cases whose eligibility is unknown
e_unwtd: IfRR3 is calculated, the eligibility rate estimatee used for the unweighted response rate.
e_wtd: IfRR3 is calculated, the eligibility rate estimatee used for the weighted response rate.

If the data frame is grouped (i.e. by usingdf %>% group_by(Region)),then the output contains one row per subgroup.

Formulas

Denote the sample totals as follows:

ER: Total number of eligible respondents
EN: Total number of eligible non-respondents
IE: Total number of ineligible cases
UE: Total number of cases whose eligibility is unknown

For weighted response rates, these totals are calculated using weights.

The response rate formulas are then as follows:

RR1 = ER / ( ER + EN + UE )

RR1 essentially assumes that all cases with unknown eligibility are in fact eligible.

RR3 = ER / ( ER + EN + (e * UE) )

RR3 uses an estimate,e, of the eligibility rate among cases with unknown eligibility.

RR5 = ER / ( ER + EN )

RR5 essentially assumes that all cases with unknown eligibility are in fact ineligible.

ForRR3, an estimate,e, of the eligibility rate among cases with unknown eligibility must be used.AAPOR strongly recommends that the basis for the estimate should be explicitly stated and detailed.

The CASRO methods, which might be appropriate for the design, use the formulae = 1 - ( IE / (ER + EN + IE) ).

Forelig_method='CASRO-overall', an estimate is calculated for the overall sampleand this single estimate is used when calculating response rates for subgroups.
Forelig_method='CASRO-subgroup', estimates are calculated separately for each subgroup.

Please consult AAPOR's currentStandard Definitions for in-depth explanations.

References

The American Association for Public Opinion Research. 2016.Standard Definitions:Final Dispositions of Case Codes and Outcome Rates for Surveys. 9th edition. AAPOR.

Examples

# Load example datadata(involvement_survey_srs, package = "nrba")involvement_survey_srs[["RESPONSE_STATUS"]] <- sample(1:4, size = 5000, replace = TRUE)# Calculate overall response ratesinvolvement_survey_srs %>%  calculate_response_rates(    status = "RESPONSE_STATUS",    status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4),    weights = "BASE_WEIGHT",    rr_formula = "RR3",    elig_method = "CASRO-overall"  )# Calculate response rates by subgrouplibrary(dplyr)involvement_survey_srs %>%  group_by(STUDENT_RACE, STUDENT_SEX) %>%  calculate_response_rates(    status = "RESPONSE_STATUS",    status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4),    weights = "BASE_WEIGHT",    rr_formula = "RR3",    elig_method = "CASRO-overall"  )# Compare alternative approaches for handling of cases with unknown eligiblityinvolvement_survey_srs %>%  group_by(STUDENT_RACE) %>%  calculate_response_rates(    status = "RESPONSE_STATUS",    status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4),    rr_formula = "RR3",    elig_method = "CASRO-overall"  )involvement_survey_srs %>%  group_by(STUDENT_RACE) %>%  calculate_response_rates(    status = "RESPONSE_STATUS",    status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4),    rr_formula = "RR3",    elig_method = "CASRO-subgroup"  )involvement_survey_srs %>%  group_by(STUDENT_RACE) %>%  calculate_response_rates(    status = "RESPONSE_STATUS",    status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4),    rr_formula = "RR3",    elig_method = "specified",    e = 0.5  )involvement_survey_srs %>%  transform(e_by_email = ifelse(PARENT_HAS_EMAIL == "Has Email", 0.75, 0.25)) %>%  group_by(PARENT_HAS_EMAIL) %>%  calculate_response_rates(    status = "RESPONSE_STATUS",    status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4),    rr_formula = "RR3",    elig_method = "specified",    e = "e_by_email"  )

Test the independence of survey response and auxiliary variables

Description

Tests whether response status among eligible sample cases is independent of categorical auxiliary variables,using a Chi-Square test with Rao-Scott's second-order adjustment.If the data include cases known to be ineligible or who have unknown eligibility status,the data are subsetted to only include respondents and nonrespondents known to be eligible.

Usage

chisq_test_ind_response(  survey_design,  status,  status_codes = c("ER", "EN", "UE", "IE"),  aux_vars)

Arguments

survey_design

A survey design object created with thesurvey package.

status

status_codes

A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'.
status_codes indicates how the values of thestatus variable are to be interpreted.

aux_vars

A list of names of auxiliary variables.

Details

Please seesvychisq for details of how the Rao-Scott second-order adjusted test is conducted.

Value

A data frame containing the results of the Chi-Square test(s) of independence between response status and each auxiliary variable.If multiple auxiliary variables are specified, the output data contains one row per auxiliary variable.

The columns of the output dataset include:

auxiliary_variable: The name of the auxiliary variable tested
statistic: The value of the test statistic
ndf: Numerator degrees of freedom for the reference distribution
ddf: Denominator degrees of freedom for the reference distribution
p_value: The p-value of the test of independence
test_method: Text giving the name of the statistical test
variance_method: Text describing the method of variance estimation

References

Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contigency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.

Examples

# Create a survey design object ----library(survey)data(involvement_survey_srs, package = "nrba")involvement_survey <- svydesign(  weights = ~BASE_WEIGHT,  id = ~UNIQUE_ID,  data = involvement_survey_srs)# Test whether response status varies by race or by sex ----test_results <- chisq_test_ind_response(  survey_design = involvement_survey,  status = "RESPONSE_STATUS",  status_codes = c(    "ER" = "Respondent",    "EN" = "Nonrespondent",    "UE" = "Unknown",    "IE" = "Ineligible"  ),  aux_vars = c("STUDENT_RACE", "STUDENT_SEX"))print(test_results)

Test of differences in survey percentages relative to external estimates

Description

Compare estimated percentages from the present survey to external estimates from a benchmark source.A Chi-Square test with Rao-Scott's second-order adjustment is used to evaluate whether the survey's estimates differ from the external estimates.

Usage

chisq_test_vs_external_estimate(survey_design, y_var, ext_ests, na.rm = TRUE)

Arguments

survey_design

A survey design object created with thesurvey package.

y_var

Name of dependent categorical variable.

ext_ests

A numeric vector containing the external estimate of the percentages for each category.The vector must have names, each name corresponding to a given category.

na.rm

Whether to drop cases with missing values

Details

Please seesvygofchisq for details of how the Rao-Scott second-order adjusted test is conducted.The test statistic,statistic is obtained by calculating the Pearson Chi-squared statistic for the estimated table of population totals. The reference distribution is a Satterthwaite approximation. The p-value is obtained by comparingstatistic/scale to a Chi-squared distribution withdf degrees of freedom.

Value

A data frame containing the results of the Chi-Square test(s) of whether survey-based estimates systematically differ from external estimates.

The columns of the output dataset include:

statistic: The value of the test statistic
df: Degrees of freedom for the reference Chi-Squared distribution
scale: Estimated scale parameter.
p_value: The p-value of the test of independence
test_method: Text giving the name of the statistical test
variance_method: Text describing the method of variance estimation

References

Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contigency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.

Examples

library(survey)# Create a survey design ----data("involvement_survey_pop", package = "nrba")data("involvement_survey_str2s", package = "nrba")involvement_survey_sample <- svydesign(  data = involvement_survey_str2s,  weights = ~BASE_WEIGHT,  strata = ~SCHOOL_DISTRICT,  ids = ~ SCHOOL_ID + UNIQUE_ID,  fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL)# Subset to only include survey respondents ----involvement_survey_respondents <- subset(  involvement_survey_sample,  RESPONSE_STATUS == "Respondent")# Test whether percentages of categorical variable differ from benchmark ----parent_email_benchmark <- c(  "Has Email" = 0.85,  "No Email" = 0.15)chisq_test_vs_external_estimate(  survey_design = involvement_survey_respondents,  y_var = "PARENT_HAS_EMAIL",  ext_ests = parent_email_benchmark)

Calculate cumulative estimates of a mean/proportion

Description

Calculates estimates of a mean/proportion which are cumulative with respectto a predictor variable, such as week of data collection or number of contact attempts.This can be useful for examining whether estimates are affected by decisions such aswhether to extend the data collection period or make additional contact attempts.

Usage

get_cumulative_estimates(  survey_design,  y_var,  y_var_type = NULL,  predictor_variable)

Arguments

survey_design

A survey design object created with thesurvey package.

y_var

Name of a variable whose mean or proportion is to be estimated.

y_var_type

EitherNULL,"categorical" or"numeric".For"categorical", proportions are estimated. For"numeric", means are estimated.ForNULL (the default), then proportions are estimated ify_var is a factor or character variable.Otherwise, means are estimated. The data will be subset to remove any missing values in this variable.

predictor_variable

Name of a variable for which cumulative estimates ofy_varwill be calculated. This variable should either be numeric or have categories which when sortedby their label are arranged in ascending order. The data will be subset to removeany missing values of the predictor variable.

Value

A dataframe of cumulative estimates. The first column–whose name matchespredictor_variable–givesdescribes the values ofpredictor_variable for which a given estimate was computed.The other columns of the result include the following:

outcome

The name of the variable for which estimates are computed

outcome_category

For a categorical variable, the category of that variable

estimate

The estimated mean or proportion.

std_error

The estimated standard error

respondent_sample_size

The number of cases used to produce the estimate (excluding missing values)

References

See Maitland et al. (2017) for an example of a level-of-effort analysisbased on this method.

Maitland, A. et al. (2017).A Nonresponse Bias Analysisof the Health Information National Trends Survey (HINTS).Journal of Health Communication 22, 545-553.doi:10.1080/10810730.2017.1324539

Examples

# Create an example survey design# with a variable representing number of contact attemptslibrary(survey)data(involvement_survey_srs, package = "nrba")survey_design <- svydesign(  weights = ~BASE_WEIGHT,  id = ~UNIQUE_ID,  fpc = ~N_STUDENTS,  data = involvement_survey_srs)# Cumulative estimates from respondents for average student age ----get_cumulative_estimates(  survey_design = survey_design |>    subset(RESPONSE_STATUS == "Respondent"),  y_var = "STUDENT_AGE",  y_var_type = "numeric",  predictor_variable = "CONTACT_ATTEMPTS")# Cumulative estimates from respondents for proportions of categorical variable ----get_cumulative_estimates(  survey_design = survey_design |>    subset(RESPONSE_STATUS == "Respondent"),  y_var = "WHETHER_PARENT_AGREES",  y_var_type = "categorical",  predictor_variable = "CONTACT_ATTEMPTS")

Summarize the variance estimation method for the survey design

Description

Summarize the variance estimation method for the survey design

Usage

get_variance_method(survey_design)

Arguments

survey_design

A survey design object created with thesurvey package.

Details

For replicate designs, the type of replicates will be determinedbased on the'type' element of the survey design object.Iftype = 'bootstrap', this can correspond to any of varioustypes of bootstrap replication (Canty-Davison bootstrap, Rao-Wu's (n-1) bootstrap, etc.).

For designs which use linearization-based variance estimation, the summaryonly indicates that linearization is used for variance estimation and, if a specialmethod is used for PPS variance estimation (e.g. Overton's approximation),that PPS variance estimation method will be described.

Value

A text string describing the method used for variance estimation

Parent involvement survey: population data

Description

An example dataset describing a population of 20,000 students with disabilitiesin 20 school districts. This population is the basis for selecting a sample ofstudents for a parent involvement survey.

Usage

involvement_survey_pop

Format

A data frame with 20,000 rows and 9 variables

Fields

UNIQUE_ID: A unique identifier for students
SCHOOL_DISTRICT: A unique identifier for school districts
SCHOOL_ID: A unique identifier for schools, nested within districts
STUDENT_GRADE: Student's grade level: 'PK', 'K', 1-12
STUDENT_AGE: Student's age, measured in years
STUDENT_DISABILITY_CODE: Code for student's disability category (e.g. 'VI' for 'Visual Impairments')
STUDENT_DISABILITY_CATEGORY: Student's disability category (e.g. 'Visual Impairments')
STUDENT_SEX: 'Female' or 'Male'
STUDENT_RACE: Seven-level code with descriptive label (e.g. 'AS7 (Asian)')

Examples

involvement_survey_pop

Parent involvement survey: simple random sample

Description

An example dataset describing a simple random sample of 5,000 parentsof students with disabilities, from a population of 20,000.The parent involvement survey measures a single key outcome:whether "parents perceive that schools facilitate parent involvementas a means of improving services and results for children with disabilities."

The variableBASE_WEIGHT provides the base sampling weight.The variableN_STUDENTS_IN_SCHOOL can be used to provide a finite population correctionfor variance estimation.

Usage

involvement_survey_srs

Format

A data frame with 5,000 rows and 17 variables

Fields

UNIQUE_ID: A unique identifier for students
RESPONSE_STATUS: Survey response/eligibility status: 'Respondent', 'Nonrespondent', 'Ineligble', 'Unknown'
WHETHER_PARENT_AGREES: Parent agreement ('AGREE' or 'DISAGREE') for whether they perceive that schools facilitate parent involvement
SCHOOL_DISTRICT: A unique identifier for school districts
SCHOOL_ID: A unique identifier for schools, nested within districts
STUDENT_GRADE: Student's grade level: 'PK', 'K', 1-12
STUDENT_AGE: Student's age, measured in years
STUDENT_DISABILITY_CODE: Code for student's disability category (e.g. 'VI' for 'Visual Impairments')
STUDENT_DISABILITY_CATEGORY: Student's disability category (e.g. 'Visual Impairments')
STUDENT_SEX: 'Female' or 'Male'
STUDENT_RACE: Seven-level code with descriptive label (e.g. 'AS7 (Asian)')
PARENT_HAS_EMAIL: Whether parent has an e-mail address ('Has Email' vs 'No Email')
PARENT_HAS_EMAIL_BENCHMARK: Population benchmark for category ofPARENT_HAS_EMAIL
PARENT_HAS_EMAIL_BENCHMARK: Population benchmark for category ofSTUDENT_RACE
BASE_WEIGHT: Sampling weight to use for weighted estimates
N_STUDENTS: Total number of students in the population
CONTACT_ATTEMPTS: The number of contact attempts made for each case (ranges between 1 and 6)

Examples

involvement_survey_srs

Parent involvement survey: stratified, two-stage sample

Description

An example dataset describing a stratified, multistage sample of 1,000 parentsof students with disabilities, from a population of 20,000.The parent involvement survey measures a single key outcome:whether "parents perceive that schools facilitate parent involvementas a means of improving services and results for children with disabilities."

The sample was selected by sampling 5 schools from each of 20 districts,and then sampling parents of 10 children in each sampled school.The variableBASE_WEIGHT provides the base sampling weight.The variableSCHOOL_DISTRICT was used for stratification,and the variablesSCHOOL_ID andUNIQUE_ID uniquely identifythe first and second stage sampling units (schools and parents).The variablesN_SCHOOLS_IN_DISTRICT andN_STUDENTS_IN_SCHOOLcan be used to provide finite population corrections.

Usage

involvement_survey_str2s

Format

A data frame with 5,000 rows and 18 variables

Fields

UNIQUE_ID: A unique identifier for students
RESPONSE_STATUS: Survey response/eligibility status: 'Respondent', 'Nonrespondent', 'Ineligble', 'Unknown'
WHETHER_PARENT_AGREES: Parent agreement ('AGREE' or 'DISAGREE') for whether they perceive that schools facilitate parent involvement
SCHOOL_DISTRICT: A unique identifier for school districts
SCHOOL_ID: A unique identifier for schools, nested within districts
STUDENT_GRADE: Student's grade level: 'PK', 'K', 1-12
STUDENT_AGE: Student's age, measured in years
STUDENT_DISABILITY_CODE: Code for student's disability category (e.g. 'VI' for 'Visual Impairments')
STUDENT_DISABILITY_CATEGORY: Student's disability category (e.g. 'Visual Impairments')
STUDENT_SEX: 'Female' or 'Male'
STUDENT_RACE: Seven-level code with descriptive label (e.g. 'AS7 (Asian)')
PARENT_HAS_EMAIL: Whether parent has an e-mail address ('Has Email' vs 'No Email')
PARENT_HAS_EMAIL_BENCHMARK: Population benchmark for category ofPARENT_HAS_EMAIL
STUDENT_RACE_BENCHMARK: Population benchmark for category ofSTUDENT_RACE
N_SCHOOLS_IN_DISTRICT: Total number of schools in each district
N_STUDENTS_IN_SCHOOL: Total number of students in each school
BASE_WEIGHT: Sampling weight to use for weighted estimates
CONTACT_ATTEMPTS: The number of contact attempts made for each case (ranges between 1 and 6)

Examples

# Load the datainvolvement_survey_str2s# Prepare the data for analysis with the 'survey' package  library(survey)  involvement_survey <- svydesign(    data = involvement_survey_str2s,    weights = ~ BASE_WEIGHT,    strata =  ~ SCHOOL_DISTRICT,    ids =     ~ SCHOOL_ID             + UNIQUE_ID,    fpc =     ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL  )

Fit a regression model to predict survey outcomes

Description

A regression model is fit to the sample data topredict outcomes measured by a survey.This model can be used to identify auxiliary variables that arepredictive of survey outcomes and hence are potentially usefulfor nonresponse bias analysis or weighting adjustments.

Only data from survey respondents will be used to fit the model,since survey outcomes are only measured among respondents.

The function returns a summary of the model, including overall testsfor each variable of whether that variable improves the model'sability to predict response status in the population of interest (not just in the random sample at hand).

Usage

predict_outcome_via_glm(  survey_design,  outcome_variable,  outcome_type = "continuous",  outcome_to_predict = NULL,  numeric_predictors = NULL,  categorical_predictors = NULL,  model_selection = "main-effects",  selection_controls = list(alpha_enter = 0.5, alpha_remain = 0.5, max_iterations = 100L))

Arguments

survey_design

A survey design object created with thesurvey package.

outcome_variable

Name of an outcome variable to use as the dependent variable in the modelThe value of this variable is expected to beNA (i.e. missing)for all cases other than eligible respondents.

outcome_type

Either"binary" or"continuous". For"binary",a logistic regression model is used. For"continuous", a generalized linear modelis fit using using an identity link function.

outcome_to_predict

Only required ifoutcome_type="binary".Specify which category ofoutcome_variable is to be predicted.

numeric_predictors

A list of names of numeric auxiliary variables to use for predicting response status.

categorical_predictors

A list of names of categorical auxiliary variables to use for predicting response status.

model_selection

A character string specifying how to select a model.The default and recommended method is 'main-effects', which simply includes main effectsfor each of the predictor variables.
The method'stepwise' can be used to perform stepwise selection of variables for the model.However, stepwise selection invalidates p-values, standard errors, and confidence intervals,which are generally calculated under the assumption that model specification is predetermined.

selection_controls

Only required ifmodel-selection isn't set to"main-effects".Otherwise, a list of parameters for model selection to pass on tostepwise_model_selection,with elementsalpha_enter,alpha_remain, andmax_iterations.

Details

See Lumley and Scott (2017) for details of how regression models are fit to survey data.For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted(see section 4 of Lumley and Scott (2017) for statistical details)using the functionregTermTest(method = "LRT", lrt.approximation = "saddlepoint")from the 'survey' package.

If the user specifiesmodel_selection = "stepwise", a regression modelis selected by adding and removing variables based on the p-value from alikelihood ratio rest. At each stage, a single variable is added to the model ifthe p-value of the likelihood ratio test from adding the variable is belowalpha_enterand its p-value is less than that of all other variables not already in the model.Next, of the variables already in the model, the variable with the largest p-valueis dropped if its p-value is greater thanalpha_remain. This iterative processcontinues until a maximum number of iterations is reached or untileither all variables have been added to the model or there are no unadded variablesfor which the likelihood ratio test has a p-value belowalpha_enter.

Value

A data frame summarizing the fitted regression model.

Each row in the data frame represents a coefficient in the model.The columnvariable describes the underlying variablefor the coefficient. For categorical variables, the columnvariable_categoryindicates the particular category of that variable for which a coefficient is estimated.

The columnsestimated_coefficient,se_coefficient,conf_intrvl_lower,conf_intrvl_upper, andp_value_coefficientare summary statistics for the estimated coefficient. Note thatp_value_coefficientis based on the Wald t-test for the coefficient.

The columnvariable_level_p_value gives the p-value of theRao-Scott Likelihood Ratio Test for including the variable in the model.This likelihood ratio test has its test statistic given by the columnLRT_chisq_statistic, and the reference distributionfor this test is a linear combination ofp F-distributionswith numerator degrees of freedom given byLRT_df_numeratorand denominator degrees of freedom given byLRT_df_denominator,wherep is the number of coefficients in the model corresponding tothe variable being tested.

References

Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605

Examples

library(survey)# Create a survey design ----data(involvement_survey_str2s, package = "nrba")survey_design <- svydesign(  weights = ~BASE_WEIGHT,  strata = ~SCHOOL_DISTRICT,  id = ~ SCHOOL_ID + UNIQUE_ID,  fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL,  data = involvement_survey_str2s)predict_outcome_via_glm(  survey_design = survey_design,  outcome_variable = "WHETHER_PARENT_AGREES",  outcome_type = "binary",  outcome_to_predict = "AGREE",  model_selection = "main-effects",  numeric_predictors = c("STUDENT_AGE"),  categorical_predictors = c("STUDENT_DISABILITY_CATEGORY", "PARENT_HAS_EMAIL"))

Fit a logistic regression model to predict response to the survey.

Description

A logistic regression model is fit to the sample data topredict whether an individual responds to the survey (i.e. is an eligible respondent)rather than a nonrespondent. Ineligible cases and cases with unknown eligibility statusare not included in this model.

This model can be used to identify auxiliary variables associated with response statusand compare multiple auxiliary variables in terms of their ability to predict response status.

Usage

predict_response_status_via_glm(  survey_design,  status,  status_codes = c("ER", "EN", "IE", "UE"),  numeric_predictors = NULL,  categorical_predictors = NULL,  model_selection = "main-effects",  selection_controls = list(alpha_enter = 0.5, alpha_remain = 0.5, max_iterations = 100L))

Arguments

survey_design

A survey design object created with thesurvey package.

status

status_codes

A named vector, with two entries named 'ER' and 'EN'indicating which values of thestatus variable representeligible respondents (ER) and eligible nonrespondents (EN).

numeric_predictors

A list of names of numeric auxiliary variables to use for predicting response status.

categorical_predictors

A list of names of categorical auxiliary variables to use for predicting response status.

model_selection

selection_controls

Details

Value

A data frame summarizing the fitted logistic regression model.

Each row in the data frame represents a coefficient in the model.The columnvariable describes the underlying variablefor the coefficient. For categorical variables, the columnvariable_category indicatesthe particular category of that variable for which a coefficient is estimated.

The columnsestimated_coefficient,se_coefficient,conf_intrvl_lower,conf_intrvl_upper,andp_value_coefficient are summary statistics forthe estimated coefficient. Note thatp_value_coefficient is based on the Wald t-test for the coefficient.

The columnvariable_level_p_value gives the p-value of theRao-Scott Likelihood Ratio Test for including the variable in the model.This likelihood ratio test has its test statistic given by the columnLRT_chisq_statistic, and the reference distributionfor this test is a linear combination ofp F-distributionswith numerator degrees of freedom given byLRT_df_numerator anddenominator degrees of freedom given byLRT_df_denominator,wherep is the number of coefficients in the model corresponding tothe variable being tested.

References

Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605

Examples

library(survey)# Create a survey design ----data(involvement_survey_str2s, package = "nrba")survey_design <- survey_design <- svydesign(  data = involvement_survey_str2s,  weights = ~BASE_WEIGHT,  strata = ~SCHOOL_DISTRICT,  ids = ~ SCHOOL_ID + UNIQUE_ID,  fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL)predict_response_status_via_glm(  survey_design = survey_design,  status = "RESPONSE_STATUS",  status_codes = c(    "ER" = "Respondent",    "EN" = "Nonrespondent",    "IE" = "Ineligible",    "UE" = "Unknown"  ),  model_selection = "main-effects",  numeric_predictors = c("STUDENT_AGE"),  categorical_predictors = c("PARENT_HAS_EMAIL", "STUDENT_GRADE"))

Re-weight data to match population benchmarks, using raking or post-stratification

Description

Adjusts weights in the data to ensure that estimated population totals forgrouping variables match known population benchmarks. If there is only one grouping variable,simple post-stratification is used. If there are multiple grouping variables,raking (also known as iterative post-stratification) is used.

Usage

rake_to_benchmarks(  survey_design,  group_vars,  group_benchmark_vars,  max_iterations = 100,  epsilon = 5e-06)

Arguments

survey_design

A survey design object created with thesurvey package.

group_vars

Names of grouping variables in the data dividing the sampleinto groups for which benchmark data are available. These variables cannot have any missing values

group_benchmark_vars

Names of group benchmark variables in the data corresponding togroup_vars.For each category of a grouping variable, the group benchmark variable gives thepopulation benchmark (i.e. population size) for that category.

max_iterations

If there are multiple grouping variables,then raking is used rather than post-stratification.The parametermax_iterations controls the maximum number of iterations touse in raking.

epsilon

If raking is used, convergence for a given margin is declaredif the maximum change in a re-weighted total is less thanepsilon timesthe total sum of the original weights in the design.

Details

Raking adjusts the weight assigned to each sample memberso that, after reweighting, the weighted sample percentages for population subgroupsmatch their known population percentages. In a sense, raking causesthe sample to more closely resemble the population in terms of variablesfor which population sizes are known.

Raking can be useful to reduce nonresponse bias caused byhaving groups which are overrepresented in the responding samplerelative to their population size.If the population subgroups systematically differ in terms of outcome variables of interest,then raking can also be helpful in terms of reduce sampling variances. However,when population subgroups do not differ in terms of outcome variables of interest,then raking may increase sampling variances.

There are two basic requirements for raking.

Basic Requirement 1 - Values of the grouping variable(s) must be known for all respondents.
Basic Requirement 2 - The population size of each group must be known (or precisely estimated).

When there is effectively only one grouping variable(though this variable can be defined as a combination of other variables),raking amounts to simple post-stratification.For example, simple post-stratification would be used if the grouping variableis "Age x Sex x Race", and the population size of each combination ofage, sex, and race is known.The method of "iterative poststratification" (also known as "iterative proportional fitting")is used when there are multiple grouping variables,and population sizes are known for each grouping variablebut not for combinations of grouping variables.For example, iterative proportional fitting would be necessaryif population sizes are known for age groups and for gender categoriesbut not for combinations of age groups and gender categories.

Value

A survey design object with raked or post-stratified weights

Examples

# Load the survey datadata(involvement_survey_srs, package = "nrba")# Calculate population benchmarkspopulation_benchmarks <- list(  "PARENT_HAS_EMAIL" = data.frame(    PARENT_HAS_EMAIL = c("Has Email", "No Email"),    PARENT_HAS_EMAIL_POP_BENCHMARK = c(17036, 2964)  ),  "STUDENT_RACE" = data.frame(    STUDENT_RACE = c(      "AM7 (American Indian or Alaska Native)", "AS7 (Asian)",      "BL7 (Black or African American)",      "HI7 (Hispanic or Latino Ethnicity)", "MU7 (Two or More Races)",      "PI7 (Native Hawaiian or Other Pacific Islander)",      "WH7 (White)"    ),    STUDENT_RACE_POP_BENCHMARK = c(206, 258, 3227, 1097, 595, 153, 14464)  ))# Add the population benchmarks as variables in the datainvolvement_survey_srs <- merge(  x = involvement_survey_srs,  y = population_benchmarks$PARENT_HAS_EMAIL,  by = "PARENT_HAS_EMAIL")involvement_survey_srs <- merge(  x = involvement_survey_srs,  y = population_benchmarks$STUDENT_RACE,  by = "STUDENT_RACE")# Create a survey design objectlibrary(survey)survey_design <- svydesign(  weights = ~BASE_WEIGHT,  id = ~UNIQUE_ID,  fpc = ~N_STUDENTS,  data = involvement_survey_srs)# Subset data to only include respondentssurvey_respondents <- subset(  survey_design,  RESPONSE_STATUS == "Respondent")# Rake to the benchmarksraked_survey_design <- rake_to_benchmarks(  survey_design = survey_respondents,  group_vars = c("PARENT_HAS_EMAIL", "STUDENT_RACE"),  group_benchmark_vars = c(    "PARENT_HAS_EMAIL_POP_BENCHMARK",    "STUDENT_RACE_POP_BENCHMARK"  ),)# Inspect estimates from respondents, before and after rakingsvymean(  x = ~PARENT_HAS_EMAIL,  design = survey_respondents)svymean(  x = ~PARENT_HAS_EMAIL,  design = raked_survey_design)svymean(  x = ~WHETHER_PARENT_AGREES,  design = survey_respondents)svymean(  x = ~WHETHER_PARENT_AGREES,  design = raked_survey_design)

Select and fit a model using stepwise regression

Description

A regression model is selected by iteratively adding and removing variables based on the p-value from alikelihood ratio rest. At each stage, a single variable is added to the model ifthe p-value of the likelihood ratio test from adding the variable is belowalpha_enterand its p-value is less than that of all other variables not already in the model.Next, of the variables already in the model, the variable with the largest p-valueis dropped if its p-value is greater thanalpha_remain. This iterative processcontinues until a maximum number of iterations is reached or untileither all variables have been added to the model or there are no variablesnot yet in the model whose likelihood ratio test has a p-value belowalpha_enter.

Stepwise model selection generally invalidates inferential statisticssuch as p-values, standard errors, or confidence intervals and leads tooverestimation of the size of coefficients for variables included in the selected model.This bias increases as the value ofalpha_enter oralpha_remain decreases.The use of stepwise model selection should be limited only toreducing a large list of candidate variables for nonresponse adjustment.

Usage

stepwise_model_selection(  survey_design,  outcome_variable,  predictor_variables,  model_type = "binary-logistic",  max_iterations = 100L,  alpha_enter = 0.5,  alpha_remain = 0.5)

Arguments

survey_design

A survey design object created with thesurvey package.

outcome_variable

The name of an outcome variable to use as the dependent variable.

predictor_variables

A list of names of variables to consider as predictors for the model.

model_type

A character string describing the type of model to fit.'binary-logistic' for a binary logistic regression,'ordinal-logistic' for an ordinal logistic regression (cumulative proportional-odds),'normal' for the typical model which assumes residuals follow a Normal distribution.

max_iterations

Maximum number of iterations to try adding new variables to the model.

alpha_enter

The maximum p-value allowed for a variable to be added to the model.Large values such as 0.5 or greater are recommended to reduce the biasof estimates from the selected model.

alpha_remain

The maximum p-value allowed for a variable to remain in the model.Large values such as 0.5 or greater are recommended to reduce the biasof estimates from the selected model.

Details

See Sauerbrei et al. (2020) for a discussion of statistical issues with using stepwise model selection.

Value

An object of classsvyglm representinga regression model fit using the 'survey' package.

References

Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605
Sauerbrei, W., Perperoglou, A., Schmid, M. et al. (2020). State of the art in selection of variables and functional forms in multivariable analysis - outstanding issues. Diagnostic and Prognostic Research 4, 3. https://doi.org/10.1186/s41512-020-00074-3

Examples

library(survey)# Load example data and prepare it for analysisdata(involvement_survey_str2s, package = 'nrba')involvement_survey <- svydesign(  data = involvement_survey_str2s,  ids = ~ SCHOOL_ID + UNIQUE_ID,  fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL,  strata = ~ SCHOOL_DISTRICT,  weights = ~ BASE_WEIGHT)involvement_survey <- involvement_survey |>    transform(WHETHER_PARENT_AGREES = factor(WHETHER_PARENT_AGREES))# Fit a regression model using stepwise selectionselected_model <- stepwise_model_selection(  survey_design = involvement_survey,  outcome_variable = "WHETHER_PARENT_AGREES",  predictor_variables = c("STUDENT_RACE", "STUDENT_DISABILITY_CATEGORY"),  model_type = "binary-logistic",  max_iterations = 100,  alpha_enter = 0.5,  alpha_remain = 0.5)

t-test of differences in means/percentages between responding sample and full sample, or between responding sample and eligible sample

Description

The functiont_test_resp_vs_full tests whether means of auxiliary variables differ between respondents and the full selected sample,where the full sample consists of all cases regardless of response status or eligibility status.
The functiont_test_resp_vs_elig tests whether means differ between the responding sample and the eligible sample,where the eligible sample consists of all cases known to be eligible, regardless of response status.

See Lohr and Riddles (2016) for the statistical theory of this test.

Usage

t_test_resp_vs_full(  survey_design,  y_vars,  na.rm = TRUE,  status,  status_codes = c("ER", "EN", "IE", "UE"),  null_difference = 0,  alternative = "unequal",  degrees_of_freedom = survey::degf(survey_design) - 1)t_test_resp_vs_elig(  survey_design,  y_vars,  na.rm = TRUE,  status,  status_codes = c("ER", "EN", "IE", "UE"),  null_difference = 0,  alternative = "unequal",  degrees_of_freedom = survey::degf(survey_design) - 1)

Arguments

survey_design

A survey design object created with thesurvey package.

y_vars

Names of dependent variables for tests. For categorical variables, percentages of each category are tested.

na.rm

Whether to drop cases with missing values for a given dependent variable.

status

The name of the variable representing response/eligibility status.
Thestatus variable should have at most four categories,representing eligible respondents (ER), eligible nonrespondents (EN),known ineligible cases (IE), and cases whose eligibility is unknown (UE).

status_codes

A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'.
status_codes indicates how the values of thestatus variable are to be interpreted.

null_difference

The difference between the two means under the null hypothesis. Default is0.

alternative

Can be one of the following:

'unequal' (the default): two-sided test of whether difference in means is equal tonull_difference
'less': one-sided test of whether difference is less thannull_difference
'greater': one-sided test of whether difference is greater thannull_difference

degrees_of_freedom

The degrees of freedom to use for the test's reference distribution.Unless specified otherwise, the default is the design degrees of freedom minus one,where the design degrees of freedom are estimated using thesurvey package'sdegf method.

Value

A data frame describing the results of the t-tests, one row per dependent variable.

Statistical Details

The t-statistic used for the test has as its numerator the difference in means between the two samples, minus thenull_difference.The denominator for the t-statistic is the estimated standard error of the difference in means.Because the two means are based on overlapping groups and thus have correlated sampling errors, special care is taken to estimate the covariance of the two estimates.For designs which use sets of replicate weights for variance estimation, the two means and their difference are estimated using each set of replicate weights;the estimated differences from the sets of replicate weights are then used to estimate sampling error with a formula appropriate to the replication method (JKn, BRR, etc.).For designs which use linearization methods for variance estimation, the covariance between the two means is estimated using the method of linearization based on influence functions implemented in thesurvey package.See Osier (2009) for an overview of the method of linearization based on influence functions.Eckman et al. (2023) showed in a simulation study that linearization and replicationperformed similarly in estimating the variance of a difference in means for overlapping samples.

Unless specified otherwise using thedegrees_of_freedom parameter, the degrees of freedom for the test are set to the design degrees of freedom minus one.Design degrees of freedom are estimated using thesurvey package'sdegf method.

See Lohr and Riddles (2016) for the statistical details of this test.See Van de Kerckhove et al. (2009) and Amaya and Presser (2017)for examples of a nonresponse bias analysis which uses t-tests to compare responding samples to eligible samples.

References

Amaya, A., Presser, S. (2017).Nonresponse Bias for Univariate and Multivariate Estimates of Social Activities and Roles. Public Opinion Quarterly, Volume 81, Issue 1, 1 March 2017, Pages 1–36, https://doi.org/10.1093/poq/nfw037
Eckman, S., Unangst, J., Dever, J., Antoun, A. (2023).The Precision of Estimates of Nonresponse Bias in Means. Journal of Survey Statistics and Methodology, 11(4), 758-783. https://doi.org/10.1093/jssam/smac019
Lohr, S., Riddles, M. (2016).Tests for Evaluating Nonresponse Bias in Surveys. Survey Methodology 42(2): 195-218. https://www150.statcan.gc.ca/n1/pub/12-001-x/2016002/article/14677-eng.pdf
Osier, G. (2009).Variance estimation for complex indicators of poverty and inequality using linearization techniques. Survey Research Methods, 3(3), 167-195. https://doi.org/10.18148/srm/2009.v3i3.369
Van de Kerckhove, W., Krenzke, T., and Mohadjer, L. (2009).Adult Literacy and Lifeskills Survey (ALL) 2003: U.S. Nonresponse Bias Analysis (NCES 2009-063). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Washington, DC.

Examples

library(survey)# Create a survey design ----data(involvement_survey_srs, package = 'nrba')survey_design <- svydesign(weights = ~ BASE_WEIGHT,                           id = ~ UNIQUE_ID,                           fpc = ~ N_STUDENTS,                           data = involvement_survey_srs)# Compare respondents' mean to the full sample mean ----t_test_resp_vs_full(survey_design = survey_design,                    y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"),                    status = 'RESPONSE_STATUS',                    status_codes = c('ER' = "Respondent",                                     'EN' = "Nonrespondent",                                     'IE' = "Ineligible",                                     'UE' = "Unknown"))# Compare respondents' mean to the mean of all eligible cases ----t_test_resp_vs_full(survey_design = survey_design,                    y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"),                    status = 'RESPONSE_STATUS',                    status_codes = c('ER' = "Respondent",                                     'EN' = "Nonrespondent",                                     'IE' = "Ineligible",                                     'UE' = "Unknown"))# One-sided tests ----  ## Null Hypothesis: Y_bar_resp - Y_bar_full <= 0.1  ## Alt. Hypothesis: Y_bar_resp - Y_bar_full >  0.1t_test_resp_vs_full(survey_design = survey_design,                    y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"),                    status = 'RESPONSE_STATUS',                    status_codes = c('ER' = "Respondent",                                     'EN' = "Nonrespondent",                                     'IE' = "Ineligible",                                     'UE' = "Unknown"),                    null_difference = 0.1, alternative = 'greater')  ## Null Hypothesis: Y_bar_resp - Y_bar_full >= 0.1  ## Alt. Hypothesis: Y_bar_resp - Y_bar_full <  0.1t_test_resp_vs_full(survey_design = survey_design,                    y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"),                    status = 'RESPONSE_STATUS',                    status_codes = c('ER' = "Respondent",                                     'EN' = "Nonrespondent",                                     'IE' = "Ineligible",                                     'UE' = "Unknown"),                    null_difference = 0.1, alternative = 'less')

t-test of differences in estimated means/percentages from two different sets of replicate weights.

Description

Tests whether estimates of means/percentages differ systematically between two sets of replicate weights:an original set of weights, and the weights after adjustment (e.g. post-stratification or nonresponse adjustments) and possibly subsetting (e.g. subsetting to only include respondents).

Usage

t_test_of_weight_adjustment(  orig_design,  updated_design,  y_vars,  na.rm = TRUE,  null_difference = 0,  alternative = "unequal",  degrees_of_freedom = NULL)

Arguments

orig_design

A replicate design object created with thesurvey package.

updated_design

A potentially updated version oforig_design,for example where weights have been adjusted for nonresponse or updated using post-stratification.The type and number of sets of replicate weights must match that oforig_design.The number of rows may differ (e.g. iforig_design includes the full sample butupdated_design only includes respondents).

y_vars

Names of dependent variables for tests. For categorical variables, percentages of each category are tested.

na.rm

Whether to drop cases with missing values for a given dependent variable.

null_difference

The difference between the two means/percentages under the null hypothesis. Default is0.

alternative

Can be one of the following:

'unequal' (the default): two-sided test of whether difference in means is equal tonull_difference
'less': one-sided test of whether difference is less thannull_difference
'greater': one-sided test of whether difference is greater thannull_difference

degrees_of_freedom

Value

A data frame describing the results of the t-tests, one row per dependent variable.

Statistical Details

The t-statistic used for the test has as its numerator the difference in means/percentages between the two samples, minus thenull_difference.The denominator for the t-statistic is the estimated standard error of the difference in means.Because the two means are based on overlapping groups and thus have correlated sampling errors, special care is taken to estimate the covariance of the two estimates.For designs which use sets of replicate weights for variance estimation, the two means and their difference are estimated using each set of replicate weights;the estimated differences from the sets of replicate weights are then used to estimate sampling error with a formula appropriate to the replication method (JKn, BRR, etc.).

This analysis is not implemented for designs which use linearization methods for variance estimation.
Unless specified otherwise using thedegrees_of_freedom parameter, the degrees of freedom for the test are set to the design degrees of freedom minus one.Design degrees of freedom are estimated using thesurvey package'sdegf method.

See Van de Kerckhove et al. (2009) for an example of this type of nonresponse bias analysis (among others).See Lohr and Riddles (2016) for the statistical details of this test.

References

Lohr, S., Riddles, M. (2016).Tests for Evaluating Nonresponse Bias in Surveys. Survey Methodology 42(2): 195-218. https://www150.statcan.gc.ca/n1/pub/12-001-x/2016002/article/14677-eng.pdf
Van de Kerckhove, W., Krenzke, T., and Mohadjer, L. (2009).Adult Literacy and Lifeskills Survey (ALL) 2003: U.S. Nonresponse Bias Analysis (NCES 2009-063). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Washington, DC.

Examples

library(survey)# Create a survey design ----data(involvement_survey_srs, package = 'nrba')survey_design <- svydesign(weights = ~ BASE_WEIGHT,                           id = ~ UNIQUE_ID,                           fpc = ~ N_STUDENTS,                           data = involvement_survey_srs)# Create replicate weights for the design ----rep_svy_design <- as.svrepdesign(survey_design, type = "subbootstrap",                                 replicates = 500)# Subset to only respondents (always subset *after* creating replicate weights)rep_svy_respondents <- subset(rep_svy_design,                              RESPONSE_STATUS == "Respondent")# Apply raking adjustment ----raked_rep_svy_respondents <- rake_to_benchmarks(  survey_design = rep_svy_respondents,  group_vars = c("PARENT_HAS_EMAIL", "STUDENT_RACE"),  group_benchmark_vars = c("PARENT_HAS_EMAIL_BENCHMARK",                           "STUDENT_RACE_BENCHMARK"),)# Compare estimates from respondents in original vs. adjusted design ----t_test_of_weight_adjustment(orig_design = rep_svy_respondents,                            updated_design = raked_rep_svy_respondents,                            y_vars = c('STUDENT_AGE', 'STUDENT_SEX'))t_test_of_weight_adjustment(orig_design = rep_svy_respondents,                            updated_design = raked_rep_svy_respondents,                            y_vars = c('WHETHER_PARENT_AGREES'))# Compare estimates to true population values ----data('involvement_survey_pop', package = 'nrba')mean(involvement_survey_pop$STUDENT_AGE)prop.table(table(involvement_survey_pop$STUDENT_SEX))

Test for differences in means/percentages between two potentially overlapping groups

Description

Test for differences in means/percentages between two potentially overlapping groups

Usage

t_test_overlap(  survey_design,  y_vars,  na.rm = TRUE,  status,  group_1,  group_2,  null_difference = 0,  alternative = "unequal",  degrees_of_freedom = survey::degf(survey_design) - 1)

Arguments

survey_design

A survey design object created with thesurvey package.

y_vars

Names of dependent variables for tests. For categorical variables, percentages of each category are tested.

na.rm

Whether to drop cases with missing values for a given dependent variable.

status

The name of the variable representing response/eligibility status.

group_1

Vector of values ofstatus variable representing the first group

group_2

Vector of values ofstatus variable representing the second group

null_difference

The hypothesized difference between the groups' means. Default is0.

alternative

Can be one of the following:

'unequal': two-sided test of whether difference in means is equal tonull_difference
'less': one-sided test of whether difference is less thannull_difference
'greater': one-sided test of whether difference is greater thannull_difference

degrees_of_freedom

Value

A data frame describing the difference in group means/percentages and the statistics from the t-test

t-test of differences in means/percentages relative to external estimates

Description

Compare estimated means/percentages from the present survey to external estimates from a benchmark source.A t-test is used to evaluate whether the survey's estimates differ from the external estimates.

Usage

t_test_vs_external_estimate(  survey_design,  y_var,  ext_ests,  ext_std_errors = NULL,  na.rm = TRUE,  null_difference = 0,  alternative = "unequal",  degrees_of_freedom = survey::degf(survey_design) - 1)

Arguments

survey_design

A survey design object created with thesurvey package.

y_var

Name of dependent variable. For categorical variables, percentages of each category are tested.

ext_ests

A numeric vector containing the external estimate of the mean for the dependent variable.Ifvariable is a categorical variable, a named vector of means must be provided.

ext_std_errors

(Optional) The standard errors of the external estimates.This is useful if the external data are estimated with an appreciable level of uncertainty,for instance if the external data come from a survey with a small-to-moderate sample size.If supplied, the variance of the difference between the survey and external estimatesis estimated by adding the variance of the external estimates to the estimated varianceof the survey's estimates.

na.rm

Whether to drop cases with missing values fory_var

null_difference

The hypothesized difference between the estimate and the external mean. Default is0.

alternative

Can be one of the following:

'unequal': two-sided test of whether difference in means is equal tonull_difference
'less': one-sided test of whether difference is less thannull_difference
'greater': one-sided test of whether difference is greater thannull_difference

degrees_of_freedom

Value

A data frame describing the results of the t-tests, one row per mean being compared.

References

See Brick and Bose (2001) for an example of this analysis methodand a discussion of its limitations.

Brick, M., and Bose, J. (2001).Analysis of Potential Nonresponse Bias. inProceedings of the Section on Survey Research Methods. Alexandria, VA: American Statistical Association.http://www.asasrms.org/Proceedings/y2001/Proceed/00021.pdf

Examples

library(survey)# Create a survey design ----data("involvement_survey_str2s", package = 'nrba')involvement_survey_sample <- svydesign(  data = involvement_survey_str2s,  weights = ~ BASE_WEIGHT,  strata =  ~ SCHOOL_DISTRICT,  ids =     ~ SCHOOL_ID             + UNIQUE_ID,  fpc =     ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL)# Subset to only include survey respondents ----involvement_survey_respondents <- subset(involvement_survey_sample,                                         RESPONSE_STATUS == "Respondent")# Test whether percentages of categorical variable differ from benchmark ----parent_email_benchmark <- c(  'Has Email' = 0.85,  'No Email' = 0.15)t_test_vs_external_estimate(  survey_design = involvement_survey_respondents,  y_var = "PARENT_HAS_EMAIL",  ext_ests = parent_email_benchmark)# Test whether the sample mean differs from the population benchmark ----average_age_benchmark <- 11t_test_vs_external_estimate(  survey_design = involvement_survey_respondents,  y_var = "STUDENT_AGE",  ext_ests = average_age_benchmark,  null_difference = 0)

Adjust weights in a replicate design for nonresponse or unknown eligibility status, using weighting classes

Description

Updates weights in a survey design object to adjust for nonresponse and/or unknown eligibilityusing the method of weighting class adjustment. For unknown eligibility adjustments, the weight in each classis set to zero for cases with unknown eligibility, and the weight of all other cases in the class isincreased so that the total weight is unchanged. For nonresponse adjustments, the weight in each classis set to zero for cases classified as eligible nonrespondents, and the weight of eligible respondent casesin the class is increased so that the total weight is unchanged.

This function currently only works for survey designs with replicate weights,since the linearization-based estimators included in thesurvey package (or Stata or SAS for that matter)are unable to fully reflect the impact of nonresponse adjustment.Adjustments are made to both the full-sample weights and all of the sets of replicate weights.

Usage

wt_class_adjust(  survey_design,  status,  status_codes,  wt_class = NULL,  type = c("UE", "NR"))

Arguments

survey_design

A replicate survey design object created with thesurvey package.

status

A character string giving the name of the variable representing response/eligibility status.
Thestatus variable should have at most four categories,representing eligible respondents (ER), eligible nonrespondents (EN),known ineligible cases (IE), and cases whose eligibility is unknown (UE).

status_codes

A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'.
status_codes indicates how the values of thestatus variable are to be interpreted.

wt_class

(Optional) A character string giving the name of the variable which divides sample cases into weighting classes.
Ifwt_class=NULL (the default), adjustment is done using the entire sample.

type

A character vector including one or more of the following options:

'UE': Adjust for unknown eligibility.
'NR': Adjust for nonresponse.
To sequentially adjust for unknown eligibility and then nonresponse, settype=c('UE', 'NR').

Details

See the vignette "Nonresponse Adjustments" from the svrep package for a step-by-step walkthrough ofnonresponse weighting adjustments in R:
vignette(topic = "nonresponse-adjustments", package = "svrep")

Value

A replicate survey design object, with adjusted full-sample and replicate weights

References

See Chapter 2 of Heeringa, West, and Berglund (2017) or Chapter 13 of Valliant, Dever, and Kreuter (2018)for an overview of nonresponse adjustment methods based on redistributing weights.

Heeringa, S., West, B., Berglund, P. (2017). Applied Survey Data Analysis, 2nd edition. Boca Raton, FL: CRC Press."Applied Survey Data Analysis, 2nd edition." Boca Raton, FL: CRC Press.
Valliant, R., Dever, J., Kreuter, F. (2018)."Practical Tools for Designing and Weighting Survey Samples, 2nd edition." New York: Springer.

Examples

library(survey)# Load an example datasetdata("involvement_survey_str2s", package = "nrba")# Create a survey design objectinvolvement_survey_sample <- svydesign(  data = involvement_survey_str2s,  weights = ~BASE_WEIGHT,  strata = ~SCHOOL_DISTRICT,  ids = ~ SCHOOL_ID + UNIQUE_ID,  fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL)rep_design <- as.svrepdesign(involvement_survey_sample, type = "mrbbootstrap")# Adjust weights for nonresponse within weighting classesnr_adjusted_design <- wt_class_adjust(  survey_design = rep_design,  status = "RESPONSE_STATUS",  status_codes = c(    "ER" = "Respondent",    "EN" = "Nonrespondent",    "IE" = "Ineligible",    "UE" = "Unknown"  ),  wt_class = "PARENT_HAS_EMAIL",  type = "NR")