ipd-tools/ipdPublic

NotificationsYou must be signed in to change notification settings
Fork5
Star11

R package to perform statistically valid and efficient inference with predicted data (IPD) using state-of-the-art methods.

License

Unknown, MIT licenses found

Licenses found

11 stars 5 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.github		.github
R		R
dev		dev
figure		figure
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
IPD.Rproj		IPD.Rproj
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README-JPN.md		README-JPN.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
cran-comments.md		cran-comments.md

Repository files navigation

ipd: Inference on Predicted Data

Overview

ipd is an open-sourceR software package for the downstream modelingof an outcome and its associated features where a potentially sizableportion of the outcome data has been imputed by an artificialintelligence or machine learning (AI/ML) prediction algorithm. Thepackage implements several recent proposed methods for inference onpredicted data (IPD) with a single, user-friendly wrapper function,ipd. The package also provides customprint,summary,tidy,glance, andaugment methods to facilitate easy model inspection.

Background

Using predictions from pre-trained algorithms as outcomes in downstreamstatistical analyses can lead to biased estimates and misleadingconclusions. The statistical challenges encountered when drawinginference on predicted data (IPD) include:

Understanding the relationship between predicted outcomes and theirtrue, unobserved counterparts.
Quantifying the robustness of the AI/ML models to resampling oruncertainty about the training data.
Appropriately propagating both bias and uncertainty from predictionsinto downstream inferential tasks.

Several works have proposed methods for IPD, including post-predictioninference (PostPI) byWang et al.,2020,prediction-powered inference (PPI) and PPI++ byAngelopoulos et al.,2023a andAngelopoulos et al., 2023b,post-prediction adaptive inference (PSPA) byMiao et al.,2023, and a correction based on theChen and Chen method and alternate PPI “All” byGronsbell et al.,2025. Each method was developed toperform inference on a quantity such as the outcome mean or quantile, ora regression coefficient, when we have:

A dataset consisting of our outcome and features of interst, wherethe outcome is only observed for a small ‘labeled’ subset andmissing for a, typically larger, ‘unlabeled’ subset.
Access to an algorithm to predict the missing outcome in the entiredataset using the fully observed features.

Overview of data and setup forIPD

We can use these methods for IPD to obtain corrected estimates andstandard errors by using the predicted outcomes and unlabeled featuresto augment the labeled subset of the data.

To enable researchers and practitioners interested in thesestate-of-the-art methods, we have developed theipd package inR toimplement these methods under the umbrella of IPD. This README providesan overview of the package, including installation instructions, basicusage examples, and links to further documentation. The examples showhow to generate data, fit models, and use custom methods provided by thepackage.

Installation

To install theipd package fromBioconductor, you can use theBiocManager package:

#-- Install BiocManager if it is not already installedif (!require("BiocManager",quietly=TRUE))    install.packages("BiocManager")BiocManager::install(version="3.21")#-- Install the ipd package from BioconductorBiocManager::install("ipd")

Or, to install the development version ofipd fromGitHub, you can use thedevtoolspackage:

#-- Install devtools if it is not already installedinstall.packages("devtools")#-- Install the ipd package from GitHubdevtools::install_github("ipd-tools/ipd")

Usage

We provide a simple example to demonstrate the basic use of thefunctions included in theipd package.

Data Generation

You can generate synthetic datasets for different types of regressionmodels using the providedsimdat function by specifying the sizes ofthe datasets, the effect size, residual variance, and the type of model.The function currently supports “mean”, “quantile”, “ols”, “logistic”,and “poisson” models. Thesimdat function generate a data.frame withthree subsets: (1) an independent “training” set with additionalobservations used to fit a prediction model, and “labeled” and“unlabeled” sets which contain the observed and predicted outcomes andthe simulated features of interest.

#-- Load necessary librarieslibrary(ipd)library(tidyverse)library(patchwork)

#-- Generate Example Data for Linear Regressionset.seed(123)n<- c(10000,500,1000)dat<- simdat(n=n,effect=1,sigma_Y=4,model="ols",shift=1,scale=2)#-- Print First 6 Rows of Training, Labeled, and Unlabeled Subsetsoptions(digits=2)head(dat[dat$set_label=="training", ])#>       X1    X2    X3     X4     Y  f set_label#> 1 -0.560 -0.56  0.82 -0.356 -0.15 NA  training#> 2 -0.230  0.13 -1.54  0.040 -4.49 NA  training#> 3  1.559  1.82 -0.59  1.152 -1.08 NA  training#> 4  0.071  0.16 -0.18  1.485 -3.67 NA  training#> 5  0.129 -0.72 -0.71  0.634  2.19 NA  training#> 6  1.715  0.58 -0.54 -0.037 -1.42 NA  traininghead(dat[dat$set_label=="labeled", ])#>          X1      X2    X3    X4     Y      f set_label#> 10001  2.37 -1.8984  0.20 -0.17  1.40  1.120   labeled#> 10002 -0.17  1.7428  0.26 -2.05  3.56  0.017   labeled#> 10003  0.93 -1.0947  0.76  1.25 -3.66  0.686   labeled#> 10004 -0.57  0.1757  0.32  0.65 -0.56 -0.212   labeled#> 10005  0.23  2.0620 -1.35  1.46 -0.82 -0.573   labeled#> 10006  1.13 -0.0028  0.23 -0.24  7.30  0.579   labeledhead(dat[dat$set_label=="unlabeled", ])#>          X1     X2    X3    X4    Y      f set_label#> 10501  0.99 -3.280 -0.39  0.97  8.4  0.124 unlabeled#> 10502 -0.66  0.142 -1.36 -0.22 -7.2 -1.040 unlabeled#> 10503  0.58 -1.368 -1.73  0.15  5.6 -0.653 unlabeled#> 10504 -0.14 -0.728  0.26 -0.23 -4.2 -0.047 unlabeled#> 10505 -0.17 -0.068 -1.10  0.58  2.2 -0.693 unlabeled#> 10506  0.58  0.514 -0.69  0.97 -1.2 -0.122 unlabeled

Thesimdat function provides observed and unobserved outcomes for boththe labeled and unlabeled datasets, though in practice the observedoutcomes are not in the unlabeled set. We can visualize therelationships between these variables:

We can see that:

The predicted outcomes are more correlated with the covariate than thetrue outcomes (panels A and B).
The predicted outcomes are not perfect substitutes for the trueoutcomes (panel C).

Model Fitting

We compare two non-IPD approaches to analyzing the data to methodsincluded in theipd package. The two non-IPD benchmarks are the‘naive’ method and the ‘classic’ method. The ‘naive’ treats thepredicted outcomes as if they were observed and regresses thepredictions on the covariates of interest without calibration. The‘classic’ uses only the subset of labeled observations where we observethe true outcome. The IPD methods are listed in alphabetical order bymethod name. A summary comparison is provided in the table below,followed by the specific calls for each method:

#>                    Estimate Std. Error#> Naive                  0.49      0.015#> Classic                1.10      0.192#> Chen and Chen          1.11      0.186#> PostPI (Bootstrap)     1.16      0.183#> PostPI (Analytic)      1.13      0.191#> PPI                    1.11      0.195#> PPI All                1.11      0.195#> PPI++                  1.10      0.190#> PSPA                   1.09      0.190

We can see that the IPD methods have similar estimates and standarderrors, while the ‘naive’ method has a different estimate and standarderrors that are too small. We compare two non-IPD approaches toanalyzing the data to methods included in theipd package in moredetail below.

‘Naive’ Regression Using the Predicted Outcomes

#--- Fit the Naive Regressionlm(f~X1,data=dat[dat$set_label=="unlabeled", ])|>    summary()#>#> Call:#> lm(formula = f ~ X1, data = dat[dat$set_label == "unlabeled",#>     ])#>#> Residuals:#>     Min      1Q  Median      3Q     Max#> -1.2713 -0.3069 -0.0076  0.3173  1.4453#>#> Coefficients:#>             Estimate Std. Error t value Pr(>|t|)#> (Intercept)  -0.0805     0.0148   -5.42  7.4e-08 ***#> X1            0.4924     0.0148   33.32  < 2e-16 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> Residual standard error: 0.47 on 998 degrees of freedom#> Multiple R-squared:  0.527,  Adjusted R-squared:  0.526#> F-statistic: 1.11e+03 on 1 and 998 DF,  p-value: <2e-16

‘Classic’ Regression Using only the Labeled Data

#--- Fit the Classic Regressionlm(Y~X1,data=dat[dat$set_label=="labeled", ])|>    summary()#>#> Call:#> lm(formula = Y ~ X1, data = dat[dat$set_label == "labeled", ])#>#> Residuals:#>     Min      1Q  Median      3Q     Max#> -15.262  -2.828  -0.094   2.821  11.685#>#> Coefficients:#>             Estimate Std. Error t value Pr(>|t|)#> (Intercept)    0.908      0.187    4.86  1.6e-06 ***#> X1             1.097      0.192    5.71  1.9e-08 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> Residual standard error: 4.2 on 498 degrees of freedom#> Multiple R-squared:  0.0614, Adjusted R-squared:  0.0596#> F-statistic: 32.6 on 1 and 498 DF,  p-value: 1.95e-08

You can fit the various IPD methods to your data and obtain summariesusing the provided wrapper function,ipd():

Chen and Chen Correction (Gronsbell et al., 2025)

#-- Specify the Formulaformula<-Y-f~X1#-- Fit the Chen and Chen Correctionipd::ipd(formula,method="chen",model="ols",data=dat,label="set_label")|>    summary()#>#> Call:#>   Y - f ~ X1#>#> Method:    chen#> Model:     ols#> Intercept: Yes#>#> Coefficients:#>             Estimate Std. Error z value Pr(>|z|)#> (Intercept)    0.880      0.182    4.83  1.3e-06 ***#> X1             1.114      0.186    5.98  2.2e-09 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PostPI Bootstrap Correction (Wang et al., 2020)

#-- Fit the PostPI Bootstrap Correctionnboot<-200ipd::ipd(formula,method="postpi_boot",model="ols",data=dat,label="set_label",nboot=nboot)|>    summary()#>#> Call:#>   Y - f ~ X1#>#> Method:    postpi_boot#> Model:     ols#> Intercept: Yes#>#> Coefficients:#>             Estimate Std. Error z value Pr(>|z|)#> (Intercept)    0.866      0.183    4.73  2.2e-06 ***#> X1             1.164      0.183    6.38  1.8e-10 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PostPI Analytic Correction (Wang et al., 2020)

#-- Fit the PostPI Analytic Correctionipd::ipd(formula,method="postpi_analytic",model="ols",data=dat,label="set_label")|>    summary()#>#> Call:#>   Y - f ~ X1#>#> Method:    postpi_analytic#> Model:     ols#> Intercept: Yes#>#> Coefficients:#>             Estimate Std. Error z value Pr(>|z|)#> (Intercept)   -0.187      0.187   -1.00     0.32#> X1             1.128      0.191    5.92  3.3e-09 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Prediction-Powered Inference (PPI; Angelopoulos et al., 2023)

#-- Fit the PPI Correctionipd::ipd(formula,method="ppi",model="ols",data=dat,label="set_label")|>    summary()#>#> Call:#>   Y - f ~ X1#>#> Method:    ppi#> Model:     ols#> Intercept: Yes#>#> Coefficients:#>             Estimate Std. Error z value Pr(>|z|)#> (Intercept)    0.890      0.183    4.87  1.1e-06 ***#> X1             1.110      0.195    5.69  1.3e-08 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PPI “All” (Gronsbell et al., 2025)

#-- Fit the PPI Correctionipd::ipd(formula,method="ppi_a",model="ols",data=dat,label="set_label")|>    summary()#>#> Call:#>   Y - f ~ X1#>#> Method:    ppi_a#> Model:     ols#> Intercept: Yes#>#> Coefficients:#>             Estimate Std. Error z value Pr(>|z|)#> (Intercept)    0.896      0.183    4.91  9.2e-07 ***#> X1             1.106      0.195    5.67  1.4e-08 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PPI++ (Angelopoulos et al., 2023)

#-- Fit the PPI++ Correctionipd::ipd(formula,method="ppi_plusplus",model="ols",data=dat,label="set_label")|>    summary()#>#> Call:#>   Y - f ~ X1#>#> Method:    ppi_plusplus#> Model:     ols#> Intercept: Yes#>#> Coefficients:#>             Estimate Std. Error z value Pr(>|z|)#> (Intercept)    0.904      0.185    4.87  1.1e-06 ***#> X1             1.100      0.190    5.78  7.5e-09 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-Prediction Adaptive Inference (PSPA; Miao et al., 2023)

#-- Fit the PSPA Correctionipd::ipd(formula,method="pspa",model="ols",data=dat,label="set_label")|>    summary()#>#> Call:#>   Y - f ~ X1#>#> Method:    pspa#> Model:     ols#> Intercept: Yes#>#> Coefficients:#>             Estimate Std. Error z value Pr(>|z|)#> (Intercept)    0.900      0.185    4.87  1.1e-06 ***#> X1             1.095      0.190    5.76  8.5e-09 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Printing and Tidying

The package also provides customprint,summary,tidy,glance,andaugment methods to facilitate easy model inspection:

#-- Fit the PostPI Bootstrap Correctionnboot<-200fit_postpi<-ipd::ipd(formula,method="postpi_boot",model="ols",data=dat,label="set_label",nboot=nboot)#-- Print the Modelprint(fit_postpi)#> IPD inference summary#>   Method:   postpi_boot#>   Model:    ols#>   Formula:  Y - f ~ X1#>#> Coefficients:#>             Estimate Std. Error z value Pr(>|z|)#> (Intercept)    0.860      0.183    4.71  2.5e-06 ***#> X1             1.148      0.182    6.30  3.1e-10 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#-- Summarize the Modelsumm_fit_postpi<- summary(fit_postpi)#-- Print the Model Summaryprint(summ_fit_postpi)#>#> Call:#>   Y - f ~ X1#>#> Method:    postpi_boot#> Model:     ols#> Intercept: Yes#>#> Coefficients:#>             Estimate Std. Error z value Pr(>|z|)#> (Intercept)    0.860      0.183    4.71  2.5e-06 ***#> X1             1.148      0.182    6.30  3.1e-10 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#-- Tidy the Model Outputtidy(fit_postpi)#> # A tibble: 2 × 5#>   term        estimate std.error conf.low conf.high#>   <chr>          <dbl>     <dbl>    <dbl>     <dbl>#> 1 (Intercept)    0.860     0.183    0.502      1.22#> 2 X1             1.15      0.182    0.790      1.50#-- Get a One-Row Summary of the Modelglance(fit_postpi)#> # A tibble: 1 × 6#>   method      model intercept nobs_labeled nobs_unlabeled call#>   <chr>       <chr> <lgl>            <int>          <int> <chr>#> 1 postpi_boot ols   TRUE               500           1000 Y - f ~ X1#-- Augment the Original Data with Fitted Values and Residualsaugmented_df<- augment(fit_postpi)head(augmented_df)#>          X1     X2    X3    X4    Y      f set_label .fitted .resid#> 10501  0.99 -3.280 -0.39  0.97  8.4  0.124 unlabeled   1.992    6.5#> 10502 -0.66  0.142 -1.36 -0.22 -7.2 -1.040 unlabeled   0.099   -7.3#> 10503  0.58 -1.368 -1.73  0.15  5.6 -0.653 unlabeled   1.522    4.1#> 10504 -0.14 -0.728  0.26 -0.23 -4.2 -0.047 unlabeled   0.702   -4.9#> 10505 -0.17 -0.068 -1.10  0.58  2.2 -0.693 unlabeled   0.667    1.5#> 10506  0.58  0.514 -0.69  0.97 -1.2 -0.122 unlabeled   1.521   -2.7

Vignette

For additional details, we provide more use cases and examples in thepackage vignette:

vignette("ipd")

Feedback

For questions, comments, or any other feedback, please contact thedevelopers (ssalerno@fredhutch.org).

Contributing

Contributions are welcome! Please open an issue or submit a pull requestonGitHub. The followingmethod/model combinations are currently implemented:

Method	Mean Estimation	Quantile Estimation	Linear Regression	Logistic Regression	Poisson Regression
Chen and Chen	❌	❌	✅	❌	❌
PostPI	❌	❌	✅	✅	❌
PPI	✅	✅	✅	✅	❌
PPI++	✅	✅	✅	✅	❌
PPIa	❌	❌	✅	❌	❌
PSPA	✅	✅	✅	✅	✅

License

This package is licensed under the MIT License.

Session Info

sessionInfo()#> R version 4.4.1 (2024-06-14 ucrt)#> Platform: x86_64-w64-mingw32/x64#> Running under: Windows 11 x64 (build 22631)#>#> Matrix products: default#>#>#> locale:#> [1] LC_COLLATE=English_United States.utf8#> [2] LC_CTYPE=English_United States.utf8#> [3] LC_MONETARY=English_United States.utf8#> [4] LC_NUMERIC=C#> [5] LC_TIME=English_United States.utf8#>#> time zone: America/New_York#> tzcode source: internal#>#> attached base packages:#> [1] stats     graphics  grDevices utils     datasets  methods   base#>#> other attached packages:#>  [1] patchwork_1.3.0 lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1#>  [5] dplyr_1.1.4     purrr_1.0.2     readr_2.1.5     tidyr_1.3.1#>  [9] tibble_3.2.1    ggplot2_3.5.2   tidyverse_2.0.0 ipd_0.99.0#>#> loaded via a namespace (and not attached):#>  [1] gtable_0.3.6         xfun_0.52            recipes_1.2.1#>  [4] lattice_0.22-6       tzdb_0.5.0           vctrs_0.6.5#>  [7] tools_4.4.1          generics_0.1.3       stats4_4.4.1#> [10] parallel_4.4.1       pkgconfig_2.0.3      ModelMetrics_1.2.2.2#> [13] Matrix_1.7-0         data.table_1.17.0    lifecycle_1.0.4#> [16] farver_2.1.2         compiler_4.4.1       munsell_0.5.1#> [19] codetools_0.2-20     htmltools_0.5.8.1    class_7.3-22#> [22] yaml_2.3.10          prodlim_2024.06.25   pillar_1.10.2#> [25] MASS_7.3-60.2        gower_1.0.2          iterators_1.0.14#> [28] rpart_4.1.23         foreach_1.5.2        nlme_3.1-164#> [31] parallelly_1.43.0    lava_1.8.1           tidyselect_1.2.1#> [34] digest_0.6.37        stringi_1.8.7        future_1.40.0#> [37] reshape2_1.4.4       listenv_0.9.1        labeling_0.4.3#> [40] splines_4.4.1        fastmap_1.2.0        grid_4.4.1#> [43] colorspace_2.1-1     cli_3.6.3            magrittr_2.0.3#> [46] utf8_1.2.4           randomForest_4.7-1.2 survival_3.8-3#> [49] future.apply_1.11.3  withr_3.0.2          scales_1.3.0#> [52] timechange_0.3.0     rmarkdown_2.29       globals_0.16.3#> [55] nnet_7.3-19          timeDate_4041.110    ranger_0.17.0#> [58] hms_1.1.3            gam_1.22-5           evaluate_1.0.3#> [61] knitr_1.50           hardhat_1.4.1        caret_7.0-1#> [64] mgcv_1.9-1           rlang_1.1.4          Rcpp_1.0.13-1#> [67] glue_1.8.0           BiocGenerics_0.50.0  pROC_1.18.5#> [70] ipred_0.9-15         rstudioapi_0.17.1    R6_2.6.1#> [73] plyr_1.8.9

About

R package to perform statistically valid and efficient inference with predicted data (IPD) using state-of-the-art methods.

ipd-tools.github.io/ipd/

Resources

Readme

License

Unknown, MIT licenses found

Movatterモバイル変換

License

Licenses found

ipd-tools/ipd

Folders and files

Latest commit

History

Repository files navigation

ipd: Inference on Predicted Data

Overview

Background

Installation

Usage

Data Generation

Model Fitting

‘Naive’ Regression Using the Predicted Outcomes

‘Classic’ Regression Using only the Labeled Data

Chen and Chen Correction (Gronsbell et al., 2025)

PostPI Bootstrap Correction (Wang et al., 2020)

PostPI Analytic Correction (Wang et al., 2020)

Prediction-Powered Inference (PPI; Angelopoulos et al., 2023)

PPI “All” (Gronsbell et al., 2025)

PPI++ (Angelopoulos et al., 2023)

Post-Prediction Adaptive Inference (PSPA; Miao et al., 2023)

Printing and Tidying

Vignette

Feedback

Contributing

License

Session Info

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors4

Uh oh!

Languages

Packages