TheOTrecod package gives access to a set oforiginal functions dedicated to data fusion.
From two separate data sources with no overlapping units, sharing only aset of common variables X and a same target information not jointlyobserved in a same encoding from one data source to another (Y in A andZ in B), the functionsOT_outcome andOT_joint aim at providing users a complete syntheticdatabase where the missing information is available for every unit.
This recoding problem is solved using the optimal transportation theorywhich provides a map that transfers the joint distribution of the firsttarget variable and X to the joint distribution of the second one and X,or inversely. Algorithms used in these two functions come from thereferences (1) and (2).
If the packageOTrecod is not installed in theircurrent R versions, users can install it by following the standardinstruction:
install.packages("OTrecod")Obviously, each time an R session is opened, theOTrecod library must be loaded with:
library(OTrecod)Moreover, the development version ofOTrecod can beinstalled actually fromGitHub with:
# Install development version from GitHubdevtools::install_github("otrecoding/OTrecod")
The available databases calledtab_test andsimu_data correspond to overlayed databases used asexamples in the documentation of all the functions. Their structures canhelp users understanding the database structure expected as inputargument of the functionsOT_outcome andOT_joint. The first rows of the two overlayed datasources ofsimu_data are visualized as follows toinform about the expected database structure:
data(simu_data)dim(simu_data)[1]7008simu_data[c(1:5,301:305),] DB Yb1 Yb2 Gender Treatment Dosage Smoking Age1 A [40-60[NA Female Trt A Dos3 YES65.442732 A [20-40]NA Male<NA> Dos2 NO51.785963 A [40-60[NA Female Placebo Dos2 YES49.108444 A [40-60[NA Female Trt B Dos4<NA>56.435245 A [40-60[NA Female Trt A Dos4 YES44.77365301 B<NA>5 Female Placebo Dos2 YES44.58233302 B<NA>1 Female Trt B Dos4<NA>65.23921303 B<NA>2 Female Placebo<NA> NO51.64228304 B<NA>2 Female Trt A<NA> NO50.15125305 B<NA>1 Female Trt B Dos4 YES61.53242The first column calledDB corresponds here to the databaseidentifier (two data sources called here 1 and 2 with the data source 1placed above the data source 2). The second column calledYb1is the target variable of the data source 1. The values ofYb1in the data source 2 are missing and will be predicted using an optimaltransportation algorithm integrated in one of the two functions calledOT_outcome andOT_joint. In the sameway, the variableYb2 (third column) is the target variable ofthe data source 2 whose values in 1 are unknown. These missing valuescan also be predicted usingOT_outcome andOT_joint.
The presence of these three variables is essential in any databasededicated to datafusion in theOTrecod packagewhatevever their names and whatever their orders in the database. Thefollowing columns correspond to shared variables of any type, completeor not. Note that continuous variables (like age in years) are notallowed with theOT_joint function.
Support functions are available in the package(merge_dbs,imput_cov) to assist userin this preparation.
Finally, the supplementary datasetsapi29 andapi35 are simple datasets extracted from the APIprogram (https://www.cde.ca.gov/re/pr/api.asp) to allow users topractice with convenient databases.
Among the available functions, theOTrecod packageprovides a set of support functions to assist users in each step oftheir data fusion projects.
Themerge_dbs function is a pre-process data fusionfunction dedicated to the harmonization of two data sources. By default,variables (not target variables) with same labels are considered asshared between the two databases. Themerge_dbsfunction detects potential discrepancies between the variables beforemerging by:
merge_dbs=function(DB1, DB2,row_ID1 =NULL,row_ID2 =NULL, NAME_Y, NAME_Z,order_levels_Y =levels(DB1[, NAME_Y]),order_levels_Z =levels(DB2[, NAME_Z]),ordinal_DB1 =NULL,ordinal_DB2 =NULL,impute ="NO",R_MICE =5,NCP_FAMD =3,seed_choice =sample(1:1000000,1))Themerge_dbs function notably provides in output anunique database, result of the overlayed of the two initial datasources, in the structure expected by theOT_outcomeandOT_joint functions.
Theselect_pred function is a pre-process datafusion function dedicated to the selection of matching variables. Thisselection is essential when the initial set of shared variables isimportant, but also because the choice of predictors greatly influencesthe quality of the data fusion whatever the optimal transportationalgorithms chosen a posteriori.
The call of theselect_pred function is actually:select_pred=function(databa,Y =NULL,Z =NULL,ID =1,OUT ="Y",quanti =NULL,nominal =NULL,ordinal =NULL,logic =NULL,convert_num =NULL,convert_class =NULL,thresh_cat =0.30,thresh_num =0.70,thresh_Y =0.20,RF =TRUE,RF_ntree =500,RF_condi =FALSE,RF_condi_thr =0.20,RF_SEED =sample(1:1000000,1))
Theverif_OT function is a post-process data fusionfunction dedicated to the validation of the fusion. The functionprovides a set of tools to assess the quality of the optimaltransportation recoding proposed by the algorithms to predict themissing information of the target variables in one or both datasources.
The call of theverif_OT function is actually:
verif_OT=function(ot_out,group.class =FALSE,ordinal =TRUE,stab.prob =FALSE,min.neigb =1,R =10,seed.stab =sample(1:1000000,1))
TheOTrecod package provides two algorithms that useoptimal transportation theory to solve recoding problems in data fusioncontexts (see (1) and (2) for more details). Each algorithm is stored inone function and each function provides in output a unique and syntheticdatabase where the two initial data sources are overlayed and themissing information from only one or both target variables are fullycompleted.
Each of the two alogorithms also proposed enrichments by relaxing theinitial distributional constraints and adding regularization terms asdescribed in (2).
TheOT_outcome function can provide individualpredictions of the incomplete target variables by considering therecoding problem involving only optimal transportation of outcomes (see(1) and (2) for more details).
The call of theOT_outcome function is:OT_outcome=function(datab,index_DB_Y_Z =1:3,quanti =NULL,nominal =NULL,ordinal =NULL,logic =NULL,convert.num =NULL,convert.class =NULL,FAMD.coord ="NO",FAMD.perc =0.8,dist.choice ="E",percent.knn =1,maxrelax =0,indiv.method ="sequential",prox.dist =0.30,solvR ="glpk",which.DB ="BOTH")
TheOT_joint function can provide individualpredictions of the incomplete target variables by considering therecoding problem involving optimal transportation of shared variablesand outcomes (see(2) for more details).
The call of theOT_joint function is:OT_joint=function(datab,index_DB_Y_Z =1:3,nominal =NULL,ordinal =NULL,logic =NULL,convert.num =NULL,convert.class =NULL,dist.choice ="E",percent.knn =1,maxrelax =0,lambda.reg =0.0,prox.X =0.10,solvR ="glpk",which.DB ="BOTH")
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR,savy N (2019). On the use of optimal transportation theory to recodevariables and application to database merging. The International Journalof Biostatistics.Volume 16, Issue 1, 20180106, eISSN 1557-4679.
Gares V, Omer J (2020). Regularized optimal transport ofcovariates and outcomes in data recoding. Journal of the AmericanStatistical Association.