Movatterモバイル変換

OTrecod package

A package dedicated to datafusion

Introduction

TheOTrecod package gives access to a set oforiginal functions dedicated to data fusion.

From two separate data sources with no overlapping units, sharing only aset of common variables X and a same target information not jointlyobserved in a same encoding from one data source to another (Y in A andZ in B), the functionsOT_outcome andOT_joint aim at providing users a complete syntheticdatabase where the missing information is available for every unit.

This recoding problem is solved using the optimal transportation theorywhich provides a map that transfers the joint distribution of the firsttarget variable and X to the joint distribution of the second one and X,or inversely. Algorithms used in these two functions come from thereferences (1) and (2).

Package installation

If the packageOTrecod is not installed in theircurrent R versions, users can install it by following the standardinstruction:

install.packages("OTrecod")

Obviously, each time an R session is opened, theOTrecod library must be loaded with:

library(OTrecod)

Moreover, the development version ofOTrecod can beinstalled actually fromGitHub with:

# Install development version from GitHubdevtools::install_github("otrecoding/OTrecod")

Databaseexamples and expected structure before data fusion

The available databases calledtab_test andsimu_data correspond to overlayed databases used asexamples in the documentation of all the functions. Their structures canhelp users understanding the database structure expected as inputargument of the functionsOT_outcome andOT_joint. The first rows of the two overlayed datasources ofsimu_data are visualized as follows toinform about the expected database structure:

data(simu_data)dim(simu_data)[1]7008simu_data[c(1:5,301:305),]    DB     Yb1 Yb2 Gender Treatment Dosage Smoking      Age1    A [40-60[NA Female     Trt A  Dos3     YES65.442732    A [20-40]NA   Male<NA>  Dos2      NO51.785963    A [40-60[NA Female   Placebo  Dos2     YES49.108444    A [40-60[NA Female     Trt B  Dos4<NA>56.435245    A [40-60[NA Female     Trt A  Dos4     YES44.77365301  B<NA>5 Female   Placebo  Dos2     YES44.58233302  B<NA>1 Female     Trt B  Dos4<NA>65.23921303  B<NA>2 Female   Placebo<NA>      NO51.64228304  B<NA>2 Female     Trt A<NA>      NO50.15125305  B<NA>1 Female     Trt B  Dos4     YES61.53242

The first column calledDB corresponds here to the databaseidentifier (two data sources called here 1 and 2 with the data source 1placed above the data source 2). The second column calledYb1is the target variable of the data source 1. The values ofYb1in the data source 2 are missing and will be predicted using an optimaltransportation algorithm integrated in one of the two functions calledOT_outcome andOT_joint. In the sameway, the variableYb2 (third column) is the target variable ofthe data source 2 whose values in 1 are unknown. These missing valuescan also be predicted usingOT_outcome andOT_joint.

The presence of these three variables is essential in any databasededicated to datafusion in theOTrecod packagewhatevever their names and whatever their orders in the database. Thefollowing columns correspond to shared variables of any type, completeor not. Note that continuous variables (like age in years) are notallowed with theOT_joint function.

Support functions are available in the package(merge_dbs,imput_cov) to assist userin this preparation.

Finally, the supplementary datasetsapi29 andapi35 are simple datasets extracted from the APIprogram (https://www.cde.ca.gov/re/pr/api.asp) to allow users topractice with convenient databases.

Support functions

Among the available functions, theOTrecod packageprovides a set of support functions to assist users in each step oftheir data fusion projects.

merge_dbs

Themerge_dbs function is a pre-process data fusionfunction dedicated to the harmonization of two data sources. By default,variables (not target variables) with same labels are considered asshared between the two databases. Themerge_dbsfunction detects potential discrepancies between the variables beforemerging by:

firstly excluding variables with different labels from the firstdatabase to the second one and inversely.
excluding a priori shared variables with different types.
excluding a priori shared factors with different levels.

The actual form of the function does not propose automaticreconciliation actions to reintroduce the problematic variables butgives user enough information in output to do it by himself ifnecessary. The call of themerge_dbs function isactually:

merge_dbs=function(DB1, DB2,row_ID1 =NULL,row_ID2 =NULL, NAME_Y, NAME_Z,order_levels_Y =levels(DB1[, NAME_Y]),order_levels_Z =levels(DB2[, NAME_Z]),ordinal_DB1 =NULL,ordinal_DB2 =NULL,impute ="NO",R_MICE =5,NCP_FAMD =3,seed_choice =sample(1:1000000,1))

Themerge_dbs function notably provides in output anunique database, result of the overlayed of the two initial datasources, in the structure expected by theOT_outcomeandOT_joint functions.

select_pred

Theselect_pred function is a pre-process datafusion function dedicated to the selection of matching variables. Thisselection is essential when the initial set of shared variables isimportant, but also because the choice of predictors greatly influencesthe quality of the data fusion whatever the optimal transportationalgorithms chosen a posteriori.

The call of theselect_pred function is actually:

select_pred=function(databa,Y =NULL,Z =NULL,ID =1,OUT ="Y",quanti =NULL,nominal =NULL,ordinal =NULL,logic =NULL,convert_num =NULL,convert_class =NULL,thresh_cat =0.30,thresh_num =0.70,thresh_Y =0.20,RF =TRUE,RF_ntree =500,RF_condi =FALSE,RF_condi_thr =0.20,RF_SEED =sample(1:1000000,1))

verif_OT

Theverif_OT function is a post-process data fusionfunction dedicated to the validation of the fusion. The functionprovides a set of tools to assess the quality of the optimaltransportation recoding proposed by the algorithms to predict themissing information of the target variables in one or both datasources.

The call of theverif_OT function is actually:

verif_OT=function(ot_out,group.class =FALSE,ordinal =TRUE,stab.prob =FALSE,min.neigb =1,R =10,seed.stab =sample(1:1000000,1))

Optimal transportationfunctions

TheOTrecod package provides two algorithms that useoptimal transportation theory to solve recoding problems in data fusioncontexts (see (1) and (2) for more details). Each algorithm is stored inone function and each function provides in output a unique and syntheticdatabase where the two initial data sources are overlayed and themissing information from only one or both target variables are fullycompleted.

Each of the two alogorithms also proposed enrichments by relaxing theinitial distributional constraints and adding regularization terms asdescribed in (2).

OT_outcome

TheOT_outcome function can provide individualpredictions of the incomplete target variables by considering therecoding problem involving only optimal transportation of outcomes (see(1) and (2) for more details).

The call of theOT_outcome function is:

OT_outcome=function(datab,index_DB_Y_Z =1:3,quanti =NULL,nominal =NULL,ordinal =NULL,logic =NULL,convert.num =NULL,convert.class =NULL,FAMD.coord ="NO",FAMD.perc =0.8,dist.choice ="E",percent.knn =1,maxrelax =0,indiv.method ="sequential",prox.dist =0.30,solvR ="glpk",which.DB ="BOTH")

OT_joint

TheOT_joint function can provide individualpredictions of the incomplete target variables by considering therecoding problem involving optimal transportation of shared variablesand outcomes (see(2) for more details).

The call of theOT_joint function is:

OT_joint=function(datab,index_DB_Y_Z =1:3,nominal =NULL,ordinal =NULL,logic =NULL,convert.num =NULL,convert.class =NULL,dist.choice ="E",percent.knn =1,maxrelax =0,lambda.reg =0.0,prox.X =0.10,solvR ="glpk",which.DB ="BOTH")

References

Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR,savy N (2019). On the use of optimal transportation theory to recodevariables and application to database merging. The International Journalof Biostatistics.Volume 16, Issue 1, 20180106, eISSN 1557-4679.
Gares V, Omer J (2020). Regularized optimal transport ofcovariates and outcomes in data recoding. Journal of the AmericanStatistical Association.

[8]ページ先頭