OTrecod: An R Package for Data Fusion using Optimal Transportation Theory

The advances of information technologies often confront users with a large amount of data which is essential to integrate easily. In this context, creating a single database from multiple separate data sources can appear as an attractive but complex issue when same information of interest is stored in at least two distinct encodings. In this situation, merging the data sources consists in finding a common recoding scale to fill the incomplete information in a synthetic database. The OTrecod package provides R-users two functions dedicated to solve this recoding problem using optimal transportation theory. Specific arguments of these functions enrich the algorithms by relaxing distributional constraints or adding a regularization term to make the data fusion more flexible. The OTrecod package also provides a set of support functions dedicated to the harmonization of separate data sources, the handling of incomplete information and the selection of matching variables. This paper gives all the keys to quickly understand and master the original algorithms implemented in the OTrecod package, assisting step by step the user in its data fusion project.


Introduction
The large amount of data produced by information technology requires flexible tools to facilitate its handling. Among them, the field of data fusion (Hall and Llinas 1997;Klein 2004;Castanedo 2013) also known as statistical matching (Adamek 1994;D'Orazio, Di Zio, and Scanu 2006;Vantaggi 2008) aims to integrate the overall information from multiple data sources for a better understanding of the phenomena that interact in the population.
Assuming that two heterogeneous databases A and B share a set of common variables X while an information of interest is encoded in two distinct scales respectively: Y in A and Z in B. If Y and Z are never jointly observed, a basic data fusion objective consists in the recoding of Y in the same scale of Z (or conversely), to allow the fusion between the databases as illustrated in Table 1  Providing a solution to this recoding problem is often very attractive because it aims at giving access to more accurate and consistent information with no additional costs in a unique and bigger database. Despite this, if we exclude all R data integration packages applied in the context of genetic area like the MultiDataSet package (Hernandez-Ferrer et al. 2017), OMICsPCA package (Das and Tripathy 2022), or the mixOmics package (F et al. 2017) which are often only effective for quantitative data integration. To our knowledge, the StatMatch package (D'Orazio 2022) is actually the only one that provide a concrete solution to the problem using hot deck imputation procedures. The main reason for this relative deficiency is that this problem is, in fact, often assimilated and solved like a missing data imputation problem. According to this idea, a very large amount of works and reference books now exist about the handling of missing data (Zhu, Wang, and Samworth 2019;Little and Rubin 2019). Moreover, we can enumerate several R packages that we can sort by types of imputation methods (Mayer et al. 2019): mice (van Buuren and Groothuis-Oudshoorn 2011) and missForest (Stekhoven and Bühlmann 2012;Stekhoven 2022) which use conditional models, softImpute (Hastie and Mazumder 2021) and missMDA (Josse and Husson 2016) which apply low-rank based models. For all these packages, imputation performances can sometimes fluctuate a lot according to the structure and the proportion of non-response encountered. Regressions and non parametric imputation approaches (like hot-deck methods from donor based family) seem to use partially, or not at all, the available information of the two databases to provide the individual predictions. Contrary to our approach, all these methods only use the set of shared variables X for the prediction of Z in A (or conversely Y in B) without really taking into account Y and its interrelations with X in their process of predictions.
The purpose of this paper is to present a new package to the R community called OTrecod which provides a simple and intuitive access to two original algorithms (Garès et al. 2020;Garès and Omer 2022) dedicated to solve these recoding problems in the data fusion context by considering them as applications of Optimal Transportation theory (OT). In fact, this theory was already applied in many areas: for example, the transport package (Schuhmacher et al. 2022) solves optimal transport problems in the field of image processing. A specific package called POT: Python Optimal Transport (Flamary et al. 2021) also exists in the Python software (Van Rossum and Drake Jr 1995) to solve optimization problems using optimal transportation theory in the fields of signal theory, image processing and domain adaptation. Nevertheless, all these available tools were still not really adapted to our recoding problem and the performances established by OT-based approaches to predict missing information compared to more standard processes (Garès et al. 2020;Garès and Omer 2022;Muzellec et al. 2020) have finished to confirm our decision.
In OTrecod the first provided algorithm, called OUTCOME and integrated in the OT_outcome function consists in finding a map that pushes the distribution of Y forward to the distribution of Z (Garès et al. 2020) while the second one, called JOINT and integrated in the OT_joint function, pushes the distribution of (Y,X) forward to the distribution of (Z,X). Consequently, by building, these two algorithms take advantage of all the potential relationships between Y, Z, and X for the prediction of the incomplete information of Y and/or Z in A and B. Enrichments related to these algorithms and described in (Garès and Omer 2022;Cuturi 2013) are also available via the optional arguments of these two functions. In its current version, these algorithms are accompanied by original preparation (merge_dbs, select_pred) and validation (verif_OT) functions which are key steps of any standard statistical matching project and can be used independently of the OUTCOME and JOINT algorithms.
2 Solving recoding problems using optimal transportation theory The optimal transportation problem The optimal transportation (OT) problem was originally stated by Monge (1781) and consists in finding the cheapest way to transport a pile of sand to fill a hole. Formally the problem writes as follows.
Consider two (Radon) spaces X and Y, µ X a probability measure on X, and µ Y a probability measure on Y and c a Borel-measurable function from X × Y to [0, ∞]. The Kantorovich's formulation of the optimal transportation problem (Kantorovich 1942) consists in finding a measure γ ∈ Γ(µ X , µ Y ) that realizes the infimum: where Γ(µ X , µ Y ) is the set of measures on X × Y with marginals µ X on X and µ Y on Y. This theory is applied here to solve a recoding problem of missing distributions in a data fusion area. To do so, we make use of Kantovorich's formulation adapted to the discrete case, known as Hitchcock's problem (Hitchcock 1941). Therefore, by construction, the proposed algorithms are usable for specific target variables only: categorical variables, ordinal or nominal, and discrete variables with finite number of values.

Optimal transportation of outcomes applied to data recoding
Let A and B be two databases corresponding to two independent sets of subjects. We assume without loss of generality that the two databases have equal sizes, so that they can be written as A = {i 1 , . . . , i n } and B = {j 1 , . . . , j n }. Let (X i , Y i , Z i ) i∈A and (X j , Y j , Z j ) j∈B be two sequences of i.i.d. discrete random variables with values in X × Y × Z, where X is a finite subset of R P , and Y and Z are finite subsets of R. Variables (X i , Y i , Z i ), i ∈ A, are i.i.d copies of (X A , Y A , Z A ) and (X j , Y j , Z j ), j ∈ B, are i.i.d copies of (X B , Y B , Z B ). Moreover assume that (X i , Y i , Z i ), i ∈ A are independent of (X j , Y j , Z j ), j ∈ B . The first version using the optimal transportation algorithm approach, described in (Garès et al. 2020), assumes that: Assumption 1. Y A and Z A respectively follow the same distribution as Y B and Z B .

Assumption 2.
For all x ∈ X the probability distributions of Y A and Z A given that X A = x are respectively equal to those of Y B and Z B given that X B = x.
In this setting, the aim is to solve the recoding problem given by equation (1) that pushes µ Y A forward to µ Z A . The variable γ of (1) is a discrete measure with marginals µ Y A and µ Z A , represented by a |Y | × |Z | matrix. The cost function denoted as c is a |Y | × |Z | matrix, (c y,z ) y∈Y,z∈Z . The goal is in the identification of: where ⟨· , ·⟩ is the dot product, 1 is a vector of ones with appropriate dimension and M T is the transpose of matrix M. The cost function considered by Garès et al. (2020), c y,z , measures the average distance between the profiles of shared variables of A satisfying Y = y and subjects of B satisfying Z = z, that is: where d is a given distance function to choose on X × X . In fact, the above situation cannot be solved in reality, since the distributions of X A , X B , Y A and Z A are never jointly observed. As a consequence, the following unbiased empirical estimators are used:μ X A n of µ X A andμ X B n of µ X B . Because Y and Z are only available in A and B respectively, two distinct empirical estimators have to be defined: where 1 {Y=y} = 1 if Y = y and 0 otherwise. The assumption 1 gives: µ Z A = µ Z B from which we can conclude thatμ Z B n,z =μ Z A n,z . Finally, denoting: κ n,y,z ≡ ∑ i∈A ∑ j∈B 1 {Yi=y,Zj=z} , the number of pairs (i, j) ∈ A × B such that Y i = y and Z j = z, the cost matrix c is estimated by: c n,y,z = 1 κ n,y,z ∑ i∈A ∑ j∈B 1 {Yi=y,Zj=z} × d(X i , X j ), ∀y ∈ Y, z ∈ Z : κ n,y,z ̸ = 0, 0, ∀y ∈ Y, z ∈ Z : κ n,y,z = 0.
Plugging the values observed for these estimators in (2) yields to a linear programming model denoted:P 0 n : The solutionγ n can then be interpreted as an estimatorμ . If this estimate is necessary, it is nevertheless insufficient here to provide the individual predictions on Z in A. These predictions are done in a second step using a nearest neighbor algorithm from which we deduce an estimation of µ Z A |X A =x,Y A =y (see Garès and Omer (2022) for details). In the remainder, the overall algorithm described in this section is referred to as OUTCOME. To improve the few drawbacks of this algorithm described in Garès et al. (2020), derived algorithms from OUTCOME have been developed (Garès and Omer 2022) and described in the following part.

Optimal transportation of outcomes and covariates
Using the same notations, Garès et al. (2020) propose to search for an optimal transportation map between the two joint distributions of (X A , Y A ) and (X A , Z A ) with marginals µ (X A ,Y A ) and µ (X A ,Z A ) respectively. Under Kantorovich's formulation in a discrete setting, they search for: where c is a given cost matrix and D is the set of joint distributions with marginals µ (X A ,Y A ) and µ (X A ,Z A ) . It is natural to see any element γ ∈ D as the vector of joint probabilities P((X A = x, Y A = y), (X A = x ′ , Z A = z)) for any x, x ′ ∈ X 2 , y ∈ Y and z ∈ Z. Since this probability nullifies for all x ̸ = x ′ , γ ∈ D is defined as a vector of R |X|×|Y |×|Z | , where γ x,y,z stands for an estimation of the joint probability P(X A = x, Y A = y, Z A = z). These notations lead to the more detailed model: The above algorithm can be solved only if the marginals µ (X A ,Y A ) and µ (X A ,Z A ) are known, but, based on assumption 2, unbiased estimatorsμ X A ,Y A n andμ X A ,Z A n can be built according to the previous subsection. For the first one it gives: The cost matrix introduced in the OUTCOME algorithm is used (3) and estimated by (5). Formally we can write: c x,y,z = c y,z , ∀x ∈ X , ∀y ∈ Y, ∀z ∈ Z, which does not depend on the value of x.
Plugging the values observed for these estimators in (7) yield a linear programming model denoted as P n . In contrast to OUTCOME, the algorithm that consists in solving P n to solve the recoding problem is referred to as JOINT in what follows.
An estimation of the distribution of Z A given the values of X A and Y A is then given by: and an individual prediction of Z A is then deduced using the maximum a posterior rule: Due to potential errors in the estimations of P, the constraints of P n may derive from the true values of the marginals of µ (X A ,Y A ,Z A ) . To deal with this situation, small violations of the constraints of P n are allowed by enriching the initial algorithm as described in Garès and Omer (2022).
The equality constraints of P n are then relaxed as follows: This relaxation is possible by introducing extra-variables e X,Y,+ and e X,Z,+ as additional constraints (14)-(15). Garès and Omer (2022) suggests to consider α n := α √ n from (16), with a parameter α to calibrate numerically but proposes also a default value fixed to 0.4.
A regularization term λ given by ( π x,y,ẑ µ X A n,x ) x∈X ,y∈Y,z∈Z can also be added to improve regularity in the variations of the conditional distribution µ Y A ,Z A |X A =x with respect to x. The corresponding regularized algorithm is: The constant λ ∈ R + is a regularization parameter to be calibrated numerically (0.1 can be considered as default value) and E X ⊂ X 2 includes the pairs of elements of X defined as neighbors: x i , x j ∈ E X if x j is among the k nearest neighbors of x i for some parameter k ≥ 1. The method that computes a solution to the recoding problem with regularization and relaxation is called R-JOINT.
In a same way, a relaxation of the assumption 1 is also proposed and added to the OUTCOME algorithm: this resulting method is denoted R-OUTCOME and is related to the following program: Note that algorithms JOINT and R-JOINT do not require assumption 1. The relaxation in R-OUTCOME alleviates its dependence to the satisfaction of this assumption. However, algorithms JOINT and R-JOINT require that X is a set of discrete variables (factors ordered or not are obviously allowed) while the absence of X in the linear algorithms OUTCOME and R-OUTCOME allow X to be a set of discrete and/or continuous variables. In this case, the nature of the variables X need to be considered when choosing the distance d.

Package installation and description Installation
The OTrecod package can be installed from the Comprehensive R Archive Network (CRAN) by using the following template: install.packages("OTrecod") The development version of OTrecod is also available and can be directly installed from GitHub by loading the devtools (Wickham et al. 2022) package (R> install_github("otrecoding/OTrecod")).

Main functions
The two types of optimal transportation algorithms previously introduced (OUTCOME and JOINT) and their respective enrichments (R-OUTCOME and R-JOINT) are available in the OTrecod package via two core functions denoted OT_outcome and OT_joint. Details about their implementations in R are described in the following section. In this package, these algorithms of recoding are seen as fundamental steps of data fusion projects that also require often adapted preparation and validation functions. In this way, the Table 2 introduces the main functions proposed by OTrecod to handle a data fusion project.

R Function Description Pre-process functions merge_dbs
Harmonization of the data sources select_pred Selection of matching variables Functions of data fusion OT_outcome Data fusion with OT theory using the OUTCOME or R-OUTCOME algorithms.

OT_joint
Data fusion with OT theory using the JOINT or R-JOINT algorithms.

Post-process function verif_OT
Quality assessment of the data fusion  (proxim_dist, avg_dist_closest, indiv_grp_closest, indiv_grp_optimal), and their related documentations, are all included and usable separately in the package. They have been kept available for users to ensure a great flexibility as other interesting functions like power_set that returns the power set of a set. This function did not exist on R until now and could be of interest for specialists of algebra. These functions are not described here but detailed in the related pdf manual of the package.

Functionalities overview Expected structure of the input databases as arguments
The functions of recoding OT_outcome and OT_joint require a specific structure of data.frame as input arguments. Described in Table 3, it must be the result of two overlayed databases made of at least four variables: • A first variable, discrete or categorical, corresponding to the database identifier, stored in factor or not, but with only two classes or levels (for example: A and B, 1 and 2 or otherwise). • The target variable of the first database (or top database) denoted Y for example, whose values related to the second database are missing. This variable can be discrete or categorical stored in factor, ordered factor or not. • In the same way, the target variable of the second database (or below database) denoted Z for example, whose values related to the first database are missing. • At least one shared variable (defined as a variable with the same label and the same encoding in the two distinct data sources). The type of shared variables can be continuous, categorical stored in factor or not, complete or not. Nevertheless, few constraints must be noticed related to this question. First, in a critical situation where only one shared variable exists, this latter cannot be incomplete. Second, continuous shared variables are actually not allowed in the current version of the function OT_joint, therefore, these variables must be transformed beforehand.  Table 3: Example of expected structure for two databases 1 and 2 with three shared variables X 1 , X 2 , X 3 As additional examples, users can also refer to the databases simu_data and tab_test provided in the package with expected structures. Note that class objects are not expected here as input arguments of these functions to allow users to freely work with or without the use of the pre-process functions provided in the package.

Choice of solver
The package OTrecod uses the ROI optimization infrastructure (Theußl, Schwendinger, and Hornik 2017) to solve the optimization problems related to the OUTCOME and JOINT algorithms. The solver GLPK (The GNU Linear Programming Kit (Makhorin 2011)) is the default solver actually integrated in the OT_outcome and OT_joint functions for handling linear problems with linear constraints. The ROI infrastructure makes easy for users to switch solvers for comparisons. In many situations, some of them can noticeably reduce the running time of the functions.
For example, the solver Clp (Forrest, Nuez, and Lougee-Heimer 2004) for COINT-OR Linear Programming, known to be particularly convenient in linear and quadratic situations, can be easily installed by users via the related plug-in available in ROI (searchable with the instruction ROI_available_solvers()) and following the few instructions detailed in (Theußl, Schwendinger, and Hornik 2020) or via the dedicated website.

An illustrative example
In California (United States), from 1999 to 2018, the Public Schools Accountability Act (PSAA) imposed on its California Department of Education (CDE) to provide annually the results of an Academic Performance Index (API) which established a ranking of the best public schools of the state.
This numeric score, indicator of school's performance levels, could vary from 200 to 1000 and the performance objective to reach for each school was 800. Information related to the 418 schools (identified by cds) of Nevada (County 29) and to the 362 schools of San Benito (County 35), was respectively collected in two databases, api29 and api35, available in the package. The distributions of all the variables in the two databases are provided by following the R commands:  The two databases seem to share a majority of variables (same labels, same encodings) of different types, therefore inferential statistics could be ideally considered by combining all the information of the two databases to study the effects of social factors on the results of the API score in 2000.
Nevertheless, while this target variable called apicl_2000 is correctly stored in the database api29 and encoded as a three levels ordered factors clearly defined: [200-600], (600-800] and (800-1000], the only information related to the API score in the database api35 is the variable apicl_1999 for the API score collected in 1999, encoded in four unknown balanced classes (G1, G2, G3 and G4). As no school is common to the two counties, we easily deduce that these two variables have never been jointly observed.
By choosing these two variables as outcomes (called also target variables), the objective of the following examples consists in creating a synthetic database where the missing information related to the API score in 2000 is fully completed in api35 by illustrating the use of the main functions of the package.

Harmonization of two datasources using merge_dbs
The function merge_dbs is an optional pre-process function dedicated to data fusion projects that merges two raw databases by detecting possible discrepancies among the respective sets of variables from one database to another. The current discrepancy situations detected by the function follow specific rules described below: • any variable (other than a target one) whose label (or name) is not common to the two data sources is automatically excluded. By default, the remaining variables are denoted shared variables.
• among the subset of shared variables, any variable stored in a different format (or type) from one datasource to another will be automatically discarded from the subset previously generated and its label saved in an output object called REMOVE1. • among the remaining subset of shared variables, a factor variable (ordered or not) stored with different levels or number of levels from one data source to another will be automatically discarded from the subset and its label saved in an output object called REMOVE2.
These situations sometimes require reconciliation actions which are not supported by the actual version of the merge_dbs function. Therefore, when reconciliation actions are required by the user, they will have to be treated a posteriori and outside the function.
Applied to our example and according to the introduced rules, the first step of our data fusion project consists in studying the coherence between the variables of the databases api29 and api35 via the following R code: step1 = merge_dbs(DB1 = api29, DB2 = api35, NAME_Y = "apicl_2000", NAME_Z = "apicl_1999", row_ID1 = 1, row_ID2 = 1, ordinal_DB1 = c(2:3, 8:12), ordinal_DB2 = c(2:3, 8:12)) As entry, the raw databases must be declared separately in the DB1 and DB2 arguments and the name of the related target variables of each database must be specified via the NAME_Y and NAME_Z for DB1 and DB2 respectively. In presence of row identifiers, the respective column indexes of each database must be set in the argument row_ID1 and row_ID2. The arguments ordinal_DB1 and ordinal_DB2 list the related column indexes of all the variables defined as ordinal in the two databases (including also the indexes of the target variables if necessary). Here, apicl_2000 is clearly an ordinal variable, and, by default, we suppose that the unknown encoding related to apicl_1999 is also ordinal: the corresponding indexes (2 and 3) are so added in these two arguments.
After running, the function informs users that no row was dropped from the databases during the merging because each target variable is fully completed in the two databases. Nine potential predictors are kept while only one variable is discarded because of discrepancies between the databases: its identity is consequently stored in output and informs user about the nature of the problem: the mobility factor has different levels from one database to the other. #> [1] "1" "2" step1$REMAINING_VAR #> [1] "acs.core" "acs.k3.20" "api.stu" "awards" "ell" "full" #> [7] "grad.sch" "meals" "stype" The function returns a list object and notably DB_READY, a data.frame whose structure corresponds to the expected structure introduced in the previous subsection: a unique database, here result of the superimposition of api29 on api35 where the first column (DB) corresponds to the database identifier (1 for api29 and 2 for api35), the second and third columns (Y, Z respectively) corresponds to the target variables of the two databases. Missing values are automatically assigned by the function to the unknown part of each target variable: in Y when the identifier equals to 2, in Z when the identifier equals to 1. The subset of shared variables whose information is homogeneous between the databases are now stored from the fourth column to the last one. Their identities are available in the output object called REMAINING_VAR.
The merge_dbs function can handle incomplete shared variables by using the impute argument. This option allows to keep the missing information unchanged (the choice by default, and also the case in this example), to work with complete cases only, to do fast multivariate imputations by chained equations approach (the function integrates the main functionalities of the mice function of the mice package), or to impute data using the dimensionality reduction method introduced in the missMDA package (see the pdf manual for details).

Selection of matching variables using select_pred
In data fusion projects, a selection of shared variables (also called matching variables) appears as an essential step for two main reasons. First, the proportion of shared variables X between the two databases can be important (much higher than three variables) and keeping a pointless part of variability between variables could strongly reduce the quality of the fusion. Second, this selection greatly influences the quality of the predictions regardless of the matching technique which is chosen a posteriori (Adamek 1994). The specific context of data fusion is subject to the following constraints: 1. Y, Z and X are never jointly observed in database A or B, so the relationships between Y and X, and between Z and X must be investigated separately. 2. matching variables need to be good predictors of both target variables (outcomes) Y and Z (Cohen 1991), in the sense that they explain relevant parts of variability of the targets.
These particularities suppose that the investigations have to be done separately but in the same manner in both databases. In this way, a predictor which appears at the same time as highly correlated to Y and Z will be automatically selected for the fusion. On the contrary, predictors whose effects on Y and Z seem not obvious will be discarded. Additional recommended rules also emerge from literature: • concerning the subset of variables that would predict only one of the two targets Y or Z, D'Orazio, Di Zio, and Scanu (2006) suggests a moderate parsimonious selection remaining too many predictors could complicate the fusion procedure. • Cibella (2010) and Scanu (2010) suggest to select quality predictors with no error and just a small amount of missing data.
The function of the package dedicated to this task is select_pred. From the DB_READY database generated in the previous subsection, studying the outputs related to each database and produced by the following R commands assist users in selecting the best predictors: The quanti, nominal, and ordinal arguments requires vectors of column indexes related to the type of each variable of the input database. The ID argument specifies the column index of the database identifier. Y and Z expected the respective names of the target variables while OUT provides the target variable to predict (Y or Z).
To detect the subset of best predictors, select_pred studies the standard pairwise associations between each shared variable and the outcomes Y and Z, taken separately (by only varying the argument OUT), according to its type: for numeric and/or ordered factor variables, select_pred calculates the associations using rank correlation coefficients (Spearman) while the Cramer's V criterion (Bergsma 2013) is used for categorical variables and not ordered factors. The related ranking tables of top scoring predictors are available in two distinct output objects: cor_OUTC_num and cor_OUTC_cat. The two first tables related to api29 highlights the strong associations between Y and the variables meals, ell, full and grad.sch in this order of importance, while the summary tables related to api35 highlights the strong associations between Z and the variables meals, grad.sch, ell and full.
It is often not unusual to observe that one or more predictors are in fact linear combinations of others. In supervised learning areas, these risks of collinearity increase with the number of predictors, and must be detected beforehand to keep only the most parsimonious subset of predictors for fusion. To avoid collinearity situations, the result of a Farrar and Glauber (FG) test is provided (Farrar and Glauber 1967). This test is based on the determinant of the rank correlation matrix of the shared variables D (Spearman) and the corresponding test statistic is given by: where n is the number of rows and k is the number of covariates. The null hypothesis supposes that S FG follows a chi square distribution with k(k − 1)/2 degrees of freedom, and its acceptation indicates an absence of collinearity. In presence of a large number of numeric covariates and/or ordered factors, the approximate Farrar-Glauber test, based on the normal approximation of the null distribution (Kotz, Balakrishnan, and Johnson 2000) can be more adapted and the statistic test becomes: Users will note that these two tests can often appear highly sensitive in the sense that they tend to easily conclude in favor of multicollinearity. Thus, it is suggested to consider these results as indicators of collinearity between predictors rather than an essential condition of acceptability. The results related to this test are stored in the FG_test object of the select_pred output.
In presence of collinearity, select_pred tests the pairwise associations between all the potential predictors according to their types (Spearman or Cramér's V). The thres_num argument fixed the threshold beyond which two ranked predictors are considered as redundant while thres_cat fixed the threshold of acceptability for the Cramér's V criterion in the subgroup of factors. In output, the results stored in the collinear_PB object permit to identify the corresponding variables. In our example, we observe:

### DETECTION OF REDUNDANT PREDICTORS
## Results for the api29 dataset ----- The FG test warns the user against the risks of collinearity between predictors, and the function notably detects strong collinearities between the variables meals, ell, full in the api29 (less strong trends in api35): an information that have to be taken into account during the selection. The part of predictors finally kept for the data fusion must be small to improve its quality. When the initial number of shared variables is not too important as in this example, choosing the best candidates between groups of redundant predictors can be made manually by selecting highest ranked predictors in the summary tables previously described. In this way, the variable meals could be preferred to ell, and full, while the variable stype could be dropped. Consequently, a possible final list of predictors could be only composed of the variables meals and grad.sch.
When the number of predictors is important, or when users prefer that an automatic process performs the variable selection, a random forest procedure can also be used via the select_pred function. Random forest approaches (Leo Breiman 2001) are here particularly convenient (Grajski et al. 1986) for multiple reasons: it works fine when the number of variables exceeds the number of rows, whatever the types of covariates, it allows to deal with non linearity, to consider correlated predictors, ordered or not ordered outcomes and to rank good candidate predictors through an inbuilt criterion: the variable importance measure.
In few words, random forest processes aggregates many CART models (L. Breiman et al. 1984) with RF_ntree bootstrap samples from the raw data source and averaging accuracies from each model permits to reduce the related variances and also the errors of prediction. A standard random forest process provides two distinct measures of importance of each variable for the prediction of the outcome, the Gini importance criterion and the permutation importance criterion, which depends on the appearance frequency of the predictor but also on its place taken up in each tree. For more details about random forest theory, user can consult Leo Breiman (2001) and/or the pdf manual of the randomForest (Liaw and Wiener 2002) package. Strobl, Hothorn, and Zeileis (2009) suggests that the permutation importance criterion, which works with permuted samples (subsampling without replacements) instead of bootstrap ones, is particularly convenient with uncorrelated predictors, but must be replaced by a conditional permutation measurement in presence of strong correlations. select_pred provides these assessments by integrating the main functionalities of the cforest and varimp functions of the package party (Hothorn, Hornik, and Zeileis 2006;Zeileis and Hothorn 2008;Hothorn et al. 2005;Strobl et al. 2007Strobl et al. , 2008. However, these measurements must be used with caution, by accounting the following constraints: • the Gini importance criterion can produce bias in favor of continuous variables and variables with many categories. This criterion is thus not available in the function. • the permutation importance criterion can overestimate the importance of highly correlated predictors and therefore redundant predictors will be discarded beforehand using the first steps of the process integrated in the function.  The select_pred function allows to proceed with different scenarios according to the type of predictors (Table 4 can help users to choose). The first one consists in boiling down to a set of categorical variables (ordered or not) by categorizing all the continuous predictors using the dedicated argument (convert_num and convert_clss) and to work with the conditional importance assessments that directly provide unbiased estimations (by setting the RF_condi argument to TRUE). Users can consult (Hothorn, Hornik, and Zeileis 2006) for more details about the approach and consult the pdf manual of the package for details about the related arguments of the select_pred function. This approach does not take into account incomplete information, so that the method will be applied to complete data only (incomplete rows will be temporarily removed from the study). It is nevertheless possible to impute missing data beforehand by using dedicated pre-existing R packages like mice (van Buuren and Groothuis-Oudshoorn 2011) or by using the imput_cov function provided in the OTrecod package.
The second possible scenario (always usable in presence of mixed type predictors), consists in the execution of a standard random forest procedure after taking care to rule out collinearity issues by first selecting unique candidates between potential redundant predictors (in this case, the discarded predictors are stored in the drop_var output object). This is the default approach used by select_pred as soon as the RF_condi argument is set to FALSE while RF is set to TRUE. This scenario can work in presence of incomplete predictors. By constructing, note that results from random forest procedures stay dependent on successive random draws carried out for the constitution of trees, and it is so suggested to check this stability by testing different random seeds (RF_SEED argument) before concluding.
The R command previously written provides automatically the results related to the second approach as additional results. The results from the two datasets show here the permutation importance estimates of each variable ranked in order of importance and expressed as percentage, after resolving collinearity problems: The results confirm that the variable meals appeared as the best predictor of the target variables in api29 and api35 respectively. The variable ell is not present in the first list (see RF_PRED from step2a) because the variables meals and ell has been detected as strongly collinear (according to the initial chosen threshold) and so select_pred keep the best choice between the two predictors: meals (the reasoning is the same for full which disappeared from the list).
The ell variable has been kept in the second list (not found as collinear enough to be removed here) and appears moreover as a good predictor of apicl_1999 in api35. Nevertheless its potential collinearity problem with meals encourages us not to keep it for the rest of the study. According to this discussion, we finally keep meals, stype and grad.sch which combines the advantages of being good predictors for the two target variables while not presenting major problems of collinearity between them.
The following synthetic database (called here bdd_ex) is now ready for the prediction of the missing API scores in api29, api35 or both, using the function OT_outcome or OT_joint:

Data fusion using OT_outcome
The OT_outcome function provides individual predictions of Z in A (and/or Y in B) by considering the recoding problem involving optimal transportation of outcomes. In a first step, the aim of the function is so to determine γ from (2) while this estimate is used in a second step to provide the predictions. The main input arguments of the function and the output values are respectively described in Tables 5  and 6.

Argument
OT_outcome  The arguments datab, index_DB_Y_Z, quanti, nominal, ordinal, and logic are not optional and must be carefully completed before each execution. A unique synthetic data.frame made of two overlayed databases (called A and B for example) (see Table 3) is expected as datab argument. If this data.frame corresponds to the output objects DB_USED or DB_READY of the select_pred or merge_dbs functions respectively, then the expected object has the required structure. Otherwise users must be sure that their data.frames are made up of two overlayed databases with at least 4 variables as described in the subsection Optimal transportation of outcomes applied to data recoding. The order of the variables have no importance in the raw database but will have to be specified a posteriori in the index_DB_Y_Z argument if necessary.
The subset of remaining predictors used for fusion may requires prior transformations according to the distance function chosen in input by the user. This process is fixed by setting the dist.choice argument. The distances actually implemented in the function are: the standard Euclidean ("E") and Manhattan ("M") distances, the Hamming distance ("H", for binary covariates), and the Gower distance ("G" sometimes preferred with mixed variables). Automatic transformations prior to the use of each distance function are summarized in Table 7.
The first version of the OT algorithm described in Garès et al. (2020) was tested on numeric coordinates extracted from a factor analysis of mixed data (FAMD) fitted on mixed raw covariates (Pagès 2002). This transformation is here available by setting the FAMD.coord argument to "YES". The minimal percentage of information explained by the FAMD is also fixable using the FAMD.perc argument. The OT_outcome functions proposes four types of models for the prediction of Y in B (and/or) Z in A, according to the values of the method and maxrelax arguments: • When maxrelax = 0 and indiv.method = "sequential" (default options), the fitted model corresponds to the OUTCOME algorithm described byP 0 n in equation (6). Assuming that Y and Z in A follow the same distribution as Y and Z in B respectively (assumption 1), this related algorithm derives the joint distribution of Y and Z in A (respectively in B) in a first step, and uses in a second step, a nearest neighbor procedure to predict missing values of Z in A (resp. Y in B). This algorithm calculates averaged distances between each subject from A (resp. B) and subgroups of subjects from B (resp. A) having same levels of Z in B (resp. Y in A). These

Value
OT_outcome OT_joint Description time_exe ✓ ✓ Running time of the algorithm gamma_A ✓ ✓ Estimation of the joint distribution of (Y, Z) for the prediction of Z in A (*) gamma_B ✓ ✓ Estimation of the joint distribution of (Y, Z) for the prediction of Y in B (*) profile ✓ ✓ The list of detected profiles of covariates res.prox ✓ ✓ A list that provides all the information related to the estimated proximities between profiles and groups of profiles estimator_ZA ✓ ✓ Estimates of the probability distribution of Z conditional to X and Y in database A (*) estimator_YB ✓ ✓ Estimates of the probability distribution of Y conditional to X and Z in database B (*) DATA1_OT ✓ ✓ The database A fully completed (if required in input by the which.DB argument) DATA2_OT ✓ ✓ The database B fully completed (if required in input by the which.DB argument) Allowed* Allowed* Allowed* Allowed* dist.choice argument "E" "M" "G" "H"  T for table   calculations can be done using all subjects of each subgroups (by default, percent.knn = 1) or only restricted parts of them (percent.knn < 1). • When maxrelax > 0 and indiv.method = "sequential", assumption 1 is alleviated by relaxing the constraints on marginal distributions and an R-OUTCOME algorithm (with relaxation) is fitted. In this case, the maxrelax argument corresponds to the α n parameter in (18). • When maxrelax = 0 and indiv.method = "optimal", the second step of the original algorithm (nearest neighbor procedure) is replaced by a linear optimization problem: searching for the individual predictions of Z in A (resp. Y in B) by minimizing the total of the distances between each individual of A and individuals of each levels of Z in B. • When maxrelax > 0 and indiv.method = "optimal", the constraints on marginal distributions of the previous model are also relaxed and the function fits an R-OUTCOME algorithm with relaxation. For these three last situations, the corresponding R-OUTCOME algorithm is so described byP 0−R n (18).
When maxrelax > 0, it is recommended to calibrate the maxrelax parameter by testing different values according to the stability of individual predictions. In output, the gamma_A and gamma_B} matrices correspond to the estimates of the joint distributions γ of (Y A , Z A ) and (Y B , Z B ) respectively (2). Moreover, a posteriori estimates of conditional distributions probabilities (Z A |Y A , X A ) and (Y B |Z B , X B ) (10) are also provided in two lists called estimatorZA, and estimatorYB respectively and the completed databases A and B are stored in the DATA1_OT and DATA2_OT objects. In particular, the individual predictions are stored in the OTpred column of these data.frames. Moreover, the profile object is a data.frame that stores the profile of covariates encountered in the two data sources while the res_prox object stored all the distance computations that will be used in the validation step (users can consult details of the proxim_dist function of the pdf manual. The computation of conditional probabilities implies to define the part of individuals considered as neighbors of each encountered profile of covariates. The argument prox.dist fixes this threshold for each profile, following the decision rule for A (and respectively for B): a subject i of A (or a statistical unit corresponding to a row of A) will be considered as neighbor of a given profile of shared variables C as soon as: When the number of shared variables is small and all of them are categorical (optimal situation), it is suggested to set the prox.dist argument to 0. Finally, using the which.DB argument, users can choose to estimate the individual predictions of Y and Z in the two databases (which.DB = "BOTH") or only one of the both (which.DB = "A" for the predictions of Z in A or which.DB = "B" for the predictions Y in B).
From the data.frame bdd_ex built previously, we use the OT_outcome function to illustrate the prediction of the target variable apicl_2000 in the api35 dataset via an OUTCOME algorithm and using an Euclidean distance function: outc1 = OT_outcome(bdd_ex, quanti = 1, ordinal = 2:6, dist.choice = "E", indiv.method = "sequential", which.DB = "B") In bdd_ex, all variables except the first one (DB: the database identifier) are or can be considered as ordinal factors, and the quanti and ordinal arguments are filled in accordingly. For the prediction of apicl_2000 in the api35 dataset (the steps would be the same for the prediction of apicl_1999 in api29 by setting which.DB = "A" or "BOTH"), the optimal transportation theory determines a map γ that pushes, in the distribution of apicl_1999 forward to the distribution of apicl_2000 in the database api35. In this case, γ B is an estimator of the joint distribution of apicl_2000 and apicl_1999 in api35. The estimation of γ B is available in the gamma_B output object while all the profiles of predictors met in the two databases are listed in the profile object: The output object estimatorYB is a list that corresponds to the estimations of the conditional probabilities of apicl_2000 in the api35 dataset for a given profile of predictors. For example, the conditional probabilities related to the first profile of predictors are: The R Journal Vol. 14/4, December 2022 ISSN 2073-4859 According to these results, we can conclude that: for a subject from api35 with a related profile of predictors P_1 and whose levels of apicl_1999 is 'G1', the probability for apicl_2000 to be predicted '[200,600]' is about 0.63. Finally, the individual predictions of apicl_2000 in api35 are available in the DATA2_OT object corresponding to the OTpred column: The OUTCOME algorithm uses here a nearest neighbor procedure to assign the individual predictions from the estimation of γ which can be a drawback as described in (Garès et al. 2020). The R-OUTCOME algorithm proposes an enrichment that directly assigns the individual predictions of apicl_2000 from the estimation of γ, without using the nearest neighbor approach. In this case, a linear optimization problem is solved to determine the individual predictions that minimize the sum of the individual distances in api35 having the modalities of the target apicl_2000 in api29. The corresponding R command is: ### R-OUTCOME algorithm: optimal assignments + relaxation parameter = 0 R_outc3 = OT_outcome(bdd_ex, quanti = 1, ordinal = 2:6, dist.choice = "E" , indiv.method = "optimal", which.DB = "B") Moreover, the OUTCOME algorithm assumes that the distribution of apicl_2000 is identically distributed in api29 and api35 (assumption 1 from the subsection Optimal transportation of outcomes applied to data recoding) which can appear as a strong hypothesis to hold in many situations. To overcome this constraint, the R-OUTCOME algorithm also allows to relax the assumption 1 by adding a relaxation parameter (18) in the optimization system. This relaxation can be done by varying the maxrelax argument of the function: ### R-OUTCOME algorithm: optimal assignments + relaxation parameter = 0.4 R_outc4 = OT_outcome(bdd_ex, quanti = 1, ordinal = 2:6, dist.choice = "E", indiv.method = "optimal", maxrelax = 0.4, which.DB = "B") The running times of these two models can take few minutes. The quality of these models will be compared in Tables 8, 9 and 10 (subsection Validation of the data fusion using verif_OT).

Data fusion using OT_joint
The OT_joint function provides individual predictions of Z in A (and/or Y in B) by considering the recoding problem as an optimal transportation problem of covariates and outcomes, that pushes the conditional distribution of (Y A |X A ) forward to the conditional distribution of (Z A |X A ) (and conversely (Z B |X B ) forward to the conditional distribution of (Y|X)). The aim is to determine γ from (7). Because joint distributions of outcomes and covariates are now mapped together (it was not the case with the OUTCOME family of algorithms), it is not required for target variables to be equally distributed (assumption 1).
The call of OT_joint was thought to propose globally the same structure as those of the function OT_outcome as described in Tables 5 and 6 with few differences described below. JOINT and R-JOINT algorithms are usable via OT_joint for solving the recoding problem depending on the values related to the maxrelax and lambda_reg arguments: • When maxrelax = 0 and lambda.reg = 0 (default values), the fitted model corresponds to the JOINT algorithm described byP n in equation (7). • When at least one of these two arguments differs from 0, the R-JOINT algorithm is called.
Using R-JOINT, it is so possible to relax constraints on marginal distributions (maxrelax > 0) and/or add an eventual more or less strong L1 regularisation term among the constraints of the algorithm by filling the lambda_reg argument. maxrelax parameter correspond to parameter α n in (16) and lambda.reg parameter correspond to parameter λ in (17).
When maxrelax > 0 and/or lambda.reg > 0, it is recommended to calibrate these two parameters by testing many different values and so studying the stability of the related individual predictions. Nevertheless, by default or as a starting point, (Garès et al. 2020) suggests the use of default values determined from simulated databases: 0.1 for the regularization parameter (lambda.reg) and 0.4 for the relaxation one (maxrelax). Finally, the output objects are similar to those previously described in the OT_outcome function.
The implementation of the function is very similar to that of OT_outcome, and applied to our example, the R code related to a JOINT algorithm is:  For relaxing the constraints stated on marginal distributions, we use the maxrelax argument that corresponds to varying the α parameter of a R-JOINT algorithm (see (17)) and a parameter of regularization λ can be simply added to the previous model as follows:

Validation of the data fusion using verif_OT
Assessing the quality of individual predictions obtained through statistical matching techniques can be a complex task notably because it is legitimate to wonder about the true reliability of a joint distribution estimated from two variables which are never jointly observed (Kadane 2001;Rodgers 1984). When the study aims to identify estimators of interests that improve the understanding of the relationships between variables in the different databases (called macro approach in D' Orazio, Di Zio, and Scanu (2006)) without going until the creation of a complete and unique dataset, uncertainty analyses are usually proposed to estimate the sensitivity of the assessments (Rässler 2002). On the contrary, if the objective is to create a synthetic dataset where the information is available for every unit, the use of auxiliary information or proxy variables must be privileged to assess the quality of the results (Paass 1986). In the absence of complementary information, Rubin (1986) suggests to study the preservation of the relationships at best between the distributions of the target variables. In this way, the verif_OT function proposes tools to assess quality of the individual predictions provided by the algorithms previously introduced.
The first expected input argument of verif_OT is an 'otres' object from the OT_outcome or OT_joint functions. In our example, this function is firstly applied to the out_c1 model by following the R command: In output, the nb.profil value gives the number of profiles of predictors globally detected in the two databases (nb = 27). In the example, a profile will be characterized by a combination of values related to the three predictors kept: grad.sch, meals and stype.
The first step of the function is dedicated to the study of the proximity between the two target variables. Therefore, standard criteria (Cramer's V and Spearman's rank correlation coefficient) are used to evaluate the pairwise association between Y and Z, globally or in one of the two databases only, according to the predictions provided by the output of OT_outcome or OT_joint. Only the database api35 was completed in the example (because which.DB = "B" as argument of OT_outcome) and stored in the DATA2_OT object, therefore, the criteria compare here the proximity distribution between the predicted values of apicl_2000 (Y) and the observed value of apicl_1999 (Z) in the database api35 (B). Regarding independence or a small correlation between Y A and Z A (or Y B and Z B ) must question about the reliability of the predictions especially when Y and Z are supposed to summarize a same information. In the example, whatever the criteria used, we notice via the res.prox object, a strong association between the two target variables in api35 which reassures about the consistency of the predictions. The related confusion matrix between the predicted values of apicl_2000 (Y) and the observed value of apicl_1999 (Z) is stored in the conf.mat object.
Second, the function proposes an optional tool (by setting group.clss= TRUE) which evaluates the impact of grouping levels of one factor on the association of Y and Z. When users have initially no information about one of the two encodings of interest, this approach can be particularly useful to detect ordinal from nominal ones and its principle is as follow. Assuming that Y ∈ Y, and Z ∈ Z (Y and Z are the respective levels) and that |Y | ≥ |Z |. From Y, successive new variables Y ′ ∈ Y ′ are built, as soon as Y ′ is a partition of Y such as |Y ′ | = |Z | (and inversely for Z if the levels of Z is the greatest). The related associations between Z and Y ′ (with now equal number of levels) are then studied using: Cramer's V, rank correlation, Kappa coefficient and confusion matrix and the results are stored in a It appears from these results that grouping the levels G2 and G3 of apicl_1999 (Z) strongly improves the association of this variable with apicl_2000 (Y) (the error rate of the confusion matrix varies from 43.4 to 13.3). Moreover the structure of conf.mat confirms that the encoding of apicl_1999 seems to be ordinal.
The third step of the quality process integrated in the function corresponds to the study of the Hellinger distance (Liese and Miescke 2008). This distance function is used as a measure of the discrepancies between the observed and predicted distributions of Y (L(Y A ) versus L( Y B )) and/or (L( Z A ) versus L(Z B )). For Y and Z, the definition of the distance is respectively: where µ Y B n,y and µ Z A n,z correspond to the empirical estimators of µ Y B and µ Z A respectively. The Hellinger distance varies between 0 (identical) and 1 (strong dissimilarities) while 0.05 can be used as an acceptable threshold below which two distributions can be considered as similar. When OUTCOME models are applied on datasets, this criterion shows that predictions respect assumption 1. It can also be used to determine the best relaxation parameter of an R-OUTCOME model. On the contrary, there is no need to interpret this criterion when an R-JOINT model is applied, because in this case, the assumption 1 is not required. Results related to this criterion are stored in the hell object: verif_outc1$hell #> YA_YB ZA_ZB #> Hellinger dist. 0.008 NA With a p-value equals to 0.008, the assumption 1 hold here but it stays interesting to test other algorithms to eventually improve the current one by notably adding relaxation and/or regularization parameters.
Finally, the verif_OT function uses the mean and standard deviance of the conditional probabilities P(Z =ẑ i |Y = y i , X = x i ) estimated by each model, as indicators of the stability of the individual predictions (provided that stab.prob = TRUE). It is nevertheless possible that conditional probabilities are computed from too few individuals (according to the frequency of each profile of shared variables met), to be considered as a reliable estimate of the reality. To avoid this problem, trimmed means and standard deviances are suggested by removing these specific probabilities from the computation, using the min.neigb parameter. In output, the results related to this last study are stored in the res.stab object: The first result shows that 14 individual predictions among 362 (the total number of rows in api35) have been assigned to subjects that exhibits a unique combination of predictors and outcome. From these subjects, it would be obviously overkill to draw conclusions about the reliability of the predictions provided by the model. We therefore decide to fix here a threshold of 5 below which it is difficult to extract any information related to the prediction. From the remaining ones, we simply perform the average (0.968) which could be interpreted as follows: when the fitted model is confronted with a same profile of predictors and a same level of apicl_1999, more than 96 times of a hundred, it will return the same individual prediction for apicl_2000.
We run a total of 11 models to determine the optimal one for the prediction of apicl_2000 in api35. Every arguments from the syntax of the OT_outcome and OT_joint functions stay unchanged with the exception of indiv.method, maxrelax, and lambda.reg which vary. The values linked to each criterion of the verif_OT function are summarized in Tables 8, 9 and 10. The syntax related to verif_OT stay unchanged for each model (notably same min.neigb arguments). Note that comparisons of stability predictions between OUTCOME and JOINT models impose that the prox.dist argument of the OT_outcome function is fixed to 0.  Table 8: V Cramer criterion (V_cram) and rank correlation (rank_cor) between apicl_1999 (the observed variable) and apicl_2000 (the predicted variable) in the api35 database (N = 362). Predictions comes from 11 models with various algorithms, relaxation and regularization parameters. They are very stable and reproductible here when the relaxation parameter differs from 0.

Model
From these results, we can conclude that: • whatever the algorithm used (OUTCOME or JOINT), adding a relaxation parameter improves here the association between the target variables (see Table 8). • according to the results of Table 9, the R_outc2 and R_outc5 models seem not optimal because they are those for which the Hellinger criterion reflects the most clear violation of assumption 1 (see Table 9). Therefore, it is here suggested to keep a relaxation parameter less than 0.6 in the final model. • Table 8 confirms this trend because adding a too high relaxation parameter seems to potentially affect the quality of the association (the V Cramer and rank correlation decrease when the relaxation parameter increases from 0.4 to 0.6). Consequently, in this example, fixing 0.4 seems to be an acceptable compromise for the relaxation parameter whatever the R-OUTCOME algorithm used (0.3 could also have been tested here). • among the remaining models (R_outc1, R_outc4, R_outj1, and R_outj4), R_outj1 and R_outj4 seem to be those with the most stable predictive potential (Table 10). Moreover, R_outc4 appears here as the best model from the tested R-OUTCOME algorithms. • adding a regularization parameter to the R_outj1 model caused a decrease in the stability of the predictions (see the value of R_outj4 compared to R_outj1 in Table 10) and we thus conclude in favor of R_outj1 as best model among those tested in the JOINT family of algorithms. • Table 11 shows that the three remaining models (R_outc4, R_outj1 and R_outj4) counted between 92 an 97% of common predictions and these last result also reassures the user about the quality of the provided predictions.
The confusion matrices related to R_outc4 and R_outj1 are described in Table 12 and seems to confirm that the encoding of apicl_2000 in three groups, could simply correspond to the grouping of levels G2 and G3 of the api_cl1999 variable.
Finally, note that the running time of each model took less than 15 seconds with R version 4.0.3 for Windows (10 Pro-64bits/ Process Intel 2.66 GHz).

Model
Type Method Relax Hell(YA_YB) outc1 OUTCOME SEQUENTIAL 0.0 0.008 R_outc1 R-OUTCOME SEQUENTIAL 0.4 0.085 R_outc2 R-OUTCOME SEQUENTIAL 0.6 0.107 R_outc3 R-OUTCOME OPTIMAL 0.0 0.002 R_outc4 R-OUTCOME OPTIMAL 0.4 0.080 R_outc5 R-OUTCOME OPTIMAL 0.6 0.102 Table 9: Hellinger distances related to the 6 models that used OUTCOME and R-OUTCOME algorithms. The values are relatively homogeneous from one model to another and do not indicate strong ditributional divergences (all values are very far from 1). Nevertheless, choosing relaxation parameters higher than 0.4 can increase the risk of distributional dissimilarities (the criterion moves away from 0.05 when the relaxation parameter increases).  . When the R_outj1 model will be confronted with a same profile of predictors and a same level of apicl_1999, more than 94 times of a hundred (mean = 0.942), it will be able to return the same individual prediction for apicl_2000. Among the OUTCOME family of algorithms, R_outc4 provides here the best stability of prediction while R_outj2 is the optimal one in the JOINT family of algorithms.

Conclusion and perspectives
To our knowledge, OTrecod is the first R package that takes advantage of the optimal transportation theory (Monge 1781) in the context of data fusion and the comparative study of methods described in (Garès and Omer 2022) underlines the promising performances of these approaches.
For more details and examples about the functions, users are invited to consult the ARTICLES section of the dedicated website.

Drawbacks
The functions of OTrecod only apply to a specific data fusion context where there is no overlapping part between the two raw data sources. This package does not deal with record linkage which is already the focus of extensive works (Sariyar and Borg 2010). This package is not adapted when the target variables (outcomes) of the two databases are continuous with an infinite number of values (like weights in grams with decimals for example). Moreover, if the data fusion requires only one shared variable to work, the quality of the fusion depends on the presence of a subset of good shared predictors in the raw databases. Finally, if the function OT_outcome allows all types of predictors, the current version of the function OT_joint imposes categorical matching variables only (scale variables are allowed): this restriction should be relaxed in a next version.