OTrecod: An R Package for Data Fusion using Optimal Transportation Theory

The advances of information technologies often confront users with a large amount of data which is essential to integrate easily. In this context, creating a single database from multiple separate data sources can appear as an attractive but complex issue when same information of interest is stored in at least two distinct encodings. In this situation, merging the data sources consists in finding a common recoding scale to fill the incomplete information in a synthetic database. The OTrecod package provides R-users two functions dedicated to solve this recoding problem using optimal transportation theory. Specific arguments of these functions enrich the algorithms by relaxing distributional constraints or adding a regularization term to make the data fusion more flexible. The OTrecod package also provides a set of support functions dedicated to the harmonization of separate data sources, the handling of incomplete information and the selection of matching variables. This paper gives all the keys to quickly understand and master the original algorithms implemented in the OTrecod package, assisting step by step the user in its data fusion project.

Gregory Guernec (Université de Toulouse, INSERM, UPS) , Valerie Gares http://vgares.perso.math.cnrs.fr/contact.html (INSA, Université de Rennes) , Jeremy Omer https://jeremyomer.wixsite.com/recherche (INSA, Université de Rennes) , Philippe Saint-Pierre https://perso.math.univ-toulouse.fr/psaintpi/ (IMT, Université Paul Sabatier, Toulouse) , Nicolas Savy https://perso.math.univ-toulouse.fr/savy/ (IMT, Université Paul Sabatier, Toulouse)
2023-02-10

1 Introduction

The large amount of data produced by information technology requires flexible tools to facilitate its handling. Among them, the field of data fusion (Hall and Llinas 1997; Klein 2004; Castanedo 2013) also known as statistical matching (Adamek 1994; D’Orazio et al. 2006; Vantaggi 2008) aims to integrate the overall information from multiple data sources for a better understanding of the phenomena that interact in the population.

Assuming that two heterogeneous databases A and B share a set of common variables X while an information of interest is encoded in two distinct scales respectively: Y in A and Z in B. If Y and Z are never jointly observed, a basic data fusion objective consists in the recoding of Y in the same scale of Z (or conversely), to allow the fusion between the databases as illustrated in Table 1.

Table 1: This package provides algorithms for merging two databases A and B where two variables of interest, Y and Z, are never jointly observed: the final result is a unique and synthetic database where Y and Z are fully completed.

Providing a solution to this recoding problem is often very attractive because it aims at giving access to more accurate and consistent information with no additional costs in a unique and bigger database. Despite this, if we exclude all R data integration packages applied in the context of genetic area like the MultiDataSet package (Hernandez-Ferrer et al. 2017), OMICsPCA package (Das and Tripathy 2022), or the mixOmics package (F et al. 2017) which are often only effective for quantitative data integration. To our knowledge, the StatMatch package (D’Orazio 2022) is actually the only one that provide a concrete solution to the problem using hot deck imputation procedures. The main reason for this relative deficiency is that this problem is, in fact, often assimilated and solved like a missing data imputation problem. According to this idea, a very large amount of works and reference books now exist about the handling of missing data (Little and Rubin 2019; Zhu et al. 2019). Moreover, we can enumerate several R packages that we can sort by types of imputation methods (Mayer et al. 2019): mice (van Buuren and Groothuis-Oudshoorn 2011) and missForest (Stekhoven and Bühlmann 2012; Stekhoven 2022) which use conditional models, softImpute (Hastie and Mazumder 2021) and missMDA (Josse and Husson 2016) which apply low-rank based models. For all these packages, imputation performances can sometimes fluctuate a lot according to the structure and the proportion of non-response encountered. Regressions and non parametric imputation approaches (like hot-deck methods from donor based family) seem to use partially, or not at all, the available information of the two databases to provide the individual predictions. Contrary to our approach, all these methods only use the set of shared variables X for the prediction of Z in A (or conversely Y in B) without really taking into account Y and its interrelations with X in their process of predictions.

The purpose of this paper is to present a new package to the R community called OTrecod which provides a simple and intuitive access to two original algorithms (Garès et al. 2020; Garès and Omer 2022) dedicated to solve these recoding problems in the data fusion context by considering them as applications of Optimal Transportation theory (OT). In fact, this theory was already applied in many areas: for example, the transport package (Schuhmacher et al. 2022) solves optimal transport problems in the field of image processing. A specific package called POT: Python Optimal Transport (Flamary et al. 2021) also exists in the Python software (Van Rossum and Drake Jr 1995) to solve optimization problems using optimal transportation theory in the fields of signal theory, image processing and domain adaptation. Nevertheless, all these available tools were still not really adapted to our recoding problem and the performances established by OT-based approaches to predict missing information compared to more standard processes (Garès et al. 2020; Muzellec et al. 2020; Garès and Omer 2022) have finished to confirm our decision.

In OTrecod the first provided algorithm, called OUTCOME and integrated in the OT_outcome function consists in finding a map that pushes the distribution of Y forward to the distribution of Z (Garès et al. 2020) while the second one, called JOINT and integrated in the OT_joint function, pushes the distribution of (Y,X) forward to the distribution of (Z,X). Consequently, by building, these two algorithms take advantage of all the potential relationships between Y, Z, and X for the prediction of the incomplete information of Y and/or Z in A and B. Enrichments related to these algorithms and described in (Cuturi 2013; Garès and Omer 2022) are also available via the optional arguments of these two functions. In its current version, these algorithms are accompanied by original preparation (merge_dbs, select_pred) and validation (verif_OT) functions which are key steps of any standard statistical matching project and can be used independently of the OUTCOME and JOINT algorithms.

2 Solving recoding problems using optimal transportation theory

The optimal transportation problem

The optimal transportation (OT) problem was originally stated by Monge (1781) and consists in finding the cheapest way to transport a pile of sand to fill a hole. Formally the problem writes as follows.

Consider two (Radon) spaces \(\mathbb{X}\) and \(\mathbb{Y}\), \(\mu^X\) a probability measure on \(\mathbb{X}\), and \(\mu^Y\) a probability measure on \(\mathbb{Y}\) and \(c\) a Borel-measurable function from \(\mathbb{X} \times \mathbb{Y}\) to \(\left[0,\infty\right]\). The Kantorovich’s formulation of the optimal transportation problem (Kantorovich 1942) consists in finding a measure \(\gamma \in \Gamma(\mu^X,\mu^Y)\) that realizes the infimum:

\[\begin{equation} \inf\left\{\left. \int_{{\mathbb{X}} \times {\mathbb{Y}}} c(x, y) \, \mathrm{d} \gamma (x,y) \right| \gamma \in \Gamma(\mu^X,\mu^Y) \right\}, \tag{1} \end{equation}\]

where \(\Gamma(\mu^X,\mu^Y)\) is the set of measures on \(\mathbb{X} \times \mathbb{Y}\) with marginals \(\mu^X\) on \(\mathbb{X}\) and \(\mu^Y\) on \(\mathbb{Y}\).

This theory is applied here to solve a recoding problem of missing distributions in a data fusion area. To do so, we make use of Kantovorich’s formulation adapted to the discrete case, known as Hitchcock’s problem (Hitchcock 1941). Therefore, by construction, the proposed algorithms are usable for specific target variables only: categorical variables, ordinal or nominal, and discrete variables with finite number of values.

Optimal transportation of outcomes applied to data recoding

Let \(A\) and \(B\) be two databases corresponding to two independent sets of subjects. We assume without loss of generality that the two databases have equal sizes, so that they can be written as \(A=\{i_1,\dots,i_{n}\}\) and \(B=\{j_1,\dots,j_{n}\}\). Let \(\big((X_i,Y_i,Z_i)\big)_{i\in A}\) and \(\big((X_j,Y_j,Z_j)\big)_{j\in B}\) be two sequences of i.i.d. discrete random variables with values in \(\mathcal{X} \times \mathcal{Y} \times \mathcal{Z}\), where \(\mathcal{X}\) is a finite subset of \(\mathbb{R}^P\), and \(\mathcal{Y}\) and \(\mathcal{Z}\) are finite subsets of \(\mathbb{R}\). Variables \((X_i,Y_i,Z_i), i\in A\), are i.i.d copies of \((X^A,Y^A,Z^A)\) and \((X_j,Y_j,Z_j), j\in B\), are i.i.d copies of \((X^B,Y^B,Z^B)\). Moreover assume that \(\big\{(X_i,Y_i,Z_i),i\in A\big\}\) are independent of \(\big\{(X_j,Y_j,Z_j),j\in B\big\}\). The first version using the optimal transportation algorithm approach, described in (Garès et al. 2020), assumes that:

Assumption 1. \(Y^A\) and \(Z^A\) respectively follow the same distribution as \(Y^B\) and \(Z^B\).

Assumption 2. For all \(x \in \mathcal{X}\) the probability distributions of \(Y^A\) and \(Z^A\) given that \(X^A= x\) are respectively equal to those of \(Y^B\) and \(Z^B\) given that \(X^B= x\).

In this setting, the aim is to solve the recoding problem given by equation (1) that pushes \(\mu^{Y^A}\) forward to \(\mu^{Z^A}\). The variable \(\gamma\) of (1) is a discrete measure with marginals \(\mu^{Y^A}\) and \(\mu^{Z^A}\), represented by a \(|\mathcal{Y}| \times |\mathcal{Z}|\) matrix. The cost function denoted as \(c\) is a \(|\mathcal{Y}| \times |\mathcal{Z}|\) matrix, \((c_{y,z})_{y\in \mathcal{Y},z\in \mathcal{Z}}\). The goal is in the identification of:

\[\begin{equation} \gamma^*\in argmin_{\gamma \in \mathbb{R}_+^{|\mathcal{Y}| \times |\mathcal{Z}|}}\left\{\langle \gamma \; ,\; c \rangle : \gamma \mathbf{1}_{|\mathcal{Z}|} = \mu^{Y^A},\gamma^T \mathbf{1}_{|\mathcal{Y}|} = \mu^{Z^A}\right\}, \tag{2} \end{equation}\]

where \(\langle \cdot \;,\; \cdot \rangle\) is the dot product, \(\mathbf{1}\) is a vector of ones with appropriate dimension and \(M^T\) is the transpose of matrix \(M\). The cost function considered by Garès et al. (2020), \(c_{y,z}\), measures the average distance between the profiles of shared variables of \(A\) satisfying \(Y=y\) and subjects of \(B\) satisfying \(Z=z\), that is:

\[\begin{equation} c_{y,z} = \mathbb{E} \left[d(X^A,X^B) \mid Y^A= y, Z^B= z\right], \tag{3} \end{equation}\]

where \(d\) is a given distance function to choose on \(\mathcal{X} \times \mathcal{X}\).

In fact, the above situation cannot be solved in reality, since the distributions of \(X^A\), \(X^B\), \(Y^A\) and \(Z^A\) are never jointly observed. As a consequence, the following unbiased empirical estimators are used: \(\hat{\mu}^{X^A}_n\) of \({\mu}^{X^A}\) and \(\hat{\mu}^{X^B}_n\) of \({\mu}^{X^B}\). Because \(Y\) and \(Z\) are only available in \(A\) and \(B\) respectively, two distinct empirical estimators have to be defined:

\[\begin{equation} \begin{aligned} \hat{\mu}^{Y^A}_{n,y} & = & \frac{1}{n}\sum_{i\in A} ~\mathbf{1}_{\{Y_i = y\}},\: \forall y\in\mathcal{Y},\\ \hat{\mu}^{Z^A}_{n,z} & = & \frac{1}{n}\sum_{j\in B} ~\mathbf{1}_{\{Z_j = z\}},\: \forall z\in \mathcal{Z}, \end{aligned} \tag{4} \end{equation}\]

where \(\mathbf{1}_{\{Y = y\}}=1\) if \(Y=y\) and \(0\) otherwise. The assumption 1 gives: \({\mu}^{Z^A}={\mu}^{Z^B}\) from which we can conclude that \(\hat{\mu}^{Z^B}_{n,z}=\hat{\mu}^{Z^A}_{n,z}\). Finally, denoting: \[\kappa_{n,y,z}\equiv \sum_{i\in A} \sum_{j\in B}~ \mathbf{1}_{\left\{Y_i=y,Z_j=z\right\}},\]

the number of pairs \((i,j)\in A\times B\) such that \(Y_i=y\) and \(Z_j=z\), the cost matrix \(c\) is estimated by:

\[\begin{equation} \hat{c}_{n,y,z}=\left\{ \begin{array}{ll} \frac{1}{\kappa_{n,y,z}}\sum_{i\in A} \sum_{j\in B}~ \mathbf{1}_{\left\{Y_i=y,Z_j=z\right\}} \times d(X_i,X_j), & \: \forall y\in \mathcal{Y}, z\in\mathcal{Z}:\kappa_{n,y,z}\neq 0,\\ 0, & \:\forall y \in \mathcal{Y}, z \in \mathcal{Z}:\kappa_{n,y,z} = 0. \end{array}\right. \tag{5} \end{equation}\]

Plugging the values observed for these estimators in (2) yields to a linear programming model denoted:

\[\begin{equation} \hat{\mathcal{P}}^0_n: \left\{ \begin{aligned} \min\: & <\widehat{c}_n,\gamma>\\ \text{s.t.}\:& \sum_{z\in Z} \gamma_{y,z} = \mu^{Y^A}_{n,y}, \:\forall y\in \mathcal{Y}, \\ & \sum_{y\in Y} \gamma_{y,z} = \mu^{Z^A}_{n,z}, \:\forall z\in \mathcal{Z}, \\ & \gamma_{y,z} \geq 0, \: \forall y\in \mathcal{Y}, \forall z\in \mathcal{Z}. \end{aligned}\right. \tag{6} \end{equation}\]

The solution \(\hat{\gamma}_n\) can then be interpreted as an estimator \(\hat{\mu}^{(Y^A,Z^A)}_n\) of the joint distribution of \(Y^A\) and \(Z^A\), \(\mu^{(Y^A,Z^A)}\). If this estimate is necessary, it is nevertheless insufficient here to provide the individual predictions on \(Z\) in \(A\). These predictions are done in a second step using a nearest neighbor algorithm from which we deduce an estimation of \(\mu^{Z^A\mid X^A=x,Y^A=y}\) (see Garès and Omer (2022) for details). In the remainder, the overall algorithm described in this section is referred to as OUTCOME. To improve the few drawbacks of this algorithm described in Garès et al. (2020), derived algorithms from OUTCOME have been developed (Garès and Omer 2022) and described in the following part.

Optimal transportation of outcomes and covariates

Using the same notations, Garès et al. (2020) propose to search for an optimal transportation map between the two joint distributions of \((X^A,Y^A)\) and \((X^A,Z^A)\) with marginals \(\mu^{(X^A,Y^A)}\) and \(\mu^{(X^A,Z^A)}\) respectively. Under Kantorovich’s formulation in a discrete setting, they search for: \[\gamma^*\in \operatorname{argmin}_{\gamma\in \mathcal{D}} <c,\gamma>,\] where \(c\) is a given cost matrix and \(\mathcal{D}\) is the set of joint distributions with marginals \(\mu^{(X^A,Y^A)}\) and \(\mu^{(X^A,Z^A)}\). It is natural to see any element \(\gamma\in \mathcal{D}\) as the vector of joint probabilities \(\mathbb{P}((X^A=x,Y^A=y),(X^A=x',Z^A=z))\) for any \(x,x'\in\mathcal{X}^2\), \(y\in\mathcal{Y}\) and \(z\in\mathcal{Z}\). Since this probability nullifies for all \(x\neq x'\), \(\gamma\in \mathcal{D}\) is defined as a vector of \(\mathbb{R}^{\left\lvert X \right\rvert\times \left\lvert \mathcal{Y} \right\rvert \times\left\lvert\mathcal{Z} \right\rvert }\), where \(\gamma_{x,y,z}\) stands for an estimation of the joint probability \(\mathbb{P}(X^A=x,Y^A=y,Z^A=z)\). These notations lead to the more detailed model:

\[\begin{equation} \mathcal{P}:\left\{ \begin{aligned} \min\: & <c,\gamma> \\ \text{s.t.}\:& \sum_{z\in Z} \gamma_{x,y,z} = \mu^{(X^A,Y^A)}_{x,y}, \:\forall x\in\mathcal{X}, \forall y\in \mathcal{Y},\\ & \sum_{y\in Y} \gamma_{x,y,z} = \mu^{(X^A,Z^A)}_{x,z}, \:\forall x\in\mathcal{X}, \forall z\in \mathcal{Z},\\ & \gamma_{x,y,z} \geq 0, \:\forall x\in\mathcal{X}, \forall y\in \mathcal{Y}, \forall z\in \mathcal{Z}. \end{aligned}\right. \tag{7} \end{equation}\]

The above algorithm can be solved only if the marginals \(\mu^{(X^A,Y^A)}\) and \(\mu^{(X^A,Z^A)}\) are known, but, based on assumption 2, unbiased estimators \(\hat{\mu}^{X^A,Y^A}_n\) and \(\hat{\mu}^{X^A,Z^A}_n\) can be built according to the previous subsection. For the first one it gives:

\[\begin{equation} \begin{aligned} \hat{\mu}^{X^A,Y^A}_{n} & = & \frac{1}{n}\sum_{i\in A} ~\mathbf{1}_{\left\{Y_i = y,X_i = x\right\}}, \:\forall x\in\mathcal{X}, \: \forall y\in\mathcal{Y}. \end{aligned} \tag{8} \end{equation}\]

The cost matrix introduced in the OUTCOME algorithm is used (3) and estimated by (5). Formally we can write:

\[\begin{equation} c_{x,y,z} = c_{y,z}, \:\forall x\in\mathcal{X},\forall y\in\mathcal{Y},\forall z\in\mathcal{Z}, \tag{9} \end{equation}\]

which does not depend on the value of \(x\).

Plugging the values observed for these estimators in (7) yield a linear programming model denoted as \(\widehat{\mathcal{P}}_n\). In contrast to , the algorithm that consists in solving \(\widehat{\mathcal{P}}_n\) to solve the recoding problem is referred to as JOINT in what follows.

An estimation of the distribution of \(Z^A\) given the values of \(X^A\) and \(Y^A\) is then given by:

\[\begin{equation} \tilde{\mu}^{Z^A\mid X^A=x,Y^A=y}_{n,z}= \left\{\begin{aligned} &\frac{\hat{\gamma}_{n,x,y,z}}{\hat{\mu}^{(X^A,Y^A)}_{n,x,y}}, & \:\forall x\in\mathcal{X},y\in\mathcal{Y},z\in\mathcal{Z}: \hat{\mu}^{(X^A,Y^A)}_{n,x,y}\neq 0,\\ &0, & \:\forall x\in\mathcal{X},y\in\mathcal{Y},z\in\mathcal{Z}: \hat{\mu}^{(X^A,Y^A)}_{n,x,y}= 0. \end{aligned}\right. \tag{10} \end{equation}\]

and an individual prediction of \(Z^A\) is then deduced using the maximum a posterior rule: \[ \widehat{z}_i^A= \operatorname{argmax}_{z\in\mathcal{Z}} \tilde{\mu}^{Z^A\mid X^A=x_i,Y^A=y_i}_{n,z}. \]

Due to potential errors in the estimations of \(\mathcal{P}\), the constraints of \(\widehat{\mathcal{P}}_n\) may derive from the true values of the marginals of \(\mu^{(X^A,Y^A,Z^A)}\). To deal with this situation, small violations of the constraints of \(\widehat{\mathcal{P}}_n\) are allowed by enriching the initial algorithm as described in Garès and Omer (2022).

The equality constraints of \(\widehat{\mathcal{P}}_n\) are then relaxed as follows:

\[\begin{align} & \sum_{z\in Z} \gamma_{x,y,z} = \hat{\mu}^{(X^A,Y^A)}_{n,x,y} + e^{X,Y}_{x,y}, \:\forall x\in\mathcal{X}, \forall y\in \mathcal{Y} \tag{11}\\ & \sum_{y\in Y} \gamma_{x,y,z} = \tilde{\mu}^{(X^A,Z^A)}_{n,x,z} + e^{X,Z}_{x,z}, \:\forall x\in\mathcal{X}, \forall z\in \mathcal{Z} \tag{12}\\ & \sum_{x\in \mathcal{X},y\in \mathcal{Y}} e^{X,Y}_{x,y} = 0,\: \sum_{x\in \mathcal{X},z\in \mathcal{Z}} e^{X,Z}_{x,z} = 0 \tag{13}\\ & -e^{X,Y,+}_{x,y}\leq e^{X,Y}_{x,y} \leq e^{X,Y,+}_{x,y}, \:\forall x\in\mathcal{X}, \forall y\in \mathcal{Y} \tag{14}\\ & -e^{X,Z,+}_{x,z}\leq e^{X,Z}_{x,z} \leq e^{X,Z,+}_{x,z}, \:\forall x\in\mathcal{X}, \forall z\in \mathcal{Z} \tag{15}\\ & \sum_{x\in \mathcal{X},y\in \mathcal{Y}} e^{X,Y,+}_{x,y} \leq \alpha_n,\: \sum_{x\in \mathcal{X},z\in \mathcal{Z}} e^{X,Z,+}_{x,z} \leq \alpha_n. \tag{16} \end{align}\]

This relaxation is possible by introducing extra-variables \(e^{X,Y,+}\) and \(e^{X,Z,+}\) as additional constraints (14)(15). Garès and Omer (2022) suggests to consider \(\alpha_n:=\frac{\alpha}{\sqrt{n}}\) from (16), with a parameter \(\alpha\) to calibrate numerically but proposes also a default value fixed to \(0.4\).

A regularization term \(\lambda\) given by \((\frac{\pi_{x,y,z}}{\hat{\mu}^{X^A}_{n,x}})_{x\in\mathcal{X},y\in\mathcal{Y},z\in\mathcal{Z}}\) can also be added to improve regularity in the variations of the conditional distribution \(\mu^{Y^A,Z^A\mid X^A=x}\) with respect to \(x\). The corresponding regularized algorithm is:

\[\begin{equation} \widehat{\mathcal{P}}^R_n: \left\{ \begin{aligned} \min\: & <\widehat{c}_n,\gamma> + \lambda \sum_{(x_i,x_j)\in E_\mathcal{X}} w_{i,j}\sum_{y\in\mathcal{Y},z\in\mathcal{Z}} r^+_{i,j,y,z}\\ &\text{s.t.}\: \text{constraints } \text{(11)--(16)} \\ & \frac{\gamma_{x_i,y,z}}{\hat{\mu}^{X^A}_{n,x_i}}-\frac{\gamma_{x_j,y,z}}{\hat{\mu}^{X^A}_{n,x_j}} \leq r^+_{i,j,y,z}, \:\forall \{x_i,x_j\}\in E_\mathcal{X}, y\in\mathcal{Y}, z\in \mathcal{Z}\\ & \frac{\gamma_{x_i,y,z}}{\hat{\mu}^{X^A}_{n,x_i}}-\frac{\gamma_{x_j,y,z}}{\hat{\mu}^{X^A}_{n,x_j}} \geq -r^+_{i,j,y,z}, \:\forall \{x_i,x_j\}\in E_\mathcal{X}, y\in\mathcal{Y}, z\in \mathcal{Z}\\ & \gamma_{x,y,z} \geq 0, \:\forall x\in\mathcal{X}, \forall y\in \mathcal{Y}, \forall z\in \mathcal{Z}. \end{aligned}\right. \tag{17} \end{equation}\]

The constant \(\lambda\in\mathbb{R}^+\) is a regularization parameter to be calibrated numerically (\(0.1\) can be considered as default value) and \(E_\mathcal{X}\subset \mathcal{X}^2\) includes the pairs of elements of X defined as neighbors: \(\left\{x_i, x_j\right\}\in E_\mathcal{X}\) if \(x_j\) is among the k nearest neighbors of \(x_i\) for some parameter \(k \geq 1\). The method that computes a solution to the recoding problem with regularization and relaxation is called R-JOINT.

In a same way, a relaxation of the assumption 1 is also proposed and added to the OUTCOME algorithm: this resulting method is denoted R-OUTCOME and is related to the following program:

\[\begin{equation} \widehat{\mathcal{P}}^{0-R}_n: \left\{ \begin{aligned} \min\: & <\widehat{c}_n,\gamma>\\ & \sum_{z\in \mathcal{Z}} \gamma_{y,z} = \hat{\mu}^{Y^A}_{n,y} + e^{Y}_{y}, \:\forall y\in \mathcal{Y}\\ & \sum_{y\in \mathcal{Y}} \gamma_{y,z} = \tilde{\mu}^{Z^A}_{n,z} + e^{Z}_{z}, \:\forall z\in \mathcal{Z}\\ & \sum_{y\in \mathcal{Y}} e^{Y}_{y} = 0,\: \sum_{z\in \mathcal{Z}} e^{Z}_z = 0\\ & -e^{Y,+}_{y}\leq e^{Y}_{y} \leq e^{Y,+}_{y}, \:\forall y\in \mathcal{Y}\\ & -e^{Z,+}_{z}\leq e^{Z}_{z} \leq e^{Z,+}_{z}, \:\forall z\in \mathcal{Z}\\ & \sum_{y\in \mathcal{Y}} e^{Y,+}_{y} \leq \alpha_n,\: \sum_{z\in \mathcal{Z}} e^{Z,+}_{z} \leq \alpha_n\\ & \gamma_{y,z} \geq 0, \:\forall y\in \mathcal{Y}, \forall z\in \mathcal{Z}. \end{aligned}\right. \tag{18} \end{equation}\]

Note that algorithms JOINT and R-JOINT do not require assumption 1. The relaxation in R-OUTCOME alleviates its dependence to the satisfaction of this assumption. However, algorithms JOINT and R-JOINT require that \(X\) is a set of discrete variables (factors ordered or not are obviously allowed) while the absence of \(X\) in the linear algorithms OUTCOME and R-OUTCOME allow \(X\) to be a set of discrete and/or continuous variables. In this case, the nature of the variables \(X\) need to be considered when choosing the distance \(d\).

3 Package installation and description

Installation

The OTrecod package can be installed from the Comprehensive R Archive Network (CRAN) by using the following template:

install.packages("OTrecod")

The development version of OTrecod is also available and can be directly installed from GitHub by loading the devtools (Wickham et al. 2022) package (R> install_github("otrecoding/OTrecod")).

Main functions

The two types of optimal transportation algorithms previously introduced (OUTCOME and JOINT) and their respective enrichments (R-OUTCOME and R-JOINT) are available in the OTrecod package via two core functions denoted OT_outcome and OT_joint. Details about their implementations in R are described in the following section. In this package, these algorithms of recoding are seen as fundamental steps of data fusion projects that also require often adapted preparation and validation functions. In this way, the Table 2 introduces the main functions proposed by OTrecod to handle a data fusion project.

Table 2: A brief description of the main functions of OTrecod
R Function Description
Pre-process functions
merge_dbs Harmonization of the data sources
select_pred Selection of matching variables
Functions of data fusion
OT_outcome Data fusion with OT theory using the OUTCOME or R-OUTCOME algorithms.
OT_joint Data fusion with OT theory using the JOINT or R-JOINT algorithms.
Post-process function
verif_OT Quality assessment of the data fusion

All the intermediate functions integrated in the OT_outcome and OT_joint functions (proxim_dist, avg_dist_closest, indiv_grp_closest, indiv_grp_optimal), and their related documentations, are all included and usable separately in the package. They have been kept available for users to ensure a great flexibility as other interesting functions like power_set that returns the power set of a set. This function did not exist on R until now and could be of interest for specialists of algebra. These functions are not described here but detailed in the related pdf manual of the package.

4 Functionalities overview

Expected structure of the input databases as arguments

The functions of recoding OT_outcome and OT_joint require a specific structure of data.frame as input arguments. Described in Table 3, it must be the result of two overlayed databases made of at least four variables:

Table 3: Example of expected structure for two databases 1 and 2 with three shared variables X1, X2, X3
DB Y Z X_1 X_2 X_3
1 (600-800] NA M Yes 50
1 (600-800] NA M No 32
1 [200-600] NA W No 31
2 NA G1 M No 47
2 NA G3 W Yes 43
2 NA G2 W No 23
2 NA G4 M Yes 22
2 NA G2 W Yes 47

As additional examples, users can also refer to the databases simu_data and tab_test provided in the package with expected structures. Note that class objects are not expected here as input arguments of these functions to allow users to freely work with or without the use of the pre-process functions provided in the package.

Choice of solver

The package OTrecod uses the ROI optimization infrastructure (Theußl et al. 2017) to solve the optimization problems related to the OUTCOME and JOINT algorithms. The solver GLPK (The GNU Linear Programming Kit (Makhorin 2011)) is the default solver actually integrated in the OT_outcome and OT_joint functions for handling linear problems with linear constraints. The ROI infrastructure makes easy for users to switch solvers for comparisons. In many situations, some of them can noticeably reduce the running time of the functions.

For example, the solver Clp (Forrest et al. 2004) for COINT-OR Linear Programming, known to be particularly convenient in linear and quadratic situations, can be easily installed by users via the related plug-in available in ROI (searchable with the instruction ROI_available_solvers()) and following the few instructions detailed in (Theußl et al. 2020) or via the dedicated website.

An illustrative example

In California (United States), from 1999 to 2018, the Public Schools Accountability Act (PSAA) imposed on its California Department of Education (CDE) to provide annually the results of an Academic Performance Index (API) which established a ranking of the best public schools of the state.

This numeric score, indicator of school’s performance levels, could vary from 200 to 1000 and the performance objective to reach for each school was 800. Information related to the 418 schools (identified by cds) of Nevada (County 29) and to the 362 schools of San Benito (County 35), was respectively collected in two databases, api29 and api35, available in the package. The distributions of all the variables in the two databases are provided by following the R commands:

library(OTrecod)
data(api29); data(api35)
summary(api29) #--------------------------------------------------
     cds                 apicl_2000  stype    awards   
 Length:418         [200-600] : 93   E:300   No  : 98  
 Class :character   (600-800] :180   M: 68   Yes :303  
 Mode  :character   (800-1000]:145   H: 50   NA's: 17  
                                                       
                                                       
                                                       
    acs.core        api.stu         acs.k3.20   grad.sch  
 Min.   :16.00   Min.   : 108.0   <=20   :190   0   : 86  
 1st Qu.:26.00   1st Qu.: 336.2   >20    :108   1-10:180  
 Median :30.00   Median : 447.5   unknown:120   >10 :152  
 Mean   :31.97   Mean   : 577.8                           
 3rd Qu.:39.00   3rd Qu.: 641.5                           
 Max.   :50.00   Max.   :2352.0                           
       ell          mobility        meals     full   
 [0-10]  :153   [0-20]  :362   [0-25]  :200   1: 85  
 (10-30] : 83   (20-100]: 56   (25-50] : 56   2:333  
 (30-50] : 60                  (50-75] :100          
 (50-100]: 93                  (75-100]: 62          
 NA's    : 29                                        
                                                     
summary(api35) #--------------------------------------------------
     cds            apicl_1999 stype    awards       acs.core    
 Length:362         G1:91      E:257   No  :111   Min.   :16.00  
 Class :character   G2:90      M: 67   Yes :237   1st Qu.:25.00  
 Mode  :character   G3:90      H: 38   NA's: 14   Median :30.00  
                    G4:91                         Mean   :31.81  
                                                  3rd Qu.:39.00  
                                                  Max.   :50.00  
    api.stu         acs.k3.20   grad.sch         ell      mobility
 Min.   : 102.0   <=20   :227   0   : 50   [0-10]  :164   1:213   
 1st Qu.: 363.2   >20    : 30   1-10:241   (10-30] : 99   2:149   
 Median : 460.0   unknown:105   >10 : 71   (30-50] : 64           
 Mean   : 577.2                            (50-100]: 10           
 3rd Qu.: 624.0                            NA's    : 25           
 Max.   :2460.0                                                   
      meals     full   
 [0-25]  : 77   1:244  
 (25-50] : 92   2:118  
 (50-75] :122          
 (75-100]: 71          
                       
                       

The two databases seem to share a majority of variables (same labels, same encodings) of different types, therefore inferential statistics could be ideally considered by combining all the information of the two databases to study the effects of social factors on the results of the API score in 2000.

Nevertheless, while this target variable called apicl_2000 is correctly stored in the database api29 and encoded as a three levels ordered factors clearly defined: [200-600], (600-800] and (800-1000], the only information related to the API score in the database api35 is the variable apicl_1999 for the API score collected in 1999, encoded in four unknown balanced classes (G1, G2, G3 and G4). As no school is common to the two counties, we easily deduce that these two variables have never been jointly observed.

By choosing these two variables as outcomes (called also target variables), the objective of the following examples consists in creating a synthetic database where the missing information related to the API score in 2000 is fully completed in api35 by illustrating the use of the main functions of the package.

Harmonization of two datasources using merge_dbs

The function merge_dbs is an optional pre-process function dedicated to data fusion projects that merges two raw databases by detecting possible discrepancies among the respective sets of variables from one database to another. The current discrepancy situations detected by the function follow specific rules described below:

These situations sometimes require reconciliation actions which are not supported by the actual version of the merge_dbs function. Therefore, when reconciliation actions are required by the user, they will have to be treated a posteriori and outside the function.

Applied to our example and according to the introduced rules, the first step of our data fusion project consists in studying the coherence between the variables of the databases api29 and api35 via the following R code:

step1 = merge_dbs(DB1 = api29, DB2 = api35,
        NAME_Y = "apicl_2000", NAME_Z = "apicl_1999",
        row_ID1 = 1, row_ID2 = 1,  
        ordinal_DB1 = c(2:3, 8:12), 
        ordinal_DB2 = c(2:3, 8:12))

As entry, the raw databases must be declared separately in the DB1 and DB2 arguments and the name of the related target variables of each database must be specified via the NAME_Y and NAME_Z for DB1 and DB2 respectively. In presence of row identifiers, the respective column indexes of each database must be set in the argument row_ID1 and row_ID2. The arguments ordinal_DB1 and ordinal_DB2 list the related column indexes of all the variables defined as ordinal in the two databases (including also the indexes of the target variables if necessary). Here, apicl_2000 is clearly an ordinal variable, and, by default, we suppose that the unknown encoding related to apicl_1999 is also ordinal: the corresponding indexes (2 and 3) are so added in these two arguments.

After running, the function informs users that no row was dropped from the databases during the merging because each target variable is fully completed in the two databases. Nine potential predictors are kept while only one variable is discarded because of discrepancies between the databases: its identity is consequently stored in output and informs user about the nature of the problem: the mobility factor has different levels from one database to the other.

summary(step1)
              Length Class      Mode     
DB_READY      12     data.frame list     
ID1_drop       0     -none-     character
ID2_drop       0     -none-     character
Y_LEVELS       3     -none-     character
Z_LEVELS       4     -none-     character
REMOVE1        0     -none-     NULL     
REMOVE2        1     -none-     character
REMAINING_VAR  9     -none-     character
IMPUTE_TYPE    1     -none-     character
DB1_raw       12     data.frame list     
DB2_raw       12     data.frame list     
SEED           1     -none-     numeric  
step1$REMOVE1   # List of removed variables because of type's problem
NULL
step1$REMOVE2   # List of removed factors because of levels' problem
[1] "mobility"
levels(api29$mobility)  # Verification
[1] "[0-20]"   "(20-100]"
levels(api29$mobility); levels(api35$mobility)
[1] "[0-20]"   "(20-100]"
[1] "1" "2"
step1$REMAINING_VAR
[1] "acs.core"  "acs.k3.20" "api.stu"   "awards"    "ell"      
[6] "full"      "grad.sch"  "meals"     "stype"    

The function returns a list object and notably DB_READY, a data.frame whose structure corresponds to the expected structure introduced in the previous subsection: a unique database, here result of the superimposition of api29 on api35 where the first column (DB) corresponds to the database identifier (1 for api29 and 2 for api35), the second and third columns (Y, Z respectively) corresponds to the target variables of the two databases. Missing values are automatically assigned by the function to the unknown part of each target variable: in Y when the identifier equals to 2, in Z when the identifier equals to 1. The subset of shared variables whose information is homogeneous between the databases are now stored from the fourth column to the last one. Their identities are available in the output object called REMAINING_VAR.

The merge_dbs function can handle incomplete shared variables by using the impute argument. This option allows to keep the missing information unchanged (the choice by default, and also the case in this example), to work with complete cases only, to do fast multivariate imputations by chained equations approach (the function integrates the main functionalities of the mice function of the mice package), or to impute data using the dimensionality reduction method introduced in the missMDA package (see the pdf manual for details).

Selection of matching variables using select_pred

In data fusion projects, a selection of shared variables (also called matching variables) appears as an essential step for two main reasons. First, the proportion of shared variables X between the two databases can be important (much higher than three variables) and keeping a pointless part of variability between variables could strongly reduce the quality of the fusion. Second, this selection greatly influences the quality of the predictions regardless of the matching technique which is chosen a posteriori (Adamek 1994). The specific context of data fusion is subject to the following constraints:

  1. Y, Z and X are never jointly observed in database A or B, so the relationships between Y and X, and between Z and X must be investigated separately.
  2. matching variables need to be good predictors of both target variables (outcomes) Y and Z (Cohen 1991), in the sense that they explain relevant parts of variability of the targets.

These particularities suppose that the investigations have to be done separately but in the same manner in both databases. In this way, a predictor which appears at the same time as highly correlated to Y and Z will be automatically selected for the fusion. On the contrary, predictors whose effects on Y and Z seem not obvious will be discarded. Additional recommended rules also emerge from literature:

The function of the package dedicated to this task is select_pred. From the DB_READY database generated in the previous subsection, studying the outputs related to each database and produced by the following R commands assist users in selecting the best predictors:

# For the dataset api29 --------------
step2a = select_pred(step1$DB_READY,
                     Y = "Y", Z = "Z", ID = 1, OUT = "Y",
                     quanti  = c(4,6), nominal = c(1,5,7), 
                     ordinal = c(2,3,8:12), thresh_cat = 0.50, 
                     thresh_num = 0.70, RF_SEED = 3011)

# For the dataset api35 --------------
step2b = select_pred(step1$DB_READY,
                     Y = "Y", Z = "Z", ID = 1, OUT = "Z",
                     quanti  = c(4,6), nominal = c(1,5,7), 
                     ordinal = c(2,3,8:12), thresh_cat = 0.50, 
                     thresh_num = 0.70, RF_SEED = 3011)

The quanti, nominal, and ordinal arguments requires vectors of column indexes related to the type of each variable of the input database. The ID argument specifies the column index of the database identifier. Y and Z expected the respective names of the target variables while OUT provides the target variable to predict (Y or Z).

To detect the subset of best predictors, select_pred studies the standard pairwise associations between each shared variable and the outcomes Y and Z, taken separately (by only varying the argument OUT), according to its type: for numeric and/or ordered factor variables, select_pred calculates the associations using rank correlation coefficients (Spearman) while the Cramer’s V criterion (Bergsma 2013) is used for categorical variables and not ordered factors. The related ranking tables of top scoring predictors are available in two distinct output objects: cor_OUTC_num and cor_OUTC_cat. In our example, the corresponding results, for each database, are:

### ASSOCIATIONS BETWEEN TARGET VARIABLES AND SHARED VARIABLES

## Results for the api29 dataset -----

step2a$cor_OUTC_num   # Y versus numeric or ordinal predictors
  name1    name2 RANK_COR pv_COR_test   N
7     Y    meals  -0.8030      0.0000 418
4     Y      ell  -0.7514      0.0000 389
5     Y     full   0.3919      0.0000 418
6     Y grad.sch   0.3346      0.0000 418
3     Y  api.stu  -0.1520      0.0018 418
8     Y    stype  -0.1520      0.0018 418
2     Y acs.core  -0.0556      0.2566 418
step2a$vcrm_OUTC_cat  # Y versus nominal or ordinal predictors
  name1     name2 V_Cramer CorrV_Cramer   N
7     Y     meals   0.6871       0.6835 418
4     Y       ell   0.6015       0.5966 389
5     Y      full   0.4508       0.4459 418
6     Y  grad.sch   0.3869       0.3816 418
3     Y    awards   0.2188       0.2073 401
2     Y acs.k3.20   0.1514       0.1349 418
8     Y     stype   0.1370       0.1185 418
## Results for the api35 dataset -----

step2b$cor_OUTC_num   # Z versus numeric or ordinal predictors
  name1    name2 RANK_COR pv_COR_test   N
4     Z      ell  -0.7511      0.0000 337
7     Z    meals  -0.7291      0.0000 362
6     Z grad.sch   0.6229      0.0000 362
5     Z     full   0.3997      0.0000 362
3     Z  api.stu  -0.0563      0.2851 362
8     Z    stype   0.0359      0.4959 362
2     Z acs.core   0.0119      0.8219 362
step2b$vcrm_OUTC_cat  # Z versus nominal or ordinal predictors
  name1     name2 V_Cramer CorrV_Cramer   N
7     Z     meals   0.5181       0.5121 362
6     Z  grad.sch   0.5131       0.5063 362
4     Z       ell   0.4775       0.4701 337
5     Z      full   0.4033       0.3934 362
3     Z    awards   0.2040       0.1818 348
8     Z     stype   0.1320       0.0957 362
2     Z acs.k3.20   0.1223       0.0817 362

The two first tables related to api29 highlights the strong associations between Y and the variables meals, ell, full and grad.sch in this order of importance, while the summary tables related to api35 highlights the strong associations between Z and the variables meals, grad.sch, ell and full.

It is often not unusual to observe that one or more predictors are in fact linear combinations of others. In supervised learning areas, these risks of collinearity increase with the number of predictors, and must be detected beforehand to keep only the most parsimonious subset of predictors for fusion. To avoid collinearity situations, the result of a Farrar and Glauber (FG) test is provided (Farrar and Glauber 1967). This test is based on the determinant of the rank correlation matrix of the shared variables D (Spearman) and the corresponding test statistic is given by:

\[ S_{FG} = - \left(n-1-\frac{1}{6} (2k+5)\times \ln(\det(D))\right) \]

where \(n\) is the number of rows and \(k\) is the number of covariates. The null hypothesis supposes that \(S_{FG}\) follows a chi square distribution with \(k(k-1)/2\) degrees of freedom, and its acceptation indicates an absence of collinearity. In presence of a large number of numeric covariates and/or ordered factors, the approximate Farrar-Glauber test, based on the normal approximation of the null distribution (Kotz et al. 2000) can be more adapted and the statistic test becomes:

\[ \sqrt{2S_{FG}} - \sqrt{2k-1} \] Users will note that these two tests can often appear highly sensitive in the sense that they tend to easily conclude in favor of multicollinearity. Thus, it is suggested to consider these results as indicators of collinearity between predictors rather than an essential condition of acceptability. The results related to this test are stored in the FG_test object of the select_pred output.

In presence of collinearity, select_pred tests the pairwise associations between all the potential predictors according to their types (Spearman or Cramér’s V). The thres_num argument fixed the threshold beyond which two ranked predictors are considered as redundant while thres_cat fixed the threshold of acceptability for the Cramér’s V criterion in the subgroup of factors. In output, the results stored in the collinear_PB object permit to identify the corresponding variables. In our example, we observe:

### DETECTION OF REDUNDANT PREDICTORS

## Results for the api29 dataset -----

## Results of the Farrar-Glauber test 
step2a$FG_test       
          DET_X      pv_FG_test pv_FG_test_appr 
     0.04913067      0.00000000      0.00000000 
## Identity of the redundant predictors

step2a$collinear_PB  
$VCRAM
       name1 name2 V_Cramer CorrV_Cramer   N
71 acs.k3.20 stype   0.6988       0.6971 418
43       ell meals   0.6696       0.6664 389
34      full meals   0.5215       0.5152 418

$SPEARM
     name1 name2 RANK_COR pv_COR_test   N
43     ell meals   0.9047           0 389
62 api.stu stype   0.7069           0 418
#### Results for the api35 dataset -----

step2b$FG_test    # Significant result
          DET_X      pv_FG_test pv_FG_test_appr 
      0.1479339       0.0000000       0.0000000 
step2b$collinear_PB
$VCRAM
       name1 name2 V_Cramer CorrV_Cramer   N
71 acs.k3.20 stype   0.6977       0.6957 362

$SPEARM
[1] name1       name2       RANK_COR    pv_COR_test N          
<0 rows> (or 0-length row.names)

The FG test warns the user against the risks of collinearity between predictors, and the function notably detects strong collinearities between the variables meals, ell, full in the api29 (less strong trends in api35): an information that have to be taken into account during the selection. The part of predictors finally kept for the data fusion must be small to improve its quality. When the initial number of shared variables is not too important as in this example, choosing the best candidates between groups of redundant predictors can be made manually by selecting highest ranked predictors in the summary tables previously described. In this way, the variable meals could be preferred to ell, and full, while the variable stype could be dropped. Consequently, a possible final list of predictors could be only composed of the variables meals and grad.sch.

When the number of predictors is important, or when users prefer that an automatic process performs the variable selection, a random forest procedure can also be used via the select_pred function. Random forest approaches (Breiman 2001) are here particularly convenient (Grajski et al. 1986) for multiple reasons: it works fine when the number of variables exceeds the number of rows, whatever the types of covariates, it allows to deal with non linearity, to consider correlated predictors, ordered or not ordered outcomes and to rank good candidate predictors through an inbuilt criterion: the variable importance measure.

In few words, random forest processes aggregates many CART models (Breiman et al. 1984) with RF_ntree bootstrap samples from the raw data source and averaging accuracies from each model permits to reduce the related variances and also the errors of prediction. A standard random forest process provides two distinct measures of importance of each variable for the prediction of the outcome, the Gini importance criterion and the permutation importance criterion, which depends on the appearance frequency of the predictor but also on its place taken up in each tree. For more details about random forest theory, user can consult Breiman (2001) and/or the pdf manual of the randomForest (Liaw and Wiener 2002) package.

Strobl et al. (2009) suggests that the permutation importance criterion, which works with permuted samples (subsampling without replacements) instead of bootstrap ones, is particularly convenient with uncorrelated predictors, but must be replaced by a conditional permutation measurement in presence of strong correlations. select_pred provides these assessments by integrating the main functionalities of the cforest and varimp functions of the package party(Hothorn et al. 2005, 2006; Strobl et al. 2007, 2008; Zeileis and Hothorn 2008). However, these measurements must be used with caution, by accounting the following constraints:

Table 4: Completing the RF_condi argument according to predictors
Possible scenarios Correlation between predictors State of the RF_condi argument Incomplete information
Same type predictors NO FALSE Allowed
Same type predictors YES TRUE Not allowed
Different type predictors NO FALSE Allowed
Different type predictors YES TRUE Not allowed

The select_pred function allows to proceed with different scenarios according to the type of predictors (Table 4 can help users to choose). The first one consists in boiling down to a set of categorical variables (ordered or not) by categorizing all the continuous predictors using the dedicated argument (convert_num and convert_clss) and to work with the conditional importance assessments that directly provide unbiased estimations (by setting the RF_condi argument to TRUE). Users can consult (Hothorn et al. 2006) for more details about the approach and consult the pdf manual of the package for details about the related arguments of the select_pred function. This approach does not take into account incomplete information, so that the method will be applied to complete data only (incomplete rows will be temporarily removed from the study). It is nevertheless possible to impute missing data beforehand by using dedicated pre-existing R packages like mice (van Buuren and Groothuis-Oudshoorn 2011) or by using the imput_cov function provided in the OTrecod package.

The second possible scenario (always usable in presence of mixed type predictors), consists in the execution of a standard random forest procedure after taking care to rule out collinearity issues by first selecting unique candidates between potential redundant predictors (in this case, the discarded predictors are stored in the drop_var output object). This is the default approach used by select_pred as soon as the RF_condi argument is set to FALSE while RF is set to TRUE. This scenario can work in presence of incomplete predictors. By constructing, note that results from random forest procedures stay dependent on successive random draws carried out for the constitution of trees, and it is so suggested to check this stability by testing different random seeds (RF_SEED argument) before concluding.

The R command previously written provides automatically the results related to the second approach as additional results. The results from the two datasets show here the permutation importance estimates of each variable ranked in order of importance and expressed as percentage, after resolving collinearity problems:

step2a$RF_PRED   # For the api29 dataset
   meals    stype grad.sch   awards acs.core 
 71.5529  11.6282  11.1181   5.4674   0.2334 
step2b$RF_PRED   # For the api35 dataset
   meals      ell grad.sch     full    stype  api.stu acs.core 
 35.9974  28.3051  22.2094   5.2143   4.0192   1.8965   1.5643 
  awards 
  0.7937 

The results confirm that the variable meals appeared as the best predictor of the target variables in api29 and api35 respectively. The variable ell is not present in the first list (see RF_PRED from step2a) because the variables meals and ell has been detected as strongly collinear (according to the initial chosen threshold) and so select_pred keep the best choice between the two predictors: meals (the reasoning is the same for full which disappeared from the list).

The ell variable has been kept in the second list (not found as collinear enough to be removed here) and appears moreover as a good predictor of apicl_1999 in api35. Nevertheless its potential collinearity problem with meals encourages us not to keep it for the rest of the study. According to this discussion, we finally keep meals, stype and grad.sch which combines the advantages of being good predictors for the two target variables while not presenting major problems of collinearity between them.

The following synthetic database (called here bdd_ex) is now ready for the prediction of the missing API scores in api29, api35 or both, using the function OT_outcome or OT_joint:

bdd_ex = step1$DB_READY[, c(1:3,10:12)]; head(bdd_ex,3)
     DB         Y    Z grad.sch    meals stype
2850  1 (600-800] <NA>      >10   [0-25]     H
2851  1 (600-800] <NA>      >10   [0-25]     H
2852  1 [200-600] <NA>     1-10 (75-100]     H

Data fusion using OT_outcome

The OT_outcome function provides individual predictions of Z in A (and/or Y in B) by considering the recoding problem involving optimal transportation of outcomes. In a first step, the aim of the function is so to determine \(\gamma\) from (2) while this estimate is used in a second step to provide the predictions. The main input arguments of the function and the output values are respectively described in Tables 5 and 6.

Table 5: Main arguments of the OT_outcome and OT_joint functions. (*: NULL as default value)
Argument OT_outcome OT_joint Description (default value)
datab X X Data.frame in the expected structure
index_DB_Y_Z X X Indexes of the ID, Y, and Z columns (1,2,3)
nominal X X Column indexes of nominal variables (*)
ordinal X X Column indexes of ordinal variables (*)
logic X X Column indexes of boolean variables (*)
quanti X X Column indexes of quantitative variables (*)
convert.num X X Column indexes of the quantitative variables to convert (* or =quanti in OT_joint)
convert.clss X X Corresponding numbers of desired classes for conversion (*)
which.DB X X Specify the target variables to complete: both or only one (BOTH)
solvR X X Choice of the solver to solve the optimization problem (glpk)
dist.choice X X Distance function (Euclidean). See Table 7
percent.knn X X Ratio of closest neighbors involved in the computations (1)
indiv.method X Type of individual predictions (sequential) for OUTCOME and R-OUTCOME algorithms
maxrelax X X Adding of a relaxation parameter (0)
lambda.reg X Adding of regularization parameter (0)

The arguments datab, index_DB_Y_Z, quanti, nominal, ordinal, and logic are not optional and must be carefully completed before each execution. A unique synthetic data.frame made of two overlayed databases (called A and B for example) (see Table 3) is expected as datab argument. If this data.frame corresponds to the output objects DB_USED or DB_READY of the select_pred or merge_dbs functions respectively, then the expected object has the required structure. Otherwise users must be sure that their data.frames are made up of two overlayed databases with at least 4 variables as described in the subsection Optimal transportation of outcomes applied to data recoding. The order of the variables have no importance in the raw database but will have to be specified a posteriori in the index_DB_Y_Z argument if necessary.

Table 6: Values of the OT_outcome and OT_joint functions. (*: NULL if not required)
Value OT_outcome OT_joint Description
time_exe X X Running time of the algorithm
gamma_A X X Estimation of the joint distribution of (Y, Z) for the prediction of Z in A (*)
gamma_B X X Estimation of the joint distribution of (Y, Z) for the prediction of Y in B (*)
profile X X The list of detected profiles of covariates
res.prox X X A list that provides all the information related to the estimated proximities between profiles and groups of profiles
estimator_ZA X X Estimates of the probability distribution of Z conditional to X and Y in database A (*)
estimator_YB X X Estimates of the probability distribution of Y conditional to X and Z in database B (*)
DATA1_OT X X The database A fully completed (if required in input by the which.DB argument)
DATA2_OT X X The database B fully completed (if required in input by the which.DB argument)

The subset of remaining predictors used for fusion may requires prior transformations according to the distance function chosen in input by the user. This process is fixed by setting the dist.choice argument. The distances actually implemented in the function are: the standard Euclidean ("E") and Manhattan ("M") distances, the Hamming distance ("H", for binary covariates), and the Gower distance ("G" sometimes preferred with mixed variables). Automatic transformations prior to the use of each distance function are summarized in Table 7.

Table 7: Internal variable transformations related to the choice of each distance function in OT_outcome and OT_joint . (*) If the number of covariates exceeds 1. T for table
Distance function
Variable transformations Euclidean Manhattan Gower Hamming
Continuous Standardized Standardized No Not allowed
Boolean Binary Binary No Binary
Nominal Disjunctive T Disjunctive T No Disjunctive T
Ordinal Discrete Discrete No Disjunctive T
Incomlete information Allowed* Allowed* Allowed* Allowed*
dist.choice argument E M G H

The first version of the OT algorithm described in Garès et al. (2020) was tested on numeric coordinates extracted from a factor analysis of mixed data (FAMD) fitted on mixed raw covariates (Pagès 2002). This transformation is here available by setting the FAMD.coord argument to "YES". The minimal percentage of information explained by the FAMD is also fixable using the FAMD.perc argument. The OT_outcome functions proposes four types of models for the prediction of Y in B (and/or) Z in A, according to the values of the method and maxrelax arguments:

When maxrelax > 0, it is recommended to calibrate the maxrelax parameter by testing different values according to the stability of individual predictions. In output, the gamma_A and gamma_B} matrices correspond to the estimates of the joint distributions \(\gamma\) of \((Y^A,Z^A)\) and \((Y^B,Z^B)\) respectively (2). Moreover, a posteriori estimates of conditional distributions probabilities \((Z^A|Y^A, X^A)\) and \((Y^B|Z^B,X^B)\) (10) are also provided in two lists called estimatorZA, and estimatorYB respectively and the completed databases \(A\) and \(B\) are stored in the DATA1_OT and DATA2_OT objects. In particular, the individual predictions are stored in the OTpred column of these data.frames. Moreover, the profile object is a data.frame that stores the profile of covariates encountered in the two data sources while the res_prox object stored all the distance computations that will be used in the validation step (users can consult details of the proxim_dist function of the pdf manual.

The computation of conditional probabilities implies to define the part of individuals considered as neighbors of each encountered profile of covariates. The argument prox.dist fixes this threshold for each profile, following the decision rule for \(A\) (and respectively for \(B\)): a subject \(i\) of \(A\) (or a statistical unit corresponding to a row of \(A\)) will be considered as neighbor of a given profile of shared variables \(C\) as soon as: \[ \text{dist}(subject_i,C) < \text{prox.dist} \times \text{max}_{k=1,\dots , n_A} \text{dist}(subject_k,C) \] When the number of shared variables is small and all of them are categorical (optimal situation), it is suggested to set the prox.dist argument to \(0\). Finally, using the which.DB argument, users can choose to estimate the individual predictions of \(Y\) and \(Z\) in the two databases (which.DB = "BOTH") or only one of the both (which.DB = "A" for the predictions of \(Z\) in \(A\) or which.DB = "B" for the predictions \(Y\) in \(B\)).

From the data.frame bdd_ex built previously, we use the OT_outcome function to illustrate the prediction of the target variable apicl_2000 in the api35 dataset via an OUTCOME algorithm and using an Euclidean distance function:

outc1 = OT_outcome(bdd_ex, quanti = 1, ordinal = 2:6, 
                   dist.choice = "E", indiv.method = "sequential", 
                   which.DB = "B")

#---------------------------------------
# OT PROCEDURE in progress ...
#---------------------------------------
# Type                     = OUTCOME
# Distance                 = Euclidean
# Percent closest knn      = 100%
# Relaxation parameter     = NO
# Relaxation value         = 0
# Individual pred process  = Sequential
# DB imputed               = B
#---------------------------------------

In bdd_ex, all variables except the first one (DB: the database identifier) are or can be considered as ordinal factors, and the quanti and ordinal arguments are filled in accordingly. For the prediction of apicl_2000 in the api35 dataset (the steps would be the same for the prediction of apicl_1999 in api29 by setting which.DB = "A" or "BOTH"), the optimal transportation theory determines a map \(\gamma\) that pushes, in the distribution of apicl_1999 forward to the distribution of apicl_2000 in the database api35. In this case, \(\gamma^B\) is an estimator of the joint distribution of apicl_2000 and apicl_1999 in api35. The estimation of \(\gamma^B\) is available in the gamma_B output object while all the profiles of predictors met in the two databases are listed in the profile object:

outc1$gamma_B
                   G1        G2         G3        G4
[200-600]  0.22248804 0.0000000 0.00000000 0.0000000
(600-800]  0.02889318 0.2486188 0.15311005 0.0000000
(800-1000] 0.00000000 0.0000000 0.09550874 0.2513812
outc1$profile[1,]         # the first profile
      ID grad.sch meals stype
2850 P_1        3     1     3

The output object estimatorYB is a list that corresponds to the estimations of the conditional probabilities of apicl_2000 in the api35 dataset for a given profile of predictors. For example, the conditional probabilities related to the first profile of predictors are:

outc1$estimatorYB[1,,]    # conditional probabilities (1st profile)
        [,1]      [,2]      [,3]
G1 0.3333333 0.3333333 0.3333333
G2 0.3333333 0.3333333 0.3333333
G3 0.0000000 0.0000000 1.0000000
G4 0.0000000 0.0000000 1.0000000

According to these results, we can conclude that: for a subject from api35 with a related profile of predictors P_1 and whose levels of apicl_1999 is 'G1', the probability for apicl_2000 to be predicted '[200,600]' is about \(0.63\). Finally, the individual predictions of apicl_2000 in api35 are available in the DATA2_OT object corresponding to the OTpred column:

head(outc1$DATA2_OT,3)  # The 1st 3 rows only
     DB    Y  Z grad.sch meals stype    OTpred
3895  2 <NA> G1        2     4     1 [200-600]
3896  2 <NA> G3        2     3     1 (600-800]
3897  2 <NA> G2        2     3     2 (600-800]

The OUTCOME algorithm uses here a nearest neighbor procedure to assign the individual predictions from the estimation of \(\gamma\) which can be a drawback as described in (Garès et al. 2020). The R-OUTCOME algorithm proposes an enrichment that directly assigns the individual predictions of apicl_2000 from the estimation of \(\gamma\), without using the nearest neighbor approach. In this case, a linear optimization problem is solved to determine the individual predictions that minimize the sum of the individual distances in api35 having the modalities of the target apicl_2000 in api29. The corresponding R command is:

### R-OUTCOME algorithm: optimal assignments + relaxation parameter = 0
R_outc3 = OT_outcome(bdd_ex, quanti = 1, ordinal = 2:6, 
                     dist.choice = "E" , indiv.method = "optimal",
                     which.DB = "B")

Moreover, the OUTCOME algorithm assumes that the distribution of apicl_2000 is identically distributed in api29 and api35 (assumption 1 from the subsection Optimal transportation of outcomes applied to data recoding) which can appear as a strong hypothesis to hold in many situations. To overcome this constraint, the R-OUTCOME algorithm also allows to relax the assumption 1 by adding a relaxation parameter (18) in the optimization system. This relaxation can be done by varying the maxrelax argument of the function:

### R-OUTCOME algorithm: optimal assignments + relaxation parameter = 0.4
R_outc4 = OT_outcome(bdd_ex, quanti = 1, ordinal = 2:6, 
                     dist.choice = "E",
                     indiv.method = "optimal", 
                     maxrelax = 0.4, which.DB = "B")

The running times of these two models can take few minutes. The quality of these models will be compared in Tables 8, 9 and 10 (subsection Validation of the data fusion using verif_OT).

Data fusion using OT_joint

The OT_joint function provides individual predictions of \(Z\) in \(A\) (and/or \(Y\) in \(B\)) by considering the recoding problem as an optimal transportation problem of covariates and outcomes, that pushes the conditional distribution of \((Y^A|X^A)\) forward to the conditional distribution of \((Z^A|X^A)\) (and conversely \((Z^B|X^B)\) forward to the conditional distribution of \((Y|X)\)). The aim is to determine \(\gamma\) from (7). Because joint distributions of outcomes and covariates are now mapped together (it was not the case with the OUTCOME family of algorithms), it is not required for target variables to be equally distributed (assumption 1).

The call of OT_joint was thought to propose globally the same structure as those of the function OT_outcome as described in Tables 5 and 6 with few differences described below. JOINT and R-JOINT algorithms are usable via OT_joint for solving the recoding problem depending on the values related to the maxrelax and lambda_reg arguments:

When maxrelax > 0 and/or lambda.reg > 0, it is recommended to calibrate these two parameters by testing many different values and so studying the stability of the related individual predictions. Nevertheless, by default or as a starting point, (Garès et al. 2020) suggests the use of default values determined from simulated databases: \(0.1\) for the regularization parameter (lambda.reg) and \(0.4\) for the relaxation one (maxrelax). Finally, the output objects are similar to those previously described in the OT_outcome function.

The implementation of the function is very similar to that of OT_outcome, and applied to our example, the R code related to a JOINT algorithm is:

outj1 = OT_joint(bdd_ex, nominal = 1, ordinal = c(2:6), 
                 dist.choice = "E"  , which.DB = "B")

#---------------------------------------
# OT JOINT PROCEDURE in progress ...
#---------------------------------------
# Type                  = JOINT
# Distance              = Euclidean
# Percent closest       = 100%
# Relaxation term       = 0
# Regularization term   = 0
# Aggregation tol cov   = 0.3
# DB imputed            = B
#---------------------------------------


### Extract individual predictions from the OTpred column
head(outj1$DATA2_OT,3)  # The 1st 3 rows only
     DB    Y  Z grad.sch meals stype    OTpred
3895  2 <NA> G1        2     4     1 [200-600]
3896  2 <NA> G3        2     3     1 (600-800]
3897  2 <NA> G2        2     3     2 (600-800]

For relaxing the constraints stated on marginal distributions, we use the maxrelax argument that corresponds to varying the \(\alpha\) parameter of a R-JOINT algorithm (see (17)) and a parameter of regularization \(\lambda\) can be simply added to the previous model as follows:

### R-JOINT algorithm (relaxation parameter = 0.4)
R_outj1 = OT_joint(bdd_ex, nominal = 1, ordinal = c(2:6), 
                   dist.choice = "E",  maxrelax = 0.4,
                   which.DB = "B")

### R-JOINT algorithm (relaxation parameter = 0.4,
###                                   & regularization parameter = 0.1)
R_outj4 = OT_joint(bdd_ex, nominal = 1, ordinal = c(2:6), 
                   dist.choice = "E",  maxrelax = 0.4,
                   lambda.reg = 0.1,
                   which.DB = "B")

These models are compared in the next subsection.

Validation of the data fusion using verif_OT

Assessing the quality of individual predictions obtained through statistical matching techniques can be a complex task notably because it is legitimate to wonder about the true reliability of a joint distribution estimated from two variables which are never jointly observed (Rodgers 1984; Kadane 2001). When the study aims to identify estimators of interests that improve the understanding of the relationships between variables in the different databases (called macro approach in D’Orazio et al. (2006)) without going until the creation of a complete and unique dataset, uncertainty analyses are usually proposed to estimate the sensitivity of the assessments (Rässler 2002). On the contrary, if the objective is to create a synthetic dataset where the information is available for every unit, the use of auxiliary information or proxy variables must be privileged to assess the quality of the results (Paass 1986). In the absence of complementary information, Rubin (1986) suggests to study the preservation of the relationships at best between the distributions of the target variables. In this way, the verif_OT function proposes tools to assess quality of the individual predictions provided by the algorithms previously introduced.

The first expected input argument of verif_OT is an 'otres' object from the OT_outcome or OT_joint functions. In our example, this function is firstly applied to the out_c1 model by following the R command:

### Quality criteria for outc1 (OUTCOME model)
verif_outc1   = verif_OT(outc1, group.class = TRUE, ordinal = FALSE, 
                         stab.prob = TRUE, min.neigb = 5)

### First results related to outc1:
verif_outc1$nb.profil
[1] 27
verif_outc1$res.prox
       N   V_cram rank_cor 
 362.000    0.860    0.892 

In output, the nb.profil value gives the number of profiles of predictors globally detected in the two databases \((nb = 27)\). In the example, a profile will be characterized by a combination of values related to the three predictors kept: grad.sch, meals and stype.

The first step of the function is dedicated to the study of the proximity between the two target variables. Therefore, standard criteria (Cramer’s V and Spearman’s rank correlation coefficient) are used to evaluate the pairwise association between \(Y\) and \(Z\), globally or in one of the two databases only, according to the predictions provided by the output of OT_outcome or OT_joint. Only the database api35 was completed in the example (because which.DB = "B" as argument of OT_outcome) and stored in the DATA2_OT object, therefore, the criteria compare here the proximity distribution between the predicted values of apicl_2000 (Y) and the observed value of apicl_1999 (Z) in the database api35 (B). Regarding independence or a small correlation between \(Y^A\) and \(Z^A\) (or \(Y^B\) and \(Z^B\)) must question about the reliability of the predictions especially when \(Y\) and \(Z\) are supposed to summarize a same information. In the example, whatever the criteria used, we notice via the res.prox object, a strong association between the two target variables in api35 which reassures about the consistency of the predictions. The related confusion matrix between the predicted values of apicl_2000 (Y) and the observed value of apicl_1999 (Z) is stored in the conf.mat object.

Second, the function proposes an optional tool (by setting group.clss= TRUE) which evaluates the impact of grouping levels of one factor on the association of \(Y\) and \(Z\). When users have initially no information about one of the two encodings of interest, this approach can be particularly useful to detect ordinal from nominal ones and its principle is as follow. Assuming that \(Y \in \mathcal{Y}\), and \(Z \in \mathcal{Z}\) (\(\mathcal{Y}\) and \(\mathcal{Z}\) are the respective levels) and that \(|\mathcal{Y}| \geq |\mathcal{Z}|\). From \(Y\), successive new variables \(Y’ \in \mathcal{Y}'\) are built, as soon as \(\mathcal{Y}'\) is a partition of \(\mathcal{Y}\) such as \(|\mathcal{Y}'| = |\mathcal{Z}|\) (and inversely for \(Z\) if the levels of \(Z\) is the greatest). The related associations between \(Z\) and \(Y’\) (with now equal number of levels) are then studied using: Cramer’s V, rank correlation, Kappa coefficient and confusion matrix and the results are stored in a table called res.grp. The corresponding outputs of the example are:

verif_outc1$conf.mat
            Z
predY         G1  G2  G3  G4 Sum
  [200-600]   81   1   1   1  84
  (600-800]   10  89  55   1 155
  (800-1000]   0   0  34  89 123
  Sum         91  90  90  91 362
verif_outc1$res.grp
  grp levels Z to Y error_rate Kappa Vcramer RankCor
4       G1/G2 G3/G4       13.3 0.794    0.83   0.877
6       G1/G2/G3 G4       19.1 0.714    0.78   0.839
2       G1 G3/G2/G4       28.2 0.593    0.68   0.624
1       G1 G2/G3/G4       37.6 0.457    0.64   0.813
3       G1 G4/G2/G3       43.4 0.374    0.59   0.115
5       G1/G2 G4/G3       43.4 0.326    0.64   0.574

It appears from these results that grouping the levels G2 and G3 of apicl_1999 (Z) strongly improves the association of this variable with apicl_2000 (Y) (the error rate of the confusion matrix varies from \(43.4\) to \(13.3\)). Moreover the structure of conf.mat confirms that the encoding of apicl_1999 seems to be ordinal.

The third step of the quality process integrated in the function corresponds to the study of the Hellinger distance (Liese and Miescke 2008). This distance function is used as a measure of the discrepancies between the observed and predicted distributions of Y (\(\mathcal{L}(Y^A)\) versus \(\mathcal{L}(\widehat{Y}^B)\)) and/or (\(\mathcal{L}(\widehat{Z}^A)\) versus \(\mathcal{L}(Z^B)\)). For \(Y\) and \(Z\), the definition of the distance is respectively:

\[ \text{dist}_{\text{hell}} (Y^A,\widehat{Y}^B)= \sqrt{\frac{1}{2} \sum_{y\in \mathcal{Y}}\left(\sqrt{\mu_{n,y}^{Y^A}}- \sqrt{\mu_{n,y}^{\widehat{Y}^B}}\right)^2 )} \]

and

\[ \text{dist}_{\text{hell}} (Z^B,\widehat{Z}^A)= \sqrt{\frac{1}{2} \sum_{z\in \mathcal{Z}}\left(\sqrt{\mu_{n,z}^{Z^B}}- \sqrt{\mu_{n,z}^{\widehat{Z}^A}}\right)^2 )}, \]

where \(\mu_{n,y}^{\widehat{Y}^B}\) and \(\mu_{n,z}^{\widehat{Z}^A}\) correspond to the empirical estimators of \(\mu^{\widehat{Y}^B}\) and \(\mu^{\widehat{Z}^A}\) respectively.

The Hellinger distance varies between 0 (identical) and 1 (strong dissimilarities) while 0.05 can be used as an acceptable threshold below which two distributions can be considered as similar. When OUTCOME models are applied on datasets, this criterion shows that predictions respect assumption 1. It can also be used to determine the best relaxation parameter of an R-OUTCOME model. On the contrary, there is no need to interpret this criterion when an R-JOINT model is applied, because in this case, the assumption 1 is not required. Results related to this criterion are stored in the hell object:

verif_outc1$hell
                YA_YB ZA_ZB
Hellinger dist. 0.008    NA

With a p-value equals to \(0.008\), the assumption 1 hold here but it stays interesting to test other algorithms to eventually improve the current one by notably adding relaxation and/or regularization parameters.

Finally, the verif_OT function uses the mean and standard deviance of the conditional probabilities \(\mathbb{P}(Z=\hat{z}_i|Y=y_i,X=x_i)\) estimated by each model, as indicators of the stability of the individual predictions (provided that stab.prob = TRUE). It is nevertheless possible that conditional probabilities are computed from too few individuals (according to the frequency of each profile of shared variables met), to be considered as a reliable estimate of the reality. To avoid this problem, trimmed means and standard deviances are suggested by removing these specific probabilities from the computation, using the min.neigb parameter. In output, the results related to this last study are stored in the res.stab object:

verif_outc1$eff.neig
   Nb.neighbor Nb.Prob
1            1      14
2            2      18
3            3      18
4            4      28
5            5      20
6            6       6
7            7      14
8            8      16
9            9      27
10          10      20
verif_outc1$res.stab
         N min.N  mean    sd
2nd DB 284     5 0.968 0.122

The first result shows that \(14\) individual predictions among \(362\) (the total number of rows in api35) have been assigned to subjects that exhibits a unique combination of predictors and outcome. From these subjects, it would be obviously overkill to draw conclusions about the reliability of the predictions provided by the model. We therefore decide to fix here a threshold of \(5\) below which it is difficult to extract any information related to the prediction. From the remaining ones, we simply perform the average (\(0.968\)) which could be interpreted as follows: when the fitted model is confronted with a same profile of predictors and a same level of apicl_1999, more than \(96\) times of a hundred, it will return the same individual prediction for apicl_2000.

We run a total of \(11\) models to determine the optimal one for the prediction of apicl_2000 in api35. Every arguments from the syntax of the OT_outcome and OT_joint functions stay unchanged with the exception of indiv.method, maxrelax, and lambda.reg which vary. The values linked to each criterion of the verif_OT function are summarized in Tables 8, 9 and 10. The syntax related to verif_OT stay unchanged for each model (notably same min.neigb arguments). Note that comparisons of stability predictions between OUTCOME and JOINT models impose that the prox.dist argument of the OT_outcome function is fixed to \(0\).

Table 8: V Cramer criterion (V_cram) and rank correlation (rank_cor) between apicl_1999 (the observed variable) and apicl_2000 (the predicted variable) in the api35 database (N=362). Predictions comes from 11 models with various algorithms, relaxation and regularization parameters. They are very stable and reproductible here when the relaxation parameter differs from 0.
Model Type Method Relax Regul N V_cram rank_cor
outc1 OUTCOME SEQUENTIAL 0.0 - 362 0.86 0.892
R_outc1 R-OUTCOME SEQUENTIAL 0.4 - 362 0.94 0.923
R_outc2 R-OUTCOME SEQUENTIAL 0.6 - 362 0.91 0.917
R_outc3 R-OUTCOME OPTIMAL 0.0 - 362 0.87 0.911
R_outc4 R-OUTCOME OPTIMAL 0.4 - 362 0.95 0.939
R_outc5 R-OUTCOME OPTIMAL 0.6 - 362 0.92 0.932
outj1 JOINT - 0.0 0.0 362 0.74 0.834
R_outj1 R-JOINT - 0.4 0.0 362 0.95 0.935
R_outj2 R-JOINT - 0.6 0.0 362 0.91 0.927
R_outj3 R-JOINT - 0.8 0.0 362 0.91 0.927
R_outj4 R-JOINT - 0.4 0.1 362 0.95 0.931

From these results, we can conclude that:

Table 9: Hellinger distances related to the 6 models that used OUTCOME and R-OUTCOME algorithms. The values are relatively homogeneous from one model to another and do not indicate strong ditributional divergences (all values are very far from 1). Nevertheless, choosing here relaxation parameters higher than 0.4 can increase the risk of distributional dissimilarities (indeed, the criterion moves away from 0.05 when the relaxation parameter increases).
Model Type Method Relax Hell(YA_YB)
outc1 OUTCOME SEQUENTIAL 0.0 0.008
R_outc1 R-OUTCOME SEQUENTIAL 0.4 0.085
R_outc2 R-OUTCOME SEQUENTIAL 0.6 0.107
R_outc3 R-OUTCOME OPTIMAL 0.0 0.002
R_outc4 R-OUTCOME OPTIMAL 0.4 0.080
R_outc5 R-OUTCOME OPTIMAL 0.6 0.102
Table 10: Stability of the predictions (min.neigb = 5). When the R_outj1 model will be confronted with a same profile of predictors and a same level of apicl_1999, more than 94 times of a hundred (mean = 0.942), it will be able to return the same individual prediction for apicl_2000. Among the OUTCOME family of algorithms, R_outc4 provides here the best stability of prediction while R_outj2 is the optimal one in the JOINT family of algorithms.
Model Type Method Relax Regul N mean sd
outc1 OUTCOME SEQUENTIAL 0.0 - 284 0.968 0.122
R_outc1 R-OUTCOME SEQUENTIAL 0.4 - 284 0.950 0.151
R_outc2 R-OUTCOME SEQUENTIAL 0.6 - 284 0.954 0.145
R_outc3 R-OUTCOME OPTIMAL 0.0 - 284 0.979 0.100
R_outc4 R-OUTCOME OPTIMAL 0.4 - 284 0.987 0.080
R_outc5 R-OUTCOME OPTIMAL 0.6 - 284 0.983 0.091
outj1 JOINT - 0.0 0.0 284 0.911 0.116
R_outj1 R-JOINT - 0.4 0.0 284 0.942 0.128
R_outj2 R-JOINT - 0.6 0.0 284 0.953 0.171
R_outj3 R-JOINT - 0.8 0.0 284 0.934 0.199
R_outj4 R-JOINT - 0.4 0.1 284 0.926 0.097
Table 11: Ratio of common predictions between two models
Model outc1 R_outc1 R_outc4 outj1 R_outj1
R_outc1 0.84 - - - -
R_outc4 0.83 0.95 - - -
outj1 0.72 0.81 0.83 - -
R_outj1 0.83 0.91 0.94 0.84 -
R_outj4 0.84 0.96 0.97 0.84 0.92
Table 12: Confusion matrices in the api35 dataset for the models (a) R_outc4 and (b) R_outj1
(a)
apicl_1999
(b)
apicl_1999
apicl_2000 G1 G2 G3 G4 apicl_2000 G1 G2 G3 G4
[200-600] 91 15 0 0 [200-600] 91 14 1 0
[600-800] 0 75 90 0 [600-800] 0 76 89 0
[800-1000] 0 0 0 91 [800-1000] 0 0 0 91

The confusion matrices related to R_outc4 and R_outj1 are described in Table 12 and seems to confirm that the encoding of apicl_2000 in three groups, could simply correspond to the grouping of levels \(G2\) and \(G3\) of the api_cl1999 variable.

Finally, note that the running time of each model took less than 15 seconds with R version \(4.0.3\) for Windows (10 Pro-64bits/ Process Intel \(2.66\) GHz).

5 Conclusion and perspectives

To our knowledge, OTrecod is the first R package that takes advantage of the optimal transportation theory (Monge 1781) in the context of data fusion and the comparative study of methods described in (Garès and Omer 2022) underlines the promising performances of these approaches.

For more details and examples about the functions, users are invited to consult the ARTICLES section of the dedicated website.

6 Drawbacks

The functions of OTrecod only apply to a specific data fusion context where there is no overlapping part between the two raw data sources. This package does not deal with record linkage which is already the focus of extensive works (Sariyar and Borg 2010). This package is not adapted when the target variables (outcomes) of the two databases are continuous with an infinite number of values (like weights in grams with decimals for example). Moreover, if the data fusion requires only one shared variable to work, the quality of the fusion depends on the presence of a subset of good shared predictors in the raw databases. Finally, if the function OT_outcome allows all types of predictors, the current version of the function OT_joint imposes categorical matching variables only (scale variables are allowed): this restriction should be relaxed in a next version.

7 Perspectives

A number of more advanced research still need further investigation:

  1. the possibility of extending the OT algorithm for recoding variables to multidimensional frameworks.
  2. the stability of the algorithm when the matching variables are incomplete and the non-response processes are missing at random or not.
  3. the contribution of calibration techniques in the quality process assessment (Deming and Stephan 1940)
  4. the creation of a tuning function which defines a grid search to find the optimal combination of parameters related to the OT algorithms, could be added in the future versions of the package.

8 Acknowledgments

The authors would especially thank Pierre Navaro (CNRS UMR 6625) for their advices during the implementation of the OTrecod package. This research has received the help from Region Occitanie Grant RBIO-2015-14054319 and Mastodons-CNRS Grant.

9 Supplementary R code

### BASIC R CODE FOR SUMMARY TABLES  8, 9, 10, and 11 ---------

## Validation of each model: Repeat the following R command for each model by
## changing outc1:
verif_outc1   = verif_OT(outc1, group.class = TRUE, ordinal = FALSE, 
                         stab.prob = TRUE, min.neigb = 5)

## Association between Y and Z: Summary Table 8
res.prx = rbind(
           outc1   = verif_outc1$res.prox    , R_outc1 = verif_R_outc1$res.prox, 
           R_outc2 = verif_R_outc2$res.prox  , R_outc3 = verif_R_outc3$res.prox,
           R_outc4 = verif_R_outc4$res.prox  , R_outc5 = verif_R_outc5$res.prox,
           outj1   = verif_outj1$res.prox    , R_outj1 = verif_R_outj1$res.prox,
           R_outj2 = verif_R_outj2$res.prox  , R_outj3 = verif_R_outj3$res.prox,
           R_outj4 = verif_R_outj4$res.prox )
                                
res.prx = data.frame(Model = c("outc1","R_outc2","R_outc3","R_outc4",
                               "R_outc5","R_outc6","outj1", "R_outj1",
                               "R_outj2","R_outj3","R_outj4"), 
                     Type  = c("OUTCOME",rep("R-OUTCOME",5),"JOINT",
                               rep("R-JOINT",4)),
                     Relax = c(0,0.4,0.6,0,0.4,0.6,0,0.4,0.6,0.8,0.4), 
                     Regul = c(rep(0,10),0.1), res.prx)

row.names(res.prx) = NULL; head(res.prx,3)

#    Name      Type Relax Regul   N V_cram rank_cor
#   outc1   OUTCOME   0.0   0.0 362   0.86    0.892
# R_outc1 R-OUTCOME   0.0   0.0 362   0.87    0.911
# R_outc2 R_OUTCOME   0.4   0.0 362   0.93    0.933
#-----

## Hellinger distance: Summary Table 9
res.helld = rbind(
             outc1   = verif_outc1$hell  , R_outc1 = verif_R_outc1$hell,
             R_outc2 = verif_R_outc2$hell, R_outc3 = verif_R_outc3$hell, 
             R_outc4 = verif_R_outc4$hell, R_outc5 = verif_R_outc5$hell, 
             outj1   = verif_outj1$hell  , R_outj1 = verif_R_outj1$hell,
             R_outj2 = verif_R_outj2$hell, R_outj3 = verif_R_outj3$hell,
             R_outj4 = verif_R_outj4$hell )

res.helld = data.frame(res.prx[,1:4], res.helld)
row.names(res.helld) = NULL; res.helld
#-----

## Stability of the prediction: Summary Table 10 
# same R code as for Summary Table 8, changing res.prox by res.stab
#----

## Ratio of common predictions: Table 11

stoc = list(outc1$DATA2_OT$OTpred  , R_outc1$DATA2_OT$OTpred,
            R_outc4$DATA2_OT$OTpred, outj1$DATA2_OT$OTpred  ,
            R_outj1$DATA2_OT$OTpred, R_outj4$DATA2_OT$OTpred)

corpred = matrix(ncol = 6, nrow = 6)
for (i in 1:6){
      for (j in 1:6){
        corpred[i,j] = round(sum(diag(table(stoc[[i]],stoc[[j]])))/362,2)
      }
}; corpred
#-----

Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2023-006.zip

CRAN packages used

OTrecod, StatMatch, mice, missForest, softImpute, missMDA, transport, devtools, randomForest, party

CRAN Task Views implied by cited packages

Environmetrics, MachineLearning, MissingData, MixedModels, OfficialStatistics, Psychometrics, Survival

Bioconductor packages used

MultiDataSet, OMICsPCA, mixOmics

J. C. Adamek. Fusion: Combining data from separate sources. Marketing Research, 6(3): 48, 1994.
W. Bergsma. A bias-correction for cramér’s v and tschuprow’s t. Journal of the Korean Statistical Society, 42(3): 323–328, 2013. URL https://www.sciencedirect.com/science/article/pii/S1226319212001032.
L. Breiman. Random forests. Machine Learning, 45(1): 5–32, 2001. URL https://doi.org/10.1023/A:1010933404324.
L. Breiman, J. Friedman, C. J. Stone and R. A. Olshen. Classification and regression trees. Taylor & Francis, 1984. URL https://books.google.fr/books?id=JwQx-WOmSyQC.
F. Castanedo. A review of data fusion techniques. TheScientificWorldJournal, 2013: 704504, 2013. DOI 10.1155/2013/704504.
N. Cibella. How to choose the matching variables, report WP2, ESS-net. Statistical Methodology Project on Integration of Surveys and Administrative Data, EUROSTAT, 2010.
M. Cohen. Statistical matching and microsimulation models, improving information for social policy decisions, the use of microsimulation modeling, technical papers, II. Washington, DC: National Academy, 1991.
M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, Eds C. J. Burges, L. Bottou, M. Welling, Z. Ghahramani and K. Q. Weinberger 2013. Curran Associates, Inc. URL https://proceedings.neurips.cc/paper/2013/file/af21d0c97db2e27e13572cbf59eb343d-Paper.pdf.
M. D’Orazio. StatMatch: Statistical matching or data fusion. 2022. URL https://CRAN.R-project.org/package=StatMatch. R package version 1.4.1.
M. D’Orazio, M. Di Zio and M. Scanu. Statistical matching: Theory and practice. John Wiley & Sons, 2006.
S. Das and Dr. S. Tripathy. OMICsPCA: An r package for quantitative integration and analysis of multiple omics assays from heterogeneous samples. 2022. R package version 1.16.0.
W. E. Deming and F. F. Stephan. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11(4): 427–444, 1940.
R. F, G. B, S. A and L. C. K-A. mixOmics: An r package for ’omics feature selection and multiple data integration. PLoS computational biology, 13(11): e1005752, 2017. URL http://www.mixOmics.org.
D. E. Farrar and R. R. Glauber. Multicollinearity in regression analysis: The problem revisited. The Review of Economics and Statistics, 49(1): 92–107, 1967. URL http://www.jstor.org/stable/1937887 [online; last accessed November 28, 2022].
R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, et al. POT : Python Optimal Transport. Journal of Machine Learning Research, 2021. URL https://hal.archives-ouvertes.fr/hal-03264013.
J. Forrest, D. de la Nuez and R. Lougee-Heimer. CLP user guide. IBM Research, 2004.
V. Garès, C. Dimeglio, G. Guernec, R. Fantin, B. Lepage, M. R. Kosorok and N. Savy. On the use of optimal transportation theory to recode variables and application to database merging. International Journal of Biostatistics, 16(1): article number : 20180106, 2020. URL https://hal.archives-ouvertes.fr/hal-01905857.
V. Garès and J. Omer. Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association, 117(537): 320–333, 2022. URL https://doi.org/10.1080/01621459.2020.1775615 .
K. A. Grajski, L. Breiman, G. V. Di Prisco and W. J. Freeman. Classification of EEG spatial patterns with a tree-structured methodology: CART. IEEE Transactions on Biomedical Engineering, BME-33(12): 1076–1086, 1986. DOI 10.1109/TBME.1986.325684.
D. Hall and J. Llinas. An introduction to multisensor data fusion. Proceedings of the IEEE, 85: 6–23, 1997. DOI 10.1109/5.554205.
T. Hastie and R. Mazumder. softImpute: Matrix completion via iterative soft-thresholded SVD. 2021. URL https://CRAN.R-project.org/package=softImpute. R package version 1.4-1.
C. Hernandez-Ferrer, C. Ruiz-Arenas, A. Beltran-Gomila and J. R. González. MultiDataSet: An r package for encapsulating multiple data sets with application to omic data integration. BMC Bioinformatics, 18(1): 36, 2017. URL https://doi.org/10.1186/s12859-016-1455-1.
F. L. Hitchcock. The distribution of a product from several sources to numerous localities. Journal of mathematics and physics / Massachusetts Institute of Technology., 20: 224–230, 1941. URL https://doi.org/10.1002/sapm1941201224.
T. Hothorn, P. Bühlmann, S. Dudoit, A. Molinaro and M. J. Van Der Laan. Survival ensembles. Biostatistics, 7(3): 355–373, 2005. URL https://doi.org/10.1093/biostatistics/kxj011.
T. Hothorn, K. Hornik and A. Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3): 651–674, 2006. URL https://doi.org/10.1198/106186006X133933 .
J. Josse and F. Husson. missMDA: A package for handling missing values in multivariate data analysis. Journal of Statistical Software, 70(1): 1–31, 2016. DOI 10.18637/jss.v070.i01.
J. B. Kadane. Some statistical problems in merging data files. Journal of Official Statistics, 17: 423–433, 2001.
L. Kantorovich. On the transfer of masses. Doklady Akademii Nauk SSSR, 37: 7–8, 1942.
L. A. Klein. Sensor and data fusion: A tool for information assessment and decision making. SPIE press, 2004.
S. Kotz, N. Balakrishnan and N. Johnson. Continuous multivariate distributions: Models and applications, volume 1, second edition. John Wiley & Sons, 2000. DOI 10.1002/0471722065.
A. Liaw and M. Wiener. Classification and regression by randomForest. R News, 2(3): 18–22, 2002. URL https://CRAN.R-project.org/doc/Rnews/.
F. Liese and K.-J. Miescke. Statistical decision theory. In Statistical decision theory: Estimation, testing, and selection, pages. 1–52 2008. New York, NY: Springer New York. ISBN 978-0-387-73194-0. URL https://doi.org/10.1007/978-0-387-73194-0_3.
R. J. A. Little and D. B. Rubin. Statistical analysis with missing data, third edition. Wiley, 2019.
A. Makhorin. GNU linear programming kit, reference manual. Free software foundation, 4: 2011.
I. Mayer, A. Sportisse, J. Josse, N. Tierney and N. Vialaneix. R-miss-tastic: A unified platform for missing values methods and workflows. 2019. URL https://arxiv.org/abs/1908.04822.
G. Monge. Mémoire sur la Théorie des Déblais et des Remblais. Histoire de l’Académie royale des sciences de Paris, 666–704, 1781.
B. Muzellec, J. Josse, C. Boyer and M. Cuturi. Missing data imputation using optimal transport. In ICML, 2020.
G. Paass. Statistical match: Evaluation of existing procedures and improvements by using additional information. Microanalytic Simulation Models to Support Social and Financial Policy, 401–420, 1986.
J. Pagès. Analyse factorielle multiple appliquée aux variables qualitatives et aux données mixtes. Revue de Statistique Appliquée, 50(4): 5–37, 2002. URL http://eudml.org/doc/106525.
S. Rässler. Frequentist theory of statistical matching. In Statistical matching: A frequentist theory, practical applications, and alternative bayesian approaches, pages. 15–43 2002. New York, NY: Springer New York. ISBN 978-1-4613-0053-3. URL https://doi.org/10.1007/978-1-4613-0053-3_2.
W. L. Rodgers. An evaluation of statistical matching. Journal of Business & Economic Statistics, 2(1): 91–102, 1984. URL http://www.jstor.org/stable/1391358 [online; last accessed November 29, 2022].
D. B. Rubin. Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business & Economic Statistics, 4(1): 87–94, 1986. URL http://www.jstor.org/stable/1391390 [online; last accessed November 29, 2022].
M. Sariyar and A. Borg. The RecordLinkage package: Detecting errors in data. The R Journal, 2(2): 61–67, 2010.
M. Scanu. Recommendations on statistical matching, report WP2, ESS-net. Statistical Methodology Project on Integration of Surveys and Administrative Data, 2010.
D. Schuhmacher, B. Bähre, C. Gottschlich, V. Hartmann, F. Heinemann and B. Schmitzer. transport: Computation of optimal transport plans and wasserstein distances. 2022. URL https://cran.r-project.org/package=transport. R package version 0.13-0.
D. J. Stekhoven. missForest: Nonparametric missing value imputation using random forest. 2022. R package version 1.5.
D. J. Stekhoven and P. Bühlmann. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics (Oxford, England), 28: 112–8, 2012.
C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin and A. Zeileis. Conditional variable importance for random forests. BMC Bioinformatics, 9(1): 307, 2008. URL https://doi.org/10.1186/1471-2105-9-307.
C. Strobl, A.-L. Boulesteix, A. Zeileis and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1): 25, 2007. URL https://doi.org/10.1186/1471-2105-8-25.
C. Strobl, T. Hothorn and A. Zeileis. Party on! The R Journal, 1(2): 14–17, 2009. URL https://doi.org/10.32614/RJ-2009-013.
S. Theußl, F. Schwendinger and K. Hornik. ROI: An extensible R optimization infrastructure. Journal of Statistical Software, 94(15): 1–64, 2020. DOI 10.18637/jss.v094.i15.
S. Theußl, F. Schwendinger and K. Hornik. ROI: The R optimization infrastructure package. 133. Vienna: WU Vienna University of Economics; Business. 2017. URL http://epub.wu.ac.at/5858/.
S. van Buuren and K. Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3): 1–67, 2011. DOI 10.18637/jss.v045.i03.
G. Van Rossum and F. L. Drake Jr. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam, 1995.
B. Vantaggi. Statistical matching of multiple sources: A look through coherence. Int. J. Approx. Reasoning, 49: 701–711, 2008. DOI 10.1016/j.ijar.2008.07.005.
H. Wickham, J. Hester, W. Chang and J. Bryan. Devtools: Tools to make developing R packages easier. 2022. URL https://CRAN.R-project.org/package=devtools. R package version 2.4.5.
A. Zeileis and T. Hothorn. Model-based recursive partitioning. Journal of Computational and Graphical Statistics - J COMPUT GRAPH STAT, 17: 492–514, 2008. DOI 10.1198/106186008X319331.
Z. Zhu, T. Wang and R. J. Samworth. High-dimensional principal component analysis with heterogeneous missingness. 2019. URL https://arxiv.org/abs/1906.12125.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Guernec, et al., "OTrecod: An R Package for Data Fusion using Optimal Transportation Theory", The R Journal, 2023

BibTeX citation

@article{RJ-2023-006,
  author = {Guernec, Gregory and Gares, Valerie and Omer, Jeremy and Saint-Pierre, Philippe and Savy, Nicolas},
  title = {OTrecod: An R Package for Data Fusion using Optimal Transportation Theory},
  journal = {The R Journal},
  year = {2023},
  note = {https://doi.org/10.32614/RJ-2023-006},
  doi = {10.32614/RJ-2023-006},
  volume = {14},
  issue = {4},
  issn = {2073-4859},
  pages = {195-222}
}