The advances of information technologies often confront users with a large amount of data which is essential to integrate easily. In this context, creating a single database from multiple separate data sources can appear as an attractive but complex issue when same information of interest is stored in at least two distinct encodings. In this situation, merging the data sources consists in finding a common recoding scale to fill the incomplete information in a synthetic database. The OTrecod package provides R-users two functions dedicated to solve this recoding problem using optimal transportation theory. Specific arguments of these functions enrich the algorithms by relaxing distributional constraints or adding a regularization term to make the data fusion more flexible. The OTrecod package also provides a set of support functions dedicated to the harmonization of separate data sources, the handling of incomplete information and the selection of matching variables. This paper gives all the keys to quickly understand and master the original algorithms implemented in the OTrecod package, assisting step by step the user in its data fusion project.
The large amount of data produced by information technology requires flexible tools to facilitate its handling. Among them, the field of data fusion (Hall and Llinas 1997; Klein 2004; Castanedo 2013) also known as statistical matching (Adamek 1994; D’Orazio et al. 2006; Vantaggi 2008) aims to integrate the overall information from multiple data sources for a better understanding of the phenomena that interact in the population.
Assuming that two heterogeneous databases A and B share a set of common variables X while an information of interest is encoded in two distinct scales respectively: Y in A and Z in B. If Y and Z are never jointly observed, a basic data fusion objective consists in the recoding of Y in the same scale of Z (or conversely), to allow the fusion between the databases as illustrated in Table 1.
Providing a solution to this recoding problem is often very attractive because it aims at giving access to more accurate and consistent information with no additional costs in a unique and bigger database. Despite this, if we exclude all R
data integration packages applied in the context of genetic area like the MultiDataSet package (Hernandez-Ferrer et al. 2017), OMICsPCA package (Das and Tripathy 2022), or the mixOmics package (F et al. 2017) which are often only effective for quantitative data integration. To our knowledge, the StatMatch package (D’Orazio 2022) is actually the only one that provide a concrete solution to the problem using hot deck imputation procedures. The main reason for this relative deficiency is that this problem is, in fact, often assimilated and solved like a missing data imputation problem. According to this idea, a very large amount of works and reference books now exist about the handling of missing data (Little and Rubin 2019; Zhu et al. 2019). Moreover, we can enumerate several R
packages that we can sort by types of imputation methods (Mayer et al. 2019): mice (van Buuren and Groothuis-Oudshoorn 2011) and missForest (Stekhoven and Bühlmann 2012; Stekhoven 2022) which use conditional models, softImpute (Hastie and Mazumder 2021) and missMDA (Josse and Husson 2016) which apply low-rank based models. For all these packages, imputation performances can sometimes fluctuate a lot according to the structure and the proportion of non-response encountered. Regressions and non parametric imputation approaches (like hot-deck methods from donor based family) seem to use partially, or not at all, the available information of the two databases to provide the individual predictions. Contrary to our approach, all these methods only use the set of shared variables X for the prediction of Z in A (or conversely Y in B) without really taking into account Y and its interrelations with X in their process of predictions.
The purpose of this paper is to present a new package to the R
community called OTrecod which provides a simple and intuitive access to two original algorithms (Garès et al. 2020; Garès and Omer 2022) dedicated to solve these recoding problems in the data fusion context by considering them as applications of Optimal Transportation theory (OT). In fact, this theory was already applied in many areas: for example, the transport package (Schuhmacher et al. 2022) solves optimal transport problems in the field of image processing. A specific package called POT: Python Optimal Transport (Flamary et al. 2021) also exists in the Python software (Van Rossum and Drake Jr 1995) to solve optimization problems using optimal transportation theory in the fields of signal theory, image processing and domain adaptation. Nevertheless, all these available tools were still not really adapted to our recoding problem and the performances established by OT-based approaches to predict missing information compared to more standard processes (Garès et al. 2020; Muzellec et al. 2020; Garès and Omer 2022) have finished to confirm our decision.
In OTrecod the first provided algorithm, called OUTCOME
and integrated in the OT_outcome
function consists in finding a map that pushes the distribution of Y forward to the distribution of Z (Garès et al. 2020) while the second one, called JOINT
and integrated in the OT_joint
function, pushes the distribution of (Y,X) forward to the distribution of (Z,X). Consequently, by building, these two algorithms take advantage of all the potential relationships between Y, Z, and X for the prediction of the incomplete information of Y and/or Z in A and B. Enrichments related to these algorithms and described in (Cuturi 2013; Garès and Omer 2022) are also available via the optional arguments of these two functions. In its current version, these algorithms are accompanied by original preparation (merge_dbs
, select_pred
) and validation (verif_OT
) functions which are key steps of any standard statistical matching project and can be used independently of the OUTCOME
and JOINT
algorithms.
The optimal transportation (OT) problem was originally stated by Monge (1781) and consists in finding the cheapest way to transport a pile of sand to fill a hole. Formally the problem writes as follows.
Consider two (Radon) spaces \(\mathbb{X}\) and \(\mathbb{Y}\), \(\mu^X\) a probability measure on \(\mathbb{X}\), and \(\mu^Y\) a probability measure on \(\mathbb{Y}\) and \(c\) a Borel-measurable function from \(\mathbb{X} \times \mathbb{Y}\) to \(\left[0,\infty\right]\). The Kantorovich’s formulation of the optimal transportation problem (Kantorovich 1942) consists in finding a measure \(\gamma \in \Gamma(\mu^X,\mu^Y)\) that realizes the infimum:
\[\begin{equation} \inf\left\{\left. \int_{{\mathbb{X}} \times {\mathbb{Y}}} c(x, y) \, \mathrm{d} \gamma (x,y) \right| \gamma \in \Gamma(\mu^X,\mu^Y) \right\}, \tag{1} \end{equation}\]
where \(\Gamma(\mu^X,\mu^Y)\) is the set of measures on \(\mathbb{X} \times \mathbb{Y}\) with marginals \(\mu^X\) on \(\mathbb{X}\) and \(\mu^Y\) on \(\mathbb{Y}\).
This theory is applied here to solve a recoding problem of missing distributions in a data fusion area. To do so, we make use of Kantovorich’s formulation adapted to the discrete case, known as Hitchcock’s problem (Hitchcock 1941). Therefore, by construction, the proposed algorithms are usable for specific target variables only: categorical variables, ordinal or nominal, and discrete variables with finite number of values.
Let \(A\) and \(B\) be two databases corresponding to two independent sets of subjects. We assume without loss of generality that the two databases have equal sizes, so that they can be written as \(A=\{i_1,\dots,i_{n}\}\) and \(B=\{j_1,\dots,j_{n}\}\). Let \(\big((X_i,Y_i,Z_i)\big)_{i\in A}\) and \(\big((X_j,Y_j,Z_j)\big)_{j\in B}\) be two sequences of i.i.d. discrete random variables with values in \(\mathcal{X} \times \mathcal{Y} \times \mathcal{Z}\), where \(\mathcal{X}\) is a finite subset of \(\mathbb{R}^P\), and \(\mathcal{Y}\) and \(\mathcal{Z}\) are finite subsets of \(\mathbb{R}\). Variables \((X_i,Y_i,Z_i), i\in A\), are i.i.d copies of \((X^A,Y^A,Z^A)\) and \((X_j,Y_j,Z_j), j\in B\), are i.i.d copies of \((X^B,Y^B,Z^B)\). Moreover assume that \(\big\{(X_i,Y_i,Z_i),i\in A\big\}\) are independent of \(\big\{(X_j,Y_j,Z_j),j\in B\big\}\). The first version using the optimal transportation algorithm approach, described in (Garès et al. 2020), assumes that:
Assumption 1. \(Y^A\) and \(Z^A\) respectively follow the same distribution as \(Y^B\) and \(Z^B\).
Assumption 2. For all \(x \in \mathcal{X}\) the probability distributions of \(Y^A\) and \(Z^A\) given that \(X^A= x\) are respectively equal to those of \(Y^B\) and \(Z^B\) given that \(X^B= x\).
In this setting, the aim is to solve the recoding problem given by equation (1) that pushes \(\mu^{Y^A}\) forward to \(\mu^{Z^A}\). The variable \(\gamma\) of (1) is a discrete measure with marginals \(\mu^{Y^A}\) and \(\mu^{Z^A}\), represented by a \(|\mathcal{Y}| \times |\mathcal{Z}|\) matrix. The cost function denoted as \(c\) is a \(|\mathcal{Y}| \times |\mathcal{Z}|\) matrix, \((c_{y,z})_{y\in \mathcal{Y},z\in \mathcal{Z}}\). The goal is in the identification of:
\[\begin{equation} \gamma^*\in argmin_{\gamma \in \mathbb{R}_+^{|\mathcal{Y}| \times |\mathcal{Z}|}}\left\{\langle \gamma \; ,\; c \rangle : \gamma \mathbf{1}_{|\mathcal{Z}|} = \mu^{Y^A},\gamma^T \mathbf{1}_{|\mathcal{Y}|} = \mu^{Z^A}\right\}, \tag{2} \end{equation}\]
where \(\langle \cdot \;,\; \cdot \rangle\) is the dot product, \(\mathbf{1}\) is a vector of ones with appropriate dimension and \(M^T\) is the transpose of matrix \(M\). The cost function considered by Garès et al. (2020), \(c_{y,z}\), measures the average distance between the profiles of shared variables of \(A\) satisfying \(Y=y\) and subjects of \(B\) satisfying \(Z=z\), that is:
\[\begin{equation} c_{y,z} = \mathbb{E} \left[d(X^A,X^B) \mid Y^A= y, Z^B= z\right], \tag{3} \end{equation}\]
where \(d\) is a given distance function to choose on \(\mathcal{X} \times \mathcal{X}\).
In fact, the above situation cannot be solved in reality, since the distributions of \(X^A\), \(X^B\), \(Y^A\) and \(Z^A\) are never jointly observed. As a consequence, the following unbiased empirical estimators are used: \(\hat{\mu}^{X^A}_n\) of \({\mu}^{X^A}\) and \(\hat{\mu}^{X^B}_n\) of \({\mu}^{X^B}\). Because \(Y\) and \(Z\) are only available in \(A\) and \(B\) respectively, two distinct empirical estimators have to be defined:
\[\begin{equation} \begin{aligned} \hat{\mu}^{Y^A}_{n,y} & = & \frac{1}{n}\sum_{i\in A} ~\mathbf{1}_{\{Y_i = y\}},\: \forall y\in\mathcal{Y},\\ \hat{\mu}^{Z^A}_{n,z} & = & \frac{1}{n}\sum_{j\in B} ~\mathbf{1}_{\{Z_j = z\}},\: \forall z\in \mathcal{Z}, \end{aligned} \tag{4} \end{equation}\]
where \(\mathbf{1}_{\{Y = y\}}=1\) if \(Y=y\) and \(0\) otherwise. The assumption 1 gives: \({\mu}^{Z^A}={\mu}^{Z^B}\) from which we can conclude that \(\hat{\mu}^{Z^B}_{n,z}=\hat{\mu}^{Z^A}_{n,z}\). Finally, denoting: \[\kappa_{n,y,z}\equiv \sum_{i\in A} \sum_{j\in B}~ \mathbf{1}_{\left\{Y_i=y,Z_j=z\right\}},\]
the number of pairs \((i,j)\in A\times B\) such that \(Y_i=y\) and \(Z_j=z\), the cost matrix \(c\) is estimated by:
\[\begin{equation} \hat{c}_{n,y,z}=\left\{ \begin{array}{ll} \frac{1}{\kappa_{n,y,z}}\sum_{i\in A} \sum_{j\in B}~ \mathbf{1}_{\left\{Y_i=y,Z_j=z\right\}} \times d(X_i,X_j), & \: \forall y\in \mathcal{Y}, z\in\mathcal{Z}:\kappa_{n,y,z}\neq 0,\\ 0, & \:\forall y \in \mathcal{Y}, z \in \mathcal{Z}:\kappa_{n,y,z} = 0. \end{array}\right. \tag{5} \end{equation}\]
Plugging the values observed for these estimators in (2) yields to a linear programming model denoted:
\[\begin{equation} \hat{\mathcal{P}}^0_n: \left\{ \begin{aligned} \min\: & <\widehat{c}_n,\gamma>\\ \text{s.t.}\:& \sum_{z\in Z} \gamma_{y,z} = \mu^{Y^A}_{n,y}, \:\forall y\in \mathcal{Y}, \\ & \sum_{y\in Y} \gamma_{y,z} = \mu^{Z^A}_{n,z}, \:\forall z\in \mathcal{Z}, \\ & \gamma_{y,z} \geq 0, \: \forall y\in \mathcal{Y}, \forall z\in \mathcal{Z}. \end{aligned}\right. \tag{6} \end{equation}\]
The solution \(\hat{\gamma}_n\) can then be interpreted as an estimator \(\hat{\mu}^{(Y^A,Z^A)}_n\) of the joint distribution of \(Y^A\) and \(Z^A\), \(\mu^{(Y^A,Z^A)}\). If this estimate is necessary, it is nevertheless insufficient here to provide the individual predictions on \(Z\) in \(A\). These predictions are done in a second step using a nearest neighbor algorithm from which we deduce an estimation of \(\mu^{Z^A\mid X^A=x,Y^A=y}\) (see Garès and Omer (2022) for details). In the remainder, the overall algorithm described in this section is referred to as OUTCOME
. To improve the few drawbacks of this algorithm described in Garès et al. (2020), derived algorithms from OUTCOME
have been developed (Garès and Omer 2022) and described in the following part.
Using the same notations, Garès et al. (2020) propose to search for an optimal transportation map between the two joint distributions of \((X^A,Y^A)\) and \((X^A,Z^A)\) with marginals \(\mu^{(X^A,Y^A)}\) and \(\mu^{(X^A,Z^A)}\) respectively. Under Kantorovich’s formulation in a discrete setting, they search for: \[\gamma^*\in \operatorname{argmin}_{\gamma\in \mathcal{D}} <c,\gamma>,\] where \(c\) is a given cost matrix and \(\mathcal{D}\) is the set of joint distributions with marginals \(\mu^{(X^A,Y^A)}\) and \(\mu^{(X^A,Z^A)}\). It is natural to see any element \(\gamma\in \mathcal{D}\) as the vector of joint probabilities \(\mathbb{P}((X^A=x,Y^A=y),(X^A=x',Z^A=z))\) for any \(x,x'\in\mathcal{X}^2\), \(y\in\mathcal{Y}\) and \(z\in\mathcal{Z}\). Since this probability nullifies for all \(x\neq x'\), \(\gamma\in \mathcal{D}\) is defined as a vector of \(\mathbb{R}^{\left\lvert X \right\rvert\times \left\lvert \mathcal{Y} \right\rvert \times\left\lvert\mathcal{Z} \right\rvert }\), where \(\gamma_{x,y,z}\) stands for an estimation of the joint probability \(\mathbb{P}(X^A=x,Y^A=y,Z^A=z)\). These notations lead to the more detailed model:
\[\begin{equation} \mathcal{P}:\left\{ \begin{aligned} \min\: & <c,\gamma> \\ \text{s.t.}\:& \sum_{z\in Z} \gamma_{x,y,z} = \mu^{(X^A,Y^A)}_{x,y}, \:\forall x\in\mathcal{X}, \forall y\in \mathcal{Y},\\ & \sum_{y\in Y} \gamma_{x,y,z} = \mu^{(X^A,Z^A)}_{x,z}, \:\forall x\in\mathcal{X}, \forall z\in \mathcal{Z},\\ & \gamma_{x,y,z} \geq 0, \:\forall x\in\mathcal{X}, \forall y\in \mathcal{Y}, \forall z\in \mathcal{Z}. \end{aligned}\right. \tag{7} \end{equation}\]
The above algorithm can be solved only if the marginals \(\mu^{(X^A,Y^A)}\) and \(\mu^{(X^A,Z^A)}\) are known, but, based on assumption 2, unbiased estimators \(\hat{\mu}^{X^A,Y^A}_n\) and \(\hat{\mu}^{X^A,Z^A}_n\) can be built according to the previous subsection. For the first one it gives:
\[\begin{equation} \begin{aligned} \hat{\mu}^{X^A,Y^A}_{n} & = & \frac{1}{n}\sum_{i\in A} ~\mathbf{1}_{\left\{Y_i = y,X_i = x\right\}}, \:\forall x\in\mathcal{X}, \: \forall y\in\mathcal{Y}. \end{aligned} \tag{8} \end{equation}\]
The cost matrix introduced in the OUTCOME
algorithm is used (3) and estimated by (5). Formally we can write:
\[\begin{equation} c_{x,y,z} = c_{y,z}, \:\forall x\in\mathcal{X},\forall y\in\mathcal{Y},\forall z\in\mathcal{Z}, \tag{9} \end{equation}\]
which does not depend on the value of \(x\).
Plugging the values observed for these estimators in (7) yield a linear programming model denoted as \(\widehat{\mathcal{P}}_n\). In contrast to , the algorithm that consists in solving \(\widehat{\mathcal{P}}_n\) to solve the recoding problem is referred to as JOINT
in what follows.
An estimation of the distribution of \(Z^A\) given the values of \(X^A\) and \(Y^A\) is then given by:
\[\begin{equation} \tilde{\mu}^{Z^A\mid X^A=x,Y^A=y}_{n,z}= \left\{\begin{aligned} &\frac{\hat{\gamma}_{n,x,y,z}}{\hat{\mu}^{(X^A,Y^A)}_{n,x,y}}, & \:\forall x\in\mathcal{X},y\in\mathcal{Y},z\in\mathcal{Z}: \hat{\mu}^{(X^A,Y^A)}_{n,x,y}\neq 0,\\ &0, & \:\forall x\in\mathcal{X},y\in\mathcal{Y},z\in\mathcal{Z}: \hat{\mu}^{(X^A,Y^A)}_{n,x,y}= 0. \end{aligned}\right. \tag{10} \end{equation}\]
and an individual prediction of \(Z^A\) is then deduced using the maximum a posterior rule: \[ \widehat{z}_i^A= \operatorname{argmax}_{z\in\mathcal{Z}} \tilde{\mu}^{Z^A\mid X^A=x_i,Y^A=y_i}_{n,z}. \]
Due to potential errors in the estimations of \(\mathcal{P}\), the constraints of \(\widehat{\mathcal{P}}_n\) may derive from the true values of the marginals of \(\mu^{(X^A,Y^A,Z^A)}\). To deal with this situation, small violations of the constraints of \(\widehat{\mathcal{P}}_n\) are allowed by enriching the initial algorithm as described in Garès and Omer (2022).
The equality constraints of \(\widehat{\mathcal{P}}_n\) are then relaxed as follows:
\[\begin{align} & \sum_{z\in Z} \gamma_{x,y,z} = \hat{\mu}^{(X^A,Y^A)}_{n,x,y} + e^{X,Y}_{x,y}, \:\forall x\in\mathcal{X}, \forall y\in \mathcal{Y} \tag{11}\\ & \sum_{y\in Y} \gamma_{x,y,z} = \tilde{\mu}^{(X^A,Z^A)}_{n,x,z} + e^{X,Z}_{x,z}, \:\forall x\in\mathcal{X}, \forall z\in \mathcal{Z} \tag{12}\\ & \sum_{x\in \mathcal{X},y\in \mathcal{Y}} e^{X,Y}_{x,y} = 0,\: \sum_{x\in \mathcal{X},z\in \mathcal{Z}} e^{X,Z}_{x,z} = 0 \tag{13}\\ & -e^{X,Y,+}_{x,y}\leq e^{X,Y}_{x,y} \leq e^{X,Y,+}_{x,y}, \:\forall x\in\mathcal{X}, \forall y\in \mathcal{Y} \tag{14}\\ & -e^{X,Z,+}_{x,z}\leq e^{X,Z}_{x,z} \leq e^{X,Z,+}_{x,z}, \:\forall x\in\mathcal{X}, \forall z\in \mathcal{Z} \tag{15}\\ & \sum_{x\in \mathcal{X},y\in \mathcal{Y}} e^{X,Y,+}_{x,y} \leq \alpha_n,\: \sum_{x\in \mathcal{X},z\in \mathcal{Z}} e^{X,Z,+}_{x,z} \leq \alpha_n. \tag{16} \end{align}\]
This relaxation is possible by introducing extra-variables \(e^{X,Y,+}\) and \(e^{X,Z,+}\) as additional constraints (14)–(15). Garès and Omer (2022) suggests to consider \(\alpha_n:=\frac{\alpha}{\sqrt{n}}\) from (16), with a parameter \(\alpha\) to calibrate numerically but proposes also a default value fixed to \(0.4\).
A regularization term \(\lambda\) given by \((\frac{\pi_{x,y,z}}{\hat{\mu}^{X^A}_{n,x}})_{x\in\mathcal{X},y\in\mathcal{Y},z\in\mathcal{Z}}\) can also be added to improve regularity in the variations of the conditional distribution \(\mu^{Y^A,Z^A\mid X^A=x}\) with respect to \(x\). The corresponding regularized algorithm is:
\[\begin{equation} \widehat{\mathcal{P}}^R_n: \left\{ \begin{aligned} \min\: & <\widehat{c}_n,\gamma> + \lambda \sum_{(x_i,x_j)\in E_\mathcal{X}} w_{i,j}\sum_{y\in\mathcal{Y},z\in\mathcal{Z}} r^+_{i,j,y,z}\\ &\text{s.t.}\: \text{constraints } \text{(11)--(16)} \\ & \frac{\gamma_{x_i,y,z}}{\hat{\mu}^{X^A}_{n,x_i}}-\frac{\gamma_{x_j,y,z}}{\hat{\mu}^{X^A}_{n,x_j}} \leq r^+_{i,j,y,z}, \:\forall \{x_i,x_j\}\in E_\mathcal{X}, y\in\mathcal{Y}, z\in \mathcal{Z}\\ & \frac{\gamma_{x_i,y,z}}{\hat{\mu}^{X^A}_{n,x_i}}-\frac{\gamma_{x_j,y,z}}{\hat{\mu}^{X^A}_{n,x_j}} \geq -r^+_{i,j,y,z}, \:\forall \{x_i,x_j\}\in E_\mathcal{X}, y\in\mathcal{Y}, z\in \mathcal{Z}\\ & \gamma_{x,y,z} \geq 0, \:\forall x\in\mathcal{X}, \forall y\in \mathcal{Y}, \forall z\in \mathcal{Z}. \end{aligned}\right. \tag{17} \end{equation}\]
The constant \(\lambda\in\mathbb{R}^+\) is a regularization parameter to be calibrated numerically (\(0.1\) can be considered as default value) and \(E_\mathcal{X}\subset \mathcal{X}^2\) includes the pairs of elements of X defined as neighbors: \(\left\{x_i, x_j\right\}\in E_\mathcal{X}\) if \(x_j\) is among the k nearest neighbors of \(x_i\) for some parameter \(k \geq 1\). The method that computes a solution to the recoding problem with regularization and relaxation is called R-JOINT
.
In a same way, a relaxation of the assumption 1 is also proposed and added to the OUTCOME
algorithm: this resulting method is denoted R-OUTCOME
and is related to the following program:
\[\begin{equation} \widehat{\mathcal{P}}^{0-R}_n: \left\{ \begin{aligned} \min\: & <\widehat{c}_n,\gamma>\\ & \sum_{z\in \mathcal{Z}} \gamma_{y,z} = \hat{\mu}^{Y^A}_{n,y} + e^{Y}_{y}, \:\forall y\in \mathcal{Y}\\ & \sum_{y\in \mathcal{Y}} \gamma_{y,z} = \tilde{\mu}^{Z^A}_{n,z} + e^{Z}_{z}, \:\forall z\in \mathcal{Z}\\ & \sum_{y\in \mathcal{Y}} e^{Y}_{y} = 0,\: \sum_{z\in \mathcal{Z}} e^{Z}_z = 0\\ & -e^{Y,+}_{y}\leq e^{Y}_{y} \leq e^{Y,+}_{y}, \:\forall y\in \mathcal{Y}\\ & -e^{Z,+}_{z}\leq e^{Z}_{z} \leq e^{Z,+}_{z}, \:\forall z\in \mathcal{Z}\\ & \sum_{y\in \mathcal{Y}} e^{Y,+}_{y} \leq \alpha_n,\: \sum_{z\in \mathcal{Z}} e^{Z,+}_{z} \leq \alpha_n\\ & \gamma_{y,z} \geq 0, \:\forall y\in \mathcal{Y}, \forall z\in \mathcal{Z}. \end{aligned}\right. \tag{18} \end{equation}\]
Note that algorithms JOINT
and R-JOINT
do not require assumption 1. The relaxation in R-OUTCOME
alleviates its dependence to the satisfaction of this assumption. However, algorithms JOINT
and R-JOINT
require that \(X\) is a set of discrete variables (factors ordered or not are obviously allowed) while the absence of \(X\) in the linear algorithms OUTCOME
and R-OUTCOME
allow \(X\) to be a set of discrete and/or continuous variables. In this case, the nature of the variables \(X\) need to be considered when choosing the distance \(d\).
The OTrecod package can be installed from the Comprehensive R Archive Network (CRAN) by using the following template:
install.packages("OTrecod")
The development version of OTrecod is also available and can be directly installed from GitHub by loading the devtools (Wickham et al. 2022) package (R> install_github("otrecoding/OTrecod")
).
The two types of optimal transportation algorithms previously introduced (OUTCOME
and JOINT
) and their respective enrichments (R-OUTCOME
and R-JOINT
) are available in the OTrecod package via two core functions denoted OT_outcome
and OT_joint
. Details about their implementations in R
are described in the following section. In this package, these algorithms of recoding are seen as fundamental steps of data fusion projects that also require often adapted preparation and validation functions. In this way, the Table 2 introduces the main functions proposed by OTrecod to handle a data fusion project.
R Function | Description |
---|---|
Pre-process functions | |
merge_dbs
|
Harmonization of the data sources |
select_pred
|
Selection of matching variables |
Functions of data fusion | |
OT_outcome
|
Data fusion with OT theory using the OUTCOME or R-OUTCOME algorithms. |
OT_joint
|
Data fusion with OT theory using the JOINT or R-JOINT algorithms. |
Post-process function | |
verif_OT
|
Quality assessment of the data fusion |
All the intermediate functions integrated in the OT_outcome
and OT_joint
functions (proxim_dist
, avg_dist_closest
, indiv_grp_closest
, indiv_grp_optimal
), and their related documentations, are all included and usable separately in the package. They have been kept available for users to ensure a great flexibility as other interesting functions like power_set
that returns the power set of a set. This function did not exist on R
until now and could be of interest for specialists of algebra. These functions are not described here but detailed in the related pdf manual of the package.
The functions of recoding OT_outcome
and OT_joint
require a specific structure of data.frame as input arguments. Described in Table 3, it must be the result of two overlayed databases made of at least four variables:
OT_joint
, therefore, these variables must be transformed beforehand.DB | Y | Z | X_1 | X_2 | X_3 |
---|---|---|---|---|---|
1 | (600-800] | NA | M | Yes | 50 |
1 | (600-800] | NA | M | No | 32 |
1 | [200-600] | NA | W | No | 31 |
2 | NA | G1 | M | No | 47 |
2 | NA | G3 | W | Yes | 43 |
2 | NA | G2 | W | No | 23 |
2 | NA | G4 | M | Yes | 22 |
2 | NA | G2 | W | Yes | 47 |
As additional examples, users can also refer to the databases simu_data
and tab_test
provided in the package with expected structures. Note that class objects are not expected here as input arguments of these functions to allow users to freely work with or without the use of the pre-process functions provided in the package.
The package OTrecod uses the ROI
optimization infrastructure (Theußl et al. 2017) to solve the optimization problems related to the OUTCOME
and JOINT
algorithms. The solver GLPK (The GNU Linear Programming Kit (Makhorin 2011)) is the default solver actually integrated in the OT_outcome
and OT_joint
functions for handling linear problems with linear constraints. The ROI infrastructure makes easy for users to switch solvers for comparisons. In many situations, some of them can noticeably reduce the running time of the functions.
For example, the solver Clp (Forrest et al. 2004) for COINT-OR Linear Programming, known to be particularly convenient in linear and quadratic situations, can be easily installed by users via the related plug-in available in ROI
(searchable with the instruction ROI_available_solvers()
) and following the few instructions detailed in (Theußl et al. 2020) or via the dedicated website.
In California (United States), from 1999 to 2018, the Public Schools Accountability Act (PSAA) imposed on its California Department of Education (CDE) to provide annually the results of an Academic Performance Index (API) which established a ranking of the best public schools of the state.
This numeric score, indicator of school’s performance levels, could vary from 200 to 1000 and the performance objective to reach for each school was 800. Information related to the 418 schools (identified by cds
) of Nevada (County 29) and to the 362 schools of San Benito (County 35), was respectively collected in two databases, api29
and api35
, available in the package. The distributions of all the variables in the two databases are provided by following the R
commands:
library(OTrecod)
data(api29); data(api35)
summary(api29) #--------------------------------------------------
cds apicl_2000 stype awards
Length:418 [200-600] : 93 E:300 No : 98
Class :character (600-800] :180 M: 68 Yes :303
Mode :character (800-1000]:145 H: 50 NA's: 17
acs.core api.stu acs.k3.20 grad.sch
Min. :16.00 Min. : 108.0 <=20 :190 0 : 86
1st Qu.:26.00 1st Qu.: 336.2 >20 :108 1-10:180
Median :30.00 Median : 447.5 unknown:120 >10 :152
Mean :31.97 Mean : 577.8
3rd Qu.:39.00 3rd Qu.: 641.5
Max. :50.00 Max. :2352.0
ell mobility meals full
[0-10] :153 [0-20] :362 [0-25] :200 1: 85
(10-30] : 83 (20-100]: 56 (25-50] : 56 2:333
(30-50] : 60 (50-75] :100
(50-100]: 93 (75-100]: 62
NA's : 29
summary(api35) #--------------------------------------------------
cds apicl_1999 stype awards acs.core
Length:362 G1:91 E:257 No :111 Min. :16.00
Class :character G2:90 M: 67 Yes :237 1st Qu.:25.00
Mode :character G3:90 H: 38 NA's: 14 Median :30.00
G4:91 Mean :31.81
3rd Qu.:39.00
Max. :50.00
api.stu acs.k3.20 grad.sch ell mobility
Min. : 102.0 <=20 :227 0 : 50 [0-10] :164 1:213
1st Qu.: 363.2 >20 : 30 1-10:241 (10-30] : 99 2:149
Median : 460.0 unknown:105 >10 : 71 (30-50] : 64
Mean : 577.2 (50-100]: 10
3rd Qu.: 624.0 NA's : 25
Max. :2460.0
meals full
[0-25] : 77 1:244
(25-50] : 92 2:118
(50-75] :122
(75-100]: 71
The two databases seem to share a majority of variables (same labels, same encodings) of different types, therefore inferential statistics could be ideally considered by combining all the information of the two databases to study the effects of social factors on the results of the API score in 2000.
Nevertheless, while this target variable called apicl_2000
is correctly stored in the database api29
and encoded as a three levels ordered factors clearly defined: [200-600]
, (600-800]
and (800-1000]
, the only information related to the API score in the database api35
is the variable apicl_1999
for the API score collected in 1999, encoded in four unknown balanced classes (G1
, G2
, G3
and G4
). As no school is common to the two counties, we easily deduce that these two variables have never been jointly observed.
By choosing these two variables as outcomes (called also target variables), the objective of the following examples consists in creating a synthetic database where the missing information related to the API score in 2000 is fully completed in api35
by illustrating the use of the main functions of the package.
merge_dbs
The function merge_dbs
is an optional pre-process function dedicated to data fusion projects that merges two
raw databases by detecting possible discrepancies among the respective sets of variables from one database to
another. The current discrepancy situations detected by the function follow specific rules described below:
REMOVE1
.REMOVE2
.These situations sometimes require reconciliation actions which are not supported by the actual version of the merge_dbs
function. Therefore, when reconciliation actions are required by the user, they will have to be treated a posteriori and outside the function.
Applied to our example and according to the introduced rules, the first step of our data fusion project consists in studying the coherence between the variables of the databases api29
and api35
via the following R
code:
As entry, the raw databases must be declared separately in the DB1
and DB2
arguments and the name of
the related target variables of each database must be specified via the NAME_Y
and NAME_Z
for DB1
and DB2
respectively. In presence of row identifiers, the respective column indexes of each database must be set in the argument row_ID1
and row_ID2
. The arguments ordinal_DB1
and ordinal_DB2
list the related
column indexes of all the variables defined as ordinal in the two databases (including also the indexes of the
target variables if necessary). Here, apicl_2000
is clearly an ordinal variable, and, by default, we suppose
that the unknown encoding related to apicl_1999
is also ordinal: the corresponding indexes (2 and 3) are so
added in these two arguments.
After running, the function informs users that no row was dropped from the databases during the merging
because each target variable is fully completed in the two databases. Nine potential predictors are kept while
only one variable is discarded because of discrepancies between the databases: its identity is consequently stored in output and informs user about the nature of the problem: the mobility
factor has different levels from one database to the other.
summary(step1)
Length Class Mode
DB_READY 12 data.frame list
ID1_drop 0 -none- character
ID2_drop 0 -none- character
Y_LEVELS 3 -none- character
Z_LEVELS 4 -none- character
REMOVE1 0 -none- NULL
REMOVE2 1 -none- character
REMAINING_VAR 9 -none- character
IMPUTE_TYPE 1 -none- character
DB1_raw 12 data.frame list
DB2_raw 12 data.frame list
SEED 1 -none- numeric
step1$REMOVE1 # List of removed variables because of type's problem
NULL
step1$REMOVE2 # List of removed factors because of levels' problem
[1] "mobility"
levels(api29$mobility) # Verification
[1] "[0-20]" "(20-100]"
[1] "[0-20]" "(20-100]"
[1] "1" "2"
step1$REMAINING_VAR
[1] "acs.core" "acs.k3.20" "api.stu" "awards" "ell"
[6] "full" "grad.sch" "meals" "stype"
The function returns a list object and notably DB_READY
, a data.frame whose structure corresponds to the expected structure introduced in the previous subsection: a unique database, here result of the superimposition of api29
on api35
where the first column (DB
) corresponds to the database identifier (1 for api29
and 2 for api35
), the second and third columns (Y
, Z
respectively) corresponds to the target variables of the two databases. Missing values are automatically assigned by the function to the unknown part of each target variable: in Y
when the identifier equals to 2, in Z
when the identifier equals to 1. The subset of shared variables whose information is homogeneous between the databases are now stored from the fourth column to the last one. Their identities are available in the output object called REMAINING_VAR
.
The merge_dbs
function can handle incomplete shared variables by using the impute
argument. This
option allows to keep the missing information unchanged (the choice by default, and also the case in this
example), to work with complete cases only, to do fast multivariate imputations by chained equations approach
(the function integrates the main functionalities of the mice
function of the mice package), or to impute data using the dimensionality reduction method introduced in the missMDA package (see the pdf manual for details).
select_pred
In data fusion projects, a selection of shared variables (also called matching variables) appears as an essential step for two main reasons. First, the proportion of shared variables X between the two databases can be important (much higher than three variables) and keeping a pointless part of variability between variables could strongly reduce the quality of the fusion. Second, this selection greatly influences the quality of the predictions regardless of the matching technique which is chosen a posteriori (Adamek 1994). The specific context of data fusion is subject to the following constraints:
These particularities suppose that the investigations have to be done separately but in the same manner in both databases. In this way, a predictor which appears at the same time as highly correlated to Y and Z will be automatically selected for the fusion. On the contrary, predictors whose effects on Y and Z seem not obvious will be discarded. Additional recommended rules also emerge from literature:
The function of the package dedicated to this task is select_pred
. From the DB_READY
database
generated in the previous subsection, studying the outputs related to each database and produced by the following R
commands assist users in selecting the best predictors:
# For the dataset api29 --------------
step2a = select_pred(step1$DB_READY,
Y = "Y", Z = "Z", ID = 1, OUT = "Y",
quanti = c(4,6), nominal = c(1,5,7),
ordinal = c(2,3,8:12), thresh_cat = 0.50,
thresh_num = 0.70, RF_SEED = 3011)
# For the dataset api35 --------------
step2b = select_pred(step1$DB_READY,
Y = "Y", Z = "Z", ID = 1, OUT = "Z",
quanti = c(4,6), nominal = c(1,5,7),
ordinal = c(2,3,8:12), thresh_cat = 0.50,
thresh_num = 0.70, RF_SEED = 3011)
The quanti
, nominal
, and ordinal
arguments requires vectors of column indexes related to the type
of each variable of the input database. The ID
argument specifies the column index of the database identifier. Y
and Z
expected the respective names of the target variables while OUT
provides the target variable to predict (Y
or Z
).
To detect the subset of best predictors, select_pred
studies the standard pairwise associations between
each shared variable and the outcomes Y
and Z
, taken separately (by only varying the argument OUT
), according to its type: for numeric and/or ordered factor variables, select_pred
calculates the associations using rank correlation coefficients (Spearman) while the Cramer’s V criterion (Bergsma 2013) is used for categorical variables and not ordered factors. The related ranking tables of top scoring predictors are available in two distinct output objects: cor_OUTC_num
and cor_OUTC_cat
. In our example, the corresponding results, for each database, are:
### ASSOCIATIONS BETWEEN TARGET VARIABLES AND SHARED VARIABLES
## Results for the api29 dataset -----
step2a$cor_OUTC_num # Y versus numeric or ordinal predictors
name1 name2 RANK_COR pv_COR_test N
7 Y meals -0.8030 0.0000 418
4 Y ell -0.7514 0.0000 389
5 Y full 0.3919 0.0000 418
6 Y grad.sch 0.3346 0.0000 418
3 Y api.stu -0.1520 0.0018 418
8 Y stype -0.1520 0.0018 418
2 Y acs.core -0.0556 0.2566 418
step2a$vcrm_OUTC_cat # Y versus nominal or ordinal predictors
name1 name2 V_Cramer CorrV_Cramer N
7 Y meals 0.6871 0.6835 418
4 Y ell 0.6015 0.5966 389
5 Y full 0.4508 0.4459 418
6 Y grad.sch 0.3869 0.3816 418
3 Y awards 0.2188 0.2073 401
2 Y acs.k3.20 0.1514 0.1349 418
8 Y stype 0.1370 0.1185 418
## Results for the api35 dataset -----
step2b$cor_OUTC_num # Z versus numeric or ordinal predictors
name1 name2 RANK_COR pv_COR_test N
4 Z ell -0.7511 0.0000 337
7 Z meals -0.7291 0.0000 362
6 Z grad.sch 0.6229 0.0000 362
5 Z full 0.3997 0.0000 362
3 Z api.stu -0.0563 0.2851 362
8 Z stype 0.0359 0.4959 362
2 Z acs.core 0.0119 0.8219 362
step2b$vcrm_OUTC_cat # Z versus nominal or ordinal predictors
name1 name2 V_Cramer CorrV_Cramer N
7 Z meals 0.5181 0.5121 362
6 Z grad.sch 0.5131 0.5063 362
4 Z ell 0.4775 0.4701 337
5 Z full 0.4033 0.3934 362
3 Z awards 0.2040 0.1818 348
8 Z stype 0.1320 0.0957 362
2 Z acs.k3.20 0.1223 0.0817 362
The two first tables related to api29
highlights the strong associations between Y
and the variables meals
, ell
, full
and grad.sch
in this order of importance, while the summary tables related to api35
highlights the strong associations between Z
and the variables meals
, grad.sch
, ell
and full
.
It is often not unusual to observe that one or more predictors are in fact linear combinations of others. In supervised learning areas, these risks of collinearity increase with the number of predictors, and must be detected beforehand to keep only the most parsimonious subset of predictors for fusion. To avoid collinearity situations, the result of a Farrar and Glauber (FG) test is provided (Farrar and Glauber 1967). This test is based on the determinant of the rank correlation matrix of the shared variables D (Spearman) and the corresponding test statistic is given by:
\[ S_{FG} = - \left(n-1-\frac{1}{6} (2k+5)\times \ln(\det(D))\right) \]
where \(n\) is the number of rows and \(k\) is the number of covariates. The null hypothesis supposes that \(S_{FG}\) follows a chi square distribution with \(k(k-1)/2\) degrees of freedom, and its acceptation indicates an absence of collinearity. In presence of a large number of numeric covariates and/or ordered factors, the approximate Farrar-Glauber test, based on the normal approximation of the null distribution (Kotz et al. 2000) can be more adapted and the statistic test becomes:
\[
\sqrt{2S_{FG}} - \sqrt{2k-1}
\]
Users will note that these two tests can often appear highly sensitive in the sense that they tend to easily conclude in favor of multicollinearity. Thus, it is suggested to consider these results as indicators of collinearity between predictors rather than an essential condition of acceptability. The results related to this test are stored in the FG_test
object of the select_pred
output.
In presence of collinearity, select_pred
tests the pairwise associations between all the potential predictors according to their types (Spearman or Cramér’s V). The thres_num
argument fixed the threshold beyond which two ranked predictors are considered as redundant while thres_cat
fixed the threshold of acceptability for the Cramér’s V criterion in the subgroup of factors. In output, the results stored in the collinear_PB
object permit to identify the corresponding variables. In our example, we observe:
### DETECTION OF REDUNDANT PREDICTORS
## Results for the api29 dataset -----
## Results of the Farrar-Glauber test
step2a$FG_test
DET_X pv_FG_test pv_FG_test_appr
0.04913067 0.00000000 0.00000000
## Identity of the redundant predictors
step2a$collinear_PB
$VCRAM
name1 name2 V_Cramer CorrV_Cramer N
71 acs.k3.20 stype 0.6988 0.6971 418
43 ell meals 0.6696 0.6664 389
34 full meals 0.5215 0.5152 418
$SPEARM
name1 name2 RANK_COR pv_COR_test N
43 ell meals 0.9047 0 389
62 api.stu stype 0.7069 0 418
#### Results for the api35 dataset -----
step2b$FG_test # Significant result
DET_X pv_FG_test pv_FG_test_appr
0.1479339 0.0000000 0.0000000
step2b$collinear_PB
$VCRAM
name1 name2 V_Cramer CorrV_Cramer N
71 acs.k3.20 stype 0.6977 0.6957 362
$SPEARM
[1] name1 name2 RANK_COR pv_COR_test N
<0 rows> (or 0-length row.names)
The FG test warns the user against the risks of collinearity between predictors, and the function notably detects strong collinearities between the variables meals
, ell
, full
in the api29
(less strong trends in api35
): an information that have to be taken into account during the selection. The part of predictors finally kept for the data fusion must be small to improve its quality. When the initial number of shared variables is not too important as in this example, choosing the best candidates between groups of redundant predictors can be made manually by selecting highest ranked predictors in the summary tables previously described. In this way, the variable meals
could be preferred to ell
, and full
, while the variable stype
could be dropped. Consequently, a possible final list of predictors could be only composed of the variables meals
and grad.sch
.
When the number of predictors is important, or when users prefer that an automatic process performs the
variable selection, a random forest procedure can also be used via the select_pred
function. Random forest approaches (Breiman 2001) are here particularly convenient (Grajski et al. 1986) for multiple reasons: it works fine when the number of variables exceeds the number of rows, whatever the types of covariates, it allows to deal with non linearity, to consider correlated predictors, ordered or not ordered outcomes and to rank good candidate predictors through an inbuilt criterion: the variable importance measure.
In few words, random forest processes aggregates many CART models (Breiman et al. 1984) with
RF_ntree
bootstrap samples from the raw data source and averaging accuracies from each model permits to reduce the related variances and also the errors of prediction. A standard random forest process provides two distinct measures of importance of each variable for the prediction of the outcome, the Gini importance criterion and the permutation importance criterion, which depends on the appearance frequency of the predictor but also on its place taken up in each tree. For more details about random forest theory, user can consult Breiman (2001) and/or the pdf manual of the randomForest (Liaw and Wiener 2002) package.
Strobl et al. (2009) suggests that the permutation importance criterion, which works with permuted samples (subsampling without replacements) instead of bootstrap ones, is particularly convenient with uncorrelated predictors, but must be replaced by a conditional permutation measurement in presence of strong correlations. select_pred
provides these assessments by integrating the main functionalities of the cforest
and varimp
functions of the package party(Hothorn et al. 2005, 2006; Strobl et al. 2007, 2008; Zeileis and Hothorn 2008). However, these measurements must be used with caution, by accounting the following constraints:
Possible scenarios | Correlation between predictors | State of the RF_condi argument | Incomplete information |
---|---|---|---|
Same type predictors | NO | FALSE | Allowed |
Same type predictors | YES | TRUE | Not allowed |
Different type predictors | NO | FALSE | Allowed |
Different type predictors | YES | TRUE | Not allowed |
The select_pred
function allows to proceed with different scenarios according to the type of predictors (Table 4 can help users to choose). The first one consists in boiling down to a set of categorical variables (ordered or not) by categorizing all the continuous predictors using the dedicated argument (convert_num
and convert_clss
) and to work with the conditional importance assessments that directly provide unbiased estimations (by setting the RF_condi
argument to TRUE
). Users can consult (Hothorn et al. 2006) for more details about the approach and consult the pdf manual of the package for details about the related arguments of the select_pred
function.
This approach does not take into account incomplete information, so that the method will be applied to complete data only (incomplete rows will be temporarily removed from the study). It is nevertheless possible to impute missing data beforehand by using dedicated pre-existing R
packages like mice (van Buuren and Groothuis-Oudshoorn 2011) or by using the imput_cov
function provided in the OTrecod package.
The second possible scenario (always usable in presence of mixed type predictors), consists in the execution
of a standard random forest procedure after taking care to rule out collinearity issues by first selecting unique candidates between potential redundant predictors (in this case, the discarded predictors are stored in the drop_var
output object). This is the default approach used by select_pred
as soon as the RF_condi
argument is set to FALSE
while RF
is set to TRUE
. This scenario can work in presence of incomplete predictors. By constructing, note that results from random forest procedures stay dependent on successive random draws carried out for the constitution of trees, and it is so suggested to check this stability by testing different random seeds (RF_SEED
argument) before concluding.
The R
command previously written provides automatically the results related to the second approach as
additional results. The results from the two datasets show here the permutation importance estimates of each
variable ranked in order of importance and expressed as percentage, after resolving collinearity problems:
step2a$RF_PRED # For the api29 dataset
meals stype grad.sch awards acs.core
71.5529 11.6282 11.1181 5.4674 0.2334
step2b$RF_PRED # For the api35 dataset
meals ell grad.sch full stype api.stu acs.core
35.9974 28.3051 22.2094 5.2143 4.0192 1.8965 1.5643
awards
0.7937
The results confirm that the variable meals
appeared as the best predictor of the target variables in api29
and api35
respectively. The variable ell
is not present in the first list (see RF_PRED
from step2a
) because the variables meals
and ell
has been detected as strongly collinear (according to the initial chosen threshold) and so select_pred
keep the best choice between the two predictors: meals
(the reasoning is the same for full
which disappeared from the list).
The ell
variable has been kept in the second list (not found as collinear enough to be removed here)
and appears moreover as a good predictor of apicl_1999
in api35
. Nevertheless its potential collinearity
problem with meals
encourages us not to keep it for the rest of the study. According to this discussion, we
finally keep meals
, stype
and grad.sch
which combines the advantages of being good predictors for the
two target variables while not presenting major problems of collinearity between them.
The following synthetic database (called here bdd_ex
) is now ready for the prediction of the missing API
scores in api29
, api35
or both, using the function OT_outcome
or OT_joint
:
DB Y Z grad.sch meals stype
2850 1 (600-800] <NA> >10 [0-25] H
2851 1 (600-800] <NA> >10 [0-25] H
2852 1 [200-600] <NA> 1-10 (75-100] H
OT_outcome
The OT_outcome
function provides individual predictions of Z in A (and/or Y in B) by considering the recoding problem involving optimal transportation of outcomes. In a first step, the aim of the function is so to determine \(\gamma\) from (2) while this estimate is used in a second step to provide the predictions. The main input arguments of the function and the output values are respectively described in Tables 5 and 6.
Argument |
OT_outcome
|
OT_joint
|
Description (default value) |
---|---|---|---|
datab
|
X | X | Data.frame in the expected structure |
index_DB_Y_Z
|
X | X |
Indexes of the ID , Y , and Z columns (1,2,3)
|
nominal
|
X | X | Column indexes of nominal variables (*) |
ordinal
|
X | X | Column indexes of ordinal variables (*) |
logic
|
X | X | Column indexes of boolean variables (*) |
quanti
|
X | X | Column indexes of quantitative variables (*) |
convert.num
|
X | X |
Column indexes of the quantitative variables to convert (* or =quanti in OT_joint )
|
convert.clss
|
X | X | Corresponding numbers of desired classes for conversion (*) |
which.DB
|
X | X | Specify the target variables to complete: both or only one (BOTH) |
solvR
|
X | X | Choice of the solver to solve the optimization problem (glpk) |
dist.choice
|
X | X | Distance function (Euclidean). See Table 7 |
percent.knn
|
X | X | Ratio of closest neighbors involved in the computations (1) |
indiv.method
|
X |
Type of individual predictions (sequential) for OUTCOME and R-OUTCOME algorithms
|
|
maxrelax
|
X | X | Adding of a relaxation parameter (0) |
lambda.reg
|
X | Adding of regularization parameter (0) |
The arguments datab
, index_DB_Y_Z
, quanti
, nominal
, ordinal
, and logic
are not optional and must be carefully completed before each execution. A unique synthetic data.frame made of two overlayed databases (called A and B for example) (see Table 3) is expected as datab
argument. If this data.frame corresponds to the output objects DB_USED
or DB_READY
of the select_pred
or merge_dbs
functions respectively, then the expected object has the required structure. Otherwise users must be sure that their data.frames are made up of two overlayed databases with at least 4 variables as described in the subsection Optimal transportation of outcomes applied to data recoding.
The order of the variables have no importance in the raw database but will have to be specified a posteriori in the index_DB_Y_Z
argument if necessary.
Value |
OT_outcome
|
OT_joint
|
Description |
---|---|---|---|
time_exe
|
X | X | Running time of the algorithm |
gamma_A
|
X | X | Estimation of the joint distribution of (Y, Z) for the prediction of Z in A (*) |
gamma_B
|
X | X | Estimation of the joint distribution of (Y, Z) for the prediction of Y in B (*) |
profile
|
X | X | The list of detected profiles of covariates |
res.prox
|
X | X | A list that provides all the information related to the estimated proximities between profiles and groups of profiles |
estimator_ZA
|
X | X | Estimates of the probability distribution of Z conditional to X and Y in database A (*) |
estimator_YB
|
X | X | Estimates of the probability distribution of Y conditional to X and Z in database B (*) |
DATA1_OT
|
X | X |
The database A fully completed (if required in input by the which.DB argument)
|
DATA2_OT
|
X | X |
The database B fully completed (if required in input by the which.DB argument)
|
The subset of remaining predictors used for fusion may requires prior transformations according to the distance function chosen in input by the user. This process is fixed by setting the dist.choice
argument. The distances actually implemented in the function are: the standard Euclidean ("E"
) and Manhattan ("M"
) distances, the Hamming distance ("H"
, for binary covariates), and the Gower distance ("G"
sometimes preferred with mixed variables). Automatic transformations prior to the use of each distance function are summarized in Table 7.
Variable transformations | Euclidean | Manhattan | Gower | Hamming |
---|---|---|---|---|
Continuous | Standardized | Standardized | No | Not allowed |
Boolean | Binary | Binary | No | Binary |
Nominal | Disjunctive T | Disjunctive T | No | Disjunctive T |
Ordinal | Discrete | Discrete | No | Disjunctive T |
Incomlete information | Allowed* | Allowed* | Allowed* | Allowed* |
dist.choice argument
|
E
|
M
|
G
|
H
|
The first version of the OT algorithm described in Garès et al. (2020) was tested on numeric coordinates
extracted from a factor analysis of mixed data (FAMD) fitted on mixed raw covariates (Pagès 2002). This transformation is here available by setting the FAMD.coord
argument to "YES"
. The minimal percentage
of information explained by the FAMD is also fixable using the FAMD.perc
argument. The OT_outcome
functions proposes four types of models for the prediction of Y in B (and/or) Z in A, according to the values of the method
and maxrelax
arguments:
maxrelax = 0
and indiv.method = "sequential"
(default options), the fitted model corresponds to the OUTCOME
algorithm described by \(\hat{\mathcal{P}}^0_n\) in equation (6). Assuming that \(Y\) and \(Z\) in \(A\) follow the same distribution as \(Y\) and \(Z\) in \(B\) respectively (assumption 1), this related algorithm derives the joint distribution of \(Y\) and \(Z\) in \(A\) (respectively in \(B\)) in a first step, and uses in a second step, a nearest neighbor procedure to predict missing values of \(Z\) in \(A\) (resp. \(Y\) in \(B\)). This algorithm calculates averaged distances between each subject from \(A\) (resp. \(B\)) and subgroups of subjects from \(B\) (resp. \(A\)) having same levels of \(Z\) in \(B\) (resp. \(Y\) in \(A\)). These calculations can be done using all subjects of each subgroups (by default, percent.knn = 1
) or only restricted parts of them (percent.knn < 1
).maxrelax > 0
and indiv.method = "sequential"
, assumption 1 is alleviated by relaxing the constraints on marginal distributions and an R-OUTCOME
algorithm (with relaxation) is fitted. In this case, the maxrelax
argument corresponds to the \(\alpha_n\) parameter in (18).maxrelax = 0
and indiv.method = "optimal"
, the second step of the original algorithm (nearest neighbor procedure) is replaced by a linear optimization problem: searching for the individual predictions of \(Z\) in \(A\) (resp. \(Y\) in \(B\)) by minimizing the total of the distances between each individual of \(A\) and individuals of each levels of \(Z\) in \(B\).maxrelax > 0
and indiv.method = "optimal"
, the constraints on marginal distributions of the previous model are also relaxed and the function fits an R-OUTCOME
algorithm with relaxation. For these three last situations, the corresponding R-OUTCOME
algorithm is so described by \(\hat{\mathcal{P}}^{0-R}_n\) (18).When maxrelax > 0
, it is recommended to calibrate the maxrelax
parameter by testing different values according to the stability of individual predictions. In output, the gamma_A
and gamma_B}
matrices correspond to the estimates of the joint distributions \(\gamma\) of \((Y^A,Z^A)\) and \((Y^B,Z^B)\) respectively (2). Moreover, a posteriori estimates of conditional distributions probabilities \((Z^A|Y^A, X^A)\) and \((Y^B|Z^B,X^B)\) (10) are also provided in two lists called estimatorZA
, and estimatorYB
respectively and the completed databases \(A\) and \(B\) are stored in the DATA1_OT
and DATA2_OT
objects. In particular, the individual predictions are stored in the OTpred
column of these data.frames. Moreover, the profile
object is a data.frame that stores the profile of covariates encountered in the two data sources while the res_prox
object stored all the distance computations that will be used in the validation step (users can consult details of the proxim_dist
function of the pdf manual.
The computation of conditional probabilities implies to define the part of individuals considered as neighbors of each encountered profile of covariates. The argument prox.dist
fixes this threshold for each profile, following the decision rule for \(A\) (and respectively for \(B\)): a subject \(i\) of \(A\) (or a statistical unit corresponding to a row of \(A\)) will be considered as neighbor of a given profile of shared variables \(C\) as soon as:
\[
\text{dist}(subject_i,C) < \text{prox.dist} \times \text{max}_{k=1,\dots , n_A} \text{dist}(subject_k,C)
\]
When the number of shared variables is small and all of them are categorical (optimal situation), it is suggested to set the prox.dist
argument to \(0\). Finally, using the which.DB
argument, users can choose to estimate the individual predictions of \(Y\) and \(Z\) in the two databases (which.DB = "BOTH"
) or only one of the both (which.DB = "A"
for the predictions of \(Z\) in \(A\) or which.DB = "B"
for the predictions \(Y\) in \(B\)).
From the data.frame bdd_ex
built previously, we use the OT_outcome
function to illustrate the prediction of the target variable apicl_2000
in the api35
dataset via an OUTCOME
algorithm and using an Euclidean distance function:
outc1 = OT_outcome(bdd_ex, quanti = 1, ordinal = 2:6,
dist.choice = "E", indiv.method = "sequential",
which.DB = "B")
#---------------------------------------
# OT PROCEDURE in progress ...
#---------------------------------------
# Type = OUTCOME
# Distance = Euclidean
# Percent closest knn = 100%
# Relaxation parameter = NO
# Relaxation value = 0
# Individual pred process = Sequential
# DB imputed = B
#---------------------------------------
In bdd_ex
, all variables except the first one (DB
: the database identifier) are or can be considered as ordinal factors, and the quanti
and ordinal
arguments are filled in accordingly. For the prediction of apicl_2000
in the api35
dataset (the steps would be the same for the prediction of apicl_1999
in api29
by setting which.DB = "A"
or "BOTH"
), the optimal transportation theory determines a map \(\gamma\) that pushes, in the distribution of apicl_1999
forward to the distribution of apicl_2000
in the database api35
. In this case, \(\gamma^B\) is an estimator of the joint distribution of apicl_2000
and apicl_1999
in api35
. The estimation of \(\gamma^B\) is available in the gamma_B
output object while all the profiles of predictors met in the two databases are listed in the profile
object:
outc1$gamma_B
G1 G2 G3 G4
[200-600] 0.22248804 0.0000000 0.00000000 0.0000000
(600-800] 0.02889318 0.2486188 0.15311005 0.0000000
(800-1000] 0.00000000 0.0000000 0.09550874 0.2513812
outc1$profile[1,] # the first profile
ID grad.sch meals stype
2850 P_1 3 1 3
The output object estimatorYB
is a list that corresponds to the estimations of the conditional probabilities of apicl_2000
in the api35
dataset for a given profile of predictors. For example, the conditional probabilities related to the first profile of predictors are:
outc1$estimatorYB[1,,] # conditional probabilities (1st profile)
[,1] [,2] [,3]
G1 0.3333333 0.3333333 0.3333333
G2 0.3333333 0.3333333 0.3333333
G3 0.0000000 0.0000000 1.0000000
G4 0.0000000 0.0000000 1.0000000
According to these results, we can conclude that: for a subject from api35
with a related profile of predictors P_1
and whose levels of apicl_1999
is 'G1'
, the probability for apicl_2000
to be predicted '[200,600]'
is about \(0.63\). Finally, the individual predictions of apicl_2000
in api35
are available in the DATA2_OT
object corresponding to the OTpred
column:
head(outc1$DATA2_OT,3) # The 1st 3 rows only
DB Y Z grad.sch meals stype OTpred
3895 2 <NA> G1 2 4 1 [200-600]
3896 2 <NA> G3 2 3 1 (600-800]
3897 2 <NA> G2 2 3 2 (600-800]
The OUTCOME
algorithm uses here a nearest neighbor procedure to assign the individual predictions from the estimation of \(\gamma\) which can be a drawback as described in (Garès et al. 2020). The R-OUTCOME
algorithm proposes an enrichment that directly assigns the individual predictions of apicl_2000
from the estimation of \(\gamma\), without using the nearest neighbor approach. In this case, a linear optimization problem is solved to determine the individual predictions that minimize the sum of the individual distances in api35
having the modalities of the target apicl_2000
in api29
. The corresponding R
command is:
### R-OUTCOME algorithm: optimal assignments + relaxation parameter = 0
R_outc3 = OT_outcome(bdd_ex, quanti = 1, ordinal = 2:6,
dist.choice = "E" , indiv.method = "optimal",
which.DB = "B")
Moreover, the OUTCOME
algorithm assumes that the distribution of apicl_2000
is identically distributed in api29
and api35
(assumption 1 from the subsection Optimal transportation of outcomes applied to data recoding) which can appear as a strong hypothesis to hold in many situations. To overcome this constraint, the R-OUTCOME
algorithm also allows to relax the assumption 1 by adding a relaxation parameter (18) in the optimization system. This relaxation can be done by varying the maxrelax
argument of the function:
### R-OUTCOME algorithm: optimal assignments + relaxation parameter = 0.4
R_outc4 = OT_outcome(bdd_ex, quanti = 1, ordinal = 2:6,
dist.choice = "E",
indiv.method = "optimal",
maxrelax = 0.4, which.DB = "B")
The running times of these two models can take few minutes. The quality of these models will be compared in Tables 8, 9 and 10 (subsection Validation of the data fusion using verif_OT
).
OT_joint
The OT_joint
function provides individual predictions of \(Z\) in \(A\) (and/or \(Y\) in \(B\)) by considering the recoding problem as an optimal transportation problem of covariates and outcomes, that pushes the conditional distribution of \((Y^A|X^A)\) forward to the conditional distribution of \((Z^A|X^A)\) (and conversely \((Z^B|X^B)\) forward to the conditional distribution of \((Y|X)\)). The aim is to determine \(\gamma\) from (7). Because joint distributions of outcomes and covariates are now mapped together (it was not the case with the OUTCOME
family of algorithms), it is not required for target variables to be equally distributed (assumption 1).
The call of OT_joint
was thought to propose globally the same structure as those of the function OT_outcome
as described in Tables 5 and 6 with few differences described below. JOINT
and R-JOINT
algorithms are usable via OT_joint
for solving the recoding problem depending on the values related to the maxrelax
and lambda_reg
arguments:
maxrelax = 0
and lambda.reg = 0
(default values), the fitted model corresponds to the JOINT
algorithm described by \(\hat{\mathcal{P}}_n\) in equation (7).R-JOINT
algorithm is called. Using R-JOINT
, it is so possible to relax constraints on marginal distributions (maxrelax > 0
) and/or add an eventual more or less strong L1 regularisation term among the constraints of the algorithm by filling the lambda_reg
argument. maxrelax
parameter correspond to parameter \(\alpha_n\) in (16) and lambda.reg
parameter correspond to parameter \(\lambda\) in (17).When maxrelax > 0
and/or lambda.reg > 0
, it is recommended to calibrate these two parameters by testing many different values and so studying the stability of the related individual predictions. Nevertheless, by default or as a starting point, (Garès et al. 2020) suggests the use of default values determined from simulated databases: \(0.1\) for the regularization parameter (lambda.reg
) and \(0.4\) for the relaxation one (maxrelax
). Finally, the output objects are similar to those previously described in the OT_outcome
function.
The implementation of the function is very similar to that of OT_outcome
, and applied to our example, the R
code related to a JOINT
algorithm is:
outj1 = OT_joint(bdd_ex, nominal = 1, ordinal = c(2:6),
dist.choice = "E" , which.DB = "B")
#---------------------------------------
# OT JOINT PROCEDURE in progress ...
#---------------------------------------
# Type = JOINT
# Distance = Euclidean
# Percent closest = 100%
# Relaxation term = 0
# Regularization term = 0
# Aggregation tol cov = 0.3
# DB imputed = B
#---------------------------------------
### Extract individual predictions from the OTpred column
head(outj1$DATA2_OT,3) # The 1st 3 rows only
DB Y Z grad.sch meals stype OTpred
3895 2 <NA> G1 2 4 1 [200-600]
3896 2 <NA> G3 2 3 1 (600-800]
3897 2 <NA> G2 2 3 2 (600-800]
For relaxing the constraints stated on marginal distributions, we use the maxrelax
argument that corresponds to varying the \(\alpha\) parameter of a R-JOINT
algorithm (see (17)) and a parameter of regularization \(\lambda\) can be simply added to the previous model as follows:
### R-JOINT algorithm (relaxation parameter = 0.4)
R_outj1 = OT_joint(bdd_ex, nominal = 1, ordinal = c(2:6),
dist.choice = "E", maxrelax = 0.4,
which.DB = "B")
### R-JOINT algorithm (relaxation parameter = 0.4,
### & regularization parameter = 0.1)
R_outj4 = OT_joint(bdd_ex, nominal = 1, ordinal = c(2:6),
dist.choice = "E", maxrelax = 0.4,
lambda.reg = 0.1,
which.DB = "B")
These models are compared in the next subsection.
verif_OT
Assessing the quality of individual predictions obtained through statistical matching techniques can be a complex task notably because it is legitimate to wonder about the true reliability of a joint distribution estimated from two variables which are never jointly observed (Rodgers 1984; Kadane 2001). When the study aims to identify estimators of interests that improve the understanding of the relationships between variables in the different databases (called macro approach in D’Orazio et al. (2006)) without going until the creation of a complete and unique dataset, uncertainty analyses are usually proposed to estimate the sensitivity of the assessments (Rässler 2002). On the contrary, if the objective is to create a synthetic dataset where the information is available for every unit, the use of auxiliary information or proxy variables must be privileged to assess the quality of the results (Paass 1986). In the absence of complementary information, Rubin (1986) suggests to study the preservation of the relationships at best between the distributions of the target variables. In this way, the verif_OT
function proposes tools to assess quality of the individual predictions provided by the algorithms previously introduced.
The first expected input argument of verif_OT
is an 'otres'
object from the OT_outcome
or OT_joint
functions. In our example, this function is firstly applied to the out_c1
model by following the R
command:
### Quality criteria for outc1 (OUTCOME model)
verif_outc1 = verif_OT(outc1, group.class = TRUE, ordinal = FALSE,
stab.prob = TRUE, min.neigb = 5)
### First results related to outc1:
verif_outc1$nb.profil
[1] 27
verif_outc1$res.prox
N V_cram rank_cor
362.000 0.860 0.892
In output, the nb.profil
value gives the number of profiles of predictors globally detected in the two databases \((nb = 27)\). In the example, a profile will be characterized by a combination of values related to the three predictors kept: grad.sch
, meals
and stype
.
The first step of the function is dedicated to the study of the proximity between the two target variables. Therefore, standard criteria (Cramer’s V and Spearman’s rank correlation coefficient) are used to evaluate the pairwise association between \(Y\) and \(Z\), globally or in one of the two databases only, according to the predictions provided by the output of OT_outcome
or OT_joint
. Only the database api35
was completed in the example (because which.DB = "B"
as argument of OT_outcome
) and stored in the DATA2_OT
object, therefore, the criteria compare here the proximity distribution between the predicted values of apicl_2000
(Y
) and the observed value of apicl_1999
(Z
) in the database api35
(B
). Regarding independence or a small correlation between \(Y^A\) and \(Z^A\) (or \(Y^B\) and \(Z^B\)) must question about the reliability of the predictions especially when \(Y\) and \(Z\) are supposed to summarize a same information. In the example, whatever the criteria used, we notice via the res.prox
object, a strong association between the two target variables in api35
which reassures about the consistency of the predictions.
The related confusion matrix between the predicted values of apicl_2000
(Y
) and the observed value of apicl_1999
(Z
) is stored in the conf.mat
object.
Second, the function proposes an optional tool (by setting group.clss= TRUE
) which evaluates the impact of grouping levels of one factor on the association of \(Y\) and \(Z\). When users have initially no information about one of the two encodings of interest, this approach can be particularly useful to detect ordinal from nominal ones and its principle is as follow. Assuming that \(Y \in \mathcal{Y}\), and \(Z \in \mathcal{Z}\) (\(\mathcal{Y}\) and \(\mathcal{Z}\) are the respective levels) and that \(|\mathcal{Y}| \geq |\mathcal{Z}|\). From \(Y\), successive new variables \(Y’ \in \mathcal{Y}'\) are built, as soon as \(\mathcal{Y}'\) is a partition of \(\mathcal{Y}\) such as \(|\mathcal{Y}'| = |\mathcal{Z}|\) (and inversely for \(Z\) if the levels of \(Z\) is the greatest). The related associations between \(Z\) and \(Y’\) (with now equal number of levels) are then studied using: Cramer’s V, rank correlation, Kappa coefficient and confusion matrix and the results are stored in a table called res.grp
. The corresponding outputs of the example are:
verif_outc1$conf.mat
Z
predY G1 G2 G3 G4 Sum
[200-600] 81 1 1 1 84
(600-800] 10 89 55 1 155
(800-1000] 0 0 34 89 123
Sum 91 90 90 91 362
verif_outc1$res.grp
grp levels Z to Y error_rate Kappa Vcramer RankCor
4 G1/G2 G3/G4 13.3 0.794 0.83 0.877
6 G1/G2/G3 G4 19.1 0.714 0.78 0.839
2 G1 G3/G2/G4 28.2 0.593 0.68 0.624
1 G1 G2/G3/G4 37.6 0.457 0.64 0.813
3 G1 G4/G2/G3 43.4 0.374 0.59 0.115
5 G1/G2 G4/G3 43.4 0.326 0.64 0.574
It appears from these results that grouping the levels G2
and G3
of apicl_1999
(Z
) strongly improves the association of this variable with apicl_2000
(Y
) (the error rate of the confusion matrix varies from \(43.4\) to \(13.3\)). Moreover the structure of conf.mat
confirms that the encoding of apicl_1999
seems to be ordinal.
The third step of the quality process integrated in the function corresponds to the study of the Hellinger distance (Liese and Miescke 2008). This distance function is used as a measure of the discrepancies between the observed and predicted distributions of Y (\(\mathcal{L}(Y^A)\) versus \(\mathcal{L}(\widehat{Y}^B)\)) and/or (\(\mathcal{L}(\widehat{Z}^A)\) versus \(\mathcal{L}(Z^B)\)). For \(Y\) and \(Z\), the definition of the distance is respectively:
\[ \text{dist}_{\text{hell}} (Y^A,\widehat{Y}^B)= \sqrt{\frac{1}{2} \sum_{y\in \mathcal{Y}}\left(\sqrt{\mu_{n,y}^{Y^A}}- \sqrt{\mu_{n,y}^{\widehat{Y}^B}}\right)^2 )} \]
and
\[ \text{dist}_{\text{hell}} (Z^B,\widehat{Z}^A)= \sqrt{\frac{1}{2} \sum_{z\in \mathcal{Z}}\left(\sqrt{\mu_{n,z}^{Z^B}}- \sqrt{\mu_{n,z}^{\widehat{Z}^A}}\right)^2 )}, \]
where \(\mu_{n,y}^{\widehat{Y}^B}\) and \(\mu_{n,z}^{\widehat{Z}^A}\) correspond to the empirical estimators of \(\mu^{\widehat{Y}^B}\) and \(\mu^{\widehat{Z}^A}\) respectively.
The Hellinger distance varies between 0 (identical) and 1 (strong dissimilarities) while 0.05 can be used as an acceptable threshold below which two distributions can be considered as similar. When OUTCOME
models are applied on datasets, this criterion shows that predictions respect assumption 1. It can also be used to determine the best relaxation parameter of an R-OUTCOME
model. On the contrary, there is no need to interpret this criterion when an R-JOINT
model is applied, because in this case, the assumption 1 is not required. Results related to this criterion are stored in the hell
object:
verif_outc1$hell
YA_YB ZA_ZB
Hellinger dist. 0.008 NA
With a p-value equals to \(0.008\), the assumption 1 hold here but it stays interesting to test other algorithms to eventually improve the current one by notably adding relaxation and/or regularization parameters.
Finally, the verif_OT
function uses the mean and standard deviance of the conditional probabilities \(\mathbb{P}(Z=\hat{z}_i|Y=y_i,X=x_i)\) estimated by each model, as indicators of the stability of the individual predictions (provided that stab.prob = TRUE
). It is nevertheless possible that conditional probabilities are computed from too few individuals (according to the frequency of each profile of shared variables met), to be considered as a reliable estimate of the reality. To avoid this problem, trimmed means and standard deviances are suggested by removing these specific probabilities from the computation, using the min.neigb
parameter. In output, the results related to this last study are stored in the res.stab
object:
verif_outc1$eff.neig
Nb.neighbor Nb.Prob
1 1 14
2 2 18
3 3 18
4 4 28
5 5 20
6 6 6
7 7 14
8 8 16
9 9 27
10 10 20
verif_outc1$res.stab
N min.N mean sd
2nd DB 284 5 0.968 0.122
The first result shows that \(14\) individual predictions among \(362\) (the total number of rows in api35
) have been assigned to subjects that exhibits a unique combination of predictors and outcome. From these subjects, it would be obviously overkill to draw conclusions about the reliability of the predictions provided by the model. We therefore decide to fix here a threshold of \(5\) below which it is difficult to extract any information related to the prediction. From the remaining ones, we simply perform the average (\(0.968\)) which could be interpreted as follows: when the fitted model is confronted with a same profile of predictors and a same level of apicl_1999
, more than \(96\) times of a hundred, it will return the same individual prediction for apicl_2000
.
We run a total of \(11\) models to determine the optimal one for the prediction of apicl_2000
in api35
.
Every arguments from the syntax of the OT_outcome
and OT_joint
functions stay unchanged with the exception of indiv.method
, maxrelax
, and lambda.reg
which vary.
The values linked to each criterion of the verif_OT
function are summarized in Tables 8, 9 and 10. The syntax related to verif_OT
stay unchanged for each model (notably same min.neigb arguments). Note that comparisons of stability predictions between OUTCOME
and JOINT
models impose that the prox.dist
argument of the OT_outcome function is fixed to \(0\).
Model | Type | Method | Relax | Regul | N | V_cram | rank_cor |
---|---|---|---|---|---|---|---|
outc1
|
OUTCOME
|
SEQUENTIAL
|
0.0 |
-
|
362 | 0.86 | 0.892 |
R_outc1
|
R-OUTCOME
|
SEQUENTIAL
|
0.4 |
-
|
362 | 0.94 | 0.923 |
R_outc2
|
R-OUTCOME
|
SEQUENTIAL
|
0.6 |
-
|
362 | 0.91 | 0.917 |
R_outc3
|
R-OUTCOME
|
OPTIMAL
|
0.0 |
-
|
362 | 0.87 | 0.911 |
R_outc4
|
R-OUTCOME
|
OPTIMAL
|
0.4 |
-
|
362 | 0.95 | 0.939 |
R_outc5
|
R-OUTCOME
|
OPTIMAL
|
0.6 |
-
|
362 | 0.92 | 0.932 |
outj1
|
JOINT
|
-
|
0.0 | 0.0 | 362 | 0.74 | 0.834 |
R_outj1
|
R-JOINT
|
-
|
0.4 | 0.0 | 362 | 0.95 | 0.935 |
R_outj2
|
R-JOINT
|
-
|
0.6 | 0.0 | 362 | 0.91 | 0.927 |
R_outj3
|
R-JOINT
|
-
|
0.8 | 0.0 | 362 | 0.91 | 0.927 |
R_outj4
|
R-JOINT
|
-
|
0.4 | 0.1 | 362 | 0.95 | 0.931 |
From these results, we can conclude that:
OUTCOME
or JOINT
), adding a relaxation parameter improves here the association between the target variables (see Table 8).R_outc2
and R_outc5
models seem not optimal because they are those for which the Hellinger criterion reflects the most clear violation of assumption 1 (see Table 9). Therefore, it is here suggested to keep a relaxation parameter less than \(0.6\) in the final model.R-OUTCOME
algorithm used (\(0.3\) could also have been tested here).R_outc1
, R_outc4
, R_outj1
, and R_outj4
), R_outj1
and R_outj4
seem to be those with the most stable predictive potential (Table 10). Moreover, R_outc4
appears here as the best model from the tested R-OUTCOME
algorithms.R_outj1
model caused a decrease in the stability of the predictions (see the value of R_outj4
compared to R_outj1
in Table 10) and we thus conclude in favor of R_outj1
as best model among those tested in the JOINT
family of algorithms.R_outc4
, R_outj1
and R_outj4
) counted between \(92\) an \(97\%\) of common predictions and these last result also reassures the user about the quality of the provided predictions.Model | Type | Method | Relax | Hell(YA_YB) |
---|---|---|---|---|
outc1
|
OUTCOME
|
SEQUENTIAL
|
0.0 | 0.008 |
R_outc1
|
R-OUTCOME
|
SEQUENTIAL
|
0.4 | 0.085 |
R_outc2
|
R-OUTCOME
|
SEQUENTIAL
|
0.6 | 0.107 |
R_outc3
|
R-OUTCOME
|
OPTIMAL
|
0.0 | 0.002 |
R_outc4
|
R-OUTCOME
|
OPTIMAL
|
0.4 | 0.080 |
R_outc5
|
R-OUTCOME
|
OPTIMAL
|
0.6 | 0.102 |
Model | Type | Method | Relax | Regul | N | mean | sd |
---|---|---|---|---|---|---|---|
outc1
|
OUTCOME
|
SEQUENTIAL
|
0.0 |
-
|
284 | 0.968 | 0.122 |
R_outc1
|
R-OUTCOME
|
SEQUENTIAL
|
0.4 |
-
|
284 | 0.950 | 0.151 |
R_outc2
|
R-OUTCOME
|
SEQUENTIAL
|
0.6 |
-
|
284 | 0.954 | 0.145 |
R_outc3
|
R-OUTCOME
|
OPTIMAL
|
0.0 |
-
|
284 | 0.979 | 0.100 |
R_outc4
|
R-OUTCOME
|
OPTIMAL
|
0.4 |
-
|
284 | 0.987 | 0.080 |
R_outc5
|
R-OUTCOME
|
OPTIMAL
|
0.6 |
-
|
284 | 0.983 | 0.091 |
outj1
|
JOINT
|
-
|
0.0 | 0.0 | 284 | 0.911 | 0.116 |
R_outj1
|
R-JOINT
|
-
|
0.4 | 0.0 | 284 | 0.942 | 0.128 |
R_outj2
|
R-JOINT
|
-
|
0.6 | 0.0 | 284 | 0.953 | 0.171 |
R_outj3
|
R-JOINT
|
-
|
0.8 | 0.0 | 284 | 0.934 | 0.199 |
R_outj4
|
R-JOINT
|
-
|
0.4 | 0.1 | 284 | 0.926 | 0.097 |
Model |
outc1
|
R_outc1
|
R_outc4
|
outj1
|
R_outj1
|
---|---|---|---|---|---|
R_outc1
|
0.84 |
-
|
-
|
-
|
-
|
R_outc4
|
0.83 | 0.95 |
-
|
-
|
-
|
outj1
|
0.72 | 0.81 | 0.83 |
-
|
-
|
R_outj1
|
0.83 | 0.91 | 0.94 | 0.84 |
-
|
R_outj4
|
0.84 | 0.96 | 0.97 | 0.84 | 0.92 |
apicl_2000 | G1 | G2 | G3 | G4 | apicl_2000 | G1 | G2 | G3 | G4 | |
---|---|---|---|---|---|---|---|---|---|---|
[200-600] | 91 | 15 | 0 | 0 | [200-600] | 91 | 14 | 1 | 0 | |
[600-800] | 0 | 75 | 90 | 0 | [600-800] | 0 | 76 | 89 | 0 | |
[800-1000] | 0 | 0 | 0 | 91 | [800-1000] | 0 | 0 | 0 | 91 | |
The confusion matrices related to R_outc4
and R_outj1
are described in Table 12 and seems to confirm that the encoding of apicl_2000
in three groups, could simply correspond to the grouping of levels \(G2\) and \(G3\) of the api_cl1999
variable.
Finally, note that the running time of each model took less than 15 seconds with R
version \(4.0.3\) for Windows (10 Pro-64bits/ Process Intel \(2.66\) GHz).
To our knowledge, OTrecod is the first R
package that takes advantage of the optimal transportation theory (Monge 1781) in the context of data fusion and the comparative study of methods described in (Garès and Omer 2022) underlines the promising performances of these approaches.
For more details and examples about the functions, users are invited to consult the ARTICLES section of the dedicated website.
The functions of OTrecod only apply to a specific data fusion context where there is no overlapping part between the two raw data sources. This package does not deal with record linkage which is already the focus of extensive works (Sariyar and Borg 2010). This package is not adapted when the target variables (outcomes) of the two databases are continuous with an infinite number of values (like weights in grams with decimals for example). Moreover, if the data fusion requires only one shared variable to work, the quality of the fusion depends on the presence of a subset of good shared predictors in the raw databases. Finally, if the function OT_outcome
allows all types of predictors, the current version of the function OT_joint
imposes categorical matching variables only (scale variables are allowed): this restriction should be relaxed in a next version.
A number of more advanced research still need further investigation:
The authors would especially thank Pierre Navaro (CNRS UMR 6625) for their advices during the implementation of the OTrecod package. This research has received the help from Region Occitanie Grant RBIO-2015-14054319 and Mastodons-CNRS Grant.
### BASIC R CODE FOR SUMMARY TABLES 8, 9, 10, and 11 ---------
## Validation of each model: Repeat the following R command for each model by
## changing outc1:
verif_outc1 = verif_OT(outc1, group.class = TRUE, ordinal = FALSE,
stab.prob = TRUE, min.neigb = 5)
## Association between Y and Z: Summary Table 8
res.prx = rbind(
outc1 = verif_outc1$res.prox , R_outc1 = verif_R_outc1$res.prox,
R_outc2 = verif_R_outc2$res.prox , R_outc3 = verif_R_outc3$res.prox,
R_outc4 = verif_R_outc4$res.prox , R_outc5 = verif_R_outc5$res.prox,
outj1 = verif_outj1$res.prox , R_outj1 = verif_R_outj1$res.prox,
R_outj2 = verif_R_outj2$res.prox , R_outj3 = verif_R_outj3$res.prox,
R_outj4 = verif_R_outj4$res.prox )
res.prx = data.frame(Model = c("outc1","R_outc2","R_outc3","R_outc4",
"R_outc5","R_outc6","outj1", "R_outj1",
"R_outj2","R_outj3","R_outj4"),
Type = c("OUTCOME",rep("R-OUTCOME",5),"JOINT",
rep("R-JOINT",4)),
Relax = c(0,0.4,0.6,0,0.4,0.6,0,0.4,0.6,0.8,0.4),
Regul = c(rep(0,10),0.1), res.prx)
row.names(res.prx) = NULL; head(res.prx,3)
# Name Type Relax Regul N V_cram rank_cor
# outc1 OUTCOME 0.0 0.0 362 0.86 0.892
# R_outc1 R-OUTCOME 0.0 0.0 362 0.87 0.911
# R_outc2 R_OUTCOME 0.4 0.0 362 0.93 0.933
#-----
## Hellinger distance: Summary Table 9
res.helld = rbind(
outc1 = verif_outc1$hell , R_outc1 = verif_R_outc1$hell,
R_outc2 = verif_R_outc2$hell, R_outc3 = verif_R_outc3$hell,
R_outc4 = verif_R_outc4$hell, R_outc5 = verif_R_outc5$hell,
outj1 = verif_outj1$hell , R_outj1 = verif_R_outj1$hell,
R_outj2 = verif_R_outj2$hell, R_outj3 = verif_R_outj3$hell,
R_outj4 = verif_R_outj4$hell )
res.helld = data.frame(res.prx[,1:4], res.helld)
row.names(res.helld) = NULL; res.helld
#-----
## Stability of the prediction: Summary Table 10
# same R code as for Summary Table 8, changing res.prox by res.stab
#----
## Ratio of common predictions: Table 11
stoc = list(outc1$DATA2_OT$OTpred , R_outc1$DATA2_OT$OTpred,
R_outc4$DATA2_OT$OTpred, outj1$DATA2_OT$OTpred ,
R_outj1$DATA2_OT$OTpred, R_outj4$DATA2_OT$OTpred)
corpred = matrix(ncol = 6, nrow = 6)
for (i in 1:6){
for (j in 1:6){
corpred[i,j] = round(sum(diag(table(stoc[[i]],stoc[[j]])))/362,2)
}
}; corpred
#-----
Supplementary materials are available in addition to this article. It can be downloaded at RJ-2023-006.zip
OTrecod, StatMatch, mice, missForest, softImpute, missMDA, transport, devtools, randomForest, party
Environmetrics, MachineLearning, MissingData, MixedModels, OfficialStatistics, Psychometrics, Survival
MultiDataSet, OMICsPCA, mixOmics
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Guernec, et al., "OTrecod: An R Package for Data Fusion using Optimal Transportation Theory", The R Journal, 2023
BibTeX citation
@article{RJ-2023-006, author = {Guernec, Gregory and Gares, Valerie and Omer, Jeremy and Saint-Pierre, Philippe and Savy, Nicolas}, title = {OTrecod: An R Package for Data Fusion using Optimal Transportation Theory}, journal = {The R Journal}, year = {2023}, note = {https://doi.org/10.32614/RJ-2023-006}, doi = {10.32614/RJ-2023-006}, volume = {14}, issue = {4}, issn = {2073-4859}, pages = {195-222} }