Multivariate reference regions (MVR) represent the extension of the reference interval concept to the multivariate setting. A reference interval is defined by two threshold points between which a high percentage of healthy subjects’ results, usually 95%, are contained. Analogously, an MVR characterizes the values of several diagnostic tests most frequently found among non-diseased subjects by defining a convex hull containing 95% of the results. MVRs have great applicability when working with diseases that are diagnosed via more than one continuous test, e.g., diabetes or hypothyroidism. The present work introduces *refreg*, an R package for estimating conditional MVRs. The reference region is non-parametrically estimated using a multivariate kernel density estimator, and its shape allowed to change under the influence of covariates. The effects of covariates on the multivariate variable means, and on their variance-covariance matrix, are estimated by flexible additive predictors. Continuous covariate non-linear effects can be estimated by penalized spline smoothers. The package allows the user to propose, for instance, an age-specific diagnostic rule based on the joint distribution of two non-Gaussian, continuous test results. The usefulness of the *refreg* package in clinical practice is illustrated with a real case in diabetes research, with an age-specific reference region proposed for the joint interpretation of two glycemia markers (fasting plasma glucose and glycated hemoglobin). To show that the *refreg* package can also be used in other, and indeed very different fields, an example is provided for the joint prediction of two atmospheric pollutants (SO\(_2\), and NO\(_x\)). Additionally, the text discusses how, conceptually, this method could be extended to more than two dimensions.

In clinical practice, many medical decisions are based on continuous
diagnostic tests (Hallworth 2011) – i.e., tests that provide results
along a continuous, quantitative scale. The interpretation of such
continuous tests requires the comparison of the obtained value with a
pre-defined reference interval, so that a result could be classified as
positive or negative (ie, disease present or absent) based on this
comparator value. A reference interval is an interval containing most
healthy subjects’ results. For a single test they are usually estimated
from the 2.5 and 97.5 empirical percentiles of the distribution for the
healthy population; thus, 95% of healthy patients are located within the
interval limits (Wright and Royston 1999). Those patients falling outside
the reference interval, are likely to have an undiagnosed disease. If
the test results are influenced by some patient characteristics
independent of the disease (e.g., age and gender), reference intervals
for specific patient groups must be obtained. These covariate-dependent
reference intervals, usually termed reference curves, are estimated
using quantile regression (Koenker and Bassett Jr 1978) or location-scale
models (Cole and Green 1992; Stasinopoulos et al. 2017). Several R
packages for estimating reference intervals and reference curves already
exist, including the R package
*referenceIntervals*,
which comprises a collection of tools, the R package
*gamlss*
(Stasinopoulos et al. 2007), which provides a general tool for
deriving reference curves in clinical practice (WHO 2006), and
software *RefCurv* (Winkler et al. 2019), recently proposed to
facilitate clinicians’ use of
*gamlss*. However, all
these packages were produced to provide reference intervals for single
tests; they cannot address diseases for which diagnosis and control are
based on multiple tests.

When the results of several tests are available for the same patient, obtaining separate reference intervals for each one provides an incomplete picture of disease status, particularly when these results are strongly correlated (Boyd 2004). Although each reference interval would leave only 5% of healthy patients out, their combined use can result in a higher percentage of false positives. Moreover, a patient falling within each univariate reference interval might, in fact, show an abnormal multivariate result. Thus, a multivariate reference region (MVRs) would provide a better means of interpreting the results of multiple tests (Winkel and Lyngbye 1972). MVRs are a straightforward extension of the univariate reference interval to the multidimensional setting, i.e., a region that contains 95% of the healthy patients’ results. However, despite being proposed more than 40 years ago, MVRs are rarely used in clinical practice. This might be explained by the multivariate Gaussian assumption of MVRs, which is quite restrictive when interpreting diagnostic test results. Further, the multivariate distribution of test results is usually affected by patient characteristics, independent of their health status. For example, (Espasandı́n-Domı́nguez et al. 2019) showed that the correlation between two diagnostic tests for diabetes was influenced by patient age and red blood cell turnover, independent of glycemia status. A conditional MVR is therefore desirable, but the statistical literature is not rich in such proposals (see, e.g., (Wei 2008)).

Very few statistical software routines have been proposed for estimating
the region containing a specific percentage of multivariate data points.
The function `mvtol.region`

in the R package
*tolerance*
(Young 2010) obtains a region containing a high percentage of a
bivariate Gaussian distribution in the context of quality control
studies, and non-parametric probabilistic regions can be obtained using
the R packages *r2d2*
(Magnusson and Burgos 2014), *hdrcde* (Hyndman 2018)
and *distfree.cr*
(Hu and Yang 2018). However, these all suffer the major limitation of not being
able to estimate the effects of covariates on the region’s shape. The R
package *modQR* can estimate
conditional multivariate quantiles, but the quantile level \(\tau\) is not
linked to the probability content of the sample (Šiman and Boček 2019). Thus, it is
not clear how to derive a reference region from these bivariate
quantiles.

The present paper introduces *refreg*, an implementation in R of a new
statistical methodology for estimating bivariate reference regions able
to classify subjects as having normal or abnormal values based on the
results of two continuous diagnostic tests. The main advantages of the
presented method are; i) the absence of parametric restrictions for
describing bivariate distributions for continuous tests, and ii) the
possibility of estimating the effects of covariates on the shape of the
reference region via flexible additive predictors. To illustrate this
statistical methodology, and how to use the package, an age-specific
reference region for two diabetes diagnostic tests was estimated. The
estimated reference region offers new insights into the diagnosis and
prognosis of diabetes, enabling physicians to identify different
patients’ profiles. The proposed method can, however, be applied to any
disease in which two continuous diagnostic tests are available, and can
even be used in non-medical fields. Indeed, an application is discussed
in which the conditional region is used in the joint forecasting of the
concentrations of two air pollutants. Moreover, the current
implementation can be easily extended to three dimensions.

The statistical model that enables conditional reference regions to be
determined is presented in the next section. The main functions
contained in the *refreg* package are then described, with a brief
introduction to the main functions. The use of the package is shown in
analyses of real medical and environmental data problems. The paper
closes with some comments, and some notes on future research directions.

In this section the main features of our statistical method is presented. In a nutshell, our proposal is based on a bivariate location scale model, where the response means, and their variance-covariance matrix, are related to covariates using flexible additive predictors. The probabilistic region covering a specific percentage of the data is firstly estimated using the model standardized residuals. Then, it is generalized for each covariate value based on the aforementioned bivariate location-scale model fit. This statistical model was already presented and evaluated in (Roca-Pardiñas et al. 2020).

Given a bivariate continuous random variable of interest \({\bf Y} = (Y_1, Y_2)\), and a vector of covariates \(\bf{X}=(X_1, \ldots, X_p)\) we consider the following structure: \[\left( { \begin{array} {c} Y_1 \\ Y_2\ \end{array} }\right) = \left( { \begin{array} {c} \mu_1(\textbf{X}) \\ \mu_2(\textbf{X})\ \end{array} }\right) + \boldsymbol{\Sigma}^{1/2} (\textbf{X}) \left( { \begin{array} {c} \varepsilon_1 \\ \varepsilon_2\\ \end{array} }\right) \label{modelo} \tag{1}\]

where \(\mu_1(\textbf{X})\) and \(\mu_2(\textbf{X})\) represents the conditional means of both responses and \(\boldsymbol{\Sigma}^{1/2} (\textbf{X})\) the Cholesky decomposition of the variance-covariance matrix \[\boldsymbol{\Sigma} (\textbf{X})=\left({ \begin{array} {cc} \sigma_1^2 (\textbf{X}) & \sigma_{12} (\textbf{X})\\ \sigma_{12} (\textbf{X}) & \sigma_2^2 (\textbf{X}) \\ \end{array}}\right) \label{mod1} \tag{2}\] so that \(\text{Var} (\bf Y|\textbf{X})=\boldsymbol{\Sigma} (\textbf{X})= \boldsymbol{\Sigma}^{1/2} (\textbf{X})\left({\boldsymbol{\Sigma}^{1/2} (\textbf{X})}\right)^T\). In order to guarantee the model identification ((1)), the bivariate residuals \((\varepsilon_1,\varepsilon_2)\) are assumed to be independent of the covariates, with zero mean, unit variance, and zero correlation.

We consider additive structures for the mean functions \(\mu_j({\bf X})\), variance functions \(\sigma_j^2({\bf X})\) (\(j=1,2\)) and the correlation function \(\rho({\bf X})\) – note that \(\sigma_{12} (\textbf{X})= \hat \sigma_1(\textbf{X}) \hat \sigma_{2} (\textbf{X}) \hat \rho (\textbf{X})\). These structures are given, respectively, by \[\mu_r({\bf X})=\alpha_r + \sum_{j=1}^p{f_{jr}(X_j)}, \quad \sigma_r^2({\bf X})=H_\sigma\left({\beta_r+ \sum_{j=1}^p{g_{jr}(X_j)}}\right) \quad \text{for } r=1,2\] and \[\rho({\bf X}) =H_\rho \left({\gamma+ \sum_{j=1}^p{m_{j}(X_j)}}\right)\] where \(\alpha\), \(\beta\) and \(\gamma\) are parametric coefficients and \(f_{jr}\), \(g_{jr}\) and \(m_j\) for \(j = 1,\cdots,p\) and \(r=1,2\) are smooth and unknown functions. \(H_\sigma (\cdot)=\exp(\cdot)\) and \(H_\rho(\cdot)=\tanh(\cdot)\) are link functions used in the variance and correlation structures, respectively, to ensure that the restrictions on the parameter spaces (\(\sigma_r^2({\bf X}) \geq 0\) and \(0\leq \rho({\bf X}) \leq 1)\) are maintained.

Based on the model presented in equation ((1)), for a given \(\textbf{X}\) the conditional \(\tau^{th}\)- reference region for \((Y_1, Y_2)\) is given by: \[R_{\tau}(\textbf{X}) = \left( { \begin{array} {c} \mu_1(\textbf{X}) \\ \mu_2(\textbf{X})\ \end{array} }\right) + \boldsymbol{\Sigma}^{1/2} (\textbf{X}) \varepsilon_{\tau} \label{rexion_x} \tag{3}\] where \(\varepsilon_{\tau}\) is the unconditionally probabilistic region for the errors \((\varepsilon_1,\varepsilon_2)\) as \[\varepsilon_{\tau} (k) = \{(\varepsilon_1,\varepsilon_2) \in{R}^2 | f(\varepsilon_1,\varepsilon_2) \leq k\} \label{rexion1} \tag{4}\] \(f\) being the density function of the bivariate residuals \((\varepsilon_1,\varepsilon_2)\) and \(k\) is the \(\tau-\)quantile of \(f(\varepsilon_1,\varepsilon_2)\).

In this section, we present the estimation procedure of the conditioned bivariate uncertainty region given in equation ((3)). Our approach is based on the estimation of the covariate effects on the response means using an additive model, and then on the variance-covariance matrix using the squared residuals of the former models. Finally, the bivariate region \(R_{\tau}(\textbf{X})\) is obtained with a bivariate kernel estimation of the standardized bivariate residuals.

The steps of the proposed estimation algorithm are the following:

**Step 1:** For \(r=1,2\) fit an additive model to the sample
\(\{{\bf X}_i, {\bf Y}_{ir}\}_{i=1}^n\) and obtain the estimates
\[\hat \mu_{r}({ \bf X}_i)=\hat \alpha+ \sum_{j=1}^p \hat f_{jr}(X_{ij})
\label{am1} \tag{5}\]
Then, estimate \(\sigma_{r}^2(\textbf{X})\) from the sample
\(\{{\bf X_i}, ({\bf Y}_{ir} - \hat \mu_r({\bf X}_i))^2\}_{i=1}^n\) as
\[\hat \sigma_{r}^2(\textbf{X}_i)= \hat \beta_r + \sum_{j=1}^p \hat g_{jr}(X_{ij})
\label{am2} \tag{6}\]
**Step 2:** Compute the correlation \(\rho (\textbf{X})\), using the
sample \(\{{\bf X}_i, \hat{\delta}_i\}_{i=1}^n\), as
\[\hat \rho (\textbf{X}_i)= \tanh \left( {\hat \gamma + \sum_{j=1}^p \hat m_{jr}(X_{ij})}\right)\]
where
\[\hat{\delta}_i =
\frac{\left(
Y_{i1}-\hat{\mu}_1(\textbf{X}_i)\right)\left(Y_{i2}-\hat{\mu}_2(\textbf{X}_i)
\right)}{\hat{\sigma}_1(\textbf{X}_i)\hat{\sigma}_2(\textbf{X}_i)}\]
**Step 3:** Compute the standardized residuals
\[\left({
\begin{array} {c}
\hat{\varepsilon}_{i1} \\
\hat{\varepsilon}_{i2}\\
\end{array}
}\right)
=
\hat{\Sigma}^{-1/2}(\textbf{X}_i)
\left({
\begin{array} {c}
Y_{i1} - \hat{\mu}_1(\textbf{X}_i)\\
Y_{i2} - \hat{\mu}_2(\textbf{X}_i) \\
\end{array}}\right)
\quad i=1,\ldots,n\]
and obtain the kernel estimation of the bivariate density
\(\hat{f}(\varepsilon_1,\varepsilon_2)\) given by
\[\hat{f}((\varepsilon_{1},\varepsilon_{2}),\textbf{H}) = \frac{1}{n}\sum_{i=1}^{n}K_{\textbf{H}}
\left({
\begin{array} {c}
\varepsilon_1 - \hat{\varepsilon}_{i1} \\
\varepsilon_2 - \hat{\varepsilon}_{i2}\\
\end{array}
}\right)\]
where \(K(\cdot)\) is the kernel which is a symmetric probability density
function and \(\textbf{H}\) is a \(2 \times 2\) positive definite matrix.
Then, obtain the \(\tau^{th}\) unconditional bivariate uncertainty region
on the residual scale as
\[\hat{\varepsilon}_{\tau} = \{(\varepsilon_{1},\varepsilon_{2})) \in \mathbb{R}^2 | \hat{f}(\varepsilon_{1},\varepsilon_{2}))\leq\hat{k}\}
\label{step4} \tag{7}\]
\(\hat k\) being the empirical \(\tau\) quantile of the values
\(\hat f(\varepsilon_{11},\varepsilon_{12}), \ldots, \hat f(\varepsilon_{n1},\varepsilon_{n2})\).

Finally, for a given \(\bf X\), the conditional bivariate uncertainty region \({R}_{\tau}(\textbf{X})\) is estimated by \[\hat{R}_{\tau}(\textbf{X}) = \left({ \begin{array} {c} \hat{\mu}_1(\textbf{X}) \\ \hat{\mu}_2(\textbf{X})\ \end{array} }\right) + \boldsymbol{ \hat \Sigma}^{1/2}(\textbf{X}) \hat{\varepsilon}_{\tau} \label{step5} \tag{8}\]

The continuous covariates smooth effects (\(f_{jr}\), \(g_{jr}\) and \(m_j\) for \(j = 1,\cdots,p\)) may be estimated using several non-parametric regression techniques. In previous works we applied polynomial kernel smoothers (Roca-Pardiñas et al. 2020). However, for sake of usage simplicity and computational cost, in the final package implementation we used a penalized spline basis representation following (Wood 2017). Thus, given an unknown smooth effect (e.g. \(f(x)\)) is estimated as: \[f(x) = \sum_{k = 1}^{K} \beta_k b_k (x)\] Confidence intervals for the estimated effects may be obtained using a bootstrap procedure. Given a specific vector of covariates \(\textbf{X}_0\), for the components (means, variances and correlation). The steps for construction of the bootstrap confidence intervals are:

**Step 1**. From the sample data \(\{(Y_{i1},Y_{i2}),{\bf X}_i\}_{i=1}^n\)
obtain the estimates \(\hat{\mu}_r({\bf X}_0)\),
\(\hat{\sigma}_{r}({\bf X}_0)\) \((r=1,2)\) and \(\hat{\rho}({\bf X}_0)\).

**Step 2**. For \(b=1,\ldots,B\) generate bootstrap samples
\(\{(Y_{i1}^\bullet,Y_{i2}^\bullet),{\bf X}_i\}_{i=1}^n\) with
\[\left( {
\begin{array} {c}
Y_{i1}^\bullet \\
Y_{i2}^\bullet \\
\end{array}}\right) = \left( {
\begin{array} {c}
\hat \mu_1(\textbf{X}_i) \\
\hat \mu_2(\textbf{X}_i)\
\end{array}
}\right)
+ \boldsymbol{\hat \Sigma}^{1/2} (\textbf{X}_i)
\left( {
\begin{array} {c}
\hat \varepsilon_{i1}^\bullet \\
\hat \varepsilon_{i2}^\bullet\\
\end{array}
}\right)
\label{boot} \tag{9}\]
where
\(\left \{ {(\hat{\varepsilon}_{i1}^\bullet,\hat{\varepsilon}_{i2}^\bullet)}\right\}_{i=1}^n\)
is a sample of size \(n\) from the residuals
\(\left \{ {(\hat{\varepsilon}_{i1},\hat{\varepsilon}_{i2})}\right\}_{i=1}^n\)
with replacement, and compute \(\hat{\mu}_r^{\bullet b}({\bf X}_0)\),
\(\hat{\sigma}_{r}^{\bullet b}({\bf X}_0)\) and
\(\hat{\rho}^{\bullet b}({\bf X}_0)\) as in Step 1.

The limits for the \(100(1-\alpha)\%\) confidence intervals of the true components \({\mu}_r(\bf X_0)\), \({\sigma}_r(\bf X_0)\) and \(\rho(\bf X_0)\) are given respectively by \(\left({\hat {\mu}^{\alpha/2}_r( \textbf{X}_0), \hat {\mu}^{1-\alpha/2}_r(\textbf{X}_0)}\right)\), \(\left({\hat {\sigma}^{\alpha/2}_r(\textbf{X}_0), \hat {\sigma}^{1-\alpha/2}_r(\textbf{X}_0)}\right)\) and \(\left({\hat {\rho}^{\alpha/2}(\textbf{X}_0), \hat {\rho}^{1-\alpha/2}(\textbf{X}_0)}\right)\), where \(\hat {\mu}_r^p(\textbf{X}_0)\) represents the \(p\)-percentile of \(\hat{\mu}_r^{\bullet 1}(\textbf{X}_0), \ldots, \hat{\mu}_r^{\bullet B}(\textbf{X}_0)\), \(\hat {\sigma}_r^p(\textbf{X}_0)\) represents the \(p\)-percentile of \(\hat{\sigma}_r^{\bullet 1}(\textbf{X}_0), \ldots, \hat{\sigma}_r^{\bullet B}(\textbf{X}_0)\), and \(\hat {\rho}^p(\textbf{X}_0)\) is the \(p\)-percentile of \(\hat{\rho}^{\bullet 1}(\textbf{X}_0), \ldots, \hat{\rho}^{\bullet B}(\textbf{X}_0)\).

The estimation of the unconditionally probabilistic region for the
bivariate errors \((\varepsilon_1,\varepsilon_2)\) is based on a kernel
density estimator. This estimator is given by:
\[\hat{f}((\varepsilon_{1},\varepsilon_{2}),\textbf{H}) = \frac{1}{n}\sum_{i=1}^{n}K_{\textbf{H}}
\left({
\begin{array} {c}
\varepsilon_1 - \hat{\varepsilon}_{i1} \\
\varepsilon_2 - \hat{\varepsilon}_{i2}\\
\end{array}
}\right)\]
where **H** is a matrix defining the kernel bandwidth
\[\textbf{H} = \begin{pmatrix} h_{11} & h_{12} \\ h_{12} & h_{22} \end{pmatrix}\]

The selection of \(\bf H\) is crucial to obtain a good estimation of \(\varepsilon_{\tau}\). A natural option is to use a plug-in or cross-validation bandwidth estimator, as in a density estimation problem \[\ \underset{\textbf{H} \in \mathcal{H}}{\operatorname{arg\min} } \textsf{E}(\iint \left(\hat{f}_\textbf{H}(\varepsilon_{1},\varepsilon_{2}) - f_\textbf{H}(\varepsilon_{1},\varepsilon_{2})^{(-i)} \right)^2 d\varepsilon_{1}d\varepsilon_{2} \label{CV} \tag{10}\] where \(\hat{f}_\textbf{H}(\varepsilon_{1},\varepsilon_{2})^{-i}\) is the estimated bivariate density function without the i-th observation. However, as we seek for a probabilistic region (i.e. a region which contains a given percentage of the multivariate data), the following selection criteria based on the region coverage is proposed \[\hat{\lambda} = \underset{\lambda}{\operatorname{arg\min} } \left|\left(n^{-1}\sum_{i=1}^n I \{(Y_{i1},Y_{i2})\in R^{(-i)}(\textbf{X}_i)\}\right) - \tau\right| \label{CV1} \tag{11}\] where \(\tau\) is the desired coverage and \(\hat{R}_{\tau}^{(-i)}(X_i)\) is the estimated bivariate region without the i-th observation. Using this criteria the estimated region show a smoother contour and a coverage of the data points closer to the desired one \(\tau\) (see Figure 1). Given the high computational cost of the regular proposed method in ((10)), a k-fold cross-validation scheme could be used instead. Moreover we simplify the minimization problem by considering \(h_{11}=h_{22}\) and \(h_{12}=0\).

The *refreg* package
contains a set of functions for estimating a conditional reference, or
uncertainty, region. Its working framework was designed so that people
without a strong statistical background can use it. Indeed, only two
functions need to be taken into account by the user: 1) the effects of
the predictor variables on responses need to be estimated using the
bivRegr function, a step that requires the user choose which variables
may influence the region; 2) bivRegion needs to be applied to a bivRegr
object so that the reference region can be estimated.

The `bivRegr()`

function has the following structure:

`bivRegr(f = formulas, data = data)`

The `f`

argument contains a list of five R formulae corresponding to the
additive predictors for the means, variances and correlation models
shown in equation (1). Since `bivRegr()`

uses `mgcv::gam()`

internally,
the user can estimate covariate linear and non-linear effects using
`s()`

operator. For instance:

```
<- y1 ~ s(x1)
mu1 <- y2 ~ s(x1)
mu2 <- ~ x2
var1 <- ~ x2
var2 <- ~ s(x3)
rho
= list(mu1,mu2,var1,var2,rho) formula
```

assumes a smooth effect of `x1`

on the response means, a parametric
effect of `x2`

on their variances, and a smooth effect of `x3`

on the
response correlation.

The `bivRegion()`

function is designed for non-parametrically estimating
a bivariate reference region:

`bivRegion(object, tau = 0.95, bandwidth = "plug-in")`

The object may be a set of bivariate data points, or a `bivRegr`

object,
while tau defines the desired coverage(s) for the reference region,
which might be a single value or a vector. Finally, “bandwidth”
specifies the kernel bandwidth selection method. The user can chose
between the plug-in, cross-validation, or the best coverage method (see
equation (13)).

Additionally, we defined S3 methods for these two main functions.
Specifically, associated with `bivRegr`

we have

`predict.bivRegr`

and`plot.bivRegr`

: to predict and depict additive models results for responses’ means, variances, and their correlation.`summary_boot.bivRegr`

: a function implementing the bootstrap inference for flexible additive models (see (9)). This function results can be depicted by applying`plot.summary_boot`

.

On the other hand, we defined the following S3 methods associated to
`bivRegion`

:

`summary.bivRegion`

: this function evaluates the region performance on the healthy patients’ sample.`predict.bivRegion`

and`plot.bivRegion`

: these functions offer a prediction or a plot of conditional regions for a new dataset. If the argument`cond=FALSE`

it evaluates the response values in the standardized scale.

In addition, we define the functions `trivRegr`

, `trivRegion`

and
`plot.trivRegion`

as an extension of the aforementioned method for a
trivariate response variable. Finally, our package also contains some
inner functions as `ace`

(for estimating variance, and correlation
models), `Hcv`

(it implements equation (11) method), and
`refcurve`

(it implements an univariate location-scale model).

This section outlines the implemented functions of the proposed package
in detail, and illustrates their use with real datasets. The first
illustration is related to diabetes research, in which a reference
region for the joint interpretation of two glycemia tests is calculated.
In the second illustration,
*refreg* methodology is
used to predict the concentrations of two air pollutants during a
pollution episode. Finally, the extension of the method to higher
dimensions is shown using real data.

Diabetes is a chronic disease, the diagnosis of which is based on two glycemia tests: the fasting plasma glucose (FPG) and glycated hemoglobin (HbA1c)(American Diabetes Association 2019) tests. The multivariate interpretation of FPG and HbA1c results is desirable for two reasons: i) the results of both tests are correlated in healthy patients (Aleyassine et al. 1980), ii) a miss-match between them may be indicative of a poorer prognosis (Kim et al. 2018). Finally, it is well known that both test results are influenced by patient age (Davidson 1979; Pani et al. 2008).

The age-dependent reference region for the FPG and HbA1c tests was estimated using a sample of healthy subjects derived from the A-Estrada Glycation and Inflammation Study (AEGIS) (see (Gude et al. 2017)). A subset of this dataset is available in the package under the name “AEGIS”.

This dataset comprised 1516 subjects and 7 variables:

- id: an anonymous identifier for each subject.
- gender: a factor variable that indicates the subject’s gender with levels “male”, and “female”.
- age: the subject’s age.
- dm: a factor variable indicating a previous diabetes mellitus diagnosis with levels “no”, and “yes”.
- fpg: fasting plasma glucose concentration in mg/dL.
- hba1c: the percentage of glycated hemoglobin.
- fru: fructosamine plasma concentration.

Applying the `summary()`

routine to the `aegis`

dataset indicated 55% of
the subjects to be female, the mean age of all 1516 subjects to be 52
years (range 18-91), and that 187 subjects (12%) had been previously
diagnosed with diabetes.

```
> summary(aegis)
R
id gender age dm fpg : 1.0 female:838 Min. :18.00 no :1329 Min. : 63.00
Min. 1st Qu.: 379.8 male :678 1st Qu.:39.00 yes: 187 1st Qu.: 82.00
: 758.5 Median :52.00 Median : 89.00
Median : 758.5 Mean :52.58 Mean : 94.51
Mean 3rd Qu.:1137.2 3rd Qu.:67.00 3rd Qu.:100.00
:1516.0 Max. :91.00 Max. :274.00
Max.
hba1c fru : 3.900 Min. :119.0
Min. 1st Qu.: 5.200 1st Qu.:225.0
: 5.400 Median :254.0
Median : 5.608 Mean :262.2
Mean 3rd Qu.: 5.700 3rd Qu.:284.0
:10.200 Max. :700.0 Max.
```

To estimate the reference region, a subset of the patients not
previously diagnosed with diabetes was define as `dm_no`

. This subset
sample was deemed the healthy patient sample.

```
> dm_no = subset(aegis,aegis$dm == "no")
R> dm_yes = subset(aegis,aegis$dm == "yes") R
```

To estimate the effect of age on the final region shape, the `bivRegr()`

function was used. This function implements the estimation process of
the bivariate location-scale:

```
> mu1 = fpg ~ s(age)
R> mu2 = hba1c ~ s(age)
R> var1 = ~ s(age)
R> var2 = ~ s(age)
R> rho = ~ s(age)
R> formula = list(mu1,mu2,var1,var2,rho) R
```

The first and second formulae define the additive models for the mean
values of both glycemia tests. The third and fourth define the additive
models for test result variability. The last formula represents the
additive model that comprises the effect of age on the correlation
between the results of both glycemia tests. In addition to the model
formulae list, a dataset including both the test results and subject’s
age must be supplied to `bivRegr()`

as:

`> fit = bivRegr(formula,data=dm_no) R`

By applying the S3 method `plot()`

to a `bivRegr`

object, the estimated
effects of covariates can be shown for each submodel. The argument eq=
controls the model component to be represented (1 = FPG mean, 2 = HbA1c
mean, 3 = FPG variance, 4 = HbA1c variance, and 5 = [FPG – HbA1c]
correlation). Moreover, the function `summary_boot()`

may be applied to
a `bivRegr`

object to obtain the 95% pointwise confidence interval for
the estimated effects via bootstrapping:

```
> fit_boot = summary_boot(fit, B=250, parallel = TRUE )
R> plot(fit_boot,eq=1)
R> plot(fit_boot,eq=2)
R> plot(fit_boot,eq=3)
R> plot(fit_boot,eq=4)
R> plot(fit_boot,eq=5) R
```

Since bootstrap resampling (introduced in equation ((9))) is
time consuming, the user can fix `parallel = TRUE`

and run a
parallelized computation. The parallel backend is registered using
*doParallel*
(Microsoft and Weston 2020a), and the parallel computation is performed by
*foreach* (Microsoft and Weston 2020b).

Figure 2 shows that the mean values of both glycemia markers increase almost linearly with age. FPG variance increases from 20 to 40 years, while the HbA1c variance increases linearly with age. Finally, the correlation between the FPG and HbA1c concentration seems to be stronger for older patients.

Applying the function `bivRegion()`

to a `bivRegr`

object provides a
bivariate region containing 100\(\tau\)% of the model standardized
residuals. This region is based on a bivariate kernel density estimator.
The kernel bandwidth selection method may be chosen with the `H_choice`

argument. Here, the 90%, 95% and 97.5% regions are obtained with the
best coverage bandwidth selector (see equation (10)):

`> region = bivRegion(fit,tau=c(0.90,0.95,0.975),H_choice = "Hcov") R`

This region facilitates a multivariate interpretation of the glycemia
test results. A patient whose results are “normal”, for his/her age
would see them fall inside this reference region, while a subject with
“abnormal” results for his/her age would not. This interpretation is
possible because the model residuals are centered around zero, show unit
variance, no correlation, and they are independent of age. The user can
check test results located outside the reference region using the
`bivRegion`

S3 method `plot`

:

```
> par(mfrow = c(1, 2))
R> plot(region, xlab = "FPG, mg/dL", ylab = "HbA1c, \%",cond=T, newdata =
Rdata.frame (age = c(20,30,40,50,60,70)),tau=0.95,reg.lwd=2, pch="*",col="grey")
> plot(region,xlab = "FPG, mg/dL", ylab = "HbA1c, %",cond=T, newdata =
Rdata.frame(age = c(20,60)),tau=c(0.50,0.95), reg.lwd=2, pch="*", col="grey")
```

Figure 3 shows the unconditional reference region for
\(\tau = 0.90, 0.95\) and \(0.975\) for healthy patients, and those
previously diagnosed with diabetes. Note that the `plot()`

function
argument ‘newdata’ allows the glycemia test values to be observed in the
standardized residuals scale of the dataset for the patients with
diabetes. As is clearly seen, most of healthy patients’ results are
located inside the reference region, while those recorded for diabetic
patients are located outside.

A major advantage of this representation is that it allows clinicians
new insights into the subject’s glycemia status. Indeed, those patients
located outside the reference region may be classified into four groups:
(I) individuals with high values for both tests (first quadrant); (II)
those with discordant results, with high HbA1c concentrations and
low/medium FPG (second quadrant); (III) individuals with low values for
both tests (third quadrant); and (IV) individuals with low/medium HbA1c
concentrations and high FPG values (fourth quadrant). This distinction
might be useful for physicians. For instance, discordant results are
probably due to an altered bloodstream protein glycation rate, a
condition associated with a poorer prognosis. Patients located outside
the standardized region may be also checked applying `summary()`

to a
`bivRegion`

object:

`> summary(region,tau = 0.95) R`

This R output presents patients located outside the standardized bivariate region for \(\tau = 0.95\). Note that, in the full sample, patients with different ages were located outside the reference region. The glycemia tests results located outside the reference region are interesting from a clinical point of view. For example, the following were seen: a possible case of undiagnosed diabetes in a 20 year old patient (FPG = 99, HbA1c = 6.3); a 47 year old patient showing a high HbA1c value for his corresponding FPG result (FPG = 86, HbA1c =6); and a patient of 85 years in the opposite situation (FPG = 120, HbA1c = 5.7).

The use of this region in combination with the results of the bivariate
location-scale model allow the conditional reference regions to be
obtained. The user can visualize these regions using the `bivRegion`

S3
method `plot()`

, setting `cond = TRUE`

as follows:

```
> plot(region, xlab = "FPG, mg/dL", ylab = "HbA1c, %",cond=T, newdata = data.frame(age =
Rc(20,30,40,50,60,70)),tau=0.95,reg.lwd=2, pch="*",col="grey")
> plot(region,xlab = "FPG, mg/dL", ylab = "HbA1c, %",cond=T, newdata = data.frame(age =
Rc(20,60)),tau=c(0.50,0.95),reg.lwd=2, pch="*",col="grey")
```

In addition, the region may be represented in a pre-existing plot if the
`plot`

function argument `add`

is equal to `TRUE`

as in the following
code:

```
> plot(dm_no[dm_no$age==40,"fpg"],dm_no[dm_no$age==40,"hba1c"],main="40 years",
Rxlim=c(50,140), ylim=c(4.2,7.5),xlab = "FPG, mg/dL", ylab = "HbA1c, %", pch=20,cex=2)
> plot(region,cond=T,newdata = data.frame(age = 40),add=T,legend=F, tau=c(0.50,0.95),
Rreg.lty=c(1,2))
> plot(dm_no[dm_no$age==60,"fpg"],dm_no[dm_no$age==60,"hba1c"],main="60 years",
Rxlim=c(50,140), ylim=c(4.2,7.5),xlab = "FPG, mg/dL", ylab = "HbA1c, %", pch=20,cex=2)
> plot(region,cond=T,newdata = data.frame(age = 60),add=T,legend=F, tau=c(0.50,0.95),
Rreg.lty=c(1,2))
```

In Figure 4, the bivariate reference region is shown for several ages. Note that the regions shift towards the upper right corner and expand as age increase. This agrees with the non-linear effect of age on the expected means and variability of both markers. The conditional region coverage and the performance of the methodology have already been assessed (Lado-Baleato et al. 2021).

This section provides an example of how the method proposed in equation
(3) might be extended to more than two dimensions. This section is
intended to provide a proof of concept rather than a formal statistical
proposal. For a trivariate variable (\(Y_1, Y_2, Y_3\)) the following
model can be assumed:
\[\left( {
\begin{array} {c}
Y_1 \\
Y_2\\
Y_3\
\end{array}
}\right) = \left( {
\begin{array} {c}
\mu_1(\textbf{X}) \\
\mu_2(\textbf{X})\\
\mu_3(\textbf{X})\
\end{array}
}\right)
+ \boldsymbol{\Sigma}^{1/2} (\textbf{X})
\left( {
\begin{array} {c}
\varepsilon_1 \\
\varepsilon_2\\
\varepsilon_3\\
\end{array}
}\right)
\label{mod3} \tag{12}\]
where \(\{\mu_r(\textbf{X})\}_{r=1}^3\) represents the conditional means
of each response, and \(\boldsymbol{\Sigma}^{1/2} (\textbf{X})\) the
Cholesky decomposition of the variance-covariance matrix
\[\boldsymbol{\Sigma} (\textbf{X})=\left({
\begin{array} {ccc}
\sigma_1^2 (\textbf{X}) & \sigma_{21} (\textbf{X})& \sigma_{31} (\textbf{X})\\
\sigma_{12} (\textbf{X}) & \sigma_2^2 (\textbf{X})& \sigma_{23} (\textbf{X}) \\
\sigma_{13} (\textbf{X}) & \sigma_{32} (\textbf{X}) & \sigma_3^2 (\textbf{X}) \\
\end{array}}\right)\]
In the trivariate case, the estimated variance-covariance matrix can be
non-positive-definite. Thus, \(\hat{\boldsymbol{\Sigma}}\) is modified by
applying the unweighted bending method of (Schaeffer 2014) as
implemented in the *mbend* R
package (Nilforooshan 2020).

Following equation ((12)), a trivariate reference region may be estimated as: \[R_{\tau}(\textbf{X}) = \left( { \begin{array}{c} \mu_1(\textbf{X}) \\ \mu_2(\textbf{X}) \\ \mu_3(\textbf{X}) \end{array} }\right) + \boldsymbol{\Sigma}^{1/2} (\textbf{X}) \varepsilon_{\tau}\] where \(\varepsilon_{\tau}\) is the unconditionally probabilistic region for the errors \((\varepsilon_1,\varepsilon_2,\varepsilon_{3})\) as \[\varepsilon_{\tau} (k) = \{(\varepsilon_1,\varepsilon_2,\varepsilon_{3}) \in{\mathbb{R}}^3 | f(\varepsilon_1,\varepsilon_2,\varepsilon_{3}) \leq k\} \label{rexion_3} \tag{13}\] \(f\) being the density function of the trivariate residuals \((\varepsilon_1,\varepsilon_2,\varepsilon_3)\) and \(k\) is the \(\tau-\)quantile of \(f(\varepsilon_1,\varepsilon_2,\varepsilon_3)\).

Using this model, the application of the methodology for diabetes diagnosis can be extended by incorporating an additional glycemia test. This extension is justified since other glycated proteins are routinely monitored in diabetes control besides FPG and HbA1c. For instance, in conditions that determine alterations in hemoglobin metabolism (e.g., anemia or kidney disease), fructosamine (Fr) is frequently used as an additional marker. Nevertheless, the translation of Fr into average glucose levels is not as clear as for HbA1c, and discordances are often encountered between the Fr and HbA1c results. In addition, agreement among these glycemia markers can be affected by factors such as patient age.