Refreg: An R Package for Estimating Conditional Reference Regions

Multivariate reference regions (MVR) represent the extension of the reference interval concept to the multivariate setting. A reference interval is defined by two threshold points between which a high percentage of healthy subjects’ results, usually 95%, are contained. Analogously, an MVR characterizes the values of several diagnostic tests most frequently found among non-diseased subjects by defining a convex hull containing 95% of the results. MVRs have great applicability when working with diseases that are diagnosed via more than one continuous test, e.g., diabetes or hypothyroidism. The present work introduces refreg, an R package for estimating conditional MVRs. The reference region is non-parametrically estimated using a multivariate kernel density estimator, and its shape allowed to change under the influence of covariates. The effects of covariates on the multivariate variable means, and on their variance-covariance matrix, are estimated by flexible additive predictors. Continuous covariate non-linear effects can be estimated by penalized spline smoothers. The package allows the user to propose, for instance, an age-specific diagnostic rule based on the joint distribution of two non-Gaussian, continuous test results. The usefulness of the refreg package in clinical practice is illustrated with a real case in diabetes research, with an age-specific reference region proposed for the joint interpretation of two glycemia markers (fasting plasma glucose and glycated hemoglobin). To show that the refreg package can also be used in other, and indeed very different fields, an example is provided for the joint prediction of two atmospheric pollutants (SO\(_2\), and NO\(_x\)). Additionally, the text discusses how, conceptually, this method could be extended to more than two dimensions.

Óscar Lado-Baleato (Department of Statistics, Mathematical Analysis, and Optimization, Universidade de Santiago de Compostela, Galicia, Spain.) , Javier Roca-Pardiñas (Galician Center for Mathematical Research and Technology (CITMAga) & Statistical Inference, Decision and Operations Research, Universidade de Vigo.) , Carmen Cadarso-Suárez (Galician Center for Mathematical Research and Technology (CITMAga) & Department of Statistics, Mathematical Analysis, and Optimization, Universidade de Santiago de Compostela, Galicia, Spain.) , Francisco Gude (Clinical Epidemiology Unit, Complexo Hospitalario de Santiago de Compostela.)

1 Introduction

In clinical practice, many medical decisions are based on continuous diagnostic tests (Hallworth 2011) – i.e., tests that provide results along a continuous, quantitative scale. The interpretation of such continuous tests requires the comparison of the obtained value with a pre-defined reference interval, so that a result could be classified as positive or negative (ie, disease present or absent) based on this comparator value. A reference interval is an interval containing most healthy subjects’ results. For a single test they are usually estimated from the 2.5 and 97.5 empirical percentiles of the distribution for the healthy population; thus, 95% of healthy patients are located within the interval limits (Wright and Royston 1999). Those patients falling outside the reference interval, are likely to have an undiagnosed disease. If the test results are influenced by some patient characteristics independent of the disease (e.g., age and gender), reference intervals for specific patient groups must be obtained. These covariate-dependent reference intervals, usually termed reference curves, are estimated using quantile regression (Koenker and Bassett Jr 1978) or location-scale models (Cole and Green 1992; Stasinopoulos et al. 2017). Several R packages for estimating reference intervals and reference curves already exist, including the R package referenceIntervals, which comprises a collection of tools, the R package gamlss (Stasinopoulos et al. 2007), which provides a general tool for deriving reference curves in clinical practice (WHO 2006), and software RefCurv (Winkler et al. 2019), recently proposed to facilitate clinicians’ use of gamlss. However, all these packages were produced to provide reference intervals for single tests; they cannot address diseases for which diagnosis and control are based on multiple tests.

When the results of several tests are available for the same patient, obtaining separate reference intervals for each one provides an incomplete picture of disease status, particularly when these results are strongly correlated (Boyd 2004). Although each reference interval would leave only 5% of healthy patients out, their combined use can result in a higher percentage of false positives. Moreover, a patient falling within each univariate reference interval might, in fact, show an abnormal multivariate result. Thus, a multivariate reference region (MVRs) would provide a better means of interpreting the results of multiple tests (Winkel and Lyngbye 1972). MVRs are a straightforward extension of the univariate reference interval to the multidimensional setting, i.e., a region that contains 95% of the healthy patients’ results. However, despite being proposed more than 40 years ago, MVRs are rarely used in clinical practice. This might be explained by the multivariate Gaussian assumption of MVRs, which is quite restrictive when interpreting diagnostic test results. Further, the multivariate distribution of test results is usually affected by patient characteristics, independent of their health status. For example, (Espasandı́n-Domı́nguez et al. 2019) showed that the correlation between two diagnostic tests for diabetes was influenced by patient age and red blood cell turnover, independent of glycemia status. A conditional MVR is therefore desirable, but the statistical literature is not rich in such proposals (see, e.g., (Wei 2008)).

Very few statistical software routines have been proposed for estimating the region containing a specific percentage of multivariate data points. The function mvtol.region in the R package tolerance (Young 2010) obtains a region containing a high percentage of a bivariate Gaussian distribution in the context of quality control studies, and non-parametric probabilistic regions can be obtained using the R packages r2d2 (Magnusson and Burgos 2014), hdrcde (Hyndman 2018) and (Hu and Yang 2018). However, these all suffer the major limitation of not being able to estimate the effects of covariates on the region’s shape. The R package modQR can estimate conditional multivariate quantiles, but the quantile level \(\tau\) is not linked to the probability content of the sample (Šiman and Boček 2019). Thus, it is not clear how to derive a reference region from these bivariate quantiles.

The present paper introduces refreg, an implementation in R of a new statistical methodology for estimating bivariate reference regions able to classify subjects as having normal or abnormal values based on the results of two continuous diagnostic tests. The main advantages of the presented method are; i) the absence of parametric restrictions for describing bivariate distributions for continuous tests, and ii) the possibility of estimating the effects of covariates on the shape of the reference region via flexible additive predictors. To illustrate this statistical methodology, and how to use the package, an age-specific reference region for two diabetes diagnostic tests was estimated. The estimated reference region offers new insights into the diagnosis and prognosis of diabetes, enabling physicians to identify different patients’ profiles. The proposed method can, however, be applied to any disease in which two continuous diagnostic tests are available, and can even be used in non-medical fields. Indeed, an application is discussed in which the conditional region is used in the joint forecasting of the concentrations of two air pollutants. Moreover, the current implementation can be easily extended to three dimensions.

The statistical model that enables conditional reference regions to be determined is presented in the next section. The main functions contained in the refreg package are then described, with a brief introduction to the main functions. The use of the package is shown in analyses of real medical and environmental data problems. The paper closes with some comments, and some notes on future research directions.

2 Statistical methodology

In this section the main features of our statistical method is presented. In a nutshell, our proposal is based on a bivariate location scale model, where the response means, and their variance-covariance matrix, are related to covariates using flexible additive predictors. The probabilistic region covering a specific percentage of the data is firstly estimated using the model standardized residuals. Then, it is generalized for each covariate value based on the aforementioned bivariate location-scale model fit. This statistical model was already presented and evaluated in (Roca-Pardiñas et al. 2020).

Conditional reference region

Given a bivariate continuous random variable of interest \({\bf Y} = (Y_1, Y_2)\), and a vector of covariates \(\bf{X}=(X_1, \ldots, X_p)\) we consider the following structure: \[\left( { \begin{array} {c} Y_1 \\ Y_2\ \end{array} }\right) = \left( { \begin{array} {c} \mu_1(\textbf{X}) \\ \mu_2(\textbf{X})\ \end{array} }\right) + \boldsymbol{\Sigma}^{1/2} (\textbf{X}) \left( { \begin{array} {c} \varepsilon_1 \\ \varepsilon_2\\ \end{array} }\right) \label{modelo} \tag{1}\]

where \(\mu_1(\textbf{X})\) and \(\mu_2(\textbf{X})\) represents the conditional means of both responses and \(\boldsymbol{\Sigma}^{1/2} (\textbf{X})\) the Cholesky decomposition of the variance-covariance matrix \[\boldsymbol{\Sigma} (\textbf{X})=\left({ \begin{array} {cc} \sigma_1^2 (\textbf{X}) & \sigma_{12} (\textbf{X})\\ \sigma_{12} (\textbf{X}) & \sigma_2^2 (\textbf{X}) \\ \end{array}}\right) \label{mod1} \tag{2}\] so that \(\text{Var} (\bf Y|\textbf{X})=\boldsymbol{\Sigma} (\textbf{X})= \boldsymbol{\Sigma}^{1/2} (\textbf{X})\left({\boldsymbol{\Sigma}^{1/2} (\textbf{X})}\right)^T\). In order to guarantee the model identification ((1)), the bivariate residuals \((\varepsilon_1,\varepsilon_2)\) are assumed to be independent of the covariates, with zero mean, unit variance, and zero correlation.

We consider additive structures for the mean functions \(\mu_j({\bf X})\), variance functions \(\sigma_j^2({\bf X})\) (\(j=1,2\)) and the correlation function \(\rho({\bf X})\) – note that \(\sigma_{12} (\textbf{X})= \hat \sigma_1(\textbf{X}) \hat \sigma_{2} (\textbf{X}) \hat \rho (\textbf{X})\). These structures are given, respectively, by \[\mu_r({\bf X})=\alpha_r + \sum_{j=1}^p{f_{jr}(X_j)}, \quad \sigma_r^2({\bf X})=H_\sigma\left({\beta_r+ \sum_{j=1}^p{g_{jr}(X_j)}}\right) \quad \text{for } r=1,2\] and \[\rho({\bf X}) =H_\rho \left({\gamma+ \sum_{j=1}^p{m_{j}(X_j)}}\right)\] where \(\alpha\), \(\beta\) and \(\gamma\) are parametric coefficients and \(f_{jr}\), \(g_{jr}\) and \(m_j\) for \(j = 1,\cdots,p\) and \(r=1,2\) are smooth and unknown functions. \(H_\sigma (\cdot)=\exp(\cdot)\) and \(H_\rho(\cdot)=\tanh(\cdot)\) are link functions used in the variance and correlation structures, respectively, to ensure that the restrictions on the parameter spaces (\(\sigma_r^2({\bf X}) \geq 0\) and \(0\leq \rho({\bf X}) \leq 1)\) are maintained.

Based on the model presented in equation ((1)), for a given \(\textbf{X}\) the conditional \(\tau^{th}\)- reference region for \((Y_1, Y_2)\) is given by: \[R_{\tau}(\textbf{X}) = \left( { \begin{array} {c} \mu_1(\textbf{X}) \\ \mu_2(\textbf{X})\ \end{array} }\right) + \boldsymbol{\Sigma}^{1/2} (\textbf{X}) \varepsilon_{\tau} \label{rexion_x} \tag{3}\] where \(\varepsilon_{\tau}\) is the unconditionally probabilistic region for the errors \((\varepsilon_1,\varepsilon_2)\) as \[\varepsilon_{\tau} (k) = \{(\varepsilon_1,\varepsilon_2) \in{R}^2 | f(\varepsilon_1,\varepsilon_2) \leq k\} \label{rexion1} \tag{4}\] \(f\) being the density function of the bivariate residuals \((\varepsilon_1,\varepsilon_2)\) and \(k\) is the \(\tau-\)quantile of \(f(\varepsilon_1,\varepsilon_2)\).

Estimation algorithm

In this section, we present the estimation procedure of the conditioned bivariate uncertainty region given in equation ((3)). Our approach is based on the estimation of the covariate effects on the response means using an additive model, and then on the variance-covariance matrix using the squared residuals of the former models. Finally, the bivariate region \(R_{\tau}(\textbf{X})\) is obtained with a bivariate kernel estimation of the standardized bivariate residuals.

The steps of the proposed estimation algorithm are the following:

Step 1: For \(r=1,2\) fit an additive model to the sample \(\{{\bf X}_i, {\bf Y}_{ir}\}_{i=1}^n\) and obtain the estimates \[\hat \mu_{r}({ \bf X}_i)=\hat \alpha+ \sum_{j=1}^p \hat f_{jr}(X_{ij}) \label{am1} \tag{5}\] Then, estimate \(\sigma_{r}^2(\textbf{X})\) from the sample \(\{{\bf X_i}, ({\bf Y}_{ir} - \hat \mu_r({\bf X}_i))^2\}_{i=1}^n\) as \[\hat \sigma_{r}^2(\textbf{X}_i)= \hat \beta_r + \sum_{j=1}^p \hat g_{jr}(X_{ij}) \label{am2} \tag{6}\] Step 2: Compute the correlation \(\rho (\textbf{X})\), using the sample \(\{{\bf X}_i, \hat{\delta}_i\}_{i=1}^n\), as \[\hat \rho (\textbf{X}_i)= \tanh \left( {\hat \gamma + \sum_{j=1}^p \hat m_{jr}(X_{ij})}\right)\] where \[\hat{\delta}_i = \frac{\left( Y_{i1}-\hat{\mu}_1(\textbf{X}_i)\right)\left(Y_{i2}-\hat{\mu}_2(\textbf{X}_i) \right)}{\hat{\sigma}_1(\textbf{X}_i)\hat{\sigma}_2(\textbf{X}_i)}\] Step 3: Compute the standardized residuals \[\left({ \begin{array} {c} \hat{\varepsilon}_{i1} \\ \hat{\varepsilon}_{i2}\\ \end{array} }\right) = \hat{\Sigma}^{-1/2}(\textbf{X}_i) \left({ \begin{array} {c} Y_{i1} - \hat{\mu}_1(\textbf{X}_i)\\ Y_{i2} - \hat{\mu}_2(\textbf{X}_i) \\ \end{array}}\right) \quad i=1,\ldots,n\] and obtain the kernel estimation of the bivariate density \(\hat{f}(\varepsilon_1,\varepsilon_2)\) given by \[\hat{f}((\varepsilon_{1},\varepsilon_{2}),\textbf{H}) = \frac{1}{n}\sum_{i=1}^{n}K_{\textbf{H}} \left({ \begin{array} {c} \varepsilon_1 - \hat{\varepsilon}_{i1} \\ \varepsilon_2 - \hat{\varepsilon}_{i2}\\ \end{array} }\right)\] where \(K(\cdot)\) is the kernel which is a symmetric probability density function and \(\textbf{H}\) is a \(2 \times 2\) positive definite matrix. Then, obtain the \(\tau^{th}\) unconditional bivariate uncertainty region on the residual scale as \[\hat{\varepsilon}_{\tau} = \{(\varepsilon_{1},\varepsilon_{2})) \in \mathbb{R}^2 | \hat{f}(\varepsilon_{1},\varepsilon_{2}))\leq\hat{k}\} \label{step4} \tag{7}\] \(\hat k\) being the empirical \(\tau\) quantile of the values \(\hat f(\varepsilon_{11},\varepsilon_{12}), \ldots, \hat f(\varepsilon_{n1},\varepsilon_{n2})\).

Finally, for a given \(\bf X\), the conditional bivariate uncertainty region \({R}_{\tau}(\textbf{X})\) is estimated by \[\hat{R}_{\tau}(\textbf{X}) = \left({ \begin{array} {c} \hat{\mu}_1(\textbf{X}) \\ \hat{\mu}_2(\textbf{X})\ \end{array} }\right) + \boldsymbol{ \hat \Sigma}^{1/2}(\textbf{X}) \hat{\varepsilon}_{\tau} \label{step5} \tag{8}\]

Flexible additive models estimation and inference

The continuous covariates smooth effects (\(f_{jr}\), \(g_{jr}\) and \(m_j\) for \(j = 1,\cdots,p\)) may be estimated using several non-parametric regression techniques. In previous works we applied polynomial kernel smoothers (Roca-Pardiñas et al. 2020). However, for sake of usage simplicity and computational cost, in the final package implementation we used a penalized spline basis representation following (Wood 2017). Thus, given an unknown smooth effect (e.g. \(f(x)\)) is estimated as: \[f(x) = \sum_{k = 1}^{K} \beta_k b_k (x)\] Confidence intervals for the estimated effects may be obtained using a bootstrap procedure. Given a specific vector of covariates \(\textbf{X}_0\), for the components (means, variances and correlation). The steps for construction of the bootstrap confidence intervals are:

Step 1. From the sample data \(\{(Y_{i1},Y_{i2}),{\bf X}_i\}_{i=1}^n\) obtain the estimates \(\hat{\mu}_r({\bf X}_0)\), \(\hat{\sigma}_{r}({\bf X}_0)\) \((r=1,2)\) and \(\hat{\rho}({\bf X}_0)\).

Step 2. For \(b=1,\ldots,B\) generate bootstrap samples \(\{(Y_{i1}^\bullet,Y_{i2}^\bullet),{\bf X}_i\}_{i=1}^n\) with \[\left( { \begin{array} {c} Y_{i1}^\bullet \\ Y_{i2}^\bullet \\ \end{array}}\right) = \left( { \begin{array} {c} \hat \mu_1(\textbf{X}_i) \\ \hat \mu_2(\textbf{X}_i)\ \end{array} }\right) + \boldsymbol{\hat \Sigma}^{1/2} (\textbf{X}_i) \left( { \begin{array} {c} \hat \varepsilon_{i1}^\bullet \\ \hat \varepsilon_{i2}^\bullet\\ \end{array} }\right) \label{boot} \tag{9}\] where \(\left \{ {(\hat{\varepsilon}_{i1}^\bullet,\hat{\varepsilon}_{i2}^\bullet)}\right\}_{i=1}^n\) is a sample of size \(n\) from the residuals \(\left \{ {(\hat{\varepsilon}_{i1},\hat{\varepsilon}_{i2})}\right\}_{i=1}^n\) with replacement, and compute \(\hat{\mu}_r^{\bullet b}({\bf X}_0)\), \(\hat{\sigma}_{r}^{\bullet b}({\bf X}_0)\) and \(\hat{\rho}^{\bullet b}({\bf X}_0)\) as in Step 1.

The limits for the \(100(1-\alpha)\%\) confidence intervals of the true components \({\mu}_r(\bf X_0)\), \({\sigma}_r(\bf X_0)\) and \(\rho(\bf X_0)\) are given respectively by \(\left({\hat {\mu}^{\alpha/2}_r( \textbf{X}_0), \hat {\mu}^{1-\alpha/2}_r(\textbf{X}_0)}\right)\), \(\left({\hat {\sigma}^{\alpha/2}_r(\textbf{X}_0), \hat {\sigma}^{1-\alpha/2}_r(\textbf{X}_0)}\right)\) and \(\left({\hat {\rho}^{\alpha/2}(\textbf{X}_0), \hat {\rho}^{1-\alpha/2}(\textbf{X}_0)}\right)\), where \(\hat {\mu}_r^p(\textbf{X}_0)\) represents the \(p\)-percentile of \(\hat{\mu}_r^{\bullet 1}(\textbf{X}_0), \ldots, \hat{\mu}_r^{\bullet B}(\textbf{X}_0)\), \(\hat {\sigma}_r^p(\textbf{X}_0)\) represents the \(p\)-percentile of \(\hat{\sigma}_r^{\bullet 1}(\textbf{X}_0), \ldots, \hat{\sigma}_r^{\bullet B}(\textbf{X}_0)\), and \(\hat {\rho}^p(\textbf{X}_0)\) is the \(p\)-percentile of \(\hat{\rho}^{\bullet 1}(\textbf{X}_0), \ldots, \hat{\rho}^{\bullet B}(\textbf{X}_0)\).

Bivariate residuals density estimation

The estimation of the unconditionally probabilistic region for the bivariate errors \((\varepsilon_1,\varepsilon_2)\) is based on a kernel density estimator. This estimator is given by: \[\hat{f}((\varepsilon_{1},\varepsilon_{2}),\textbf{H}) = \frac{1}{n}\sum_{i=1}^{n}K_{\textbf{H}} \left({ \begin{array} {c} \varepsilon_1 - \hat{\varepsilon}_{i1} \\ \varepsilon_2 - \hat{\varepsilon}_{i2}\\ \end{array} }\right)\] where H is a matrix defining the kernel bandwidth \[\textbf{H} = \begin{pmatrix} h_{11} & h_{12} \\ h_{12} & h_{22} \end{pmatrix}\]

graphic without alt text
Figure 1: Estimated reference regions for plug-in and best coverage bandwidths for non-gaussian data. In black we represent true regions and in red the estimated ones for \(\tau = 0.50, 0.95\). Best coverage bandwidth offers a smoother region than plug-in estimator.

The selection of \(\bf H\) is crucial to obtain a good estimation of \(\varepsilon_{\tau}\). A natural option is to use a plug-in or cross-validation bandwidth estimator, as in a density estimation problem \[\ \underset{\textbf{H} \in \mathcal{H}}{\operatorname{arg\min} } \textsf{E}(\iint \left(\hat{f}_\textbf{H}(\varepsilon_{1},\varepsilon_{2}) - f_\textbf{H}(\varepsilon_{1},\varepsilon_{2})^{(-i)} \right)^2 d\varepsilon_{1}d\varepsilon_{2} \label{CV} \tag{10}\] where \(\hat{f}_\textbf{H}(\varepsilon_{1},\varepsilon_{2})^{-i}\) is the estimated bivariate density function without the i-th observation. However, as we seek for a probabilistic region (i.e. a region which contains a given percentage of the multivariate data), the following selection criteria based on the region coverage is proposed \[\hat{\lambda} = \underset{\lambda}{\operatorname{arg\min} } \left|\left(n^{-1}\sum_{i=1}^n I \{(Y_{i1},Y_{i2})\in R^{(-i)}(\textbf{X}_i)\}\right) - \tau\right| \label{CV1} \tag{11}\] where \(\tau\) is the desired coverage and \(\hat{R}_{\tau}^{(-i)}(X_i)\) is the estimated bivariate region without the i-th observation. Using this criteria the estimated region show a smoother contour and a coverage of the data points closer to the desired one \(\tau\) (see Figure 1). Given the high computational cost of the regular proposed method in ((10)), a k-fold cross-validation scheme could be used instead. Moreover we simplify the minimization problem by considering \(h_{11}=h_{22}\) and \(h_{12}=0\).

3 Overview of the package

The refreg package contains a set of functions for estimating a conditional reference, or uncertainty, region. Its working framework was designed so that people without a strong statistical background can use it. Indeed, only two functions need to be taken into account by the user: 1) the effects of the predictor variables on responses need to be estimated using the bivRegr function, a step that requires the user choose which variables may influence the region; 2) bivRegion needs to be applied to a bivRegr object so that the reference region can be estimated.

The bivRegr() function has the following structure:

bivRegr(f = formulas, data = data)

The f argument contains a list of five R formulae corresponding to the additive predictors for the means, variances and correlation models shown in equation (1). Since bivRegr() uses mgcv::gam() internally, the user can estimate covariate linear and non-linear effects using s() operator. For instance:

mu1 <- y1 ~ s(x1)
mu2 <- y2 ~ s(x1)
var1 <- ~ x2
var2 <- ~ x2
rho <- ~ s(x3)
formula = list(mu1,mu2,var1,var2,rho)

assumes a smooth effect of x1 on the response means, a parametric effect of x2 on their variances, and a smooth effect of x3 on the response correlation.

The bivRegion() function is designed for non-parametrically estimating a bivariate reference region:

bivRegion(object, tau = 0.95, bandwidth = "plug-in")

The object may be a set of bivariate data points, or a bivRegr object, while tau defines the desired coverage(s) for the reference region, which might be a single value or a vector. Finally, “bandwidth” specifies the kernel bandwidth selection method. The user can chose between the plug-in, cross-validation, or the best coverage method (see equation (13)).

Additionally, we defined S3 methods for these two main functions. Specifically, associated with bivRegr we have

On the other hand, we defined the following S3 methods associated to bivRegion:

In addition, we define the functions trivRegr, trivRegion and plot.trivRegion as an extension of the aforementioned method for a trivariate response variable. Finally, our package also contains some inner functions as ace (for estimating variance, and correlation models), Hcv (it implements equation (11) method), and refcurve (it implements an univariate location-scale model).

4 Refreg in practice

This section outlines the implemented functions of the proposed package in detail, and illustrates their use with real datasets. The first illustration is related to diabetes research, in which a reference region for the joint interpretation of two glycemia tests is calculated. In the second illustration, refreg methodology is used to predict the concentrations of two air pollutants during a pollution episode. Finally, the extension of the method to higher dimensions is shown using real data.

Case 1: Glycemic tests for diabetes diagnosis

Diabetes is a chronic disease, the diagnosis of which is based on two glycemia tests: the fasting plasma glucose (FPG) and glycated hemoglobin (HbA1c)(American Diabetes Association 2019) tests. The multivariate interpretation of FPG and HbA1c results is desirable for two reasons: i) the results of both tests are correlated in healthy patients (Aleyassine et al. 1980), ii) a miss-match between them may be indicative of a poorer prognosis (Kim et al. 2018). Finally, it is well known that both test results are influenced by patient age (Davidson 1979; Pani et al. 2008).

The age-dependent reference region for the FPG and HbA1c tests was estimated using a sample of healthy subjects derived from the A-Estrada Glycation and Inflammation Study (AEGIS) (see (Gude et al. 2017)). A subset of this dataset is available in the package under the name “AEGIS”.

This dataset comprised 1516 subjects and 7 variables:

Applying the summary() routine to the aegis dataset indicated 55% of the subjects to be female, the mean age of all 1516 subjects to be 52 years (range 18-91), and that 187 subjects (12%) had been previously diagnosed with diabetes.

R> summary(aegis)
       id            gender         age          dm            fpg        
Min.   :   1.0   female:838   Min.   :18.00   no :1329   Min.   : 63.00  
1st Qu.: 379.8   male  :678   1st Qu.:39.00   yes: 187   1st Qu.: 82.00  
Median : 758.5                Median :52.00              Median : 89.00  
Mean   : 758.5                Mean   :52.58              Mean   : 94.51  
3rd Qu.:1137.2                3rd Qu.:67.00              3rd Qu.:100.00  
Max.   :1516.0                Max.   :91.00              Max.   :274.00  

hba1c             fru       
Min.   : 3.900   Min.   :119.0  
1st Qu.: 5.200   1st Qu.:225.0  
Median : 5.400   Median :254.0  
Mean   : 5.608   Mean   :262.2  
3rd Qu.: 5.700   3rd Qu.:284.0  
Max.   :10.200   Max.   :700.0  

To estimate the reference region, a subset of the patients not previously diagnosed with diabetes was define as dm_no. This subset sample was deemed the healthy patient sample.

R> dm_no = subset(aegis,aegis$dm == "no")
R> dm_yes = subset(aegis,aegis$dm == "yes")

To estimate the effect of age on the final region shape, the bivRegr() function was used. This function implements the estimation process of the bivariate location-scale:

R> mu1 = fpg ~ s(age)
R> mu2 = hba1c ~ s(age)
R> var1 = ~ s(age)
R> var2 = ~ s(age)
R> rho = ~ s(age)
R> formula = list(mu1,mu2,var1,var2,rho)

The first and second formulae define the additive models for the mean values of both glycemia tests. The third and fourth define the additive models for test result variability. The last formula represents the additive model that comprises the effect of age on the correlation between the results of both glycemia tests. In addition to the model formulae list, a dataset including both the test results and subject’s age must be supplied to bivRegr() as:

R> fit = bivRegr(formula,data=dm_no)
graphic without alt text
Figure 2: Estimated effects of age on the FPG and HbA1c mean, variance models, and on their correlation. Output from summary_boot, the shaded area is the 95% pointwise confidence interval obtained by bootstrap resampling. The parameters of our bivariate response change with age.

By applying the S3 method plot() to a bivRegr object, the estimated effects of covariates can be shown for each submodel. The argument eq= controls the model component to be represented (1 = FPG mean, 2 = HbA1c mean, 3 = FPG variance, 4 = HbA1c variance, and 5 = [FPG – HbA1c] correlation). Moreover, the function summary_boot() may be applied to a bivRegr object to obtain the 95% pointwise confidence interval for the estimated effects via bootstrapping:

R> fit_boot = summary_boot(fit, B=250, parallel = TRUE )
R> plot(fit_boot,eq=1)
R> plot(fit_boot,eq=2)
R> plot(fit_boot,eq=3)
R> plot(fit_boot,eq=4)
R> plot(fit_boot,eq=5) 

Since bootstrap resampling (introduced in equation ((9))) is time consuming, the user can fix parallel = TRUE and run a parallelized computation. The parallel backend is registered using doParallel (Microsoft and Weston 2020a), and the parallel computation is performed by foreach (Microsoft and Weston 2020b).

Figure 2 shows that the mean values of both glycemia markers increase almost linearly with age. FPG variance increases from 20 to 40 years, while the HbA1c variance increases linearly with age. Finally, the correlation between the FPG and HbA1c concentration seems to be stronger for older patients.

graphic without alt text
Figure 3: Estimated region in the bivariate residuals scale for healthy patients (left), and patients with diabetes (right). Green contour represents the region for \(\tau = 0.50\), while red dashed contour for \(\tau = 0.95\).

Applying the function bivRegion() to a bivRegr object provides a bivariate region containing 100\(\tau\)% of the model standardized residuals. This region is based on a bivariate kernel density estimator. The kernel bandwidth selection method may be chosen with the H_choice argument. Here, the 90%, 95% and 97.5% regions are obtained with the best coverage bandwidth selector (see equation (10)):

R> region = bivRegion(fit,tau=c(0.90,0.95,0.975),H_choice = "Hcov")

This region facilitates a multivariate interpretation of the glycemia test results. A patient whose results are “normal”, for his/her age would see them fall inside this reference region, while a subject with “abnormal” results for his/her age would not. This interpretation is possible because the model residuals are centered around zero, show unit variance, no correlation, and they are independent of age. The user can check test results located outside the reference region using the bivRegion S3 method plot:

R> par(mfrow = c(1, 2))
R> plot(region, xlab = "FPG, mg/dL", ylab = "HbA1c, \%",cond=T, newdata = 
 data.frame (age = c(20,30,40,50,60,70)),tau=0.95,reg.lwd=2, pch="*",col="grey")

R> plot(region,xlab = "FPG, mg/dL", ylab = "HbA1c, %",cond=T, newdata = 
   data.frame(age = c(20,60)),tau=c(0.50,0.95), reg.lwd=2, pch="*", col="grey")

Figure 3 shows the unconditional reference region for \(\tau = 0.90, 0.95\) and \(0.975\) for healthy patients, and those previously diagnosed with diabetes. Note that the plot() function argument ‘newdata’ allows the glycemia test values to be observed in the standardized residuals scale of the dataset for the patients with diabetes. As is clearly seen, most of healthy patients’ results are located inside the reference region, while those recorded for diabetic patients are located outside.

A major advantage of this representation is that it allows clinicians new insights into the subject’s glycemia status. Indeed, those patients located outside the reference region may be classified into four groups: (I) individuals with high values for both tests (first quadrant); (II) those with discordant results, with high HbA1c concentrations and low/medium FPG (second quadrant); (III) individuals with low values for both tests (third quadrant); and (IV) individuals with low/medium HbA1c concentrations and high FPG values (fourth quadrant). This distinction might be useful for physicians. For instance, discordant results are probably due to an altered bloodstream protein glycation rate, a condition associated with a poorer prognosis. Patients located outside the standardized region may be also checked applying summary() to a bivRegion object:

R> summary(region,tau = 0.95)

This R output presents patients located outside the standardized bivariate region for \(\tau = 0.95\). Note that, in the full sample, patients with different ages were located outside the reference region. The glycemia tests results located outside the reference region are interesting from a clinical point of view. For example, the following were seen: a possible case of undiagnosed diabetes in a 20 year old patient (FPG = 99, HbA1c = 6.3); a 47 year old patient showing a high HbA1c value for his corresponding FPG result (FPG = 86, HbA1c =6); and a patient of 85 years in the opposite situation (FPG = 120, HbA1c = 5.7).

graphic without alt text
Figure 4: Predicted reference regions for different ages. Solid line contour represents the reference region for \(\tau = 0.50\), and the dashed line contour for \(\tau = 0.95\). Toprow plots are depicted by plot.bivRegion function setting the arguments cond=TRUE and add=FALSE for several ages, and bottomrow ones with cond=TRUE and add=TRUE in a pre-existing plot. The estimated regions change with age and it describes the shape of the observed data points.

The use of this region in combination with the results of the bivariate location-scale model allow the conditional reference regions to be obtained. The user can visualize these regions using the bivRegion S3 method plot(), setting cond = TRUE as follows:

R> plot(region, xlab = "FPG, mg/dL", ylab = "HbA1c, %",cond=T, newdata = data.frame(age =
   c(20,30,40,50,60,70)),tau=0.95,reg.lwd=2, pch="*",col="grey")

R> plot(region,xlab = "FPG, mg/dL", ylab = "HbA1c, %",cond=T, newdata = data.frame(age =
   c(20,60)),tau=c(0.50,0.95),reg.lwd=2, pch="*",col="grey")    

In addition, the region may be represented in a pre-existing plot if the plot function argument add is equal to TRUE as in the following code:

R> plot(dm_no[dm_no$age==40,"fpg"],dm_no[dm_no$age==40,"hba1c"],main="40 years",
   xlim=c(50,140), ylim=c(4.2,7.5),xlab = "FPG, mg/dL", ylab = "HbA1c, %", pch=20,cex=2)
R> plot(region,cond=T,newdata = data.frame(age = 40),add=T,legend=F, tau=c(0.50,0.95),

R> plot(dm_no[dm_no$age==60,"fpg"],dm_no[dm_no$age==60,"hba1c"],main="60 years", 
   xlim=c(50,140), ylim=c(4.2,7.5),xlab = "FPG, mg/dL", ylab = "HbA1c, %", pch=20,cex=2)
R> plot(region,cond=T,newdata = data.frame(age = 60),add=T,legend=F, tau=c(0.50,0.95),

In Figure 4, the bivariate reference region is shown for several ages. Note that the regions shift towards the upper right corner and expand as age increase. This agrees with the non-linear effect of age on the expected means and variability of both markers. The conditional region coverage and the performance of the methodology have already been assessed (Lado-Baleato et al. 2021).

Extension of case 1, conditional reference region for a trivariate response

This section provides an example of how the method proposed in equation (3) might be extended to more than two dimensions. This section is intended to provide a proof of concept rather than a formal statistical proposal. For a trivariate variable (\(Y_1, Y_2, Y_3\)) the following model can be assumed: \[\left( { \begin{array} {c} Y_1 \\ Y_2\\ Y_3\ \end{array} }\right) = \left( { \begin{array} {c} \mu_1(\textbf{X}) \\ \mu_2(\textbf{X})\\ \mu_3(\textbf{X})\ \end{array} }\right) + \boldsymbol{\Sigma}^{1/2} (\textbf{X}) \left( { \begin{array} {c} \varepsilon_1 \\ \varepsilon_2\\ \varepsilon_3\\ \end{array} }\right) \label{mod3} \tag{12}\] where \(\{\mu_r(\textbf{X})\}_{r=1}^3\) represents the conditional means of each response, and \(\boldsymbol{\Sigma}^{1/2} (\textbf{X})\) the Cholesky decomposition of the variance-covariance matrix \[\boldsymbol{\Sigma} (\textbf{X})=\left({ \begin{array} {ccc} \sigma_1^2 (\textbf{X}) & \sigma_{21} (\textbf{X})& \sigma_{31} (\textbf{X})\\ \sigma_{12} (\textbf{X}) & \sigma_2^2 (\textbf{X})& \sigma_{23} (\textbf{X}) \\ \sigma_{13} (\textbf{X}) & \sigma_{32} (\textbf{X}) & \sigma_3^2 (\textbf{X}) \\ \end{array}}\right)\] In the trivariate case, the estimated variance-covariance matrix can be non-positive-definite. Thus, \(\hat{\boldsymbol{\Sigma}}\) is modified by applying the unweighted bending method of (Schaeffer 2014) as implemented in the mbend R package (Nilforooshan 2020).

Following equation ((12)), a trivariate reference region may be estimated as: \[R_{\tau}(\textbf{X}) = \left( { \begin{array}{c} \mu_1(\textbf{X}) \\ \mu_2(\textbf{X}) \\ \mu_3(\textbf{X}) \end{array} }\right) + \boldsymbol{\Sigma}^{1/2} (\textbf{X}) \varepsilon_{\tau}\] where \(\varepsilon_{\tau}\) is the unconditionally probabilistic region for the errors \((\varepsilon_1,\varepsilon_2,\varepsilon_{3})\) as \[\varepsilon_{\tau} (k) = \{(\varepsilon_1,\varepsilon_2,\varepsilon_{3}) \in{\mathbb{R}}^3 | f(\varepsilon_1,\varepsilon_2,\varepsilon_{3}) \leq k\} \label{rexion_3} \tag{13}\] \(f\) being the density function of the trivariate residuals \((\varepsilon_1,\varepsilon_2,\varepsilon_3)\) and \(k\) is the \(\tau-\)quantile of \(f(\varepsilon_1,\varepsilon_2,\varepsilon_3)\).

Using this model, the application of the methodology for diabetes diagnosis can be extended by incorporating an additional glycemia test. This extension is justified since other glycated proteins are routinely monitored in diabetes control besides FPG and HbA1c. For instance, in conditions that determine alterations in hemoglobin metabolism (e.g., anemia or kidney disease), fructosamine (Fr) is frequently used as an additional marker. Nevertheless, the translation of Fr into average glucose levels is not as clear as for HbA1c, and discordances are often encountered between the Fr and HbA1c results. In addition, agreement among these glycemia markers can be affected by factors such as patient age.