The paper presents a new R package qape for prediction, accuracy estimation of various predictors and Monte Carlo simulation studies of properties of both predictors and estimators of accuracy measures. It allows to predict any population and subpopulation characteristics of the response variable based on the Linear Mixed Model (LMM). The response variable can be transformed, e.g. to logarithm and the data can be in the cross-sectional or longitudinal framework. Three bootstrap algorithms are developed: parametric, residual and double, allowing to estimate the prediction accuracy. Analyses can also include Monte Carlo simulation studies of properties of the methods used. Unlike other packages, in the prediction process the user can flexibly define the predictor, the model, the transformation function of the response variable, the predicted characteristics and the method of accuracy estimation.
One of the tasks in application of mixed models in the real-life problems is the prediction of random effects. Then, the predicted values give the possibility for further prediction, e.g. characteristics of interest such as sum, mean or quantiles or the future value of the response variable for cross-sectional or longitudinal data.
Three main predictors of these characteristics are proposed in the literature: Empirical Best Linear Unbiased Predictors - EBLUPs (see e.g. (Henderson 1950) and (Royall 1976)), PLUG-IN predictors (see e.g. (Boubeta et al. 2016), (Chwila and Żądło 2019), (Hobza and Morales 2016)) and Empirical Best Predictors - EBPs (see e.g. (Molina and Rao 2010)). Each assumes the LMM to model the response variable.
The numerous successful applications of these three predictors for cross-sectional and longitudinal data can be found in the model approach in survey sampling, including the small area estimation. In paper (Fay III and Herriot 1979) the Authors introduce the prediction of the mean income for small places based on the special case of the LMM model called Fay-Herriot model and the EBLUP. The analysis of poverty is extended in many works, e.g. in (Molina and Rao 2010) and (Christiaensen et al. 2012). In turn, in (Battese et al. 1988) the Authors analyse the total crop areas based on survey and satellite data using EBLUPs. The proposed LMM model is known as the Battese-Harter-Fuller model. The predictors are also exploited in the subject of experience rating in non-life insurance, see (Frees et al. 1999) and (Bühlmann and Gisler 2005), where the longitudinal data are under consideration. The insurance premium for the next period for every policy in the insurance portfolio is predicted.
A major challenge in this type of prediction is the estimation of the prediction accuracy measure. Most often it is the Root Mean Squared Error (RMSE), which is given in analytical form or can be e.g. estimated using bootstrap. A feature of the distribution of the squared prediction error is usually a very strong positive asymmetry. Because the mean is not recommended as the appropriate measure of the central tendency in such distributions, the alternative prediction accuracy measure called the Quantile of Absolute Prediction Errors (QAPE), proposed by (Żądło 2013) and (Wolny-Dominiak and Żądło 2020), can be applied.
There is a variety of R packages to calculate the considered predictors together with the accuracy measure of prediction, usually the RMSE. The package sae, see (Molina and Marhuenda 2015), provides EBLUPs based on Fay-Herriot and Battese-Harter-Fuller models. In turn, the multivariate EBLUP for Fay-Herriot models is implemented in msae, see (Permatasari and Ubaidillah 2021). Several EBLUPs introduced in (Rao and Yu 1994) are implemented in package saery introduced by (Lefler et al. 2014), likewise in JoSAE, see (Breidenbach 2018), but with additional heteroscedasticity analysis. The EBP is provided in the package emdi described in (Kreutzmann et al. 2019).
A new package in this area is our proposed package qape. It allows the prediction of flexibly defined characteristics of the response variable using the above three predictors, assuming an appropriate LMM. A novel feature of the package qape, compared to those already in place, is the ability of bootstrap estimation of the prediction accuracy measures, both the RMSE and QAPE. Three types of bootstrap procedures are provided: parametric, residual and double.
There are three groups of functions in this package: predictors values
calculation, bootstrap estimation of RMSE and QAPE measures, and Monte
Carlo (MC) analysis of properties of predictors and prediction accuracy
estimators. The prediction is based on a LMM model defined by the user
and allows to predict the population characteristics of the response
variable, which can be defined by a linear combination (in the case of
EBLUP), by any R function (e.g. sum
) or any function defined by the
user (in the case of the EBP and PLUG-IN predictors). The package allows
for full flexibility in defining: the model, the predicted
characteristic, and the transformation of the response variable.
This paper is organized as follows. Firstly, the background of the LMM
is presented together with the theoretical foundations of the prediction
including prediction accuracy measures. Then, the package functionality
in the area of prediction is presented and illustrated. A short
application based on radon
data, a cross-sectional dataset available
in HLMdiag package, to
predict three subpopulation characteristics is shown. Subsequently, the
theoretical background of the prediction accuracy measures estimation
based on bootstrap is presented. Implementations of bootstrap algorithms
in qape are briefly
introduced. Finally, the procedure of the model-based Monte Carlo
simulation study is discussed. The paper ends with a conclusion.
We consider the problem of prediction of any given function of the population vector \(\mathbf{Y}\) of the response variable: \[\label{theta} \theta = f_{\theta}(\mathbf{Y}) \tag{1}\] under the LMM. It covers linear combinations of \(\mathbf{Y}\) (such as one future realization of the response variable or population and subpopulation means and totals) but also other population and subpopulation characteristics such quantiles and variability measures.
To assess the accuracy of the particular predictor \(\hat \theta\), firstly, the prediction error is defined as \(U=\hat{\theta}-\theta\). Therefore, the well-known RMSE has the following formula: \[\label{eq0} RMSE(\hat{\theta})=\sqrt{E(\hat{\theta}-\theta)^{2}}=\sqrt{E({{U}^{2}})}. \tag{2}\] The alternative to the RMSE based on the mean could be the QAPE based on quantiles. It represents the \(p\)th quantile of the absolute prediction error \(|U|\), see (Żądło 2013) and (Wolny-Dominiak and Żądło 2020), and it is given by: \[\label{eq1} QAPE_p(\hat{\theta}) = \inf \left\{ {x:P\left( {\left| {{\hat{\theta}-\theta}} \right| \le x} \right) \ge p} \right\} =\inf \left\{ {x:P\left( {\left| {{U}} \right| \le x} \right) \ge p} \right\} \tag{3}\] This measure informs that at least \(p100\%\) of observed absolute prediction errors are smaller than or equal to \(QAPE_p(\hat{\theta})\), while at least \((1-p)100\%\) of them are higher than or equal to \(QAPE_p(\hat{\theta})\). Quantiles reflect the relation between the magnitude of the error and the probability of its realization. It means that using the QAPE, it is possible to make a full description of the distribution of prediction errors instead of using the average (reflected by the RMSE). Furthermore, the MSE is the mean of positively (usually very strongly) skewed squared prediction errors, where the mean should not be used as a measure of the central tendency of positively skewed distributions.
The above described accuracy prediction measures RMSE and QAPE can be estimated using the bootstrap techniques. Their estimators as well as the bootstrap distributions of the prediction errors based on any (assumed or misspecified) model are provided in qape package, including algorithms where the parallel computing is used.
In the qape package, the whole prediction process has its own specific procedure, which can be presented in the following steps.
Procedure 1. The process of prediction, accuracy measures estimation and Monte Carlo simulation analyses in qape
Define the characteristics of the response variable to predict,
provide the information on sample and population values,
define the LMM,
estimate parameters of the LMM,
predict the random variable \(\theta\) using the chosen class of predictors,
estimate the prediction accuracy measures RMSE and QAPE using one of the developed bootstrap algorithms,
conduct simulation analyses of properties of predictors and accuracy measures estimators under any (also misspecified) LMM model.
The main functions of the qape package provide the bootstrap estimation of prediction accuracy measures. However, it must be preceded by the prediction process, including the choice of the LMM and the predictor.
Let \(\mathbf{Y}\) denote the vector of response variables \(Y_1, Y_2,..., Y_N\). Assuming, without a loss of generality, that only the first \(n\) realizations of \(Y_i\) are observed, \(\mathbf{Y}\) can be decomposed as \(\mathbf{Y}= \begin{bmatrix} \mathbf{Y}_s^T & \mathbf{Y}_r^T \end{bmatrix}^T\) , where \(\mathbf{Y}_s\) and \(\mathbf{Y}_r\) are of dimension \(n \times 1\) and \((N - n) \times 1\), respectively. In all notations, the subscript "s" is used for observed realizations of the variable of interest and "r" for the unobserved ones. Two known matrices of auxiliary variables are also considered, denoted by \(\mathbf{X}\) and \(\mathbf{Z}\), which are associated with fixed and random effects, respectively. The \(\mathbf{X}\) matrix is of dimension \(N \times p\), and it consists of \(p\) regression variables. It can be decomposed like \(\mathbf{Y}\) as follows: \(\mathbf{X}= \begin{bmatrix} \mathbf{X}_s^T & \mathbf{X}_r^T \end{bmatrix}^T\), where matrices \(\mathbf{X}_s\) and \(\mathbf{X}_r\), both known, are of dimension \(n \times p\) and \((N-n) \times p\), respectively. Similarly, the \(\mathbf{Z}\) matrix of dimension \(N \times h\) can be written as follows: \(\mathbf{Z}= \begin{bmatrix} \mathbf{Z}_s^T & \mathbf{Z}_r^T \end{bmatrix}^T\), where matrices \(\mathbf{Z}_s\) and \(\mathbf{Z}_r\), both known, are of dimension \(n \times h\) and \((N-n) \times h\), respectively.
Then, let \(LMM(\mathbf{X}, \mathbf{Z}, \boldsymbol{\psi})\) denotes the LMM of the following form (e.g. (Rao and Molina 2015), p. 98): \[\label{LMM} \left\{ \begin{array}{c} \mathbf{Y}=\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{v}+\mathbf{e} \\ E(\mathbf{e})=\mathbf{0}, E(\mathbf{v})=\mathbf{0} \\ Var(\mathbf{e})=\mathbf{R}(\pmb{\delta}), Var(\mathbf{v})=\mathbf{G}(\pmb{\delta}) \end{array} \right. \tag{4}\] The vector of parameters in model ((4)) is then \(\boldsymbol{\psi}=\begin{bmatrix} \boldsymbol{\beta}^T & \pmb{\delta}^T \end{bmatrix}^T\), where \(\boldsymbol{\beta}\) is a vector of fixed effects of dimension \(p \times 1\) and \(\pmb{\delta}\) is a vector of variance components. The random part of the model is described by the known matrix \(\mathbf{Z}\), a vector \(\mathbf{v}\) of random effects of dimension \(h \times 1\) and a vector \(\mathbf{e}\) of random components of dimension \(N\times 1\), where \(\mathbf{e}\) and \(\mathbf{v}\) are assumed to be independent. The vector of random components \(\mathbf{e}\) will be decomposed similarly to the vector \(\mathbf{Y}\), i.e. \(\mathbf{e}=\begin{bmatrix} \mathbf{e}_s^T & \mathbf{e}_r^T \end{bmatrix}^T\).
In the residual bootstrap implemented in qape, there is a need to re-write the LMM model to take account of the specific structure of data, i.e. the grouping variables taken into account in the random part of the model. In this case, without a loss of the generality, the LMM model can be written as follows: \[\label{LMMa} \mathbf{Y}=\mathbf{X}\boldsymbol{\beta} + \mathbf{Z}_1\mathbf{v}_1+...+\mathbf{Z}_l\mathbf{v}_l+...+\mathbf{Z}_L\mathbf{v}_L+\mathbf{e}, \tag{5}\] where \(\mathbf{v}_1,\dots,\mathbf{v}_l,\dots,\mathbf{v}_L\) are independent vectors of random effects assumed for different divisions of the \(\mathbf{Y}\) vector (under different grouping of the data) and \(\mathbf{Z}_1, \dots, \mathbf{Z}_l, \dots, \mathbf{Z}_L\) are known matrices of auxiliary variables associated with random effects. Writing in ((5)): \(\mathbf{Z}= \begin{bmatrix} \mathbf{Z}_1 & \dots & \mathbf{0} & \dots & \mathbf{0} \\ \vdots & \ddots & & & \vdots \\ \mathbf{0} & \dots & \mathbf{Z}_l & \dots & \mathbf{0} \\ \vdots & & & \ddots & \vdots \\ \mathbf{0} & \dots & \mathbf{0} & \dots & \mathbf{Z}_L \\ \end{bmatrix}\) and \(\mathbf{v}= \begin{bmatrix} \mathbf{v}_1^T & \dots & \mathbf{v}_l^T & \dots & \mathbf{v}_L^T \\ \end{bmatrix}^T\) the LMM model is obtained. Let
\[\label{vl} \mathbf{v}_l=\left[ \mathbf{v}_{l1}^T \dots \mathbf{v}_{lk}^T \dots \mathbf{v}_{lK_l}^T \right]^T \tag{6}\] be of dimension \(K_l J_l \times 1\), where \(\mathbf{v}_{lk}\) is of dimension \(J_l \times 1\) for all \(k=1,...,K_l\) and \(K_l\) is the number of random effects at the \(l\)th level of grouping. Hence, \(\mathbf{Z}_l\) is \(N \times K_l J_l\). For example, if the random regression coefficient model is considered with two random coefficients where both random effects are subpopulation-specific, where \(D\) is the number of subpopulations, then \(L=1\), \(K_1=2\) and \(J_1=D\).
In the qape package, in the general case the predicted characteristic is given by any function of response variables: \[\label{ftheta} \theta = f_{\theta}(\mathbf{Y}). \tag{7}\] Under the \(LMM(\mathbf{X}, \mathbf{Z}, \boldsymbol{\psi})\) model it could be predicted using one of three predictors:
Empirical Best Linear Unbiased Predictor (EBLUP),
Empirical Best Predictor (EBP) under nested error LMM,
PLUG-IN predictor under the LMM.
The first predictor (EBLUP) allows to predict the linear combination of the response variables: \[\label{l.theta} \theta = f_{\theta}(\mathbf{Y}) = \boldsymbol{\gamma}^T \mathbf{Y}= \boldsymbol{\gamma}_s^T \mathbf{Y}_s + \boldsymbol{\gamma}_r^T \mathbf{Y}_r, \tag{8}\] where \(\boldsymbol{\gamma}\) is a vector of weights. In this case, the predicted characteristic \(\theta\) is basically the linear combination of the response variable. For example, if one of the elements of \(\boldsymbol{\gamma}\) equals 1 and the rest of the elements equals 0, then one realization of the response variable is predicted. If all elements in \(\boldsymbol{\gamma}\) vector equal 1, then \(\theta\) becomes the sum of all \(Y_i\)’s in the whole considered population dataset. The two-stage EBLUP corresponds to the Best Linear Unbiased Predictor (BLUP) introduced in (Henderson 1950) and (Royall 1976) as: \[\label{BLUP} \hat{\theta}^{BLUP} (\pmb{\delta}) = {\boldsymbol{\gamma}}_s^T \mathbf{Y}_s + \hat{\theta}_r(\pmb{\delta}), \tag{9}\] where the predictor of the linear combination \(\boldsymbol{\gamma}_r^T \mathbf{Y}_r\) of unobserved random variables is given by \(\hat{\theta}_r(\pmb{\delta})={\boldsymbol{\gamma }}_r^T {{\mathbf{X}}_r}{\tilde{\boldsymbol{\beta}} }(\pmb{\delta}) +\boldsymbol{\gamma }_r^T{\mathbf{Z}}_r{\mathbf{\tilde{v}}}(\pmb{\delta})\), where \(\tilde{\boldsymbol{\beta}}(\pmb{\delta})\) is the Best Linear Unbiased Estimator of \(\boldsymbol{\beta}\) and \(\tilde{\mathbf{v}}(\pmb{\delta})\) is the Best Linear Unbiased Predictor of \(\mathbf{v}\), both presented in ((4)). As shown by (Żądło 2017) p. 8094, if \(Cov(\mathbf{e}_r, \mathbf{e}_s)=\mathbf{0}\), then the predictor ((9)) is the BLUP of \(\theta\) defined as the linear combination ((8)). Even if \(Cov(\mathbf{e}_r, \mathbf{e}_s) \neq \mathbf{0}\), the predictor \(\hat{\theta}_r(\pmb{\delta})\) is the Best Linear Unbiased Predictor of the following linear combination of \(\boldsymbol{\beta}\) and \(\mathbf{v}\): \({\boldsymbol{\gamma }}_r^T{{\mathbf{X}}_r}{ {\boldsymbol{\beta}} } +\boldsymbol{\gamma }_r^T{\mathbf{Z}}_r{\mathbf{{v}}}\). The EBLUP \(\hat\theta^{EBLUP}\) is obtained by replacing the vector of variance components \(\pmb{\delta}\) in BLUP ((9)) with the estimator \(\hat{\pmb{\delta}}\). If (a) the expectation of the predictor is finite, (b) \(\hat{\pmb{\delta}}\) is any even, translation-invariant estimator of \(\pmb{\delta}\), (c) the distributions of both random effects and random components are symmetric around \(\mathbf{0}\) (not necessarily normal), the EBLUP remains unbiased, as proved by (Kackar and Harville 1981).
To introduce the second predictor, called EBP, considered e.g. by (Molina and Rao 2010), firstly, the Best Predictor (BP) \(\hat{\theta}^{BP}\) of characteristic \(\theta(\mathbf{Y})\) has to be defined. It is computed by minimizing the Mean Squared Error \(MSE(\hat\theta )=E(\hat\theta - \theta)^2\) and can be written as \(\hat\theta^{BP} = E(\theta|\mathbf{Y}_s)\). It means that the conditional distribution of \(\mathbf{Y}_r|\mathbf{Y}_s\) must be known to compute its value while at least the parameters of this distribution, denoted by \(\boldsymbol{\psi}\) in ((4)), are unknown. The EBP \(\hat\theta^{EBP}\) is obtained by replacing these parameters with estimators \(\hat{\boldsymbol{\psi}}\). Its value can be computed according to the Monte Carlo procedure presented in the supplementary document for this paper.
The last predictor is the PLUG-IN predictor defined as (e.g. (Chwila and Żądło 2019)): \[\hat{\theta}^{PLUG-IN}=\theta(\begin{bmatrix} \mathbf{Y}_s^T & \mathbf{\hat{Y}}_r^T \end{bmatrix}^T),\] where \(\mathbf{\hat{Y}}_r\) is the vector of fitted values of unobserved random variables under the assumed model (any model specified by the statistician). Under the LMM and if the linear combination of \(\mathbf{Y}\) is predicted, the PLUG-IN predictor is the EBLUP, but generally, it is not optimal. However, it was shown in simulation studies that it can have similar or even higher accuracy compared to empirical (estimated) best predictors, where the best predictors minimize the prediction mean squared errors (cf. e.g. (Boubeta et al. 2016), (Chwila and Żądło 2019), (Hobza and Morales 2016)). Moreover, the PLUG-IN predictor is less computationally demanding than the EBP.
To deal with the LMM model, the
qape package uses the
lmer()
function from the
lme4 package, see
(Bates et al. 2015). Assuming ((4)) and based on \(\mathbf{Y}_s\), the
vector of model parameters
\(\boldsymbol{\psi} = [\boldsymbol{\beta}^T, \pmb{\delta}^T]^T\) is
estimated using the Restricted Maximum Likelihood Method (REML), known
to be robust on non-normality, see e.g (Jiang 1996), and
\(\hat{\boldsymbol{\psi}}\) is obtained.
In order to obtain the predictor of \(\theta\), one of the three
qape functions can be
applied: EBLUP()
, ebpLMMne()
or plugInLMM()
. Firstly, the
characteristic of response variables of interest has to be defined. It
is actually obvious for EBLUP, which can be used only to predict the
population/subpopulation linear combination (e.g. the sum) by using the
argument gamma
equivalent to the population vector of weights
\(\boldsymbol{\gamma}\) in ((8)). For other two predictors,
the EBP and the PLUG-IN, the input argument called thetaFun
has to be
given (see \(f_{\theta}(.)\) in ((7))). Function thetaFun
could define one characteristic or a vector of characteristics, for
example:
> thetaFun1 <- function(x) median(x)
> thetaFun2 <- function(x) c(sum(x), mean(x), sd(x))
Secondly, two groups of input arguments, common to all three predictors, has to be provided:
group 1 - arguments defining the sample and the population
YS
- values of the dependent variable in the sample
(\(\mathbf{Y}_s\)),
reg
- the population matrix of auxiliary variables named in
fixed.part
, random.part
and division
,
con
- the population \(0-1\) vector with \(1\)s for elements in
the sample and \(0\)s for elements which are not in the sample,
group 2 - arguments defining the model
fixed.part
- fixed-effects terms declared as in lm4::lmer
function,
random.part
- random-effects terms declared as in lm4::lmer
function,
weights
- the population vector of weights.
The weights make it possible to include heteroscedasticity of random components in the LMM.
In EBLUP()
and plugInLMM()
the random-effects terms of the LMM have
to be declared as the input argument random.part
. The form of the
ebpLMMne
predictor, in turn, requires defining in the ebpLMMne()
function the so-called division
argument instead of random.part
.
This input represents the variable dividing the population dataset into
subsets, which are taken into account in the nested error linear mixed
model with ‘division
’-specific random components (presented in
supplementary document for this paper).
In the process of prediction, it is often necessary to perform data
transformation before estimating the model parameters. An example is the
logarithmic scaling of the variable of interest. The
qape package offers the
possibility for declaring the argument backTrans
to conduct the data
back-transformation. Hence, a very flexible solution is used which
allows to use any transformation of the response variable such that the
back-transformation can be defined. This argument (available in R or
defined by the user function) should be the back-transformation function
of the already transformed dependent variable used to define the model,
e.g. for log-transformed YS
used as the response variable:
> backTrans <- function(x) exp(x)
The main output is the value of predictor thetaP
. For each class of
predictors, there are two S3 methods registered for existing generic
functions print
and summary
. The full list of output arguments is
presented in detail in the qape-manual
file, cf. (Wolny-Dominiak and Żądło 2023).
In order to demonstrate the functionality of the package’s main
functions, in the following examples the radon
dataset available in
HLMdiag package
((Loy and Hofmann 2014)) is analyzed. It contains the results of a survey measuring
radon concentrations in 919 owner-occupied homes in 85 counties of
Minnesota (see Figure 1). A study was conducted in
1987-1988 by the Minnesota Department of Health, showing that indoor
radon levels are higher in Minnesota compared to typical levels in the
U.S. In the data, the response variable log.radon
(denoted in
((10)) by \(log(Y_{ic})\)) is the radon measurement in
logarithms of picoCurie per liter. The independent variables, on the
other hand, are: uranium
(\(x_{1ic}\)) the average county-level soil
uranium content, basement
(\(x_{2ic}\)) the 0-1 variable indicating the
level of the home at which the radon measurement was taken - 0 for
basement, 1 for the first floor, and county
(denoted by subscript \(c\)
in ((10))) is county ID.
In all considered examples, the prediction for the county no. 26
(county == 26
) is conducted and it is assumed that the observations in
this county from the first floor (basement == 1
) are not available
(see Figure 2).
The radon
dataset is widely discussed in the literature. In the paper
(Nero et al. 1994), the Authors used an ordinary regression model
to predict county geometric means of radon concentration using surficial
soil radium data from the National Uranium Resource Evaluation. In turn,
the paper (Price et al. 1996) focuses on the prediction of the
geometric mean of radon for each county, but using a Bayesian approach.
For the radon
data we use the following model
\[\label{radon.model}
log(Y_{ic}) = \beta_1 x_{1ic} + (\beta_2 + v_{1c}) x_{2ic} + \beta_0 + v_{2c} + e_{ic}, \tag{10}\]
where \(i=1,2,\dots,N\), \(c=1,2,\dots, C\), \(N = 919\) observations,
\(C = 85\) counties, \(\beta_1\), \(\beta_2\) and \(\beta_0\) are unknown fixed
effects, \(v_{1c}\) and \(v_{2c}\) are random effects, \(e_{ic}\) are random
components, \(v_{1c}\), and \(e_{ic}\) are mutually independent, \(v_{2c}\)
and \(e_{ic}\) are mutually independent too, \(Cor(v_{1c}, v_{2c}) = \rho\),
\(v_{1c} \sim (0, \sigma^2_{v_1})\), \(v_{2c} \sim (0, \sigma^2_{v_2})\) and
\(e_{ic} \sim (0, \sigma^2_e)\). As can easily be seen, the considered
model is the random coefficient model with two correlated
county
-specific random effects. Its syntax written using the package
lme4 notation is as
follows:
<- lmer(log.radon ~ basement + uranium + (basement | county), data = radon) radon.model
This and similar LMMs are considered, analyzed, and used for the
considered dataset in many publications, with a good overview presented
in (Gelman and Hill 2006). In (Gelman and Pardoe 2006), based on their
preceding research (Price et al. 1996), (Lin et al. 1999),
(Price and Gelman 2005), a very similar model but with additional
multivariate normality assumptions is studied, verified and chosen as
fitting well to the data within a Bayesian framework. The same model as
in (Gelman and Pardoe 2006) with its special cases is considered in
(Cantoni et al. 2021) but within the frequentist approach. Based on 25
measures of explained variation and model selection, the Authors
conclude that the same model as considered in our paper (with additional
normality assumption, however, which is not used in all cases considered
in that paper), "seems the best" (Cantoni et al. 2021 10) for the
radon
data. Further tests of the model are presented by
(Loy 2013), (Loy and Hofmann 2015) and (Loy et al. 2017) (see also
(Cook et al. 2007) for the introduction of the methodology) showing
among others: the normality and homescedasticity of random components,
the normality of the distribution of the random slope but – what is
important for our further considerations – the lack of the normality of
the random intercept. Since the problem of choosing and verifying a
model for the considered dataset is widely discussed in the literature,
we will focus on the issues that are new in this case, namely the
problem of prediction and estimation of the prediction accuracy as well
as the Monte Carlo analysis of predictors’ properties.
This example shows the prediction procedure in the package qape. In the first step, it is needed to define all the input arguments that will then be passed to the prediction functions.
> Ypop <- radon$log.radon # the population vector of the dependent variable
> # It is assumed that observations from the first floor
> # in county no. 26 are not available:
> con <- rep(1, nrow(radon))
> con[radon$county == 26 & radon$basement == 1] <- 0
> YS <- Ypop[con == 1] # sample vector of the dependent variable
> reg <- dplyr::select(radon, -log.radon) # the population matrix of auxiliary variables
> fixed.part <- 'basement + uranium' # the fixed part of the considered model
> random.part <- '(basement|county)' # the random part of the considered model
> # The vector of weights to define
> # the predicted linear combination - the mean for county == 26:
> gamma <-
+ (1 / sum((radon$county == 26))) * ifelse((radon$county == 26), 1, 0)
> estMSE <- TRUE # to include the naive MSE estimator of the EBLUP in the output
Then the functions corresponding to each predictor can be used. First,
the EBLUP prediction in the package
qape is presented. As the
EBLUP is limited to the linear combination of random variables, the
predicted characteristic is simply the arithmetic mean. To be precise,
it is the mean of logarithms of measurements (instead of the mean of
measurements), because the EBLUP can be used only under the linear
(linearized) models. As in the LMM the homescedasticity of random
components is assumed, the input argument weights = NULL
is set up.
> myeblup <- EBLUP(YS, fixed.part, random.part, reg, con, gamma, weights = NULL, estMSE)
> # the value of the predictor of the arithmetic mean
> # of logarithms of radon measurements:
> myeblup$thetaP
1] 1.306916
[> myeblup$neMSE # the value of the naive MSE estimator
1] 0.002292732 [
Hence, the predicted value of the arithmetic mean of logarithms of radon measurements equals \(1.306916\) log picoCurie per liter. The estimated root of prediction MSE equals \(\sqrt{0.002292732} \approx 0.048\) log picoCurie per liter, but – what is important – it is the value of the naive RMSE estimator (as defined by Rao and Molina 2015 106), which means that it ignores the decrease of accuracy due to the estimation of model parameters.
The second part of this example shows the prediction of the arithmetic
mean, geometric mean and median of radon measurements (not logarithm of
radon measurements) in county no. 26 with the use of the PLUG-IN
predictor. It requires the setting of two input arguments: thetaFun
and backTrans
.
> thetaFun <- function(x) {
+ c(mean(x[radon$county == 26]), psych::geometric.mean(x[radon$county == 26]),
+ median(x[radon$county == 26]))
+ }
> backTransExp <- function(x) exp(x) # back-transformation
> myplugin <- plugInLMM(YS, fixed.part, random.part, reg, con, weights = NULL,
+ backTrans = backTransExp, thetaFun)
> # values of the predictor of arithmetic mean, geometric mean
> # and median of radon measurements:
> myplugin$thetaP
1] 3.694761 4.553745 3.900000 [
In this case we can conclude that the predicted values of the aritmethmic mean, geometric mean and median in county no. 26 equal: \(3.694761\), \(4.553745\) and \(3.9\) picoCurie per liter, respectively. The problem of prediction accuracy estimation will be discussed in the next sections of the paper.
The qape package allows
to use the Empirical Best Predictor (EBP) (see the supplementary
document for this paper) as well. It provides predicted values of any
function of the variable of interest, as the PLUG-IN predictor. However,
this requires stronger assumptions to be met. The EBP procedure
available in qape package
is prepared under the assumption of the normality of the variable of
interest after any transformation. However, in the case of the
considered model for logarithms of radon measurements, the assumption is
not met as we mentioned before based on the results presented in the
literature. It can also be verified using normCholTest
function
(available in qape
package) as follows:
> normCholTest(radon.model, shapiro.test)$p.value
1] 2.589407e-08 [
Moreover, due to the fact of very time-consuming iterative procedure
used to compute the EBP for the general case, in the
qape package the function
ebpLMMne
uses a very fast procedure working only for nested error
Linear Mixed Models (see (Molina and Rao 2010)).
The prediction of any function of the random variables based on
cross-sectional data has been considered. Its special case, not
presented above but widely discussed in the econometric literature, is
the prediction of one random variable, in this case a radon measurement
for one non-observed owner-occupied home. Furthermore, the
qape package is also
designed for prediction based on longitudinal data for current or future
periods as shown in examples for the EBLUP
, plugInLMM
and ebpLMMne
functions in the qape-manual
file, cf. (Wolny-Dominiak and Żądło 2023).
The qape package provides three main types of bootstrap algorithms: the parametric bootstrap, the residual bootstrap and the double-bootstrap.
The parametric bootstrap procedure is implemented according to (González-Manteiga et al. 2007) and (González-Manteiga et al. 2008) and could be described in the following steps:
based on \(n\) observations of the dependent and independent variables (\(\mathbf{Y}_s\), \(\mathbf{X}_s\) and \(\mathbf{Z}_s\)) estimate \(\boldsymbol{\psi}\) to obtain the vector of estimates \(\boldsymbol{\hat{\psi}}\),
generate \(B\) realizations \(y_{i}^{*(b)}\) of \(Y_{i}\), under the
\(LMM(\mathbf{X}, \mathbf{Z}, \hat{\boldsymbol{\psi}})\) and
multivariate normality of random effects and random components
obtaining
\(\mathbf{y}^{*(b)}=\begin{bmatrix}
y_{1}^{*(b)} & ... & y_{i}^{*(b)} &... & y_{N}^{*(b)}
\end{bmatrix}^T\), where \(i=1, 2, ... ,N\) and \(b=1, 2, ... ,B\),
decompose the vector \(\mathbf{y}^{*(b)}\) as follows \(\begin{bmatrix} \mathbf{y}_s^{*(b)T} & \mathbf{y}_r^{*(b)T} \end{bmatrix}^T\),
in the \(b\)th iteration (\(b=1,2,...,B\))
compute the bootstrap realization \(\theta^{*(b)}=\theta^{*(b)}(\mathbf{y}^{*(b)},\boldsymbol{\hat{\psi}})\) of random variable \(\theta\),
obtain the vector of estimates \(\boldsymbol{\hat{\psi}}^{*(b)}\) using \(\mathbf{y}_s^{*(b)}\) and compute the bootstrap realization of predictor \(\hat{\theta}\) denoted by \(\hat{\theta}^{*(b)}(\mathbf{y}_s^{*(b)},\boldsymbol{\hat{\psi}}^{*(b)})\) based on \(LMM(\mathbf{X}, \mathbf{Z}, \boldsymbol{\hat{\psi}}^{*(b)})\),
compute bootstrap realizations of prediction error \(U^*\) denoted by \(u^*\) and for the \(b\)th iteration given by: \[\label{u*b} u^{*(b)}=\hat{\theta}^{*(b)}(\mathbf{y}_s^{*(b)},\boldsymbol{\hat{\psi}}^{*(b)})-\theta^{*(b)} (\mathbf{y}^{*(b)},\boldsymbol{\hat{\psi}}) =\hat{\theta}^{*(b)}-\theta^{*(b)}, (\#eq:u*b)\]
compute the parametric bootstrap estimators of prediction accuracy measures: RMSE and QAPE replacing prediction errors \(U\) in ((2)) and ((3)) by their bootstrap realizations.
Another possible method to estimate the prediction accuracy measures is the residual bootstrap. In what follows, we use the notation \(srswr(\mathbf{A}, m)\) to indicate the outcome of taking a simple random sample with replacement of size \(m\) of rows of matrix \(\mathbf{A}\). If \(\mathbf{A}\) is a vector, it simplifies to a simple random sample with replacement of size \(m\) of elements of \(\mathbf{A}\).
To obtain the algorithm of the residual bootstrap, it is enough to replace step 2 of the parametric bootstrap procedure presented above with the following procedure of the population data generation based on ((5)):
The next 3–5 steps in this procedure are analogous to steps in the parametric bootstrap procedure.
In the above-described step, it can be seen that if more than one vector of random effect is assumed at the \(l\)th level of grouping, then the elements are not sampled with replacement independently. In this case, rows of the matrix formed by these vectors are sampled with replacement.
The residual bootstrap algorithm can also be performed with so-called "correction procedure". This procedure, which can improve the properties of the residual bootstrap estimators due to the underdispersion of the uncorrected residual bootstrap distributions, is presented in the supplementary document for this paper.
Two bootstrap procedures are implemented in separate functions:
bootPar()
(the parametric bootstrap) and bootRes()
(the residual
bootstrap). According to the general Procedure 1, the step
preceding the bootstrap procedure in both functions is the definition of
the predictor object. It must be one of the following: EBLUP
,
ebpLMMne
or plugInLMM
. This object has to be passed to bootPar()
or bootRes()
as the input parameter predictor
. The other input
parameters are intuitive: B
- the number of bootstrap iterations and
p
- order of quantiles in the estimated QAPEs.
The additional input parameter in bootRes()
is a logical condition
called correction
, which makes it possible to include an additional
correction term for both random effects and random components, presented
in the supplementary document for this paper, to avoid the problem of
underdispersion of residual bootstrap distributions.
The main output values in both functions are basically the measures:
estRMSE
and estQAPE
computed based on ((2)) and
((3)), respectively, where prediction errors are replaced by
their bootstrap realizations. There is also the output error
being the
vector of bootstrap realizations of prediction errors, which is useful
e.g. in in-depth analysis of the prediction accuracy and for graphical
presentation of results. To estimate these accuracy measures, we use
below the residual bootstrap with the correction procedure.
As previously stated, our package utilizes the lmer()
function from
the lme4 package for
estimating model parameters. However, this function has been known to
generate convergence warnings in certain situations, listed for example
by (Bates et al. 2015) p. 25, when the estimated variances of random effects are
close to zero. Such scenarios may occur when models are estimated for
smaller or medium-sized datasets, when complex variance-covariance
structures are assumed, or when the grouping variable considered for
random effects has only a few levels. Although we have not observed such
issues estimating model parameters based on the original dataset
required to compute values of the predictors in previous sections,
bootstrapping or Monte Carlo simulations are more complex cases. This is
because, based on the estimates of model parameters, the values of the
dependent variables are generated \(B\) times, and then model parameters
are estimated in each out of \(B\) iterations. Therefore, in at least some
iterations, dependent variable values may be randomly generated giving
realizations, where the variance of the random effect is relatively
close to zero. As a result, estimates of model parameters can be
obtained; however, convergence issues implying warnings may occur. In
such cases, there are at least two possible solutions. The first option
is to discard iterations with warnings, which would imply that the
dependent variable would not follow the assumed model as required, but
instead only its conditional version with relatively high values of
variances of random effects. It will imply overdispersed bootstrap
distribution of random effects, which will affect the bias of the
bootstrap estimators of accuracy measures. The second option is to
consider all generated realizations, despite convergence warnings, as
long as the parameters can be estimated for all iterations. We opted for
the latter solution, as argued in (Bates et al. 2015) p. 25, who noted that "being
able to fit a singular model is an advantage: when the best fitting
model lies on the boundary of a constrained space".
The analyses presented in Example 1 are continued. We extend the previous results to include the issue of estimating the prediction accuracy of the considered predictors. The use of functions for this estimation primarily requires an object of class predictor, here "myplugin".
> class(myplugin)
1] "plugInLMM" [
The short chunk of the R code presents the residual bootstrap estimators
of the RMSE (estRMSE
) and the QAPE (estQAPE
) of the PLUG-IN
predictors (plugin
) of previously analyzed three characteristics of
radon measurements in county no. 26: the arithmetic mean, geometric mean
and median. In this and subsequent examples we make the computations for
relatively high number of iterations allowing, in our opinion, to get
reliable results. These results are also used to prepare Figure
3. However, the computations are time-consuming. The
supplementary R file contains the same chunks of the code but the number
of iterations applied is smaller in order to execute the code swiftly.
> # accuracy measures estimates based on
> # the residual bootstrap with the correction:
> B <- 500 # number of bootstrap iterations
> p <- c(0.75, 0.9) # orders of Quantiles of Absolute Prediction Error
> set.seed(1056)
> residBoot <- bootRes(myplugin, B, p, correction = TRUE)
> # values of estimated RMSEs of the predictor of three characteristics:
> # the arithmetic mean, geometric mean and median of radon measurements, respectively:
> residBoot$estRMSE
1] 0.1848028 0.2003681 0.2824359
[> # values of estimated QAPEs
> # (of order 0.75 in the first row, and of order 0.9 in the second row)
> # of the predictor of three characteristics:
> # the arithmetic mean, geometric mean and median of radon measurements,
> # in the 1st, 2nd and 3rd column, respectively:
> residBoot$estQAPE
1] [,2] [,3]
[,75% 0.1533405 0.2135476 0.2908988
90% 0.2813886 0.3397411 0.4374534
Let us concentrate on interpretations of estimators of accuracy measures
for the predictor of the geometric mean, i.e. the second value of
residBoot$estRMSE
, and values in the second column of
residBoot$estQAPE
. It is estimated that the average difference between
predicted values of the geometric mean and their unknown realizations
equals \(0.2003681\) picoCurie per liter. Furthermore, it is estimated
that at least \(75\%\) of absolute prediction errors of the predictor of
the geometric mean are smaller or equal to \(0.2135476\) picoCurie per
liter and at least \(25\%\) of absolute prediction errors of the predictor
are higher or equal to \(0.2135476\) picoCurie per liter. Finally, it is
estimated that at least \(90\%\) of absolute prediction errors of the
predictor of the geometric mean are smaller or equal to \(0.3397411\)
picoCurie per liter and at least \(10\%\) of absolute prediction errors of
the predictor are higher or equal to \(0.3397411\) picoCurie per liter.
The distributions of bootstrap absolute prediction errors with values of
estimated RMSEs and QAPEs for the considered three prediction problems
are presented in Figure 3.
Since the assumption of normality is not met, the parametric bootstrap
should not be used in this case. For this reason, we do not present the
results for this method below, although – but for illustrative purposes
only – they are presented in the supplementary R file. Moreover, these
analyses can also be conducted using bootParFuture()
and
bootResFuture()
functions where parallel computing algorithms are
applied. The input arguments and the output of these functions are the
same as in bootPar()
and bootRes()
. Examples based on these
functions are also included in the supplementary R file.
The qape package also
allows to use predictors under a model different from the assumed one
(e.g. a simpler or more robust model), but estimate its accuracy under
the assumed model. In this case, the parametric and residual bootstrap
procedures are implemented in bootParMis()
and bootResMis()
functions. These functions allow to estimate the accuracy of two
predictors under the model correctly specified for the first of them. Of
course, it is expected that the estimated accuracy of the first
predictor will be better than of the second one, but the key issue can
be the difference between estimates of accuracy measures. A small
difference, even to the second predictor’s disadvantage, may be treated
by the user as an argument for using the second predictor due to its
properties, such as robustness or simplicity.
The considered functions allow to estimate the accuracy of two
predictors, which belong to the class plugInLMM
, under the model used
to define the first of them. The remaining arguments are the same as in
bootPar()
and bootRes()
functions: B
- the number of bootstrap
iterations, and p
- orders of QAPE estimates to be taken into account.
The output results of bootParMis()
and bootResMis()
include –
similarly to bootPar()
and bootRes()
functions – estimates of the
RMSEs and QAPEs of both predictors (denoted here by: estRMSElmm
,
estRMSElmmMis
, estQAPElmm
and estQAPElmmMis
), and boostrap
realizations of their prediction errors (errorLMM
and errorLMMmis
).
In this example, we study the same accuracy measures as in Example 2,
but the aim is to compare the predictor myplugin
and other predictor
defined under the misspecified LMM. First, the misspecified model has to
be defined, and a relevant predictor has to be computed.
> fixed.part.mis <- '1'
> random.part.mis <- '(1|county)'
> myplugin.mis <- plugInLMM(YS, fixed.part.mis, random.part.mis, reg, con,
+ weights = NULL, backTrans = backTransExp, thetaFun)
Having two objects: myplugin
and myplugin.mis
, one can proceed to a
comparison by estimating bootstrap prediction accuracy performed using
the residual bootstrap with correction procedure. In this case, we
estimate the prediction accuracy of these two predictors under the model
used to define the first of them.
> set.seed(1056)
> residBootMis <- bootResMis(myplugin, myplugin.mis, B, p, correction = TRUE)
> # residual bootstrap with the correction RMSE estimators
> # of 'plugin' of: arithmetic mean, geometric mean and median
> # of radon measurements in county 26:
> residBootMis$estRMSElmm
1] 0.1848028 0.2003681 0.2824359
[> # residual bootstrap with the correction RMSE estimators
> # of 'plugin.mis' of: arithmetic mean, geometric mean and median
> # of radon measurements in county 26:
> residBootMis$estRMSElmmMis
1] 0.1919184 0.3192304 0.2762137
[> # residual bootstrap with the correction QAPE estimators of order 0.75 and 0.9
> # of 'plugin' of: arithmetic mean, geometric mean and median
> # of radon measurements in county 26:
> residBootMis$estQAPElmm
1] [,2] [,3]
[,75% 0.1533405 0.2135476 0.2908988
90% 0.2813886 0.3397411 0.4374534
> # residual bootstrap with the correction QAPE estimators of order 0.75 and 0.9
> # of 'plugin.mis' of: arithmetic mean, geometric mean and median
> # of radon measurements in county 26:
> residBootMis$estQAPElmmMis
1] [,2] [,3]
[,75% 0.2267062 0.3802836 0.3255197
90% 0.2813787 0.4970726 0.4489399
The results, presented above, were obtained for the same number of
bootstrap iterations as in Example 2 (\(B = 500\)). If we compare, under
the model defined in plugin
, estimated RMSEs of plugin
and
plugin.mis
predictors of the geometric mean given by \(0.2003681\) and
\(0.3192304\) picoCurie per liter, respectively, we can state that the
estimated accuracy (measured by RMSE estimators) of the first predictor
is better comparing with the second one. If we are not interested in the
average accuracy measures but in the right tail of the distribution of
prediction errors, we can use estimates of QAPE of order 0.9 to compare
the accuracy. The result for the plugin.mis
of the geometric mean
equals to \(0.4970726\) picoCurie per liter, and it is higher comparing
with \(0.3397411\) picoCurie per liter obtained for plugin
for the same
prediction problem. Hence, in this case, the accuracy comparison based
both on the RMSE and QAPE leads to the same finding.
In the previous paragraph, we have focused on the results for the case
of prediction of the geometric mean. If the comparison is made for the
case of prediction of the arithmetic mean (the first column of output
results) or the median (the third column of output results), we will
come to the same conclusion regarding the estimated accuracy of plugin
and plugin.mis
as in the case of prediction of the geometric mean.
Similarly to the residual bootstrap, the parametric bootstrap procedure
paramBootMis
available in
qape package can be
performed. However, in the considered case the normality assumption is
not met (as discussed above) and the procedure is not recommended. The
appropriate chunk of the R code is presented in the supplementary R
file, but it is solely intended for illustrative purposes.
In the previous section, our aim was to estimate the prediction accuracy under correctly specified or misspecified model. In this section, we do not estimate the accuracy, but we approximate the true prediction accuracy under the specified model in the Monte Carlo simulation study. The crucial difference is that in this case, the model parameters used are obtained based on the whole population dataset, not the sample. If the number of iterations is large enough, we can treat the computed values of the measures as their true values, which are unknown in practice.
The last step of the analysis in qape package presented in Procedure 1 is the Monte Carlo (MC) simulation analysis of:
properties of predictors
and properties of parametric, residual and double bootstrap estimators of accuracy measures.
The whole Monte Carlo procedure is as follows.
Procedure 2. Model-based Monte Carlo simulation analyses in qape
define the population vector of the dependent variable and the population matrix of auxiliary variables,
provide the information on the division of the population into the sampled and non-sampled part,
define \(\theta\) - the characteristics of the response variable to be predicted,
define the predictors \(\hat{\theta}\) and accuracy measures estimators which properties are to be assessed,
define the model to be used to generate realizations of the values of the dependent variable and estimate its parameters based on population data,
For k=1, 2, ..., K
generate the population vector of the response variable based on the assumed model,
based on population data, compute the characteristics \(\theta\), denoted by \(\theta_k\),
based on sample data, estimate the parameters of the LMM,
based on sample data, compute values of predictors \(\hat{\theta}\), denoted by \(\hat{\theta}_k\),
based on sample data, estimate the accuracy of \(\hat{\theta}\) using bootstrap methods,
End For
compute accuracy measures of predictors using \(\hat{\theta}_k\) and \(\theta_k\) (for \(k=1,2, ..., K\)),
compute accuracy measures of estimators of prediction accuracy measures.
In order to perform a Monte Carlo (MC) analysis on the properties of
predictors, it is necessary to have access to the entire population data
for both dependent and independent variables. The function mcLMMmis()
can be used with the following arguments. Firstly, the population values
of the dependent variable (after a necessary transformation) should be
declared as Ypop
. By using the Ypop
values, we can estimate the
model parameters based on the entire population data (assuming that they
are known). This allows us to generate values of the dependent variable
in the simulation study that can mimic its distribution in the entire
population, not just in the sample. This approach ensures that our
simulation study can be an accurate representation of the random process
in the entire population, resembling the real-world scenario. Secondly,
three predictors: predictorLMMmis
, predictorLMM
, predictorLMM2
,
which belong to the class plugInLMM
, are to be defined. The first one
is used only to define the (possibly misspecified) model used to
generate population values of the response variables. Accuracy of
predictorLMM
and predictorLMM2
is assessed in the simulation study.
The next two arguments include the number of MC iterations K
and
orders p
of QAPEs used to assess the prediction accuracy. Finally, it
should be noted that it is possible to modify covariance matrices of
random components and random effects based on the model defined in
predictorLMMmis
, which are used tThiso generate values of the
dependent variable. It is possible by declaring values of ratioR
and
ratioG
arguments, which the diagonal elements of covariance matrices
of random components and random effects, respectively, are divided by.
The output of this function covers the following statistics of both
predictors computed in the simulation study: relative biases (rBlmm
and rBlmm2
), relative RMSEs (rRMSElmm
and rRMSElmm2
) and QAPEs
(QAPElmm
and QAPElmm2
). Simulation-based prediction errors of both
predictors (errorLMM
and errorLMM2
) are also taken into account.
In the example, an MC simulation is carried out assuming the myplugin
predictor. The goal is to approximate the true accuracy of the
prediction assuming model ((10)). Hence, in the package
qape, all input predictor
objects in the function mcLMMmis
have to be defined as myplugin
.
> # input arguments:
<- myplugin # to define the model
predictorLMMmis <- myplugin # which properties are assessed in the simulation study
predictorLMM <- myplugin # which properties are assessed in the sim. study predictorLMM2
Except that no modification of covariance matrices has to be used.
# diag. elements of the covariance matrix of random components are divided by:
<- 1
ratioR # diag. elements of the covariance matrix of random effects are divided by:
<- 1 ratioG
We specify the number of Monte Carlo iterations.
<- 500 # the number of MC iterations K
The analysis is conducted in the object MC
.
> set.seed(1086)
> MC <- mcLMMmis(Ypop, predictorLMMmis, predictorLMM, predictorLMM2,
+ K, p, ratioR, ratioG)
> # relative bias of 'predictorLMM'
> # of the arithmetic mean, geometric mean and median in county 26 (in %):
> MC$rBlmm
1] -1.73208393 -0.04053178 -5.22355236 [
Results of the relative biases are obtained. It is seen, that under the
assumed model the values of the considered predictor of the geometric
mean (the second value of MC$rBlmm
) are smaller than possible
realizations of the geometric mean on average by \(0.04053178\%\). In
turn, the relative RMSEs are as follows.
> # relative RMSE of 'predictorLMM'
> # of the arithmetic mean, geometric mean and median in county 26 (in %):
> MC$rRMSElmm
1] 3.429465 4.665810 7.146678 [
In the considered case, the average difference between predicted values
of the geometric mean and its possible realizations (the second value of
MC$rRMSElmm
) equals \(4.665810\%\). It should be noted that this value
can be treated as the true value of the relative RMSE (if the number of
iterations is large enough), not the estimated value obtained in
Examples 2 and 3.
Finally, QAPEs of orders 0.75 and 0.9 are considered.
> # QAPE of order 0.75 and 0.9 of 'predictorLMM'
> # of the arithmetic mean, geometric mean and median in county 26:
> MC$QAPElmm
1] [,2] [,3]
[,75% 0.1491262 0.1989504 0.2919221
90% 0.2895684 0.2959457 0.4728064
Let us interpret the results presented in the second column of
MC$QAPElmm
. At least \(75\%\) (\(90\%\)) of absolute prediction errors of
the predictor of the geometric mean are smaller or equal to \(0.1989504\)
(\(0.2959457\)) picoCurie per liter and at least \(25\%\) (\(10\%\)) of
absolute prediction errors of the predictor are higher or equal to
\(0.1989504\) (\(0.2959457\)) picoCurie per liter. Similar to the values of
the rRMSEs in the previous code chunk, the values can be considered to
be true QAPE values, not the estimates presented in Examples 2 and 3.
In Example 4, the accuracy of one predictor under the model used to
define this predictor was presented. A more complex version of the
simulation study, where the properties of two predictors are studied
under the model defined by the third predictor, is presented in the
supplementary R file. What is more, the
qape package also allows
to use mcBootMis()
function to conduct MC analyses of properties of
accuracy measure estimators (estimators of MSEs and QAPEs) of two
predictors (which belong to the class plugInLMM
) declared as
arguments. The model used in the simulation study is declared in the
first predictor, but the properties of accuracy measures estimators of
both predictors are studied. Output results of mcBootMis()
covers
simulation results on properties of different accuracy measures
estimators, including the relative biases and relative RMSEs of the
parametric bootstrap MSE estimators of both predictors. The same
simulation-based statistics but for parametric bootstrap QAPE estimators
are also included. Other bootstrap methods, including the residual
bootstrap with and without the correction procedure, are also taken into
account. The full list of output arguments of mcBootMis()
function are
presented in qape-manual
file, cf. (Wolny-Dominiak and Żądło 2023).
The package enables R users to make predictions and assess the accuracy under linear mixed models based on different methods in a fast and intuitive manner – not only based on the RMSE but also based on Quantiles of Absolute Prediction Errors. It also covers functions which allow to conduct Monte Carlo simulation analyses of properties of the methods of users interest. Its main advantage, compared to other packages, is the considerable flexibility in terms of defining the model (as in the lme4 package) and the predicted characteristic, but also the transformation of the response variable.
In our opinion, the package is useful for scientists, practitioners and decision-makers in all areas of research where accurate estimates and forecasts for different types of data (including cross-sectional and longitudinal data) and for different characteristics play the crucial role. We believe that it will be of special interest to survey statisticians interested in the prediction for subpopulations with small or even zero sample sizes, called small areas.
Supplementary materials are available in addition to this article. It can be downloaded at RJ-2024-004.zip
qape, sae, msae, saery, JoSAE, emdi, HLMdiag, lme4
Econometrics, Environmetrics, MixedModels, OfficialStatistics, Psychometrics, SpatioTemporal
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wolny--Dominiak & Ża̧dło, "Prediction, Bootstrapping and Monte Carlo Analyses Based on Linear Mixed Models with QAPE 2.0 Package", The R Journal, 2025
BibTeX citation
@article{RJ-2024-004, author = {Wolny--Dominiak, Alicja and Ża̧dło, Tomasz}, title = {Prediction, Bootstrapping and Monte Carlo Analyses Based on Linear Mixed Models with QAPE 2.0 Package}, journal = {The R Journal}, year = {2025}, note = {https://doi.org/10.32614/RJ-2024-004}, doi = {10.32614/RJ-2024-004}, volume = {16}, issue = {1}, issn = {2073-4859}, pages = {67-82} }