swgee: An R Package for Analyzing Longitudinal Data with Response Missingness and Covariate Measurement Error

Though longitudinal data often contain missing responses and error-prone covariates, relatively little work has been available to simultaneously correct for the effects of response missingness and covariate measurement error on analysis of longitudinal data. Yi (2008) proposed a simulation based marginal method to adjust for the bias induced by measurement error in covariates as well as by missingness in response. The proposed method focuses on modeling the marginal mean and variance structures, and the missing at random mechanism is assumed. Furthermore, the distribution of covariates are left unspecified. These features make the proposed method applicable to a broad settings. In this paper, we develop an R package, called swgee, which implements the method proposed by Yi (2008). Moreover, our package includes additional implementation steps which extend the setting considered by Yi (2008). To describe the use of the package and its main features, we report simulation studies and analyses of a data set arising from the Framingham Heart Study.

Juan Xiong (Department of Preventive Medicine, School of Medicine, Shenzhen University) , Grace Y. Yi (Department of Statistics and Actuarial Science, University of Waterloo)

1 Introduction

Longitudinal studies are commonly conducted in the health sciences, biochemical, and epidemiology fields; these studies typically collect repeated measurements on the same subject over time. Missing observations and covariate measurement error frequently arise in longitudinal studies and they present considerable challenges in statistical inference about such data (Carroll et al. 2006; Yi 2008). It has been well documented that ignoring missing responses and covariate measurement error may lead to severely biased results, thus leading to invalid inferences (Fuller 1987; Carroll et al. 2006).

Regarding longitudinal data with missing responses, there has been extensive methods such as maximum likelihood, multiple imputation, and weighted generalized estimating equations (GEE) method (Little and Rubin 2002). In terms of methods of handling measurement error in covariate, many methods have been developed for various settings. Comprehensive discussions can be found in Fuller (1987), Gustafson (2003), Carroll et al. (2006), Buonaccorsi (2010) and Yi (2017). However, there has been relatively little work on simultaneously addressing the effects of response missingness and covariate measurement error in longitudinal data analysis, although some work such as Wang et al. (2008), Liu and Wu (2007) and Yi et al. (2012), are available. In particular, Yi (2008) proposed an estimation method based on the marginal model for the response process, which does not require the full specification of the distribution of the response variable but models only the mean and variance structures. Furthermore, a functional method is applied to relax the need of modeling the covariate process. These features make the method of Yi (2008) flexible for many applications.

Relevant to our R package, a set of R packages and statistical software have been available for performing the GEE and weighted GEE analyses for longitudinal data with missing observations. In particular, package gee (Carey 2015) and yags (Carey 2011) perform the GEE analyses under the strong assumption of missing completely at random (MCAR) (Kenward 1998). Package wgeesel (Xu et al. 2018) can perform the multiple model selection based on weighted GEE/GEE. Package geepack (Hojsgaard et al. 2016) implements the weighted GEE analyses under the missing at random (MAR) assumption, in which an optional vector of weights can be used in the fitting process but the weight vector has to be externally calculated. In addition, the statistical software SAS/STAT version 13.2 (SAS Institute Inc. 2014) includes an experimental version of the function PROC GEE (Lin and Rodriguez 2015), which fits weighted GEE models.

Our swgee package has several features distinguishing from existing packages. First, swgee is designed to analyze longitudinal data with both missing responses and error-prone covariates. To the best of our knowledge, this is the first R package that can simultaneously account for response missingness and covariate measurement error. Secondly, this simulation based marginal method can be applied to a broad range of problems because the associated model assumptions are minimal. swgee can be directly applied to handle continuous and binary responses as well as count data with dropouts under the MAR and MCAR mechanisms. Thirdly, observations are weighted inversely proportional to their probability of being observed, with weights calculated internally. Lastly, the swgee package employs the simulation extrapolation (SIMEX) algorithm to account for the effect of measurement error in covariates.

The remainder is organized as follows. Section 2 introduces the notation and model setup. In Section 3, we describe the method proposed by (Yi 2008) and its implementation in R in Section 4. The developed R package is illustrated with simulation studies and analyses of a data set arising from the Framingham Heart Study in Section 5. General discussion is included in Section 6.

2 Notation and framework

For \({i}=1, \dots, n\) and \({j}=1, \dots, m\), let \(Y_{ij}\) be the response variable for subject \({i}\) at time point \({j}\), let \(\mathbf{X}_{ij}\) be the vector of covariates subject to error, and \(\mathbf{Z}_{ij}\) be the vector of covariates which are error-free. Write \(\mathbf{Y}_{i} =(Y_{i1}, Y_{i2}, \dots, Y_{im})'\), \(\mathbf{X}_{i} =(\mathbf{X}'_{i1}, \mathbf{X}'_{i2}, \dots, \mathbf{X}'_{im})'\), and \(\mathbf{Z}_{i} =(\mathbf{Z}'_{i1}, \mathbf{Z}'_{i2}, \dots, \mathbf{Z}'_{im})'\).

Response model

For \({i}=1, \dots, n\) and \({j}=1, \dots, m\), let \(\mu_{ij}=E(Y_{ij}|\mathbf{X}_{i}, \mathbf{Z}_{i})\) and \({v}_{ij}=var(Y_{ij}|\mathbf{X}_{i}, \mathbf{Z}_{i})\) be the conditional expectation and variance of \(Y_{ij}\), given the covariates \(\mathbf{X}_{i}\) and \(\mathbf{Z}_{i}\), respectively. We model the influence of the covariates on the marginal response mean by means of a regression model: \[\label{responseR} g(\mu_{ij})=\mathbf{X}'_{ij}\boldsymbol{\beta}_x + \mathbf{Z}'_{ij}\boldsymbol{\beta}_z, \tag{1}\] where \(\boldsymbol{\beta}=(\boldsymbol{\beta}'_x, \boldsymbol{\beta}'_z)'\) is the vector of regression parameters and \(g(\cdot)\) is a specified monotone function. The intercept term, if any, of the model may be included as the first element of \(\boldsymbol{\beta}_z\) by including the unit vector as the first column of \(\mathbf{Z}_{i}\).

To model the variance of \(Y_{ij}\), we consider \[\label{varianceY} v_{ij}=h(\mu_{ij}; \phi), \tag{2}\] where \(h(\cdot;\cdot)\) is a given function and \(\phi\) is the dispersion parameter that is known or to be estimated. We treat \(\phi\) as known with emphasis setting on estimation of the \(\boldsymbol{\beta}\) parameter. Here we assume that \(E(Y_{ij}^k|\mathbf{X}_{i}, \mathbf{Z}_{i}) = E(Y_{ij}^k|\mathbf{X}_{ij}, \mathbf{Z}_{ij})\) for \(k=1\) and \(2\), that is, the dependence of the mean \(\mu_{ij}\) and the variance \(v_{ij}\) on the subject-level covariates \(\mathbf{X}_{i}\) and \(\mathbf{Z}_{i}\) is completely reflected by the dependence on the time-specific covariates \(\mathbf{X}_{ij}\) and \(\mathbf{Z}_{ij}\). This assumption has been widely used in marginal analysis of longitudinal analysis (e. g. , Diggle and Kenward 1994; Lai and Small 2007). The necessity of these assumptions was discussed by Yi (2017 5.1.1).

Missing data model

For \({i}=1, \dots, n\) and \({j}=1, \dots, m\), let \(O_{ij}\) be 1 if \(Y_{ij}\) is observed and 0 otherwise, and let \(\mathbf{O}_{i}=(O_{i1}, O_{i2}, \dots, O_{im})'\) be the vector of missing data indicators. Dropouts or monotone missing data patterns are considered here. That is, \(O_{ij}=0\) implies \(O_{ij'}=0\) for all \(j' > j\). We assume that \(O_{i1}=1\) for every subject \({i}\). To reflect the dynamic nature of the observation process over time, we assume an MAR mechanism for the missing process. That is, given the covariates, the missingness probability depends on the observed responses but not unobserved response components (Little and Rubin 2002). Let \(\lambda_{ij} = P(O_{ij}=1|O_{i,j-1}=1, \mathbf{X}_{i}, \mathbf{Z}_{i}, \mathbf{Y}_{i})\) and \(\pi_{ij} = P(O_{ij}=1|\mathbf{X}_{i}, \mathbf{Z}_{i}, \mathbf{Y}_{i})\), then \[\label{missingv} \pi_{ij} = \prod_{t=2}^{j}\lambda_{it}. \tag{3}\] Logistic regression models are used to model the dropout process: \[\label{logisticR} logit(\lambda_{ij})=\mathbf{u}'_{ij}\boldsymbol{\alpha}, \tag{4}\] for \(j=2, \dots, m\), where \(\mathbf{u}_{ij}\) is the vector consisting of the information of the covariates \(\mathbf{X}_{i}\), \(\mathbf{Z}_{i}\) and the observed responses, and \(\boldsymbol{\alpha}\) is the vector of regression parameters. Write \(\boldsymbol{\theta} = (\boldsymbol{\alpha}', \boldsymbol{\beta}')'\) and let \(q=dim(\boldsymbol{\theta})\).

Measurement error model

For \({i}=1, \dots, n\) and \({j}=1, \dots, m\), let \(\mathbf{W}_{ij}\) be the observed measurements of the covariates \(\mathbf{X}_{ij}\). Covariates \(\mathbf{X}_{ij}\) and their observed measurements \(\mathbf{W}_{ij}\) are assumed to follow a classical additive measurement error model: \[\label{MEmodel} \mathbf{W}_{ij} = \mathbf{X}_{ij} + \mathbf{e}_{ij}, \tag{5}\] where the \(\mathbf{e}_{ij}\) are independent of \(\mathbf{X}_{i}\) , \(\mathbf{Z}_{i}\) and \(\mathbf{Y}_{i}\). And \(\mathbf{e}_{ij}\) follows \(N(\mathbf{0}, \boldsymbol{\Sigma}_e)\) with the covariance matrix \(\boldsymbol{\Sigma}_e\). This model has been widely used in the context of handling measurement error problems. (Yi 2008) assumed that \(\boldsymbol{\Sigma}_e\) is known or can be estimated from replication experiments (e. g. , Carroll et al. 2006; Yi 2017).

3 Methodology

Weighted estimation function

The inverse probability weighted generalized estimating equations method is often employed to accommodate the missing data effects (e. g. , Robins et al. 1995; Preisser et al. 2002; Qu et al. 2011) when primary interest lies in the estimation of the marginal mean parameters \(\boldsymbol{\beta}\) in the model (1). For \({i}=1, \dots, n\), let \(M_i\) be the random dropout time for subject \({i}\) and \(m_i\) be a realization. Define \(L_{i}(\boldsymbol{\alpha}) = (1-\lambda_{im_i})\prod_{t=2}^{m_i-1}\lambda_{it}\), where \(\lambda_{it}\) is determined by model (4). Let \(\mathbf{S}_i(\boldsymbol{\alpha}) = \partial{log L_i(\boldsymbol{\alpha})}/\partial{\boldsymbol{\alpha}}\) be the vector of score functions contributed from subject \({i}\). Let \(\mathbf{D}_i = \partial{\boldsymbol{\mu}'_i}/\partial{\boldsymbol{\beta}}\) be the matrix of the derivatives of the mean vector \(\boldsymbol{\mu}_i=(\mu_{i1}, \dots, \mu_{im})'\) with respect to \(\boldsymbol{\beta}\) and let \(\boldsymbol{\Delta}_i = diag(I(O_{ij}=1)/\pi_{ij}, j=1, 2, \dots, m)\) be the weighted matrix accommodating missingness, where \(I(\cdot)\) is the indicator function. Let \(\mathbf{V}_{i} = \mathbf{A}_{i}^{1/2}\mathbf{C}_{i}\mathbf{A}_{i}^{1/2}\) be the conditional covariance matrix of \(\mathbf{Y}_{i}\), given \(\mathbf{X}_{i}\) and \(\mathbf{Z}_{i}\), where \(\mathbf{A}_{i} = diag(v_{ij}, j=1, 2, \dots, m)\) and \(\mathbf{C}_{i} = [\rho_{i;jk}]\) is the correlation matrix with diagonal elements equal 1 and \(\rho_{i;jk}\) being the conditional correlation coefficient of response components \(Y_{ij}\) and \(Y_{ik}\) for \(j \neq k\), given \(\mathbf{X}_{i}\) and \(\mathbf{Z}_{i}\). Define \[\mathbf{U}_i(\boldsymbol{\theta}) = \mathbf{D}_{i}\mathbf{V}_{i}^{-1}\boldsymbol{\Delta}_i(\mathbf{Y}_i - \boldsymbol{\mu}_i)\] and \[\label{wee} \mathbf{H}_{i}(\boldsymbol{\theta}) = (\mathbf{U}'_i(\boldsymbol{\theta}), \mathbf{S}'_i(\boldsymbol{\alpha}))'. \tag{6}\]

In the absence of measurement error, that is, covariates \(\mathbf{X}_{ij}\) are precisely observed, we have \(E[\mathbf{H}_{i}(\boldsymbol{\theta})] = \mathbf{0}\). Hence, \(\mathbf{H}(\boldsymbol{\theta}) = \sum_{i=1}^n\mathbf{H}_{i}(\boldsymbol{\theta})\) are unbiased estimation functions for \(\boldsymbol{\theta}\) (e. g. , Yi 2017 1). Under regularity conditions, the consistent estimator \(\widehat{\boldsymbol{\theta}}\) of \(\boldsymbol{\theta}\) can be obtained by solving \[\label{IPWGEE} \mathbf{H}(\boldsymbol{\theta}) = \mathbf{0}, \tag{7}\] where the weight matrix \(\boldsymbol{\Delta}_i\) is used to adjust for the contributions of subject \(i\) with his/her missingness probabilities incorporated. Specifically, the probability \(\pi_{ij}\) is determined by (3) in conjunction with (4). Correlation matrix \(\mathbf{C}_{i}\) can be replaced by the moment estimate, or alternatively, a working independence matrix \(\mathbf{A}_{i}\) may be used to replace \(\mathbf{V}_{i}\) (Liang and Zeger 1986). A detail discussion can be found in Yi (2017 4).

SIMEX approach

When measurement error is present in covariates \(\mathbf{X}_{ij}\), \(\mathbf{H}(\boldsymbol{\theta})\) is no longer unbiased if naively replacing \(\mathbf{X}_{ij}\) with its observed measurement \(\mathbf{W}_{ij}\). Yi (2008) developed a simulation-extrapolation (SIMEX) method to adjust for the bias induced by using \(\mathbf{W}_{ij}\), as well as the missingness effects in the response variables. This method originates from the SIMEX method by Cook and Stefanski (1994) who considered cross-sectional data with measurement error alone. The basic idea of the SIMEX method is to first add additional variability to the observed measurement \(\mathbf{W}_{ij}\), then establish the trend how different degrees of measurement error may induce bias in estimation of the model parameters, and finally extrapolate this trend to the case of no measurement error.

Now, we describe the SIMEX method developed by Yi (2008). Let \(B\) be a given positive integer and \(\boldsymbol{\Lambda} = \{\lambda_1, \lambda_2, \dots, \lambda_M\}\) be a sequence of nonnegative numbers taken from \([0, \lambda_M]\) with \(\lambda_1 = 0\).

The SIMEX approach is very appealing because of its simplicity of implementation and no requirement of modeling the true covariates \(\mathbf{X}_i\). However, to use this method, several aspects need to be considered. As suggested by Carroll et al. (2006), the specification of \(\Lambda\) is not unique; a typical choice of grid \(\Lambda\) is the equal cut points of interval \([0, 2]\) with \(M=5\) or 9. Choosing \(B=100\) or 200 is often sufficient for many applications. The quadratic regression function is commonly used for Step 3 to yield reasonable results. (e. g. , He et al. 2012).

Finally, we extend the method by (Yi 2008) to accommodating the case where the covariance matrix \(\boldsymbol{\Sigma}_e\) for model (5) is unknown but repeated surrogate measurements of \(\mathbf{X}_{ij}\) are available. Let \(\mathbf{W}_{ijk}\) denote the repeated surrogate measurements of \(\mathbf{X}_{ij}\) for \(i=1, \dots, n; j=1,\dots,m\); and \(k=1,\dots, K\). The surrogate measurements \(\mathbf{W}_{ijk}\) and the true covariate \(\mathbf{X}_{ij}\) are linked by the model \[\label{MEmodelSurr} \mathbf{W}_{ijk} = \mathbf{X}_{ij} + \mathbf{e}_{ijk}, \tag{8}\] where the \(\mathbf{e}_{ijk}\) are independent of \(\mathbf{X}_{i}\) , \(\mathbf{Z}_{i}\) and \(\mathbf{Y}_{i}\), and \(\mathbf{e}_{ijk}\) follows \(N(\mathbf{0}, \boldsymbol{\Sigma}_e)\) with the covariance matrix \(\boldsymbol{\Sigma}_e\). We now adapt the arguments of (Devanarayan and Stefanski 2002) to modify the simulation step of the preceding SIMEX method. For a given \(b\) and \(\lambda \in \boldsymbol{\Lambda}\), set \[\label{measurementSurro} \mathbf{W}_{ij}(b, \lambda) = \overline{\mathbf{W}}_{ij} + \sqrt{\lambda/K}\sum_{k=1}^K{c}_{ijk}(b)\mathbf{W}_{ijk}, \tag{9}\] where \(\overline{\mathbf{W}}_{ij}={K}^{-1}\sum_{k=1}^K\mathbf{W}_{ijk}\) and \(\mathbf{c}_{ij}(b)=(c_{ij1}(b), \dots, c_{ijk}(b))'\) is a normalized contrast satisfying \(\sum_{k=1}^K{c}_{ijk}=0\) and \(\sum_{k=1}^K{c}_{ijk}^2=1\).

A simple way to generate a contrast \(\mathbf{c}_{ij}(b)\) can be done by independently generating \(K\) variates, \(d_{ijk}(b)\), from \(N(0, 1)\) for \(k=1,\dots, K\) and a given \(b\). Let \(\overline{d}_{ij}(b) = {K}^{-1}\sum_{k=1}^K{d}_{ijk}(b)\). Then \({c}_{ijk}(b)\) is set as \[{c}_{ijk}(b, \lambda) = \frac{d_{ijk}(b) - \overline{d}_{ij}(b)}{\sqrt{\sum_{k=1}^K\{d_{ijk}(b) - \overline{d}_{ij}(b)\}^2}}.\] Once \(\mathbf{W}_{ij}(b, \lambda)\) of (9) is available, we repeat Steps 2 and 3 to obtain the SIMEX estimator and the associated standard error.

4 Implementation in R

We implement the SIMEX procedure described in Section 3 in R and develop the package, called swgee. Our package swgee takes the advantage of existing R packages geepack (Hojsgaard et al. 2016) and mvtnorm (Genz and Bretz 2009; Genz et al. 2018). Specifically, the function swgee produces the estimates for elements of the parameter vector \(\boldsymbol{\beta}\), which are of primary interest, the associated standard errors, and P-values.

Our R function swgee requires the input data set to be sorted by subject \(i\) and visit time \(j\) for \({i}=1, \dots, n\) and \({j}=1, \dots, m\). If a subject is missing at a certain time, the corresponding measurements should be recorded as NAs. As long as the user provides the missing data model (4), the function swgee can internally generate the missing data indicators \(O_{ij}\) for \({i}=1, \dots, n\) and \({j}=1, \dots, m\), and then apply the user specified model (4) to fit the data. The missingness probabilities \(\pi_{ij}\) are calculated by (3) and then used to construct the weight matrix \(\boldsymbol{\Delta}_i\) for the estimating equation (6). The estimate of the missing data model (4) parameter \(\boldsymbol{\alpha}\) can also be retrieved from the function swgee output.

The form of calling function swgee is given by

    swgee(formula, data, id, family, corstr, missingmodel, SIMEXvariable,
        SIMEX.err, repeated = FALSE, repind = NULL, B, lambda)

where the arguments are described as follows:

5 Examples

An example data set

To illustrate the usage of the developed R package swgee, we apply the package to a subset of GWA13 (Genetic Analysis Workshops) data arising from the Framingham Heart Study. The data set consists of measurements of 100 patients from a series of exams with 5 assessments for each individual. Measurements such as height, weight, age, systolic blood pressure (SBP) and cholesterol level (CHOL) are collected at each assessment, and \(14\%\) patients dropped out of the study. The original data were analyzed by (Yi 2008). It is of interest to study how an individual’s obesity may change with age (\(Z_{ij}\)) and how it is associated with SBP (\(X_{ij1}\)) and CHOL (\(X_{ij2}\)), where \(i=1, \dots, 100\), and \(j=1, \dots, 5\). The response \(Y_i\) is the indicator of obesity status of subject \(i\) as in (Yi 2008); SBP is rescaled as \(log(SBP-50)\) as in (Carroll et al. 2006); and CHOL is standardized. The response and the covariates are postulated by the logistic regression model: \[logit \mu_{ij} = \beta_0 + \beta_{x1}X_{ij1} + \beta_{x2}X_{ij2} + \beta_{z}Z_{ij},\]

where \(\beta_0\), \(\beta_{x1}\), \(\beta_{x2}\) and \(\beta_{z}\) are regression coefficients of interest. We assume that errors in both risk factors \(X_{ij1}\) and \(X_{ij2}\) can be represented by model (5). The missing data process is characterized by the logistic regression model: \[logit \lambda_{ij} = \alpha_1 + \alpha_{2}Y_{i,j-1} + \alpha_{3}X_{i,j-1,1} +\alpha_{4}X_{i,j-1,2} + \alpha_{5c}z_{i,j-1},\] for \(j=2, \dots, 5\).

We now apply the developed R package swgee, which can be downloaded from CRAN and then loaded in R:

    R> library("swgee")

Next, load the data that are properly organized with the variable names specified. In the example here, the data set, named as bmidata, is included by issuing

    R> data("BMI")
    R> bmidata <- BMI

We are concerned how measurement error in SBP and CHOL impacts estimation of parameter \(\boldsymbol{\beta} = (\beta_0, \beta_{x1}, \beta_{x2}, \beta_z)'\). For illustrative purposes, we use setting with \(B=100\), \(\lambda_M=2\) and \(M=5\). In this example, we assume that parameters in \(\boldsymbol{\Sigma}_e=\left(\begin{array}{cc} \sigma_1^2 & \sigma_{12} \\ \sigma_{21} & \sigma_2^2 \end{array} \right)\) with \(\sigma_{12}=\sigma_{21}\) are known. This is a typical case when conducting sensitivity analysis. Here we set \(\sigma_{1} = \sigma_2=0.5\) and \(\sigma_{12}=\sigma_{21}=0\) as an example.

The naive GEE approach without considering missingness and measurement error effects in covariates gives the output:

    R> output1 <- gee(bbmi~sbp+chol+age, id=id, data=bmidata,
    +      family=binomial(link="logit"), corstr="independence")

    R> summary(output1)

     gee S-function, version 4.13 modified 98/01/27 (1998)

     Link:                      Logit
     Variance to Mean Relation: Binomial
     Correlation Structure:     Independent

    gee(formula = bbmi ~ sbp + chol + age, id = id, data = bmidata,
        family = binomial(link = "logit"), corstr = "independence")

    Summary of Residuals:
            Min          1Q      Median          3Q         Max
    -0.26533967 -0.11385369 -0.08572483 -0.06279540  0.95475735

                   Estimate Naive S.E.    Naive z Robust S.E.   Robust z
    (Intercept) -5.43746374 1.42090827 -3.8267521  1.64320527 -3.3090593
    sbp          0.59071183 0.30643396  1.9276970  0.24338420  2.4270755
    chol         0.11109496 0.13654324  0.8136247  0.23086218  0.4812177
    age          0.01297337 0.01339946  0.9682008  0.01814546  0.7149652

    Estimated Scale Parameter:  1.017131
    Number of Iterations:  1
    Working Correlation
         [,1] [,2] [,3] [,4] [,5]
    [1,]    1    0    0    0    0
    [2,]    0    1    0    0    0
    [3,]    0    0    1    0    0
    [4,]    0    0    0    1    0
    [5,]    0    0    0    0    1

To adjust for possible effects of missingness as well as measurement error in variables SBP and CHOL, we call the developed function swgee for the analysis:

    R> set.seed(1000)
    R> sigma <- diag(rep(0.25, 2))
    R> output2 <- swgee(bbmi~sbp+chol+age, data=bmidata, id=id,
    +     family=binomial(link="logit"), corstr="independence",
    +     missingmodel=O~bbmi+sbp+chol+age, SIMEXvariable=c("sbp","chol"),
    +     SIMEX.err=sigma, repeated=FALSE, B=100, lambda=seq(0, 2, 0.5))

    > summary(output2)
    Call: beta
                 Estimate    StdErr t.value   p.value
    (Intercept) -8.004577  2.060967 -3.8839 0.0001028 ***
    sbp          1.196363  0.356868  3.3524 0.0008011 ***
    chol         0.099984  0.264180  0.3785 0.7050810
    age          0.012718  0.017201  0.7394 0.4596520
    Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1  1

    Call: alpha
            Estimate    StdErr t.value  p.value
    alpha1  9.019084  3.086533  2.9221 0.003477 **
    alpha2 -0.786135  0.656843 -1.1968 0.231370
    alpha3 -0.568740  0.732885 -0.7760 0.437732
    alpha4 -0.128941  0.247757 -0.5204 0.602761
    alpha5 -0.064257  0.025982 -2.4731 0.013395 *
    Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1  1

The function swgee can store individual estimated coefficients in the simulation step, and this enables us to show the extrapolation curve through the developed R function plot.swgee. The plot.swgee function plots the extrapolation of the estimate of each covariate effect with the quadratic extrapolants. Figure 1 displays the graph for the variable SBP in the example for which the quadratic extrapolation function is applied from the following command:

    R> plot(output2,"sbp")
graphic without alt text
Figure 1: Display of the SIMEX estimate for the example: the dot is the SIMEX estimate obtained from the quadratic extrapolation.

Simulation studies

In this section, we conduct simulation studies to investigate the impact of ignoring covariate measurement error and response missingness on estimation, where the implementation is carried out using the usual GEE method. Furthermore, we assess the performance of the swgee method which accommodates the effects induces from error-prone covariates and missing responses. We set \(n=200\) and \(m=3\), and generate 500 simulations for each parameter configuration. Consider the logistic regression model \[logit(\mu_{ij})=\beta_0 + \beta_{x1}x_{ij1} + \beta_{x2}x_{ij2} + \beta_{z}z_{ij},\] where \(\beta_0 = 0\), \(\beta_{x1} = log(1.5)\), \(\beta_{x2} = log(1.5)\), \(\beta_{z} = log(0.75)\) and \(z_{ij}\) is generated independently from \(Bin(1, 0.5)\) to represent a balanced design. The true covariate \(\mathbf{X}_{ij} = (x_{ij1}, x_{ij2})'\) is generated from the normal distribution \(N(\boldsymbol{\mu}_x, \boldsymbol{\Sigma}_x)\), where \(\boldsymbol{\mu}_{x} = (0.5, 0.5)'\) and \(\boldsymbol{\Sigma}_x = \left(\begin{array}{cc} \sigma_{x1}^2 & \rho_{x}\sigma_{x1}\sigma_{x2} \\ \rho_{x}\sigma_{x1}\sigma_{x2} & \sigma_{x2}^2 \end{array} \right)\) with \(\sigma_{x1} = \sigma_{x2} = 1\). The surrogate value \(\mathbf{W}_{ij} = (W_{ij1}, W_{ij2})'\) is generated from \(N(\mathbf{X}_{ij}, \boldsymbol{\Sigma}_e)\) with \(\boldsymbol{\Sigma}_e = \left(\begin{array}{cc} \sigma_{1}^2 & \rho\sigma_{1}\sigma_{2} \\ \rho\sigma_{1}\sigma_{2} & \sigma_{2}^2 \end{array} \right)\). \(\rho\) and \(\rho_{x}\) are set to 0.50 to represent moderate correlations. To feature minor, moderate and severe degrees of measurement error, we consider \(\sigma_1, \sigma_2 =\) 0.25, 0.50 or 0.75. The missing data indicator is generated from model (4), where \(\alpha_{0} = \alpha_{1} = 0.5\), \(\alpha_{2} = \alpha_{3} = 0.1\), and \(\alpha_{z} = 0.2\). In implementing the swgee method, we choose \(B = 100\), \(\lambda_M=2\), \(M = 5\), and a quadratic regression for each extrapolation step.

In Table 1, we report on the results of the biases of the estimates (Bias), the empirical standard error (SE), and the coverage rate (CR in percent) for \(95\%\) confidence intervals. When measurement error is minor, (i.e. \(\sigma_1 = \sigma_2 = 0.25\)), both gee and swgee provide reasonable results with fairly small finite sample biases and coverage rates close to the nominal level \(95\%\). When there is moderate or substantial measurement error in covariates \(\mathbf{X}_{ij}\), the performance of the gee method deteriorates remarkably in estimation of error-prone covariate effects, leading to considerably biased estimates for \(\beta_{x1}\) and \(\beta_{x2}\). The corresponding coverage rates for \(95\%\) confidence intervals can be quite low. In contrast, the swgee method remarkably improve the performance, providing a lot smaller biases and much higher coverage rates. The estimates for \(\beta_z\) are not subject to much impact of measurement error, which is partially attributed by that the precisely observed covariates \(z_{ij}\) are generated independently of error-prone covairates \(\mathbf{X}_{ij}\) under the current simulation study.

In summary, ignoring measurement error may lead to substantially biased results. Properly addressing covariate measurement error in estimation procedures is necessary. The proposed swgee method performs reasonably well under various configurations. As expected, its performance may become less satisfactory when measurement error becomes substantial. However, the swgee method does significantly improve the performance of the gee analysis.

Table 1: Simulation Results
\(\sigma_1\) \(\sigma_2\) Method \(\beta_{x1}\) \(\beta_{x2}\) \(\beta_{z}\)
Bias SE CR Bias SE CR Bias SE CR
0.25 0.25 gee -0.0310 0.1228 92.6 -0.0158 0.1246 92.6 0.0063 0.2121 94.6
0.25 0.25 swgee -0.0062 0.1420 95.0 0.0104 0.1425 95.2 0.0036 0.2354 95.6
0.25 0.50 gee -0.0019 0.1212 95.4 -0.0997 0.1156 83.4 0.0082 0.2110 94.2
0.25 0.50 swgee -0.0003 0.1415 95.0 -0.0087 0.1543 93.0 0.0035 0.2361 95.6
0.25 0.75 gee 0.0328 0.1189 95.4 -0.1841 0.1022 51.0 0.0101 0.2100 94.0
0.25 0.75 swgee 0.0205 0.1407 95.8 -0.0660 0.1562 86.4 0.0046 0.2359 95.6
0.50 0.25 gee -0.1156 0.1114 78.2 0.0139 0.1236 94.2 0.0078 0.2113 94.6
0.50 0.25 swgee -0.0282 0.1520 93.2 0.0177 0.1431 95.4 0.0031 0.2362 95.2
0.50 0.50 gee -0.0948 0.1114 81.8 -0.0780 0.1161 85.6 0.0102 0.2099 94.2
0.50 0.50 swgee -0.0228 0.1510 93.8 -0.0022 0.1542 93.6 0.0030 0.2370 95.4
0.50 0.75 gee -0.0629 0.1103 87.8 -0.1727 0.1036 55.6 0.0125 0.2088 94.2
0.50 0.75 swgee -0.0052 0.1499 94.8 -0.0608 0.1570 87.2 0.0042 0.2369 95.2
0.75 0.25 gee -0.1991 0.0966 45.6 0.0484 0.1216 94.2 0.0092 0.2107 94.6
0.75 0.25 swgee -0.0870 0.1508 86.4 0.0395 0.1430 93.6 0.0034 0.2366 95.2
0.75 0.50 gee -0.1889 0.0976 50.0 -0.0458 0.1154 89.8 0.0121 0.2091 94.0
0.75 0.50 swgee -0.0831 0.1509 87.8 0.0165 0.1539 94.0 0.0034 0.2375 95.4
0.75 0.75 gee -0.1636 0.0974 58.8 -0.1468 0.1039 66.4 0.0147 0.2077 94.2
0.75 0.75 swgee -0.0678 0.1505 90.0 -0.0442 0.1574 88.8 0.0046 0.2374 95.2

6 Summary and discussion

Missing observations and covariate measurement error commonly arise in longitudinal data. However, there has been relatively little work on simultaneously accounting for the effects of response missingness and covariate measurement error on estimation of response model parameters for longitudinal data. Yi (2008) described a simulation based marginal method to adjust for the biases induced by both missingness and covariate measurement error. The proposed method does not require the full specification of the distribution of the response vector but only requires modeling its mean and covariance structure. In addition, the distribution of covariates is left unspecified, which is desirable for many practical problems. These features make the proposed method flexible.

Here we not only develop the R package swgee to implement the method by Yi (2008), but also include an extended setting in the package. Our aim is to provide analysts an accessible tool for the analysis of longitudinal data with missing responses and error-prone covariates. Our illustrations show that the developed package has the advantages of simplicity and versatility.

7 Acknowledgments

Juan Xiong was supported by the Natural Science Foundation of SZU (grant no.2017094). Grace Y. Yi was supported by the Natural Sciences and Engineering Research Council of Canada. The authors thanks Boston University and the National Heart, Lung, and Blood Institute (NHLBI) for providing the data set from the Framingham Heart Study (No. N01-HC-25195) in the illustration. The Framingham Heart Study is conducted and supported by the NHLBI in collaboration with Boston University. This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI.

Conflict of Interest: None declared.

CRAN packages used

gee, yags, wgeesel, geepack, mvtnorm

CRAN Task Views implied by cited packages

Distributions, Econometrics, Finance, MixedModels


This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

J. P. Buonaccorsi. Measurement error: Models, methods, and applications. Boca Raton, Florida: Chapman & Hall/CRC, 2010.
V. J. Carey. : Generalized estimation equation solver. 2015. R package version 4.13-19.
V. J. Carey. : Yet another GEE solve. 2011. R package version 6.1-13.
R. J. Carroll, D. Ruppert, L. A. Stefanski and C. M. Crainiceanu. Measurement error in nonlinear models: A modern perspective. 2nd ed Boca Raton, Florida: Chapman & Hall/CRC, 2006.
J. R. Cook and L. A. Stefanski. Simulation-extrapolation estimation in parametric measurement error models. Journal of the American Statistical Association, 89(428): 1314–1328, 1994. URL https://doi.org/10.1080/01621459.1994.10476871.
V. Devanarayan and L. A. Stefanski. Empirical simulation extrapolation for measurement error models with replicate measurements. Statistics and Probability Letters, 59(3): 219–225, 2002. URL https://doi.org/10.1016/S0167-7152(02)00098-6.
P. J. Diggle and M. G. Kenward. Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics, 43(1): 49–93, 1994. URL https://doi.org/10.2307/2986113.
W. A. Fuller. Measurement error models. New York: John Wiley & Sons, 1987.
A. Genz and F. Bretz. Computation of multivariate normal and t probabilities. New York: Springer-Verlag, 2009.
A. Genz, F. Bretz, T. Miwa, X. Mi and T. Hothorn. : Multivariate normal and t distributions. 2018. R package version 1.0-7.
P. Gustafson. Measurement error and misclassification in statistics and epidemiology. Boca Raton, Florida: Chapman & Hall/CRC, 2003.
W. He, J. Xiong and G. Y. Yi. SIMEX R package for accelerated failure time models with covariate measurement error. Journal of Statistical Software, 46(1): 1–14, 2012. URL https://doi.org/10.18637/jss.v046.c01.
S. Hojsgaard, U. Halekoh and J. Yan. : Generalized estimating equation package. 2016. R package version 1.2-1.
M. G. Kenward. Selection models for repeated measurements with non-random dropout: An illustration of sensitivity. Statistics in Medicine, 17(23): 2723–2732, 1998. URL https://doi.org/10.1002/(SICI)1097-0258(19981215)17:23<2723::AID-SIM38>3.0.CO;2-5.
T. L. Lai and D. S. Small. Marginal regression analysis of longitudinal data with time-dependent covariates: A generalized method-of-moments approach. Journal of The Royal Statistical Society Series B-statistical Methodology, 69(1): 79–99, 2007. URL https://doi.org/10.1111/j.1467-9868.2007.00578.x.
K. Y. Liang and S. L. Zeger. Longitudinal data analysis using generalized linear models. Biometrika, 73(1): 13–22, 1986. URL https://doi.org/10.2307/2336267.
G. Lin and R. N. Rodriguez. Weighted methods for analyzing missing data with the GEE procedure. Paper SAS166-2015, 1–8, 2015.
R. J. A. Little and D. B. Rubin. Statistical analysis with missing data. 2nd ed New Jersey: John Wiley & Sons, 2002.
W. Liu and L. Wu. Simultaneous inference for semiparametric nonlinear mixed-effects models with covariate measurement errors and missing responses. Biometrics, 63(2): 342–350, 2007. URL https://doi.org/10.1111/j.1541-0420.2006.00687.x.
J. S. Preisser, K. K. Lohman and P. J. Rathouz. Performance of weighted estimating equations for longitudinal binary data with drop-outs missing at random. Statistics in Medicine, 21(20): 3035–3054, 2002. URL https://doi.org/10.1002/sim.1241.
A. Qu, G. Y. Yi, P. X. K. Song and P. Wang. Assessing the validity of weighted generalized estimating equations. Biometrika, 98(1): 215–224, 2011. URL https://doi.org/10.1093/biomet/asq078.
J. M. Robins, A. Rotnitzky and L. Zhao. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90(429): 106–121, 1995. URL https://doi.org/10.1080/01621459.1995.10476493.
SAS Institute Inc. SAS/STAT software, version 13.2. Cary, NC, 2014. URL http://www.sas.com/.
C. Y. Wang, Y. Huang, E. C. Chao and M. K. Jeffcoat. Expected estimating equations for missing data, measurement error, and misclassification, with application to longitudinal nonignorable missing data. Biometrics, 64(1): 85–95, 2008. URL https://doi.org/10.1111/j.1541-0420.2007.00839.x.
C. Xu, Z. Li and M. Wang. : Weighted generalized estimating equations and model selection. 2018. R package version 1.5.
G. Y. Yi. A simulation-based marginal method for longitudinal data with dropout and mismeasured covariates. Biostatistics, 9(3): 501–512, 2008. URL https://doi.org/10.1093/biostatistics/kxm054.
G. Y. Yi. Statistical analysis with measurement error or misclassification. New York: Springer-Verlag, 2017.
G. Y. Yi, Y. Ma and R. J. Carroll. A functional generalized method of moments approach for longitudinal studies with missing responses and covariate measurement error. Biometrika, 99(1): 151–165, 2012. URL https://doi.org/10.1093/biomet/asr076.



Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".


For attribution, please cite this work as

Xiong & Yi, "swgee: An R Package for Analyzing Longitudinal Data with Response Missingness and Covariate Measurement Error", The R Journal, 2019

BibTeX citation

  author = {Xiong, Juan and Yi, Grace Y.},
  title = {swgee: An R Package for Analyzing Longitudinal Data with Response Missingness and Covariate Measurement Error},
  journal = {The R Journal},
  year = {2019},
  note = {https://rjournal.github.io/},
  volume = {11},
  issue = {1},
  issn = {2073-4859},
  pages = {416-426}