The *crch* package provides functions for maximum likelihood estimation of censored or truncated regression models with conditional heteroscedasticity along with suitable standard methods to summarize the fitted models and compute predictions, residuals, etc. The supported distributions include left- or right-censored or truncated Gaussian, logistic, or student-t distributions with potentially different sets of regressors for modeling the conditional location and scale. The models and their R implementation are introduced and illustrated by numerical weather prediction tasks using precipitation data for Innsbruck (Austria).

Censored or truncated response variables occur in a variety of applications. Censored data arise if exact values are only reported in a restricted range. Data may fall outside this range but are reported at the range limits. In contrast, if data outside this range are omitted completely we call the dataset truncated. E.g., consider wind measurements with an instrument that needs a certain minimum wind speed to start working. If wind speeds below this minimum are recorded as \(\le \mathit{minimum}\) the data are censored. If only wind speeds exceeding this limit are reported and those below are omitted the data are truncated. Even if the generating process is not as clear, censoring or truncation can be useful to consider limited data such as precipitation observations.

The tobit (Tobin 1958) and truncated regression (Cragg 1971) models are common linear regression models for censored and truncated conditionally normally distributed responses respectively. Beside truncated data, truncated regression is also used in two-part models (Cragg 1971) for censored type data: a binary (e.g., probit) regression model fits the exceedance probability of the lower limit and a truncated regression model fits the value given the lower limit is exceeded.

Usually linear models like the tobit or truncated regression models
assume homoscedasticity which means that the variance of an underlying
normal distribution does not depend on covariates. However, sometimes
this assumption does not hold and models that can consider conditional
heteroscedasticity should be used. Such models have been proposed, e.g.,
for generalized linear models (Nelder and Pregibon 1987; Smyth 1989), generalized
additive models (Rigby and Stasinopoulos 1996; Rigby and Stasinopoulos 2005), or beta regression
(Cribari-Neto and Zeileis 2010). There also exist several R packages with functions
implementing the above models, e.g.,
*dglm* (Dunn and Smyth 2014),
*glmx* (Zeileis et al. 2013),
*gamlss* (Rigby and Stasinopoulos 2005),
*betareg* (Grün et al. 2012)
among others.

The *crch* package provides
functions to fit censored and truncated regression models that consider
conditional heteroscedasticity. It has a convenient interface to
estimate these models with maximum likelihood and provides several
methods for analysis and prediction. In addition to the typical
conditional Gaussian distribution assumptions it also allows for
logistic and student-t distributions with heavier tails.

In the following this paper presents the heteroscedastic censored and truncated regression models and their R implementation. Furthermore these models and their implementation are illustrated with numerical weather prediction data of precipitation in Innsbruck (Austria).

For both, censored and truncated regression, a normalized latent response \((y^*-\mu)/\sigma\) is assumed to follow a certain distribution \(D\) \[\frac{y^*-\mu}{\sigma} \sim D\] The location parameter \(\mu\) and a link function of the scale parameter \(g(\sigma)\) are assumed to relate linearly to covariates \(\mathbf{x} = (1, x_1, x_2, \ldots)^\top\) and \(\mathbf{z} = (1, z_1, z_2, \ldots)^\top\):

\[\begin{aligned} \label{eq:mu} \mu &= &\mathbf{x}^{\top}\beta\\ \end{aligned} \tag{1} \]

\[\begin{aligned} \label{eq:sigma} g(\sigma) & = & \mathbf{z}^{\top}\gamma \end{aligned} \tag{2} \]

where \(\beta=(\beta_0, \beta_1, \beta_2, \ldots)^\top\) and \(\gamma=(\gamma_0, \gamma_1, \gamma_2, \ldots)^\top\) are coefficient vectors. The link function \(g(\cdot):\mathbb{R}^+ \mapsto \mathbb{R}\) is a strictly increasing and twice differentiable function; e.g., the logarithm (i.e., \(g(\sigma)=\log(\sigma)\)) is a well suited function. Although they only map to \(\mathbb{R}^+\), the identity \(g(\sigma) = \sigma\) or the quadratic function \(g(\sigma)=\sigma^2\) can be useful as well. However, problems in the numerical optimization can occur.

Commonly \(D\) is the standard normal distribution so that \(y^*\) is assumed to be normally distributed with mean \(\mu\) and variance \(\sigma^2\). \(D\) might also be assumed to be a standard logistic or a student-t distribution if heavier tails are required. The tail weight of the student-t distribution can be controlled by the degrees of freedom \(\nu\) which can either be set to a certain value or estimated as an additional parameter. To assure positive values, \(\log(\nu)\) is modeled in the latter case.

\[\label{eq:dfest} \log(\nu) = \delta \tag{3} \]

The exact values of censored responses are only known in an interval defined by \(\mathit{left}\) and \(\mathit{right}\). Observation outside this interval are mapped to the interval limits \[y = \begin{cases} \mathit{left} & y^* \le \mathit{left} \\ y^* & \mathit{left} < y^* < \mathit{right}\\ \mathit{right} & y^* \ge \mathit{right} \end{cases}\] The coefficients \(\beta\), \(\gamma\), and \(\delta\) (Equations (1)–(3)) can be estimated by maximizing the sum over the data set of the log-likelihood function \(\log(f_{\mathit{cens}}(y, \mu, \sigma))\), where \[f_{\mathit{cens}}(y, \mu, \sigma) = \begin{cases} F\left(\frac{\mathit{left} - \mu}{\sigma}\right) & y \le \mathit{left} \\ f\left(\frac{y - \mu}{\sigma}\right) & \mathit{left} < y < \mathit{right} \\ \left(1 - F\left(\frac{\mathit{right} - \mu}{\sigma}\right)\right) & y \ge \mathit{right} \end{cases}\]

\(F()\) and \(f()\) are the cumulative distribution function and the probability density function of \(D\), respectively. If \(D\) is the normal distribution this model is a heteroscedastic variant of the tobit model (Tobin 1958).

Truncated responses occur when latent responses below or above some thresholds are omitted. \[y = y^*|\mathit{left} < y^* < \mathit{right}\] Then \(y\) follows a truncated distribution with probability density function \[f_{\mathit{tr}}(y, \mu, \sigma) = \frac{f\left(\frac{y - \mu}{\sigma}\right)} {F\left(\frac{\mathit{right} - \mu}{\sigma}\right) - F\left(\frac{\mathit{left} - \mu}{\sigma}\right)}\] In that case the coefficients \(\beta\), \(\gamma\), and \(\delta\) can be estimated by maximizing the sum over the data set of the log-likelihood function \[\log(f_{\mathit{tr}}(y, \mu, \sigma))\]

The models from the previous section can both be fitted with the
`crch()`

function provided by the *crch* package. This function takes a
formula and data, sets up the likelihood function, gradients and Hessian
matrix and uses `optim()`

to maximize the likelihood. It returns an S3
object for which various standard methods are available. We tried to
build an interface as similar to `glm()`

as possible to facilitate the
usage.

```
crch(formula, data, subset, na.action, weights, offset, link.scale = "log",
dist = "gaussian", df = NULL, left = -Inf, right = Inf, truncated = FALSE,
control = crch.control(...), model = TRUE, x = FALSE, y = FALSE, ...)
```

Here `formula`

, `data`

, `na.action`

, `weights`

, and `offset`

have their
standard model frame meanings (e.g., Chambers and Hastie 1992). However, as
provided in the *Formula*
package (Zeileis and Croissant 2010) `formula`

can have two parts separated by `|`

where the first part defines the location model and the second part the
scale model. E.g., with `y ~ x1 + x2 | z1 + z2`

the location model is
specified by `y ~ x1 + x2`

and the scale model by `~ z1 + z2`

. Known
offsets can be specified for the location model by `offset`

or for both,
the location and scale model, inside `formula`

, i.e.,
`y ~ x1 + x2 + offset(x3) | z1 + z2 + offset(z3)`

.

The link function \(g(\cdot)\) for the scale model can be specified by
`link.scale`

. The default is `"log"`

, also supported are `"identity"`

and `"quadratic"`

. Furthermore, an arbitrary link function can be
specified by supplying an object of class `"link-glm"`

containing
`linkfun`

, `linkinv`

, `mu.eta`

, and `name`

. Furthermore it must contain
the second derivative `dmu.deta`

if analytical Hessians are employed.

`dist`

specifies the used distribution. Currently supported are
`"gaussian"`

(the default), `"logistic"`

, and `"student"`

. If
`dist = "student"`

the degrees of freedom can be set by the `df`

argument. If set to `NULL`

(the default) the degrees of freedom are
estimated by maximum likelihood (Equation (3)).

`left`

and `right`

define the lower and upper censoring or truncation
points respectively. The logical argument `truncated`

defines whether a
censored or truncated model is estimated. Note that also a wrapper
function `trch()`

exists that is equivalent to `crch()`

but with default
`truncated = TRUE`

.

The maximum likelihood estimation is carried out with the R function
`optim()`

using control options specified in `crch.control()`

. By
default the `"BFGS"`

method is applied. If no starting values are
supplied, coefficients from `lm()`

are used as starting values for the
location part. For the scale model the intercept is initialized with the
link function of the residual standard deviation from `lm()`

and the
remaining scale coefficients are initialized with 0. If the degrees of
freedom of a student-t distribution are estimated they are initialized
by 10. For the student-t distribution with estimated degrees of freedom
the covariance matrix estimate is derived from the numerical Hessian
returned by `optim()`

. For fixed degrees of freedom and Gaussian and
logistic distributions the covariance matrix is derived analytically.
However, by setting `hessian = TRUE`

the numerical Hessian can be
employed for those models as well.

Finally `model`

, `y`

, and `x`

specify whether the model frame, response,
or model matrix are returned.

The returned model fit of class `"crch"`

is a list similar to `"glm"`

objects. Some components like `coefficients`

are lists with elements for
location, scale, and degrees of freedom. The package also provides a set
of extractor methods for `"crch"`

objects that are listed in
Table 1.

Function | Description |
---|---|

`print()` |
Print function call and estimated coefficients. |

`summary()` |
Standard regression output (coefficient estimates, standard errors, partial Wald tests). Returns an object of class `"summary.crch"` containing summary statistics which has a `print()` method. |

`coef()` |
Extract model coefficients where `model` specifies whether a single vector containing all coefficients (`"full"` ) or the coefficients for the location (`"location"` ), scale (`"scale"` ) or degrees of freedom (`"df"` ) are returned. |

`vcov()` |
Variance-covariance matrix of the estimated coefficients. |

`predict()` |
Predictions for new data where `"type"` controls whether location (`"response"` /`"location"` ), scale (`"scale"` ) or quantiles (`"quantile"` ) are predicted. Quantile probabilities are specified by `at` . |

`fitted()` |
Fitted values for observed data where `"type"` controls whether location (`"location"` ) or scale (`"scale"` ) values are returned. |

`residuals()` |
Extract various types of residuals where `type` can be `"standardized"` (default), `"pearson"` , `"response"` , or `"quantile"` . |

`terms()` |
Extract terms of model components. |

`logLik()` |
Extract fitted log-likelihood. |

Additional to the `crch()`

function and corresponding methods the *crch*
package also provides probability density, cumulative distribution,
random number, and quantile functions for censored and truncated normal,
logistic, and student-t distributions. Furthermore it also provides a
function `hxlr()`

(heteroscedastic extended logistic regression) to fit
heteroscedastic interval-censored regression models (Messner et al. 2014c).

Note that alternatives to `crch()`

heteroscedastic censored and
truncated models could also be fitted by the R package *gamlss*
(Rigby and Stasinopoulos 2005) with the add-on packages
*gamlss.cens* and
*gamlss.tr*. However,
for the special case of linear censored truncated regression models with
the Gaussian, logistic, or student-t distribution *crch* provides a fast
and convenient interface and various useful methods for analysis and
prediction.

This section shows a weather forecast example application of censored
and truncated regression models fitted with `crch()`

. Weather forecasts
are usually based on numerical weather prediction (NWP) models that take
the current state of the atmosphere and compute future weather by
numerically simulating the most important atmospheric processes.
However, because of uncertain initial conditions and unknown or
unresolved processes these numerical predictions are always subject to
errors. To estimate these errors, many weather centers provide so called
ensemble forecasts: several NWP runs that use different initial
conditions and model formulations. Unfortunately these ensemble
forecasts cannot consider all error sources so they are often still
biased and uncalibrated. Thus they are often calibrated and corrected
for systematic errors by statistical post-processing.

One popular post-processing method is heteroscedastic linear regression where the ensemble mean is used as regressor for the location and the ensemble standard deviation or variance is used as regressor for the scale (e.g., Gneiting et al. 2005). Because not all meteorological variables can be assumed to be normally distributed this idea has also been extended to other distributions including truncated regression for wind (Thorarinsdottir and Gneiting 2010) and censored regression for wind power (Messner et al. 2014b) or precipitation (Messner et al. 2014a).

The following example applies heteroscedastic censored regression with a logistic distribution assumption to precipitation data in Innsbruck (Austria). Furthermore, a two-part model tests whether the occurrence of precipitation and the precipitation amount are driven by the same process.

First, the *crch* package is loaded together with an included
precipitation data set with forecasts and observations for Innsbruck
(Austria)

```
> library("crch")
R> data("RainIbk", package = "crch") R
```

The `data.frame`

`RainIbk`

contains observed 3 day-accumulated
precipitation amounts (`rain`

) and the corresponding 11 member ensemble
forecasts of total accumulated precipitation amount between 5 and 8 days
in advance (`rainfc.1`

, `rainfc.2`

, \(\ldots\) `rainfc.11`

). The rownames
are the end date of the 3 days over which the precipitation amounts are
accumulated respectively; i.e., the respective forecasts are issued 8
days before these dates.

In previous studies it has been shown that it is of advantage to model the square root of precipitation rather than precipitation itself. Thus all precipitation amounts are square rooted before ensemble mean and standard deviation are derived. Furthermore, events with no variation in the ensemble are omitted:

```
> RainIbk <- sqrt(RainIbk)
R> RainIbk$ensmean <- apply(RainIbk[,grep('^rainfc',names(RainIbk))], 1, mean)
R> RainIbk$enssd <- apply(RainIbk[,grep('^rainfc',names(RainIbk))], 1, sd)
R> RainIbk <- subset(RainIbk, enssd > 0) R
```

A scatterplot of `rain`

against `ensmean`

```
> plot(rain ~ ensmean, data = RainIbk, pch = 19, col = gray(0, alpha = 0.2))
R> abline(0,1, col = "red") R
```

indicates a linear relationship that differs from a 1-to-1 relationship (Figure 1). Precipitation is clearly non-negative with many zero observations. Thus censored regression or a two-part model are suitable to estimate this relationship.

First we fit a logistic censored model for `rain`

with `ensmean`

as
regressor for the location and `log(enssd)`

as regressor for the scale.

```
> CRCH <- crch(rain ~ ensmean | log(enssd), data = RainIbk, left = 0,
R+ dist = "logistic")
> summary(CRCH) R
```

```
:
Callcrch(formula = rain ~ ensmean | log(enssd), data = RainIbk,
dist = "logistic", left = 0)
:
Standardized residuals1Q Median 3Q Max
Min -3.5780 -0.6554 0.1673 1.1189 7.4990
Coefficients (location model):
Pr(>|z|)
Estimate Std. Error z value -0.85266 0.06903 -12.35 <2e-16 ***
(Intercept) 0.78686 0.01921 40.97 <2e-16 ***
ensmean
Coefficients (scale model with log link):
Pr(>|z|)
Estimate Std. Error z value 0.11744 0.01460 8.046 8.58e-16 ***
(Intercept) log(enssd) 0.27055 0.03503 7.723 1.14e-14 ***
---
: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Signif. codes
: logistic
Distribution-likelihood: -8921 on 4 Df
Login BFGS optimization: 15 Number of iterations
```

Both, `ensmean`

and `log(enssd)`

are highly significant according to the
Wald test performed by the `summary()`

method. The location model is
also shown in Figure 1:

`> abline(coef(CRCH)[1:2], col = "blue") R`

If we compare this model to a constant scale model (tobit model with logistic distribution)

```
> CR <- crch(rain ~ ensmean, data = RainIbk, left = 0, dist = "logistic")
R> cbind(AIC(CR, CRCH), BIC = BIC(CR, CRCH)[,2]) R
```

```
df AIC BIC3 17905.69 17925.22
CR 4 17850.30 17876.33 CRCH
```

we see that the scale model clearly improves the fit regarding AIC and BIC.

A comparison of the logistic model with a Gaussian and a student-t model

```
> CRCHgau <- crch(rain ~ ensmean | log(enssd), data = RainIbk, left = 0,
R+ dist = "gaussian")
> CRCHstud <- crch(rain ~ ensmean | log(enssd), data = RainIbk, left = 0,
R+ dist = "student")
> AIC(CRCH, CRCHgau, CRCHstud) R
```

```
df AIC4 17850.30
CRCH 4 17897.23
CRCHgau 5 17850.65 CRCHstud
```

confirms the logistic distribution assumption. Note, that with the estimated degrees of freedom of 9.56 the student-t distribution resembles the (scaled) logistic distribution quite well (see Figure 2).

In the censored model the occurrence of precipitation and precipitation
amount are assumed to be driven by the same process. To test this
assumption we compare the censored model with a two-part model
consisting of a heteroscedastic logit model and a truncated regression
model with logistic distribution assumption. For the heteroscedastic
logit model we use `hetglm()`

from the *glmx* package and for the
truncated model we employ the `crch()`

function with the argument
`truncated = TRUE`

.

```
> library("glmx")
R> BIN <- hetglm(I(rain > 0) ~ ensmean | log(enssd), data = RainIbk,
R+ family = binomial(link = "logit"))
> TRCH <- crch(rain~ensmean | log(enssd), data = RainIbk, subset = rain > 0,
R+ left = 0, dist = "logistic", truncated = TRUE)
```

In the heteroscedastic logit model, the intercept of the scale model is not identified. Thus, the location coefficients of the censored and truncated regression models have to be scaled to compare them with the logit model.

```
> cbind("CRCH" = c(coef(CRCH, "location")/exp(coef(CRCH, "scale"))[1],
R+ coef(CRCH, "scale")[2]),
+ "BIN" = coef(BIN),
+ "TRCH" = c(coef(TRCH, "location")/exp(coef(TRCH, "scale"))[1],
+ coef(TRCH, "scale")[2]))
```

```
CRCH BIN TRCH-0.7581811 -1.0181715 0.2635421
(Intercept) 0.6996699 0.7789091 0.5455966
ensmean log(enssd) 0.2705476 0.4539908 0.2326229
```

The different (scaled) coefficients indicate that different processes drive the occurrence of precipitation and precipitation amount. This is also confirmed by AIC and BIC that are clearly better for the two-part model than for the censored model:

```
> loglik <- c("Censored" = logLik(CRCH), "Two-Part" = logLik(BIN) + logLik(TRCH))
R> df <- c(4, 7)
R> aic <- -2 * loglik + 2 * df
R> bic <- -2 * loglik + log(nrow(RainIbk)) * df
R> cbind(df, AIC = aic, BIC = bic) R
```

```
df AIC BIC4 17850.30 17876.33
Censored -Part 7 17744.82 17790.39 Two
```

Finally, we can use the fitted models to predict future precipitation. Therefore assume that the current NWP forecast of square rooted precipitation has an ensemble mean of 1.8 and an ensemble standard deviation of 0.9. A media