A package for constructing and using an adaptive mixture of Student-t distributions as a flexible candidate distribution for efficient simulation.
This note presents the package AdMit (Ardia et al. 2008, 2009), an R implementation of the adaptive mixture of Student-t distributions (AdMit) procedure developed by Hoogerheide (2006); see also Hoogerheide et al. (2007; Hoogerheide and van Dijk 2008). The AdMit strategy consists of the construction of a mixture of Student-t distributions which approximates a target distribution of interest. The fitting procedure relies only on a kernel of the target density, so that the normalizing constant is not required. In a second step, this approximation is used as an importance function in importance sampling or as a candidate density in the independence chain Metropolis-Hastings (M-H) algorithm to estimate characteristics of the target density. The estimation procedure is fully automatic and thus avoids the difficult task, especially for non-experts, of tuning a sampling algorithm. Typically, the target is a posterior distribution in a Bayesian analysis, where we indeed often only know a kernel of the posterior density.
In a standard case of importance sampling or the independence chain M-H algorithm, the candidate density is unimodal. If the target distribution is multimodal then some draws may have huge weights in the importance sampling approach and a second mode may be completely missed in the M-H strategy. As a consequence, the convergence behavior of these Monte Carlo integration methods is rather uncertain. Thus, an important problem is the choice of the importance or candidate density, especially when little is known a priori about the shape of the target density. For both importance sampling and the independence chain M-H, it holds that the candidate density should be close to the target density, and it is especially important that the tails of the candidate should not be thinner than those of the target.
Hoogerheide (2006) and Hoogerheide et al. (2007) mention several reasons why mixtures of Student-t distributions are natural candidate densities. First, they can provide an accurate approximation to a wide variety of target densities, with substantial skewness and high kurtosis. Furthermore, they can deal with multi-modality and with non-elliptical shapes due to asymptotes. Second, this approximation can be constructed in a quick, iterative procedure and a mixture of Student-t distributions is easy to sample from. Third, the Student-t distribution has fatter tails than the Normal distribution; especially if one specifies Student-t distributions with few degrees of freedom, the risk is small that the tails of the candidate are thinner than those of the target distribution. Finally, Zeevi and Meir (1997) showed that under certain conditions any density function may be approximated to arbitrary accuracy by a convex combination of basis densities; the mixture of Student-t distributions falls within their framework.
The package AdMit consists of three main functions: AdMit
, AdMitIS
and AdMitMH
. The first one allows the user to fit a mixture of
Student-t distributions to a given density through its kernel function.
The next two functions perform importance sampling and independence
chain M-H sampling using the fitted mixture estimated by AdMit
as the
importance or candidate density, respectively.
To illustrate the use of the package, we apply the AdMit methodology to a bivariate bimodal distribution. We describe the use of the functions provided by the package and document the ability and relevance of the methodology to reproduce the shape of non-elliptical distributions.
This section presents the functions provided by the package AdMit with an illustration of a bivariate bimodal distribution. This distribution belongs to the class of conditionally Normal distributions proposed by Gelman and Meng (1991) with the property that the joint density is not Normal. It is not a posterior distribution, but it is chosen because it is a simple distribution with non-elliptical shapes so that it allows for an easy illustration of the AdMit approach.
Let \(X_1\) and \(X_2\) be two random variables, for which \(X_1\) is Normally distributed given \(X_2\) and vice versa. Then, the joint distribution, after location and scale transformations in each variable, can be written as (see Gelman and Meng 1991): \[\begin{aligned} \begin{split} p(\mathbf x) \propto k(\mathbf x) &\doteq \exp \Big[ -\tfrac{1}{2} \big(A x_1^2 x_2^2 + x_1^2 + x_2^2\\ &\quad\quad - 2 B x_1 x_2 - 2 C_1 x_1 - 2 C_2 x_2\big) \Big]\, \end{split} \end{aligned} \] where \(p(\mathbf x)\) denotes a density, \(k(\mathbf x)\) a kernel and \(\mathbf x\doteq (x_1,x_2)'\) for notational purposes. \(A\), \(B\), \(C_1\) and \(C_2\) are constants; we consider an asymmetric case in which \(A=5\), \(B=5\), \(C_1=3\), \(C_2=3.5\) in what follows.
The adaptive mixture approach determines the number of mixture components \(H\), the mixing probabilities, the modes and scale matrices of the components in such a way that the mixture density \(q(\mathbf x)\) approximates the target density \(p(\mathbf x)\) of which we only know a kernel function \(k(\mathbf x)\) with \(\mathbf x\in \mathbb R^d\). Typically, \(k(\mathbf x)\) will be a posterior density kernel for a vector of model parameters \(\mathbf x\).
The AdMit strategy consists of the following steps:
Initialization: computation of the mode and scale matrix of the first component, and drawing a sample from this Student-t distribution;
Iterate on the number of components: add a new component that covers a part of the space of \(\mathbf x\) where the previous mixture density was relatively small, as compared to \(k(\mathbf x)\);
Optimization of the mixing probabilities;
Drawing a sample \(\{\mathbf x^{[i]}\}\) from the new mixture \(q(\mathbf x)\);
Evaluation of importance sampling weights \(\{ k(\mathbf x^{[i]})/q(\mathbf x^{[i]}) \}\). If the coefficient of variation, i.e. the standard deviation divided by the mean, of the weights has converged, then stop. Otherwise, go to step (1). By default, the adaptive procedure stops if the relative change in the coefficient of variation of the importance sampling weights caused by adding one new Student-t component to the candidate mixture is less than 10%.
There are two main reasons for using the coefficient of variation of the importance sampling weights as a measure of the fitting quality. First, it is a natural, intuitive measure of quality of the candidate as an approximation to the target. If the candidate and the target distributions coincide, all importance sampling weights are equal, so that the coefficient of variation is zero. For a poor candidate that not even roughly approximates the target, some importance sampling weights are huge while most are (almost) zero, so that the coefficient of variation is high. The better the candidate approximates the target, the more evenly the weight is divided among the candidate draws, and the smaller the coefficient of variation of the importance sampling weights. Second, Geweke (1989) argues that a reasonable objective in the choice of an importance density is the minimization of the variance, or equivalently the coefficient of variation, of the importance weights. We prefer to quote the coefficient of variation because it does not depend on the number of draws or the integration constant of the posterior density kernel, unlike the standard deviation of the scaled or unscaled importance weights, respectively. The coefficient of variation merely reflects the quality of the candidate as an approximation to the target.
Note also that all Student-t components in the mixture approximation \(q(\mathbf x)\) have the same degrees of freedom parameter. Moreover, this value is not determined by the procedure but set by the user; by default, the algorithm uses the value one (i.e. the approximation density is a mixture of Cauchy) since i) it enables the method to deal with fat-tailed target distributions and ii) it makes it easier for the iterative procedure to detect modes that are far apart. There may be a trade-off between efficiency and robustness at this point: setting the degrees of freedom to one reflects that we find a higher robustness more valuable than a slight possible gain in efficiency.
The core function provided by the package is the function AdMit
. The
main arguments of the function are: KERNEL
, a kernel function of the
target density on which the approximation is constructed. This function
must contain the logical argument log
. When log = TRUE
, the function
KERNEL
returns the logarithm value of the kernel function; this is
used for numerical stability. mu0
is the starting value of the first
stage optimization; it is a vector whose length corresponds to the
length of the first argument in KERNEL
. Sigma0
is the (symmetric
positive definite) scale matrix of the first component. If a matrix is
provided by the user, it is used as the scale matrix of the first
component and then mu0
is used as the mode of the first component
(instead of a starting value for optimization). control
is a list of
tuning parameters, containing in particular: df
(default: 1), the
degrees of freedom of the mixture components and CVtol
(default: 0.1),
the tolerance of the relative change of the coefficient of variation.
For further details, the reader is referred to the documentation manual
(by typing ?AdMit
) or to Ardia et al. (2009). Finally, the
last argument of AdMit
is …
which allows the user to pass additional
arguments to the function KERNEL
.
Let us come back to our bivariate conditionally Normal distribution. First, we need to code the kernel:
> GelmanMeng <- function(x, log = TRUE)
+ \{
+ if (is.vector(x))
+ x <- matrix(x, nrow = 1)
+ r <- -0.5 * ( 5 * x[,1]^2 * x[,2]^2
+ + x[,1]^2 + x[,2]^2
+ - 10 * x[,1] * x[,2]
+ - 6 * x[,1] - 7 * x[,2] )
+ if (!log)
+ r <- exp(r)
+ as.vector(r)
+ \}
Note that the argument log
is set to TRUE
by default so that the
function outputs the logarithm of the kernel. Moreover, the function is
vectorized to speed up the computations. The contour plot of
GelmanMeng
is displayed in the left part of Figure 1.
We now use the function AdMit
to find a suitable approximation for the
density whose kernel is GelmanMeng
. We use the starting value
mu0 = c(0, 0.1)
for the first stage optimization and assign the result
of the function to outAdMit
.
> set.seed(1234)
> outAdMit <- AdMit(KERNEL = GelmanMeng,
+ mu0 = c(0, 0.1))
$CV
\1] 8.0737 1.7480 0.9557 0.8807
[
$mit
\$mit\$p
\
cmp1 cmp2 cmp3 cmp4 0.54707 0.14268 0.21053 0.09972
$mit\$mu
\
k1 k20.3640 3.2002
cmp1 3.7088 0.3175
cmp2 1.4304 1.1240
cmp3 0.8253 2.0261
cmp4
$mit\$Sigma
\
k1k1 k1k2 k2k1 k2k20.03903 -0.15606 -0.15606 1.22561
cmp1 0.86080 -0.08398 -0.08398 0.02252
cmp2 0.13525 -0.06743 -0.06743 0.09520
cmp3 0.03532 -0.04072 -0.04072 0.21735
cmp4
$mit\$df
\1] 1
[
$summary
\
H METHOD.mu TIME.mu METHOD.p TIME.p CV1 1 BFGS 0.21 NONE 0.00 8.0737
2 2 BFGS 0.08 NLMINB 0.08 1.7480
3 3 BFGS 0.12 NLMINB 0.08 0.9557
4 4 BFGS 0.20 NLMINB 0.19 0.8807
The output of the function AdMit
is a list. The first component is
CV
, a vector of length \(H\) which gives the value of the coefficient of
variation of the importance sampling weights at each step of the
adaptive fitting procedure. The second component is mit
, a list which
consists of four components giving information on the fitted mixture of
Student-t distributions: p
is a vector of length \(H\) of mixing
probabilities, mu
is a \(H\times d\) matrix whose rows give the modes of
the mixture components, Sigma
is a \(H \times d^2\) matrix whose rows
give the scale matrices (in vector form) of the mixture components and
df
is the degrees of freedom of the Student-t components. The third
component of the list returned by AdMit
is summary
. This is a data
frame containing information on the adaptive fitting procedure. We refer
the reader to the documentation manual for further details.
For the kernel GelmanMeng
, the approximation constructs a mixture of
four components. The value of the coefficient of variation decreases
from 8.0737 to 0.8807. A contour plot of the four-component
approximation is displayed in the right-hand side of
Figure 1. This graph is produced using the function dMit
which returns the density of the mixture given by the output
outAdMit$mit
. The contour plot illustrates that the four-component
mixture provides a good approximation of the density: the areas with
substantial target density mass are covered by sufficient density mass
of the mixture approximation.
Once the adaptive mixture of Student-t distributions is fitted to the
density using a kernel, the approximation provided by AdMit
is used as
the importance sampling density in importance sampling or as the
candidate density in the independence chain M-H algorithm.
The AdMit package contains the AdMitIS
function which performs
importance sampling using the mixture approximation in outAdMit$mit
as
the importance density. The arguments of the function AdMitIS
are: (i)
N
, the number of draws used in importance sampling; (ii) KERNEL
, a
kernel function of the target density; (iii) G
, the function of which
the expectation \(E_p\big[g(\mathbf x)\big]\) is estimated; By default,
the function G
is the identity so that the function outputs a vector
containing the mean estimates for the components of \(\mathbf x\).
Alternative functions may be provided by the user to obtain other
quantities of interest for \(p(\mathbf x)\). The only requirement is that
the function outputs a matrix; (iv) mit
, a list providing information
on the mixture approximation; (v) …
allows additional parameters to be
passed to the function KERNEL
and/or G
.
Let us apply the function AdMitIS
to the kernel GelmanMeng
using the
approximation outAdMit$mit
:
> outAdMitIS <- AdMitIS(N = 1e5,
+ KERNEL = GelmanMeng,
+ mit = outAdMit\$mit)
$ghat
\1] 0.9556 2.2465
[
$NSE
\1] 0.003833 0.005639
[
$RNE
\1] 0.6038 0.5536 [
The output of the function AdMitIS
is a list. The first component is
ghat
, the importance sampling estimator of
\(E_p\big[g(\mathbf x)\big]\). The second component is NSE
, a vector
containing the numerical standard errors (the variation of the estimates
that can be expected if the simulations were to be repeated) of the
components of ghat
. The third component is RNE
, a vector containing
the relative numerical efficiencies of the components of ghat
(the
ratio between an estimate of the variance of an estimator based on
direct sampling and the importance sampling estimator’s estimated
variance with the same number of draws). RNE
is an indicator of the
efficiency of the chosen importance function; if target and importance
densities coincide, RNE
equals one, whereas a very poor importance
density will have a RNE
close to zero. Both NSE
and RNE
are
estimated by the method given in Geweke (1989).
Further, the AdMit package contains the AdMitMH
function which uses
the mixture approximation in outAdMit$mit
as the candidate density in
the independence chain M-H algorithm. The arguments of the function
AdMitMH
are: (i) N
, the length of the MCMC sequence of draws; (ii)
KERNEL
, a kernel function of the target density; (iii) mit
, a list
providing information on the mixture approximation; (iv) …
allows
additional parameters to be passed to the function KERNEL
.
Let us apply the function AdMitMH
to the kernel GelmanMeng
using the
approximation outAdMit$mit
:
> outAdMitMH <- AdMitMH(N = 101000,
+ KERNEL = GelmanMeng,
+ mit = outAdMit\$mit)
$draws
\
k1 k21 7.429e-01 3.021e-01
2 7.429e-01 3.021e-01
3 7.429e-01 3.021e-01
4 1.352e+00 1.316e+00
5 1.011e+00 1.709e+00
6 1.011e+00 1.709e+00
7 1.005e+00 1.386e+00
...
$accept
\1] 0.5119 [
The output of the function AdMitMH
is a list of two components. The
first component is draws
, a \(N \times d\) matrix containing draws from
the target density \(p(\mathbf x)\) in its rows. The second component is
accept
, the acceptance rate of the independence chain M-H algorithm.
The package coda (Plummer et al. 2008) can be used to check the convergence of the
MCMC chain and obtain quantities of interest for \(p(\mathbf x)\). Here,
for simplicity, we discard the first 1’000 draws as a burn-in sample and
transform the output outAdMitMH$draws
in a mcmc
object using the
function as.mcmc
provided by coda. A summary of the MCMC chain can
be obtained using summary
:
> library("coda")
> draws <- as.mcmc(outAdMitMH\$draws)
> draws <- window(draws, start = 1001)
> colnames(draws) <- c("X1", "X2")
> summary(draws)\$stat
-series SE
Mean SD Naive SE Time0.9643 0.9476 0.002996 0.004327
X1 2.2388 1.3284 0.004201 0.006232 X2
We note that the mean estimates are close to the values obtained with
the function AdMitIS
. The relative numerical efficiency (RNE) can be
computed using the functions effectiveSize
and niter
:
> effectiveSize(draws) / niter(draws)
X1 X2 0.3082 0.3055
These relative numerical efficiencies reflect the good quality of the candidate density in the independence chain M-H algorithm.
The approach is compared to the standard Gibbs sampler, which is
extremely easy to implement here since the full conditional densities
are Normals. We generated \(N=101000\) draws from a Gibbs sampler using
the first 1’000 draws as a burn-in period. In this case, the RNE were
0.06411 and 0.07132, respectively. In Figure 2 we display the
autocorrelogram for the MCMC output of the AdMitMH
function (upper
graphs) together with the autocorrelogram of the Gibbs (lower part). We
clearly notice that the Gibbs sequence is mixing much more slowly,
explaining the lower RNE values. This example illustrates the phenomenon
that the autocorrelation in the Gibbs sequence may be high when
assessing a non-elliptical distribution, and that the AdMit approach may
overcome this.
This note presented the package AdMit which provides functions to approximate and sample from a certain target distribution given only a kernel of the target density function. The estimation procedure is fully automatic and thus avoids the time-consuming and difficult task of tuning a sampling algorithm. The use of the package has been illustrated in a bivariate bimodal distribution.
Interested users are referred to Ardia et al. (2009) for a more complete discussion on the AdMit package. The article provides a more extensive introduction of the package usage as well as an illustration of the relevance of the AdMit procedure with the Bayesian estimation of a mixture of ARCH model fitted to foreign exchange log-returns data. The methodology is compared to standard cases of importance sampling and the Metropolis-Hastings algorithm using a naive candidate and with the Griddy-Gibbs approach. It is shown that for investigating means and particularly tails of the joint posterior distribution the AdMit approach is preferable.
The authors acknowledge an anonymous reviewer for helpful comments that have led to improvements of this note. The first author is grateful to the Swiss National Science Foundation (under grant #FN PB FR1-121441) for financial support. The third author gratefully acknowledges the financial assistance from the Netherlands Organization of Research (under grant #400-07-703).
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ardia, et al., "AdMit", The R Journal, 2009
BibTeX citation
@article{RJ-2009-003, author = {Ardia, David and Hoogerheide, Lennart F. and Dijk, Herman K. van}, title = {AdMit}, journal = {The R Journal}, year = {2009}, note = {https://rjournal.github.io/}, volume = {1}, issue = {1}, issn = {2073-4859}, pages = {25-30} }