The g-and-k and (generalised) g-and-h distributions are flexible univariate distributions which can model highly skewed or heavy tailed data through only four parameters: location and scale, and two shape parameters influencing the skewness and kurtosis. These distributions have the unusual property that they are defined through their quantile function (inverse cumulative distribution function) and their density is unavailable in closed form, which makes parameter inference complicated. This paper presents the gk R package to work with these distributions. It provides the usual distribution functions and several algorithms for inference of independent identically distributed data, including the finite difference stochastic approximation method, which has not been used before for this problem.
Statisticians have long sought for a simple extension to the normal distribution which can model data subject to skew, heavy tails or both. One approach is to transform a standard normal random variable \(Z \sim \mathcal{N}(0,1)\) to \[\label{eq:Qgen} X = A + B G(Z) H(Z), \tag{1}\] where \(A\) and \(B\) are location and scale parameters, \(G(\cdot)\) introduces asymmetry, and \(H(\cdot)\) elongates the tails of the distribution while having little effect near the mode. This paper considers two such distributions, the \(g\)-and-\(k\) and generalised \(g\)-and-\(h\) distributions. These distributions can model many types of behaviour through just a small number of parameters.
Defining random variables as transformations of \(Z\) is equivalent to specifying the distribution’s quantile function (defined in the next section), and distributions of this type are known as quantile distributions. Work on quantile distributions goes back at least to Hastings et al. (1947). See Gilchrist (2000) for a book length treatment of their history and use in statistics. Tukey (1977) proposed the form (1) and a distribution using it: the original \(g\)-and-\(h\) distribution. Haynes et al. (1997) were the first to use the two distributions considered in this paper: the \(g\)-and-\(k\) distribution and a generalised form of the \(g\)-and-\(h\) distribution. For brevity henceforth “\(g\)-and-\(h\) distribution” will refer to their generalised form. See Peters et al. (2016) for a thorough review of these and other distributions based on (1).
Applications of the \(g\)-and-\(k\) and \(g\)-and-\(h\) distributions have included environmental data (Rayner and MacGillivray 2002), financial returns (Drovandi and Pettitt 2011) and insurance risk (Peters et al. 2016). There has also been considerable methodological work on inference for these distributions (e.g. Rayner and MacGillivray 2002; Haynes and Mengersen 2005; Allingham et al. 2009; Drovandi and Pettitt 2011; Fearnhead and Prangle 2012). This is because it is not possible to express the densities of quantile distributions in closed form beyond some special cases, which makes it difficult to apply standard likelihood-based inference methods.
This paper presents the gk R package to work with the \(g\)-and-\(k\) and \(g\)-and-\(h\) distributions. The remaining sections covering the following:
The cumulative distribution function (cdf) of a univariate random variable \(X\), \(F_X:\mathbb{R} \to [0,1]\), is defined as \(\Pr\left(X \leq x \right)\). (Later we will often drop subscripts where they are clear from the context.) The cdf suffices to completely specify the probability distribution of \(X\). It is often the case that the cdf is not available in closed form but is implicitly defined through its derivative, the probability density function (pdf). An example of this is the normal distribution.
For quantile distributions, the cdf is implicity defined through its inverse, the quantile function \(F_X^{-1}(u)\) where \(F_X^{-1}:[0,1] \to \mathbb{R}\). The \(g\)-and-\(k\) and \(g\)-and-\(h\) distributions use a quantile function of the form \(F^{-1}(u;\theta) = Q\left(z(u);\theta \right)\) where \(z(\cdot)\) is the \(\mathcal{N}(0,1)\) quantile function and \(\theta\) is a vector of parameters. The \(Q\) functions are: \[\begin{aligned} Q_{gk}(z;A,B,g,k,c) &= A + B\left(1+c \tanh[gz/2] \right)z \left(1+z^2 \right)^k \label{eq:Qgk} \end{aligned} \tag{2}\]
\[\begin{aligned} Q_{gh}(z;A,B,g,h,c) &= A + B\left(1+c \tanh[gz/2] \right)z \exp \left(hz^2/2 \right). \label{eq:Qgh} \end{aligned} \tag{3}\] It is possible to sample from the distributions using the inversion method, that is, by simulating \(U \sim \mathcal{U}(0,1)\) and substituting it into the quantile function. Equivalently one can sample \(Z \sim \mathcal{N}(0,1)\) and substitute it into \(Q_{gk}\) or \(Q_{gh}\) i.e. the process described in the introduction based on Equation (1). In terms of (1), \(G(z) = 1+c \tanh(gz/2)\) produces asymmetry and \(H(z) = z(1+z^2)^k\) or \(z \exp\left(hz^2/2 \right)\) elongates tails.
Each distribution has four main parameters: \(A\) (location), \(B\) (scale), \(g\) (shape parameter mainly affecting skewness), and \(k\) or \(h\) (shape parameter mainly affecting kurtosis). The remaining parameter \(c\) is discussed below. When both shape parameters are zero the distribution is simply \(\mathcal{N}(0,1)\). An illustration of the flexible shapes that the \(g\)-and-\(k\) density can take is given in Figure 1. The \(g\)-and-\(h\) can produce similar shapes, with the following exception. The \(g\)-and-\(k\) distribution allows negative values of \(k\) which can produce lighter tails than a normal distribution, but also bimodal distributions of potentially limited usefulness.
Well-defined continuous distributions result from parameter values producing strictly increasing quantile functions. Determining when this is true is complicated so discussion is postponed to a later section. For now note that it is standard to take \(B>0\) and fix \(c=0.8\) (which will be assumed throughout unless mentioned otherwise), and in this case \(k \geq 0\) or \(h \geq 0\) guarantees a valid distribution.
The gk package provides the standard suite of R functions for the \(g\)-and-\(k\) and \(g\)-and-\(h\) distributions i.e. random sampling and calculation of the pdf, cdf and quantile functions. This section describes how these functions are implemented. It is assumed that parameters have been chosen such that the quantile function is strictly increasing. No warning is given when this is not the case as checking validity is time consuming (see next section).
The qgk
and qgh
functions calculate the quantile function
\(F^{-1}(u)\). Their implementation is straightforward. First \(z(u)\) is
calculated using qnorm
, then this is passed to an internal function,
z2gk
or z2gh
, which computes \(Q_{gk}\) or \(Q_{gh}\).
The rgk
and rgh
functions perform random sampling. This is done by
the method described earlier of sampling \(\mathcal{N}(0,1)\) draws and
substituting them into \(Q_{gk}\) or \(Q_{gh}\), via the function z2gk
or
z2gh
.
The pgk
and pgh
functions calculate the cdf \(F(x)\) given input \(x\).
They numerically solve \(Q(z) - x = 0\), which is guaranteed to have a
unique root for \(z\). The required final output is then
\(u = \Phi^{-1}(z)\) where \(\Phi\) is the \(\mathcal{N}(0,1)\) cdf. An
alternative approach would be to directly solve
\(Q\left(z(u) \right) - x = 0\) for \(u\). However we found this was less
numerically stable for \(u\) close to 0 or 1.
Our code finds the root for \(z\) using R’s uniroot
command and z2gk
or z2gh
for \(Q\left(z \right)\) evaluations. The need to run a root
finding algorithm means this function is slow relative to cdf
calculations of standard distributions - see Table 1.
The functions include an argument zscale
. Setting this to TRUE
outputs the \(z\) value which is found rather than \(u\). This is used in
the density functions below, and more generally is also useful to retain
numerical precision when \(z\) has large magnitude.
The dgk
and dgh
functions calculate the pdf \(f(x)\), or the log pdf
if the argument log=TRUE
is supplied. The method is based on the
standard probability result that if \(A\) has density \(f_A(a)\) and \(t(a)\)
is a differentiable 1-1 transformation then the density of \(B=t(A)\) is
\[f_B(b) = f_A(a) / t'(a) \qquad \text{where } a = t^{-1}(b),\]
and \(t'\) denotes the first derivative of \(t\).
For quantile distributions we have \(Z \sim \mathcal{N}(0,1)\) and \(X = Q(Z)\) for some \(Q\) function. So the pdf of \(X\) is \[f(x) = \phi(z)/Q'(z) \qquad \text{where } z = Q^{-1}(x),\] where \(\phi(z)\) is the \(\mathcal{N}(0,1)\) pdf.
Our code to calculate the pdf first finds \(z = Q^{-1}(x)\) using pgk
or
pgh
with zscale=TRUE
. Then the pdf or its log is calculated using
formulae (4) and (5) (see Appendix A) for \(Q'(u)\).
The reliance on performing root finding within pgk
and pgh
means
that dgk
and dgh
are slow relative to pdf calculations for standard
distributions - see Table 1.
Note that an alternative representation of \(f(x)\) is \(1/q'(u)\) where \(q(u)\) represents \(F^{-1}(u)\) and \(u = F(x)\). Density calculations based on this approach are described in Rayner and MacGillivray (2002). However we found that calculating the \(u\) values required for this approach was occasionally numerically unstable, as mentioned above.
Table 1 compares the time to execute gk’s distributional functions to those for the normal distribution. It illustrates that random sampling and quantile function calculation are reasonably efficient, but calculating the cdf and pdf are expensive.
Time (microseconds) | Ratio vs normal | ||||
Normal | \(g\)-and-\(k\) | \(g\)-and-\(h\) | \(g\)-and-\(k\) | \(g\)-and-\(h\) | |
Quantile function | 175 | 972 | 445 | 5.56 | 2.55 |
Random sampling | 150 | 921 | 436 | 6.15 | 2.91 |
cdf | 313 | 143151 | 116928 | 457 | 374 |
369 | 138381 | 111279 | 375 | 302 |
Recall that a valid continuous distribution requires the quantile function to be strictly increasing. Clearly this property is unaffected by the choice of \(A\) and \(B>0\). This section discusses the effects of \(g,h,k\) and \(c\).
Several theoretical results on valid parameters can be derived. It’s convenient to concentrate on \(c \geq 0\). In this case \(h < 0\) or \(k < -1/2\) is invalid. Taking \(k \geq 0\) or \(h \geq 0\) produces valid distributions when \(0 \leq c < c^* \approx 0.83\). This is the reason for taking \(c=0.8\) as standard: it maintains this property while allowing the skewness factor in (2) and (3) to have a large effect. For justification of all these results, see Appendix B.
When \(c=0.8\), the above results completely characterise the range of
valid parameters for the \(g\)-and-\(h\) distribution. For the \(g\)-and-\(k\)
distribution, there is still some uncertainty for \(-0.5 \leq k<0\),
which, as mentioned earlier, corresponds to light tails. For both
distributions, the case where \(c>c^*\) is less clear: even positive
values of \(k\) or \(h\) do not guarantee validity. Therefore we provide the
function isValid
to test parameter validity numerically.
Validity can be checked by testing whether the minimum derivative of
(2) or (3) is positive. Appendix A shows that it is
equivalent to test whether the functions (6) or (7)
are positive. isValid
uses numerical optimisation to minimise these
and returns whether the minimum value is positive. To reduce the
possibility of finding local minima, multiple optimisation starting
points can supplied as a vector to the argument initial_z
. However it
is still not guaranteed that the global minimum is found, so there
remains a possibility that the function may produce false positives.
The function can be used as follows to illustrate the region of valid \(g\)-and-\(k\) parameter values for \(c=0.8\). The results are plotted as Figure 2.
= expand.grid(g = seq(-10, 10, 0.1), k = seq(-0.6, 0.1, 0.01))
gk_grid = isValid(gk_grid$g, gk_grid$k) v
We do not test validity automatically within the package’s other
functions. This is because isValid
is relatively computationally
expensive and not guaranteed to be correct. Therefore particular care
should be taken for \(k<0\) or \(c > c^*\), as the distribution functions
will not provide warnings when invalid parameters are used. A reasonable
region of \(g\) and \(k\) values to use in practice with \(c=0.8\) can be
derived from Figure 2. It shows that for \(|g|<7\) some
\(-0.5 \leq k < 0\) values are invalid. Apart from a narrow strip near
\(g=0\), this invalid region’s boundary is roughly quadratic, as
illustrated by the curve \(\tilde{k}(g) = -0.045-0.01g^2\) on the figure.
Based on this analysis, \(k \geq \max\left(-0.5, \tilde{k}(g) \right)\)
seems a reasonable sufficient condition for parameter validity to use in
practice.
The package provides three inference methods for data \(x_1,x_2,\ldots,x_n\) which are assumed to be independent and identically distributed (IID) draws from a \(g\)-and-\(k\) or \(g\)-and-\(h\) distribution with unknown parameters. This section describes these methods. An illustration of their use is provided in the next section. See the gk help files for a full description of all the arguments available.
The mcmc
function implements inference using Markov chain Monte Carlo
(MCMC). This samples from a Markov chain whose stationary distribution
is the Bayesian posterior of interest for the parameters \(\theta\). We
use a Metropolis-Hastings algorithm, in which a proposed new state of
the chain \(\theta'\) is sampled by adding a \(\mathcal{N}(0,\Sigma)\)
increment to the current state \(\theta_t\). A decision to accept or
reject \(\theta'\) is made based on the prior and likelihood values at
\(\theta_t\) and \(\theta'\) and a random variable (see steps 3-4 of
Algorithm 1.)
Tuning \(\Sigma\) can be difficult. Haynes and Mengersen (2005), who first used MCMC for the \(g\)-and-\(k\) distribution, did this manually. Instead we use the adaptive Metropolis (AM) algorithm of Haario et al. (2001) which tunes \(\Sigma\) automatically during its operation. The resulting \(\theta_t\)s no longer form a Markov chain, but it has been proved (Saksman and Vihola 2010) that, under suitable conditions, calculations using them still converge to posterior quantities as the length of the chain increases. The AM algorithm is presented as Algorithm 1. Step 1 states the proposal matrix used in terms of the empirical variance of the past MCMC states. To calculate this empirical variance efficiently, the code updates it each time a new state is observed. As a default we specify tuning choices \(\epsilon = 10^{-6}\) and \(t_0=100\).
Like other Bayesian methods, MCMC requires a prior density for the
parameters, \(\pi(\theta)\), to be specified. This must be supplied by the
user. For computational convenience this should be supplied in the form
of a function get_log_prior
which takes a vector of parameters as
input and returns the log prior density. We allow the user to
reparameterise \(\theta\), using \(\log B\) rather than \(B\), via the logB
argument. This can improve MCMC efficiency when the posterior for \(B\) is
concentrated on values close to zero.
For IID data the likelihood is
\(L(\theta) = \prod_{i=1}^n f\left(x_i;\theta \right)\), the product each
observation’s pdf. Evaluating this for the \(g\)-and-\(k\) or \(g\)-and-\(h\)
distributions using the pgk
or pgh
command requires \(n\) calls to
numerical optimisation. Therefore MCMC becomes computationally expensive
for even moderately large datasets.
Loop over \(1 \leq t \leq N\):
If \(t \leq t_0\) let \(\Sigma_t = \Sigma_0\). Otherwise let \(\Sigma_t = \tfrac{1}{4}(2.4)^2 \left(\hat{\Sigma}_{t-1}+\epsilon I \right)\), where \(\hat{\Sigma}_{t-1}\) is the variance of \(\theta_1,\theta_2,\ldots,\theta_{t-1}\).
Sample \(\theta' \sim \mathcal{N}\left(\theta_{t-1}, \Sigma_{t-1} \right)\)
Sample \(u \sim \mathcal{U}(0,1)\) and let \(r = \frac{\pi(\theta') L(\theta')}{\pi(\theta_{t-1}) L(\theta_{t-1})}\).
If \(u<r\) let \(\theta_t = \theta'\). Otherwise let \(\theta_t = \theta_{t-1}\).
The abc
function implements inference by approximate Bayesian
computation (ABC). This is a method for approximate Bayesian inference
which avoids evaluating the likelihood function. It is especially useful
when the likelihood function is unavailable or, as for quantile
distributions, is expensive to compute. ABC is based instead on finding
parameter values which produce simulated data similar to the
observations. The abc
function implements a simple version of ABC,
Algorithm 2. Here a simulation is accepted if it has one of
the \(M\) smallest distances to the observations. Distance refers to a
weighted version of Euclidean distance between vectors of simulated and
observed summary statistics. Details of the weighting are given in the
algorithm’s description. (For \(n\) large, abc
avoids high memory
requirements by running several batches of Algorithm 2. Each
batch uses \(N=10^4\) and returns the \(M\) best simulations. The overall
best \(M\) best simulations are then found and returned. The \(v_j\) weights
calculated in the first batch are reused in the others.)
Like mcmc
, abc
is a Bayesian method and requires a prior
distribution for \(\theta\) to be provided. It is convenient for this to
be provided in a different form to the mcmc
case. A function rprior
should be supplied which a single numeric input and returns that many
samples from the prior distribution as rows of a matrix.
Calculate observed summaries \(s_0=s(x)\).
For \(1 \leq i \leq N\) sample parameters \(\theta_i\) from the prior.
For \(1 \leq i \leq N\) simulate summary statistics \(s(x_i)\) given parameters \(\theta_i\). Let \(s_{ij}\) denote the \(j\)th component of \(s(x_i)\).
For \(1 \leq j \leq q\) (where \(q=\dim(s_0)\)) calculate the empirical variance \(v_j\) of the \((s_{ij})_{1 \leq i \leq n}\) values.
For \(1 \leq i \leq N\) let \(d_i = \sum_{j=1}^q \left(s_{ij}-s_{0j} \right)^2/v_j\).
Find the \(M\) smallest \(d_i\) values and return the corresponding \(\theta_i\)s.
ABC produces samples from an approximation to the Bayesian posterior distribution. The quality of the approximation depends in a complex way on the choice of summary statistics and the tuning parameters \(N\) and \(M\). For more background on ABC see the review paper by Marin et al. (2012) and the handbook of Sisson et al. (2017). Two general R packages for ABC which implement more advanced methods are abc (Csilléry et al. 2012) and EasyABC (Jabot et al. 2013).
Using ABC for the \(g\)-and-\(k\) and \(g\)-and-\(h\) distributions was proposed
by Allingham et al. (2009) and has been investigated in many subsequent papers.
Following Drovandi and Pettitt (2011) we offer three choices of summary statistics
which can be selected through the sumstats
argument: (1) the full
order statistics; (2) octiles of the observations, \(E_1,E_2,\ldots,E_7\);
(3) robust estimates of the moments based on the octiles:
\[S_A=E_4,\quad S_B=E_6-E_2,\quad S_g=\left(E_6+E_2-2E_4 \right)/S_b,\quad S_k=\left(E_7-E_5+E_3-E_1 \right)/S_b.\]
Many more sophisticated approaches to choosing ABC summary statistics
have been proposed (Blum et al. 2013), but these are a simple starting point.
For summaries (2) or (3) we follow Fearnhead and Prangle (2012) and speed up step 3
of Algorithm 2 by using the fact that the octiles (or close
approximations) can be simulated quickly without the need to simulate a
full dataset. Suppose \(X_1,X_2,\ldots,X_N\) are \(g\)-and-\(k\) or
\(g\)-and-\(h\) variables, and let \(X_{(1)} < X_{(2)} \ldots < X_{(N)}\)
denote the order statistics. We replace \(E_i\) with
\(E'_i = X_{\left(r\left(iN/8 \right) \right)}\) where \(r(\cdot)\) rounds
to the nearest integer. Now we need to simulate 7 order statistics from
the \(g\)-and-\(k\) or \(g\)-and-\(h\) distribution. To do so we simulate
corresponding order statistics of the \(\mathcal{U}(0,1)\) distribution
using the exponential spacings method (Ripley 1987). This is
implemented by the orderstats
function. The uniform order statistics
are then substituted into \(F^{-1}(u)\).
The fdsa
function performs inference using finite difference
stochastic approximation (FDSA). FDSA, originally due to Kiefer and Wolfowitz (1952),
attempts to find \(\theta^*\) minimising a loss function
\(\mathcal{L}(\theta)\) by iteratively calculating estimates
\(\theta_1, \theta_2, \ldots\) Each iteration moves the estimate in the
opposite direction to an estimate of the loss gradient, based on finite
difference calculations.
We use FDSA for maximum likelihood estimation of IID observations. In this setting \(\mathcal{L}(\theta)\) can be taken to be the negative log likelihood, \[\mathcal{L}(\theta) = -\log L(\theta) = -\sum_{i=1}^n \log f\left(x_i;\theta \right). %= \sum_{i=1}^n \log q'\left(u_i;\theta \right),\] The gradient of \(\mathcal{L}(\theta)\) can be estimated using only a small subset of the data, so FDSA has the potential to scale up to large datasets better than MCMC, while avoiding the approximation error of ABC. Unlike ABC and MCMC, we are not aware of FDSA having previously been used for the \(g\)-and-\(k\) and \(g\)-and-\(h\) distributions.
The \(g\)-and-\(k\) and \(g\)-and-\(h\) distributions have some parameter constaints (e.g. \(B>0\), \(h \geq 0\)). Also we found setting further constaints from preliminary analyses sometimes helps FDSA behave well. Therefore we use a version of FDSA for bounded minimisation from L’Ecuyer and Glynn (1994), presented as Algorithm 3.
Loop over \(0 \leq t \leq N-1\).
Calculate \(\hat{g}_t\) by performing the following steps for \(i \leq 1 \leq 4\).
Let \(\Delta_i\) be a \(4\)-dimensional vector whose \(i\)th component is 1 and others are zero.
Let \(\phi^+ = P\left(\theta_t + c_t \Delta_i \right)\) and
\(\phi^- = P\left(\theta_t - c_t \Delta_i \right)\).
(Here \(P(\phi)\) is a projection operator. Its output is \(\phi'\)
such that \(\phi'_i\) is the closest value to \(\phi_i\) in
\(\left[\theta^-_i, \theta^+_i \right]\). The \(i\) subscripts
represent \(i\)th components.)
Let \(\hat{g}_{it} = \frac{1}{|\phi^+_i-\phi^-_i|}\left[\hat{\mathcal{L}}(\phi^+) - \hat{\mathcal{L}}(\phi^-) \right]\).
Let \(\theta_{t+1} = P\left(\theta_t - a_t \hat{g}_t \right)\).
Output: Final estimate \(\theta_N\).
The unbiased estimate of \(\mathcal{L}(\theta)\) required by Algorithm
3, \(\hat{\mathcal{L}}(\theta)\), can be taken to be the sum
of a random sample of \(m\) negative log likelihood terms multiplied by
\(n/m\). Hence for a vector y
containing a random subsample of \(m\)
observations (sometimes referred to as a batch),
\(\hat{\mathcal{L}}(\theta)\) can be calculated using
-sum(dgk(y,A,B,g,k,log=TRUE))*n/m
(or similar for the \(g\)-and-\(h\)
distribution). Variance reduction in step 1c of Algorithm 3
is possible by coupling the two estimates (Kushner and Yin 2003). Hence we use
the same random subsample of data for all \(\hat{\mathcal{L}}\)
calculations in an iteration of step 1.
FDSA convergence requires that the gain sequences \(a_t\) and \(c_t\) must
satisfy certain conditions. Following Spall (1998) we take
\(a_t = a_0 \left(A+t+1 \right)^{-\alpha}\) and
\(c_t = c_0 \left(t+1 \right)^{-\gamma}\). This leaves several tuning
choices, which can be selected by the user, or left at default values
which we provide. Following Kleinman et al. (1999) we use default values
\(\alpha=1\) and \(\gamma=0.49\). Following Spall (1998) our default for
\(c_0\) is an estimate of the standard deviation of
\(\hat{\mathcal{L}}(\theta_0)\) using some preliminary simulations. We
provide defaults \(a_0=1\) and \(A=100\) but it is recommended to manually
tune these to produce rapid convergence. This may require several short
pilot runs of the algorithm. The fdsa
function allows \(a_0\) and \(c_0\)
to be vectors, in which case operations in Algorithm 3 are
interpreted as elementwise where necessary. This allows the user to tune
gain sequences differently for each parameter. As for mcmc
, we allow
the user to reparameterise \(\theta\), using \(\log B\) rather than \(B\), via
the logB
argument, which can improve FDSA efficiency when the MLE
value of \(B\) is close to zero.
Under weak assumptions, FDSA converges to a local minimum of \(\mathcal{L}(\theta)\) (Kushner and Yin 2003). In our experience the likelihood for the \(g\)-and-\(k\) and \(g\)-and-\(h\) distributions is usually unimodal, so there is little danger of converging to an incorrect mode. Nonetheless it may be a useful check on the results to rerun the algorithm from various starting points or compare with the output of another algorithm.
An alternative to FDSA is simultaneous perturbation stochastic approximation (SPSA) (Spall 1998). Here each iteration makes a finite difference estimate of the derivative of the loss function when moving in a random direction from \(\theta_t\). An update moves \(\theta_t\) a distance (negatively) proportional to this estimated derivative in the selected direction. Each SPSA iteration requires fewer likelihood estimates than FDSA, and it is asymptotically more efficient (Kushner and Yin 2003). However we found in exploratory work that for our application the SPSA updates were dominated by improving \(A\) and \(B\) estimates, and the remaining parameters were learned very slowly.
We illustrate gk’s inference methods on the Garch
exchange rate
dataset from the Ecdat
package (Croissant 2016). This consists of 1967 daily US dollar
exchange rates against other currencies from 1980 to 1987. We
concentrate on the exchange rate with Canadian Dollars. Let \(x_t\) denote
the exchange rate on day \(t\). The log return is defined as
\(\log\left(x_{t+1}/x_t \right)\). Figure 3 is a time
series plot of the log returns. Figure 7 shows a
histogram and a quantile-quantile plot indicating that the tails are
heavier than those of a normal density.
We focus on using the \(g\)-and-\(k\) distribution to model the log returns
under an IID assumption. For models also including time series structure
see for example Drovandi and Pettitt (2011). The full code for the analysis below can
be run via the fx
function.
The ABC and MCMC analyses which follow are Bayesian and require specification of a prior. We use a uniform prior for ease of comparison to the maximum likelihood results from FDSA. For MCMC we are able to use an improper uniform prior. For ABC a proper prior is required so we bound the parameters as follows \(-1<A<1\), \(0<B<1\), \(-5<g<5\), \(0<k<10\). We restrict \(A\) and \(B\) to magnitude 1 at most, as we believe log returns of this magnitude are highly unlikely. The \(g\) and \(k\) parameters are given wider support which can capture a broad range of distributional shapes.
We ran ABC as follows:
= function(i) {
rprior cbind(runif(i, -1, 1), runif(i, 0, 1), runif(i, -5, 5), runif(i, 0, 10))}
= abc(log_return, N = 1E7, rprior = rprior, M=200,
abc_out sumstats = 'moment estimates')
This simulated \(10^7\) parameter vectors and accepting the best \(200\). We used moment estimator summary statistics, described earlier, which can be simulated quickly without the need to simulate an entire dataset. As a result this analysis took only 6 minutes.
The resulting approximate posterior samples are shown in Figure 6. Figure 7 shows density and quantile-quantile plots under the mean parameter values. These reveal a very poor fit to the data. However this short ABC analysis does provide reasonable tuning choices for the other methods.
We ran FDSA as follows:
= abc_out[, 1:4]
abc_out_tf 2] = log(abc_out_tf[, 2])
abc_out_tf[, = colMeans(abc_out_tf)
abc_est_tf = fdsa(log_return, N = 1E4, logB = TRUE, theta0 = abc_est_tf,
fdsa_out_pilot batch_size = 100, a0 = 2E-4)
= c(1E-6, 1E-2, 1E-2, 1E-2)
a0 = fdsa(log_return, N = 1E4, logB = TRUE, theta0 = abc_est_tf,
fdsa_out batch_size = 100, a0 = a0)
We found that using the original parameterisation caused high variance in our gradient estimates. This is because the log-likelihood surface becomes extremely steep for \(B\) close to 0. Therefore we reparameterised \(B\) to \(\log B\). The initial FDSA state was set to equal the ABC means. The FDSA steps sizes \(a_0\) were tuned by trial-and-error.
Figure 4 shows a trace plot of the FDSA algorithm output. A pilot run with \(a_0=2 \times 10^{-4}\) is shown in black. Parameters \(\log B, g\) and \(k\) do not converge over \(10,000\) iterations. However they have smooth curves, indicating that there is relatively little noise in their gradient estimates and so larger steps could be taken. In contrast \(A\) converges quickly and then oscillates noisily. This indicates that a smaller step size could be used to average out this noise more effectively without endangering convergence. Therefore for the final run we used \(a_0=\left(10^{-6}, 10^{-2}, 10^{-2}, 10^{-2} \right)\).
The final FDSA analysis took 17 minutes. The final states were \(A=9.1 \times 10^{-5}\), \(B=1.7 \times 10^{-3}\), \(g=2.0 \times 10^{-2}\) and \(k=0.35\). Figure 7 shows density and quantile-quantile plots under these parameter values. These are a much better fit to the data than the ABC results.
Next we use the FDSA results to help tune an MCMC algorithm, which quantifies the uncertainty in the parameter values.
We ran MCMC as follows:
= fdsa_out[1E5, 1:4]
fdsa_est_tf = var(fdsa_out[1E5 + (-1000:0), 1:4])
Sigma0 = function(theta) {
log_prior if (theta[4] < 0) return(-Inf)
return(theta[2])
}= mcmc(log_return, N = 1E4, logB = TRUE, get_log_prior = log_prior,
mcmcout_tf theta0 = fdsa_est, Sigma0 = Sigma0)
Again we used a log reparameterisation for \(B\). To achieve an improper uniform prior on the original parameterisation, we used a prior density proportional of \(B \mathbb{1}\left(k>0 \right)\) on \(\left(A,\log B, g, k \right)\) (where \(\mathbb{1}\) represents an indicator function). Our initial parameter vector was the final FDSA state. We use the variance matrix of the last 1000 FDSA states to select the initial MCMC proposal variance.
Figure 5 shows a trace plot of the MCMC algorithm output. For the first few hundred iterations small proposals are made, at least for \(\log B\), \(g\) and \(k\), but the proposal variance quickly adapts and the remainder of the output appears to have converged. Exploratory work showed that taking a poor initial state meant MCMC is very slow to converge, because the variance matrix adapts to the transient state of the algorithm. Hence tuning based on FDSA output is very useful.
The MCMC analysis took 39 minutes. Figure 6 parameter histograms and figure 7 shows density and quantile-quantile plots. These are similar to the FDSA fit. Note that the density plot is based on mean parameter values from the MCMC output (after discarding the first half of the output as burn-in and transforming \(\log B\) values back to the original parameterisation).
The ABC analysis is quick but produces a poor fit. However it helps tune the FDSA method which finds a good estimate of the MLE in a reasonable time. Further computational effort using MCMC provides a Bayesian fit. Figure 7 shows that the \(g\)-and-\(k\) distribution fits the data better than a normal distribution, but still does not fit the most extreme observations. Further improvements might be possible by using more flexible distributions, for example allowing different \(k\) parameters for the upper and lower tails (Peters et al. 2016).
This paper has reviewed the \(g\)-and-\(k\) and \(g\)-and-\(h\) distributions, and introduced the gk package to work with them. The package includes the usual distributional functions, although the pdf and cdf functions are slow due to relying on numerical root-finding. Another function tests the validity of different parameter combinations, and this was used to produce a novel result on which parameters are valid for the \(g\)-and-\(k\) distribution (i.e. it is appears to be sufficient that \(k \geq \max\left(-0.5, -0.045-0.01g^2 \right)\).) The package also provides several methods for inference of IID data under these distributions, and their use has been illustrated above. The methods include a FPSA algorithm which can find MLEs for large datasets in a reasonable time and has not been applied to this problem before.
The derivatives of the \(Q\) functions are as follows: \[\begin{aligned} Q_{gk}'\left(z;A,B,g,k,c \right) &= B \left(1+z^2 \right)^k R_{gk}\left(z;g,k \right), \label{eq:Qgk1} \end{aligned} \tag{4}\]
\[\begin{aligned} Q_{gh}'\left(z;A,B,g,h,c \right) &= B \exp\left(hz^2/2 \right) R_{gh}\left(z;g,h \right) \label{eq:Qgh1}, \end{aligned} \tag{5}\] where \[\begin{aligned} R_{gk}\left(z;g,k,c \right) &= \left[ 1 + c \tanh\left(gz/2 \right) \right] \frac{1+\left(2k+1 \right)z^2}{1+z^2} + \frac{cgz}{2 \cosh^2\left(gz/2 \right)}, \label{eq:Rgk} \end{aligned} \tag{6}\]
\[\begin{aligned} R_{gh}\left(z;g,h,c \right) &= \left[ 1 + c \tanh\left(gz/2 \right) \right] \left(1+hz^2 \right) + \frac{cgz}{2 \cosh^2\left(gz/2 \right)}. \label{eq:Rgh} \end{aligned} \tag{7}\] Observe that each \(Q'\) function has the same sign and roots as the corresponding \(R\) function.
This appendix proves theoretical results quoted earlier about which parameter values produce valid \(g\)-and-\(k\) and \(g\)-and-\(h\) distributions.
First note that the defining functions in (2) and (3) both have the property that \(Q\left(z;A,B,-g,k,c \right) = Q\left(z;A,B,g,k,-c \right)\). Therefore any behaviour produced by \(c<0\) can be replicated with \(c>0\) and a different choice of \(g\). So for simplicity it suffices to concentrate on \(c \geq 0\).
For the remainder of this appendix, distributional validity will correspond to a strictly increasing quantile function. This property is generally violated if \(c>1\), as there are two solutions to \(Q(z)=A\): \(z=0\) and a solution to \(1+c\tanh\left(gz/2 \right)=0\) (The only exception is the special case of \(g = 0\).) Also taking \(h<0\) or \(k<-1/2\) is invalid, as in either case \(Q\), which is continuous, has a positive gradient at \(z=0\) but limits of zero.
Finally it is shown that non-negative values of \(k\) or \(h\) produce valid distributions provided that \(0 \leq c<c^* \approx 0.83\) (Rayner and MacGillivray 2002). From Appendix A it suffices to derive the values of \(c\) such that \(R(z)\) – representing either \(R_{gk}\left(z;g,k,c \right)\) or \(R_{gh}\left(z;g,h,c \right)\) – is guaranteed to be positive for \(k \geq 0\) or \(h \geq 0\). Note that \(R(z)\) is a continuous function of \(z\), and \(R(0)>0\). So a sufficient condition for validity is that no solution to \(R(z)=0\) exists. Rearranging \(R(z)=0\) using (6) and (7) gives \[\begin{aligned} 1/c &= u v \mathop{\mathrm{sech}}^2 u + \tanh u, \label{eq:c equality} \\ \text{where} \qquad u&=-gz/2, \nonumber \\ \text{and} \qquad v&= \begin{cases} \frac{1+z^2}{1+\left(2k+1 \right)z^2} & \text{(g-and-k)}\\ \frac{1}{1+hz^2} & \text{(g-and-h)} \end{cases} \nonumber \end{aligned} \tag{8}\] For \(k \geq 0\) or \(h \geq 0\), \(v\) can only take values in \((0,1]\) with \(1\) attained by \(z=0\). Hence (8) gives \(c>0\) if and only if \(u>0\), and we concentrate on this case from now on. We wish to find the minimum positive solution for \(c\). Since \(1/c\) is increasing in \(v\) it suffices to concentrate on its largest value, \(v=1\). The problem reduces to minimising \(\left(u \mathop{\mathrm{sech}}u + \tanh u \right)^{-1}\) for \(u>0\). Numerically this gives \(c^* \approx 0.83\), as shown in Figure 8.
Thanks to Kieran Peel who wrote a helpful undergraduate dissertation on this topic.
gk, microbenchmark, abc, EasyABC, Ecdat
Bayesian, Distributions, Econometrics, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Prangle, "gk: An R Package for the g-and-k and Generalised g-and-h Distributions", The R Journal, 2020
BibTeX citation
@article{RJ-2020-010, author = {Prangle, Dennis}, title = {gk: An R Package for the g-and-k and Generalised g-and-h Distributions}, journal = {The R Journal}, year = {2020}, note = {https://rjournal.github.io/}, volume = {12}, issue = {1}, issn = {2073-4859}, pages = {7-20} }