bayesanova: An R package for Bayesian Inference in the Analysis of Variance via Markov Chain Monte Carlo in Gaussian Mixture Models

This paper introduces the R package bayesanova, which performs Bayesian inference in the analysis of variance (ANOVA). Traditional ANOVA based on null hypothesis significance testing (NHST) is prone to overestimating effects and stating effects if none are present. Bayesian ANOVAs developed so far are based on Bayes factors (BF), which also enforce a hypothesis testing stance. Instead, the Bayesian ANOVA implemented in bayesanova focusses on effect size estimation and is based on a Gaussian mixture with known allocations, for which full posterior inference for the component parameters is implemented via Markov-Chain-Monte-Carlo (MCMC). Inference for the difference in means, standard deviations and effect sizes between each of the groups is obtained automatically. Estimation of the parameters instead of hypothesis testing is embraced via the region of practical equivalence (ROPE), and helper functions provide checks of the model assumptions and visualization of the results.

Riko Kelter (University of Siegen, Department of Mathematics)

1 Introduction

This article introduces bayesanova, an R package for conducting a Bayesian analysis of variance (ANOVA) via Markov Chain Monte Carlo (MCMC) in a Gaussian mixture model. Classic frequentist analysis of variance is based on null hypothesis significance testing (NHST), which recently has been shown to produce serious problems regarding the reproducibility and reliability of scientific results (Wasserstein and Lazar 2016; Colquhoun 2017, 2019; Benjamin et al. 2018; Wasserstein et al. 2019). NHST is based on test statistics, \(p\)-values and significance levels \(\alpha\), which are designed to control the long-term false-positive rate. Still, in a multitude of settings these approaches do in fact lead to an inflated rate of false-positive results, undermining the validity and progress of science. Examples include optional stopping of participant recruiting in studies (Carlin and Louis 2009) or the necessary testing for violations of distributional assumptions which some frequentist hypothesis tests make (Rochon et al. 2012).

As a solution to these problems, Bayesian methods have been proposed recently and are since gaining popularity in a wide range of scientific domains (Kruschke 2013, 2015; McElreath and Smaldino 2015). The Bayesian philosophy proceeds by combining the model likelihood \(f(x|\theta)\) with the available prior information \(p(\theta)\) to obtain the posterior distribution \(f(\theta|x)\) through the use of Bayes’ theorem: \[\label{eq:mean1} f(\theta|x) \propto f(x|\theta) f(\theta) \tag{1}\] While the Bayesian philosophy thus allows for flexible modeling, inference for the posterior distribution \(f(\theta|x)\) can be complicated in practice. Therefore, Markov chain Monte Carlo techniques have been developed, which make use of the facts that (1) constructing a Markov chain which has the posterior distribution \(f(\theta|x)\) as its stationary distribution, and (2) drawing samples from this Markov chain to approximate the posterior \(f(\theta|x)\) can be used to obtain the posterior numerically.

The bayesanova package is designed as a Bayesian alternative to the frequentist analysis of variance. By using a Gaussian mixture model and implementing a Markov Chain Monte Carlo algorithm for this model, full posterior inference can be obtained. This allows for explicit hypothesis testing between groups as in the frequentist ANOVA, or for estimation of parameters under uncertainty. The focus in bayesanova is on the latter perspective and avoids explicit hypothesis testing. While Bayesian versions of the analysis of variance have been proposed recently by (Rouder et al. 2012) and (Bergh et al. 2019), these implementations focus on the Bayes factor as a measure of evidence (Doorn et al. 2019; JASP Team 2019). As the Bayes factor suffers from multiple problems, one of which is its strong dependence on the used priors – see (Kamary et al. 2014) and (Robert 2016) – the implementation in bayesanova avoids the Bayes factor and uses a different posterior index, the region of practical equivalence (ROPE) (Kruschke 2018), which has lately been shown to have some desirable properties, in particular in contrast to the Bayes factor (Makowski et al. 2019a).

The plan of the paper is as follows: The next section introduces the analysis of variance in a frequentist and Bayesian fashion and gives an overview about packages implementing these methods. The following section then introduces the novel approach implemented in bayesanova. The details on the mixture representation of the Bayesian analysis of variance are discussed and scenarios where bayesanova is designed to be used are detailed. The section thereafter outlines the structure of the package and details the included functions. The following section presents a variety of examples and illustrations using real datasets from biomedical and psychological research as well as synthetic datasets. The last section then provides a summary of the benefits and drawbacks of the used implementation, as well as future plans for the package.

2 Frequentist and Bayesian analysis of variance

Traditional ANOVA models using NHST via the F-statistic

In applied statistics, the one-way analysis of variance is a method which can be used to compare means of two or more samples (typically three). The one-way ANOVA assumes numerical (response) data in each group and (usually) categorical input data like a group indicator in a randomized clinical trial (RCT). Interpreting the ANOVA as a linear model, one obtains for data \(y_{i,j}\), where \(i=1,...,n\) is an index over the experimental units (patients, participants) and \(j=1,...,k\) an index over treatment groups \[\label{eq:mean} y_{i,j}=\mu_j+\varepsilon_{i,j} \tag{2}\] if the experiment is completely randomized. Here, \(\varepsilon \sim \mathcal{N}(0,\sigma^2)\) so that \(\varepsilon_{i,j}\) are normally distributed zero-mean residuals. \(\mu_j\) is the mean of treatment group \(j\) and \(y_{i,j}\) the response variable which is measured in the experiment.

The one-way ANOVA then tests the null hypothesis \(H_0\) that all samples are drawn from populations with identical means. To do this, (1) two estimates of the population variance are obtained which rely on various assumptions and (2) an F-statistic is produced by the ANOVA, which is the ratio of variance calculated among the means to the variance within the samples. The intuition here is that if group means are drawn from populations with identical means, the variance of the group means should be smaller than the variance of samples and a high ratio thereby indicates differing means. Mathematical details on computing the F-statistic can be found in the Appendix.

The one-way ANOVA as detailed above makes several assumptions, the most important of which are: (1) variances of populations are equal; (2) responses for a given group are independent and identical distributed random variables; (3) response variable residuals are normally distributed, that is \(\varepsilon \sim \mathcal{N}(0,\sigma^2)\).

While Monte Carlo studies have shown that the ANOVA is quite robust to small to medium violations of these assumptions (Donaldson 1966), severe violations of assumptions (1)-(3) will result in inflated rates of false positives and and thereby unreliable results (Tiku 1971).

Bayesian ANOVA models

Bayesian models for the ANOVA have been developed recently to solve some of the problems of NHST. The developed models can be categorized broadly into two approaches: The first approach relies on the Bayes factor as a measure of relative evidence and was developed by (Rouder et al. 2012). The second approach is based on MCMC algorithms like Gibbs sampling in JAGS (Plummer 2003) or Hamiltonian Monte Carlo (HMC) in Stan (Carpenter et al. 2017; Stan Development Team 2020). This approach was popularized by (Kruschke 2015). Here the region of practical equivalence (ROPE) as introduced by (Kruschke 2015) is used for measuring the evidence given the data. Also, an explicit hypothesis testing stance is avoided.

The approach of (Rouder et al. 2012) can be summarized as follows: An independent Cauchy prior is considered \[\begin{aligned} \label{eq:indepCauchyPrior} p(\theta)=\prod_{i=1}^p \frac{1}{(1+\theta_i^2)\pi} \end{aligned} \tag{3}\] for the vector \(\boldsymbol{\theta}=(\theta_1,...,\theta_p)'\) of the \(p\) effects between different groups. For example, in a three-group setting there would be three effects \(\theta_1\), \(\theta_2\) and \(\theta_3\) corresponding to the effects between the first and second, first and third, and second and third group. In the case of \(k=4\) groups, there are \(p=6\) effects and so on. The ANOVA is then rewritten as a linear model \[\begin{aligned} \label{eq:rouderAnova} \boldsymbol{y}=\boldsymbol{\mu} \boldsymbol{1}+\sigma \boldsymbol{X} \boldsymbol{\theta} +\boldsymbol{\varepsilon} \end{aligned} \tag{4}\] where \(\boldsymbol{\mu}\) is the grand mean parameter, \(\boldsymbol{1}\) a column vector of length \(n\) with entries equal to \(1\), \(\boldsymbol{\theta}\) a column vector of the standardized effect size parameters of length \(p\), and \(\boldsymbol{X}\) is the \(n\times p\) design matrix. The factor \(\sigma\) in \(\sigma\boldsymbol{X}\boldsymbol{\theta}\) is attributed to the reparameterization according to Jeffreys: Following (Jeffreys 1961) by reparameterizing \(\delta=\mu/\sigma\), where \(\delta\) is the effect size of (Cohen 1988), (Rouder et al. 2012) rewrote the observed data sampling distribution as \[\begin{aligned} y_i \sim \mathcal{N}(\sigma \delta,\sigma^2) \end{aligned}\] The residuals \(\boldsymbol{\varepsilon}\) in Equation ((4)) are defined to be \[\begin{aligned} \boldsymbol{\varepsilon} \sim \mathcal{N}(0,\sigma^2 \boldsymbol{I}) \end{aligned}\] with \(\boldsymbol{I}\) being the identity matrix of size \(n\) and \(\boldsymbol{0}\) a column vector of zeros of size \(n\).

Putting a Jeffreys prior \(p(\mu,\sigma^2)=1/\sigma^2\) on the mean and variance, and assuming the following g-prior structure \[\begin{aligned} \boldsymbol{\theta}|\boldsymbol{G}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{G}) \end{aligned}\] which is based on (Zellner 1980), where \(\boldsymbol{G}\) is a \(p\times p\) diagonal matrix, the only open aspect remaining is putting a prior on the diagonal elements \(g_l\) of \(\boldsymbol{G}\) for \(l=1,...,p\). (Rouder et al. 2012) chose \[\begin{aligned} g_l \sim \text{Inverse-}\chi_1^2 \end{aligned}\] so that the marginal prior on the effect size parameter vector \(\boldsymbol{\theta}\) results in the independent Cauchy distribution given in Equation ((3)). (Rouder et al. 2012) then showed that the resulting \(BF_{10}\) can be written as \[\begin{aligned} BF_{10}=\int_g K(\boldsymbol{n},g)\left ( \frac{\sum_{i} \sum_{j}(y_{ij}-\bar{y})^2+\frac{1}{g}(\sum_{i}c_i \bar{y}_i^2-(\sum_{i}c_i \bar{y}_i)^2 / (\sum_{i}c_i))}{\sum_{i} \sum_{j}(y_{ij}-\bar{y})^2}\right )^{-(N-1)/2}p(g)dg \end{aligned}\] if a balanced one-way design is used (equal sample sizes in each group). Here, \(\boldsymbol{n}=(n_1,...,n_p)'\) is the vector of sample sizes for each effect \(1,...,p\), \(n=\sum_i n_i\) is the full sample size, \(c_i=n_i/(n_i+1/g)\) and \[\begin{aligned} K(\boldsymbol{n},g)=\sqrt{N}\left ( \frac{\prod_i 1/(1+gn_i)}{\sum_i n_i/(1+gn_i)} \right )^{1/2} \end{aligned}\] In summary, this Bayes factor of (Rouder et al. 2012) can be computed via Gaussian quadrature, as it constitutes a one-dimensional integral after inserting the necessary quantities.

The second approach of a Bayesian ANOVA model can be credited to (Kruschke 2015), who uses the MCMC sampler JAGS (Plummer 2003) to obtain full posterior inference in his model instead of relying on the Bayes factor. The reasons for avoiding the Bayes factor as a measure of evidence are that (1) it depends strongly on the selected prior modeling (Kamary et al. 2014); (2) the Bayes factor states only relative evidence for the alternative to the null hypothesis (or vice versa) so that even a large Bayes factor does not indicate that either one of both hypotheses is a good fit for the actual data (Kelter 2020b,a); (3) it can be located in the same formalism of hypothesis testing the pioneers of frequentist testing advocated at the time of invention (Robert 2016; Tendeiro and Kiers 2019). In addition, the calculation of the Bayes factor for increasingly complex models can be difficult, as the above derivations of (Rouder et al. 2012) exemplify, see also (Kamary et al. 2014). Importantly, the Bayes factor assigns positive measure to a Lebesgue-null-set which is puzzling from a measure-theoretic perspective, compare (Kelter 2021d), (Rao and Lovric 2016), and (Berger 1985).

(Kruschke 2015) modeled the Bayesian ANOVA for \(k\) groups and \(n\) observations \(y_1,...y_n\) as a hierarchical Bayesian model, where \[\begin{aligned} \label{eq:dataKruschke} y_i \sim \mathcal{N}(\mu,\sigma_y^2) \end{aligned} \tag{5}\] where the standard deviation \(\sigma_y\) is modelled as \[\begin{aligned} \sigma_y \sim \mathcal{U}(L,H) \end{aligned}\] the mean \(\mu_i\) is the linear combination \[\begin{aligned} \label{eq:meansKruschke} \mu=\beta_0+\sum_{j=1}^k \beta_j x_j(i) \end{aligned} \tag{6}\] and the coefficients of this linear combination are given as \[\begin{aligned} \beta_0 \sim \mathcal{N}(M_0,S_0)\\ \beta_j \sim \mathcal{N}(0,\sigma_{\beta}) \end{aligned}\] where \(x_j(i)\) is the index for the group the observation \(y_i\) belongs to. If, for example, \(y_i\) is in the first group, \(x_1(i)=1\) and \(x_j(i)=0\) for all \(j\neq 1\) with \(j\in \{1,...,k\}\), yielding the group mean \(\mu_i=\beta_0+\beta_1\) of the first group. Thus, although Equation ((5)) seems to indicate that there is a single mean \(\mu\) for all observations \(y_i\), \(i=1,...,n\), the mean \(\mu\) takes \(k\) different values depending on which group the observation \(y_i\) is located in. These \(k\) different values for \(\mu\) correspond to the different means in the \(k\) groups as shown in Equation ((6)). The variables \(L,H,M_0,S_0\) are hyperparameters, and the parameter \(\beta_j\) can be interpreted as the effect size differing from the grand mean \(\beta_0\), which is why the prior on \(\beta_j\) is normal with mean zero so that the expectation of these effect size differences from the grand mean sum up to zero again. The hyperparameter \(\sigma_\beta\) can either be set constant or given another prior, extending the multilevel model, where (Kruschke 2015) followed the recommendations of (Gelman and Hill 2006) to use a folded t-distribution or a gamma-distribution with non-zero mode.

Inference for the full posterior, that is for the parameters \(\mu_k,\sigma_y,\beta_0,\beta_j \forall j, j=1,...,k\) (and \(\sigma_\beta\), if a hyperprior like a folded t-distribution or gamma-distribution is used on this parameter) given the data is provided via the MCMC sampler JAGS (Plummer 2003), which uses Gibbs sampling to draw samples from the posterior. Posterior distributions obtained through Gibbs sampling are finally used to estimate all parameters via \(95\%\) Highest-Density-Intervals (HDI). Explicit testing is avoided.

3 Available software

Available software for the traditional ANOVA

Conducting a traditional analysis of variance is possible with an abundance of software, for example via the stats package (R Core Team 2020) which is part of the R programming language (R Core Team 2020).

Available software for the Bayesian ANOVA

The BayesFactor package by (Morey and Rouder 2018) provides the Bayesian ANOVA Bayes factor of (Rouder et al. 2012), and various helper functions for analysis of the results.

A simple illustration of the main workflow in the BayesFactor package is given here, using the ToothGrowth dataset in the datasets package (Cannon et al. 2019). The ToothGrowth dataset contains three columns: len, the dependent variable each of which is the length of a guinea pig’s tooth after treatment with vitamin C. The predictor supp corresponds to the supplement type (either orange juice or ascorbic acid), the predictor dose is the amount of vitamin C administered.

The BayesFactor package’s core function allows the comparison of models \(\mathcal{M}_0,...,\mathcal{M}_n\) with factors as predictors. The null model without any predictors is most often compared to models including predictors or even interaction terms using the Bayes factor as detailed above. The function anovaBF computes several model estimates at once, so that the model with the largest Bayes factor can be selected. The data are first loaded and the categorial predictors converted to factors:

R> set.seed(42)
R> library(datasets)
R> data(ToothGrowth)
R> head(ToothGrowth,n=3)

   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5

R> ToothGrowth$dose = factor(ToothGrowth$dose) 
R> levels(ToothGrowth$dose) = c('Low', 'Medium', 'High')

Then, a Bayesian ANOVA is conducted using both predictors dose, supp and the interaction term dose * supp:

R> library(BayesFactor)
R> bf = anovaBF(len ~ supp * dose, data = ToothGrowth)

Bayes factor analysis
[1] supp                    : 1.198757      +- 0.01%
[2] dose                    : 4.983636e+12  +- 0%
[3] supp + dose             : 2.963312e+14  +- 1.59%
[4] supp + dose + supp:dose : 8.067205e+14  +- 1.94%

Against denominator:
  Intercept only 
Bayes factor type: BFlinearModel, JZS   

The results are shown in form of the Jeffreys-Zellner-Siow (JZS) Bayes factor \(BF_{10}\) detailed previously. As the \(BF_{10}\) for the model including both predictors supp and dose is largest, the Bayesian ANOVA favours this model over the null model which includes only the intercept. Thus, as there are the low, medium and high dose groups and the two supplement groups, in total one obtains \(3\times 2=6\) different groups. The results show that there is strong evidence that the model attesting these six differing groups is favourable over the null model (and every other model as given in output lines [1], [2] and [3]).

Note, that this solution is also implemented in the open-source software JASP, for an introduction see (Bergh et al. 2019).

The Bayesian ANOVA model of (Kruschke 2015) is not implemented in a software package by now. Instead, users have to write their own model scripts for JAGS (Plummer 2003) to run the analysis. Still, recently the package BANOVA was published by (Dong and Wedel 2019), which uses JAGS (Plummer 2003) and the Hamiltonian Monte Carlo (HMC) sampler Stan (Carpenter et al. 2017) via the package RStan (Stan Development Team 2020) to provide similar inferences without the need to code the JAGS or Stan models on your own.

Note that in the above example, a traditional ANOVA can easily be fit via

R> summary(aov(len ~ supp * dose, data = ToothGrowth))

            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4  15.572 0.000231 ***
dose         2 2426.4  1213.2  92.000  < 2e-16 ***
supp:dose    2  108.3    54.2   4.107 0.021860 *  
Residuals   54  712.1    13.2                     
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

which yields similar results, favouring the full model with both predictors and interaction term, as both predictors and the interaction term are significant.

4 The Bayesian ANOVA model based on Gaussian mixtures

The method used in the bayesanova package is based on estimation of parameters in a Gaussian mixture distribution. On this mixture a Gibbs sampling algorithm is applied to produce posterior distributions of all unknown parameters given the data in the Gaussian components, that is for \(\mu_j,\sigma_j, j=1,...,k\) and for the differences in means \(\mu_l-\mu_r,l\neq r\) and the effect sizes \(\delta_{lr},l\neq r\) where \(k\) is the number of groups in the study or experiment. This way, a relatively complete picture of the situation at hand can be drawn and while the technical aspects are omitted here, the validity of the procedure stems from standard MCMC theory, see for example (Robert and Casella 2004). The principal idea of mixture models is expressed by (Frühwirth-Schnatter 2006):

Consider a population made up of \(K\) subgroups, mixed at random in proportion to the relative group sizes \(\eta_1,...,\eta_K\). Assume interest lies in some random feature \(Y\) which is heterogeneous across and homogeneous within the subgroups. Due to heterogeneity, \(Y\) has a different probability distribution in each group, usually assumed to arise from the same parametric family \(p(y|\theta)\) however, with the parameter \(\theta\) differing across the groups. The groups may be labeled through a discrete indicator variable \(S\) taking values in the set \(\{1,...,K\}\). When sampling randomly from such a population, we may record not only \(Y\), but also the group indicator \(S\). The probability of sampling from the group labeled \(S\) is equal to \(\eta_S\), whereas conditional on knowing \(S\), \(Y\) is a random variable following the distribution \(p(y|\theta_S)\) with \(\theta_S\) being the parameter in group \(S\). (...) The marginal density \(p(y)\) is obviously given by the following mixture density \[\begin{aligned} p(y)=\sum_{S=1}^K p(y,S)=\eta_1 p(y|\theta_1)+...+\eta_S p(y|\theta_K) \end{aligned}\]

Clearly, this resembles the situation of the analysis of variance, in which the allocations \(S\) are known. Traditionally, mixtures are treated with missing allocations but in the setting of the ANOVA these are known, leading to a much simpler scenario. This interpretation also makes sense from a semantic point: the inherent assumption of a researcher is that the population is indeed made up of \(k\) subgroups in the case of a k-group ANOVA, which differ in a random feature \(Y\) which is heterogeneous across groups and homogeneous within each group. When conducting for example a randomized clinical trial (RCT), the group indicator \(S\) is of course recorded. The clinician will choose the patients according to a sampling plan, which could be designed to achieve equally sized groups, that is, \(\eta_1=\eta_2=...=\eta_k\) for \(k\) study groups. Thus, when sampling the population with the target of equally sized groups, the researcher will sample the objects with equal probability from the population. Consider a treatment one, treatment two and a control group. In this typical setting, the researcher could flip a coin for each patient in the RCT to assign him or her to one of the two treatment groups or to the control group, so that with probability \(\eta_1=\eta_2=\eta_3=1/3\) for any group, the patient is assigned to it. Repeating this process then leads to the mixture model given above. After the RCT is conducted, the resulting histogram of observed \(Y\) values will finally take the form of the mixture density \(p(y)\) above. If there is an effect in the treatment, this density \(p(y)\) will express three modes which in turn result from the underlying mixture model of the data-generating process.

If unbalanced groups are the goal, weights can be adjusted accordingly, for example \(\eta_1=0.3\), \(\eta_2=0.2\) and \(\eta_3=0.5\). After fixing the mixture weights \(\eta_1,\eta_2,\eta_3\), the family of distributions for the mixture components needs to be selected. The above considerations lead to finite mixtures of normal distributions which ‘occur frequently in many areas of applied statistics such as [...] medicine’ (Frühwirth-Schnatter 2006 169). The components \(p(y|\theta_i)\) therefore become \(f_N(y;\mu_j,\sigma_j^2)\) for \(j=1,...,k\) in this case, where \(f_N(y;\mu_j,\sigma_j^2)\) is the density of the univariate normal distribution. Parameter estimation in finite mixtures of normal distributions consists of estimation of the component parameters \((\mu_j,\sigma_j^2)\), the allocations \(S_i,i=1,...,n\) and the weight distribution \((\eta_1,...,\eta_k)\) based on the available complete data \((y_i,S_i),i=1,...,n\). In the case of the Bayesian ANOVA, the allocations \(S_i\) (where \(S_i=1\) if \(y_i\) belongs to the first component, \(S_i=2\) if \(y_i\) belongs to the second component, until \(S_i=k\) if \(y_i\) belongs to the \(k\)-th group) are known for all observations \(y_i\), \(i=1,...,n\). Therefore, inference reduces to inference for the density parameters \((\mu_j,\sigma_j^2)\) of the normal components of the mixture for the \(j=1,...,k\) groups.

The Bayesian ANOVA model based on Gaussian mixtures is summarized in Figure 1 using the three-group case as an example:

graphic without alt text
Figure 1: Three-component Gaussian mixture with known allocations for the Bayesian analysis of variance

The measured variables \(y_i\) follow a three-component Gaussian mixture with known allocations. The first group is normally distributed as \(\mathcal{N}(\mu_1,\sigma_1)\), the second group as \(\mathcal{N}(\mu_2,\sigma_2)\) and the third group as \(\mathcal{N}(\mu_3,\sigma_3)\). The means \(\mu_1, \mu_2\) and \(\mu_3\) are each distributed as \(\mu_j \sim \mathcal{N}(b_0,B_0), j=1,2,3\) with noninformative hyperparameters \(b_0\) and \(B_0\) and the standard deviations \(\sigma_1,\sigma_2\) and \(\sigma_3\) are distributed as \(\sigma_j \sim \mathcal{G}^{-1}(c_0,C_0), j=1,2,3\) with noninformative hyperparameters \(c_0\) and \(C_0\). For details, see (Kelter 2020c, 2021a). As the allocations are known, the weights \(\eta_1,\eta_2\) and \(\eta_3\) are known too, and need not to be estimated, which is why the parameters \(\eta_1,\eta_2,\eta_3\) are not included in the diagram. The model visualized in Figure 1 can be generalized for an arbitrary number of mixture components, which then includes nearly arbitrary ANOVA settings for comparison of multiple groups. A definitive advantage of this model is that inference is obtained for both means and standard deviations, yielding richer information compared to the testing perspectives which are stressed in traditional or Bayesian ANOVA models focussing on the Bayes factor. Also, posterior distributions of effect sizes can be obtained via MCMC, providing an additional layer of information to draw inferences.

Instead of relying on the Bayes factor, the bayesanova package follows the approach of (Kruschke 2018) to use a region of practical equivalence (ROPE). The effect size \(\delta\) is routinely categorized as small, medium or large in medical research when \(\delta \in [0.2,0.5),\delta \in [0.5,0.8)\) or \(\delta \in [0.8,\infty)\), see (Cohen 1988). The approach using the ROPE proceeds by taking these categories as regions of practical equivalence, that is both \(\delta=0.25\) and \(\delta=0.26\) are identified as a small effect because both are inside the region of practical equivalence \([0.2,0.5)\) of a small effect \(\delta\). The underlying idea is that measuring effect sizes only makes sense up to a specific precision, which is given by the above categorization of effect sizes. By studying how much probability mass of the posterior distribution of \(\delta\) lies inside some of the above ROPEs \([0.2,0.5)\), \([0.5,0.8)\) and \([0.8,\infty)\) of a small, medium and large positive effect for \(\delta\) (negative effects analogue), a continuous statement about the most probable effect size \(\delta\) given the data can be made. Kruschke originally advocated to use the location of the 95% highest-posterior-density (HPD) interval in relation to the ROPE to test whether the null value in the middle of the ROPE should be accepted or rejected for practical purposes. Here, this approach is generalized by switching to the amount of posterior probability mass inside the ROPE. Detailed examples are provided later in this paper.

Table 1 provides an overview about the four ANOVA models and their purpose.

Table 1: Overview about the four ANOVA models
Model Purpose Evidence measure Computational aspects
Frequentist ANOVA Testing the global hypothesis that all samples are drawn from populations with identical means against the alternative F-statistic and p-value Analytic solution
Bayesian ANOVA of (Rouder et al. 2012) Test the global hypothesis that the effect size vector is zero versus the alternative Bayes factor Numerical integration required
Bayesian ANOVA of (Kruschke 2015) Estimation of effect sizes between groups via ROPE and 95% HPD ROPE Gibbs sampling in MCMC sampler JAGS (or Stan) required
Bayesian ANOVA based on Gaussian mixtures Estimation of effect sizes between groups via the ROPE and posterior probability mass ROPE Gibbs sampling without MCMC sampler JAGS (or Stan) required

Although it appears that the model of (Kruschke 2015) and the Gaussian mixture modeling approach proposed in this paper have the same purpose, they differ in how data \(y_i\) are assumed to be generated. In the mixture approach we assume that the sample of \(n_j\) participants in group \(j\) results from a mixture process, e.g. by flipping a coin, rolling a dice or using any other randomization device (as is the case in clinical trials when assigning patients to groups according to a double-blinded protocol). Thus, the process of data generation is not “one has collected \(n_j\) participants for group j” but “the given sample of \(n_j\) participants in group j is assumed to be a realization of a mixture process where with probability \(\eta_j\) participants are assigned to group \(j\)”. Importantly, note that the realization of \(n_j\) participants in group \(j\) for \(j=1,...,k\) is expected under the mixture component weight \(\eta_j=n_j/n\), but also entirely different group sizes \(n_j\) can result under such a mixture. In fact, the weights \(\eta_j=n_j/n\) which are assumed to be known are the corresponding maximum-likelihood-estimators of the weight parameters \(\eta_j\) given the sample sizes \(n_j\) for \(j=1,...,k\), but the conceptual focus of the mixture approach is to closely mimic the randomization process researchers follow when conducting a randomized controlled trial. Note further that the model of Kruschke assumes homogeneity of variances in contrast to the Gaussian mixture model, but Kruschke’s model can easily be extended to account for heterogeneity of variance, rendering this difference less important. Note that both the frequentist ANOVA and the Bayesian version of (Rouder et al. 2012) assume homogeneity of variance across groups.

5 Package structure and implementation

The bayesanova package has four functions. These provide (1) the MCMC algorithm for conducting the Bayesian ANOVA in the Gaussian mixture model with known allocations, detailed above, (2) checks of the model assumptions and (3) visualizations of the posterior results for easy interpretation and communication of research results. Visualizations of the posterior mixture components in comparison with the original data are provided by the fourth function. An overview is provided in Table 2.

Table 2: Outline of the four main functions implemented in bayesanova
Function Description
bayes.anova Main function of the package, conducts the MCMC algorithm to provide full posterior inference in the three-component Gaussian mixture model
assumption.check Helper function for checking the assumption of normality in each group previous to running a Bayesian ANOVA
anovaplot Provides multiple visualizations of the results, including posterior distributions, difference in means and standard deviations and effect sizes as well as a full ROPE-analysis
post.pred.check Provides a posterior predictive check for a fitted Bayesian ANOVA model

The core function is bayes.anova, which provides the MCMC algorithm to obtain full posterior inference in a \(k\)-component Gaussian mixture model shown in Figure 1 for the special case of \(k=3\) components. The function implements a Gibbs sampling algorithm, which iteratively updates

  1. the means \(\mu_j|\mu_{-j}, \sigma_1, ..., \sigma_k, S, y\) given the other means \(\mu_{-j}\) and standard deviations \(\sigma_1, ..., \sigma_k\) as well as the full data \(S,y\), where \(S\) is the indicator vector for the groups the observations \(y\) belong to

  2. the standard deviations \(\sigma_j|\sigma_{-j}, \mu_1, ..., \mu_k, S, y\) given the other standard deviations \(\sigma_{-j}\) and means \(\mu_1, ..., \mu_k\) as well as the full data \(S,y\), where \(S\) is again the indicator vector for the groups the observations \(y\) belong to

The details of the Gibbs sampler can be found in (Kelter 2020c, 2021a), and the validity of the method follows from standard MCMC theory, see for example (Robert and Casella 2004).

The bayes.anova function takes as input three numerical vectors first, second and third, which correspond to the observed responses in each of the three groups and provides multiple optional parameters:

bayes.anova(n=10000, first, second, third, 
fourth = NULL, fifth = NULL, sixth = NULL, 
hyperpars="custom", burnin=n/2, sd="sd", q=0.1, ci=0.95)

These are the only mandatory input values, and currently six groups are the limit bayesanova supports. More than three groups can be handed to the function by providing numerical vectors for the parameters fourth, fifth and sixth.

If no other parameters are provided, the function chooses a default of n=10000 Gibbs sampling iterations, where the burn-in of the Markov chains is set to burnin=n/2, so that the first 5000 iterations are discarded. The default setting uses inference for means \(\mu_j\) and standard deviations \(\sigma_j\), which is indicated by the parameter sd="sd", but inference for variances \(\sigma_j^2\) instead of standard deviations \(\sigma_j\) can easily be obtained by setting sd="var". The credible level for all computed credible intervals defaults to \(0.95\), indicated by ci=0.95. The two remaining parameters hyperpars and q define preselected values for the hyperparameters in the prior, to ensure weakly informative priors are used which influence the analysis as little as possible. For details, see (Kelter 2020c, 2021a), but in general these values apply to a broad range of contexts so that changing them is not recommended. Note, that another set of hyperparameters based on (Raftery 1996) can be selected via hyperpars="rafterys", if desired.

After execution, the function returns a dataframe including four Markov chains for each parameter of the specified size n-burnin, to make subsequent convergence assessment or post-processing of the MCMC results possible.

The second function is assumption.check. This function runs a preliminary assumption check on the data, which is recommended before running a Bayesian ANOVA. The model assumptions are normality in each mixture component, so that the assumption.check function runs Shapiro-Wilk tests to check for normality (Shapiro and Wilk 1965). The input parameters are the three numerical vectors x1, x2 and x3 including the observed responses in the first, second and third group, and the desired confidence level conf.level for the Shapiro-Wilk tests:

assumption.check(x1, x2, x3, x4 = NULL, x5 = NULL, x6 = NULL, conf.level=0.95)

The default confidence level is 0.95. More than three groups can easily be added by providing values for x4, x5 and x6.

The third function is anovaplot, which provides a variety of visualizations of results. The function takes as input a dataframe dataframe, which should be the result of the bayes.anova function detailed above, a parameter type, which indicates which visualization is desired, a parameter sd, which indicates if the provided dataframe includes posterior draws of \(\sigma_j\) or \(\sigma_j^2\) and last a parameter ci, which again defined the credible level used in the computations.

anovaplot(dataframe, type="rope", sd="sd", ci=0.95)

The default values for sd is "sd", and the default credible level is 0.95. The type parameter takes one of four possible values: (1) type="pars", (2) type="diff", (3) type="effect" and (4) type="rope". In the first case, posterior distributions of all model parameters are produced, complemented by convergence diagnostics in form of trace plots, autocorrelation plots and the Gelman-Brooks-Rubin shrink factor (Gelman and Brooks 1998), which should be close to one to indicate convergence to the posterior. In the second case, the posterior distributions of the differences \(\mu_i-\mu_j, j\neq i\) of the group means and differences \(\sigma_l-\sigma_r, l\neq r\) of the group standard deviations (or variances, if sd="var" and the dataframe includes posterior draws of the \(\sigma_j^2\)’s instead of \(\sigma_j\)’s) are produced, complemented by the same convergence diagnostics. In the third case, the posterior distributions of the effect sizes \(\delta_{lr}, l\neq r\) are produced, which are most often of interest in applied research. In this case, posteriors are complemented by the same convergence diagnostics, too. The last and fourth case produces a full ROPE-analysis, which does provide the posteriors of the effect sizes \(\delta_{lr}, l\neq r\), but additionally computes a partitioning of the posterior probability mass into the standardized ROPEs of small, medium and large (and no) effect sizes according to (Cohen 1988), which are the reference standard in medical and psychological research.

The last function post.pred.check provides a posterior predictive check for a fitted Bayesian ANOVA model against the original data, which is routine in a Bayesian workflow (Gabry et al. 2019).

6 Illustrations and examples

This section provides illustrations and a variety of examples, in which the bayesanova package can be used and provides richer information than existing solutions.

Tooth growth of guinea pigs treated with vitamin C

The guinea pig dataset from above is used as a first example. The data are included in the dataset ToothGrowth in the datasets package which is part of R. First, data is loaded and split into three groups, corresponding to a low, medium and high administered vitamin C dose:

R> library(datasets)
R> data(ToothGrowth)
R> head(ToothGrowth,n=3)

   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5

R> library(dplyr)
R> library(tidyr)
R> library(bayesanova)
R> grp1 = (ToothGrowth %>% filter(dose==0.5) %>% select(len))$len 
R> grp2 = (ToothGrowth %>% filter(dose==1.0) %>% select(len))$len 
R> grp3 = (ToothGrowth %>% filter(dose==2.0) %>% select(len))$len

Next, we run the assumption checks on the data

R> assumption.check(grp1, grp2, grp3, conf.level=0.95)

Model assumptions checked. No significant deviations from normality detected. 
Bayesian ANOVA can be run safely.   
graphic without alt text
Figure 2: Assumption checks for the ToothGrowth dataset using the assumption.check() function in bayesanova, showing that data in the three groups can be assumed as normally distributed so that running the Bayesian ANOVA based on the Gaussian mixture model is justified

Figure 2 shows the histograms and quantile-quantile plots for all three groups produced by assumption.check(). Clearly, there are no large deviations, and no Shapiro-Wilk test was significant at the \(0.05\) level.

Next, the Bayesian ANOVA can be run via the bayes.anova function. Therefore, the default parameter values are used, yielding n=5000 posterior draws:

R> set.seed(42)
R> res = bayes.anova(first = grp1, second = grp2, third = grp3)

|Parameter     |LQ    |Mean  |UQ    |Std.Err |
|mu1           |8.69  |10.61 |12.5  |0.91    |
|mu2           |18.05 |19.75 |21.46 |0.84    |
|mu3           |24.94 |26.1  |27.25 |0.57    |
|sigma1        |3.02  |4.07  |5.67  |0.67    |
|sigma2        |2.95  |3.96  |5.43  |0.64    |
|sigma3        |2.43  |3.25  |4.42  |0.52    |
|mu2-mu1       |6.7   |9.15  |11.7  |1.25    |
|mu3-mu1       |13.42 |15.49 |17.67 |1.06    |
|mu3-mu2       |4.36  |6.34  |8.38  |1.01    |
|sigma2-sigma1 |-2.02 |-0.11 |1.68  |0.93    |
|sigma3-sigma1 |-2.62 |-0.81 |0.74  |0.85    |
|sigma3-sigma2 |-2.46 |-0.71 |0.85  |0.82    |
|delta12       |-5.77 |-4.59 |-3.21 |0.65    |
|delta13       |-9.37 |-8.14 |-6.63 |0.71    |
|delta23       |-4.36 |-3.36 |-2.19 |0.56    |

The results table shows the lower and upper quantile, corresponding to the \(100\cdot\)ci\(+(100-\)ci\()/2\) and \((100-\)ci\()/2\) quantiles where ci is the credible level chosen above. Also, the posterior mean and standard error are given for each parameter, difference of parameters and effect size. The results clearly show that there are huge differences between the groups: For example, one can immediately spot that the more vitamin c given, the more tooth growth can be observed via tooth lengths. While the first group (low dose) has a posterior mean of \(10.61\) with credible interval \([8.69,10.61]\), the second group achieves a mean of \(19.75\) with credible interval \([18.05,21.46]\). The third group has a posterior mean of even \(26.1\) with credible level \([]\). The posterior estimates for the differences \(\mu_2-\mu_1\), \(\mu_3-\mu_1\) and \(\mu_3-\mu_2\) show that all groups differ from each other with a very high probability, given the data.

Note that the information provided is much more fine-grained than in the solutions via the traditional ANOVA and the Jeffreys-Zellner-Siow based Bayes-factor ANOVA above. While in these two solutions, one could only infer that the model using both predictors and the interaction term is the best, now we are given precise estimates of the effect sizes between each group defined by the dose of vitamin c administered. Note also, that including the second predictor supp is no problem, leading to a setting which incorporates six groups in the mixture then.

Heart rate data for runners

The second example is from the biomedical sciences and uses the heart rate data from (Moore et al. 2012). In the study, heart rates of female and male runners and generally sedentary participants (not regularly running) following six minutes of exercise were recorded. The participant’s Gender and Heart.rate are given and which group he or she belongs to (Group=="Runners" or Group=="Control"). In the study, 800 participants were recruited, so that in each of the four groups given by the combinations of Gender and Group 200 subjects participated.

Therefore, the situation requires a \(2\times 2\) between subjects ANOVA. Specifically, interest lies in the hypothesis that heart rate differs between gender and groups. The Bayesian ANOVA of bayesanova can easily be applied in such an often encountered setting. We first load the data and split them into the four groups:

R> library(dplyr)
R> hr=read.csv("heartrate.csv",sep=",")
R> head(hr)

  Gender   Group Heart.Rate
1 Female Runners        119
2 Female Runners         84
3 Female Runners         89
4 Female Runners        119
5 Female Runners        127
6 Female Runners        111

R> femaleRunners = (hr %>% filter(Gender=="Female") 
+  %>% filter(Group=="Runners")  
+  %>% select(Heart.Rate))$Heart.Rate
R> maleRunners = (hr %>% filter(Gender=="Male") %>% filter(Group=="Runners") 
+  %>% select(Heart.Rate))$Heart.Rate
R> femaleControl = (hr %>% filter(Gender=="Female") 
+  %>% filter(Group=="Control")  
+  %>% select(Heart.Rate))$Heart.Rate
R> maleControl = (hr %>% filter(Gender=="Male") %>% filter(Group=="Control")
+  %>% select(Heart.Rate))$Heart.Rate

Then, we check the model assumptions:

R> assumption.check(femaleRunners, maleRunners, femaleControl, maleControl)

We can thus safely proceed running the Bayesian ANOVA:

R> set.seed(42)
R> resRunners = bayes.anova(first = femaleRunners, second = maleRunners, 
+  third = femaleControl, fourth = maleControl)

|Parameter     |LQ     |Mean   |UQ     |Std.Err |
|mu1           |113.48 |116    |118.5  |1.27    |
|mu2           |102.51 |103.98 |105.55 |0.76    |
|mu3           |145.44 |148.04 |150.52 |1.3     |
|mu4           |127.12 |130.01 |132.82 |1.47    |
|sigma1        |14.38  |15.87  |17.51  |0.8     |
|sigma2        |11.21  |12.35  |13.67  |0.63    |
|sigma3        |14.71  |16.19  |17.85  |0.82    |
|sigma4        |15.46  |17.02  |18.79  |0.85    |
|mu2-mu1       |-14.9  |-12.01 |-9.06  |1.48    |
|mu3-mu1       |28.47  |32.04  |35.6   |1.83    |
|mu4-mu1       |10.19  |14.01  |17.9   |1.96    |
|mu3-mu2       |41.12  |44.05  |46.95  |1.51    |
|mu4-mu2       |22.83  |26.02  |29.21  |1.66    |
|mu4-mu3       |-21.8  |-18.03 |-14.4  |1.92    |
|sigma2-sigma1 |-5.6   |-3.52  |-1.57  |1.02    |
|sigma3-sigma1 |-1.94  |0.32   |2.53   |1.15    |
|sigma4-sigma1 |-1.14  |1.15   |3.51   |1.18    |
|sigma3-sigma2 |1.83   |3.84   |5.85   |1.03    |
|sigma4-sigma2 |2.7    |4.67   |6.8    |1.05    |
|sigma4-sigma3 |-1.48  |0.83   |3.13   |1.17    |
|delta12       |2.4    |3.2    |3.96   |0.4     |
|delta13       |-8.92  |-8.01  |-7.05  |0.48    |
|delta14       |-4.42  |-3.46  |-2.5   |0.49    |
|delta23       |-12.55 |-11.67 |-10.77 |0.45    |
|delta24       |-7.65  |-6.79  |-5.91  |0.45    |
|delta34       |3.52   |4.43   |5.37   |0.48    |

The results reveal multiple insights now. To support the interpretation, we first produce visualisations of the results via the anovaplot() function:

R> anovaplot(resRunners)

Figure 3 shows the plots produces by the above call to anovaplot().