This paper introduces the R package *bayesanova*, which performs Bayesian inference in the analysis of variance (ANOVA). Traditional ANOVA based on null hypothesis significance testing (NHST) is prone to overestimating effects and stating effects if none are present. Bayesian ANOVAs developed so far are based on Bayes factors (BF), which also enforce a hypothesis testing stance. Instead, the Bayesian ANOVA implemented in *bayesanova* focusses on effect size estimation and is based on a Gaussian mixture with known allocations, for which full posterior inference for the component parameters is implemented via Markov-Chain-Monte-Carlo (MCMC). Inference for the difference in means, standard deviations and effect sizes between each of the groups is obtained automatically. Estimation of the parameters instead of hypothesis testing is embraced via the region of practical equivalence (ROPE), and helper functions provide checks of the model assumptions and visualization of the results.

This article introduces
*bayesanova*, an R
package for conducting a Bayesian analysis of variance (ANOVA) via
Markov Chain Monte Carlo (MCMC) in a Gaussian mixture model. Classic
frequentist analysis of variance is based on null hypothesis
significance testing (NHST), which recently has been shown to produce
serious problems regarding the reproducibility and reliability of
scientific results
(Wasserstein and Lazar 2016; Colquhoun 2017, 2019; Benjamin et al. 2018; Wasserstein et al. 2019).
NHST is based on test statistics, \(p\)-values and significance levels
\(\alpha\), which are designed to control the long-term false-positive
rate. Still, in a multitude of settings these approaches do in fact lead
to an inflated rate of false-positive results, undermining the validity
and progress of science. Examples include optional stopping of
participant recruiting in studies (Carlin and Louis 2009) or the necessary testing
for violations of distributional assumptions which some frequentist
hypothesis tests make (Rochon et al. 2012).

As a solution to these problems, Bayesian methods have been proposed recently and are since gaining popularity in a wide range of scientific domains (Kruschke 2013, 2015; McElreath and Smaldino 2015). The Bayesian philosophy proceeds by combining the model likelihood \(f(x|\theta)\) with the available prior information \(p(\theta)\) to obtain the posterior distribution \(f(\theta|x)\) through the use of Bayes’ theorem: \[\label{eq:mean1} f(\theta|x) \propto f(x|\theta) f(\theta) \tag{1}\] While the Bayesian philosophy thus allows for flexible modeling, inference for the posterior distribution \(f(\theta|x)\) can be complicated in practice. Therefore, Markov chain Monte Carlo techniques have been developed, which make use of the facts that (1) constructing a Markov chain which has the posterior distribution \(f(\theta|x)\) as its stationary distribution, and (2) drawing samples from this Markov chain to approximate the posterior \(f(\theta|x)\) can be used to obtain the posterior numerically.

The *bayesanova* package is designed as a Bayesian alternative to the
frequentist analysis of variance. By using a Gaussian mixture model and
implementing a Markov Chain Monte Carlo algorithm for this model, full
posterior inference can be obtained. This allows for explicit hypothesis
testing between groups as in the frequentist ANOVA, or for estimation of
parameters under uncertainty. The focus in *bayesanova* is on the latter
perspective and avoids explicit hypothesis testing. While Bayesian
versions of the analysis of variance have been proposed recently by
(Rouder et al. 2012) and (Bergh et al. 2019), these implementations focus on
the Bayes factor as a measure of evidence (Doorn et al. 2019; JASP Team 2019). As
the Bayes factor suffers from multiple problems, one of which is its
strong dependence on the used priors – see (Kamary et al. 2014) and
(Robert 2016) – the implementation in *bayesanova* avoids the Bayes
factor and uses a different posterior index, the region of practical
equivalence (ROPE) (Kruschke 2018), which has lately been shown to have
some desirable properties, in particular in contrast to the Bayes factor
(Makowski et al. 2019a).

The plan of the paper is as follows: The next section introduces the
analysis of variance in a frequentist and Bayesian fashion and gives an
overview about packages implementing these methods. The following
section then introduces the novel approach implemented in *bayesanova*.
The details on the mixture representation of the Bayesian analysis of
variance are discussed and scenarios where *bayesanova* is designed to
be used are detailed. The section thereafter outlines the structure of
the package and details the included functions. The following section
presents a variety of examples and illustrations using real datasets
from biomedical and psychological research as well as synthetic
datasets. The last section then provides a summary of the benefits and
drawbacks of the used implementation, as well as future plans for the
package.

In applied statistics, the one-way analysis of variance is a method which can be used to compare means of two or more samples (typically three). The one-way ANOVA assumes numerical (response) data in each group and (usually) categorical input data like a group indicator in a randomized clinical trial (RCT). Interpreting the ANOVA as a linear model, one obtains for data \(y_{i,j}\), where \(i=1,...,n\) is an index over the experimental units (patients, participants) and \(j=1,...,k\) an index over treatment groups \[\label{eq:mean} y_{i,j}=\mu_j+\varepsilon_{i,j} \tag{2}\] if the experiment is completely randomized. Here, \(\varepsilon \sim \mathcal{N}(0,\sigma^2)\) so that \(\varepsilon_{i,j}\) are normally distributed zero-mean residuals. \(\mu_j\) is the mean of treatment group \(j\) and \(y_{i,j}\) the response variable which is measured in the experiment.

The one-way ANOVA then tests the null hypothesis \(H_0\) that all samples are drawn from populations with identical means. To do this, (1) two estimates of the population variance are obtained which rely on various assumptions and (2) an F-statistic is produced by the ANOVA, which is the ratio of variance calculated among the means to the variance within the samples. The intuition here is that if group means are drawn from populations with identical means, the variance of the group means should be smaller than the variance of samples and a high ratio thereby indicates differing means. Mathematical details on computing the F-statistic can be found in the Appendix.

The one-way ANOVA as detailed above makes several assumptions, the most important of which are: (1) variances of populations are equal; (2) responses for a given group are independent and identical distributed random variables; (3) response variable residuals are normally distributed, that is \(\varepsilon \sim \mathcal{N}(0,\sigma^2)\).

While Monte Carlo studies have shown that the ANOVA is quite robust to small to medium violations of these assumptions (Donaldson 1966), severe violations of assumptions (1)-(3) will result in inflated rates of false positives and and thereby unreliable results (Tiku 1971).

Bayesian models for the ANOVA have been developed recently to solve some of the problems of NHST. The developed models can be categorized broadly into two approaches: The first approach relies on the Bayes factor as a measure of relative evidence and was developed by (Rouder et al. 2012). The second approach is based on MCMC algorithms like Gibbs sampling in JAGS (Plummer 2003) or Hamiltonian Monte Carlo (HMC) in Stan (Carpenter et al. 2017; Stan Development Team 2020). This approach was popularized by (Kruschke 2015). Here the region of practical equivalence (ROPE) as introduced by (Kruschke 2015) is used for measuring the evidence given the data. Also, an explicit hypothesis testing stance is avoided.

The approach of (Rouder et al. 2012) can be summarized as follows: An independent Cauchy prior is considered \[\begin{aligned} \label{eq:indepCauchyPrior} p(\theta)=\prod_{i=1}^p \frac{1}{(1+\theta_i^2)\pi} \end{aligned} \tag{3}\] for the vector \(\boldsymbol{\theta}=(\theta_1,...,\theta_p)'\) of the \(p\) effects between different groups. For example, in a three-group setting there would be three effects \(\theta_1\), \(\theta_2\) and \(\theta_3\) corresponding to the effects between the first and second, first and third, and second and third group. In the case of \(k=4\) groups, there are \(p=6\) effects and so on. The ANOVA is then rewritten as a linear model \[\begin{aligned} \label{eq:rouderAnova} \boldsymbol{y}=\boldsymbol{\mu} \boldsymbol{1}+\sigma \boldsymbol{X} \boldsymbol{\theta} +\boldsymbol{\varepsilon} \end{aligned} \tag{4}\] where \(\boldsymbol{\mu}\) is the grand mean parameter, \(\boldsymbol{1}\) a column vector of length \(n\) with entries equal to \(1\), \(\boldsymbol{\theta}\) a column vector of the standardized effect size parameters of length \(p\), and \(\boldsymbol{X}\) is the \(n\times p\) design matrix. The factor \(\sigma\) in \(\sigma\boldsymbol{X}\boldsymbol{\theta}\) is attributed to the reparameterization according to Jeffreys: Following (Jeffreys 1961) by reparameterizing \(\delta=\mu/\sigma\), where \(\delta\) is the effect size of (Cohen 1988), (Rouder et al. 2012) rewrote the observed data sampling distribution as \[\begin{aligned} y_i \sim \mathcal{N}(\sigma \delta,\sigma^2) \end{aligned}\] The residuals \(\boldsymbol{\varepsilon}\) in Equation ((4)) are defined to be \[\begin{aligned} \boldsymbol{\varepsilon} \sim \mathcal{N}(0,\sigma^2 \boldsymbol{I}) \end{aligned}\] with \(\boldsymbol{I}\) being the identity matrix of size \(n\) and \(\boldsymbol{0}\) a column vector of zeros of size \(n\).

Putting a Jeffreys prior \(p(\mu,\sigma^2)=1/\sigma^2\) on the mean and variance, and assuming the following g-prior structure \[\begin{aligned} \boldsymbol{\theta}|\boldsymbol{G}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{G}) \end{aligned}\] which is based on (Zellner 1980), where \(\boldsymbol{G}\) is a \(p\times p\) diagonal matrix, the only open aspect remaining is putting a prior on the diagonal elements \(g_l\) of \(\boldsymbol{G}\) for \(l=1,...,p\). (Rouder et al. 2012) chose \[\begin{aligned} g_l \sim \text{Inverse-}\chi_1^2 \end{aligned}\] so that the marginal prior on the effect size parameter vector \(\boldsymbol{\theta}\) results in the independent Cauchy distribution given in Equation ((3)). (Rouder et al. 2012) then showed that the resulting \(BF_{10}\) can be written as \[\begin{aligned} BF_{10}=\int_g K(\boldsymbol{n},g)\left ( \frac{\sum_{i} \sum_{j}(y_{ij}-\bar{y})^2+\frac{1}{g}(\sum_{i}c_i \bar{y}_i^2-(\sum_{i}c_i \bar{y}_i)^2 / (\sum_{i}c_i))}{\sum_{i} \sum_{j}(y_{ij}-\bar{y})^2}\right )^{-(N-1)/2}p(g)dg \end{aligned}\] if a balanced one-way design is used (equal sample sizes in each group). Here, \(\boldsymbol{n}=(n_1,...,n_p)'\) is the vector of sample sizes for each effect \(1,...,p\), \(n=\sum_i n_i\) is the full sample size, \(c_i=n_i/(n_i+1/g)\) and \[\begin{aligned} K(\boldsymbol{n},g)=\sqrt{N}\left ( \frac{\prod_i 1/(1+gn_i)}{\sum_i n_i/(1+gn_i)} \right )^{1/2} \end{aligned}\] In summary, this Bayes factor of (Rouder et al. 2012) can be computed via Gaussian quadrature, as it constitutes a one-dimensional integral after inserting the necessary quantities.

The second approach of a Bayesian ANOVA model can be credited to (Kruschke 2015), who uses the MCMC sampler JAGS (Plummer 2003) to obtain full posterior inference in his model instead of relying on the Bayes factor. The reasons for avoiding the Bayes factor as a measure of evidence are that (1) it depends strongly on the selected prior modeling (Kamary et al. 2014); (2) the Bayes factor states only relative evidence for the alternative to the null hypothesis (or vice versa) so that even a large Bayes factor does not indicate that either one of both hypotheses is a good fit for the actual data (Kelter 2020b,a); (3) it can be located in the same formalism of hypothesis testing the pioneers of frequentist testing advocated at the time of invention (Robert 2016; Tendeiro and Kiers 2019). In addition, the calculation of the Bayes factor for increasingly complex models can be difficult, as the above derivations of (Rouder et al. 2012) exemplify, see also (Kamary et al. 2014). Importantly, the Bayes factor assigns positive measure to a Lebesgue-null-set which is puzzling from a measure-theoretic perspective, compare (Kelter 2021d), (Rao and Lovric 2016), and (Berger 1985).

(Kruschke 2015) modeled the Bayesian ANOVA for \(k\) groups and \(n\) observations \(y_1,...y_n\) as a hierarchical Bayesian model, where \[\begin{aligned} \label{eq:dataKruschke} y_i \sim \mathcal{N}(\mu,\sigma_y^2) \end{aligned} \tag{5}\] where the standard deviation \(\sigma_y\) is modelled as \[\begin{aligned} \sigma_y \sim \mathcal{U}(L,H) \end{aligned}\] the mean \(\mu_i\) is the linear combination \[\begin{aligned} \label{eq:meansKruschke} \mu=\beta_0+\sum_{j=1}^k \beta_j x_j(i) \end{aligned} \tag{6}\] and the coefficients of this linear combination are given as \[\begin{aligned} \beta_0 \sim \mathcal{N}(M_0,S_0)\\ \beta_j \sim \mathcal{N}(0,\sigma_{\beta}) \end{aligned}\] where \(x_j(i)\) is the index for the group the observation \(y_i\) belongs to. If, for example, \(y_i\) is in the first group, \(x_1(i)=1\) and \(x_j(i)=0\) for all \(j\neq 1\) with \(j\in \{1,...,k\}\), yielding the group mean \(\mu_i=\beta_0+\beta_1\) of the first group. Thus, although Equation ((5)) seems to indicate that there is a single mean \(\mu\) for all observations \(y_i\), \(i=1,...,n\), the mean \(\mu\) takes \(k\) different values depending on which group the observation \(y_i\) is located in. These \(k\) different values for \(\mu\) correspond to the different means in the \(k\) groups as shown in Equation ((6)). The variables \(L,H,M_0,S_0\) are hyperparameters, and the parameter \(\beta_j\) can be interpreted as the effect size differing from the grand mean \(\beta_0\), which is why the prior on \(\beta_j\) is normal with mean zero so that the expectation of these effect size differences from the grand mean sum up to zero again. The hyperparameter \(\sigma_\beta\) can either be set constant or given another prior, extending the multilevel model, where (Kruschke 2015) followed the recommendations of (Gelman and Hill 2006) to use a folded t-distribution or a gamma-distribution with non-zero mode.

Inference for the full posterior, that is for the parameters \(\mu_k,\sigma_y,\beta_0,\beta_j \forall j, j=1,...,k\) (and \(\sigma_\beta\), if a hyperprior like a folded t-distribution or gamma-distribution is used on this parameter) given the data is provided via the MCMC sampler JAGS (Plummer 2003), which uses Gibbs sampling to draw samples from the posterior. Posterior distributions obtained through Gibbs sampling are finally used to estimate all parameters via \(95\%\) Highest-Density-Intervals (HDI). Explicit testing is avoided.

Conducting a traditional analysis of variance is possible with an
abundance of software, for example via the
*stats* package
(R Core Team 2020) which is part of the R programming language
(R Core Team 2020).

The *BayesFactor*
package by (Morey and Rouder 2018) provides the Bayesian ANOVA Bayes
factor of (Rouder et al. 2012), and various helper functions for analysis of
the results.

A simple illustration of the main workflow in the *BayesFactor* package
is given here, using the `ToothGrowth`

dataset in the
*datasets* package
(Cannon et al. 2019). The `ToothGrowth`

dataset contains three columns:
`len`

, the dependent variable each of which is the length of a guinea
pig’s tooth after treatment with vitamin C. The predictor `supp`

corresponds to the supplement type (either orange juice or ascorbic
acid), the predictor `dose`

is the amount of vitamin C administered.

The *BayesFactor* package’s core function allows the comparison of
models \(\mathcal{M}_0,...,\mathcal{M}_n\) with factors as predictors. The
null model without any predictors is most often compared to models
including predictors or even interaction terms using the Bayes factor as
detailed above. The function `anovaBF`

computes several model estimates
at once, so that the model with the largest Bayes factor can be
selected. The data are first loaded and the categorial predictors
converted to factors:

```
> set.seed(42)
R> library(datasets)
R> data(ToothGrowth)
R> head(ToothGrowth,n=3)
R
len supp dose1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
> ToothGrowth$dose = factor(ToothGrowth$dose)
R> levels(ToothGrowth$dose) = c('Low', 'Medium', 'High') R
```

Then, a Bayesian ANOVA is conducted using both predictors `dose`

, `supp`

and the interaction term `dose * supp`

:

```
> library(BayesFactor)
R> bf = anovaBF(len ~ supp * dose, data = ToothGrowth)
R
Bayes factor analysis--------------
1] supp : 1.198757 +- 0.01%
[2] dose : 4.983636e+12 +- 0%
[3] supp + dose : 2.963312e+14 +- 1.59%
[4] supp + dose + supp:dose : 8.067205e+14 +- 1.94%
[
:
Against denominator
Intercept only ---
: BFlinearModel, JZS Bayes factor type
```

The results are shown in form of the Jeffreys-Zellner-Siow (JZS) Bayes
factor \(BF_{10}\) detailed previously. As the \(BF_{10}\) for the model
including both predictors `supp`

and `dose`

is largest, the Bayesian
ANOVA favours this model over the null model which includes only the
intercept. Thus, as there are the low, medium and high dose groups and
the two supplement groups, in total one obtains \(3\times 2=6\) different
groups. The results show that there is strong evidence that the model
attesting these six differing groups is favourable over the null model
(and every other model as given in output lines `[1]`

, `[2]`

and `[3]`

).

Note, that this solution is also implemented in the open-source software JASP, for an introduction see (Bergh et al. 2019).

The Bayesian ANOVA model of (Kruschke 2015) is not implemented in a
software package by now. Instead, users have to write their own model
scripts for JAGS (Plummer 2003) to run the analysis. Still, recently the
package *BANOVA* was
published by (Dong and Wedel 2019), which uses JAGS (Plummer 2003) and the
Hamiltonian Monte Carlo (HMC) sampler Stan (Carpenter et al. 2017) via the
package *RStan* (Stan Development Team 2020)
to provide similar inferences without the need to code the JAGS or Stan
models on your own.

Note that in the above example, a traditional ANOVA can easily be fit via

```
> summary(aov(len ~ supp * dose, data = ToothGrowth))
R
Pr(>F)
Df Sum Sq Mean Sq F value 1 205.4 205.4 15.572 0.000231 ***
supp 2 2426.4 1213.2 92.000 < 2e-16 ***
dose :dose 2 108.3 54.2 4.107 0.021860 *
supp54 712.1 13.2
Residuals ---
: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Signif. codes
```

which yields similar results, favouring the full model with both predictors and interaction term, as both predictors and the interaction term are significant.

The method used in the *bayesanova* package is based on estimation of
parameters in a Gaussian mixture distribution. On this mixture a Gibbs
sampling algorithm is applied to produce posterior distributions of all
unknown parameters given the data in the Gaussian components, that is
for \(\mu_j,\sigma_j, j=1,...,k\) and for the differences in means
\(\mu_l-\mu_r,l\neq r\) and the effect sizes \(\delta_{lr},l\neq r\) where
\(k\) is the number of groups in the study or experiment. This way, a
relatively complete picture of the situation at hand can be drawn and
while the technical aspects are omitted here, the validity of the
procedure stems from standard MCMC theory, see for example
(Robert and Casella 2004). The principal idea of mixture models is expressed by
(Frühwirth-Schnatter 2006):

Consider a population made up of \(K\) subgroups, mixed at random in proportion to the relative group sizes \(\eta_1,...,\eta_K\). Assume interest lies in some random feature \(Y\) which is heterogeneous across and homogeneous within the subgroups. Due to heterogeneity, \(Y\) has a different probability distribution in each group, usually assumed to arise from the same parametric family \(p(y|\theta)\) however, with the parameter \(\theta\) differing across the groups. The groups may be labeled through a discrete indicator variable \(S\) taking values in the set \(\{1,...,K\}\). When sampling randomly from such a population, we may record not only \(Y\), but also the group indicator \(S\). The probability of sampling from the group labeled \(S\) is equal to \(\eta_S\), whereas conditional on knowing \(S\), \(Y\) is a random variable following the distribution \(p(y|\theta_S)\) with \(\theta_S\) being the parameter in group \(S\). (...) The marginal density \(p(y)\) is obviously given by the following mixture density \[\begin{aligned} p(y)=\sum_{S=1}^K p(y,S)=\eta_1 p(y|\theta_1)+...+\eta_S p(y|\theta_K) \end{aligned}\]

Clearly, this resembles the situation of the analysis of variance, in which the allocations \(S\) are known. Traditionally, mixtures are treated with missing allocations but in the setting of the ANOVA these are known, leading to a much simpler scenario. This interpretation also makes sense from a semantic point: the inherent assumption of a researcher is that the population is indeed made up of \(k\) subgroups in the case of a k-group ANOVA, which differ in a random feature \(Y\) which is heterogeneous across groups and homogeneous within each group. When conducting for example a randomized clinical trial (RCT), the group indicator \(S\) is of course recorded. The clinician will choose the patients according to a sampling plan, which could be designed to achieve equally sized groups, that is, \(\eta_1=\eta_2=...=\eta_k\) for \(k\) study groups. Thus, when sampling the population with the target of equally sized groups, the researcher will sample the objects with equal probability from the population. Consider a treatment one, treatment two and a control group. In this typical setting, the researcher could flip a coin for each patient in the RCT to assign him or her to one of the two treatment groups or to the control group, so that with probability \(\eta_1=\eta_2=\eta_3=1/3\) for any group, the patient is assigned to it. Repeating this process then leads to the mixture model given above. After the RCT is conducted, the resulting histogram of observed \(Y\) values will finally take the form of the mixture density \(p(y)\) above. If there is an effect in the treatment, this density \(p(y)\) will express three modes which in turn result from the underlying mixture model of the data-generating process.

If unbalanced groups are the goal, weights can be adjusted accordingly,
for example \(\eta_1=0.3\), \(\eta_2=0.2\) and \(\eta_3=0.5\). After fixing
the mixture weights \(\eta_1,\eta_2,\eta_3\), the family of distributions
for the mixture components needs to be selected. The above
considerations lead to finite mixtures of normal distributions which
*‘occur frequently in many areas of applied statistics such as [...]
medicine’* (Frühwirth-Schnatter 2006 169). The components
\(p(y|\theta_i)\) therefore become \(f_N(y;\mu_j,\sigma_j^2)\) for
\(j=1,...,k\) in this case, where \(f_N(y;\mu_j,\sigma_j^2)\) is the density
of the univariate normal distribution. Parameter estimation in finite
mixtures of normal distributions consists of estimation of the component
parameters \((\mu_j,\sigma_j^2)\), the allocations \(S_i,i=1,...,n\) and the
weight distribution \((\eta_1,...,\eta_k)\) based on the available
complete data \((y_i,S_i),i=1,...,n\). In the case of the Bayesian ANOVA,
the allocations \(S_i\) (where \(S_i=1\) if \(y_i\) belongs to the first
component, \(S_i=2\) if \(y_i\) belongs to the second component, until
\(S_i=k\) if \(y_i\) belongs to the \(k\)-th group) are known for all
observations \(y_i\), \(i=1,...,n\). Therefore, inference reduces to
inference for the density parameters \((\mu_j,\sigma_j^2)\) of the normal
components of the mixture for the \(j=1,...,k\) groups.

The Bayesian ANOVA model based on Gaussian mixtures is summarized in Figure 1 using the three-group case as an example:

The measured variables \(y_i\) follow a three-component Gaussian mixture with known allocations. The first group is normally distributed as \(\mathcal{N}(\mu_1,\sigma_1)\), the second group as \(\mathcal{N}(\mu_2,\sigma_2)\) and the third group as \(\mathcal{N}(\mu_3,\sigma_3)\). The means \(\mu_1, \mu_2\) and \(\mu_3\) are each distributed as \(\mu_j \sim \mathcal{N}(b_0,B_0), j=1,2,3\) with noninformative hyperparameters \(b_0\) and \(B_0\) and the standard deviations \(\sigma_1,\sigma_2\) and \(\sigma_3\) are distributed as \(\sigma_j \sim \mathcal{G}^{-1}(c_0,C_0), j=1,2,3\) with noninformative hyperparameters \(c_0\) and \(C_0\). For details, see (Kelter 2020c, 2021a). As the allocations are known, the weights \(\eta_1,\eta_2\) and \(\eta_3\) are known too, and need not to be estimated, which is why the parameters \(\eta_1,\eta_2,\eta_3\) are not included in the diagram. The model visualized in Figure 1 can be generalized for an arbitrary number of mixture components, which then includes nearly arbitrary ANOVA settings for comparison of multiple groups. A definitive advantage of this model is that inference is obtained for both means and standard deviations, yielding richer information compared to the testing perspectives which are stressed in traditional or Bayesian ANOVA models focussing on the Bayes factor. Also, posterior distributions of effect sizes can be obtained via MCMC, providing an additional layer of information to draw inferences.

Instead of relying on the Bayes factor, the *bayesanova* package follows
the approach of (Kruschke 2018) to use a region of practical
equivalence (ROPE). The effect size \(\delta\) is routinely categorized as
small, medium or large in medical research when
\(\delta \in [0.2,0.5),\delta \in [0.5,0.8)\) or
\(\delta \in [0.8,\infty)\), see (Cohen 1988). The approach
using the ROPE proceeds by taking these categories as regions of
practical equivalence, that is both \(\delta=0.25\) and \(\delta=0.26\) are
identified as a small effect because both are inside the region of
practical equivalence \([0.2,0.5)\) of a small effect \(\delta\). The
underlying idea is that measuring effect sizes only makes sense up to a
specific precision, which is given by the above categorization of effect
sizes. By studying how much probability mass of the posterior
distribution of \(\delta\) lies inside some of the above ROPEs
\([0.2,0.5)\), \([0.5,0.8)\) and \([0.8,\infty)\) of a small, medium and large
positive effect for \(\delta\) (negative effects analogue), a continuous
statement about the most probable effect size \(\delta\) given the data
can be made. Kruschke originally advocated to use the location of the
95% highest-posterior-density (HPD) interval in relation to the ROPE to
test whether the null value in the middle of the ROPE should be accepted
or rejected for practical purposes. Here, this approach is generalized
by switching to the amount of posterior probability mass inside the
ROPE. Detailed examples are provided later in this paper.

Table 1 provides an overview about the four ANOVA models and their purpose.

Model | Purpose | Evidence measure | Computational aspects |
---|---|---|---|

Frequentist ANOVA | Testing the global hypothesis that all samples are drawn from populations with identical means against the alternative | F-statistic and p-value | Analytic solution |

Bayesian ANOVA of (Rouder et al. 2012) | Test the global hypothesis that the effect size vector is zero versus the alternative | Bayes factor | Numerical integration required |

Bayesian ANOVA of (Kruschke 2015) | Estimation of effect sizes between groups via ROPE and 95% HPD | ROPE | Gibbs sampling in MCMC sampler JAGS (or Stan) required |

Bayesian ANOVA based on Gaussian mixtures | Estimation of effect sizes between groups via the ROPE and posterior probability mass | ROPE | Gibbs sampling without MCMC sampler JAGS (or Stan) required |

Although it appears that the model of (Kruschke 2015) and the Gaussian mixture modeling approach proposed in this paper have the same purpose, they differ in how data \(y_i\) are assumed to be generated. In the mixture approach we assume that the sample of \(n_j\) participants in group \(j\) results from a mixture process, e.g. by flipping a coin, rolling a dice or using any other randomization device (as is the case in clinical trials when assigning patients to groups according to a double-blinded protocol). Thus, the process of data generation is not “one has collected \(n_j\) participants for group j” but “the given sample of \(n_j\) participants in group j is assumed to be a realization of a mixture process where with probability \(\eta_j\) participants are assigned to group \(j\)”. Importantly, note that the realization of \(n_j\) participants in group \(j\) for \(j=1,...,k\) is expected under the mixture component weight \(\eta_j=n_j/n\), but also entirely different group sizes \(n_j\) can result under such a mixture. In fact, the weights \(\eta_j=n_j/n\) which are assumed to be known are the corresponding maximum-likelihood-estimators of the weight parameters \(\eta_j\) given the sample sizes \(n_j\) for \(j=1,...,k\), but the conceptual focus of the mixture approach is to closely mimic the randomization process researchers follow when conducting a randomized controlled trial. Note further that the model of Kruschke assumes homogeneity of variances in contrast to the Gaussian mixture model, but Kruschke’s model can easily be extended to account for heterogeneity of variance, rendering this difference less important. Note that both the frequentist ANOVA and the Bayesian version of (Rouder et al. 2012) assume homogeneity of variance across groups.

The *bayesanova* package has four functions. These provide (1) the MCMC
algorithm for conducting the Bayesian ANOVA in the Gaussian mixture
model with known allocations, detailed above, (2) checks of the model
assumptions and (3) visualizations of the posterior results for easy
interpretation and communication of research results. Visualizations of
the posterior mixture components in comparison with the original data
are provided by the fourth function. An overview is provided in Table
2.

Function | Description |
---|---|

`bayes.anova` |
Main function of the package, conducts the MCMC algorithm to provide full posterior inference in the three-component Gaussian mixture model |

`assumption.check` |
Helper function for checking the assumption of normality in each group previous to running a Bayesian ANOVA |

`anovaplot` |
Provides multiple visualizations of the results, including posterior distributions, difference in means and standard deviations and effect sizes as well as a full ROPE-analysis |

`post.pred.check` |
Provides a posterior predictive check for a fitted Bayesian ANOVA model |

The core function is `bayes.anova`

, which provides the MCMC algorithm to
obtain full posterior inference in a \(k\)-component Gaussian mixture
model shown in Figure 1 for the special case of \(k=3\)
components. The function implements a Gibbs sampling algorithm, which
iteratively updates

the means \(\mu_j|\mu_{-j}, \sigma_1, ..., \sigma_k, S, y\) given the other means \(\mu_{-j}\) and standard deviations \(\sigma_1, ..., \sigma_k\) as well as the full data \(S,y\), where \(S\) is the indicator vector for the groups the observations \(y\) belong to

the standard deviations \(\sigma_j|\sigma_{-j}, \mu_1, ..., \mu_k, S, y\) given the other standard deviations \(\sigma_{-j}\) and means \(\mu_1, ..., \mu_k\) as well as the full data \(S,y\), where \(S\) is again the indicator vector for the groups the observations \(y\) belong to

The details of the Gibbs sampler can be found in (Kelter 2020c, 2021a), and the validity of the method follows from standard MCMC theory, see for example (Robert and Casella 2004).

The `bayes.anova`

function takes as input three numerical vectors
`first`

, `second`

and `third`

, which correspond to the observed
responses in each of the three groups and provides multiple optional
parameters:

```
bayes.anova(n=10000, first, second, third,
fourth = NULL, fifth = NULL, sixth = NULL,
hyperpars="custom", burnin=n/2, sd="sd", q=0.1, ci=0.95)
```

These are the only mandatory input values, and currently six groups are
the limit *bayesanova* supports. More than three groups can be handed to
the function by providing numerical vectors for the parameters `fourth`

,
`fifth`

and `sixth`

.

If no other parameters are provided, the function chooses a default of
`n=10000`

Gibbs sampling iterations, where the burn-in of the Markov
chains is set to `burnin=n/2`

, so that the first 5000 iterations are
discarded. The default setting uses inference for means \(\mu_j\) and
standard deviations \(\sigma_j\), which is indicated by the parameter
`sd="sd"`

, but inference for variances \(\sigma_j^2\) instead of standard
deviations \(\sigma_j\) can easily be obtained by setting `sd="var"`

. The
credible level for all computed credible intervals defaults to \(0.95\),
indicated by `ci=0.95`

. The two remaining parameters `hyperpars`

and `q`

define preselected values for the hyperparameters in the prior, to
ensure weakly informative priors are used which influence the analysis
as little as possible. For details, see
(Kelter 2020c, 2021a), but in
general these values apply to a broad range of contexts so that changing
them is not recommended. Note, that another set of hyperparameters based
on (Raftery 1996) can be selected via `hyperpars="rafterys"`

, if
desired.

After execution, the function returns a dataframe including four Markov
chains for each parameter of the specified size `n-burnin`

, to make
subsequent convergence assessment or post-processing of the MCMC results
possible.

The second function is `assumption.check`

. This function runs a
preliminary assumption check on the data, which is recommended before
running a Bayesian ANOVA. The model assumptions are normality in each
mixture component, so that the `assumption.check`

function runs
Shapiro-Wilk tests to check for normality (Shapiro and Wilk 1965). The
input parameters are the three numerical vectors `x1`

, `x2`

and `x3`

including the observed responses in the first, second and third group,
and the desired confidence level `conf.level`

for the Shapiro-Wilk
tests:

`assumption.check(x1, x2, x3, x4 = NULL, x5 = NULL, x6 = NULL, conf.level=0.95)`

The default confidence level is `0.95`

. More than three groups can
easily be added by providing values for `x4`

, `x5`

and `x6`

.

The third function is `anovaplot`

, which provides a variety of
visualizations of results. The function takes as input a dataframe
`dataframe`

, which should be the result of the `bayes.anova`

function
detailed above, a parameter `type`

, which indicates which visualization
is desired, a parameter `sd`

, which indicates if the provided dataframe
includes posterior draws of \(\sigma_j\) or \(\sigma_j^2\) and last a
parameter `ci`

, which again defined the credible level used in the
computations.

`anovaplot(dataframe, type="rope", sd="sd", ci=0.95)`

The default values for `sd`

is `"sd"`

, and the default credible level is
`0.95`

. The `type`

parameter takes one of four possible values: (1)
`type="pars"`

, (2) `type="diff"`

, (3) `type="effect"`

and (4)
`type="rope"`

. In the first case, posterior distributions of all model
parameters are produced, complemented by convergence diagnostics in form
of trace plots, autocorrelation plots and the Gelman-Brooks-Rubin shrink
factor (Gelman and Brooks 1998), which should be close to one to indicate
convergence to the posterior. In the second case, the posterior
distributions of the differences \(\mu_i-\mu_j, j\neq i\) of the group
means and differences \(\sigma_l-\sigma_r, l\neq r\) of the group standard
deviations (or variances, if `sd="var"`

and the dataframe includes
posterior draws of the \(\sigma_j^2\)’s instead of \(\sigma_j\)’s) are
produced, complemented by the same convergence diagnostics. In the third
case, the posterior distributions of the effect sizes
\(\delta_{lr}, l\neq r\) are produced, which are most often of interest in
applied research. In this case, posteriors are complemented by the same
convergence diagnostics, too. The last and fourth case produces a full
ROPE-analysis, which does provide the posteriors of the effect sizes
\(\delta_{lr}, l\neq r\), but additionally computes a partitioning of the
posterior probability mass into the standardized ROPEs of small, medium
and large (and no) effect sizes according to (Cohen 1988),
which are the reference standard in medical and psychological research.

The last function `post.pred.check`

provides a posterior predictive
check for a fitted Bayesian ANOVA model against the original data, which
is routine in a Bayesian workflow (Gabry et al. 2019).

This section provides illustrations and a variety of examples, in which
the *bayesanova* package can be used and provides richer information
than existing solutions.

The guinea pig dataset from above is used as a first example. The data
are included in the dataset `ToothGrowth`

in the *datasets* package
which is part of R. First, data is loaded and split into three groups,
corresponding to a low, medium and high administered vitamin C dose:

```
> library(datasets)
R> data(ToothGrowth)
R> head(ToothGrowth,n=3)
R
len supp dose1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
> library(dplyr)
R> library(tidyr)
R> library(bayesanova)
R> grp1 = (ToothGrowth %>% filter(dose==0.5) %>% select(len))$len
R> grp2 = (ToothGrowth %>% filter(dose==1.0) %>% select(len))$len
R> grp3 = (ToothGrowth %>% filter(dose==2.0) %>% select(len))$len R
```

Next, we run the assumption checks on the data

```
> assumption.check(grp1, grp2, grp3, conf.level=0.95)
R
Model assumptions checked. No significant deviations from normality detected. Bayesian ANOVA can be run safely.
```

Figure 2 shows the histograms and
quantile-quantile plots for all three groups produced by
`assumption.check()`

. Clearly, there are no large deviations, and no
Shapiro-Wilk test was significant at the \(0.05\) level.

Next, the Bayesian ANOVA can be run via the `bayes.anova`

function.
Therefore, the default parameter values are used, yielding `n=5000`

posterior draws:

```
> set.seed(42)
R> res = bayes.anova(first = grp1, second = grp2, third = grp3)
R
|Parameter |LQ |Mean |UQ |Std.Err |
|:-------------|:-----|:-----|:-----|:-------|
|mu1 |8.69 |10.61 |12.5 |0.91 |
|mu2 |18.05 |19.75 |21.46 |0.84 |
|mu3 |24.94 |26.1 |27.25 |0.57 |
|sigma1 |3.02 |4.07 |5.67 |0.67 |
|sigma2 |2.95 |3.96 |5.43 |0.64 |
|sigma3 |2.43 |3.25 |4.42 |0.52 |
|mu2-mu1 |6.7 |9.15 |11.7 |1.25 |
|mu3-mu1 |13.42 |15.49 |17.67 |1.06 |
|mu3-mu2 |4.36 |6.34 |8.38 |1.01 |
|sigma2-sigma1 |-2.02 |-0.11 |1.68 |0.93 |
|sigma3-sigma1 |-2.62 |-0.81 |0.74 |0.85 |
|sigma3-sigma2 |-2.46 |-0.71 |0.85 |0.82 |
|delta12 |-5.77 |-4.59 |-3.21 |0.65 |
|delta13 |-9.37 |-8.14 |-6.63 |0.71 |
|delta23 |-4.36 |-3.36 |-2.19 |0.56 |
```

The results table shows the lower and upper quantile, corresponding to
the \(100\cdot\)`ci`

\(+(100-\)`ci`

\()/2\) and \((100-\)`ci`

\()/2\) quantiles where
`ci`

is the credible level chosen above. Also, the posterior mean and
standard error are given for each parameter, difference of parameters
and effect size. The results clearly show that there are huge
differences between the groups: For example, one can immediately spot
that the more vitamin c given, the more tooth growth can be observed via
tooth lengths. While the first group (low dose) has a posterior mean of
\(10.61\) with credible interval \([8.69,10.61]\), the second group achieves
a mean of \(19.75\) with credible interval \([18.05,21.46]\). The third
group has a posterior mean of even \(26.1\) with credible level
\([24.94.27.25]\). The posterior estimates for the differences
\(\mu_2-\mu_1\), \(\mu_3-\mu_1\) and \(\mu_3-\mu_2\) show that all groups
differ from each other with a very high probability, given the data.

Note that the information provided is much more fine-grained than in the
solutions via the traditional ANOVA and the Jeffreys-Zellner-Siow based
Bayes-factor ANOVA above. While in these two solutions, one could only
infer that the model using both predictors and the interaction term is
the best, now we are given precise estimates of the effect sizes between
each group defined by the dose of vitamin c administered. Note also,
that including the second predictor `supp`

is no problem, leading to a
setting which incorporates six groups in the mixture then.

The second example is from the biomedical sciences and uses the heart
rate data from (Moore et al. 2012). In the study, heart rates of female and
male runners and generally sedentary participants (not regularly
running) following six minutes of exercise were recorded. The
participant’s `Gender`

and `Heart.rate`

are given and which group he or
she belongs to (`Group=="Runners"`

or `Group=="Control"`

). In the study,
800 participants were recruited, so that in each of the four groups
given by the combinations of `Gender`

and `Group`

200 subjects
participated.

Therefore, the situation requires a \(2\times 2\) between subjects ANOVA.
Specifically, interest lies in the hypothesis that heart rate differs
between gender and groups. The Bayesian ANOVA of *bayesanova* can easily
be applied in such an often encountered setting. We first load the data
and split them into the four groups:

```
> library(dplyr)
R> hr=read.csv("heartrate.csv",sep=",")
R> head(hr)
R
Gender Group Heart.Rate1 Female Runners 119
2 Female Runners 84
3 Female Runners 89
4 Female Runners 119
5 Female Runners 127
6 Female Runners 111
> femaleRunners = (hr %>% filter(Gender=="Female")
R+ %>% filter(Group=="Runners")
+ %>% select(Heart.Rate))$Heart.Rate
> maleRunners = (hr %>% filter(Gender=="Male") %>% filter(Group=="Runners")
R+ %>% select(Heart.Rate))$Heart.Rate
> femaleControl = (hr %>% filter(Gender=="Female")
R+ %>% filter(Group=="Control")
+ %>% select(Heart.Rate))$Heart.Rate
> maleControl = (hr %>% filter(Gender=="Male") %>% filter(Group=="Control")
R+ %>% select(Heart.Rate))$Heart.Rate
```

Then, we check the model assumptions:

`> assumption.check(femaleRunners, maleRunners, femaleControl, maleControl) R`

We can thus safely proceed running the Bayesian ANOVA:

```
> set.seed(42)
R> resRunners = bayes.anova(first = femaleRunners, second = maleRunners,
R+ third = femaleControl, fourth = maleControl)
|Parameter |LQ |Mean |UQ |Std.Err |
|:-------------|:------|:------|:------|:-------|
|mu1 |113.48 |116 |118.5 |1.27 |
|mu2 |102.51 |103.98 |105.55 |0.76 |
|mu3 |145.44 |148.04 |150.52 |1.3 |
|mu4 |127.12 |130.01 |132.82 |1.47 |
|sigma1 |14.38 |15.87 |17.51 |0.8 |
|sigma2 |11.21 |12.35 |13.67 |0.63 |
|sigma3 |14.71 |16.19 |17.85 |0.82 |
|sigma4 |15.46 |17.02 |18.79 |0.85 |
|mu2-mu1 |-14.9 |-12.01 |-9.06 |1.48 |
|mu3-mu1 |28.47 |32.04 |35.6 |1.83 |
|mu4-mu1 |10.19 |14.01 |17.9 |1.96 |
|mu3-mu2 |41.12 |44.05 |46.95 |1.51 |
|mu4-mu2 |22.83 |26.02 |29.21 |1.66 |
|mu4-mu3 |-21.8 |-18.03 |-14.4 |1.92 |
|sigma2-sigma1 |-5.6 |-3.52 |-1.57 |1.02 |
|sigma3-sigma1 |-1.94 |0.32 |2.53 |1.15 |
|sigma4-sigma1 |-1.14 |1.15 |3.51 |1.18 |
|sigma3-sigma2 |1.83 |3.84 |5.85 |1.03 |
|sigma4-sigma2 |2.7 |4.67 |6.8 |1.05 |
|sigma4-sigma3 |-1.48 |0.83 |3.13 |1.17 |
|delta12 |2.4 |3.2 |3.96 |0.4 |
|delta13 |-8.92 |-8.01 |-7.05 |0.48 |
|delta14 |-4.42 |-3.46 |-2.5 |0.49 |
|delta23 |-12.55 |-11.67 |-10.77 |0.45 |
|delta24 |-7.65 |-6.79 |-5.91 |0.45 |
|delta34 |3.52 |4.43 |5.37 |0.48 |
```

The results reveal multiple insights now. To support the interpretation,
we first produce visualisations of the results via the `anovaplot()`

function:

`> anovaplot(resRunners) R`

Figure 3 shows the plots produces by the above call to
`anovaplot()`

.