Finite mixture models are being used increasingly to model a wide variety of random phenomena for clustering, classification and density estimation. mclust is a powerful and popular package which allows modelling of data as a Gaussian finite mixture with different covariance structures and different numbers of mixture components, for a variety of purposes of analysis. Recently, version 5 of the package has been made available on CRAN. This updated version adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.
mclust (Fraley et al. 2016) is a popular R package for model-based clustering, classification, and density estimation based on finite Gaussian mixture modelling. An integrated approach to finite mixture models is provided, with functions that combine model-based hierarchical clustering, EM for mixture estimation and several tools for model selection. Thus mclust provides a comprehensive strategy for clustering, density estimation and discriminant analysis. A variety of covariance structures obtained through eigenvalue decomposition are available. Functions for performing single E and M steps and for simulating data for each available model are also included. Additional ways of displaying and visualising fitted models along with clustering, classification, and density estimation results are also provided. It has been used in a broad range of contexts including geochemistry (Templ et al. 2008; Ellefsen et al. 2014), chemometrics (Fraley and Raftery 2006b, 2007b), DNA sequence analysis (Verbist et al. 2015), gene expression data (Yeung et al. 2001; Li et al. 2005; Fraley and Raftery 2006a), hydrology (Kim et al. 2014), wind energy (Kazor and Hering 2015), industrial engineering (Campbell et al. 1999), epidemiology (Flynt and Daepp 2015), food science (Kozak and Scaman 2008), clinical psychology (Suveg et al. 2014), political science (Ahlquist and Breunig 2012; Jang and Hitchcock 2012), and anthropology (Konigsberg et al. 2009).
One measure of the popularity of mclust is provided by the download logs of the RStudio (http://www.rstudio.com) CRAN mirror (available at http://cran-logs.rstudio.com). The cranlogs package (Csardi 2015) makes it easy to download such logs and graph the number of downloads over time. We used cranlogs to query the RStudio download database over the past three years. In addition to mclust, other R packages which handle Gaussian finite mixture modelling as part of their capabilities have been included in the comparison: Rmixmod (Lebret et al. 2015), mixture (Browne et al. 2015), EMCluster (Chen and Maitra 2015), mixtools (Benaglia et al. 2009), and bgmm (Biecek et al. 2012). We also included flexmix (Leisch 2004; Grün and Leisch 2007, 2008) which provides a general framework for finite mixtures of regression models using the EM algorithm, since it can be adapted to perform Gaussian model-based clustering using a limited set of models (only the diagonal and unconstrained covariance matrix models). Table 1 summarises the functionalities of the selected packages.
Package | Version | Clustering | Classification | Density estimation | Non-Gaussian components |
---|---|---|---|---|---|
mclust | 5.2 | \(✓\) | \(✓\) | \(✓\) | \(✗\) |
Rmixmod | 2.0.3 | \(✓\) | \(✓\) | \(✗\) | \(✓\) |
mixture | 1.4 | \(✓\) | \(✓\) | \(✗\) | \(✗\) |
EMCluster | 0.2-5 | \(✓\) | \(✓\) | \(✗\) | \(✗\) |
mixtools | 1.0.4 | \(✓\) | \(✗\) | \(✓\) | \(✓\) |
bgmm | 1.7 | \(✓\) | \(✓\) | \(✗\) | \(✗\) |
flexmix | 2.3-13 | \(✓\) | \(✗\) | \(✗\) | \(✓\) |
Figure 1 shows the trend in weekly downloads from the RStudio CRAN mirror for the selected packages. The popularity of mclust has been increasing steadily over time with a first high peak around mid April 2015, probably due to the release of R version 3.2 and, shortly after, the release of version 5 of mclust. Then, successive peaks occurred in conjunction with the release of package’s updates. Based on these logs, mclust is the most downloaded package dealing with Gaussian mixture models, followed by flexmix which, as mentioned, is a more general package for fitting mixture models but with limited clustering capabilities.
Another aspect that can be considered as a proxy for the popularity of a
package is the mutual dependencies structure between R packages1.
This can be represented as a graph with packages at the vertices and
dependencies (either “Depends”, “Imports”, “LinkingTo”, “Suggests” or
“Enhances”) as directed edges, and analysed through the PageRank
algorithm used by the Google search engine (Brin and Page 1998). For the
packages considered previously, we used the page.rank
function
available in the igraph
package (Csardi and Nepusz 2006) and we obtained the ranking reported in
Table 2, which approximately reproduces the results
discussed above. Note that mclust is among the top 100 packages on
CRAN by this ranking. Finally, its popularity is also indicated by the
55 other CRAN packages listed as reverse dependencies, either “Depends”,
“Imports” or “Suggests”.
mclust | Rmixmod | mixture | EMCluster | mixtools | bgmm | flexmix |
---|---|---|---|---|---|---|
75 | 2300 | 2319 | 2143 | 1698 | 3736 | 270 |
Earlier versions of the package have been described in Fraley and Raftery (1999), Fraley and Raftery (2003), and Fraley et al. (2012). In this paper we discuss some of the new functionalities available in mclust version \(\ge 5\). In particular we describe the newly available models, dimension reduction for visualisation, bootstrap-based inference, implementation of different model selection criteria and initialisation strategies for the EM algorithm.
The reader should first install the latest version of the package from CRAN with
> install.packages("mclust")
Then the package is loaded into an R session using the command
> library(mclust)
__ ___________ __ _____________/ |/ / ____/ / / / / / ___/_ __/
/ /|_/ / / / / / / / /\__ \ / /
/ / / / /___/ /___/ /_/ /___/ // /
/_/ /_/\____/_____/\____//____//_/ version 5.2
'citation("mclust")' for citing this R package in publications. Type
All the datasets used in the examples are available in mclust or in other R packages, such as gclus (Hurley 2012), rrcov (Todorov and Filzmoser 2009) and tourr (Wickham et al. 2011), and can be installed from CRAN using the above procedure, except where noted differently.
Let \(\mathbf{x}= \lbrace \mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_i,\ldots,\mathbf{x}_n \rbrace\) be a sample of \(n\) independent identically distributed observations. The distribution of every observation is specified by a probability density function through a finite mixture model of \(G\) components, which takes the following form \[f(\mathbf{x}_i; \mathbf{\Psi}) = \sum_{k=1}^G \pi_k f_k(\mathbf{x}_i; \mathbf{\theta}_k), \label{eq:finmixdens} \tag{1}\] where \(\mathbf{\Psi}= \{\pi_1, \ldots, \pi_{G-1}, \mathbf{\theta}_1, \ldots, \mathbf{\theta}_G\}\) are the parameters of the mixture model, \(f_k(\mathbf{x}_i; \mathbf{\theta}_k)\) is the \(k\)th component density for observation \(\mathbf{x}_i\) with parameter vector \(\mathbf{\theta}_k\), \((\pi_1,\ldots,\pi_{G-1})\) are the mixing weights or probabilities (such that \(\pi_k > 0\), \(\sum_{k=1}^G\pi_k = 1\)), and \(G\) is the number of mixture components.
Assuming that \(G\) is fixed, the mixture model parameters \(\mathbf{\Psi}\) are usually unknown and must be estimated. The log-likelihood function corresponding to equation (1) is given by \(\ell(\mathbf{\Psi}; \mathbf{x}_1, \ldots, \mathbf{x}_n) = \sum_{i=1}^{n} \log(f(\mathbf{x}_i; \mathbf{\Psi}))\). Direct maximisation of the log-likelihood function is complicated, so the maximum likelihood estimator (MLE) of a finite mixture model is usually obtained via the EM algorithm (Dempster et al. 1977; McLachlan and Peel 2000).
In the model-based approach to clustering, each component of a finite
mixture density is usually associated with a group or cluster. Most
applications assume that all component densities arise from the same
parametric distribution family, although this need not be the case in
general. A popular model is the Gaussian mixture model (GMM), which
assumes a (multivariate) Gaussian distribution for each component,
i.e. \(f_k(\mathbf{x}; \mathbf{\theta}_k) \sim N(\mathbf{\mu}_k, \mathbf{\Sigma}_k)\).
Thus, clusters are ellipsoidal, centered at the mean vector
\(\mathbf{\mu}_k\), and with other geometric features, such as volume,
shape and orientation, determined by the covariance matrix
\(\mathbf{\Sigma}_k\). Parsimonious parameterisations of the covariances
matrices can be obtained by means of an eigen-decomposition of the form
\(\mathbf{\Sigma}_k = \lambda_k \mathbf{D}_k \mathbf{A}_k \mathbf{D}{}^{\!\top}_k\),
where \(\lambda_k\) is a scalar controlling the volume of the ellipsoid,
\(\mathbf{A}_k\) is a diagonal matrix specifying the shape of the density
contours with \(\det(\mathbf{A}_k)=1\), and \(\mathbf{D}_k\) is an
orthogonal matrix which determines the orientation of the corresponding
ellipsoid (Banfield and Raftery 1993; Celeux and Govaert 1995). In one
dimension, there are just two models: E
for equal variance and V
for
varying variance. In the multivariate setting, the volume, shape, and
orientation of the covariances can be constrained to be equal or
variable across groups. Thus, 14 possible models with different
geometric characteristics can be specified. Table 3
reports all such models with the corresponding distribution structure
type, volume, shape, orientation, and associated model names. In
Figure 2 the geometric characteristics are shown
graphically.
Model | \(\mathbf{\Sigma}_k\) | Distribution | Volume | Shape | Orientation |
---|---|---|---|---|---|
EII |
\(\lambda \mathbf{I}\) | Spherical | Equal | Equal | — |
VII |
\(\lambda_k \mathbf{I}\) | Spherical | Variable | Equal | — |
EEI |
\(\lambda \mathbf{A}\) | Diagonal | Equal | Equal | Coordinate axes |
VEI |
\(\lambda_k \mathbf{A}\) | Diagonal | Variable | Equal | Coordinate axes |
EVI |
\(\lambda \mathbf{A}_k\) | Diagonal | Equal | Variable | Coordinate axes |
VVI |
\(\lambda_k \mathbf{A}_k\) | Diagonal | Variable | Variable | Coordinate axes |
EEE |
\(\lambda \mathbf{D}\mathbf{A}\mathbf{D}{}^{\!\top}\) | Ellipsoidal | Equal | Equal | Equal |
EVE |
\(\lambda \mathbf{D}\mathbf{A}_k \mathbf{D}{}^{\!\top}\) | Ellipsoidal | Equal | Variable | Equal |
VEE |
\(\lambda_k \mathbf{D}\mathbf{A}\mathbf{D}{}^{\!\top}\) | Ellipsoidal | Variable | Equal | Equal |
VVE |
\(\lambda_k \mathbf{D}\mathbf{A}_k \mathbf{D}{}^{\!\top}\) | Ellipsoidal | Variable | Variable | Equal |
EEV |
\(\lambda \mathbf{D}_k \mathbf{A}\mathbf{D}{}^{\!\top}_k\) | Ellipsoidal | Equal | Equal | Variable |
VEV |
\(\lambda_k \mathbf{D}_k \mathbf{A}\mathbf{D}{}^{\!\top}_k\) | Ellipsoidal | Variable | Equal | Variable |
EVV |
\(\lambda \mathbf{D}_k \mathbf{A}_k \mathbf{D}{}^{\!\top}_k\) | Ellipsoidal | Equal | Variable | Variable |
VVV |
\(\lambda_k \mathbf{D}_k \mathbf{A}_k \mathbf{D}{}^{\!\top}_k\) | Ellipsoidal | Variable | Variable | Variable |
Starting with version \(5.0\) of mclust, four additional models have been included: EVV, VEE, EVE, VVE. Models EVV and VEE are estimated using the methods described in Celeux and Govaert (1995), and the estimation of models EVE and VVE is carried out using the approach discussed by Browne and McNicholas (2014). In the models VEE, EVE and VVE it is assumed that the mixture components share the same orientation matrix. This assumption allows for a parsimonious characterisation of the clusters, while still retaining flexibility in defining volume and shape.
To illustrate the new modelling capabilities of mclust for model-based
clustering consider the wine
dataset contained in the gclus
R package. This dataset provides 13 measurements obtained from a
chemical analysis of 178 wines grown in the same region in Italy but
derived from three different cultivars (Barolo, Grignolino, Barbera).
> data(wine, package = "gclus")
> Class <- factor(wine$Class, levels = 1:3,
labels = c("Barolo", "Grignolino", "Barbera"))
> X <- data.matrix(wine[,-1])
> mod <- Mclust(X)
> summary(mod$BIC)
:
Best BIC values3 VVE,3 VVE,6
EVE,-6873.257 -6896.83693 -6906.37460
BIC 0.000 -23.57947 -33.11714
BIC diff > plot(mod, what = "BIC", ylim = range(mod$BIC[,-(1:2)], na.rm = TRUE),
legendArgs = list(x = "bottomleft"))
In the above Mclust()
function call, only the data matrix is provided,
and the number of mixing components and the covariance parameterisation
are selected using the Bayesian Information Criterion (BIC). A summary
showing the top-three models and a plot of the BIC traces (see
Figure 3) for all the models considered is then obtained.
In the last plot we adjusted the range of the y-axis so to remove those
models with lower BIC values. There is a clear indication of a
three-component mixture with covariances having different shapes but the
same volume and orientation (EVE). Note that all the top three models
are among the models added to the latest major release of mclust.
wine
data.A summary of the selected model is obtained as:
> summary(mod)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm ----------------------------------------------------
EVE (ellipsoidal, equal volume and orientation) model with 3 components:
Mclust
log.likelihood n df BIC ICL-3032.45 178 156 -6873.257 -6873.549
:
Clustering table1 2 3
63 51 64
The fitted model provides an accurate recovery of the true classes:
> table(Class, mod$classification)
1 2 3
Class 59 0 0
Barolo 4 3 64
Grignolino 0 48 0
Barbera > adjustedRandIndex(Class, mod$classification)
1] 0.8803998 [
The latter index is the adjusted Rand index [ARI; Hubert and Arabie (1985)], which can be used for evaluating a clustering solution. The ARI is a measure of agreement between two partitions, one estimated by a statistical procedure independent of the labelling of the groups, and one being the true classification. It has zero expected value in the case of a random partition, and it is bounded above by 1, with higher values representing better partition accuracy.
To visualise the clustering structure and the geometric characteristics
induced by an estimated Gaussian finite mixture model we may project the
data onto a suitable dimension reduction subspace. The function
MclustDR()
implements the methodology introduced in Scrucca (2010). The
estimated directions which span the reduced subspace are defined as a
set of linear combinations of the original features, ordered by
importance as quantified by the associated eigenvalues. By default,
information on the dimension reduction subspace is provided by both the
variation on cluster means and, depending on the estimated mixture
model, on the variation on cluster covariances. This methodology has
been extended to supervised classification by Scrucca (2014).
Furthermore, a tuning parameter has been included which enables the
recovery of most of the separating directions, i.e. those that show
maximal separation among groups. Other dimension reduction techniques
for finding the directions of optimum separation have been discussed in
detail by Hennig (2004) and implemented in the package
fpc (Hennig 2015).
Applying MclustDR
to the wine data example, such directions are
obtained as follows:
> drmod <- MclustDR(mod, lambda = 1)
> summary(drmod)
-----------------------------------------------------------------
for model-based clustering and classification
Dimension reduction -----------------------------------------------------------------
: Mclust (EVE, 3)
Mixture model type
Clusters n1 63
2 51
3 64
:
Estimated basis vectors
Dir1 Dir20.11701058 0.2637302
Alcohol -0.02814821 0.0489447
Malic -0.18258917 0.5390056
Ash -0.02969793 -0.0309028
Alcalinity 0.00575692 0.0122642
Magnesium -0.18497201 -0.0016806
Phenols 0.45479873 -0.2948947
Flavanoids 0.59278569 -0.5777586
Nonflavanoid 0.05347167 0.0508966
Proanthocyanins -0.08328239 0.0332611
Intensity 0.42950365 -0.4588969
Hue 0.40563746 -0.0369229
OD280 0.00075867 0.0010457
Proline
Dir1 Dir21.5794 1.332
Eigenvalues 54.2499 100.000 Cum. %
By setting the optional tuning parameter lambda = 1
, instead of the
default value 0.5
, only the information on cluster means is used for
estimating the directions. In this case, the dimension of the subspace
is \(d = \min(p, G-1)\), where \(p\) is the number of variables and \(G\) the
number of mixture components or clusters. In the data example, there are
\(p=13\) features and \(G=3\) clusters, so the dimension of the reduced
subspace is \(d=2\). As a result, the projected data show the maximal
separation among clusters, as shown in Figure 4a, which is
obtained with
> plot(drmod, what = "contour")
On the same subspace we can also plot the uncertainty boundaries corresponding to the MAP classification:
> plot(drmod, what = "boundaries", ngrid = 200)
and then add a circle around the misclassified observations
> miscl <- classError(Class, mod$classification)$misclassified
> points(drmod$dir[miscl,], pch = 1, cex = 2)
|