mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models

Finite mixture models are being used increasingly to model a wide variety of random phenomena for clustering, classification and density estimation. mclust is a powerful and popular package which allows modelling of data as a Gaussian finite mixture with different covariance structures and different numbers of mixture components, for a variety of purposes of analysis. Recently, version 5 of the package has been made available on CRAN. This updated version adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.

Luca Scrucca (Università degli Studi di Perugia) , Michael Fop (University College Dublin) , T. Brendan Murphy (University College Dublin) , Adrian E. Raftery (University of Washington)
2016-06-13

1 Introduction

mclust (Fraley et al. 2016) is a popular R package for model-based clustering, classification, and density estimation based on finite Gaussian mixture modelling. An integrated approach to finite mixture models is provided, with functions that combine model-based hierarchical clustering, EM for mixture estimation and several tools for model selection. Thus mclust provides a comprehensive strategy for clustering, density estimation and discriminant analysis. A variety of covariance structures obtained through eigenvalue decomposition are available. Functions for performing single E and M steps and for simulating data for each available model are also included. Additional ways of displaying and visualising fitted models along with clustering, classification, and density estimation results are also provided. It has been used in a broad range of contexts including geochemistry (Templ et al. 2008; Ellefsen et al. 2014), chemometrics (Fraley and Raftery 2006b, 2007b), DNA sequence analysis (Verbist et al. 2015), gene expression data (Yeung et al. 2001; Li et al. 2005; Fraley and Raftery 2006a), hydrology (Kim et al. 2014), wind energy (Kazor and Hering 2015), industrial engineering (Campbell et al. 1999), epidemiology (Flynt and Daepp 2015), food science (Kozak and Scaman 2008), clinical psychology (Suveg et al. 2014), political science (Ahlquist and Breunig 2012; Jang and Hitchcock 2012), and anthropology (Konigsberg et al. 2009).

One measure of the popularity of mclust is provided by the download logs of the RStudio (http://www.rstudio.com) CRAN mirror (available at http://cran-logs.rstudio.com). The cranlogs package (Csardi 2015) makes it easy to download such logs and graph the number of downloads over time. We used cranlogs to query the RStudio download database over the past three years. In addition to mclust, other R packages which handle Gaussian finite mixture modelling as part of their capabilities have been included in the comparison: Rmixmod (Lebret et al. 2015), mixture (Browne et al. 2015), EMCluster (Chen and Maitra 2015), mixtools (Benaglia et al. 2009), and bgmm (Biecek et al. 2012). We also included flexmix (Leisch 2004; Grün and Leisch 2007, 2008) which provides a general framework for finite mixtures of regression models using the EM algorithm, since it can be adapted to perform Gaussian model-based clustering using a limited set of models (only the diagonal and unconstrained covariance matrix models). Table 1 summarises the functionalities of the selected packages.

Table 1: Capabilities of the selected packages dealing with finite mixture models.
Package Version Clustering Classification Density estimation Non-Gaussian components
mclust 5.2 \(✓\) \(✓\) \(✓\) \(✗\)
Rmixmod 2.0.3 \(✓\) \(✓\) \(✗\) \(✓\)
mixture 1.4 \(✓\) \(✓\) \(✗\) \(✗\)
EMCluster 0.2-5 \(✓\) \(✓\) \(✗\) \(✗\)
mixtools 1.0.4 \(✓\) \(✗\) \(✓\) \(✓\)
bgmm 1.7 \(✓\) \(✓\) \(✗\) \(✗\)
flexmix 2.3-13 \(✓\) \(✗\) \(✗\) \(✓\)

Figure 1 shows the trend in weekly downloads from the RStudio CRAN mirror for the selected packages. The popularity of mclust has been increasing steadily over time with a first high peak around mid April 2015, probably due to the release of R version 3.2 and, shortly after, the release of version 5 of mclust. Then, successive peaks occurred in conjunction with the release of package’s updates. Based on these logs, mclust is the most downloaded package dealing with Gaussian mixture models, followed by flexmix which, as mentioned, is a more general package for fitting mixture models but with limited clustering capabilities.

graphic without alt text
Figure 1: Number of weekly downloads from the RStudio CRAN mirror over time for some R packages dealing with Gaussian finite mixture modelling.

Another aspect that can be considered as a proxy for the popularity of a package is the mutual dependencies structure between R packages1. This can be represented as a graph with packages at the vertices and dependencies (either “Depends”, “Imports”, “LinkingTo”, “Suggests” or “Enhances”) as directed edges, and analysed through the PageRank algorithm used by the Google search engine (Brin and Page 1998). For the packages considered previously, we used the page.rank function available in the igraph package (Csardi and Nepusz 2006) and we obtained the ranking reported in Table 2, which approximately reproduces the results discussed above. Note that mclust is among the top 100 packages on CRAN by this ranking. Finally, its popularity is also indicated by the 55 other CRAN packages listed as reverse dependencies, either “Depends”, “Imports” or “Suggests”.

Table 2: Ranking obtained with the PageRank algorithm for some R packages dealing with Gaussian finite mixture modelling. At the time of writing there are 8663 packages on CRAN.
mclust Rmixmod mixture EMCluster mixtools bgmm flexmix
75 2300 2319 2143 1698 3736 270

Earlier versions of the package have been described in Fraley and Raftery (1999), Fraley and Raftery (2003), and Fraley et al. (2012). In this paper we discuss some of the new functionalities available in mclust version \(\ge 5\). In particular we describe the newly available models, dimension reduction for visualisation, bootstrap-based inference, implementation of different model selection criteria and initialisation strategies for the EM algorithm.

The reader should first install the latest version of the package from CRAN with

> install.packages("mclust")

Then the package is loaded into an R session using the command

> library(mclust)
    __  ___________    __  _____________
   /  |/  / ____/ /   / / / / ___/_  __/
  / /|_/ / /   / /   / / / /\__ \ / /   
 / /  / / /___/ /___/ /_/ /___/ // /    
/_/  /_/\____/_____/\____//____//_/    version 5.2
Type 'citation("mclust")' for citing this R package in publications.  

All the datasets used in the examples are available in mclust or in other R packages, such as gclus (Hurley 2012), rrcov (Todorov and Filzmoser 2009) and tourr (Wickham et al. 2011), and can be installed from CRAN using the above procedure, except where noted differently.

2 Gaussian finite mixture modelling

Let \(\mathbf{x}= \lbrace \mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_i,\ldots,\mathbf{x}_n \rbrace\) be a sample of \(n\) independent identically distributed observations. The distribution of every observation is specified by a probability density function through a finite mixture model of \(G\) components, which takes the following form \[f(\mathbf{x}_i; \mathbf{\Psi}) = \sum_{k=1}^G \pi_k f_k(\mathbf{x}_i; \mathbf{\theta}_k), \label{eq:finmixdens} \tag{1}\] where \(\mathbf{\Psi}= \{\pi_1, \ldots, \pi_{G-1}, \mathbf{\theta}_1, \ldots, \mathbf{\theta}_G\}\) are the parameters of the mixture model, \(f_k(\mathbf{x}_i; \mathbf{\theta}_k)\) is the \(k\)th component density for observation \(\mathbf{x}_i\) with parameter vector \(\mathbf{\theta}_k\), \((\pi_1,\ldots,\pi_{G-1})\) are the mixing weights or probabilities (such that \(\pi_k > 0\), \(\sum_{k=1}^G\pi_k = 1\)), and \(G\) is the number of mixture components.

Assuming that \(G\) is fixed, the mixture model parameters \(\mathbf{\Psi}\) are usually unknown and must be estimated. The log-likelihood function corresponding to equation (1) is given by \(\ell(\mathbf{\Psi}; \mathbf{x}_1, \ldots, \mathbf{x}_n) = \sum_{i=1}^{n} \log(f(\mathbf{x}_i; \mathbf{\Psi}))\). Direct maximisation of the log-likelihood function is complicated, so the maximum likelihood estimator (MLE) of a finite mixture model is usually obtained via the EM algorithm (Dempster et al. 1977; McLachlan and Peel 2000).

In the model-based approach to clustering, each component of a finite mixture density is usually associated with a group or cluster. Most applications assume that all component densities arise from the same parametric distribution family, although this need not be the case in general. A popular model is the Gaussian mixture model (GMM), which assumes a (multivariate) Gaussian distribution for each component, i.e. \(f_k(\mathbf{x}; \mathbf{\theta}_k) \sim N(\mathbf{\mu}_k, \mathbf{\Sigma}_k)\). Thus, clusters are ellipsoidal, centered at the mean vector \(\mathbf{\mu}_k\), and with other geometric features, such as volume, shape and orientation, determined by the covariance matrix \(\mathbf{\Sigma}_k\). Parsimonious parameterisations of the covariances matrices can be obtained by means of an eigen-decomposition of the form \(\mathbf{\Sigma}_k = \lambda_k \mathbf{D}_k \mathbf{A}_k \mathbf{D}{}^{\!\top}_k\), where \(\lambda_k\) is a scalar controlling the volume of the ellipsoid, \(\mathbf{A}_k\) is a diagonal matrix specifying the shape of the density contours with \(\det(\mathbf{A}_k)=1\), and \(\mathbf{D}_k\) is an orthogonal matrix which determines the orientation of the corresponding ellipsoid (Banfield and Raftery 1993; Celeux and Govaert 1995). In one dimension, there are just two models: E for equal variance and V for varying variance. In the multivariate setting, the volume, shape, and orientation of the covariances can be constrained to be equal or variable across groups. Thus, 14 possible models with different geometric characteristics can be specified. Table 3 reports all such models with the corresponding distribution structure type, volume, shape, orientation, and associated model names. In Figure 2 the geometric characteristics are shown graphically.

Table 3: Parameterisations of the within-group covariance matrix \(\mathbf{\Sigma}_k\) for multidimensional data available in the mclust package, and the corresponding geometric characteristics.
Model \(\mathbf{\Sigma}_k\) Distribution Volume Shape Orientation
EII \(\lambda \mathbf{I}\) Spherical Equal Equal
VII \(\lambda_k \mathbf{I}\) Spherical Variable Equal
EEI \(\lambda \mathbf{A}\) Diagonal Equal Equal Coordinate axes
VEI \(\lambda_k \mathbf{A}\) Diagonal Variable Equal Coordinate axes
EVI \(\lambda \mathbf{A}_k\) Diagonal Equal Variable Coordinate axes
VVI \(\lambda_k \mathbf{A}_k\) Diagonal Variable Variable Coordinate axes
EEE \(\lambda \mathbf{D}\mathbf{A}\mathbf{D}{}^{\!\top}\) Ellipsoidal Equal Equal Equal
EVE \(\lambda \mathbf{D}\mathbf{A}_k \mathbf{D}{}^{\!\top}\) Ellipsoidal Equal Variable Equal
VEE \(\lambda_k \mathbf{D}\mathbf{A}\mathbf{D}{}^{\!\top}\) Ellipsoidal Variable Equal Equal
VVE \(\lambda_k \mathbf{D}\mathbf{A}_k \mathbf{D}{}^{\!\top}\) Ellipsoidal Variable Variable Equal
EEV \(\lambda \mathbf{D}_k \mathbf{A}\mathbf{D}{}^{\!\top}_k\) Ellipsoidal Equal Equal Variable
VEV \(\lambda_k \mathbf{D}_k \mathbf{A}\mathbf{D}{}^{\!\top}_k\) Ellipsoidal Variable Equal Variable
EVV \(\lambda \mathbf{D}_k \mathbf{A}_k \mathbf{D}{}^{\!\top}_k\) Ellipsoidal Equal Variable Variable
VVV \(\lambda_k \mathbf{D}_k \mathbf{A}_k \mathbf{D}{}^{\!\top}_k\) Ellipsoidal Variable Variable Variable
graphic without alt text
Figure 2: Ellipses of isodensity for each of the 14 Gaussian models obtained by eigen-decomposition in case of three groups in two dimensions.

Starting with version \(5.0\) of mclust, four additional models have been included: EVV, VEE, EVE, VVE. Models EVV and VEE are estimated using the methods described in Celeux and Govaert (1995), and the estimation of models EVE and VVE is carried out using the approach discussed by Browne and McNicholas (2014). In the models VEE, EVE and VVE it is assumed that the mixture components share the same orientation matrix. This assumption allows for a parsimonious characterisation of the clusters, while still retaining flexibility in defining volume and shape.

3 Model-based clustering

To illustrate the new modelling capabilities of mclust for model-based clustering consider the wine dataset contained in the gclus R package. This dataset provides 13 measurements obtained from a chemical analysis of 178 wines grown in the same region in Italy but derived from three different cultivars (Barolo, Grignolino, Barbera).

> data(wine, package = "gclus")
> Class <- factor(wine$Class, levels = 1:3,
                  labels = c("Barolo", "Grignolino", "Barbera"))
> X <- data.matrix(wine[,-1])
> mod <- Mclust(X)
> summary(mod$BIC)
Best BIC values:
             EVE,3       VVE,3       VVE,6
BIC      -6873.257 -6896.83693 -6906.37460
BIC diff     0.000   -23.57947   -33.11714
> plot(mod, what = "BIC", ylim = range(mod$BIC[,-(1:2)], na.rm = TRUE),
       legendArgs = list(x = "bottomleft"))

In the above Mclust() function call, only the data matrix is provided, and the number of mixing components and the covariance parameterisation are selected using the Bayesian Information Criterion (BIC). A summary showing the top-three models and a plot of the BIC traces (see Figure 3) for all the models considered is then obtained. In the last plot we adjusted the range of the y-axis so to remove those models with lower BIC values. There is a clear indication of a three-component mixture with covariances having different shapes but the same volume and orientation (EVE). Note that all the top three models are among the models added to the latest major release of mclust.

graphic without alt text
Figure 3: BIC plot for models fitted to the wine data.

A summary of the selected model is obtained as:

> summary(mod)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EVE (ellipsoidal, equal volume and orientation) model with 3 components:

 log.likelihood   n  df       BIC       ICL
       -3032.45 178 156 -6873.257 -6873.549

Clustering table:
 1  2  3 
63 51 64 

The fitted model provides an accurate recovery of the true classes:

> table(Class, mod$classification)        
Class         1  2  3
  Barolo     59  0  0
  Grignolino  4  3 64
  Barbera     0 48  0
> adjustedRandIndex(Class, mod$classification)
[1] 0.8803998

The latter index is the adjusted Rand index [ARI; Hubert and Arabie (1985)], which can be used for evaluating a clustering solution. The ARI is a measure of agreement between two partitions, one estimated by a statistical procedure independent of the labelling of the groups, and one being the true classification. It has zero expected value in the case of a random partition, and it is bounded above by 1, with higher values representing better partition accuracy.

To visualise the clustering structure and the geometric characteristics induced by an estimated Gaussian finite mixture model we may project the data onto a suitable dimension reduction subspace. The function MclustDR() implements the methodology introduced in Scrucca (2010). The estimated directions which span the reduced subspace are defined as a set of linear combinations of the original features, ordered by importance as quantified by the associated eigenvalues. By default, information on the dimension reduction subspace is provided by both the variation on cluster means and, depending on the estimated mixture model, on the variation on cluster covariances. This methodology has been extended to supervised classification by Scrucca (2014). Furthermore, a tuning parameter has been included which enables the recovery of most of the separating directions, i.e. those that show maximal separation among groups. Other dimension reduction techniques for finding the directions of optimum separation have been discussed in detail by Hennig (2004) and implemented in the package fpc (Hennig 2015).

Applying MclustDR to the wine data example, such directions are obtained as follows:

> drmod <- MclustDR(mod, lambda = 1)
> summary(drmod)
-----------------------------------------------------------------
Dimension reduction for model-based clustering and classification 
-----------------------------------------------------------------

Mixture model type: Mclust (EVE, 3)
        
Clusters  n
       1 63
       2 51
       3 64

Estimated basis vectors:
                       Dir1       Dir2
Alcohol          0.11701058  0.2637302
Malic           -0.02814821  0.0489447
Ash             -0.18258917  0.5390056
Alcalinity      -0.02969793 -0.0309028
Magnesium        0.00575692  0.0122642
Phenols         -0.18497201 -0.0016806
Flavanoids       0.45479873 -0.2948947
Nonflavanoid     0.59278569 -0.5777586
Proanthocyanins  0.05347167  0.0508966
Intensity       -0.08328239  0.0332611
Hue              0.42950365 -0.4588969
OD280            0.40563746 -0.0369229
Proline          0.00075867  0.0010457

               Dir1    Dir2
Eigenvalues  1.5794   1.332
Cum. %      54.2499 100.000

By setting the optional tuning parameter lambda = 1, instead of the default value 0.5, only the information on cluster means is used for estimating the directions. In this case, the dimension of the subspace is \(d = \min(p, G-1)\), where \(p\) is the number of variables and \(G\) the number of mixture components or clusters. In the data example, there are \(p=13\) features and \(G=3\) clusters, so the dimension of the reduced subspace is \(d=2\). As a result, the projected data show the maximal separation among clusters, as shown in Figure 4a, which is obtained with

> plot(drmod, what = "contour")

On the same subspace we can also plot the uncertainty boundaries corresponding to the MAP classification:

> plot(drmod, what = "boundaries", ngrid = 200)

and then add a circle around the misclassified observations

> miscl <- classError(Class, mod$classification)$misclassified
> points(drmod$dir[miscl,], pch = 1, cex = 2)