Gaussian Mixture Models in R

Gaussian mixture models (GMMs) are widely used for modelling stochastic problems. Indeed, a wide diversity of packages have been developed in R. However, no recent review describing the main features offered by these packages and comparing their performances has been performed. In this article, we first introduce GMMs and the EM algorithm used to retrieve the parameters of the model and analyse the main features implemented among seven of the most widely used R packages. We then empirically compare their statistical and computational performances in relation with the choice of the initialisation algorithm and the complexity of the mixture. We demonstrate that the best estimation with well-separated components or with a small number of components with distinguishable modes is obtained with REBMIX initialisation, implemented in the rebmix package, while the best estimation with highly overlapping components is obtained with k-means or random initialisation. Importantly, we show that implementation details in the EM algorithm yield differences in the parameters’ estimation. Especially, packages mixtools (Young et al. 2020) and Rmixmod (Langrognet et al. 2021) estimate the parameters of the mixture with smaller bias, while the RMSE and variability of the estimates is smaller with packages bgmm (Ewa Szczurek 2021) , EMCluster (W.-C. Chen and Maitra 2022) , GMKMcharlie (Liu 2021), flexmix (Gruen and Leisch 2022) and mclust (Fraley, Raftery, and Scrucca 2022). The comparison of these packages provides R users with useful recommendations for improving the computational and statistical performance of their clustering and for identifying common deficiencies. Additionally, we propose several improvements in the development of a future, unified mixture model package.

Bastien Chassagnol (Laboratoire de Probabilités, Statistiques et Modélisation (LPSM), UMR CNRS 8001) , Antoine Bichat (Les Laboratoires Servier) , Cheïma Boudjeniba (Systems Biology Group, Dept. of Computational Biology, Institut Pasteur) , Pierre-Henri Wuillemin (Laboratoire d’Informatique de Paris 6 (LIP6), UMR 7606) , Mickaël Guedj (Les Laboratoires Servier) , David Gohel (ArData) , Gregory Nuel (Laboratoire de Probabilités, Statistiques et Modélisation (LPSM), UMR CNRS 8001) , Etienne Becht (Les Laboratoires Servier)
2023-11-01

1 Introduction to Mixture modelling

Formally, let’s consider a pair of random variables \((X,S)\) with \(S \in \{1, \ldots, k\}\) a discrete variable and designing the component identity of each observation. When observed, \(S\) is generally denoted as the labels of the individual observations. \(k\) is the number of mixture components. Then, the density distribution of \(X\) is given in Equation (1):

\[\begin{equation} \begin{split} f_\theta(X) &= \sum_S f_\theta (X, S) \\ &= \sum_{j=1}^k p_j f_{\zeta j}(X), \quad X\in\mathbb{R} \end{split} \tag{1} \end{equation}\]

where \(\theta = (p, \zeta) = (p_1, \ldots, p_k, \zeta_1, \ldots, \zeta_k)\) denotes the parameters of the model: \(p_j\) is the proportion of component \(j\) and \(\zeta_j\) represents the parameters of the density distribution followed by component \(j\). In addition, since \(S\) is a categorical variable parametrized by \(p\), the prior weights must enforce the unit simplex constraint (Equation (2)):

\[\begin{equation} \begin{cases} p_j \ge 0 \quad \forall j \in \{1, \ldots, k \}\\ \sum_{j=1}^k p_j =1 \end{cases} \tag{2} \end{equation}\]

In terms of applications, mixture models can be used to achieve the following goals:

\[\begin{equation} \eta_{i} (j) := \mathbb{P}_{\theta} (S_i=j |X_i=x_i) \tag{3} \end{equation}\]

In section Univariate and multivariate Gaussian distributions in the context of mixture models, we describe the most commonly used family, the Gaussian Mixture Model (GMM). We then present the MLE estimation of the parameters of a GMM, introducing the classic EM algorithm in section Parameter estimation in finite mixtures models. Finally, we introduce bootstrap methods used to evaluate the quality of the estimation and metrics used for the selection of the best model in respectively appendices Derivation of confidence intervals in GMMs and Model selection.

1.1 Univariate and multivariate Gaussian distributions in the context of mixture models

We focus our study on the finite Gaussian mixture models (GMM) in which we suppose that each of the \(k\) components follows a Gaussian distribution.

We recall below the definition of the Gaussian distribution in both univariate and multivariate context. In the finite univariate Gaussian mixture model, the distribution of each component \(f_{\zeta j}(X)\) is given by the following univariate Gaussian p.d.f. (probability density function) (Equation (4)):

\[\begin{equation} f_{\zeta j}(X=x)=\varphi_{\zeta_j}(x | \mu_j, \sigma_j):=\frac{1}{\sqrt{2\pi} \sigma_j} \exp^{- \frac{(x - \mu_j)^2}{2 \sigma_j^2}} \tag{4} \end{equation}\] which we note: \(X \sim \mathcal{N}(\mu_j, \sigma_j)\).

In the univariate case, the parameters to be inferred from each component, \(\zeta_j\), are: \(\mu_j\), the location parameter (equal to the mean of the distribution) and \(\sigma_j\), the scale parameter (equal to the standard deviation of the distribution with a Gaussian distribution).

Following parsimonious parametrisations with respect to univariate GMMs are often considered:

In the finite multivariate Gaussian mixture model, the distribution \(f_{\zeta j}(\boldsymbol{X})\) of each component \(j\), where \(\boldsymbol{X} \in \mathbb{R}^D =(X_1, \ldots, X_D)^\top\) is a multivariate random variable of dimension \(D\), is given by the following multivariate Gaussian p.d.f. (Equation (5)):

\[\begin{equation} f_{\zeta j}(\boldsymbol{X}=\boldsymbol{x})=\operatorname{det}(2\pi\boldsymbol{\Sigma}_j)^{-\frac{1}{2}} \exp\left( -\frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu}_j) \boldsymbol{\Sigma}_j^{-1} (\boldsymbol{x} - \boldsymbol{\mu}_j)^\top\right) \tag{5} \end{equation}\]

which we note \(\boldsymbol{X} \sim \mathcal{N}_D(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)\). The parameters to be estimated for each component can be decomposed into:

Three families of multivariate GMMs are often considered:

In the multivariate setting, the volume, shape, and orientation of the covariances can be constrained to be equal or variable across clusters, generating 14 possible parametrisations with different geometric characteristics (Celeux and Govaert 1992; Banfield and Raftery 1993). We review them in Appendix Parameters estimation in a high-dimensional context and Table 5. Of note, the correlation matrix can be easily derived from the covariance matrix with the following normalisation: \(\operatorname{cor}(\boldsymbol{X})=\left(\frac{\operatorname{cov}(x_l, x_m)}{\sqrt{\operatorname{var}(x_l)} \times \sqrt{\operatorname{var}(x_m)}}\right)_{(l,m) \in D \times D}\). Correlation if strictly included between -1 and 1, the strength of the correlation is given by its absolute value while the type of the interaction is returned by its sign. A correlation of 1 or -1 between two features indicates a strictly linear relationship.

For the sake of simplicity and tractability, we will only consider the fully unconstrained model in both the univariate (heteroscedastic and unbalanced classes) and multivariate dimension (unbalanced and complete covariance matrices for each cluster) in the remainder of our paper.

1.2 Parameter estimation in finite mixtures models

A common way for estimating the parameters of a parametric distribution is the maximum likelihood estimation (MLE) method. It consists in estimating the parameters by maximising the likelihood, or equivalently the log-likelihood of a sample. In what follows, \(\ell(\theta|x_{1:n})=\log (f(x_{1:n}|\theta))\) is the log-likelihood of a n-sample. When all observations are independent, it simplifies to \(\ell(\theta|x_{1:n}) = \sum_{i=1}^n \log (f(x_i|\theta))\). The MLE consists in finding the parameter estimate \(\hat{\theta}\) which maximises the log-likelihood \(\hat{\theta} = \arg \max \ell (\theta | x_{1:n})\).

Recovering the maximum of a function is generally performed by finding the values at which its derivative vanishes. The MLE in GMMs has interesting properties, as opposed to the moment estimation method: it is a consistent, asymptotically efficient and unbiased estimator (McLachlan and Peel 2000; Chen 2016).

When \(S\) is completely observed, for pairs of observations \((x_{1:n}, s_{1:n})\), the log-likelihood of a finite mixture model is simply given by Equation (6):

\[\begin{equation} \ell(\theta|X_{1:n}=x_{1:n}, S_{1:n}=s_{1:n})=\sum_{i=1}^n \sum_{j=1}^k \left[\log\left(f_{\zeta_j} (x_i, s_i=j)\right) + \log(p_j) \right]_{\mathbf{1}_{s_i=j}} \tag{6} \end{equation}\]

where an analytical solution can be computed provided that a closed-form estimate exists to retrieve the parameters \(\zeta_j\) for each components’ parametric distribution. The MLE maximisation, in this context, involves the estimation of the parameters for each cluster, denoted as \(\zeta_j\). The corresponding proportions, \(p_j\), can be straightforwardly computed as the ratios of observations assigned to cluster \(j\) relative to the total number of observations, \(n\).

However, when \(S\) is unobserved, the log-likelihood, qualified as incomplete with respect to the previous case, is given by Equation (7):

\[\begin{equation} \ell (\theta \vert x_{1:n}) = \sum_{i=1}^n \log \left( \underbrace{\sum_{j=1}^k p_j f_{\zeta_j}(x_i)}_{\text{sum of of logs}} \right) \tag{7} \end{equation}\]

The sum of terms embed in the log function (see underbrace section in Equation (7)) makes it intractable in practice to derive the null values of its corresponding derivative. Thus, no closed form of the MLE is available, including for the basic univariate GMM model. This is why most parameter estimation methods derive instead from the EM algorithm, first described in Dempster et al. (1977). We describe its main theoretical properties, the reasons for its popularity, and its main limitations in the next section.

1.3 The EM algorithm

In cases where both \(S\) and the parameters associated to each cluster are unknown, there is no available closed-form solution that would jointly maximise the log-likelihood, as defined in Equation (7), with respect to the set of parameters \(({\theta}, S)\). However, when either \(S\) or \(\theta\) are known, the estimation of the other parameters is straightforward. Hence, the general principle of EM-like algorithms is splitting this complex non-closed joint MLE estimation of \((S, \theta)\) into the iterative estimation of \(S_q\) from \(\hat{\theta}_{q-1}\) and \(X\) (expectation phase, or E-step of the algorithm) and the estimation of \(\hat{\theta}_{q}\) from \((S_q\) and \(X\) (maximisation phase, or M-step), with \(\hat{\theta}_{q-1}\) being the estimated parameters at the previous step \(q-1\), until we reach the convergence.

The EM algorithm sets itself apart from other commonly used methods by taking into account all possible values taken by the latent variable \(S\). To do so, it computes the expected value of the log likelihood of \(\theta\), conditioned by the posterior distribution \(\mathbb{P}_{\hat{\theta}_{q-1}} (S|X)\), also named as the auxiliary function. Utilising the assumption of independence among observations in a mixture model, the general formula of this proxy function of the incomplete log-likelihood is given in finite mixture models by Equation (8).

\[\begin{equation} \begin{split} Q(\theta|\hat{\theta}_{q-1}) & := \mathbb{E}_{S_{1:n}| X_{1:n}, \hat{\theta}_{q-1}} \left[\ell(\theta | X_{1:n}, S_{1:n})\right] \\ &=\sum_{i=1}^n \sum_{j=1}^k \eta_{i}(j) \left( \log (p_j) + \log (\mathbb{P}(X_i|S_i=j, \theta)) \right) \end{split} \tag{8} \end{equation}\]

with \(\hat{\theta}_{q-1}=\hat{\theta}\) the current estimated parameter value.

In practice, the EM algorithm consists in performing alternatively E-step and M-step until convergence, as described in the pseudocode below (Box 1):

  • step E: determine the posterior probability function \(\eta_i(j)\) for each observation of \(X\) for each possible discrete latent class, using the initial estimates \(\hat{\theta}_0\) at step \(q=0\), or the previously computed estimates \(\hat{\theta}_{q-1}\). The general formula is given by Equation (9):
\[\begin{equation} \eta_i(j) = \frac{p_j f_{\zeta_j} (x_i)}{\sum_{j=1}^k p_j f_{\zeta_j} (x_i)} \tag{9} \end{equation}\]
  • step M: compute the mapping function \(\hat{\theta}_q:=M(\theta | \hat{\theta}_{q-1})=\arg \max Q(\theta| \hat{\theta}_{q-1})\) which maximises the auxiliary function. One way of retrieving the MLE associated to the auxiliary function is to determine the roots of its derivative, namely solving Equation (10)3:
\[\begin{equation} \frac{\partial Q(\theta| \hat{\theta}_{q-1})}{\partial \theta}=0 \tag{10} \end{equation}\]

Interestingly, the decomposition of the incomplete log-likelihood associated to a mixture model \(\ell(\theta|X)\) reveals an entropy term and the so-called auxiliary function (Dempster et al. 1977). It can be used to prove that maximising the auxiliary function at each step induces a bounded increase of the incomplete log-likelihood. Namely, the convergence of the EM algorithm, defined by comparisons of consecutive log-likelihood, is guaranteed, provided the mapping function returns the maximum of the auxiliary function. Yet, the convergence of the series of estimated parameters \((\theta_q)_{q \ge 0} \underset{i\to +\infty}{\longrightarrow} \hat{\theta}\) is harder to prove but has been formally demonstrated for the exponential family (a superset of the Gaussian family), as stated in Dempster et al. (1977).

Additionally, the EM algorithm is deterministic, meaning that for a given initial estimate \(\theta_0\) the parameters returned by the algorithm at a given step \(q\) are fixed. However, this method requires the user to provide an initial estimate, denoted as \(\theta_0\), of the model parameters and to specify the number of components in the mixture. We review some classic initialisation methods in Initialisation of the EM algorithm and some algorithms used to overcome the main limitations of the EM algorithm in the Appendix Extensions of the EM algorithm to overcome its limitations.

Finally, the prevalent choice of Gaussian distributions to characterize the distribution of random observations is guided by a set of interesting properties. In particular, Geary (1936) has shown that the Normal distribution is the only distribution for which the Cochran’s theorem (Cochran 1934) is guaranteed, namely for which the the mean and variance of the sample are independent of each other. Additionally, similar to any distribution proceeding from the exponential family, the MLE statistic is sufficient4.

1.4 Initialisation of the EM algorithm

EM-like algorithms require an initial estimate of the parameters, \(\theta_0\), to optimise the maximum likelihood. Initialisation is a crucial step, as a bad initialisation can possibly lead to a local sub-optimal solution or trap the algorithm in the boundary of the parameter space. The most straightforward initialisation methods, such as random initialisation, are standalone and do not require any additional initialisation algorithms, whereas meta-methods, such as short-EM, still need to be initialised by alternative methods. The commonly-used initialisation methods encompass:

The rebmix algorithm can thus be seen as a natural extension of the quantile method, with more rigorous statistical support. Two drawbacks of the algorithm include the need for intensive calibration of hyperparameters and its inadequacy for the estimation of highly overlapping or high dimensional mixture distributions6.

Following this theoretical introduction, we empirically evaluate the performance of the aforementioned R packages, considering various initialization algorithms and the complexity of the GMMs distributions. Precisely, we outline the simulation framework used to compare the seven packages in Methods and report the results in Results. We conclude by providing a general simplified framework to select the combination of package and initialisation method best suited to its objectives and the nature of the distribution of the dataset.

2 A comprehensive benchmark comparing estimation performance of GMMs

We searched CRAN and Bioconductor mirrors for packages that can retrieve parameters of GMM models. Briefly, out of 54 packages dealing with GMMs estimation, we focused on seven packages that all estimate the MLE in GMMs using the EM algorithm, were recently updated and allow the users to specify their own initial estimates: bgmm, EMCluster, flexmix, GMKMcharlie, mclust, mixtools and Rmixmod. The complete inclusion process is detailed in Appendix C, the meta-analysis workflow for the final selection of CRAN and Bioconductor platforms. The flowchart summarising our choices is represented in Figure 1.

From root to top, we describe schematically the filtering process used for the final selection of the reviewed packages

Figure 1: A minimal roadmap used for the selection of the packages reviewed in our benchmark.

We also include two additional packages dedicated specifically to high-dimensional settings, namely EMMIXmfa (Rathnayake et al. 2019) and HDclassif (Berge et al. 2019) to compare their performance with standard multivariate approaches in complex, but non degenerate cases. We summarise the main features and use cases of the seven + two reviewed packages in Table 1. The three most commonly used packages are mixtools, mclust and flexmix. However, the mclust package is by far the most complete with many features provided to visualise and evaluate the quality of the GMM estimate. bgmm has the greatest number of dependencies, while mclust only depends of base R packages. Additionally, in parallel to clustering tasks, flexmix and mixtools packages perform regression of mixtures and implement mixture models using other parametric distributions or non-parametric methods via kernel-density estimation.

Table 1: Main features of the reviewed packages, sorted by decreasing number of daily downloads. Downloads per day returns the daily average number of downloads for each package on the last 2 years. Recursive dependencies column counts the complete set of non-base packages required, as first-order dependencies depend on other packages as well.

Package

Version

Regression

Implemented models

Downloads per day

Last update

Imports

Recursive dependencies

Language

mclust

5.4.7

5,223

31/10/2022

R (≥ 3.0)

0

Fortran

flexmix

2.3-17

Poisson, binary, non-parametric, semi-parametric

3,852

07/06/2022

R (≥ 2.15.0), modeltools, nnet, stats4

3

R

mixtools

1.2.0

multinomial, gamma, Weibull, non-parametric, semi-parametric

178

05/02/2022

R (≥ 3.5.0), kernlab, segmented, survival

6

C

Rmixmod

2.1.5

39

18/10/2022

R (≥ 2.12.0), Rcpp, RcppEigen

4

C++

EMCluster

0.2-13

33

12/08/2022

R (≥ 3.0.1), Matrix

3

C

bgmm

1.8.4

27

10/10/2021

R (≥ 2.0), mvtnorm, combinat

77

R

GMKMcharlie

1.1.1

12

29/05/2021

Rcpp, RcppParallel, RcppArmadillo

3

C++

EMMIXmfa

2.0.11

12

16/12/2019

0

C

HDclassif

2.2.0

35

12/10/2022

rARPACK

13

R

We further detail features specifically related to GMMs in Table 2. We detail row after row its content below:

High-dimensional packages provide specific representations adjusted to the high-dimensional settings, notably allowing the user to visualise the projected factorial representation of its dataset in a two or three-dimensional subspace. They also provide specialised performance plots, notably scree plots or BIC scatter plots to represent in a compact way numerous projections and parametrisations.

Table 2: Custom features associated to GMMs estimation for any of the benchmarked packages.

Package features

mclust

flexmix

mixtools

Rmixmod

EMCluster

bgmm

GMKMcharlie

EMMIXmfa

HDclassif

Models Available (univariate)

canonical

unconstrained

canonical

canonical

unconstrained

canonical

unconstrained

Models Available (multivariate)

canonical

unconstrained diagonal or general

unconstrained

canonical

unconstrained

4 models (diagonal and general, either component specific or global)

unconstrained

4 models (either component-wise or common, on the intrinsic and diagonal residual error covariance matrices)

canonical on projected dimension

Variants of the EM algorithm

VBEM

SEM, CEM

ECM

SEM, CEM

CW-EM, MML

AECM

SEM, CEM

Initialisation

hierarchical clustering, quantile

short-EM, random

random

random, short-EM, CEM, SEM

random, short-EM

k-means, quantile

k-means

k-means, random, heuristic

short-EM, random, k-means

Model or Cluster Selection

BIC, ICL, LRTS

AIC, BIC, ICL

AIC, BIC, ICL, CAIC, LRTS

BIC, ICL, NEC

AIC, BIC, ICL, CLC

GIC

BIC, ICL, CV

Bootstrap Confidence Intervals