Gaussian mixture models (GMMs) are widely used for modelling stochastic problems. Indeed, a wide diversity of packages have been developed in R. However, no recent review describing the main features offered by these packages and comparing their performances has been performed. In this article, we first introduce GMMs and the EM algorithm used to retrieve the parameters of the model and analyse the main features implemented among seven of the most widely used R packages. We then empirically compare their statistical and computational performances in relation with the choice of the initialisation algorithm and the complexity of the mixture. We demonstrate that the best estimation with well-separated components or with a small number of components with distinguishable modes is obtained with REBMIX initialisation, implemented in the rebmix package, while the best estimation with highly overlapping components is obtained with *k*-means or random initialisation. Importantly, we show that implementation details in the EM algorithm yield differences in the parameters’ estimation. Especially, packages mixtools (Young et al. 2020) and Rmixmod (Langrognet et al. 2021) estimate the parameters of the mixture with smaller bias, while the RMSE and variability of the estimates is smaller with packages bgmm (Ewa Szczurek 2021) , EMCluster (W.-C. Chen and Maitra 2022) , GMKMcharlie (Liu 2021), flexmix (Gruen and Leisch 2022) and mclust (Fraley, Raftery, and Scrucca 2022). The comparison of these packages provides R users with useful recommendations for improving the computational and statistical performance of their clustering and for identifying common deficiencies. Additionally, we propose several improvements in the development of a future, unified mixture model package.

Formally, let’s consider a pair of random variables \((X,S)\) with \(S \in \{1, \ldots, k\}\) a discrete variable and designing
the component identity of each observation. When observed, \(S\) is
generally denoted as the labels of the individual observations. \(k\) is
the number of mixture *components*. Then, the density distribution of
\(X\) is given in Equation (1):

where \(\theta = (p, \zeta) = (p_1, \ldots, p_k, \zeta_1, \ldots, \zeta_k)\) denotes the parameters of the model: \(p_j\) is the proportion of component \(j\) and \(\zeta_j\) represents the parameters of the density distribution followed by component \(j\). In addition, since \(S\) is a categorical variable parametrized by \(p\), the prior weights must enforce the unit simplex constraint (Equation (2)):

\[\begin{equation} \begin{cases} p_j \ge 0 \quad \forall j \in \{1, \ldots, k \}\\ \sum_{j=1}^k p_j =1 \end{cases} \tag{2} \end{equation}\]In terms of applications, mixture models can be used to achieve the following goals:

*Clustering*: hard clustering consists in determining a complete partition of the \(n\) observations \(x_{1:n}\) into \(k\) disjoint non-empty subsets. In the context of*mixture model-based clustering*, this is done by assigning each observation \(i\) to the cluster \(\hat{s_i}=\arg \max_j \eta_{i} (j)\) that maximises the posterior distribution (MAP) (see Equation (3)):

*Prediction*: the purpose is to predict a response variable \(Y\) from an explanatory variable \(X\). The dependent variable \(Y\) can either be discrete, taking values in classes \(\{1, \ldots, G\}\) (*classification*task) or continuous (*regression*task). In this paper, we do not extensively discuss application of mixture models to regression purposes but refer the reader to Bouveyron and Girard (2009) for mixture classification and Shimizu and Kaneko (2020) for mixtures of regression models.

In section Univariate and multivariate Gaussian distributions in the context of mixture models, we describe the most
commonly used family, the Gaussian Mixture Model (GMM). We then present
the MLE estimation of the parameters of a GMM, introducing the classic
EM algorithm in section Parameter estimation in finite mixtures models. Finally, we introduce bootstrap methods used to evaluate the
quality of the estimation and metrics used for the selection of the best
model in respectively appendices *Derivation of confidence intervals in GMMs* and *Model selection*.

We focus our study on the finite Gaussian mixture models (GMM) in which we suppose that each of the \(k\) components follows a Gaussian distribution.

We recall below the definition of the Gaussian distribution in both univariate and multivariate context. In the finite univariate Gaussian mixture model, the distribution of each component \(f_{\zeta j}(X)\) is given by the following univariate Gaussian p.d.f. (probability density function) (Equation (4)):

\[\begin{equation} f_{\zeta j}(X=x)=\varphi_{\zeta_j}(x | \mu_j, \sigma_j):=\frac{1}{\sqrt{2\pi} \sigma_j} \exp^{- \frac{(x - \mu_j)^2}{2 \sigma_j^2}} \tag{4} \end{equation}\] which we note: \(X \sim \mathcal{N}(\mu_j, \sigma_j)\).

In the univariate case, the parameters to be inferred from each
component, \(\zeta_j\), are: \(\mu_j\), the *location* parameter (equal to
the mean of the distribution) and \(\sigma_j\), the *scale* parameter
(equal to the standard deviation of the distribution with a Gaussian distribution).

Following parsimonious parametrisations with respect to univariate GMMs are often considered:

*homoscedascity*: variance is considered equal for all components, \(\sigma_j = \sigma, \forall j \in \{1, \ldots, k \}\), as opposed to heteroscedascity where each sub-population has its unique variability.*equi-proportion*among all mixtures: \(p_j = \frac{1}{k} \quad j \in \{ 1, \ldots, k\}\)^{1}

In the finite multivariate Gaussian mixture model, the distribution \(f_{\zeta j}(\boldsymbol{X})\) of each component \(j\), where \(\boldsymbol{X} \in \mathbb{R}^D =(X_1, \ldots, X_D)^\top\) is a multivariate random variable of dimension \(D\), is given by the following multivariate Gaussian p.d.f. (Equation (5)):

\[\begin{equation} f_{\zeta j}(\boldsymbol{X}=\boldsymbol{x})=\operatorname{det}(2\pi\boldsymbol{\Sigma}_j)^{-\frac{1}{2}} \exp\left( -\frac{1}{2} (\boldsymbol{x} - \boldsymbol{\mu}_j) \boldsymbol{\Sigma}_j^{-1} (\boldsymbol{x} - \boldsymbol{\mu}_j)^\top\right) \tag{5} \end{equation}\]which we note \(\boldsymbol{X} \sim \mathcal{N}_D(\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)\). The parameters to be estimated for each component can be decomposed into:

\(\boldsymbol{\mu}_j=\begin{pmatrix} \mu_{1j} \\ \vdots \\ \mu_{Dj} \end{pmatrix} \in \mathbb{R}^D\), the \(D\)-dimensional mean vector.

\(\boldsymbol{\Sigma}_j\), the \(\mathcal{M}_D(\mathbb{R})\) positive-definite

^{2}covariance matrix, whose diagonal terms are the individual variances of each feature and the off-diagonal terms are the pairwise covariance terms.

Three families of multivariate GMMs are often considered:

- the
*spherical*family, \(\boldsymbol{\Sigma}_j=\sigma_j^2 \boldsymbol{I}_D\), with \(\sigma_j \in \mathbb{R}_{+}^*\), refers to GMMs whose covariance matrix is diagonal with an unique standard deviation term. The corresponding volume representation is a \(D-\)*hypersphere*of radius \(\sigma_j\). - the
*diagonal*family, \(\boldsymbol{\Sigma}_j=\operatorname{diag} \left(\sigma_{1j}^2, \ldots, \sigma_{1D}^2\right)\), with \(\sigma_j \in \mathbb{R}_{+}^D\), refers to GMMs whose covariance matrix is diagonal. Its associated volume representation is an ellipsoid whose main axes are aligned with the \(D\) canonical basis of \(\mathbb{R}^D\). Of note, the null constraint imposed over the off-diagonal terms in the spherical and diagonal families imply that the multivariate distribution can be further decomposed and analysed as the product of univariate independent Gaussian realisations. - the
*ellipsoidal*family, also named the*general*family, refer to GMMs whose covariance matrix, \(\boldsymbol{\Sigma}_j\), can be any arbitrary positive-definite \(D \times D\) matrix. Thus, the corresponding clusters for each component \(J\) are ellipsoidal, centred at the mean vector \(\boldsymbol{\mu}_j\), and volume and orientation respectively determined by the eigenvalues and the eigenvectors of the covariance matrix \(\boldsymbol{\Sigma}_j\).

In the multivariate setting, the volume, shape, and orientation of the covariances can be constrained to be equal or variable across clusters, generating 14 possible parametrisations with different geometric characteristics (Celeux and Govaert 1992; Banfield and Raftery 1993). We review them in Appendix *Parameters estimation in a high-dimensional context* and Table 5. Of note, the
correlation matrix can be easily derived from the covariance
matrix with the following normalisation:
\(\operatorname{cor}(\boldsymbol{X})=\left(\frac{\operatorname{cov}(x_l, x_m)}{\sqrt{\operatorname{var}(x_l)} \times \sqrt{\operatorname{var}(x_m)}}\right)_{(l,m) \in D \times D}\). Correlation if strictly included between -1 and 1, the strength of the
correlation is given by its absolute value while the type of the
interaction is returned by its sign. A correlation of 1 or -1 between two features indicates a strictly linear relationship.

For the sake of simplicity and tractability, we will only consider the fully unconstrained model in both the univariate (heteroscedastic and unbalanced classes) and multivariate dimension (unbalanced and complete covariance matrices for each cluster) in the remainder of our paper.

A common way for estimating the parameters of a parametric distribution is
the *maximum likelihood estimation* (MLE) method. It consists in
estimating the parameters by maximising the likelihood, or
equivalently the log-likelihood of a sample. In what follows,
\(\ell(\theta|x_{1:n})=\log (f(x_{1:n}|\theta))\) is the log-likelihood of
a *n*-sample. When all observations are independent, it simplifies to
\(\ell(\theta|x_{1:n}) = \sum_{i=1}^n \log (f(x_i|\theta))\). The MLE
consists in finding the parameter estimate \(\hat{\theta}\) which maximises the
log-likelihood \(\hat{\theta} = \arg \max \ell (\theta | x_{1:n})\).

Recovering the maximum of a function is generally performed by
finding the values at which its derivative vanishes. The MLE in GMMs
has interesting properties, as opposed to the *moment estimation*
method: it is a consistent, asymptotically efficient and unbiased
estimator (McLachlan and Peel 2000; Chen 2016).

When \(S\) is completely observed, for pairs of observations \((x_{1:n}, s_{1:n})\), the log-likelihood of a finite mixture model is simply given by Equation (6):

\[\begin{equation} \ell(\theta|X_{1:n}=x_{1:n}, S_{1:n}=s_{1:n})=\sum_{i=1}^n \sum_{j=1}^k \left[\log\left(f_{\zeta_j} (x_i, s_i=j)\right) + \log(p_j) \right]_{\mathbf{1}_{s_i=j}} \tag{6} \end{equation}\]where an analytical solution can be computed provided that a closed-form estimate exists to retrieve the parameters \(\zeta_j\) for each components’ parametric distribution. The MLE maximisation, in this context, involves the estimation of the parameters for each cluster, denoted as \(\zeta_j\). The corresponding proportions, \(p_j\), can be straightforwardly computed as the ratios of observations assigned to cluster \(j\) relative to the total number of observations, \(n\).

However, when \(S\) is unobserved, the log-likelihood, qualified as incomplete with respect to the previous case, is given by Equation (7):

\[\begin{equation} \ell (\theta \vert x_{1:n}) = \sum_{i=1}^n \log \left( \underbrace{\sum_{j=1}^k p_j f_{\zeta_j}(x_i)}_{\text{sum of of logs}} \right) \tag{7} \end{equation}\]The sum of terms embed in the log function (see underbrace section in Equation (7)) makes it intractable in practice to derive the null values of its corresponding derivative. Thus, no closed form of the MLE is available,
including for the basic univariate GMM model. This is why most
parameter estimation methods derive instead from the *EM algorithm*,
first described in Dempster et al. (1977). We describe its main theoretical properties, the reasons for its popularity, and its main limitations in the next section.

In cases where both \(S\) and the parameters associated to each cluster are unknown, there is no available closed-form solution that would jointly maximise the log-likelihood, as defined in Equation (7), with respect to the set of parameters \(({\theta}, S)\).
However, when either \(S\) or \(\theta\) are known, the estimation of the
other parameters is straightforward. Hence, the general principle of
EM-like algorithms is splitting this complex non-closed joint MLE
estimation of \((S, \theta)\) into the iterative estimation of \(S_q\) from
\(\hat{\theta}_{q-1}\) and \(X\) (expectation phase, or *E-step* of the algorithm)
and the estimation of \(\hat{\theta}_{q}\) from \((S_q\) and \(X\) (maximisation phase, or *M-step*), with \(\hat{\theta}_{q-1}\) being the estimated parameters at the
previous step \(q-1\), until we reach the convergence.

The EM algorithm sets itself apart from other commonly used methods by taking
into account all possible values taken by the latent variable \(S\). To do
so, it computes the expected value of the log likelihood of \(\theta\),
conditioned by the posterior distribution
\(\mathbb{P}_{\hat{\theta}_{q-1}} (S|X)\), also named as the *auxiliary
function*. Utilising the assumption of independence among observations in a mixture model, the general formula of this proxy
function of the incomplete log-likelihood is given in finite mixture
models by Equation (8).

with \(\hat{\theta}_{q-1}=\hat{\theta}\) the current estimated parameter value.

In practice, the EM algorithm consists in performing alternatively E-step and M-step until convergence, as described in the pseudocode below (Box 1):

*step E*: determine the posterior probability function \(\eta_i(j)\) for each observation of \(X\) for each possible discrete latent class, using the initial estimates \(\hat{\theta}_0\) at step \(q=0\), or the previously computed estimates \(\hat{\theta}_{q-1}\). The general formula is given by Equation (9):

*step M*: compute the mapping function \(\hat{\theta}_q:=M(\theta | \hat{\theta}_{q-1})=\arg \max Q(\theta| \hat{\theta}_{q-1})\) which maximises the auxiliary function. One way of retrieving the MLE associated to the auxiliary function is to determine the roots of its derivative, namely solving Equation (10)^{3}:

Interestingly, the decomposition of the incomplete log-likelihood
associated to a mixture model \(\ell(\theta|X)\) reveals an entropy term
and the so-called auxiliary function (Dempster et al. 1977). It can be used to prove that
maximising the auxiliary function at each step induces a bounded
increase of the incomplete log-likelihood. Namely, the convergence of
the EM algorithm, defined by comparisons of consecutive log-likelihood,
is guaranteed, provided the mapping function returns the maximum of the
auxiliary function. Yet, the convergence of the series of estimated
parameters
\((\theta_q)_{q \ge 0} \underset{i\to +\infty}{\longrightarrow} \hat{\theta}\)
is harder to prove but has been formally demonstrated for the *exponential family* (a superset of the Gaussian family), as stated in Dempster et al. (1977).

Additionally, the EM algorithm is *deterministic*, meaning that for a
given initial estimate \(\theta_0\) the parameters returned by the
algorithm at a given step \(q\) are fixed. However, this method requires the user to provide an initial estimate, denoted as \(\theta_0\), of the model parameters and to specify the number of components in the mixture. We review some classic
initialisation methods in Initialisation of the EM algorithm and some
algorithms used to overcome the main limitations of the EM
algorithm in the Appendix *Extensions of the EM algorithm to overcome its limitations*.

Finally, the prevalent choice of Gaussian distributions to characterize the distribution of random observations is guided by a set of interesting properties. In particular, Geary (1936) has shown that the Normal distribution is the only distribution for which the Cochran’s theorem (Cochran 1934) is guaranteed, namely for which the the mean and variance of the sample are independent of each other. Additionally, similar to any distribution proceeding from the exponential family, the MLE statistic is *sufficient*^{4}.

EM-like algorithms require an initial estimate of the parameters, \(\theta_0\), to
optimise the maximum likelihood. *Initialisation* is a crucial step, as
a bad initialisation can possibly lead to a local sub-optimal solution
or trap the algorithm in the boundary of the parameter space. The most straightforward initialisation methods, such as random initialisation, are standalone and do not require any additional initialisation algorithms, whereas *meta-methods*, such as short-EM, still need to be initialised by alternative methods. The commonly-used initialisation methods encompass:

The

*Model-based Hierarchical Agglomerative Clustering*(MBHC) is an agglomerative hierarchical clustering based on MLE criteria applied to GMMs (Scrucca and Raftery 2015). First, the MBHC is initialised by assigning each observation to its own cluster. Next, the pair of clusters that maximises the likelihood of the underlying statistical model among all possible pairs is merged. This procedure is repeated until all clusters are merged. The final resulting clusters are then simply the last \(k\) cuts of the resulting dendrogram. When the data is univariate and homoscedastic, or when the underlying distribution has a diagonal covariance matrix, the merging criterion performs similarly to*Ward’s criterion*, in that merging of the two clusters also simultaneously minimizes the sum of squares. As opposed to the other initialisation methods described hereafter, MBHC is a deterministic method which does not require careful calibration of hyperparameters. However, as acknowledged by the author of the method (Fraley 1998), the resulting partitions are generally suboptimal compared to other initialisation methods.The conventional

*random*initialization method, frequently employed for the initialization step of the*k*-means algorithm, involves the random selection of \(k\) distinct observations, which are referred to as*centroids*. Subsequently, each observation is assigned to the nearest centroid, a process reminiscent of the C-step in the CEM algorithm (Biernacki et al. 2003). This is the method used in this paper, unless otherwise stated. Alternative versions of this method have been developed: for instance, the package mixtools draws the proportions of the components from a Dirichlet distribution, whose main advantage lies in respecting the unit simplex constraint (Equation (2))^{5}, but uses binning methods to guess the means and standard deviations of the components. Similarly, Kwedlo (2013) proposes a method in which the means of the components are randomly chosen, but with an additional constraint of maximising the Mahalanobis distance between the selected centroids. This enables to cover a larger portion of the parameters’ space.*k*-means is a CEM algorithm, in which the additional assumption of balanced classes and homoscedascity implies that each observation in the E-step is assigned to the cluster with the nearest mean (the one with the shortest Euclidean distance).*K*-means is initialised by randomly selecting \(k\) points, known as the*centroids*. It is often chosen for its fast convergence and memory-saving consumption.The

*quantile*method sorts each observation \(x_i\) in an increasing order and splits them into equi-balanced quantiles of size \(1/k\). Then, all observations for a given quantile are assumed to belong to the same component.The

*Rough-Enhanced-Bayes mixture*(REBMIX) algorithm is implemented in the rebmix (Nagode 2022) package and the complete pseudo-code is described thoroughly in (Nagode 2015; Panic et al. 2020). The key stages implemented by the rebmix algorithm for initialising the parameters of GMMs encompass:- First, the observations are processed using one of these three methods:
*k*-nearest neighbours (KNN), Parzen kernel density estimation, or binned intervals. With the binned interval method, the observations are initially divided into \(\sqrt{n}^D\) intervals of equal lengths. The mode of one of the components’ distribution is subsequently determined by the midpoint of the interval with the highest frequency. The observations lying within the interval are used as preliminary estimates, referred to as “rough” parameters in Nagode (2015). - All other observations and intervals are then iteratively assigned to the currently estimated component or to residual components, the ones that have not yet been characterised. The decision to assign an interval to either the currently estimated component or one of the residual components depends on the magnitude of the discrepancy between the observed and the expected frequency within the interval.
- Finally, all intervals assigned to the currently estimated component (and not only the interval including the mode of the distribution) are used to determine the parameters of the associated Gaussian distribution. Since this step relies on a more comprehensive number of observations for parameter estimation, guaranteeing in principle more robust estimates, this stage is referred to as “enhanced” estimation in Nagode (2015). The algorithm terminates when all intervals have been assigned to a cluster, and the parameters of the various distribution components have been estimated.

- First, the observations are processed using one of these three methods:

The rebmix algorithm can thus be seen as a natural extension of the quantile method, with more rigorous statistical support. Two drawbacks of the algorithm include the need for intensive calibration of hyperparameters and its inadequacy for the estimation of highly overlapping or high dimensional mixture distributions^{6}.

The

*meta-methods*consist generally in short runs of EM-like algorithms, namely CEM, SEM and EM (see Appendix B:*Extensions of the EM algorithm to overcome its limitation*), with alleviated convergence criterion. The main idea is to use several random initial estimates with shorter runs of the algorithm to explore larger regions of the parameter space and avoid being trapped in a local maximum. Yet, these methods are highly dependent on the choice of the initialisation algorithm (Biernacki et al. 2003).In the high-dimensional setting, if the number of dimensions \(D\) exceeds the number of observations \(n\), all previous methods must be adjusted, usually by first projecting the dataset into a smaller, suitable subspace and then inferring prior parameters in it. In particular, EMMIXmfa, in the mixture of common factor analysers (MCFA) approach, initialises the shared projection matrix \(\boldsymbol{Q}\) by either keeping the first \(d\) eigen vectors generated from standard principal component analysis or uses custom random initialisations (Baek et al. 2010).

Following this theoretical introduction, we empirically evaluate the performance of the aforementioned R packages, considering various initialization algorithms and the complexity of the GMMs distributions. Precisely, we outline the simulation framework used to compare the seven packages in Methods and report the results in Results. We conclude by providing a general simplified framework to select the combination of package and initialisation method best suited to its objectives and the nature of the distribution of the dataset.

We searched CRAN and Bioconductor mirrors for packages that can retrieve
parameters of GMM models. Briefly, out of 54 packages dealing with GMMs
estimation, we focused on seven packages that all estimate the
MLE in GMMs using the EM algorithm, were recently
updated and allow the users to specify their own initial estimates:
bgmm, EMCluster, flexmix,
GMKMcharlie, mclust, mixtools and Rmixmod. The
complete inclusion process is detailed in Appendix C, *the meta-analysis workflow for the final selection of CRAN and Bioconductor platforms*. The flowchart summarising our choices is represented in Figure 1.

We also include two additional packages dedicated specifically to high-dimensional settings, namely EMMIXmfa (Rathnayake et al. 2019) and HDclassif (Berge et al. 2019) to compare their performance with standard multivariate approaches in complex, but non degenerate cases. We summarise the main features and use cases of the seven + two reviewed packages in Table 1. The three most commonly used packages are mixtools, mclust and flexmix. However, the mclust package is by far the most complete with many features provided to visualise and evaluate the quality of the GMM estimate. bgmm has the greatest number of dependencies, while mclust only depends of base R packages. Additionally, in parallel to clustering tasks, flexmix and mixtools packages perform regression of mixtures and implement mixture models using other parametric distributions or non-parametric methods via kernel-density estimation.

Package | Version | Regression | Implemented models | Downloads per day | Last update | Imports | Recursive dependencies | Language |
---|---|---|---|---|---|---|---|---|

mclust | 5.4.7 | 5,223 | 31/10/2022 | R (≥ 3.0) | 0 | Fortran | ||

flexmix | 2.3-17 | Poisson, binary, non-parametric, semi-parametric | 3,852 | 07/06/2022 | R (≥ 2.15.0), modeltools, nnet, stats4 | 3 | R | |

mixtools | 1.2.0 | multinomial, gamma, Weibull, non-parametric, semi-parametric | 178 | 05/02/2022 | R (≥ 3.5.0), kernlab, segmented, survival | 6 | C | |

Rmixmod | 2.1.5 | 39 | 18/10/2022 | R (≥ 2.12.0), Rcpp, RcppEigen | 4 | C++ | ||

EMCluster | 0.2-13 | 33 | 12/08/2022 | R (≥ 3.0.1), Matrix | 3 | C | ||

bgmm | 1.8.4 | 27 | 10/10/2021 | R (≥ 2.0), mvtnorm, combinat | 77 | R | ||

GMKMcharlie | 1.1.1 | 12 | 29/05/2021 | Rcpp, RcppParallel, RcppArmadillo | 3 | C++ | ||

EMMIXmfa | 2.0.11 | 12 | 16/12/2019 | 0 | C | |||

HDclassif | 2.2.0 | 35 | 12/10/2022 | rARPACK | 13 | R |

We further detail features specifically related to GMMs in Table 2. We detail row after row its content below:

The parametrisations used to provide parsimonious estimation of the GMMs are reviewed in Parameter estimation in finite mixtures models and summarised in rows 1 and 2 (Table 2) for the univariate and multivariate setting. We refer to the package as “canonical” when it implements both homoscedastic and heteroscedastic parametrisations in the univariate setting, and the 14 parametrisations listed in Supplementary Table 3 in the multivariate setting. Adding the additional constraint of equi-balanced clusters results in a total to \(14 \times 2=28\) distinct models and \(2 \times 2=4\) parametrisations, respectively in the univariate and multivariate setting. Since EMMIXmfa and HDclassif are dedicated to the analysis of high-dimensional datasets, they project the observations in a smaller subspace and are not available in the univariate setting. Given an user-defined or prior computed intrinsic dimension, we can imagine using any of the standard parametrisations available for instance in the mclust package, and listed in Appendix

*Parsimonious parametrisation of multivariate GMMs*. In addition, HDclassif allows each cluster \(j\) to be represented with its own subspace intrinsic dimension \(d_j\), as we describe in further details in Appendix*Parameters estimation in a high-dimensional context*.The EM algorithm is the most commonly employed method for estimating the parameters of GMMs, however, alternative algorithms based on the EM framework, are reviewed in Appendix B:

*Extensions of the EM algorithm to overcome its limitations*and row 3 of Table 2. Especially, GMMs estimation is particularly impacted by the presence of outliers, justifying a specific benchmark (see Appendix*A small simulation to evaluate the impact of outliers*). We briefly review the most common initialisation algorithms in section Initialisation of the EM algorithm and row 4 of Table 2, a necessary and tedious task for both the EM algorithm and its alternatives.To select the best parametrisations and number of components that fit the mixture, several metrics are provided by the reviewed packages (

*Model selection*and row 5). Due to the complexity of computing the true distribution of the estimated parameters, bootstrap methods are commonly used used to derive confidence intervals (see Appendix*Derivation of confidence intervals in GMMs*and row 6 in Table 2).

- Six packages supply several functions for visualisation, summarised in the last row of Table 2, to display either the distributions corresponding to the estimated parameters or compare quickly the performance across packages. However, mclust is by far the most complete one, with density plots (in the univariate setting) and isodensity plots (bi-dimensional in the bivariate setting or in higher dimensions after appropriate dimensionality reduction), with the option to plot custom confidence intervals and critical regions, and finally boxplot bootstrap representations for displaying the distribution of the benchmarked estimated parameters.

High-dimensional packages provide specific representations adjusted to the high-dimensional settings, notably allowing the user to visualise the projected factorial representation of its dataset in a two or three-dimensional subspace. They also provide specialised performance plots, notably scree plots or BIC scatter plots to represent in a compact way numerous projections and parametrisations.

Package features | mclust | flexmix | mixtools | Rmixmod | EMCluster | bgmm | GMKMcharlie | EMMIXmfa | HDclassif |
---|---|---|---|---|---|---|---|---|---|

Models Available (univariate) | canonical | unconstrained | canonical | canonical | unconstrained | canonical | unconstrained | ||

Models Available (multivariate) | canonical | unconstrained diagonal or general | unconstrained | canonical | unconstrained | 4 models (diagonal and general, either component specific or global) | unconstrained | 4 models (either component-wise or common, on the intrinsic and diagonal residual error covariance matrices) | canonical on projected dimension |

Variants of the EM algorithm | VBEM | SEM, CEM | ECM | SEM, CEM | CW-EM, MML | AECM | SEM, CEM | ||

Initialisation | hierarchical clustering, quantile | short-EM, random | random | random, short-EM, CEM, SEM | random, short-EM | k-means, quantile | k-means | k-means, random, heuristic | short-EM, random, k-means |

Model or Cluster Selection | BIC, ICL, LRTS | AIC, BIC, ICL | AIC, BIC, ICL, CAIC, LRTS | BIC, ICL, NEC | AIC, BIC, ICL, CLC | GIC | BIC, ICL, CV | ||

Bootstrap Confidence Intervals |