Model-based clustering is a popular technique for grouping objects based on a finite mixture model. It has countless applications in different fields of study. The R package ManlyMix implements the Manly mixture model that allows modeling skewness within data groups and performs cluster analysis. ManlyMix is a powerful diagnostics tool that is capable of conducting investigation concerning the normality of variables upon fitting of a Manly forward or backward model. Theoretical foundations as well as description of functions are provided. All features of the package are illustrated with examples in great detail. The analysis of real-life datasets demonstrates the flexibility and usefulness of the package.
Finite mixture models provide a powerful tool to model heterogeneous data. Their flexibility, close connection to cluster analysis, and interpretability make them increasingly appealing to researchers and practitioners these days. The applications of finite mixture modeling can be found in all fields, including medicine (Schlattmann 2009), transportation (Park and Lord 2009), dendrochronology (Michael and Melnykov 2016), and environment science (Gillespie and Neale 2006), just to name a few.
The Bayes decision rule, applied to posterior probabilities obtained in the course of fitting a mixture model, yields a clustering result. Such a procedure is called model-based clustering. It assumes the existence of a one-to-one correspondence between each distribution in the mixture model and underlying data group.
If all components in the model are Gaussian distributions, the mixture is called a Gaussian mixture model. Gaussian mixtures are very popular among practitioners due to their interpretability and simplicity. However, when there is severe skewness in data, Gaussian mixtures models do not provide a good fit to the data. As a result, model-based clustering might produce unsatisfactory results. In such cases, more flexible mixtures should be adopted. Some existing software packages that provide such functionality are listed in Table 1.
Package | Mixture components |
---|---|
flowClust (Lo et al. 2009) | |
mixsmsn (Prates et al. 2013) | scale skew-normal and skew- |
EMMIXskew (Wang et al. 2013) | restricted skew-normal and skew- |
EMMIXuskew (Lee and McLachlan 2014) | unrestricted skew- |
Here, mixsmsn,
EMMIXskew, and
EMMIXuskew packages
are based on skew-normal and skew-
The ManlyMix package
implements several functions associated with Manly mixture models
including the core function for running the EM algorithm, the forward
and backward model selection procedure for eliminating unnecessary
transformation parameters, and the Manly
providing an alternative approach to modeling heterogeneous skewed data;
calling core functions from C for speed;
providing excellent model interpretability through output of skewness parameters;
preventing overfitting of the data by implementing model selection algorithms;
offering effective assessment of mixture characteristics through the overlap calculation and variability assessment.
This paper is organized in the following way. A brief introduction to
the Expectation-Maximization (EM) algorithm for Manly mixture models as
well as the classification Expectation-Maximization (CEM) algorithm for
Manly
Consider a dataset
To find the MLE of the parameter vector
The EM algorithm could be started with an initial partition of the data
passed into the M-step. Or the E-step is run first with initial
parameters
In ManlyMix, the function Manly.EM()
runs the EM algorithm for a
Manly mixture model and returns estimated model parameters, posterior
probabilities, as well as a classification vector. This function is
constructed in C for computational efficiency.
Pairwise overlap, introduced by Maitra and Melnykov (2010), is a measure of
the interaction between two mixture components. If we denote
Manly.overlap()
estimates
The variability assessment of parameter estimates from Manly mixture
model can be made by taking the inverse of the empirical observed
information matrix
Manly.var()
in the package calculates
the covariance matrix based on an estimated model provided by function
Manly.EM()
.
In the Manly mixture model, there are
The selection is based on the Bayesian information criterion
(BIC) (Schwarz 1978), which is the most commonly used criterion in finite
mixture modeling (McLachlan and Peel 2000). The smaller BIC is, the better
fit provided by a mixture is. The forward selection procedure starts
from the Gaussian mixture model and adds one method = "forward"
or method = "backward"
in the Manly.select()
function.
Manly
It can be noticed that the traditional
In this way, each step of the Manly Manly.Kmeans()
. It can be used when
the number of data points in each cluster is about the same and the
transformed clusters are close to being spherical. It shows faster
performance as the inversion of potentially large covariance matrices is
not needed.
All functions available in the package ManlyMix are listed with brief descriptions in Table 2. In this section, we demonstrate the utility of each function through a synthetic dataset and the analysis of two real-life datasets: Iris (Anderson 1935; Fisher 1936) and AIS (Cook and Weisberg 1994).
Function | Description |
---|---|
Manly.EM() |
Runs the EM algorithm for a Manly mixture model |
Manly.select() |
Runs forward and backward selection methods for a Manly mixture |
model | |
Manly.Kmeans() |
Runs the Manly |
Manly.overlap() |
Estimates the overlap values for a Manly mixture |
Manly.sim() |
Simulates datasets from Manly mixture models |
Manly.var() |
Performs variability assessment of Manly mixture model parameter |
estimates and returns confidence intervals | |
Manly.plot() |
Constructs a plot to display model-fitting and clustering |
ClassAgree() |
Calculates the confusion matrix and number of misclassifications |
Manly.model() |
Serves as a wrapper function for Manly mixture modeling |
In this subsection, a Manly mixture is constructed with user-specified
parameters. The overlap values of this mixture is estimated through
function Manly.overlap()
. Then function Manly.sim()
simulates a
dataset from the mixture along with a true membership vector.
Step a: Mixture specification
Now we demonstrate the procedure to construct a Manly mixture step by
step. First, the user need to specify the number of components (assigned
to K
) and variables (assigned to p
). In this case, we have a
three-component bivariate mixture.
library(ManlyMix)
K <- 3
p <- 2
set.seed(123)
If the mixture probability density function of interest is written as
la
(matrix input of size tau
(vector input of length Mu
(matrix input of size S
(array input of
dimensionality
tau <- c(0.25, 0.3, 0.45)
Mu <- matrix(c(4.5, 4, 5, 7, 8, 5.5),3)
la <- matrix(c(0.2, 0.5, 0.3, 0.25, 0.35, 0.4),3)
S <- array(NA, dim = c(p, p, K))
S[,,1] <- matrix(c(0.4, 0, 0, 0.4), 2)
S[,,2] <- matrix(c(1, -0.2, -0.2, 0.6), 2)
S[,,3] <- matrix(c(2, -1, -1, 2), 2)
Step b: Overlap assessment
It is desirable to be capable of understanding the degree of interaction
among mixing components to assess clustering complexity. Function
Manly.overlap()
, employing the measure of pairwise overlap, is
implemented for this purpose. It has the following syntax:
Manly.overlap(tau, Mu, S, la, N = 1000)
with arguments la
, tau
, Mu
, S
and N
. Here, N
represents the
number of samples simulated from the given mixture for pairwise overlap
estimation. The larger N
is, the more precise the calculation is. By
default, 1000 samples are employed. Four objects are returned by the
function, including the misclassification probability matrix
$OmegaMap
, pairwise overlap $OverlapMap
, average mixture overlap
$BarOmega
, and maximum mixture overlap $MaxOmega
. Here, element
$OmegaMap[k2, k1]
corresponds to $OmegaMap
sums up to 1. Then, pairwise
overlaps $OverlapMap
. Among all pairwise overlaps
($MaxOmega
. The average
of these three values, on the other hand, results in $BarOmega
being
0.08066667
.
A <- Manly.overlap(tau, Mu, S, la)
print(A)
## $OmegaMap
## [,1] [,2] [,3]
## [1,] 0.909 0.058 0.033
## [2,] 0.038 0.916 0.046
## [3,] 0.016 0.051 0.933
##
## $OverlapMap
## Components Overlap
## 1 (1, 2) 0.096
## 2 (1, 3) 0.049
## 3 (2, 3) 0.097
##
## $BarOmega
## [1] 0.08066667
##
## $MaxOmega
## [1] 0.097
It can be seen that in the considered case, function Manly.overlap()
calculates all characteristics based on the input of true model
parameters. If parameters la
, tau
, Mu
and S
are estimated,
Manly.overlap()
provides estimates of misclassification probabilities
and overlap values. As for high-dimensional data, we can not readily
visually assess the interaction between data groups, such output helps
approximate the proximity of clusters and discover properties associated
with them.
Step c: Data generation
Function Manly.sim()
simulates Manly mixture datasets based on
user-specified model parameters. It employs the built-in R function
rmultinom()
for assigning data points to rnorm()
. The
covariance structures
The Manly.sim()
command has the following syntax:
Manly.sim(n, la, tau, Mu, S)
The user can input n
as the desired sample size. Here, a dataset of 30
observations is simulated from Equation (13) and data
matrix $X
as well as its true membership vector $id
are returned.
n <- 30
B <- Manly.sim(n, la, tau, Mu, S)
print(B)
## $X
## [,1] [,2]
## [1,] 3.259485 3.882271
## [2,] 3.247362 4.269247
Part of the output is intentionally omitted.
## [30,] 3.310186 2.974554
##
## $id
## [1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
The Iris dataset (Anderson 1935; Fisher 1936) has 150 observations and 4
variables that represent sepal length, sepal width, petal length, and
petal width. Three species, Iris setosa, Iris versicolor, and Iris
virginica, have equal representation, consisting of 50 observations
each. The function Manly.EM()
fits a Manly mixture to the Iris
dataset and 95% confidence intervals of the model MLE are provided by
Manly.var()
. The Manly F and Manly B models are obtained by
Manly.select()
. The Manly Manly.Kmeans()
.
Step a: Data preparation
Manly.EM()
requires input of a matrix object X
, where rows of X
represent X
is univariate data with
vector input, it will be automatically transformed into a matrix of just
one column. Thus, X
has the dimensionality X
.
library(ManlyMix)
K <- 3
p <- 4
X <- as.matrix(iris[,-5])
Step b: Initialization of the EM algorithm
Good initialization strategy of the EM algorithm is important to improve
chances of finding a correct result. There are two ways for the user to
initialize the Manly.EM()
function. One is by means of providing the
initial partition of the data id
(vector input of length la
. Here, it needs to be noticed that the
specification of la
matrix serves as an indicator of whether the
transformation is applied to a specific variable and component or not.
For example, for the Iris dataset, assumes that all variables in all
components enjoy normality except for the first variable in the first
component, la
needs to be set as
la <- matrix(c(0.1, rep(0, 11)), 3, 4)
, with 0.1
(that can be any
non-zero value) serving as the starting point in Nelder-Mead
optimization. If no la
is provided, the skewness parameters are all
set equal to la
, tau
, Mu
, and S
. The algorithm employs these
parameters to compute the posterior probabilities in the first E-step.
Here, we adopt the first strategy. The initial partition of the Iris
data is obtained by running the traditional la
is a matrix of size 0.1
.
set.seed(123)
id.km <- kmeans(X, K)$cluster
la <- matrix(0.1, K, p)
Step c: EM algorithm for Manly mixture modeling
Manly.EM()
runs the EM algorithm for modeling based on Manly mixtures
given in Equation (1). The command has the following
syntax:
Manly.EM(X, id = NULL, la = NULL, tau = NULL, Mu = NULL, S = NULL,
tol = 1e-5, max.iter = 1000).
The parameters tol
and iter
correspond to the stopping rule for the
EM algorithm. tol
specifies the tolerance level of the EM algorithm.
If the relative difference of the tol
, the EM algorithm is terminated.
By default, tol
is set equal to max.iter
stands for the
maximum number of iterations allowed for the EM algorithm. The default
value of max.iter
is 1000
. We fit the Iris dataset by both
Gaussian mixture (assigned to object G
) and Manly mixture (assigned to
object M
).
G <- Manly.EM(X, id = id.km)
colnames(G$la) <- colnames(X)
print(G$la)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
M <- Manly.EM(X, id.km, la)
colnames(M$la) <- colnames(X)
print(M$la)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] -0.1158602 0.05907443 -0.2382086 -4.033529
## [2,] -0.1254022 -0.65079974 -0.3848938 0.479587
## [3,] -0.1282339 0.64271380 0.3343054 -1.134275
print(M$id)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 3
## [112] 2 2 2 2 2 2 2 2 3 2 2 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 3 2 3 2 2 2 3 3 3
## [149] 2 2
The estimated model parameters returned by the function Manly.EM()
include $la
(matrix output of size $tau
(vector
output of length $Mu
(matrix output of size $S
(array output of dimensionality $la
for the Gaussian
mixture have all elements fixed at zero, while the Manly mixture has
estimated the skewness parameters for each component and variable. For
example, the skewness parameter associated with the sepal length
variable of the first cluster is estimated to be
Some other parameters returned by Manly.EM()
are the $gamma
calculated from
Equation (2) in the last E-step and the membership vector
$id
assigned by the Bayes decision rule in Equation (5).
In this case, the output of $id
demonstrates the model-based
clustering solution of the Iris dataset.
The characteristics of the fitted model are demonstrated in terms of the
model log-likelihood $ll
and BIC $bic
. The number of iterations run
by the EM algorithm until convergence is recorded through $iter
. In
this example, the EM algorithm reaches convergence after 13 iterations
and the model BIC is $flag
reports the validity of the fitted model, where $flag
is equal
to
Step d: Variability assessment of Manly mixture model
Variability assessment of the model parameters allows practitioners to
study the specific nature of the fitted model as well as detected
clustering solutions. We provide the user with function Manly.var()
,
which calculates the inverse of the empirical observed information
matrix given in Equation (8) and returns the
variance-covariance matrix of the estimated MLE from Manly.EM()
function. It also outputs the confidence intervals of each parameter.
The command has the following syntax:
Manly.var(X, model = NULL, conf.CI = NULL)
X
represents the data matrix and model
is the object of class
"ManlyMix"
. conf.CI
is user-specified confidence level, which needs
to take a value between 0
and 1
. Here by setting model = M
, we
take the MLE of the fitted Manly mixture obtained from step c and
evaluate its variability. The number of unique model parameters is
conf.CI = 0.95
calculates 95% confidence intervals for these
56 parameters. Thus Manly.var()
function returns a V
) and 56 confidence intervals
(assigned to CI
).
result <- Manly.var(X, model = M, conf.CI = 0.95)
In the code output of 95% confidence intervals, the first column represents the point estimates of the 56 model parameters, while the second and third columns stand for the lower and upper bounds of confidence intervals, respectively.
print(result$CI)
## Estimates Lower Upper
## [1,] 0.333333333 0.257887628 0.408779039
## [2,] 0.264119102 0.175676489 0.352561716
## [3,] 3.794594378 -8.303528084 15.892716840
Part of the output is intentionally omitted.
## [54,] 0.642713799 -0.407079526 1.692507125
## [55,] 0.334305435 -0.128841410 0.797452280
## [56,] -1.134275340 -2.079185136 -0.189365544
Step e: Forward and backward selection algorithms
Step e targets detecting the normally distributed variables in Iris.
Manly.select()
provides the selection algorithm for eliminating
unnecessary skewness parameters in M$la
. These skewness parameters are
fixed to be equal to zero and the log-likelihood is maximized based on
the rest of parameters. The use of the function is shown below:
Manly.select(X, model, method, tol = 1e-5, max.iter = 1000,
silent = FALSE)
The argument model
is the initial model to start the selection
procedure with. method
is set to either "forward"
or "backward"
for the implementation of Algorithm 1 or Algorithm 2, respectively. The
selection criterion for each step is based on $bic
values obtained
from all candidate models that are of class "ManlyMix"
. silent
is an
argument that controls the code output. By default, silent
provides
the steps of selection and BIC values for all candidate models. Thus,
the user can monitor the selection procedures. The output can be turned
off by setting silent = TRUE
. We first discuss the implementation of
the forward selection on the Iris dataset. The algorithm is
initialized by the Gaussian mixture model G
obtained in step c.
MF <- Manly.select(X, model = G, method = "forward")
## step 1 :
## current BIC = 580.8389
## alternative BICs = 585.6791 585.0607 582.1893 585.7369 585.2193 583.7374
## 585.7081 583.963 579.8978 573.4626 585.8161 585.8407
## step 2 :
## current BIC = 573.4626
## alternative BICs = 578.3191 577.6844 574.813 578.3719 577.843 576.3611
## 578.3282 576.5866 572.5215 578.4397 578.4643
## step 3 :
## current BIC = 572.5215
## alternative BICs = 577.378 576.7713 575.8067 577.4308 576.8833 575.3526
## 577.3871 575.6213 577.4799 577.3221
The forward selection takes three steps for the algorithm to find the
best model (assigned to MF
). In step 3, there is no alternative BIC
value that is smaller than the current model BIC, so the forward
selection algorithm stops searching over non-zero
On the contrary, the backward selection starts with the full Manly
mixture M
and drops one skewness parameter at a time.
MB <- Manly.select(X, model = M, method = "backward")
## step 1 :
## current BIC = 618.4553
## alternative BICs = 613.5161 612.7184 613.8658 613.448 614.3828 616.6445
## 613.5442 616.3626 617.0157 625.7431 613.1927 610.7879
## step 2 :
## current BIC = 610.7879
## alternative BICs = 605.9157 605.9075 607.3512 605.8475 606.3851 607.8112
## 605.9437 607.0513 609.9457 618.1426 605.7915
Part of the output is intentionally omitted.
## step 10 :
## current BIC = 575.3526
## alternative BICs = 572.5215 576.3611 582.729
## step 11 :
## current BIC = 572.5215
## alternative BICs = 573.4626 579.8978
After 11 steps, the backward selection produces the Manly B model, which enjoys the same BIC value as the Manly F model.
Step f: Diagnostics
The skewness parameters of the Manly F and Manly B models are
investigated in the following example. It is observed that the forward
selection adopts only two
colnames(MF$la) <- colnames(X)
print(MF$la)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 0 0 0.0000000 -4.04
## [2,] 0 0 0.0000000 0.00
## [3,] 0 0 0.5615625 0.00
colnames(MB$la) <- colnames(X)
print(MB$la)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 0 0 0.0000000 -4.034815
## [2,] 0 0 0.0000000 0.000000
## [3,] 0 0 0.5619671 0.000000
Step g: Manly
The Manly Manly.Kmeans()
, which has the following syntax:
Manly.Kmeans(X, id = NULL, la = NULL, Mu = NULL, S = NULL,
initial = "k-means", K = NULL, nstart = 100,
method = "ward.D", tol = 1e-5, max.iter = 1000).
Manly.Kmeans()
has most of the arguments and returned values the same
as those of the function Manly.EM()
. As the Manly tau
are not needed in this function. S
is a vector of length $la
, $Mu
, and $S
correspond to Manly.Kmeans()
has several initialization choices: (1) by providing
id
and la
; (2) by providing la
, Mu
, and S
; (3) by specifying
the number of clusters K
and letting initial = "k-means"
; It takes
the default traditional K-means clustering result and passes it into the
CEM algorithm; (4) by specifying the number of clusters K
and letting
initial = "hierarchical"
; It adopts the hierarchical clustering
solution as the initial dataset partition. nstart
is responsible for
controlling the number of random starts tried in initialization choice
(3) with a default value equal to 100
. method
sets the linkage
method in initialization choice (4) with a default of
method = "ward.D"
, which represents the Ward’s linkage (Ward 1963).
Here, the initialization choice of Manly.Kmeans()
is (1), which is the
same as that of Manly.EM()
.
MK <- Manly.Kmeans(X, id.km, la)
colnames(MK$la) <- colnames(X)
print(MK$la)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] -0.37975529 -0.5815382 -0.81530022 -2.5572830
## [2,] -0.27067058 -0.4103692 -0.31001602 -0.5367999
## [3,] -0.02896526 0.1138177 -0.05694487 0.2617650
print(MK$S)
## [1] 0.002717844 0.006156015 0.160435910
In this subsection, dataset AIS (Cook and Weisberg 1994) is studied for
illustrative purposes. The Australian Institute of Sports (AIS) dataset
was first introduced by Cook and Weisberg (1994). It contains information
collected from 202 athletes, among which 100 are females and 102 are
males. There are 13 variables, including the gender, sport kind and 11
numeric measurements of the athletes. We adopt the same variables and
analysis as Lee and McLachlan (2013). The goal of the analysis is to cluster
the athletes into two groups: males and females by constructing models
based on three measurements: the body mass index (“BMI”), lean body mass
(“LBM”), and the percentage of body fat (“Bfat”). Function
ClassAgree()
compares the estimated and true partitions. Function
Manly.plot()
is introduced for visual analysis of Manly mixture fitted
results.
Step a: model fit
The AIS dataset is analyzed by six mixture models: the traditional
kmeans()
), Manly Manly.Kmeans()
), Gaussian
mixture model (Manly.EM()
), Manly mixture model (Manly.EM()
), Manly
forward model and Manly backward model (both available through
Manly.select()
).
library(ManlyMix)
data("ais"); set.seed(123)
X <- as.matrix(ais[,c(8, 10, 11)])
id <- as.numeric(ais[,1])
n <- dim(X)[1]
p <- dim(X)[2]
K <- max(id)
Kmeans <- kmeans(X, K)
id.km <- Kmeans$cluster
By running the following code, we not only obtain the fitted models, but
also test the package from different aspects. The number of parameters
in the models are 7 (
MK <- Manly.Kmeans(X, id = id.km, la = matrix(0.1, K, p))
G <- Manly.EM(X, id = id.km, la = matrix(0, K, p))
M <- Manly.EM(X, id = id.km, la = matrix(0.1, K, p))
MF <- Manly.select(X, G, method = "forward", silent = TRUE)
MB <- Manly.select(X, M, method = "backward", silent = TRUE)
Now we consider the fitted model parameters to perform a comprehensive analysis and diagnostics of the AIS dataset. From the following output, it is observed that the Manly F model drops two skewness parameters from the full Manly mixture model while Manly B drops three. This yields the conclusion that the “Bfat” variable in the first group and “LBM” variable in the second one are close to be normal. Through the one-to-one correspondence between skewness parameters and dataset variables, ManlyMix is proved to be particularly useful for model variable diagnostics.
colnames(MF$la) <- colnames(X)
MF$la
## BMI Bfat LBM
## [1,] -0.08671894 0.0000000 0.01002851
## [2,] -0.12882354 -0.1902031 0.00000000
colnames(MB$la) <- colnames(X)
MB$la
## BMI Bfat LBM
## [1,] -0.09362427 0.0000000 0
## [2,] -0.12720459 -0.1933216 0
BIC values for the four models are
Step b: classification table
Classification results from the six models are compared using function
ClassAgree()
in step b. Function ClassAgree()
adopts input of both
the estimated and true id vectors with the following syntax:
ClassAgree(est.id, trueid)
ClassAgree()
permutes the partition labels to achieve the lowest
number of misclassifications. Then, based on the switched labels, it
returns the confusion matrix and number of misclassifications. In the
analysis of the AIS dataset, the following output is produced by
ClassAgree()
.
ClassAgree(id.km, id)
## $ClassificationTable
## est.id
## trueid 1 2
## 1 98 2
## 2 12 90
##
## $MisclassificationNum
## [1] 14
ClassAgree(MK$id, id)
## $ClassificationTable
## est.id
## trueid 1 2
## 1 95 5
## 2 7 95
##
## $MisclassificationNum
## [1] 12
ClassAgree(G$id, id)
## $ClassificationTable
## est.id
## trueid 1 2
## 1 100 0
## 2 8 94
##
## $MisclassificationNum
## [1] 8
ClassAgree(M$id, id)
## $ClassificationTable
## est.id
## trueid 1 2
## 1 98 2
## 2 2 100
##
## $MisclassificationNum
## [1] 4
ClassAgree(MF$id, id)
## $ClassificationTable
## est.id
## trueid 1 2
## 1 99 1
## 2 3 99
##
## $MisclassificationNum
## [1] 4
ClassAgree(MB$id, id)
## $ClassificationTable
## est.id
## trueid 1 2
## 1 99 1
## 2 4 98
##
## $MisclassificationNum
## [1] 5
Rows and columns represent the true and estimated partitions,
respectively. The diagonal and off-diagonal elements in the table
correspond to correct and incorrect classifications, respectively. The
lowest number of misclassifications (4 misclassifications) is obtained
by the Manly mixture and Manly forward models. One worth-mentioning fact
is that these two models enjoy the clustering solution as good as the
unrestricted skew-
Step c: visualization tool
In order to investigate the behavior of each model, contour plots with
classified data points need to be analyzed. Manly.plot()
allows
conducting the visual analysis of a dataset fitted by Manly mixture
model. The command has the following syntax:
Manly.plot(X, var1 = NULL, var2 = NULL, model = NULL, x.slice = 100,
y.slice = 100, x.mar = 1, y.mar = 1, col = "lightgrey", ...).
If both var1
and var2
are provided, they represent variables on the
model
is the object of class "ManlyMix"
. The parameters of model
object
are used to calculate the density and draw contour lines. The estimated
membership vector model$id
is reflected through different colors.
x.slice
and y.slice
options control the number of grid points for
which a density is calculated. The larger these two values are, the more
grid values are considered. Thus, the contour lines look smoother.
x.mar
and y.mar
specify plot margins. The parameter col
specifies
the color of contour lines with the default color being light grey.
Other variables in the built-in R function contour()
can also be used
as specified. On the other hand, if only var1
is provided, a density
plot of this variable is constructed. x.slice
and x.mar
have the
same functionality as those in the contour plot. The parameter col
stands for density line color with the default being light grey. ...
allows other arguments from the built-in R function hist()
to be
passed.
In this case, we conduct the same analysis as that in
Lee and McLachlan (2013) and adopt the two variables “LBM” and “Bfat” for
constructing contour plots. The margins of the plots are set to be 3
on the 13
on the 3.2
. Labels and axes are suppressed. The function
is first applied to the four fitted models Gaussian mixture, Manly
mixture, Manly F and Manly B in step a.
Manly.plot(X, var1 = 3, var2 = 2, model = G, x.mar = 3, y.mar = 13,
xaxs="i", yaxs="i", xaxt="n", yaxt="n", xlab="", ylab = "",
nlevels = 10, drawlabels = FALSE, lwd = 3.2,
col = "lightgrey", pch = 19)
Manly.plot(X, var1 = 3, var2 = 2, model = M, x.mar = 3, y.mar = 13,
xaxs="i", yaxs="i", xaxt="n", yaxt="n", xlab="", ylab = "",
nlevels = 10, drawlabels = FALSE, lwd = 3.2,
col = "lightgrey", pch = 19)
Manly.plot(X, var1 = 3, var2 = 2, model = MF, x.mar = 3, y.mar = 13,
xaxs="i", yaxs="i", xaxt="n", yaxt="n", xlab="", ylab = "",
nlevels = 10, drawlabels = FALSE, lwd = 3.2,
col = "lightgrey", pch = 19)
Manly.plot(X, var1 = 3, var2 = 2, model = MB, x.mar = 3, y.mar = 13,
xaxs="i", yaxs="i", xaxt="n", yaxt="n", xlab="", ylab = "",
nlevels = 10, drawlabels = FALSE, lwd = 3.2,
col = "lightgrey", pch = 19)
Function Manly.plot()
enjoys sufficient flexibility to adopt other
parsimonious models. Parameters obtained by traditional "ManlyMix"
so that $id
, $tau
, $Mu
, $la
, $S
are extracted in
their correct forms.
Kmeans$id <- id.km
Kmeans$tau <- MK$tau <- rep(1 / K, K)
Kmeans$Mu <- Kmeans$centers
Kmeans$la <- matrix(0, K, p)
Kmeans$S <- array(0, dim = c(p, p, K))
for(k in 1:K)
diag(Kmeans$S[,,k]) <- Kmeans$tot.withinss / n / p
s2 <- MK$S
MK$S <- array(0, dim = c(p, p, K))
for(k in 1:K)
diag(MK$S[,,k]) <- s2[k]
Manly.plot(X, var1 = 3, var2 = 2, model = Kmeans, x.mar = 3, y.mar = 13,
xaxs="i", yaxs="i", xaxt="n", yaxt="n", xlab="", ylab = "",
nlevels = 4, drawlabels = FALSE, lwd = 3.2,
col = "lightgrey", pch = 19)
Manly.plot(X, var1 = 3, var2 = 2, model = MK, x.mar = 3, y.mar = 13,
xaxs="i", yaxs="i", xaxt="n", yaxt="n", xlab="", ylab = "",
nlevels = 10, drawlabels = FALSE, lwd = 3.2,
col = "lightgrey", pch = 19)
Figure 1 combines all output plots from Manly.plot()
.
Manly.plot()
based on the two variables
“LBM” (X-axis) and “Bfat” (Y-axis). The model locations are: K-means (first row left), Manly
K-means (first row right),
Gaussian mixture (second row left), Manly mixture (second row right),
Manly forward (third row left) and Manly backward (third row
right).
It can be observed that the two components have a slight overlap, so the
clustering problem is not over-complicated. However, the red cluster is
highly skewed and has a heavy tail. This imposes difficulties for the
traditional Manly.plot()
on real-life datasets.
Alternative coding d: wrapper function
Wrapper function Manly.model()
enables practitioners to run analysis
in a simple and convenient way. The function has the following syntax:
Manly.model(X, K = 1:5, Gaussian = FALSE, initial = "k-means",
nstart = 100, method = "ward.D", short.iter = 5,
select = "none", silent = TRUE, plot = FALSE,
var1 = NULL, var2 = NULL, VarAssess = FALSE,
conf.CI = NULL, overlap = FALSE, N = 1000, tol = 1e-5,
max.iter = 1000, ...).
Argument K
is an integer vector providing the numbers of clusters to
be tested for the data. The default setting tests 1, 2, 3, 4, or 5
clusters. It calls the Manly.EM()
function to fit all five models. The
one with the lowest BIC value is chosen to be the best model. Gaussian
option specifies whether skewness parameters are adopted or not. If
TRUE
, Gaussian mixtures are fitted. With the default value being
FALSE
, it runs full Manly mixture models. initial
specifies the
initialization strategy used. It has three input options: (1)
initial = "k-means"
is the default initialization strategy, which
passes the traditional nstart
is passed into the built-in R function
kmeans
for specifying the number of random starts (the default
nstart = 100
); (2) if initial = "hierarchical"
, the hierarchical
clustering initialization is used. The linkage method is passed by
method
argument into R function hclust
. The default is Ward’s
linkage; (3) if initial = "emEM"
, the emEM (Biernacki et al. 2003)
initialization is run. Short runs of EM are conducted based on random
starts and the one that corresponds to the highest log-likelihood is
picked for running until convergence. nstart
controls the number of
random starts. The number of iterations for the short EM is specified by
short.iter
with a default value set to 5
iterations.
select
argument has three input values: "none"
, "forward"
and
"backward"
. If select = "none"
, then the object returned by function
Manly.EM()
is adopted directly. If select = "forward"
, the
Gaussian
option is automatically adjusted to Gaussian = TRUE
. It
calls function Manly.select(..., method = "forward")
to improve the
original Gaussian fit. On the other hand, if select = "backward"
,
Gaussian
option is automatically set to Gaussian = FALSE
. The full
Manly mixture is followed by the backward selection
Manly.select(..., method = "backward")
. silent
argument controls the
output in function Manly.select()
. The default setting suppresses the
output. plot
determines whether Manly.plot()
function is called or
not. If plot = TRUE
, then Manly.plot()
runs and arguments var1
and
var2
allow user to specify which variable(s) to plot. Argument
VarAssess
provides the option of using Manly.var()
for variability
assessment. Notice here that it only provides assessment for a full
Manly mixture model. conf.CI
specifies the confidence level of the
confidence intervals returned. The overlap
option, if specified to be
TRUE
, adopts the Manly.overlap()
function and estimates pairwise
overlap values for the returned model. N
is the number of Monte Carlo
simulations run in Manly.overlap()
.
Three objects are returned by function Manly.model()
: $model
,
$VarAssess
, and $Overlap
. $model
is the final model of class
"ManlyMix"
by Manly.EM()
or Manly.select()
. $VarAssess
returns
the variance-covariance matrix and confidence intervals by Manly.var()
function. $Overlap
returns the object by Manly.overlap()
.
For AIS dataset, suppose the user wants to obtain the Manly F or Manly B model and take a look at their contour plots. A compact version of the code is given by:
MF <- Manly.model(X, K = 2, initial = "k-means", nstart = 100,
select = "forward", plot = TRUE, var1 = 3, var2 = 2,
x.mar = 3, y.mar = 13, xaxs="i", yaxs="i",
xaxt="n", yaxt="n", xlab="", ylab = "",
nlevels = 4, drawlabels = FALSE, lwd = 3.2,
col = "lightgrey", pch = 19)
MB <- Manly.model(X, K = 2, initial = "k-means", nstart = 100,
select = "backward", plot = TRUE, var1 = 3, var2 = 2,
x.mar = 3, y.mar = 13, xaxs="i", yaxs="i",
xaxt="n", yaxt="n", xlab="", ylab = "",
nlevels = 4, drawlabels = FALSE, lwd = 3.2,
col = "lightgrey", pch = 19)
Here, MF$model
and MB$model
obtained are the same as those in step
a. The contour plots are generated automatically and can be found in
Figure 1.
Alternative coding e: initialization with model parameters
Functions Manly.EM()
can take initial model parameters as
initialization of the algorithm. It is especially useful for the emEM
initialization. The practitioner can construct a large number of short
EM runs, select the one with the highest log-likelihood and obtain its
estimated parameters. Then, the EM algorithm initialized by these
parameters is run until convergence. Here is a small example on the
AIS dataset. 100 short EM algorithms run for 5 iterations each. As we
can see, the obtained object M is the same as that from step a.
ll <- -Inf
init <- NULL
nstart <- 100
iter <- 0
repeat {
id.km <- kmeans(X, centers = K, iter.max = 1)$cluster
temp <- Manly.EM(X, id = id.km, la = matrix(0.1, K, p), max.iter = 5)
if(temp$ll > ll) {
ll <- temp$ll
init <- temp
}
iter <- iter + 1
if(iter == nstart)
break
}
M <- Manly.EM(X, tau = init$tau, Mu = init$Mu, S = init$S, la = init$la)
Since one reviewer is interested in seeing a showcase of a univariate Manly mixture, we illustrate its utility on the acidity dataset (Crawford 1994). It provides the acidity measure of 155 lakes in the Northeastern United States. There are two clusters, but the true partition is unknown.
Step a: model fit
We run the following models: the traditional kmeans()
),
Manly Manly.Kmeans()
), Gaussian mixture model
(Manly.EM()
), Manly mixture model (Manly.EM()
), Manly forward model,
and Manly backward model (both available through Manly.select()
).
library(ManlyMix)
data("acidity"); set.seed(123)
K <- 2
p <- 1
X <- acidity
Kmeans <- kmeans(X, K)
id.km <- Kmeans$cluster
MK <- Manly.Kmeans(X, id = id.km, la = matrix(0.1, K, p))
G <- Manly.EM(X, id = id.km, la = matrix(0, K, p))
M <- Manly.EM(X, id = id.km, la = matrix(0.1, K, p))
MF <- Manly.select(X, G, method = "forward", silent = TRUE)
MB <- Manly.select(X, M, method = "backward", silent = TRUE)
The model BIC values for Gaussian, Manly, Manly F and Manly B are
Step b: visualization tool
To visually assess the fit provided by all models, we use the command
Manly.plot()
with univariate input. The fitted density plots
associated with histogram of the data are provided in
Figure 2.
Kmeans$id <- id.km
Kmeans$tau <- MK$tau <- rep(1 / K, K)
Kmeans$Mu <- Kmeans$centers
Kmeans$la <- matrix(0, K, p)
Kmeans$S <- array(0, dim = c(p, p, K))
for(k in 1:K)
Kmeans$S[,,k] <- Kmeans$tot.withinss / n / p
s2 <- MK$S
MK$S <- array(0, dim = c(p, p, K))
for(k in 1:K)
MK$S[,,k] <- s2[k]
Manly.plot(X = acidity, model = Kmeans, var1 = 1, main = "",
ylim = c(0, 0.75), xlab = "", xaxt = "n", ylab = "",
yaxt = "n", x.slice = 200, col = "red")
Manly.plot(X = acidity, model = MK, var1 = 1, main = "", ylim = c(0, 0.75),
xlab = "", xaxt = "n", ylab = "", yaxt = "n",
x.slice = 200, col = "red")
Manly.plot(X = acidity, model = G, var1 = 1, main = "", ylim = c(0, 0.75),
xlab = "", xaxt = "n", ylab = "", yaxt = "n",
x.slice = 200, col = "red")
Manly.plot(X = acidity, model = M, var1 = 1, main = "", ylim = c(0, 0.75),
xlab = "", xaxt = "n", ylab = "", yaxt = "n",
x.slice = 200, col = "red")
Manly.plot(X = acidity, model = MF, var1 = 1, main = "", ylim = c(0, 0.75),
xlab = "", xaxt = "n", ylab = "", yaxt = "n",
x.slice = 200, col = "red")
Manly.plot(X = acidity, model = MB, var1 = 1, main = "", ylim = c(0, 0.75),
xlab = "", xaxt = "n", ylab = "", yaxt = "n",
x.slice = 200, col = "red")
Manly models provide the most reasonable fit of the data. The first
component is slightly skewed to the right and only the Manly models pick
up the high density at its peak. The second component is slightly skewed
to the left. The density fits provided by
Manly.plot()
: K-means (first row left), Manly
K-means (first row right),
Gaussian mixture (second row left), Manly mixture (second row right),
Manly forward (third row left) and Manly backward (third row
right).
Alternative coding c: wrapper function
The wrapper function Manly.model()
is capable of combining steps a and
b in one command. The following code directly yields the Manly F or
Manly B model:
MF <- Manly.model(X, K = 2, Gaussian = TRUE, initial = "k-means",
nstart = 100, select = "forward", plot = TRUE,
var1 = 1, main = "", ylim = c(0, 0.75),
xlab = "", xaxt = "n", ylab = "", yaxt = "n",
x.slice = 200, col = "red")
MB <- Manly.model(X, K = 2, Gaussian = FALSE, initial = "k-means",
nstart = 100, select = "backward", plot = TRUE,
var1 = 1, main = "", ylim = c(0, 0.75),
xlab = "", xaxt = "n", ylab = "", yaxt = "n",
x.slice = 200, col = "red")
The Manly F and Manly B density plots given in Figure 2 are generated automatically.
For users who need further information about the package, we have constructed 16 demo examples listed in Table 3 that provide a comprehensive demonstration of ManlyMix capabilities. Among the examples, 11 of them are designed to demonstrate the capability and utility of each function and 5 of them run comprehensive analysis of classification datasets. Each demo can be accessed by its name and the users can reproduce themselves.
As an illustration of how these demos can be employed, the code of the first example can be approached through running the following code in R.
library(ManlyMix)
demo(EMalgorithm1)
Function | Demo example(s) |
---|---|
Manly.EM() |
demo(EMalgorithm1) , demo(EMalgorithm2) |
Manly.select() |
demo(ForwardSelection) , demo(BackwardSelection) |
Manly.Kmeans() |
demo(ManlyKmeans1) , demo(ManlyKmeans2) |
Manly.overlap() |
demo(Overlap) |
Manly.sim() |
demo(DataSimulation) |
Manly.var() |
demo(VarAssess) |
Manly.plot() |
demo(DensityPlot) , demo(ContourPlot) |
Comprehensive analysis | demo(utility) , demo(ais) , demo(seeds) , demo(bankruptcy) |
demo(acidity) |
The R package ManlyMix is discussed and illustrated in detail. The provided functions enable practitioners to analyze heterogeneous data and conduct cluster analysis with Manly mixture models. The algorithms behind functions are introduced and explained carefully. Illustrative examples based on challenging real-life datasets are studied to demonstrate the usefulness and efficiency of the package. Promising results suggest that ManlyMix is not only a powerful package for clustering and classification, but also a diagnostic tool to investigate skewness and deviation from normality in data. Demo examples are provided for each function in ManlyMix for the users to study.
The six competitors for mixture modeling of skewed data given in
Table 1 are applied to the AIS dataset in
Section 3.3, including
Table 4 provides model-based clustering results. The number of parameters of the models are 20 (flowClust), 25 (SSN), 25 (SST), 25 (rMSN), 27 (rMST), and 27 (uMST). The computing times are 0.012, 1.553, 4.737, 0.024, 0.136, 3547.527, respectively. The BIC values of the models are 3551.125, 3576.886, 3558.227, 3566.007, 3562.151, 3591.241. flowClust enjoys the lowest BIC value, which is still higher than that of Manly B. uMST yields the lowest number of misclassifications, which is as good as the full Manly mixture model and Manly forward model.
flowClust | SSN | SST | ||||
Group | 1 | 2 | 1 | 2 | 1 | 2 |
1 | 99 | 1 | 99 | 1 | 99 | 1 |
2 | 8 | 94 | 7 | 95 | 5 | 97 |
rMSN | rMST | uMST | ||||
Group | 1 | 2 | 1 | 2 | 1 | 2 |
1 | 100 | 0 | 99 | 1 | 98 | 2 |
2 | 8 | 94 | 6 | 96 | 2 | 100 |
The research is partially funded by the University of Louisville EVPRI internal research grant from the Office of the Executive Vice President for Research and Innovation.
flowClust, mixsmsn, EMMIXskew, EMMIXuskew, ManlyMix
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Zhu & Melnykov, "ManlyMix: An R Package for Manly Mixture Modeling", The R Journal, 2017
BibTeX citation
@article{RJ-2017-060, author = {Zhu, Xuwen and Melnykov, Volodymyr}, title = {ManlyMix: An R Package for Manly Mixture Modeling}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {2}, issn = {2073-4859}, pages = {176-197} }