Clustering algorithms are designed to identify groups in data where the traditional emphasis has been on numeric data. In consequence, many existing algorithms are devoted to this kind of data even though a combination of numeric and categorical data is more common in most business applications. Recently, new algorithms for clustering mixed-type data have been proposed based on Huang’s k-prototypes algorithm. This paper describes the R package clustMixType which provides an implementation of k-prototypes in R.
Clustering algorithms are designed to identify groups in data where the
traditional emphasis has been on numeric data. In consequence, many
existing algorithms are devoted to this kind of data even though a
combination of numeric and categorical data is more common in most
business applications. For an example in the context of credit scoring,
see, e.g. Szepannek (2017). The standard way to tackle mixed-type data
clustering problems in R is to use either (1) Gower distance
(Gower 1971) via the
gower package (van der Loo 2017) or
the daisy(method = "gower")
in the
cluster package
(Maechler et al. 2018); or (2) Hierarchical clustering through hclust()
or the
agnes()
function in cluster. Recent innovations include the package
CluMix (Hummel et al. 2017), which
combines both Gower distance and hierarchical clustering with some
functions for visualization. As this approach requires computation of
distances between any two observations, it is not feasible for large
data sets. The package
flexclust (Leisch 2006)
offers a flexible framework for k-centroids clustering through the
function kcca()
which allows for arbitrary distance functions. Among
the currently pre-implemented kccaFamilies
, there is no distance
measure for mixed-type data. Alternative approaches based on
expectation-maximization are given by the function flexmixedruns()
in
the fpc package (Hennig 2018) and
the package clustMD
(McParland 2017). Both require the variables in the data set to be ordered
according to their data type, and that categorical variables to be
preprocessed into integers. The clustMD algorithm (McParland and Gormley 2016) also
allows ordinal variables but is quite computationally intensive. The
kamila package (Foss and Markatou 2018)
implements the KAMILA clustering algorithm which uses a kernel density
estimation technique in the continuous domain, a multinomial model in
the categorical domain, and the Modha-Spangler weighting of variables in
which categorical variables have to be transformed into indicator
variables in advance (Modha and Spangler 2003).
Recently, more algorithms for clustering mixed-type data have been proposed in the literature ((He et al. 2005; Amir and Dey 2007; Pham et al. 2011; Dutta et al. 2012; Ji et al. 2012, 2013, 2015; Lim et al. 2012; Foss et al. 2016; HajKacem et al. 2016; Liu et al. 2017)). Many of these are based on the idea of Huang’s k-prototypes algorithm (Huang 1998). The rest of this paper describes the R package clustMixType (Szepannek 2018), which provides up to the author’s knowledge the first implementation of this algorithm in R. The k-modes algorithm (Huang 1997a) has been implemented in the package klaR ((Weihs et al. 2005; Roever et al. 2018)) for purely categorical data, but not for the mixed-data case. The rest of the paper is organized as follows: A brief description of the algorithm is followed by the functions in the clustMixType package. Some extensions to the original algorithm are discussed and as well as a worked example application.
The k-prototypes algorithm belongs to the family of partitional cluster
algorithms. Its objective function is given by:
The algorithm iterates in a manner similar to the k-means algorithm (MacQueen 1967) where for the numeric variables the mean and the categorical variables the mode minimizes the total within cluster distance. The steps of the algorithm are:
Initialization with random cluster prototypes.
For each observation do:
Assign observations to its closest prototype according to
Update cluster prototypes by cluster-specific means/modes for all variables.
As long as any observations have swapped their cluster assignment in 2 or the maximum number of iterations has not been reached: repeat from 2.
An implementation of the k-prototypes algorithm is given by the function
kproto(x, k, lambda = NULL, iter.max = 100, nstart = 1, na.rm = TRUE)
where
x
is a data frame with both numeric and factor variables. As
opposed to other existing R packages, the factor variables do not
need to be preprocessed in advance and the order of the variables
does not matter.
k
is the number of clusters which has to be pre-specified.
Alternatively, it can also be a vector of observation indices or a
data frame of prototypes with the same columns as x
. If ever at
the initialization or during the iteration process identical
prototypes do occur, the number of clusters will be reduced
accordingly.
lambda
lambdaest()
.
Alternatively, a vector of length ncol(x)
can be passed to
lambda
(cf. Section on 4).
iter.max
sets the maximum number of iterations, just as in
kmeans()
. The algorithm may stop prior to iter.max
if no
observations swap clusters.
nstart
may be set to a value nstart
Generally, the algorithm can deal with missing data but as a default
NA
s are removed by na.rm = TRUE
.
Two additional arguments, verbose
and keep.data
, can control whether
information on missing values should be printed and whether the original
data should be stored in the output object. The keep.data=TRUE
option
is required for the default call to the summary()
function, but in
case of large x
, it can be set to FALSE
to save memory.
The output is an object of class "kproto"
. For convenience, the
elements are designed to be compatible with those of class "kmeans"
:
cluster
is an integer vector of cluster assignments
centers
stores the prototypes in a data frame
size
is a vector of corresponding cluster sizes
withinss
returns the sum over all within cluster distances to the
prototype for each cluster
tot.withinss
is their sum over all clusters which corresponds to
the objective function
dists
returns a matrix of all observations’ distances to all
prototypes in order to investigate the crispness of the clustering
lambda
and iter
store the specified arguments of the function
call
trace
lists the objective function
Unlike "kmeans"
, the "kproto"
class is accompanied corresponding
predict()
and summary()
methods. The predict.kproto()
method can
be used to assign clusters to new data. Like many of its cousins, it is
called by
predict(object, newdata)
The output again consists of two elements: a vector cluster
of cluster
assignments and a matrix dists
of all observations’ distances to all
prototypes.
The investigation resulting from a cluster analysis typically consists
of identifying the differences between the clusters, or in this specific
case, those of the k prototypes. For practical applications besides the
cluster sizes, it is of further interest to take into account the
homogeneity of the clusters. For numeric variables, this can be done by
calling the R function summary()
. For categorical variables the
representativity of the prototypes is given their frequency distribution
obtained by prop.table()
. The summary.kproto()
method applies these
methods to the variables conditional to the resulting clusters and
returns a comparative results table of the clusters for each variable.
The summary is not restricted to the training data but it can further be
applied to new data by calling summary(object, data)
where data
are
new data that will be internally passed to the predict()
method on the
object
of class "kproto"
. If no new data is specified (default:
data = NULL
), the function requires object
to contain the original
data (argument keep.data = TRUE
). In addition, a function
clprofiles(object, x, vars = NULL)
supports the analysis of the clusters by visualization of cluster
profiles based on an object
of class "kproto"
and data x
. Note
that the latter may also have different variables compared to the
original data, such as for profiling variables that were not used for
clustering. As opposed to summary.kproto()
, no new prediction is done
but the cluster assignments of the "kproto"
object given by
object$cluster
are used. For this reason, the observations in x
must
be the same as in the original data. Via the vars
argument, a subset
of variables can be specified either by a vector of indices or variable
names. Note that clprofiles()
is not restricted to objects of class
"kproto"
but can also be applied to other cluster objects as long as
they are of a "kmeans"
-like structure with elements cluster
and
size
.
For unsupervised clustering problems, the user typically has to specify
the impact of the specific variables on the desired cluster solution
which is controlled by the parameter
lambdaest(x, num.method = 1, fac.method = 1, outtype = "numeric")
provides different data based heuristics for the choice of num.method = 1
) or standard deviation
num.method = 2
) over all numeric variables is related to the
average concentration fac.method = 1
) or fac.method = 2
) over
all variables kproto()
is called without specifying
lambda
, the parameter is automatically set using
num.method = fac.method = 1
. Note that this should be considered a
starting point for further analysis; the explicit choice of
Originally, kproto()
is extended to accept vectors as input where each
element corresponds to a variable specific weight, outtype
argument into a
vector, the function lambdaest()
returns a vector of outtype = "variation"
returns a vector of
original values of variability for all variables in terms of the
quantities described above.
An issue of great practical relevance is the ability of an algorithm to
deal with missing values which can be solved by an intuitive extension
of k-prototypes. During the iterations, cluster prototypes can be
updated by ignoring NA
s: Both means for numeric variables as well as
modes for factors can be computed based on the available data.
Similarly, distances for cluster assignment can be computed for each
observation based on the available variables only. This not only allows
cluster assignment for observations with missing values, but already
takes these observations into account when the clusters are formed. By
using kproto()
, this can be obtained by setting the argument
na.rm = FALSE
. Note that in practice, this option should be handled
with care unless the number of missing values is very small. The
representativeness of a prototype might become questionable if its means
and modes do not represent the major part of the cluster’s observations.
Finally, a modification of the original algorithm presented in Section 2 allows for vector-wise computation of the iteration steps, reducing computation time. The update of the prototypes is not done after each new cluster assignment, but once each time the whole data set has been reassigned. The modified k-prototypes algorithm consists of the following steps:
Initialization with random cluster prototypes.
Assign all observations to its closest prototype according to
Update cluster prototypes.
As long as any observations have swapped their cluster assignment in 2 or the maximum number of iterations has not been reached: repeat from 2.
As an example, data x
with two numeric and two categorical variables
are simulated according to the documentation in ?kproto
: Using
set.seed(42)
, four clusters
cluster | numeric | categorical |
---|---|---|
1 | + | + |
2 | + | - |
3 | - | + |
4 | - | - |
Given the knowledge that there are four clusters in the data, a straightforward call for k-prototypes clustering of the data will be:
kpres <- kproto(x = x, k = 4)
kpres # output 1
summary(kpres) # output 2
library(wesanderson)
par(mfrow=c(2,2))
clprofiles(kpres, x, col = wes_palette("Royal1", 4, type = "continuous")) # figure 1
The resulting output is of the form:
# Output 1:
Numeric predictors: 2
Categorical predictors: 2
Lambda: 5.52477
Number of Clusters: 4
Cluster sizes: 100 95 101 104
Within cluster error: 332.909 267.1121 279.2863 312.7261
Cluster prototypes:
x1 x2 x3 x4
92 A A 1.4283725 1.585553
54 A A -1.3067973 -1.091794
205 B B -1.4912422 -1.654389
272 B B 0.9112826 1.133724
# Output 2 (only for variable x1 and x3):
x1
cluster A B
1 0.890 0.110
2 0.905 0.095
3 0.069 0.931
4 0.144 0.856
-----------------------------------------------------------------
x3
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 -0.7263 0.9314 1.4080 1.4280 2.1280 4.5110
2 -3.4200 -1.9480 -1.3170 -1.3070 -0.6157 2.2450
3 -4.2990 -2.0820 -1.4600 -1.4910 -0.7178 0.2825
4 -1.5300 0.2788 0.9296 0.9113 1.5000 3.1480
-----------------------------------------------------------------
The first two as well as the last two cluster prototypes share the same
mode in the factor variables but they can be distinguished by their
location with respect to the numeric variables. For clusters 1 & 4 (as
opposed to 2 & 3), it is vice versa. Note that the order of the
identified clusters is not necessarily the same as in cluster
generation. Calling summary()
and clprofiles()
provides further
information on the homogeneity of the clusters. A color palette can be
passed to represent the clusters across the different variables for the
plot; here it is taken from the package
wesanderson
(Ram and Wickham 2018).
x1
and
x3
.By construction, taking into account either only numeric (k-means) or
only factor variables (k-modes) will not be able to identify the
underlying cluster structure without further preprocessing of the data
in this example. A performance comparison using the Rand index
(Rand 1971) as computed by the package
clusteval (Ramey 2012)
results in rand indices of
library(klaR)
library(clusteval)
kmres <- kmeans(x[,3:4], 4) # kmeans using numerics only
kmores <- kmodes(x[,1:2], 4) # kmodes using factors only
cluster_similarity(kpres$cluster, clusid, similarity = "rand")
cluster_similarity(kmres$cluster, clusid, similarity = "rand")
cluster_similarity(kmores$cluster, clusid, similarity = "rand")
The runtime of kproto()
is linear in the number of observations
(Huang 1997b) and thus it is also applicable to large data sets.
Figure 2 (left) shows the behaviour of the runtime
for 50 repeated simulations of the example data with an increased number
of variables (half of them numeric and half of them categorical). It is
possible to run kproto()
for more than hundred variables which is far
beyond most practical applications where human interpretation of the
resulting clusters is desired. Note that clustMixType is written in R
and currently no C++ code is used to speed up computations which could
be a subject of future work.
In order to determine the appropriate number of clusters for a data set,
we can use the standard scree test. In this case, the objective function
tot.withinss
element. The kproto()
function is run multiple times for varying numbers of clusters (but
fixed summary()
, clprofiles()
or the withinss
element of
the "kproto"
output.
Es <- numeric(10)
for(i in 1:10){
kpres <- kproto(x, k = i, nstart = 5)
Es[i] <- kpres$tot.withinss
}
plot(1:10, Es, type = "b", ylab = "Objective Function", xlab = "# Clusters",
main = "Scree Plot") # figure 2
The clustMixType package provides a user-friendly way for clustering
mixed-type data in R given by the k-prototypes algorithm. As opposed to
other packages, no preprocessing of the data is necessary, and in
contrast to standard hierarchical approaches, it is not restricted to
moderate data sizes. Its application requires the specification of two
hyperparameters: the number of clusters
This paper is based on clustMixType version 0.1-36. Future work may
focus on the development of further guidance regarding the choice of the
parameter
gower, cluster, CluMix, flexclust, fpc, clustMD, kamila, clustMixType, klaR, wesanderson, clusteval
Cluster, Environmetrics, MachineLearning, Robust
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Szepannek, "clustMixType: User-Friendly Clustering of Mixed-Type Data in R", The R Journal, 2018
BibTeX citation
@article{RJ-2018-048, author = {Szepannek, Gero}, title = {clustMixType: User-Friendly Clustering of Mixed-Type Data in R}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {2}, issn = {2073-4859}, pages = {200-208} }