The heuristic
Cluster analysis offers a useful way to organize and represent complex
data sets. It is used routinely for data analysis in fields such as
bioinformatics. The
However, the disadvantages of various heuristic
In response to this challenge, our objective is to develop a practical
and convenient clustering algorithm in R to guarantee optimality in a
one-dimensional (1-D) space. The motivation originated from discrete
system modeling such as Boolean networks
(Kauffman 1969, 1993) and generalized logical networks
(Song et al. 2009),
where continuous values must be converted to discrete ones before
modeling. Our algorithm follows a comparable dynamic programming
strategy used in a 1-D quantization problem to preserve probability
distributions (Song et al. 2010), the segmented
least squares problem and the knapsack problem
(Kleinberg and Tardos 2006). The objective function in the
We implemented the algorithm in the R package
Ckmeans.1d.dp
(Song and Wang 2011) and evaluated its performance by simulation studies.
We demonstrate how the standard
We first define the
We introduce a dynamic programming algorithm to guarantee optimality of
clustering in 1-D. We define a sub-problem as finding the minimum
withinss of clustering
Using the above recurrence, we can obtain
To find a clustering of data with the minimum withinss of
The space complexity of the dynamic programming is
We implemented this dynamic programming algorithm and created an R
package
Ckmeans.1d.dp. To
take advantage of the runtime speed up by a compiled language over R, an
interpreted language, we developed the dynamic programming solution in
C++. We compiled the C++ code to a binary dynamic library, which can be
called within R through function Ckmeans.1d.dp()
provided in the
package.
We simulated data sets containing clusters using 1-D Gaussian mixture
models. In a Gaussian mixture model, the number of components is the
true number of clusters. The mean of each Gaussian component is randomly
sampled from the uniform distribution from
We first compared the optimality of Ckmeans.1d.dp()
and the standard
kmeans()
function in
R. The core of the kmeans()
function is also implemented in C/C++. As
Ckmeans.1d.dp()
guarantees optimality of clustering, we define the
relative difference from the kmeans()
clustering result to the optimal
value produced by Ckmeans.1d.dp()
as
The data sets were randomly generated from Gaussian mixture models with
Ckmeans.1d.dp()
and kmeans()
on input
data given the correct number of clusters kmeans()
to the optimal value produced by
Ckmeans.1d.dp()
is shown as a function of
Although one hopes to get better clusters by restarting Ckmeans.1d.dp()
in optimality.
We then compared the empirical runtime of Ckmeans.1d.dp()
with the
standard
We ran both programs on each input array ten times to obtain average empirical runtimes in three different settings. All simulations were run on a desktop computer with an Intel Core i7 860 processor and 8GB memory, running OpenSUSE 11.3 and R 2.11.1.
In the first setting (Figure 2), runtime is
obtained as a function of input data size from Ckmeans.1d.dp()
and kmeans()
without constraints on optimality.
The number of components in the Gaussian mixture models is fixed to
According to Figure 2, as the input size
increases, the runtime of Ckmeans.1d.dp()
increases faster than the
runtime of kmeans()
. However, the result returned by kmeans()
may
not be optimal (the restart of kmeans()
was set to 1). Without any
constraints on optimality, kmeans()
demonstrates an advantage in
runtime over Ckmeans.1d.dp()
of runtime quadratic in
In the second setting (Figure 3),
runtime is obtained as a function of number of clusters for
Ckmeans.1d.dp()
and kmeans()
without constraints on optimality. All
input data sets are of the same size
In Figure 3, the runtime of
Ckmeans.1d.dp()
increases linearly with the number of clusters kmeans()
. kmeans()
, without any constraints on
optimality, still shows an advantage in absolute runtime over
Ckmeans.1d.dp()
.
In the third setting (Figure 4), we
plot the runtime of Ckmeans.1d.dp()
and kmeans()
as a function of
the number of clusters, obtained from exactly the same input data in the
second setting, but with a constraint on the optimality of
Since Ckmeans.1d.dp()
returns an optimal result in only one run, its
runtime is the same as setting 2. On the other hand, kmeans()
was
restarted a number of times in order to approach an optimal solution
within the given tolerance. The number of restarts increases
substantially to produce a nearly optimal solution as the number of
clusters Ckmeans.1d.dp()
increases linearly with kmeans()
appears to increase exponentially with kmeans()
almost always far exceeds Ckmeans.1d.dp()
when Ckmeans.1d.dp()
over kmeans()
in
both runtime and optimality when
The R source code for the simulation studies is available from http://www.cs.nmsu.edu/~joemsong/R-Journal/Ckmeans-Simulation.zip.
The Ckmeans.1d.dp
package implements the dynamic programming algorithm for 1-D clustering
that we described above. In the package, the Ckmeans.1d.dp()
function
performs the clustering on a 1-D input vector using a given number of
clusters.
The following example illustrates how to use the package.
Figure 5 visualizes the input data and the cluster
result obtained by Ckmeans.1d.dp()
in this example.
# a one-dimensional example
# with a two-component Gaussian mixture model
x <- rnorm(50, mean = 1, sd = 0.3)
x <- append(x, rnorm(50, sd = 0.3) )
result <- Ckmeans.1d.dp(x, 2)
plot(x, col = result$cluster)
abline(h = result$centers, col = 1:2, pch = 8,
cex = 2)
The main purpose of our work is to develop an exact solution to 1-D
clustering in a practical amount of time, as an alternative to heuristic
There are two limitations of our Ckmeans.1d.dp method. It only applies to 1-D data and its quadratic runtime in the sample size may not be acceptable in all applications.
However, where applicable, the impact of our method is its optimality and repeatability. As our simulation studies show, when the number of clusters is big, our method not only guarantees optimality but also has a fast runtime. In this situation, our method is preferred. We applied our Ckmeans.1d.dp to quantize biological data in the DREAM Challenges (Prill 2010) for qualitative dynamic modeling using generalized logical networks (Song et al. 2009).
It is of great interest if the 1-D dynamic programming strategy can be
extended to multiple dimensional spaces so that clustering can be done
in polynomial time to
We gratefully acknowledge the financial support to the reported work by the grant number HRD-0420407 from the U.S. National Science Foundation.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wang, " Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming", The R Journal, 2011
BibTeX citation
@article{RJ-2011-015, author = {Wang, Mingzhou Song and Haizhou}, title = { Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming}, journal = {The R Journal}, year = {2011}, note = {https://rjournal.github.io/}, volume = {3}, issue = {2}, issn = {2073-4859}, pages = {29-33} }