The general clustering algorithms do not guarantee optimality because of the hardness of the problem. Polynomial-time methods can find the clustering corresponding to the exact optimum only in special cases. For example, the dynamic programming algorithm can solve the one-dimensional clustering problem, i.e., when the items to be clustered can be characterised by only one scalar number. Optimal one-dimensional clustering is provided by package Ckmeans.1d.dp in R. The paper shows a possible generalisation of the method implemented in this package to multidimensional data: the dynamic programming method can be applied to find the optimum clustering of vectors when only subsequent items may form a cluster. Sequential data are common in various fields including telecommunication, bioinformatics, marketing, transportation etc. The proposed algorithm can determine the optima for a range of cluster numbers in order to support the case when the number of clusters is not known in advance.
Clustering plays a key role in various areas including data mining, character recognition, information retrieval, machine learning applied in diverse fields such as marketing, medicine, engineering, computer science, etc. A clustering algorithm forms groups of similar items in a data set which is a crucial step in analysing complex data. Clustering can be formulated as an optimisation problem assigning items to clusters while minimising the distances among the cluster members. The normally used clustering algorithms do usually not find the optimal solution because the clustering problem is NP-complete in the general case. This paper introduces a package implementing an optimisation method for clustering in a special case when a sequential constraint should be met, i.e., when the items to be clustered are sorted and only subsequent items may form a cluster. This constraint is common when clustering data streams, e.g., audio and video streams, trajectories, motion tracks, click-streams etc. The good news is that the exact optimum can be found in polynomial time in this case.
The algorithm recommended for clustering sequential data is based on the
dynamic programing approach developed by Wang and Song (2011). They gave a
polynomial-time algorithm for one-dimensional clustering, i.e., when the
items can be characterised by only one scalar number. Similarly to the
heuristic
Although the original algorithm has been developed to find the optimal
solution with exactly
The remainder of this paper is organized as follows: In the next section, a brief overview of the related work is presented. Then the optimization problem is formally described and the developed optimization algorithm is introduced in detail. Some evaluation results are also presented and the usage of the implemented package is introduced. A brief summary concludes the paper.
Several clustering models and a broad variety of clustering methods are
available in the literature (Tan et al. 2006; Jain 2010). Minimising
As mentioned, one-dimensional clustering can be solved in polynomial time (Wang and Song 2011). The problem represents a special kind of clustering with sequential constraint because a necessary condition for the optimality in one dimension is that only subsequent items may form a cluster if the items are considered in their scalar order. The package presented in this paper generalised the one-dimensional clustering method to the multidimensional case. The dynamic programming method used for optimal clustering in one dimension is essentially the same as the one first applied by Bellman (1961) for linear curve approximation. For this reason, our package can be also considered as an implementation of the optimal dynamic programming clustering method proposed by Bellman.
Several papers (e.g., Himberg et al. (2001), Terzi (2006), Tierney et al. (2014)) are
dealing with clustering with sequential constraints because processing
data sequences has a broad application area. Leiva and Vidal (2013) gave a
clustering algorithm called Warped
Clustering methods divide a dataset
where
Now, we can formulate our problem as follows:
Input:
Items to be clustered:
Number of clusters:
Output:
Optimal clustering:
Minimise
within-cluster sum of squared distances (
Subject to
sequential constraint:
general clustering conditions:
each item is clustered:
one cluster is assigned to each item
The recursive formula used in the dynamic programming formulation is
based on the fact, that if clustering for the first
The recursive formula applied in the dynamic programming approach is defined as follows (Wang and Song 2011):
where
The optimal solution can be determined by dynamic programming in two
steps. First, the recursive formula is used to find the minimal
for i := 0 to n
D[i, 0] := 0 // initialisations
for i := 1 to n
for m := 1 to min(i, k)
D[i, m] := MAX_DOUBLE;
for j := i downto m // calculating the recursive formula
if D[i, m] > D[j - 1, m - 1] + d(xj, ..., xi)
D[i, m] := D[j - 1, m - 1] + d(xj, ..., xi)
B[i, m] := j
Return D[n, m] for m = 1, ..., k
The algorithm finds the exact optimum in polynomial time. It runs
The optimal solution with cluster number
the mth cluster is B[n, m], ..., n
j := B[n, m]
for l := m to 2
the (l - 1)th cluster is B[j, l - 1], ..., j - 1
j = B[j, l - 1]
Backtracking can be performed in linear time in the number of the
clusters (
We would like to mention how to combine the dynamic programming method
with methods finding the proper number of clusters. The cluster numbers
are typically determined by using a measure of validity (e.g.,
Backtracking can be performed only once if the number of clusters is
known in advance or it can be selected by analysing
We implemented this dynamic programming algorithm in C++. The
implementation was built on source code from the R package
Ckmeans.1d.dp and we created a new R package clustering.sc.dp. In
the name of the package, sc and dp refer to sequential constraint
and dynamic programming, respectively. The open-source approach made it
possible to reuse the code from package Ckmeans.1d.dp but it was
beneficial for the original package as well: we suggested a minor change
in the code to speed up the code which was incorporated into
Ckmeans.1d.dp (
If the sequential constraint is considered in clustering than the
optimal
We compared the dynamic programming algorithm with the kmeans()
function in R which provides the (Hatigan and Wong 1979) implementation of clustering.sc.dp()
.
We used the relative difference in kmeans()
result to the optimal value produced by clustering.sc.dp()
for measuring deviation of the
kmeans()
to the optimal value returned by clustering.sc.dp()
. The
input data set of size 10 000 were generated as a random walk having a
step size that varies according to an exponential distribution with rate
1.0 in each coordinate.We tested the dynamic programming algorithm on inputs with different sizes, dimensions and cluster numbers in order to find its performance bounds and examine experimentally how the running time depends on the input sizes. The simulations were run on a desktop computer with a Pentium Dual-Core 2.93 GHz processor and 4 GB memory, running Windows 10 operation system. We generated multidimensional Gaussian random walks as data sets for performance tests. The steps between subsequent items were generated independently for each coordinate by using the Gaussian random distribution with zero mean and standard deviation of 0.1.
In the first setting (Figure 2), runtime is obtained
as a function of input data size for running clustering.sc.dp()
. The
size of the input varies from 1,000 to 30,000 with a step size of 1,000.
The input data consists of two-dimensional vectors. The number of
clusters is set to 2. The runtime increases quadratically in the number
of items to be clustered.
Table 1 presents some runtime data for a different magnitude of the number of items in order to find performance bounds of the method. One can see that the optimal clustering was found very quickly if the number of items is 1,000, it took more than a second, almost two and a half minutes, about four hours for 10,000, 100,000 and one million items, respectively.
Size | 1000 | 10,000 | 100,000 | 1,000,000 |
---|---|---|---|---|
Runtime (sec) | 0.03 | 1.19 | 144.00 | 14,479.88 |
In the second performance test we examine the dependency of the runtime on the dimensions. All input data sets are of the same size 10,000 and the dimensions of the vectors varies from 1 to 512. The dimensions are doubled in each run. Figure 3 shows the results. The algorithm runs less than a second if the input contains one-dimensional vectors (i.e., scalars). The runtime increases linearly with the dimension and it can be solved within two and a half minutes if the dimension of the processed vectors is 512.
In the third performance test, the runtime is examined as a function of the number of clusters. The number of clusters is between 1 and 25. The input data set consist of 10,000 two-dimensional vectors. The black line with circles in Figure 4 shows the result. The runtime increases linearly with the number of clusters.
Finally, we compared the runtime of different functions within the
package. The input data were the same as in the previous setting. One
can see in Figure 4 that the runtime of
findwithinss.sc.dp()
is slightly larger than the one of
clustering.sc.dp()
. The runtime of backtracking.sc.dp()
remains
small even for large cluster numbers. The package has three typical ways
of usage:
clustering.sc.dp()
can be called when the number of clusters is
known in advance.
findwithinss.sc.dp()
can be called first and then
backtracking.sc.dp()
for a selected number of clusters.
findwithinss.sc.dp()
can be called first and then
backtracking.sc.dp()
for all possible number of clusters.
The subsequent call of findwithinss.sc.dp()
and backtracking.sc.dp()
is only slightly slower than calling clustering.sc.dp()
. Furthermore,
the total runtime of calling findwithinss.sc.dp()
for cluster number
backtracking.sc.dp()
for all cluster numbers less than or
equal to clustering.sc.dp()
called for the same cluster number. This is much
faster than calling the clustering function clustering.sc.dp()
into phases of finding
the optimal findwithinss.sc.dp()
) and
backtracking (backtracking.sc.dp()
).
clustering.sc.dp()
, findwithinss.sc.dp()
, backtracking.sc.dp()
. It
also contains the total running time of backtracking.sc.dp()
when it
is called for all cluster numbers less than or equal to the specific
cluster number.R package clustering.sc.dp offers functions to perform optimal
clustering on multidimensional data with sequential constraint. Method
clustering.sc.dp()
can find the optimal clustering if the number of
clusters is known. Otherwise, methods findwithinss.sc.dp()
and
backtracking.sc.dp()
can be used.
The following examples illustrate how to use the package. Function
clustering.sc.dp()
outputs the same fields describing clustering as
Ckmeans.1d.dp()
does in the original package:
Figure 5 visualizes the input data and the
clusters created by clustering.sc.dp()
. See below for the R source
code of the example.
# Example1: clustering data generated from a random walk
x <- rbind(0, matrix(rnorm(99 * 2, 0, 0.1), nrow = 99, ncol = 2))
x <- apply(x, 2, cumsum)
k <- 2
result <- clustering.sc.dp(x, k)
plot(x, type = "b", col = result$cluster)
points(result$centers, pch = 24, bg = 1:k)
The next example demonstrates the usage of functions
findwithinss.sc.dp()
and backtracking.sc.dp()
. Similarly to the
previous example, it also processes data of a random walk. Function
findwithinss.sc.dp()
finds optimal
k
.
backtracking.sc.dp()
.
In our example, the first cluster number where backtracking.sc.dp()
outputs the optimal clustering for the selected
cluster number in the same format as clustering.sc.dp()
does without
running the whole clustering process again
(Figure 6).
# Example2: clustering data generated from a random walk with small withinss
x <- rbind(0, matrix(rnorm(99 * 2, 0, 0.1), nrow = 99, ncol = 2))
x <- apply(x, 2, cumsum)
k <- 10
r <- findwithinss.sc.dp(x, k)
# select the first cluster number where withinss drops below a threshold
k_th <- which(r$twithinss <= 5.0)[1]
# backtrack
result <- backtracking.sc.dp(x, k_th, r$backtrack)
plot(x, type = "b", col = result$cluster)
points(result$centers, pch = 24, bg = 1:k_th)
Clustering data with sequential constraint is a polynomial time solvable variant of the clustering problem. The paper introduced a package implementing a dynamic programming approach that finds the exact optimum of the problem. The algorithm represents an extension of the one-dimensional dynamic programming strategy of Ckmeans.1d.dp to multiple dimensional spaces which has been an open problem in the paper of (Wang and Song 2011). The package supports both cases when the exact number of clusters is given and when the number of clusters is not known in advance. It can also be used to evaluate approximation algorithms for clustering with sequential constraint due to its optimality. The runtime evaluations indicate how fast the algorithm can solve problems with different sizes and parameters. Our future plan is to use the dynamic programming method in video summarisation.
Research is supported by the Hungarian National Development Agency under grant HUMAN_MB08-1-2011-0010.
Ckmeans.1d.dp, clustering.sc.dp
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Szkaliczki, "clustering.sc.dp: Optimal Clustering with Sequential Constraint by Using Dynamic Programming", The R Journal, 2016
BibTeX citation
@article{RJ-2016-022, author = {Szkaliczki, Tibor}, title = {clustering.sc.dp: Optimal Clustering with Sequential Constraint by Using Dynamic Programming}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {1}, issn = {2073-4859}, pages = {318-327} }