Protein structure data consist of several dihedral angles, lying on a multidimensional torus. Analyzing such data has been and continues to be key in understanding functional properties of proteins. However, most of the existing statistical methods assume that data are on Euclidean spaces, and thus they are improper to deal with angular data. In this paper, we introduce the package ClusTorus specialized to analyzing multivariate angular data. The package collects some tools and routines to perform algorithmic clustering and model-based clustering for data on the torus. In particular, the package enables the construction of conformal prediction sets and predictive clustering, based on kernel density estimates and mixture model estimates. A novel hyperparameter selection strategy for predictive clustering is also implemented, with improved stability and computational efficiency. We demonstrate the use of the package in clustering protein dihedral angles from two real data sets.
Multivariate angular or circular data have found applications in some
research domains including geology (e.g., paleomagnetic directions) and
bioinformatics (e.g., protein dihedral angles). Due to the cyclic nature
of angles, usual vector-based statistical methods are not directly
applicable to such data. A
A prominent example in which multivariate angular data appear is the
analysis of protein structures. As described in Branden and Tooze (1999), the functional
properties of proteins are determined by the ordered sequences of amino
acids and their spatial structures. These structures are determined by
several dihedral angles, and thus, protein structures are commonly
described on multidimensional tori. The
clus.torus
(top left) and kmeans.torus
(top
right), both implemented in
ClusTorus,
mixtools::mvnormalmixEM
(bottom left), in which the number of
components 3 is prespecified, and mclust::Mclust
(bottom right), in
which the number of components is chosen by BIC. Gray points in the
top-left panel are “outliers", automatically assigned by
clus.torus
.We introduce the R package
ClusTorus (Jung and Hong 2021)
which provides various tools for handling and clustering multivariate
angular data on the torus. The package provides angular adaptations of
usual clustering methods such as the
For data on the torus, there are a few previous works for mixture
modeling and clustering. Mardia et al. (2007) proposed a mixture of bivariate
von Mises distributions for data on
Algorithmic clustering for data on the torus has also been proposed. For
example, Gao et al. (2018) used an extrinsic
The main contribution of ClusTorus is an implementation of the
predictive clustering approaches of Jung et al. (2021) and Shin et al. (2019). For this,
the conformal prediction framework of Vovk et al. (2005) is extended for multivariate
angular data. The conformal prediction is a distribution-free method of
constructing prediction sets, and our implementation uses kernel density
estimates and mixture models, both based on the multivariate von Mises
distribution (Mardia et al. 2012). Furthermore, by using Gaussian-like
approximations of the von Mises distributions and a graph-theoretic
approach, flexible clusters, composed of unions of ellipsoids on
clus.torus
as follows.
library(ClusTorus)
set.seed(2021)
ex <- clus.torus(SARS_CoV_2)
plot(ex)
The result of the predictive clustering is visualized in the top left
panel of Figure 1, which is generated by plot(ex)
. The
dataset SARS_CoV_2
, included in ClusTorus, collects the dihedral
angles clus.torus
performs three core
procedures—conformal prediction, hyperparameter selection and cluster
assignment—for predictive clustering.
The rest of this article focuses on introducing the three core
procedures: (i) the conformal prediction framework, including our
choices of the conformity scores, (ii) hyperparameter selection and
(iii) cluster assignment. After demonstrating how the package
ClusTorus can be used for clustering of clus.torus
and other clustering algorithms such as
The conformal prediction framework (Vovk et al. 2005) is one of the main
ingredients of our development. Based on the work of Vovk et al. (2005) and Lei et al. (2013, 2015), we briefly introduce the basic concepts and properties of
conformal prediction. Suppose that we observe a sample of size
For a given
Consider the following quantity:
We demonstrate a construction of the conformal prediction set with a
kernel density estimate-based conformity score, defined later in
(2), for the data shown in Figure 1. With the
conformity score given by (2), cp.torus.kde
computes the
conformal prediction set Cn
of the output below. The
columns Lminus
and Lplus
provide approximated prediction sets,
defined in Jung et al. (2021).
cp.kde <- cp.torus.kde(SARS_CoV_2)
cp.kde
Conformal prediction sets (Lminus, Cn, Lplus) based on kde with concentration 25
Testing inclusion to the conformal prediction set with level = 0.1 :
-------------
phi psi Lminus Cn Lplus level
1 0.00000000 0 FALSE FALSE FALSE 0.1
2 0.06346652 0 FALSE FALSE FALSE 0.1
3 0.12693304 0 FALSE FALSE FALSE 0.1
4 0.19039955 0 FALSE FALSE FALSE 0.1
5 0.25386607 0 FALSE FALSE FALSE 0.1
6 0.31733259 0 FALSE FALSE FALSE 0.1
7 0.38079911 0 FALSE FALSE FALSE 0.1
8 0.44426563 0 FALSE FALSE FALSE 0.1
9 0.50773215 0 FALSE FALSE FALSE 0.1
10 0.57119866 0 FALSE FALSE FALSE 0.1
9990 rows are omitted.
The concentration parameter concentration
and level
. By default, these values are set
as concentration = 25
and level = 0.1
.
The output cp.kde
is an S3 object with class cp.torus.kde
, for which
a generic method plot
is available. The conformal prediction set for
SARS_CoV_2
data can be displayed on the Ramachandran plot, as follows.
The result is shown in Figure 2.
plot(cp.kde)
If the sample size
procedure inductive conformal
prediction(
Split the data randomly into
Construct
Put
Construct
end procedure
While the sizes icp.torus
implements Algorithm
1 for several prespecified conformity
scores. As already mentioned, we need to choose the conformity score
Before we discuss our choices of the conformity scores, we first
illustrate how the functions in ClusTorus are used to produce
inductive conformal prediction sets. The following codes show a
calculation of the inductive conformal prediction set for the data
SARS_CoV_2
. The conformal prediction set with the conformity score
given by kernel density estimates (2) can be constructed by
icp.torus
and icp.torus.eval
. The function icp.torus
computes
icp.torus.eval
tests whether pre-specified evaluation points are
included in icp.torus.eval
creates a grid of size
set.seed(2021)
icp.torus.kde <- icp.torus(SARS_CoV_2, model = "kde", concentration = 25)
icp.kde <- icp.torus.eval(icp.torus.kde, level = 0.1)
icp.kde
Conformal prediction set (Chat_kde)
Testing inclusion to the conformal prediction set with level = 0.1:
-------------
X1 X2 inclusion
1 0.00000000 0 FALSE
2 0.06346652 0 FALSE
3 0.12693304 0 FALSE
4 0.19039955 0 FALSE
5 0.25386607 0 FALSE
6 0.31733259 0 FALSE
7 0.38079911 0 FALSE
8 0.44426563 0 FALSE
9 0.50773215 0 FALSE
10 0.57119866 0 FALSE
9990 rows are omitted.
In the codes above, the data splitting for icp.torus
is done
internally, and can be inspected by icp.torus.kde$split.id
.
We now introduce our choices for the conformity score
For the 2-dimensional case, Jung et al. (2021) proposed to use the kernel
density estimate based on the von Mises kernel (Marzio et al. 2011) for the
conformity score. A natural extension to the kde.torus
provides the multivariate von Mises kernel density
estimation. For conformal prediction, we take
Our next choices of conformity scores are based on mixture models. Since
the multivariate normal distributions are not defined on
For any positive integer
On the other hand, Mardia et al. (2012) introduced an approximated density
function
We have implemented four conformity scores, described in the previous section. These are based on
kernel density estimate (2),
mixture model (5),
max-mixture model (6), and
ellipsoids obtained by approximating the max-mixture (9).
The function icp.torus
in ClusTorus computes these conformity scores
using the inductive conformal prediction framework, and returns
icp.torus
object(s). Table 1 illustrates
several important arguments of the function icp.torus
.
Arguments | Descriptions |
---|---|
data |
|
model |
A string. One of "kde", "mixture", and "kmeans" which determines the model or estimation methods. If "kde", the model is based on the kernel density estimates. It supports the kde-based conformity score only. If "mixture", the model is based on the von Mises mixture, fitted with an EM algorithm. It supports the von Mises mixture and its variants based conformity scores. If "kmeans", the model is also based on the von Mises mixture, but the parameter estimation is implemented with the elliptical k-means algorithm illustrated in Appendix. It supports the log-max-mixture based conformity score only. If the dimension of data space is greater than 2, only "kmeans" is supported. Default is model = "kmeans" . |
J |
A scalar or numeric vector for the number(s) of components for model = c("mixture", "kmeans") . Default is J = 4 . |
concentration |
A scalar or numeric vector for the concentration parameter(s) for model = "kde" . Default is concentration = 25 . |
The argument model
of the function icp.torus
indicates which
conformity score is used. By setting model = "kde"
, the kde-based
conformity score (2) is used. By setting model = "mixture"
the
mixture model (5) is estimated by an EM algorithm, and
conformity scores based on (5), (6), (9) are
all provided. Setting model = "kmeans"
provides a mixture model fit by
the elliptical
The arguments J
and concentration
specify the model fitting
hyperparameters. To compute conformity scores based on kernel density
estimate (2), one needs to specify the concentration parameter
icp.torus
takes either a
single value (e.g., J = 4
is the default), or a vector (e.g.,
J = 4:30
or concentration = c(25,50)
) for arguments J
and
concentration
. If J
(or concentration
) is a scalar, then
icp.torus
returns an icp.torus
object.
On the other hand, if J
(or concentration
) is a numeric vector
containing at least two values, then icp.torus
returns a list of
icp.torus
objects, one for each value in J
(or concentration
,
respectively). Typically, the hyperparameter icp.torus
objects, evaluated for each candidate in vector-valued
J
(or concentration
) is required for our hyperparameter selection
procedure, discussed in a later section.
Let us present an R code example for creating an icp.torus
object,
fitted with model = "kmeans"
(the default value for argument model
)
and J = 12
.
set.seed(2021)
icp.torus.12 <- icp.torus(SARS_CoV_2, J = 12)
plot(icp.torus.12, level = 0.1111)
The icp.torus
object has an S3 method plot
, and the R code
plot(icp.torus.12, level = 0.1111)
plots the ellipses in (10)
with level = 0.1111
. The union of these
ellipses is in fact the inductive conformal prediction set of level
ellipse = FALSE
, as follows.
plot(icp.torus.12, ellipse = FALSE)
The resulting graphic is omitted.
Conformity scores based on mixture model and its variants need
appropriate estimators of the parameters, model = "mixture"
of icp.torus
works only for the 2-dimensional case. On the other hand, the elliptical
Table 2 summarizes the four choices of conformity scores in terms of model-fitting methods, dimensionality of the data space, and whether clustering is available. Our predictive clustering is implemented only based on the "ellipsoids" conformity score (9). The rational for this choice is due to the relatively simple form of prediction sets (a union of ellipsoids (10)).
Conformity Scores | EM | k-means | dim = 2 | dim > 2 | Clustering |
---|---|---|---|---|---|
Kernel density ((2)) | |||||
Mixture ((5)) | |||||
Max-mixture ((6)) | |||||
Ellipsoids ((9)) |
We now describe our clustering strategies using the conformal prediction
sets. Suppose for now that the level
We now describe how the cluster labels are assigned to data points. Each
data point included in the prediction set is automatically assigned to
the cluster which contains the point. For the data points which are not
included in the conformal prediction set, we have implemented two
different types for cluster assignment, as defined in Jung et al. (2021). The
first is to assign the closest cluster label. The notion of closest
cluster can be defined either by the Mahalanobis distance
The function cluster.assign.torus
, which takes as input an icp.torus
object and level
cluter.obj
, and includes the cluster memberships of all data points,
for each and every cluster assignment method we discussed above. The
output of cluster.assign.torus
includes the number of clusters
detected, the cluster assignment results for the first 10 observations,
and cluster sizes, as shown in the code example below.
c <- cluster.assign.torus(icp.torus.12, level = 0.1111)
c
Number of clusters: 5
-------------
Clustering results by log density:
[1] 1 1 1 1 2 4 2 1 3 1
cluster sizes: 538 372 39 4 19
Clustering results by posterior:
[1] 5 5 5 5 2 4 3 5 3 5
cluster sizes: 6 310 104 4 548
Clustering results by representing outliers:
[1] 1 1 1 1 2 6 2 1 6 1
cluster sizes: 508 343 15 3 0 103
Note: cluster id = 6 represents outliers.
Clustering results by Mahalanobis distance:
[1] 1 1 1 1 2 4 2 1 3 1
cluster sizes: 533 372 39 4 24
962 clustering results are omitted.
The clustering results contained in the object c
can be visualized as
follows.
plot(c, assignment = "log.density")
plot(c, assignment = "outlier")
The results are displayed in Figure 4. When the argument
assignment
is not specified, the outlier disposing assignment is
chosen by default.
Poor choices of conformity score result in too wide prediction sets.
Thus, we need to choose the hyperparameters elaborately for a better
conformal prediction set and for a better clustering performance. The
hyperparameters are the concentration parameter
We briefly review the criterion used in Jung et al. (2021). Assume for now that
mixture models are used; that is,
To evaluate (11), one needs to have a set of candidates for icp.torus
is designed to take as input a set of
hyperparameter candidates. As an example, the following code evaluates
the inductive conformal prediction sets for data SARS_CoV_2
, fitted by
mixture models with the number of components given by each
set.seed(2021)
icp.torus.objects <- icp.torus(SARS_CoV_2, J = 3:35)
The result, icp.torus.objects
, is a list of 33 icp.torus
objects.
Evaluating Jung et al. (2021)’s criterion (11) is implemented in the
function hyperparam.torus
. There, the criterion (11) is termed
"elbow", since the minimizer
hyperparam.out <- hyperparam.torus(icp.torus.objects)
hyperparam.out
Type of conformity score: kmeans general
Optimizing method: elbow
-------------
Optimally chosen parameters. Number of components = 12 , alpha = 0.1111111
Results based on criterion elbow :
J alpha mu criterion
2241 12 0.1111111 0.1215 0.2326111
2244 12 0.1172840 0.1169 0.2341840
2242 12 0.1131687 0.1211 0.2342687
2243 12 0.1152263 0.1198 0.2350263
2001 11 0.1172840 0.1179 0.2351840
2240 12 0.1090535 0.1265 0.2355535
2245 12 0.1193416 0.1169 0.2362416
2002 11 0.1193416 0.1175 0.2368416
2494 13 0.1316872 0.1053 0.2369872
2004 11 0.1234568 0.1136 0.2370568
2246 12 0.1213992 0.1161 0.2374992
2003 11 0.1213992 0.1163 0.2376992
2005 11 0.1255144 0.1123 0.2378144
1999 11 0.1131687 0.1248 0.2379687
2239 12 0.1069959 0.1310 0.2379959
8004 rows are omitted.
Available components:
[1] "model" "option" "results" "icp.torus" "Jhat" "alphahat"
It can be checked that the choice of
In computing the criterion (11), the volume
However, for high dimensional cases, for example
To this end, we have developed and implemented a computationally more
efficient procedure for hyperparameter selection, which also provides
more stable clustering results. This procedure is a two-step procedure,
first choosing the model parameter
Our approach is in contrast to the approaches in Lei et al. (2013) and
Shin et al. (2019) in which they only choose the model parameter for a
prespecified level
The first step of the procedure is to choose hyperparam.J
computes the minimizer
procedure hyperparam.J(
Evaluate
Evaluate
Output
end procedure
The fitted models icp.torus
for various
option
of hyperparam.J
. The argument option = "risk"
,
"AIC"
, or "BIC"
is for the risk, AIC, or BIC, respectively. By
choosing
The second step is to choose the level hyperparam.alpha
, and the algorithm is
described in Algorithm 3.
procedure hyperparam.alpha(fitted model
Evaluate the number of clusters
Set
For
Output
end procedure
Note that we could alternatively input an array of levels, for the
argument alphavec
of hyperparam.alpha
, if there is a prespecified
searching area. In our experience, setting alpha.lim
of hyperparam.alpha
, which is
In summary, we first choose the number of model components hyperparam.torus
combines and implements Algorithms 2
and 3 sequentially and thus chooses option
of hyperparam.torus
is set
as option = "risk"
, "AIC"
, or "BIC"
. If option = "elbow"
(the
default value, if the dimension of data is hyperparam.torus
returns the chosen
hyperparameters icp.torus
object.
As an example, the following code applies the two-step procedure with
option = "risk"
to icp.torus.objects
we evaluated earlier.
hyperparam.risk.out <- hyperparam.torus(icp.torus.objects, option = "risk")
hyperparam.risk.out
Type of conformity score: kmeans general
Optimizing method: risk
-------------
Optimally chosen parameters. Number of components = 12 , alpha = 0.132716
Results based on criterion risk :
J criterion
1 3 2016.575
2 4 1990.566
3 5 1907.887
4 6 1922.430
5 7 1924.768
... (omitted)
With the option "risk," plot(hyperparam.out)
and
plot(hyperparam.risk.out)
. (The resulting graphic is omitted.)
In the next section, the two-step procedures for hyperparameter
selection are used in a cluster analysis of data on
In this section, we give an example of clustering ILE
data in
ILE
is a dataset included in ClusTorus, which
represents the structure of the isoleucine. This dataset is obtained by
collecting several different .pdb
files in the Protein Data Bank
(Berman et al. 2003). We used PISCES (Wang and Dunbrack 2003) to select high-quality protein data, by
using several benchmarks—resolution is 1.6ÅILE
data
consist of
ILE
data, in which there are
four variables (angles) For predictive clustering of ILE
data, the conformal prediction sets
and scores are built from mixture models, fitted with the elliptical
model = "kmeans"
). Other choices of models
such as "kde"
and "mixture"
are not applicable for this data set
with icp.torus
, with J = 10:40
indicating
the candidates of
set.seed(2021)
icp.torus.objects <- icp.torus(ILE, J = 10:40)
Next step is to select the hyperparameter hyperparam.torus
. As discussed
in the previous section, for this data set with hyperparam.torus
, if hyperparam.torus
uses
the two-step procedure, discussed in the previous section, with
option = "risk"
as the default choice for the criterion. In the code
example below, we use the two-step procedure, but apply all three
available criteria (option = "risk"
, "AIC"
, and "BIC"
) in choosing
output_list <- sapply( c("risk", "AIC", "BIC"), function(opt) {
hyperparam.torus(icp.torus.objects, option = opt)},
simplify = FALSE,
USE.NAMES = TRUE)
The result output_list
is a list of length 3, consisting of outputs of
the function hyperparam.torus
. The details of hyperparameter selection
can be visualized, and are shown in Figure 6. The
first row of the figure is created by plot(output_list$risk)
, and
shows that the evaluated prediction risk is the smallest at
ILE
data, generated from the
outputs of hyperparam.torus
. Rows correspond to different choices of
criteria "risk", "AIC" and "BIC". In each row, the left panel
shows the values of criterion over The number of clusters, given by the conformal prediction set
hyperparam.risk.out <- output_list$risk
Finally, the function cluster.assign.torus
is used for cluster
membership assignment for each data point in ILE
. In the code below,
the function cluster.assign.torus
takes as input
hyperparam.risk.out
, an output of hyperparam.torus
, and we have not
specified any level. Since the object hyperparam.risk.out
contains the
chosen level alphahat
), the level of the
conformal prediction set is, by default, set as
hyperparam.risk.out$alphahat
.
cluster.out <- cluster.assign.torus(hyperparam.risk.out)
The output cluster.out
contains the membership assignment results as
well as the number of clusters, which can be retrieved by
cluster.out$ncluster
or by simply printing the output cluster.out
.
The assigned cluster memberships can be displayed on the pairwise
scatter plots of the four angles. We demonstrate the outlier-disposing
membership assignment (the default behavior for S3 method plot
), as
well as the membership assignment based on the maximum of log-densities.
Figure 7 displays the scatter plots generated by the
codes:
ILE
data with cluster
assignments. (Top) assignment = "outlier"
. (Bottom)
assignment = "log.density"
.plot(cluster.out, assignment = "outlier") # Top panel of Figure 7
plot(cluster.out, assignment = "log.density") # Bottom panel of Figure 7
Note that these cluster assignments are based on the conformal
prediction set hyperparam.risk.out
as value icp.torus
. Since the conformal
prediction set is a union of 4-dimensional toroidal ellipsoids,
projections of such ellipsoids onto coordinate planes are plotted by the
following code, and is shown in Figure 8.
ILE
data, overlaid with the
(projected) ellipsoids that constitute the conformal prediction set
set.seed(2021)
plot(hyperparam.risk.out$icp.torus,
data = ILE[sample(1:nrow(ILE),500),],
level = hyperparam.risk.out$alphahat)
Scatter plots of data = ILE[sample(1:nrow(ILE),500),]
to plot randomly selected
observations.
clus.torus
The predictive clustering for data on the torus is obtained by
sequentially applying functions icp.torus
, hyperparam.torus
and
cluster.assign.torus
, as demonstrated for ILE
data in the previous
section. The function clus.torus
is a user-friendly all-in-one
function, which performs the predictive clustering by sequentially
calling the three core functions.
Using clus.torus
can be as simple as clus.torus(data)
, as shown in
the first code example, resulting in Figure 1, in
Introduction. In this case, the three functions are called sequentially
with default choices for their arguments. On the other hand, users can
specify which models and fitting methods are used, whether
hyperparameter tuning is required, and, if so, which criterion is used
for hyperparam.torus
, and so on. Key arguments of clus.torus
are
summarized in Table 3. The argument model
only
takes "kmeans"
and "mixture"
as input, which is passed to
icp.torus
inside the function. Since the function concerns clustering,
conformal prediction sets consisting of ellipsoids (9) are
needed, and such prediction sets are given by both model = "kmeans"
and "mixture"
. Next, the values of the arguments J
and level
determine whether tuning is needed for hyperparameters J = NULL
and level = NULL
, then
hyperparam.torus
is used to select both parameters, with argument
option
(see Table 3). If either J
or level
is
specified as a scalar, then the function simply uses the given value for
constructing the conformal prediction sets and for clustering. Other
arguments available for icp.torus
and hyperparam.torus
can be
specified, and the function passes those arguments to corresponding
functions, if applicable.
Arguments | Descriptions |
---|---|
data |
|
model |
A string. One of "kmeans" and "mixture" which determines the model or estimation methods. If "mixture" , the model is the von Mises mixture, fitted with an EM algorithm. If "kmeans" , the model is also the von Mises mixture, fitted by the elliptical k-means algorithm. If the dimension of data space is greater than 2, only "kmeans" is supported. Default is model = "kmeans" . |
J |
A scalar or numeric vector. If J is scalar, the number of components J . If J is a vector, then hyperparam.torus or hyperparam.J is used to select J = NULL , in which case J = 4:30 is used. |
level |
A scalar in level = NULL , in which case hyperparam.alpha is used to choose optimal level |
option |
A string. One of "elbow" , "risk" , "AIC" , or "BIC" , determining the criterion used for hyperparam.torus andr hyperparam.J . Default is option = "elbow" if option = "risk" if |
The output of the function is a list of three objects, with S3 class
clus.torus
. The three objects in the output are
a cluster.obj
object, containing the results of cluster membership
assignments,
an icp.torus
object, corresponding to the model with J
), and
if applicable, a hyperparam.torus
, hyperparam.J
or
hyperparam.alpha
object.
Each of these objects can be plotted via plot
, defined for S3 class
clus.torus
. For example, recall that ex
is a clus.torus
object we
created in Introduction. By setting the argument panel
of the method
plot
as panel = 1
, the cluster.obj
object is plotted.
plot(ex, panel = 1) # equivalent to plot(ex)
The result is shown in Figure 1 (top left). If the data
dimension is panel = 2
, the icp.torus
object is plotted, similar to Figures 3 and
8. Finally, if panel = 3
, the graphics relevant to
hyperparameter selection are created, similar to Figure
6.
Gao et al. (2018) and Jung et al. (2021) used the extrinsic kmeans.torus
implements
the extrinsic
set.seed(2021)
exkmeans <- kmeans.torus(SARS_CoV_2, centers = 3, nstart = 30)
head(exkmeans$membership)
27.B.ALA 28.B.TYR 29.B.THR 30.B.ASN 31.B.SER 32.B.PHE
1 1 1 1 2 3
Distance-based clustering methods, such as hierarchical clustering, only
requires a pairwise distances of the data points. The function
ang.pdist
generates the distance matrix for the input data in which
the angular distance between the two points on hclust
, the pairwise angular distances are
used to provide a hierarchical clustering using, e.g., the complete
linkage, as done in the following example.
distmat <- ang.pdist(SARS_CoV_2)
hc <- hclust(distmat, method = "complete")
hc.result <- cutree(hc, k = 3)
head(hc.result)
[1] 1 1 1 1 2 3
Figure 9 shows the results for the two clustering
algorithms, discussed above. The left panel shows that the Euclidean
embedding reflects the rotational nature of angular data. The right
panel shows that the distance-based clustering methods is well-applied
with ang.pdist
. Note that both the extrinsic
In this paper, we introduced the package ClusTorus which contains
various tools and routines for multivariate angular data, including
kernel density estimates and mixture model estimates. ClusTorus
performs clustering based on conformal prediction sets. We demonstrated
our implementation with data on
There are some possible future developments for ClusTorus. First, EM
algorithms for von Mises mixture models on high dimensional tori (e.g.,
In this appendix, we outline the elliptical ellip.kmeans.torus
. The
algorithm is used to estimate the parameters of the mixture model
(4), approximated as in (7). Note that the EM algorithm
can be used for parameter estimation for mixture models in low
dimensions. The EM algorithms of Jung et al. (2021) is implemented in the
function EMsinvMmix
, but works for
Suppose
With these approximated maximum likelihood estimators, the elliptical
ellip.kmeans.torus
.
procedure Elliptical k-means(
Initialize
set
Update
Update
Update
Repeat step 3-6 until converge
end procedure
Note that the initial values require an initial clustering. For this, we
use other clustering algorithms such as the extrinsic init
of ellip.kmeans.torus
and icp.torus
. One may specify
arguments for either hlcust
or kmeans
in icp.torus
. For example,
one may specify the choice of initial values as follows.
icp.torus(data = SARS_CoV_2, J = 4, init = "kmeans", nstart = 30)
icp.torus(data = SARS_CoV_2, J = 4, init = "hierarchical", method = "complete")
By default, the hierarchical clustering with complete linkage is used.
Data analysis in this article using icp.torus
or clus.torus
was
performed with the default initialization.
The protein structure data we aim to analyze typically consist of
hundreds of angles (observations). Fitting the mixture with a large
number of components may give inefficient estimators. Thus, we have
implemented options for reducing the number of model parameters, by
constraining the shape of the ellipsoids, or the covariance matrices.
Applying the constraints lead much faster convergence for estimating
parameters (Grim 2017). We list three types of constraints
for covariance matrices mixturefitmethod
and kmeansfitmethod
(for
icp.torus
) and type
(for EMsinvMmix
and ellip.kmeans.torus
). We
explain in terms of the arguments for the function icp.torus
.
mixturefitmethod = "circular"
and
kmeansfitmethod = "heterogeneous-circular"
represents this
constraint. Furthermore, if kmeansfitmethod = "homogeneous-circular"
.
mixturefitmethod = "axis-aligned"
represents this constraint.
No constraint for mixturefitmethod = "general"
and kmeansfitmethod = "general"
.
The default values for icp.torus
are kmeansfitmethod = "general"
and
mixturefitmethod = "axis-aligned"
.
Several S3 classes are defined in the packages ClusTorus. A list of the S3 classes is given in Table 4.
S3 class | functions | methods |
---|---|---|
cp.torus.kde |
cp.torus.kde |
print , plot |
icp.torus |
icp.torus |
print , plot , LogLik , predict |
icp.torus.eval |
icp.torus.eval |
print |
cluster.obj |
cluster.assign.torus |
print , plot |
kmeans.torus |
kmeans.torus |
print , predict |
hyperparam.torus |
hyperparam.torus |
print , plot |
hyperparam.J |
hyperparam.J |
print , plot |
hyperparam.alpha |
hyperparam.alpha |
print , plot |
clus.torus |
clus.torus |
print , plot |
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1A2C2002256).
bio3d, ClusTorus, BAMBI, mixtools, mclust, MoEClust
Cluster, Distributions, Environmetrics, Omics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hong & Jung, "ClusTorus: An R Package for Prediction and Clustering on the Torus by Conformal Prediction", The R Journal, 2022
BibTeX citation
@article{RJ-2022-032, author = {Hong, Seungki and Jung, Sungkyu}, title = {ClusTorus: An R Package for Prediction and Clustering on the Torus by Conformal Prediction}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {2}, issn = {2073-4859}, pages = {186-207} }