In many situations, it could be interesting to ascertain whether groups of curves can be performed, especially when confronted with a considerable number of curves. This paper introduces an R package, known as clustcurv, for determining clusters of curves with an automatic selection of their number. The package can be used for determining groups in multiple survival curves as well as for multiple regression curves. Moreover, it can be used with large numbers of curves. An illustration of the use of clustcurv is provided, using both real data examples and artificial data.
Keywords: multiple curves, number of groups, nonparametric, survival analysis, regression models, cluster
A problem often encountered in many fields is the comparison of several populations through specific curves. Typical examples, considered by a number of authors, are given by the comparison of survival curves in survival analysis, children growth curves in pediatrics, or the comparison of regression curves in regression analysis. In many of these studies, it is very common to compare a large number of curves between groups, and methods of summarizing and extracting relevant information are necessary. A common approach is to look for a partition of the sample into a number of groups in such a way that curves in the same group are as alike as possible but as distinct as possible from those in other groups. This process is also known as curve clustering. A hypothesis test can be used to ascertain that the curves in the same group are equal. A fundamental and difficult problem in clustering curves is the estimation of the number of clusters in a dataset.
Traditionally, the comparison of these functions is performed using
parametric models through the comparison of the resulting model
parameters. This approach, however, requires the specification of the
parametric model, which is often difficult and may be considered a
disadvantage. Several nonparametric methods have been proposed in the
literature to compare multiple curves. In the area of survival analysis,
for example, several nonparametric methods have been proposed to test
for the equality of survival curves for censored data. The most
well-known and widely used to test the null hypothesis of no difference
in survival between two or more independent groups was proposed by
Mantel (1966). An alternative test that is often used is the Peto & Peto
(Peto and Peto 1972) modification of the Gehan-Wilcoxon test (Gehan 1965).
Several other variations of the log-rank test statistic using weights on
each event time have been proposed in the literature
(Tarone and Ware 1977; Harrington and Fleming 1982; Fleming et al. 1987) as well as
other procedures to compare these survival curves based on different
measures, as can be the medians (Chen and Zhang 2016). There exists an extensive
literature on curve comparison in the framework of regression analysis.
In this context, several nonparametric tests have been proposed to test
the equality of the mean functions,
When the null hypothesis of equality of curves is rejected, leading to
the clear conclusion that at least one curve is different, it can be
interesting to ascertain whether curves can be grouped or if all these
curves are different from each other. In this setting, one naïve
approach would be to perform pairwise comparisons. In this line are the
papers by Rosenblatt (1975), González-Manteiga and Cao (1993), Härdle and Mammen (1993), Dette and Neumeyer (2001) who proposed
alternative tests of the null hypothesis of equality of curves obtained
from pairwise comparisons of the estimators of the regression functions.
A similar statistic was also considered by King et al. (1991). Pairwise comparisons
between group levels with corrections for multiple testing are also
possible in the framework of survival analysis. Among others, this can
be achieved with the pairwise_survdiff
of the package
survminer
(Kassambara et al. 2019). However, in any case, this approach would lead to a large
number of comparisons without the possibility of determining groups with
similar curves. Results for such a test can tell us that all
combinations are different, or just one pair. When the number of curves
to be compared increase, so does the difficulty of interpretation.
For partitioning a given set of curves into a set of K groups of
curves (i.e., K clusters), Villanueva et al. (2019a) propose an adaptation of
the
The development of clustcurv R package has been motivated by recent
contributions that account for these problems, in particular, the
methods proposed by Villanueva et al. (2019a) to determine groups in multiple
survival curves and those introduced by Villanueva et al. (2019b) in the framework
of regression curves. The clustcurv R package attempts to answer the
following two questions: (i) given a potential large sample of curves,
what is the best value for the number of clusters? (ii) What is the best
subdivision of the sample curves into a given number of
Three data sets were chosen for illustration of the software usage with real data. The first two datasets show the applicability of the proposed methods for obtaining clusters of survival curves. These applications were chosen to solve two real problems in the study of recurrence of breast cancer patients and survival of myeloma cancer. To illustrate the package usage in the regression context we used real data from a Barnacle’s Growth study conducted in Galicia, Spain. Simulated data were also used to illustrate the package capabilities in a more complicated scenario.
The remainder of the paper is structured as follows: Section 2 briefly reviews methods for selecting the number of clusters and the nonparametric test used; Section 3 explains the use of the main functions and methods of clustcurv; Section 4 gives an illustration of the practical application of the package using real and simulated data; and finally, the last section concludes with a discussion and possible future extensions of the package.
In this section, we briefly review the methodological background of the
clustcurv package. As it solves problems addressed in the field of
survival analysis and regression analysis, firstly, the notation and the
nonparametric estimation procedures for both contexts are exposed. Then,
the procedure for determining groups of curves is explained in detail,
considering a general framework which includes the aforementioned
contexts. Briefly, our procedure is described as follows. First, the
We will assume the
Since the censoring time is assumed to be independent of the process,
the survival functions,
Let
The regression models in ((1)) can be estimated using several approaches, such as methods based on regression splines (Boor 2001), Bayesian approaches (Lang and Brezger 2004), or local polynomial kernel smoothers (Wand and Jones 1995; Fan and Gijbels 1996). In this package, local linear kernel smoothers, as implemented in the npregfast package, are used.
As noted earlier, several authors have proposed different methods that
can be used to compare estimates of nonparametric functions of multiple
samples. The null hypothesis is that all the curves have identical
functions,
We propose a procedure to test, for a given number
The cited testing procedure is based on the
The following test statistics were considered in order to test
In order to approximate the minimizers involved in the test statistics,
we propose the use of clustering algorithms. Particularly, in the case
of
Finally, the decision rule based on
The testing procedure used here involves the following steps:
Using the original sample, for
Obtain the
Draw bootstrap samples using a bootstrap procedure. In the survival context, follow step 3.(a), and in the regression context, follow step 3.(b):
Let
Since in step 3 the bootstrap resamples are constructed under the null
hypothesis of
It is important to highlight that repeating this procedure, testing
With the original sample, for
Initialize with
2.1. Obtain the “best” partition
2.2. For
2.3. Generate
2.4. if
reject
go back to 2.1
else
accept
end
The number
Finally, note that, under survival and regression scenarios, the
proposed procedure for the determination of groups in multiple curves
may be translated as a test of multiple hypotheses where a set of
The clustcurv package is a shortcut for “clustering curves” for being this its major functionality: to provide a procedure that allows users to determine groups of multiple curves with an automatic selection of their number. The package enables both survival and regression curves to be grouped, and it is designed along lines similarly into both contexts. In addition, in view of the high computational cost entailed in these methods, parallelization techniques are included to become feasible and efficient in real situations.
The functions within the clustcurv package are described in Table
1. Briefly, there are two main types of functionalities:
(i) to determine groups of multiple curves with the automatic selection
of their number with regclustcurves
or survclustcurves
functions and
(ii) to determine groups of curves, given a number kregcurves
or ksurvcurves
functions. The S3
object obtained from
whatever previous functions is the argument required as input for
autoplot
, which returns a graphical output based on
ggplot2 package.
Numerical summaries of the fitted objects can be obtained by using
print
or summary
.
Since the two most important functions in this package are
survclustcurv
and regclustcurv
, the arguments of these functions are
shown in Table 2. Note that the ksurvcurves
kregcurves
functions are just a simplified version of the previous
two. Users can automatically obtain the optimal number of groups of
multiple curves by means of survclustcurves
and regclustcurves
.
Nevertheless, in those situations where the user knows in advance the
number of groups, it is possible to obtain the assignment of the curves
into the corresponding group, by means of the function ksurvcurves
or
kregcurves
. In both functions, a common argument is the algorithm,
which returns the best assignments of the curves into the groups to
which they belong. At the moment, the algorithms to solve this
optimization problem can be algorithm = ’kmeans’
or algorithm = ’kmedians’
.
Furthermore, in order to address the high computational burden, the
functions survclustcurves
, regclustcurves
, ksurvcurves
and
kregcurves
have been programmed in parallel to compute the
bootstrap-based testing procedure. The input command required for the
use of parallelization is cluster = TRUE
. The number of cores for
parallel execution is fixed using the number of CPU cores on the current
host minus one unless it is specified by the user (ncores = NULL
).
Then, registerDoParallel
of the
doParallel package is
used to register the parallel backend. The parallel computation is
performed by the foreach
function of
foreach package.
Function | Description |
survclustcurves |
Main function for determining groups of multiple survival curves and selecting automatically the optimal number of them. |
regclustcurves |
Main function for determining groups of multiple regression curves ecting automatically the optimal number of them. |
ksurvcurves |
Main function for determining groups of survival curves, given a number of groups |
kregcurves |
Main function for determining groups of regression curves, given a number of groups |
summary |
Method of the generic summary function for kcurves and clustcurves objects (both survival and regression context), which returns a short summary. |
print |
Method of the generic print function for kcurves and clustcurves objects, which prints out some key components. |
autoplot |
Visualisation of clustcurves and kcurves objects with ggplot2 (Wickham et al. 2019) graphics. Provides the plots for the estimated nonparametric curves grouped by color (optional) and their centroids (mean curve of the curves pertaining to the same group). |
survclustcurves() arguments |
|
time |
A vector with variable of interest, i.e. survival time. |
status |
A vector with censoring indicator of the survival time of the process; 0 if the total time is censored and 1 otherwise. |
x |
A vector with categorical variable indicating the population to which the observations belongs. |
kvector |
A vector specifying the number of groups of curves to be checking. By default it is NULL. |
kbin |
Size of the grid over which the survival functions are to be estimated. |
nboot |
Number of bootstrap repeats. |
algorithm |
A character string specifying which clustering algorithm is used, i.e., ’kmeans’ ) or ’kmedians’ ). |
alpha |
A numeric value, particularly, the signification level of the hypothesis test. |
cluster |
A logical value. If TRUE (default) the code is parallelized. Note that there are cases without enough repetitions (e.g., a low number of initial variables) that R will gain in performance through serial computation. R takes time to distribute tasks across the processors also it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than the time needed for single-thread computing, it could be better not to parallelize. |
ncores |
An integer value specifying the number of cores to be used in the parallelized procedure. If NULL , the number of cores to be used is equal to the number of cores of the machine |
seed |
Seed to be used in the procedure. |
multiple |
A logical value. If TRUE (not default), the resulted pvalues are adjusted by using one of several methods for multiple comparisons. |
multiple.method |
Correction method: ’bonferroni’ , ’holm’ , ’hochberg’ , ’hommel’ , ’BH’ , ’BY’ |
regclustcurves() arguments |
|
y |
A vector with variable of interest, i.e. response variable. |
x |
A vector with independent variable. |
z |
A vector with categorical variable indicating the population to which the observations belongs. |
kvector |
A vector specifying the number of groups of curves to be checking. By default it is NULL. |
kbin |
Size of the grid over which the survival functions are to be estimated. |
h |
The kernel bandwidth smoothing parameter. |
nboot |
Number of bootstrap repeats. |
algorithm |
A character string specifying which clustering algorithm is used, i.e., ’kmeans’ ) or ’kmedians’ ). |
alpha |
A numeric value, particularly, the signification level of the hypothesis test. |
cluster |
A logical value. If TRUE (default) the code is parallelized. Note that there are cases without enough repetitions (e.g., a low number of initial variables) that R will gain in performance through serial computation. R takes time to distribute tasks across the processors also it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than the time needed for single-thread computing, it could be better not to parallelize. |
ncores |
An integer value specifying the number of cores to be used in the parallelized procedure. If NULL , the number of cores to be used is equal to the number of cores of the machine |
seed |
Seed to be used in the procedure. |
multiple |
A logical value. If TRUE (not default), the resulted pvalues are adjusted by using one of several methods for multiple comparisons. |
multiple.method |
Correction method: ’bonferroni’ , ’holm’ , ’hochberg’ , ’hommel’ , ’BH’ , ’BY’ |
In this section, we illustrate the use of clustcurv package using some
real and simulated data. In the case of the survival context, the
proposed methods were applied to the German breast cancer data included
in the condSURV package
and to the multiple myeloma data freely available as part of the
survminer package. For the regression analysis, the clustcurv
package includes a data set called barnacle5
with measurements of
rostro-carinal length and dry weight of barnacles collected from five
sites of Galicia (northwest of Spain). Additionally, in order to show
the behaviour of the method in a more complicated scenario, an example
with simulated data is also provided.
In this study, a total of 686 patients with primary node-positive breast
cancer were recruited between July 1984 and December 1989, and 16
variables were measured such as the age of the patient (age
),
menopausal status (menopause
), hormonal therapy (hormone
), tumour
size (size
, in mm), tumor grade (grade
), and the number of positive
nodes (nodes
). In addition to these and other variables, the
recurrence-free survival time (rectime
, in days) and the corresponding
censoring indicator (0 – censored, 1 – event) were also recorded.
We will use these data to illustrate the package capabilities to build
clusters of survival curves based on the covariate nodes (grouped from 1
to > 13). An excerpt of the data.frame
with one row per patient is
shown below:
> library(condSURV)
> library(clustcurv)
> data(gbcsCS)
> head(gbcsCS[, c(5:10, 13, 14)])
age menopause hormone size grade nodes rectime censrec
1 38 1 1 18 3 5 1337 1
2 52 1 1 20 1 1 1420 1
3 47 1 1 30 2 1 1279 1
4 40 1 1 24 1 3 148 0
5 64 2 2 19 2 1 1863 0
6 49 2 2 56 1 3 1933 0
The first three patients have developed a recurrence shown by censrec
variable equals to 1, unlike the following three, which take the value
of 0. This variable, along with the other two, rectime
and nodes
,
will be taken into account for applying the methods described in Section
2. The number of positive nodes has been grouped from 1 to >
13 because of its low numbers onwards. Below, the steps for this
preprocessed are shown:
> table(gbcsCS$nodes)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
187 110 79 57 41 33 36 20 20 19 15 13 11 3 5 8 5 5 5 3 1
23 24 26 30 33 35 36 38 51
1 2 1 1 1 1 1 1 1
> gbcsCS[gbcsCS$nodes > 13,'nodes'] <- 14
> gbcsCS$nodes <- factor(gbcsCS$nodes)
> levels(gbcsCS$nodes)[14]<- '>13'
> table(gbcsCS$nodes)
1 2 3 4 5 6 7 8 9 10 11 12 13 >13
187 110 79 57 41 33 36 20 20 19 15 13 11 45
Estimates of the survival curves are obtained using the
survclustcurves
function. This function allows determining groups
using the optimization algorithm ’clustcurves’
object. The first three arguments must be introduced,
where time
is a vector with event-times, status
for their
corresponding indicator statuses, and x
is the categorical covariate.
As we mentioned, note that the proposed procedure may deal with the
problem of testing multiple hypotheses, particularly relevant when the
categorical variable has many levels. Thus, if the user wants to apply
some correction, it is possible to specify multiple = TRUE
and select
some of the well-known techniques such as Bonferroni, Holm, etc., by
means of the argument multiple.method
.
The output of this function is the assignment of the survival curves to
the group to which they belong and an automatic selection of their
number. The following input commands provide an example of this output
using the
> fit.gbcs <- survclustcurves(time = gbcsCS$rectime, status = gbcsCS$censrec,
x = gbcsCS$nodes, nboot = 500, seed = 300716, algorithm = 'kmedians',
cluster = TRUE)
Checking 1 cluster...
Checking 2 clusters...
Checking 3 clusters...
Finally, there are 3 clusters.
> summary(fit.gbcs)
Call:
survclustcurves(time = gbcsCS$rectime, status = gbcsCS$censrec,
x = gbcsCS$nodes, nboot = 500, algorithm = "kmedians", cluster = TRUE,
seed = 300716)
Clustering curves in 3 groups
Number of observations: 640
Cluster method: kmedians
Factor's levels:
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13"
[14] ">13"
Clustering factor's levels:
[1] 1 1 1 3 3 3 3 2 3 2 2 2 2 2
Testing procedure:
H0 Tvalue pvalue
1 1 95.68626 0.000
2 2 56.03966 0.018
3 3 33.63386 0.830
Available components:
[1] "num_groups" "table" "levels" "cluster" "centers" "curves"
[7] "method" "data" "algorithm" "call"
The graphical representation of the fit can be easily obtained with the
function autoplot
. Specifying the argument groups_by_color = FALSE
,
the estimated survival curves for each level of the factor nodes by
means of the Kaplan-Meier estimator can be drawn. The assignment of the
curves to the three groups can be observed in Figure
1 simply typing groups_by_color = TRUE
. As
expected, the survival of patients can be influenced by the number of
lymph nodes. The patients’ recurrence time rises with the decrease of
lymph nodes. Note that having 3 or fewer positive nodes seems to be
related to higher recurrence-free probabilities. Patients with 9 or more
positive nodes are more likely to develop a recurrence. The group of
patients with 8 positive nodes was assigned to the group with highest
recurrence probabilities. Though this was unexpected, further analysis
confirm the poor and unexpected behavior.
nodes
. A specific color is assigned for each curve according
to the group to which it belongs using the Equivalently, the following piece of code shows the input commands and
the results obtained with the algorithm = ’kmeans’
. However, the
number of groups and the assignments are different from those obtained
with the ’kmedians’
. Although this situation is not so common, in some
real applications, it can happen.
> fit.gbcs2 <- survclustcurves(time = gbcsCS$rectime, status = gbcsCS$censrec,
x = gbcsCS$nodes, nboot = 500, seed = 300716, algorithm = 'kmeans',
cluster = TRUE)
Checking 1 cluster...
Checking 2 clusters...
Finally, there are 2 clusters.
> fit.gbcs2
Call:
survclustcurves(time = gbcsCS$rectime, status = gbcsCS$censrec,
x = gbcsCS$nodes, nboot = 500, algorithm = "kmeans", cluster = TRUE,
seed = 300716)
Clustering curves in 2 groups
Number of observations: 607
Cluster method: kmeans
The corresponding plot is shown in Figure 2. Note that having 9 or more positive nodes seems to be related to a lower recurrence-free survival than having 9 or less, with the exception of the survival curve for those patients with 8 positive nodes, which was assigned to the group with highest recurrence probabilities.
nodes
. A specific color is assigned for each curve according
to the group to which it belongs using the It is important to highlight that given a fixed value of ksurvcurves
function by considering, for example, the argument k = 3
.
> ksurvcurves(time = gbcsCS$rectime, status = gbcsCS$censrec, x = gbcsCS$nodes,
seed = 300716, algorithm = 'kmedians', k = 3)
Call:
ksurvcurves(time = gbcsCS$rectime, status = gbcsCS$censrec, x = gbcsCS$nodes,
k = 3, algorithm = "kmedians", seed = 300716)
Clustering curves in 3 groups
Number of observations: 640
Cluster method: kmedians
More information related to the output above can be obtained running the
summary
function.
In this case, a study of the survival in patients with multiple myeloma
(MM) cancer was conducted. 256 individuals were included from the start
of the follow-up to whom were analyzed and collected 16 variables. This
data set is freely available in the survminer package. Below, it is
shown the first rows of the data.frame
with columns such as treatment
(treatment
), life state indicator (event
; censored – 0; 1 – dead),
survival time (time
, in months), among others.
> library(survminer)
> data(myeloma)
> head(myeloma[,1:5])
molecular_group chr1q21_status treatment event time
GSM50986 Cyclin D-1 3 copies TT2 0 69.24
GSM50988 Cyclin D-2 2 copies TT2 0 66.43
GSM50989 MMSET 2 copies TT2 0 66.50
GSM50990 MMSET 3 copies TT2 1 42.67
GSM50991 MAF <NA> TT2 0 65.00
GSM50992 Hyperdiploid 2 copies TT2 0 65.20
In this example, it is interesting to analyze if the survival in patients with MM disease is the same for the different molecular subgroups. If there is an effect of the molecular subgroups on the survival, future therapies that might exploit molecular insights should lead to an improvement in outcome for patients with these types of disease (Zhan et al. 2006).
Below, a summary of the results of the survclustcurves
function
obtained with time
, event
, and molecular_group
as input variables
and for both kmedians
and kmeans
algorithms are shown.
> fit.mye <- survclustcurves(time = myeloma$time, status = myeloma$event,
x = myeloma$molecular_group, nboot = 500, seed = 300716,
algorithm = 'kmedians', cluster = TRUE)
Checking 1 cluster...
Checking 2 clusters...
Finally, there are 2 clusters.
> summary(fit.mye)
Call:
survclustcurves(time = myeloma$time, status = myeloma$event,
x = myeloma$molecular_group, nboot = 500, algorithm = "kmedians",
cluster = TRUE, seed = 300716)
Clustering curves in 2 groups
Number of observations: 248
Cluster method: kmedians
Factor's levels:
[1] "Cyclin D-1" "Cyclin D-2" "Hyperdiploid" "Low bone disease"
[5] "MAF" "MMSET" "Proliferation"
Clustering factor's levels:
[1] 1 1 1 1 1 2 2
Testing procedure:
H0 Tvalue pvalue
1 1 31.31603 0.026
2 2 14.94269 0.682
Available components:
[1] "num_groups" "table" "levels" "cluster" "centers" "curves"
[7] "method" "data" "algorithm" "call"
> fit.mye2 <- survclustcurves(time = myeloma$time, status = myeloma$event,
x = myeloma$molecular_group, nboot = 500, seed = 300716,
algorithm = 'kmeans', cluster = TRUE)
Checking 1 cluster...
Checking 2 clusters...
Finally, there are 2 clusters.
> summary(fit.mye2)
Call:
survclustcurves(time = myeloma$time, status = myeloma$event,
x = myeloma$molecular_group, nboot = 500, algorithm = "kmeans",
cluster = TRUE, seed = 300716)
Clustering curves in 2 groups
Number of observations: 248
Cluster method: kmeans
Factor's levels:
[1] "Cyclin D-1" "Cyclin D-2" "Hyperdiploid" "Low bone disease"
[5] "MAF" "MMSET" "Proliferation"
Clustering factor's levels:
[1] 1 1 1 1 1 2 2
Testing procedure:
H0 Tvalue pvalue
1 1 4.500272 0.032
2 2 1.108812 0.730
Available components:
[1] "num_groups" "table" "levels" "cluster" "centers" "curves"
[7] "method" "data" "algorithm" "call"
When comparing the results obtained through the two methods (kmeans
,
kmedians
), it is seen that the obtained number of clusters is the same
(2 groups), even the assignment of the curves to the groups.
In particular, results obtained reveal that MMSET level and Proliferation level are associated with a high-risk or damage on the lifetime, while MAF, Low bone disease, Hyperdiploid, Cycline D-1, and Cycline D-2 have higher survival probabilities. This is observed in the plot shown in Figure 3, which can be obtained using the following input command:
> autoplot(fit.mye, groups_by_color = TRUE)
molecular group
. A specific color is assigned for each curve
according to the group to which it belongs using the This study was conducted on the Atlantic coast of Galicia (Northwest
Spain), which consists of an approximately 1000km long shoreline with
extensive rocky stretches exposed to tidal surges and wave action that
are settled by the Pollicipes pollicipes (Gmelin, 1789) populations
targeted for study. A total of 5000 specimens were collected from five
sites of the region’s Atlantic coastline and corresponded to the
stretches of coast where this species is harvested: Punta do Mouro,
Punta Lens, Punta de la Barca, Punta del Boy and Punta del Alba. Two
biometric variables of each specimen were measured: RC (Rostro-carinal
length, maximum distance across the capitulum between the ends of the
rostral and carinal plates) and DW (Dry Weight). This data set
(barnacle5
) is available in the clustcurv package. The idea of this
study is to know the relation between RC and DW variables along the
coast, i.e., to analyze if the barnacle’s growth is similar in all
locations or by contrast, if it is possible to detect geographical
differentiation in growth. A sample of the dataset is shown as follow:
> data("barnacle5")
> head(barnacle5)
DW RC F
1 0.52 12.0 laxe
2 1.46 18.9 laxe
3 0.05 6.4 laxe
4 0.17 9.4 laxe
5 0.05 6.2 laxe
6 0.41 12.2 laxe
For each location (F
), nonparametric regression curves were estimated
to modeling the dependence between RC and DW. In order to determine
groups, we used the proposed methodology in Subsection
2.2. Through executing the next piece of
code, the following results can be obtained: one estimated curve was
attributed to the first group (Punta Lens), two estimated curves were
assigned to group 2 (Punta de la Barca and Punta del Boy), and the other
two belong to group 3 (Laxe do Mouro and Punta del Alba) (Figure
4). In this example, the regclustcurves
function was used with algorithm = ’kmeans’
and the input variables
> fit.bar <- regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F,
nboot = 500, seed = 300716, algorithm = 'kmeans', cluster = TRUE)
Checking 1 cluster...
Checking 2 clusters...
Checking 3 clusters...
Finally, there are 3 clusters.
> summary(fit.bar)
Call:
regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F,
nboot = 500, algorithm = "kmeans", cluster = TRUE, seed = 300716)
Clustering curves in 3 groups
Number of observations: 5000
Cluster method: kmeans
Factor's levels:
[1] "laxe" "lens" "barca" "boy" "alba"
Clustering factor's levels:
[1] 2 1 3 3 2
Testing procedure:
H0 Tvalue pvalue
1 1 0.94353014 0.000
2 2 0.15463483 0.034
3 3 0.02348982 0.422
Available components:
[1] "num_groups" "table" "levels" "cluster" "centers" "curves"
[7] "method" "data" "algorithm" "call"
As can be seen, Figure 4 obtained using the following input command. The specimens from Punta de la Barca and Punta del Boy have similar morphology, wide and short. This is due to these zones present similar oceanographic characteristic, such as exposed rocky shore to waves and highly articulated. Unlike, the barnacles collected from Laxe do Moure and Punta del Alba are narrow and long because they are less exposed locations. Finally, Punta Lens is an intermediate coast, alternating sections of steep coast with large sand.
> autoplot(fit.bar, groups_by_color = TRUE)
Finally, this subsection reports the capabilities of the clustcurv
package in a more complicated simulated scenario. We consider the
regression models given in ((1)) for
where
We explore the methodology considering the null hypothesis
The code for the generation of this dataset with
> m <- function(x, j) {
y <- numeric(length(x))
y[j <= 5] <- x[j <= 5] + 1
y[j > 5 & j <= 10] <- x[j > 5 & j <= 10] ^ 2 + 1
y[j > 10 & j <= 15] <- 2 * sin(2 * x[j > 10 & j <= 15]) #- 4
y[j > 15 & j <= 20] <- 2 * sin(x[j > 15 & j <= 20])
y[j > 20 & j <= 25] <- 2 * sin(x[j > 20 & j <= 25]) + a * exp(x[j > 20 & j <= 25])
y[j > 25] <- 1
return(y)
}
> seed <- 300716
> set.seed(seed)
> n <- 5000
> a <- 0.0
> x <- runif(n, -2, 2)
> prob <- sample(c(1, 1.5, 2, 2.5, 3), 30, replace = TRUE)
> prob <- prob/sum(prob)
> f <- sample(1:30, n, replace = TRUE, prob = prob)
> N <- length(unique(f))
> error <- rnorm(n,0,1.5)
> y <- m(x, f) + (0.5 + 0.05 * m(x, f)) * error
> data <- data.frame(x, y, f)
In order to determine groups of the generated curves, the user has to
execute the next piece of code. As expected, when
> fit.sim <- regclustcurves(x = data$x, y = data$y, z = data$f, nboot = 500,
algorithm = 'kmedians', cluster = TRUE, seed = 300716)
Checking 1 cluster...
Checking 2 clusters...
Checking 3 clusters...
Checking 4 clusters...
Checking 5 clusters...
Finally, there are 5 clusters.
> fit.sim
Call:
regclustcurves(y = data$y, x = data$x, z = data$f, nboot = 500,
algorithm = "kmedians", cluster = TRUE, seed = 300716)
Clustering curves in 5 groups
Number of observations: 5000
Cluster method: kmedians
> autoplot(fit.sim, groups_by_colour = TRUE, centers = TRUE)
Additionally, for different values of
> seed <- 300716
> set.seed(seed)
> n <- 5000
> a <- 0.4
> x <- runif(n, -2, 2)
> prob <- sample(c(1, 1.5, 2, 2.5, 3), 30, replace = TRUE)
> prob <- prob/sum(prob)
> f <- sample(1:30, n, replace = TRUE, prob = prob)
> N <- length(unique(f))
> error <- rnorm(n,0,1.5)
> y <- m(x, f) + (0.5 + 0.05 * m(x, f)) * error
> data2 <- data.frame(x, y, f)
> fit.sim2 <- regclustcurves(x = data2$x, y = data2$y, nboot = 500, seed = 300716,
z = data$f, algorithm = 'kmedians', cluster = TRUE)
Checking 1 cluster...
Checking 2 clusters...
Checking 3 clusters...
Checking 4 clusters...
Checking 5 clusters...
Checking 6 clusters...
Finally, there are 6 clusters.
> fit.sim2
Call:
regclustcurves(y = data2$y, x = data2$x, z = data$f, nboot = 500,
algorithm = "kmedians", cluster = TRUE, seed = 300716)
Clustering curves in 6 groups
Number of observations: 5000
Cluster method: kmedians
> autoplot(fit.sim2, groups_by_colour = TRUE, centers = TRUE)
Figures 5 and 6 show the
results with the simulated data with autoplot
function, the argument
centers = TRUE
.
f
with f
with This paper discussed the implementation of some methods developed for
determining groups of multiple nonparametric curves in the R package
clustcurv. In particular, the methods proposed are focused on the
framework of regression analysis and in the framework of survival
analysis. In the context of survival analysis, we restrict ourselves to
survival curves. Hopefully, future versions of the package will extend
the methodology to determine groups in risk functions, cumulative hazard
curves, or density functions. The current version of the package
implements two optimization algorithms, the well-known
The authors acknowledge financial support by the Spanish Ministry of Economy and Competitiveness (MINECO) through project MTM2017-89422-P and MTM2017-82379-R (funded by (AEI/FEDER, UE). Thanks to the Associate Editor and the referee for comments and suggestions that have improved this paper.
clustcurv, survminer, npregfast, ggplot2, doParallel, foreach, condSURV
HighPerformanceComputing, Phylogenetics, Spatial, Survival, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Villanueva, et al., "clustcurv: An R Package for Determining Groups in Multiple Curves ", The R Journal, 2021
BibTeX citation
@article{RJ-2021-032, author = {Villanueva, Nora M. and Sestelo, Marta and Meira-Machado, Luís and Roca-Pardiñas, Javier}, title = {clustcurv: An R Package for Determining Groups in Multiple Curves }, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {1}, issn = {2073-4859}, pages = {164-183} }