clustcurv: An R Package for Determining Groups in Multiple Curves

In many situations it could be interesting to ascertain whether groups of curves can be performed, especially when confronted with a considerable number of curves. This paper introduces an R package, known as clustcurv, for determining clusters of curves with an automatic selection of their number. The package can be used for determining groups in multiple survival curves as well as for multiple regression curves. Moreover, it can be used with large numbers of curves. An illustration of the use of clustcurv is provided, using both real data examples and artificial data.


Introduction
A problem, often encountered in many fields is the comparison of several populations through specific curves. Typical examples, considered by a number of authors, are given by the comparison of survival curves in survival analysis, children growth curves in pediatrics or the comparison of regression curves in regression analysis. In many of these studies, it is very common to compare large number of curves between groups, and methods of summarizing and extracting relevant information are necessary. A common approach is to look for a partition of the sample into a number of groups, in such a way that curves in the same group are as alike as possible but as distinct as possible from those in other groups. This process is also known as curve clustering. Hypothesis test can be used to ascertain that the curves in the same group are equal. A fundamental and difficult problem in clustering curves is the estimation of the number of clusters in a dataset.
Traditionally, the comparison of these functions is performed using parametric models through the comparison of the resulting model parameters. This approach, however, requires the specification of the parametric model which is often difficult and may be considered a disadvantage. Several nonparametric methods have been proposed in the literature to compare multiple curves. In the area of survival analysis, for example, several nonparametric methods have been proposed to test for the equality of survival curves for censored data. The most well-known and widely used to test the null hypothesis of no difference in survival between two or more independent groups was proposed by Mantel (1966). An alternative test that is often used is the Peto & Peto (Peto and Peto, 1972) modification of the Gehan-Wilcoxon test (Gehan, 1965). Several other variations of the log-rank test statistic using weights on each event time have been proposed in the literature (Tarone and Ware, 1977;Harrington and Fleming, 1982;Fleming et al., 1987) as well as other procedures to compare these survival curves based on different measures, as can be the medians (Chen and Zhang, 2016). There exists an extensive literature on curve comparison in the framework of regression analysis. In this context, several nonparametric tests have been proposed to test the equality of the mean functions, H 0 : m 1 = · · · = m j . Hall and Hart (1990) proposed a bootstrap test, while Härdle and Mammen (1993) Hardle and Marron (1990) suggested a semiparametric approach based on kernel smoothing. Other relevant papers on this topic are King et al. (1991) , Delgado (1993) , Kulasekera (1995) , Young and Bowman (1995), Dette and Neumeyer (2001), Pardo-Fernández et al. (2007) , Srihera and Stute (2010) , among others. A good review on this topic can be seen in the paper by Neumeyer and Dette (2003).
When the null hypothesis of equality of curves is rejected, leading to the clear conclusion that at least one curve is different, it can be interesting to ascertain whether curves can be grouped or if all these curves are different from each other. In this setting, one naïve approach would be to perform pairwise comparisons. In this line are the papers by Rosenblatt (1975), González-Manteiga and Cao (1993), Härdle and Mammen (1993), Dette and Neumeyer (2001) who proposed alternative tests of the null hypothesis of equality of curves obtained from pairwise comparisons of the estimators of the regression functions. A similar statistic was also considered by King et al. (1991). Pairwise comparisons between group levels with corrections for multiple testing are also possible in the framework of survival analysis. Among others, this can be achieved with the pairwise_survdiff of the package survminer (Kassambara et al., 2019). However, in any case, this approach would lead to a large number of comparisons without the possibility of determining groups with similar curves. Results for such a test can tell us that all combinations are different, or just one pair. When the number of curves to be compared increase so does the difficult of interpretation.
For partitioning a given set of curves into a set of K groups of curves (i.e., K clusters),  propose an adaptation of the K-means methodology. K-means is probably the most commonly used clustering method for splitting a dataset into a set of K groups with a very simple and fast algorithm. Furthermore, it can efficiently deal with very large data sets. One potential disadvantage of K-means clustering is that it requires the number of clusters to be pre-specified. A method is proposed by  to determine the number of clusters.
The development of clustcurv R package has been motivated by recent contributions that account for these problems; in particular the methods proposed by  to determine groups in multiple survival curves and those introduced by Villanueva et al. (Manuscript submitted for publication, 2019) in the framework of regression curves. The clustcurv R package attempts to answer the following two questions: (i) given a potential large sample of curves what is the best value for the number of clusters? (ii) What is the best subdivision of the sample curves into a given number of K clusters? To facilitate the task of selecting the optimal number of clusters as well as the composition of the clusters, it is essential to have software for implementing the proposed methods in an environment which researchers will find user-friendly and easily understandable. We believe that our package can answer to this aim providing several user-friendly functions. The package clustcurv is freely available from the Comprehensive R Archive Network (CRAN) at https://cran.rproject.org/web/packages/clustcurv Three data sets were chosen for illustration of the software usage with real data. The first two datasets, to show the applicability of the proposed methods for obtain clusters of survival curves. These applications were chosen to solve two real problems in the study of recurrence of breast cancer patient and survival of myeloma cancer. To illustrate the package usage in the regression context we used real data from a Barnacle's Growth study conducted in Galicia, Spain. Simulated data were also used to illustrate the package capabilities in a more complicated scenario.
The remainder of the paper is structured as follows: Section 2 briefly reviews methods for selecting the number of clusters and the nonparametric test used; Section 3 explains the use of the main functions and methods of clustcurv; Section 4 gives an illustration of the practical application of the package using real and simulated data; and finally, the last section concludes with a discussion and possible future extensions of the package.

An overview of the methodology
In this section we briefly review the methodological background of the clustcurv package. As it solves problems addressed in the field of survival analysis and regression analysis, firstly, the notation and the nonparametric estimation procedures for both contexts are exposed. Then, the procedure for determining groups of curves is explained in detail considering a general framework which includes the aforementioned contexts. Briefly, our procedure is described as follows. First, the J curves are estimated by nonparametric estimators. Second, given a number of K groups, the optimal possible assignment of J curves into K groups is chosen by means of an heuristic algorithm. Third, the optimal number of groups is determined using an automatic based-bootstrap testing procedure.

Notation and estimation procedure in the survival context
We will assume the J-sample general random censorship model where observations are made on n j individuals from population j(j = 1, . . . , J). Denote n = ∑ J j=1 n j and suppose that the observations from the n individuals are mutually independent. Let T ij be an event time corresponding to an event measured from the start of the follow-up of the i-th subject (i = 1, . . . , n j ) in the sample j and assume that T ij is observed subject to a (univariate) random right-censoring variable C ij assumed to be independent of T ij . Due to the censoring, rather than T ij we observe Since the censoring time is assumed to be independent of the process, the survival functions, S j (t) = P(T j > t), may be consistently estimated by the Kaplan-Meier estimator (Kaplan and Meier, 1958) based on the ( T j , ∆ j ). The Kaplan-Meier estimator or the Product-Limit estimator is a nonparametric method frequently used to estimate survival for censored data. Let t 1 < t 2 < . . . < t m j , m j ≤ n j denote the distinct ordered failure times from population j (j = 1, . . . , J) and let d u be the number of events from population j at time t u . Then, the Kaplan-Meier estimator of survival (for population j) is where R j (t) = ∑ n j i=1 I( T ij ≥ t) denote the number of individuals at risk just before time t, among individuals from population j. The Kaplan-Meier estimate is a step function with jumps at event times.
The size of the steps depend on the number of events and the number of individuals at risk at the corresponding time. Under this setup we will be interested to determine clusters in multiple survival curves.

Notation and estimation procedure in the regression context
Let (X j , Y j ) be J independent random vectors, and assume that they satisfy the following nonparametric regression models, for j = 1, . . . , J, where the error variable ε j has mean zero and m j (X j ) = E(Y j |X j ) is the unknown regression function. We do not make any assumptions about the error distribution.
The regression models in (1) can be estimated using several approaches, such as, methods based on regression splines (de Boor, 2001), Bayesian approaches (Lang and Brezger, 2004) or local polynomial kernel smoothers (Wand and Jones, 1995;Fan and Gijbels, 1996). In this package local linear kernel smoothers, as implemented in the npregfast package, are used.

Determining groups of nonparametric curves
As noted earlier, several authors have proposed different methods that can be used to compare estimates of nonparametric functions of multiple samples. The null hypothesis is that all the curves have identical functions, H 0 : F 1 = . . . = F J . However, if this hypothesis is rejected there are no available procedures that let determine groups among these curves, that is, to asses if the levels {1, . . . , J} can be grouped in K groups (G 1 , . . . , G K ) with K < J, so that F i = F j for all i, j ∈ G k , for each k = 1, . . . , K. Note that (G 1 , . . . , G K ) must be a partition of {1, . . . , J}, and therefore must satisfy the following conditions, We propose a procedure to test, for a given number K, the null hypothesis H 0 (K) that at least one partition exists (G 1 , . . . , G K ) so that all the conditions above are verified. The alternative hypothesis H 1 (K) is that for any (G 1 , . . . , G K ), exists at least a group G k in which F i = G j for some i, j ∈ G k .
The cited testing procedure is based on the J-dimensional process where, for j = 1, . . . , J, and C k is the pooled nonparametric estimate based on the combined G k -partition sample.
The following test statistics were considered in order to test H 0 (K): a Cramér-von Mises type statistic and a modification of it based on the L 1 norm proposed in the Kolmogorov-Smirnov test statistic where R is the support of the lifetime distribution or the support of the independent variable in case of survival or regression, respectively.
In order to approximate the minimizers involved in the test statistics we propose the use of clustering algorithms. Particularly, in the case of D CM , defined in terms of the L 2 -distance, we propose the use of the K-means (Macqueen, 1967). However, for obtaining the values of D KS , defined in this case in terms of the L 1 -norm, a variation of the K-means where instead of calculating the mean for each group to determine its centroid, it calculates the median, the k-medians -suggested by Macqueen (1967) and developed by Kaufman and Rousseeuw (1990)-would be more appropriate. In both cases, the carried out procedure is equivalent: the functions F j (j = 1, . . . , J) have to be estimated in a common grid of size Q leading to a matrix of (J x Q) dimension, where each row corresponds with the estimates of the j curve in the Q positions of the grid. Then, this matrix will be the input of both heuristic methods, K-means and K-medians, and from these the "best" partition (G 1 , . . . , G K ) is obtained.
Finally, the decision rule based on D consists of rejecting the null hypothesis if D is larger than the (1 − α)-percentile obtained under the null hypothesis. To approximate the distributions of the test statistic under the null hypothesis, resampling methods such as the bootstrap introduced by Efron (1979) can be applied.
The testing procedure used here involves the following steps: 1. Using the original sample, for j = 1, . . . , J and i = 1, . . . , n j , estimate the functions F j in a nonparametric way and in a common grid, using each sample separately. Then, using the proposed algorithms, obtain the "best" partition (G 1 , . . . , G K ) and with it obtain the estimated curves C k using a pooled nonparametric estimator based on the combined partition samples (i.e., the estimator obtained by applying the nonparametric estimator to the combined partition samples).
2. Obtain the D value as explained before.
3. Draw bootstrap samples using a bootstrap procedure. In the survival context follow step 3.(a) and 3.(b) for the regression context: Note that this procedure is a pooled bootstrap, i.e., bootstrap from the pooled combined partition sample given by the null hypothesis H 0 (K).
being ε ij the null errors under the H 0 (K) obtained as and the variables W * b 1 , . . . , W * b n are independent for the observed sample and i.i.d. with E(W * b i ) = 0, Var(W * b i ) = 1 and third moment equals to 1. A common choice is considering a binary variable with probabilities P{W * b i = (1 − √ 5)/2} = (5 + √ 5)/10 and P{W * b i = (1 + √ 5)/2} = (5 − √ 5)/10, which corresponds to the golden section. Note that we have used here the wild bootstrap (Wu, 1986;Liu, 1988;Mammen, 1993) because this method is valid both for homoscedastic and for heteroscedastic models where the variance of the error is a function of the covariate. It is important to highlight that repeating this procedure -testing H 0 (K)from K = 1 onwards until a certain null hypothesis is not rejected allows us to determine automatically the number of K groups. Note, however, that unlike the previous test decision, this latter one is not statistically significant (strong evidences for rejecting the null hypothesis are not given). The whole procedure are briefly described step by step in Algorithm 1.
Finally, note that, under survival and regression scenarios, the proposed procedure for the determination of groups in multiple curves may be translated as a test of multiple hypotheses where a set of K p-values corresponding to the K null hypotheses, H 0 (1), H 0 (2), . . . , H 0 (K) are given. Even though several methods have been proposed to deal with this problem (see e.g. Dudoit and van der Laan (2008) for an introduction to this area), there are still open challenges because there is no information about the minimum number of tests needed to apply these techniques. In any case, we have decided Algorithm 1: k-nonparametric curves algorithm 1. With the original sample, for j = 1, . . . , J and i = 1, . . . , n j , and using the nonparametric estimator obtain F j .
2. Initialize with K = 1 and test H 0 (K): 2.1 Obtain the "best" partition G 1 , . . . , G K by means of the k-means or k-medians algorithm.
2.2 For k = 1, . . . , K, estimate C k and retrieve the test statistic D.

Generate B bootstrap samples and calculate
3. The number K of groups of equal nonparametric curves is determined.
to propose a possible approach to apply some of these well-known techniques as Bonferroni, Holm (Holm, 1979), etc. As the problem is still open, we feel that the final user is who must be able to decide to apply them by means of an argument included in the functions of the package. The challenge in the present context is that the number of hypothesis that are going to be tested is unknown in advance. In order to solve this we propose that, after having increased K in the algorithm, the null hypothesis for "smaller K's" has to be re-tested simultaneously with H 0 (K).

Package structure and functionality
The clustcurv package is a shortcut for "clustering curves" for being this its major functionality: to provide a procedure that allows users determining groups of multiple curves with an automatic selection of their number. The package enables both survival and regression curves to be grouped, and it is designed along lines similarly into both contexts. In addition, in view of the high computational cost entailed in these methods, parallelization techniques are included to become feasible and efficient onto real situations.
The functions within the clustcurv package are described in Table 1. Briefly, there are two main types of functionalities: (i) to determine groups of multiple curves with the automatic selection of their number with regclustcurves or survclustcurves functions and (ii) to determine groups of curves, given a number K, with kregcurves or ksurvcurves functions. The S3 object obtained from whatever previous functions is the argument required as input for autoplot which returns a graphical output based on ggplot2 package. Numerical summaries of the fitted objects can be obtained by using print or summary.
Hence the two most important functions in this package are survclustcurv and regclustcurv, the arguments of these functions are shown in Table 2. Note that the ksurvcurves kregcurves functions are just a simplified version of the previous two. Users can obtain automatically the optimal number of groups of multiple curves by means of survclustcurves and regclustcurves. Nevertheless, in those situations where the user knows in advance the number of groups, it is possible to obtain the assignment of the curves into the corresponding group, by means of the function ksurvcurves or kregcurves. In both functions a common argument is the algorithm which returns the best assignments of the curves into the groups which they belong. At the moment, the algorithms to solve this optimization problem can be K-means or K-medians, through the argument algorithm = kmeans or algorithm = kmedians . Furthermore, in order to address the high computational burden, the functions survclustcurves, regclustcurves, ksurvcurves and kregcurves have been programmed in parallel to compute the bootstrap-based testing procedure. The input command required for the use of parallelization is cluster = TRUE. The number of cores for parallel execution is fixed using the number of CPU cores on the current host minus one unless it is specified by the user (ncores = NULL). Then, registerDoParallel of the doParallel package is used to register the parallel backend. The parallel computation is performed by the foreach function of foreach package.

Function Description survclustcurves
Main function for determining groups of multiple survival curves and selecting automatically the optimal number of them. regclustcurves Main function for determining groups of multiple regression curves ecting automatically the optimal number of them. ksurvcurves Main function for determining groups of survival curves, given a number of groups K. kregcurves Main function for determining groups of regression curves, given a number of groups K. summary Method of the generic summary function for kcurves and clustcurves objects (both survival and regression context), which returns a short summary. print Method of the generic print function for kcurves and clustcurves objects, which prints out some key components. autoplot Visualisation of clustcurves and kcurves objects with ggplot2 (Wickham et al., 2019) graphics. Provides the plots for the estimated nonparametric curves grouped by color (optional) and their centroids (mean curve of the curves pertaining to the same group).

Illustrative examples
In this section we illustrate the use of clustcurv package using some real and simulated data. In the case of the survival context, the proposed methods were applied to the german breast cancer data included in the condSURV package and to the multiple myeloma data freely available as part of the survminer package. For the regression analysis, the clustcurv package includes a data set called barnacle5 with measurements of rostro-carinal length and dry weight of barnacles collected from five sites of Galicia (northwest of Spain). Additionally, in order to show the behaviour of the method in a more complicated scenario, an example with simulated data is also provided.

Application to German Breast Cancer Study Data
In this study, a total of 686 patients with primary node positive breast cancer were recruited between July 1984 and December 1989 and 16 variables were measured such as age of the patient (age), menopausal status (menopause), hormonal therapy (hormone), tumour size (size, in mm), tumor grade (grade) and number of positive nodes (nodes). In addition to these and other variables, the recurrence free survival time (rectime, in days) and the corresponding censoring indicator (0 -censored, 1event) were also recorded.
We will use these data to illustrate the package capabilities to build clusters of survival curves based on the covariate nodes (grouped from 1 to > 13). An excerpt of the data.frame with one row per patient is shown below.
> library(condSURV) > library(clustcurv) > data(gbcsCS) > head(gbcsCS [, c(5:10, 13, 14) A character string specifying which clustering algorithm is used, i.e., K-means ( kmeans ) or K-medians ( kmedians ). alpha A numeric value, particularly, the signification level of the hypothesis test. cluster A logical value. If TRUE (default) the code is parallelized. Note that there are cases without enough repetitions (e.g., a low number of initial variables) that R will gain in performance through serial computation. R takes time to distribute tasks across the processors also it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than the time needed for single-thread computing, it could be better not to parallelize. ncores An integer value specifying the number of cores to be used in the parallelized procedure. If NULL, the number of cores to be used is equal to the number of cores of the machine − 1. seed Seed to be used in the procedure. multiple A logical value. If TRUE (not default), the resulted pvalues are adjusted by using one of several methods for multiple comparisons. multiple.method Correction method: bonferroni , holm , hochberg , hommel , BH , BY Table 2: Arguments of survclustcurves and regclustcurves will be taken into account for applying the methods described in Section 2.2. The number of positive nodes have been grouped from 1 to > 13 because of its low numbers onwards. Below, the steps for this preprocessed are shown > table(gbcsCS$nodes) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  187 110 79 57 41 33 36 20 20 19 15 13 11 3 5 8 5 5 5  Estimates of the survival curves are obtained using the survclustcurves function. This function allows determining groups using the optimization algorithm K-means or K-medians. The function will verify if data has been introduced correctly and will create a clustcurves object. The first three arguments must be introduced, where time is a vector with event-times, status for their corresponding indicator statuses, and x is the categorical covariate.
As we mentioned, note that the proposed procedure may deal with the problem of testing multiple hypothesis, particularly relevant when the categorical variable has many levels. Thus, if the user wants to apply some correction, it is possible to specify multiple = TRUE and select some of the well-known techniques such as Bonferroni, Holm, etc. by means of the argument multiple.method.

Number of observations: 640
Cluster method: kmedians Factor s levels: [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" [14] ">13" Clustering factor s levels: [1] 1 1 1 3 3 3 3 2 3 2 2 2 2 2 Testing procedure: H0 Tvalue pvalue 1 1 95.68626 0.000 2 2 56.03966 0.018 3 3 33.63386 0.830 Available components: [1] "num_groups" "table" "levels" "cluster" "centers" "curves" [7] "method" "data" "algorithm" "call" The graphical representation of the fit can be easily obtained with the function autoplot. Specifying the argument groups_by_color = FALSE the estimated survival curves for each level of the factor nodes by means of the Kaplan-Meier estimator can be drawn. The assignment of the curves to the three groups can be observed in Figure 1 simply typing groups_by_color = TRUE. As expected, the survival of patients can be influenced by the number of lymph nodes, patient's recurrence time rises with the decrease of lymph nodes. Note that having 3 or less positive nodes seems to be related with higher recurrence-free probabilities. Patients with 9 or more positive nodes are more likely to develop a recurrence. The group of patients with 8 positive nodes were assigned to the group with highest recurrence probabilities. Though this were unexpected further analysis confirm the poor and unexpected behavior.  Figure 1: Estimated survival curves for each of the levels of the variable nodes. A specific color is assigned for each curve according to the group to which it belongs using the K-medians algorithm (in this case, three groups, K = 3).

Number of observations: 607 Cluster method: kmeans
The corresponding plot is shown in Figure 2. Note that having 9 or more positive nodes seems to be related with a lower recurrence free survival than having 9 or less, with the exception of the survival curve for those patients with 8 positive nodes which was assigned to the group with highest recurrence probabilities. It is important to highlight that given a fixed value of K, one may also be interested in determining the group for which each survival function belongs. This is possible by means of the ksurvcurves function by considering for example the argument k = 3. > ksurvcurves(time = gbcsCS$rectime, status = gbcsCS$censrec, x = gbcsCS$nodes, seed = 300716, algorithm = kmedians , k = 3) Call: ksurvcurves(time = gbcsCS$rectime, status = gbcsCS$censrec, x = gbcsCS$nodes, k = 3, algorithm = "kmedians", seed = 300716) Clustering curves in 3 groups

Number of observations: 640 Cluster method: kmedians
More information related to the output above can be obtained running the summary function.

Application to Multiple Myeloma Study Data
In this case, a study of the survival in patients with multiple myeloma (MM) cancer was conducted and 256 individuals were included from the start of the follow up to whom were analyzed and collected 16 variables. This data set is freely available in the survminer package. Below, it is shown the first rows of the data.frame with columns such as treatment (treatment), life state indicator (event; censored -0; 1 -dead), survival time (time, in months), among others. In this example it is interesting to analyze if the survival in patients with MM disease is the same for the different molecular subgroups. If there is a effect of the molecular subgroups on the survival, future therapies that might exploit molecular insights should lead to an improvement in outcome for patients with these type of disease (Zhan et al., 2006).
In particular, results obtained reveal that MMSET level and Proliferation level are associated with a high-risk or damage on the lifetime, while MAF, Low bone disease, Hyperdiploid, Cycline D-1 and Cycline D-2 have higher survival probabilities. This is observed in the plot shown in Figure 3 which can be obtained using the following input command: > autoplot(fit.mye, groups_by_color = TRUE)

Application to Barnacle's Growth Study Data
This study was conducted on the Atlantic coast of Galicia (Northwest Spain), which consists of an approximately 1000km long shoreline with extensive rocky stretches exposed to tidal surge and wave action that are settled by the Pollicipes pollicipes (Gmelin, 1789) populations targeted for study. A total of 5000 specimens were collected from five sites of the region's Atlantic coastline and corresponds to the stretches of coast where this species is harvested: Punta do Mouro, Punta Lens, Punta de la Barca, Punta del Boy and Punta del Alba. Two biometric variables of each specimen were measured: RC (Rostro-carinal length, maximum distance across the capitulum between the ends of the rostral and carinal plates) and DW (Dry Weight). This data set (barnacle5) is available in the clustcurv package. The idea of this study is to know the relation between RC and DW variables along the coast, i.e., to analyze if the barnacle's growth is similar in all locations or by contrast, if it is possible to detect geographical differentiation in growth. A sample of the dataset is shown as follow For each location (F), nonparametric regression curves were estimated to modeling the dependence between RC and DW. In order to determine groups, we used the proposed methodology in Subsection 2.2.2. Through executing the next piece of code the following results can be obtained: one estimated curve was attributed to the first group (Punta Lens), two estimated curves were assigned to group 2 (Punta de la Barca and Punta del Boy) and the other two belong to group 3 (Laxe do Mouro and Punta del Alba) (Figure 4). In this example, the regclustcurves function was used with algorithm = kmeans and the input variables y, x, z.
> fit.bar <-regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F, nboot = 500, seed = 300716, algorithm = kmeans , cluster = TRUE) Factor s levels: [1] "laxe" "lens" "barca" "boy" "alba" Clustering factor s levels: [1] 2 1 3 3 2 Testing procedure: H0 Tvalue pvalue 1 1 0.94353014 0.000 2 2 0.15463483 0.034 3 3 0.02348982 0.422 Available components: [1] "num_groups" "table" "levels" "cluster" "centers" "curves" [7] "method" "data" "algorithm" "call" As can be seen in Figure 4 obtained using the following input command, the specimens from Punta de la Barca and Punta del Boy have similar morphology, wide and short. This is due to these zones present similar oceanographic characteristic, such as exposed rocky shore to waves and highly articulated. Unlike, the barnacles collected from Laxe do Moure and Punta del Alba are narrow and long because they are less exposed locations.  Figure 4: Estimated regression curves for each of the levels of the factor. A specific color is assigned for each curve according to the group to which it belongs using the K-means algorithm (in this case, three groups, K = 3).

Application to simulated data
Finally, this subsection reports the capabilities of the clustcurv package in a more complicated simulated scenario. We consider the regression models given in (1) for j = 1, . . . , 30, with where a is a real constant, X j is the explanatory covariate drawn from a uniform distribution [−2, 2], and ε j is the error distributed in accordance to a N(0, σ j (x)). We have considered the heteroscedastic scenario where σ j (x) = 0.5 + 0.05m j (x) · N(0, 1.5).
We explore the methodology considering the null hypothesis H 0 (5) of assignment of the m j curves into five groups (K = 5). To show the performance of our procedure, two values were considered for a, 0 and 0.4. It should be noted that the value a = 0 corresponds to the null hypothesis, while if a = 0.4 the number of groups is six. Particularly, we have defined an unbalanced scenario, with unequal sample sizes for each j curve, particularly, (n 1 , n 2 , . . . , n J ) ∼ Multinomial(n; p 1 , p 2 , . . . , p J ) being p j = p * j / ∑ J j=1 p * j , with p * j randomly drawn from {1, 1.5, 2, 2.5, 3} and n = 5000. Note that we propose this procedure for generating the n j in order to obtain a completely unbalanced study.
Finally, there are 6 clusters. > fit.sim2 Call: regclustcurves(y = data2$y, x = data2$x, z = data$f, nboot = 500, algorithm = "kmedians", cluster = TRUE, seed = 300716) Clustering curves in 6 groups Number of observations: 5000 Cluster method: kmedians > autoplot(fit.sim2, groups_by_colour = TRUE, centers = TRUE) Figures 5 and 6 show the results with the simulated data with a = 0 and a = 0.4, respectively. In this situation the true number of groups is equal to 5 and 6. As can be appreciated, our method seems to perform reasonably well for both values of a. For a = 0 the null hypothesis H 0 (5) is accepted, curves assigned to each groups are plotted with the same color. In the case of a = 0.4 the null hypothesis H 0 (6) is accepted, therefore, there are 6 groups of regression curves. Note that in both plots, the centroids are coloured in black because in the autoplot function the argument centers = TRUE.  Figure 6: Estimated regression curves for each of the levels of the variable f with a = 0.4. A specific color is assigned for each curve according to the group to which it belongs using the K-means algorithm (in this case, six groups, K = 6). Black curves are the centroids of each group.

Conclusion and further extensions of the R package
This paper discussed the implementation of some methods developed for determining groups of multiple nonparametric curves in the R package clustcurv. In particular, the methods proposed are focused in the framework of regression analysis and in the framework of survival analysis. In the context of survival analysis, we restrict ourselves to survival curves. Hopefully, future versions of the package will extend the methodology to determine groups in risk functions, cumulative hazard curves, or density functions. The current version of the package implements two optimization algorithms, the well-known K-means and K-medians. It can be interesting let the user choose far from those, such as Means-Shift or K-medoids algorithms.