Recently, survival models have found vast applications in biostatistics, bioinformatics, reliability engineering, finance and related fields. But there are few R packages focusing on evaluating the predictive power of survival models. This lack of handy software on evaluating survival predictions hinders further applications of survival analysis for practitioners. In this research, we want to fill this gap by providing an "all-in-one" R package which implements most predictive evaluation metrics in survival analysis. In the proposed SurvMetrics R package, we implement concordance index for both untied and tied survival data; we give a new calculation process of Brier score and integrated Brier score; we also extend the applicability of integrated absolute error and integrated square error for real data. For models that can output survival time predictions, a simplified metric called mean absolute error is also implemented. In addition, we test the effectiveness of all these metrics on simulated and real survival data sets. The newly developed SurvMetrics R package is available on CRAN at https://CRAN.R-project.org/package=SurvMetrics and GitHub at https://github.com/skyee1/SurvMetrics.
Survival analysis is an important part of biostatistics. It is frequently used to define prognostic indices for mortality or recurrence of a disease, and to study the outcome of treatment (Fan et al. 2016; Wiens et al. 2016; Wang et al. 2019; Wijethilake et al. 2021). Recent decades have witnessed many other applications of survival analysis in various disciplines such as finance, engineering and social sciences (Steyerberg et al. 2010; Wang et al. 2019; Bender et al. 2021). Evaluating the ability to predict future data is one of the most important considerations in the development of these survival models (Wang et al. 2019).
When evaluating a prediction model, the predictive performance of the survival model is commonly addressed by appropriate measures which quantify the ‘distance’ or the agreement between the observed and predicted outcomes (Uno et al. 2011; Bylinskii et al. 2018). By investigating the predictive measures frequently used in the statistical literature (top or specialized statistical journals) in the past decade (from 2010 to 2021) in Figure 1 and Figure 2, we have found that among these 136 research papers, the popular predictive metrics for survival models mainly include: concordance index (C-index), Brier score (BS), integrated Brier score (IBS), integrated absolute error (IAE), integrated square error (ISE) and mean absolute error (MAE) (Brier 1950; Harrell et al. 1982; Schemper 1992; Graf et al. 1999; HooraMoradian et al. 2017).
From Figure 1, we can see that the two most frequently used metrics are C-index and IBS, while the other metrics are less used. This may be partly due to the fact that the available predictive metrics are scattered across several R packages, and differences in input data of these tools make it difficult for non-specialists to use or compare the survival models (Schröder et al. 2011; Peters and Hothorn 2021). Hence, software which provides easy input data formats and includes most popular metrics is desirable (Schröder et al. 2011; Wang et al. 2019).
Figure 2 indicates that the use of C-index is steadily increasing and it has become a dominant predictive measure recently (Jing et al. 2019; Amico et al. 2021). On the one hand, this implies that the lack of handy evaluation metrics tools since other measures are also very important in survival analysis but do not have an adequate seat in survival prediction practices (Ensor et al. 2021; Li et al. 2021; Nemati et al. 2021). On the other hand, it reminds us that providing an accurate C-index measure which can take various situations into account is extremely important (H Ea Gerty and Zheng 2005; Kang et al. 2015; Zadeh and Schmid 2020; Zhang et al. 2021).
Different from common classification or regression problems where tools
to evaluate predictive performance are abundant, there are few R
packages focusing on evaluating the predictive power of survival models
(Harrell Jr 2021; Peters and Hothorn 2021). To make things worse, most of these
available packages have implemented only one or two evaluation metrics
and/or some of them often throw errors in real survival problems
(Peters and Hothorn 2021). Take the C-index for an example, if the user wants to
compare two samples with the same survival time, this scenario returns
NA in Hmisc, survivals, and survcomp packages. However, as
described by (Ishwaran et al. 2008), "for sample pairs, where observed time
We argue that an appealing survival evaluation metric tool should satisfy the following two properties. First, the evaluation metrics should not depend on any specific model and they can be applied to evaluating survival models without specifying a specific model form (Schröder et al. 2011). Second, the metrics should be computationally tractable and user has the choice to input only the fitted model without any additional processing, especially for the non-specialist (Harrell et al. 1982; Graf et al. 1999; Ishwaran et al. 2008; HooraMoradian et al. 2017).
In this research, we are trying to develop a comprehensive and effective R package with the above two properties. In the proposed SurvMetrics package, the calculation process of most metrics does not depend on the specific model form and only predicted survival probability vector (or matrix) is needed. This feature is particularly desirable for non-statistical researchers such as clinical and financial practitioners (Ali et al. 2021). To illustrate the effectiveness of our tool, we choose two popular survival models, namely, a semi-parametric Cox model and a non-parametric random survival forest model (RSF) to check and test the provided functionalities (Cox 1972; Ishwaran et al. 2008).
In this section, we will present the survival model evaluation metrics and some implementing details. The following Table 1 presents some popular R packages and the metrics they provide.
Packages | C-index | BS | IBS | IAE | ISE | MAE |
---|---|---|---|---|---|---|
survival | ||||||
Hmisc | ||||||
survcomp | ||||||
ipred | ||||||
mlr | ||||||
SurvMetrics |
As can be seen in Table 1, the proposed SurvMetrics package implements all popular evaluation metrics, while other packages mostly implement just one or two. With our SurvMetrics package, one can compare and evaluate different survival models in a convenient and comprehensive way.
The most popular evaluation metric in survival predictions is C-index (Harrell et al. 1982; Li et al. 2016; Wang and Zhou 2017; Lee et al. 2018; Wang and Li 2019; Hsu and Lin 2020; Subramanian et al. 2020; Zadeh and Schmid 2020; Devlin and Heller 2021; Zhang et al. 2021), which is a generalization of the ROC curve (H Ea Gerty and Zheng 2005; Obuchowski 2006; Kang et al. 2015; Li et al. 2016). C-index intends to measure the proportion of predictions of the binary survival status that agree with the true survival status. If the study is concerned with the comparison of survival probability between different samples, for example, which component in the device shows damage earlier, the C-index can be used.
To calculate the values of C-index from different models, Hmisc, survcomp and survival R packages are frequently used in practice (Harrell et al. 1996; Schröder et al. 2011; Harrell Jr 2021; Therneau 2021). All the above-mentioned packages in their current versions do not consider the cases of tied survival data, i.e., samples with the same survival time. As shown in Table 3, they return 0 or NA for all cases of survival data tied. However, in practice, cancer patients are usually surveyed annually or monthly after surgery for survival status and with the advent of big data, tied samples are becoming commonplace. Simply ignoring the presence of tie data in calculating C-index will result in evaluation bias among survival models.
Meanwhile, in the calculation of the C-index, not all sample pairs are
comparable, e.g., for those pairs whose shorter survival time is
censored. So it is necessary to filter the sample pairs
In the proposed R package, we will take a similar strategy adopted in
(Ishwaran et al. 2008).
The comparable sample pair is defined as:
The complete concordance (CC) is defined as:
We derive the partial concordance (PC):
The C-index value can be given by:
From the above, we know C-index lies between 0 and 1 and usually the larger the C-index, the more accurate prediction the model can get.
To further illustrate the proposed Cindex()
function in SurvMetrics
with other packages, we first give a toy example in Table
2.
sample | ||||||||
---|---|---|---|---|---|---|---|---|
time | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 |
status | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
predicted | 0.2 | 0.3 | 0.3 | 0.3 | 0.4 | 0.2 | 0.4 | 0.3 |
Based on the above information, different R packages will take different ways to deal with these predictions. We summarize these calculation differences behind these R packages in Table 3.
sample pair | ||||||||
---|---|---|---|---|---|---|---|---|
Hmisc | 0 | NA | NA | NA | 0 | NA | NA | NA |
survival | 0 | NA | NA | NA | 0 | NA | NA | NA |
survcomp | NA | NA | NA | NA | NA | NA | NA | NA |
SurvMetrics | 0.5 | 0.5 | 0.5 | 1 | 0.5 | 0.5 | 0.5 | 0.5 |
From the above, we can find that when ignoring part or all of the
information from tied sample pairs, the C-index values obtained from
Hmisc and other two packages cannot reflect the predictive powers of
survival models in a true manner. In this case, one can turn to the
Cindex()
function in SurvMetrics package for help.
BS is another commonly used metric in survival analysis
(Steyerberg et al. 2010; Pakbin et al. 2018; Imani et al. 2019; Chicco et al. 2021), which
measures the mean squared error between the observed progression status
and the predicted survival probability at time
In the SurvMetrics R package we will take the most classical method to calculate the BS by weighting the prediction residuals with inverse probability censoring weights (IPCW) proposed by (Graf et al. 1999):
According to the definition of BS, its value depends on the selection of
Brier()
function of our SurvMetrics R package,
As we can see above, BS depends partly on the choices of a user-specified time point. It makes the comparison between models somewhat difficult when we want to know an average prediction performance over all prediction times. In practice, the integral of BS or IBS, which does not depend on the selection of one time point, is more widely used when we are interested in the entire time interval (Zhu and Kosorok 2012; Periáñez et al. 2016; Bertens et al. 2017). IBS is more concerned with the residuals at all observed time points. When the time of interest is no longer a specific time point, IBS can provide more information than BS. For example, the probability of survival of a cancer patient for each year after the disease.
The definition of IBS is straightforward:
But the calculation of IBS using sbrier()
function from the ipred
package does not always go smoothly (Peters and Hothorn 2021) . If a list of
survfit
objects is incorrectly specified, then the package currently
returns a particular error:
Error in switch(ptype, survfit = { : EXPR must be a length 1 vector}
In our experience, this kind of error is common, especially for the non-specialists (Peters and Hothorn 2021).
In the SurvMetrics package, we will make life easier for
non-specialists. Users only need to input survival time, survival
status, the predicted survival probability matrix and the range of
integration to the IBS()
function, and our program will take care of
all the rest work and give a correct output. Similar to BS, a smaller
IBS value usually implies a more accurate survival model.
Another two evaluation metrics, namely, IAE and ISE are also
occasionally used to compare the difference between the estimated
survival function
Since the true survival functions are usually unknown beforehand,
traditional IAE and ISE methods are only applicable to simulation
scenarios. Here, we propose to approximate IAEISE()
function and this makes
IAE and ISE also suitable for real data study. Similar to IBS, a smaller
IAE and ISE values lead to a more accurate model.
The last evaluation metric presented here is MAE (Schemper 1992), which can be used to measure the residuals of observed survival time versus predicted survival time, for example, to predict the time to damage of the device. MAE is a better choice than BS if predicted survival time is a model’s output. The definition of MAE is shown below:
where
MAE only estimates the average absolute error between the predicted
survival time and the true survival time in the uncensored samples and
is rarely used in practice. For consistence, we also provide this metric
in the MAE()
function. Similar to MSE, the lower the MAE, the higher
the accuracy of prediction.
In this section, we will use some simulated and real survival data sets to illustrate the effectiveness of the metrics provided in the SurvMetrics package.
First, Cox proportional hazard models will be fitted on three simulated scenarios which are very similar to Settings 1–4 by (Steingrimsson et al. 2019). Scenario1 satisfies the proportional hazards assumption and others violate it.
Scenario1: This data set is created using
Scenario2: This data set is created using
Scenario3: This data set is created using
From Figure 3, one may see that the prediction performance of the Cox model keeps decreasing in three scenarios as expected. However, the difference in terms of C-index and BS are not significant for Scenario2 and Scenario3 while in terms of IBS, IAE and ISE, a sharp difference and a clear trend can be observed.
As we know, C-index evaluates the model from a ranking perspective in a rough matter and BS only considers a fixed time point. Both metrics may not distinguish models described above or in other complex settings. In this case, evaluation metrics which consider model performance in a global way, such as IBS, IAE, and ISE, may be useful alternatives. This also justifies why a package such as SurvMetrics that can evaluate survival models from multiple perspectives is strongly needed in practice.
In this section, a kidney dataset from the survival R package is used to illustrate the usage of the SurvMetrics package using popular survival models widely used in biostatics and biomedical study.
In the following, we first cut the kidney dataset into a training set and a testing set. Then, two popular models, namely, Cox model and random survival forest model are constructed based on the training set to show how different functions are used in SurvMetrics. Finally, we will show how to evaluate the predictive performance of these two models on the testing set using the proposed SurvMetrics R package. In the latest version of SurvMetrics, user has the choice to input only the standard survival models and testing sets. Meanwhile, it is still useful to deal with predicted survival probabilities from non-standard models. The following example will contain these two input forms.
The corresponding R code is provided here:
#1. data preparation
library(survival) # to fit a Cox model
library(randomForestSRC) # to fit an RSF model
library(SurvMetrics) # to get all the metrics
library(pec) # to make predictions based on Cox model
set.seed(1)
mydata <- kidney[, -1]
train_index <- sample(1:nrow(mydata), 0.7 * nrow(mydata))
train_data <- mydata[train_index, ]
test_data <- mydata[-train_index, ]
#2. fit the RSF model and Cox model to predict the testing set
#2.1 RSF model
fit_rsf <- rfsrc(Surv(time,status)~., data = train_data) #fit the RSF model
distime <- fit_rsf$time.interest #get the survival time of events
med_index <- median(1:length(distime)) #the index of median survival time of events
mat_rsf <- predict(fit_rsf, test_data)$survival #get the survival probability matrix
vec_rsf <- mat_rsf[ ,med_index] #median survival probability of all samples
#2.2 Cox model
fit_cox <- coxph(Surv(time,status)~., data = train_data, x = TRUE) #fit the Cox model
mat_cox <- predictSurvProb(fit_cox, test_data, distime) #get the survival probability matrix
vec_cox <- mat_cox[ ,med_index]
#3. get all the metrics by SurvMetrics
#3.1 CI BS IBS IAE ISE based on RSF model: standard model input methods
Cindex_rsf <- Cindex(fit_rsf, test_data)
BS_rsf <- Brier(fit_rsf, test_data, distime[med_index])
IBS_rsf <- IBS(fit_rsf, test_data)
IAE_rsf <- IAEISE(fit_rsf, test_data)[1]
ISE_rsf <- IAEISE(fit_rsf, test_data)[2]
#CI BS IBS IAE ISE based on Cox model: standard model input methods
Cindex_cox <- Cindex(fit_cox, test_data)
BS_cox <- Brier(fit_cox, test_data, distime[med_index])
IBS_cox <- IBS(fit_cox, test_data)
IAE_cox <- IAEISE(fit_cox, test_data)[1]
ISE_cox <- IAEISE(fit_cox, test_data)[2]
#3.2 CI BS IBS IAE ISE based on RSF model: Non-standard model input methods
times <- test_data$time
status <- test_data$status
Cindex_rsf <- Cindex(Surv(times, status), vec_rsf)
BS_rsf <- Brier(Surv(times, status), vec_rsf, distime[med_index])
IBS_rsf <- IBS(Surv(times, status), mat_rsf, distime) # distime can be replaced by range(distime)
IAE_rsf <- IAEISE(Surv(times, status), mat_rsf, distime)[1]
ISE_rsf <- IAEISE(Surv(times, status), mat_rsf, distime)[2]
#CI BS IBS IAE ISE based on Cox model: Non-standard model input methods
Cindex_cox <- Cindex(Surv(times, status), vec_cox)
BS_cox <- Brier(Surv(times, status), vec_cox, distime[med_index])
IBS_cox <- IBS(Surv(times, status), mat_cox, distime)
IAE_cox <- IAEISE(Surv(times, status), mat_cox, distime)[1]
ISE_cox <- IAEISE(Surv(times, status), mat_cox, distime)[2]
models | C-index | BS | IBS | IAE | ISE |
---|---|---|---|---|---|
Cox | 0.751185 | 0.18133 | 0.08842 | 77.90947 | 19.65131 |
RSF | 0.729858 | 0.221523 | 0.10920 | 105.6936 | 28.80941 |
The results presented in Table 4 and Figure 4 show that using C-index alone is hard to differentiate the predictive power of the two models. In this case, alternative measures provided in the proposed package may help if further evaluation is needed. This result is also consistent with the simulated scenario mentioned above.
Assessing the predictive performance of survival models is complex due to the lack of standards regarding the best criterion to use in survival analysis. The available metrics are scatter across different R packages that use heterogeneous interfaces, which makes it difficult for the non-specialist to use or compare the performance of various survival models.
In this paper, we try to fill the gap by providing an "all-in-one" R package called SurvMetrics which provides a uniform interface to an extensive set of performance assessment and statistical comparison methods. Practitioners can easily implement comparative studies and identify the best model(s) using this package. In the current version of the SurvMetrics package, six evaluation metrics are present. Evaluation metrics used in more complicated settings such as time-dependent AUC for joint modelling of longitudinal and survival data, and concordance index for competitive risks are still being developed and will be added to the SurvMetrics package in the future. Meanwhile, we are also working to provide a more user-friendly interface to facilitate both statistical and non-statistical clinical research workers in evaluating survival models.
This research is support in part by National Statistical Scientific Research Project of China (No.2022LZ28), Changsha Municipal Natural Science Foundation (No.kq2202080), The Open Research Fund from the Guangdong Provincial Key Laboratory of Big Data Computing, The Chinese University of Hong Kong, Shenzhen(No. B10120210117-OF04) and the Postgraduate Scientific Research Innovation Project of Hunan Province, China (CX20210155).
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Zhou, et al., "SurvMetrics: An R package for Predictive Evaluation Metrics in Survival Analysis", The R Journal, 2023
BibTeX citation
@article{RJ-2023-009, author = {Zhou, Hanpu and Wang*, Hong and Wang, Sizheng and Zou, Yi}, title = {SurvMetrics: An R package for Predictive Evaluation Metrics in Survival Analysis}, journal = {The R Journal}, year = {2023}, note = {https://rjournal.github.io/}, volume = {14}, issue = {4}, issn = {2073-4859}, pages = {252-263} }