SurvMetrics: An R package for Predictive Evaluation Metrics in Survival Analysis

Recently, survival models have found vast applications in biostatistics, bioinformatics, reliability engineering, finance and related fields. But there are few R packages focusing on evaluating the predictive power of survival models. This lack of handy software on evaluating survival predictions hinders further applications of survival analysis for practitioners. In this research, we want to fill this gap by providing an "all-in-one" R package which implements most predictive evaluation metrics in survival analysis. In the proposed SurvMetrics R package, we implement concordance index for both untied and tied survival data; we give a new calculation process of Brier score and integrated Brier score; we also extend the applicability of integrated absolute error and integrated square error for real data. For models that can output survival time predictions, a simplified metric called mean absolute error is also implemented. In addition, we test the effectiveness of all these metrics on simulated and real survival data sets. The newly developed SurvMetrics R package is available on CRAN at https://CRAN.R-project.org/package=SurvMetrics and GitHub at https://github.com/skyee1/SurvMetrics .


Introduction
Survival analysis is an important part of biostatistics.It is frequently used to define prognostic indices for mortality or recurrence of a disease, and to study the outcome of treatment (Wang et al., 2019;Wijethilake et al., 2021;Fan et al., 2016;Wiens et al., 2016).Recent decades have witnessed many other applications of survival analysis in various disciplines such as finance, engineering and social sciences (Steyerberg et al., 2010;Bender et al., 2021;Wang et al., 2019).Evaluating the ability to predict future data is one of the most important considerations in the development of these survival models (Wang et al., 2019).When evaluating a prediction model, the predictive performance of the survival model is commonly addressed by appropriate measures which quantify the 'distance' or the agreement between the observed and predicted outcomes (Uno et al., 2011;Bylinskii et al., 2018).By investigating the predictive measures frequently used in the statistical literature (top or specialized statistical journals) The R Journal Vol.14/4, December 2022 ISSN  in the past decade (from 2010 to 2021) in Figure 1 and Figure 2, we have found that among these 136 research papers, the popular predictive metrics for survival models mainly include: concordance index (C-index), Brier score (BS), integrated Brier score (IBS), integrated absolute error (IAE), integrated square error (ISE) and mean absolute error (MAE) (Harrell et al., 1982;Brier, 1950;Graf et al., 1999;HooraMoradian et al., 2017;Schemper, 1992).From Figure 1, we can see that the two most frequently used metrics are C-index and IBS, while the other metrics are less used.This may be partly due to the fact that the available predictive metrics are scattered across several R packages, and differences in input data of these tools make it difficult for non-specialists to use or compare the survival models (Schröder et al., 2011;Peters and Hothorn, 2021).Hence, software which provides easy input data formats and includes most popular metrics is desirable (Schröder et al., 2011;Wang et al., 2019).
Figure 2 indicates that the use of C-index is steadily increasing and it has become a dominant predictive measure recently (Jing et al., 2019;Amico et al., 2021).On the one hand, this implies that the lack of handy evaluation metrics tools since other measures are also very important in survival analysis but do not have an adequate seat in survival prediction practices (Nemati et al., 2021;Li et al., 2021;Ensor et al., 2021).On the other hand, it reminds us that providing an accurate C-index measure which can take various situations into account is extremely important (H Ea Gerty and Zheng, 2005;Kang et al., 2015;Zadeh and Schmid, 2020;Zhang et al., 2021).
Different from common classification or regression problems where tools to evaluate predictive performance are abundant, there are few R packages focusing on evaluating the predictive power of survival models (Harrell Jr, 2021;Peters and Hothorn, 2021).To make things worse, most of these available packages have implemented only one or two evaluation metrics and/or some of them often throw errors in real survival problems (Peters and Hothorn, 2021).Take the C-index for an example, if the user wants to compare two samples with the same survival time, this scenario returns NA in Hmisc, survivals, and survcomp packages.However, as described by Ishwaran et al. (2008), "for sample pairs, where observed time X i = X j and both are deaths, count 1 if predicted outcomes are tied."This error is mainly due to the fact that the above packages ignore this possible scenario in practice.Omitting such scenarios can result in the loss of information from survival data, especially when the sample size is small with high data collection costs.In SurvMetrics, we have implemented the method described by Ishwaran et al. (2008) to cover all tied scenarios.There are also functions available in the ipred package to calculate BS and IBS values, but a list of survfit objects is required in the calculation process (Peters and Hothorn, 2021).However, for most survival models, only a survival probability vector or matrix is provided, making calculations of such values a challenging problem (Ishwaran et al., 2008).It is not easy for non-specialists to get IBS using the ipred package directly on the survival model results.
We argue that an appealing survival evaluation metric tool should satisfy the following two properties.First, the evaluation metrics should not depend on any specific model and they can be applied to evaluating survival models without specifying a specific model form (Schröder et al., 2011).Second, the metrics should be computationally tractable and user has the choice to input only the The R Journal Vol.14/4, December 2022 ISSN 2073-4859 fitted model without any additional processing, especially for the non-specialist (Harrell et al., 1982;Graf et al., 1999;HooraMoradian et al., 2017;Ishwaran et al., 2008).
In this research, we are trying to develop a comprehensive and effective R package with the above two properties.In the proposed SurvMetrics package, the calculation process of most metrics does not depend on the specific model form and only predicted survival probability vector (or matrix) is needed.This feature is particularly desirable for non-statistical researchers such as clinical and financial practitioners (Ali et al., 2021).To illustrate the effectiveness of our tool, we choose two popular survival models, namely, a semi-parametric Cox model and a non-parametric random survival forest model (RSF) to check and test the provided functionalities (Cox, 1972;Ishwaran et al., 2008).

Predictive evaluation metrics in survival analysis
In this section, we will present the survival model evaluation metrics and some implementing details.
The following Table 1 presents some popular R packages and the metrics they provide.As can be seen in Table 1, the proposed SurvMetrics package implements all popular evaluation metrics, while other packages mostly implement just one or two.With our SurvMetrics package, one can compare and evaluate different survival models in a convenient and comprehensive way.

C-index
The most popular evaluation metric in survival predictions is C-index (Harrell et al., 1982;Li et al., 2016;Lee et al., 2018;Devlin and Heller, 2021;Subramanian et al., 2020;Hsu and Lin, 2020;Zadeh and Schmid, 2020;Zhang et al., 2021;Wang and Zhou, 2017;Wang and Li, 2019), which is a generalization of the ROC curve (H Ea Gerty and Zheng, 2005;Obuchowski, 2006;Kang et al., 2015;Li et al., 2016).C-index intends to measure the proportion of predictions of the binary survival status that agree with the true survival status.If the study is concerned with the comparison of survival probability between different samples, for example, which component in the device shows damage earlier, the C-index can be used.
To calculate the values of C-index from different models, Hmisc, survcomp and survival R packages are frequently used in practice (Harrell et al., 1996;Schröder et al., 2011;Therneau, 2021;Harrell Jr, 2021).All the above-mentioned packages in their current versions do not consider the cases of tied survival data, i.e., samples with the same survival time.As shown in Table 3, they return 0 or NA for all cases of survival data tied.However, in practice, cancer patients are usually surveyed annually or monthly after surgery for survival status and with the advent of big data, tied samples are becoming commonplace.Simply ignoring the presence of tie data in calculating C-index will result in evaluation bias among survival models.
Meanwhile, in the calculation of the C-index, not all sample pairs are comparable, e.g., for those pairs whose shorter survival time is censored.So it is necessary to filter the sample pairs (i, j) that can be compared in the calculation, i and j denote any two samples in the data set, respectively.By defining the comparable sample pair in the following way, when np ij = 1, all comparable sample pairs described by Ishwaran et al. (2008) can be covered.
In the proposed R package, we will take a similar strategy adopted in Ishwaran et al. (2008).δ i denotes the survival status of sample i (0 means censoring, 1 means event), Y i and X i represent the predicted survival time and the observed survival time, respectively.By defining csign and sign, we can easily distinguish which sample pairs meet the concordance and which are not.
(a) The comparable sample pair is defined as: ( The R Journal Vol.14/4, December 2022 ISSN 2073-4859 (b) The complete concordance (CC) is defined as: where sign (c) We derive the partial concordance (PC): (5) (d) The C-index value can be given by: From the above, we know C-index lies between 0 and 1 and usually the larger the C-index, the more accurate prediction the model can get.
To further illustrate the proposed Cindex() function in SurvMetrics with other packages, we first give a toy example in Table 2.
A simulation data set which can include all the possible sample pairs, where time and status are the observed survival time and survival status of samples S 1 ...S 8 , respectively, and predicted is the survival probability predicted by survival models.
Based on the above information, different R packages will take different ways to deal with these predictions.We summarize these calculation differences behind these R packages in Table 3. From the above, we can find that when ignoring part or all of the information from tied sample pairs, the C-index values obtained from Hmisc and other two packages cannot reflect the predictive powers of survival models in a true manner.In this case, one can turn to the Cindex() function in SurvMetrics package for help.

BS
BS is another commonly used metric in survival analysis (Steyerberg et al., 2010;Chicco et al., 2021;Imani et al., 2019;Pakbin et al., 2018), which measures the mean squared error between the observed progression status and the predicted survival probability at time t * .Thus, a lower BS usually indicates a better prediction model.BS focuses on the residuals between the predicted survival probability and the survival status at a fixed time point.If the study is concerned with the prediction probability of a model at a fixed time point, BS can be chosen as the metric.For example, the 10-year average survival probability of cancer patients.
The R Journal Vol.14/4, December 2022 ISSN 2073-4859 In the SurvMetrics R package we will take the most classical method to calculate the BS by weighting the prediction residuals with inverse probability censoring weights (IPCW) proposed by Graf et al. (1999): where t * is the time point at which BS is to be calculated, N is the sample size, z i is the covariates of instance i, Ŝ(•) is the survival function predicted by the model, Ĝ(•) denotes the weight for the instance which is estimated by the KM(Kaplan-Meier) estimator of the censoring distribution.
According to the definition of BS, its value depends on the selection of t * provided by the user and different choices of t * will always lead to different BS values (Chew et al., 2001;Li et al., 2021;Zhu and Kosorok, 2012;Ji et al., 2020).In the Brier() function of our SurvMetrics R package, t * is set to median survival time as default.

IBS
As we can see above, BS depends partly on the choices of a user-specified time point.It makes the comparison between models somewhat difficult when we want to know an average prediction performance over all prediction times.In practice, the integral of BS or IBS, which does not depend on the selection of one time point, is more widely used when we are interested in the entire time interval (Zhu and Kosorok, 2012;Periáñez et al., 2016;Bertens et al., 2017).IBS is more concerned with the residuals at all observed time points.When the time of interest is no longer a specific time point, IBS can provide more information than BS.For example, the probability of survival of a cancer patient for each year after the disease.
The definition of IBS is straightforward: But the calculation of IBS using sbrier() function from the ipred package does not always go smoothly (Peters and Hothorn, 2021) .If a list of survfit objects is incorrectly specified, then the package currently returns a particular error: 'Error in switch(ptype,survfit = { : EXPR must be a length 1 vector} ' In our experience, this kind of error is common, especially for the non-specialists (Peters and Hothorn, 2021).
In the SurvMetrics package, we will make life easier for non-specialists.Users only need to input survival time, survival status, the predicted survival probability matrix and the range of integration to the IBS() function, and our program will take care of all the rest work and give a correct output.Similar to BS, a smaller IBS value usually implies a more accurate survival model.

IAE and ISE
Another two evaluation metrics, namely, IAE and ISE are also occasionally used to compare the difference between the estimated survival function Ŝ(•) and the true survival function S(•) in terms of L 1 and L 2 paradigms, respectively.When the purpose of the model is to approximate the theoretical distribution function, IAE and ISE can be selected as the metrics.HooraMoradian et al. ( 2017); Zou et al. (2021).The IAE and ISE can be defined as: where X is the covariate of the training set.
Since the true survival functions are usually unknown beforehand, traditional IAE and ISE methods are only applicable to simulation scenarios.Here, we propose to approximate S(t) using the nonparametric KM estimator in the IAEISE() function and this makes IAE and ISE also suitable for real data study.Similar to IBS, a smaller IAE and ISE values lead to a more accurate model.

MAE
The last evaluation metric presented here is MAE (Schemper, 1992), which can be used to measure the residuals of observed survival time versus predicted survival time, for example, to predict the time to damage of the device.MAE is a better choice than BS if predicted survival time is a model's output.
The definition of MAE is shown below: where n is the event sample size.
MAE only estimates the average absolute error between the predicted survival time and the true survival time in the uncensored samples and is rarely used in practice.For consistence, we also provide this metric in the MAE() function.Similar to MSE, the lower the MAE, the higher the accuracy of prediction.

Simulations and examples
In this section, we will use some simulated and real survival data sets to illustrate the effectiveness of the metrics provided in the SurvMetrics package.

The performance of SurvMetrics on simulation data sets
First, Cox proportional hazard models will be fitted on three simulated scenarios which are very similar to Settings 1-4 by Steingrimsson et al. (2019).Scenario1 satisfies the proportional hazards assumption and others violate it.Scenario1: This data set is created using N independent observations, where the covariate vector (W 1 , ..., W p ) is multivariate normal with mean zero and a covariance matrix having elements (i, j) equal to 0.9 |i−j| .Survival times are simulated from an exponential distribution with mean µ = e 0.1 ∑ p i=[p/2]+1 W i (i.e., a proportional hazards model) and the censoring distribution is exponential with mean c mean which is chosen to control the censoring rate.Here, [x] denotes the largest integer no more than x.Scenario2: This data set is created using N independent observations where the covariates (W 1 , ..., W p ) are multivariate normal with mean zero and a covariance matrix having elements (i, j) equal to 0.75 |i−j| .Survival times are gamma distributed with shape parameter µ = 0.5 + 0.3|∑ and scale parameter 2. Censoring times are uniform on [0, u max ], and u max is chosen to control the censoring rate.Here, the proportional hazards assumption is violated.Scenario3: This data set is created using N independent observations where the covariates (W 1 , ..., W p ) are multivariate normal with mean zero and a covariance matrix having elements (i, j) equal to 0.75 |i−j| .Survival times are simulated according to a log-normal distribution with mean µ = 0.1|∑ Censoring times are log-normal with mean µ + c step and scale parameter one, where c step is chosen to control the censoring rate.Here, the underlying censoring distribution depends on covariates and the proportional hazards assumption is also violated.
From Figure 3, one may see that the prediction performance of the Cox model keeps decreasing in three scenarios as expected.However, the difference in terms of C-index and BS are not significant for Scenario2 and Scenario3 while in terms of IBS, IAE and ISE, a sharp difference and a clear trend can be observed.
As we know, C-index evaluates the model from a ranking perspective in a rough matter and BS only considers a fixed time point.Both metrics may not distinguish models described above or in other complex settings.In this case, evaluation metrics which consider model performance in a global way, such as IBS, IAE, and ISE, may be useful alternatives.This also justifies why a package such as SurvMetrics that can evaluate survival models from multiple perspectives is strongly needed in practice.

An example of SurvMetrics
In this section, a kidney dataset from the survival R package is used to illustrate the usage of the SurvMetrics package using popular survival models widely used in biostatics and biomedical study.The sample size is 300, the dimension is 25, and control the censoring rate about 30%.The prediction performance of the Cox model keeps decreasing in three scenarios, which satisfies the data generated methods.However, in some complex situations where C-index and BS cannot distinguish models, IBS, IAE, and ISE, may be useful alternatives.
In the following, we first cut the kidney dataset into a training set and a testing set.Then, two popular models, namely, Cox model and random survival forest model are constructed based on the training set to show how different functions are used in SurvMetrics.Finally, we will show how to evaluate the predictive performance of these two models on the testing set using the proposed SurvMetrics R package.In the latest version of SurvMetrics, user has the choice to input only the standard survival models and testing sets.Meanwhile, it is still useful to deal with predicted survival probabilities from non-standard models.The following example will contain these two input forms.The results presented in Table 4 and Figure 4 show that using C-index alone is hard to differentiate the predictive power of the two models.In this case, alternative measures provided in the proposed package may help if further evaluation is needed.This result is also consistent with the simulated scenario mentioned above.

Summary
Assessing the predictive performance of survival models is complex due to the lack of standards regarding the best criterion to use in survival analysis.The available metrics are scatter across different R packages that use heterogeneous interfaces, which makes it difficult for the non-specialist to use or compare the performance of various survival models.
In this paper, we try to fill the gap by providing an "all-in-one" R package called SurvMetrics which provides a uniform interface to an extensive set of performance assessment and statistical comparison methods.Practitioners can easily implement comparative studies and identify the best model(s) using this package.In the current version of the SurvMetrics package, six evaluation metrics The R Journal Vol.14/4, December 2022 ISSN 2073-4859 are present.Evaluation metrics used in more complicated settings such as time-dependent AUC for joint modelling of longitudinal and survival data, and concordance index for competitive risks are still being developed and will be added to the SurvMetrics package in the future.Meanwhile, we are also working to provide a more user-friendly interface to facilitate both statistical and non-statistical clinical research workers in evaluating survival models.

Figure 1 :
Figure1: Frequency of survival model evaluation metrics appearing in journals such as "Annals of Statistics", "Biometrika", "Journal of the American Statistical Association", "Journal of the Royal Statistical Society, Series B", "Statistics in Medicine", "Artificial Intelligence in Medicine", "Lifetime Data Analysis" from 2010 to 2021.

TheFigure 3 :
Figure 3: This graphic shows the Cox model prediction accuracy based on three different scenarios via 5 evaluation metrics by the SurvMetrics package.The sample size is 300, the dimension is 25, and control the censoring rate about 30%.The prediction performance of the Cox model keeps decreasing in three scenarios, which satisfies the data generated methods.However, in some complex situations where C-index and BS cannot distinguish models, IBS, IAE, and ISE, may be useful alternatives.

Figure 4 :
Figure 4: This graphic shows the prediction accuracy by using SurvMetrics package to compare the Cox and RSF models based on the kidney data set from survival R package.In this case, using IBS, IAE and ISE can give a more clear conclusion that Cox model performs better.

Table 1 :
R packages that are commonly used to evaluate survival predictions.

Table 4 :
This table shows the prediction accuracy by using SurvMetrics package to compare the Cox and RSF models based on the kidney data set from survival R package.In this case, the values of C-index are close, and the values of IAE and ISE are far different.When using C-index alone is hard to differentiate the predictive power of the two models, alternative measures provided in SurvMetrics may help give a more accurate result.