Support Vector Machines for Survival Analysis with R by

This article introduces the R package survivalsvm, implementing support vector machines for survival analysis. Three approaches are available in the package: The regression approach takes censoring into account when formulating the inequality constraints of the support vector problem. In the ranking approach, the inequality constraints set the objective to maximize the concordance index for comparable pairs of observations. The hybrid approach combines the regression and ranking constraints in a single model. We describe survival support vector machines and their implementation, provide examples and compare the prediction performance with the Cox proportional hazards model, random survival forests and gradient boosting using several real datasets. On these datasets, survival support vector machines perform on par with the reference methods.


Introduction
Survival analysis considers time to an event as the dependent variable.For example, in the veteran's administration study (Kalbfleisch and Prentice, 2002), a clinical trial of lung cancer treatments, the dependent variable is time to death.The particularity of such a survival outcome is censoring, indicating that no event occurred during the study.For example, in the veteran's lung cancer study, some patients stayed alive until the end of the study such that time to death is censored.Because the time to event is unknown for censored observations, standard regression techniques cannot be used.The censoring indicator δ is set to be 0 or 1 for censored and not censored individuals, respectively.The main interest is to analyze the time T ∈ [0, ∞] until an event occurs, given covariates X ∈ R d .
The most popular statistical approach for survival analysis is the Cox proportional hazards (PH) model, which is described in detail in standard textbooks (Kleinbaum and Klein, 2012;Lee and Wang, 2003).The most important advantage of the PH model is that it does not assume a particular statistical distribution of the survival time.However, the crucial assumption is that the effect of covariates on the survival variable is time independent.The hazards of two individuals are thus assumed to be proportional over time (proportional hazards assumption).The general form of the PH model is given by h(t|X) = h 0 (t) exp(β X) , where β = (β 1 , β 2 , . . ., β d ) ∈ R d are parameters to be estimated and h 0 (t) is the baseline hazard function, which is independent of the covariates.It does not need to be pre-specified or estimated in the model.
The proportional hazards assumption can easily be checked in one dimension (X ∈ R), but is difficult to verify when working in higher dimensions.Another difficulty when fitting a PH model occurs when the number of covariates d exceeds the number of individuals n, since it uses the partial likelihood for parameter estimation.
One alternative approach to the PH model is to use support vector machines (SVMs).SVMs were first proposed by Vapnik (1995) as a learning method for dichotomous outcomes.They are known for their good theoretic foundations and high classification accuracy on high dimensional classification problems (Cervantes et al., 2008;Hsu et al., 2003).The formulation of an SVM supposes a target variable Y ∈ {−1, 1} and covariates X ∈ R d .Assuming that the two target classes are linearly separable, there exists a linear function f (x) = ψx + b such that y f (x) > 0. The SVM task is to find the separating hyperplane H(ψ, b) = {x | x, ψ + b = 0} for the two classes with the maximal margin.The margin is defined as the smallest distance between any data point and the separating hyperplane.Data points lying at this distance from the separating hyperplane are called support vectors.They determine the margin, and must verify either f (x) = 1 or f (x) = −1.If the two classes are not linearly separable, misclassifications can be allowed.This is done by introducing slack variables ξ i ≥ 0, allowing but penalizing misclassifications.The slack variable for an individual i is defined as is inside the margin and on the correct side of the hyperplane, ξ i > 1 if the data point lies on the wrong side of the hyperplane, and ξ i = 1 if observation i lies exactly on the hyperplane.Figure 1 presents an illustration of the SVM approach for a two dimensional data set.Data points are grouped into the classes y = 1 (circles) and y = −1 (triangles).The filled and darkened data points are the support vectors defining the margin.The two red data points lying inside the margin are misclassified.They are penalized depending on their locations on the wrong or the correct side of the hyperplane, represented by the dashed line.All the other data points are correctly classified and therefore not penalized.

Suppor vector Support vector Misclassified Maximal margin optimal hyperplane
Figure 1: Illustration of the SVM approach with a two dimensional data set, split into two classes y = 1 (circles) and y = −1 (triangles).The filled data points are the support vectors defining the margin.Data points lying outside of the margin are correctly classified and not penalized (ξ = 0), contrary to the two misclassified and penalized red points lying inside the margin.These data points are penalized with the slack variables ξ < 1 or ξ > 1, depending on whether they lie on the correct or on the wrong side of the separating hyperplane (dashed line).This illustration is inspired by Rhode (2012) and the SVM problem was solved with the kernlab package (Karatzoglou et al., 2004).
To achieve the goal of the SVM approach described above, the following optimization problem is posed in primal space: where ψ, b and the slack variables ξ i are unknown, n is the number of individuals, and γ > 0 is a regularization parameter controlling the maximal margin and misclassification penalties.Instead of solving the optimization problem (2) in the primal space, the optimization problem is transformed to the dual problem, and the Lagrange function is optimized in the dual space.Details can be found in Bishop (2007).Eitrich and Lang (2006) recommends tuning the model to find the best parameter for regularization.In the SVM problem posed in (2), it is assumed that the two classes are linearly separable, but this is usually not the case.Russell and Norvig (2010) demonstrates that a set of n data points is always separable in an n − 1 dimensional space, which guarantees the existence of a higher dimensional space in which the two classes are linearly separable.This concept is described in more detail in the next section.
SVMs were extended to support vector regression, a variant for continuous outcomes by Vapnik (1998).More recently, several extensions to survival analysis were proposed.Van Belle et al. (2007) and Evers and Messow (2008) extended the formulation of the SVM problem with the aim to maximize the concordance index for comparable pairs of observations.This approach, also known as the ranking approach, was modified in Van Belle et al. (2008) to improve computational performance.Shivaswamy The R Journal Vol.10/1, July 2018 ISSN 2073ISSN -4859 et al. (2007) ) introduced a regression approach to survival analysis, based on the idea of support vector regression (Vapnik, 1998).Van Belle et al. (2011) combined the ranking and regression approaches in a single model to build the hybrid approach of SVMs for survival outcomes.The ranking approach, as proposed by Evers and Messow (2008), is implemented in the R package survpack available on the authors' website (Evers, 2009).The approaches of Van Belle et al. (2008) and Van Belle et al. (2011) are available in a Matlab toolbox (Yang and Pelckmans, 2014).
In the next section, we describe the three approaches for survival SVMs in detail.After that, we present the implementation of these methods in the R package survivalsvm.Finally, an application of survival SVMs on real data sets compares their prediction performance and runtime with established reference methods and other available implementations.

Survival support vector machines
Three approaches have been proposed to solve survival problems using SVMs: the regression (Shivaswamy et al., 2007), the ranking (Van Belle et al., 2007;Evers and Messow, 2008;Van Belle et al., 2008) and the hybrid approach (Van Belle et al., 2011).The regression approach is based on the support vector regression (SVR) (Vapnik, 1998) idea and aims at finding a function that estimates observed survival times as continuous outcome values y i using covariates x i .A naïve variant of this approach is to ignore all censored observations and just solve the resulting SVR problem.Unfortunately, such a formulation implies a loss of information, since it does not take the particularity of survival data into account.Shivaswamy et al. (2007) improved this formulation by including the censoring in the SVM problem.For censored observations, the time to event after censoring is unknown and thus predictions greater than the censoring time do not have to be penalized.However, all survival predictions lower than the censoring time are penalized as usual.For not censored data, the exact survival times are known and, as in standard SVR, all survival predictions lower or greater than the observed survival time are penalized.Alternatively, survival predictions from censored and not censored data can be penalized differently, as proposed by Khan and Zubek (2008).However, this implies more parameters of regularization when formulating the survival SVR problem, increasing computation times (Van Belle et al., 2011).In the present work, we implement the survival SVR as proposed by Shivaswamy et al. (2007) and formulate the problem as follows: where i = 1, . . ., n. δ is the censoring indicator and F is a function that maps observed covariates to the feature space.The feature space, compared with the original space in which data is observed, is a higher dimensional space in which the training data points are projected to be separated more easily by a hyperplane.More details about this concept can be found in Vapnik (1995).Since the feature space implies a higher dimensional space, the inner product ψ, F(x i ) is calculated using a kernel function to reduce runtime (Vapnik, 1995).In the case of no censoring, i.e., δ i = 1, the inequality constraints in (3) are the same as in the classical SVR problem.However, if censoring occurs, the second constraint is reduced to ξ * i ≥ 0. The ranking approach considers survival analysis based on SVMs as a classification problem with an ordinal target variable (Herbrich et al., 1999).It aims at predicting risk ranks between individuals instead of estimating survival times (Van Belle et al., 2007;Evers and Messow, 2008).Suppose two individuals i and j at time t, where an event occurs for j but not necessarily for i.Their predictions based on the classical SVM idea are ψ, F(x i ) + b and ψ, F(x j ) + b, respectively.The ranking approach aims at predicting the correct ranking of y i and y j , meaning (y i − y j )( ψ, F(x i ) − ψ, F(x j ) ) > 0. If we further suppose y j < y i , predictions for i and j have to verify ψ, F(x i ) − ψ, F(x j ) > 0. Therefore, the ranking approach, as proposed by Van Belle et al. (2007), is equivalent to solving the optimization problem min The R Journal Vol.10/1, July 2018 ISSN 2073-4859 where is the comparison function for observations i and j.Assuming that individuals are sorted increasingly by their survival times, two observations i and j (i < j) are comparable if their survival times t i and t j verify t i < t j and δ i = 1, i.e., the shorter observed time is not censored.If we further suppose that the observation with the shortest survival time is not censored, each observation i = 2, . . ., n will be comparable to at least one other individual.The problem posed in ( 4) is equivalent to maximizing the concordance index (C-index) defined by Van Belle et al. (2007) over comparable pairs for a given prediction function u as where I(a) = 1 if a > 0, and I(a) = 0, otherwise.This definition is inspired from those proposed by Harrell et al. (1984).Given an observation i, we only consider one comparable observation j(i) = i − 1.
We denote the first variant of the ranking approach presented in (4) as vanbelle1 .In an alternative formulation, Van Belle et al. ( 2011) restrict the comparability effect to the inequality constraints as follows: where i = 1, . . ., n. Thereby, the value 1 on the right-hand side of the first constraint in vanbelle1 (4) is replaced by the difference y i − y j(i) between the survival times y i and y j(i) of data points i and its nearest comparable neighbor j(i).We denote this second variant of the ranking approach as vanbelle2 .Another variant of the ranking approach was developed by Evers and Messow (2008) and implemented in the R package survpack, offering the possibility to fit survival SVM models using either linear or radial basis kernels.In this variant, for a data point i, all comparable pairs are considered, instead of only the nearest neighbor.The denotation evers is used to refer to this variant.
The hybrid approach (Van Belle et al., 2011) combines the regression and ranking approaches in the survival SVMs problem.Thus, the constraints of (3) and ( 5) are included in the optimization problem min where i = 1, . . ., n.To solve the optimization problems (3), ( 4), ( 5) and ( 6) we consider the corresponding Lagrange function in dual space and solve the quadratic optimization problem.The next section presents our implementation of these models in R.

Implementation in R
We implemented the four models presented in (3), (4), ( 5) and ( 6) in the survivalsvm package.The function survivalsvm fits a new survival model, and risk ranks are predicted using the generic R function predict.Common to the four models is that a quadratic optimization problem is solved when moving to the dual space.In the function survivalsvm, we solve this optimization problem using two quadratic programming solvers: ipop from the package kernlab (Karatzoglou et al., 2004) and pracma from the package pracma (Borchers, 2016).The function pracma wraps quadprog from the package quadprog (Berwin A., 2013), which is implemented in Fortran to solve quadratic problems as described by Goldfarb and Idnani (1982).Thereby, the kernel matrix is assumed to be positive semi-definite.If this is not the case, the function nearPD from the Matrix package (Bates and Maechler, 2016) is used to adjust the kernel matrix to the nearest positive semi-definite matrix.In contrast to The R Journal Vol.10/1, July 2018 ISSN 2073-4859 quadprog, ipop is written in pure R. Hence, the runtime of ipop is expected to be greater than that of quadprog for solving the same optimization problem.However, an advantage of ipop is that the kernel matrix does not need to be modified when solving the optimization problem.The user of survivalsvm can choose which solver is used for solving the quadratic optimization problem.
As in the ranking formulations (4) and ( 5), the hybrid formulation (6) calculates differences between comparable data points.Three options to define comparable pairs are available in survivalsvm: makediff1 removes the assumption that the first data point is not censored, makediff2 computes differences over not censored data points only and makediff3 uses the definition described above.
The R package survivalsvm allows the user to choose one of the four kernels linear, additive (Daemen and De Moor, 2009), radial basis, and polynomial, labeled lin_kernel, add_kernel, rbf_kernel and poly_kernel, respectively.They can be passed to the survivalsvm function using the kernel parameter.

Example of usage
To exemplify the usage, the implementation is applied to the data set veteran available in the package survival (Therneau, 2015).The function Surv from the package survival serves to construct a survival target object.

Real data application and comparison to other survival models
To evaluate the survival SVM models and our implementation, four publicly available survival data sets were used.The first is the data set veteran from the Veteran's lung cancer trial study (Kalbfleisch and Prentice, 2002), available in the package survival.It includes 137 individuals and 5 covariates.Second, we utilized data from the Interpretation of a Trial Stopped Early study (Ermerson and Banks, 1994) in which 130 individuals participated.Two survival outcomes were of interest in this study, complete remission and death.The corresponding data sets are labeled leuk_cr and leuk_death .For both data sets, 10 covariates are provided.Third, we used the Germany Breast Cancer Study Group 2 (GBSG2) data (Schumacher et al., 1994) which consists in 686 samples and 8 covariates.Finally, we considered the Mayo Clinic Lung Cancer (MCLC ) study (Loprinzi et al., 1994)  For each data set, we fitted the four survival SVM models (3), ( 4), ( 5) and ( 6) using linear, additive and radial basis function (RBF ) kernels.The ranking approach of survival SVMs implemented in the R package survpack is applied using linear and RBF kernels, the two kernels offered by the package.The Cox PH model, random survival forest (RSF ) (Ishwaran et al., 2008) and gradient boosting (Gboost ) for survival analysis Ridgeway (1999) served as reference models.
For RSF, the package randomForestSRC (Ishwaran and Kogalur, 2016) was used.The number of random variables for splitting and the minimal number of events in the terminal nodes were tuned when building survival trees.randomForestSRC refers to these parameters as mtry and nodesize, respectively.For gradient boosting models implemented in mboost (Hothorn et al., 2016), we fitted a PH model as base learner.The number of boosting iterations and the regression coefficient were tuned.In mboost, these parameters are named mstop and nu, respectively.Tuning was conducted using 5 × 10-fold nested cross validation: Data sets were randomly divided into 5 almost equally sized subsamples, and in each iteration, one of the 5 groups was used as test data set.On the remaining groups, the models were trained after tuning the model parameters with a 10-fold cross validation.The best models were chosen based on the C-index.The mlr package (Bischl et al., 2016a,b) was employed for parameter tuning.
Experiments were run on a high performance computing platform using 6 cores of Intel Xeon 3.33 GHz CPUs and an available memory of 183 GB.Table 2 summarizes the empirical mean runtime required by each survival SVM model to run a single resampling operation.As depicted, the runtimes are influenced by the kernel function used.Compared with linear and additive kernel functions, which required approximatively equal runtimes, the runtimes of the radial basis function (RBF) are higher.The reason is that the RBF requires one additional parameter, which has to be optimized via tuning.The effect of the number of parameters is also observed on the hybrid survival SVM approach, which required more runtime than the other approaches.Finally, the runtime of the ranking approaches vanbelle1 and vanbelle2, implemented in survivalsvm, was considerably lower than that of the evers approach, implemented in the survpack package.
Table 3 and Figure 2 present the performance estimates of the compared models, based on the C-index.Of the 17 models, the best model received the rank 1, the worst model the rank 17.In case of ties, all tied models were assigned the mean rank value.On the veteran data set, the hybrid approach with the additive kernel performed best of all survival SVM models, with only slight differences to the PH model and SSVR approach.The models based on the ranking approach (vanbelle1 and vanbelle2) The R Journal Vol.This operation includes tuning of the parameters of regularization for the ranking and regression based models.Since the RBF, in comparison to linear and additive kernel functions, requires one additional parameter to be computed, its runtimes were higher.Furthermore, because the hybrid approach uses two parameters of regularization, it also needs more time to find the best parameters of regularization.
The implementation of the evers approach in survpack does not include the additive kernel.† Interrupted after 10 days of computation.
performed worse than the other models.The differences between the reference models were small.For the leuk_cr data set, the evers approach with the RBF kernel was the best SVM approach, followed by the hybrid approach with the additive kernel.The best reference models were the PH model and GBoost.For leuk_death, the hybrid approach with the additive kernel performed on par with RSF.The RSF model also performed better than the other reference models on the GBSG2 data set, while evers performed worse.The hybrid model was still the best survival SVM model, with almost the same results for the linear and additive kernel.No results could be obtained for the RBF kernel in 10 days of computation.For the MCLC data set, the survival SVM with the hybrid approach and RBF kernel performed best, while the PH model was the best reference model.The differences were small, except for evers, which performed worse.In summary, there are only slight differences between the best survival SVM models and the best reference models.However, the differences between the SVM approaches and kernels were substantial.

Conclusion
We presented the R package survivalsvm for fitting survival SVMs.Three approaches are available in the package.First, in the regression approach, the classical SVR was extended to censored survival outcomes.Second, in the ranking approach, the ranks between model predictions and the observed survival times are maximized based on the C-index.The third approach, called the hybrid approach, combines the two first approaches into a single model.We implemented these three approaches in the survivalsvm package and used 5 data sets to compare the prediction performance and runtime of our implementation with the Cox PH model, RSF and gradient boosting.Furthermore, we included an implementation of a variant of the ranking approach (Evers and Messow, 2008) in the comparison.
Of the survival SVM models, the hybrid approach generally performed best in terms of prediction performance but was slowest to compute.We observed only small differences between the best SVM models and the best reference models and could not determine a clear winner.
Comparing the ranking and regression based models, the evers approach always required more runtime than the approaches implemented in survivalsvm.Although the hybrid approach performed better than the others survival SVM approaches, its runtime was considerably increased.This was due to the fact that the formulation of the hybrid approach needs two parameters of regularization, while the ranking and regression approaches require only one parameter.Figure 2: Prediction performance for the five survival support vector models (vanbelle1, vanbelle2, SSVR, hybrid and evers) and three reference methods (PH, RSF and GBoost) on 5 data sets.The C-index was computed for each method using nested 5 × 10 cross validation.The quadprog optimizer was used in the package survivalsvm.Plots were generated using the ggplot2 (Wickham, 2009) and tikzDevice (Sharpsteen et al., 2016) packages.
The best performing kernel functions depended on the data set and the chosen SVM model.For the ranking approaches, the differences were larger than the regression and hybrid approaches.For the hybrid approach, the additive and RBF kernels achieved the best results.However, the runtimes for the RBF kernel were substantially larger.Again, this was due to the tuning of an additional parameter.
Our implementation utilizes quadratic programming and an interior point optimizer to solve the quadratic optimization problem derived from the primal support vector optimization problem.When the quadratic programming is used for a non positive semi-definite kernel matrix, this matrix is slightly modified to the nearest positive semi-definite matrix.Calling the interior point optimizer does not make any modification on the original matrix, but is computationally slower since the software is fully implemented in R. Pölsterl et al. (2015) proposed a fast algorithm to train survival SVMs in primal space.This algorithm is fast in low dimensions, but for high dimensional problems the authors recommended reducing the dimensions before applying an SVM algorithm.However, some special and fast algorithms, such as the sequential minimal optimization (SMO ) (Platt, 1998), which are available for classical SVM optimization problems, were shown to be more accurate (Horn et al., 2016).The implementation for survival SVMs could possibly be improved by an extension of the SMO optimization procedure.
Having restrictions on only the nearest neighbor in the ranking approach, as formulated in vanbelle1 and vanbelle2, can considerably improve the computational performance, but can also reduce prediction performance.In principle, the number of nearest neighbors is not limited to the choices of Evers and Messow (2008) and Van Belle et al. (2008).Since the optimal number of neighbors depends on the dataset and the availability of computational resources, it may be included as a further tuning parameter.
In conclusion, we have shown that SVMs are a useful alternative for survival prediction.The package survivalsvm provides a fast and easy-to-use implementation of the available approaches of survival SVMs.Our results show that the choice of the SVM model and the kernel function is crucial.In addition to prediction performance, runtime is an important aspect for large data sets.We The R Journal Vol.10/1, July 2018 ISSN 2073-4859 recommend to conduct benchmark experiments using several approaches and available kernels before analyzing a data set.

Table 1 :
comprising 168 individuals and 8 covariates.Table1provides a brief summary of the data sets used.Data sets used to compare prediction performance and runtime.The data sets leuk_cr and leuk_death differ only in the event considered.In leuk_cr, the event is a complete remission while in leuk_death the event of interest is death.The two last columns give the names of the censoring status and survival time variables.

Table 2 :
Mean runtime for each survival SVM model to run a single resampling operation.

Table 3 :
Prediction performance of the five survival SVMs (vanbelle1, vanbelle2, SSVR, hybrid and evers) and three reference methods (PH, RSF and GBoost) on 5 data sets.The C-index was computed for each method using 5 × 10-fold nested cross validation.Mean values and standard deviations (SD ) are shown.The C-index and a rank is presented for each model.Rank 1 was assigned to the best performing model, rank 17 to the worst performing one.In case of ties, all tied models were assigned the mean rank value.The quadprog The best SVM models and the best reference models are marked in light and dark gray colors, respectively.The implementation of the evers optimizer was used in the package survivalsvm.