This article introduces the R package survivalsvm, implementing support vector machines for survival analysis. Three approaches are available in the package: The regression approach takes censoring into account when formulating the inequality constraints of the support vector problem. In the ranking approach, the inequality constraints set the objective to maximize the concordance index for comparable pairs of observations. The hybrid approach combines the regression and ranking constraints in a single model. We describe survival support vector machines and their implementation, provide examples and compare the prediction performance with the Cox proportional hazards model, random survival forests and gradient boosting using several real datasets. On these datasets, survival support vector machines perform on par with the reference methods.
Survival analysis considers time to an event as the dependent variable.
For example, in the veteran’s administration study
(Kalbfleisch and Prentice 2002), a clinical trial of lung cancer
treatments, the dependent variable is time to death. The particularity
of such a survival outcome is censoring, indicating that no event
occurred during the study. For example, in the veteran’s lung cancer
study, some patients stayed alive until the end of the study such that
time to death is censored. Because the time to event is unknown for
censored observations, standard regression techniques cannot be used.
The censoring indicator
The most popular statistical approach for survival analysis is the Cox
proportional hazards (PH) model, which is described in detail in
standard textbooks (Lee and Wang 2003; Kleinbaum and Klein 2012). The
most important advantage of the PH model is that it does not assume a
particular statistical distribution of the survival time. However, the
crucial assumption is that the effect of covariates on the survival
variable is time independent. The hazards of two individuals are thus
assumed to be proportional over time (proportional hazards assumption).
The general form of the PH model is given by
The proportional hazards assumption can easily be checked in one
dimension (
One alternative approach to the PH model is to use support vector
machines (SVMs). SVMs were first proposed by Vapnik (1995) as a
learning method for dichotomous outcomes. They are known for their good
theoretic foundations and high classification accuracy on high
dimensional classification problems
(Hsu et al. 2003; Cervantes et al. 2008). The formulation of an
SVM supposes a target variable
To achieve the goal of the SVM approach described above, the following
optimization problem is posed in primal space:
SVMs were extended to support vector regression, a variant for continuous outcomes by Vapnik (1998). More recently, several extensions to survival analysis were proposed. (Van Belle et al. 2007) and (Evers and Messow 2008) extended the formulation of the SVM problem with the aim to maximize the concordance index for comparable pairs of observations. This approach, also known as the ranking approach, was modified in (Van Belle et al. 2008) to improve computational performance. (Shivaswamy et al. 2007) introduced a regression approach to survival analysis, based on the idea of support vector regression (Vapnik 1998). (Van Belle et al. 2011) combined the ranking and regression approaches in a single model to build the hybrid approach of SVMs for survival outcomes. The ranking approach, as proposed by (Evers and Messow 2008), is implemented in the R package survpack available on the authors’ website (Evers 2009). The approaches of (Van Belle et al. 2008) and (Van Belle et al. 2011) are available in a Matlab toolbox (Yang and Pelckmans 2014).
In the next section, we describe the three approaches for survival SVMs in detail. After that, we present the implementation of these methods in the R package survivalsvm. Finally, an application of survival SVMs on real data sets compares their prediction performance and runtime with established reference methods and other available implementations.
Three approaches have been proposed to solve survival problems using
SVMs: the regression (Shivaswamy et al. 2007), the ranking
(Van Belle et al. 2007, 2008; Evers and Messow 2008) and the
hybrid approach (Van Belle et al. 2011). The regression approach is
based on the support vector regression (SVR) (Vapnik 1998)
idea and aims at finding a function that estimates observed survival
times as continuous outcome values
The ranking approach considers survival analysis based on SVMs as a
classification problem with an ordinal target variable
(Herbrich et al. 1999). It aims at predicting risk ranks between
individuals instead of estimating survival times
(Van Belle et al. 2007; Evers and Messow 2008). Suppose two individuals
The hybrid approach (Van Belle et al. 2011) combines the regression
and ranking approaches in the survival SVMs problem. Thus, the
constraints of ((3)) and ((5)) are included
in the optimization problem
We implemented the four models presented in ((3)),
((4)), ((5)) and ((6)) in
the survivalsvm package. The function survivalsvm
fits a new
survival model, and risk ranks are predicted using the generic R
function predict
. Common to the four models is that a quadratic
optimization problem is solved when moving to the dual space. In the
function survivalsvm
, we solve this optimization problem using two
quadratic programming solvers: ipop
from the package kernlab
(Karatzoglou et al. 2004) and pracma
from the package
pracma (Borchers 2016). The
function pracma
wraps quadprog
from the package
quadprog
(Berwin A. 2013), which is implemented in Fortran to solve quadratic
problems as described by Goldfarb and Idnani (1982). Thereby, the kernel matrix
is assumed to be positive semi-definite. If this is not the case, the
function nearPD
from the
Matrix package
(Bates and Maechler 2016) is used to adjust the kernel matrix to the nearest positive
semi-definite matrix. In contrast to quadprog
, ipop
is written in
pure R. Hence, the runtime of ipop
is expected to be greater than that
of quadprog
for solving the same optimization problem. However, an
advantage of ipop
is that the kernel matrix does not need to be
modified when solving the optimization problem. The user of
survivalsvm
can choose which solver is used for solving the quadratic
optimization problem.
As in the ranking formulations ((4)) and
((5)), the hybrid formulation ((6))
calculates differences between comparable data points. Three options to
define comparable pairs are available in survivalsvm
: makediff1
removes the assumption that the first data point is not censored,
makediff2
computes differences over not censored data points only and
makediff3
uses the definition described above.
The R package survivalsvm allows the user to choose one of the
four kernels linear, additive (Daemen and De Moor 2009), radial basis,
and polynomial, labeled lin_kernel
, add_kernel
, rbf_kernel
and
poly_kernel
, respectively. They can be passed to the survivalsvm
function using the kernel
parameter.
To exemplify the usage, the implementation is applied to the data set
veteran available in the package survival (Therneau 2015). The
function Surv
from the package survival serves to construct a
survival target object.
R> library(survivalsvm)
R> library(survival)
R> data(veteran, package = "survival")
First, we split the data into a training and a test data set
R> set.seed(123)
R> n <- nrow(veteran)
R> train.index <- sample(1:n, 0.7 * n, replace = FALSE)
R> test.index <- setdiff(1:n, train.index)
and next fit a survival support vector regression model
R> survsvm.reg <- survivalsvm(Surv(diagtime, status) ~ .,
+ subset = train.index, data = veteran,
+ type = "regression", gamma.mu = 1,
+ opt.meth = "quadprog", kernel = "add_kernel")
The regularization parameter is passed using the argument gamma.mu
.
For each of the models ((3)), ((4)) and
((5)), only one value is required, while two values are
needed when fitting a hybrid model. Calling the print
function on the
output gives
R> print(survsvm.reg)
survivalsvm result
Call:
survivalsvm(Surv(diagtime, status) ~ ., subset = train.index, data = veteran,
type = "regression", gamma.mu = 1, opt.meth = "quadprog", kernel = "add_kernel")
Survival svm approach : regression
Type of Kernel : add_kernel
Optimization solver used : quadprog
Number of support vectors retained : 39
survivalsvm version : 0.0.4
We can now make the predictions for the observations given by
test.index
:
R> pred.survsvm.reg <- predict(object = survsvm.reg,
+ newdata = veteran, subset = test.index)
and print the prediction object:
R> print(pred.survsvm.reg)
survivalsvm prediction
Type of survivalsvm : regression
Type of kernel : add_kernel
Optimization solver used in model : quadprog
Predicted risk ranks : 13.89 14.95 11.12 15.6 10.7 ...
survivalsvm version : 0.0.4
To evaluate the survival SVM models and our implementation, four
publicly available survival data sets were used. The first is the data
set veteran from the Veteran’s lung cancer trial study
(Kalbfleisch and Prentice 2002), available in the package
survival. It includes
Data set | Sample size | #Covariates | Status | Survival time |
---|---|---|---|---|
veteran | status | time | ||
leuk_cr | complete_rem | data_cr | ||
leuk_death | status_last_fol_up | data_last_fol_up | ||
GBSG2 | cens | time | ||
MCLC | status | time |
For each data set, we fitted the four survival SVM models ((3)), ((4)), ((5)) and ((6)) using linear, additive and radial basis function (RBF) kernels. The ranking approach of survival SVMs implemented in the R package survpack is applied using linear and RBF kernels, the two kernels offered by the package. The Cox PH model, random survival forest (RSF) (Ishwaran et al. 2008) and gradient boosting (Gboost) for survival analysis Ridgeway (1999) served as reference models.
For RSF, the package
randomForestSRC
(Ishwaran and Kogalur 2016) was used. The number of random variables for
splitting and the minimal number of events in the terminal nodes were
tuned when building survival trees. randomForestSRC refers to
these parameters as mtry
and nodesize
, respectively. For gradient
boosting models implemented in
mboost
(Hothorn et al. 2016), we fitted a PH model as base learner. The number of
boosting iterations and the regression coefficient were tuned. In
mboost, these parameters are named mstop
and nu
, respectively.
Tuning was conducted using
Experiments were run on a high performance computing platform using
Data set | Kernel | Mean runtime in minutes | |||||
---|---|---|---|---|---|---|---|
3-7 | vanbelle1 | vanbelle2 | SSVR | hybrid | evers | ||
veteran | linear | ||||||
additive | |||||||
RBF | |||||||
leuk_cr | linear | ||||||
additive | |||||||
RBF | |||||||
leuk_death | linear | ||||||
additive | |||||||
RBF | |||||||
GBSG2 | linear | ||||||
additive | |||||||
RBF | NA |
NA |
|||||
MCLC | linear | ||||||
additive | |||||||
RBF |
Table 3 and Figure 2 present
the performance estimates of the compared models, based on the
veteran | leuk_cr | leuk_death | GBSG2 | MCLC | |||||||||||
Method | Kernel | Rank | Rank | Rank | Rank | Rank | |||||||||
vanbelle1 | linear | ||||||||||||||
additive | |||||||||||||||
RBF | |||||||||||||||
vanbelle2 | linear | ||||||||||||||
additive | |||||||||||||||
RBF | |||||||||||||||
SSVR | linear | ||||||||||||||
additive | |||||||||||||||
RBF | |||||||||||||||
hybrid | linear | ||||||||||||||
additive | |||||||||||||||
RBF | NA |
||||||||||||||
evers | linear | ||||||||||||||
RBF | NA |
||||||||||||||
PH Model | |||||||||||||||
RSF | |||||||||||||||
GBoost |
quadprog
optimizer was used in the package
survivalsvm. Plots were generated using the
ggplot2 (Wickham 2009)
and tikzDevice
(Sharpsteen et al. 2016) packages.We presented the R package survivalsvm for fitting survival SVMs.
Three approaches are available in the package. First, in the regression
approach, the classical SVR was extended to censored survival outcomes.
Second, in the ranking approach, the ranks between model predictions and
the observed survival times are maximized based on the
Comparing the ranking and regression based models, the evers approach always required more runtime than the approaches implemented in survivalsvm. Although the hybrid approach performed better than the others survival SVM approaches, its runtime was considerably increased. This was due to the fact that the formulation of the hybrid approach needs two parameters of regularization, while the ranking and regression approaches require only one parameter.
The best performing kernel functions depended on the data set and the chosen SVM model. For the ranking approaches, the differences were larger than the regression and hybrid approaches. For the hybrid approach, the additive and RBF kernels achieved the best results. However, the runtimes for the RBF kernel were substantially larger. Again, this was due to the tuning of an additional parameter.
Our implementation utilizes quadratic programming and an interior point optimizer to solve the quadratic optimization problem derived from the primal support vector optimization problem. When the quadratic programming is used for a non positive semi-definite kernel matrix, this matrix is slightly modified to the nearest positive semi-definite matrix. Calling the interior point optimizer does not make any modification on the original matrix, but is computationally slower since the software is fully implemented in R. (Pölsterl et al. 2015) proposed a fast algorithm to train survival SVMs in primal space. This algorithm is fast in low dimensions, but for high dimensional problems the authors recommended reducing the dimensions before applying an SVM algorithm. However, some special and fast algorithms, such as the sequential minimal optimization (SMO) (Platt 1998), which are available for classical SVM optimization problems, were shown to be more accurate (Horn et al. 2016). The implementation for survival SVMs could possibly be improved by an extension of the SMO optimization procedure.
Having restrictions on only the nearest neighbor in the ranking approach, as formulated in vanbelle1 and vanbelle2, can considerably improve the computational performance, but can also reduce prediction performance. In principle, the number of nearest neighbors is not limited to the choices of (Evers and Messow 2008) and (Van Belle et al. 2008). Since the optimal number of neighbors depends on the dataset and the availability of computational resources, it may be included as a further tuning parameter.
In conclusion, we have shown that SVMs are a useful alternative for survival prediction. The package survivalsvm provides a fast and easy-to-use implementation of the available approaches of survival SVMs. Our results show that the choice of the SVM model and the kernel function is crucial. In addition to prediction performance, runtime is an important aspect for large data sets. We recommend to conduct benchmark experiments using several approaches and available kernels before analyzing a data set.
The package survivalsvm is available on CRAN and a development version at https://github.com/imbs-hl/survivalsvm. The presented examples and R code to reproduce all results in this paper are available at https://github.com/imbs-hl/survivalsvm-paper.
survivalsvm, kernlab, pracma, quadprog, Matrix, randomForestSRC, mboost, mlr, ggplot2, tikzDevice
Cluster, DifferentialEquations, Econometrics, HighPerformanceComputing, MachineLearning, NaturalLanguageProcessing, NumericalMathematics, Optimization, Phylogenetics, ReproducibleResearch, Spatial, Survival, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Fouodo, et al., "Support Vector Machines for Survival Analysis with R", The R Journal, 2018
BibTeX citation
@article{RJ-2018-005, author = {Fouodo, Césaire J. K. and Knig, Inke R. and Weihs, Claus and Ziegler, Andreas and Wright, Marvin N.}, title = {Support Vector Machines for Survival Analysis with R}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {412-423} }