SurvBoost: An R Package for High-Dimensional Variable Selection in the Stratified Proportional Hazards Model via Gradient Boosting

High-dimensional variable selection in the proportional hazards (PH) model has many successful applications in different areas. In practice, data may involve confounding variables that do not satisfy the PH assumption, in which case the stratified proportional hazards (SPH) model can be adopted to control the confounding effects by stratification of the confounding variable, without directly modeling the confounding effects. However, there is lack of computationally efficient statistical software for high-dimensional variable selection in the SPH model. In this work, an R package, SurvBoost, is developed to implement the gradient boosting algorithm for fitting the SPH model with high-dimensional covariate variables and other confounders. Extensive simulation studies demonstrate that in many scenarios SurvBoost can achieve a better selection accuracy and reduce computational time substantially compared to the existing R package that implements boosting algorithms without stratification. The proposed R package is also illustrated by an analysis of the gene expression data with survival outcome in The Cancer Genome Atlas (TCGA) study. In addition, a detailed hands-on tutorial for SurvBoost is provided.


Introduction
Variable selection for high-dimensional survival data has become increasingly important in a variety of research areas. One of the most popular methods is based on the proportional hazards (PH) model. Many penalized regression methods including adaptive lasso and elastic net have been proposed for the PH model (Tibshirani, 1997;Simon et al., 2011;Goeman, 2010). Alternatively, boosting described by Bühlmann and Yu (2010) has been adopted for variable selection in regression models and the PH model via gradient descent techniques. It can have a better variable selection accuracy compared with other methods in many scenarios. The R package mboost has been developed and become a powerful tool for variable selection and parameter estimation in complex parametric and nonparametric models via the boosting methods (Hothorn et al., 2017). It has been widely used in many applications.
However, in many biomedical studies, the collected data may involve confounding variables that do not satisfy the PH assumption. For example, in cancer research you may argue that gender effects are not proportional, but we are more interested in selecting genes as the important risk factors for cancer survival. The PH assumption can reasonably be imposed on modeling the gene effects but not for gender effects. In this case the stratified proportional hazards (SPH) models are needed. In particular, the data are often grouped into multiple strata according to confounding variables. The SPH model adjusts those confounding effects by fitting the Cox regression with different baseline hazards for different strata, while still assuming that the covariate effects of interest are the same across different strata and satisfy the proportional hazard assumption.
The SPH model has a wide range of applications for survival analysis, but no computationally efficient statistical software are available for high-dimensional variable selection in the SPH model. To fill this gap, we develop an R package, SurvBoost, to implement the gradient boosting algorithm for fitting the SPH model with high-dimensional covariates with adjusting confounding variables. SurvBoost implements the gradient descent algorithm for fitting both PH and SPH model. The algorithm for the PH model has been used for the additive Cox model in the mboost package, which cannot fit the SPH model to perform variable selection. The survival package is capable of performing model fitting for the SPH model, but does not implement variable selection desired in the high-dimensional setting. In our SurvBoost package, we optimize the implementations which can reduce 30%-50% computational time. Additional options are available in the SurvBoost package to determine an appropriate stopping criteria for the algorithm. Another useful function assists in selecting stratification variables, which may improve model fitting results.
The rest of the paper is organized as follows: In Section 2, we will provide a brief overview of the gradient boosting method for the SPH model along with the algorithm stopping criteria. In Section 3, we show that SurvBoost can achieve a better selection accuracy and reduce computational time substantially compared with mboost. In Section 4, we provide a detailed hands-on tutorial for SurvBoost. In Section 5, we illustrate the proposed R package on an analysis of the gene expression data with survival outcome in The Cancer Genome Atlas (TCGA) study.

Stratified proportional hazards model
The Cox proportional hazards model is effective for modeling survival outcomes in many applications. An important assumption underlying this model is a constant hazard ratio, meaning that the hazard for one individual is proportional to that of any other individual. This is a strong assumption for many applications. Thus, one useful adaptation to this model is relaxing the strict proportional hazards assumption; one approach is to allow the baseline hazard to differ by group across the observations. This is known as the stratified proportional hazards (SPH) model.
Suppose the dataset consists of n subjects. For i = 1, . . . , n, denote by T i the observed time of event or censoring for subject i, and δ i indicates whether or not an event occurred for subject i. Denote by G the total number of strata and by n g the number of subjects in stratum g. Let g i be the strata indicator for subject i. Suppose there are p potential covariate variables of our interest to select. For j = 1, . . . , p, let x ij be the covariate j for subject i. For stratum g = 1, . . . , G, the hazard of subject i at time t in stratum g becomes where h 0,g (t) is the baseline hazard function, X i,g is a vector of covariates and β is the regression coefficients of interest.
Allowing the baseline hazard to differ across strata allows flexibility often desired when proportional hazards is too strong. The SPH model can control effects of confounding variables through this stratification. The estimates of the effect of covariates remain constant across strata, so the model is still interpretable across all subjects.

Gradient boosting for SPH
The log partial likelihood of the SPH model is where β = (β 1 , . . . , β p ) , X i = (X i1 , . . . , X ip ) and R ig = { : T ≥ T i , g l = g} for all i with g i = g representing the set of at risk subjects in group g. We adopt the following gradient boosting algorithm to find the maximum partial likelihood estimate (MPLE). Let S kg (i, j) = ∑ ∈R ig X k j exp{X g β} for k = 0, 1, 2. Compute the first partial derivative with respect to j: Find j * = argmax j |L 1 (j)|.

8
Calculate the second partial derivative with respect to j * : Update β j * = β j * + υL 2 (j * ) −1 L 1 (j * ) 10 end 11 end Algorithm 1: Boosting gradient descent algorithm This algorithm updates variables one at a time, by selecting the variable which maximizes the first partial derivative. The number of iterations is important for ensuring a sufficient number of updates to the β estimates, in addition to selecting the true signals (He et al., 2016). Note that this algorithm is slightly different than the one implemented in the mboost package even in the unstratified case, which we will use for comparison in the simulation setting. Buhlmann and Hothorn's algorithm uses only the first derivative to update the estimated β values (Bühlmann and Hothorn, 2007).

Stopping criteria
Selection of the number of boosting iterations is important. Over-fitting can occur if the number of iterations is too large (Jiang, 2004). Additionally the stopping criteria is important for accuracy of the coefficient estimates, since each iteration contributes to updating the estimate of one coefficient. The algorithm is less sensitive to the step size (Bühlmann and Hothorn, 2007).
SurvBoost provides several options for optimizing the number of iterations including: k-fold cross validation, Bayesian information criteria, change in likelihood, or specifying the number of variables to select.
The Bayesian Information Criteria (BIC) is one approach for selecting the optimal number of boosting iterations.
where l j (θ j ) is the maximized likelihood for a model with p j selected variables and l 0 (θ 0 ) is the maximized likelihood for the reference model with p 0 selected variables. The number of uncensored events is d. Volinsky and Raftery (2000) argue that replacing the sample size, n, with d in the BIC calculation has better properties when dealing with censored survival models.
The extended BIC is also useful in high dimensional cases; this approach penalizes for greater complexity where ( p p j ) is the size of the class of models that model j belongs to, p is the total number of variables. The value of γ is fixed between 0 and 1, selected to penalize at the appropriate rate. Selecting 0 will reduce this to the standard BIC; EBIC and BIC are implemented jointly in the package using this connection to reduce from EBIC to BIC. AIC is available as well as a stopping criteria, although this information might not be as effective in the high dimensional setting.
Cross validation is another approach which may be used to determine the stopping point. The goodness of fit function is calculated as suggested by Simon et al. (2011). It is the log-partial likelihood of all the data using the optimal β determined with data excluding fold k (β −k ) minus the log-partial likelihood excluding fold k ( −k ) of the data with the same β.
Where m is the current number of iterations and k indicates the subset of data being excluded.
Change in likelihood is another approach incorporated in the package. This method stops iterating once a small change in likelihood, specified in the function, is reached.
Where α is a small constant. Default change in likelihood, used in simulations, is a change of 0.001.

Simulation studies
This section compares the variable selection performance to a competing R package, mboost (Hofner et al., 2014).

Stratified Data
Stratified data was simulated such that censoring rates were relatively constant across groups and the expected survival time differed by group. These assumptions mimic realistic settings such as those encountered with data grouped by hospital or facility.
For this simulation 1,500 observations were generated into ten strata; each strata had a different baseline hazard following a Weibull distribution. The Weibull distribution shape parameter was 3 for all strata, and the scale parameter varied across strata from e −1 to e −15 with ten evenly spaced intervals. There were 100 true signals among 4,000 variables with true magnitude of 2 or -2. There was uniform censoring from time 0 to 200. Fifty of these data sets were generated.
The following example demonstrates the importance of the stopping criteria. SurvBoost has five options for specifying the number of iterations as described in the methods section. Selecting an appropriate number of iterations depends on the goals of the analysis. For example, if the goal is to achieve high sensitivity cross validation or extended BIC may be the best approach.
The performance of different stopping criteria are compared based on several selection and estimation measures: sensitivity (Se), specificity (Sp), false discovery rate (FDR), and mean squared error (MSE). FDR is calculated as the ratio of false positives over the total number of selected variables. Sensitivity, specificity, and FDR aim to address the performance of the variable selection, while MSE aims to address the accuracy of the coefficient estimates.
This simulation presents the performance of SurvBoost compared to the R package mboost. The boosting algorithm implemented in mboost is very similar to that of SurvBoost but does not allow stratification. We will compare results between the two packages using only a fixed number of iterations as the stopping rule; mboost has K-fold cross validation available for some settings but no longer provides it for Cox PH models. All of the stopping methods implemented in SurvBoost are not available in mboost. The performance can be compared by measures such as sensitivity and mean squared error. Table 1 presents the results of 50 simulated data sets, comparing the boosting algorithm using several different stopping procedures and to results from mboost. In this simulation, mboost selects fewer variables on average resulting in fewer false positives and more false negatives. Additionally the mean squared error is slightly higher than that of all the SurvBoost options.
Runtime is also an important factor with this algorithm. Stratification speeds up the algorithm as seen in the first simulation. All runtimes were generated on a MacBook with 2.9GHz Intel Core i5 and 16GB memory.   To further demonstrate the importance of using the SPH model, we compared results of modeling with and without stratification for data simulated with ten strata. In this setting the true signal for all 100 variables is 0.75 to illustrate the performance with a smaller effect size.  From this example we can evaluate the importance of using the stratified model. This case demonstrates that when not stratifying by group the model is not as sensitive to the true signals, resulting in lower sensitivity. We also observe a larger or similar number of variables selected, meaning that there are a larger number of false positives when ignoring the stratification. Depending on the context, a larger number of false positives may be very undesirable. The algorithm implemented for the Cox PH model is slightly different than the one used in SurvBoost, which can be seen here by the difference in the two unstratified rows.

Unstratified Data
Another simulation was used to compare performance of our method to mboost when stratification is not necessary for appropriate modeling. Similarly to the stratified case, four thousand variables were generated for 1,500 observations but without stratification. The baseline hazard followed a Weibull distribution, with shape parameter equal to 5 and scale equal to exp −5 . The true β contained 100 true signals of magnitude 2 or -2 out of 4,000 variables.
We can observe in Table 2 that SurvBoost performs similarly to mboost under these conditions. mboost tends to select fewer variables than SurvBoost, so in this simulation mboost has fewer false positives and more false negatives compared to SurvBoost.

Illustration of package
This section provides a brief tutorial on how to use this package based on simulated data. In order to install the package, several other R packages must be installed. The code relies on Rcpp, RcppArmadillo, and RcppParallel in order to improve computational speed (Eddelbuettel et al., 2018a,b;Allaire et al., 2018). Additionally the survival package is used for simulation and post selection refitting for inference and will be required for installation of SurvBoost (Therneau, 2017). If working on a Windows machine, installing Rtools is also necessary. The following line of R code installs the package from CRAN.

Model fitting
The boosting_core() function requires similar inputs to the familiar coxph() function from the package survival.
boosting_core(formula, data = matrix(), rate = 0.01, num_iter = 500, ...) The input formula has the form Surv(time,death) ∼ variable1 + variable2. The input data is in matrix form or a data frame. Two additional parameters must be specified for the boosting algorithm: rate and num_iter. Rate is the step size in the algorithm, although choice of this may not impact the performance too significantly (Bühlmann and Hothorn, 2007), default value is set to 0.01. Selecting an appropriate number of iterations to run the algorithm will, however, have a greater impact on the results. The last input num_iter is used to determine the number of iterations to run the algorithm, default value is 500.

Simple example
We present a simple example demonstrating the convenience of using the package for stratified data. We simulate survival data for five strata with different constant baseline hazards.
Another feature of the package assists with determining variables to stratify on if this information is unknown. The function strata.boosting will print box plots and a summary table of the survival time grouped by splits in the specified variable. The variable can be categorical or continuous; if continuous, the function will split on the median value to demonstrate whether there appears to be a difference in the survival time distribution for the two groups. This information alone does not suggest that stratification on this variable is necessary. It is intended to be a tool to confirm if there are differences seen across groups, when stratification is anticipated to be necessary.

TCGA data example
Data from three breast cancer cohorts was used to demonstrate this method on data outside of the simulation framework. There were 578 patients included in the combined data, with 8,864 variables measured for each patient: 8,859 genes and 5 phenotypic variables. The phenotype variables included age at diagnosis, tumor size, cancer stage, progesterone-receptor status, and estrogen-receptor status. The data can be downloaded from The Cancer Genome Atlas (TCGA) (Naderi et al., 2006;Chin et al., 2006;Miller et al., 2005) or from GitHub using the following R code.
R > library(piggyback) R > pb_download("data.tcga.tsv.gz", repo = "EmilyLMorris/survBoost") R > data <-read_tsv("data.tcga.tsv") The patients were split into two cohorts depending on their cancer stage and tumor size. One cohort contained patients with a less severe prognosis, cancer stage of one and tumor size less than the median; the other cohort contained those with cancer stage greater than one and/or with a tumor larger than the median size.
Using stability selection (Meinshausen and Bühlmann, 2010), 14 variables were identified with selection frequencies greater than 50% from 50 iterations of subsampling. Age and progesterone-receptor status were selected in addition to 12 genes. The boosting algorithm was performed with the number of iterations fixed at the sample size of 578. Several of the genes selected in this analysis have been previously identified as having an association with breast cancer. Psoriasin (S100A7) has been associated with breast cancer (Al-Haddad et al., 1999). Several studies have found COL2A1 to be part of gene signatures for predicting tumor recurrence (Yu et al., 2007;Wang et al., 2005). Other genes selected that have been identified as part of a gene signature or association with breast cancer tumor progression risk include: ZIC1 (Boersma et al., 2008), CYP2B6 (Tozlu et al., 2006), ELF5 (Chakrabarti et al., 2012), IGJ (Boersma et al., 2008), DHRS2 (Krijgsman et al., 2012), and CEACAM5 (Blumenthal et al., 2007). Mboost using the same criteria but without a stratified model only identifies one gene of importance, MC2R, demonstrating the utility of the SPH model in this context.

Conclusion
In this article, we introduce a new R package SurvBoost which implements the gradient boosting algorithm for high-dimensional variable selection in the stratified proportional hazards (SPH) model, while most existing R packages, such as mboost only focus on the proportional hazards model. In the simulation studies, we show that SurvBoost can improve the model fitting and achieve better variable selection accuracy for the data with stratified structures. In addition, we optimize the implementations of the gradient boosting in both the SPH and the PH models. For the PH model fitting, SurvBoost can reduce about 30%-50% computational time compared to mboost. In the future, we plan to extend the package to handle more complex survival data such as left-truncation data and interval censoring data.