penPHcure: Variable Selection in Proportional Hazards Cure Model with Time-Varying Covariates

We describe the penPHcure R package, which implements the semiparametric proportionalhazards (PH) cure model of Sy and Taylor (2000) extended to time-varying covariates and the variable selection technique based on its SCAD-penalized likelihood proposed by Beretta and Heuchenne (2019a). In survival analysis, cure models are a useful tool when a fraction of the population is likely to be immune from the event of interest. They can separate the effects of certain factors on the probability of being susceptible and on the time until the occurrence of the event. Moreover, the penPHcure package allows the user to simulate data from a PH cure model, where the event-times are generated on a continuous scale from a piecewise exponential distribution conditional on time-varying covariates, with a method similar to Hendry (2014). We present the results of a simulation study to assess the finite sample performance of the methodology and illustrate the functionalities of the penPHcure package using criminal recidivism data.


Introduction
In contrast to other statistical methods, survival analysis models are designed to model the time to an event of interest (e.g. death or occurrence of a disease in medical studies). A typical feature of time-to-event data is the presence of right censoring, an incomplete information problem that arises when a subject is lost to follow-up or does not experience the event before the end of the study. In these cases, it is unknown whether the subject will eventually experience the event and when it will occur, given that it can occur. The most common assumption of standard survival analysis models is that the whole population will sooner or later experience the event of interest. However, in practice, this may not be the case because a fraction of the population may be immune (i.e. not susceptible) to this event. Cure models, also known as split population duration models or limited-failure population models, were developed to handle this kind of situations. They allow us to investigate the effects of some covariates (e.g. type of treatment, stage of the tumor, sex or age) on the probability to be susceptible to the event of interest (i.e. incidence) and on the survival time conditional on being susceptible (i.e. latency).
Originally, cure models were introduced in the medical literature by Boag (1949) and Berkson and Gage (1952), but they have been used in several other disciplines during the years. In reliability engineering, Meeker (1987) investigates the failure of solid-state electronic components (e.g. integrated circuits). In social science, Schmidt and Witte (1989) investigate the timing of return to prison for a sample of prison releases and they use it to make predictions of whether or not individuals return to prison. In finance, Cole and Gunther (1995) analyze the determinants of commercial bank failures in the United States; in credit scoring, Tong et al. (2012) predict defaults on a portfolio of UK personal loans. In political science, Svolik (2008) studies the likelihood that a democracy consolidates and the timing of authoritarian reversals in democracies that are not consolidated. In marketing, Polo et al. (2011) investigate the drivers of customer retention in a liberalizing market, using data for a sample of 650 consumers in the Spanish mobile phone industry. In the literature, several variants of cure models have been proposed (see Amico and Van Keilegom (2018) for a comprehensive survey), which belong to two main families: mixture cure models and promotion time cure models.
In this article we present the penPHcure package (Beretta and Heuchenne, 2019b), which implements the semi-parametric proportional-hazards (PH) mixture cure model of Sy and Taylor (2000) extended to time-varying covariates, where the incidence and latency distributions are modeled by a logistic regression and a Cox's PH model (Cox, 1972), respectively. The penPHcure package contains two main functions: penpHcure, to estimate the regression coefficients, their confidence intervals using the basic/percentile bootstrap method and to perform variable selection using the SCAD-penalized likelihood technique proposed by Beretta and Heuchenne (2019a); and penpHcure.simulate to simulate data from a PH cure model, where the event-times are generated on a continuous scale from a piecewise exponential distribution conditional on time-varying covariates, using a method similar to the one described in Hendry (2014).
At the time of writing this article, we are unaware of other R packages for estimation of semiparametric PH mixture cure models with time-varying covariates and, above all, that enable the user to perform variable selection. In the context of cure models for right censored data, available R packages include: the flexsurvcure package (Amdahl, 2019) for estimation of parametric mixture and non-mixture cure models with time-invariant covariates using time-to-event distributions from the flexsurv package (Jackson, 2016); the nltm package (Garibotti et al., 2019) for estimation of the semi-parametric PH cure model with time-invariant covariates of , as well as other nonlinear transformation models for analyzing survival data, using the method of Tsodikov (2003); and the smcure package (Cai et al., 2012) for estimation of the semi-parametric PH cure model and the accelerated failure time cure model with time-invariant covariates and the spduration package (Beger et al., 2018) that implements a parametric cure model with time-varying covariates using Weibull and Log-Logistic latency distributions. Compared to spduration, the penPHcure package has some advantages: the latency distribution is modeled by a more flexible semi-parametric Cox's PH model; the response variable, the time to the event of interest, is continuous; and, above all, it allows the user to simultaneously select variables and estimate their parameters using a variable selection technique based on SCAD penalties.
The remainder of this article is structured as follows: • Methodology. We present the PH cure model with time-varying covariates implemented in the penPHcure function when the argument pen.type is set equal to "none" (default); -Variable selection. We present the variable selection technique based on SCAD penalties implemented in the penPHcure function when pen.type=="SCAD"; -Data generation. We describe the algorithm implemented in the penPHcure.simulate function, which generates data from a PH cure model with time-varying covariates.
• Simulation study. We analyze the finite sample performance of the PH cure model estimates and its variable selection technique implemented in the penPHcure function.
• An application to Criminal Recidivism data. We provide an example of practical use of the penPHcure function to analyze a real data set.

Methodology
Let Y be a Bernoulli random variable indicating whether an individual is susceptible (Y = 1) or immune (Y = 0) to the event of interest, with probability p = P(Y = 1). Let T be the time to event, defined only when Y = 1. Assuming that a fraction of the population is immune to the event of interest, the marginal survival function of T is defined as where p is the incidence (i.e. probability of being susceptible) and S(t|Y = 1) is the latency (i.e. survival function conditional on being susceptible).
The incidence is modeled by a logistic regression model: , where x is a vector of time-fixed covariates (including the intercept) and b a vector of unknown coefficients. Whereas, the latency is modeled by a Cox's PH model: where z(t) is a vector of time-varying covariates (we denote byz(t) the full history of the covariates up to time t), β is a vector of unknown coefficients and h 0 (t) is an arbitrary baseline conditional hazard function.
.., n} denote the observed data, where t i is the event/censoring time and δ i is the censoring indicator, which takes value 1 if t i is uncensored and 0 otherwise. Since we know that y i = 1 when δ i = 1, but y i is unobserved when δ i = 0, we can estimate the unknown parameters θ = (b, β, h 0 ) using the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The complete-data likelihood can be written as i.e. the product between the incidence component L 1 , depending on a set of time-fixed covariates x i , and the latency component L 2 , depending on a set of time-varying covariates z i (t).
Given some starting values θ (0) , the m-th iteration of the EM algorithm consists of two steps: E step. Compute the expectation of the complete-data likelihood with respect to the conditional distribution of the y i 's given the current parameter estimatesθ (m−1) and the observed data O.
This expectation is obtained replacing the y i 's in (1) by their expectation .
Note that we removed the dependence of the theoretical functions on the estimated parameters to simplify the notation.

M step.
Maximize the expected complete-data likelihood with respect to b, β and the function h 0 .
Given (1) is maximized using the Newton-Raphson method as in the classical logistic regression model. Whereas, the latency component L 2 in (1) is maximized using a profile likelihood approach. The latter involves two steps: (i) the baseline conditional hazard function is estimated nonparametrically bŷ where t (1) ≤ ... ≤ t (k) are the k ordered event times and R j is the risk set at t − (j) (i.e. the set of all individuals who did not experience the event of interest and have not been censored just prior to time t (j) ); and then (ii) the function h 0 in L 2 is replaced by its estimator given in (2) to obtain the following partial likelihood, which does not depend on the function h 0 anymore, ( 3) Finally, the latency component is estimated by maximizing (3) with respect to β. In case of tied event-times, (2) and (3) can be rewritten using the Breslow (1974) or Efron (1977) approximation as in the standard Cox's PH model.

Variable selection
When the number of available covariates is large, fitting all possible subsets to find the most relevant covariates would be too time consuming. Beretta and Heuchenne (2019a) proposed a regularization method based on the maximization of a penalized version of the complete-data log-likelihood where denotes a log-likelihood and p λ (.) a SCAD penalty function, which role is to shrink the small coefficients toward zero. We assume that the q 1 and q 2 covariates in the incidence and latency component, respectively, have been standardized, such that the coefficients in b and β are on the same scale. The Smoothly Clipped Absolute Deviation (SCAD) penalty function (Fan and Li, 2001) is given by for some a > 2 and λ > 0. As explained by Li (2001, 2002) in the context of linear regression, generalized linear models, Cox's PH model and frailty models, the SCAD estimator has three desirable properties: unbiasedness (do not penalize to much large coefficients), sparsity and continuity. Moreover, with a proper choice of the tuning parameters (λ, a), it also possesses what is known as the "oracle property", meaning that the SCAD estimator is asymptotically equivalent to the oracle estimator (i.e. an estimator with only the relevant variables with nonzero coefficients).
For fixed values of the tuning parameters (λ 1 , λ 2 , a 1 , a 2 ), estimates of θ can be obtained using an EM algorithm close to the one described in the previous section. The only difference lies in the M-step. Given that the penalized log-likelihoods P 1 (b; λ 1 ) and˜ P 2 (β; λ 2 ), where˜ P 2 (β; λ 2 ) is the logarithm of (3), are non-concave and non-differentiable at the origin, b and β are now estimated using the MM algorithm of Hunter and Li (2005) based on a perturbed Local Quadratic Approximation (LQA) of the penalty function. By default, when the argument pen.type of the function penPHcure is set equal to "SCAD", the initial values for b and β are vectors with all elements equal to zero. Otherwise, the user can specify other values (e.g. the estimated coefficients of the model with all covariates) using the argument SV.
Regarding the choice of the tuning parameters, following the suggestion of Fan and Li (2001) we keep a 1 = a 2 = 3.7 and, given a set of possible values for (λ 1 , λ 2 ), we select the ones that minimize the following Akaike (AIC) or Bayesian (BIC) Information Criteria: λ 1 ,λ 2 is the observed data log-likelihood evaluated at the penalized MLEθ λ 1 ,λ 2 and ν is the number of nonzero coefficients, identified as the number of coefficients with an absolute value greater than a given threshold (by default 10 −6 ).

Data generation
Let S = {s 1 , s 2 , ..., s J } be a partition of the time scale forming J + 1 intervals (0, s 1 ], (s 1 , s 2 ], ..., (s J−1 , s J ], (s J , ∞]. Define a vector of time-varying covariates piecewise constant in each interval: z(t) = z j , for t ∈ (s j−1 , s j ]. Consider a transformation g, such that g(0) = 0, g(t) is strictly increasing for t > 0 and g −1 (t) is differentiable. In the implementation of the penPHcure.simulate function we use g(t) = t 1/γ , where the parameter γ can be specified by the user via the argument gamma, which is equal to 1, by default. According to Hendry (2014), if we generate a random variable V as a piecewise exponential distribution with density function given by where λ j = exp(z j β) is the constant hazard in the interval (g −1 (s j−1 ), g −1 (s j )], then g(V) follows a Cox's PH model with time-varying covariates with a baseline hazard function given by h 0 (t) = d dt [g −1 (t)]. This method is part of the algorithm implemented in the penPHcure.simulate function to simulate data from a PH cure model with time-varying covariates (see Algorithm 1 for a detailed description).

Algorithm 1 Data generation: PH cure model with time-varying covariates
Require: N, sample size; S, partition of the time scale; g(t), variable transformation; b, incidence coefficients; β, latency coefficients. for i = 1, ..., N do 1. Generate a vector x i from an arbitrary distribution; 2. Generate y i from a Bernoulli distribution with probability p(
We consider 6 simulation settings, with different levels of censoring and proportions of nonsusceptible individuals (expressed as a fraction of the sample size), depending on different values of b 0 and λ C (see Table 1). For each of these settings, we generated 500 replications using the penPHcure.simulate function for different sample sizes N = {250, 500, 1000}. Then, for all simulated datasets we use the penPHcure function to (i) fit a standard PH cure model with all covariates (FULL), (ii) fit a standard PH cure model with the covariates associated to the non-zero coefficients only (ORA-CLE) and (iii) to perform variable selection using the regularization method with SCAD penalties and tuning parameters chosen according to the BIC criterion. The possible values of the tuning parameters (λ 1 , λ 2 ) are obtained with the function exp(seq(-6,-1,length.out = 10)), whereas (a 1 , a 2 ) are kept equal to 3.7. Furthermore, we use the coxph function in the survival package (Therneau, 2015) to fit the classical Cox's PH model with the covariates associated to the non-zero coefficients only (COX).

Censoring
Cure The performance is measured in terms of Mean Estimation Error (MEE) and average number of correct and incorrect zeros identified by the variable selection technique (SCAD). In particular, the estimation error for the incidence component is computed as E (p(x) − p 0 (x)) 2 , wherep(x) and p 0 (x) are the estimated and true probabilities of being susceptible. Whereas, the estimation error for the latency component is computed as E Ŝ (T|Y = 1) − S 0 (T|Y = 1) 2 , whereŜ(T|Y = 1) and S 0 (T|Y = 1) are the estimated and true survival functions conditional on being susceptible.
In Figures 1 and 2, we provide the MEEs for the incidence and latency components, respectively, while in Figure 3, we provide the average number of correct and incorrect zeros. From those figures, we can see that the PH cure model estimation and its variable selection technique implemented in the penPHcure function perform reasonably well. For an increase of the sample size or a decrease of the level of censoring, the MEE decreases and the number of correct (resp. incorrect) zeros converges to 4 (resp. 0). The MEEs of the ORACLE model are always the lowest ones, but we notice that the ones of the SCAD method tend towards them as the sample size increases. It is important to note that, for a fixed level of censoring, we observe higher MEEs in case of a lower fraction of cured individuals. The worst results are obtained in situations of high censoring and low cure rate, but it is enough to increase the sample size to obtain better results. This is an evidence of the fact that a cure model should always be applied on data with a sufficient number of non-susceptible individuals. Last, but not the least important, we note that the use of the classical Cox's PH model (COX) leads to very high errors. This was expected since the model is wrongly specified as it ignores the existence of cured subjects.
Finally, in Table 2, we also present the coverage probabilities of the estimated 95% confidence intervals for the ORACLE model using the basic and percentile bootstrap methods with 500 resamples. In most cases, the basic bootstrap method outperforms the percentile bootstrap method, especially for the lowest sample sizes, with coverage probabilities closer to the 95% nominal level.
The R code used to obtain the results in Figures 1 to 3 (resp. Table 2) are provided in Section 2 (resp. Section 3) of the supplementary material ('beretta-heuchenne.R'). Moreover, in the file 'berettaheuchenne-suppl.pdf', we provide a table with all the results contained in Figures 1 to 3.

An application to Criminal Recidivism data
In this section, we illustrate the use of the penPHcure R package using a Criminal Recidivism dataset, which contains a sample of 432 inmates released from Maryland state prisons and followed for one year after release (Rossi et al., 1980). The aim of this study was to investigate the relationship between the time to first arrest after release and some covariates observed during the follow-up period. In particular, to study the effect of providing financial aid at the moment of release. The original source of the data is the Rossi dataset in the RcmdrPlugin.survival package (Fox and Carvalho, 2012), which has been converted into counting process format and included in the penPHcure package.
• fin. Financial aid received after release: yes or no; • age. Age in years at the time of release; • race. Race of the individual: black or other; • wexp. Full-time work experience before incarceration: yes or no; • mar. Married at the time of release: yes or no; • paro. Released on parole: yes or no; • prio. Number of convictions prior to incarceration; • educ. Level of education: <=9th degree ("3"), 10th or 11th degree ("4") or >=12 degree ("5"); • emp. Working full time during the observed time interval: yes or no. This is the only variable which is varying over time (e.g. the individual with id = 2 did not work full time during the first 9 weeks after release, then he did for 5 weeks and, finally, he has been arrested after 3 weeks without working full time).
Using the penPHcure function, by default, we can fit the standard PH cure model. First, we use a formula object with the response on the left of the tilde operator and the explanatory variables to be included in the latency component on the right. The response is a survival object returned by the Surv(tstart,tstop,arrest) function. Then, using the argument cureform, we specify the explanatory variables to be included in the incidence component. By default, these covariates are set equal to the last observation, but in this case, we set the argument which.X = "mean" to compute the time-weighted average over the full-history. Finally, setting the argument inference = TRUE, we conduct inference about the parameter estimates via bootstrapping (by default, 100 bootstrap resamples). The user can increase/decrease the number of bootstrap resamples with the argument nboot.
> set.seed(123) # for reproducibility > fit <-penPHcure(Surv(tstart,tstop,arrest)~fin+age+race+wexp+mar+paro+prio+educ+emp, + cureform =~fin+age+race+wexp+mar+paro+prio+educ+emp, + data = cpRossi,which.X = "mean",inference = TRUE) Initializing PH cure model with time-varying covariates... This call to the penPHcure function returned an object of class "PHcure" and we can print a summary of the results using the summary method. By default, confidence intervals are computed using the basic bootstrap method (the alternative is percentile bootstrap) and a confidence level of 95%. In order to control these features, the user can provide the arguments conf.int and conf.int.level, respectively. We now perform variable selection with the proposed SCAD-penalized likelihood method to check whether other covariates may be relevant to explain incidence and latency. First, we specify the possible values of the tuning parameters (using the argument pen.tuneGrid) and we set the starting values equal to the coefficient estimates from the unpenalized model (using the argument SV). Then, we still use the penPHcure, but we now include the argument pen.type = "SCAD". > pen.tuneGrid <-list(CURE = list(lambda = seq(0.01,0.12,by=0.01), + a = 3.7), + SURV = list(lambda = seq(0.01,0.12,by=0.01), + a = 3.7)) > SV <-list(b=fit$b,beta=fit$beta) > tuneSCAD <-penPHcure(Surv(tstart,tstop,arrest)~fin+age+race+wexp+mar+paro+prio+educ+emp, + cureform =~fin+age+race+wexp+mar+paro+prio+educ+emp, + data = cpRossi,which.X = "mean",pen.type = "SCAD", + pen.tuneGrid = pen.tuneGrid,SV = SV) Initializing PH cure model with time-varying covariates... This time the call to the penPHcure function returned an object of class "penPHcure". We can print a summary of the results using the summary method and, by default, the fitted model with the lowest BIC criterion is returned. The Bayesian Information Criterion is minimized for λ 1 = 0.09 and λ 1 = 0.05. In this case, the covariate age is selected in the incidence component. The negative sign of the estimated coefficient implies that younger individuals are more susceptible to be rearrested. The covariates prio and emp are selected in the latency component. The positive sign of the estimated coefficient (prio) implies that a higher number of convictions prior to incarceration increases the risk of being rearrested (among the individuals susceptible to be rearrested). The Akaike Information Criterion is minimized for λ 1 = 0.06 and λ 1 = 0.03. As expected the AIC criterion selected a less penalized and more complex model. In the incidence component, also the covariates fin and educ have been selected. The negative signs imply that individuals who received financial aid or with a high level of education (>=12 degree) are less susceptible to be rearrested. In the latency component, also the covariate race has been selected. The negative coefficient implies that individuals of race other than black have a lower risk of being rearrested (among the individuals susceptible to be rearrested).

Conclusion
In survival analysis studies, it may be the case that a fraction of the population is likely to be not susceptible to the event of interest. In this article, we presented the penPHcure R package, which implements the semi-parametric proportional-hazards (PH) cure model of Sy and Taylor (2000) extended to time-varying covariates. This model can measure the effects of some covariates on the probability to be susceptible and on the time until the occurrence of the event. The penPHcure package is composed of two main functions: penpHcure, to estimate the regression coefficients, their confidence intervals using the basic/percentile bootstrap method and to perform variable selection using the SCAD-penalized likelihood technique proposed by Beretta and Heuchenne (2019a); and penpHcure.simulate to simulate data from a PH cure model with time-dependent covariates. We first explained the methodology behind these functions and presented the results of a simulation study to assess its finite-sample performance. Then, we illustrated the use of the penPHcure function through an example based on the Criminal Recidivism dataset.

Availability
The latest release and a development version of the penPHcure package are respectively avilable on CRAN and at https://github.com/a-beretta/penPHcure.