We describe the penPHcure R package, which implements the semiparametric proportional-hazards (PH) cure model of Sy and Taylor (2000) extended to time-varying covariates and the variable selection technique based on its SCAD-penalized likelihood proposed by Beretta and Heuchenne (2019b). In survival analysis, cure models are a useful tool when a fraction of the population is likely to be immune from the event of interest. They can separate the effects of certain factors on the probability of being susceptible and on the time until the occurrence of the event. Moreover, the penPHcure package allows the user to simulate data from a PH cure model, where the event-times are generated on a continuous scale from a piecewise exponential distribution conditional on time-varying covariates, with a method similar to Hendry (2014). We present the results of a simulation study to assess the finite sample performance of the methodology and illustrate the functionalities of the penPHcure package using criminal recidivism data.
In contrast to other statistical methods, survival analysis models are designed to model the time to an event of interest (e.g., death or occurrence of a disease in medical studies). A typical feature of time-to-event data is the presence of right censoring, an incomplete information problem that arises when a subject is lost to follow-up or does not experience the event before the end of the study. In these cases, it is unknown whether the subject will eventually experience the event and when it will occur, given that it can occur. The most common assumption of standard survival analysis models is that the whole population will sooner or later experience the event of interest. However, in practice, this may not be the case because a fraction of the population may be immune (i.e., not susceptible) to this event. Cure models, also known as split population duration models or limited-failure population models, were developed to handle this kind of situation. They allow us to investigate the effects of some covariates (e.g., type of treatment, stage of the tumor, sex, or age) on the probability to be susceptible to the event of interest (i.e., incidence), and on the survival time conditional on being susceptible (i.e., latency).
Originally, cure models were introduced in the medical literature by Boag (1949) and Berkson and Gage (1952), but they have been used in several other disciplines during the years. In reliability engineering, Meeker (1987) investigates the failure of solid-state electronic components (e.g., integrated circuits). In social science, Schmidt and Witte (1989) investigate the timing of return to prison for a sample of prison releases, and they use it to make predictions of whether or not individuals return to prison. In finance, Cole and Gunther (1995) analyze the determinants of commercial bank failures in the United States; in credit scoring, Tong et al. (2012) predict defaults on a portfolio of UK personal loans. In political science, Svolik (2008) studies the likelihood that a democracy consolidates and the timing of authoritarian reversals in democracies that are not consolidated. In marketing, Polo et al. (2011) investigate the drivers of customer retention in a liberalizing market, using data for a sample of 650 consumers in the Spanish mobile phone industry. In the literature, several variants of cure models have been proposed (see Amico and Van Keilegom (2018) for a comprehensive survey), which belong to two main families: mixture cure models and promotion time cure models.
In this article, we present the penPHcure package (Beretta and Heuchenne 2019a), which
implements the semiparametric proportional-hazards (PH) mixture cure
model of Sy and Taylor (2000) extended to time-varying covariates, where the
incidence and latency distributions are modeled by a logistic regression
and a Cox’s PH model (Cox 1972), respectively. The penPHcure package
contains two main functions: penpHcure
, to estimate the regression
coefficients, their confidence intervals using the basic/percentile
bootstrap method, and to perform variable selection using the
SCAD-penalized likelihood technique proposed by Beretta and Heuchenne (2019b);
and penpHcure.simulate
to simulate data from a PH cure model, where
the event-times are generated on a continuous scale from a piecewise
exponential distribution conditional on time-varying covariates, using a
method similar to the one described in Hendry (2014).
At the time of writing this article, we are unaware of other R packages
for estimation of semiparametric PH mixture cure models with
time-varying covariates and, above all, that enable the user to perform
variable selection. In the context of cure models for right-censored
data, available R packages include: the
flexsurvcure
package (Amdahl 2019) for estimation of parametric mixture and
non-mixture cure models with time-invariant covariates using
time-to-event distributions from the
flexsurv package
(Jackson 2016); the nltm
package (Garibotti et al. 2019) for estimation of the semiparametric PH cure model with
time-invariant covariates of Tsodikov et al. (2003), as well as other
nonlinear transformation models for analyzing survival data using the
method of Tsodikov (2003); and the
smcure package (Cai et al. 2012)
for estimation of the semiparametric PH cure model and the accelerated
failure time cure model with time-invariant covariates and the
spduration package
(Beger et al. 2018) that implements a parametric cure model with time-varying
covariates using Weibull and Log-Logistic latency distributions.
Compared to spduration
, the penPHcure package has some advantages:
the latency distribution is modeled by a more flexible semiparametric
Cox’s PH model; the response variable and the time to the event of
interest are continuous; and, above all, it allows the user to
simultaneously select variables and estimate their parameters using a
variable selection technique based on SCAD penalties.
The remainder of this article is structured as follows:
penPHcure
function when the argument
pen.type
is set equal to "none"
(default);
penPHcure
function when pen.type=="SCAD"
;penPHcure.simulate
function, which generates data from a PH
cure model with time-varying covariates.penPHcure
function.penPHcure
function to analyze a real data
set.Let
The incidence is modeled by a logistic regression model:
Let
Given some starting values
Compute the expectation of the complete-data likelihood with respect
to the conditional distribution of the
Maximize the expected complete-data likelihood with respect to
The EM algorithm terminates whenever
When the number of available covariates is large, fitting all possible subsets to find the most relevant covariates would be too time consuming. Beretta and Heuchenne (2019b) proposed a regularization method based on the maximization of a penalized version of the complete-data log-likelihood
For fixed values of the tuning parameters
pen.type
of the function penPHcure
is
set equal to "SCAD"
, the initial values for SV
.
Regarding the choice of the tuning parameters, following the suggestion
of Fan and Li (2001), we keep
Let penPHcure.simulate
function, we use gamma
, which by default is
equal to 1. According to Hendry (2014), if we generate a random variable
penPHcure.simulate
function to simulate data from a PH cure model with time-varying
covariates (see Table 1 for a detailed description).
Require: |
for |
|
if |
else |
end if |
end for |
Return:
|
In this section, we present the results of a simulation study conducted
to assess the finite sample performance of the PH cure model estimation
and its variable selection technique implemented in the penPHcure
function. The event-times follow a Cox’s PH model with baseline hazard
function
We consider 6 simulation settings, with different levels of censoring
and proportions of non-susceptible individuals (expressed as a fraction
of the sample size), depending on different values of penPHcure.simulate
function for different sample sizes penPHcure
function to (i) fit a
standard PH cure model with all covariates (FULL), (ii) fit a standard
PH cure model with the covariates associated to the non-zero
coefficients only (ORACLE), and (iii) to perform variable selection
using the regularization method with SCAD penalties and tuning
parameters chosen according to the BIC criterion. The possible values of
the tuning parameters exp(seq(-6, -1, length.out = 10))
, whereas coxph
function in the
survival package
(Therneau 2015) to fit the classical Cox’s PH model with the covariates
associated to the non-zero coefficients only (COX).
Censoring | Cure | ||
---|---|---|---|
Low (40%) | High (30%) | 0.02 | 1.45 |
Low (40%) | Medium (20%) | 0.3 | 2.35 |
Medium (60%) | High (45%) | 0.35 | 0.35 |
Medium (60%) | Medium (30%) | 0.75 | 1.45 |
High (80%) | High (60%) | 0.95 | -0.7 |
High (80%) | Medium (40%) | 1.55 | 0.7 |
The performance is measured in terms of Mean Estimation Error (MEE) and
average number of correct and incorrect zeros identified by the variable
selection technique (SCAD). In particular, the estimation error for the
incidence component is computed as
In 1,2, we provide the
MEEs for the incidence and latency components, respectively, while in
3, we provide the average number of correct and
incorrect zeros. From those figures, we can see that the PH cure model
estimation and its variable selection technique implemented in the
penPHcure
function perform reasonably well. For an increase of the
sample size or a decrease of the level of censoring, the MEE decreases,
and the number of correct (resp. incorrect) zeros converges to 4 (resp.
0). The MEEs of the ORACLE model are always the lowest ones, but we
notice that the ones of the SCAD method tend towards them as the sample
size increases. It is important to note that, for a fixed level of
censoring, we observe higher MEEs in the case of a lower fraction of
cured individuals. The worst results are obtained in situations of high
censoring and low cure rates, but it is enough to increase the sample
size to obtain better results. This is evidence of the fact that a cure
model should always be applied to data with a sufficient number of
non-susceptible individuals. Last but not the least important, we note
that the use of the classical Cox’s PH model (COX) leads to very high
errors. This was expected since the model is wrongly specified as it
ignores the existence of cured subjects.
Finally, in Table 3, we also present the coverage probabilities of the estimated 95% confidence intervals for the ORACLE model using the basic and percentile bootstrap methods with 500 resamples. In most cases, the basic bootstrap method outperforms the percentile bootstrap method, especially for the smallest sample sizes, with coverage probabilities closer to the 95% nominal level.
The R code used to obtain the results in
1,2,3
(resp. Table 3) are provided in Section 2
(resp. Section 3) of the supplementary material (beretta-heuchenne.R
).
Moreover, in the file beretta-heuchenne-suppl.pdf
, we provide a table
with all the results contained in
1,2,3.
cens | cure | N | Method | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.40 | 0.30 | 250 | basic | 0.97 | 0.97 | 0.98 | 0.97 | 0.97 | 0.96 | 0.97 | 0.96 | 0.94 |
perc | 0.9 | 0.91 | 0.94 | 0.91 | 0.94 | 0.93 | 0.94 | 0.95 | 0.94 | |||
500 | basic | 0.97 | 0.95 | 0.95 | 0.94 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | ||
perc | 0.91 | 0.91 | 0.94 | 0.92 | 0.94 | 0.94 | 0.91 | 0.95 | 0.95 | |||
1000 | basic | 0.95 | 0.96 | 0.95 | 0.97 | 0.97 | 0.93 | 0.95 | 0.95 | 0.96 | ||
perc | 0.94 | 0.94 | 0.93 | 0.95 | 0.97 | 0.93 | 0.95 | 0.95 | 0.95 | |||
0.40 | 0.20 | 250 | basic | 0.95 | 0.96 | 0.99 | 0.96 | 0.97 | 0.96 | 0.97 | 0.97 | 0.96 |
perc | 0.89 | 0.9 | 0.94 | 0.91 | 0.93 | 0.94 | 0.96 | 0.96 | 0.96 | |||
500 | basic | 0.98 | 0.97 | 0.97 | 0.97 | 0.98 | 0.95 | 0.96 | 0.96 | 0.96 | ||
perc | 0.93 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.93 | 0.93 | |||
1000 | basic | 0.97 | 0.96 | 0.96 | 0.95 | 0.95 | 0.94 | 0.93 | 0.96 | 0.95 | ||
perc | 0.95 | 0.94 | 0.94 | 0.93 | 0.94 | 0.94 | 0.94 | 0.96 | 0.95 | |||
0.60 | 0.45 | 250 | basic | 0.98 | 0.95 | 0.96 | 0.96 | 0.97 | 0.98 | 0.98 | 0.97 | 0.97 |
perc | 0.95 | 0.91 | 0.93 | 0.93 | 0.94 | 0.93 | 0.94 | 0.94 | 0.94 | |||
500 | basic | 0.98 | 0.95 | 0.94 | 0.97 | 0.96 | 0.94 | 0.94 | 0.96 | 0.95 | ||
perc | 0.96 | 0.93 | 0.94 | 0.92 | 0.93 | 0.94 | 0.92 | 0.95 | 0.95 | |||
1000 | basic | 0.95 | 0.96 | 0.96 | 0.95 | 0.95 | 0.95 | 0.94 | 0.93 | 0.95 | ||
perc | 0.95 | 0.94 | 0.94 | 0.93 | 0.94 | 0.94 | 0.92 | 0.94 | 0.93 | |||
0.60 | 0.30 | 250 | basic | 0.96 | 0.94 | 0.98 | 0.94 | 0.99 | 0.97 | 0.96 | 0.96 | 0.96 |
perc | 0.89 | 0.9 | 0.91 | 0.89 | 0.93 | 0.94 | 0.94 | 0.95 | 0.95 | |||
500 | basic | 0.96 | 0.97 | 0.98 | 0.95 | 0.97 | 0.96 | 0.96 | 0.95 | 0.95 | ||
perc | 0.94 | 0.91 | 0.94 | 0.93 | 0.93 | 0.95 | 0.96 | 0.95 | 0.95 | |||
1000 | basic | 0.96 | 0.96 | 0.96 | 0.95 | 0.96 | 0.96 | 0.95 | 0.95 | 0.96 | ||
perc | 0.95 | 0.94 | 0.94 | 0.92 | 0.94 | 0.94 | 0.94 | 0.94 | 0.95 | |||
0.80 | 0.60 | 250 | basic | 0.99 | 0.97 | 0.98 | 0.97 | 0.97 | 0.97 | 0.96 | 0.97 | 0.97 |
perc | 0.94 | 0.91 | 0.92 | 0.91 | 0.92 | 0.94 | 0.92 | 0.96 | 0.94 | |||
500 | basic | 0.98 | 0.95 | 0.97 | 0.95 | 0.97 | 0.97 | 0.96 | 0.98 | 0.98 | ||
perc | 0.94 | 0.93 | 0.95 | 0.93 | 0.94 | 0.95 | 0.94 | 0.95 | 0.96 | |||
1000 | basic | 0.96 | 0.95 | 0.96 | 0.94 | 0.96 | 0.96 | 0.95 | 0.94 | 0.96 | ||
perc | 0.93 | 0.95 | 0.94 | 0.93 | 0.95 | 0.95 | 0.94 | 0.93 | 0.93 | |||
0.80 | 0.40 | 250 | basic | 0.98 | 0.95 | 0.99 | 0.98 | 1 | 0.97 | 0.96 | 0.99 | 0.99 |
perc | 0.92 | 0.88 | 0.91 | 0.91 | 0.93 | 0.95 | 0.93 | 0.97 | 0.95 | |||
500 | basic | 0.97 | 0.95 | 0.97 | 0.96 | 0.98 | 0.96 | 0.96 | 0.97 | 0.96 | ||
perc | 0.94 | 0.9 | 0.93 | 0.91 | 0.93 | 0.94 | 0.94 | 0.96 | 0.94 | |||
1000 | basic | 0.96 | 0.96 | 0.97 | 0.95 | 0.95 | 0.98 | 0.95 | 0.93 | 0.96 | ||
perc | 0.96 | 0.95 | 0.95 | 0.94 | 0.93 | 0.94 | 0.94 | 0.93 | 0.93 |
In this section, we illustrate the use of the penPHcure R package
using a Criminal Recidivism dataset, which contains a sample of 432
inmates released from Maryland state prisons and followed for one year
after release (Rossi et al. 1980). The aim of this study was to investigate
the relationship between the time to first arrest after release and some
covariates observed during the follow-up period. In particular, to study
the effect of providing financial aid at the moment of release. The
original source of the data is the Rossi
dataset in the
RcmdrPlugin.survival
package (Fox and Carvalho 2012), which has been converted into a counting
process format and included in the penPHcure package.
Let us load and illustrate the dataset:
> library(penPHcure)
> data("cpRossi", package = "penPHcure")
> head(cpRossi)
id tstart tstop arrest fin age race wexp mar paro prio educ emp
1 1 0 20 yes no 27 black no no yes 3 3 no
2 2 0 9 no no 18 black no no yes 8 4 no
3 2 9 14 no no 18 black no no yes 8 4 yes
4 2 14 17 yes no 18 black no no yes 8 4 no
5 3 0 16 no no 19 other yes no yes 13 3 no
6 3 16 17 no no 19 other yes no yes 13 3 yes
> str(cpRossi)
'data.frame': 1405 obs. of 13 variables:
$ id : int 1 2 2 2 3 3 3 4 4 4 ...
$ tstart: int 0 0 9 14 0 16 17 0 4 21 ...
$ tstop : int 20 9 14 17 16 17 25 4 21 31 ...
$ arrest: Factor w/ 2 levels "no","yes": 2 1 1 2 1 1 2 1 1 1 ...
$ fin : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 2 ...
$ age : int 27 18 18 18 19 19 19 23 23 23 ...
$ race : Factor w/ 2 levels "black","other": 1 1 1 1 2 2 2 1 1 1 ...
$ wexp : Factor w/ 2 levels "no","yes": 1 1 1 1 2 2 2 2 2 2 ...
$ mar : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 1 1 1 ...
$ paro : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ prio : int 3 8 8 8 13 13 13 1 1 1 ...
$ educ : Factor w/ 3 levels "3","4","5": 1 2 2 2 1 1 1 3 3 3 ...
$ emp : Factor w/ 2 levels "no","yes": 1 1 2 1 1 2 1 1 2 1 ...
The object cpRossi
is a data.frame
in counting process format with
1405 observations for 432 individuals on 13 variables. The id
variable
provides the unique identification number for every individual in the
study. The variables tstart
and tstop
denote the time interval of
the observation (measured in weeks). The variable arrest
denotes
whether the individual has been arrested during the 1-year follow-up
period. The remaining explanatory variables are described hereafter.
fin
. Financial aid received after release: yes
or no
;age
. Age in years at the time of release;race
. Race of the individual: black
or other
;wexp
. Full-time work experience before incarceration: yes
or
no
;mar
. Married at the time of release: yes
or no
;paro
. Released on parole: yes
or no
;prio
. Number of convictions prior to incarceration;educ
. Level of education: <=9th degree (“3”), 10th or 11th degree
(“4”), or >=12th degree (“5”);emp
. Working full time during the observed time interval: yes
or
no
. This is the only variable which is varying over time (e.g.,
the individual with id = 2
did not work full time during the first
9 weeks after release, then he did for 5 weeks, and, finally, he has
been arrested after 3 weeks without working full time).Using the penPHcure
function, by default, we can fit the standard PH
cure model. First, we use a formula object with the response on the left
of the tilde operator and the explanatory variables to be included in
the latency component on the right. The response is a survival object
returned by the Surv(tstart,tstop,arrest)
function. Then, using the
argument cureform
, we specify the explanatory variables to be included
in the incidence component. By default, these covariates are set equal
to the last observation, but in this case, we set the argument
which.X = "mean"
to compute the time-weighted average over the full
history. Finally, setting the argument inference = TRUE
, we conduct
inference about the parameter estimates via bootstrapping (by default,
100 bootstrap resamples). The user can increase/decrease the number of
bootstrap resamples with the argument nboot
.
> set.seed(123) # for reproducibility
> fit <- penPHcure(Surv(tstart,tstop,arrest)~fin+age+race+wexp+mar+paro+prio+educ+emp,
+ cureform = ~fin+age+race+wexp+mar+paro+prio+educ+emp,
+ data = cpRossi,which.X = "mean",inference = TRUE)
Initializing PH cure model with time-varying covariates...
Number of individuals: 432
Censoring proportion: 0.736
Tied failure times: TRUE
Number of unique failure times: 49
Number of covariates in the survival component: 10
Number of covariates in the cure component: 10
Checking starting values...
Fitting standard PH cure model with time-varying covariates... Please wait...
Performing inference via bootstrapping... Please wait ...
|==============================================================================| 100%
DONE!
This call to the penPHcure
function returned an object of class
"PHcure"
, and we can print a summary of the results using the
summary
method. By default, confidence intervals are computed using
the basic bootstrap method (the alternative is percentile bootstrap) and
a confidence level of 95%. In order to control these features, the user
can provide the arguments conf.int
and conf.int.level
, respectively.
> summary(fit)
------------------------------------------------------
+++ PH cure model with time-varying covariates +++
------------------------------------------------------
Sample size: 432
Censoring proportion: 0.7361111
Number of unique event times: 49
Tied failure times: TRUE
log-likelihood: -643.65
------------------------------------------------------
+++ Cure (incidence) coefficient estimates +++
+++ and 95% confidence intervals * +++
------------------------------------------------------
Estimate 2.5% 97.5%
(Intercept) 1.136709 -34.041743 9.769052
finyes -0.455199 -2.299870 12.188817
age -0.067715 -0.382429 0.413001
raceother -0.100950 -2.988104 35.336024
wexpyes 0.251663 -3.257339 2.193371
marno 0.261947 -15.041406 35.102574
paroyes -0.041289 -3.156395 1.659637
prio 0.068443 -0.285553 0.237089
educ4 -0.570782 -2.614311 2.532353
educ5 -1.163257 -39.473732 34.360612
empyes -0.860659 -3.299354 1.216299
------------------------------------------------------
+++ Survival (latency) coefficient estimates +++
+++ and 95% confidence intervals * +++
------------------------------------------------------
Estimate 2.5% 97.5%
finyes 0.062630 -1.427436 1.446067
age 0.046192 -0.067209 0.176043
raceother -0.759985 -2.770654 0.720247
wexpyes -0.552549 -1.672866 0.576657
marno 0.123655 -2.327914 1.600195
paroyes 0.040388 -0.816177 1.110058
prio 0.048407 -0.107942 0.195763
educ4 0.588156 -0.545885 1.881803
educ5 0.838098 -2.527512 5.118107
empyes -1.431782 -1.980471 -0.781978
------------------------------------------------------
* Confidence intervals computed by the basic
bootstrap method, with 100 replications.
------------------------------------------------------
As you can see, only one covariate (emp
) in the latency component is
statistically significant (the 95% confidence interval does not include
zero). The negative sign of the estimated coefficient implies that the
individuals working full time after release have a lower risk of being
rearrested (among the individuals susceptible to be rearrested). The
lack of significance of the other covariates might be explained by the
small sample size, the high level of censoring (only 114 out of 432
individuals have been rearrested), or by potential confounding factors.
We now perform variable selection with the proposed SCAD-penalized
likelihood method to check whether other covariates may be relevant to
explain incidence and latency. First, we specify the possible values of
the tuning parameters (using the argument pen.tuneGrid
) and set the
starting values equal to the coefficient estimates from the unpenalized
model (using the argument SV
). Then, we still use the penPHcure
function, but we now include the argument pen.type = "SCAD"
.
> pen.tuneGrid <- list(CURE = list(lambda = seq(0.01,0.12,by=0.01),
+ a = 3.7),
+ SURV = list(lambda = seq(0.01,0.12,by=0.01),
+ a = 3.7))
> SV <- list(b=fit$b,beta=fit$beta)
> tuneSCAD <- penPHcure(Surv(tstart,tstop,arrest)~fin+age+race+wexp+mar+paro+prio+educ+emp,
+ cureform = ~fin+age+race+wexp+mar+paro+prio+educ+emp,
+ data = cpRossi,which.X = "mean",pen.type = "SCAD",
+ pen.tuneGrid = pen.tuneGrid,SV = SV)
Initializing PH cure model with time-varying covariates...
Number of individuals: 432
Censoring proportion: 0.736
Tied failure times: TRUE
Number of unique failure times: 49
Number of covariates in the survival component: 10
Number of covariates in the cure component: 10
Checking starting values...
Tuning SCAD-penalized PH cure model with time-varying covariates... Please wait...
iter aCURE aSURV lambdaCURE lambdaSURV AIC BIC df
1 3.70 3.70 0.01 0.01 1319.1625 1384.2573 16
2 3.70 3.70 0.01 0.02 1319.1625 1384.2573 16
3 3.70 3.70 0.01 0.03 1316.0665 1360.8192 11
4 3.70 3.70 0.01 0.04 1318.0458 1358.7300 10
5 3.70 3.70 0.01 0.05 1318.0457 1358.7300 10
... ... (omitted rows) ... ... (omitted rows) ... ...
140 3.70 3.70 0.12 0.08 1325.5349 1333.6718 2
141 3.70 3.70 0.12 0.09 1325.5349 1333.6718 2
142 3.70 3.70 0.12 0.10 1325.5349 1333.6718 2
143 3.70 3.70 0.12 0.11 1325.5349 1333.6718 2
144 3.70 3.70 0.12 0.12 1325.5349 1333.6718 2
DONE!
This time, the call to the penPHcure
function returned an object of
class "penPHcure"
. We can print a summary of the results using the
summary
method, and, by default, the fitted model with the lowest BIC
criterion is returned.
> summary(tuneSCAD)
------------------------------------------------------
+++ PH cure model with time-varying covariates +++
+++ [ Variable selection ] +++
------------------------------------------------------
Sample size: 432
Censoring proportion: 0.7361111
Number of unique event times: 49
Tied failure times: TRUE
Penalty type: SCAD
Selection criterion: BIC
------------------------------------------------------
+++ Tuning parameters +++
------------------------------------------------------
Cure (incidence) --- lambda: 0.09
a: 3.7
Survival (latency) - lambda: 0.05
a: 3.7
BIC = 1329.481
------------------------------------------------------
+++ Cure (incidence) +++
+++ [ Coefficients of selected covariates ] +++
------------------------------------------------------
Estimate
(Intercept) 1.776907
age -0.076498
------------------------------------------------------
+++ Survival (latency) +++
+++ [ Coefficients of selected covariates ] +++
------------------------------------------------------
Estimate
prio 0.101202
empyes -1.537286
The Bayesian Information Criterion is minimized for age
is selected in the
incidence component. The negative sign of the estimated coefficient
implies that younger individuals are more susceptible to be rearrested.
The covariates prio
and emp
are selected in the latency component.
The positive sign of the estimated coefficient (prio
) implies that a
higher number of convictions prior to incarceration increases the risk
of being rearrested (among the individuals susceptible to be
rearrested).
Let us now have a look at the fitted model with the lowest AIC criterion:
> summary(tuneSCAD,crit.type = "AIC")
------------------------------------------------------
+++ PH cure model with time-varying covariates +++
+++ [ Variable selection ] +++
------------------------------------------------------
Sample size: 432
Censoring proportion: 0.7361111
Number of unique event times: 49
Tied failure times: TRUE
Penalty type: SCAD
Selection criterion: AIC
------------------------------------------------------
+++ Tuning parameters +++
------------------------------------------------------
Cure (incidence) --- lambda: 0.06
a: 3.7
Survival (latency) - lambda: 0.03
a: 3.7
AIC = 1310.79
------------------------------------------------------
+++ Cure (incidence) +++
+++ [ Coefficients of selected covariates ] +++
------------------------------------------------------
Estimate
(Intercept) 1.829260
finyes -0.585638
age -0.067130
educ5 -0.887636
------------------------------------------------------
+++ Survival (latency) +++
+++ [ Coefficients of selected covariates ] +++
------------------------------------------------------
Estimate
raceother -0.586626
prio 0.103746
empyes -1.552737
The Akaike Information Criterion is minimized for fin
and educ
have been selected. The negative signs imply
that individuals who received financial aid or with a high level of
education (>=12th degree) are less susceptible to be rearrested. In the
latency component, also the covariate race
has been selected. The
negative coefficient implies that individuals of a race other than black
have a lower risk of being rearrested (among the individuals susceptible
to be rearrested).
In survival analysis studies, it may be the case that a fraction of the
population is likely to be not susceptible to the event of interest. In
this article, we presented the penPHcure R package, which implements
the semiparametric proportional-hazards (PH) cure model of
Sy and Taylor (2000) extended to time-varying covariates. This model can
measure the effects of some covariates on the probability of being
susceptible and on the time until the occurrence of the event. The
penPHcure package is composed of two main functions: penpHcure
, to
estimate the regression coefficients, their confidence intervals using
the basic/percentile bootstrap method and to perform variable selection
using the SCAD-penalized likelihood technique proposed by
Beretta and Heuchenne (2019b); and penpHcure.simulate
to simulate data from
a PH cure model with time-dependent covariates. We first explained the
methodology behind these functions and presented the results of a
simulation study to assess its finite-sample performance. Then, we
illustrated the use of the penPHcure
function through an example based
on the Criminal Recidivism dataset.
The latest release and a development version of the penPHcure package are respectively avilable on CRAN and at https://github.com/a-beretta/penPHcure.
penPHcure, flexsurvcure, flexsurv, nltm, smcure, spduration, survival, RcmdrPlugin.survival
ClinicalTrials, Distributions, Econometrics, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Beretta & Heuchenne, "penPHcure: Variable Selection in Proportional Hazards Cure Model with Time-Varying Covariates", The R Journal, 2021
BibTeX citation
@article{RJ-2021-061, author = {Beretta, Alessandro and Heuchenne, Cédric}, title = {penPHcure: Variable Selection in Proportional Hazards Cure Model with Time-Varying Covariates}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {1}, issn = {2073-4859}, pages = {53-66} }