Mixture cure models have been widely used to analyze survival data with a cure fraction. They assume that a subgroup of the individuals under study will never experience the event (cured subjects). So, the goal is twofold: to study both the cure probability and the failure time of the uncured individuals through a proper survival function (latency). The R package npcure implements a completely nonparametric approach for estimating these functions in mixture cure models, considering right-censored survival times. Nonparametric estimators for the cure probability and the latency as functions of a covariate are provided. Bootstrap bandwidth selectors for the estimators are included. The package also implements a nonparametric covariate significance test for the cure probability, which can be applied with a continuous, discrete, or qualitative covariate.
In classical survival analysis, it is assumed that all the individuals will eventually experience the event of interest. However, there are many contexts in which this assumption might not be true. Noticeable examples are the lifetime of cancer patients after treatment, time to infection in a risk population, or time to default in credit scoring, among many others. Cure models are a stream of methods recently developed in survival analysis that take into account the possibility that subjects could never experience the event of interest. See Maller and Zhou (1996) for early references and Amico and Van Keilegom (2018) for an updated review.
Let
The cure probability,
An attractive feature of parametric and semiparametric models is that
they provide close expressions for relevant parameters and functions. On
the other hand, the sound inference is guaranteed only if the chosen
model fits the data suitably. A problem with these methods is that the
parametric assumptions may be incorrect. For example, regarding the cure
rate
As a result of the increasing demand for the use of cure models, the number of packages in R accounting for the possibility of cure in survival analysis has grown significantly over the last decade: see the CRAN task view on survival analysis (https://CRAN.R-project.org/view=Survival). The smcure package (Cai et al. 2012) fits the semiparametric PH and AFT mixture cure models (see Kalbfleisch and Prentice 2002). Besides, the NPHMC package (Cai et al. 2013) allows to calculate the sample size of a survival trial with or without cure fractions. More recently, the flexsurvcure package (Amdahl 2017) provides flexible parametric mixture and non-mixture cure models for time-to-event data, and the rcure package (Han et al. 2017) incorporates methods related to robust cure models for survival data which include a weakly informative prior in the logistic part. The geecure package (Niu and Peng 2018) features the marginal parametric and semiparametric PH mixture cure models for analyzing clustered survival data with a possible cure fraction. Furthermore, the miCoPTCM package (Bertrand et al. 2020) fits semiparametric promotion time cure models with possibly mis-measured covariates, while the mixcure package (Peng 2020) implements parametric and semiparametric mixture cure models based on existing R packages. For interval-censored data with a cure fraction, the GORcure package (Zhou et al. 2017) implements the generalized odds rate mixture cure model, including the PH mixture cure model and the proportional odds mixture cure model as special cases. The intercure package (Brettas 2016) provides an implementation of semiparametric cure rate estimators for interval-censored data using bounded cumulative hazard and frailty models.
In contrast with (semi)parametric procedures, nonparametric methods do
not rely on data belonging to any particular parametric family or
fulfilling any parametric assumption. They estimate the goal functions
without making any assumptions about its shape, so they have much wider
applicability than alternative parametric methods. A completely
nonparametric mixture cure model must consider purely nonparametric
estimators for both the cure rate,
Although some of the aforementioned packages have a nonparametric flavor, their approach to mixture cure modeling is not completely nonparametric. Our R package npcure (López-de-Ullibarri et al. 2020) fills the gap by providing implementations of the nonparametric estimator of the cure rate function proposed by Xu and Peng (2014; further studied by López-Cheda et al. 2017a) and of the nonparametric estimator of the latency function proposed by López-Cheda et al. (2017b).
Furthermore, the generalized PL estimator of the conditional survival
function, Beran()
in package
condSURV
(Meira-Machado and Sestelo 2016), prodlim()
in package
prodlim (Gerds 2018) and
Beran()
in package
survidm
(Meira-Machado et al. 2019). The function in our package compares advantageously
with the aforementioned functions with respect to the issue of bandwidth
selection. This smoothing parameter plays an essential role in the
bias-variance tradeoff of every nonparametric smoothing method. In
(Dabrowska 1992), an expression for the bandwidth minimizing the asymptotic
mean squared error (MSE) of this estimator was obtained, and a plug-in
bandwidth selector was proposed based on suitable estimators of the
unknown functions in that expression. However, the performance of this
bandwidth selector is unsatisfying for small sample sizes, and a
cross-validation (CV) procedure is usually preferred (Gannoun et al. 2007 among others; see Iglesias-Pérez 2009). Recently, (Geerdens et al. 2017) propose an
improved CV bandwidth selector, especially with a high censoring rate.
To the best of our knowledge, there are not any R packages allowing to
compute Beran estimator with a suitable bandwidth selector: while the
condSURV and survidm packages do not consider any bandwidth
selectors, the prodlim package uses nearest neighborhoods as the
smoothing parameter. The npcure package, available from the
Comprehensive R Archive Network (CRAN) at
https://CRAN.R-project.org/package=npcure, fulfills this need with the
implementation of the CV bandwidth selector for the Beran estimator in
(Geerdens et al. 2017).
In this paper, we explain how the npcure package can be used in the context of nonparametric mixture cure models with right-censored data. The main objective is to estimate the cure probability and latency functions, as well as to perform covariate significance tests for the cure rate. In the next section, we describe our approach to nonparametric estimation in mixture cure models. The methodology applied in the covariate significance tests is presented in another section. Two sections follow, devoted respectively to explain the package functions and to illustrate their use with an application to a medical dataset.
One of the specificities of time-to-event data is related to the
presence of individuals that have not experienced the event by the end
of the study. The observed survival times of these individuals are said
to be right-censored and underestimate the true unknown time to the
occurrence of the event. This situation is usually modeled by
considering a censoring variable
The nonparametric latency estimator proposed by (López-Cheda et al. 2017a), and further
studied in (López-Cheda et al. 2017b), is:
The proposed nonparametric estimators of both the cure rate and latency
are consistent under the general condition (see Laska and Meisner 1992; Maller and Zhou 1992; López-Cheda et al. 2017a,b)
The condition in ((4)) ensures
Under condition ((4)),
These comments emphasize that care must be exercised in choosing the
length of follow-up if cures might be present since too much censoring
or insufficient follow-up time could lead to erroneous conclusions. For
example, if the last observation is uncensored, then, even if there is
considerable late censoring, the estimated cure rate is 0. To avoid
these difficulties, particularly with heavy censoring, reasonably long
follow-up times and large sample sizes may be required. In this way,
Thus, when estimating
The nonparametric estimators in ((1)) and
((3)) depend on two smoothing parameters,
For a fixed value
With respect to the latency estimator in ((3)), (López-Cheda et al. 2017b)
propose to choose the bandwidth
For a fixed covariate value
Generate
Consider a search grid of bandwidths
Compute the nonparametric estimator
For each bandwidth
The bootstrap bandwidth
Following (López-Cheda et al. 2017a) and (López-Cheda et al. 2017b), the bootstrap resamples in Step 1 are
generated considering the following procedure, which is equivalent to
the simple weighted bootstrap proposed by (Li and Datta 2001) without resampling the
covariate
Generate
For each
For each
(López-Cheda et al. 2017a) and (López-Cheda et al. 2017b) show that the effect of the pilot bandwidth on
the bootstrap bandwidth selectors of
When selecting locally adaptive bandwidths, the results might look a
little bit spiky due to its local nature (see, e.g., Brockmann et al. 1993 on local
bandwidth selection for kernel regression estimators). That could be the
case for the bootstrap bandwidths for both the cure rate and latency
functions. To get rid of the fluctuation of these local bandwidths,
In medical studies, it is usually important to assess whether the cure
probability depends on a specific covariate,
The main challenge when testing ((8)) is
that the cure indicator,
Finally, building on (Delgado and González-Manteiga 2001) and using the estimated values of
The distribution of the CM and KS statistics under the null hypothesis is approximated by bootstrap, according to the following steps:
Obtain
Estimate the probability of cure under
Compute
Generate
Compute
With the bootstrap resample
With
Repeat Steps A-D above
The
Note that nonparametric estimators of the conditional functions
The npcure package provides several functions to model
nonparametrically survival data with a possibility of cure. Table
1 contains a compact summary of the available
functions. The estimators of the cure rate and latency functions,
discussed in the section "Nonparametric estimation in mixture cure
models", are implemented by probcure()
and latency()
, respectively.
The functions probcurehboot()
and latencyhboot()
compute bootstrap
bandwidths for these estimators. Another function deserving mention in
this context is beran()
, which computes the generalized PL estimator
of the conditional survival function beran()
is returned by berancv()
. Given the
computational burden of the procedures implemented by the aforementioned
functions, all of them make extensive use of compiled C code. The
significance test introduced in the previous section is carried out by
testcov()
, and testmz()
performs the nonparametric test of
(Maller and Zhou 1992). Next, a detailed account of the usage of all these functions
is provided.
Function | Description |
---|---|
beran |
Computes Beran’s estimator of the conditional survival function. |
berancv |
Computes the CV bandwidth for Beran’s estimator of the conditional survival function. |
controlpars |
Sets the control parameters of the latencyhboot() and probcurehboot() functions. |
hpilot |
Computes pilot bandwidths for the nonparametric estimators of the cure rate and the latency. |
latency |
Computes the nonparametric estimator of the latency. |
latencyhboot |
Computes the bootstrap bandwidth for the nonparametric estimator of the latency. |
print.npcure |
Method of the generic function print for ‘npcure’ objects. |
probcure |
Computes the nonparametric estimator of the cure rate. |
probcurehboot |
Computes the bootstrap bandwidth for the nonparametric estimator of the cure rate. |
summary.npcure |
Method of the generic function summary for ‘npcure’ objects. |
testcov |
Performs covariate significance tests for the cure rate. |
testmz |
Performs the nonparametric test of (Maller and Zhou 1992). |
The estimation functions in npcure are restricted to one-dimensional continuous covariates. The Epanechnikov kernel is used in the smoothing procedures. Nonparametric estimation with discrete or categorical variables could be dealt with as in other kernel smoothing procedures. A simple approach is to split the sample into a number of subsets according to the covariate values. When the size of the subsamples is not too small, valid unconditional estimates of the cure probability and latency can be computed. Another alternative is the use of special kernels that can handle any covariate types (see Racine and Li 2004).
Several features are shared by the functions in the package. All
functions return an object of S3 class ‘npcure’, formally a list of
components. Among these components are the primary outputs of the
functions, like the computed estimates for probcure()
and latency()
,
the selected bandwidths for probcurehboot()
and latencyhboot()
, or
the testcov()
and testmz()
. The
covariate values, observed times, and uncensoring indicators are passed
to the functions via the x
, t
, and d
arguments, respectively.
Typically, a set of names is passed, which are interpreted as column
names of a data frame specified by the dataset
argument. However,
dataset
may also be left as NULL
, the default, in which case the
objects named in x
, t
, and d
must live in the working directory.
More details on these and other arguments are given in the following.
The estimation of the cure rate using the nonparametric estimator in
((1)) is implemented in the probcure()
function:
probcure(x, t, d, dataset = NULL, x0, h, local = TRUE, conflevel = 0L,
bootpars = if (conflevel == 0 && !missing(h)) NULL else controlpars())
The x0
argument specifies the covariate values where conditional
estimates of the cure rate are to be computed. The bandwidths required
by the estimator are passed to the h
argument. The local
argument is
a logical value determining whether the bandwidths are interpreted as
local (local = TRUE
) or global (local = FALSE
) bandwidths. Notice
that if local = TRUE
, then h
and x0
must have the same length.
Actually, the h
argument may be missing, in which case the local
bootstrap bandwidth computed by the probcurehboot()
function is used.
This last function implements the procedure for selecting the bandwidth
probcurehboot(x, t, d, dataset, x0, bootpars = controlpars())
The bootpars
argument controls the details of the computation of the
bootstrap bandwidth (see section "Bandwidth selection"). In typical
use, it is intended to receive the list returned by the controlpars()
function. The components of this list are described in
Table 2.
Argument | Description |
---|---|
B |
Number of bootstrap resamples (by default, 999 ). |
hbound |
A vector giving the minimum and maximum, respectively, of the initial grid of bandwidths as multiples of the standardized interquartile range (IQR) of the covariate values (by default, c(0.1, 3) ). |
hl |
Length of the initial grid of bandwidths (by default, 100 ). |
hsave |
A logical specifying if the grid of bandwidths is saved (by default FALSE ). |
nnfrac |
Fraction of the sample size determining the order 0.25 ). |
fpilot |
Either NULL , the default, or a function name. If NULL , the pilot bandwith is computed by the package function hpilot() . If not NULL , it is the name of an alternative, user-defined function for computing the pilot. |
qt |
In bandwidth selection with latencyhboot() , order of the quantile of the observed times specifying the upper bound of the integral in the computation of the MISE0.75 ). |
hsmooth |
Order of a moving average computed to optionally smooth the selected bandwidths. By default is 1 , meaning that no smoothing is done. |
The function probcure()
also allows constructing point confidence
intervals (CI) for the cure rate. These CIs exploit the asymptotic
normality of the estimator (Xu and Peng 2014), using the bootstrap to obtain an
estimate of the standard error of the estimated cure rate. The bootstrap
resamples are generated by the same procedure described in the section
"Bandwidth selection". Denoting by conflevel
argument as a number between 0 and 1. With the special value 0, the
default, no CI is computed. Other parameters related to the bootstrap
CIs can be passed to the bootpars
argument, typically via the output
of the controlpars()
function. These parameters relate to the number
of bootstrap resamples and the computation of the pilot bandwidth, and
are specified, respectively, by the B
and nnfrac
arguments described
in Table 2.
The usage of these functions is illustrated with a simulated dataset generated from a model where the cure probability is a logistic function of the covariate:
library("npcure")
n <- 50
x <- runif(n, -2, 2)
y <- rweibull(n, shape = 0.5 * (x + 4), scale = 1)
c <- rexp(n, rate = 1)
p <- exp(2 * x)/(1 + exp(2 * x))
u <- runif(n)
t <- ifelse(u < p, pmin(y, c), c)
d <- ifelse(u < p, ifelse(y < c, 1, 0), 0)
data <- data.frame(x = x, t = t, d = d)
In the next code example, point and 95% CI estimates of the cure
probability are obtained with probcure()
at a grid of covariate values
ranging from probcurehboot()
are passed to the
h
argument. The bandwidths, which have been further smoothed with a
moving average of 15 bandwidths, are contained in the hsmooth
component of the output of probcurehboot()
. For the bootstrap, 2000
resamples are generated.
x0 <- seq(-1.5, 1.5, by = 0.1)
hb <- probcurehboot(x, t, d, data, x0 = x0,
bootpars = controlpars(B = 2000, hsmooth = 15))
q1 <- probcure(x, t, d, data, x0 = x0, h = hb$hsmooth, conflevel = 0.95,
bootpars = controlpars(B = 2000))
q1
#> Bandwidth type: local
#>
#> Conditional cure estimate:
#> h x0 cure lower 95% CI upper 95% CI
#> 0.6212329 -1.5 1.000000000 0.98450759 1.00000000
#> 0.6523881 -1.4 1.000000000 0.87087244 1.00000000
#> 0.6533320 -1.3 1.000000000 0.86080078 1.00000000
#> 0.6606362 -1.2 1.000000000 0.83135572 1.00000000
#> 0.6710717 -1.1 1.000000000 0.82267310 1.00000000
#> 0.6912311 -1.0 0.972213147 0.78259082 1.00000000
#> ...
More compactly, the same bootstrap bandwidths would be selected and the
same estimates obtained if h
were left unset when calling
probcure()
:
q2 <- probcure(x, t, d, data, x0 = x0, conflevel = 0.95,
bootpars = controlpars(B = 2000, hsmooth = 15))
Figure 1 shows a plot of the true cure rate
function and its point and 95% CI estimates at the covariate values
saved in x0
. The plot can be reproduced by executing the next code.
The components of the q1
object accessed by the code are x0
, keeping
the vector of covariate values, q
, containing the point estimates of
the cure rate, and conf
, a list with the lower (component lower
) and
upper (component upper
) limits of the CIs for the cure rate.
plot(q1$x0, q1$q, type = "l", ylim = c(0, 1), xlab = "Covariate X",
ylab = "Cure probability")
lines(q1$x0, q1$conf$lower, lty = 2)
lines(q1$x0, q1$conf$upper, lty = 2)
lines(q1$x0, 1 - exp(2 * q1$x0)/(1 + exp(2 * q1$x0)), col = 2)
legend("topright", c("Estimate", "95% CI limits", "True"),
lty = c(1, 2, 1), col = c(1, 1, 2))
The latency estimator in ((3)) is implemented in the
latency()
function:
latency(x, t, d, dataset = NULL, x0, h, local = TRUE, testimate = NULL,
conflevel = 0L, bootpars = if (conflevel == 0) NULL else controlpars(),
save = TRUE)
The function’s interface is similar to that of probcure()
, with all
the arguments, except for testimate
, having exactly the same
interpretation. The testimate
argument determines the times NULL
, which results in the latency being estimated at times given by
the t
argument.
Also, as was the case for probcure()
, latency()
allows getting
bootstrap CIs for the latency function by specifying their level with
the conflevel
argument. These CIs also rely on the asymptotic
normality of the latency estimator
Also, as with probcure()
, the user can specify a local or global
bandwidth with the combined use of the h
and local
arguments. When
h
is left unspecified, a local bootstrap bandwidth is indirectly
computed by the latencyhboot()
function:
latencyhboot(x, t, d, dataset = NULL, x0, bootpars = controlpars())
This function provides an implementation of the bandwidth selector
probcurehboot()
, with which it shares a common
interface. The only noticeable difference is that now the qt
argument
of controlpars()
(see Table 2) can be used to set
Using the same simulated data as before, the next code illustrates the
computation of point and 95% CI estimates (based on 500 bootstrap
resamples) of the latency for covariate values 0 and 0.5, and with local
bandwidths equal to 0.8 and 0.5, respectively. Notice that, since the
testim
argument is unset, the estimates are computed at the times t
:
S0 <- latency(x, t, d, data, x0 = c(0, 0.5), h = c(0.8, 0.5),
conflevel = 0.95, bootpars = controlpars(B = 500))
To estimate the latency using the bootstrap bandwidth selector,
latencyhbooot()
can be called before calling latency()
. In the
following code, the component h
of the output of latencyhbooot()
,
where the selected local bandwidths are contained, is passed to the h
argument of latency()
:
b <- latencyhboot(x, t, d, data, x0 = c(0, 0.5))
S0 <- latency(x, t, d, data, x0 = c(0, 0.5), h = b$h, conflevel = 0.95)
S0
#> Bandwidth type: local
#>
#> Covariate (x0): 0.0 0.5
#> Bandwidth (h): 4.531978 2.527206
#>
#> Conditional latency estimate:
#>
#> x0 = 0
#> time latency lower 95% CI upper 95% CI
#> 0.004599127 1.0000000 1.00000000 1.0000000
#> 0.042088293 1.0000000 1.00000000 1.0000000
#> 0.042271452 1.0000000 1.00000000 1.0000000
#> 0.059671372 1.0000000 1.00000000 1.0000000
#> 0.067375891 1.0000000 1.00000000 1.0000000
#> 0.098569312 1.0000000 1.00000000 1.0000000
#> ...
#>
#> x0 = 0.5
#> time latency lower 95% CI upper 95% CI
#> 0.004599127 1.0000000 1.00000000 1.0000000
#> 0.042088293 1.0000000 1.00000000 1.0000000
#> 0.042271452 1.0000000 1.00000000 1.0000000
#> 0.059671372 1.0000000 1.00000000 1.0000000
#> 0.067375891 1.0000000 1.00000000 1.0000000
#> 0.098569312 1.0000000 1.00000000 1.0000000
#> ...
An alternative, more succinct way to proceed is to leave h
unset,
since in that case, latencyhboot()
is indirectly called:
S0 <- latency(x, t, d, data, x0 = c(0, 0.5), conflevel = 0.95)
Figure 1 shows the estimated and true latencies for
covariate value latency()
. The testim
component has the times at which
the estimates are computed. The S
component is a list having a named
item for each covariate value. Each element contains the latency
estimates for a covariate value, and the name is constructed from the
covariate value by prefixing it with an x
. The conf
component is
also a named list, the names being constructed as those of the S
component. Each one of these items contains, structured as a list, the
lower (lower
component) and upper (upper
component) limits of the
CIs. Finally, x0
keeps the covariate values as a separate element.
plot(S0$testim, S0$S$x0, type = "s", xlab = "Time", ylab = "Latency",
ylim = c(0, 1))
lines(S0$testim, S0$conf$x0$lower, type = "s", lty = 2)
lines(S0$testim, S0$conf$x0$upper, type = "s", lty = 2)
lines(S0$testim, pweibull(S0$testim, shape = 0.5 * (S0$x0[1] + 4),
scale = 1, lower.tail = FALSE), col = 2)
legend("topright", c("Estimate", "95% CI limits", "True"),
lty = c(1, 2, 1), col = c(1, 1, 2))
The npcure package also provides an implementation of the nonparametric covariate significance tests for the cure rate discussed in the section "Covariate significance tests":
testcov(x, t, d, dataset = NULL, bootpars = controlpars(), save = FALSE)
The x
argument is the covariate whose effect on the cure rate is to be
tested. The function’s output is a list whose main components are CM
and KS
. Each of them, in turn, is a list containing the test statistic
(stat
) and pvalue
) of the CM and KS tests, respectively.
The result of the test carried out with our simulated data and 2500 bootstrap resamples is:
testcov(x, t, d, data, bootpars = controlpars(B = 2500))
#> Covariate test
#>
#> Covariate: x
#> test statistic p.value
#> Cramer-von Mises 0.4537077 0.0592
#> Kolmogorov-Smirnov 1.2456568 0.0708
Non-numeric covariates can also be tested. For example, for z
, a
nominal covariate added to the simulated data, the result is:
data$z <- rep(factor(letters[1:5]), each = 10)
testcov(z, t, d, data, bootpars = controlpars(B = 2500))
#> Covariate test
#>
#> Covariate: z
#> test statistic p.value
#> Cramer-von Mises 0.2513218 0.6356
#> Kolmogorov-Smirnov 0.7626470 0.5340
The npcure package also includes the beran()
function, which
computes the generalized PL estimator of the conditional survival
function, beran()
function in
our package may be used together with the berancv()
function:
berancv(x, t, d, dataset, x0, cvpars = controlpars())
This function computes the local CV bandwidth selector of (Geerdens et al. 2017).
It can be directly called by the user, but in practical work should be
more usual an indirect call from the beran()
function, which, as said
before, computes the generalized PL estimator of
beran(x, t, d, dataset, x0, h, local = TRUE, testimate = NULL, conflevel = 0L,
cvbootpars = if (conflevel == 0 && !missing(h)) NULL else controlpars())
The arguments of these two functions have the same meaning as their
homonyms in the latency()
and latencyhboot()
functions, cvpars
and
cvbootpars
playing the role of bootpars
in these last functions. As
in latency()
, if no bandwidth is provided by the user via h
, then
the local CV bandwidth in (Geerdens et al. 2017) is computed by berancv()
.
For example, the code below computes the Beran estimator for the
covariate values 0 and 0.5 using local CV bandwidths. The default
behavior of berancv()
is modified by the auxiliary function
controlpars()
. In detail, the local CV bandwidth search is performed
in a grid of bandwidths, which is saved (hsave = TRUE
) and consists of
200 bandwidths (hl = 200
) ranging from 0.2 to 2 times the standardized
IQR of the covariate (hbound = c(0.2, 2)
). Point and 95% CI estimates
of the conditional survival function beran()
with the selected bandwidths:
x0 <- c(0, 0.5)
hcv <- berancv(x, t, d, data, x0 = x0,
cvpars = controlpars(hbound = c(0.2, 2), hl = 200, hsave = TRUE))
S <- beran(x, t, d, data, x0 = x0, h = hcv$h, conflevel = 0.95)
S
#> Bandwidth type: local
#>
#> Covariate (x0): 0.0 0.5
#> Bandwidth (h): 1.598875 1.104106
#>
#> Beran's conditional survival estimate:
#>
#> x0 = 0
#> time survival lower 95% CI upper 95% CI
#> 0.004599127 1.0000000 1.0000000 1.0000000
#> 0.042088293 1.0000000 1.0000000 1.0000000
#> 0.042271452 1.0000000 1.0000000 1.0000000
#> 0.059671372 1.0000000 1.0000000 1.0000000
#> 0.067375891 1.0000000 1.0000000 1.0000000
#> 0.098569312 1.0000000 1.0000000 1.0000000
#> ...
#>
#> x0 = 0.5
#> time survival lower 95% CI upper 95% CI
#> 0.004599127 1.0000000 1.0000000 1.0000000
#> 0.042088293 1.0000000 1.0000000 1.0000000
#> 0.042271452 1.0000000 1.0000000 1.0000000
#> 0.059671372 1.0000000 1.0000000 1.0000000
#> 0.067375891 1.0000000 1.0000000 1.0000000
#> 0.098569312 1.0000000 1.0000000 1.0000000
#> ...
The next code shows an equivalent way of obtaining the same estimates:
S <- beran(x, t, d, data, x0 = x0, conflevel = 0.95,
cvbootpars = controlpars(hbound = c(0.2, 2), hl = 200, hsave = TRUE))
Figure 2 displays point and 95% CI estimates of the survival curve for covariate value 0.5. It has been obtained by executing:
plot(S$testim, S$S$x0.5, type = "s", xlab = "Time", ylab = "Survival",
ylim = c(0, 1))
lines(S$testim, S$conf$x0.5$lower, type = "s", lty = 2)
lines(S$testim, S$conf$x0.5$upper, type = "s", lty = 2)
p0 <- exp(2 * x0[2])/(1 + exp(2 * x0[2]))
lines(S$testim, 1 - p0 + p0 * pweibull(S$testim,
shape = 0.5 * (x0[2] + 4), scale = 1, lower.tail = FALSE), col = 2)
legend("topright", c("Estimate", "95% CI limits", "True"),
lty = c(1, 2, 1), col = c(1, 1, 2))
The nonparametric estimators of the cure rate and latency functions given in ((1)) and ((3)), respectively, require assumption ((4)) for their consistency. In other words, the follow-up must be long enough for cures to happen so that the censored times after the largest uncensored observation can be assumed to correspond to cured subjects.
The procedure to test the hypothesis ((4)) proposed by
(Maller and Zhou 1992) is performed by the testmz()
function:
testmz(t, d, dataset)
The function returns a list (with class attribute ‘npcure’) whose main
component, containing the pvalue
. The
further component aux
is, in turn, a list of components statistic
,
which contains the test statistic, n
, the sample size, delta
, giving
the difference between the largest observed time interval
, which has the
range between delta
With our simulated data, the result of the test is:
testmz(t, d, data)
#> Maller-Zhou test
#>
#> statistic n p.value
#> 43 50 2.024892e-43
To illustrate the nonparametric modeling of the mixture cure model with
the npcure package, we consider the bone marrow transplantation data
in (Klein and Moeschberger 2005), available as the bmt
dataset of the R package
KMsurv (Klein et al. 2012). The
data comes from a multi-center study carried out between 1984 and 1989,
involving 137 patients with acute myelocytic leukemia (AML) or acute
lymphoblastic leukemia (ALL), aged from 7 to 52. Bone marrow transplant
(BMT) is the standard treatment for acute leukemia. Transplantation can
be considered a failure when leukemia recurs or the patient dies.
Consequently, the failure time is defined as the time (days) to relapse
or death. The variables collecting this information are:
t2 |
Disease-free survival time in days (time to relapse, death, or end of study) |
d3 |
Disease-free survival indicator (1 : Dead or relapsed, 0 : Alive and disease-free) |
The probability of cure after BMT is high, especially if BMT is performed while the patient remains in the chronic phase (Devergie et al. 1987). Recovery after BMT is a complex process depending on a large set of risk factors, whose status is coded by the following variables:
ta |
Time to acute graft-versus-host disease (GVHD). |
tc |
Time to chronic GVHD. |
tp |
Time to return of platelets to normal levels. |
z1 |
Patient age (years). |
z2 |
Donor age (years). |
z7 |
Waiting time to transplant (days). |
group |
Disease group (1 : ALL, 2 : AML low risk, 3 : AML high risk). |
da |
Acute GVHD indicator (1 : Developed, 0 : Never developed). |
dc |
Chronic GVHD indicator (1 : Developed, 0 : Never developed). |
dp |
Platelet recovery indicator (1 : Returned to normal, 0 : Never returned to normal). |
z3 |
Patient gender (1 : Male, 0 : Female). |
z4 |
Donor gender (1 : Male, 0 : Female). |
z5 |
Patient cytomegalovirus (CMV) status (1 : Positive, 0 : Negative). |
z6 |
Donor CMV status (1 : Positive, 0 : Negative). |
z8 |
FAB (1 : FAB grade 4 or 5 and AML, 0 : Otherwise). |
z9 |
Hospital (1 : Ohio State University, 2 : Alferd, 3 : St. Vincent, 4 : Hahnemann). |
z10 |
Methotrexate (MTX) used for prophylaxis of GVHD (1 : Yes, 0 : No). |
Before applying the estimation methods of the npcure package, it
should be checked whether the follow-up time was long enough to make it
sure that condition ((4)) holds. This can be subjectively
assessed by visualizing a plot of the KM estimate of the unconditional
survival function,
data("bmt", package = "KMsurv")
testmz(t2, d3, bmt)
#> Maller-Zhou test
#>
#> statistic n p.value
#> 11 137 1.047242e-05
We start by estimating the cure probability as a function of age (z1
)
and waiting time to transplant (z7
), respectively. Cure probabilities
are estimated at a grid of 100 values between the 5th and 95th quantiles
of the values of z1
and z7
. The code for z1
is (for z7
, it is
similar):
x0 <- seq(quantile(bmt$z1, 0.05), quantile(bmt$z1, 0.95), length.out = 100)
q.age <- probcure(z1, t2, d3, bmt, x0 = x0, conflevel = 0.95,
bootpars = controlpars(hsmooth = 10))
Both estimated cure rates are displayed in Figure 4, where a kernel estimate of the covariate density has been added for reference:
par(mar = c(5, 4, 4, 5) + 0.1)
plot(q.age$x0, q.age$q, type = "l", ylim = c(0, 1),
xlab = "Patient age (years)", ylab = "Cure probability")
lines(q.age$x0, q.age$conf$lower, lty = 2)
lines(q.age$x0, q.age$conf$upper, lty = 2)
par(new = TRUE)
d.age <- density(bmt$z1)
plot(d.age, xaxt = "n", yaxt = "n", xlab = "", ylab = "", col = 2,
main = "", zero.line = FALSE)
mtext("Density", side = 4, col = 2, line = 3)
axis(4, ylim = c(0, max(d.age$y)), col = 2, col.axis = 2)
legend("topright", c("Estimate", "95% CI limits", "Covariate density"),
lty = c(1, 2, 1), col = c(1, 1, 2), cex = 0.8)
The cure probability seems to be nearly constant or, at most, to decrease slightly with patient age and as the waiting time to transplant increases.
The significance of the effect of patient age (z1
) and waiting time to
transplant (z7
) on the probability of cure can be tested with the
testcov()
function:
testcov(z1, t2, d3, bmt, bootpars = controlpars(B = 2500))
#> Covariate test
#>
#> Covariate: z1
#> test statistic p.value
#> Cramer-von Mises 0.1103200 0.8204
#> Kolmogorov-Smirnov 0.7308477 0.7900
testcov(z7, t2, d3, bmt, bootpars = controlpars(B = 2500))
#> Covariate test
#>
#> Covariate: z7
#> test statistic p.value
#> Cramer-von Mises 0.7921912 0.0968
#> Kolmogorov-Smirnov 1.6116129 0.1008
The effect of age on the cure probability is not statistically
significant with neither the CM nor the KS tests (
Cure probability can also be compared between groups defined by a
categorical covariate. We illustrate this case by considering gender
(z3
) and the use of MTX for prophylaxis of GVHD (z10
). For improving
readability, we first label the groups:
bmt$z3 <- factor(bmt$z3, labels = c("Male", "Female"))
bmt$z10 <- factor(bmt$z10, labels = c("MTX", "No MTX"))
summary(bmt[, c("z3", "z10")])
#> z3 z10
#> Male :57 MTX :97
#> Female:80 No MTX:40
The estimated survival functions are displayed in Figure
5. The code for gender (z3
) is:
library("survival")
Sgender <- survfit(Surv(t2, d3) ~ z3, data = bmt)
Sgender
#> Call: survfit(formula = Surv(t2, d3) ~ z3, data = bmt)
#>
#> n events median 0.95LCL 0.95UCL
#> z3=Male 57 36 318 172 NA
#> z3=Female 80 47 606 418 NA
plot(Sgender, col = 1:2, mark.time = FALSE, xlab = "Time (days)",
ylab = "Disease-free survival")
legend("topright", legend = c("Male", "Female"), title = "Gender",
lty = 1, col = 1:2)
The estimated probability of cure for each group defined by gender
(z3
) is obtained by computing for each stratum the unconditional cure
rate estimator of (Laska and Meisner 1992). This estimator of the probability of cure is
the value of the KM curve at
qgender <- c(min(Sgender[1]$surv), min(Sgender[2]$surv))
qgender
#> [1] 0.1899671 0.4065833
The estimated probability of cure is 19.0% for males and 40.7% for
females. The cure probabilities according to the use or not of MTX as
GVHD prophylactic (z10
) are:
Smtx <- survfit(Surv(t2, d3) ~ z10, data = bmt)
qmtx <- c(min(Smtx[1]$surv), min(Smtx[2]$surv))
qmtx
#> [1] 0.3679977 0.3482143
The cure rate of patients treated with MTX is estimated to be 36.8%, slightly higher than 34.8%, the estimate for patients not treated with MTX.
The effect of these two binary variables on the cure probability is
tested with the testcov()
function similarly as it was done with
continuous covariates:
testcov(z3, t2, d3, bmt, bootpars = controlpars(B = 2500))
#> Covariate test
#>
#> Covariate: z3
#> test statistic p.value
#> Cramer-von Mises 0.5947305 0.0900
#> Kolmogorov-Smirnov 1.1955919 0.0892
testcov(z10, t2, d3, bmt, bootpars = controlpars(B = 2500))
#> Covariate test
#>
#> Covariate: z10
#> test statistic p.value
#> Cramer-von Mises 1.018441 0.0692
#> Kolmogorov-Smirnov 1.199340 0.0668
The differences in the probability of cure between males and females,
and between patients with and without MTX treatment are not
statistically significant, although a borderline effect is evidenced
(
The survival of the uncured patients (latency) is estimated for patient
age (z1
) 25 and 40 years as follows:
S0 <- latency(z1, t2, d3, bmt, x0 = c(25, 40), conflevel = 0.95,
bootpars = controlpars(B = 500))
Figure 6 displays the survival functions for the two ages, obtained by executing:
plot(S0$testim, S0$S$x25, type = "s", ylim = c(0, 1),
xlab = "Time (days)", ylab = "Latency")
lines(S0$testim, S0$conf$x25$lower, type = "s", lty = 2)
lines(S0$testim, S0$conf$x25$upper, type = "s", lty = 2)
lines(S0$testim, S0$S$x40, type = "s", col = 2)
lines(S0$testim, S0$conf$x40$lower, type = "s", lty = 2, col = 2)
lines(S0$testim, S0$conf$x40$upper, type = "s", lty = 2, col = 2)
legend("topright", c("Age 25: Estimate", "Age 25: 95% CI limits",
"Age 40: Estimate", "Age 40: 95% CI limits"), lty = 1:2,
col = c(1, 1, 2, 2))
An increased survival of younger patients can be observed, but the survival advantage vanishes after approximately 6 years.
This paper introduces the npcure package. It provides an R implementation of a completely nonparametric approach for estimation in mixture cure models, along with a nonparametric covariate significance test for the cure probability. Moreover, the generalized PL estimator of the conditional survival function with a CV bandwidth selection function is included. Furthermore, the theory underlying the implemented methods, presented in (Xu and Peng 2014), (López-Cheda et al. 2017a), and (López-Cheda et al. 2017b), has been compiled.
The npcure package has some limitations. Firstly, it only handles right-censored survival times. Left-censored data, truncation, or interval-censored data have not been considered in this approach, and it remains an open problem to be dealt with in the future. Secondly, a conditional estimation can be performed when only one covariate is involved. The same restriction applies to the implemented covariate significance test for the cure rate. An important extension would be the development of estimation and test procedures for the cure rate and latency functions when they depend on a set of covariates. A major challenge is the way the covariates are handled. In that case, the analysis of a large number of covariates would suffer from the curse of dimensionality. Dimension reduction techniques would be required, which leads to a demanding approach that has not been addressed yet, and we leave for further research.
There is an interesting issue that remains an open problem to be dealt with in future versions of the package. Traditional cure rate models implicitly assume that there is no additional information on the cure status of the patients. So, the cure indicator is modeled as a latent variable. However, examples contradicting this assumption can be found. For instance, in some clinical settings, subjects who are followed up beyond a threshold period without experiencing the event can be considered as cured. In other cases, complementary diagnostic tests providing further information about a patient’s cure status may be available. We aim to develop improved non-parametric methods of estimation and hypothesis testing that take into account this additional information.
The first author’s research was sponsored by the Beatriz Galindo Junior Spanish Grant from Ministerio de Ciencia, Innovación y Universidades (MICINN) with reference BGP18/00154. All the authors acknowledge partial support by the MICINN Grant MTM2017-82724-R (EU ERDF support included), and by Xunta de Galicia (Centro Singular de Investigación de Galicia accreditation ED431G/01 2016-2019 and Grupos de Referencia Competitiva CN2012/130 and ED431C2016-015) and the European Union (European Regional Development Fund - ERDF).
smcure, NPHMC, flexsurvcure, rcure, geecure, miCoPTCM, mixcure, GORcure, intercure, npcure, condSURV, prodlim, survidm, KMsurv
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
López-Cheda, et al., "npcure: An R Package for Nonparametric Inference in Mixture Cure Models", The R Journal, 2021
BibTeX citation
@article{RJ-2021-027, author = {López-Cheda, Ana and Jácome, M. Amalia and López-de-Ullibarri, Ignacio}, title = {npcure: An R Package for Nonparametric Inference in Mixture Cure Models}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {1}, issn = {2073-4859}, pages = {21-41} }