Change-point hazard models have several practical applications, including modeling processes such as cancer mortality rates and disease progression. While the inverse cumulative distribution function (CDF) method is commonly used for simulating data, we demonstrate the shortcomings of this approach when simulating data from change-point hazard distributions with more than a scale parameter. We propose an alternative method of simulating this data that takes advantage of the memoryless property of survival data and introduce the R package cpsurvsim which implements both simulation methods. The functions of cpsurvsim are discussed, demonstrated, and compared.
When modeling time-to-event processes, especially over long periods of time, it is often unreasonable to assume a constant hazard rate. In these cases, change-point hazard models are applicable. The majority of research surrounding change-point hazard models focuses on the Cox proportional hazards and piecewise exponential models with one change-point (Yao 1986; Gijbels and Gurler 2003; Wu et al. 2003; Dupuy 2006; Rojas et al. 2011), likely due to the straightforward extension for including fixed and time-varying covariates (Zhou 2001; Hendry 2014; Montez-Rath et al. 2017a; Wong et al. 2018). Research on hazard models with multiple change-points is also expanding as these models have a wide range of applications in fields such as medicine, public health, and economics (Liu et al. 2008; Goodman et al. 2011; He et al. 2013; Han et al. 2014; Qian and Zhang 2014; Cai et al. 2017). In the interest of simulating time-to-event data featuring trends with multiple change-points, (Walke 2010) presents an algorithm for simulating data from the piecewise exponential distribution with fixed type I censoring using the location of the change-points, the baseline hazard, and the relative hazard for each time interval in between change-points. As the research surrounding parametric change-point hazard models with multiple change-points continues to grow, likewise does the need to simulate data from these distributions. Simulation is also a powerful and popular tool for assessing the appropriateness of a model for one’s data or conducting a power analysis.
Several R packages available from the Comprehensive R Archive Network (CRAN) provide functions for simulating time-to-event data in general, with a heavy focus on the Cox model. Some of the more popular packages are provided in Table 1, which expands on the METACRAN compilation (Allignol and Latouche 2020). Although considerably smaller in scope, a few R packages provide functions for simulating data with change-points. CPsurv has functionality for simulating both nonparametric survival data and parametric survival data from the Weibull change-point distribution but requires existing data as an argument and only allows for one change-point (Krügel et al. 2017). SimSCRPiecewise simulates data using the piecewise exponential hazard model within the Bayesian framework, however, this method requires at least one covariate as an argument (Chapple 2016).
Package Title | Brief Description |
---|---|
coxed | Simulates data for the Cox model using the flexible-hazard |
method and allows for the inclusion of time-varying covariates | |
(Kropko and Harden 2019) | |
CPsurv | Simulates one change-point for non-parametric survival analysis |
or parametric survival analysis using the Weibull distribution | |
(Krügel et al. 2017) | |
cpsurvsim | Simulates data with multiple change-points from the exponential |
and Weibull distributions (Hochheimer 2021) | |
discSurv | Simulates survival data from discrete competing risk models |
(Welchowski and Schmid 2019) | |
gems | Simulates data from multistate models and allows for |
non-Markov models that account for previous events | |
(Blaser et al. 2015) | |
genSurv | Gives users the option to generate data with a binary, |
time-dependent covariate | |
(Meira-Machado and Faria 2014; Araújo et al. 2015) | |
ipred | Provides a function for simulating survival data for tree- |
structured survival analysis (Peters and Hothorn 2019) | |
MicSim | Performs continuous time miscrosimulations to simulate life |
courses (Zinn 2018) | |
PermAlgo | Uses a permutational algorithm to generate time-to-event data |
allowing for the inclusion of several time-dependent covariates | |
(Sylvestre et al. 2010) | |
prodlim | Has functions for simulating right censored non-parametric |
survival data with two covariates and with or without competing | |
risks (Gerds 2018) | |
simMSM | Uses inversion sampling to simulate data from multi-state |
models allowing for non-linear baseline hazards, time-varying | |
covariates, and dependence on past events (Reulen 2015) | |
simPH | Simulates data from Cox proportional hazards models |
(Gandrud 2015) | |
simsurv | Simulates data from various parametric survival distributions, |
2-component mixture distributions, and user-defined hazards | |
(Brilleman 2019) | |
SimSCRPiecewise | Uses Bayesian estimation to simulate data from the piecewise |
exponential hazard model allowing for the inclusion of covariates | |
(Chapple 2016) | |
SimulateCER | While not a formal R package, this package extends the methods |
found in PermAlgo and can be downloaded from GitHub | |
(Montez-Rath et al. 2017b) | |
survsim | Allows users to simulate time-to-event, competing risks, multiple |
event, and recurrent event data (Moriña and Navarro 2014) |
Our package cpsurvsim allows users to simulate data from a both the exponential and Weibull hazard models with type I right censoring allowing for multiple change-points (Hochheimer 2021). cpsurvsim provides two methods for simulating data, which are introduced in the following section. The first method draws on (Walke 2010), using the inverse hazard function to simulate data. The second employs the memoryless simulation method, the details of which are also discussed in the next section. We then demonstrate how to simulate data using cpsurvsim and compare the performance of these methods through a simulation study with the motivation of enabling users to determine which method is best for their data.
The piecewise exponential model with multiple change-points
The Weibull distribution is another popular parametric model for
survival data due to its flexibility to fit a variety of hazard shapes
while still satisfying the proportional hazards assumption. Note that
when
(Zhou 2001) touches on the idea of the memoryless property as a means of
interpreting the piecewise exponential model, however, we take this one
step further by using this property to simulate data from change-point
hazard models. In survival analysis, the memoryless property states that
the probability of an individual experiencing an event at time
Our memoryless simulation method uses this extension of the memoryless
property in that data between change-points are simulated from
independent exponential or Weibull hazard distributions with scale
parameters
The cpsurvsim package can be installed from CRAN. Functions for simulating data are summarized in Table 2
Function | Hazard model | Simulation method |
---|---|---|
exp_cdfsim |
Piecewise constant | Inverse hazard function |
exp_memsim |
Piecewise constant | Memoryless |
weib_cdfsim |
Weibull change-point | Inverse hazard function |
weib_memsim |
Weibull change-point | Memoryless |
As an example of the functions exp_cdfsim
and weib_cdfsim
, which
simulate data using the inverse hazard method from the exponential and
Weibull distributions, respectively, consider the following:
library(cpsurvsim)
dta1 <- exp_cdfsim(n = 50, endtime = 100, theta = c(0.005, 0.01, 0.05),
+ tau = c(33, 66))
head(dta1)
time censor
1 100.00000 0
2 85.99736 1
3 78.21772 1
4 71.03138 1
5 100.00000 0
6 82.71520 1
dta2 <- weib_cdfsim(n = 50, endtime = 100, gamma = 2,
+ theta = c(0.0001, 0.0002, 0.0001), tau = c(33, 66))
head(dta2)
time censor
1 11.36844 1
2 100.00000 0
3 81.04904 1
4 100.00000 0
5 71.93590 1
6 56.40275 1
When simulating using the memoryless method, we use the following calls from cpsurvsim:
dta3 <- exp_memsim(n = 50, endtime = 100, theta = c(0.005, 0.01, 0.05),
+ tau = c(33, 66))
head(dta3)
time censor
1 93.64262 1
2 63.47413 1
3 84.54253 1
4 89.01574 1
5 73.92685 1
6 23.67631 1
dta4 <- weib_memsim(n = 50, endtime = 100, gamma = 2,
+ theta = c(0.0001, 0.0002, 0.0001), tau = c(33, 66))
head(dta4)
time censor
1 59.47848 1
2 100.00000 0
3 62.08739 1
4 100.00000 0
5 100.00000 0
6 100.00000 0
As seen in these examples, all four functions return a dataset with the survival times and a censoring indicator.
To compare the performance of the inverse hazard method with the
memoryless method under different settings, we conducted a simulation
study using cpsurvsim. We simulated data with one, two, three, and
four change-points using both the exponential and Weibull distributions.
In our simulation, time
When simulating from the exponential distribution, these two simulation
methods had comparable accuracy in terms of the location of the
change-points (see Figure 1). Sample size, however,
had a large impact on the accuracy regardless of the simulation method.
When estimating one change-point with a sample size of 50, there were a
few simulation templates where less than a third of estimates
|
|
|
|
|
|
|
|
Simulations for the Weibull distribution, however, revealed important
differences in accuracy between the two methods (see
Figure 2). Although accuracy of the two methods was
similar when simulating one change-point, there were many cases where
accuracy was very low, even with a sample size of 500
(Figure 2a). In almost one quarter (
|
|
|
|
|
|
|
|
We suspected that the inaccurate estimates using the Weibull
distribution were due to inaccuracies in estimating the shape parameter
|
|
|
|
|
|
|
|
The R package cpsurvsim provides implementation of the standard method
of simulating from a distribution, using the inverse CDF, and a new
method that exploits the memoryless property of survival analysis. When
simulating from the exponential distribution with multiple
change-points, these methods have comparable performance. Simulating
multiple change-points from the Weibull hazard, however, suggested that
the inverse hazard method produces more accurate estimates of the
change-points
The inspiration to develop the memoryless simulation method and test it
came from observing the shortcomings of the inverse hazard method in our
research. The memoryless method performs better in some simulation
scenarios, which led us to implement both methods in this R package.
This simulation study, however, suggests that in the majority of cases
the inverse hazard method simulates values of
An important limitation of the exp_cdfsim
and weib_cdfsim
functions
is that they only accommodate up to four change-points. While it’s
possible to have more than this many change-points in a dataset, it’s
also important to make sure that there is a meaningful interpretation
for multiple change-points. Also, cpsurvsim only accommodates type I
right censoring. For the Weibull distribution,
Thank you to Dr. Sarah Ratcliffe for her guidance and for assisting with running these simulations.
While the change-points can be estimated without knowing the values of
the scale parameters, the reverse is not possible. Thus, we used the
estimated values of the change-points in order to estimate values of
With a few exceptions, the estimates of
|
|
|
|
|
|
|
|
Estimates of
|
|
|
|
|
|
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
PhD & PhD, "cpsurvsim: An R Package for Simulating Data from Change-Point Hazard Distributions", The R Journal, 2022
BibTeX citation
@article{RJ-2022-005, author = {PhD, Camille J. Hochheimer, and PhD, Roy T. Sabo,}, title = {cpsurvsim: An R Package for Simulating Data from Change-Point Hazard Distributions}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {1}, issn = {2073-4859}, pages = {196-207} }