The new R package ebmstate is a package for multi-state survival analysis. It is suitable for high-dimensional data and allows point and interval estimation of relative transition hazards, cumulative transition hazards and state occupation probabilities, under clock-forward and clock-reset Cox models. Our package extends the package mstate in a threefold manner: it transforms the Cox regression model into an empirical Bayes model that can handle high-dimensional data; it introduces an analytical, Fourier transform-based estimator of state occupation probabilities for clock-reset models that is much faster than the corresponding, simulation-based estimator in mstate; and it replaces asymptotic confidence intervals meant for the low-dimensional setting by non-parametric bootstrap confidence intervals. Our package supports multi-state models of arbitrary structure, but the estimators of state occupation probabilities are valid for transition structures without cycles only. Once the input data is in the required format, estimation is handled automatically. The present paper includes a tutorial on how to use ebmstate to estimate transition hazards and state occupation probabilities, as well as a simulation study showing how it outperforms mstate in higher-dimensional settings.
Multi-state models based on transition hazard functions are often used in the statistical analysis of longitudinal data, in particular disease progression data (Hougaard 1999). The multi-state model framework is particularly suitable to accommodate the growing level of detail of modern clinical data: as long as a clinical history can be framed as a random process which, at any moment in time, occupies one of a few states, a multi-state model is applicable. Another strong point of this framework is that it can incorporate a regression model, i.e., a set of assumptions on how covariates, possibly time-dependent ones, affect the risk of transitioning between any two states of the disease. Once estimated, multi-state models with regression features allow the stratification of patients according to their transition hazards. In addition, it is possible, under some models, to generate disease outcome predictions. These come in the form of state occupation probability estimates, meaning estimates of the probability of being in each state of the disease over a given time frame.
The survival analysis ‘task view’ of the Comprehensive R Archive Network
lists seven R packages that are able to fit general multi-state models
and, at the same time, feature some kind of regression model or
algorithm: flexsurv
(Jackson 2016),
msm (Jackson 2011),
SemiMarkov
(Listwon and Saint-Pierre 2015),
survival
(Therneau 2015),
mstate (Wreede et al. 2010),
mboost
(Hothorn et al. 2020) – as extended by
gamboostMSM
(Reulen 2014) – and
penMSM
(Reulen 2015). All of them implement relative risk regression models
(as defined in Aalen et al. 2008 133). The only exceptions are
survival, which also
fits Aalen’s additive regression model (Aalen 1989), and flexsurv
,
which also implements accelerated failure time models .
Recall that a Cox regression model is a semi-parametric model in which
every transition hazard is assumed to be the product of a baseline
hazard function of unspecified form (the non-parametric component) and
an exponential relative risk function (the parametric component)
(Aalen et al. 2008 133). Generally, the relative risk regression models
implemented in these packages are Cox regression models. However, some
models in flexsurv
, as well as those in
msm and
SemiMarkov, also
restrict the baseline hazards to specific parametric families, i.e. they
are fully parametric. In
msm and
SemiMarkov, the
stronger assumptions regarding the functional form of the hazard are
leveraged to do away with other common assumptions:
SemiMarkov drops
the usual Markov property to implement homogeneous semi-Markov models;
msm is suitable for panel
data, i.e., data in which the state of each individual is known only at
a finite series of times.
Packages penMSM and gamboostMSM are the best suited to deal with higher-dimensional covariate data. The first of these packages relies on a structured fusion lasso method, while the second implements (jointly with mboost) a boosting algorithm. Both methods induce sparsity in the number of non-zero covariate effects, as well as equality among the different transition effects of each covariate, and are thus especially useful to reduce complicated multi-state models to more interpretable ones. The remaining packages assume standard, fixed effects relative risk regression models and do not include regularisation or variable selection features.
It is also illustrative to order the seven packages mentioned according
to how extensive their analysis workflow is. Packages
SemiMarkov and
penMSM are intended for
the estimation of relative transition hazards only (i.e., for estimating
the impact of covariates on each transition hazard). With the package
mboost (as extended by
gamboostMSM) it is
also possible to estimate the baseline transition hazards. Finally, a
more complete workflow including estimates of both relative and
cumulative transition hazards, as well as state occupation
probabilities, is implemented in flexsurv
,
msm and
mstate, and has been
under implementation in
survival (version 3.0
or later).
The present paper provides an introduction to ebmstate, a new R package for multi-state survival analysis available for download on the Comprehensive R Archive Network (CRAN). The main goal of ebmstate is to provide an analysis framework for the Cox model that performs better with higher-dimensional covariate data and is also complete, in the sense of being able to generate point and interval estimates of relative transition hazards, cumulative transition hazards and state occupation probabilities, both under clock-forward and clock-reset models. A fundamental characteristic of ebmstate is that it re-implements and extends the analysis framework of mstate, which is complete in the sense just mentioned. In fact, to a large extent, our package was built by importing, adapting and replacing functions from the mstate package. This not only eliminates redundancies, but also makes our package more accessible to the numerous users of mstate (the three papers associated with mstate have jointly over 2000 citations).
To improve the performance of mstate’s multi-state Cox model when dealing with higher-dimensional covariate data, a ridge-type regularisation feature was added. We allow the regression coefficients of the model to be partitioned into groups, with each group having its own Gaussian prior. A group can gather, for example, all the regression coefficients for a given transition. Or, within a given transition, coefficients can be grouped according to the covariate type they refer to (for example, demographic, clinical or genomic type). The resulting hierarchical Bayes model is empirical in that a full prior elicitation is not required (the mean and variance hyper-parameters of the Gaussian are estimated from the data). Model fitting relies on the iterative algorithm introduced by Schall (1991), which typically converges after a small number of steps. A simulation study showing that Schall’s algorithm performance compares well with that of other algorithms for ridge penalty optimisation, including one based on cross-validation, can be found in Perperoglou (2014).
The asymptotic confidence intervals generated by mstate are applicable when the number of observations is much larger than the number of parameters to be estimated (see section 3.3 below). To preserve the completeness of mstate’s framework in higher-dimensional settings, we therefore implemented non-parametric bootstrap intervals of regression coefficients, cumulative transition hazards and state occupation probabilities.
The high computational cost implied by the non-parametric bootstrap
motivated a third extension to
mstate. We developed an
estimator of state occupation probabilities under clock-reset Cox models
that is based on a convolution argument (as in Spitoni et al. 2012) and the
Fast Fourier transform (FFT). At present, the estimation of such
probabilities for clock-forward Cox models can be carried out using the
efficient, product-limit based algorithm available in
mstate. However, for
clock-reset Cox models, only a simulation-based estimator is available
in this package (see also the flexsurv
package for a similar,
simulation-based estimator). The FFT estimator in
ebmstate was
conceived as a faster alternative to this simulation-based estimator,
but its scope is currently restricted to multi-state models with
transition structures that have no cycles, i.e. in which a transition
between two states is either not possible or follows a unique sequence
of states. Figure 1 provides a short
graphical summary of
ebmstate, with the
main inputs – a genomic-clinical data set and an empirical Bayes
multi-state Cox model – and the main outputs – the estimates of
relative hazards and state occupation probabilities (cumulative
transition hazards are omitted).
As already mentioned, our empirical Bayes method improves estimator performance in models with larger numbers of covariates (see section 4 on estimator performance). Also, as a ridge-type regression method, it can be used as an alternative to the lasso method of penMSM in two particular cases: when the levels of correlation between covariates are high enough to compromise the stability of lasso-based covariate selection; or simply to improve prediction accuracy when interpretability is not essential and the number of covariates is not greater than the number of observations (Zou and Hastie 2005). In addition, and perhaps more importantly, ebmstate goes beyond the regularised estimation of transition hazards offered by penMSM and gamboostMSM: point and interval estimates of state occupation probabilities under the regularised Cox model can also be computed.
A multi-state Cox model is a continuous-time stochastic process with a
finite (and usually small) state space
The parametric component of the transition hazard from
Define
The purpose of the framework just described is to allow the clustering of covariate effects according to their prior distribution. If there is no prior knowledge about how this clustering should be done, a single Gaussian prior can be imposed on all regression coefficients at once. If prior knowledge allows the grouping of effects according to the transition they refer to, a different Gaussian prior can be assigned to the coefficients of each transition. Even within each transition, different groups of coefficients can be assigned different prior distributions. In the analysis of biomedical data, for example, there can be a split between genes which are known to affect the transition hazard, and other genes whose effect is unknown.
Our package imports from
mstate a Breslow
estimator of two types of cumulative transition hazard: one on a global
time scale, defined as
By state occupation probability, we mean the probability that a patient
in state
When conditioning on a given covariate path (time-fixed or not), state
occupation probability estimates are not valid unless the covariates are
external (Aalen et al. 2008 142; Cortese and Andersen 2010). Note that a vector of
covariates
In the current section, we present the estimation methods underlying the extensions of mstate implemented in ebmstate.
Let CoxRFX
. The results of a simulation study showing its consistency are
included in the Supporting Scripts and Data (file ESM_1.html, section
1).
The computation of cumulative hazard rates for given covariate values
and an estimated regression coefficient vector relies on the function
msfit_generic
, which is essentially a wrapper for the function
mstate::msfit
(see section 5.3).
For the mathematical details of this computation, we refer therefore the
reader to Wreede et al. (2010).
The package mstate
includes a simulation-based estimator that can take as input either
For convenience, let the sequence of states from
Recall that
In ebmstate, the
function probtrans_ebmstate
generates a set of state occupation
probability estimates at equally spaced time points:
probtrans_ebmstate
can be described as follows:
For
For
Finally, use the estimates obtained in the last iteration of step 2
to compute
Substituting
Apart from probtrans_ebmstate
, the function probtrans_fft
is also
based on the convolution argument just shown. However, this function
makes use of the convolution theorem, i.e., of the fact that the
convolution of two (vectorized) functions in the time domain is
equivalent to a pointwise product of the same functions in the frequency
domain. The estimation of state occupation probabilities is thus
simplified to
fft
function of the base package stats
.
The Supporting Scripts and Data contain a short simulation study
checking that state occupation probabilities can be accurately estimated
with probtrans_ebmstate
and probtrans_fft
(see file ESM_1.html,
sections 3 and 4).
Figure 2 consists of a grid of plots with estimated
curves of state occupation probabilities. It compares, in terms of speed
and accuracy, the estimator in probtrans_fft
with an estimator in
mstate::mssample
that has the same target, but is simulation-based.
Each plot contains a black curve and a superimposed red curve. The red
curves in any given column of the grid are all based on the same run of
a function: columns 1 to 3 are based on runs of mssample
with the
number of samples probtrans_fft
. Each column in the
grid reproduces the same 4 black curves. These are based on a single run
of mssample
with probtrans_fft
is as accurate as
mssample
with mssample
achieves a good approximation to the
true state occupation probabilities, but is still roughly 9 times
slower. The details on how figure 2 and its
underlying data were generated are given in the Supporting Scripts and
Data (file ESM_1.html, section 5).
Under any model estimated by
ebmstate – as in
general under a Bayesian model –, one can, if the sample size is large
enough, approximate the posterior by a normal distribution with mean
equal to the maximum a posteriori estimate and covariance matrix equal
to the inverse of the generalised observed Fisher information (see, for
example, Gelman et al. 2014 83–84). This approximation has first-order
accuracy and is thus outperformed by Laplace’s method, which has
second-order accuracy (Carlin and Louis 2009 110–111). However, as Carlin and Louis (2009 112) observe, “for moderate- to high-dimensional boot_ebmstate
.
It is a well-documented fact in the statistical literature that standard least-squares or maximum-likelihood estimators can often be improved by regularisation or shrinkage (see, for example, Samworth 2012). This improvement comes about when the model dimensionality is high enough that the bias introduced by regularisation is outweighed by the reduction in the estimator variance. In the current setting, one might therefore ask: what kind of dimensionality does a semi-parametric, multi-state Cox model need to have to be outperformed by its empirical Bayes counterpart? A simulation study we carried out offers a tentative answer to this question, by comparing estimators under both Cox models for an increasing number of covariates. The study also features a third method, based on a fully non-parametric model, as a null model method. This was included to give an idea of how many covariates the empirical Bayes model can deal with before it becomes no better than a simple non-regressive model.
We assessed the performance of all estimators defined by the tuple
All training data sets were simulated from clock-reset Cox models. Apart
from survival::coxph
and ebmstate::CoxRFX
respectively; the
estimation of state occupation probabilities is based on
mstate::probtrans
for the null model and on ebmstate::probtrans_fft
for both the standard Cox and the empirical Bayes Cox models.
The reason we did not consider simulation scenarios with more than 500
covariates per transition, in data sets of 1000 patients, was simply
computational cost. For example, generating the data and error
observations for the scenario with
Whenever an estimator was able to compute a valid estimate of its target
for each training data set, i.e., when it did not return any ‘NA’
estimates, its boxplots are based on 300 valid error observations. This
was always the case with non-parametric estimators: the estimates of
regression coefficients and relative hazards of this type of estimators
are trivial (fixed at zero and one respectively) and hence it is also
straightforward to compute absolute errors. It also happened that
non-parametric estimators of state occupation probabilities had no ‘NA’
estimates (see file ESM_1.html, section 6, figure 6.3, in the Supporting
Scripts and Data). The situation was similar for the empirical Bayes Cox
model estimators, which showed no more than 5
With respect to the performance of the three methods studied, the boxplots in figures 6 and 7 suggest the following conclusions:
As
The relative performance of the empirical Bayes method with respect
to the null method decreases as the number of co-occurring
transition hazards in the model grows. All other things equal, the
empirical Bayes method has the best performance under the ‘linear’
structure model, which has no competing transitions; it performs
less well under the ‘m’ structure transition model, where two
transition hazards can co-occur; and has the worse relative
performances under the ‘competing risks’ model, where three
transition hazards co-occur. This trend is clearer for
Having as target the regression coefficients or the state occupation
probabilities, instead of relative hazards, makes the empirical
Bayes method better in comparison to the null method. In fact, as
The features of mstate
were illustrated in Wreede et al. (2010) using a simple
workflow. The starting point of this workflow is a data set in ‘long
format’. Such data set can be fed into survival::coxph
to obtain
estimates of the regression coefficients of a multi-state Cox model. The
resulting model fit object can be passed on to mstate::msfit
, along
with a vector of covariates of a particular patient, to get personalised
estimates of the cumulative hazard functions. Finally, state occupation
probabilities for the same patient can be estimated if the object
created by mstate::msfit
is fed into mstate::probtrans
. In this
section, we describe how
ebmstate extends the
scope of this workflow, i.e., how it uses the packages
survival and
mstate to generate
estimates under a multi-state empirical Bayes Cox model. A diagram
summarising the extension is shown in figure 8. In
the 5.5 subsection, we give some
recommendations on how to assess and compare models, but for more
detailed tutorials on how to analyse multi-state data using models
defined by transition hazards, we refer the reader to
Putter et al. (2007) and Putter (2011).
The main steps of the ebmstate workflow are here illustrated using a data set of patients with myelodysplastic syndromes (MDS) which has been described and studied in Papaemmanuil et al. (2013). A myelodysplastic syndrome is a form of leukemia in which the bone marrow is not able to produce enough mature blood cells, and which sometimes develops into a cancer of white blood cells with a quick and aggressive progression, i.e., into acute myeloid leukemia (AML). Figure 9a illustrates an illness-death type model for MDS patients and also gives a breakdown of the number of transition events. The conversion to a model with a transition structure that has no cycles (i.e., that can be handled by our convolution-based estimators) is shown in figure 9b. The data set used for model estimation, obtained after a number of pre-processing steps, contains the disease history of 576 patients, as well as measurements on 30 covariates. Of these 30 covariates, 11 are mutation covariates and the remaining are clinical or demographic (see figure 9c). The running time for the estimation of relative transition hazards does not exceed 10 seconds in a standard laptop computer. The same holds for the estimation of cumulative transition hazards or state occupation probabilities for a given patient. The complete R code underlying the data analysis in the current section can be found in the Supporting Scripts and Data (file ESM_2.html). For running only the R snippets shown below and reproduce their results, the best option is to use the R script in file ESM_3.R of the Supporting Scripts and Data.
Table 1 shows a fragment of the MDS data
set. The data is in ‘long format’, which means that each row refers to a
period of risk for a given transition and patient. For example, row Tstart[i]
, patient id[i]
entered state
from[i]
, and thereby began to be at risk for transition trans[i]
,
i.e., at risk of going from state from[i]
to state to[i]
. If the
first transition of patient id[i]
after time Tstart[i]
occurs before
the last follow-up time for this patient, Tstop[i]
records the time of
this transition (regardless of whether the patient moved to state
to[i]
or not). Otherwise, Tstop[i]
is set to the last follow-up
time. The value of status[i]
is set to 1 if and only if the first
transition of patient id[i]
after Tstart[i]
is to state to[i]
and
occurs before the last follow-up (otherwise it is set to 0). The value
of time[i]
is defined simply as Tstop[i]
Tstart[i]
, and
strata[i]
is the stratum of the baseline hazard for transition
trans[i]
(more about this variable in the following section). For x
ASXL1
, DNMT3A
,
x[i]
denotes the level of covariate x
between Tstart[i]
and Tstop[i]
in patient id[i]
. (In the MDS data
set, we assume that the relative hazard of a patient is determined by
her covariate vector at
From table 1, we know that patient 77 entered state 1 (‘MDS’) at time 0 and remained in this state until time 2029, when she moved to state 3 (‘death before AML’). There are no rows to describe the evolution of patient 77 after entering state 3, as this state is an absorbing state. As to patient 78, she remained in state 1 until time 332, and moved from there to state 2 (‘AML’). She lived with AML for 1117 days and moved to state 4 (‘death after AML’) at time 1449.
id from to trans Tstart Tstop time status strata ASXL1 DNMT3A [...]
77 1 2 1 0 2029 2029 0 1 0 0 .
77 1 3 2 0 2029 2029 1 2 0 0 .
78 1 2 1 0 332 332 1 1 1 0 .
78 1 3 2 0 332 332 0 2 1 0 .
78 2 4 3 332 1449 1117 1 3 1 0 .
Table 1: A 5-row fragment of the MDS data set (in long format)
Once the data is in ‘long format’, the estimation of an empirical Bayes
model can be carried out using the function CoxRFX
. A simple example
of the first argument of CoxRFX
, denoted ‘Z
’, is a data frame
gathering the trans
, strata
and covariate columns of the data in
long format:
outcome_covs <- c("id","from","to","trans","Tstart","Tstop","time","status",
"strata")
Z <- mstate_data[!names(mstate_data) %in% outcome_covs]
#(`mstate_data' has the data in long format)
The strata
column determines which baseline hazard functions are
assumed to be equal. In table 1, each
transition is assumed to have a (potentially) different baseline hazard.
The model’s assumptions regarding how covariates affect the hazard are
reflected on the format of the covariate columns of Z
. When the Z
argument is the one created in the previous block of code, CoxRFX
returns a single regression coefficient estimate for each covariate. In
other words, the impact of any covariate is assumed to be the same for
every transition.
There are however ways of relaxing this assumption. One can replace the
ASXL1
column in Z (or any other covariate column) by several
‘type-specific’ ASXL1
columns: the ASXL1
column specific for type
ASXL1
in rows belonging to
transition of type CoxRFX
to estimate a (potentially) different ASXL1
coefficient
for each transition type. This process of covariate expansion by type
can be based on any partition of the set of transitions. When each type
corresponds to a single transition, we refer to it simply as ‘covariate
expansion by transition’. The output shown below illustrates the effect
of expanding the covariates in ‘mstate_data’ by transition.
# Columns `id' and `trans' from `mstate_data' together with the first
# two expanded covariates (patients 77 and 78):
id trans ASXL1.1 ASXL1.2 ASXL1.3 DNMT3A.1 DNMT3A.2 DNMT3A.3 [...]
77 1 0 0 0 0 0 0 .
77 2 0 0 0 0 0 0 .
78 1 1 0 0 0 0 0 .
78 2 0 1 0 0 0 0 .
78 3 0 0 1 0 0 0 .
The example code given below shows how to use
mstate to expand
covariates by transition and how to create a Z
argument that makes
CoxRFX
estimate a regression coefficient for each covariate for
transitions 1 and 2, and assume a fully non-parametric hazard for
transition 3.
# To expand covariates by transition using mstate::expand.covs,
# first set the class of `mstate_data' as
class(mstate_data) <- c("data.frame","msdata")
# then add the transition matrix as attribute:
attr(mstate_data,"trans") <- tmat
#(`tmat' is the output of mstate::transMat)
# Expand covariates by transition:
covariates_expanded_123 <- mstate::expand.covs(
mstate_data,
covs = names(mstate_data)[! names(mstate_data) %in% outcome_covs],
append = F
)
# remove all covariates for transition 3 from `covariates_expanded_123'
# to fit a fully non-parametric model on this transition:
covariates_expanded_12 <- covariates_expanded_123[
!grepl(".3",names(covariates_expanded_123),fixed = T)
]
#argument `Z' of coxrfx
Z_12 <- data.frame(covariates_expanded_12,strata = mstate_data$trans,
trans = mstate_data$trans)
The second argument of CoxRFX
(‘surv
’) is a survival object that can
easily be built by feeding the outcome variable columns of the data to
the function Surv
(from the package
survival). Whether
CoxRFX
fits a clock-forward model or a clock-reset model depends on
the kind of survival object:
#argument `surv' for a clock-forward model
surv <- Surv(mstate_data$Tstart,mstate_data$Tstop,mstate_data$status)
#argument `surv' for a clock-reset model
surv <- Surv(mstate_data$time,mstate_data$status)
The argument groups
of CoxRFX
is a vector whose length equals the
number of covariates in the data. In other words, the length of groups
is ncol(Z)-2
, since the argument Z
must include both the covariate
data and the strata
and trans
columns. If, for groups[i]
=groups[j]
Z
both belong
to a group named ‘foo’ of coefficients with the same prior. For the Z
object built above, the groups
argument created in the following block
of code embodies the assumption that all coefficients associated with a
given transition have the same prior distribution. The final line of
code fits the empirical Bayes model.
#argument `groups' of coxrfx
groups_12 <- paste0(rep("group",ncol(Z)-2),c("_1","_2"))
#fit random effects model
model_12 <- CoxRFX(Z_12,surv,groups_12,tmat)
Figure 10 shows regression coefficient point
estimates for a clock-reset, empirical Bayes model fitted with the code
above. Also shown are 95% non-parametric bootstrap confidence intervals
computed using ebmstate::boot_ebmstate
. The
The function msfit_generic
is the generic function in
ebmstate that
computes cumulative transition hazards for a given set of covariate
values and an estimated Cox model. It calls a different method according
to the class of its object
argument. The default method corresponds to
the original msfit
function of the
mstate package and is
appropriate for objects of class coxph
, i.e., objects that contain the
fit of a Cox model with fixed effects. The other available method for
msfit_generic
, msfit_generic.coxrfx
, is just the original msfit
function, (slightly) adapted to deal with objects generated by CoxRFX
.
Quite importantly, msfit_generic.coxrfx
does not allow the variance of
the cumulative hazards to be computed, as this computation relies on
asymptotic results which may not be valid for an empirical Bayes model.
As a result, it only has two other arguments apart from the object of
class coxrfx
: a data frame with the covariate values of the patient
whose cumulative hazards we want to compute; and a transition matrix
describing the states and transitions in the model (such as the one that
can be generated using transMat
from the package
mstate). The following
block of code exemplifies how these objects can be built and generates
the msfit
object containing the cumulative transition hazard estimates
for a sample patient. Note that the object with the patient data must
include a row for each transition, as well as a column specifying the
transition stratum of each row of covariates.
# Build `patient_data' data frame with the covariate values for which
# cumulative hazards are to be computed (covariate values of patient 78):
patient_data <- mstate.data[mstate.data$id == 78,,drop = F][rep(1,3),]
patient_data$strata <- patient_data$trans <- 1:3
patient_data <- mstate::expand.covs(
patient_data,
covs = names(patient_data)[ ! names(patient_data) %in% outcome_covs],
append = T
)
patient_data <- patient_data[ ! grepl(".3",names(patient_data),fixed = T)]
# The `patient_data' data frame has only 3 rows (one for each transition).
# The output below shows its `id' and `trans' columns
# and expanded covariates ASXL1 and DNMT3A:
id trans ASXL1.1 ASXL1.2 DNMT3A.1 DNMT3A.2 [...]
78 1 1 0 0 0 .
78 2 0 1 0 0 .
78 3 0 0 0 0 .
# compute cumulative hazards
msfit_object_12 <- msfit_generic(model_12,patient_data,tmat)
Figure 11 shows three plots of estimated
cumulative transition hazards for the sampled patient, one for each
transition in the model, along with ebmstate::boot_ebmstate
).
Throughout the plotted period, the ‘slope’ of the cumulative hazard
(i.e., the hazard rate) for the MDS to AML transition is lower than the
one for the MDS to death transition, and this in turn is lower than the
one for the AML to death transition. It should be recalled that the
cumulative hazard estimate is strictly non-parametric for this last
transition, i.e., it is the same for all patients. The central plot of
figure 11 suggests that, as time since
diagnosis goes by, the hazard of dying in MDS increases (possibly an
effect of age). On the other hand, the hazard of dying in AML seems to
decrease (slightly) with time (rightmost plot). Conclusions regarding
the evolution of the AML hazard are hard to draw, since the confidence
intervals for the corresponding cumulative hazard curve are very wide
(leftmost plot).
If an object generated by msfit_generic
is fed to plot
, and the
package mstate is
loaded, the method mstate:::plot.msfit
will be called. This is an
efficient way of automatically plotting the cumulative hazard estimates
for all transitions, but confidence interval lines (separately
estimated) cannot be added.
The functions probtrans_mstate
, probtrans_ebmstate
and
probtrans_fft
compute estimates of state occupation probabilities for
a given msfit
object. All three functions generate objects of class
probtrans
that can be fed to the plot.probtrans
method from the
package mstate. The
first of these functions should only be used for clock-forward models,
as it relies on product-limit calculations. It calls the method
probtrans_mstate.default
, if the msfit
object was generated by
msfit_generic.default
, or the method probtrans_mstate.coxrfx
, if it
was generated by msfit_generic.coxrfx
. Both methods are identical to
the function probtrans
in the
mstate package, with
the reserve that probtrans_mstate.coxrfx
does not allow the
computation of the variances or covariances of the state occupation
probability estimator.
The functions probtrans_ebmstate
and probtrans_fft
are the functions
in ebmstate for the
computation of state occupation probability estimates under clock-reset
models with a transition structure that has no cycles. When using
probtrans_fft
(the faster, but somewhat less stable, of these two
functions), three arguments must be supplied: the initial state of the
process whose state occupation probabilities one wishes to compute, the
msfit
object, and the upper time limit for the generation of estimates
(max_time
). Both functions are based on a discrete-time approximation
to a series of convolutions. The default argument nr_steps
controls
the number of (equally spaced) time steps used in this approximation.
The arguments max_time
and nr_steps
should be increased until the
estimated curves become stable.
The following line of code computes point estimates of state occupation probabilities for the sample patient.
probtrans_object_12 <- probtrans_fft("MDS",msfit_object_12, max_time = 4000)
Estimates are shown in figure 12, along
with boot_ebmstate
:
# Creating the object arguments for boot_ebmstate()
# `groups' arguments was already created, but we need to add names to it
names(groups_12) <- names(covariates_expanded_12)
# `mstate_data_expanded' argument (similar to `covariates_expanded' but
# including outcome variables)
mstate_data_expanded <- cbind(
mstate_data[names(mstate_data) %in% outcome_covs],
covariates_expanded_12
)
# create the non-parametric bootstrap confidence intervals
boot_ebmstate_object <- boot_ebmstate(
mstate_data = mstate_data_expanded,
which_group = groups_12,
min_nr_samples = 100,
patient_data = patient_data,
tmat = tmat,
initial_state = "MDS",
time_model = "clockreset",
input_file = NULL,
coxrfx_args = list(max.iter = 200),
probtrans_args = list(max_time = 4000)
)
For any model fitted with
ebmstate, two
performance metrics can be easily computed: the concordance statistic
((Harrell et al. 1982); see also the help page of
survival::concordance
for the definition of concordance) and the
Bayesian Information Criterion (BIC) score (Schwarz 1978).
As an example of how these two metrics can be obtained and used for
model comparison, suppose we wish to compare ‘model_12’ fitted above –
which consists of a Cox regression including all covariates for
transitions 1 and 2 and a fully non-parametric model for transition 3 –
with a model that combines Cox regressions of all covariates for each of
the three transitions (denoted ‘model_123’ below). The following code
snippet shows how to fit this second model.
# arguments `groups' and `Z' for fitting a Cox regression model on all transitions
Z_123 <- data.frame(
covariates_expanded_123,
strata = mstate_data$trans,
trans = mstate_data$trans
)
groups_123 <- paste0(rep("group", ncol(Z_123) - 2), c("_1", "_2", "_3"))
# Fit a Cox regression model for all transitions
model_123 <- CoxRFX(Z = Z_123, surv = surv, groups = groups_123)
Running the concordance
function in the
survival package for
each model yields the following output:
> concordance(model_12)
Call:
concordance.coxph(object = model_12)
n= 1210
Concordance= 0.8131 se= 0.01314
concordant discordant tied.x tied.y tied.xy
strata=1 18040 2783 0 1 0
strata=2 37919 9678 0 7 0
strata=3 0 0 1052 0 4
> concordance(model_123)
Call:
concordance.coxph(object = model_123)
n= 1210
Concordance= 0.8168 se= 0.01312
concordant discordant tied.x tied.y tied.xy
strata=1 18041 2782 0 1 0
strata=2 37920 9677 0 7 0
strata=3 784 268 0 4 0
The output shows that modelling transition 3 with a Cox model, instead
of a fully parametric one, has a negligible impact on the overall
concordance. However, this is due to the fact that there are far fewer
observations for this transition. The concordance for transition 3 only,
which corresponds to strata 3, is 0.5 under the fully parametric model
(i.e., all patients are assigned the same transition hazard) and
considerably higher under the Cox regression (concordance
function
(argument newdata
). However, in the present case, only 61 patients
were ever at risk of dying with AML (i.e. of undergoing transition 3),
and of these only 41 actually died, so we might prefer to keep all
patients in the training data, rather than saving a fraction of them for
testing purposes. Such an option will yield more accurate coefficient
estimates, at the expense of not allowing the computation of unbiased
estimates of model performance. If the goal is only to compare models,
we can make do without test data, by using an information score that
penalises model complexity, such as the BIC. To facilitate model
comparison, the BIC score is one of the attributes of the model fit
object:
> model_12$BIC
[1] 2508.37
> model_123$BIC
[1] 2483.49
The best model is the one with the lowest score, so the choice of ‘model_123’ is confirmed.
We have shown that
ebmstate is suitable
for higher-dimensional, multi-state survival analysis, and that it is
both efficient and easy-to-use. To a significant extent, the
user-friendliness of
ebmstate stems from
the fact that it was not built ‘from the ground up’. Instead, we
produced a package that is more easily accessible to the many users of
mstate by taking
advantage of whichever features of this package were useful to our
method and by eliminating redundancies. The connection between
ebmstate and
mstate is based on the
fact that the function CoxRFX
takes the same type of input and
produces the same type of output as coxph
from the package survival
,
and the function probtrans_fft
(or probtrans_ebmstate
) has the same
type of input and output as probtrans
from
mstate (as shown in
figure 8).
We also sought to improve our package’s user-friendliness by making it as efficient as possible. The reduction of computational cost is based on two features. First, our empirical Bayes method relies on an expectation-maximisation algorithm that estimates both the parameters and the hyper-parameters of the model, i.e., no further tuning of the model is required. Second, in ebmstate, the computation of state occupation probability estimates relies on analytical results rather than on simulation: not only for clock-forward models, where we import from mstate a product-limit estimator, but also for clock-reset models, where we implement our own estimator based on a convolution argument and the fast Fourier transform.
To our knowledge,
ebmstate is the first
R package to put together a framework for multi-state model estimation
that is complete and suitable for higher-dimensional data. It does so by
implementing point and interval estimators of regression coefficients,
cumulative transition hazards and state occupation probabilities, under
regularised multi-state Cox models. In section
4, the results of the simulation study
suggest that for data sets with 100 patients or more and a ratio of
As mentioned in previous sections, ebmstate imports a product-limit estimator from mstate that targets the state occupation probabilities of patients with time-fixed covariate vectors. However, these estimators are extendible to models with time-dependent covariates, as long as these are external and the estimates are conditional on specific covariate paths (Aalen et al. 2008 142). For piecewise constant covariates, it is likely that such an adaptation could be obtained by combining transition probability estimates obtained for each period in which the covariates are fixed. While no significant theoretical obstacles are foreseen in this matter, the computer implementation for more than a single piecewise constant covariate is likely to be a laborious task. We have left it therefore for future work.
The authors are supported by grant NNF17OC0027594 from the Novo Nordisk Foundation. We thank an anonymous reviewer for their constructive comments and helpful suggestions which led to a much-improved manuscript.
In the supporting Scripts and Data, the file ESM_1.html
contains
additional simulation results and theoretical demonstrations. Additional
details on the analysis of the MDS data set are given in the file
ESM_2.html
. The MDS data set is in files MDS.TPD.20Nov2012.csv
and
mds.paper.clin.txt
. The file ESM_3.R
contains a simplified R script
to run the code snippets in the present paper. The
ebmstate package is
available on CRAN.
The authors have declared no conflict of interest.
Figures
Figure 1: Summary of inputs and outputs of the package ebmstate. The input data set should be one that violates the assumption – commonly used in survival analysis – that the number of observations is much larger than the number of parameters to be estimated (a genomic-clinical data set is shown as a typical example). The input model is a multi-state Cox model defined by a transition structure and a prior distribution on the regression coefficients. This prior distribution is defined by partitioning the vector of regression coefficients into groups of regression coefficients, with each group having its own Gaussian prior with undetermined mean and variance. The outputs of ebmstate include estimates of the relative transition hazards associated with each covariate, as well as estimates of the probability that a specific patient (with specific covariate measurements) has of occupying each state of the model over some time period. Estimates of cumulative transition hazards are omitted from the figure.
Figure 2: Comparison of running times and estimation accuracy of mssample and probtrans_fft. Each plot in the grid shows two estimated curves of state occupation probabilities. The black curves are based on a single run of mstate::mssample with n=100.000 observations (approximately 17 minutes of running time) and are the same across columns. They serve as benchmark for precision assessment. In columns 1 to 3 of the grid, the superimposed red curves are based on a run of mssample with respectively 100, 1000, and 10.000 observations. In the rightmost column, the red curves are based on a run of probtrans_fft. All functions have as input the same set of cumulative transition hazards. These were estimated using a non-parametric multi-state model and a data set of 1000 patients generated according to a clock-reset Cox model with a ‘linear’ transition structure (leftmost diagram of figure 3). Plots in the same row refer to the same state of the model, while those in the same column refer to the same run of a function. Running times and, where appropriate, number of simulations (n) are given on top of each column.
Figure 3: Model transition structures. We studied the performance of Cox model estimators, empirical Bayes Cox model estimators and fully non-parametric estimators with respect to these 3 transition structures.
Figure 4: Proportions of valid, infinite and missing (‘NA’) estimates for the standard Cox model estimators in the simulation study of figure 6 (100 patients per simulated data set).
Figure 5: Proportions of valid, infinite and missing (‘NA’) estimates for the standard Cox model estimators in the simulation study of figure 7 (1000 patients per simulated data set).
Figure 6: Performance comparison of standard Cox, empirical Bayes Cox, and fully non-parametric (null) estimators using training data sets with 100 observations each. In the figure grid there is a boxplot corresponding to every tuple
Figure 7: Performance comparison of standard Cox, empirical Bayes Cox, and fully non-parametric (null) estimators using training data sets with 1000 observations each. In the figure grid there is a boxplot corresponding to every tuple
Figure 8: Extension of the mstate analysis framework by ebmstate. Arrows correspond to functions. Boxes correspond to inputs or outputs of functions. Functions CoxRFX and probtrans_fft from ebmstate compute point estimates only. Interval estimates can be obtained using the non-parametric bootstrap algorithm implemented in the function ebmstate::boot_ebmstate.
Figure 9: a: transition model implied by the data set of patients with myelodysplastic syndromes, together with transition event numbers; b: conversion to a transition structure without cycles; c: transformations applied to the MDS covariate data and summary statistics for the data before transformation. MDS stands for myelodysplastic syndromes; AML stands for acute myeloid leukemia.
Figure 10: Point estimates of regression coefficients for the Cox model fitted to the MDS data, along with 95% non-parametric bootstrap confidence intervals. The x-axis scale is logarithmic so that coefficient estimates can be read as relative hazard estimates. If
Figure 11: Point estimates of cumulative transition hazards for a sample patient with MDS (black curve), along with
Figure 12: Point estimates of state occupation probabilities for a sample patient with MDS (black curve), along with
Supplementary materials are available in addition to this article. It can be downloaded at RJ-2024-002.zip
msm, SemiMarkov, survival, mstate, mboost, gamboostMSM, penMSM, ebmstate
ClinicalTrials, Distributions, Econometrics, Epidemiology, MachineLearning, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Costa & Gerstung, "ebmstate: An R Package For Disease Progression Analysis Under Empirical Bayes Cox Models", The R Journal, 2025
BibTeX citation
@article{RJ-2024-002, author = {Costa, Rui J. and Gerstung, Moritz}, title = {ebmstate: An R Package For Disease Progression Analysis Under Empirical Bayes Cox Models}, journal = {The R Journal}, year = {2025}, note = {https://doi.org/10.32614/RJ-2024-002}, doi = {10.32614/RJ-2024-002}, volume = {16}, issue = {1}, issn = {2073-4859}, pages = {15-38} }