idmTPreg : Regression Model for Progressive Illness Death Data

The progressive illness-death model is frequently used in medical applications. For example, the model may be used to describe the disease process in cancer studies. We have developed a new R package called idmTPreg to estimate regression coefficients in datasets that can be described by the progressive illness-death model. The motivation for the development of the package is a recent contribution that enables the estimation of possibly time-varying covariate effects on the transition probabilities for a progressive illness-death data. The main feature of the package is that it befits both non-Markov and Markov progressive illness-death data. The package implements the introduced estimators obtained using a direct binomial regression approach. Also, variance estimates and confidence bands are implemented in the package. This article presents guidelines for the use of the package.


Introduction
In a classical survival study, patients start from an initial state "alive" and are followed up until the end of the study.They make a transition to the absorbing state "dead", unless they drop out of the study.In many medical studies, state "alive" includes two or more transient states.For example, in the illness trajectory of a cancer, a particular stage of the illness as a transient state is usually observed over time (e.g.cancer recurrence).Multi-state models are particularly useful for modeling the overall process of survival Hougaard (2000), Andersen and Keiding (2002).A multi-state model is a model for a continuous-time stochastic process allowing individuals to move among a finite number of states.A transition from one state to another one is the occurrence of one event of interest.The progressive illness-death model depicted in Figure 1 is a three-state multi-state model used in the medical literature to describe disease progression (Meira-Machado et al., 2008).It consists of three states: "Healthy" (state 1), "Diseased" (state 2) and "Dead" (state 3).All patients start in healthy state, then each patient can either have a transition directly to the dead state or they can be diseased before moving to the dead state.That means, the trajectory for a patient will be 1 −→ 3 or 1 −→ 2 −→ 3, but the entire trajectory might not be observed due to censoring.
Often, in the presence of censoring, estimating the effect of a set of prognostic factors on the course of the disease is an important target for the progressive illness-death model.The effect of prognostic factors for the Markov progressive illness-death model is generally modeled by Aalen´s additive model, Aalen et al. (2001).According to their model, the effects on transition intensities are estimated and, from them, the effects on transition probabilities (therefore, the conditional transition probabilities) can be calculated by solving the so-called forward Kolmogorov differential equation.But the violation of Markov condition does not allow such a calculation.To this end, Meira-Machado´s approach (Meira-Machado et al., 2014) based on kernel smoothing is an alternative to the Aalen model.The method considers the nonparametric estimation of conditional transition probabilities in a non-Markov illness-death Model that allows only conditioning on continuous covariates in low dimension.Azarang et al. (2017) proposed the direct binomial regression method for the transition probabilities in the right-censored progressive illness-death model.By applying this method, one can estimate the possibly time-varying regression coefficients as covariate effects on the transition probabilities.The method does not require the Markov assumption and has no restriction on the dimension of covariates.In addition, it can be applied for both continuous and categorical covariates.Based on this method, we have developed a software package in R, called idmTPreg (available from the Comprehensive R Archive Network at https://cran.r-project.org/web/packages/idmTPreg/).The progressive illness-death model illustrated in Figure 1 is the only progressive disease model supported by the idmTPreg package.Furthermore, several packages to analyse multi-state survival data like the progressive illnessdeath data are available on the Comprehensive R Archive Network (CRAN).For example, the mstate de Wreede et al. ( 2010), provides estimation of the transition probabilities possibly depending on covariates, the implemented technique assumes proportional transition intensities.The package can be applied to right censored and left truncated multi-state data.The msm package Jackson (2011), in terms of covariates, can be used to obtain conditional transition probabilities in continuous-time Markov and hidden-Markov multistate models for longitudinal data.In the p3state.msmpackage Meira-Machado and Roca-Pardiñas (2011), estimates of regression parameters can be obtained by assuming that each transition may be specified by a Cox-type model.To this end, one of the three possible options for the model can be chosen; This paper describes idmTPreg package and its capabilities through the following sections.In the next section, we outline the methodology of the direct modeling approach in the progressive illness-death model, the detailed mathematics underlying the package have been discussed in our previous paper.Then, we describe the package in full detail, and demonstrate how to apply the functions provided by the package through the analysis of a real dataset.Ultimately, the last section gives a summary of the work.

An overview of the methodology
In this section we briefly review the methodological background behind the idmTPreg package.As mentioned before, the progressive illness-death model presents two transient states (state 1 and state 2) and an absorbing state (state 3).The three possible transitions are shown by forward arrows in Figure 1.We assume here that recovery (transition from 2 to 1) is not possible.Five different transition probabilities of the model are: p 11 (s, t), p 12 (s, t), p 13 (s, t), p 22 (s, t) and p 23 (s, t); s and t (s < t) are times.Among the transition probabilities the following relations hold: p 11 (s, t) + p 12 (s, t) + p 13 (s, t) = 1 and p 22 (s, t) + p 23 (s, t) = 1.The transition probabilities p kl (s, t)´s, where k = 1, 2; l = 1, 2, 3 and k ≤ l, are defined as: p kl (s, t) = Pr(patient is at state l at time t | patient was at state k at time s.)Likewise, in the presence of covariates (X X X) the conditional transition probabilities denoted by p kl (s, t|X) are defined.Let Z be the sojourn time in state 1, T be the total survival time, and C be the censoring time of the model.The observed information is ( Z, ∆ 1 , T, ∆, X X X), where ∆ 1 and ∆ are the censoring indicator of Z and T respectively; Z = min(Z, C) and T = min(T, C); and X X X is the vector of time-independent covariates.In order to correct for estimation bias due to censoring, we update ∆ 1 and ∆ with time t and define ∆ t 1 and ∆ t as follows: Azarang et al. (2017) introduced regression modeling to estimate the possibly time-varying coefficients for the conditional transition probabilities (p kl (s, t|X X X), k = 1, 2; l = 1, 2, 3; k ≤ l) in the progressive illness-death model via defining binomial time-varying response variables (Y kl (t)'s), and time-varying random weights W s kl (t) (depending on transition from k to l the weights are either a function of ∆ t 1 or a function of ∆ t and are estimated by Ŵs kl (t), see Azarang et al. (2017) for more details).The response variables and weights for each transition probability are well defined so that:  kl (t) is linked to the transition probability p kl (s, t|X) via an allowable link function for the binomial family.Without the loss of generality, we consider logit link function:
As mentioned before, we consider s to be fixed and t ∈ [a, τ], where t is the last event time point, and p kl (s, a|x x x) > 0.Then, β β β (s) kl (t), which is a vector of possibly time-varying coefficients, can be estimated by solving the following score equations: where

Package description
The idmTPreg package provides estimates of the coefficients on transition probabilities for a progressive illness-death dataset.The package consists of 6 user-visible functions described in Table 1.In addition, there is an invisible function called mod.glm.fit that is contained inside the main function to fit generalized linear type models.This function is a modified version of glm.fit available in the R Stats Package.The modification was done to give special weights to the binary responses discussed in the previous section.The modified function mod.glm.fitgives the estimated vector of coefficients.Also, the call to mod.glm.fit has been programmed in parallel to obtain 95% pointwise bootstrap confidence bands.The number of cores for parallel execution is set to the number of CPU cores on the current host by default unless it is specified by the user.Then, registerDoParallel of the doParallel package is used to register the parallel backend.The parallel computation is performed by the foreach function of foreach package.
The main function, intended to be called by the user, is TPreg().The data frame to be passed into the main function of the package, other than covariates, must contain Zt, Tt, delta1, and delta variables (denoted by Z, T, ∆ 1 and ∆ in the previous section).Assessing these variables for a nonstatistician might be challenging, so iddata function is used to convert simple records of progressive illness-death data to the proper format.To this end, one may ask clinicians to give them the total survival time (Stime), the indicator of uncensored total survival time (Sind), the arrival time to the diseased state (Iltime), the indicator of visiting diseased state (Ilind) and a vector of covariates (cov).By convention for patients who have not been diseased (Ilind=0), their arrival time to the diseased state is recorded equal to their total survival time.

Example of application
To illustrate our method with the capability of our package, we consider the colon cancer dataset which is freely available as a part of the survival package.In this study, from 929 patients who had curative-intent resections of stage III colon cancer, 315 were randomly assigned to observation only, 310 to levamisole alone (Lev), and 304 to levamisole plus fluorouracil (Lev+5FU).See Moertel et al. (1990) for details.By the end of the study, 477 patients remained alive, 468 developed a recurrence, and 452 died and among these, 38 died without recurrence.The possible events for a patient may be described by the progressive illness-death model with states 1, 2 and 3 corresponding to "Alive and disease-free", "Alive with recurrence", and "Dead" respectively.A subset of colon cancer dataset called colonTPreg is available in the idmTPreg package, but just three risk factors were included in the dataset : Age (in years) and Nodes (number of lymph nodes with detectable cancer), and treatment.We use the colonTPreg dataset to demonstrate the functionality of the package.

Function Description
TPreg Fits the semi-parametric regression model to estimate the effects on transition probabilities for a sequence of time.
summary Gives details about the estimated effects of pre-specified transition probabilities for a sequence from a given s to a given t.
plot Makes a plot for the estimated effect of pre-specified transition probabilities along time, from time s to time t. print Provides the details about the estimated effects of pre-specified transition probabilities for given s and t.
iddata Converts a raw illness-death data to a data frame which can be passed into TPreg function (described before) .Each row in the data corresponds to a single individual.The columns Zt, and Tt are time variables, measured in days.For instance, patient 1 experienced a recurrence after 968 days (transition from state 1 to state 2), and died after 1521 days.Patient 2 was censored after 3087 days without having the recurrence.These data have a suitable format for the analysis.
Estimates for the effect of covariates on transition probabilities are obtained using function TPreg().The first argument of this function is an object of class "formula" which specifies the covariates on the right-hand side of the ∼ operator, that are separated by + operators.The left side of the ∼ operator is left empty because the time-dependent binary responses corresponding to each transition are defined in the main function.The data argument must be a data frame of "iddata" class or a data frame similar to colonTPreg format described previously.The link argument is a suitable link function for binomial family (logit, probit and cauchit).Argument s is the current time for the transition probabilities; default is zero which reports the occupation probabilities.Argument t is the Future time for the transition probabilities; default is NULL which is the largest uncensored sojourn time in the initial state.The R argument specifies the number of bootstrap replicates, default is 199.The argument trans indicates the possible transition(s) for a progressive illness-death model.The output is an object of class "TPreg".This object has its own print(), summary() and plot() methods.
We apply the method described in the previous section to estimate the effect of the risk factors on transition "Alive and disease-free"−→"Dead".To see the result for each time of the sequence from time s = 0 to 7 years with the default increment, we use the following input command: > co13 <-TPreg( ~Age + Nodes + treatment, colonTPreg, + link = "logit", s = 0, R = 99, + t = 365.24*7,trans = "13") > co13 The R Journal Vol.10/2, December 2018 ISSN 2073-4859 Call: TPreg(formula = ~Age + Nodes + treatment, data = colonTPreg, link = "logit", s = 0, t = 365.24* 7, R = 99, trans = "13") The print() method returns the results for s=0 and t= 2556.68 days; and provides 95% pointwise bootstrap confidence bands based on nonparametric resampling and normal method (normal approximation of two-sided nonparametric confidence interval).Then, using the summary() function one can obtain estimated values at each especified time between s=0 and t=2556.68.The estimates between the jump times of Y 13 in the time interval [s,t] are piecewise constant, and one can choose a vector of times from the jump times via argument by.Then, at the selected times the values are estimated.This argument gives the increment of the sequence from time s to time t.The default is max( Z) − min( Z) q 0.01 ( Z) , where q 0.01 (.) is the sample quantile corresponding to 0.01 probability and x gives the largest integer less than or equal to x.

> summary(co13)
Call: TPreg(formula = ~Age + Nodes + treatment, data = colonTPreg, link = "logit", s = 0, t = 365.24* 7, R = 99, trans = "13") (s,t): [1] 0.00 2556.68The plot() method is used to plot estimated regression coefficients with 95% confidence bands to visualize possible time-varying effects of covariates along time.Argument covar of plot.Tpreg() function indicates the covariates for which their effects are to be plotted.The argument rug ((TRUE) by default) adds a rug representation of times between time s and time t.And argument Ylim gives the list of limits for the y axes.
> plot(co13, covar = c("Age", "Nodes", "treatmentLev", "treatmentLev.5FU"),Ylim = list(c(-0.1,0.1),c(-0.5,0.5),c(-2,2), c(-2,2))) Figure 2 shows the plot corresponding the adjusted effects of Age, Nodes, Lev and Lev+5FU on transition probability p 13 .The covariate age, shows no significant effect of age on p 13 along time.The increasing number of nodes with detectable cancer significantly increases the probability of dying steadily over time.As for the effect of treatment, after 1000 days the Lev+5FU decreases the transition probability to the death state (in the long run), while no effect of Lev on p 13 is appreciated.By setting trans = all inside TPreg() function, plot() simultaneously displays the effect of prespesified covariate(s) on all transition probabilities.
We set by=1 and trans=11 to calculate all regression results on transition probability p 11 for all times between 0 and 7 years.Compared with a larger by, by=1 results in a longer time for R to run the code.

Summary
This paper describes the implementation of a flexible method in R for fitting a regression model to possibly non-Markov progressive illness-death data.The idmTPreg package offers the user the opportunity to estimate possibly time-varying effect of covariates on the transition probabilities for the progressive illness-death model.We have explained the use of the idmTPreg package by applying the method to a colon cancer dataset.The results in this paper were obtained using R 3.4.2.In a future version of the package, we plan to implement a similar method to estimate coefficients on net survivals for a progressive illness-death in a relative survival setting.
where E s for k = 1 is the expectation value conditioned on the event of being observed at the initial The R Journal Vol.10/2, December 2018 ISSN 2073-4859 state by a given time s (Z > s) and for k = 2 it is the expectation value conditioned on being observed at the intermediate state by time s (Z ≤ s < T).Then for a fixed time s, the linear predictor Xβ Xβ Xβ

Figure 2 :
Figure 2: Estimated effects of the Age (upper left corner), Nodes (upper right corner), Lev (lower left corner), and Lev+5FU (lower right corner) on the probability of transition from "Alive and disease-free" to "Dead" along time.

Figure 3 :
Figure 3: Estimated effects of the Age (upper left corner), Nodes (upper right corner), Lev (lower left -corner), and Lev+5FU (lower right corner) on disease-free survival along time.
TDCM, CMM or CSMM.TDCM (time dependent Cox model) associates a time dependent covariate with the occurrence of disease, in CMM (Cox Markov model) it is assumed that the future is independent of the past given the present state, and the CSMM model (Cox semi-Markov model) emphasizes the importance of time spent in the current state, Survival.
Note that the expected value of the above score functions equals zero.Since the estimates of the coefficients, β β β (t), are piecewise constant between the jump times of Y kl (that are the jump times of W s kl too) we can fit the standard approach for generalised linear models at each jump time of the corresponding response variable.Therefore, for each t between a and τ, β β β kl kl (t) are given.

Table 1 :
Functions and summary of their descriptions in the idmTPreg package.