JMcmprsk: An R Package for Joint Modelling of Longitudinal and Survival Data with Competing Risks

In this paper, we describe an R package named JMcmprsk, for joint modelling of longitudinal and survival data with competing risks. The package in its current version implements two joint models of longitudinal and survival data proposed to handle competing risks survival data together with continuous and ordinal longitudinal outcomes respectively (Elashoff et al., 2008; Li et al., 2010). The corresponding R implementations are further illustrated with real examples. The package also provides simulation functions to simulate datasets for joint modelling with continuous or ordinal outcomes under the competing risks scenario, which provide useful tools to validate and evaluate new joint modelling methods.


Introduction
Joint modelling of longitudinal and survival data has drawn a lot of attention over the past two decades. Much of the research has focused on data with a single event time and a single type of failure, usually under the assumption of independent censoring of event times (Tsiatis and Davidian, 2004). However, in some situations interest lies with competing risks data, where there is more than one possible cause of event or where the censoring is informative (Williamson et al., 2008). Typically, standard linear mixed model or its extensions are used for the longitudinal submodel. Cause-specific hazards model with either unspecified or spline baseline hazards are studied for the competing risk submodels. Various types of random effects are assumed to account for the association between these submodels.
Despite of various theoretical and methodological developments (Hickey et al., 2018b;Papageorgiou et al., 2019), there are still limited software packages to deal with specific problems in the analysis of follow-up data in clinical studies. To our knowledge, currently there are three related CRAN R packages, namely JM (Rizopoulos, 2012), joineR (Williamson et al., 2008) and lcmm (Proust-Lima et al., 2017), which support the modeling of longitudinal and survival data with competing risks.
The JM package provides support for competing risks via the "CompRisk" option in the jointModel() function. In JM, a linear mixed effects submodel is modelled for the longitudinal part and a relative risk submodel is assumed for each competing event. In the current version (1.4-8), only piecewise proportional hazards model, where the log baseline hazard is approximated using B-splines, is supported for the survival component. The joineR package fits the joint model (Williamson et al., 2008) for joint models of longitudinal data and competing risks using the joint() function. In their model, the time-to-event data is modelled using a cause-specific Cox proportional hazards regression model with time-varying covariates. The longitudinal outcome is modelled using a linear mixed effects model. The association is captured by a zero-mean shared latent Gaussian process. Parameters in the model are estimated using an Expectation Maximization(EM) algorithm. The lcmm package implements the support for competing risks joint modelling in the Jointlcmm() function. Radically different from the above two R packages, the lcmm package uses a less well-known framework called the joint latent class model (Proust-Lima et al., 2014), which assumes that dependency between the longitudinal markers and the survival risk can be captured by a latent class structure entirely. However, the lcmm package is mainly designed for prediction purpose and may not be suited to evaluate specific assumptions regarding the characteristics of the marker trajectory that are the most influential on the event risk (Proust-Lima et al., 2014).
In all these packages, a time independent shared random effects vector is usually assumed in modelling the longitudinal and survival data. However, they are not capable of fitting more flexible models with separate random effects in these submodels (Elashoff et al., 2008;Li et al., 2010). In many biomedical applications, sometimes, it is necessary to have a model which takes into account longitudinal ordinal outcomes for the longitudinal part. Yet, due to the complex nature of joint modelling, most of the available software does not support longitudinal ordinal variables (Armero et al., 2016;Ferrer, 2017). We fill this gap in our package by implementing a joint model which supports ordinal disease markers based on our previous work (Li et al., 2010).
Both JM and joineR packages depend heavily on the R nlme and survival packages. In JM, the linear mixed effects submodel and the survival submodel are first fitted using lme() and coxph() R function in these packages before a joint modelling process. In joineR, lme() and coxph() functions are applied to obtain initial values for parameters in the joint model, which are further estimated by an EM algorithm. The major advantage of using available packages such as survival and nlme lies that joint modelling R packages can be built quickly with adequate efficiency as most of these base R packages have been optimized for speed. However, if required functionality is not available in these packages, as is the case of Elashoff et al. (2008) and Li et al. (2010), implementing new joint modelling methods is a non-trivial task.
Compared with JM and joineR packages, the JMcmprsk package introduced here can be regarded as a "stand-alone" R package, which does not required initial estimates for the linear mixed effects model or survival submodel to compute parameters of the joint model in question. In particular, the JMcmprsk package is built within the Rcpp (Eddelbuettel et al., 2011) and GSL(The GNU Scientific Library) (Galassi et al., 2002) framework, which make R functions have access to a wide range of fast numerical routines such as Monte Carlo integration, numerical integration and differentiation routines.

Joint Models with Competing Risks
A joint model for competing risk data consists of two linked components: the longitudinal submodel, which takes care of repeatedly measured information, and the survival submodel, which deals with multiple failure times. Combination of different longitudinal and survival components leads to a variety of joint models (Hickey et al., 2018a).
In the current version of JMcmprsk, we have implemented two joint models for competing risk data, namely joint modelling with continuous longitudinal outcomes (Elashoff et al., 2008), and joint modelling with ordinal longitudinal outcomes (Li et al., 2010). Both models have adopted a causespecific Cox submodel with a frailty term for multiple survival endpoints. The difference between these two models lies in the longitudinal part. The former model applies a linear mixed submodel for the continuous longitudinal outcome while the latter model includes a partial proportional odds submodel for the ordinal longitudinal outcome.
Different from previous approaches (Rizopoulos, 2012;Williamson et al., 2008), we assume a flexible separate random effects structure for the longitudinal submodel and the survival submodel. Furthermore, the association between both submodels is modeled by the assumption that the random effects in two submodels jointly have a multivariate normal distribution.

Model 1: Joint modelling with continuous longitudinal outcomes
Let Y i (t) be the longitudinal outcome measured at time t for subject i, i = 1, 2, · · · , n, and n is the total number of subjects in study. Let C i = (T i , D i ) denote the competing risks data on subject i, where T i is the failure time or censoring time, and D i takes value in {0, 1, · · · , g}, with D i = 0 indicating a censored event and D i = k showing that subject i fails from the kth type of failure, where k = 1, · · · , g.
The joint model is specified in terms of the following two linked submodels: i (t) denote the covariates for the fixed-effects β and γ k ,X i (t) denotes the covariates for the random-effects b i , and i (t) ∼ N(0, σ 2 ) for all t ≥ 0. The parameter ν 1 is set to 1 to ensure identifiability. We assume that b i is independent of i (t) and that i (t 1 ) is independent of i (t 2 ) for any t 1 = t 2 . We further assume the random effects b i and u i jointly have a multivariate normal distribution, denoted by Denote Ψ as the unknown parameters from the joint models. We propose to obtain the maximum likelihood estimate of Ψ through an EM algorithm. The complete data likelihood is In the E-step, we need to calculate the expected value of all the functions of θ. Since, the integral over the random effects does not have a closed-form solution, an iterative numerical method has to be employed.
In JMcmprsk, the integral over time is approximated using a Gauss-Kronrod quadrature and the computation of the integral over the individual random effects is achieved using a Gauss-Hermite quadrature. The quadrature approximates the integral using a weighted sum of function values, at specified points within the domain of integration; the Gaussian quadrature is based on the use of polynomial functions. Standard option here is the Gaussian quadratic rules. In the M-step, Ψ is updated by maximizing the functions obtained from the E-step.

Model 2: Joint modelling with ordinal longitudinal outcomes
Let Y ij denote the jth response measured on subject i, where i = 1, · · · , n, j = 1, · · · , n i , and Y ij takes values in {1, · · · , K}. The competing risks failure times on subject i is (T i , D i ), and the notations have the same meaning as in Model 1.
we propose the following partial proportional odds model for Y ij where k = 1, · · · , K − 1, X ij andX ij are p × 1 and s × 1 vectors of covariates for the fixed-effect β and α k , with α 1 = 0, andX ij is a subset of X ij for which the proportional odds assumption may not be satisfied. The q × 1 vector b i represents random effects of the associated covariates W ij . The distribution of the competing risks failure times (T i , D i ) are assumed to take the form of the following cause-specific hazards frailty model: where the l × 1 vector γ d and ν d are cause-specific coefficients for the covariates Z i (t) and the random effects u i , respectively.
The parameter ν 1 is set to 1 to ensure identifiability. We assume the random effects b i and u i jointly have a multivariate normal distribution, denoted by a i ∼ N(0, Σ).
Denote Ψ as the unknown parameters from the joint models. We propose to obtain the maximum likelihood estimate of Ψ through an EM algorithm. The complete data likelihood is where π ij (k) stands for the probability that Y ij ≤ k given the covariates and the random effects. The implementation of EM algorithm in this case is similar to the procedure of Model 1.

Package structure and functionality
The R package JMcmprsk implements the above two joint models on the basis of R package Rcpp (Eddelbuettel et al., 2011) and GSL library (Galassi et al., 2002) and is hosted at CRAN. After setting the GSL environment by following the instructions in the INSTALL file from the package, we can issue the following command in R console to install the package: > install.packages("JMcmprsk") There are two major functions included in the JMcmprsk package: the function that fits continuous outcomes jmc() and the function that fits ordinal outcomes jmo().

jmc() function
As an illustrative example of jmc(), we consider Scleroderma Lung Study (Tashkin et al., 2006), a double-blinded, randomized clinical trial to evaluate effectiveness of oral cyclophosphamide (CYC) versus placebo in the treatment of lung disease due to scleroderma. This study consists of 158 patients, the primary outcome is forced vital capacity (FVC, as % predicted) determined at 3-month intervals from the baseline. The event of interest is the time-to-treatment failure or death. We consider two covariates, baseline %FVC (FVC 0 ) and baseline lung fibrosis (FIB 0 ) and two risks, informative and noninformative. The model setups are as follows, and the cause-specific hazards frailty models are We first load the package and the data: -unique(lung[, c(1, 12, 13, 6:10)]) The number of rows in "yread" is the total number of measurements for all subjects in the study. For "cread", the survival / censoring time is included in the first column, and the failure type coded as 0 (censored events), 1 (risk 1), or 2 (risk 2) is given in the second column. Two competing risks are assumed.
Then, "yread" and "cread" are used as the longitudinal and survival input data for the model specified by the function jmc() as shown below: jmcfit <-jmc(long_data = yread, surv_data = cread, out = "FVC", FE = c("time", "FVC0", "FIB0", "CYC", "FVC0.CYC", "FIB0.CYC", "time.CYC"), RE = "linear", ID = "ID",cate = NULL, intcpt = 0, quad.points = 20, quiet = TRUE, do.trace = FALSE) where out is the name of the outcome variable in the longitudinal sub-model, FE the list of covariates for the fixed effects in the longitudinal sub-model, RE the types/vector of random effects in the longitudinal sub-model, ID the column name of subject id, cate the list of categorical variables for the fixed effects in the longitudinal sub-model, intcpt the indicator of random intercept coded as 1 (yes, default) or 0(no). The option quiet is used to print the progress of function, default is TRUE (no printing).
A concise summary of the results can be obtained using jmcfit as shown below.
The supporting function coef() can be used to extract the coefficients of the longitudinal / survival process by specifying the argument coe f f , where"beta" and "gamma" denotes the longitudinal and survival sub-model fixed effects respectively: beta <-coef(jmcfit, coeff = "beta") >beta intercept time. The supporting function summary() can be used to extract the point estimate, standard error, 95%CI, and p-values for the coefficients of both sub-models with the option coe f f to specify which sub-model fixed effects one would like to extract, and digits, the number of digits to be printed out. We proceed below to extract the fixed effects for both submodels. We proceed to test the global hypothesis for the longitudinal and the survival submodels using linearTest(). >linearTest(jmcfit, coeff="beta") Chisq df Pr(>|Chi|) L*beta=Cb 1072.307 7 0.0000 >linearTest(jmcfit, coeff="gamma") Chisq df Pr(>|Chi|) L*gamma=Cg 11.06558 10 0.3524 The results suggest that the hypothesis β 1 = β 2 = · · · = β 7 = 0 be rejected, and the hypothesis γ 11 = γ 12 = · · · = γ 15 = γ 21 = γ 22 = · · · = γ 25 = 0 not be rejected at the significance level of 0.05.
linearTest() can also be used to test any linear hypothesis about the coefficients for each submodel. For example, if one wants to test H 0 : β 1 = β 2 in the longitudinal sub-model, then we start with a linear contrast Lb and pass it to linearTest().
Similarly, linear hypotheses test can also be performed for the survival sub-model using linearTest(). For example, if we want to test H 0 : γ 11 = γ 21 , then we start with another linear contrast Lg and pass it to linearTest().

jmo() function
The implementation of jmo() is very similar to that of jmc(). As an illustrative example, we use the data from (rt PA Stroke Study, 1995). In this study, 624 patients are included, and the patients treated with rt-PA were compared with those given placebo to look for an improvement from baseline in the score on the modified Rankin scale, an ordinal measure of degree of disability with categories ranging from no symptoms, no significant disability to severe disability or death, which means in this example, Y ij takes K = 4 ordinal values. The following covariates are considered: treatment group (rt-PA or placebo), modified Rankin scale prior stroke onset, time since randomization (dummy variables for 3, 6 and 12 months), and the three subtypes of acute stroke (small vessel occlusive disease, large vessel atherosclerosis or cardioembolic stroke, and unknown reasons). Similarly, we also consider the informative and noninformative risks. The model setups are as follows, +β 4 time6 + β 5 time12 + β 6 Small vessel + β 7 Large vessel or cardioembolic stroke +β 8 Small vessel*group + β 9 Large vessel or cardioembolic stroke*group) −(α k1 Small vessel + α k2 Large vessel or cardioembolic stroke) − b i )] −1 .
Likewise, jmo() function allows for categorical variables. Moreover, categorical variables are allowed for setting up non-proportional odds covariates. As an illustration, we consider the "sex" and "race" variables and use them as two of the non-proportional odds covariates. Below is another example.

Older versions of jmc() and jmo()
In the previous versions of JMcmprsk, both the previous jmc() and jmo() functions require the longitudinal input data "yfile" to be in a specific format regarding the order of the outcome variable and the random and fixed effects covariates. It also requires users to create an additional "mfile" for the longitudinal data. At the suggestions of the reviewers, in the most recent version, we focus and develop user-friendly versions of these functions.
However, for both package consistency and user's convenience, we still keep older versions of these functions in the package, and rename these functions to jmc_0() and jmo_0(), respectively. Supporting functions of jmo() and jmc(), such as coef(), summary(), linearTest(), also apply to jmc_0() and jmo_0() functions.
Here, we show the usage of jmc_0() with some simulated data and the "lung" data used in presenting jmc() functions.

Computational Complexity
To understand the computational complexity of both jmc() and jmo() models, we carried out a variety of simulations with different sample size and different proportions of events. However, there was no clear trend observed between proportions of events and running times. Hence, only one event distribution with different sample sizes are given here for illustration purpose. According to Figures 1 and 2, we can easily see that the run time grows much faster as sample size increases, which implies that the computational complexity does not follow a linear order. In this case, it will limit joint models to handling large and even moderate sample size data. To make the joint modeling more scalable, it is necessary to carry out a novel algorithm to reduce its computational complexity to a linear order. Data setup: p = 4, n q = 6, 10.4% censoring, 51.4% risk 1, and 38.2% risk 2. The run time under each sample size was based on one random sample.

Data Simulation
A simulation can generate datasets with an exact ground truth for evaluation. Hence, the simulation of longitudinal and survival data with multiple failures associated by random effects is an important measure to assess the performance of joint modelling approaches dealing with competing risks. In JMcmprsk, simulation tools are based on the data models proposed in Elashoff et al. (2008) and Li et al. (2010), which can be used for testing joint models with continuous and ordinal longitudinal outcomes, respectively.
The main function for simulation data continuous longitudinal outcomes and survival data with multiple event outcomes is called SimDataC(), which has the following usage: SimDataC (k_val, p1_val, p1a_val, p2_val, g_val, truebeta, truegamma, randeffect, yfn, cfn, mfn) We briefly explain some of the important options. k_val denotes the number of subjects in study; p1_val and p1a_val denote the dimension of fixed effects and random effects in longitudinal measurements, respectively; p2_val and g_val denotes the dimension of fixed effects and number to competing risks in survival data; truebeta and truegamma represent the true values of fixe effects in he longitudinal and the survival submodels, respectively. randeffect set the true values for random effects in longitudinal and competing risks parts,namely in the order of σ,σ b ,ν 2 and σ u .

Conclusions and Future Work
In this paper, we have illustrated the capabilities of package JMcmprsk for fitting joint models of time-toevent data with competing risks for two types of longitudinal data. We also presents simulation tools to generate joint model datasets under different settings. Several extensions of JMcmprsk package are planned to further expand on what is currently available. First, as the integral over the random effects becomes computationally burdensome in case of high dimensionality, Laplace approximations or other Gauss-Hermite quadrature rules would be applied to the E-M step to speed up the computation procedure. Second, with increasing need of predictive tools for personalized medicine, dynamic predictions for the aforementioned joint models will be added. Third, other new joint models such as joint analysis for bivariate longitudinal ordinal outcomes will be included.