CopulaCenR: Copula based Regression Models for Bivariate Censored Data in R

Bivariate time-to-event data frequently arise in research areas such as clinical trials and epidemiological studies, where the occurrence of two events are correlated. In many cases, the exact event times are unknown due to censoring. The copula model is a popular approach for modeling correlated bivariate censored data, in which the two marginal distributions and the betweenmargin dependence are modeled separately. This article presents the R package CopulaCenR, which is designed for modeling and testing bivariate data under right or (general) interval censoring in a regression setting. It provides a variety of Archimedean copula functions including a flexible two-parameter copula and different types of regression models (parametric and semiparametric) for marginal distributions. In particular, it implements a semiparametric transformation model for the margins with proportional hazards and proportional odds models being its special cases. The numerical optimization is based on a novel two-step algorithm. For the regression parameters, three likelihood-based tests (Wald, generalized score and likelihood ratio tests) are also provided. We use two real data examples to illustrate the key functions in CopulaCenR.


Introduction
Bivariate data arise frequently in many research areas such as health, epidemiology, and economics. For example, bivariate time-to-event endpoints are often used in clinical trials studying bilateral diseases (e.g., eye diseases) or complex diseases (e.g., cancer and psychiatric disorders). The two events are correlated as they come from the same subject. In many situations, the two event times cannot be precisely observed, leading to bivariate censored data. Specifically, bivariate right-censored data occur when the study ends prior to the occurrence of one or both events. An example of such data comes from a clinical study assessing the treatment effect on preventing blindness in Diabetic Retinopathy patients where each patient had one eye randomized to the treatment and the other eye received no treatment (Huster et al., 1989), and the time-to-blindness are bivariate and rightcensored. We will illustrate the analysis of this study in Section 2.4. In another situation, bivariate interval-censored data occur when the status of both events are periodically examined at intermittent assessment times. In this case, the right censoring could also happen if the event still does not occur at the last assessment time. A special case of interval-censored data is the current status data if there is only one assessment time and the event is only known to occur or not by its assessment time. An example of bivariate interval-censored data will be demonstrated in Section 2.4, which came from a clinical trial studying the progression of a bilateral eye disease, Age-related Macular Degeneration (AMD), where the progression time to late-AMD are interval or right censored (AREDS Group, 1999). More examples can be found in books Hougaard (2000) and Sun (2007).
The development of our package is motivated by researches that are interested in (1) discovering covariates that are significantly associated with the bivariate censored outcomes, and (2) characterizing the joint and conditional risks of two events. For the bivariate events, the joint and conditional risks could be clinically more important than the marginal risk (of a single event). For example, the joint 5-year progression-free probability for both eyes helps identify patients with a high risk of progressing to late-AMD. For another example, for patients having one eye already progressed, the conditional 5-year progression-free probability for the non-progressed eye (given its fellow eye already progressed) provides important information for both clinicians and the patient since patients with both eyes progressed to the late stage of the disease may lose the ability to live independently.
There are three major approaches to fit regression models for bivariate censored data. The simplest way is to fit a marginal model and estimate the variance-covariance by a robust sandwich estimator (for example, Wei et al., 1989). This approach takes a working independence assumption, and thus cannot generate joint or conditional distributions. The second approach is based on frailty models (for example, Oakes, 1982), which are essentially mixed effects models and account for the dependence between two events by a latent frailty variable. However, the covariate effects in frailty models are usually interpreted on a conditional level (by conditioning on the frailty term), which is not straightforward. The third approach is to use copula models (for example, Clayton, 1978). Unlike the marginal or frailty approaches, the copula approach models the joint survival distribution by directly connecting the two marginal distributions through a copula function. One unique advantage of the copula is that it separately models the marginal distributions and the dependence parameter(s), allowing flexibility in marginal models and straightforward interpretation for covariate effects. Moreover, the challenge from censoring can be naturally handled through the marginal distributions within the copula function. Besides, the joint and conditional distributions can be obtained based on the copula model. Along with these three major approaches, multiple endeavors have been devoted to the development of software, mostly R (R Core Team, 2019) packages, to build regression models for bivariate censored data. For bivariate right-censored data, the survival (Therneau, 2018b) package can fit parametric or semiparametric Cox (Cox, 1972) marginal and frailty models. Also, packages such as parfm (Munda et al., 2012) and frailtypack (Rondeau et al., 2012) implement proportional hazards (PH) frailty models under the parametric and semiparametric settings. Other R packages such as coxme (Therneau, 2018a) and phmm (Donohue and Xu, 2019) also fit PH frailty models for right-censored data. For bivariate interval-censored data, the survival and frailtypack packages provide marginal and frailty models under the parametric or semiparametric (Cox PH) situation, respectively. The C++ program IntCens (codes located under https://dlin.web.unc.edu/software/intcens/) implements a class of semiparametric frailty models, including both PH and proportional odds (PO) models.
To the best of our knowledge, there exists no R package for fitting copula-based regression models for both bivariate right-censored and interval-censored data. The existing copula packages for bivariate data handle either the non-censoring (i.e., complete data) or the right-censoring situation. In the non-censoring situation, the package copula (Hofert et al., 2018) by Yan (2007) and Kojadinovic and Yan (2010) implements multivariate copula models without covariates for complete data and obtains the maximum likelihood estimator for the copula dependence parameter. It gives useful codes for implementing regression models in bivariate complete data in the appendix of Yan (2007). It also provides copula goodness-of-fit tests for model selection purpose. The package VineCopula (Schepsmeier et al., 2018) can also model bivariate or multivariate complete data without covariates through the vine copula models (Aas et al., 2009). Packages such as CopulaRegression (Nicole Kraemer, 2014) and gcmr (Masarotto and Varin, 2017) can provide copula-based regression models with parametric margins for bivariate or multivariate complete data and provide maximum likelihood estimators for model parameters. The package gamCopula (Nagler and Vatter, 2020) implements a generalized additive model that can take into account the effect of the predictors on the dependence structure of bivariate and vine copula models (Vatter and Chavez-Demoulin, 2015). For the right-censoring situation, the Copula.surv package (Emura, 2018) can estimate the Clayton copula dependence parameter in bivariate right-censored data without covariates and also perform a goodness-of-fit test for a fitted Clayton model (Emura et al., 2010). The Sunclarco package (Prenen et al., 2017b) provides Clayton or Gumbel copula-based regression models with parametric (Weibull and piecewise constant) or Cox semiparametric margins for multivariate right-censored data (Prenen et al., 2017a). The package GJRM (Marra and Radice, 2020) can fit both marginal and copula regression models in complete and right-censored data Marra and Radice, 2019). By far, there is no copula-based R package for bivariate interval-censored data.
We develop the CopulaCenR package, which fits copula-based regression models for both bivariate right-censored and interval-censored data. The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=CopulaCenR. The main advantage of CopulaCenR relies on the diverse choice of copula and marginal models for both bivariate right-censored and interval-censored data. Specifically, it provides a class of Archimedean copulas that correspond to a variety of dependence structures, as illustrated in Table 1. In particular, in addition to these frequently used one-parameter Archimedean copulas, a two-parameter copula function (Copula2) is also included. This Copula2 has more flexibility in modeling dependence structure, as we show in Section 2.2. Furthermore, CopulaCenR implements a list of parametric and semiparametric marginal regression models, as illustrated in Table 2. For parameter estimation, the package utilizes a novel two-step procedure that is computationally stable and efficient. For the inference of regression parameters, three likelihood-based tests such as Wald, generalized score and likelihood ratio tests are provided.
We will describe the major features of CopulaCenR in Section 2.2 and presents the model and estimation procedure in Section 2.3. We will demonstrate two real data examples in Section 2.4 using the version 1.1.2 of CopulaCenR. Finally, we will conclude and discuss in Section 2.5.

Package Features
The most popular copula family for bivariate censored data is the Archimedean copula family, which has an explicit form of where u and v are two uniformly distributed margins; φ η is the generator function, which is a continuous, strictly decreasing and convex function; φ −1 η is the inverse of φ η . One generator function uniquely determines an Archimedean copula. The copula parameter η has a one-to-one correspondence with the popular dependence measure Kendall's τ. Another property of the copula is the tail dependence (i.e., τ L and τ U for lower and upper tail dependence), which measure the dependence between two margins in the lower and upper tails. More details about Archimedean copulas can be found in Nelsen (2006). Table 1 lists six Archimedean family copula models that are implemented in CopulaCenR. Two most frequently used Archimedean copulas are Clayton (Clayton, 1978) and Gumbel (Gumbel, 1960) models, which account for the lower or upper tail dependence between two margins using a single parameter η. Other Archimedean copulas, such as Frank (Frank, 1979), Joe (Joe, 1993) and Ali-Mikhail-Haq (AMH) (Ali et al., 1978), are also one-parameter copulas. In addition to these five copulas, we also include a flexible two-parameter Archimedean copula model (Joe and Hu, 1996;Joe, 1997), namely, Copula2 (also called the "BB1" family), which is formulated as (1) The two dependence parameters (α and κ) are explicitly connected to Kendall's τ with τ = 1 − 2ακ/(2κ + 1), and they account for the correlation between u and v at upper and lower tails. In particular, when α = 1, Copula2 becomes the Clayton copula, and when κ → ∞, it becomes the Gumbel copula. Thus, the two-parameter copula model provides more flexibility in modeling the between-margin dependence than the one-parameter copulas such as Clayton or Gumbel (Joe, 2014). Figure 1 illustrates the scatter plots of bivariate event times generated from the six copula models in Table 1.
τ L and τ U are the lower and upper tail dependence measures.  To fit a copula-based regression model, one also needs to choose a regression model for the margins. Table 2 lists the available marginal models in CopulaCenR. For bivariate right-censored data, users can fit either a parametric marginal model via the function rc_par_copula or a semiparametric Cox PH model via the function rc_spCox_copula . Specifically, the parametric models incorporate both the PH (e.g., Weibull, Gompertz) and the PO (e.g., Loglogistic) models. For bivariate interval-censored data, one can choose to fit a parametric marginal model via the function ic_par_copula. Moreover, the package can also fit a semiparametric transformation model via the function ic_spTran_copula. It contains a variety of marginal models including the PH and PO models, as we explain in Section 2.3.3. A novel two-step sieve estimation procedure is implemented . For the inference of the covariate effects, three types of likelihood-based tests are implemented in CopulaCenR: the Wald test (built within rc_par_copula, rc_spCox_copula, ic_par_copula, and ic_spTran_copula), the generalized score test (score_copula) and the likelihood-ratio test (lrt_copula). After a copula model being fitted, fitted values (i.e., linear predictors, survival probabilities) can be extracted by the general S3 function fitted. For new observations, the linear predictors and survival probabilities can be obtained using the function predict. Moreover, the user can plot three types of distributions (joint, conditional and marginal) using the general functions plot and lines. In particular, an interactive 3D contour will be plotted to visualize the joint distribution.
Besides, the package provides a bivariate event time generating function data_sim_copula, which can generate random bivariate event times based on a specified copula function, a marginal distribution, and covariate values.
In summary, the key functions of CopulaCenR are • rc_par_copula: for fitting parametric regression models to bivariate right-censored data; • rc_spCox_copula: for fitting a semiparametric Cox regression model to bivariate right-censored data; • ic_par_copula: for fitting parametric regression models to bivariate interval-censored data; • ic_spTran_copula: for fitting a semiparametric transformation model to bivariate intervalcensored data; • score_copula: for performing the generalized score test on covariate effects; • lrt_copula: for performing the likelihood ratio test (LRT) on covariate effects between two nested models; • tau_copula: for calculating Kendall's τ from copula parameter estimates; • plot,lines: S3 methods for plotting joint, conditional and marginal distributions based on a fitted copula model; • fitted,predict: S3 methods for extracting fitted values and predicting new observations; • summary,print,coef,logLik,AIC,BIC: other S3 functions for a fitted object; • data_sim_copula: for generating bivariate event times through a specified copula model and marginal distributions.
We use two real data examples to illustrate the implementation of these functions in Section 2.4.

Copula model for bivariate censored data
Let (T 1 , T 2 ) be the true bivariate event times, with marginal survival functions S j (t j ) = Pr(T j > t j ), j = 1, 2, and joint survival function S(t 1 , t 2 ) = Pr(T 1 > t 1 , T 2 > t 2 ). Assume there are n independent subjects in a study. When (T 1 , T 2 ) are subject to right-censoring, for subject i = 1 · · · n, we observe where C ij is the censoring time of T ij , ∆ ij is the censoring indicator and Z ij is the covariate vector. When (T 1 , T 2 ) are under interval-censoring, we observe is the time interval that T ij lies in and Z ij is the covariate vector.
By the Sklar's theorem (Sklar, 1959), so long as the marginal survival functions S j are continuous, there exists a unique function C η that connects two marginal survival functions into the joint survival function: S(t 1 , t 2 ) = C η {S 1 (t 1 ), S 2 (t 2 )}, t 1 , t 2 ≥ 0. Here, the function C η is called a copula and its parameter η measures the dependence between the two margins. A signature feature of the copula is that it allows the dependence to be modeled separately from the marginal distributions.

Joint likelihood functions for bivariate censored data
In this section, we present the joint likelihood functions for bivariate right-censored data and bivariate interval-censored data, respectively.
Define the density function for copula denote the corresponding density function of S(t 1 , t 2 ). Denote by θ = (β = (β 1 , β 2 ), η, S 01 , S 02 ) all the unknown parameters in S(t 1 , t 2 ), where β j is the regression coefficient vector and S 0j is the baseline survival function for the jth margin. Then, the joint likelihood for the observed data D = {D i } n i=1 can be written as where (δ i1 , δ i2 ) ∈ {(0, 0), (0, 1), (1, 0), (1, 1)}. Similarly, using the notation introduced in Section 2.3.1, the joint likelihood function for bivariate interval-censored data from n subjects can be written as The right interval R ij can take values in (0, ∞]. For a given subject i, if R ij = ∞ (i.e., T ij is rightcensored), then any term involving R ij becomes 0 and the joint survival function for subject i reduces to only one (if both R i1 and R i2 are ∞) or two (if one R ij is ∞) terms. The special case of bivariate current status data (i.e., only one assessment time for each subject) can also fit into this framework, where for each T ij , either L ij = 0 (T ij is smaller than the assessment time, which is R ij in this case) or R ij = ∞ (T ij is greater than the assessment time, which is L ij in this case). Therefore, the likelihood function (3) can handle the bivariate data under general interval-censoring.

Marginal models
We implement several popular parametric marginal models in CopulaCenR, as shown in Table 2. For example, the marginal Weibull survival distribution can be written as where λ j and k j are the scale and shape parameters of the baseline Weibull distribution, and β j are the covariate effects. The model follows the PH assumption. In this case, the parameter set θ becomes (β , η, λ 1 , k 1 , λ 2 , k 2 ) . Other parametric distributions including Gompertz and Loglogistic are also implemented in the package.
More generally, we implement the semiparametric Cox PH marginal model for bivariate rightcensored data. The model does not specify the marginal distribution for the baseline hazards function. Instead, the baseline hazards are treated as piecewise constants between all uncensored event times as suggested by Breslow (1972). The model is expressed as in which the Breslow baseline cumulative hazard function Λ j (t) is given by We also consider a class of semiparametric linear transformation models for the marginal distribution of the interval-censored data. The model is expressed as: Λ j (·) is an unknown and non-decreasing function of t, which is not necessarily the baseline cumulative hazards function. In CopulaCenR, we approximate Λ j in a sieve space constructed by Bernstein polynomials. A Bernstein basis polynomial with degree m is expressed as: where l and u are the lower and upper bounds of all observed times. One big advantage of Bernstein polynomials is that they do not require the specification of interior knots, as seen from (5), making them easy to work with. More details can be found in .
In model (4), G j (·) is a pre-specified strictly increasing function, such as the Box-Cox and the logarithmic transformation functions. The package uses a G(·) function as specified in Zhou et al. (2017): Note that the model (4) contains a class of survival models. For example, when G(x) = x at r = 1, the marginal function S j (t|Z) becomes exp{−Λ j (t)e Z β j }, which is essentially a PH model. Likewise, when G j (x) = log(1 + x) at r = 3, S j (t|Z) becomes {1 + Λ j (t)e Z β j } −1 , which is a PO model. In practice, the value of r can be either selected according to model AIC or treated unknown and estimated together with other model parameters.

Two-step estimation procedure
In this section, we illustrate the estimation procedure for the unknown parameter θ. For simplicity, we use the general notation θ = (β 1 , β 2 , η, S 01 , S 02 ) throughout this section. In principle, we can maximize the joint log-likelihood function based on formula (2) or (3) directly, written as l n (θ|D) = log L n (θ|D) = ∑ n i=1 log L(θ|D i ). Due to the complex structure of the log-likelihood function, we implement a novel two-step estimation procedure, which is proven to be computationally more stable and efficient than the one-step procedure, as shown in  and . Essentially, the two-step procedure implements an extra step to obtain appropriate initial values for all the unknown parameters. The estimation procedure is described below: 1. Obtain initial estimates of θ n : l jn (β j , S 0j ), where l jn denotes the log-likelihood for the marginal model, j = 1, 2; (1) 0j are the initial estimates from (a), and l n is the joint log-likelihood.
[Remark 1.] In the case of semiparametric Cox PH margins (with the Breslow baseline cumulative hazard estimator), although the maximum likelihood estimators from step 2 are consistent and asymptotically normal, the Hessian matrix cannot be directly used for estimating the variance-covariance matrix of (β,η) . Therefore, the bootstrap procedure is implemented in the package for producing a valid variance-covariance estimator.
[Remark 2.] In the case of semiparametric transformation model margins (with the use of Bernstein polynomials), the two-step estimation procedure becomes a two-step "sieve" estimation procedure.  rigorously proved the asymptotic properties of the sieve maximum likelihood estimators.
The main model-fitting functions (rc_par_copula, rc_spCox_copula, ic_par_copula and ic_spTran_copula) provide a built-in optimization option, which is a wrapper to the optimization routines optim and nlm in R.

Likelihood-based tests for covariate effects
We now separate β into two parts: β g and β ng , where β g is the parameter set of interest for hypothesis testing and β ng denotes the rest of the regression coefficients. In certain cases, β g can be the entire regression parameter β. The package implements three likelihood-based tests including the Wald test, the generalized score test (Cox and Hinkley, 1979) and the likelihood ratio test, which are asymptotically equivalent and follow the chi-squared distribution with d f = dim(β g ). In particular, the generalized score test is usually faster than the other two tests for large-scale testings such as the genome-wide association study (GWAS) . Due to the complex structure of the joint log-likelihood, instead of analytically deriving the first and second order derivatives, we use the Richardson's extrapolation (Lindfield et al., 1989) to approximate the score function and observed Fisher information numerically.

Fitting copula models for bivariate right-censored data
The bivariate right-censored input dataset shall be a data frame including the covariates and four additional key input columns: • id: the subject/cluster id, • ind: the margin indicator (1, 2), • obs_time: the exact observed time, • status: censoring indicator (1 for event, 0 for right-censoring).
We use the DRS (Diabetic Retinopathy Study) data as an example. The DRS data contain bivariate right-censored time to blindness from 197 diabetic retinopathy patients. These patients were from a 50% random sample of the patients with "high-risk" diabetic retinopathy as defined by the DRS (Huster et al., 1989). Each patient had one eye randomized to one of the two laser treatments and the other eye received no treatment. For each eye, the event of interest was the time from initiation of treatment to the time to blindness in months. Censoring was caused by death, dropout, or end of the study. The data can be loaded by data("DRS", package = "CopulaCenR") head ( There are three covariates: treat is treatment with 0 for no treatment, 1 for xenon laser treatment and 2 for argon laser treatment; age is the age at diagnosis of diabetes; type is the type of diabetes with 1 for juvenile (age ≤ 20 at diagnosis) and 2 for adult. The primary question of the DRS study was to assess the treatment effectiveness while accommodating the dependence between two eyes.

Fitting copula models for bivariate interval-censored data
The bivariate interval-censored input dataset shall be a data frame including the covariates and five key input columns: • id: the subject/cluster id, • ind: the margin indicator (1 or 2), • Left: the left bound of the observed interval, • Right: the right bound of the observed interval (can take "Inf"), • status: the censoring indicator (1 for left-or interval-censoring, 0 for right-censoring).
We use the AREDS (Age-Related Eye Disease Study) data as an example. The event of interest is the AMD (Age-related Macular Degeneration) disease, which is a leading cause of blindness in the developed world (Swaroop et al., 2009). It is known as a polygenic, progressive and neurodegenerative disorder. The AREDS study is a multi-center randomized clinical trial studying the development and progression of AMD, sponsored by the National Eye Institute (AREDS Group, 1999). Due to intermittent assessment times (every 6 months up to the first 6 years and every 1 year since after), the exact time when each eye progressed to late-AMD was only known to lie in a certain interval. As a result, the outcome data are bivariate interval-censored. The package includes a subset data of 629 Caucasian participants from AREDS who had at least one eye in moderate AMD stage at baseline. The data can be loaded by Out of these 629 subjects, 273 subjects developed late-AMD in both eyes during the study and the times to late-AMD were interval-censored; 138 subjects developed late-AMD in one eye (intervalcensored) and did not develop late-AMD before the end of the study (right-censored); the rest 218 subjects were right-censored for late-AMD in both eyes.
There are three continuous covariates: SevScaleBL for baseline AMD severity score (a value between 1 and 8 with a higher value indicating more severe AMD), ENROLLAGE for baseline age and rs2284665 for a genetic variant (0, 1, 2 for GG, GT, TT) that might be associated with AMD progression. The two clinical covariates SevScaleBL and ENROLLAGE are well-known risk factors of AMD. Thus, our primary interest is to find out whether the genetic variant rs2284665 is significantly associated with AMD progression.
We fit a two-parameter copula semiparametric transformation model for the AREDS data through the function ic_spTran_copula. The arguments l and u are the range of event times, which need to be pre-specified by the user. In practice, l and u can be set as the minimum and maximum of observed times. The argument m corresponds to the degree of Bernstein polynomials (as shown in formula 5), with the default value m = 3. The argument r specifies the form of marginal transformation model (as shown in formula 6). In practice, the values of m and r can be chosen based on the smallest AIC for a list of fitted models with different values.
Similarly, the conditional survival probabilities (Figure 4) can be obtained for the left eyes from the same three subjects, given their right eyes (i.e., cond_margin = 2) had progressed (to late-AMD) at year 5 (i.e., cond_time = 5).
Likewise, we can also obtain the eye-level marginal survival probabilities (i.e., plot_margin = 1 for the left eyes) for the same three subjects, as illustrated in Figure 5.

Summary
This paper presents the R package CopulaCenR for implementing copula-based regression models in bivariate censored data, including both bivariate right-censored data and bivariate interval-censored data. A variety of Archimedean copulas, including a flexible two-parameter copula, are built in the package to accommodate different dependence structures. Moreover, the package can fit various parametric and semiparametric regression models for the two margins within the copula function. In particular, a general semiparametric transformation model with PH and PO models being its special cases is implemented for the margins in this package. For parameter estimation, a novel two-step procedure is adopted to guarantee stable and fast computation. For the inference of covariate effects, all three likelihood-based tests are provided. Lastly, two real data examples are given to demonstrate the key features and capabilities of this package.
One future extension of this package is to allow multivariate copula functions for handling multivariate censored events. Another important research extension is to add goodness-of-fit tests, which is critical for choosing a proper copula model. However, there are limited works in testing copula models in bivariate censored data, especially in bivariate interval-censored data under the regression setting. The current literature (e.g., Shih, 1998;Andersen et al., 2010;Emura et al., 2010;Wang, 2010) only focus on testing copulas in bivariate right-censored data without covariates. We are currently investigating these directions and plan to incorporate them in a future version of CopulaCenR.