Bayesian Regression Models for Interval-censored Data in R by

The package icenReg provides classic survival regression models for interval-censored data. We present an update to the package that extends the parametric models into the Bayesian framework. Core additions include functionality to define the regression model with the standard regression syntax while providing a custom prior function. Several other utility functions are presented that allow for simplified examination of the posterior distribution.


Introduction
Interval-censoring occurs when observations are not known exactly, but rather up to an interval.For example, suppose a component of a machine is inspected at time c 1 and c 2 .The component is observed to be operational at c 1 , but broken at c 2 .In such a case, while the exact failure time is not known, it is known that the event occurred inside the interval (c 1 , c 2 ].In some cases, these intervals are small and the interval-censored aspect of the data can be ignored with only minor biases.For example, if age is reported in years, it is likely to be interval-censored due to binning, i.e. reporting "28 years old" typically implies age is in the interval [28,29).Similarly, if the intervals are non-overlapping, such as reported income brackets, one can simply treat the data as ordinal data and use appropriate models.However, if the data set contains moderate sized overlapping intervals, then interval-censoring methods should be used for valid inference.Note that a right censored observation can be represented as (C, ∞), where C is the censoring time, left censoring can be represented as [0, C) and an uncensored observation occurring at time t can be represented as [t, t].
Although interval-censoring is not strictly a survival analysis problem (for example, the Tobit model (Tobin, 1958)), this work focusses on the survival analysis setting in which the outcome of interest is time to event.A common assumption in many interval censored models, including those provided by icenReg, is that the distribution of the inspection times is independent of the event time of interest (Gruger et al., 1991).This can be framed as each subject having an event time of interest, t i , and a set of inspections c i0 = 0 < c i1 < ... < c ik i = ∞ where the subject is inspected to determine if the event has occurred.The interval [c ij , c ij+1 ) such that t i ∈ [c ij , c ij+1 ) is then recorded as the interval for subject i.The independence assumption states that t i is independent of c ij .
The standard univariate estimator is the non-parametric maximum likelihood estimator (NPMLE) (Turnbull, 1976), which can be viewed as a generalization of the Kaplan-Meier curves (Kaplan and Meier, 1958) that allow for interval-censoring (Ng, 2002).Many of the standard survival regression models can be extended to the interval-censored such as the proportional hazards, accelerated failure time (AFT) model and proportional odds.Semi-parametric models in which the baseline distribution is fit with the NPMLE are often used to avoid the need to specify the baseline distribution (Finkelstein, 1986), (Rossini and Tsiatis, 1996).While it has been shown that the regression coefficients are asymptotically normal and bootstrap procedures can be used for inference on the regression parameters (Huang, 1995), it is also noted that the asymptotic distribution of the baseline survival curve is currently an open question.This implies that while standard errors can be produced for the regression coefficients, quantifying the uncertainty in estimated survival probabilities when using the semi-parametric models is not currently available; even the bootstrap estimator has been shown to be inconsistent (Sen and Xu, 2015).It has also been noted that while the regression coefficients are consistent, a non-trivial upward bias in the coefficient estimates has been observed (Pan, 1999).Fully parametric regression survival models can also be used and are fairly straightforward to implement (Rabinowitz et al., 1995).In contrast to semi-parametric models, fully parametric models provide more efficient inference and allow for quantification of uncertainty of survival estimates at the cost of requiring assumptions of the family of baseline distribution, although it has been shown empirically that inference is fairly robust to mis-specification of the baseline distribution (Lindsey, 1998).Fully parametric models can be easily extended to the Bayesian framework (Gómez et al., 2004).For a thorough review of the non-parametric, semi-parametric and fully-parametric models in the interval-censoring context, see (Sun, 2007).In this work, we focus on parametric regression models in the Bayesian framework.
In general, interval-censored data is less informative than uncensored data.As such, incorporating prior information into an analysis using Bayesian methods can be especially useful.Recent additions to the R package icenReg (Anderson-Bergman, 2016) allow for simplified Bayesian analysis using standard regression formulas and user written prior functions.In section 2, the regression models available in icenReg are mathematically formulated.In section 3, the general form of the posterior The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859 distribution is presented and the MCMC sampler is briefly discussed.In section 4, the core Bayesian functions in icenReg are presented.In section 5, an example analysis on a classic dataset is presented.

Regression Models
To completely define a parametric survival regression model, one needs to specify the • Baseline distribution • Effect of the covariates on the baseline distribution In icenReg, several classic survival baseline distributions are included: Weibull, gamma, exponential, log-normal and log-logistic.At this time, three classic regression models are supported in icenReg: proportional hazards, AFT and proportional odds.In describing these regression models, we use several standard survival definitions.Defining f (t) and F(t) to represent the probability density function and cumulative density function for a given distribution, the survival distribution is defined as S(t) = 1 − F(t) and the hazard function h(t) = f (t) S(t) .The functions h o (t) and S o (t) represent the baseline hazard and survival function; i.e. the corresponding functions if all covariates are equal to 0. The vector X represents a subject's covariates, α represents a vector of parameters defining the baseline distribution and β represents a vector of regression coefficients.
The proportional hazards model can be defined as having the relation This definition can be used to interpret a regression coefficient β j as a one unit increase in x j is associated with an exp(β j ) fold increase in the hazard at any time.
The proportional odds model is defined as the relation .
This definition can be used interpret a regression coefficient β j as a one unit increase in x j is associated with exp(β j ) fold increase in the odds of survival at any given time.
The AFT model is defined by the relation This definition can be used to interpret a regression coefficient β j as a one unit increase in x j is associated with events occuring exp(β j ) fold faster.
To define the likelihood function, we let n 1 be the number of uncensored subjects, n 2 be the number of interval-censored subjects (note that this can include left and right censored subjects), t i be subject i's event time if subject i is uncensored, {L i , R i } be the left and right side of the interval containing subject i's event time if subject was censored and X i be a vector of subject i's covariates.Then the likelihood can be written as under the implication that if n 1 or n 2 are equal to 0, the corresponding term of the likelihood function reduces to 1.

Bayesian Inference
To perform Bayesian inference, the prior is multiplied by the likelihood function to form the posterior distribution.For the Bayesian models included in icenReg, the posterior distribution is proportional to where p is the prior distribution on the α and β parameters.Because the posterior is not in closed form for these models, Markov Chain Monte Carlo (MCMC) methods are used to draw samples from the posterior distribution.
In icenReg, MCMC sampling is carried out by an adaptive block updater (Haario et al., 2001).Default behavior is to first calculate the maximum likelihood estimator (MLE)1 , use the MLE point estimates as initial values and the inverse Fisher's information as an initial estimate for the posterior covariance.During the burn-in period, the posterior covariance is updated.A default target acceptance rate of 0.25 is used, as suggested in (Gelman et al., 1996).
Alternatively, the user can specify not to use the MLE and Fisher's information to build the starting proposal distribution.In this case, the starting proposal covariance matrix will be the identity matrix multiplied by a user-provided scalar (default = 0.1), which then has the option to adaptively learn the covariance matrix.While this is generally not recommended for efficiency purposes, it may be beneficial when the prior is strongly informative compared with the likelihood function.As an extreme example, if all the data were right censored, the MLE would be degenerate but an informative prior can still lead to valid Bayesian inference.In such cases, starting at the MLE would cause the MCMC algorithm to fail.The formula argument declares the likelihood function in the same manner as other icenReg model functions, to be demonstrated in the following section.The logPriorFxn allows the user to write a custom prior function that takes in a vector of parameters and returns the log prior density (or a value equal up to an additive constant).The order of the values should be the same order as the parameters returned when a user calls coef on a model.Default behavior is to use a flat prior.The model argument declares the regression model, with choices ph (proportional hazards), po (proportional odds) and aft (accelerated failure time).The dist defines the baseline distribution, with options exponential , weibull , gamma , lnorm (log-normal) and loglogistic .The function argument controls accepts a list of control parameters for the MCMC sampler, see ?bayesControls for details of options.The argument useMCores is a logical variable indicating whether the multiple chains should be run in parallel.If set to TRUE, a cluster must be registered in advance; this is demonstrated in section 5.

Core Functionality
The output from ic_bayes provides a list of samples from the posterior of α and β.Users are often interested in the survival probabilities for subjects with different sets of covariates, which requires a decent amount coding and double checking differing distribution parameterization.To simplify this process, the sampleSurv function allows a user to take draws of the posterior survival distribution for a given set of covariates.The arguments are defined as sampleSurv(fit, newdata = NULL, p = NULL, q = NULL, samples = 100) The argument fit is a fit returned from ic_bayes.The argument newdata is a data.framewhich includes the set of covariates from which we would like to draw the posterior probabilities from.If newdata is NULL, the baseline distribution is used.A user should either provide a numeric vector p of percentiles to sample OR a numeric vector q, a set of times to sample the cumulative probabilities at.
The function ic_sample allows a user to take posterior samples of event times for a given set of covariates.The arguments are defined as ic_sample(fit, newdata = NULL, sampleType = fullSample , samples = 5) The argument sampleType has two options: fullSample , in which event times are sampled from the full posterior and fixedParSample , in which event times are sampled conditional on the MAP estimates.
In some cases, a user may wish to impute posterior samples of the exact event times for the response variables in their dataset.This may be for the purpose of inferring the distribution of the exact event time for a specific subject, or for passing the data to an analysis tool that does not account for interval-censoring.This can be done with imputeCens.The arguments are defined as imputeCens(fit, newdata = NULL, imputeType = "fullSample", samples = 5) The arguments are the same as ic_sample, except that the newdata dataframe must include a pair of columns that contain the lower and upper bounds of the response variable.If newdata is set to NULL, imputeCens will impute all the rows from the original dataset.
The function survCIs returns credible intervals for the survival distribution, along with the posterior mean and posterior median estimates.The arguments for survCIs are survCIs(fit, newdata = NULL, p = NULL, q = NULL, ci_level = 0.95, MC_samps = 40000) Finally, the plot function accepts the following arguments plot(x, newdata = NULL, plot_legend = T, lgdLocation = topright , cis = T, ci_level = 0.9, ...) In this case, x should be a fit from ic_bayes, newdata is a dataframe with a set of covariates to determine the survival functions to plot, plot_legend is a logical argument indicating whether to include a legend with labels provided by the rownames of newdata, cis is a logical indicator for whether credible intervals should be included, ci_level is the credible levels for the credible intervals, and ... is additional arguments to be passed to the base plot function.Note that if the col argument is supplied, each color will be matched to the corresponding row of newdata.The solid lines plotted are the posterior median survival probabilities, with dashed lines representing the upper and lower limits of the credible interval.

Example Analysis
To demonstrate the use of Bayesian regression models in icenReg, we will use the miceData dataset included in icenReg (Hoel and Walburg, 1972).This dataset examined occurrances of lung cancer in RFM mice (bred for high rates of cancer) kept in two different environments; conventional environment (ce) or germ-free environment (ge).At different ages, mice are sacrificed and examined for lung tumors.If mouse i is inspected at age C i and a tumor is found, then time of onset is recorded as being in the interval [0, C i ].If no tumor is found, then the time of onset is recorded as being in the interval (C i , ∞).Note that this form of data is referred to as current status data.
We first load the icenReg library along with foreach (Analytics and teve Weston, 2014b) and doParallel (Analytics and teve Weston, 2014a), which are required to run MCMC chains in parallel.

> library(icenReg) > library(foreach) > library(doParallel)
We then load and examine the miceData dataset.The column l and u represent the lower and upper end of the intervals containing the onset time for each mouse.We note that there are 96 mice in the ce group and 48 mice in the ge group.Because current status data is fairly uninformative per subject, this dataset contains limited information about the distribution of time to onset.
For the sake of demonstration, suppose that we had expert information regarding onset of lung cancer.An expert tells us that (a) after two years in the conventional environment, the expert is 50% certain that between 10-30% of the mice will have developed lung tumors and (b) hazard rates are non-decreasing with age.To incorporate (a), we can set a Beta(α = 1.5, β = 5.5) prior onto the probability of an event occuring before t = 730 for the CE group.For (b), we note the fact that for the Weibull distribution, a shape parameter below 1 implies a decreasing hazard, while a shape parameter above 1 implies an increasing hazard.To enforce a non-decreasing hazard, we will set zero probability mass to the shape parameter below 1.We note that this is an improper prior: we have put a flat prior of the regression coefficient.
To demonstrate how to incorporate this into ic_bayes, we first look at the parameters that will be handed to our prior function.This will be vector of parameters given in the same form and order as returned by coef, for either a Bayesian model or maximum likelihood model (ic_par).
> # Fitting MLE > mle_fit <-ic_par(cbind(l, u) ~grp, + model = ph , + dist = weibull , + data = miceData) > coef(mle_fit) log_shape log_scale grpge 0.7071843 6.9481420 0.7861709 All the syntax used for defining models for ic_par is shared with ic_bayes.In the formula, we define the response by calling cbind(l,u), where l and u represent the lower and upper ends of the interval.We see that we will be given the baseline log shape parameter, baseline log scale and the coefficient for the dummy variable indicating belonging to the GE group.We then write our log prior density function as such: ans is log-density of the prior + ans <-0 + # First prior: S(730) ~beta(1.5, 5.5) + # Note that we are using a Weibull distribution + s_730 = 1 -pweibull(730, shape = shape, scale = scale) + ans = ans + dbeta(s_730, 1.5, 5.5, log = T) + # Second prior: shape >= 1 + if(shape < 1) ans = -Inf The R Journal Vol.XX/YY, AAAA 20ZZ ISSN 2073-4859 This prior is then provided to the ic_bayes function.We use doParallel's registerDoParallel to sample from the 4 chains in parallel.

> summary(bayes_fit)
Model: Bayesian Cox PH Baseline: weibull Call: ic_bayes(formula = cbind(l, u) ~grp, data = miceData, logPriorFxn = expertPrior, model = "ph", dist = "weibull", useMCores = T) We can access the raw MCMC samples from the $mcmcList field.This is a coda::mcmcList object (Plummer et al., 2006), and as such all the standard coda methods can be used directly on this object.For example, if we want traceplots and marginal density estimates of the samples, we can directly call plot.The results are plotted on figure 1.

> plot(bayes_fit$mcmcList)
We can examine a plot of the posterior survival distribution using the plot method.If we do not provide any new data, the baseline survival distribution will be plotted.This is demonstrated on figure 2. The solid line is the median posterior survival probability at any given time, with the dashed lines representing upper and lower credible intervals for the survival probabilities.Using the survCIs function, we can extract credible intervals for the survival function with a given set of covariates, along with the posterior mean and posterior medians.
> survCIs(bayes_fit, + newdata = newdata, + p = seq(from = 0.1, to = 0.9, by = .2),+ ci_level = 0.95) Model call: ic_bayes(formula = cbind(l, u) ~grp, data = miceData, logPriorFxn = expertPrior, model = "ph", dist = "weibull", useMCores = T) Now suppose we wanted to draw posterior samples of the event time distribution for each group.For example, we may wish to construct density plots for event time from each group.This can be done with ic_samples and is demonstrated in the code below.The generated plot can be found on figure 4 Finally, we can draw posterior samples of the event time, given that it occurs within some specified interval, with imputeCens.To demonstrate, suppose we were interested in the exact event time for mice in each group that were sacrificed at one year and found to have no tumors, implying the event time was right censored at t = 365.This can be expressed in the interval censoring format as t ∈ [365, ∞).Below, we use imputeCens to draw posterior samples of event times conditional on being greater than 365 and plot the estimate posterior density in figure 5.

Summary
Interval-censoring occurs when event times are not known exactly, but rather only up to an interval.Naturally, this results in less informative data than if the event time were observed exactly.The potentially weakly informative data further motivates using prior information about the data generating process to provide a more informative analysis of a given data set.Bayesian methodology provides a straightforward framework for incorporating such prior information.The addition of ic_bayes to the icenReg package allows for simple, efficient interval-censored regression models with generic user provided prior distributions and a variety of tools to simplify analyses.

Figure 3 :
Figure 3: Comparing survival curves between groups

Figure 5 :
Figure 5: Posterior densities of event time conditional on occurring after first year