riskRegression: Predicting the Risk of an Event using Cox Regression Models

In the presence of competing risks a prediction of the time-dynamic absolute risk of an event can be based on cause-speciﬁc Cox regression models for the event and the competing risks (Benichou and Gail, 1990). We present computationally fast and memory optimized C++ functions with an R interface for predicting the covariate speciﬁc absolute risks, their conﬁdence intervals, and their conﬁdence bands based on right censored time to event data. We provide explicit formulas for our implementation of the estimator of the (stratiﬁed) baseline hazard function in the presence of tied event times. As a by-product we obtain fast access to the baseline hazards (compared to survival::basehaz() ) and predictions of survival probabilities, their conﬁdence intervals and conﬁdence bands. Conﬁdence intervals and conﬁdence bands are based on point-wise asymptotic expansions of the corresponding statistical functionals. The software presented here is implemented in the riskRegression package.


Introduction
Predictions of hazards and risks based on a Cox regression analysis need to be fast and memory efficient, especially in large data, in simulation studies, and for cross-validation or bootstrap loops.The CRAN task view Survival lists many R packages implementing the Cox regression model and extensions thereof.Among the most popular routines are the function coxph() from the survival package (Therneau, 2017) and the function cph() from the rms package (Harrell Jr, 2017).We present a fast and memory efficient algorithm to extract baseline hazards and predicted risks with confidence intervals from an object obtained with either of these functions.
In the presence of competing risks one needs to combine at least two Cox regression models to predict the absolute risk of an event (cumulative incidence) conditional on covariates (Benichou and Gail, 1990).We present the CSC()-function of the R package riskRegression which fits the Cox regression models using either coxph() or cph().We also present a concomitant predict() S3 method which computes the absolute risks of the event of interest for given combinations of covariate values and time points.Optionally, the predict() method computes asymptotic confidence intervals and confidence bands for the predicted absolute risks.We review the formula behind the estimators implemented and illustrate the R interface.
It is possible to obtain the predictions of absolute risks based on cause-specific Cox regression also with the survival package or with the mstate package (Putter et al., 2016).However, both require more work from the user.Finally, it should be noted that there are alternative regression methods for absolute risks in the presence of competing risks such as Fine-Gray regression (Fine and Gray, 1999) or direct binomial regression (Gerds et al., 2012;Scheike et al., 2008).

Predicting absolute risks based on cause-specific Cox regression
We denote by T the time between a baseline date and the date of an event and by D ∈ {1, ..., K} the cause of the event.We assume that {D = 1} is the event of interest.Let X = (X 1 , . . ., X p ) be a p-dimensional vector of baseline covariates with arbitrary distribution, and Z = (Z 1 , . . ., Z q ) be the strata variables, i.e. a set of categorical baseline covariates with finitely many possible values.Without loss of generality and to ease the notation we set q = 1.We use {1, . . ., L} for the categories of Z.We consider a setting in which the event time T is right censored at a random time C. We assume that C is conditionally independent of T given (X, Z) and fix a time τ such that almost surely P(C > τ|X, Z) > 0. We denote T = min(T, C), D = ∆D, and ∆ = 1{T ≤ C}.

Cause-specific Cox regression
Given covariates (X, Z), let S 0 (t|x, z) = P(T > t|X = x, Z = z) denote the event-free survival function and F j (t|x, z) = P(T ≤ t, D = j|X = x, Z = z) the cumulative incidence function for event j.The cause-specific hazard rates are defined as λ j,z (t|x) = dF j (t|x, z)/S 0 (t|x, z) (Andersen et al., 1993).We also denote the cumulative hazard rates by Λ j,z (t|x) = t 0 λ j,z (s|x)ds.The stratified Cox regression model (Cox, 1972) for cause j is given by where ) is a p-dimensional vector of regression coefficients (the log-hazard ratios), and {λ 0j,z (t) : z = 1, . . ., L} a set of unspecified baseline hazard functions.

Predicting the absolute risk of an event
The cause-specific Cox regression models can be combined into a prediction of the absolute risk of an event of type 1 until time t conditional on the covariates x, z.For the case where K = 2 the absolute risk formula of Benichou and Gail (1990) is given by: ( where s− denotes the right sided limit, e.g.Λ 1,z (s − |x) = lim v→s,v<s Λ 1,z (v|x).The absolute risk accumulates over time the product between the event-free survival and the hazard of experiencing the event of interest, both conditional to the baseline covariates and to the strata variable.The event free survival can be estimated from the cause-specific hazards using the product integral estimator: or the exponential approximation: which is asymptotically equivalent to the product-limit estimator if the distribution of the event times is continuous.Using the product integral estimator ensures that S(t|x, z) + F 1 (t|x, z) + F 2 (t|x, z) equals exactly 1.This is a desirable property since the sum of the transition probabilities over all possible transitions should sum to one.
Formula (2) generalizes to situations with more than 2 competing risks, i.e., K > 2. However, in applications with many competing risks there will sometimes be few events of specific causes, and it may be hard to fit a Cox regression model for each cause separately.One possibility when K > 2 is to combine all causes where D > 1 into a single competing risk for the cause of interest D = 1.While the riskRegression package allows the use of more than 2 competing risks, we will illustrate its use considering only 2 competing risks .The package implements formula (2) in two steps.
The R Journal Vol.9/2, December 2017 ISSN 2073-4859 Step 1: estimation of the cause-specific hazards In the call of CSC() the argument formula is used to define the outcome variables with the help of the function prodlim::Hist().The variable "time" in the data set contains the values of the observed event time T and the variable "status" the cause of the event D. Objects generated with the function prodlim::Hist() have a print method: A nice complement to the regression models is the marginal Aalen-Johansen estimate of the absolute risk of cancer related death (Figure 2): The R Journal Vol.9/2, December 2017 ISSN 2073-4859 library(prodlim) plot(prodlim(Hist(time,status) ~1, data = Melanoma), atrisk.at= seq(0,3652.5,365.25),xlim = c(0,3652.5),axis1.at= seq(0,3652.5,365.25),axis1.lab= 0:10, xlab = "Years", ylab = "Absolute risk", cause = 1)  Hist(time,status) ~age + logthick + epicel + strata(sex) defines the covariate(s) X which enter into the linear predictor xβ in formula (1), and the strata variable(s) Z which define the baseline hazard functions λ 0j,z .Strata variables are specified by wrapping the variable names into the special function strata(), as one would do when using the coxph() function.If only one formula is provided, the CSC() function will use the same baseline covariates and strata variables for all cause-specific Cox regression models.Instead one may feed a list of formulas into the argument formula, one for each cause: cfit1 <-CSC(formula = list(Hist(time,status) ~age + logthick + epicel + strata(sex), Hist(time,status) ~age + strata(sex)), data = Melanoma) coef(cfit1) Note that the choice of the baseline covariates relative to each cause made here is not based on clinical or statistical criteria; it was done to illustrate the software possibilities.The causes are internally ordered with respect to the levels of the variable "status", if this variable is a factor, and otherwise with respect to sort(as.character(unique(status))).The order of the causes is saved as cfit1 [["causes"]].Accordingly, the first formula is used to fit a Cox regression model to the first cause and the second formula is used to fit a Cox regression model to the second cause and so on.
The R Journal Vol.9/2, December 2017 ISSN 2073-4859 Internally, CSC() constructs dummy variables, one for each cause, and then calls the function defined by the argument fitter on a suitably constructed Surv() formula.By default the cause-specific Cox models are fitted with the function survival::coxph().Alternatively, one can set the argument fitter to the name of a different routine, e.g., cph.
Step 2: computation of the absolute risk The object obtained with CSC() has class "CauseSpecificCox".The second step is to call the corresponding predict() method.In addition to the object obtained with CSC() this requires three additional arguments: newdata, times, cause.The argument newdata should be a "data.frame"which contains the covariates X and Z in the same format as the data used to fit CSC().The argument cause defines the cause of interest D and the argument times defines a vector of prediction horizon(s) whose values are used as the upper integration limit t in formula (2).The predict() method computes the absolute risks (formula ( 2)) for each row in newdata and each value of times: newdata <-data.frame(age= c(45,67), logthick = c(0.1,0.2),epicel = c("present","not present"), sex = c("Female","Male")) pfit1 <-predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500)) By default, the product integral estimator is used to estimate the event-free survival function.Setting the argument productLimit to FALSE when calling the predict function enables to use the exponential approximation.The predict function returns a structured list of class "predictCSC".The corresponding print() method calls as.data.For each row in newdata (values are repeated for each prediction horizon) and each prediction horizon (column times) the column "absRisk" contains the absolute risk of cancer specific mortality (cause 1).Standard errors and confidence intervals for the absolute risk can be obtained setting the argument se to TRUE: pfit1se <-predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500 The R Journal Vol.9/2, December 2017 ISSN 2073-4859 The elements $absRisk, $absRisk.se,$absRisk.lowerand $absRisk.upperare matrices where each row corresponds to a row in newdata and each column to a value of the times vector.All these matrices are sorted according to the original orders of the arguments newdata and times.By default 10,000 simulations will be used to estimate the appropriate quantile for the confidence bands (see explanations in section Construction of the confidence bands).This default behavior can be changed by setting the argument nsim.band to another value.The autoplot() function can then be used to compare confidence bands and the confidence intervals: Note that the resulting object is a "ggplot" graphic.This can be useful to personalize the graph, e.g.change the font size of the text.

Construction of the confidence intervals
In this section we describe the asymptotic formula behind the confidence intervals for the predicted absolute risks and the empirical counterpart which is implemented in riskRegression.We assume a sample (X i ) i∈{1,...,n} of n independent and identically distributed replications of X = ( T, D, X, Z).
The estimator F1 of F 1 is obtained by substituting the Cox partial likelihood estimate βj for β j and the baseline hazard estimate λ0j,z for λ 0j,z in equations ( 1) and ( 2).
The asymptotic confidence intervals for the covariate specific absolute risks of event 1 before time t are based on the following von Mises expansion (van der Vaart, 1998): where the exponential approximation is used for defining the event free survival in F 1 (t|x, z).Given (4), for fixed values t, x, z the central limit theorem implies that F1 (t|x, z) has an asymptotic normal distribution with asymptotic variance Based on the estimate φ F 1 of the influence function φ F 1 (both defined in subsequent subsections) our variance estimate is given by We then construct Wald confidence intervals for F 1 (t|x, z) in the usual way: where q α is the α-quantile of the normal distribution.Since the absolute risk is bounded below by 0 and above by 1, the confidence interval is automatically restricted to this interval.Alternatively the confidence interval can be computed using a log-log transformation: first the confidence interval is computed on the log-log scale: where V log−log is the variance of the influence function on the log-log scale.Then the confidence interval is back transformed using the link function: x → exp(− exp(x)).This ensures that the confidence interval is bounded below by 0 and above by 1.By default, the confidence intervals are computed using the log-log transformation.To compute them without using the log-log transformation, the argument log.transform needs to be set to FALSE when calling predict.CauseSpecificCox(): pfit2se <-predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500 Here the column "absRisk.se"contains V log−log (log-log scale) while the columns "absRisk", "absRisk.lower",and "absRisk.upper"are on the original scale.The benefit of using a log-log transformation is studied in the section Coverage of the confidence interval and the confidence bands.

Asymptotic formula
The influence function φ F 1 can be expressed as a function of the influence functions φ λ 1 and φ λ 2 of the cause-specific hazard rates: The R Journal Vol.9/2, December 2017 ISSN 2073-4859 We use the shorthand notations (s, d, v, w).
We also suppress the dependence on τ when we in the following adapt the usual notation for Cox model asymptotic theory to the cause-specific and stratified case: For fixed cause j, time t and strata z the expansions are then characterized by the influence functions In absence of strata, formula ( 7) is equal to formula 2 of (Reid, 1981) and formula (8) equals the one given in Gerds and Schumacher (2001, top of page 576; note however that there is a sign mistake in their first term).To connect these formulas with formula (6) it remains to note that under the Cox regression model the influence function of the cause-specific hazard rate can be written as:

Empirical estimates
The following formulas are obtained with the plug-in principle substituting the Cox partial likelihood estimates βj for β j and the baseline hazard estimates λ0j,z for λ 0j,z into formulas ( 6) -( 9).We denote by N z j (t) = ∑ n i=1 1{ Ti ≤ t, Di = j, Z i = z} the strata and cause specific counting process, and by Y z (t) = ∑ n i=1 1{T i ≥ t, Z i = z} the strata specific "at-risk" process (Andersen et al., 1993).The empirical estimate of the influence function of the partial likelihood estimate is given by where The R Journal Vol.9/2, December 2017 ISSN 2073-4859 The influence functions for the cumulative baseline hazard and its first differential are estimated by: These estimates lead to the following estimates of the influence functions of the covariate specific cumulative hazard and its derivative relative to the time: Finally, we obtain our estimate of the influence function of the absolute risk:

Construction of the confidence bands
A confidence band with confidence level 1 − α for the absolute risk In figure Figure 3, τ 1 = 0 and τ 2 = 3458, the time at which the last event occurred.Using the martingale central limit theorem, (Cheng et al., 1998) have shown that F1 (t|x, z) − F 1 (t|x, z) converges weakly to a zero-mean Gaussian process on T .The asymptotic variance of this process is V(t, x, z).However the 1 − α quantile achieving simultaneous coverage is larger than the 1 − α quantile of a standard normal distribution.Since the dependence between the increments of the process F1 (t|x, z) − F 1 (t|x, z) makes the derivation of an explicit expression for the quantile difficult, we used instead a resampling technique (Scheike and Zhang, 2008).Consider over t ∈ T the normalized process: Denote by c 1−α/2 the 1 − α/2 quantile of the sample: and using the symmetry of the Gaussian distribution, i.e. c α/2 = −c 1−α/2 , a 1 − α confidence band over T is constructed as follows: Like for the confidence intervals, the confidence bands will be restricted to the interval [0;1] when they are not computed using a log-log transformation.

Implementation details
The function predict.CauseSpecificCox() calls two important functions, predictCox() that computes the hazard and cumulative hazard for a fitted Cox model, and iidCox() that computes the influence function for the baseline hazard and regression coefficients.In this section we first explain The R Journal Vol.9/2, December 2017 ISSN 2073-4859 how predictCox() deals with ties in the event times.We then show that the function predictCox() can also be used to obtain confidence intervals and bands for covariate specific survival probabilities in the situation without competing risks.Implementation details about the function iidCox() are postponed to appendix B. These details may be useful for programmers who need to care about memory usage.

Handling of tied event times
We speak of ties when two or more observations have the same value of the time variable T. Ties occur for example when time is recorded on a discrete scale such as months.The survival package implements three different methods to deal with ties ("efron", "breslow", and "exact", see help(coxph)) for the partial likelihood estimator of the log hazard ratios β j (Therneau and Grambsch, 2000).We have implemented the "efron" and the "breslow" method but not the "exact" method.For a comparison of these methods and yet another method see Scheike and Sun (2007).We now state the formula for the baseline hazard function under Breslow's (Breslow, 1974) and Efron's method (Efron, 1977) for the handling of ties.The baseline hazard estimate in strata z given by the Breslow method is: With the Efron method for handling ties the formula is given by: Both estimators are implemented in the function predictCox() which provides estimates of the baseline hazard, the cumulative baseline hazard and baseline survival.The predictCox() function does not have an argument to specify whether Breslow method or Efron method should be used; instead it uses the same method that has been used to estimate the regression coefficients.In the case of the coxph() function, the default method is Efron: f1 <-coxph(Surv(time,status != 0) ~age + logthick + epicel + strata(sex), data = Melanoma, x = TRUE, y = TRUE) f1$method ## [1] "efron" Therefore the baseline hazard will be estimated using the Efron method when calling predictCox(): baseH1 <-predictCox(f1) as.data.table(baseH1[c("time","cumhazard","strata","survival")])The cumulative baseline hazard and baseline survival are displayed in the columns "cumHazard", and "survival" of the output.This corresponds, respectively, to Λ 0j,z (t) and exp(−Λ 0j,z (t)) where j is 1, z can be found in the column "strata" and t in "time".The covariate specific cumulative hazard Λ j,z (t|x) and survival exp(−Λ j,z (t|x)) can be estimated using the same function: predictCox(f1, newdata = Melanoma[c(17,101,123)

Confidence intervals and confidence bands for survival probabilities
In absence of competing risks, the influence function of a Cox model can be used to estimate confidence intervals for the survival.This can be done by setting the argument se to TRUE when calling the predictCox() function: p1 <-predictCox(f1, newdata = Melanoma[3:5,], times = c(Melanoma$time[5:7],1000), se = TRUE, type="survival") The predictCox() function output an object of class "predictCox".The print() method can be used to display the confidence intervals for the survival computed at different times: Here the confidence intervals were computed using a log-log transformation.The argument log.transform can be set to FALSE to compute them without using the log-log transformation.Confidence bands can also be obtained using predictCox() by setting the argument band to TRUE.As for the predict.CauseSpecificCox() function, 10,000 simulations will be used to compute the confidence bands; this can be changed specifying the argument nsim.band.

Coverage of the confidence interval and the confidence bands
To assess the validity of the estimation of the standard error, we performed a simulation study.The sample size was varied between 50 and 10000.For each sample size, 5000 datasets were simulated using the SimCompRisk() function from the riskRegression package.The time of first event typically ranged between 0.001 and 20 with a median around 4. For each dataset, a Cox model specific to cause 1 and a 2 cause-specific Cox model were fitted considering 2 covariates (X1 and X2).The survival, absolute risk, their confidence intervals were estimated (with or without log-log transformation) at time 1, 1.5, 2, 3, 4, 5, 6, and 7 conditional on X1 = 0 and X2 = 1.The confidence bands over those 8 times were also computed.
The true absolute risk was defined as the median of the absolute risks over the 5000 datasets.The coverage of the confidence intervals was computed as the percentage of times that the true absolute The R Journal Vol.9/2, December 2017 ISSN 2073-4859 risk was inside the confidence interval, at a given time.The coverage of the confidence bands was computed as the percentage of times that the true absolute risk was inside the confidence bands, simultaneously for all of the 8 times.
As expected, the coverage of the confidence intervals is improving with increasing sample size and reaches its nominal level of 95% at around sample size 1000 (figure Figure 4, left panel).Using a log-log transformation leads to better small sample properties and similar large sample properties (figure Figure 4, right panel).A larger sample size was necessary for the confidence bands to converge toward the nominal coverage level (n = 5000, figure Figure 5 left panel).Again log-log transformation leads to better small sample properties.We only displayed here the coverage for the absolute risk but similar coverage was obtained for the survival function.

Baseline hazard
We compare the performance of predictCox() regarding the estimate of the baseline hazard function with that of the function survival::basehaz().For this purpose we simulate data with 10 covariates including both continuous and discrete type using the sampleData().
In our performance study we vary the sample size ranging from 500 to 1,000,000 observations and consider both stratified: Surv(time,event) ~strata(X1) + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 and non-stratified Cox regression models: Surv(time,event) ~X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 The computation times were estimated using the rbenchmark package (Kusnierczyk, 2012) and averaged across 100 simulated data sets.The memory usage was estimated using the profvis package (Chang and Luraschi, 2017) and averaged across 10 simulations.When the execution time of the function was extremely fast (i.e.<0.005s), the memory usage could not be reliably assessed and was set to NA.  varying between 3 and 11 in terms of computation time.The gain in speed is especially expressed in large datasets.Memory usage is also lower for predictCox() and decreases by a factor between 1 and 1.6 as compared to basehaz().

Absolute risk
We now compare two implementations for computing the standard errors of the absolute risk.The default implementation corresponds to setting the argument store.iid to "full" when calling predict.CauseSpecificCox() or predictCox() while the second is obtained by setting store.iid to "minimal".The two implementations are described in more detail appendix B.
First we compare their computation time and memory consumption when making only one prediction.As before, we simulated datasets for K = 2 using the SimCompRisk() function with increasing sample size.Then, the two cause-specific Cox models were fitted to each of the simulated datasets using riskRegression::CSC.Then we measured the computation time and memory usage necessary to estimate the absolute risk with its standard error for the first observation at time 4, when using the argument store.iid="minimal"or store.iid="full" in predict.CauseSpecificCox().
The results are show on figure Figure 8.While both implementations lead to a similar computation time, the memory usage for store.iid="minimal"grows linearly while for store.iid="full" it grows approximately in n 1.4 .However, if instead of estimating the absolute risk with its standard error for one prediction, we estimate it for all the observations the implementation store.iid="minimal"becomes slower compared to store.iid="full"(e.g.28.5 minutes vs. 6 minutes for n=2000).
The R Journal Vol.9/2, December 2017 ISSN 2073-4859   The R Journal Vol.9/2, December 2017 ISSN 2073-4859 The R Journal Vol.9/2, December This option also applies when computing standard errors, confidence intervals, or confidence bands for the survival probabilities:

Figure 1 :
Figure 1: Box-arrow diagram showing the three states of the competing risk model and the number of observed transitions in the Melanoma data set.

Figure 2 :
Figure 2: Non-parametric estimation of the absolute risk of cancer related death over time obtained using the Aalen-Johansen estimator.The right hand side of the formula in the call of the CSC() function:

Figure 4 :
Figure 4: Coverage of the asymptotic confidence interval of the absolute risk plotted against the sample size.Each color corresponds to a prediction time.The figure is only shown for samples size below 2000 since for larger sample sizes the coverage is always approximately equal to 0.95.Left panel: confidence intervals are computed on the original scale.Right panel: confidence intervals are computed on the log-log scale and back-transformed.
Memory usage and computation time are displayed on figure Figure 6 and Figure 7.Both functions lead to a reasonable computation time (<1 min) even when considering very large datasets (e.g.>10,000 observations).Nevertheless the predictCox() function outperforms the basehaz() function by a factorThe R Journal Vol.9/2, December 2017 ISSN 2073-4859

Figure 5 :
Figure 5: Coverage of the asymptotic confidence band of the absolute risk plotted against the sample size.Left panel: confidence intervals are computed on the original scale.Right panel: confidence intervals are computed on the log-log scale and back-transformed.

Figure 6 :
Figure 6: The computation time (in seconds) of predictCox() and basehaz() plotted against the sample size for a Cox model (left panel) and a stratified Cox model (right panel).The x axis is displayed using a logarithmic scale but its labels refer to the (untransformed) sample size.The curves represent the median values over 100 simulations while the shaded areas represent the 95% confidence intervals.

Figure 7 :
Figure 7: The memory usage (in megabytes) of predictCox() and basehaz() plotted against the sample size for a Cox model (left panel) and a stratified Cox model (right panel).The x axis is displayed using a logarithmic scale but its labels refer to the (untransformed) sample size.The curves represent the median values over 100 simulations while the shaded areas represent the 95% confidence intervals.

Figure 8 :
Figure 8:The computation time (left panel) and memory usage (right panel) for computing the absolute risk with its standard error for one observation plotted against the sample size, setting the argument store.iid to "full" or "minimal" when calling predict.CauseSpecificCox().The curves represent the median values over 100 simulations while the shaded areas represent the 95% confidence intervals.
To conveniently extract a subset of the results, one should first call as.data.table.predictCSC() to combine these results into a "data.table"object.Here is an example: The function as.data.table.predictCox()makes it easy to extract subsets from "predictCox" object: