The Cox proportional hazards model is one of the most frequently used estimators in duration (survival) analysis. Because it is estimated using only the observed durations’ rank ordering, typical quantities of interest used to communicate results of the Cox model come from the hazard function (e.g., hazard ratios or percentage changes in the hazard rate). These quantities are substantively vague and difficult for many audiences of research to understand. We introduce a suite of methods in the R package coxed to address these problems. The package allows researchers to calculate duration-based quantities from Cox model results, such as the expected duration (or survival time) given covariate values and marginal changes in duration for a specified change in a covariate. These duration-based quantities often match better with researchers’ substantive interests and are easily understood by most readers. We describe the methods and illustrate use of the package.
The Cox proportional hazards model (Cox 1972) is frequently used for duration (survival) analysis in a myriad of disciplines including the health sciences, social sciences, operations research, and engineering. For many researchers who employ the Cox model, the chief concept of substantive interest is the duration of an event, such as the survival time of a patient or the duration of a civil war. However, the standard methods of reporting results from the Cox model—which are based in the hazard function—communicate no specific information about duration. As a result, standard interpretations of Cox model results are often substantively vague and difficult for many audiences of research to understand.
Here we introduce an package implementation of Cox proportional hazards model with expected durations, or Cox ED (Kropko and Harden 2020). The Cox ED suite of methods available in the coxed package provides a more intuitive approach to communicating results from the Cox model. Specifically, it computes duration-based quantities of interest, such as the expected time until event occurrence according to the estimated model. These quantities have long been available with parametric duration models, but in some instances researchers may not wish to make the distributional assumptions required of those estimators. The Cox ED methods allow researchers to stay within the Cox model framework, but communicate results in the language of time. This affords more conceptual precision when conversing with other researchers and makes the results of the analysis more intuitive and accessible for general audiences.
The goal of Cox ED is to generate expected durations for individual observations and/or marginal changes in expected duration given a change in a covariate from the Cox model. Specifically, the methods can compute (1) the expected duration for each observation used to fit the Cox model, given the covariates, (2) the expected duration for a “new” observation with a covariate profile set by the analyst, or (3) the first difference, or change, in expected duration given two new observations.
There are two different methods of generating duration-based quantities
in the package. The first method employs a generalized additive model
(GAM) to map the model’s exponentiated linear predictor values to
duration times. The second method calculates expected durations by using
nonparametric estimates of the baseline hazard and survivor functions.
We present overviews of these methods here. See Kropko and Harden (2020) for
additional details, including simulation results comparing the two
methods. Importantly, both approaches use coefficient estimates from the
Cox model, so researchers must first estimate the model just as they
always have. Cox ED is a postestimation procedure, not a
new estimator. All of the choices required of applied researchers in
estimating the Cox model must be made first, at the estimation stage,
before proceeding to implement Cox ED.
The GAM approach to Cox ED proceeds according to five
steps. As is noted above, the first step is model estimation. Then the
method computes expected values of risk for each observation by
matrix-multiplying the covariates,
The next step is to connect the model’s expected risk for each
observation (ELP) to duration time (the observed durations). A GAM fits
a model to data by using a series of locally-estimated polynomial
splines set by the user (Hastie and Tibshirani 1990). It is a flexible means of
allowing for the possibility of nonlinear relationships between
variables. Cox ED uses a GAM to model the observed
durations as a function of the linear predictor ranks generated in the
previous step. More specifically, the method utilizes a cubic regression
spline to draw a smoothed line summarizing the bivariate relationship
between the observed durations and the ranks (for more details, see Wood 2006, 2011).
The GAM fit can be used directly to compute expected durations, given
the covariates, for each observation in the data. However, for most
researchers it is more useful to assess how a change to a particular
covariate of interest corresponds to changes in expected duration. In
order to examine such marginal changes, it is necessary to create two or
more “new” observations corresponding to theoretically-interesting,
hypothetical covariate profiles. For example, the analyst might set an
indicator variable to 0 and 1 or a continuous variable to a “low” and a
“high” value. Cox ED allows the covariates in the model to
vary naturally over the entire data, then averages over them in the
computations.
Set
Repeat step (a) while setting
Take the values obtained in steps (a) and (b) and append them to the
list of ELP values from the original Cox model in which
Pass the list of rankings from step (c) to the GAM as new data to generate expected values. Note that a new GAM is not estimated at this step. Rather, expected durations are generated for each observation—including the two new ones created in steps (a) and (b)—using the previously estimated GAM. This produces point estimates of the expected durations for those two new observations.
Compute the difference between the two estimates obtained in step
(d): the expected duration for the data in which
Finally, to produce estimates of uncertainty, the GAM approach repeats
this process via bootstrapping. The method generates bootstrap samples
of the data and re-estimates the Cox model coefficients on each
bootstrap sample.
One drawback to the GAM approach is that it uses two statistical models (Cox model and GAM), which yields two sources of estimation uncertainty. An alternative approach comes from the method proposed by Cox and Oakes (1984 107–109) for estimating the cumulative baseline hazard function. This method is nonparametric and results in a step-function representation of the cumulative baseline hazard; we refer to it as the nonparametric step-function (NPSF) approach.
Cox and Oakes (1984 108) show that the cumulative baseline hazard function can be estimated after fitting a Cox model by
where
Tied durations are handled by collapsing the dataset by unique
duration. The method calculates
The NPSF approach calculates a running sum, in reverse, for the collapsed ELPs. That is, at the first time point this sum includes the ELP for observations at every time point. At the second time point, this sum includes the ELP for every observation except for those with the earliest observed duration. At the last time point, this sum is equal to the sum of only the ELPs of observations with the latest observed duration. These sums represent the denominator of equation ((1)).
For each time point, the method divides the number of failures
Finally, the method calculates the running sum of the ratios we derived in the previous step. This running sum is the non-parametric estimate of the cumulative hazard function.
This procedure yields a stepwise function. Time points with no failures do not contribute to the cumulative hazard, so the function is flat until the next time point with observed failures.
The NPSF approach next obtains expected durations and marginal changes in expected duration by first calculating the baseline survivor function from the cumulative hazard function, using
Each observation’s survivor function is related to the baseline survivor function by
where
where
where
To calculate a marginal effect, the NPSF approach to Cox ED follows the same strategy employed in the GAM approach. It creates two new covariate profiles, setting a variable of interest to two theoretically interesting values. It calculates expected values from each profile, then computes the difference in the two estimates. Finally, the method bootstraps to obtain a standard error and/or confidence intervals for this point estimate.
The methods described above are mostly automated in the package;
analysts generally need only a coxph
model object from the
survival package (Therneau 2015) or a cph
model object from the
rms package (Harrell 2018), and, if covariate effects are desired, the name
of the variable of interest and the two values of that variable they
wish to input.
We illustrate the main features of the package with an empirical example. Martin and Vanberg (2003) examine the determinants of negotiation time among political parties forming coalition governments in Western Europe. The outcome variable in this analysis is the number of days between the beginning and end of the bargaining period. The covariates include the range of government—the ideological distance between the extreme members of the coalition—the number of parties in the coalition, as well as several others. Their main hypotheses predict negative coefficients on the range of government and number of parties variables. They expect that increases in the ideological distance between the parties and the size of the coalition correspond with decreases in the risk of government formation, or longer negotiation times.
The authors demonstrate support for their hypotheses with a sample of
data on bargaining in Western European democracies between 1950 and
1995. They estimate a Cox model, then interpret the covariate effects
with quantities based in the hazard rate. As an alternative, we employ
Cox ED with these data. We use the coxed()
function to
predict bargaining duration for every case in the data. Then test the
first of their hypotheses by computing estimates of bargaining duration
at different values of ideological range of government.
The first step with Cox ED is to estimate the model. We
estimate the Cox model from Martin and Vanberg (2003) using the Surv()
and
coxph()
functions from the survival package:
library(coxed)
data(martinvanberg)
mv.surv <- Surv(martinvanberg$formdur, event = rep(1, nrow(martinvanberg)))
mv.cox <- coxph(mv.surv ~ postel + prevdef + cont + ident + rgovm + pgovno +
tpgovno + minority, data = martinvanberg)
We report these results in Table 1.
Range of government | |
(0.120) | |
Number of government parties | 1.191 |
(0.124) | |
Number of government parties | |
(0.035) | |
Do negotiations commence | |
immediately after an election? | (0.169) |
Did the government take a | |
parliamentary defeat? | (0.230) |
Continuation | 1.100 |
(0.240) | |
Identifiability | 0.146 |
(0.119) | |
Minority government | |
(0.208) | |
Observations | 203 |
R |
0.745 |
Max. Possible R |
1.000 |
Log Likelihood | |
Wald Test | 218.130 |
LR Test | 277.239 |
Score (Logrank) Test | 279.277 |
Note: Cell entries report Cox model coefficient estimates with standard errors in parentheses. | |
Next we use the GAM version of coxed()
to examine expected durations
and marginal changes in duration.bootstrap = TRUE
option. By default the bootstrapping procedure uses
200 iterations (to set this value to a different number, use the B
argument).B
to a
higher value. There are different methods for calculating a
bootstrapped confidence interval. The default method used by
coxed()
(setting the argument confidence = "studentized"
) adds
and subtracts qnorm(level - (1 - level)/2)
times the bootstrapped
standard error to the point estimate, where level
is the analyst’s
chosen threshold for evaluating statistical significance. The
alternative approach is to take the (1 - level)/2
and
level + (1 - level)/2
quantiles of the bootstrapped draws, which
can be done by specifying confidence = "empirical"
. We recommend a
higher number of bootstrap iterations for empirical confidence
intervals. Additionally, the nonparametric bias corrected and
accelerated (confidence = "bca"
, which implements the bias correction and
acceleration procedure in DiCiccio and Efron (1996) using code modified from
the mediation package (Tingley et al. 2014).
ed <- coxed(mv.cox, method = "gam", bootstrap = TRUE, B = 30)
Now every predicted duration has a standard error and a 95% confidence interval. The first several cases’ predicted durations are estimated as follows:
> head(ed$exp.dur)
exp.dur bootstrap.se lb ub
1 48.978295 5.6915889 37.8229859 60.133605
2 42.036276 4.8132767 32.6024267 51.470125
3 55.440293 6.8188818 42.0755303 68.805056
4 15.734577 1.7119205 12.3792749 19.089880
5 1.530695 0.3512462 0.8422652 2.219125
6 64.449942 7.7421823 49.2755433 79.624340
The summary()
function, when applied to coxed()
output, reports
either the mean or median estimated duration along with the bootstrapped
standard error and confidence interval for the statistic:
> summary(ed, stat = "mean")
mean bootstrap.se lb ub
28.034 1.998 24.119 31.95
> summary(ed, stat = "median")
median bootstrap.se lb ub
21.208 2.263 16.773 25.643
coxed()
can be used to provide duration predictions for observations
outside of the estimation sample. Suppose that we observe five new cases
and place them inside a data frame:
new.coalitions <- data.frame(postel = c(1, 1, 1, 0, 1),
prevdef = c(0, 0, 1, 1, 0),
cont = c(1, 0, 1, 0, 1),
ident = c(1, 2, 2, 3, 3),
rgovm = c(.3, .8, 1.1, .2, .35),
pgovno = c(2, 3, 3, 2, 4),
tpgovno = c(3.2, 0, 5, 0, 2.6),
minority = c(0, 0, 1, 0, 0))
To forecast durations for these cases along with standard errors and
confidence intervals, we use the coxed()
function and place
new.coalitions
into the newdata
argument:
forecast <- coxed(mv.cox, newdata = new.coalitions, method = "gam",
bootstrap = TRUE, B = 30)
> forecast$exp.dur
exp.dur bootstrap.se lb ub
1 4.5845636 2.7517846 -0.8088352 9.977962
2 0.9542265 0.5656203 -0.1543688 2.062822
3 5.2816962 1.1323172 3.0623953 7.500997
4 1.2358600 0.4684499 0.3177151 2.154005
5 0.5924056 0.7286877 -0.8357961 2.020607
The data used by coxed()
to map rankings to durations are stored in
the gam.data
attribute, and can be used to visualize the fit of the
GAM, as in Figure 1.
We use coxed()
to provide an answer to the key question, “how much
longer will negotiations take for an ideologically polarized coalition
as compared to an ideologically homogeneous one?” Specifically, we call
coxed()
and specify two new datasets, one in which rgovm = 0
indicating that all political parties in the governing coalition have
the same ideological position (i.e., a coalition of one party), and one
in which rgovm = 1.24
, indicating that the parties have very different
ideological positions.rgovm = 1.24
reflects the average
ideological range of coalition governments in the sample.mutate()
from the dplyr package
(Wickham et al. 2018) to quickly create new data frames in which rgovm
equals 0 or
1.24 for all cases, and set these two data frames as newdata
and
newdata2
inside coxed()
.
me <- coxed(mv.cox, method = "gam", bootstrap = TRUE, B = 30,
newdata = mutate(martinvanberg, rgovm = 0),
newdata2 = mutate(martinvanberg, rgovm = 1.24))
coxed()
calculates expected durations for all cases under each new
data frame and subtracts the durations for each case. To obtain point
estimates we can request the mean or median difference.
> summary(me, stat = "mean")
mean bootstrap.se lb ub
newdata2 28.927 3.285 22.489 35.365
newdata 25.321 2.632 20.163 30.480
difference 3.605 2.417 -1.133 8.343
> summary(me, stat = "median")
median bootstrap.se lb ub
newdata2 22.392 3.234 16.053 28.730
newdata 19.692 3.449 12.932 26.451
difference 2.928 1.931 -0.857 6.714
These results demonstrate that a coalition in which the parties have average ideological differences will take 3.6 more days on average (with a median of 2.9 days) to conclude negotiations than a coalition in which all parties have the same position (i.e., a single-party government).
The NPSF method can be used to compute estimates of these same
quantities simply by specifying method = "npsf"
in the coxed()
function. Additionally, the package includes a function called
sim.survdata()
designed for simple simulations of duration data that
do not assume a distributional form for the baseline hazard. This
method, which is fully described in Harden and Kropko (2019), can be useful in
several applied and computational settings that involve the Cox model.
The Cox model is popular among applied researchers in a wide range of disciplines due to its inherent flexibility. However, this flexibility makes conveying the substantive meaning of results challenging. By using only the rank ordering of the observed duration times, the Cox model limits researchers to interpreting results in the language of hazard and changes in risk. This yields two key problems. First, it is substantively vague because hazard does not have a meaningful scale. This hinders researchers’ capacity to determine whether an estimated effect is substantively “large” or “small.” Furthermore, hazard-based interpretations require specialized knowledge to understand. This makes the research less accessible to general audiences, who may be able to learn from the work but cannot due to the means by which results are communicated.
The Cox ED methods provide a solution to these problems by allowing researchers to compute duration-based quantities from the Cox model. Communicating results in the language of time allows for more substantive precision and is intuitive to a broad audience of readers. We demonstrate above that Cox ED is straightforward to implement in . The coxed package contains functions that allow researchers to use the methods even with minimal knowledge of . Additionally, the functions are flexible; users can make several changes to many of their features to suit the problem at hand. Finally, the output from the functions provide point estimates, standard errors, and confidence intervals, so researchers can report their results with appropriate measures of uncertainty.
In sum, the coxed package provides a useful alternative for researchers to communicate results from the Cox model. It gives them the benefits of the intuitive quantities available in parametric models while retaining the desirable estimation properties of the Cox model. Thus, the analysis can be guided by appropriate modeling choices, but reported in an intuitive, accessible manner.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
B
to a
higher value. There are different methods for calculating a
bootstrapped confidence interval. The default method used by
coxed()
(setting the argument confidence = "studentized"
) adds
and subtracts qnorm(level - (1 - level)/2)
times the bootstrapped
standard error to the point estimate, where level
is the analyst’s
chosen threshold for evaluating statistical significance. The
alternative approach is to take the (1 - level)/2
and
level + (1 - level)/2
quantiles of the bootstrapped draws, which
can be done by specifying confidence = "empirical"
. We recommend a
higher number of bootstrap iterations for empirical confidence
intervals. Additionally, the nonparametric bias corrected and
accelerated (confidence = "bca"
, which implements the bias correction and
acceleration procedure in DiCiccio and Efron (1996) using code modified from
the mediation package (Tingley et al. 2014).[↩]rgovm = 1.24
reflects the average
ideological range of coalition governments in the sample.[↩]Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kropko & Harden, "coxed: An R Package for Computing Duration-Based Quantities from the Cox Proportional Hazards Model", The R Journal, 2019
BibTeX citation
@article{RJ-2019-042, author = {Kropko, Jonathan and Harden, Jeffrey J.}, title = {coxed: An R Package for Computing Duration-Based Quantities from the Cox Proportional Hazards Model}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {2}, issn = {2073-4859}, pages = {38-45} }