In the presence of competing risks a prediction of the time-dynamic absolute risk of an event can be based on cause-specific Cox regression models for the event and the competing risks (Benichou and Gail 1990). We present computationally fast and memory optimized C++ functions with an R interface for predicting the covariate specific absolute risks, their confidence intervals, and their confidence bands based on right censored time to event data. We provide explicit formulas for our implementation of the estimator of the (stratified) baseline hazard function in the presence of tied event times. As a by-product we obtain fast access to the baseline hazards (compared to survival::basehaz()
) and predictions of survival probabilities, their confidence intervals and confidence bands. Confidence intervals and confidence bands are based on point-wise asymptotic expansions of the corresponding statistical functionals. The software presented here is implemented in the riskRegression package.
Predictions of hazards and risks based on a Cox regression analysis need
to be fast and memory efficient, especially in large data, in simulation
studies, and for cross-validation or bootstrap loops. The CRAN task view
Survival lists many R
packages implementing the Cox regression model and extensions thereof.
Among the most popular routines are the function coxph()
from the
survival package
(Therneau 2017) and the function cph()
from the
rms package (Harrell Jr 2017).
We present a fast and memory efficient algorithm to extract baseline
hazards and predicted risks with confidence intervals from an object
obtained with either of these functions.
In the presence of competing risks one needs to combine at least two Cox
regression models to predict the absolute risk of an event (cumulative
incidence) conditional on covariates (Benichou and Gail 1990). We
present the CSC()
-function of the R package
riskRegression
which fits the Cox regression models using either coxph()
or cph()
.
We also present a concomitant predict()
S3 method which computes the
absolute risks of the event of interest for given combinations of
covariate values and time points. Optionally, the predict()
method
computes asymptotic confidence intervals and confidence bands for the
predicted absolute risks. We review the formula behind the estimators
implemented and illustrate the R interface.
It is possible to obtain the predictions of absolute risks based on cause-specific Cox regression also with the survival package or with the mstate package (Putter et al. 2016). However, both require more work from the user. Finally, it should be noted that there are alternative regression methods for absolute risks in the presence of competing risks such as Fine-Gray regression (Fine and Gray 1999) or direct binomial regression (Scheike et al. 2008; Gerds et al. 2012).
For the sole purpose of illustration we use the Melanoma
data set
which is included in the riskRegression package. It contains data
from 205 malignant melanoma patients. Among the risk factors for cancer
specific death were patient age and sex and the histological variables
tumor thickness, invasion (levels 0,1,2), and epithelioid cells (no
present vs. present). Within the limitation of the follow-up periods, it
was observed that 57 patients had died from cancer (“status” equals 1)
and 14 had died from other causes (“status” equals 2). The remaining
patients were right censored (“status” equals 0).
library(riskRegression, verbose = FALSE, quietly = TRUE)
library(survival)
data(Melanoma)
str(Melanoma)
## 'data.frame': 205 obs. of 7 variables:
## $ time : int 10 30 35 99 185 204 210 232 232 279 ...
## $ status : int 3 3 2 3 1 1 1 3 1 1 ...
## $ sex : int 1 1 1 0 1 1 1 0 1 0 ...
## $ age : int 76 56 41 71 52 28 77 60 49 68 ...
## $ year : int 1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
## $ thickness: num 6.76 0.65 1.34 2.9 12.08 ...
## $ ulcer : int 1 0 0 0 1 1 1 1 1 1 ...
We denote by
We consider a setting in which the event time
Given covariates
The cause-specific Cox regression models can be combined into a
prediction of the absolute risk of an event of type
or the exponential approximation:
Formula (2) generalizes to situations with more than 2
competing risks, i.e.,
The first step is to fit the Cox regression models with the CSC()
function in order to estimate
cfit0 <- CSC(formula = Hist(time,status) ~ age + logthick + epicel + strata(sex),
data = Melanoma)
coef(cfit0)
## $`Cause 1`
## age logthick epicelpresent
## 0.01548722 0.68178505 -0.73848649
##
## $`Cause 2`
## age logthick epicelpresent
## 0.07680909 0.04750975 0.31497177
In the call of CSC()
the argument formula
is used to define the
outcome variables with the help of the function prodlim::Hist()
. The
variable “time” in the data set contains the values of the observed
event time prodlim::Hist()
have a
print method:
h <- with(Melanoma, prodlim::Hist(time,status))
h
## Right-censored response of a competing.risks model
##
## No.Observations: 205
##
## Pattern:
##
## Cause event right.censored
## 1 57 0
## 2 14 0
## unknown 0 134
and a plot method:
plot(h, arrowLabelStyle = "count",
stateLabels = c("Radical\noperation", "Cancer\nrelated death", "Death\nother causes"))
A nice complement to the regression models is the marginal Aalen-Johansen estimate of the absolute risk of cancer related death (2):
library(prodlim)
plot(prodlim(Hist(time,status) ~1, data = Melanoma),
atrisk.at = seq(0,3652.5,365.25), xlim = c(0,3652.5),
axis1.at = seq(0,3652.5,365.25), axis1.lab = 0:10,
xlab = "Years", ylab = "Absolute risk", cause = 1)
The right hand side of the formula in the call of the CSC()
function:
Hist(time,status) ~ age + logthick + epicel + strata(sex)
defines the covariate(s) strata()
, as one would do when using the coxph()
function.
If only one formula is provided, the CSC()
function will use the same
baseline covariates and strata variables for all cause-specific Cox
regression models. Instead one may feed a list of formulas into the
argument formula
, one for each cause:
cfit1 <- CSC(formula = list(Hist(time,status) ~ age + logthick + epicel + strata(sex),
Hist(time,status) ~ age + strata(sex)),
data = Melanoma)
coef(cfit1)
## $`Cause 1`
## age logthick epicelpresent
## 0.01548722 0.68178505 -0.73848649
##
## $`Cause 2`
## age
## 0.07919648
Note that the choice of the baseline covariates relative to each cause
made here is not based on clinical or statistical criteria; it was done
to illustrate the software possibilities. The causes are internally
ordered with respect to the levels of the variable “status”, if this
variable is a factor, and otherwise with respect to
sort(as.character(unique(status)))
. The order of the causes is saved
as cfit1[["causes"]]
. Accordingly, the first formula is used to fit a
Cox regression model to the first cause and the second formula is used
to fit a Cox regression model to the second cause and so on. Internally,
CSC()
constructs dummy variables, one for each cause, and then calls
the function defined by the argument fitter
on a suitably constructed
Surv()
formula. By default the cause-specific Cox models are fitted
with the function survival::coxph()
. Alternatively, one can set the
argument fitter
to the name of a different routine, e.g., cph
.
The object obtained with CSC()
has class "CauseSpecificCox"
. The
second step is to call the corresponding predict()
method. In addition
to the object obtained with CSC()
this requires three additional
arguments: newdata
, times
, cause
. The argument newdata
should be
a "data.frame"
which contains the covariates CSC()
. The argument cause
defines the
cause of interest times
defines a vector of
prediction horizon(s) whose values are used as the upper integration
limit predict()
method computes
the absolute risks (formula (2)) for each row in newdata
and each value of times
:
newdata <- data.frame(age = c(45,67), logthick = c(0.1,0.2),
epicel = c("present","not present"),
sex = c("Female","Male"))
pfit1 <- predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500))
By default, the product integral estimator is used to estimate the
event-free survival function. Setting the argument productLimit
to
FALSE
when calling the predict function enables to use the exponential
approximation. The predict function returns a structured list of class
"predictCSC"
. The corresponding print()
method calls
as.data.table.predictCSC()
to display the predictions as follows:
print(pfit1)
## observation age logthick epicel sex times strata absRisk
## 1: 1 45 0.1 present Female 867 sex=Female 0.021
## 2: 2 67 0.2 not present Male 867 sex=Male 0.149
## 3: 1 45 0.1 present Female 3500 sex=Female 0.117
## 4: 2 67 0.2 not present Male 3500 sex=Male 0.428
For each row in newdata
(values are repeated for each prediction
horizon) and each prediction horizon (column times
) the column
“absRisk” contains the absolute risk of cancer specific mortality (cause
1). Standard errors and confidence intervals for the absolute risk can
be obtained setting the argument se
to TRUE
:
pfit1se <- predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500),
se = TRUE, keep.newdata = FALSE)
print(pfit1se)
## observation times strata absRisk absRisk.se absRisk.lower absRisk.upper
## 1: 1 867 sex=Female 0.021 0.122 0.00738 0.0478
## 2: 2 867 sex=Male 0.149 0.161 0.07356 0.2502
## 3: 1 3500 sex=Female 0.117 0.151 0.05552 0.2025
## 4: 2 3500 sex=Male 0.428 0.320 0.20416 0.6355
Here we have set the argument keep.newdata
to FALSE
to not export
the value of the covariates. The structure of the "predictCSC"
object
is as follows.
str(pfit1se)
## List of 11
## $ absRisk : num [1:2, 1:2] 0.021 0.149 0.117 0.428
## $ absRisk.se : num [1:2, 1:2] 0.122 0.161 0.151 0.32
## $ absRisk.lower : num [1:2, 1:2] 0.00738 0.07356 0.05552 0.20416
## $ absRisk.upper : num [1:2, 1:2] 0.0478 0.2502 0.2025 0.6355
## $ times : num [1:2] 867 3500
## $ strata : Factor w/ 2 levels "sex=Female","sex=Male": 1 2
## $ conf.level : num 0.95
## $ se : logi TRUE
## $ band : logi FALSE
## $ nsim.band : num 10000
## $ transformation.absRisk:function (x)
## - attr(*, "class")= chr "predictCSC"
The elements $absRisk
, $absRisk.se
, $absRisk.lower
and
$absRisk.upper
are matrices where each row corresponds to a row in
newdata
and each column to a value of the times
vector. All these
matrices are sorted according to the original orders of the arguments
newdata
and times
. To conveniently extract a subset of the results,
one should first call as.data.table.predictCSC()
to combine these
results into a "data.table"
object. Here is an example:
ptable1 <- as.data.table(pfit1se)
ptable1[times == 3500 & observation == 1,
.(times,absRisk,absRisk.lower,absRisk.upper)]
## times absRisk absRisk.lower absRisk.upper
## 1: 3500 0.1166383 0.05551508 0.2025222
In the same way confidence bands can be obtained by setting the argument
band
to TRUE
:
vec.times <- cfit1$eventTimes
pfit1band <- predict(cfit1, newdata = newdata[1], cause = 1,
times = vec.times, se = TRUE, band = TRUE)
newdata[1]
## age logthick epicel sex
## 1: 45 0.1 present Female
By default 10,000 simulations will be used to estimate the appropriate
quantile for the confidence bands (see explanations in section
Construction of the confidence bands). This default
behavior can be changed by setting the argument nsim.band
to another
value. The autoplot()
function can then be used to compare confidence
bands and the confidence intervals:
figure3 <- autoplot(pfit1band, band = TRUE, ci = TRUE)$plot
figure3 <- figure3 + xlab("Time (days)") + ylab("Absolute risk")
print(figure3)
Note that the resulting object is a "ggplot"
graphic. This can be
useful to personalize the graph, e.g. change the font size of the text.
In this section we describe the asymptotic formula behind the confidence
intervals for the predicted absolute risks and the empirical counterpart
which is implemented in riskRegression. We assume a sample
The asymptotic confidence intervals for the covariate specific absolute
risks of event 1 before time log.transform
needs to be set to FALSE
when calling predict.CauseSpecificCox()
:
pfit2se <- predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500),
se = TRUE, log.transform = FALSE, keep.newdata = FALSE)
print(pfit2se)
## observation times strata absRisk absRisk.se absRisk.lower absRisk.upper
## 1: 1 867 sex=Female 0.021 0.00992 0.00157 0.0404
## 2: 2 867 sex=Male 0.149 0.04586 0.05945 0.2392
## 3: 1 3500 sex=Female 0.117 0.03795 0.04226 0.1910
## 4: 2 3500 sex=Male 0.428 0.11620 0.20021 0.6557
Here the column “absRisk.se” contains
The influence function
For fixed cause
In absence of strata, formula (7) is equal to formula 2 of
(Reid 1981) and formula (8) equals the one given in
Gerds and Schumacher (2001) [top of page 576; note however that there is a sign mistake
in their first term]. To connect these formulas with formula
(6) it remains to note that under the Cox regression
model the influence function of the cause-specific hazard rate can be
written as:
The following formulas are obtained with the plug-in principle
substituting the Cox partial likelihood estimates
A confidence band with confidence level
In figure 3,
The function predict.CauseSpecificCox()
calls two important functions,
predictCox()
that computes the hazard and cumulative hazard for a
fitted Cox model, and iidCox()
that computes the influence function
for the baseline hazard and regression coefficients. In this section we
first explain how predictCox()
deals with ties in the event times. We
then show that the function predictCox()
can also be used to obtain
confidence intervals and bands for covariate specific survival
probabilities in the situation without competing risks. Implementation
details about the function iidCox()
are postponed to appendix B. These
details may be useful for programmers who need to care about memory
usage.
We speak of ties when two or more observations have the same value of
the time variable help(coxph)
) for the partial likelihood
estimator of the log hazard ratios
Both estimators are implemented in the function predictCox()
which
provides estimates of the baseline hazard, the cumulative baseline
hazard and baseline survival. The predictCox()
function does not have
an argument to specify whether Breslow method or Efron method should be
used; instead it uses the same method that has been used to estimate the
regression coefficients. In the case of the coxph()
function, the
default method is Efron:
f1 <- coxph(Surv(time,status != 0) ~ age + logthick + epicel + strata(sex),
data = Melanoma, x = TRUE, y = TRUE)
f1$method
## [1] "efron"
Therefore the baseline hazard will be estimated using the Efron method
when calling predictCox()
:
baseH1 <- predictCox(f1)
as.data.table(baseH1[c("time","cumhazard","strata","survival")])
## time cumhazard strata survival
## 1: 99 0.00623279 Female 0.9937866
## 2: 232 0.01256406 Female 0.9875145
## 3: 279 0.01897728 Female 0.9812017
## 4: 295 0.02555481 Female 0.9747690
## 5: 355 0.03221850 Female 0.9682950
## ---
## 199: 3909 0.69673810 Male 0.4982078
## 200: 4119 0.69673810 Male 0.4982078
## 201: 4207 0.69673810 Male 0.4982078
## 202: 4310 0.69673810 Male 0.4982078
## 203: 4492 0.69673810 Male 0.4982078
The cumulative baseline hazard and baseline survival are displayed in
the columns “cumHazard”, and “survival” of the output. This corresponds,
respectively, to
predictCox(f1, newdata = Melanoma[c(17,101,123),],
times = c(7,3,5)*365.25)
## observation times strata cumhazard survival
## 1: 1 2557 Male 0.884 0.413
## 2: 2 2557 Female 0.555 0.574
## 3: 3 2557 Female 0.949 0.387
## 4: 1 1096 Male 0.453 0.635
## 5: 2 1096 Female 0.202 0.817
## 6: 3 1096 Female 0.346 0.708
## 7: 1 1826 Male 0.670 0.512
## 8: 2 1826 Female 0.366 0.693
## 9: 3 1826 Female 0.626 0.535
In absence of competing risks, the influence function of a Cox model can
be used to estimate confidence intervals for the survival. This can be
done by setting the argument se
to TRUE
when calling the
predictCox()
function:
p1 <- predictCox(f1, newdata = Melanoma[3:5,],
times = c(Melanoma$time[5:7],1000),
se = TRUE, type="survival")
The predictCox()
function output an object of class "predictCox"
.
The print()
method can be used to display the confidence intervals for
the survival computed at different times:
print(p1)
## observation times strata survival survival.se survival.lower survival.upper
## 1: 1 185 Male 0.980 0.616 0.935 0.994
## 2: 2 185 Female 0.985 1.014 0.893 0.998
## 3: 3 185 Male 0.947 0.652 0.823 0.985
## 4: 1 204 Male 0.973 0.567 0.921 0.991
## 5: 2 204 Female 0.985 1.014 0.893 0.998
## 6: 3 204 Male 0.929 0.594 0.791 0.977
## 7: 1 210 Male 0.967 0.503 0.913 0.987
## 8: 2 210 Female 0.985 1.014 0.893 0.998
## 9: 3 210 Male 0.912 0.552 0.762 0.969
## 10: 1 1000 Male 0.864 0.321 0.760 0.925
## 11: 2 1000 Female 0.769 0.286 0.631 0.860
## 12: 3 1000 Male 0.673 0.391 0.426 0.832
Here the confidence intervals were computed using a log-log
transformation. The argument log.transform
can be set to FALSE
to
compute them without using the log-log transformation. Confidence bands
can also be obtained using predictCox()
by setting the argument band
to TRUE
. As for the predict.CauseSpecificCox()
function, 10,000
simulations will be used to compute the confidence bands; this can be
changed specifying the argument nsim.band
. The function
as.data.table.predictCox()
makes it easy to extract subsets from
"predictCox"
object:
p1 <- as.data.table(p1)
p1[times == 185,]
## observation times strata survival survival.se survival.lower survival.upper
## 1: 1 185 Male 0.9801395 0.6163925 0.9350597 0.9940246
## 2: 2 185 Female 0.9845660 1.0137483 0.8927600 0.9978695
## 3: 3 185 Male 0.9470887 0.6524116 0.8226132 0.9849795
To assess the validity of the estimation of the standard error, we
performed a simulation study. The sample size was varied between 50 and
10000. For each sample size, 5000 datasets were simulated using the
SimCompRisk()
function from the riskRegression package. The time
of first event typically ranged between
The true absolute risk was defined as the median of the absolute risks over the 5000 datasets. The coverage of the confidence intervals was computed as the percentage of times that the true absolute risk was inside the confidence interval, at a given time. The coverage of the confidence bands was computed as the percentage of times that the true absolute risk was inside the confidence bands, simultaneously for all of the 8 times.
As expected, the coverage of the confidence intervals is improving with increasing sample size and reaches its nominal level of 95% at around sample size 1000 (figure 4, left panel). Using a log-log transformation leads to better small sample properties and similar large sample properties (figure 4, right panel). A larger sample size was necessary for the confidence bands to converge toward the nominal coverage level (n = 5000, figure 5 left panel). Again log-log transformation leads to better small sample properties. We only displayed here the coverage for the absolute risk but similar coverage was obtained for the survival function.
We compare the performance of predictCox()
regarding the estimate of
the baseline hazard function with that of the function
survival::basehaz()
. For this purpose we simulate data with 10
covariates including both continuous and discrete type using the
sampleData()
.
In our performance study we vary the sample size ranging from 500 to 1,000,000 observations and consider both stratified:
Surv(time,event) ~ strata(X1) + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
and non-stratified Cox regression models:
Surv(time,event) ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
The computation times were estimated using the
rbenchmark
package (Kusnierczyk 2012) and averaged across 100 simulated data sets.
The memory usage was estimated using the
profvis package
(Chang and Luraschi 2017) and averaged across 10 simulations. When the execution
time of the function was extremely fast (i.e. <0.005s), the memory
usage could not be reliably assessed and was set to NA
.
Memory usage and computation time are displayed on figure
6 and 7.
Both functions lead to a reasonable computation time (<1 min) even when
considering very large datasets (e.g. >10,000 observations).
Nevertheless the predictCox()
function outperforms the basehaz()
function by a factor varying between predictCox()
and decreases by
a factor between basehaz()
.
predictCox()
and basehaz()
plotted against the sample size for a Cox
model (left panel) and a stratified Cox model (right panel). The x axis
is displayed using a logarithmic scale but its labels refer to the
(untransformed) sample size. The curves represent the median values over
100 simulations while the shaded areas represent the 95% confidence
intervals.predictCox()
and basehaz()
plotted against the sample size for a Cox
model (left panel) and a stratified Cox model (right panel). The x axis
is displayed using a logarithmic scale but its labels refer to the
(untransformed) sample size. The curves represent the median values over
100 simulations while the shaded areas represent the 95% confidence
intervals.We now compare two implementations for computing the standard errors of
the absolute risk. The default implementation corresponds to setting the
argument store.iid
to "full"
when calling
predict.CauseSpecificCox()
or predictCox()
while the second is
obtained by setting store.iid
to "minimal"
. The two implementations
are described in more detail appendix B.
First we compare their computation time and memory consumption when
making only one prediction. As before, we simulated datasets for SimCompRisk()
function with increasing sample size. Then,
the two cause-specific Cox models were fitted to each of the simulated
datasets using riskRegression::CSC
. Then we measured the computation
time and memory usage necessary to estimate the absolute risk with its
standard error for the first observation at time 4, when using the
argument store.iid="minimal"
or store.iid="full"
in
predict.CauseSpecificCox()
.
The results are show on figure 8. While both
implementations lead to a similar computation time, the memory usage for
store.iid="minimal"
grows linearly while for store.iid="full"
it
grows approximately in store.iid="minimal"
becomes slower compared to store.iid="full"
(e.g.
store.iid
to
"full"
or "minimal"
when calling predict.CauseSpecificCox()
. The
curves represent the median values over 100 simulations while the shaded
areas represent the 95% confidence
intervals.This paper introduces new features of the riskRegression package
for prediction of absolute risks from cause-specific Cox regression
models using computationally efficient functions. 1
summarizes the main functions described in this paper. Confidence
intervals and confidence bands for the absolute risks can be computed
using the predict()
function and displayed using the print()
method.
The predictCox()
function can be applied on "coxph"
and "cph"
objects to predict the survival with its confidence interval or
confidence bands. In both cases, the autoplot()
function can display
the predicted risk (or survival) over time. When dealing with small to
moderate sample sizes, we advise to compute confidence intervals or
confidence bands using a log-log transformation (argument
log.transform
).
CSC() |
Fit cause-specific Cox models |
predict() |
Predict covariate specific absolute risks for given time horizons |
iidCox() |
Compute the influence function of the baseline hazard estimator and of the partial likelihood estimator of the regression coefficients |
predictCox() |
Compute the (cumulative) baseline hazard function and predictions of hazards and survival probabilities in new data |
autoplot() |
Graphical display of the predicted risk across time |
The CSC()
function requires a routine to estimate the regression
coefficients of the Cox model. By default, CSC()
calls the coxph()
function from the survival package to do so. This has several
reasons: coxph()
has been thoroughly tested, is reasonably fast,
widely used and provides flexible modeling options. However in specific
contexts, other routines may be more appropriate, e.g. faster. Also,
developers that have implemented their own routine may be interested in
computing the baseline hazard or the influence function.
To be able to apply the predictCox()
and iidCox()
functions on new
classes, one needs to define the methods extracting the necessary
information from the new class. For instance, the
mets package (Holst and Scheike 2017)
contains an efficient routine for estimating Cox models. However
"phreg"
object that are quite different from "coxph"
objects:
library(mets, verbose = FALSE)
Melanoma$entry <- 0
f1.phreg <- phreg(Surv(entry, time, status != 0) ~ age + logthick + epicel +
strata(sex), data = Melanoma)
list(coxph=names(f1),
phreg=names(f1.phreg))
## $coxph
## [1] "coefficients" "var" "loglik" "score"
## [5] "iter" "linear.predictors" "residuals" "means"
## [9] "concordance" "method" "n" "nevent"
## [13] "terms" "assign" "wald.test" "x"
## [17] "strata" "y" "formula" "xlevels"
## [21] "contrasts" "call"
##
## $phreg
## [1] "coef" "ploglik" "gradient" "hessian" "U" "S0"
## [7] "nevent" "ord" "time" "jumps" "jumptimes" "strata"
## [13] "entry" "exit" "status" "p" "X" "id"
## [19] "opt" "call" "model.frame"
Therefore to be able to use the predictCox()
and iidCox()
functions
on "phreg"
one needs to define methods to extract:
(CoxCenter()
method). Instead of working on CoxCenter()
method returns
riskRegression:::coxCenter.coxph(f1)
## age logthick epicelpresent
## 52.4634146 0.6181706 0.4341463
(CoxDesign()
method). The first two columns describe the beginning
and the end of the interval of time when the individual was
followed. The third contains the event type (0 corresponding to
censoring and 1 to an observed event, e.g. death). The remaining
columns contain the design matrix corresponding to the coefficients
head(riskRegression:::coxDesign.coxph(f1))
## start stop status age logthick epicelpresent strata
## 1 0 10 1 76 1.9110229 1 2
## 2 0 30 1 56 -0.4307829 0 2
## 3 0 35 0 41 0.2926696 0 2
## 4 0 99 1 71 1.0647107 0 1
## 5 0 185 1 52 2.4915512 1 2
## 6 0 204 1 28 1.5769147 0 2
(CoxFormula()
method):
riskRegression:::coxFormula.coxph(f1)
Surv(time, status != 0) ~ age + logthick + epicel + strata(sex)
CoxLP()
method). This function has three arguments:
object
, data
, and center
.
head(riskRegression:::coxLP.coxph(f1, data = NULL, center = FALSE))
## [1] 2.433793 1.136828 1.168604 2.332216 2.165533 1.559629
When setting data
to NULL
, the CoxLP()
method will return the
linear predictor computed on the dataset used to fit the Cox model. The
center
argument indicates whether the covariates should be centered
before computing the linear predictor.
(CoxN()
method):
riskRegression:::coxN.coxph(f1)
## [1] 205
(CoxSpecialStrata()
method):
riskRegression:::coxSpecialStrata.coxph(f1)
## [1] "strata"
(CoxStrata()
method). This variable must be univariate,
aggregating all the strata variables.
head(riskRegression:::coxStrata.coxph(f1, data = NULL, strata.vars = "strata(sex)"))
## [1] Male Male Male Female Male Male
## Levels: Female Male
Similarly to CoxLP()
, when setting data
to NULL
, the CoxStrata()
method will return the strata variable computed on the dataset used to
fit the Cox model.
(CoxVarCov()
method):
riskRegression:::coxVarCov.coxph(f1)
## age logthick epicelpresent
## age 6.553939e-05 -0.0002102408 -0.0004363469
## logthick -2.102408e-04 0.0206977373 0.0076882093
## epicelpresent -4.363469e-04 0.0076882093 0.0692047486
Most of the above methods correspond to a very small piece of code that reformats the information contained in the object, e.g.:
riskRegression:::coxVarCov.coxph
## function (object)
## {
## Sigma <- object$var
## if (!is.null(Sigma)) {
## coefName <- names(coef(object))
## colnames(Sigma) <- coefName
## rownames(Sigma) <- coefName
## }
## return(Sigma)
## }
## <environment: namespace:riskRegression>
We refer to the help page of each method for a more precise description
of each method arguments and expected output, as well as examples. Once
all of the methods have been defined for a new object (e.g. "phreg"
),
predictCox()
and iidCox()
can be applied on the new object:
all.equal(predictCox(f1), predictCox(f1.phreg))
## [1] TRUE
all.equal(iidCox(f1), iidCox(f1.phreg))
## [1] TRUE
The function iidCox()
computes the estimates of the influence
functions
f1.iid <- iidCox(f1)
The default implementation (store.iid="full"
) stores the influence
function for the baseline cumulative hazard as a list of matrices, one
for each strata. The size of each matrix is the number of observations
predict()
method.
When dealing with very large datasets, e.g. following store.iid
to "minimal"
when calling iidCox
will return the
influence function using this alternative storage method:
f2.iid <- iidCox(f1, store.iid = "minimal")
We can compare the memory cost of the default implementation vs. the alternative one:
size.f1.iid <- object.size(f1.iid$IFcumhazard)
size.f2.iid <- object.size(f2.iid$calcIFhazard)
as.numeric(size.f2.iid/size.f1.iid)
## [1] 0.08627399
and see that the memory use for storing the influence function of the cumulative hazard has been divided by more than 10.
Computing the influence function of the absolute risk with the default implementation only requires to:
When using the alternative implementation,
The two implementations will lead to the same influence function and therefore the same confidence intervals or confidence bands.
The alternative implementation is performed iterating over the set of
covariates used to make the predictions, avoiding to store the influence
function of the baseline hazard for all event times and strata. It will
also not compute the influence function at unnecessary times and strata.
Thus it should always be more memory efficient and, when asking for a
single prediction, it should also be have a lower computation time.
However, compared to the default implementation where
Therefore when the prediction is to be made for many different sets of covariates (i.e. new patients) this may lead to a substantial increase in computation time. See the subsection Runtime and memory usage for more details.
To use the implementation for computing the standard errors, confidence
intervals, or confidence bands one must set the argument store.iid
to
"minimal"
:
pfit3se <- predict(cfit1, newdata = newdata, cause = 1, times = c(867,3500),
se = TRUE, store.iid = "minimal", keep.newdata = FALSE)
print(pfit3se)
## observation times strata absRisk absRisk.se absRisk.lower absRisk.upper
## 1: 1 867 sex=Female 0.021 0.122 0.00738 0.0478
## 2: 2 867 sex=Male 0.149 0.161 0.07356 0.2502
## 3: 1 3500 sex=Female 0.117 0.151 0.05552 0.2025
## 4: 2 3500 sex=Male 0.428 0.320 0.20416 0.6355
range(pfit3se$absRisk.se-pfit1se$absRisk.se)
range(pfit3se$absRisk.upper-pfit1se$absRisk.upper)
range(pfit3se$absRisk.lower-pfit1se$absRisk.lower)
## [1] -2.775558e-17 5.551115e-17
## [1] -5.551115e-17 0.000000e+00
## [1] -5.551115e-17 0.000000e+00
This option also applies when computing standard errors, confidence intervals, or confidence bands for the survival probabilities:
p2 <- predictCox(f1, newdata = Melanoma[3:5,],
times = c(Melanoma$time[5:7],1000),
se = TRUE, store.iid = "minimal", type="survival")
p2 <- as.data.table(p2)
p2[times == 185, survival.lower]-p1[times == 185, survival.lower]
p2[times == 185, survival.upper]-p1[times == 185, survival.upper]
## [1] 0.000000e+00 1.110223e-16 -1.110223e-16
## [1] 0 0 0
survival, rms, riskRegression, mstate, rbenchmark, profvis, mets
CausalInference, ClinicalTrials, Econometrics, Epidemiology, HighPerformanceComputing, ReproducibleResearch, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ozenne, et al., "riskRegression: Predicting the Risk of an Event using Cox Regression Models", The R Journal, 2017
BibTeX citation
@article{RJ-2017-062, author = {Ozenne, Brice and Sørensen, Anne Lyngholm and Scheike, Thomas and Torp-Pedersen, Christian and Gerds, Thomas Alexander}, title = {riskRegression: Predicting the Risk of an Event using Cox Regression Models}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {2}, issn = {2073-4859}, pages = {440-460} }