Advancements in medical informatics tools and high-throughput biological experimentation make large-scale biomedical data routinely accessible to researchers. Competing risks data are typical in biomedical studies where individuals are at risk to more than one cause (type of event) which can preclude the others from happening. The (Fine and Gray 1999) proportional subdistribution hazards model is a popular and well-appreciated model for competing risks data and is currently implemented in a number of statistical software packages. However, current implementations are not computationally scalable for large-scale competing risks data. We have developed an R package, fastcmprsk, that uses a novel forward-backward scan algorithm to significantly reduce the computational complexity for parameter estimation by exploiting the structure of the subject-specific risk sets. Numerical studies compare the speed and scalability of our implementation to current methods for unpenalized and penalized Fine-Gray regression and show impressive gains in computational efficiency.
Competing risks time-to-event data arise frequently in biomedical research when subjects are at risk for more than one type of possibly correlated events or causes and the occurrence of one event precludes the others from happening. For example, one may wish to study time until first kidney transplant for kidney dialysis patients with end-stage renal disease. Terminating events such as death, renal function recovery, or discontinuation of dialysis are considered competing risks as their occurrence will prevent subjects from receiving a transplant. When modeling competing risks data the cumulative incidence function (CIF), the probability of observing a certain cause while taking the competing risks into account, is oftentimes a quantity of interest.
The most commonly-used model to draw inference about the covariate effect on the CIF and to predict the CIF dependent on a set of covariates is the Fine-Gray proportional subdistribution hazards model (Fine and Gray 1999). Various statistical packages for estimating the parameters of the Fine-Gray model are popular within the R programming language (Ihaka and Gentleman 1996). One package, among others, is the cmprsk package. The riskRegression package, initially implemented for predicting absolute risks (Gerds et al. 2012), uses a wrapper that calls the cmprsk package to perform Fine-Gray regression. (Scheike and Zhang 2011) provide timereg that allows for general modeling of the cumulative incidence function and includes the Fine-Gray model as a special case. The survival package also performs Fine-Gray regression but does so using a weighted Cox (Cox 1972) model. Over the past decade, there have been several extensions to the Fine-Gray method that also result in useful packages. The crrSC package allows for the modeling of both stratified (Zhou et al. 2011) and clustered (Zhou et al. 2012) competing risks data. (Kuk and Varadhan 2013) propose a stepwise Fine-Gray selection procedure and develop the crrstep package for implementation. (Fu et al. 2017) then introduce penalized Fine-Gray regression with the corresponding crrp package.
A contributing factor to the computational complexity for general
Fine-Gray regression implementation is parameter estimation. Generally,
one needs to compute the log-pseudo likelihood and its first and second
derivatives with respect to its regression parameters for optimization.
Calculating these quantities is typically of order
The contribution of this work is the development of an R package
fastcmprsk which implements a novel forward-backward scan algorithm
(Kawaguchi et al. 2020) for the Fine-Gray model. By taking advantage of
the ordering of the data and the structure of the risk set, we can
calculate the log-pseudo likelihood and its derivatives, which are
necessary for parameters estimation, in
The paper is organized as follows. In the next section, we briefly review the basic definition of the Fine-Gray proportional subdistribution hazards model, the CIF, and penalized Fine-Gray regression. We highlight the computational challenge of lineaizing estimation for the Fine-Gray model and introduce the forward-backward scan algorithm of (Kawaguchi et al. 2020) in Section 3. Then in Section 4, we describe the main functionalities of the fastcmprsk package that we developed for R which utilizes the aforementioned algorithm for unpenalized and penalized parameter estimation and CIF estimation. We perform simulation studies in Section 5 to compare the performance of our proposed method to some of their popular competitors. The fastcmprsk package is readily available on the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=fastcmprsk.
We first establish some notation and the formal definition of the data
generating process for competing risks. For subject
The CIF for the primary event conditional on the covariates
Parameter estimation and large-sample inference of the PSH model follows
from the log-pseudo likelihood:
Let
An alternative interpretation of the coefficients from the Fine-Gray
model is to model their effect on the CIF. Using a Breslow-type
estimator (Breslow 1974), we can obtain a consistent estimate
for
Oftentimes reserachers are interested in identifying which covariates
have an effect on the CIF. Penalization methods
(Tibshirani 1996; Fan and Li 2001; Zou 2006; Zhang et al. 2010)
offer a popular way to perform variable selection and parameter
estimation simultaneously through minimizing the objective function
The sparsity of the model depends heavily on the choice of the tuning
parameters. Practically, finding a suitable (or optimal) tuning
parameter involves applying a penalization method over a sequence of
possible candidate values of
Whether interest is in fitting an unpenalized model or a series of penalized models used for variable selection, one will need to minimize the negated log-pseudo (or penalized log-pseudo likelihood. While current implementations can readily fit small to moderately-sized datasets, where the sample size can be in the hundreds to thousands, we notice that these packages grind to a halt for large-scale data such as, electronic health records (EHR) data or cancer registry data, where the number of observations easily exceed tens of thousands, as illustrated later in Section 5.1 (Table 2) on some simulated large competing risks data.
The primary computational bottleneck for estimating the parameters of
the Fine-Gray model is due to the calculation of the log-pseudo
likelihood and its derivatives, which are required for commonly-used
optimization routines. For example, the cyclic coordinate descent
algorithm requires the score function
Before moving forward we will first consider the Cox proportional
hazards model for right-censored data, which can be viewed as a special
case of the Fine-Gray model when competing risks are not present (i.e.
Unfortunately, the risk set associated with the Fine-Gray model does not
retain the same cumulative structure. (Kawaguchi et al. 2020) propose a
novel forward-backward scan algorithm that reduces the computational
complexity associated with parameter estimation from
We utilize this forward-backward scan algorithm of (Kawaguchi et al. 2020) for both penalized and unpenalized parameter estimation for the Fine-Gray model in linear time. Furthermore, we also develop scalable methods to estimate the predicted CIF and its corresponding confidence interval/band. For convenience to researchers and readers, a function to simulate two-cause competing risks data is also included. Table 1 provides a summary of the currently available functions provided in fastcmprsk. We briefly detail the use of some of the key functions below.
Function name | Basic description |
---|---|
Modeling functions | |
fastCrr |
Fits unpenalized Fine-Gray regression and |
returns an object of class "fcrr " |
|
fastCrrp |
Fits penalized Fine-Gray regression and |
returns an object of class "fcrrp " |
|
Utilities | |
Crisk |
Creates an object of class "Crisk " to be used as the |
response variable for fastCrr and fastCrrp |
|
varianceControl |
Options for bootstrap variance for fastCrr . |
simulateTwoCauseFineGrayModel |
Simulates two-cause competing risks data |
S3 methods for "fcrr " |
|
AIC |
Generic function for calculating AIC |
coef |
Extracts model coefficients |
confint |
Computes confidence intervals for parameters in the model |
logLik |
Extracts the model log-pseudo likelihood |
predict |
Predict the cumulative incidence function given newdata |
using model coefficients. | |
summary |
Print ANOVA table |
vcov |
Returns bootstrapped variance-covariance matrix |
if variance = TRUE . |
|
S3 methods for "fcrrp " |
|
AIC |
Generic function for calculating AIC |
coef |
Extracts model coefficients for each tuning parameter |
logLik |
Extracts the model log-pseudo likelihood for each tuning |
parameter |
|
plot |
Plot coefficient path as a function of |
Researchers can simulate two-cause competing risks data using the
simulateTwoCauseFineGrayModel
function in fastcmprsk. The data
generation scheme follows a similar design to that of
(Fine and Gray 1999) and (Fu et al. 2017). Given a design matrix
R> #### Need the following packages to run the examples in the paper
R> install.packages("cmprsk")
R> install.packages("crrp")
R> install.packages("doParallel")
R> install.packages("fastcmprsk")
R> ###
R> library(fastcmprsk)
R> set.seed(2019)
R> N <- 500 # Set number of observations
R> # Create coefficient vector for event of interest and competing event
R> beta1 <- c(0.40, -0.40, 0, -0.50, 0, 0.60, 0.75, 0, 0, -0.80)
R> beta2 <- -beta1
R> # Simulate design matrix
R> Z <- matrix(rnorm(nobs * length(beta1)), nrow = N)
R> # Generate data
R> dat <- simulateTwoCauseFineGrayModel(N, beta1, beta2,
+ Z, u.min = 0, u.max = 1, p = 0.5)
R> # Event counts (0 = censored; 1 = event of interest; 2 = competing event)
R> table(dat$fstatus)
0 1 2
241 118 141
R> # First 6 observed survival times
R> head(dat$ftime)
[1] 0.098345608 0.008722629 0.208321175 0.017656904 0.495185038 0.222799124
We first illustrate the coefficient estimation from
((1)) using the Fine-Gray log-pseudo likelihood. The
fastCrr
function returns an object of class "fcrr
" that estimates
these parameters using our forward-backward scan algorithm and is
syntactically similar to the coxph
function in survival. The
formula
argument requires an outcome of class "Crisk
". The Crisk
function produces this object by calling the Surv
function in
survival, modifying it to allow for more than one event, and requires
four arguments: a vector of observed event times (ftime
), a vector of
corresponding event/censoring indicators (fstatus
), the value of
fstatus
that denotes a right-censored observation (cencode
) and the
value of fstatus
that denotes the event of interest (failcode
). By
default, Crisk
assumes that cencode = 0
and failcode = 1
. The
variance
passed into fastCrr
is a logical argument that specifies
whether or not the variance should be calculated with parameter
estimation.
# cmprsk package
R> library(cmprsk)
R> fit1 <- crr(dat$ftime, dat$fstatus, Z, failcode = 1, cencode = 0,
+ variance = FALSE)
# fastcmprsk package
R> fit2 <- fastCrr(Crisk(dat$ftime, dat$fstatus, cencode = 0, failcode = 1) ~ Z,
+ variance = FALSE)
R> max(abs(fit1$coef - fit2$coef)) # Compare the coefficient estimates for both methods
[1] 8.534242e-08
As expected, the fastCrr
function calculates nearly identical
parameter estimates to the crr
function. The slight difference in
numerical accuracy can be explained by the different methods of
optimization and convergence thresholds used for parameter estimation.
Convergence within the cyclic coordinate descent algorithm used in
fastCrr
is determined by the relative change of the coefficient
estimates. We allow users to modify the maximum relative change and
maximum number of iterations used for optimization within fastCrr
through the eps
and iter
arguments, respectively. By default, we set
eps = 1E-6
and iter = 1000
in both our unpenalized and penalized
optimization methods.
We now show how to obtain the variance-covariance matrix for the
parameter estimates. The variance-covariance matrix for
fastCrr
function. First, the asymptotic expression
requires estimating both varianceControl
function controls the parameters used for bootstrapping, that one then
passes into the var.control
argument in fastCrr
. These arguments
include B
, the number of bootstrap samples to be used, and seed
, a
non-negative numeric integer to set the seed for resampling.
R> # Estimate variance via 100 bootstrap samples using seed 2019.
R> vc <- varianceControl(B = 100, seed = 2019)
R> fit3 <- fastcmprsk::fastCrr(Crisk(dat$ftime, dat$fstatus) ~ Z, variance = TRUE,
+ var.control = vc,
+ returnDataFrame = TRUE)
# returnDataFrame = TRUE is necessary for CIF estimation (next section)
R> round(sqrt(diag(fit3$var)), 3) # Standard error estimates rounded to 3rd decimal place
[1] 0.108 0.123 0.085 0.104 0.106 0.126 0.097 0.097 0.104 0.129
The accuracy of the bootstrap variance-covariance matrix compared to the
asymptotic expression depends on several factors including the sample
size and number of bootstrap samples fastCrr
for small sample sizes, we find this to be a
more efficient approach for large-scale competing risks data.
We adopt several S3 methods that work seamlessly with the "fcrr
"
object that is outputted from fastCrr
. The coef
method returns the
estimated regression coefficient estimates
R> coef(fit3) # Coefficient estimates
[1] 0.192275755 -0.386400287 0.018161906 -0.397687129 0.105709092 0.574938015
[7] 0.778842652 -0.006105756 -0.065707434 -0.996867883
The model pseudo log-likelihood can also be extracted via the logLik
function:
R> logLik(fit3) # Model log-pseudo likelihood
[1] -590.3842
Related quantities to the log-pseudo likelihood are information
criteria, measures of the quality of a statistical model that are used
to compare alternative models on the same data. These criterion are
computed using the following formula:
fcrr
object using
AIC
and users specify the penalty factor using the k
argument. By
default k = 2
and corresponds to the Akaike information criteria
(Akaike 1974).
R> AIC(fit3, k = 2) # Akaike's Information Criterion
[1] 1200.768
R> # Alternative expression of the AIC
R> -2 * logLik(fit3) + 2 * length(coef(fit3))
[1] 1200.768
If the variance
is set to TRUE
for the fastCrr
model fit, we can
extract the bootstrap variance-covariance matrix using vcov
.
Additionally, conf.int
will display confidence intervals, on the scale
of level
argument can be used to
specify the confidence level. By default level = 0.95
and corresponds
to
R> vcov(fit3)[1:3, 1:3] # Variance-covariance matrix for the first three estimates
[,1] [,2] [,3]
[1,] 0.0116785745 0.0031154634 0.0007890851
[2,] 0.0031154634 0.0150597898 0.0004681825
[3,] 0.0007890851 0.0004681825 0.0072888011
R> confint(fit3, level = 0.95) # 95 % Confidence intervals
2.5% 97.5%
x1 -0.01953256 0.4040841
x2 -0.62692381 -0.1458768
x3 -0.14916899 0.1854928
x4 -0.60197206 -0.1934022
x5 -0.10199838 0.3134166
x6 0.32827237 0.8216037
x7 0.58798896 0.9696963
x8 -0.19610773 0.1838962
x9 -0.26995659 0.1385417
x10 -1.24897861 -0.7447572
Lastly, summary
will return an ANOVA table for the fitted model. The
table presents the log-subdistribution hazard ratio (coef
), the
subdistribution hazard ratio (exp(coef)
), the standard error of the
log-subdistribution hazards ratio (se(coef)
) if variance = TRUE
in
fastCrr
, the corresponding z value
), and two-sided
Pr(|z|)
). When setting conf.int = TRUE
, the summary
function will also print out the variance = TRUE
when running fastCrr
). Additionally the pseudo
log-likelihood for the estimated model and the null pseudo
log-likelihood (when
R> # ANOVA table for fastCrr
R> summary(fit3, conf.int = TRUE) # conf.int = TRUE allows for 95% CIs to be presented
Fine-Gray Regression via fastcmprsk package.
fastCrr converged in 24 iterations.
Call:
fastcmprsk::fastCrr(Crisk(dat$ftime, dat$fstatus) ~ Z, variance = TRUE,
var.control = vc, returnDataFrame = TRUE)
coef exp(coef) se(coef) z value Pr(>|z|)
x1 0.19228 1.212 0.1081 1.779 7.5e-02
x2 -0.38640 0.679 0.1227 -3.149 1.6e-03
x3 0.01816 1.018 0.0854 0.213 8.3e-01
x4 -0.39769 0.672 0.1042 -3.816 1.4e-04
x5 0.10571 1.111 0.1060 0.997 3.2e-01
x6 0.57494 1.777 0.1259 4.568 4.9e-06
x7 0.77884 2.179 0.0974 7.998 1.3e-15
x8 -0.00611 0.994 0.0969 -0.063 9.5e-01
x9 -0.06571 0.936 0.1042 -0.631 5.3e-01
x10 -0.99687 0.369 0.1286 -7.750 9.1e-15
exp(coef) exp(-coef) 2.5% 97.5%
x1 1.212 0.825 0.981 1.498
x2 0.679 1.472 0.534 0.864
x3 1.018 0.982 0.861 1.204
x4 0.672 1.488 0.548 0.824
x5 1.111 0.900 0.903 1.368
x6 1.777 0.563 1.389 2.274
x7 2.179 0.459 1.800 2.637
x8 0.994 1.006 0.822 1.202
x9 0.936 1.068 0.763 1.149
x10 0.369 2.710 0.287 0.475
Pseudo Log-likelihood = -590
Null Pseudo Log-likelihood = -675
Pseudo likelihood ratio test = 170 on 10 df.
Since standard error estimation is performed via bootstrap and
resampling, it is easy to use multiple cores to speed up computation.
Parallelization is seamlessly implemented using the
doParallel package
(Calaway et al. 2019). Enabling usage of multiple cores is done through the
useMultipleCores
argument within the varianceControl
function. To
avoid interference with other processes, we allow users to set up the
cluster on their own. We provide an example below.
R> library(doParallel)
R> n.cores <- 2 # No. of cores
R> myClust <- makeCluster(n.cores)
R> # Set useMultipleCores = TRUE to enable parallelization
R> vc = varianceControl(B = 1000, useMultipleCores = TRUE)
R> registerDoParallel(myClust)
R> fit3 <- fastCrr(Crisk(dat$ftime, dat$fstatus) ~ Z, variance = TRUE,
+ var.control = vc)
R> stopCluster(myClust)
The CIF is also available in linear time in the fastcmprsk package.
(Fine and Gray 1999) propose a Monte Carlo simulation method for
interval and band estimation. We implement a slightly different approach
using bootstrapping for interval and band estimation in our package. Let
Next we propose a symmetric global confidence band for the estimated CIF
Resample subjects with replacement from the original data
For the
Estimate
Finally, the
z0
using the coefficient and baseline
estimates from our toy
example.Similar to estimating the variance-covariance matrix for the coefficient
estimates varianceControl
function. One can perform CIF estimation and
interval/band estimation using the predict
function by specifying a
vector newdata
argument and the fitted model
from fastCrr
. To calculate the CIF, both the Breslow estimator of the
cumulative subdistribution hazard and the (ordered) model data frame
need to be returned values within the fitted object. This can be
achieved by setting both the getBreslowJumps
and returnDataFrame
arguments within fastCrr
to TRUE
. Additionally, for confidence band
estimation one must specify a time interval tL
and tU
arguments in
predict
. Figure 1 illustrates the estimated CIF and corresponding
z0
simulated from a
standard random normal distribution.
R> set.seed(2019)
R> # Make sure getBreslowJumps and returnDataFrame are set to TRUE
R> fit4 <- fastCrr(Crisk(dat$ftime, dat$fstatus, cencode = 0, failcode = 1) ~ Z,
+ variance = FALSE,
+ getBreslowJumps = TRUE, # Default = TRUE
+ returnDataFrame = TRUE) # Default is FALSE for storage purposes
R> z0 <- rnorm(10) # New covariate entries to predict
R> cif.point <- predict(fit4, newdata = z0, getBootstrapVariance = TRUE,
+ type = "interval", tL = 0.2, tU = 0.9,
+ var.control = varianceControl(B = 100, seed = 2019))
R> plot(cif.point) # Figure 1 (Plot of CIF and 95% C.I.)
We extend our forward-backward scan approach for for penalized Fine-Gray
regression as described in Section 2.4. The fastCrrp
function performs LASSO, SCAD, MCP, and ridge (Hoerl and Kennard 1970)
penalization. Users specify the penalization technique through the
penalty
argument. The advantage of implementing this algorithm for
penalized Fine-Gray regression is two fold. Since the cyclic coordinate
descent algorithm used in the crrp
function calculates the gradient
and Hessian diagonals in lambda
argument of fastCrrp
. If
left untouched (i.e. lambda = NULL
), a log-spaced interval of
fastCrrp
is nearly identical
to the syntax for crrp
.
R> library(crrp)
R> lam.path <- 10^seq(log10(0.1), log10(0.001), length = 25)
R> # crrp package
R> fit.crrp <- crrp(dat$ftime, dat$fstatus, Z, penalty = "LASSO",
+ lambda = lam.path, eps = 1E-6)
R> # fastcmprsk package
R> fit.fcrrp <- fastCrrp(Crisk(dat$ftime, dat$fstatus) ~ Z, penalty = "LASSO",
+ lambda = lam.path)
R> # Check to see the two methods produce the same estimates.
R> max(abs(fit.fcrrp$coef - fit.crrp$beta))
[1] 1.110223e-15
R> plot(fit.fcrrp) # Figure 2 (Solution path)
This section provides a more comprehensive illustration of the
computational performance of the fastcmprsk package over two popular
competing packages cmprsk and crrp. We simulate datasets under
various sample sizes and fix the number of covariates simulateTwoCauseFineGrayModel
to simulate these data and average
runtime results over 100 Monte Carlo replicates. We report timing on a
system with an Intel Core i5 2.9 GHz processor and 16GB of memory.
In this section, we compare the runtime and estimation performance of
the fastCrr
function to crr
. We vary fastCrr
and crr
both with and without variance estimation.
We take 100 bootstrap samples, without parallelization, to obtain the
bootstrap standard errors with fastCrr
. As shown later in the section
(Tables 3 and 4), 100 bootstrap samples suffices to produce a good
standard error estimate with close-to-nominal coverage for large enough
sample sizes in our scenarios. In practice, we recommend users to
increase the number of bootstrap samples until the variance estimate
becomes stable, when computationally feasible.
Figure 3 depicts how fast the computational complexities of fastCrr
(dashed lines) and crr
(solid lines) increase as crr
increases quadratically (solid line
slopes fastCrr
is linear (dashed line
slopes fastCrr
over crr
are expected to grow exponentially as the sample
size increases.
fastCrr
and crr
with and without variance estimation. Axes are on
the log10 scale. Solid and
dashed lines represent the crrp and fastcmprsk
implementation, respectively. Square, and circle symbols denote variance
and without variance calculation, respectively. Variance estimation for
crr
is performed using the asymptotic expression of the
variance-covariance estimator. Variance estimation for
fastCrr
is performed using 100 bootstrap samples. Reported
runtime are averaged over 100 Monte Carlo runs.
We further demonstrates the computational advantages of fastCrr
over
crr
for large sample size data in Table 2 by
comparing their runtime on a single simulated data with fastCrr
scales well to
large sample size data, whereas crr
eventually grinds to a halt as fastCrr
to finish, while crr
did not finish in 3 days.
Because the forward-backward scan allows us to efficiently compute
variance estimates through bootstrapping, we have also observed massive
computational gains in variance estimation with large sample size data
(7 minutes for fastCrr
versus 54 hours for crr
). Furthermore, since
parallelization of the bootstrap procedure was not implemented in these
timing reports, we expect multicore usage to further decrease the
runtime of the variance estimation for fastCrr
Sample size |
|||
50,000 | 100,000 | 500,000 | |
crr without variance |
6 hours | 24 hours | – |
crr with variance |
54 hours | – | – |
fastCrr without variance |
5 seconds | 12 seconds | 50 seconds |
fastCrr with variance |
7 minutes | 14 minutes | 69 minutes |
We also performed a simulation to compare the bootstrap procedure for
variance estimation to the estimate of the asymptotic variance provided
in ((3)) used in crr
. First, we compare the two
standard error estimates with the empirical standard error of
Std. Err. Est. | 2000 | 3000 | 4000 | ||
---|---|---|---|---|---|
Empirical | 0.06 | 0.05 | 0.04 | 0.03 | |
Bootstrap | 0.10 | 0.05 | 0.04 | 0.03 | |
Asymptotic | 0.07 | 0.04 | 0.03 | 0.03 | |
Empirical | 0.10 | 0.05 | 0.04 | 0.03 | |
Bootstrap | 0.11 | 0.06 | 0.04 | 0.04 | |
Asymptotic | 0.08 | 0.05 | 0.04 | 0.03 | |
Empirical | 0.09 | 0.06 | 0.04 | 0.03 | |
Bootstrap | 0.11 | 0.06 | 0.04 | 0.04 | |
Asymptotic | 0.07 | 0.05 | 0.04 | 0.03 |
Additionally, we present in Table 4 the coverage
probability (and standard errors) of the fastCrr
) and asymptotic
(crr
) variance estimate. The confidence intervals are wider for the
bootstrap approach when compared to confidence intervals produced using
the asymptotic variance estimator, especially when
2000 | 3000 | 4000 | ||
---|---|---|---|---|
crr |
0.93 (0.03) | 0.90 (0.03) | 0.93 (0.03) | 0.95 (0.02) |
fastCrr |
1.00 (0.00) | 0.98 (0.02) | 0.95 (0.02) | 0.95 (0.02) |
As mentioned in Section 2.4, (Fu et al. 2017) provide an R
package crrp for performing penalized Fine-Gray regression using the
LASSO, SCAD, and MCP penalties. We compare the runtime between
fastCrrp
with the implementation in the crrp package. To level
comparisons, we modify the source code in crrp
so that the function
only calculates the coefficient estimates and BIC score. We vary
fastCrrp
function has over crrp
.
Similar to the unpenalized scenario, the computational performance of
crrp
(solid lines) increases quadratically while fasrCrrp
(dashed
lines) increases linearly, resulting in a 200 to 300-fold speed up in
runtime when
The fastcmprsk package provides a set of scalable tools for the
analysis of large-scale competing risks data by developing an approach
to linearize the computational complexity required to estimate the
parameters of the Fine-Gray proportional subdistribution hazards model.
Multicore use is also implemented to further speed up methods that
require bootstrapping and resampling. Our simulation results show that
our implementation results in a up to 7200-fold decrease in runtime for
large sample size data. We also note that in a real-world application,
(Kawaguchi et al. 2020) record a drastic decrease in runtime (
Lastly, our current implementation assumes that covariates are densely observed across subjects. This is problematic in the sparse high-dimensional massive sample size (sHDMSS) domain (Mittal et al. 2014) where the number of subjects and sparsely-represented covariates easily exceed tens of thousands. These sort of data are typical in large comparative effectiveness and drug safety studies using massive administrative claims and EHR databases and typically contain millions to hundreds of millions of patient records with tens of thousands patient attributes, which such settings are particularly useful for drug safety studies of a rare event such as unexpected adverse events (Schuemie et al. 2018) to protect public health. We are currently extending our algorithm to this domain in a sequel paper.
We thank the referees and the editor for their helpful comments that improved the presentation of the article. Marc A. Suchard’s work is partially supported through the National Institutes of Health grant U19 AI 135995. Jenny I. Shen’s work is partly supported through the National Institutes of Health grant K23DK103972. The research of Gang Li was partly supported by National Institutes of Health Grants P30 CA-16042, UL1TR000124-02, and P50 CA211015.
We describe the data generation process for the
simulateTwoCauseFineGrayModel
function. Let simulateTwoCauseFineGrayModel
to simulate
the observed event times, cause and censoring indicators.
#START CODE
...
...
...
# nobs, Z, p = pi, u.min, u.max, beta1 and beta2 are already defined.
# Simulate cause indicators here using a Bernoulli random variable
c.ind <- 1 + rbinom(nobs, 1, prob = (1 - p)^exp(Z %*% beta1))
ftime <- numeric(nobs)
eta1 <- Z[c.ind == 1, ] %*% beta1 #linear predictor for cause on interest
eta2 <- Z[c.ind == 2, ] %*% beta2 #linear predictor for competing risk
# Conditional on cause indicators, we simulate the model.
u1 <- runif(length(eta1))
t1 <- -log(1 - (1 - (1 - u1 * (1 - (1 - p)^exp(eta1)))^(1 / exp(eta1))) / p)
t2 <- rexp(length(eta2), rate = exp(eta2))
ci <- runif(nobs, min = u.min, max = u.max) # simulate censoring times
ftime[c.ind == 1] <- t1
ftime[c.ind == 2] <- t2
ftime <- pmin(ftime, ci) # X = min(T, C)
fstatus <- ifelse(ftime == ci, 0, 1) # 0 if censored, 1 if event
fstatus <- fstatus * c.ind # 1 if cause 1, 2 if cause 2
...
...
...
fastcmprsk, cmprsk, riskRegression, timereg, survival, crrSC, crrstep, crrp, glmnet, ncvreg, Cyclops, doParallel
CausalInference, ClinicalTrials, Econometrics, Epidemiology, MachineLearning, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kawaguchi, et al., "A Fast and Scalable Implementation Method for Competing Risks Data with the R Package fastcmprsk", The R Journal, 2021
BibTeX citation
@article{RJ-2021-010, author = {Kawaguchi, Eric S. and Shen, Jenny I. and Li, Gang and Suchard, Marc A.}, title = {A Fast and Scalable Implementation Method for Competing Risks Data with the R Package fastcmprsk}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {12}, issue = {2}, issn = {2073-4859}, pages = {43-60} }