Left censoring can occur with relative frequency when analyzing recurrent events in epidemiological studies, especially observational ones. Concretely, the inclusion of individuals that were already at risk before the effective initiation in a cohort study may cause the unawareness of prior episodes that have already been experienced, and this will easily lead to biased and inefficient estimates. The miRecSurv package is based on the use of models with specific baseline hazard, with multiple imputation of the number of prior episodes when unknown by means of the COMPoisson distribution, a very flexible count distribution that can handle over, sub, and equidispersion, with a stratified model depending on whether the individual had or had not previously been at risk, and the use of a frailty term. The usage of the package is illustrated by means of a real data example based on an occupational cohort study and a simulation study.
It is not unusual in cohort epidemiological studies that part (or all) of the participants have experienced the event under study at least once before the beginning of the follow-up. This situation is particularly common in the case of observational designs. Under these circumstances, the prior history and prior time at risk of these individuals can be unknown or estimated on the basis of self recall questionnaires, which could lead to modeling issues when the baseline hazard of suffering the event is time-dependent. If the event of interest can only occur once, and it occurred for an individual before the start of the follow-up, the result for this individual is fixed regardless of the duration of the follow-up. Therefore we are in the well-known and well-studied situation of left censoring of a binary variable, for which specific modeling techniques are available. On the other hand, if the event of interest can be suffered several times by the same individual (it is recurrent), and the number of events suffered by the individuals in the cohort before the beginning of the follow-up is unknown, we face a left censoring situation with a discrete censored variable that can define different baseline hazards depending on the episode an individual is at risk of.
This paper introduces the miRecSurv, useful when the prior history is unknown for all, or some, of the individuals included in a cohort and the outcome of interest is a recurrent event with event dependence. Specifically, we suppose that we know the moment from where all individuals are at risk, but the number of episodes experienced by the individuals in this time is unknown. This is a realistic situation in practice; for example, it is very probable that in a workforce cohort, we know when a worker started to work (thus, to be at risk of having a sick leave) and, however, and especially for people with ample trajectory, that we do not know whether or not in effect they have already had sick leaves (and in this case, how many). Another situation that might deal with this issue is a study of cohorts with an outcome of incidence of infection from human papillomavirus on adult women. It would be relatively simple to know how long they have been at risk (beginning of active sexual life) or to make a reasonable assumption. However, we will not be able to know the number of infections since most of the time, when they occur, they are asymptomatic (Fernández-Fontelo et al. 2016).
Our proposal starts from the assumption that even though the previous
history of all or some of the individuals is unknown, we do know which
of these were at risk prior to the beginning of the follow-up and
starting when. Further, that is based fundamentally on three
considerations: 1) impute
where
Therefore, the use of the term individual random error
The imputation of the number of previous events in individuals at risk before the beginning of the follow-up is done through the COMPoisson generalized distribution (using the log link function), that allows adjusting a regression model using the Conway-Maxwell Poisson (COMPoisson) distribution (Shmueli et al. 2005) considering the dispersion of the data (sub, equi, or overdispersion). This imputation is carried out through multiple imputation calculating its parameters directly from the observed data in two phases: Firstly, a generalized linear model (GLM) is fitted using the number of episodes observed during the follow-up as the response variable and the covariates selected by the user as explanatory, based on the COMPoisson distribution. Imputed values are randomly sampled from this distribution with the parameters obtained in the previous step, including random noise generated from a normal distribution. In order to produce a proper estimation of uncertainty, the described methodology is included in a multiple imputation framework, according to the well known Rubin’s rules (Rubin 1987) and based on the following steps in a Bayesian context:
Fit the COMPoisson count data model and find the posterior mean and
variance
Draw new parameters
Compute predicted scores
Draw imputations from the COMPoisson distribution and scores obtained in the previous step.
The COMPoisson random number generation is based on the rcom
function
included in the archived package
compoisson
(Dunn 2012). The overperformance of the COMPoisson distribution in
this context compared to alternative discrete distributions (Poisson,
negative binomial, Hermite and zero-inflated versions) is discussed in
Hernández-Herrera et al. (2020) by means of a comprehensive simulation study.
The main function of the
miRecSurv is
recEvFit
, which allows the user to fit recurrent events survival
models. A call to this function might be
recEvFit(formula, data, id, prevEp, riskBef, oldInd,
frailty=FALSE, m=5, seed=NA, ...)
The description of these arguments can be summarized as follows:
formula
: a formula object, with the response on the left of a
Surv
function.data
: a data.frame in which to interpret the variables named in
the formula.id
: subject identifier.prevEp
: known previous episodes.riskBef
: indicator for new individual in the cohort
(riskBef=FALSE
) or subject who was at risk before the start of
follow-up (riskBef=TRUE
).oldInd
: time an individual has been at risk prior to the
follow-up. This time can be positive or negative (time origin as the
start of follow-up).frailty
: should the model include a frailty term. Defaults to
FALSE
.m
: number of multiple imputations. The default is m=5
.seed
: an integer that is used as argument by the set.seed
function for offsetting the random number generator. Default is to
leave the random number generator alone....
: extra arguments to pass to coxph
.The output of this function is a list with seven elements:
fit
: a list with all the coxph
objects fitted for each imputed
dataset.coeff
: a list with the vectors of coefficients from the models
fitted to each imputed dataset.loglik
: a list with the loglikelihood for each model fitted.vcov
: a list with the variance-covariance matrices for the
parameters fitted for each of the imputed datasets.AIC
: a list with the AIC of each of the models fitted.CMP
: summary tables of the fitted COMPoisson models used for
imputing missing values.data.impute
: the original dataset with the multiple imputed
variables as final columns.In order to facilitate the interpretation, a pooled summary table is
available via summary
method, formatted in a very similar way to the
summary tables of very well-known functions as coxph
, as can be seen
in the next section. As in some cases the multiple imputation process
might be computationally expensive and take some time, the function
recEvFit
provides the user with a progress bar.
To illustrate our proposal, we use simulated data generated with the
parameters estimated in a worker cohort, where the outcome is the
occurrence of sick leave due to any cause. Table 1 shows the
characteristics of each episode in this population, estimated in a
cohort study described in Navarro et al. (2012). The maximum number of episodes
that a subject may suffer was not fixed, although the baseline hazard
was considered constant when
Episode | Distribution | Ancillary | |
---|---|---|---|
1 | Log-logistic | 7.974 | 0.836 |
2 | Weibull | 7.109 | 0.758 |
3 | Log-normal | 5.853 | 1.989 |
4 | Log-normal | 5.495 | 2.204 |
To illustrate the usage of the miRecSurv package, the results corresponding to a sample from the first scenario can be obtained by means of
library(survsim)
library(miRecSurv)
d.ev <- c('llogistic','weibull','weibull','weibull')
b0.ev <- c(5.843, 5.944, 5.782, 5.469)
a.ev <- c(0.700, 0.797, 0.822, 0.858)
d.cens <- c('weibull','weibull','weibull','weibull')
b0.cens <- c(7.398, 7.061, 6.947, 6.657)
a.cens <- c(1.178, 1.246, 1.207, 1.422)
set.seed(1234)
sample1 <- rec.ev.sim(n=1500, foltime=1095,
dist.ev=d.ev, anc.ev=a.ev, beta0.ev=b0.ev,
dist.cens=d.cens, anc.cens=a.cens, beta0.cens=b0.cens,
beta=list(c(-.25,-.25,-.25,-.25), c(-.5,-.5,-.5,-.5), c(-.75,-.75,-.75,-.75)),
x=list(c("bern", .5), c("bern", .5), c("bern", .5)),
priskb=.1, max.old=5475)
sample1$old2 <- -sample1$old
sample1$old2[is.na(sample1$old)] <- 0
### Shared frailty
ag_s1 <- coxph(Surv(start2,stop2,status)~as.factor(x)+as.factor(x.1)+as.factor(x.2)+old2+
strata(as.factor(risk.bef))+frailty(nid), data=sample1)
### Counting process
shfmi.cp_s1 <- recEvFit(Surv(start2, stop2, status)~x+x.1+x.2, data=sample1,
id="nid", prevEp = "obs.episode",
riskBef = "risk.bef", oldInd = "old", frailty=TRUE, m=5, seed=1234)
### Gap time
shfmi.gt_s1 <- recEvFit(Surv(stop2-start2, status)~x+x.1+x.2, data=sample1,
id="nid", prevEp = "obs.episode",
riskBef = "risk.bef", oldInd = "old", frailty=TRUE, m=5, seed=1234)
The generated cohort, including the estimated number of previous events (multiple imputed), can be obtained as
head(shfmi.cp_s1$data.impute)
nid real.episode obs.episode time status start stop time2 start2
1 1 1 1 119.673502 1 0.0000 119.6735 119.673502 124.5052
2 1 2 2 389.637244 1 119.6735 509.3107 389.637244 244.1787
3 1 3 3 244.130315 1 509.3107 753.4411 244.130315 633.8160
4 1 4 4 144.476145 0 753.4411 897.9172 144.476145 877.9463
5 2 1 1 107.943850 1 0.0000 107.9438 107.943850 681.4178
6 2 2 2 6.472291 1 107.9438 114.4161 6.472291 789.3617
stop2 old risk.bef long z x x.1 x.2 EprevCOMPoissDef1 EprevCOMPoissDef2
1 244.1787 0 FALSE NA 1 1 0 0 0 0
2 633.8160 0 FALSE NA 1 1 0 0 1 1
3 877.9463 0 FALSE NA 1 1 0 0 2 2
4 1022.4224 0 FALSE NA 1 1 0 0 3 3
5 789.3617 0 FALSE NA 1 1 1 0 0 0
6 795.8340 0 FALSE NA 1 1 1 0 1 1
EprevCOMPoissDef3 EprevCOMPoissDef4 EprevCOMPoissDef5
1 0 0 0
2 1 1 1
3 2 2 2
4 3 3 3
5 0 0 0
6 1 1 1
The pooled coefficients table can be obtained as
summary(shfmi.cp_s1)
Call:
recEvFit(formula = Surv(start2, stop2, status) ~ x + x.1 + x.2,
data = sample1, id = "nid", prevEp = "obs.episode", riskBef = "risk.bef",
oldInd = "old", frailty = TRUE, m = 5, seed = 1234)
Coefficients:
coef exp(coef) se(coef) Chisq Pr(>|z|)
x 2.766059e-01 1.3186465 0.3161915 76.49190798 1.401017e-17
x.1 4.337341e-01 1.5430085 0.3010033 174.86750236 2.970463e-39
x.2 7.489812e-01 2.1148444 0.5144217 436.60796033 8.875830e-95
old -2.584687e-05 0.9999742 0.4976880 0.03005029 7.830972e-01
frailty NA NA NA 77.64460457 6.481409e-02
AIC: 34867.6
COM-Poisson regression model for imputing missing values
Estimate SE z.value Pr(>|z|)
(Intercept) -6.3611 0.0348 -182.5394 0.000e+00
x 0.2356 0.0240 9.8376 7.755e-23
x.1 0.3563 0.0258 13.7869 3.054e-43
x.2 0.6446 0.0305 21.1139 5.927e-99
S:(Intercept) -0.5029 0.0315 -15.9637 2.288e-57
Simulated data include four scenarios of
To compare the results obtained, we also estimate the shared frailty model (Eq (3)). This model has a common hazard baseline (i.e., does not consider the event dependence) and incorporates an individual frailty term.
Results of fitting the described models in scenario 1 are summarized in the Table 2:
Model | ||||||
---|---|---|---|---|---|---|
Shared Frailty | 0.292 | 0.043 | 0.558 | 0.043 | 0.843 | 0.043 |
SHFMI.CP | 0.244 | 0.034 | 0.470 | 0.036 | 0.706 | 0.039 |
SHFMI.GT | 0.218 | 0.030 | 0.419 | 0.031 | 0.632 | 0.034 |
Below are presented the results for the second scenario, Table 3:
Model | ||||||
---|---|---|---|---|---|---|
Shared Frailty | 0.287 | 0.039 | 0.554 | 0.039 | 0.832 | 0.040 |
SHFMI.CP | 0.242 | 0.032 | 0.472 | 0.035 | 0.703 | 0.040 |
SHFMI.GT | 0.217 | 0.028 | 0.425 | 0.030 | 0.633 | 0.034 |
Results for scenario 3, Table 4:
Model | ||||||
---|---|---|---|---|---|---|
Shared Frailty | 0.277 | 0.033 | 0.542 | 0.034 | 0.818 | 0.035 |
SHFMI.CP | 0.243 | 0.030 | 0.474 | 0.036 | 0.710 | 0.042 |
SHFMI.GT | 0.217 | 0.025 | 0.421 | 0.029 | 0.632 | 0.033 |
Finally, Table 5 shows the estimates when 100% of the subjects are at risk prior to the beginning of the follow-up:
Model | ||||||
---|---|---|---|---|---|---|
Shared Frailty | 0.262 | 0.026 | 0.527 | 0.027 | 0.797 | 0.029 |
SHFMI.CP | 0.248 | 0.029 | 0.495 | 0.040 | 0.742 | 0.053 |
SHFMI.GT | 0.208 | 0.022 | 0.416 | 0.027 | 0.626 | 0.034 |
The average relative bias of the estimates produced by each method is shown in Figure 1. It can be seen that the estimates produced by the miRecSurv are less biased than those based on the common baseline hazard model.
The real example corresponds to a selected sample of the HC-UFMG cohort
(Reis et al. 2008), which involves workers of a public hospital in Belo
Horizonte, Brazil. Specifically, we selected all workers present on
January 1st, 2004 (
To analyze the association of sex, type of contract, and the kind of job developed by these workers over the risk of suffering sick leaves, the package miRecSurv can be used in the following way.
mod1 <- recEvFit(Surv(start, stop, status)~Male+Non.civil.servant+
Non.healthcare, data=example,
id="nid", prevEp="num", riskBef="prev2", oldInd="days_prev",
frailty=TRUE, m=5, seed=1234)
summary(mod1)
Call:
recEvFit(formula = Surv(start, stop, status) ~ Male + Non.civil.servant +
Non.healthcare, data = example, id = "nid", prevEp = "num",
riskBef = "prev2", oldInd = "days_prev", frailty = TRUE,
m = 5, seed = 1234)
Coefficients:
coef exp(coef) se(coef) Chisq Pr(>|z|)
Male -0.2966733104 0.7432868 0.2033092 24.386339 3.999350e-07
Non.civil.servant -0.1788549829 0.8362272 0.1320313 9.221709 2.391622e-03
Non.healthcare -0.1315412853 0.8767431 0.1318741 5.568327 1.703076e-02
days_prev -0.0002383671 0.9997617 0.2027526 6.457974 7.089736e-03
frailty NA NA NA 1157.231797 5.664063e-35
AIC: 29699
COM-Poisson regression model for imputing missing values
Estimate SE z.value Pr(>|z|)
(Intercept) -6.7602 0.0407 -166.0266 0.000e+00
Male -0.1665 0.0261 -6.3766 1.810e-10
Non.civil.servant -0.0132 0.0246 -0.5354 5.924e-01
Non.healthcare -0.0917 0.0246 -3.7297 1.917e-04
S:(Intercept) -1.4043 0.0709 -19.8059 2.646e-87
According to these results, it can be seen that all three considered explanatory variables are significantly associated with the risk of suffering sick leaves (see Coefficients table), although the type of contract is not significant on the model used to impute the number of episodes suffered before the start of the follow-up.
Left censoring, when analyzing recurrent events, is a situation that can occur with a certain frequency in cohort studies. For example, it will occur in every cohort that follows subjects that were already at risk of experiencing the outcome of interest before the beginning of the follow-up. If the recurrent event presents event dependence, not knowing the number of episodes that each individual has had prior to the effective initiation of the follow-up, and analyzing this through “classical” methods, is an important problem, as it will lead to biased and inefficient estimates.
The presented proposal seems to work reasonably well, outperforming the alternative to a common hazard model. It is important to carry out a comprehensive study to evaluate the performance of the presented proposal. It is also worth noticing that although in this particular example, the average relative biases produced by the shared frailty model are relatively low, this is not always the case, as can be seen in a simulation study like Navarro et al. (2017).
Unavailability of the number of previous episodes that an individual has had should not be a justification for the use of models with a common baseline hazard which on the large majority of occasions show a higher bias and less coverage than specific baseline hazard models, as shown by the results of this work.
miRecSurv, compoisson, survsim
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Moriña, et al., "miRecSurv Package: Prentice-Williams-Peterson Models with Multiple Imputation of Unknown Number of Previous Episodes", The R Journal, 2021
BibTeX citation
@article{RJ-2021-082, author = {Moriña, David and Hernández-Herrera, Gilma and Navarro, Albert}, title = {miRecSurv Package: Prentice-Williams-Peterson Models with Multiple Imputation of Unknown Number of Previous Episodes}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {2}, issn = {2073-4859}, pages = {419-426} }