Data subject to length-biased sampling are frequently encountered in various applications including prevalent cohort studies and are considered as a special case of left-truncated data under the stationarity assumption. Many semiparametric regression methods have been proposed for length-biased data to model the association between covariates and the survival outcome of interest. In this paper, we present a brief review of the statistical methodologies established for the analysis of length-biased data under the Cox model, which is the most commonly adopted semiparametric model, and introduce an R package CoxPhLb that implements these methods. Specifically, the package includes features such as fitting the Cox model to explore covariate effects on survival times and checking the proportional hazards model assumptions and the stationarity assumption. We illustrate usage of the package with a simulated data example and a real dataset, the Channing House data, which are publicly available.
In prevalent cohort studies, subjects who have experienced an initiating event (e.g., disease diagnosis) but have not yet experienced a failure event (e.g., death) are sampled from the target population and followed until a failure or censoring event occurs. Data collected from such sampling designs are subject to left truncation since subjects who experienced a failure event prior to study enrollment are selectively excluded and are not observed in the data. When the occurrence of the initiating event follows a stationary Poisson process, the data are called “length-biased data”, which is a special case of left-truncated data. These data are encountered in a variety of fields such as cancer screening trials (Zelen and Feinleib 1969), studies of unemployment (Lancaster 1979; Una-Alvarez et al. 2003), epidemiologic cohort studies (Gail and Benichou 2000; Scheike and Keiding 2006), and genome-wide linkage studies (Terwilliger et al. 1997). The failure times observed in such data tend to be longer than those in the target population since subjects with longer failure times are more likely to be included in the length-biased data. Figure 1 depicts the occurrence of length-biased sampling. The underlying length-biased sampling assumption (i.e., the stationarity assumption) can be analytically examined (Addona and Wolfson 2006).
Provided that the observed data are not random samples of the target population, statistical methods for conventional survival data cannot be directly applied to length-biased data. Extensive studies have been conducted on statistical methodologies that account for length bias. In particular, a number of semiparametric regression methods have been proposed in the literature to model the association between covariates and the survival outcome of interest. Among the semiparametric regression models, the Cox proportional hazards model (Cox 1972) has been the most commonly adopted. Under the Cox model, Wang (1996) proposed a pseudo-partial likelihood approach to assess the covariate effects. However, her estimation method is limited to length-biased data with no right censoring. Tsai (2009) generalized the method to handle potential right censoring. Qin and Shen (2010) constructed estimating equations based on risk sets that are adjusted for length-biased sampling through inverse weights. A thorough review of the existing nonparametric and semiparametric regression methods can be found in Shen et al. (2017).
Although there is a substantial amount of literature on statistical methods for length-biased data, publicly available computational tools for analyzing such data are limited. In this paper, we introduce a new package, CoxPhLb (Lee et al. 2019a), in R that provides tools to analyze length-biased data under the Cox model. The package includes functions that fit the Cox model using the estimation method proposed by Qin and Shen (2010), check the proportional hazards model assumptions based on methods developed by (Lee et al. 2019b) and check the underlying stationarity assumption. CoxPhLb is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=CoxPhLb. To the best of our knowledge, this is the first and only publicly available R package for analyzing length-biased data under the Cox model.
The remainder of this paper is organized as follows. The following section provides a brief review of the semiparametric estimation method under the Cox model to assess the covariate effects on the survival outcome. Then, we outline how the Cox proportional hazards model assumptions can be checked both graphically and analytically, and describe two approaches to test the stationarity of the underlying incidence process. We illustrate the R package CoxPhLb using a simulated data example and a real dataset, the Channing House data. Finally, we conclude this paper with summarizing remarks.
Let
The length-biased data consist of
Among many estimation methods established for length-biased data under the Cox model, we provide the estimation function based on the inversely weighted estimating equation of Qin and Shen (2010). While this estimating equation approach may not be the most efficient method, the estimation procedure is easy to implement and provides a mean zero stochastic process that forms the basis of model checking. We adopt this estimation method for model fitting and checking in the R package CoxPhLb.
For subject
This estimation procedure can be implemented using the coxphlb
function in the CoxPhLb package as follows:
coxphlb(formula, data, method = c("Bootstrap","EE"), boot.iter = 500,
seed.n = round(runif(1,1,1e09)), digits = 3L)
where formula
has the same syntax as the formula used in coxph
from
the survival package
(Therneau 2020). The response needs to be a survival
object such
as Surv(a, y, delta)
where a
, y
, and delta
are the truncation
times, the observed failure times, and the censoring indicators,
respectively. The argument data
is a data frame that includes
variables named in the formula. We can choose either the bootstrap
variance estimates ("Bootstrap"
), or the model-based variance
estimates ("EE"
) to be returned in the fitted model object through the
argument method
. When bootstrap resampling is chosen for variance
estimation, the bootstrap sample size is controlled by boot.iter
with
the default set as seed.n
. A
summary table is returned with values rounded by the integer set through
digits
.
Alternatively, one can implement the estimation procedure by using the
coxph
function with the subset of the data that consist of uncensored
failure times only and an offset
term to add
coxph
function
will return the same point estimates as the coxphlb
function. To
compute the corresponding standard errors using coxph
, we need to use
the bootstrap approach. Later, in the simulated data example, we further
evaluate the computational efficiency of the coxphlb
function with the
model-based variance estimation (i.e., method = EE
) opposed to the
bootstrap resampling method (i.e., method = Bootstrap
).
Two primary components of checking the Cox proportional hazards model assumptions are examining (i) the functional form of a covariate and (ii) the proportional hazards assumption. To detect violations of these model assumptions, the general form of the cumulative sums of multiparametric stochastic processes is considered.
Under Model ((1)), we can construct a mean zero stochastic
process,
Let
The null distribution of the general form ((3)) under
Model ((1)) has been studied in Lee et al. (2019b) to derive the
critical values for test statistics
We can carry out model checking in R using the function coxphlb.ftest
to examine the functional form of a continuous covariate as follows:
coxphlb.ftest(fit, data, spec.p = 1, n.sim = 1000, z0 = NULL,
seed.n = round(runif(1,1,1e09)),
digits = 3L)
where the argument fit
is an object of the "coxphlb"
class, which
can be obtained by using the coxphlb
function, and data
is the data
frame used in the fitted model. We specify the spec.p
, where the default is set as
n.sim
controls the number of samples with
the default set as z0
, which if NULL
, 100 equally distributed grid points will be
selected over the range of the seed.n
. The digits
. To test the
proportional hazards assumption, we use coxphlb.phtest
as follows:
coxphlb.phtest(fit, data, spec.p = NULL, n.sim = 1000,
seed.n = round(runif(1,1,1e09)),
digits = 3L)
where all arguments play the same role as in the coxphlb.ftest
function, except for spec.p
. The proportional hazards assumption can
be tested for the spec.p
equal to spec.p
is left
unspecified.
We can conduct graphical assessment by using the following functions:
coxphlb.ftest.plot(x, n.plot = 20, seed.n = round(runif(1,1,1e09)))
coxphlb.phtest.plot(x, n.plot = 20, seed.n = round(runif(1,1,1e09)))
where x
are objects of the "coxphlb.ftest"
class and the
"coxphlb.phtest"
class, respectively. These functions return a plot of
the observed process along with a randomly sampled n.plot
number of
realizations. We can fix the random seed number through seed.n
.
When the underlying incidence process follows a stationary Poisson
process, the distribution of the truncation variable is uniform and the
data are considered length-biased. In the literature, two approaches
have been proposed to check the stationarity assumption: a graphical
assessment and an analytical test procedure. Asgharian et al. (2006)
demonstrated that the stationarity assumption can be checked graphically
by comparing the Kaplan–Meier estimators based on the current and
residual survival times. A large discrepancy indicates that the
stationarity assumption is invalid. Addona and Wolfson (2006) proposed an
analytic test to check the assumption. They showed that testing the
stationarity assumption is equivalent to testing whether the
distributions of the backward and forward recurrence times are the same.
Let
In R, we can explore the stationarity assumption graphically using
function station.test.plot
as follows:
station.test.plot(a, v, delta)
where a
is the vector of backward recurrence times, v
is the vector
of forward recurrence times, and delta
is the vector of censoring
indicators. The function produces a plot of two Kaplan–Meier curves. To
test the assumption analytically, we can use
station.test(a, v, delta, digits = 3L)
where the data input arguments are the same as those in the
station.test.plot
function. The test statistic and the corresponding
digits
.
The three major components of CoxPhLb are (i) model fitting using
function coxphlb
, (ii) model checking using functions coxphlb.ftest
and coxphlb.phtest
, and (iii) stationarity assumption testing using
function station.test
. An overview of all functions in the CoxPhLb
package is presented in Table 1. In the following sections, we
provide R codes that illustrate how to use the functions with the
simulated data that are available in the CoxPhLb package and a real
dataset, the Channing House data, which is publicly available in the
KMsurv package
(Klein et al. 2012). The provided R codes can be implemented after
installing and loading the CoxPhLb package, which will automatically
load the survival package.
Function | Description | S3 methods |
---|---|---|
coxphlb |
Fits a Cox model to right-censored length-biased data | print() |
summary() |
||
coef() |
||
vcov() |
||
coxphlb.ftest |
Tests the functional form of covariates | print() |
coxphlb.phtest |
Tests the proportional hazards assumption | print() |
station.test |
Tests the stationarity assumption | print() |
coxphlb.ftest.plot |
Returns a graph for testing the functional form of covariates | |
coxphlb.phtest.plot |
Returns a graph for testing the proportional hazards assumption | |
station.test.plot |
Returns a graph for testing the stationarity assumption |
We use the simulated dataset, ExampleData1
, that is available in the
CoxPhLb package for illustration. The data have y
), the truncation variable
(a
), the censoring indicator (delta
), and two covariates, x1
) and x2
). The vector of forward recurrence times (v
) is the
difference between the failure times and the backward recurrence times
(y-a
).
We begin by checking the stationarity assumption of the simulated dataset graphically as follows.
> data("ExampleData1", package = "CoxPhLb")
> dat1 <- ExampleData1
> station.test.plot(a = dat1$a, v = dat1$y - dat1$a, delta = dat1$delta)
The resulting plot is presented in Figure 3. We observe that the two Kaplan–Meier curves are very close to each other, especially at the early time points, which provides some evidence of the stationarity of incidence. However, we note some discrepancy in the tails of the curves. Thus, we conduct an analytical test to verify the underlying assumption as follows.
> station.test(a = dat1$a, v = dat1$y - dat1$a, delta = dat1$delta)
test.statistic p.value
-0.375 0.707
To save the computed test statistic and the corresponding
> fit.ee1 <- coxphlb(Surv(a, y, delta) ~ x1 + x2, data = dat1, method = "EE")
Call:
coxphlb(formula = Surv(a, y, delta) ~ x1 + x2, method = EE)
coef variance std.err z.score p.value lower.95 upper.95
x1 1.029 0.031 0.177 5.83 <0.001 0.683 1.375
x2 0.45 0.132 0.364 1.24 0.216 -0.263 1.163
The outputs include the estimated coefficients, the corresponding
variance and the standard error estimates, the computed z scores and fit.ee1
as an object of the
"coxphlb"
class. In the resulting table, we observe that the effect of
> coxphlb(Surv(a, y, delta) ~ x1 + x2, data = dat1, method = "Bootstrap", seed.n = 1234)
Call:
coxphlb(formula = Surv(a, y, delta) ~ x1 + x2, method = Bootstrap)
coef variance std.err z.score p.value lower.95 upper.95
x1 1.029 0.032 0.178 5.78 <0.001 0.68 1.378
x2 0.45 0.133 0.365 1.23 0.218 -0.266 1.166
The outputs based on the bootstrap resampling method (i.e.,
method = Bootstrap
) are close to the results from the model-based
variance estimation (i.e., method = EE
). To measure the average
execution times of the two variance estimation methods, we ran the
coxphlb
function 100 times. The average execution times were 0.972s
and 3.581s for method = EE
and Bootstrap
, respectively, on a desktop
computer with Intel Core i5 CPU@3.40GHz and 8 GB 2400 MHz DDR4 of
memory, which demonstrates the computational efficiency of coxphlb
with the model-based variance estimation.
We note that the estimation results are only valid when the Cox proportional hazards model assumptions are correct. We thus verify the proportional hazards model assumptions. First, the linear functional form of the second covariate which has continuous values, can be checked as follows.
> ftest1 <- coxphlb.ftest(fit = fit.ee1, data = dat1, spec.p = 2, seed.n = 1234)
p.value
x2 0.433
> coxphlb.ftest.plot(ftest1, n.plot = 50, seed.n = 1234)
To test the linear functional form, the object fit.ee1
is specified as
an input argument. We set spec.p = 2
to conduct the test for seed.n = 1234
for reproducible results. The coxphlb.ftest
function returns a ftest1
which is in the
"coxphlb.ftest"
class and set n.plot = 50
to plot 50 realization
lines. Based on Figure 4 and the
> phtest11 <- coxphlb.phtest(fit = fit.ee1, data = dat1, spec.p = 1, seed.n = 1234)
p.value
x1 0.59
> phtest12 <- coxphlb.phtest(fit = fit.ee1, data = dat1, spec.p = 2, seed.n = 1234)
p.value
x2 0.833
> coxphlb.phtest.plot(phtest11, n.plot = 50, seed.n = 1234)
> coxphlb.phtest.plot(phtest12, n.plot = 50, seed.n = 1234)
The outputs consist of the
> coxphlb.phtest(fit = fit.ee1, data = dat1, spec.p = NULL, seed.n = 1234)
p.value
GLOBAL 0.762
The global test can be conducted by specifying spec.p = NULL
. The
result is consistent with the tests performed for each covariate. Note
that the graphical assessment is unavailable for the global test.
The Channing House data were collected from 462 individuals at a
retirement center located in Palo Alto, California, from 1964 to 1975
(Klein and Moeschberger 2003). We consider the elders aged death
), age at entry
(ageentry
), age at death or censoring (age
), and gender (gender
).
The observed survival times are left-truncated because only individuals
who have lived long enough to enter the retirement center will be
included in the data. We load the original dataset and select the subset
of the dataset for illustration. Note that we convert age measured in
months to years.
> install.packages("KMsurv")
> data("channing", package = "KMsurv")
> dat2 <- as.data.frame(cbind(ageentry = channing$ageentry/12, age = channing$age/12,
+ death = channing$death, gender = channing$gender))
> dat2 <- dat2[dat2$ageentry >= 65, ]
First, we check the stationarity assumption as follows.
> station.test.plot(a = (dat2$ageentry - 65), v = (dat2$age - dat2$ageentry),
+ delta = dat2$death)
The resulting plot in Figure 6 provides a strong sign that the stationarity assumption is satisfied. We further verify the assumption by conducting the analytical test.
> station.test(a = (dat2$ageentry - 65), v = (dat2$age - dat2$ageentry),
+ delta = dat2$death)
test.statistic p.value
0.261 0.794
Based on the results, we can conclude that the stationarity assumption is valid. Hence, we use the functions in CoxPhLb to assess the covariate effects on the survival outcome for the Channing House data. The model can be fitted as follows.
> fit.ee2 <- coxphlb(Surv((ageentry - 65), (age - 65), death) ~ gender, data = dat2,
+ method = "EE")
Call:
coxphlb(formula = Surv((ageentry - 65), (age - 65), death) ~ gender, method = EE)
coef variance std.err z.score p.value lower.95 upper.95
gender -0.112 0.029 0.17 -0.66 0.509 -0.445 0.22
Using the model-based variance estimation, we find that gender is not a significant factor for survival. The bootstrap resampling approach provides consistent results as follows.
> coxphlb(Surv((ageentry - 65), (age - 65), death) ~ gender, data = dat2,
+ method = "Bootstrap", seed.n = 1234)
Call:
coxphlb(formula = Surv((ageentry - 65), (age - 65), death) ~ gender, method = Bootstrap)
coef variance std.err z.score p.value lower.95 upper.95
gender -0.112 0.028 0.168 -0.67 0.504 -0.441 0.217
The estimation of the covariate effects is only valid when the Cox model
assumptions are not violated. By conducting the following proportional
hazards assumption test, we verify that the assumption is reasonable
based on the computed
> phtest2 <- coxphlb.phtest(fit = fit.ee2, data = dat2, spec.p = 1, seed.n = 1234)
p.value
gender 0.723
> coxphlb.phtest.plot(phtest2, seed.n = 1234)
Observational data subject to length-biased sampling have been widely recognized by epidemiologists, clinicians, and health service researchers. While statistical methodologies have been well established for analyzing such types of failure time data, the lack of readily available software has been a barrier to the implementation of proper methods. We introduce the R package CoxPhLb that allows practitioners to easily and properly analyze length-biased data under the Cox model, which is commonly used for conventional survival data. When the stationarity assumption is uncertain for the data, one can check the assumption graphically and analytically using tools provided in the package prior to fitting the Cox model. In addition, the fundamental assumptions of the Cox model can be examined.
The CoxPhLb package may be further expanded by including other
estimation approaches under the Cox model. For example, Qin et al. (2011)
proposed an estimation method based on the full likelihood, which
involves more intensive computations. Implementation of this method
yields more efficient estimators, which is certainly desirable. In
addition, when a violation of the proportional hazards assumption is
detected by the coxphlb.phtest
or coxphlb.phtest.plot
functions, we
may consider extending the regression method to handle covariates with
non-proportionality such as the
coxphw package
(Dunkler et al. 2018) which implements the weighted Cox regression method
for conventional survival data. We leave these possible extensions for
future work.
This work was partially supported by the U.S. National Institutes of Health, grants CA193878 and CA016672.
CoxPhLb, survival, KMsurv, coxphw
ClinicalTrials, Econometrics, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Lee, et al., "CoxPhLb: An R Package for Analyzing Length Biased Data under Cox Model", The R Journal, 2020
BibTeX citation
@article{RJ-2020-024, author = {Lee, Chi Hyun and Zhou, Heng and Ning, Jing and Liu, Diane D. and Shen, Yu}, title = {CoxPhLb: An R Package for Analyzing Length Biased Data under Cox Model}, journal = {The R Journal}, year = {2020}, note = {https://rjournal.github.io/}, volume = {12}, issue = {1}, issn = {2073-4859}, pages = {118-130} }