The hazard function is a key component in the inferential process in survival analysis and relevant for describing the pattern of failures. However, it is rarely shown in research papers due to the difficulties in nonparametric estimation. We developed the bshazard package to facilitate the computation of a nonparametric estimate of the hazard function, with data-driven smoothing. The method accounts for left truncation, right censoring and possible covariates. B-splines are used to estimate the shape of the hazard within the generalized linear mixed models framework. Smoothness is controlled by imposing an autoregressive structure on the baseline hazard coefficients. This perspective allows an ‘automatic’ smoothing by avoiding the need to choose the smoothing parameter, which is estimated from the data as a dispersion parameter. A simulation study demonstrates the capability of our software and an application to estimate the hazard of Non-Hodgkin’s lymphoma in Swedish population data shows its potential.
The hazard function is the basis of the inferential process in survival analysis, and although relevant for describing the pattern of failures, is often neglected in favor of survival curves in clinical papers. The most widely applied model in survival analysis (the Cox model) allows valid comparisons in terms of hazard ratios without distributional assumptions concerning the baseline hazard function, whose nonparametric estimate is rarely shown. Thus the reference against which the relative hazard is estimated is usually ignored and a crude measure of absolute risk is sometimes provided by the Kaplan-Meier estimator that is indirectly related to the shape of the hazard function.
In the methodological literature, some methods have been developed to obtain a nonparametric hazard estimate, including kernel-based (Müller and Wang 1994; Hess et al. 1999) and spline-based estimators (O’Sullivan 1988; Cai et al. 2002). However, specific statistical software commands accounting for the characteristics of survival data are lacking. An exception is the R package muhaz (Hess and Gentleman 2010) that estimates the hazard function from right-censored data using kernel-based methods, but this package does not accommodate left-truncated data nor does it allow for adjustment for possible covariates. Flexible parametric survival models can also be used to describe and explore the hazard function (Royston and Parmar 2002) and in R these have been implemented in the package flexsurv (Jackson 2014). In these models a transformation of the survival function is modeled as a natural cubic spline function of the logarithm of time (plus linear effects of covariates). However this approach relies on an appropriate choice of the number of knots to be used in the spline.
We have developed the bshazard package to obtain a nonparametric smoothed estimate of the hazard function based on B-splines and generalized linear mixed models (GLMM). This perspective enables ‘automatic’ smoothing, as the smoothing parameter can be estimated from the data as a dispersion parameter (Eilers and Marx 1996; Pawitan 2001; Lee et al. 2006). Our implementation accommodates the typical characteristics of survival data (left truncation, right censoring) and also possible covariates. In the following sections we briefly review the methodology and demonstrate the performance of the package in numerical examples using simulated data. We illustrate the use of the package in an application to Swedish population data, where we estimate the incidence of Non-Hodgkin’s lymphoma (NHL) in sisters of patients.
In this section we briefly review the methodology for smoothing the hazard rate inspired by (Eilers and Marx 1996) and described in Chapter 9 of (Lee et al. 2006).
Let
so that
Consider a sample of subjects that are observed from a common origin
(e.g. start of exposure, time of diagnosis) to the occurrence of the
event of interest. Observation can be right censored if the last
follow-up time is before the occurrence of the event and/or left
truncated if the observation starts after the origin (time 0). By
partitioning the time axis into very small intervals, the number of
events in each interval is approximately Poisson with mean
splitLexis
function in the
Epi package. Using (common)
break points supplied by the user, the function divides each data record
into disjoint follow-up intervals each with an entry time, exit time and
censoring indicator, stacking these as separate ‘observations’ in the
new data set. The bins should be small enough so that
We will use
Note that the coefficients
Smoothers, such as regression splines, can be used to estimate the
function
where
B-splines are advantageous because the smoothed estimate is determined
only by values at neighboring points and has no boundary effect.
Nevertheless the major criticism of this approach is the arbitrary
choice of the knots which determine the locations at which the smoothed
function can bend and this has been subject to various investigations
and discussions (Kooperberg and Stone 1992). With a small number of knots, a
B-spline may not be appealing because it is not smooth at the knots
(this problem may be avoided by higher-degree B-splines) and can lead to
underfitting of the data. The number of knots can be increased to
provide a smoother estimate, but a large number of knots may lead to
serious overfitting. O’Sullivan proposed using a relatively large number
of knots and preventing overfitting by a penalty term to restrict the
flexibility of the fitted curve (O’Sullivan 1986, 1988), which is
analogous to likelihood-based mixed effects modeling
(Eilers and Marx 1996; Pawitan 2001; Lee et al. 2006). In fact, the penalized least
squares equation for a quadratic penalty corresponds to a mixed model
log-likelihood, where the set of second-order differences of the
coefficients of the B-splines (denoted
Intuitively, since the coefficients
More formally, for model [bspl] the element
The Extended Quasi-Likelihood approximation (Lee et al. 2006) facilitates an
explicit likelihood formulation (also for overdispersed Poisson data):
This log-likelihood can be seen as a penalized likelihood where the term
The parameter
The advantage of the mixed model approach is that we have an established
procedure for estimating
In our application, the following iterative algorithm is used:
Given initial/last estimated values of
Given
Iterate between 1 and 2 until convergence.
We begin by implementing step 1 of the IWLS algorithm as follows. Given
a fixed value of
At this point we can introduce a quantity to describe the complexity of
the smoothing, that is the effective number of parameters (or degrees of
freedom) associated with
For step 2, given
The quantity
Once convergence is reached, a point-wise confidence band for the smooth
estimate is computed for the linear predictor
This section provides explanations of the input data and output data of the package bshazard, which estimates the hazard function nonparametrically. In order to use package bshazard, the following R packages need to be installed: survival, Epi and splines (Carstensen et al. 2011; Therneau 2014). After installation, the main function can be called by:
library(bshazard)
bshazard(formula, data, nbin, nk, degree, l0, lambda, phi, alpha, err, verbose)
The only mandatory argument is formula
, which should contain a
survival object (interpreted by the survival package) with the time of
event (possibly in a counting process format), the censoring indicator
and (optionally) the covariates. For example, the function
output <- bshazard(formula = Surv(time_of_entry, time_of_exit,
censoring_indicator ~ covariates))
will produce a smoothed estimate of the hazard accounting for
covariates
with the following default settings:
the time axis split at each distinct observed time in the data (censoring or event);
B-splines with 31 knots;
degree of B-splines 1;
smoothing parameter estimated from the data (with starting value 10);
overdispersion parameter estimated from the data;
By providing various arguments in the function call, the user can
override the default settings for the data format (e.g. specify a data
frame), the number of bins (nbin
), the number of knots (nk
), degree
of splines (degree
), smoothing and overdispersion parameters (lambda
and phi
), confidence level (alpha
) and convergence criterion
(err
). Detailed explanations are provided in the function’s help file.
The output
of the bshazard
function includes a data frame with the
original data split in time intervals (raw.data
), vectors containing
the estimated hazard and confidence limits and the parameter estimates
(coefficients of the covariates,
The package also includes the following functions:
summary(output)
prints the values of the estimated hazard, with
confidence limits, for each observed time and the parameter
estimates;
plot(output)
and lines(output)
plot the hazard, with confidence
limits.
In order to test the flexibility of the proposed algorithm we simulated
data from a non monotone function that could represent, for example, the
seasonality of flu incidence:
fit <- bshazard(Surv(exit_time, cens) ~ 1, data = dati, nbin = 100)
The choice to pre-bin the data in 100 time intervals was for simplicity
of comparison of different estimates of hazard from different
simulations at the same time point. The hazard function estimate did not
change when using different numbers of bins or different numbers of
knots (data not shown). The results of this simulation are summarised in
Figure 1. The mean estimate of the hazard function is
very close to the true hazard. For comparison, we also estimated the
hazard using
muhaz(exit_time, cens, max.time = 3, bw.method = "g", n.est.grid = 100)
and plotted its mean estimate in Figure 1, where it can
be seen to be very close to the true hazard. Under the same distribution
we also simulated a left truncation time
In a second simulation, we included a covariate
fit <- bshazard(Surv(entry_time, exit_time, cens) ~ x, data = dati, nbin = 30)
Results are shown in Figure 2. The mean estimate of
the hazard function is again very close to the true hazard. Note that
when adjusting for continuous covariates, the hazard is estimated under
the assumption of a linear effect of the covariates centered at their
mean values. In this case, a comparison with the muhaz
function was
not possible given that it does not accommodate covariates or late
entry.
In this section we use the bshazard package to estimate the hazard of cancer diagnosis using data from Swedish population registers. For all individuals born in Sweden since 1932, the Multi-Generation Register maintained by Statistics Sweden (Statistics Sweden 2009) identifies the biological parents, thus enabling the construction of kinships within families. Using the individually unique national registration number, individuals in this register can be linked to the National Cancer Register, which records all malignancies diagnosed since 1958. In our application we use data from Lee et al. (2013) who analyzed 3,015 sisters of 1,902 non-Hodgkin’s lymphoma (NHL) female patients and 15,697 sisters of 3,836 matched controls who were cancer free at the time of diagnosis of the corresponding case and who matched the case with respect to age, year of birth, sex and county of residence.
Sisters are at risk from birth to the age at NHL diagnosis or censoring due to death, emigration or end of study (2007). Individuals born before the start of the cancer register (1958) were considered at risk from their age at the start of the cancer register, resulting in delayed entry. In the study period, 32 NHL diagnoses were observed in the exposed group (sisters of subjects with a diagnosis of NHL) and 39 in the unexposed group (sisters of cancer-free subjects).
In order to illustrate the automatic smoothing, we first concentrated on
sisters of cancer-free subjects and computed the smoothed hazard using
the usual B-splines (i.e. setting the smoothing parameter
fit_notexp_l0 <- bshazard(Surv(entry_age, exit_age, cens) ~ 1, lambda = 0,
data = sis[nexpo])
where [nexpo]
selects only the sisters of control subjects.
The resulting hazard function, plotted in Figure 3, has several bumps, so we proceeded to estimate the smoothing parameter from the data, again using the same number of knots (31 as default):
fit_notexp <- bshazard(Surv(entry_age, exit_age, cens) ~ 1, data = sis[nexpo])
The resulting estimate of
Lee et al. (2013) found that sisters of female NHL patients have hazard ratio of NHL of 4.36 (95% confidence interval [2.74; 6.94]) compared to sisters of controls (Lee et al. 2013). For comparison with their results we estimated the risk of NHL adjusting for ‘exposure’ (i.e. being a sister of a case rather than a control):
fit_adj_exp <- bshazard(Surv(entry_age, exit_age, cens) ~ case, data = sis)
where the variable case
indicates whether the subject is a sister of a
case. We obtained a very similar hazard ratio,
Since this estimate is not meaningful for categorical variables, we
obtained separate hazard estimates for unexposed and exposed subjects
from:
The function plot
in the package bshazard calls an object of class
‘bahazard
’ and allows one to easily plot separate curves for each set
of covariate values. The estimates plotted in Figure 4
were obtained using plot(fit_adj_exp, overall = FALSE)
and assume
proportionality of the hazard in exposed and unexposed sisters. As a
reference, we also performed stratified analyses obtaining estimates
separately for exposed and unexposed sisters and these are plotted in
grey in Figure 4. As expected, the hazard estimates are
similar to the separate estimates, but are constrained by the assumption
of proportionality. This is especially evident in the exposed group: the
hazard increase seems to level off after age 55, but this is not
detected by the hazard estimate obtained under the joint adjusted model.
The adjustment to the hazard estimate is particularly advantageous for continuous variables. The proposed method allows inclusion of more than one covariate, so in the NHL application the hazard of exposed and unexposed subjects could be further adjusted for calendar time by:
fit_adj_exp_caly <- bshazard(Surv(entry_age, exit_age, cens) ~ case + yob, data = sis)
This yielded hazard estimates that were essentially unchanged and are not reported here.
We have implemented a software package in R to obtain a nonparametric smoothed estimate of the hazard function based on B-splines from the perspective of generalized linear mixed models. The Poisson model leads to an elegant form for the log of the hazard (Lambert and Eilers 2005). We adopted the discrete approach to survival analysis allowing staggered entry, intermittent risk periods and large data sets to be handled with ease. The code is easy to use and accommodates the typical characteristics of survival data (left truncation, right censoring) by using the survival package. It also accounts for possible covariates, but the user should be aware that covariates are assumed to have a constant effect on the hazard (proportional hazards). It is of note that the model is also valid for estimating the rate function over time where an individual can have repeated events (Cook and Lawless 2002). For such applications, the code can be used without change and data should be included in a counting process format. The package bshazard is available from http://CRAN.R-project.org/package=bshazard.
The main advantage of our function is that the user can obtain the estimated hazard by a simple line of code and that the extent of smoothing is data-driven (i.e. the user does not need to specify any smoothing parameter). The number of degrees of freedom gives an intuitive interpretation of the amount of smoothing and is also useful for making comparisons between different smoothers.
To prepare the data for analysis the package uses the splitLexis
function (Epi package). The choice of time intervals does not affect
the smoothed estimate as long as the bins are small enough for the
assumption of constant hazard within the bin to be reasonably satisfied.
For large numbers of observations the splitting can be time consuming,
especially when accounting for many covariates. Nevertheless, in our
relatively large data-set of NHL sisters, the most complex hazard
estimate, adjusting for two covariates (fit_adj_exp_caly
) was obtained
in less than one minute. Interval censored data are not included in the
code at this time, but the package can still be used if the censored
intervals are relatively short. In this situation we could choose the
bins in such a way that every censored interval is completely included
in one bin, avoiding the problem of the specification of the exact event
time, but some assumptions on person-time at risk will be needed. With
small data sets and in the presence of covariates, estimation of both
smoothing and overdispersion parameters can cause some convergence
problem; in this case if there is not strong evidence of overdispersion
we suggest fixing
The possibility to estimate the hazard function with a simple command provides a useful tool for a deeper understanding of the process being studied, both in terms of the magnitude of the risk and the shape of the curve (Rebora et al. 2008). For example, in a previous paper, we found that sisters of subjects diagnosed with NHL have a hazard ratio of 4.36 (95% confidence interval [2.74; 6.94]) for NHL compared to sisters of controls (Lee et al. 2013), but did not show at which age the risk was higher. Reanalyzing the data using bshazard revealed how the magnitude of the risk varied with age. This important information is often neglected in epidemiological studies, in large part due to the lack of simple and accessible software tools. An important area of application is in the presence of time-dependent variables, when an absolute measure of risk cannot be obtained by the Kaplan-Meier estimator. For example, in a comparison between the effect on disease-free survival of chemotherapy and transplantation, which occur at different time points, the Kaplan-Meier method will tend to overestimate the survival of the transplanted group, since these patients have to survive until transplant (immortal time bias). In such situations, a hazard estimate is particularly useful for presenting the instantaneous risk of an event over time given that it conditions on the subjects at risk at each time.
In summary, the bshazard package can enhance the analysis of survival
data in a wide range of applications. The advantage of automatic
smoothing and the close relationship with the survfit
function make
the package very simple to use.
The authors would like to thank Yudi Pawitan who provided the initial code for smoothing.
muhaz, flexsurv, bshazard, Epi, survival, splines
ClinicalTrials, Distributions, Econometrics, Epidemiology, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Rebora, et al., "bshazard: A Flexible Tool for Nonparametric Smoothing of the Hazard Function", The R Journal, 2015
BibTeX citation
@article{RJ-2014-028, author = {Rebora, Paola and Salim, Agus and Reilly, Marie}, title = {bshazard: A Flexible Tool for Nonparametric Smoothing of the Hazard Function}, journal = {The R Journal}, year = {2015}, note = {https://rjournal.github.io/}, volume = {6}, issue = {2}, issn = {2073-4859}, pages = {114-122} }