It is frequently of interest in time-to-event analysis to compare multiple survival functions nonparametrically. However, when the hazard functions cross, tests in existing R packages do not perform well. To address the issue, we introduce the package survELtest, which provides tests for comparing multiple survival functions with possibly crossing hazards. Due to its powerful likelihood ratio formulation, this is the only R package to date that works when the hazard functions cross. We illustrate the use of the procedures in survELtest by applying them to data from randomized clinical trials and simulated datasets. We show that these methods lead to more significant results than those obtained by existing R packages.
The nonparametric comparison of multiple survival functions is of
interest in numerous biomedical settings, such as clinical trials
(Robert et al. 2015), preclinical studies (Liebl et al. 2015) and observational
studies (Loupy et al. 2013) with right-censored time-to-event endpoints. It
has been implemented in existing R packages using log-rank-type
statistics. However, these log-rank-type tests can fail to detect
differences among survival curves when the hazard functions cross. For
example, consider the Kaplan–Meier (KM) estimated survival functions in
Figure 1 for the treatment and control groups of
patients in a randomized clinical trial. There is a clear gap between
the survival curves, which we would expect to be detected by a
reasonable statistical test. Nevertheless, the log-rank test provided in
the survival
(Therneau et al. 2020) package returns a
The need to compare survival functions with crossing hazards has been
documented in the statistics literature (see, e.g., Pepe and Fleming 1989; Yang and Prentice 2010).
There are many practical situations in which the hazard functions cross,
indicating that the instantaneous treatment effect changes direction.
For example, some treatments are initially harmful due to toxicity or
other complications, but may be beneficial later on. Other treatments
can have short-term benefits but produce side effects in the long run.
Despite the varying instantaneous treatment effect, the cumulative
treatment effect can still be positive, as reflected by a positive
difference between the treatment and control survival functions
throughout the follow-up period. It is important to be able to detect
such a difference, as the treatment would be worth considering in this
case. To this end, an adaptive weighted log-rank test was implemented in
the R package YPmodel
(Sun and Yang 2020), but this test involves a parametric assumption on the
hazard functions, is limited to two-sample comparisons, and cannot deal
with the general
To address this issue, the package
survELtest
(Chang 2020) provides nonparametric tests that can deal with
general
Our approach computes EL ratios at each observed uncensored time point, then summarizes them into maximal-deviation-type and integral-type statistics. The statistical theory of this approach and the empirical levels and powers in various simulation scenarios have been studied in Chang and McKeague (2016) ((2016, 2019)), but these authors focused on the technical development of one-sided tests, and did not provide a software package or an accessible guide for implementing the method. In this paper we provide a general framework for both two-sided and one-sided testing, an accessible guide to the R package survELtest, and a comparison with existing R packages (reviewed briefly in Section 2.1) in applications to data from clinical trials and simulated datasets.
For survdiff
function in the package survival consider the
Fleming-Harrington SurvivalTests
in the
coin (Hothorn et al. 2019) package
implements a reformulated weighted log-rank test as a linear rank test.
The maxstat
(Hothorn 2017) package performs tests using maximally selected log-rank
statistics.
This paper is organized as follows. In the next section, we provide a
brief review of
This section briefly reviews the log-rank-type statistics in existing R
packages, for testing whether the
To illustrate the pitfalls of this formulation under crossing hazards,
we restrict our attention to
In the proposed package survELtest, we use a likelihood ratio statistic, namely a pointwise EL statistic, to replace the estimated hazard difference in (1). This pointwise EL statistic quantifies, at each time point, the difference in the multiple survival functions. It is always non-negative, as are all typical likelihood ratio statistics, which prevents the problematic cancellation described in the previous section. In the rest of Section 2, we provide a brief description of this approach. More details can be found in Chang and McKeague (2016, 2019).
The pointwise EL statistic is constructed from the following likelihood
ratio:
For the desired simultaneous inference, we summarize the pointwise
statistics in two ways. The first, provided by the routine intELtest
,
takes a weighted sum:
Another way to summarize the pointwise statistics is to take a maximum
supELtest
. Such maximal-deviation-type
statistics have been used in the classical Kolmogorov–Smirnov test, and
are more sensitive to local differences amongst survival curves (i.e.,
differences among survival curves that appear only in a short period of
time). In contrast, the integral-type statistic
So far, we have focused on two-sided testing. For one-sided testing, we
consider the alternative
The resulting EL test will be preceded by an initial test that excludes
the possibility of crossings or alternative orderings among the survival
functions. The reason is due to the fact that for functional parameters
(e.g., survival functions), testing a one-sided alternative hypothesis
usually involves certain assumptions, such as that the functions are not
crossed and that their ordering is as hypothesized. These assumptions
may be checked using the initial test, with the null hypothesis being
that the assumptions are not satisfied, versus the alternative that they
are. If the null hypothesis of the initial test is rejected, we conclude
that the assumptions are satisfied and proceed to the EL test. Rejection
of the null hypothesis
As mentioned in Section 2.2, we need to specify wt
in the routine intELtest
. The default is an objective
weight wt = "p.event"
.
Despite the default weight, we provide in intELtest
two alternative
options that have been used for integral-type statistics in the
literature. One option is wt = "dF"
.
Another weight, proposed by Pepe and Fleming (1989), is wt = "dt"
.
Having computed the statistics, we need to calibrate the tests. Possible
methods include bootstrapping or simulating the limiting distributions.
We choose the former for the following two reasons: (a) in small
samples, calibration using the bootstrap can perform better than using
the limiting distribution (Heller and Venkatraman 1996). (b) in our experience the
bootstrap can be more computationally efficient, since the limiting
distributions of
Here we adopt a Gaussian multiplier bootstrap approach, which is
commonly used instead of the nonparametric bootstrap in survival
analysis to avoid producing tied data in the bootstrap samples. To form
the bootstrap samples, the original data are perturbed using independent
standard Gaussian random variables, termed Gaussian multipliers (see,
e.g., Parzen et al. 1997). We denote the number of bootstrap samples as nboot
argument (default is 1000). In cases when
nsplit
parts, where nsplit =
/ nlimit
nlimit = 200
). Here and in the sequel, if we do not specify which R
function an option or argument applies to, then the option or argument
applies to all the functions provided by the survELtest package.
Since the bootstrap involves random sampling, the critical values will
differ based on different sets of bootstrap samples. To make the
critical values reproducible, we set a seed for random number generation
via the seed
option in our routines.
The survELtest package can be installed along with survELtest using
the following R code:
> install.packages("survELtest")
The following code loads the package:
> library(survELtest)
The main routines in survELtest are intELtest
, supELtest
,
nocrossings
, and ptwiseELtest
. The intELtest
routine conducts
testing based on the integrated EL statistics supELtest
routine conducts testing based on the maximally
selected EL statistics sided = 1
in both intELtest
and supELtest
, but should be preceded
by the initial test in nocrossings
to exclude the possibility of
crossings or alternative orderings among the survival functions. While
the first three routines provide simultaneous testing, ptwiseELtest
conducts pointwise testing to compare the survival curves at each time
point. It can be used to identify periods of local differences, after
intELtest
or supELtest
test gives a significant result. A flow chart
of the procedure for using the survELtest package is given in Figure
2. Methods defined for the objects produced by the
main routines are provided for print
and summary
. In addition to the
aforementioned routines, survELtest contains four datasets:
hepatitis
, threearm
, hazardcross
and hazardcross_Weibull
, which
will be analyzed in Sections 3.2,
3.3, and the Appendix to illustrate the use of the
routines.
A summary of the R code and the input arguments of the routines are
given as follows. Among the input arguments below, only the formula
input is compulsory. The rest of the arguments can be omitted if the
default settings are used.
> intELtest(formula, data = NULL, group_order = NULL, t1 = 0, t2 = Inf, sided = 2,
+ nboot = 1000, wt = "p.event", alpha = 0.05, seed = 1011, nlimit = 200)
> supELtest(formula, data = NULL, group_order = NULL, t1 = 0, t2 = Inf, sided = 2,
+ nboot = 1000, alpha = 0.05, seed = 1011, nlimit = 200)
> nocrossings(formula, data = NULL, group_order = NULL, t1 = 0, t2 = Inf, sided = 2,
+ nboot = 1000, alpha = 0.05, seed = 1011, nlimit = 200)
> ptwiseELtest(formula, data = NULL, group_order = NULL, t1 = 0, t2 = Inf, sided = 2,
+ nboot = 1000, alpha = 0.05, seed = 1011, nlimit = 200)
The time needed to run these functions depends on the total number intELtest
with the
default settings, it takes about 0.32 seconds on the dataset hepatitis
with threearm
with
formula
: a formula object, with a Surv object on the left of the
data
described below, the variables in the formula
should be already
defined by the user or in attached R objects.data
: an optional data frame containing the variables in the
formula
: the observed survival and censoring times, the censoring
indicator, and the grouping variable. The default is the data frame
with three columns of variables taken from the formula
: column 1
contains the observed survival and censoring times, column 2 the
censoring indicator, and column 3 the grouping variable.group_order
: a t1
: the first endpoint of a prespecified time interval, if any, to
which the comparison of the survival functions is restricted. The
default value is 0.t2
: the second endpoint of a prespecified time interval, if any,
to which the comparison of the survival functions is restricted. The
default value is sided
: 2 if two-sided test, and 1 if one-sided test. The default
value is 2.nboot
: the number of bootstrap replications in calculating
critical values for the tests. The default value is 1000.wt
: the name of the weight to be used in the integrated EL
statistics in intELtest
: "p.event
", "dF
", or "dt
". The
default is "p.event
".alpha
: the pre-specified significance level of the tests. The
default value is 0.05.seed
: the seed for the random number generator in R, for
generating bootstrap samples needed to calculate the critical values
for the tests. The default value is 1011.nlimit
: a number used to calculate
nsplit =
/ nlimit
nboot
bootstrap replications is
split. The use of this variable can make computation faster when the
number of time points nlimit
is 200.In this section we apply the routines supELtest
and ptwiseELtest
to
the dataset threearm
provided in the survELtest package, and compare
the results with the log-rank-type tests for trend. The dataset is
obtained by resampling from a perturbed dataset of patients from a
randomized clinical trial for the treatment of major depression, where
the perturbation is achieved by adding a random censboot
function in the package
boot (Canty and Ripley 2020). The
original data were analyzed by Chang and McKeague (2019), who observed a local difference
among the survival functions.
The purpose of analyzing the threearm
dataset is to assess whether the
survival functions of the three arms are ordered: that is, whether the
experimental treatment group (supELtest
.
The endpoint of the clinical trial is time (in days) to first remission. Because a shorter time to first remission is desirable, a treatment with a lower value of the survival function is better in this dataset. Based on this information, from the KM estimated survival curves in the left panel of Figure 3, it seems that the three groups are similar initially but become ordered for the rest of the follow-up period.
threearm
dataset: experimental
treatment group (solid), standard treatment group (dashed) and placebo
group (two-dashed).To see if the curves are statistically significantly ordered, we start
with conducting the commonly used log-rank-type tests. The trend test is
needed for the one-sided research question. Using the common choice
> library(survival)
> dat = Surv(threearm[, 1], threearm[, 2])
> logrank = survdiff(dat ~ threearm[, 3])
> score_vec = 3 : 1
> logrankteststat = matrix(score_vec, nrow = 1, ncol = 3)
+ %*% (logrank$obs - logrank$exp) / sqrt(matrix(score_vec, nrow = 1, ncol = 3)
+ %*% (logrank$var) %*% matrix(score_vec, nrow = 3, ncol = 1))
> if(logrankteststat < 0){
+ pval = 2 * pnorm(logrankteststat)
+ }else{
+ pval = 2 * (1 - pnorm(logrankteststat))
+ }
> round(pval, 2)
[,1]
[1,] 0.04
As the log-rank test for trend gives a survdiff(dat ``.17ex
threearm[, 3], rho = 1)
in the above code, which leads to a
Now we conduct the proposed one-sided testing for the threearm
data.
We anticipate a more significant result than the log-rank-type tests, as
there seems to be crossing among the estimated hazard functions in the
right panel of Figure 3, created using the function
muhaz
in the package
muhaz (Hess and Gentleman 2019) with
the default settings. The initial test and the maximally selected EL
test are implemented by the routines nocrossings
and supELtest
,
respectively. (Note that if two-sided testing is conducted instead, then
the initial test is not needed.) To use the routines, we need to specify
two options: sided = 1
for the one-sided test, and
group_order = c(3, 2, 1)
, since the hypothesized order among the three
arms with the survival rates ranging from the largest to the smallest is
the placebo (coded as 3 in the grouping variable), standard treatment
(coded as 2), and experiment treatment (coded as 1). The rest of the
options are kept at their default values. The R code for performing the
initial test is as follows:
> nocrossings(Surv(threearm$time, threearm$censor) ~ threearm$group,
+ group_order = c(3, 2, 1), sided = 1)
Call:
nocrossings(formula = Surv(threearm$time, threearm$censor) ~ threearm$group,
group_order = c(3, 2, 1), sided = 1)
Decision = 1
A decision value of 1 means there is no crossing or alternative
orderings among the survival functions. Thus, we can proceed to the main
(maximally selected EL) test in the second step:
> supELtest(Surv(threearm$time, threearm$censor) ~ threearm$group,
+ group_order = c(3, 2, 1), sided = 1)
Call:
supELtest(formula = Surv(threearm$time, threearm$censor) ~ threearm$group,
group_order = c(3, 2, 1), sided = 1)
One-sided maximally selected EL test statistic = 14.23, p = 0.004
As the maximally selected EL test gives a
Since our procedure leads to the conclusion that the survival functions
are ordered, it can be of interest to identify periods of local
differences for further clinical investigation. To this end, we can use
the routine ptwiseELtest
for pointwise testing at each observed
uncensored time point:
> ptwise = ptwiseELtest(Surv(threearm$time, threearm$censor) ~ threearm$group,
+ group_order = c(3, 2, 1), sided = 1)
The list of the time points at which the survival functions are ordered
(i.e., decision == 1
) is obtained by
> round(ptwise$result_dataframe$time_pts[ptwise$result_dataframe$decision == 1], 2)
[1] 13.91 13.91 13.91 13.92 13.92 13.92 13.92 13.93 13.98 13.99 13.99 14.00 14.00
[14] 14.00 14.01 14.01 20.96 20.96 27.98 27.99 28.00 28.00 28.00 28.02 28.02 28.98
[27] 28.99 29.01 30.00 32.96 36.97 40.97 40.98 40.99 41.02 41.98 41.98 41.99 42.00
[40] 42.01 42.02 42.99 43.01 43.02 43.02 44.00 44.00 51.97 51.97 55.00 56.01 56.01
[53] 56.02 59.03 59.04 64.97 68.98 69.01
From the result, we see there are local differences occurring near the time points 14, 21, 30, 40, 52, 56, 59, 65 and 69 days.
Now we turn to our motivating example in the Introduction and
demonstrate the use of intELtest
and its benefit over the
log-rank-type tests. The corresponding dataset hepatitis
is provided
in the survELtest package. The dataset was obtained by reconstructing
survival and censoring information (Guyot et al. 2012) based on
digitizing the KM curves presented in Nguyen-Khac et al. (2011). It contains
survival data (in days, rounded to one decimal place) from patients in a
randomized clinical trial for the treatment of severe alcoholic
hepatitis. The purpose of the clinical trial was to assess if the
treatment group (
From the KM estimated survival curves in Figure 1, the
survival rate of the treatment group seems to be greater than that of
the control group over the entire follow-up period. To see whether the
difference between the survival functions are statistically significant,
we start with conducting the commonly used two-sided log-rank test:
> library(survival)
> dat = Surv(hepatitis[, 1], hepatitis[, 2])
> logrank = survdiff(dat ~ hepatitis[, 3])
> round(1 - pchisq(logrank$chisq, df = 1), 2)
[1] 0.07
The log-rank test gives a muhaz
in the
package muhaz with the default settings). We also conduct another
log-rank-type test—the Peto and Peto’s modification of the
Gehan-Wilcoxon test—by setting
survdiff(dat ``.17ex
hepatitis[, 3], rho = 1)
in the above code, which leads to a
Now we apply the proposed two-sided integrated EL test to the
hepatitis
data to see if we can better detect a difference between the
survival functions. The default options are used and the R code is as
simple as
> intELtest(Surv(hepatitis$time, hepatitis$censor) ~ hepatitis$group)
Call:
intELtest(formula = Surv(hepatitis$time, hepatitis$censor) ~ hepatitis$group)
Two-sided integrated EL test statistic = 1.42, p = 0.007
As the integrated EL test gives a
Note the decision as to whether there is a significant discrepancy
between the two survival functions is totally different for the log-rank
and the integrated EL tests at
In this paper we introduce the R package survELtest for comparing two or more survival functions nonparametrically based on right-censored data. It is the only R package to date that utilizes the powerful likelihood ratio formulation instead of log-rank-type statistics, thereby performing well when the hazard functions cross. We provide both maximal-deviation-type and integral-type statistics, for detecting local and cumulative differences among the survival functions, respectively.
The use of the software is illustrated using two data sets from
randomized clinical trials, where the estimated survival functions seem
to be ordered, but the estimated hazard functions cross. In these cases,
our procedures lead to more significant results than the results
obtained from the log-rank-type tests. Specifically, in one of the
examples, the original clinical trial concludes that there is no
significant difference between the treatment and the control groups
(log-rank
The package is available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=survELtest. The development website is available at https://github.com/news11/survELtest.
The research of Hsin-wen Chang was partially supported by Ministry of Science and Technology of Taiwan under grants 106-2118-M-001-015-MY3 and MOST 109-2118-M-001-005-. The authors thank Yu-Ju Wang for computational support and Shih-Hao Huang for helpful comments. The authors declare that they have no conflict of interest.
Here we provide two more examples for comparing our procedures with
other existing tests in the literature, namely the log-rank test, the
Peto and Peto’s modification of the Gehan-Wilcoxon test, the adaptive
weighted log-rank test implemented in the R package YPmodel, and the
RMST method implemented in the R package survRM2. Since the latter two
methods cannot deal with the general hazardcross
, the survival time is generated from the
piecewise exponential model displayed in the left panel of Figure
5. Since the difference between the true survival curves
appears only during supELtest
to
detect such local differences. For the second dataset
hazardcross_Weibull
, the survival time is generated from the Weibull
model displayed in the right panel of Figure 5. We use
intELtest
because the difference between the true survival curves is
spread over the entire follow-up period. For both datasets, the true
hazard functions cross, but there is an obvious gap between the survival
curves. The censoring distributions are specified to be the same in each
arm, and uniform with administrative censoring at t = 10 and a censoring
rate of 25% in the first group. We use the default settings in
implementing the tests given in the aforementioned two packages.
The results are given in Table 1. Our tests provide more significant results in detecting the gap between the survival curves than any of the other tests for both datasets.
hazardcross
(left)
and hazardcross_Weibull
(right) datasets: the first (solid) and second
(dashed) groups.Datasets | EL | log-rank | PP | YP | dRMST | rRMST |
hazardcross |
0.037 | 0.106 | 0.060 | 0.096 | 0.126 | 0.130 |
hazardcross_Weibull |
0.005 | 0.080 | 0.006 | 0.009 | 0.014 | 0.018 |
survival, YPmodel, survRM2, survELtest, emplik, emplik2, ELYP, FHtest, clinfun, LogrankA, coin, maxstat, boot, muhaz
ClinicalTrials, Econometrics, Optimization, Survival, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Chang, et al., "Comparing Multiple Survival Functions with Crossing Hazards in R", The R Journal, 2021
BibTeX citation
@article{RJ-2021-002, author = {Chang, Hsin-wen and Tsai, Pei-Yuan and Kao, Jen-Tse and Lan, Guo-You}, title = {Comparing Multiple Survival Functions with Crossing Hazards in R}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {12}, issue = {2}, issn = {2073-4859}, pages = {107-119} }