condSURV : An R Package for the Estimation of the Conditional Survival Function for Ordered Multivariate Failure Time Data

One major goal in clinical applications of time-to-event data is the estimation of survival with censored data. The usual nonparametric estimator of the survival function is the time-honored Kaplan-Meier product-limit estimator. Though this estimator has been implemented in several R packages, the development of the condSURV R package has been motivated by recent contributions that allow the estimation of the survival function for ordered multivariate failure time data. The condSURV package provides three different approaches all based on the Kaplan-Meier estimator. In one of these approaches these quantities are estimated conditionally on current or past covariate measures. Illustration of the software usage is included using real data.


Introduction
One major goal in survival studies is the estimation of the survival function.The most popular method for estimating this function is the well-known product-limit estimator also known as Kaplan-Meier estimator (Kaplan and Meier, 1958).The popularity of the product-limit estimator is explained by its simplicity and intuitive appeal while requiring very week assumptions.It simply takes into account the empirical probability of surviving over a certain time.The method does not take into account of covariates, so it is mainly descriptive.Discrete covariates can be included by splitting the sample for each level of the covariate and applying the product-limit estimator for each subsample.This approach is not recommended for continuous covariates.To account for this extra difficulty several generalizations to the Kaplan-Meier estimator have been proposed throughout the last decades.Beran (1981) was the first one who proposed an estimator of the conditional distribution (survival) function with censored data in a fully nonparametric way.His estimator was further studied among others by Dabrowska (1987), Akritas (1994), Gonzalez-Manteiga and Cadarso-Suárez (1994) and Van Keilegom et al. (2001).All these estimators can be used to estimate the distribution (or survival) function conditional on a continuous covariable in a regression model, when data are subject to censoring.However, none of the above methods can be used to estimate the conditional survival when the covariate is censored.
Several software packages in the form of R packages have been developed to estimate the survival function.Though this function can be estimated parametrically or using nonparametric maximum likelihood estimation, the product limit Kaplan-Meier estimator is still one of the best options for estimating the survival function.Several R packages have been developed to implement the productlimit Kaplan-Meier estimator.For instance, the survival package (Therneau, 2015) and the prodlim package (Gerds, 2015) can be used to obtain Kaplan-Meier estimates.A comprehensive list of the available packages which can be used to estimate the survival function can be seen in the CRAN Task View "Survival Analysis" (Allignol and Latouche, 2016) of the Comprehensive R Archive Network (CRAN).
In many longitudinal medical studies, patients may experience several events through a follow-up period.In these studies, the analysis of sequentially ordered events are often of interest.The events of concern can be of the same nature (e.g., recurrent disease episodes in cancer studies) or represent different states in the disease process (e.g., "alive and disease-free", "alive with recurrence" and "dead").If the events are of the same nature, this is usually referred as recurrent events (Cook and Lawless, 2007), whereas if they are based on different disease categories they are usually modeled through their intensity functions (Meira-Machado et al., 2009).Again, several R packages have been developed to deal with problems that arise in these processes (see for example, Allignol and Latouche 2016).Some of these packages can be used to estimate occupation probabilities, transition probabilities and the cumulative incidence functions.However, none can be used to estimate conditional survival probabilities such as: where T 1 , T 2 and T 3 are ordered event times of successive events.This issue was recently considered by Meira-Machado et al. (2016) who proposed nonparametric and semiparametric estimators for such quantities.
This paper describes the R package condSURV (available from the Comprehensive R Archive The R Journal Vol.8/2, December 2016 ISSN 2073-4859 Network at https://CRAN.R-project.org/package=condSURV/) and its capabilities for implementing nonparametric and semiparametric estimators for conditional survival probabilities for two multivariate ordered times.The package can also be used to estimate more general functions involving more than two successive event times.The estimation of these quantities is essential for long-term survival prognosis which arises in many medical contexts such as cancer studies, asthma, HIV/AIDS, heart disease, dementia and Alzheimer's disease, etc.The methods implemented in the package also deal with the possible effect of covariates on the conditional survival probabilities (e.g., where Z denotes a continuous covariate).To account for the covariate effect, a flexible approach is based on local smoothing by means of kernel weights based on local constant (Nadaraya-Watson) regression.In this article we explain and illustrate how numerical and graphical output for all methods can be obtained using the condSURV package.
The remainder of this paper is organized as follows.The following section provides a brief introduction to the methodological background.All estimators for the conditional survival function are presented.Then, a detailed description of the package is presented, and its usage is illustrated through the analysis of a real data set; and finally, the last section contains the main conclusions of this work.The use of the package to more than two consecutive events is illustrated in the Appendix.

Methodology background
Suppose that an individual may experience K consecutive events at times which are measured from the start of the follow-up.In this section different methods are proposed to estimate conditional survival probabilities such as where T 1 and T 2 are ordered event times of two successive events.The proposed methods are all based on the Kaplan-Meier estimator and the ideas behind the proposed estimators can also be used to estimate more general functions involving more than two successive event times.However, for ease of presentation and without loss of generality, we take K = 2 in this section.The extension to K > 2 is straightforward.
Let (T 1 , T 2 ) be a pair of successive event times corresponding to two ordered (possibly consecutive) events measured from the start of the follow-up.Let T = T 2 denote the total time and assume that both T 1 and T are observed subject to a (univariate) random right-censoring variable C assumed to be independent of (T 1 , T). Due to censoring, rather than (T 1 , T) we observe ( T 1 , ∆ 1 , T, ∆ 2 ) where T 1 = min(T 1 , C), ∆ 1 = I(T 1 ≤ C), T = min(T, C), ∆ 2 = I(T ≤ C), where I(•) is the indicator function.Let ( T 1i , ∆ 1i , T i , ∆ 2i ), 1 ≤ i ≤ n be independent and identically distributed data with the same distribution as ( T 1 , ∆ 1 , T, ∆ 2 ).
Let S 1 and S be the marginal survival functions of T 1 and T; that is, S 1 (y) = P(T 1 > y) and S(y) = P(T > y).Introduce also the conditional survival probabilities P(T > y|T 1 > x) and P(T > y|T 1 ≤ x).without loss of generality, we only consider the estimation of S(y|x) = P(T > y|T 1 > x).
The Kaplan-Meier estimator, also known as the product-limit estimator, is the most frequently used method to estimate survival for censored data.The most used representation of the Kaplan-Meier estimator of the total time is through a product of the following form where R(t) = ∑ n i=1 I( T i ≥ t) denotes the number of individuals at risk just before time t.The censoring is assumed to be independent in the sense that the additional knowledge of the right-censoring times before any time y does not carry information on the risk of failure at time y.The Kaplan-Meier estimate is a step function with jumps at event times.The size of the steps depends on the number of events and the number of individuals at risk at the corresponding time.Below we introduce a weighted average representation of the Kaplan-Meier estimator which will be used later to introduce estimators for the conditional survival function where T (1) ≤ . . .≤ T (n) denotes the ordered T-sample and is the Kaplan-Meier weight attached to T (i) .In the expression of W i notation ∆ 2[i] is used for the i-th The R Journal Vol.8/2, December 2016 ISSN 2073-4859 concomitant value of the censoring indicator (that is, ∆ 2 In this work we are interested in the estimation of the conditional survival function, S(y | x) = P(T > y | T 1 > x).Below we provide estimators for this quantity, all based on the Kaplan-Meier estimator.
The conditional survival function S(y | x) can be expressed as S(y Then, the denominator of the term at the right hand side can be estimated using the Kaplan-Meier estimator of survival of the first time; the quantity at the numerator involves transformations of the pair (T 1 , T) which cannot be estimated so simply.This quantity can be estimated using Kaplan-Meier weights pertaining to the distribution of the total time to weight the bivariate data (Meira-Machado et al., 2016).The corresponding estimator is given as follows: Another way to introduce a (monotonic) nonparametric estimator for the conditional survival is by considering the landmark approach (Van Houwelingen, 2007).Given the time point x, to estimate the analysis can be restricted to the individuals with an observed first event time greater than x.Then, an estimator for the conditional survival function is just the Kaplan-Meier estimator of the survival function of T computed from such a subset where S x (y) is the Kaplan-Meier estimator of survival computed from the T, ∆ 2 -sample in i : The standard error of the nonparametric landmark estimator (LDM) may be large when the censoring is heavy, particularly with a small sample size.Interestingly, the variance of this estimator may be reduced by presmoothing (Dikta, 1998).Here, the idea of presmoothing involves replacing the censoring indicators (in the expression of the Kaplan-Meier weights) by some smooth fit before the Kaplan-Meier formula is applied.This preliminary smoothing may be based on a certain parametric family such as the logistic (thus leading to a semiparametric estimator), or on a nonparametric estimator of the binary regression curve.The corresponding presmoothed landmark estimator is then given by where w x i is defined through , where m( T) belongs to a parametric (smooth) family of binary regression curves, e.g., logistic.In practice, we assume that m(t) = m(t; β) where β is a vector of parameters which will be computed by maximizing the conditional likelihood of the ∆ 2 's given T.
Note that S PLDM (y | x) is the presmoothed Kaplan-Meier estimator of survival computed from the T, ∆ 2 -sample in i : T 1i > x ordered with respect to T.
The practical performance of the proposed estimators for the conditional survival function has been investigated through simulations (Meira-Machado et al., 2016).It has been demonstrated that all of the studied estimators perform well, approaching their targets as the sample size increases.Besides, simulation results reveal that the landmark estimator (LDM) performs favorably when compared with the first method (KMW).Furthermore, the reported simulation results reveal relative benefits of presmoothing (PLDM) in the heavily censored scenarios or for small sample sizes.Now we will explain how to introduce covariate information in the conditional survival function.Discrete covariates can be also included by splitting the sample for each level of the covariate and repeating the described procedures for each subsample.This approach is not valid for continuous covariates.To estimate the survival probabilities conditionally on continuous covariates The R Journal Vol.8/2, December 2016 ISSN 2073-4859 we propose to use local smoothing which is introduced using regression weights.Without loss of generality the methodology will be explained in the build of the conditional survival probability P(T > y|T 1 > x, Z = z), where Z denotes a continuous covariate.To estimate S(y we need to estimate the following conditional expectations: In the absence of censoring, to estimate these quantities, we may use kernel smoothing techniques by calculating a local average of the indicator functions.For example, E[I(T 1 > x, T ≤ y)|Z = z] could be estimated as follows where NW i (z, a n ) is a weight function which corresponds to the Nadaraya-Watson (Nadaraya, 1965;Watson, 1964) estimator (NW) as follows where K is a known probability density function (the kernel function) and a n is a sequence of bandwidths.
In our case, however, we allow the data to be right-censored.To handle right-censoring, inverse probability of censoring weighting (IPCW; see for example, Satten and Datta 2001) can be used.In order to introduce our estimators, note that, assuming that the support of the conditional where G Z and H Z denote the conditional survival functions of the censoring variable of the total time and the first event time, respectively, given Z.
The estimation of the conditional survival function, given a covariate under random censoring has been considered in many papers.This topic was introduced by Beran (1981) and was further studied by several authors (Dabrowska, 1987;Akritas, 1994;Gonzalez-Manteiga and Cadarso-Suárez, 1994;Van Keilegom et al., 2001).Their proposals can also be used to estimate the conditional survival function of C | Z = z, say G z .This can be done using the estimator introduced by Beran, In order to introduce our estimators we propose to plug-in Beran's estimator G Z and use NW to compute Similarly, we propose to plug-in Beran's estimator H Z and use NW to compute ) .
Then, we may introduce the IPCW estimator as follows:

condSURV in practice
This section introduces an overview of how the package is structured.condSURV is a shortcut for "conditional survival" and this is its major functionality: to provide estimates of the survival function conditional to previous (possibly censored) events.This software enables both numerical and graphical outputs to be displayed for all methods (KMW, LDM, PLDM and IPCW) described in the previous section.This software is intended to be used with the R environment for statistical computing and graphics.
Our package is composed of 12 functions that allow users to obtain estimates for all proposed methods. The

plot.survCS
Plot for an object of class "survCS".

summary.survCS
Summary for an object of class "survCS".

print.survCS
Print for an object of class "survCS".

KM
Computes the Kaplan-Meier product-limit of survival.

PKM
Computes the presmoothed Kaplan-Meier product-limit of survival.

KMW
Returns a vector with the Kaplan-Meier weights.

PKMW
Returns a vector with the presmoothed Kaplan-Meier weights.

LLW
Returns a vector with the local linear weights.

NWW
Returns a vector with the Nadaraya-Watson weights.Details on the usage of the functions (described in Table 1) can be obtained with the corresponding help pages.
It should be noted that to implement the methods described in Section Methodology background one needs the following variables of data in a specific order (as shown): time1, event1, Stime and event.The variable time1 represents the observed time to the first event of interest, and event1 the corresponding status/censoring indicator (if the survival time is a censored observation, the value is 0 and otherwise the value is 1).The variable Stime represents the total survival time.If event1 = 0, then the total survival time is equal to the observed time to the first event.The variable event is the final status of the individual (takes the value 1 if the final event of interest is observed and 0 otherwise).The illustration of the condSURV package for more than two event times is discussed in the Appendix.

Example of application
For illustration, we apply the proposed methods to data from a large clinical trial on Duke's stage III patients, affected by colon cancer, that underwent a curative surgery for colorectal cancer (Moertel et al., 1990).This data set is freely available as part of the R survival package.The data is also available as part of the R package condSURV.From the total of 929 patients, 468 developed a recurrence and among these 414 died.For each individual, an indicator of his/her final vital status (censored or not), the survival times (time to recurrence, time to death) from the entry of the patient in the study (in days), and a vector of covariates including age (in years) and recurrence (coded as 1 = yes; 0 = no) were recorded.The covariate recurrence is a time-dependent covariate which can be expressed as an intermediate event.We will use the methods described in Section Methodology background to study survival as well as the effect of recurrence on the final outcome (death).
In the following, we will demonstrate the package capabilities using this data.Below is an excerpt of the data.framewith one row per individual.The R Journal Vol.8/2, December 2016 ISSN 2073-4859 the individual represented in line 2 is alive and without recurrence at the end of follow-up.We note that event1 = 1 and event = 0 corresponds to individuals with an observed recurrence that remain alive at the end of the follow-up.

> library("condSURV") > data(colonCS) > head(colonCS
The development of the condSURV R package has been motivated by recent contributions that allow the estimation of the (conditional) survival function for ordered multivariate failure time data.This package contains the function survCS which takes the input data as an R formula and creates a survival object among the chosen variables for analysis.This function will verify if the data has been introduced correctly and will create a "survCS" object.Arguments in this function must be introduced in the following order time1, event1, time2, event2, . . ., Stime and event, where time1, time2, . . ., Stime are ordered event times and event1, event2, . . ., event their corresponding indicator statuses.This function plays a similar role as the Surv function in the survival R package.
The effect of "recurrence" is important on the patient outcome and can be studied through the ordered multivariate event time data of time-to-event from enrolment, to recurrence and to death.Results obtained from the estimation of the conditional survival probabilities, S(y | x) = P(T > y|T 1 > x), can be used to understand which individuals without recurring cancer after surgery are most likely to survive from their disease and which would benefit from more personal attention, closer follow-up and monitoring.Below we discuss how to estimate this and other quantities using the condSURV package.
Estimates for the conditional survival probabilities are obtained using function survCOND.The first argument of this function is a formula object with the response on the left of a ~operator.The response must be a "survCS" object which is obtained using the survCS function.A single covariate (qualitative or quantitative) can be included in the right hand side of the formula allowing the estimation of survival probabilities conditionally on current or past covariate measures.The use of the main function survCOND is explained below.
In the absence of covariates, two methods can be used to estimate the conditional survival probabilities: the method based on the use of Kaplan-Meier weights (KMW) and the method based on the landmark approach (KMW).A smoothed version of the landmark approach is also implemented.Given x = 365 (one year) and y = 1825 (five years), estimates for S(y | x) = P(T > y|T 1 > x) can be obtained using function survCOND with the method based on the use of Kaplan-Meier weights (method = "KMW"): > set.seed(123)> colon.kmw.1 <-survCOND(survCS(time1, event1, Stime, event) ~1, x = 365, y = 1825, + data = colonCS, method = "KMW") > summary(colon.kmw.1)P(T>y|T1>365) y estimate lower 95% CI upper 95% CI 1825 0.7303216 0.697005 0.7562444 As can be seen, the survCOND function provides, by default, 95% pointwise confidence intervals (conf = TRUE) using 200 bootstrap replicates (n.boot = 200).The construction of the pointwise confidence intervals is obtained by means of the bootstrap percentile method by randomly sampling the n items from the original data set with replacement (Davison and Hinkley, 1997).Intervals with other levels of confidence besides 95% (the default value) can be obtained by setting the argument conf.level to the desired level.
The package also provides plots for all methods.The following input commands (shown below) provide the plots for the conditional survival function P(T > y|T 1 > x) along y ≥ x where x is a predefined fixed value.The corresponding plots for the two landmark methods (LDM and PLDM) are shown in Figure 1.The plots were obtained for fixed values x equal to 365 and 1095 days, along time y.This figure allows for an inspection along time of the survival probability (i.e., of being alive with or without recurrence) for the individuals who are disease free 1 and 3 years after surgery.All curves are monotonously decreasing.It is also evident that the conditional survival probabilities are smaller for lower x values.This feature was expected since the survival time increases with an increase in the recurrence-free survival.Results also suggest that individuals with higher recurrence times are most likely to survive from their disease.
To illustrate the usage of the graphical parameter arguments of function plot.survCS,plots shown in the first row in Figure 1 were obtained using arguments col, confcol, xlab, ylab and ylim.Plots shown on the second row were obtained using the default values.For more details about the graphical parameter arguments, see the corresponding help file.
One important goal is to obtain estimates for the above estimated quantities (conditional survival probabilities) conditionally on current or past covariate measures.The current version of the package allows the inclusion of a single covariate.Below we illustrate its usage using two qualitative covariates rx (treatment: Obs(ervation), Lev(amisole), Lev(amisole)+5FU), sex (1 -male) and a continuous covariate age (in years).The following input commands provide the estimates of the conditional survival S(y | x) = P(T > y|T 1 > x) for the three treatment groups by including the covariate (rx) in the right hand side of the formula argument.
> colon.rx.ldm <-survCOND(survCS(time1, event1, Stime, event) ~rx, x = 365, + data = colonCS, method = "LDM") > summary(colon.rx.Results obtained for the three treatment groups reveal that the combined treatment of levamisole plus fluorouracil have a benefit on overall survival.This is confirmed by the plot shown in Figure 2 which can be obtained using the following input command: > plot(colon.rx.ldm, xlab = "Time (days)", ylab = "S(y|365)", conf = FALSE) Similarly, one can obtain the corresponding survival probabilities S(y | x) = P(T > y|T 1 ≤ x) for both genders (1 -male).Since this variable in the data.framecolonCS is of class "integer" it must be included in the formula using function factor.> colon.sex.ldm<-survCOND(survCS(time1, event1, Stime, event) ~factor(sex), x = 365, + data = colonCS, method = "LDM") > summary(colon.sex.ldm,times = 365 * 1:6) The plot shown in Figure 3  ence of the covariate age together with the 95% pointwise confidence based on the percentile bootstrap which resamples each datum with probability 1/n.The methods for implementing the conditional survival function conditionally on current or past covariate measures can be computationally demanding.In particular, the use of bootstrap resampling techniques are time-consuming processes because it is necessary to estimate the model a great number of times.The CPU time needed for running the input command required to obtain the plot shown in Figure 3 can take a few minutes.
In such cases we recommend the use of parallelization (cluster = TRUE).This allows to run those repeated operations (for example, the estimation of the conditional probability in each of the bootstrap replicates) on multiple processors/cores on your computer, or on multiple nodes of a cluster.Thus, we can reduce the execution time in the construction of the bootstrap-based confidence interval.
The use of the condSURV package to more than two consecutive events is illustrated in the Appendix.

Conclusions
This paper discusses the implementation in R of some newly developed methods for the estimation of the conditional survival function.The condSURV package implements nonparametric and semiparametric estimators for these quantities.The package also introduces and implements feasible estimation methods for these quantities conditionally on current or past covariate measures.Other related estimators are also implemented in the package.One of these estimators is the Kaplan-Meier estimator typically assumed to estimate the survival function.A modification of the Kaplan-Meier estimator based on a preliminary estimation (presmoothing) of the censoring probability for the survival time, given the available information is also implemented.
Software for multi-state survival analysis has been developed recently.These models deal with problems that are similar to those implemented in package condSURV.Among other quantities these packages deal with the estimation of the transition probabilities.It can be shown that in the progressive model with three states the conditional survival function P(T 2 > y | T 1 > x) can be expressed as the sum of two transition probabilities, p 11 (x, y) + p 12 (x, y).However, for more than three states no formal relation can be established between the two quantities.To the best of our knowledge none of the available software packages can be used to estimate conditional survival probabilities such as: P(T 2 > y | T 1 > x), P(T 3 > y|T 1 < x 1 , T 2 > x 2 ) or P(T 3 > y|T 1 > x 1 , T 2 > x 2 ) where T 1 , T 2 and T 3 are ordered event times of successive events.
We mention some important topics that we shall consider in future versions of the package.One important issue is about the extension of the proposed methods for interval censoring.Another topic of much practical interest is to establish a more formal relation between our software and the survival package.
The R Journal Vol.8/2, December 2016 ISSN 2073-4859 The results in this paper were obtained using R 3.2.5.The condSURV package is available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package=condSURV/.

Figure 1 :
Figure 1: Estimation of the conditional survival function given that the subject is alive and disease-free at x = 365 (top) and x = 1095 (bottom row) days.Landmark estimators at the left and presmoothed landmark estimator on the right hand side.Colon cancer data.

Figure 3 :
Figure 3: Estimates of the conditional survival function given that the subject is alive and disease-free at x = 365 days given the continuous covariate age is equal to 48 years old.95% pointwise confidence bands based on the percentile bootstrap.Colon cancer data.
Conditional survival probabilities based on Kaplan-Meier weights and the Landmark approaches.This function also implements estimation methods for these quantities conditionally on current or past covariate measures.survCS Create a "survCS" object, usually used as a response variable in a model formula.

Table 1 :
Summary of functions in the condSURV package.
If argument y is omitted, then the survCOND function allows the user to obtain estimates for all possible y values.Then, one can use the summary function to get the estimated values at the desired values (through argument times of the summary function).A truncated output for the following input commands is shown below:Similarly, one can obtain the results for the landmark methods (LDM and PLDM) using the same function survCOND.The unsmoothed landmark estimator is obtained using argument method = "LDM" whereas for obtaining the presmoothed landmark estimator the argument presmooth = TRUE is also required.