ppseq: An R Package for Sequential Predictive Probability Monitoring

Emily C. Zabor; Brian P. Hobbs; Michael J. Kane

doi:10.32614/RJ-2023-017

1 Introduction

Statistical methods for early phase oncology trials were developed in the context of cytotoxic treatments. Most cytotoxic treatments exhibit increases in both efficacy and toxicity with increasing dose. As a result, phase I trials were centered on identifying the maximum tolerated dose (MTD), defined as the highest dose that did not exceed a pre-specified toxicity threshold. But advances in drug discovery have produced numerous biomarker-guided non-cytotoxic therapies such as small molecule inhibitors, antibody drug conjugates, immune checkpoint inhibitors, and monoclonal antibodies. These therapies typically do not exhibit the same pattern of increases in both efficacy and toxicity with increasing dose, so the MTD as traditionally defined may not exist. Instead, lower doses may have equivalent efficacy but lower toxicity as compared to higher doses. As a result, these therapies can be difficult to study with traditional dose-escalation designs such as the rule-based 3+3 and the model-based continual reassessment method, among others (Pestana et al. 2020). To address this issue, recent phase I trials have included large dose-expansion cohorts, in which additional patients are enrolled in phase I after the dose-escalation phase is complete. In this setup, the dose-escalation phase is considered phase Ia and used to assess the initial safety of multiple doses, then the dose-expansion phase is considered phase Ib and can have a variety of aims including to further refine the safety of one or more doses, to assess preliminary efficacy, to explore the treatment in various disease-specific subtypes, or to further characterize the pharmacokinetics and/or pharmacodynamics. The use of dose-expansion cohorts increased from 12% in 2006 to 38% in 2011 (Manji et al. 2013) and trials with dose-expansion cohorts led to higher response rates and more frequent success in phase II trials (Bugano et al. 2017).

But despite these successes, recent dose-expansion cohorts have not always been planned in advance, leading to uncertain statistical properties, and have at times included samples sizes that far exceed those typically seen in early phase trials. For example, the KEYNOTE-001 trial of pembrolizumab, initially designed as a 3+3 dose-escalation trial, included multiple protocol amendments and ultimately enrolled a total of 655 patients across five melanoma expansion cohorts and 550 patients across four non-small-cell lung cancer expansion cohorts (Khoja et al. 2015). In a basket trial of atezolizumab, an anti-PD-L1 treatment in patients with a variety of cancers and both with and without PD-L1 expression, an expansion cohort in metastatic urothelial carcinoma ultimately evaluated 95 patients, despite the fact that no expansion cohort in this disease subtype was originally planned in the trial protocol. The expansion cohort in metastatic urothelial carcinoma was added through a protocol amendment, and the sample size later was increased from what was initially planned (Powles et al. 2014; Petrylak et al. 2018). These enlarged sample sizes raise ethical concerns for patients who enroll in clinical trials, and emphasize the need for rigorous statistical designs to ensure that trials can stop early for futility while maintaining traditional control of type I error and power.

Bayesian predictive probability has been proposed as an approach for sequential monitoring in early phase oncology trials (Dmitrienko and Wang 2006; Lee and Liu 2008; Saville et al. 2014; Hobbs et al. 2018). However, in order to be useful to investigators designing such trials, software must be made available to calibrate the design for the desired statistical properties. To our knowledge no such software currently exists in the R programming language. This paper introduces the ppseq package for the R software language, which provides functions to design early phase clinical trials of binary endpoints using sequential predictive probability monitoring for futility. Static plots produced using the ggplot2 package (Wickham 2016) and interactive plots produced using the plotly package (Sievert 2020) compare design options based on frequentist type I error and power. Moreover, the ppseq package implements two criteria for selecting an ideal predictive probability monitoring design from among the options formed by combinations of posterior and predictive decision thresholds. While the ppseq package was developed with early phase oncology clinical trials in mind, the methodology is general and can be applied to any application of sequential futility monitoring in clinical trial design.

2 Predictive probability monitoring

Consider the setting of an expansion cohort with a binary outcome, such as tumor response as measured by the RECIST (Response Evaluation Criteria in Solid Tumors) criteria. Here we will focus on the one-sample setting, in which all patients in the trial are enrolled onto a single arm and given the experimental treatment of interest. Functionality is also available for the two-sample setting, in which patients are enrolled onto two different treatment arms, or a treatment and a control arm, for comparative purposes. Each patient, denoted by \(i\), enrolled in the trial either has a response such that \(x_i = 1\) or does not have a response such that \(x_i = 0\). Then \(X = \sum_{i=1}^n x_i\) is the number of responses out of \(n\) currently observed patients up to a maximum of \(N\) total patients. Let \(p\) represent the true probability of response. Fix \(p_0\) as the threshold for unacceptable response rate and \(p_1\) as the threshold for acceptable response rate. Most dose-expansion studies with an efficacy aim will wish to test the null hypothesis \(H_0: p \leq p_0\) versus the alternative hypothesis \(H_1: p \geq p_1\).

The Bayesian paradigm of statistics is founded on Bayes’ theorem, which is a mathematical theory that specifies how to combine the prior distributions that define prior beliefs about parameters, such as the true response rate \(p\), with the observed data, such as the total number of responses \(X\), yielding a posterior distribution. Here the prior distribution of the response rate \(\pi(p)\) has a beta distribution \(\mbox{Beta}(a_0, b_0)\) and our data \(X\) have a binomial distribution \(\mbox{Bin}(n, p)\). Combining the likelihood function for the observed data \(L_x(p) \propto p^x (1-p)^{n-x}\) with the prior, we obtain the posterior distribution of the response rate, which follows the beta distribution \(p|x \sim \mbox{Beta}(a_0 + x, b_0 + n - x)\). A posterior probability threshold \(\theta\) would be pre-specified during the trial design stage. At the end of the trial, if the posterior probability exceeded the pre-specified threshold, i.e. if \(\Pr(p>p_0 | X) > \theta\), the trial would be declared a success.

The posterior predictive distribution of the number of future responses \(X^*\) in the remaining \(n^*=N-n\) future patients follows a beta-binomial distribution \(\mbox{Beta-binomial}(n^*, a_0 + x, b_0 + n - x)\). Then the posterior predictive probability (PPP), is calculated as \(PPP = \sum_{{x^*}=0}^{n^*} \Pr(X^*=x^*|x) \times I(\Pr(p>p_0 | X, X^*=x^*) > \theta)\). The posterior predictive probability represents the probability that, at any given interim monitoring point, the treatment will be declared efficacious at the end of the trial when full enrollment is reached, conditional on the currently observed data and the specified priors. We would stop the trial early for futility if the posterior predictive probability dropped below a pre-specified threshold \(\theta^*\), i.e. \(PPP<\theta^*\). Predictive probability thresholds closer to 0 lead to less frequent stopping for futility whereas thresholds near 1 lead to frequent stopping unless there is almost certain probability of success. Predictive probability provides an intuitive interim monitoring strategy for clinical trials that tells the investigator what the chances are of declaring the treatment efficacious at the end of the trial if we were to continue enrolling to the maximum planned sample size, based on the data observed in the trial to date.

3 Package overview

The ppseq package facilitates the design of clinical trials utilizing sequential predictive probability monitoring for futility. The goal is to establish a set of decision rules at the trial planning phase that would be used for interim monitoring during the course of the trial. The main computational challenge in designing such a trial is joint calibration of the posterior probability and posterior predictive probability thresholds to be used in the trial in order to achieve the desired levels of frequentist type I error and power. The main function for achieving this aim is the calibrate_thresholds() function, which will evaluate a grid of posterior thresholds \(\theta\) and predictive thresholds \(\theta^*\) provided by the user as vector inputs specified with the arguments pp_threshold and ppp_threshold, respectively. Other required arguments include the unacceptable response rate \(p_0\) specified by p_null, the acceptable response rate \(p_1\) specified by p_alt, a vector of sample sizes at which interim analyses are to be performed n, and the maximum total sample size N. The direction of the alternative hypothesis is specified with the argument direction and defaults to "greater", which corresponds to the alternative hypothesis \(H_1: p \geq p_1\). The hyperparameters of the prior beta distribution are specified with the argument prior and default to c(0.5, 0.5), which denotes a \(\mbox{Beta}(0.5, 0.5)\) distribution. The number of posterior samples is specified with the argument S, which defaults to 5000 samples, and the number of simulated trial datasets is specified with the argument nsim, which defaults to 1000. The additional argument delta, which defaults to NULL for the one-sample setting, can specify a clinically meaningful difference between groups \(\delta = p_1 - p_0\) in the case of a two-sample trial design.

The calibrate_thresholds() function conducts the following algorithm, given the default arguments:

Generate nsim datasets, denoted by \(j\), containing the cumulative number of responses \(x\) at each interim sample size \(n\) from a binomial distribution under the unacceptable (i.e. null) response rate \(p_0\), specified by p_null, and under the acceptable (i.e. alternative) response rate \(p_1\), specified by p_alt, for each interim look, denoted by \(l\).
For dataset \(j\) and posterior threshold \(\theta_k\), draw S samples, denoted by \(s\), from the posterior distribution \(p|x_{ls} \sim \mbox{Beta}(a_0 + x_l, b_0 + n_l - x_l)\) where \(x_l\) is the number of responses at interim look \(l\) and \(n_l\) is the number of enrolled patients at interim look \(l\). Use each \(p|x_{ls}\) as the response probability in a binomial distribution to generate the number of future responses \(X^*_{ls}\) in the remaining \(n^*_l=N-n_l\) future patients at interim look \(l\).
1. Then for each \(X^*_{ls}\), generate S posterior probabilities, denoted by \(s'\), at the end of the trial: \(PP^*_{lss'} \sim \mbox{Beta}(a_0 + X^*_{ls} + x_l, b_0 + n^*_l - (X^*_{ls} + x_l))\) and calculate \(PP^*_{ls} = \Pr(p>p_0 | X^*_{ls}) = \frac{\sum_1^{S'} PP^*_{lss'} > p_0}{S'}\).
2. Estimate the predictive probability at posterior threshold \(k\) as \(PPP_{lk} = \frac{\sum_1^S PP^*_{ls}>\theta_k}{S}\).
3. Stop the trial for dataset \(j\) at interim look \(l\) and predictive threshold \(m\) if \(PPP_{lk} < \theta^*_m\). Otherwise continue enrolling.
Repeat (2) over all combinations of datasets \(j\), posterior thresholds \(k\), and predictive thresholds \(m\).
If dataset \(j\) was stopped early for futility then we do not reject the null hypothesis. If dataset \(j\) reached full enrollment, we reject the null hypothesis \(H_0: p \leq p_0\) at posterior threshold \(k\) if \(PPP_{lk} > \theta_k\).

The function returns a list, the first element of which is a tibble containing the posterior threshold \(\theta\), the predictive threshold \(\theta^*\), the mean sample size under the null and the alternative, the proportion of positive trials under the null and alternative, and the proportion of trials stopped early under the null and alternative. The proportion of trials simulated under the null hypothesis for which the null hypothesis was rejected is an estimate of the type I error, and the proportion of trials simulated under the alternative hypothesis for which the null hypothesis was rejected is an estimate of the power. The print() option will print the results summary for each combination of thresholds, filtered by an acceptable range of frequentist type I error and minimum power, if desired. Note that the results will be sensitive to the choice of nsim and S. We have set what we believe are reasonable defaults and would caution users against reducing these values without careful consideration.

Design selection

After obtaining results for all combinations of evaluated posterior and predictive thresholds, the next step is to select the ideal design from among the various options. The ppseq package introduces two criteria to assist users in making a selection. The first, called the “optimal accuracy” design, identifies the design that minimizes the Euclidean distance to 0 type I error probability and a power of 1. To accomplish this, the accuracy Euclidean distance (AED) for the design with posterior threshold \(k\) and predictive threshold \(m\) is calculated as \(AED_{km} = w_\alpha * (\alpha_{km} - 0)^2 + w_{(1-\beta)} * ((1-\beta)_{km}-1)^2\), where \(w_\alpha\) and \(w_{(1-\beta)}\) are optional weights on the type I error and power, respectively, and \(\alpha_{km}\) denotes the estimated type I error and \((1-\beta)_{km}\) denotes the estimated power. The design with the smallest value of \(AED_{km}\) is selected as optimal. The second criteria, called the “optimal efficiency” design, identifies the design that minimizes the Euclidean distance to minimal average sample size under the null and maximal average sample size under the alternative. To accomplish this, the efficiency Euclidean distant (EED) for the design with posterior threshold \(k\) and predictive threshold \(m\) is calculated as \(EED_{km} = w_{\bar{N}_{H_0}} * (\bar{N}_{H_0km} - min(\bar{N}_{H_0}))^2 + w_{\bar{N}_{H_1}} * (\bar{N}_{H_1km}-max(\bar{N}_{H_1}))^2\), where \(w_{\bar{N}_{H_0}}\) and \(w_{\bar{N}_{H_1}}\) are optional weights on the average sample size under the null and alternative, respectively, \(\bar{N}_{H_0km}\) and \(\bar{N}_{H_1km}\) denote the average sample sizes under the null and alternative, respectively, and \(min(\bar{N}_{H_0})\) and \(max(\bar{N}_{H_1})\) denote the minimum average sample size under the null and the maximum average sample size under the alternative alternative, respectively, across all combinations of \(k\) and \(m\). The design with the smallest value of \(EED_{km}\) is selected as optimal. The optimize_design() function returns a list that contains the details of each of the two optimal designs.

Decision rules

To ease the implementation of clinical trials designed with sequential predictive probability monitoring, once a design has been selected, a table of decision rules can be produced using the calc_decision_rules() function. The function takes the sample sizes n at which interim analyses are to be performed as well as the maximum total sample size N, the null value to compare to in the one-sample case p0 (set to NULL in the two-sample case), the posterior threshold of the selected design theta, and the predictive threshold of the selected design ppp. Arguments direction, prior, S, delta are as described in the Package Overview section, with the same defaults. The function results in a tibble. The trial would stop at a given look if the number of observed responses is less than or equal to \(r\), otherwise the trial would continue enrolling if the number of observed responses is greater than \(r\). At the end of the trial when the maximum planned sample size is reached, the treatment would be considered promising if the number of observed responses is greater than \(r\). In the one-sample case, the resulting tibble includes a column for the sample size n at each interim look, \(r\) at each look, and a column for the associated posterior predictive probability ppp. In the two-sample case, the tibble includes columns for n0 and n1, the sample size at each interim analysis in the control and experimental arms, respectively. There are also columns for r0 and r1, the number of responses in the control arm and experimental arm, respectively, leading to the decision to stop or continue. Finally, there is a column for the posterior predictive probability associated with that decision ppp.

Visualizations

Finally, to assist users in comparing the results of the various design options, a plot() option is available for the results of calibrate_thresholds that allows creation of static plots using the ggplot2 package (Wickham 2016) or interactive plots using the plotly package (Sievert 2020). Two plots are produced, one plotting type I error by power and indicating the optimal accuracy design, and one plotting the average sample size under the null by the average sample size under the alternative and indicating the optimal efficiency design. The motivation for including an interactive graphics option was the utility of the additional information available when hovering over each point. Instead of simply eyeballing where points fall along the axes, users can see the specific type I error, power, average sample size under the null, average sample size under the alternative, the posterior and predictive thresholds associated with the design, as well as the distance to the upper left point on the plot. A plot() option is also available for the results of calc_decision_rules. In the one-sample case it produces a single plot showing the sample size at each interim analysis on the x-axis and the possible number of responses at each interim analysis on the y-axis. In the two-sample case a grid of plots is produced, with one plot for each interim analysis. The x-axis shows the number of possible responses in the control group and the y-axis shows the number of possible responses in the experimental group. In both cases, the boxes are colored green for a “proceed” decision and red for a “stop” decision for each combination and the hover box produced by plotly provides the details.

4 Package demonstration

In this section I will present a basic example to demonstrate the functionality included in the ppseq. The following section will present a more realistic case study.

First we install and load the ppseq package.

install.packages("ppseq")

library(ppseq)

Consider the case where we are interested in designing a trial to investigate how a new treatment impacts tumor response measured as a binary outcome of response versus no response. We know the current standard of care treatment results in a tumor response rate of 10%, and we wish to improve this by 30%. So we wish to test \(H_0: p \leq 0.1\) versus \(H_1: p \geq 0.4\), so we set p_null = 0.1 and p_alt = 0.4. This is a rare disease so our maximum sample size is 15, so we set N = 15, and we will do interim analyses after every 5 patients, so we set n = seq(5, 15, 5). We wish to examine designs based on combinations of posterior thresholds \(\theta = {0.85, 0.90}\) and predictive thresholds \(\theta^*={0.1, 0.2}\), so we set pp_threshold = c(0.85, 0.9) and ppp_threshold = c(0.1, 0.2). Finally, for computational speed in this basic example, we set S=50 and nsim=50, but in practice we would want to use much larger values, in the thousands.

set.seed(123)

cal_thresh <-
  calibrate_thresholds(
    p_null = 0.1, 
    p_alt = 0.4,
    n = seq(5, 15, 5), 
    N = 15,
    pp_threshold = c(0.85, 0.9),
    ppp_threshold = c(0.1, 0.2),
    S = 50, 
    nsim = 50
    )

Since there are only four design options in this toy example, we print the entire results table using a call to print(). Each row represents a different combination of the considered posterior and predictive thresholds, denotes pp_threshold and ppp_threshold, respectively.

print(cal_thresh)

# A tibble: 4 × 8
  pp_threshold ppp_t…¹ mean_…² prop_…³ prop_…⁴ mean_…⁵ prop_…⁶ prop_…⁷
         <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1         0.85     0.1    12.1    0.1     0.42    14.8    0.92    0.02
2         0.85     0.2     9.5    0.06    0.64    14.8    0.92    0.02
3         0.9      0.1    10.8    0.04    0.5     14.8    0.9     0.02
4         0.9      0.2     8.9    0.04    0.74    14.7    0.88    0.04
# … with abbreviated variable names ¹ppp_threshold, ²mean_n1_null,
#   ³prop_pos_null, ⁴prop_stopped_null, ⁵mean_n1_alt, ⁶prop_pos_alt,
#   ⁷prop_stopped_alt

We use the optimize_design() function to identify the optimal accuracy and optimal efficiency designs. We constrain consideration to designs with type I error between 0.025 and 0.1, specified by type1_range = c(0.025, 0.1), and power of at least 0.75, specified by minimum_power = 0.75.

optimize_design(cal_thresh, type1_range = c(0.025, 0.1), minimum_power = 0.75)

$`Optimal accuracy design:`
# A tibble: 1 × 6
  pp_threshold ppp_threshold `Type I error` Power
         <dbl>         <dbl>          <dbl> <dbl>
1         0.85           0.2           0.06  0.92
  `Average N under the null` `Average N under the alternative`
                       <dbl>                             <dbl>
1                        9.5                              14.8

$`Optimal efficiency design:`
# A tibble: 1 × 6
  pp_threshold ppp_threshold `Type I error` Power
         <dbl>         <dbl>          <dbl> <dbl>
1          0.9           0.2           0.04  0.88
  `Average N under the null` `Average N under the alternative`
                       <dbl>                             <dbl>
1                        8.9                              14.7

To ease interim analysis during the course of the trial, we obtain the decision table using the calc_decision_rules() function, with the thresholds of our selected optimal efficiency design passed as theta = 0.9 and ppp_threshold = 0.2. For computational speed in this basic example, we set S=50, but in practice we would want to draw many more posterior samples, and values of 5000 or more should be considered. If any NA value was returned in this table, it would indicate that at that interim look you would never stop, regardless of the number of responses.

calc_decision_rules(
  n = seq(5, 15, 5), 
  N = 15,
  theta = 0.9,
  ppp = 0.2,
  p0 = 0.1,
  S = 50
)

# A tibble: 3 × 3
      n     r   ppp
  <dbl> <int> <dbl>
1     5     0  0.06
2    10     1  0.1 
3    15     2  0

This table tells us that at the first interim analysis after we have observed \(n=5\) patients for response, we would stop the trial if \(\leq r=0\) responses had been observed, and would continue enrolling if \(> r=0\) responses had been observed. At the second interim analysis after we have observed \(n=10\) patients for response, we would stop the trial if \(\leq r=1\) responses had been observed, and would continue enrolling if \(> r=1\) responses had been observed. At the end of the trial after all \(n=N=15\) patients have been observed for response, we would declare the new treatment promising if \(> r=2\) responses had been observed.

5 Case study

The case study focuses on a re-design of a phase II study of atezolizumab in metastatic urothelial carcinoma patients (mUC) using sequential predictive probability monitoring. Atezolizumab is a programmed death-ligand 1 (PD-L1) blocking monoclonal antibody that was given accelerated approval by the U.S. Food and Drug Administration in May 2016 for the treatment of patients with locally advanced or metastatic urothelial carcinoma who had disease progression following platinum-containing chemotherapy. The approval was based on the results of a single-arm phase II study in 310 patients (Rosenberg et al. 2016). The phase II study used a hierarchical fixed-sequence testing procedure to test increasingly broad subgroups of patients based on PD-L1 status, and found overall response rates of 26% (95% CI: 18-36), 18% (95% CI: 13-24), and 15% (95% CI 11-19) in patients with \(\geq5\%\) PD-L1-positive immune cells (IC2/3 subgroup), in patients with \(\geq1\%\) PD-L1-positive immune cells (IC1/2/3 subgroup), and in all patients, respectively (Rosenberg et al. 2016). All three rates exceeded the historical control rate of 10%. Then, in March 2021, the approval in this indication was voluntarily withdrawn by the sponsor following negative results from a randomized phase III study (Powles et al. 2018). In the phase III study, 931 patients were randomly assigned to receive atezolizumab or chemotherapy in a 1:1 ratio, and the same hierarchical fixed-sequence testing procedure as in the phase II study was used. The phase III study found that overall survival did not differ significantly between the atezolizumab and chemotherapy groups of the IC2/3 subgroup (median survival 11.1 months [95% CI: 8.6-15.5] versus 10.6 months [95% CI: 8.4-12.2]), so no further testing was conducted for the primary endpoint (Powles et al. 2018). Further analyses revealed that while the response rates to atezolizumab were comparable to those seen in the phase II study, the response rates to chemotherapy were much higher than the historical control rate of 10%. The overall response rates to chemotherapy were 21.6% (95% CI: 14.5-30.2), 14.7% (95% CI: 10.9-19.2), and 13.4% (95% CI: 10.5-16.9) for the IC2/3 subgroup, IC1/2/3 subgroup, and all patients, respectively. The overall response rates to atezolizumab were 23% (95% CI: 15.6-31.9), 14.1% (95% CI: 10.4-18.5), and 13.4% (95% CI: 10.5-16.9) for the IC2/3 subgroup, IC1/2/3 subgroup, and all patients, respectively. These results indicate that PD-L1 status is a prognostic biomarker for both standard of care chemotherapies that comprised the control arm as well as atezolizumab in this patient population.

We wish to re-design the phase II trial of atezolizumab using a two-arm randomized design with sequential predictive probability monitoring. We will use the calibrate_thresholds() function to obtain the operating characteristics of designs based on on each combination of posterior and predictive thresholds. We focus here on the main biomarker subgroup of interest, the IC2/3 subgroup. We design the study with a null response rate of 0.1 in both arms, and an alternative response rate of 0.25 in the atezolizumab arm. So we set p_null = c(0.1, 0.1) and p_alt = c(0.1, 0.25). We plan the study with 100 participants, assuming that the total sample size available is similar to the 310 used in the actual single-arm phase II trial, and that a third of that patient population fall into our desired biomarker subgroup. So we set N = c(50, 50) to indicate equal allocation of the 100 patients to each of the two arms. We will check for futility after every 10 patients are enrolled on each arm, so we set n = cbind(seq(10, 50, 10), seq(10, 50, 10)), a matrix showing the interim analysis sample sizes in each of the two arms. To design the study, we need to calibrate the design over a range of posterior probability thresholds and predictive probability thresholds. In this example we will consider posterior thresholds of 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, and 0.99, and predictive thresholds of 0.05, 0.1, 0.15, and 0.2. So we set pp_threshold = seq(0.9, 0.99, 0.01) and ppp_threshold = seq(0.05, 0.2, 0.05). We selected these posterior thresholds because we want to have a posterior probability of at least 0.9 that our response rate exceeds the null. And we selected these predictive thresholds because we would not want to stop the trial early for futility if there was more than a 20% chance of success when full enrollment was reached. We use the defaults for all remaining arguments, including the defaults of S=5000 and nsim=1000, which we believe are sufficient to obtain stable results. Because of the inherent computational intensity in these calculations, this function relies on the future (Bengtsson 2021) and furrr (Vaughan and Dancho 2021) packages to parallelize computations. The user will be responsible for setting up a call to future::plan() that is appropriate to their operating environment and simulation setting. Because the code takes some time to run, the results of the below example code are available as a dataset called two_sample_cal_tbl included in the ppseq package.

library(future)

set.seed(123)

future::plan(future::multicore(workers = 40))

two_sample_cal_tbl <- 
  calibrate_thresholds(
    p_null = c(0.1, 0.1), 
    p_alt = c(0.1, 0.25),
    n = cbind(seq(10, 50, 10), seq(10, 50, 10)),
    N = c(50, 50), 
    pp_threshold = seq(0.9, 0.99, 0.01),
    ppp_threshold = seq(0.05, 0.2, 0.05),
    direction = "greater", 
    delta = 0, 
    prior = c(0.5, 0.5), 
    S = 5000, 
    nsim = 1000
  )

We can compare the design options that meet our desired range of type I error and minimum power by passing the results of our call to the calibrate_thresholds() function to the plot() function. The plotly = TRUE option produces interactive visualizations or the plotly = FALSE option produces static visualizations.

plot(
  two_sample_cal_tbl, 
  type1_range = c(0.05, 0.1), 
  minimum_power = 0.7,
  plotly = TRUE
)

Figure 1: Plot of efficiency design options made with the {plotly} package. The average sample size under the null is on the x-axis and the average sample size under the alternative is on the y-axis. The color represents the Euclidean distance to the top left point and the optimal design is indicated by a diamond. The optimal efficiency design is the one with posterior threshold 0.92 and predictive threshold 0.05. A similar plot is available for the accuracy design options.

We see that the optimal efficiency design is the one with posterior threshold 0.92 and predictive threshold 0.05. It has a type I error of 0.07, power of 0.7, average sample size under the null of 29 per arm, and average sample size under the alternative of 46 per arm. This design would allow us to stop early if the treatment were inefficacious, thus preserving valuable financial resources for use in studying more promising treatments and preventing our human subjects from continuing an ineffective treatment.

optimize_design(
  two_sample_cal_tbl, 
  type1_range = c(0.05, 0.1), 
  minimum_power = 0.7
)

$`Optimal accuracy design:`
# A tibble: 1 × 6
  pp_threshold ppp_threshold `Type I error` Power
         <dbl>         <dbl>          <dbl> <dbl>
1         0.86           0.1            0.1 0.751
  `Average N under the null` `Average N under the alternative`
                       <dbl>                             <dbl>
1                       28.6                              45.1

$`Optimal efficiency design:`
# A tibble: 1 × 6
  pp_threshold ppp_threshold `Type I error` Power
         <dbl>         <dbl>          <dbl> <dbl>
1         0.92          0.05           0.07 0.701
  `Average N under the null` `Average N under the alternative`
                       <dbl>                             <dbl>
1                       28.6                              45.5

Finally, we generate the decision table associated with the selected design for use in making decisions at each interim analysis during the conduct of the trial. Because of the computational time involved, the results of the below example code are available as a dataset called two_sample_decision_tbl included in the ppseq package. In the results table, we see that at the first interim futility look after just 5 patients, we would not stop the trial. After the first 10 patients we would stop the trial if there were 0 responses, and so on. At the end of the trial when all 95 patients have accrued, we would declare the treatment promising of further study if there were greater than or equal to 14 responses.

set.seed(123)

two_sample_decision_tbl <- 
  calc_decision_rules(
    n = cbind(seq(10, 50, 10), seq(10, 50, 10)),
    N = c(50, 50),
    theta = 0.92, 
    ppp = 0.05, 
    p0 = NULL, 
    direction = "greater", 
    delta = 0, 
    prior = c(0.5, 0.5), 
    S = 5000
    )

The resulting table contains all possible combinations of responses in the two arms, so is quite long. We can visualize the resulting decision rules by passing the results of the call to the calc_decision_rules() function to the plot() function, which returns a faceted plot according to the interim look. The color denotes whether the trial should proceed (green) or stop (red) for a given combination of number of responses in the control arm (x-axis) and number of responses on the experimental arm (y-axis). For example, we see that at the second interim look when there are 20 patients enrolled on each arm, if there were 10 responses in the control arm, we would stop the trial if there were 8 or fewer responses on the experimental arm; otherwise we would proceed with enrollment.

ggplotly(
  plot(two_sample_decision_tbl, plotly = FALSE) + scale_fill_viridis_d()
)

Figure 2: Plot of decision rules made with {plotly}. Each facet is for the total sample size at a pre-specified interim analysis. The number of responses in the control arm is on the x-axis and the number of responses in the experimental arm is on the y-axis. The fill color indicates whether the trial should stop (purple) or proceed (yellow) for a given combination of number of responses in the two arms at each interim analysis. At a given interim analysis, one can look up the number of observed responses to make a decision about whether the trial should stop or proceed enrollment.

6 Summary

With the focus of early stage clinical trial research in oncology shifting away from the study of cytotoxic treatments and toward immunotherapies and other non-cytotoxic treatments, new approaches to clinical trial design are needed that move beyond the traditional search for the maximum tolerated dose (Hobbs et al. 2019). Bayesian sequential predictive probability monitoring provides a natural and flexible way to expand the number of patients studied in phase 1 or to design phase 2 trials that allow for efficient early stopping for futility while maintaining control of type I error and power. The ppseq package implements functionality to evaluate a range of posterior and predictive thresholds for a given study design and identify the optimal design based on accuracy (i.e. type I error and power) or efficiency (i.e. average sample sizes under the null and alternative). Interactive visualization options are provided to ease comparison of the resulting design options. Once an ideal design is selected, a table of decision rules can be obtained to make trial conduct simple and straightforward.

Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2023-017.zip

CRAN packages used

ppseq, ggplot2, plotly, future, furrr

CRAN Task Views implied by cited packages

HighPerformanceComputing, Phylogenetics, Spatial, TeachingStatistics, WebTechnologies

H. Bengtsson. A Unifying Framework for Parallel and Distributed Processing in R using Futures. The R Journal, 13(2): 273–291, 2021. URL https://doi.org/10.32614/RJ-2021-048.

D. D. G. Bugano, K. Hess, D. L. F. Jardim, A. Zer, F. Meric-Bernstam, L. L. Siu, A. R. A. Razak and D. S. Hong. Use of expansion cohorts in phase I trials and probability of success in phase II for 381 anticancer drugs. Clin Cancer Res, 23(15): 4020–4026, 2017. DOI 10.1158/1078-0432.Ccr-16-2354.

A. Dmitrienko and M. D. Wang. Bayesian predictive approach to interim monitoring in clinical trials. Stat Med, 25(13): 2178–95, 2006. DOI 10.1002/sim.2204.

B. P. Hobbs, P. C. Barata, Y. Kanjanapan, C. J. Paller, J. Perlmutter, G. R. Pond, T. M. Prowell, E. H. Rubin, L. K. Seymour, N. A. Wages, et al. Seamless Designs: Current Practice and Considerations for Early-Phase Drug Development in Oncology. J Natl Cancer Inst, 111(2): 118–128, 2019.

B. P. Hobbs, N. Chen and J. J. Lee. Controlled multi-arm platform design using predictive probability. Stat Methods Med Res, 27(1): 65–78, 2018.

L. Khoja, M. O. Butler, S. P. Kang, S. Ebbinghaus and A. M. Joshua. Pembrolizumab. J Immunother Cancer, 3: 36, 2015.

J. J. Lee and D. D. Liu. A predictive probability design for phase II cancer clinical trials. Clin Trials, 5(2): 93–106, 2008. DOI 10.1177/1740774508089279.

A. Manji, I. Brana, E. Amir, G. Tomlinson, I. F. Tannock, P. L. Bedard, A. Oza, L. L. Siu and A. R. Razak. Evolution of clinical trial design in early drug development: Systematic review of expansion cohort use in single-agent phase I cancer trials. J Clin Oncol, 31(33): 4260–7, 2013. DOI 10.1200/jco.2012.47.4957.

R. C. Pestana, S. Sen, B. P. Hobbs and D. S. Hong. Histology-agnostic drug development - considering issues beyond the tissue. Nat Rev Clin Oncol, 17(9): 555–568, 2020.

D. P. Petrylak, T. Powles, J. Bellmunt, F. Braiteh, Y. Loriot, R. Morales-Barrera, H. A. Burris, J. W. Kim, B. Ding, C. Kaiser, et al. Atezolizumab (MPDL3280A) monotherapy for patients with metastatic urothelial cancer: Long-term outcomes from a phase 1 study. JAMA Oncol, 4(4): 537–544, 2018. DOI 10.1001/jamaoncol.2017.5440.

T. Powles, I. Durán, M. S. van der Heijden, Y. Loriot, N. J. Vogelzang, U. De Giorgi, S. Oudard, M. M. Retz, D. Castellano, A. Bamias, et al. Atezolizumab versus chemotherapy in patients with platinum-treated locally advanced or metastatic urothelial carcinoma (IMvigor211): A multicentre, open-label, phase 3 randomised controlled trial. Lancet, 391(10122): 748–757, 2018. DOI 10.1016/s0140-6736(17)33297-x.

T. Powles, J. P. Eder, G. D. Fine, F. S. Braiteh, Y. Loriot, C. Cruz, J. Bellmunt, H. A. Burris, D. P. Petrylak, S. L. Teng, et al. MPDL3280A (anti-PD-L1) treatment leads to clinical activity in metastatic bladder cancer. Nature, 515(7528): 558–62, 2014. DOI 10.1038/nature13904.

J. E. Rosenberg, J. Hoffman-Censits, T. Powles, M. S. van der Heijden, A. V. Balar, A. Necchi, N. Dawson, P. H. O’Donnell, A. Balmanoukian, Y. Loriot, et al. Atezolizumab in patients with locally advanced and metastatic urothelial carcinoma who have progressed following treatment with platinum-based chemotherapy: A single-arm, multicentre, phase 2 trial. Lancet, 387(10031): 1909–20, 2016. DOI 10.1016/s0140-6736(16)00561-4.

B. R. Saville, J. T. Connor, G. D. Ayers and J. Alvarez. The utility of bayesian predictive probabilities for interim monitoring of clinical trials. Clin Trials, 11(4): 485–493, 2014. DOI 10.1177/1740774514531352.

C. Sievert. Interactive web-based data visualization with R, plotly, and shiny. Chapman; Hall/CRC, 2020. URL https://plotly-r.com.

D. Vaughan and M. Dancho. furrr: Apply mapping functions in parallel using futures. 2021. URL https://CRAN.R-project.org/package=furrr. R package version 0.2.2.

H. Wickham. ggplot2: Elegant graphics for data analysis. Springer-Verlag New York, 2016. URL https://ggplot2.tidyverse.org.