stratamatch: Prognostic Score Stratification using a Pilot Design

Optimal propensity score matching has emerged as one of the most ubiquitous approaches for causal inference studies on observational data; However, outstanding critiques of the statistical properties of propensity score matching have cast doubt on the statistical efficiency of this technique, and the poor scalability of optimal matching to large data sets makes this approach inconvenient if not infeasible for sample sizes that are increasingly commonplace in modern observational data. The stratamatch package provides implementation support and diagnostics for ‘stratified matching designs,’ an approach which addresses both of these issues with optimal propensity score matching for large-sample observational studies. First, stratifying the data enables more computationally efficient matching of large data sets. Second, stratamatch implements a ‘pilot design’ approach in order to stratify by a prognostic score, which may increase the precision of the effect estimate and increase power in sensitivity analyses of unmeasured confounding.


Introduction
To make causal inference from observational data, researchers must address concerns that effect estimates may be biased due to confounding factors -baseline characteristics of the individuals in the study that influence both their selection of treatment and their probable outcome. Matching methods seek to account for this self-selection by grouping treated and control individuals with similar baseline characteristics. One of the most common such methods, propensity score matching, pairs individuals who appear to have had similar probabilities of receiving the treatment according to their baseline characteristics (Rosenbaum and Rubin, 1983), with the goal of coercing the data set into a form that resembles a fully-randomized controlled trial (King and Nielsen, 2019;Rosenbaum et al., 2010;Hernán and Robins, 2016). However, propensity score matching can only address bias to due measured baseline covariates, necessitating sensitivity analyses to interrogate the potential of bias due to unmeasured confounding (Rosenbaum, 2005b;Rosenbaum et al., 2010).
In their provocative article "Why Propensity Should Not Be Used for Matching," King and Nielson argue that that the fully randomized controlled trial -the design emulated by propensity score matching -is less statistically efficient than the block-randomized controlled experiment (King and Nielsen, 2019). In block-randomized designs, individuals are stratified by prognostically important covariates (e.g., for a clinical trial: sex, age group, smoking status) prior to randomization in order to reduce the heterogeneity between the treatment and control groups. In the experimental context, these efforts to reduce heterogeneity between compared groups help to increase the precision of the treatment effect estimate. In observational settings, reducing this type of heterogeneity not only improves precision but increases the robustness of the study's conclusions to being explained away by the possibility of unobserved confounding (Rosenbaum, 2005a;Aikens et al., 2020). The stratified matching design -in which observations are stratified prior to matching within strata -attempts to emulate the block-randomized controlled trial design in the observational context in order to secure these statistical benefits over pure propensity score matching. In addition, since the computation required for optimal matching can be quite time-consuming for studies of more than a few thousand observations, stratified matching designs could greatly improve the scalability of optimal matching. While the worst-case computational complexity of optimal matching is unfavorable, the process of matching a stratified data set with a constant stratum size scales much more effectively with sample size (for a summary of empirical run-times, see section 2.5.3).
While a variety of packages in R (R Core Team, 2019) exist for matching subjects in observational studies, limited support exists for researchers seeking to implement a stratified matching design. The popular MatchIt package (Ho et al., 2011) is a user-friendly option for common propensity score matching designs and related approaches, optmatch (Hansen and Klopfer, 2006) and DOS2 (Rosenbaum, 2019) are a powerful combination for implementing a variety of more complicated optimal matching schemes, and nearfar (Rigdon et al., 2018) implements a different form of matching for the instrumental variable study. The primary goal of stratamatch is to make stratified matching and prognostic score designs accessible to a wider variety of applied researchers, and to suggest a suite of diagnostic tools for the stratified observational study. In favorable settings, these designs could not only increase the precision and robustness of inference but could facilitate optimal matching of sample sizes for which this technique was previously computationally impractical. This paper discusses the methodological contributions of stratamatch -in particular the implementation of a novel pilot design approach suggested by Aikens et al. (2020) (section 2.2) -and summarizes the package implementation (Section 2.3) with illustrative examples (Section 2.4). While stratamatch may substantially improve the scalability of optimal matching for some large data sets, the main objective of the package is not to implement a computationally complex task but to make sophisticated study design tools and concepts accessible to a wide variety of researchers.

Study design A prognostic score stratification pilot design
Stratifying a data set based on baseline variation prior to matching reduces the heterogeneity between matched sets with respect to that baseline variation. But what baseline characteristics should be used? One option is to select prognostically important covariates by hand, based on expert knowledge. However, in practice, this "manual" stratification process often produces strata that vary wildly in size and composition. Some strata may be so small or so imbalanced in their composition of treated and control individuals that it is difficult to find high-quality matches or many observations are thrown away. Other strata may be so large that matching within them is still computationally infeasible.
The auto_stratify function in stratamatch divides subjects into strata using a prognostic score (see Hansen (2008)), which summarizes the baseline variation most important to the outcome. In addition to producing strata of more regular size and composition, balancing treatment and control groups based on the prognostic score may confer several statistical benefits: increasing precision (Aikens et al., 2020;Leacy and Stuart, 2014), providing some protection against mis-specification of the propensity score (Leacy and Stuart, 2014;Antonelli et al., 2018), and decreasing the susceptibility of an observed effect to being explained away by unobserved confounding (Rosenbaum and Rubin, 1983;Aikens et al., 2020). However, fitting the prognostic score on the same data set raises concerns of overfitting and may lead to biased effect estimates (Hansen, 2008;Abadie et al., 2018). For this reason, (Aikens et al., 2020) suggest using a pilot design for estimating the prognostic score.
Central to the pilot design concept is maintaining separation between the design and analysis phases of a study (see table 1, or for more information Goodman et al. (2017) and Rubin (2008)). Using an observational pilot design, the researchers partition their data set into an analysis set and a held-aside pilot set . Outcome information in the pilot set can be observed (e.g. to fit a prognostic score), and the information gained can be used to inform the study design. Subsequently, in order to preserve the separation of the study design from the study analysis, the individuals from the pilot set are omitted from the main analysis (i.e., they are not reused in the analysis set). The primary insight of the pilot design is that reserving all of the observations in a study for the analysis phase (i.e., in the analysis set) is not always better. Rather, clever use of data in the design phase (i.e., in the pilot set) may facilitate the design of stronger studies.
In the stratamatch approach, a random subsample of controls is extracted as a pilot set to fit a prognostic model, and that model is then used to estimate prognostic scores on the mix of control and treated individuals in the analysis set. The observations in the analysis set can then be stratified based on the quantiles of the estimated prognostic score, and matched by propensity score or Mahalanobis distance within strata (see section 2.3). Aikens et al. (2020) describe the scenarios in which a prognostic score matching pilot design is most useful. Briefly, the stratamatch approach is best for large data sets (i.e., thousands to millions of observations), especially when the number of control observations is plentiful. This technique may be particularly useful when modeling a prognostic score with the measured covariates is straightforward, and when propensity score alone is likely to exclude certain aspects of variation highly associated with outcome but unassociated with treatment assignment. While computational gains vary, stratification tends to noticeably accelerate matching for sample sizes of 5,000 or more (see section 2.5.3).

When to use this approach
Conversely, this technique is not recommended for small data sets in which each control observation is precious, especially when prognostic scores are likely to be difficult to estimate from the measured covariates (see Aikens et al. (2020) for a lengthy discussion). Ideally, there should be ample control observations available to fit a usable prognostic model and still leave sufficient controls remaining to select high-quality matches for the treated individuals in the data set. While some stratamatch designs may be useful for the estimation of other causal estimands, the statistical properties of prognostic pilot designs for estimands other than the average treatment effect among the treated have not yet been characterized (Aikens et al., 2020).

Design phase
Phase of a study in which the researcher considers what kinds of data will provide the strongest information to address the question at hand (e.g., randomization, sampling, matching, inverse probability weighting). The goal of the design phase is to obtain data which will provide strong inference.

Analysis phase
Phase of a study in which the data that comes from the design phase are summarized into statistics. Inference and sensitivity analyses are performed.

Pilot Design
An observational study approach in which some data is spent in the design phase to improve the study design/preprocessing.

Pilot Set
A subset of data extracted to be used in the design phase.

Analysis set
The set of data reserved for inference in the analysis phase.

Propensity score
Probability of assignment to the treatment group based on measured baseline characteristics.

Prognostic score
Expectation of the outcome in the absence of treatment based on measured baseline characteristics. Prognostic model A model (e.g. logistic regression) used to estimate prognostic scores.

Stratum
A subset of observations in the analysis set to be matched together.

Software
The stratamatch function, auto_stratify, implements the prognostic score stratification in the pilot design described above. The most basic procedure does the following: 1. Partition the data set into a pilot data set and an analysis data set 2. Fit a model for the prognostic score from the observations in the pilot set 3. Estimate prognostic scores for the analysis set using the prognostic model 4. Stratify the analysis set based on prognostic score quantiles.
A call to auto_stratify produces an auto_strata object, which contains the analysis set, the pilot set, and other information about the strata and prognostic scores. The stratamatch package implements a set of diagnostic plots and tables that can be used to assess the quality of a stratification. Example code, output, and diagnostics are provided in section 2.4. If the strata are satisfactory, the treatment and control individuals within each stratum can then be matched. By default, the strata_match function performs 1 : 1 propensity score matching within each stratum. Other matching scheme possibilities are discussed in section 2.5.3).

Simulated example
This section demonstrates the basic functionality of stratamatch in simulated example. The function make_sample_data generates a simple simulated data set so that users can explore the design options implemented by stratamatch. Below, we generate a sample of 10,000 observations and print the first few rows as an illustration. library("stratamatch") library("dplyr") dat <-make_sample_data(n = 10000) head(dat)

X1
X2 B1 B2 C1 treat outcome 1 0.93332697 1. The user should suppose that the rows of dat are individuals in an observational study, and the objective of the study is to estimate the effect of a binary treatment assignment (treat) on a binary outcome (outcome). Columns 1-5 represent three types of measured baseline covariates: continuous (X1 and X2), binary (B1 and B2) and categorical (C1). For this example, we assume strongly ignorable treatment assignment -that is, roughly, there are no unmeasured confounding factors (Rosenbaum and Rubin, 1983). (For sensitivity analyses for this assumption see, for example Rosenbaum (2005b)).

Automatic stratification
The command below uses auto_stratify to (1) partition 10% of the controls in dat into the pilot set (2) fit a prognostic score model for outcome based on X1 and X2, (3) estimate prognostic scores on the analysis set, and (4) return to us the analysis set, divided into strata of approximately 500 individuals, based on prognostic score quantiles. All of these steps are completed automatically with this function call, and the results are returned to us as a.strat. a.strat <-auto_stratify(dat, treat = "treat", prognosis = outcome~X1 + X2, + pilot_fraction = 0.1, size = 500) Constructing a pilot set by subsampling 10% of controls. Fitting prognostic model via logistic regression: outcome~X1 + X2 The result returned by auto_stratify is an auto_strata object. Running print on this object supplies basic information about how the stratification process has been completed. print(a.strat) auto_strata object from package stratamatch.
Function call: auto_stratify(data = dat, treat = "treat", prognosis = outcomeX 1 + X2, size = 500, pilot_fraction = 0.1) Analysis set dimensions: 9234 X 8 Pilot set dimensions: 766 X 7 Prognostic Score Formula: outcome~X1 + X2 Here, auto_stratify has partitioned away a pilot set of 766 control individuals to fit our desired prognostic model, leaving 9,234 individuals in the analysis set. Using the prognostic model, prognostic scores were estimated on the individuals in the analysis set, and these individuals were divided into strata with a target size of 500. In order to record these stratification assignments, an eighth column, stratum, has been appended to the analysis set. The number strata and range of strata sizes can be obtained from summary(a.strat).
The analysis set and pilot set are accessible via a.strat$analysis_set and a.strat$pilot_set, respectively. The strata_table (accessed via a.strat$strata_table) reports the strata sizes and the prognostic score quantile bins which define each stratum.

Diagnostics
A major focus of the stratamatch package is suggesting diagnostics for the quality of stratification in observational studies. The issue_table reports the total size and composition of each stratum: The 'Potential_Issues' column is meant to quickly flag strata which may be problematically large, small, or imbalanced in the ratio of treated and control samples. The "small treat:control ratio" flag for stratum 19 indicates that the proportion of treated individuals is 0.2 or lower 1 . This is a relatively common issue, which is often easily addressed (see section 2.6).
The stratamatch package implements four diagnostic plotting options: 1. Size-Ratio Plot: (Figure 1) Displays each stratum in the analysis set based on its size and the percentage of control observations in order to identify potentially problematic strata. 2. Propensity Score Histogram: (Figure 1) Displays the distribution of estimated propensity scores across the treatment and control groups, within a single stratum or the entire analysis set. These plots are used for assessing propensity score overlap. 3. Assignment-Control Plot: (Figure 2) Displays each individual based on estimated propensity score and estimated prognostic score, based on visualizations from Aikens et al. (2020). As above, these plots can display a single stratum or the entire analysis set. Assignment-control plots are useful for checking the overlap and correlation of prognostic and propensity scores. 4. Residual Plots: (Not shown) Show the diagnostic plots for the prognostic model used to estimate the prognostic scores. It is essentially a wrapper for plot.lm (see the documentation for plot.lm in the base R package, stats). Note that since the pilot set alone is used to fit the prognostic model, only the pilot set is used for these diagnostic plots.
The code below makes each of the plot types listed above, including two assignment-control plots: one for the entire analysis set and one for a single stratum. The results are shown in figures 1 and 2, with interpretation in the figure captions. For propensity score histograms and assignment-control plots, the 'propensity' argument is required, specifying how the propensity scores should be estimated. Below, the propensity score is fit on the analysis set based on a regression of treatment assignment on 'X1', 'X2', 'B1', and 'B2' (for other input options, run help(plot.auto_strata) or help(plot.manual_strata)). plot(a.strat, type = "SR") plot(a.strat, type = "hist", propensity = treat~X2 + X1 + B1 + B2, stratum = 1) plot(a.strat, type = "AC", propensity = treat~X2 + X1 + B1 + B2) plot(a.strat, type = "AC", propensity = treat~X2 + X1 + B1 + B2, stratum = 1) plot(a.strat, type = "residual") In this example, the command a.strat$prognostic_model would supply the prognostic model (an lm or glm object) for further diagnostics (e.g. with summary(a.strat$prognostic_model)). Assessment of the prognostic model can indicate whether a sufficient number of observations has been partitioned into the pilot set (see section 2.6). However, one benefit of a stratified matching design is that even an imperfect prognostic model may yield robust inference if the resulting strata are of sufficient quality to allow for a strong propensity match (see, for example theory on stratified sampling (Lohr, 2019) or commentary on doubly robust matching (Leacy and Stuart, 2014;Antonelli et al., 2018))

Matching
Once the data have been stratified, the user can optimally match individuals within each stratum. The strata_match function supports optimal 1:1, 1:k, or full matching (Rosenbaum, 1991;

Assignment−Control plot for stratum 1
Estimated propensity score Estimated prognostic score strata cut points treated control (b) An assignment-control plot for stratum 1 Figure 2: Assignment-control plots (Aikens et al., 2020) showing estimated propensity score versus estimated prognostic score for each subject in the analysis set (A) or a selected stratum (B). In an ideal scenario, there is ample overlap between treated and control individuals in terms of both prognosis and propensity (for other cases, see section 2.6). Grey lines denote the prognostic score thresholds defining the strata. Klopfer, 2006), based on a propensity score or Mahalanobis distance. The sample code below performs 1:1 propensity score matching. This function makes essential use of the optmatch package (Hansen and Klopfer (2006); see also Bertsekas and Tseng (1988)) to perform the matching within strata. mymatch <-strata_match(a.strat, model = treat~X1 + X2 + B1 + B2) Fitting propensity model: treat~X1 + X2 + B1 + B2 The result is an optimal 1 to 1 matching within prognostic score strata. Above, mymatch is an optmatch class object, as described by the optmatch package (Hansen and Klopfer, 2006). For the most part, mymatch can be treated as a factor giving match assignments for each row of the data set. The command summary(mymatch) would display the number of pairs, the number of unmatched individuals, and effective sample size. For suggestions regarding other matching schemes for stratified data, see 2.5.3.

A brief comment on estimation
The procedure for performing inference after matching -in particular the estimation of the standard error of the effect estimate for the purposes of hypothesis tests and confidence intervals -is a topic of some debate in the literature. We will not attempt to resolve this debate here, although interested readers may find the commentary by Stuart (Stuart, 2010) to be an accessible starting place, and note more recent work by Abadie and Speiss (Abadie and Spiess, 2021), and numerous other authors (for example Imbens (2006, 2011); Austin and Small (2014); Austin and Cafri (2020)). Below, we describe two contrasting approaches which are most familiar to epidemiology and statistics, with references to coding resources.
First, Rosenbuam (Rosenbaum, 2005b;Rosenbaum et al., 2010) motivates the use of permutationbased tests followed by sensitivity analyses for unobserved confounding. Both of these are implemented in sensitivitymw (Rosenbaum, 2014(Rosenbaum, , 2015 for pairmatching and sensitivityfull (Rosenbaum, 2017(Rosenbaum, , 2007 for full matching. In keeping with a randomization inference framework, these techniques generally consider inference conditional on the matched sample and focus on uncertainty derived from the randomization process emulated by the matching. Researchers with further information on the sampling process which generated the observational data may thereafter combine this approach with a sampling variation framework to estimate parameters and standard errors for a more general target population (see the framework outlined by Tipton (2013) for the experimental setting).
A second common approach uses covariate adjustment. This framework is motivated importantly by Ho et al. (2007), who make the case for matching as a preprocessing step to reduce the dependence of parametric analyses on model selection. In keeping with the regression literature from the social sciences, these approaches often begin by supposing that the complete observational data set is an independent and identically distributed set of observations from some larger population, perhaps according to some parametric data-generating model. Within this framework, there is still considerable debate regarding correct standard error estimation. A thorough practical tutorial for the covariate adjustment approach with code examples and some suggestions for standard error estimation is featured in the recent MatchIt vignette, "Estimating Effects after Matching" (Greifer, 2020). Note that the pilot design implemented by stratamatch removes control individuals at random from the data set while retaining all treated individuals. Thus, while we recommend stratamatch for the estimation of the average treatment effect among the treated, the characteristics of stratamatch designs for estimation of other causal estimands (e.g. average treatment effect) have not yet been well characterized.

Real-data example: Life sustaining treatments for critical care patients
As an applied example, the stratamatch package contains a re-processed version of deidentified medical data from Chavez et al. (2018). Briefly, the authors extracted demographic information, common laboratory test results, comorbidity information, and treatment team assignments for 10,157 ICU patients from the Stanford University Hospital who met their inclusion criteria. During their stay, each patient's critical care preferences are summarized with a code status. The default -Full Code status -indicates no limitations on resuscitative measures, while other codes (e.g. 'Do not resuscitate', or 'DNR') indicate different limitations on the intensity and type of resuscitation the patient should receive if they become pulseless or apneic (i.e., their heart stops or they stop breathing). This code status is a product of complex dynamics between patient and provider. When a patient's code status does not reflect their goals of care, patients may have life sustaining care inappropriately withheld, or they may receive aggressive treatment which does not effectively increase their quality or quantity of remaining life.
In this example, suppose a researcher wants to study whether comparable patients under the care of surgical teams vs. non-surgical teams are more likely to have their code status set to limit resuscitation (i.e., any form of 'DNR'). From this we could infer tendencies that different treatment teams have in counseling and decision-making about life-sustaining treatments for the critically ill. However, the patient groups seen by surgical vs. non-surgical teams are necessarily different, because patients are assigned to treatment teams based on their reason for being in the hospital and their treatment history. Thus, a naive comparison of DNR order frequency between care team types would be misleading. To better account for these potential differences, we employ a stratified pilot matching design to compare "treated" (assigned to a surgical care team) individuals with "control" (assigned to a non-surgical care team) ones which are similar in terms of their prognostic and propensity scores.

Automatic stratification
Patients must be first stratified by a prognostic score (i.e., their estimated probability of receiving a DNR order if they are not assigned to a surgical care team), before being matched on a propensity score (i.e., their estimated probability of assignment to a surgical care team). In the example below, we use auto_stratify on the ICU_data to (1) partition 10% of controls into a pilot set, (2) build a prognostic score model on that pilot set based on age ('Birth.preTimeDays'), sex, and race/ethnicity (3) estimate prognostic scores on the analysis set and (4) return a stratified data set with approximately 500 individuals per stratum.
ICU_astrat <-auto_stratify(data = ICU_data, treat = "surgicalTeam", prognosis = DNR~Birth. We see here that auto_stratify partitioned the data into a pilot set of 793 "controls" (i.e., patients not assigned to a surgical treatment team) and an analysis set of the 9,364 remaining individuals. The prognostic model was fit on the pilot set according to the formula we provided, regressing DNR code assignment on age, sex, and race. This model was used to estimate the prognostic score (probability of DNR code assignment based on demographics) for each of the 9,364 individuals in the analysis set. Finally, each individual in the analysis set was assigned to a stratum based on this score. 19 strata, each containing between 492 and 494 patients, were created. This stratum assignment information was appended to the analysis set by adding a new 14th column, stratum.

Manual stratification
Rather than using a pilot design to build a prognostic score, researchers may wish to stratify the data set based on discrete covariates (e.g., chosen by a domain expert). The manual_stratify function supports these study designs. For example, the code below bins the 10,157 patients in the data set purely based on race/ethnicity and sex. In contrast, the size-ratio plots for the automatic stratification show a much smaller range of sizes and control proportions, with fewer -and more easily addressedpotential issues. Strata with Potential Issues: 9 The resulting manual_strata object has many of the same properties as an auto_strata object from auto_stratify and can be matched in the same way with strata_match. However manual_strata objects do not have a pilot set prognostic score information, and accordingly assignment-control and residual plots are not supported for these inputs.
This more traditional manual approach may be preferred in some cases for its simplicity, and because it obviates the need to sacrifice observations to fit a prognostic model. However, selecting a binning scheme which results in favorable strata may be a time-consuming iterative process, as highlighted by the diagnostics in the following section. These issues underscore the potential usefulness of the prognostic score stratification implemented by auto_stratify.

Diagnostics
Size-ratio plots for the manual and automatic stratification illustrate a common issue with manual stratification: it is often difficult to select discrete covariates which result in appropriately sized and balanced strata (Figure 3). This also is reflected by the number of strata with potential issues in the manual stratification issue table below. For example, stratum 1 below (white males) contains 3,314 patients, while stratum 3 (Native American males) contains only 18 patients, only one of whom was assigned to a surgical team. In exceedingly large strata, matching becomes increasingly computationally intensive, while in exceedingly small and/or highly imbalanced strata, finding high quality matches can be difficult or infeasible (see section 2.5.3).    The code below displays the assignment-control plot for one of the strata in the automatically stratified data set ( Figure 4). plot(ICU_astrat, type = "AC", propensity = surgicalTeam~Female.pre + Birth.preTimeDays + RaceAsian.pre + RaceUnknown.pre + RaceOther.pre + RaceBlack.pre + RacePacificIslander.pre + RaceNativeAmerican.pre + all_latinos, stratum = 2)

ICU_mstrat
The striae in this assignment-control plot appear when discrete characteristics (e.g. sex and race/ethnicity) are highly weighted in the propensity or prognostic score, causing observations to cluster together. Since this is relatively common, 'jitter' arguments can be used to add small amounts of random noise to the coordinates of each point in order to avoid stacking.

Matching
After a suitable stratification is selected, observations can be matched within strata using strata_match. Since every stratum from the automatic stratification in this example contains at least a 1:2 ratio of patients who were assigned to surgical teams and those who were not, we can match 2 "control" (i.e., non-surgical team) patients to each "treated" (i.e., surgical team) subject in each stratum. In this step, we match individuals who, based on their baseline covariates, appear equally likely to have been assigned to a surgical team vs. not. The following performs the matching:

Assignment−Control plot for stratum 2
Estimated propensity score Estimated prognostic score strata cut points treated control At this point, the researcher can compare matched treated and control individuals to infer whether patients assigned to surgical treatment teams are more or less likely to be assigned a DNR code status, following up with sensitivity analyses (see, for example sensitivitymw (Rosenbaum, 2014))

Key design choices and advanced functionality The selection of the pilot set
The previous illustrations demonstrated the simplest method of extracting the pilot set: a random subsampling of all controls. Aikens et al. (2020) contains a more thorough discussion of the considerations that might inform the selection of a pilot set.
A first consideration is the pilot set size. In general, the researcher should create a pilot set large enough to build a reliable prognostic model and retain enough remaining controls to select high-quality matches to the treatment group. This depends on the quality and number of available controls and the relative difficultly of fitting a prognostic model on the measured covariates. When high-quality controls (i.e. those resembling the treatment group) are scarce, the researcher should consider a smaller pilot set or a different study design altogether.
Another consideration is composition. Ideally, the individuals in the pilot set should be similar to the individuals in the treatment group, so a prognostic model built on this pilot set will not be extrapolating heavily when estimating prognostic scores on the analysis set. This approach can be especially important when there is some category of observations in the data which are relatively rare, and the researcher would like to ensure that some observations in this category end up in both the pilot and analysis sets. When discrete covariates are specified with the 'group_by_covariates' argument to auto_stratify the pilot set will be split proportionally based on these covariates, so that the pilot set will be representative of the total control sample in terms of these covariates. This option can be used directly with auto_stratify. However, the split_pilot_set function is supplied as a convenience for users who prefer to split the pilot set themselves before stratification, as demonstrated below.

Fitting the prognostic model
To fit the prognostic model, auto_stratify uses either linear (continuous outcome) or logistic regression (binary outcome). To accommodate a wider variety of modeling choices, auto_stratify can also be run using a vector of analysis set prognostic scores or prognostic model object 2 .
The example below uses the glmnet package (Friedman et al., 2010) to fit a cross-validated lasso on the pilot set which was extracted in the previous section.

Matching
Section 2.4 demonstrates how the stratamatch package can be used for optimal 1 : k matching on propensity score. The strata_match function also supports full matching (Hansen and Klopfer, 2006;Rosenbaum, 1991), and the use of Mahalanobis distance instead of a propensity score. If desired, a data set stratified with stratamatch can instead be matched within strata using other matching software (e.g., optmatch (Hansen and Klopfer, 2006), or MatchIt (Ho et al., 2011)). For example, users proficient with optmatch will note that adding + strata(stratum) to the matching formula supplied to optmatch::pairmatch and other matching functions will match within stratum assignments in the analysis set.
More nuanced matching schemes may also help address imbalances in the number of treated and control units within strata. For example, the researcher could perform 1 : k matching within each 2 Model objects must have a method associated with the predict generic function stratum, but allow k to vary between strata -matching more controls to each treated individual in strata where controls are plentiful and performing 1 : 1 or 1 : 2 matching where controls are less abundant. Another solution is to use a matching scheme within strata which naturally allows for variation in the ratio of treated and control individuals in matched sets, such as full matching (Rosenbaum, 1991;Hansen and Klopfer, 2006) or variable k matching (Pimentel et al., 2015).
As shown in figure 5, stratification is expected to substantially accelerate the matching process, especially for large sample sizes (several thousand or more). Hansen and Klopfer articulate a worstcase run-time for various forms of optimal matching with optmatch as O(n 3 log(nM)), where M represents the maximum matching discrepancy between treated and control observations (Hansen and Klopfer, 2006). For context, this scales slightly less favorably than matrix inversion, which quickly becomes time-consuming for large inputs. By comparison, matching within strata of a fixed size tends to scale much more favorably for large n (figure 5). To further accelerate computation, a researcher might distribute matching the stratified data set over several computing nodes. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Unstratified matching scales in a supra-linear manner with sample size (Hansen and Klopfer, 2006), while stratified matching with a set strata size tends to scale more favorably with n. At sample sizes of 30,000, optimally matching a whole data set may take over an hour, and much larger sample sizes may quickly become infeasible.

Trouble-shooting a stratification scheme
This section summarizes some common pitfalls and workarounds while stratifying a data set. Importantly, in order to preserve the separation of the design and analysis set, individuals partitioned into the pilot set must not be recombined with the analysis set. For instance, simply running auto_stratify repeatedly with different seeds to sample new pilot sets from the data and fit new prognostic score models may lead to overfitting of the prognostic model, raising concerns of bias in the study results (see Hansen (2008); Abadie et al. (2018)).
The following issues are common: 1. Some strata are too small or too large: This problem can often be solved simply by rerunning auto_stratify with a different 'size' parameter. When this is done, the researcher should be sure to use the same pilot and analysis set as they received when they first ran auto_stratify (i.e., do not partition a new pilot set).
2. The strata have poor balance of treated and control individuals: This situation is relatively common, but often straightforward to address with matching schemes that match more controls to each treated observation or allow for variable treat:control ratios. See section 2.5.3 for some suggestions.
3. The prognostic model is poor: In some cases, the user may encounter an error fitting the prognostic model, or they may suspect from prognostic model diagnostics that the model does a poor job of capturing variation predictive of the outcome. There are a few reasons the prognostic model may be problematic.
(a) The prognostic model was mis-specified. In this case, the user should fit a revised prognostic model on the same pilot set as was previously used. However, refitting repeatedly can lead to overfitting, so this should be done in moderation.
(b) The pilot set was too small to get a reliable fit. In this case, the user can add more samples from the analysis set to the existing pilot set. Samples that are moved into the pilot set must stay in the pilot set and should not be re-pooled with the analysis set.
(c) Pilot set size is sufficient, but prognostic model perfectly separates treated individuals from control individuals: If this occurs in either the pilot set or analysis set, it may be a sign that overlap is poor. See below.
4. The treated and control individuals within strata have poor overlap in propensity and/or prognostic scores: This problem is best diagnosed with assignment-control plots (see Aikens et al. (2020) for a deeper description). Propensity and prognostic score based subclassification methods both depend on some form of overlap in the baseline characteristics of treated and control individuals in order to make a valid estimate of causal effect (for a summary, see Leacy and Stuart (2014)). Treatment and control groups which are clearly separated in terms of either their propensity scores or prognostic scores can be an indication that these two groups should not be compared, because the resulting inference on treatment effect would be misleading. A researcher facing this situation might consider trimming the score space (Glynn et al., 2019) in some cases, or seeking out another data set if the overlap problems are severe. While this may seem to be a disappointing result, the ability to identify these data issues before proceeding is one of the most important strengths of design-based causal inference (see, for example Austin (2011)).

Summary and discussion
Stratifying a data set prior to matching may make optimal and full matching designs scale more practically for modern observational sample sizes ( Figure 5). However, the primary objective of stratamatch is not to directly implement a computationally taxing task, but to expand access to sophisticated study design tools for a wide range of researchers with varying levels of technical and statistical sophistication. Indeed, the computational steps of stratification are relatively straightforward; however, the statistical concept of the pilot design is nuanced, and the process of stratifying a data set and interrogating the quality of that stratification can be thought-intensive and isn't well-supported by other resources. The stratamatch package is intended to make a prognostic score stratification pilot designs -and stratified matching designs in general -easily implementable, with helpful diagnostic tools and documentation. The overall goal of this effort is to push researchers toward approaches and diagnostics which emphasize stronger study design in the observational setting. In modern observational studies, designs such as the stratamatch approach which are tailored to large-sample studies can offer increased precision and other statistical benefits that might otherwise be left on the table by more traditional approaches.