This article describes the R package rdrobust, which provides data-driven graphical and inference procedures for RD designs. The package includes three main functions: rdrobust
, rdbwselect
and rdplot
. The first function (rdrobust
) implements conventional local-polynomial RD treatment effect point estimators and confidence intervals, as well as robust bias-corrected confidence intervals, for average treatment effects at the cutoff. This function covers sharp RD, sharp kink RD, fuzzy RD and fuzzy kink RD designs, among other possibilities. The second function (rdbwselect
) implements several bandwidth selectors proposed in the RD literature. The third function (rdplot
) provides data-driven optimal choices of evenly-spaced and quantile-spaced partition sizes, which are used to implement several data-driven RD plots.
The regression-discontinuity (RD) design is a widely employed quasi-experimental research design in social, behavioral and related sciences; for reviews see Imbens and T. Lemieux (2008) and Lee and T. Lemieux (2010). In this design, units are assigned to treatment based on whether their value of an observed covariate is above or below a known cutoff, and the probability of receiving treatment conditional on this covariate jumps discontinuously at the cutoff. This jump induces “variation” in treatment assignment that may be regarded, under appropriate assumptions, as being unrelated to potential confounders. Thus, inference in RD designs is typically conducted using only observations near the cutoff or threshold, where the discontinuous change in the probability of treatment assignment occurs. Due to its local nature, RD average treatment effects estimators are usually constructed using local-polynomial nonparametric regression, and statistical inference is based on large-sample approximations.
This article gives an introduction to the R package rdrobust (Calonico et al. 2015b), which offers an array of data-driven local-polynomial and partitioning-based inference procedures for RD designs. We introduce three main functions implementing several data-driven nonparametric point and confidence intervals estimators, bandwidth selectors, and plotting procedures useful for RD empirical applications:
rdrobust
(). This function implements the bias-corrected robust (to
“large” bandwidth choices) inference procedure proposed by
Calonico et al. (2014, CCT hereafter), as well as
many other RD inference procedures employing local-polynomial
regression. The function rdrobust
offers bias-corrected robust
confidence intervals for average treatment effects at the cutoff for
sharp RD, sharp kink RD, fuzzy RD and fuzzy kink RD designs.
rdbwselect
(). This function implements several data-driven
bandwidth selectors for RD designs based on the recent work of
Imbens and K. Kalyanaraman (2012, IK hereafter) and CCT. Although
this command may be used as a stand-alone bandwidth selector in RD
applications, its main purpose is to provide fully data-driven
bandwidth choices to be used by our main function rdrobust()
.
rdplot
(). This function implements several data-driven optimal
choices of evenly-spaced and quantile-spaced bins, which are useful
to produce RD plots that either approximate the regression function
by local sample averages or represent the overall variability of the
data in a disciplined way. These optimal choices are based on an
integrated mean squared error expansion of appropriately constructed
partitioning estimators, as discussed in
Calonico et al. ({in press}); see also
Cattaneo and Farrell (2013) for related results. These binned sample
means and partition size chosen are used to construct the popular RD
plots commonly found in RD applications in a fully automatic way.
We first provide a brief review of all the methods implemented in
rdrobust, and then discuss an empirical illustration using some of the
features of our functions rdrobust()
, rdbwselect()
and rdplot()
. A
full description of the capabilities of the package rdrobust is
available in its manual and help files. A companion Stata (StataCorp. 2013)
package is described in Calonico, M. D. Cattaneo, and R. Titiunik (2014).
We now present the basic RD framework, describe the population parameters of interest, introduce the local-polynomial based estimators, review different inference procedures, and briefly summarize the popular RD plots. Regularity conditions and other technical aspects underlying the estimands and estimators may be found in the references given throughout. Recent results and further details on RD designs are discussed in IK and CCT (see also the supplemental appendix), and references therein.
We adopt the potential outcomes framework commonly employed in the
treatment effects literature (e.g., Heckman and E. J. Vytlacil (2007) and
Imbens and J. M. Wooldridge (2009)). Let
This setup allows for imperfect compliance, which in the RD literature
is known as the fuzzy RD design. The case of perfect treatment
compliance is usually called the sharp RD design. In either case, the
observed outcome and treatment status are, respectively,
The observed data is
We introduce some additional notation. For
We focus on average treatment effects at the cutoff in the sharp RD, fuzzy RD, sharp kink RD and fuzzy kink RD designs, although the results cover other possibilities. For further details on the interpretation of these estimands and regularity conditions see, among others, (Hahn, P. Todd, and W. van der Klaauw 2001), Porter (2003), Lee (2008), IK, CCT, (Card et al. 2014), and references therein.
Sharp RD designs. Two popular parameters of interest are the
sharp RD average treatment effect at the threshold, denoted by
Fuzzy RD designs. The parameters of interest take the form
Statistical inference in the RD design reduces to nonparametric
regression inference at the induced boundary point
Using the generic notation above, the sharp RD estimators are:
Assuming the bandwidth
The function rdrobust()
implements the above RD point estimators with
options c
to set deriv
to set p
to set kernel
to set h
to set
The main obstacle in the practical implementation of RD local polynomial estimators is bandwidth selection. The package rdrobust implements the main two approaches for bandwidth selection available in the literature: (i) plug-in rules based on mean squared error (MSE) expansions, and (ii) cross validation. IK provide a comprehensive review of these approaches.
Direct plug-in rules. Direct plug-in (DPI) approaches to
bandwidth selection are based on a mean squared error (MSE)
expansion of the sharp RD estimators, leading to the MSE-optimal
choice
The package rdrobust implements two data-driven versions of this
MSE-optimal bandwidth, one proposed by IK (available for scalereg
allows users to
remove it. The IK implementation, denoted by
Cross validation. This bandwidth choice is implemented as
follows:
and, for
The function rdbwselect()
implements the above bandwidth selectors.
We briefly review the main asymptotic properties of the local polynomial RD estimators, with particular emphasis on the properties of the associated confidence interval estimators. Specifically, we discuss three types of confidence intervals (CI) based on Gaussian approximations: (i) conventional CI (assuming “small” bias), (ii) bias-corrected CI (not necessarily requiring undersmoothing), and (iii) robust bias-corrected CI (not necessarily requiring undersmoothing).
Optimal point estimators. The package rdrobust implements the
following data-driven RD treatment effect point estimators.
Sharp RD confidence intervals. Confidence intervals accompanying the point estimators discussed above rely on the following distributional approximation:
where
Conventional confidence intervals. An asymptotic
Bias-corrected confidence intervals. As an alternative
to undersmoothing, we can directly bias-correct the estimator by
constructing an estimator of
Robust bias-corrected confidence intervals. The
confidence intervals discussed so far have some unappealing
properties that may affect their performance in applications. On the
one hand, the confidence intervals
CCT propose an alternative, more robust confidence interval formula
based on the bias-corrected RD treatment effect estimators, but
employing a different variance for studentization purposes.
Intuitively, the bias-corrected RD estimator does not perform well
in finite-samples because the bias estimate introduces additional
variability in the statistic, which is not accounted for when
forming the associated confidence intervals
Sharp RD variance estimation. To construct fully feasible confidence intervals the unknown asymptotic variance is replaced by a consistent estimator thereof. The asymptotic variance formulas introduced above have a “sandwich” structure coming from the weighted least-squares structure of local polynomials. The package rdrobust offers two distinct valid variance estimators, employing either “plug-in estimated residuals” or “fixed-matches estimated residuals”.
Plug-in estimated residuals. In this approach, the
unknown residuals are estimated directly using the RD local
polynomial estimator. Usually the same bandwidth
Fixed-matches estimated residuals. CCT propose an
alternative variance estimator employing a different construction
for the residuals, motivated by the work of
(Abadie and Imbens 2006). This estimator is constructed using a
simple nearest-neighbor (or fixed-matches) estimator for the
residuals.
Extensions to Fuzzy RD confidence intervals. All the ideas and results presented above extend to the case of fuzzy RD designs, which we omit here to conserve space. See IK and CCT for further details. The package rdrobust also implements fuzzy RD estimators and confidence intervals.
The function rdrobust()
may be used to conduct inference in all RD
settings; by default, this function employs the function rdbwselect()
for bandwidth selection. Specifically, assuming
Sharp RD:
rdrobust(y = y, x = x)
Sharp Kink RD:
rdrobust(y = y, x = x, deriv = 1)
Fuzzy RD:
rdrobust(y = y, x = x, fuzzy = t)
Fuzzy Kink RD:
rdrobust(y = y, x = x, fuzzy = t, deriv = 1)
Because of the simplicity of the RD design, it is customary (and
advisable) to summarize the main features of the RD design in a
graphical way. The package rdrobust also implements the results in
Calonico et al. ({in press}), which offer several optimal
data-driven choices of tuning parameters useful to produce several
versions of the popular RD plots. These plots present global and local
estimates of the regression functions,
The first ingredient of the RD plot (two regression curves estimated for
The function rdplot()
offers eight distinct automatic implementations
for RD plots depending on (i) the choice of partitioning (ES or QS),
(ii) the goal of the plot (IMSE-optimal or Mimicking Variance), and
(iii) the estimation approach used (spacings or polynomial regression).
Specifically, the function rdplot()
covers the following.
Population quantities:
Estimators:
Further details on implementation and syntax are given in the help file
of the function rdplot()
. For other technical and methodological
details see Calonico et al. ({in press}).
We employ an extract of the dataset constructed by
Cattaneo et al. (2015) to illustrate some of the features
of our R package rdrobust. This dataset contains information on
elections for the U.S. Senate during the period 1914–2010. We focus on
the RD effect of the Democratic party winning a U.S. Senate seat on the
vote share obtained in the following election for that same seat,
mimicking the analysis conducted in (Lee 2008) for the U.S. House.
The dataset rdrobust_RDsenate
contains two variables: vote and
margin. The variable vote records the state-level vote share of the
Democratic party in a given statewide election for a Senate seat, while
the variable margin records the margin of victory of the Democratic
party in the previous election for the same Senate seat (i.e., six years
prior).
First, we load the database and present basic summary statistics. The functions included in the R package rdrobust allow for missing values, which are automatically excluded for estimation purposes.
> library(rdrobust)
> data(rdrobust_RDsenate)
> vote <- rdrobust_RDsenate$vote
> margin <- rdrobust_RDsenate$margin
> summary(vote)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 42.67 50.55 52.67 61.35 100.00 93
> summary(margin)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-100.000 -12.210 2.166 7.171 22.770 100.000
This data set has a total of
We use rdplot()
to construct an automatic plot of the RD design.
> (rdplot(y = vote, x = margin, title = "RD Plot - Senate Elections Data",
+ y.label = "Vote Share in Election at time t+1",
+ x.label = "Vote Share in Election at time t"))
Call:
rdplot(y = vote, x = margin, title = "RD Plot - Senate Elections Data",
x.label = "Vote Share in Election at time t",
y.label = "Vote Share in Election at time t+1")
Method: mimicking variance evenly-spaced method using spacings estimators
Left Right
Number of Obs. 595 702
Polynomial Order 4 4
Scale 1 1
Selected Bins 15 35
Bin Length 6.6614 2.8561
IMSE-optimal bins 8 9
Mimicking Variance bins 15 35
Relative to IMSE-optimal:
Implied scale 1.8750 3.8889
WIMSE variance weight 0.1317 0.0167
WIMSE bias weight 0.8683 0.9833
Figure 1 is constructed using the default options in the
command rdplot
, which produce an RD plot with evenly-spaced bins
selected to mimic the underlying variability of the data and is
implemented using spacings estimators. Using the notation introduced
above, the number of optimal bins for control and treatment units are
Next, we construct an alternative RD plot using evenly-spaced bins selected to trace out the underlying regression function (i.e., using the IMSE-optimal selector), also implemented using spacings estimators. The resulting plot is given in Figure 2.
> (rdplot(y = vote, x = margin, binselect = "es",
+ title = "RD Plot - Senate Elections Data",
+ y.label = "Vote Share in Election at time t+1",
+ x.label = "Vote Share in Election at time t"))
Call:
rdplot(y = vote, x = margin, binselect = "es", x
title = "RD Plot - Senate Elections Data",
x.label = "Vote Share in Election at time t",
y.label = "Vote Share in Election at time t+1")
Method: IMSE-optimal evenly-spaced method using spacings estimators
Left Right
Number of Obs. 595 702
Polynomial Order 4 4
Scale 1 1
Selected Bins 8 9
Bin Length 12.4901 11.1071
IMSE-optimal bins 8 9
Mimicking Variance bins 15 35
Relative to IMSE-optimal:
Implied scale 1.0000 1.0000
WIMSE variance weight 0.5000 0.5000
WIMSE bias weight 0.5000 0.5000
While providing a good approximation to the underlying regression
function (taking the global polynomial fit as benchmark), the
IMSE-optimal number of bins will usually be too small in applications.
This happens because the optimal formulas seek to balance squared bias
and variance in order to approximate the underlying regression function
globally. To obtain a visual “cloud of points” we need to increase the
number of bins, that is, undersmooth the estimator. In other words, in
order to increase the overall variability of the plotted points, we may
reduce the bin-length – which is done by increasing the total number of
bins used. This may be easily done using the option scale
as follows:
> (rdplot(y = vote, x = margin, binselect = "es", scale = 5,
+ title = "RD Plot - Senate Elections Data",
+ y.label = "Vote Share in Election at time t+1",
+ x.label = "Vote Share in Election at time t"))
Call:
rdplot(y = vote, x = margin, binselect = "es", scale = 5,
title = "RD Plot - Senate Elections Data",
x.label = "Vote Share in Election at time t",
y.label = "Vote Share in Election at time t+1")
Method: IMSE-optimal evenly-spaced method using spacings estimators
Left Right
Number of Obs. 595 702
Polynomial Order 4 4
Scale 5 5
Selected Bins 40 45
Bin Length 2.4980 2.2214
IMSE-optimal bins 8 9
Mimicking Variance bins 15 35
Relative to IMSE-optimal:
Implied scale 5.0000 5.0000
WIMSE variance weight 0.0079 0.0079
WIMSE bias weight 0.9921 0.9921
Figure 3 shows the resulting (undersmoothed) RD plot, where now the number of bins used is five times larger than the optimal choice in an integrated mean squared error sense. The resulting estimator is naturally more variable than before.
Next, we conduct fully data-driven RD treatment effect estimation and
inference. The function rdrobust()
using its default options leads to
the following output:
> rdrobust(y = vote, x = margin)
Call:
rdrobust(y = vote, x = margin)
Summary:
Number of Obs 1297
NN Matches 3
BW Type CCT
Kernel Type Triangular
Left Right
Number of Obs 343 310
Order Loc Poly (p) 1 1
Order Bias (q) 2 2
BW Loc Poly (h) 16.7936 16.7936
BW Bias (b) 27.4372 27.4372
rho (h/b) 0.6121 0.6121
Estimates:
Coef Std. Err. z P>|z| CI Lower CI Upper
Conventional 7.4253 1.4954 4.9656 0.0000 4.4944 10.3562
Robust 0.0000 4.0697 10.9833
These results contain a variety of information, which is organized in three panels. The first two contain a summary of the main choices selected to construct the RD treatment effect estimators, while the lower panel includes the main estimation results. Specifically, using the notation introduced above, this table shows:
The total number of observations is
The bandwidth selection procedure is the one proposed by CCT,
leading to
The RD point estimator is
The function rdrobust()
also offers a more detailed output, which
includes all the point estimators, standard errors and confidence
intervals discussed previously. These results are retrieved using the
all = TRUE
option. The corresponding output is as follows:
> rdrobust(y = vote, x = margin, all = TRUE)
Call:
rdrobust(y = vote, x = margin, all = TRUE)
Summary:
Number of Obs 1297
NN Matches 3
BW Type CCT
Kernel Type Triangular
Left Right
Number of Obs 343 310
Order Loc Poly (p) 1 1
Order Bias (q) 2 2
BW Loc Poly (h) 16.7936 16.7936
BW Bias (b) 27.4372 27.4372
rho (h/b) 0.6121 0.6121
Estimates:
Coef Std. Err. z P>|z| CI Lower CI Upper
Conventional 7.4253 1.4954 4.9656 0.0000 4.4944 10.3562
Bias-Corrected 7.5265 1.4954 5.0333 0.0000 4.5957 10.4574
Robust 7.5265 1.7637 4.2675 0.0000 4.0697 10.9833
Finally, we illustrate all the bandwidth selection procedures contained
in our package. To this end, we employ our companion function
rdbwselect()
to compare the CCT bandwidth selectors with the IK and CV
approaches. We have:
> rdbwselect(y = vote, x = margin, all = TRUE)
Call:
rdbwselect(y = vote, x = margin, all = TRUE)
BW Selector All
Number of Obs 1297
NN Matches 3
Kernel Type Triangular
Left Right
Number of Obs 595 702
Order Loc Poly (p) 1 1
Order Bias (q) 2 2
h b
CCT 16.79357 27.43722
IK 15.66761 16.48524
CV 35.42113 NA
In this case we employed the option all = TRUE
, which computes the
three bandwidth selectors briefly discussed above. Notice that the
option CV
is currently not available for derivative estimation. To
further understand the performance of the CV approach, we include a
graph of the CV objective function over the grid being considered. This
is done using the option cvplot
as shown next (in this example we also
changed the grid features to obtain a better plot, and to show this
additional functionality in action as well).
> rdbwselect(y = vote, x = margin, bwselect = "CV",
+ cvgrid_min = 10, cvgrid_max = 80, cvplot = TRUE)
Call:
rdbwselect(y = vote, x = margin, bwselect = "CV", cvgrid_min = 10,
cvgrid_max = 80, cvplot = TRUE)
BW Selector CV
Number of Obs 1297
NN Matches 3
Kernel Type Triangular
Left Right
Number of Obs 595 702
Order Loc Poly (p) 1 1
Order Bias (q) 2 2
h b
34.5 34.5
Figure 4 shows the CV objective function as a function of a
grid of bandwidth. In this example, the cross validation approach
delivers a minimum at
Our functions contained in the R package rdrobust have many other
options. For instance, for the main function rdrobust()
we have the
following additional examples (output is not provided to conserve
space).
Estimation using uniform kernel:
rdrobust(y = vote, x = margin, kernel = "uniform")
Estimation using the IK bandwidth selector:
rdrobust(y = vote, x = margin, bwselect = "IK")
Estimation using the CV bandwidth selector:
rdrobust(y = vote, x = margin, bwselect = "CV")
Estimation using
rdrobust(y = vote, x = margin, h = 15, rho = 0.8)
Estimation using
rdrobust(y = vote, x = margin, p = 2, q = 4)
Estimation using plug-in residuals estimates:
rdrobust(y = vote, x = margin, vce = "resid")
We introduced the main features of the R package rdrobust, which
includes the functions rdrobust()
, rdbwselect()
, and rdplot()
designed to conduct data-driven nonparametric robust inference in RD
designs. This implementation covers average RD treatment effects at the
cutoff in the sharp RD, sharp kink RD, fuzzy RD and fuzzy kink RD
designs. A full description of this package may be found in its manual
and help files. A companion Stata package offering the same structure
and capabilities is described in Calonico, M. D. Cattaneo, and R. Titiunik (2014).
We thank the Editor Deepayan Sarkar, two anonymous reviewers, Richard Anderson, Devin Caughey and Marko Klašnja for comments and suggestions on previous versions of our R package rdrobust. The authors gratefully acknowledge financial support from the National Science Foundation (SES 1357561).
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Calonico, et al., "rdrobust: An R Package for Robust Nonparametric Inference in Regression-Discontinuity Designs", The R Journal, 2015
BibTeX citation
@article{RJ-2015-004, author = {Calonico, Sebastian and Cattaneo, Matias D. and Titiunik, Rocío}, title = {rdrobust: An R Package for Robust Nonparametric Inference in Regression-Discontinuity Designs}, journal = {The R Journal}, year = {2015}, note = {https://rjournal.github.io/}, volume = {7}, issue = {1}, issn = {2073-4859}, pages = {38-51} }