The Counterfactual package implements the estimation and inference methods of Chernozhukov et al. (2013) for counterfactual analysis. The counterfactual distributions considered are the result of changing either the marginal distribution of covariates related to the outcome variable of interest, or the conditional distribution of the outcome given the covariates. They can be applied to estimate quantile treatment effects and wage decompositions. This paper serves as an introduction to the package and displays basic functionality of the commands contained within.
Using econometric terminology, we can often think of a counterfactual
distribution as the result of a change in either the distribution of a
set of covariates
We start by giving a simple example of counterfactual analysis. Suppose
we would like to analyze the wage differences between men and women. Let
Let
To get started using the package Counterfactual for the first time, issue the command
> install.packages("Counterfactual")
into your R browser to install the package in your computer. Once the package has been installed, you can use the package Counterfactual during any R session by simply issuing the command
> library(Counterfactual)
Now you are ready to use the function counterfactual and data sets contained in Counterfactual. For general questions about the package you may type
> help(package = "Counterfactual")
to view the package help file, or for more questions about a specific
function you can type help(function-name)
. For example, try:
> help(counterfactual)
or simply type
> ?counterfactual
The command counterfactual has the general syntax:
> counterfactual(formula, data, weights, na.action = na.exclude,
+ group, treatment = FALSE, decomposition = FALSE,
+ transformation = FALSE, counterfactual_var,
+ quantiles, method = "qr",
+ trimming = 0.005, nreg = 100, scale_variable,
+ counterfactual_scale_variable,
+ censoring = 0, right = FALSE, nsteps = 3,
+ firstc = 0.1, secondc = 0.05, noboot = FALSE,
+ weightedboot = FALSE, seed = 8, robust = FALSE,
+ reps = 100, alpha = 0.05, first = 0.1,
+ last = 0.9, cons_test = 0, printdeco = TRUE,
+ sepcore = FALSE, ncore=1)
To describe the different options of the command we need to provide some background on methods for counterfactual analysis.
Consider a general setting with two populations labeled by
The reference and counterfactual populations in the wage examples
correspond to different groups such as men and women or nonunion and
union workers. We can also generate counterfactual populations by
artificially transforming a reference population. Formally, we can think
of
The reference and counterfactual populations can be specified to counterfactual in two ways that accommodate the previous two cases:
If the option group has been specified, then
Alternatively, the option counterfactual_var can be used to
specify the covariates in the counterfactual population. In this
case, the names on the right handside of formula contain the
variables in
Counterfactual distribution and quantile functions are formed by
combining the conditional distribution in the population
To estimate the QE function we need to model and estimate the
conditional distribution
In this section we assume that we have samples
The option formula specifies the outcome
method = "qr", which is the default, implements the quantile
regresion estimator of the conditional distribution
method = "loc" implements the estimator of the conditional
distribution
method = "locsca" implements the estimator of the conditional
distribution
method = "cqr" implements the censored quantile regression
estimator of the conditional distribution, which is the same as
((6)) with
method = "cox" implements the duration regression estimator of
the conditional distribution function
method = "logit" implements the distribution regression
estimator of the conditional distribution with logistic link
function
method = "probit" implements the distribution regression
estimator of the conditional distribution with normal link function,
i.e. where
method = "lpm" implements the linear probability model estimator
of the conditional distribution
For the methods (2)–(8), the option nreg sets the number of values of
The command counterfactual reports pointwise and uniform confidence
intervals for the QEs over a prespecified set of quantile indexes. The
construction of the intervals rely on functional central limit theorems
and bootstrap functional central limit theorems for the empirical QEs
derived in (Chernozhukov et al. 2013). In particular, the
pointwise intervals are based on the normal distribution, whereas the
uniform intervals are based on two resampling schemes: empirical and
weighted bootstrap. Thus, the
In addition to the intervals, counterfactual reports the p-values for several functional tests based on two test-statistic: Kolmogorov-Smirnov and the Cramer-von-Misses-Smirnov. The null-hypotheses considered are
Correct parametric specification of the model for the conditional
distribution. This test compares the empirical distribution of the
outcome
Zero QE at all the quantile indexes of interest:
Constant QE at all quantile indexes of interest:
First-order stochastic dominance:
Negative first-order stochastic dominance:
The options of counterfactual related to inference are:
noboot = TRUE suppresses the bootstrap. The bootstrap can be very demanding in terms of computation time. Therefore, it is recommended to switch it off for the exploratory analysis of the data.
weightedboot = TRUE selects weighted bootstrap with standard exponential weights. The default weightedboot = FALSE selects empirical bootstrap with multinomial weights. We recommend weighted bootstrap when the covariates include categorical variables with small cell sizes to avoid singular designs in the bootstrap draws.
reps specifies the number of bootstrap replications. This number will matter only if the bootstrap has not be suppressed. The default is 100.
alpha specifies the significance level of the tests and confidence
intervals. Note that the confidence level of the confidence interval
is 1 - alpha. Thus, the default value of
first and last select the subset of quantile indexes of interest
for inference. The tails of the distribution should not be used
because standard asymptotic does not apply to these parts. The
needed amount of tail trimming depends on the sample size and on the
distribution of the dependent variable. first sets the lowest
quantile index used and last sets the highest quantile index used.
The default values are 0.1 and 0.9 so that
cons_test add tests of the null hypothesis that
The command counterfactual provides functionality for parallel computing, which is specially useful to speed up the execution of the bootstrap. There are two options related to parallel computing:
setcore specifies whether multiple cores should be used. The default value setcore = FALSE turns off the parallel computing.
ncore selects the number of cores to use for parallel computing. The information of this option is only used when parallel computing is switched on with setcore = TRUE.
We consider two empirical examples to illustrate the functionality of the command counterfactual. The first example is an estimation of Engel curves that includes a counterfactual analysis where the counterfactual population is an artificial transformation of a reference population. The second example is wage decomposition with respect to union status where the reference and counterfactual populations correspond to two different groups.
We use the classical Engel 1857 dataset to estimate the relationship
between food expenditure (foodexp) and annual household income
(income), and then report the estimates of the QE of a change in the
distribution of the annual household income that might be induced for
example by a variation in income taxation.
First, we generate the variable counterfactual_income with the
counterfactual values of income and plot the reference and
counterfactual income distributions. The counterfactual distribution
corresponds to a mean preserving spread of the distribution in the
reference population that reduces standard deviation by
> library(quantreg)
> data(engel)
> attach(engel)
> counter_income <- mean(income)+0.75*(income-mean(income))
> cdfx <- c(1:length(income))/length(income)
> plot(c(0,4000),range(c(0,1)), xlim =c(0, 4000), type="n", xlab = "Income",
+ ylab="Probability")
> lines(sort(income), cdfx)
> lines(sort(counter_income), cdfx, lwd = 2, col = 'grey70')
> legend(1500, .2, c("Original", "Counterfactual"), lwd=c(1,2),bty="n",
+ col=c(1,'grey70'))
To estimate the QEs of this counterfactual change we turn on the option transformation of counterfactual by setting transformation = TRUE, and specify that the counterfactual values of the covariate income are in the generated variable counter_income by setting counterfactual_var = counter_income. This yields:
> qrres <- counterfactual(foodexp~income, counterfactual_var
+ = counter_income, transformation = TRUE)
Conditional Model: linear quantile regression
Number of regressions estimated: 100
The variance has been estimated by bootstraping the results 100 times.
No. of obs. in the reference group: 235
No. of obs. in the counterfactual group: 235
Quantile Effects -- Composition
----------------------------------------------------------------------
Pointwise Pointwise Functional
Quantile Est. Std.Err 95% Conf.Interval 95% Conf.Interval
0.1 68.049 4.214 59.789 76.308 55.939 80.159
0.2 57.897 4.332 49.406 66.388 45.448 70.346
0.3 43.851 5.246 33.568 54.133 28.774 58.927
0.4 29.248 5.091 19.270 39.227 14.618 43.878
0.5 16.716 4.602 7.696 25.735 3.491 29.940
0.6 5.744 4.308 -2.698 14.187 -6.634 18.123
0.7 -8.866 7.132 -22.845 5.113 -29.361 11.630
0.8 -40.099 8.191 -56.153 -24.045 -63.637 -16.561
0.9 -88.56 13.83 -115.67 -61.44 -128.31 -48.80
Bootstrap inference on the counterfactual quantile process
----------------------------------------------------------------------
P-values
======================
NULL-Hypthoesis KS-statistic CMS-statistic
======================================================================
Correct specification of the parametric model 0 0
No effect: QE(tau)=0 for all taus 0 0
Constant effect: QE(tau)=QE(0.5) for all taus 0 0
Stochastic dominance: QE(tau)>0 for all taus 0 0
Stochastic dominance: QE(tau)<0 for all taus 0 0
We reject the simultaneous hypotheses of zero, constant, positive and negative effect of the income redistribution at all the deciles. The QR model for the conditional distribution cannot be rejected at conventional significance levels.
Finally, we reestimate the QE function on the larger set of quantiles
> taus <- c(1:99)/100
> first <- sum(as.double(taus <= .10))
> last <- sum(as.double(taus <= .90))
> rang <- c(first:last)
> rqres <- counterfactual(foodexp~income, counterfactual_var=counter_income,
+ nreg=100, quantiles=taus, transformation = TRUE,
+ printdeco = FALSE, sepcore = TRUE,ncore=2)
cores used= 2
> duqf <- (rqres$resCE)[,1]
> l.duqf <- (rqres$resCE)[,3]
> u.duqf <- (rqres$resCE)[,4]
> plot(c(0,1), range(c(min(l.duqf[rang]), max(u.duqf[rang]))), xlim = c(0,1),
+ type = "n", xlab = expression(tau), ylab = "Difference in Food Expenditure",
+ cex.lab=0.75)
> polygon(c(taus[rang], rev(taus[rang])), c(u.duqf[rang], rev(l.duqf[rang])),
+ density = -100, border = F, col = "grey70", lty = 1, lwd = 1)
> lines(taus[rang], duqf[rang])
> abline(h = 0, lty = 2)
> legend(0, -90, "QE", cex = 0.75, lwd = 4, bty = "n", col = "grey70")
> legend(0, -90, "QE", cex = 0.75, lty = 1, bty = "n", lwd = 1)
We use an extract of the U.S. National Longitudinal Survey of Young
Women (NLSW) for employed women in 1988 to estimate a wage decomposition
with respect to union status.
We start by estimating the wage decomposition by logistic distribution regression, where the counterfactual population is specified with group=union with the options treatment=TRUE and decomposition=TRUE to estimate the composition, structure and total effects. The structure effect in this case correspond to the treatment effect of union on the treated or union premium. The tables show that the union workers earn higher wages than the nonumion workers throuoghout the distribution although the union wage gap is decreasing in the quantile index. This gap can be mostly explained by differences in tenure, education and experience between union and nonunion workers in the upper tail of the distribution and by the union premium in the rest of the distribution.
> data(nlsw88)
> attach(nlsw88)
> lwage <- log(wage)
> logitres <- counterfactual(lwage~tenure+ttl_exp+grade,
+ group = union, treatment=TRUE,
+ decomposition=TRUE, method = "logit",
+ weightedboot = TRUE, sepcore = TRUE, ncore=2)
cores used= 2
Conditional Model: logit
Number of regressions estimated: 96
The variance has been estimated by bootstraping the results 100 times.
No. of obs. in the reference group: 1407
No. of obs. in the counterfactual group: 459
Quantile Effects -- Structure
----------------------------------------------------------------------
Pointwise Pointwise Functional
Quantile Est. Std.Err 95% Conf.Interval 95% Conf.Interval
0.1 0.24047 0.06005 0.12278 0.35817 0.07757 0.40338
0.2 0.21903 0.05630 0.10869 0.32937 0.06631 0.37176
0.3 0.23437 0.04628 0.14366 0.32508 0.10881 0.35992
0.4 0.18524 0.04252 0.10190 0.26857 0.06989 0.30059
0.5 0.16041 0.04404 0.07410 0.24671 0.04095 0.27987
0.6 0.13897 0.04618 0.04845 0.22949 0.01369 0.26425
0.7 0.05701 0.04407 -0.02937 0.14339 -0.06255 0.17657
0.8 0.01945 0.04179 -0.06245 0.10135 -0.09391 0.13281
0.9 0.006434 0.078547 -0.147514 0.160382 -0.206650 0.219518
Bootstrap inference on the counterfactual quantile process
----------------------------------------------------------------------
P-values
======================
NULL-Hypthoesis KS-statistic CMS-statistic
======================================================================
Correct specification of the parametric model 1 1
No effect: QE(tau)=0 for all taus 0 0
Constant effect: QE(tau)=QE(0.5) for all taus 0 0.02
Stochastic dominance: QE(tau)>0 for all taus 0.95 0.95
Stochastic dominance: QE(tau)<0 for all taus 0 0
Quantile Effects -- Composition
----------------------------------------------------------------------
Pointwise Pointwise Functional
Quantile Est. Std.Err 95% Conf.Interval 95% Conf.Interval
0.1 0.06062 0.04131 -0.02035 0.14160 -0.05752 0.17877
0.2 0.04879 0.03717 -0.02406 0.12164 -0.05750 0.15508
0.3 0.05313 0.03721 -0.01981 0.12607 -0.05329 0.15956
0.4 0.09245 0.03772 0.01851 0.16638 -0.01543 0.20033
0.5 0.08952 0.03945 0.01220 0.16683 -0.02329 0.20233
0.6 0.12259 0.03862 0.04690 0.19828 0.01215 0.23303
0.7 0.12975 0.03781 0.05564 0.20385 0.02162 0.23787
0.8 0.090722 0.030184 0.031563 0.149881 0.004404 0.177041
0.9 0.05503 0.05744 -0.05755 0.16760 -0.10924 0.21929
Bootstrap inference on the counterfactual quantile process
----------------------------------------------------------------------
P-values
======================
NULL-Hypthoesis KS-statistic CMS-statistic
======================================================================
Correct specification of the parametric model 1 1
No effect: QE(tau)=0 for all taus 0.01 0.01
Constant effect: QE(tau)=QE(0.5) for all taus 0.77 0.58
Stochastic dominance: QE(tau)>0 for all taus 0.81 0.81
Stochastic dominance: QE(tau)<0 for all taus 0.01 0.01
Quantile Effects -- Total
----------------------------------------------------------------------
Pointwise Pointwise Functional
Quantile Est. Std.Err 95% Conf.Interval 95% Conf.Interval
0.1 0.30110 0.05703 0.18933 0.41287 0.13655 0.46565
0.2 0.26782 0.06383 0.14271 0.39293 0.08363 0.45202
0.3 0.28750 0.05459 0.18051 0.39449 0.12998 0.44502
0.4 0.27768 0.05211 0.17556 0.37981 0.12733 0.42804
0.5 0.24992 0.05111 0.14975 0.35010 0.10244 0.39741
0.6 0.26156 0.04999 0.16358 0.35953 0.11731 0.40580
0.7 0.18675 0.04529 0.09800 0.27551 0.05608 0.31743
0.8 0.11017 0.04624 0.01954 0.20081 -0.02327 0.24361
0.9 0.06146 0.06287 -0.06176 0.18468 -0.11996 0.24287
Bootstrap inference on the counterfactual quantile process
----------------------------------------------------------------------
P-values
======================
NULL-Hypthoesis KS-statistic CMS-statistic
======================================================================
Correct specification of the parametric model 1 1
No effect: QE(tau)=0 for all taus 0 0
Constant effect: QE(tau)=QE(0.5) for all taus 0.06 0.13
Stochastic dominance: QE(tau)>0 for all taus 0.93 0.93
Stochastic dominance: QE(tau)<0 for all taus 0 0
Next, we reestimate the QE function on the larger set of quantiles
> taus <- c(1:99)/100
> first <- sum(as.double(taus <= .10))
> last <- sum(as.double(taus <= .90))
> rang <- c(first:last)
> logitres <- counterfactual(lwage~tenure+ttl_exp+grade,
+ group = union, treatment=TRUE, quantiles=taus,
+ method="logit", nreg=100, weightedboot = TRUE,
+ printdeco=FALSE, decomposition = TRUE,
+ sepcore = TRUE,ncore=2)
cores used= 2
> duqf_SE <- (logitres$resSE)[,1]
> l.duqf_SE <- (logitres$resSE)[,3]
> u.duqf_SE <- (logitres$resSE)[,4]
> duqf_CE <- (logitres$resCE)[,1]
> l.duqf_CE <- (logitres$resCE)[,3]
> u.duqf_CE <- (logitres$resCE)[,4]
> duqf_TE <- (logitres$resTE)[,1]
> l.duqf_TE <- (logitres$resTE)[,3]
> u.duqf_TE <- (logitres$resTE)[,4]
> range_x <- min(c(min(l.duqf_SE[rang]), min(l.duqf_CE[rang]),
+ min(l.duqf_TE[rang])))
> range_y <- max(c(max(u.duqf_SE[rang]), max(u.duqf_CE[rang]),
+ max(u.duqf_TE[rang])))
> par(mfrow=c(1,3))
> plot(c(0,1), range(c(range_x, range_y)), xlim = c(0,1), type = "n",
+ xlab = expression(tau), ylab = "Difference in Wages", cex.lab=0.75,
+ main = "Total")
> polygon(c(taus[rang],rev(taus[rang])),
+ c(u.duqf_TE[rang], rev(l.duqf_TE[rang])), density = -100, border = F,
+ col = "grey70", lty = 1, lwd = 1)
> lines(taus[rang], duqf_TE[rang])
> abline(h = 0, lty = 2)
> plot(c(0,1), range(c(range_x, range_y)), xlim = c(0,1), type = "n",
+ xlab = expression(tau), ylab = "", cex.lab=0.75, main = "Structure")
> polygon(c(taus[rang],rev(taus[rang])),
+ c(u.duqf_SE[rang], rev(l.duqf_SE[rang])), density = -100, border = F,
+ col = "grey70", lty = 1, lwd = 1)
> lines(taus[rang], duqf_SE[rang])
> abline(h = 0, lty = 2)
> plot(c(0,1), range(c(range_x, range_y)), xlim = c(0,1), type = "n",
+ xlab = expression(tau), ylab = "", cex.lab=0.75, main = "Composition")
> polygon(c(taus[rang],rev(taus[rang])),
+ c(u.duqf_CE[rang], rev(l.duqf_CE[rang])), density = -100, border = F,
+ col = "grey70", lty = 1, lwd = 1)
> lines(taus[rang], duqf_CE[rang])
> abline(h = 0, lty = 2)
Finally, we repeat the point and interval estimation using the duration regression method with the option method = "cox". Despite of relying on a more restrictive model for the conditional distribution, the duration regression estimates in Figure 3 are similar to the logit regression estimates in Figure 2.
> coxres <- counterfactual(lwage~tenure+ttl_exp+grade,
+ group = union, treatment=TRUE, quantiles=taus,
+ method="cox", nreg=100, weightedboot = TRUE,
+ printdeco = FALSE, decomposition = TRUE, sepcore = TRUE,ncore=2)
cores used= 2
> duqf_SE <- (coxres$resSE)[,1]
> l.duqf_SE <- (coxres$resSE)[,3]
> u.duqf_SE <- (coxres$resSE)[,4]
> duqf_CE <- (coxres$resCE)[,1]
> l.duqf_CE <- (coxres$resCE)[,3]
> u.duqf_CE <- (coxres$resCE)[,4]
> duqf_TE <- (coxres$resTE)[,1]
> l.duqf_TE <- (coxres$resTE)[,3]
> u.duqf_TE <- (coxres$resTE)[,4]
> range_x = min(c(min(l.duqf_SE[rang]), min(l.duqf_CE[rang]),
+ min(l.duqf_TE[rang])))
> range_y = max(c(max(u.duqf_SE[rang]), max(u.duqf_CE[rang]),
+ max(u.duqf_TE[rang])))
> par(mfrow=c(1,3))
> plot(c(0,1), range(c(range_x, range_y)), xlim = c(0,1), type = "n",
+ xlab = expression(tau), ylab = "Difference in Wages", cex.lab=0.75,
+ main = "Total")
> polygon(c(taus[rang],rev(taus[rang])),
+ c(u.duqf_TE[rang], rev(l.duqf_TE[rang])), density = -100, border = F,
+ col = "grey70", lty = 1, lwd = 1)
> lines(taus[rang], duqf_TE[rang])
> abline(h = 0, lty = 2)
> plot(c(0,1), range(c(range_x, range_y)), xlim = c(0,1), type = "n",
+ xlab = expression(tau), ylab = " ", cex.lab=0.75, main = "Structure")
> polygon(c(taus[rang],rev(taus[rang])),
+ c(u.duqf_SE[rang], rev(l.duqf_SE[rang])), density = -100, border = F,
+ col = "grey70", lty = 1, lwd = 1)
> lines(taus[rang], duqf_SE[rang])
> abline(h = 0, lty = 2)
> plot(c(0,1), range(c(range_x, range_y)), xlim = c(0,1), type = "n",
+ xlab = expression(tau), ylab = "", cex.lab=0.75, main = "Composition")
> polygon(c(taus[rang],rev(taus[rang])),
+ c(u.duqf_CE[rang], rev(l.duqf_CE[rang])), density = -100, border = F,
+ col = "grey70", lty = 1, lwd = 1)
> lines(taus[rang], duqf_CE[rang])
> abline(h = 0, lty = 2)
Counterfactual, quantreg, survival
CausalInference, ClinicalTrials, Econometrics, Environmetrics, Optimization, ReproducibleResearch, Robust, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Chen, et al., "Counterfactual: An R Package for Counterfactual Analysis", The R Journal, 2017
BibTeX citation
@article{RJ-2017-033, author = {Chen, Mingli and Chernozhukov, Victor and Fernández-Val, Iván and Melly, Blaise}, title = {Counterfactual: An R Package for Counterfactual Analysis}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {1}, issn = {2073-4859}, pages = {370-384} }