In this paper we introduce the ArCo package for R which consists of a set of functions to implement the the Artificial Counterfactual (ArCo) methodology to estimate causal effects of an intervention (treatment) on aggregated data and when a control group is not necessarily available. The ArCo method is a two-step procedure, where in the first stage a counterfactual is estimated from a large panel of time series from a pool of untreated peers. In the second-stage, the average treatment effect over the post-intervention sample is computed. Standard inferential procedures are available. The package is illustrated with both simulated and real datasets.
In this paper an R package to conduct counterfactual analysis by the Artificial Counterfactual (ArCo) method is introduced. The ArCo methodology is a flexible and easy-to-implement method to estimate the causal effects of an intervention on a single treated unit and when a control group is not readily available. The procedure consists of two steps where, in the first stage, a counterfactual is estimated on the basis of a large-dimensional set of variables from a pool of untreated units (“donors pool”). In the second stage, the average intervention effect on a vector of variables from the treated unit is estimated. The method is robust to the presence of confounding effects, such as a global shock and can be seen as an extension of the Synthetic Control (SC) approach of Abadie and Gardeazabal (2003) and Abadie et al. (2011) and the panel factor (PF) model method put forward by Hsiao et al. (2012).
Causality and counterfactual analysis are topics of major importance in Social Sciences, Economics, Medicine, Psychology, etc. Routinely, causal statements with respect to a given intervention (or treatment) depend on the construction of counterfactuals, which are estimated from the outcomes of a similar group of individuals not affected by the intervention. Nevertheless, reaching final cause-and-effect conclusions is a very challenging task as a consequence of the difficulties in finding reliable sources of exogenous variation. In microeconometrics there have been major advances in the literature and the estimation of treatment effects is now part of the current toolbox of applied researchers. On the other hand, when there is not a natural control group and there is a single treated unit, which is usually the case when handling, for example, macro (aggregated) data, the econometric tools have evolved at a slower pace. However, in the recent past, some authors have proposed new techniques inspired partially by the developments in microeconometrics that are able, under some assumptions, to estimate counterfactuals with aggregate data. The ArCo method fits into this exciting literature. See, for example, Athey and Imbens (2016b) or Athey and Imbens (2016a) for recent reviews.
All the econometric analysis conducted with the ArCo method is on the time dimension and it consists of the following key ingredients: the definition of the intervention and the intervention time, the units of interest (treated and untreated units) and the observed variables for each unit. Units are, for example, municipalities, states, regions, countries, products, firms, stores, etc. The intervention occurs in one of these units starting in a given point in time and does not affect the other units. Interventions could be war out-breaks, terrorist attacks, policy changes, enforcement of new laws, natural disasters, new prices, among many others. Each unit is described by at least one variable. For example, if we are dealing with countries the variables may be GDP and inflation and for companies they could be profit and revenue. For products the variables could be prices and sold quantities. All variables are observed along a period of time (days, months, years, for example).
The ArCo method is a two-step procedure. In the first step, the data before the occurrence of the intervention is used to estimate a multivariate time-series regression model relating the variables in the treated unit (dependent variables) with only the variables belonging to the untreated peers (explanatory variables). In the second step, the counterfactual is constructed by extrapolating the estimated model with data after the intervention. The estimated effect is the time-series average of the difference between the actual data and the counterfactual. It is important to stress that the model is conditioned on data from the untreated peers only and the validity of the method is based on two key assumptions:
The peers are not affected by the intervention;
The data is trend-stationary.
The first assumption is common in the literature and has been also
assumed by Abadie and Gardeazabal (2003) and Hsiao et al. (2012). On the other hand, the
trend-stationarity assumption has been dangerously ignored in the
literature. Carvalho et al. (2016b) investigate the consequences of applying
the ArCo, SC or PF methods when the data are integrated processes of
order 1. They find that without a cointegrating relationship (spurious
case) the estimator of the effects of the intervention diverges,
resulting in the rejection of the null hypothesis of no effect
regardless of its existence with probability approaching 1. For the case
when one cointegrating relationship exists, the estimator is
High-dimensionality is relevant when the number of parameters to be
estimated is large compared to the actual sample size. This can happen
either when the number of peers and/or the number of variables for each
peer is large or when the sample size is too small. In Carvalho et al. (2016a) the
authors consider a linear model estimated by the Least Absolute
Selection and Shrinkage Operator (LASSO) proposed by Tibshirani (1996).
However, in the R package we leave the choice of the conditioning model
to the user. LASSO regressions (and extensions), regression trees and
random forests, boosted trees, neural networks, splines, factor models,
are some possible examples of models to be estimated in the first step.
The results in Carvalho et al. (2016a) are derived under asymptotic limits on the
time dimension (
The ArCo approach nests the PF method proposed by Hsiao et al. (2012) and can been seen as a generalization of the SC approach in the same lines as discussed by Doudchenko and Imbens (2016) and Ferman and Pinto (2016). It is also better suited than Difference-in-Differences (DiD) estimators for comparative case studies when there is a single treated unit and no similar control group is available, even after the inclusion of many control variables. Furthermore, the ArCo approach relaxes the stringent parallel trend restriction of the DiD methods.
Recently, Gobillon and Magnac (2016) generalize DiD estimators by considering a correctly
specified linear panel model with strictly exogenous regressors and
interactive fixed effects represented as a number of common factors with
heterogeneous loadings. Their theoretical results rely on double
asymptotics when both
Although, both the ArCo and the SC methods construct a counterfactual as
a function of observed variables from a pool of peers, the two
approaches have important differences. First, the SC method relies on a
convex combination of peers to construct the counterfactual which, as
pointed out by Ferman and Pinto (2016), biases the estimator. The ArCo solution is a
general, possibly nonlinear, function. Even in the case of linearity,
the method does not impose any restriction on the parameters. For
example, the restriction that weights in the SC methods are all positive
seems a bit too strong. The SC method also requires an unrealistic
identification assumption about the (perfect-)fit of the model in the
pre-intervention period. Furthermore, the weights in the SC method are
usually estimated using time averages of the observed variables for each
peer. Therefore, all the time-series dynamics is removed and the weights
are determined in a pure cross-sectional setting. In addition, the SC
method was designed to evaluate the effects of the intervention on a
single variable. In order to evaluate the effects in a vector of
variables, the method has to be applied several times. The ArCo
methodology can be directly applied to a vector of variables of
interest. In addition, there is no formal inferential procedure for
hypothesis testing in the SC method, whereas in the ArCo methodology, a
simple, uniformly valid and standard test can be applied
This section aims to describe the mathematical notation and the key definitions of the ArCo methodology in a way that is compatible with the ArCo (Fonseca et al. 2017) package. For further details on statistical properties and theoretical results, see Carvalho et al. (2016a) and Carvalho et al. (2016b). Everything concerning the technique used to estimate the first-step model is left in a very general way as the ArCo package was developed to accept many different classes of models. Note, however, that the theory and the inference was developed for LASSO linear regressions.
units: Indexed by
variables: For each unit
intervention: The intervention took place only in the treated
unit at time
Assume, without loss of generality, that the treated unit is the first
one (
The goal is to test the hypothesis on the effects of the intervention
being statistically significant for
We do not observe
Consider the following model for
The first step of the ArCo method consists on estimating ((2))
using the first
Finally, the ArCo estimator is defined as
As mentioned in the Introduction, the ArCo methodology is a two-step
estimator where the first step consists on the estimation of
(Carvalho et al. 2016a) shows the conditions and results for the asymptotic
normality of
One can also establish a hypothesis test given that
The estimation of
Covariance matrix assuming the errors are independent and identically distributed;
prewhitening using VAR models as in (Andrews and Monahan 1992). The lag of the VAR may be chosen by the user or though information criterion;
(Newey and West 1987) (NW) covariance matrix with quadratic spectral, truncated, Bartlett, Parzen or Turkey-Hanning kernels; and
the combination of NW and prewhitening where the NW covariance matrix is calculated on the residuals of the VAR.
In many cases it is reasonable to assume that the intervention time is
unknown. For example, although some new policy has started at a known
time, its effects may have been anticipated due to rational
expectations. Regardless the source of the uncertainty, what we need is
to estimate
Set
Thus,
The inferential procedures for the ArCo method are asymptotic. However,
simulation experiments show that the method works well in small samples
(Carvalho et al. 2016a). It is intuitive to think that the only sample size that
matters is the one before the intervention as it is the one used to
estimate the model. However, for a given sample size
The first step of the ArCo is a time-series estimation. Therefore, all the data must have the same periodicity (year, quarters, months, etc). The Synthetic Control kills the time-series dimension of the control variables because it uses only their average (or other statistic such as median). This is an important difference between the methods: the ArCo uses all the time-series information but it requires more data than the Synthetic Control because all series must be complete (without missing values) and in the same periodicity.
The ArCo package has two datasets that were created using the data generation process (DGP) from the second simulation example in (Carvalho et al. 2016a) and another dataset from the same article’s empirical example. In this section we walk through the ArCo package using these datasets. All examples presented here are entirely reproducible using the same codes and datasets.
The DGP is given by:
The two generated datasets are data.q1
(one variable and 20 units) and
data.q2
(two variables and 6 units). We will start with data.q1
,
which can be loaded with the following code:
> library(ArCo)
> data(data.q1)
The data.q1
is a matrix with 100 observations of 20 columns. In the
ArCo context each column is a unit, therefore, we have one variable from
each of the 20 units from the data. The intervention took place at the
first unit at
> plot(data.q1[,1], type="l")
> abline(v=51, col=2, lty=2)
The main function of the ArCo package is the function fitArCo
,
which estimates the ArCo and provides the relevant statistics and
outputs. The data must be supplied in a list of matrices, each matrix
should have data.q1
has only one variable, we will supply the function
with a list with only one matrix.
The user can choose any model to estimate the first step of the ArCo as
long as it is possible to write a compatible estimation function
(argument fn
) and forecasting function (argument p.fn
). The
estimation function must receive a matrix of explanatory variables
newdata
predict
methods in
R, which receives a model object and the new data used on the
prediction.fn
and
a p.fn
that uses simple Ordinary Least Squares (OLS). The OLS
estimation is the default model of the fitArCo
function in case the
user does not supply any functions.
> fn=function(X,y){
+ return(lm(y~X))
+ }
> p.fn=function(model,newdata){
+ b=coef(model)
+ return(cbind(1,newdata) %*% b)
+ }
The ArCo can now be estimated:
> ArCo=fitArCo(data = list(data.q1), fn = fn, p.fn = p.fn, treated.unit = 1, t0 = 51,
VCOV.type = "nw", VCOV.lag = 3, prewhitening.kernel = TRUE)
> plot(ArCo,display.fitted=TRUE)
> ArCo$delta
LB delta UB
Variable1 0.4115779 0.5174889 0.6233999
The fitArCo
generates an object of class fitArCo with S3 method for
plotx11()
to plot in a new window.lag
allows the inclusion of lags in
the model and the argument Xreg
accounts for exogenous controls.
Finally, the $delta
returns the $p.value
returns the ArCo
In this section we show the ArCo with two variables and some additional features from the fitArCo function such as bootstrap estimation of the couterfactual confidence bands, which have the advantage that it can be calculated for any model chosen by the user.
If the argument boot.cf
is set to TRUE the function will automatically
perform block bootstrap replications of the counterfactual, the number
of replications and the block length are controlled through the
arguments R
and l
respectively. The bootstrap procedure uses the
package boot (Canty and Ripley 2015)
and is described as follows:
For
sample
estimate the model on
for
repeat steps 2-4
The data.q2
dateset has two variables for 6 units. The number of
observations is 100 and the treatment of adding up one standard
deviation (0.84 for the first variable and 0.511 for the second
variable) took place at
The following example estimates the ArCo using LASSO with the
glmnet (Friedman et al. 2010). The
function fn
is cv.glmnet
the function p.fn
is predict
cv.glmnet
chooses the regularization parameter
> data(data.q2)
> library(glmnet)
> set.seed(123)
> ArCo2 = fitArCo(data = data.q2, fn = cv.glmnet, p.fn = predict, treated.unit = 1,
t0 = 51, VCOV.type = "nw", VCOV.lag = 3, prewhitening.kernel = TRUE,boot.cf = TRUE,
R=200, l=3)
> plot(ArCo2, display.fitted=TRUE ,confidence.bands = TRUE, alpha = 0.05)
> ArCo2$delta
LB delta UB
Variable1 0.6801909 0.7710389 0.8618870
Variable2 0.4225233 0.5071188 0.5917144
Figure 3 shows the counterfactual with $delta
shows that the treatment
was different from zero on both variables and the point value of
A potential issue for the practitioner is the choice of the function
used to estimate the model in the first step of the ArCo (fn
) method
and the corresponding prediction function (p.fn
). The default is to
estimate the ArCo using OLS because it is a simple procedure and require
no extra packages. However, the most recommended model approach is to
use LASSO to estimate a linear model due to its good properties in
high-dimensional settings and, more importantly, because most the
asymptotic theory for the ArCo method was developed for LASSO
(high-dimensions) and OLS (low-dimensions). Nevertheless, the user is
free to use any other models such as Random Forest, Boosted Trees,
Neural Networks, Factor Models, etc. In these cases, the user should be
aware that the inference on the average treatment effect may not be
valid and should only be used as an approximation. We recommend doing
some simulation tests under the null hypothesis (no treatment) and under
the alternative hypothesis (treatment different from zero) if the user
chooses to use an estimation approach different from LASSO and OLS
The fitArCo
accepts extra arguments to be passed to the fn
function.
This is not particularly useful for LASSO or OLS, but if the user
chooses models that require tuning (Boosted Trees, Random Forest, etc)
extra arguments may be required.
The fitArCo
function returns all the inference regarding the average
treatment effect (confidence intervals and VCOV.type="nw"
) with Quadratic Spectral kernel (default). The optimal
lag for the covariance matrix is calculated automatically but the user
may force it through the argument VCOV.lag
.
The entire list of what the function fitArCo
function returns is:
cf:
fitted.values:
model: A list with fn
function.
delta: The delta statistics for each variable and its confidence interval.
p.value: The ArCo p-value.
data: The data used.
t0: The intervention period used.
treated.unit: The treated unit used.
boot.cf: A list with the bootstrap result (boot.cf=TRUE) or logical FALSE (boot.cf=FALSE). In the first case, each element in the list refers to one bootstrap replication of the counterfactual, i.e. the list length is R.
call: The matched call.
We mentioned earlier that in many cases the exact intervention period
might be unknown, but it can be estimated using the maximization of the
norm of estimate_t0
does. The example below uses the dataset
data.q2
and LASSO to estimate the model as in the previous example.
The arguments start
and end
determine the range of
> library(glmnet)
> set.seed(123)
> data("data.q2")
> t0=estimate_t0(data = data.q2, fn = cv.glmnet, p.fn = predict, treated.unit = 1,
start = 0.4, end = 0.9)
> t0$t0
[1] 53
> t0$delta.norm
[1] 0.9209421
The natural step that follows is to use the estimated fitArCo
function. The results will obviously be very similar to the
ones presented on the previous section because the estimated
This sections shows the ArCo package working on two different examples with real data. The first example is from (Carvalho et al. 2016a) and it shows the effects on inflation of an tax evasion program on the Brazilian metropolitan area of São Paulo. The second example uses data from the package Synth (Abadie et al. 2011) to compare the ArCo with the Synthetic Control (Abadie and Gardeazabal 2003; Abadie et al. 2010).
So far we have used OLS and cross-validation LASSO to make the examples
as simple as possible. However, there are some advantages on selecting
the LASSO regularization parameter using information criterion such as
the Bayesian Information Criterion (BIC), which allow us to consistently
estimate the degrees of freedom of the model (Zou et al. 2007).
Moreover, choosing by information criterion is faster and avoids further
complications created by cross-validation on time-series. For the next
examples we will define the functions fn
and p.fn
to to select the
model using the BIC as follows:
> fn=function (x, y){
+ n=length(y)
+ model = glmnet(x = x, y = y)
+ coef = coef(model)
+ df = model$df
+ yhat=cbind(1,x)%*%coef
+ residuals = (y- yhat)
+ mse = colMeans(residuals^2)
+ nvar = df + 1
+ bic = n*log(mse)+nvar*log(n)
+ selected= which(bic == min(bic))
+ return(coef[,selected])
+ }
>
> p.fn=function(model,newdata){
+ return(cbind(1,newdata) %*% model)
+ }
In October 2007 the government of the Brazilian state of São Paulo launched an anti tax evasion program called Nota Fiscal Paulista (NFP). It consists of a refund from a state tax on the circulation of products and services. The NFP acts as an incentive for the consumer to ask for electronic sales receipt. The receipts give the consumer the right to participate on monthly lotteries promoted by the government and the tax rebate if his tax identifier number (CPF) is included in the receipt.
The NFP program received substantial support from the population. In January 2008, 413 thousand people were already registered in the program while in October 2013 this number of participants were more than 15 million.
The problem arises if we assume that there was some degree of tax evasion occurring before the intervention, and once the consumers start demanding the receipts, the sellers were forced to pay more tax and evasion became more difficult. If the sellers have some degree of market power they might increase prices in order to accommodate the extra tax. If these assumptions are true one would expect an upward movement on prices due to an increase on marginal costs. Therefore, the example presented in this section tries to investigate whether the NFP had and impact on consumers price in São Paulo.
The first sector to have the NFP was restaurants, therefore the Food
Away from Home component (FAH) of the Consumer Prices Index (CPI) is a
good measure to contemplate the entire program from the start. The
sample is available on the ArCo package with the name of
inflationNFP
. It consists of nine Brazilian metropolitan areas
including São Paulo from January 1995 to September 2009. Each
metropolitan area has the interpretation of an ArCo unit. The variables
are the FAH component of the CPI inflation and the monthly GDP growth
for all metropolitan areas. The treatment happened at
> library(glmnet)
> data("inflationNFP")
> t0=34
> ArCoNFP=fitArCo(inflationNFP,fn,p.fn,1,t0,VCOV.type = "nw",VCOV.lag = 2,
prewhitening.kernel = TRUE)
> plot(ArCoNFP,plot=1,display.fitted = TRUE)
> ArCoNFP$delta
LB delta UB
inflationFAH 0.226033437 0.4226199609 0.619206484
GDP -0.004699182 -0.0006055599 0.003488062
> ArCoNFP$p.value
Joint inflationFAH GDP
1.333664e-04 2.514289e-05 7.718676e-01
The $delta
shows that in fact there was a significant impact on
inflation caused by the NFP. The impact on the GPD was statistically
zero. The $p.value
relates to an overall effect considering all
variables, and it also shows that the effects of the NFP were, in
general, different from zero. Figure 4 shows that the LASSO
had a good fit in sample, but not too good for us to be concerned with
over-fitting. In nearly all periods the estimated counterfactual was
below the realized value of the inflation after the NFP was introduced.
It is also interesting to see what happened with the CPI. The code below accumulates the realized inflation and the counterfactual. Figure 5 shows that the accumulated increase on the CPI (FAH) might have been significantly smaller without the NFP.
> FAHsp=inflationNFP$inflationFAH[,1]
> real=cumprod(1+FAHsp/100)
> cf=cumprod(1+c(FAHsp[1:(t0-1)],ArCoNFP$cf[,1])/100)
> fitted=cumprod(1+fitted(ArCoNFP)[,1]/100)
>
> plot(real,type="l",ylab="Y1",xlab="time")
> lines(c(rep(NA,t0-1),tail(cf,length(real)-t0+1)),col=4)
> lines(fitted,col=2)
> abline(v=t0,col=4,lty=2)
> legend("topleft",legend=c("Observed","Fitted","Counterfactual"),col=c(1,2,4),
lty=1,lwd=1,cex=1,seg.len = 1,bty="n")
The package Synth provides a user friendly implementation of the
SC method and some datasets to test it. In this section we compare the
SC and the ArCo by reproducing the exact examples in the synth
function from the Synth package. The first example is with
generate data, the exact code is reproduced below. The variable of
interest is
> # = Exact code reproduced from the first example in the ?synth documentation = #
> library(Synth)
> data("synth.data")
> dataprep.out<-
+ dataprep(
+ foo = synth.data,
+ predictors = c("X1", "X2", "X3"),
+ predictors.op = "mean",
+ dependent = "Y",
+ unit.variable = "unit.num",
+ time.variable = "year",
+ special.predictors = list(
+ list("Y", 1991, "mean"),
+ list("Y", 1985, "mean"),
+ list("Y", 1980, "mean")
+ ),
+ treatment.identifier = 7,
+ controls.identifier = c(29, 2, 13, 17, 32, 38),
+ time.predictors.prior = c(1984:1989),
+ time.optimize.ssr = c(1984:1990),
+ unit.names.variable = "name",
+ time.plot = 1984:1996
+ )
>
> t0 = Sys.time()
> synth.out <- synth(dataprep.out)
X1, X0, Z1, Z0 all come directly from dataprep object.
****************
searching for synthetic control unit
****************
****************
****************
MSPE (LOSS V): 4.399697
solution.v:
1e-10 0.00358062 0.2382334 0.3691776 1.218e-07 0.3890082
solution.w:
0.2349748 0.007323657 0.02010757 0.1897415 0.4884164 0.05943599
> Sys.time()-t0
Time difference of 14.36551 secs
The synth.data
is a panel with variables for the time and the units.
We can transform it into a list using the function panel_to_ArCo_list
.
The code below shows the structure of the data, its transformation into
list and runs the ArCo using LASSO selecting the best model using the
BIC.
> head(synth.data)
unit.num year name Y X1 X2 X3
1 2 1980 control.region.northeast 131.8 NA 21.5 NA
2 2 1981 control.region.northeast 128.7 0.2534060 24.1 NA
3 2 1982 control.region.northeast 127.4 0.2512521 23.8 NA
4 2 1983 control.region.northeast 128.0 0.2489048 21.6 NA
5 2 1984 control.region.northeast 123.1 0.2462865 23.9 17.9
6 2 1985 control.region.northeast 125.8 0.2434099 22.9 18.1
>
> # = Adjust the synth.data panel to be compatible with fitArCo = #
> ArCodata=panel_to_ArCo_list(synth.data,"year","unit.num",c("Y","X2"))
> ArCodata=lapply(ArCodata,function(x)x[as.character(1984:1996),])
>
> # = Run the ArCo = #
> t0 = Sys.time()
> ArCo3=fitArCo(ArCodata,fn,p.fn,treated.unit = 2,t0 = 9)
> Sys.time()-t0
Time difference of 0.01013398 secs
> ArCo3$delta[1,]
LB delta UB
7.816165 16.559440 25.302714
> ArCo3$p.value
Joint Y X2
0.0006722134 0.0002055471 0.0233194437
>
> # = Plot the ArCo and the SC = #
> synthcf=dataprep.out$Y0plot%*%synth.out$solution.w
> plot(dataprep.out$Y1plot,type="l",ylim=c(80,180),ylab="Y",xlab="Time")
> lines(synthcf,col=2)
> abline(v=8,col=4,lty=2)
> lines(c(fitted(ArCo3)[,1], ArCo3$cf[,1]),col=4)
> legend("topleft",legend=c("Y","Synth","ArCo"),col=c(1,2,4),cex = 1,
+ seg.len = 1,bty = "n",lty=1)
The ArCo took 0.01 seconds to run by LASSO, while the SC took 14
seconds. Figure 6 shows that the two methods had similar
results. We used 1992 as the first treated observation (
The second example in the Synth package is to estimate the
counterfactual of the GDP in the Basque using other Spanish regions as
controls. The main assumption is that the terrorism in the 70s had a
negative impact on GDP. The dataset has GDP, investment and several
other variables but we can only use the first two because of missing
observations. Most of the other variables are related to sector
production and education. For more details use ?basque
.
> # = Exact code reproduced from the second example in the ?synth documentation = #
> # = Comments and printed outputs were omitted = #
> data("basque")
> dataprep.out <-
+ dataprep(
+ foo = basque
+ ,predictors= c("school.illit",
+ "school.prim",
+ "school.med",
+ "school.high",
+ "school.post.high"
+ ,"invest"
+ )
+ ,predictors.op = c("mean")
+ ,dependent = c("gdpcap")
+ ,unit.variable = c("regionno")
+ ,time.variable = c("year")
+ ,special.predictors = list(
+ list("gdpcap",1960:1969,c("mean")),
+ list("sec.agriculture",seq(1961,1969,2),c("mean")),
+ list("sec.energy",seq(1961,1969,2),c("mean")),
+ list("sec.industry",seq(1961,1969,2),c("mean")),
+ list("sec.construction",seq(1961,1969,2),c("mean")),
+ list("sec.services.venta",seq(1961,1969,2),c("mean")),
+ list("sec.services.nonventa",seq(1961,1969,2),c("mean")),
+ list("popdens",1969,c("mean")))
+ ,treatment.identifier = 17
+ ,controls.identifier = c(2:16,18)
+ ,time.predictors.prior = c(1964:1969)
+ ,time.optimize.ssr = c(1960:1969)
+ ,unit.names.variable = c("regionname")
+ ,time.plot = c(1955:1997)
+ )
>
> dataprep.out$X1["school.high",] <-
+ dataprep.out$X1["school.high",] +
+ dataprep.out$X1["school.post.high",]
> dataprep.out$X1 <-
+ as.matrix(dataprep.out$X1[
+ -which(rownames(dataprep.out$X1)=="school.post.high"),])
> dataprep.out$X0["school.high",] <-
+ dataprep.out$X0["school.high",] +
+ dataprep.out$X0["school.post.high",]
> dataprep.out$X0 <-
+ dataprep.out$X0[
+ -which(rownames(dataprep.out$X0)=="school.post.high"),]
>
> lowest <- which(rownames(dataprep.out$X0)=="school.illit")
> highest <- which(rownames(dataprep.out$X0)=="school.high")
>
> dataprep.out$X1[lowest:highest,] <-
+ (100 * dataprep.out$X1[lowest:highest,]) /
+ sum(dataprep.out$X1[lowest:highest,])
> dataprep.out$X0[lowest:highest,] <-
+ 100 * scale(dataprep.out$X0[lowest:highest,],
+ center=FALSE,
+ scale=colSums(dataprep.out$X0[lowest:highest,])
+ )
>
> t0 = Sys.time()
> synth.out <- synth(data.prep.obj = dataprep.out)
> Sys.time()-t0
Time difference of 5.340531 secs
Since the GDP is non-stationary, it was first differentiated to estimate
the ArCo. The code below prepares the data, which also comes in a panel,
estimates the ArCo using LASSO and the BIC to select the best model and
rebuild the GDP on its original level to compare with the SC. Since we
do not know exactly when in the 70s the effects on the GDP begun, we
also estimated
> # = Adjust the panel to fitArCo = #
> leveldata=panel_to_ArCo_list(basque,"year","regionno",c("gdpcap","invest"))
> # = GDP should be differentiated (delta(log)) = #
> diffdata=leveldata
> diffdata$gdpcap=diff(log(diffdata$gdpcap),1)
> diffdata$gdpcap=rbind(NA,diffdata$gdpcap)
> # = adjust sample size = #
> diffdata=lapply(diffdata,function(x)x[-c(1:10,42,43),])
>
> # = Estimate intervention period = #
> t0=Sys.time()
> t0.4=estimate_t0(diffdata,fn,p.fn,treated.unit = 17)
> # = Estimate ArCo = #
> ArCo4=fitArCo(diffdata,fn,p.fn,17,t0.4$t0)
> Sys.time()-t0
Time difference of 0.1583807 secs
> ArCo4$delta
LB delta UB
gdpcap -0.01903297 -0.007195488 0.004641989
invest 0.61542806 1.789351051 2.963274038
> ArCo4$p.value
Joint gdpcap invest
0.009720640 0.233505642 0.002812938
>
> # = Rebuild lvl counterfactual GDP = #
> y0=leveldata$gdpcap[10,17] # = First observation
> ArCocf=y0*exp(cumsum(c(fitted(ArCo4)[,1],ArCo4$cf[,1])))
> y.lvl=leveldata$gdpcap[as.character(1965:1995),17]
>
> # = Extract synth counterfactual and plot = #
> synthcf=(dataprep.out$Y0plot%*%synth.out$solution.w)[as.character(1965:1995),]
> xax=rownames(diffdata$gdpcap)
> plot(xax, y.lvl,type="l",ylab="Y",xlab="Time",ylim=c(0,12),lwd=3)
> lines(xax,ArCocf,col=4)
> lines(xax,synthcf,col=2)
> abline(v=xax[t0.4$t0],col=4,lty=2)
> legend("topleft",legend=c("Y","Synth","ArCo"),col=c(1,2,4),lty=1,
+ lwd=c(3,1,1),cex=1,seg.len = 1,bty="n")
The $delta
showed that the intervention was not significant for the
differentiated GDP, however, Figure 7 shows that there was a
considerable impact on its level. The difference estimated by the ArCo
is even bigger than the one estimated by the SC. It also seems that the
ArCo counterfactual follows the trend before the intervention while the
SC simply shifts the observed trend after the intervention.
In this paper we have introduced the ArCo package which consists in a set of R functions implementing the ArCo method proposed by Carvalho et al. (2016a), which is a generalization of the Synthetic Control of Abadie and Gardeazabal (2003) and Abadie et al. (2010) and the panel approach put forward by Hsiao et al. (2012).
The ArCo method is a two-step methodology where in the first step a counterfactual is estimated based on a large, possibly high-dimensional, panel of time series from untreated peers. The package is very flexible and the user can choose any model of her preference such as, for example, LASSO regressions, random forests, boosted trees, neural networks, among others. In the second stage the average treatment effect is computed and a test for the null of no intervention effect is provided.
We believe that this R package will serve as an useful addition to the econometrics toolbox to conduct counterfactual analysis in comparative case studies.
We would like to thank an anonymous referee for very helpful and detailed comments.
CausalInference, Econometrics, MachineLearning, Optimization, Survival, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
predict
methods in
R, which receives a model object and the new data used on the
prediction.[↩]x11()
to plot in a new window.[↩]Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Fonseca, et al., "ArCo: An R package to Estimate Artificial Counterfactuals", The R Journal, 2018
BibTeX citation
@article{RJ-2018-016, author = {Fonseca, Yuri R. and Masini, Ricardo P. and Medeiros, Marcelo C. and Vasconcelos, Gabriel F. R.}, title = {ArCo: An R package to Estimate Artificial Counterfactuals}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {91-108} }