In this article the package High-dimensional Metrics hdm is introduced. It is a collection of statistical methods for estimation and quantification of uncertainty in high-dimensional approximately sparse models. It focuses on providing confidence intervals and significance testing for (possibly many) low-dimensional subcomponents of the high-dimensional parameter vector. Efficient estimators and uniformly valid confidence intervals for regression coefficients on target variables (e.g., treatment or policy variable) in a high-dimensional approximately sparse regression model, for average treatment effect (ATE) and average treatment effect for the treated (ATET), as well for extensions of these parameters to the endogenous setting are provided. Theory grounded, data-driven methods for selecting the penalization parameter in Lasso regressions under heteroscedastic and non-Gaussian errors are implemented. Moreover, joint/ simultaneous confidence intervals for regression coefficients of a high-dimensional sparse regression are implemented. Data sets which have been used in the literature and might be useful for classroom demonstration and for testing new estimators are included.
Analysis of high-dimensional models, models in which the number of parameters to be estimated is large relative to the sample size, is becoming increasingly important. Such models arise naturally in readily available high-dimensional data which have many measured characteristics available per individual observation as in, for example, large survey data sets, scanner data, and text data. Such models also arise naturally even in data with a small number of measured characteristics in situations where the exact functional form with which the observed variables enter the model is unknown, and we create many technical variables, a dictionary, from the raw characteristics. Examples of this scenario include semiparametric models with nonparametric nuisance functions. More generally, models with many parameters relative to the sample size often arise when attempting to model complex phenomena.
For R, many packages for estimation high-dimensional models are already available. For example, glmnet (Friedman, T. Hastie, and R. Tibshirani 2010) and lars (Hastie and B. Efron 2013) are popular for Lasso estimation. The section "Regularized & Shrinkage Methods" in the task view on "Machine Learning and Statistical Learning" contains further implementation of Lasso and related methods.
The methods which are implemented in this package (hdm) are chiefly distinct from already available methods in other packages in offering the following four major features:
First, we provide a version of Lasso regression that expressly handles and allows for non-Gaussian and heteroscedastic errors.
Second, we implement a theoretically grounded, data-driven choice of
the penalty level rlasso
). The prefix r
in function names
should underscore this. In high-dimensional settings
cross-validation is very popular; but it lacks a theoretical
justification for use in the present context and some theoretical
proposals for the choice of
Third, we provide efficient estimators and uniformly valid
confidence intervals for various low-dimensional causal/structural
parameters appearing in high-dimensional approximately sparse
models.
Fourth, joint/ simultaneous confidence intervals for estimated coefficients in a high-dimensional approximately sparse models are provided, based on the methods and theory developed in (Belloni, V. Chernozhukov, and K. Kato 2014). They proposed uniformly valid confidence regions for regressions coefficients in a high-dimensional sparse Z-estimation problems, which include median, mean, and many other regression problems as special cases. In this article we apply this method to the coefficients of a Lasso regression and highlight this method with an empirical example.
In the following we will give a short overview over the functionality of the package with simple examples. A detailed introduction on how to use the functions is given in the accompanying vignette.
In this section we give a very short introduction to the literature with a focus on the methods which are implemented in the package.
Lasso under heteroscedastic and non-Gaussian errors was analysed in Belloni, D. Chen, V. Chernozhukov, and C. Hansen (2012). Post-Lasso, i.e. a least squares fit with the variables selected by a Lasso regression, is introduced and analysed in Belloni and V. Chernozhukov (2013). Inference for a low-dimensional component of a potentially high-dimensional vector has been conducted in Belloni, V. Chernozhukov, and C. Hansen (2014). Belloni, V. Chernozhukov, and C. Hansen (2014) also consider inference on average treatment effects in a heterogeneous treatment effect setting after selection amongst high-dimensional controls. Inference in high-dimensional settings is enabled by the so-called orthogonality condition. The systematic development of the orthogonality condition for inference on low-dimensional parameters in high-dimensional settings is given in Chernozhukov, C. Hansen, and M. Spindler (2015b).
Instrumental variables estimation is a central topic in Econometrics and also becoming more popular in fields such as Biostatistics, Epidemiology, and Sociology. Good introductions to instrumental variables and treatment effects are the books Angrist and J.-S. Pischke (2008) and Imbens and D. B. Rubin (2015). The case of selection on high-dimensional instrumental variables is given in Belloni, D. Chen, V. Chernozhukov, and C. Hansen (2012), the case of selection on the instruments and control variables in Chernozhukov, C. Hansen, and M. Spindler (2015a). For further discussion of estimation of treatment effects in a high-dimensional setting including cases with endogenous treatment assignment, we refer to Belloni, V. Chernozhukov, I. Fernández-Val, and C. Hansen (2013).
An important feature of our package is that it allows Lasso estimation under heteroscedastic and non-Gaussian errors. This distinguishes the package hdm from other already available software implementations of Lasso. As an additional benefit, the theoretical grounded choice of the penalty does not require cross-validation which might lead to a considerable saving of computation time in many applications.
Consider high dimensional approximately sparse linear regression models.
These models have a large number of regressors
The model reads
The model can be exactly sparse, namely
In order to get theoretically justified performance guarantees, we
consider the Lasso estimator with data-driven penalty loadings:
Regularization by the
A crucial matter is the choice of the penalization parameter
In high-dimensions setting, cross-validation is very popular in practice
but lacks theoretical justification and so may not provide such a
performance guarantee. In sharp contrast, the choice of the penalization
parameter
In the case of homoscedasticity, we set the penalty
loadings
In the case of heteroskedasticity, the loadings are set to
Estimation proceeds by iteration. The estimates of residuals
The core of the package are the functions rlasso
and rlassoEffects
.
They allow estimation of a Lasso regression under heteroscedastic and
non-Gaussian errors and inference on a set of prespecified,
low-dimensional objects. As Lasso regression introduces shrinkage bias,
a post-Lasso option, which can be described as Lasso estimation followed
by a final refit with ols including only the selected variables, is
available.
The function rlasso
implements Lasso and post-Lasso. The default
option is to use post-Lasso, post=TRUE
. The user can also decide if an
unpenalized intercept
should be included (TRUE
by default),
LassoShooting.fit
contains the computational algorithm that underlies
the estimation procedure. This function implements a version of the
Shooting Lasso Algorithm (Fu 1998). The option penalty
of the function
rlasso
allows different choices for the penalization parameter and
loadings. It allows for homoscedastic or heteroscedastic errors with
default homoscedastic = FALSE
. Moreover, the dependence structure of
the design matrix might be taken into consideration for calculation of
the penalization parameter with X.dependent.lambda = TRUE
. With the
option lambda.start
initial values for the algorithm can be set.lambda.start
. Then this value is used as penalty
parameter with independent design and heteroscedastic errors to
weight the regressors.
The option penalty
allows to set the constants control
.
rlasso
returns an object of S3 class rlasso
for which methods like
predict
, print
, summary
are provided. The methods print
and
summary
have the option all
. By setting this option to FALSE
only
the coefficients estimated to be non-zero are shown.
The function rlassoEffects
does inference for prespecified target
variables. Those can be specified either by the variable names, an
integer valued vector giving their position in x
, or by a logical
vector indicating the variables for which inference should be conducted.
It returns an object of S3 class rlassoEffects
for which the methods
summary
, print
, confint
, and plot
are provided. rlassoEffects
is a wrapper function for rlassoEffect
which does inference for a
single target regressor.
The function rlassoEffects
might either be used in the form
rlassoEffects(x, y, index)
where x
is a matrix, y
denotes the
outcome variable and index
specifies the variables of x
for which
inference is conducted. This can done by an integer vector (postion of
the variables), a logical vector or the name of the variables. An
alternative usage is as rlassoEffects(formula, data, I)
where I
is a
one-sided formula which specifies the variables for which is inference
is conducted. For further details we refer to the help page of the
function and the following examples where both methods for usage are
shown.
For logistic regression the corresponding functions are implemented in
an analogous manner, named rlassologit
and rlassologitEffects
.
First, we generate a data set on which we demonstrate the basic functions:
set.seed(12345)
> n <- 100 #sample size
> p <- 100 # number of variables
> s <- 3 # number of variables with non-zero coefficients
> X <- matrix(rnorm(n*p), ncol=p)
> beta <- c(rep(5,s), rep(0,p-s))
> Y <- X%*%beta + rnorm(n)
Next we use Lasso for fitting and prediction, show the results and make predictions for the old and for new X-variables
> lasso.reg <- rlasso(Y~X, post=FALSE, intercept=TRUE) # use Lasso, not-post-Lasso
> # lasso.reg <- rlasso(X,Y, post=FALSE, intercept=TRUE) # alternative use
> summary(lasso.reg, all=FALSE)
Call:
rlasso.formula(formula = Y ~ X, post = FALSE, intercept = TRUE)
Post-Lasso Estimation: FALSE
Total number of variables: 100
Number of selected variables: 11
Residuals:
Min 1Q Median 3Q Max
-2.09008 -0.45801 -0.01237 0.50291 2.25098
Estimate
(Intercept) 0.057
1 4.771
2 4.693
3 4.766
13 -0.045
15 -0.047
16 -0.005
19 -0.092
22 -0.027
40 -0.011
61 0.114
100 -0.025
Residual standard error: 0.8039
Multiple R-squared: 0.9913
Adjusted R-squared: 0.9902
Joint significance test:
the sup score statistic for joint significance test is 64.02 with a p-value of 0
> yhat.lasso <- predict(lasso.reg) #in-sample prediction
> Xnew <- matrix(rnorm(n*p), ncol=p) # new X
> Ynew <- Xnew%*%beta + rnorm(n) #new Y
> yhat.lasso.new <- predict(lasso.reg, newdata=Xnew) #out-of-sample prediction
To reduce bias, now use post-Lasso for fitting and prediction
> post.lasso.reg <- rlasso(Y~X, post=TRUE, intercept=TRUE)
> summary(post.lasso.reg, all=FALSE)
Call:
rlasso.formula(formula = Y ~ X, post = TRUE, intercept = TRUE)
Post-Lasso Estimation: TRUE
Total number of variables: 100
Number of selected variables: 3
Residuals:
Min 1Q Median 3Q Max
-2.83981 -0.80339 0.02063 0.75573 2.30421
Estimate
(Intercept) 0.000
1 5.150
2 4.905
3 4.912
Residual standard error: 1.059
Multiple R-squared: 0.9855
Adjusted R-squared: 0.9851
Joint significance test:
the sup score statistic for joint significance test is 66.87 with a p-value of 0
> yhat.postlasso <- predict(post.lasso.reg) #in-sample prediction
> yhat.postlasso.new <- predict(post.lasso.reg, newdata=Xnew) #out-of-sample prediction
We can do inference on a set of variables of interest, e.g. the first, second, third, and the fiftieth:
> lasso.effect <- rlassoEffects(x=X, y=Y, index=c(1,2,3,50))
> print(lasso.effect)
Call:
rlassoEffects.default(x = X, y = Y, index = c(1, 2, 3, 50))
Coefficients:
V1 V2 V3 V50
5.0890 4.7781 4.8292 0.1384
> summary(lasso.effect)
[1] "Estimates and significance testing of the effect of target variables"
Estimate. Std. Error t value Pr(>|t|)
V1 5.0890 0.1112 45.781 <2e-16 ***
V2 4.7781 0.1318 36.264 <2e-16 ***
V3 4.8292 0.1314 36.752 <2e-16 ***
V50 0.1384 0.1122 1.234 0.217
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The confidence intervals for the coefficients are given by
> confint(lasso.effect)
2.5 % 97.5 %
> confint(lasso.effect)
2.5 % 97.5 %
V1 4.8710981 5.306834
V2 4.5198436 5.036332
V3 4.5716439 5.086726
V50 -0.0814167 0.358284
Moreover, simultaneous / joint confidence intervals can be calculated which will be discussed in more detail later.
> confint(lasso.effect, joint = TRUE)
2.5 % 97.5 %
V1 4.8464078 5.3315239
V2 4.4315649 5.1246107
V3 4.4864563 5.1719131
V50 -0.1720149 0.4488822
Finally, we can also plot the estimated effects with their confidence intervals:
> plot(lasso.effect, main="Confidence Intervals")
A strength of the package is that it can also handle non-Gaussian and heteroscedastic error what we illustrate with a small example:
> Y <- X%*%beta + rnorm(n, sd=sin(X%*%beta)^2)
> Y <- X%*%beta + rt(n, df=5)
> lasso.reg = rlasso(Y~X, post=FALSE, intercept=TRUE) # use Lasso, not-post-Lasso
> summary(lasso.reg, all=FALSE)
Call:
rlasso.formula(formula = Y ~ X, post = FALSE, intercept = TRUE)
Post-Lasso Estimation: FALSE
Total number of variables: 100
Number of selected variables: 8
Residuals:
Min 1Q Median 3Q Max
-2.51449 -0.76567 0.09798 0.80007 2.11780
Estimate
(Intercept) 0.032
1 4.959
2 4.686
3 4.671
8 0.083
21 0.010
36 0.047
58 -0.070
100 -0.133
Residual standard error: 1.051
Multiple R-squared: 0.9858
Adjusted R-squared: 0.9845
Joint significance test:
the sup score statistic for joint significance test is 66.87 with a p-value of 0
(Belloni, V. Chernozhukov, and K. Kato 2014) provide uniformly valid confidence intervals for rlassoEffects
and confint
. To get joint confidence intervals the
option joint
of the latter function has to be set to TRUE
. In the
next section we illustrate this with an empirical illustration,
analysing the effect of gender on wage, to quantify potential
discrimination.
In Labour Economics an important question is how the wage is related to
the gender of the employed. We use US census data from the year 2012 to
analyse the effect of gender and interaction effects of other variables
with gender on wage jointly. The dependent variable is the logarithm of
the wage, the target variable is female
(in combination with other
variables). All other variables denote some other socio-economic
characteristics, e.g. marital status, education, and experience. For a
detailed description of the variables we refer to the help page.
First, we load and prepare the data.
> library(hdm)
> data(cps2012)
> X <- model.matrix( ~ -1 + female + female:(widowed + divorced + separated +
+ nevermarried + hsd08 + hsd911+ hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) +
+ (widowed + divorced + separated + nevermarried +
+ hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3)^2, data=cps2012)
> dim(X)
[1] 29217 136
> X <- X[,which(apply(X, 2, var)!=0)] # exclude all constant variables
> dim(X)
[1] 29217 116
> index.gender <- grep("female", colnames(X))
> y <- cps2012$lnw
The parameter estimates for the target parameters, i.e. all coefficients related to gender (i.e. by interaction with other variables) are calculated and summarized by the following commands
> effects.female <- rlassoEffects(x=X, y=y, index=index.gender)
> summary(effects.female)
[1] "Estimates and significance testing of the effect of target variables"
Estimate. Std. Error t value Pr(>|t|)
female -0.154923 0.050162 -3.088 0.002012 **
female:widowed 0.136095 0.090663 1.501 0.133325
female:divorced 0.136939 0.022182 6.174 6.68e-10 ***
female:separated 0.023303 0.053212 0.438 0.661441
female:nevermarried 0.186853 0.019942 9.370 < 2e-16 ***
female:hsd08 0.027810 0.120914 0.230 0.818092
female:hsd911 -0.119335 0.051880 -2.300 0.021435 *
female:hsg -0.012890 0.019223 -0.671 0.502518
female:cg 0.010139 0.018327 0.553 0.580114
female:ad -0.030464 0.021806 -1.397 0.162405
female:mw -0.001063 0.019192 -0.055 0.955811
female:so -0.008183 0.019357 -0.423 0.672468
female:we -0.004226 0.021168 -0.200 0.841760
female:exp1 0.004935 0.007804 0.632 0.527139
female:exp2 -0.159519 0.045300 -3.521 0.000429 ***
female:exp3 0.038451 0.007861 4.891 1.00e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Finally, we estimate and plot confident intervals, first "pointwise" and then the joint confidence intervals.
> joint.CI <- confint(effects.female, level = 0.95, joint = TRUE)
> joint.CI
2.5 % 97.5 %
female -0.244219905 -0.06562666
female:widowed -0.036833704 0.30902467
female:divorced 0.097104065 0.17677471
female:separated -0.066454223 0.11305975
female:nevermarried 0.149932792 0.22377417
female:hsd08 -0.230085134 0.28570576
female:hsd911 -0.215291089 -0.02337899
female:hsg -0.046383541 0.02060398
female:cg -0.023079946 0.04335705
female:ad -0.072370082 0.01144259
female:mw -0.035451308 0.03332443
female:so -0.043143587 0.02677690
female:we -0.043587197 0.03513494
female:exp1 -0.008802488 0.01867301
female:exp2 -0.239408202 -0.07963045
female:exp3 0.024602478 0.05229868
> plot(effects.female, joint=TRUE, level=0.95)
This analysis allows a closer look how discrimination according to gender is related to other socio-economic variables.
As a side remark, the version 0.2 allows also now a formula interface
for many functions including rlassoEffects
. Hence, the analysis could
also be done more compact as
> effects.female <- rlassoEffects(lnw ~ female + female:(widowed + divorced + separated +
+ nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) +
+ (widowed + divorced + separated + nevermarried +
+ hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3)^2, data=cps2012,
+ I = ~ female + female:(widowed + divorced + separated + nevermarried +
+ hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3))
The one-sided option I
gives the target variables for which inference
is conducted.
The package contains six data sets which are made available for
researchers. The growth data set contains the so-called Barro-Lee data
(Barro and J.-W. Lee 1994). It contains macroeconomics information for large set of
countries over several decades to analyse the drivers of economic
growth. Next, the package includes a data set on settler mortality and
economic outcomes which was used in the literature to illustrate the
effect of institutions on economic growth (Acemoglu, S. Johnson, and J. A. Robinson 2001). The data
set pension
contains information about the choice of 401(k) pension
plans and savings behaviour in the United States (Chernozhukov and C. Hansen 2004). A data set
for analysing the effect of judicial decisions regarding eminent domain
on economic outcomes (e.g. house prices), is included. More information
is given in Chen and J. Sethi (2010) and Belloni, D. Chen, V. Chernozhukov, and C. Hansen (2012). The BLP data (Berry, J. Levinsohn, and A. Pakes 1995) set was
analysed in Berry, Levinsohn and Pakes (1995). The data stem from annual
issues of the Automotive News Market Data Book. The data set inlcudes
information on all models marketed during the the period beginning 1971
and ending in 1990 cotaining 2217 model/years from 997 distinct models.
Finally, US census data (CPS2012) is included.
All data sets are described in their help pages where also further
references can be found.
Additionally, the package contains functions for estimation and
inference on structural parameters in a high-dimensional setting. An
important tool for estimation in the potential presence of endogeneity
is instrumental variable (IV) estimation. The function rlassoIV
implements methods for IV estimation in a high-dimensional setting. This
function is a wrapper for several methods. The user can specify if
selection shall be done on the instruments (z-variables), on the control
variables (x-variables), or on both. For the low-dimensional case where
no selection is done for the x- or z-variables rlassoIV
performs
classical 2SLS estimation by calling the function tsls
.
The goal of many empirical analyses is to understand and estimate the
causal effect of a treatment, e.g. the effect of an endogenous binary
treatment rlassoLATE
, rlassoLATE
, rlassoATE
, and rlassoATET
respectively.
These functions also allow calculation of treatment effect standard
errors with different bootstrap methods (“wild”, “normal”, “Bayes”)
besides to the classical plug-in standard errors. Moreover, methods
print
, summary
, and confint
, are available.
Both the family of rlassoIV
-functions and the family of the functions
for treatment effects , which are introduced in the next section, allow
use with both formula-interface and by handing over the prepard model
matrices. Hence the general pattern for use with formula is
function(formula, data, ...)
where formula consists of two-parts and
is a member of the class Formula
. These formulas are of the pattern
y d + x | x + z
where y
is the outcome variable, x
are exogenous
variables, d
endogenous varialbes (if several ones are allowed depends
on the concrete function), and z
denote the instrumental variables. A
more primitive use of the functions is by simply hand over the required
group of variables as matrices: function(x=x, d=d, y=y, z=z)
. In some
of the following examples both alternatives are displayed.
In this section, we briefly present how instrumental variables estimation is conducted in a high-dimensional setting. The eminent domain example, for which the data set is contained in the package, serves as an illustration. The underlying goal is to estimate the effect of pro-plaintiff decisions in cases of eminent domain (government’s takings of private property) on economic outcomes. The analysis of the effects of such decisions is complicated by the possible endogeneity between judicial decisions and potential economic outcomes. To address the potential endogeneity, we employ an instrumental variables strategy based on the random assignment of judges to the federal appellate panels that make the decisions. Because judges are randomly assigned to three-judge panels, the judges and their demographics are randomly assigned conditional on the distribution of characteristics of federal circuit court judges in a given circuit-year.
First, we load the data and construct the matrices with the controls (x), instruments (z), outcome (y), and treatment variables (d). Here we consider regional GDP as the outcome variable.
> data(EminentDomain)
> z <- EminentDomain$logGDP$z
> x <- EminentDomain$logGDP$x
> y <- EminentDomain$logGDP$y
> d <- EminentDomain$logGDP$d
As mentioned above,
Next, we estimate the model with selection on the instruments.
> lasso.IV.Z <- rlassoIV(x=x, d=d, y=y, z=z, select.X=FALSE, select.Z=TRUE)
> # or lasso.IV.Z <- rlassoIVselectZ(x=X, d=d, y=y, z=Z)
> summary(lasso.IV.Z)
[1] "Estimates and significance testing of the effect of target variables in the
IV regression model"
coeff. se. t-value p-value
d1 0.01543 0.01926 0.801 0.423
Finally, we do selection on both the
> lasso.IV.XZ <- rlassoIV(x=x, d=d, y=y, z=z, select.X=TRUE, select.Z=TRUE)
> summary(lasso.IV.XZ)
Estimates and Significance Testing of the effect of target variables in the
IV regression model
coeff. se. t-value p-value
d1 -0.03488 0.15881 -0.22 0.826
> confint(lasso.IV.XZ)
2.5 % 97.5 %
d1 -0.3461475 0.2763868
Here we apply the treatment functions to 401(k) plan participation. Though it is clear that 401(k) plans are widely used as vehicles for retirement saving, their effect on assets is less clear. The key problem in determining the effect of participation in 401(k) plans on accumulated assets is saver heterogeneity coupled with nonrandom selection into participation states. In particular, it is generally recognized that some people have a higher preference for saving than others. Thus, it seems likely that those individuals with the highest unobserved preference for saving would be most likely to choose to participate in tax-advantaged retirement savings plans and would also have higher savings in other assets than individuals with lower unobserved saving propensity. This implies that conventional estimates that do not allow for saver heterogeneity and selection of the participation state will be biased upward, tending to overstate the actual savings effects of 401(k) and IRA participation.
Again, we start first with the data preparation. For a detailed
description of the data set we refer to the help page help(pension)
.
data(pension)
y <- pension$tw; d = pension$p401; z = pension$e401
X <- pension[,c("i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5",
"fsize", "hs", "smcol", "col", "marr", "twoearn", "db", "pira", "hown")]
# simple model
xvar <- c("i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5",
"fsize", "hs", "smcol", "col", "marr", "twoearn", "db", "pira", "hown")
xpart <- paste(xvar, collapse = "+")
form <- as.formula(paste("tw ~ ", paste(c("p401", xvar), collapse ="+"), "|",
paste(xvar, collapse = "+")))
formZ <- as.formula(paste("tw ~ ", paste(c("p401", xvar), collapse ="+"), "|",
paste(c("e401", xvar), collapse = "+")))
Now we can compute the estimates of the target treatment effect parameters:
> #pension.ate = rlassoATE(X,d,y)
> pension.ate = rlassoATE(form, data = pension)
> summary(pension.ate)
Estimation and significance testing of the treatment effect
Type: ATE
Bootstrap: not applicable
coeff. se. t-value p-value
TE 10490 1920 5.464 4.67e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> #pension.atet <- rlassoATET(X,d,y)
> pension.atet <- rlassoATET(form, data = pension)
> summary(pension.atet)
Estimation and significance testing of the treatment effect
Type: ATET
Bootstrap: not applicable
coeff. se. t-value p-value
TE 11810 2844 4.152 3.29e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> pension.late <- rlassoLATE(X,d,y,z)
> #pension.late <- rlassoLATE(formZ, data=pension)
> summary(pension.late)
Estimation and significance testing of the treatment effect
Type: LATE
Bootstrap: not applicable
coeff. se. t-value p-value
TE 12189 2734 4.458 8.27e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> pension.latet <- rlassoLATET(X,d,y,z)
> #pension.latet <- rlassoLATET(formZ, data=pension)
> summary(pension.latet)
Estimation and significance testing of the treatment effect
Type: LATET
Bootstrap: not applicable
coeff. se. t-value p-value
TE 12687 3590 3.534 0.00041 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The results are summarized in Table 1. We see that in this example the estimates are quite similar, while the standard errors differ.
Estimate | Std. Error | |
---|---|---|
ATE | 10490.07 | 1919.99 |
ATET | 11810.45 | 2844.33 |
LATE | 12188.66 | 2734.12 |
LATET | 12686.87 | 3590.09 |
A core problems in Economics is to estimate the effect of price on the demand of products. This is usually impeded by the fact that price are not exogenously given, but determined by demand and supply in a simultaneous system of equations, introducing endogeneity. A possible way to solve this problem is to employ instrumental variables - a technique which is also getting more popular outside of Economics. The results found here are a replication in Rof the results in (Chernozhukov, C. Hansen, and M. Spindler 2015a) and (Chernozhukov, C. Hansen, and M. Spindler 2015b).
here we are interested in estimating the coefficients in a simple logit model of demand for automobiles using market share data. Our example is based on the data and most basic strategy from (Berry, J. Levinsohn, and A. Pakes 1995). The goal is to estimate the influence / effect of the price on demand (=market share). Specifically,
where
In our example, we use the same set of product characteristics
(
We also adopt the argument from (Berry, J. Levinsohn, and A. Pakes 1995) to form our potential
instruments. BLP argue that that characteristics of other products will
satisfy an exclusion restriction,
While the choice of the baseline instruments and controls is motivated
by good intuition and economic theory, it should be noted that theory
does not clearly state which product characteristics or instruments
should be used in the model. Theory also fails to indicate the
functional form with which any such variables should enter the model.
High-dimensional methods outlined offer one strategy to help address
these concerns which complements the economic intuition motivating the
baseline controls and instruments. As an illustration, we consider an
expanded set of controls and instruments. We augment the set of
potential controls with all first order interactions of the baseline
variables, quadratics and cubics in all continuous baseline variables,
and a time trend which yields a total of 24
The data set is included in the package an can be accessed in the following way
data(BLP)
BLPData <- BLP$BLP
A detailed description of the data is given in (Berry, J. Levinsohn, and A. Pakes 1995) and also on the help page.
In the base line model we consider five x-variables and ten instrumental
variables as described above. First we process the data, in particular
we construct the design matrices for the x- and z-variables.The matrix
of instruments constructIV
in the package
hdm.
attach(BLPData)
X <- as.matrix(cbind(1, BLPData[,c("hpwt", "air", "mpd", "space")]))
Z <- BLP$Z
#Z <- hdm:::constructIV(firm.id, cdid, id, X)
With the baseline x- and z- variables, we estimate the price-effect with ols, tsls, and finally with selection on the x- and z-variables.
ols.reg <- lm(y~ price + hpwt + air + mpd + space, data =BLPData)
tsls.reg <- tsls(x=X, d=price, y=y, z=Z, intercept =FALSE, homoscedastic = FALSE)
lasso.reg <- rlassoIV(x=X[,-1], y=y, z=Z, d=price, select.X=TRUE, select.Z=TRUE,
intercept = TRUE)
The results are summarized in the table below:
Method | Price coefficient | Standard error |
---|---|---|
Baseline OLS | -0.089 | 0.004 |
Baseline TSLS | -0.136 | 0.012 |
Baseline TSLS with Lasso selection | -0.174 | 0.013 |
Next, we estimate the augmented model which in total has 24 controls and 48 instruments First, we construct the data needed to estimate the augmented model. The set of augmented IVs can be accessed by
augZ <- BLP$augZ
tu <- trend/19
mpdu <- mpd/7
spaceu <- space/2
augX = cbind(1, hpwt, air, mpdu, spaceu, tu, hpwt^2, hpwt^3, mpdu^2, mpdu^3,
spaceu^2, spaceu^3, tu^2, tu^3, hpwt*air, mpdu*air, spaceu*air, tu*air,
hpwt*mpdu, hpwt*spaceu, hpwt*tu, mpdu*spaceu, mpdu*tu, spaceu*tu)
colnames(augX) <- c("constant", "hpwt", "air", "mpdu", "spaceu", "tu", "hpwt^2", "hpwt^3",
"mpdu^2", "mpdu^3", "spaceu^2", "spaceu^3", "tu^2", "tu^3", "hpwt*air", "mpdu*air",
"spaceu*air", "tu*air", "hpwt*mpdu", "hpwt*spaceu", "hpwt*tu", "mpdu*spaceu", "mpdu*tu",
"spaceu*tu")
# augZ <- hdm:::constructIV(firm.id, cdid, id, augX) # construction of augmented set of IVs
Next, we redo the analysis with augmented set of variables:
ols.reg <- lm(y~ -1 + cbind(price, augX))
tsls.reg <- tsls(x=augX, d=price, y=y, z=augZ, intercept =FALSE, homoscedastic = FALSE)
lasso.reg <- rlassoIV(x=augX[,-1], y=y, z=augZ, d=price, select.X=TRUE, select.Z=TRUE,
intercept = TRUE)
The results for the augmented model are presented in the table:
Method | Price coefficient | Standard error |
---|---|---|
Augmented OLS | -0.099 | 0.004 |
Augmented TSLS | -0.127 | 0.008 |
Augmented TSLS with Lasso selection | -0.286 | 0.02 |
We report estimation results from the baseline and augmented setting in the tables above. The estimations all give a negative effect of price on demand as expected, but they differ considerably in size of the effects. In the augmented model we get the strongest influence of price on the market share and we see that the choice of instruments is an important issue.
The package hdm contains methods for estimation and inference in a high-dimensional setting. We have presented a short overview of the functions implemented. The package also contains data sets which might be useful for classroom presentations and as applications for newly developed estimators. More examples and a short introduction in the underlying theoretical concepts are provided in the accompanying vignette. It is planned to extend the functionality and to improve the performance of the functions in subsequent versions of the package.
CausalInference, Econometrics, MachineLearning, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
lambda.start
. Then this value is used as penalty
parameter with independent design and heteroscedastic errors to
weight the regressors.[↩]Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Chernozhukov, et al., "hdm: High-Dimensional Metrics", The R Journal, 2016
BibTeX citation
@article{RJ-2016-040, author = {Chernozhukov, Victor and Hansen, Chris and Spindler, Martin}, title = {hdm: High-Dimensional Metrics}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {2}, issn = {2073-4859}, pages = {185-199} }