In this article the package High-dimensional Metrics hdm is introduced. It is a collection of statistical methods for estimation and quantification of uncertainty in high-dimensional approximately sparse models. It focuses on providing confidence intervals and significance testing for (possibly many) low-dimensional subcomponents of the high-dimensional parameter vector. Efficient estimators and uniformly valid confidence intervals for regression coefficients on target variables (e.g., treatment or policy variable) in a high-dimensional approximately sparse regression model, for average treatment effect (ATE) and average treatment effect for the treated (ATET), as well for extensions of these parameters to the endogenous setting are provided. Theory grounded, data-driven methods for selecting the penalization parameter in Lasso regressions under heteroscedastic and non-Gaussian errors are implemented. Moreover, joint/ simultaneous confidence intervals for regression coefficients of a high-dimensional sparse regression are implemented. Data sets which have been used in the literature and might be useful for classroom demonstration and for testing new estimators are included.
Analysis of high-dimensional models, models in which the number of parameters to be estimated is large relative to the sample size, is becoming increasingly important. Such models arise naturally in readily available high-dimensional data which have many measured characteristics available per individual observation as in, for example, large survey data sets, scanner data, and text data. Such models also arise naturally even in data with a small number of measured characteristics in situations where the exact functional form with which the observed variables enter the model is unknown, and we create many technical variables, a dictionary, from the raw characteristics. Examples of this scenario include semiparametric models with nonparametric nuisance functions. More generally, models with many parameters relative to the sample size often arise when attempting to model complex phenomena.
For R, many packages for estimation high-dimensional models are already available. For example, glmnet (Friedman, T. Hastie, and R. Tibshirani 2010) and lars (Hastie and B. Efron 2013) are popular for Lasso estimation. The section "Regularized & Shrinkage Methods" in the task view on "Machine Learning and Statistical Learning" contains further implementation of Lasso and related methods.
The methods which are implemented in this package (hdm) are chiefly distinct from already available methods in other packages in offering the following four major features:
First, we provide a version of Lasso regression that expressly handles and allows for non-Gaussian and heteroscedastic errors.
Second, we implement a theoretically grounded, data-driven choice of
the penalty level \(\lambda\) in the Lasso regressions. To underscore
this choice, we call the Lasso implementation in this package
“rigorous” Lasso (=rlasso
). The prefix r
in function names
should underscore this. In high-dimensional settings
cross-validation is very popular; but it lacks a theoretical
justification for use in the present context and some theoretical
proposals for the choice of \(\lambda\) are often not feasible.
Moreover, the theoretically grounded, data-driven choice makes
cross-validation redundant, as it is time-consuming particularly in
large data sets.
Third, we provide efficient estimators and uniformly valid confidence intervals for various low-dimensional causal/structural parameters appearing in high-dimensional approximately sparse models.1 For example, we provide efficient estimators and uniformly valid confidence intervals for a regression coefficient on a target variable (e.g., a treatment or policy variable) in a high-dimensional sparse regression model. Target variable in this context means the object not interest, e.g. a prespecified regression coefficient. We also provide estimates and confidence intervals for average treatment effect (ATE) and average treatment effect for the treated (ATET), as well extensions of these parameters to the endogenous setting.
Fourth, joint/ simultaneous confidence intervals for estimated coefficients in a high-dimensional approximately sparse models are provided, based on the methods and theory developed in (Belloni, V. Chernozhukov, and K. Kato 2014). They proposed uniformly valid confidence regions for regressions coefficients in a high-dimensional sparse Z-estimation problems, which include median, mean, and many other regression problems as special cases. In this article we apply this method to the coefficients of a Lasso regression and highlight this method with an empirical example.
In the following we will give a short overview over the functionality of the package with simple examples. A detailed introduction on how to use the functions is given in the accompanying vignette.
In this section we give a very short introduction to the literature with a focus on the methods which are implemented in the package.
Lasso under heteroscedastic and non-Gaussian errors was analysed in Belloni, D. Chen, V. Chernozhukov, and C. Hansen (2012). Post-Lasso, i.e. a least squares fit with the variables selected by a Lasso regression, is introduced and analysed in Belloni and V. Chernozhukov (2013). Inference for a low-dimensional component of a potentially high-dimensional vector has been conducted in Belloni, V. Chernozhukov, and C. Hansen (2014). Belloni, V. Chernozhukov, and C. Hansen (2014) also consider inference on average treatment effects in a heterogeneous treatment effect setting after selection amongst high-dimensional controls. Inference in high-dimensional settings is enabled by the so-called orthogonality condition. The systematic development of the orthogonality condition for inference on low-dimensional parameters in high-dimensional settings is given in Chernozhukov, C. Hansen, and M. Spindler (2015b).
Instrumental variables estimation is a central topic in Econometrics and also becoming more popular in fields such as Biostatistics, Epidemiology, and Sociology. Good introductions to instrumental variables and treatment effects are the books Angrist and J.-S. Pischke (2008) and Imbens and D. B. Rubin (2015). The case of selection on high-dimensional instrumental variables is given in Belloni, D. Chen, V. Chernozhukov, and C. Hansen (2012), the case of selection on the instruments and control variables in Chernozhukov, C. Hansen, and M. Spindler (2015a). For further discussion of estimation of treatment effects in a high-dimensional setting including cases with endogenous treatment assignment, we refer to Belloni, V. Chernozhukov, I. Fernández-Val, and C. Hansen (2013).
An important feature of our package is that it allows Lasso estimation under heteroscedastic and non-Gaussian errors. This distinguishes the package hdm from other already available software implementations of Lasso. As an additional benefit, the theoretical grounded choice of the penalty does not require cross-validation which might lead to a considerable saving of computation time in many applications.
Consider high dimensional approximately sparse linear regression models. These models have a large number of regressors \(p\), possibly much larger than the sample size \(n\), but only a relatively small number \(s =o(n)\) of these regressors are important for capturing accurately the main features of the regression function. The latter assumption makes it possible to estimate these models effectively by searching for approximately the right set of regressors.
The model reads \[y_i = x_i' \beta_0 + \varepsilon_i, \quad \mathbb{E}[\varepsilon_i x_i]=0, \quad \beta_0 \in \mathbb{R}^p, \quad i=1,\ldots,n\] where \(y_i\) are observations of the response variable, \(x_i=(x_{i,j}, \ldots, x_{i,p})\)’s are observations of \(p-\)dimensional regressors, and \(\varepsilon_i\)’s are centered disturbances, where possibly \(p \gg n\). Assume that the data sequence is i.i.d. for the sake of exposition, although the framework covered is considerably more general. An important point is that the errors \(\varepsilon_i\) may be non-Gaussian or heteroskedastic (Belloni, D. Chen, V. Chernozhukov, and C. Hansen 2012).
The model can be exactly sparse, namely \[\| \beta_0\|_0 \leq s = o(n),\] or approximately sparse, namely that the values of coefficients, sorted in decreasing order, \((| \beta_0|_{(j)})_{j=1}^p\) obey, \[| \beta_0|_{(j)} \leq \mathsf{A} j^{-\mathsf{a}(\beta_0)}, \quad \mathsf{a}(\beta_0)>1/2, \quad j=1,...,p.\] An approximately sparse model can be well-approximated by an exactly sparse model with sparsity index \[s \propto n^{1/(2 \mathsf{a}(\beta_0))}.\]
In order to get theoretically justified performance guarantees, we consider the Lasso estimator with data-driven penalty loadings: \[\hat \beta = \arg \min_{\beta \in \mathbb{R}^p} \mathbb{E}_n [(y_i - x_i' \beta)^2] + \frac{\lambda}{n} ||\hat{\Psi} \beta||_1\] where \(||\beta||_1=\sum_{j=1}^p |\beta_j|\), \(\hat{\Psi}=\mathrm{diag}(\hat{\psi}_1,\ldots,\hat{\psi}_p)\) is a diagonal matrix consisting of penalty loadings, and \(\mathbb{E}_n\) abbreviates the empirical average. The penalty loadings are chosen to insure basic equivariance of coefficient estimates to rescaling of \(x_{i,j}\) and can also be chosen to address heteroskedasticity in model errors. We discuss the choice of \(\lambda\) and \(\hat \Psi\) below.
Regularization by the \(\ell_1\)-norm naturally helps the Lasso estimator to avoid overfitting, but it also shrinks the fitted coefficients towards zero, causing a potentially significant bias. In order to remove some of this bias, consider the Post-Lasso estimator that applies ordinary least squares to the model \(\hat{T}\) selected by Lasso, formally, \[\hat{T} = \text{support}(\hat{\beta}) = \{ j \in \{ 1, \ldots,p\}: \lvert \hat{\beta} \rvert >0 \}.\] The Post-Lasso estimate is then defined as \[\tilde{\beta} \in \arg\min_{\beta \in \mathbb{R}^p} \ \mathbb{E}_n \left( y_i - \sum_{j=1}^p x_{i,j} \beta_j \right) ^2: \beta_j=0 \quad \text{ if } \hat \beta_j = 0 , \quad \forall j.\] In words, the estimator is ordinary least squares applied to the data after removing the regressors that were not selected by Lasso. The Post-Lasso estimator was introduced and analysed in Belloni and V. Chernozhukov (2013).
A crucial matter is the choice of the penalization parameter \(\lambda\). With the right choice of the penalty level, Lasso and Post-Lasso estimators possess excellent performance guarantees: They both achieve the near-oracle rate for estimating the regression function, namely with probability \(1- \gamma - o(1)\), \[\sqrt{\mathbb{E}_n [ (x_{i}'(\hat \beta - \beta_0))^2 ] } \lesssim \sqrt{(s/n) \log p}.\]
In high-dimensions setting, cross-validation is very popular in practice but lacks theoretical justification and so may not provide such a performance guarantee. In sharp contrast, the choice of the penalization parameter \(\lambda\) in the Lasso and Post-Lasso methods in this package is theoretical grounded and feasible. Therefore we call the resulting method the “rigorous”Lasso method and hence add a prefix r to the function names.
In the case of homoscedasticity, we set the penalty loadings \(\hat{\psi}_j = \sqrt{\mathbb{E}_n x_{i,j}^2}\), which insures basic equivariance properties. There are two choices for penalty level \(\lambda\): the \(X\)-independent choice and \(X\)-dependent choice. In the \(X\)-independent choice we set the penalty level to \[\lambda = 2c \sqrt{n} \hat{\sigma} \Phi^{-1}(1-\gamma/(2p)),\] where \(\Phi\) denotes the cumulative standard normal distribution, \(\hat \sigma\) is a preliminary estimate of \(\sigma = \sqrt{\mathbb{E} \varepsilon^2}\), and \(c\) is a theoretical constant, which is set to \(c=1.1\) by default for the Post-Lasso method and \(c=.5\) for the Lasso method, and \(\gamma\) is the probability level, which is set to \(\gamma =.1\) by default. The parameter \(\gamma\) can be interpreted as the probability of mistakenly not removing \(X\)’s when all of them have zero coefficients. In the X-dependent case the penalty level is calculated as \[\lambda = 2c \hat{\sigma} \Lambda(1-\gamma|X),\] where \[\Lambda(1-\gamma|X)=(1-\gamma)-\text{quantile of}\quad n||\mathbb{E}_n[x_i e_i] ||_{\infty}|X,\] where \(X=[x_1, \ldots, x_n]'\) and \(e_i\) are iid \(N(0,1)\), generated independently from \(X\); this quantity is approximated by simulation. The \(X\)-independent penalty is more conservative than the \(X\)-dependent penalty. In particular the \(X\)-dependent penalty automatically adapts to highly correlated designs, using less aggressive penalization in this case Belloni, V. Chernozhukov, and C. Hansen (2010).
In the case of heteroskedasticity, the loadings are set to \(\hat{\psi}_j=\sqrt{\mathbb{E}_n[x_{ij}^2 \hat \varepsilon_i^2]}\), where \(\hat \varepsilon_i\) are preliminary estimates of the errors. The penalty level can be \(X\)-independent (Belloni, D. Chen, V. Chernozhukov, and C. Hansen 2012): \[\lambda = 2c \sqrt{n} \Phi^{-1} (1-\gamma/(2p)),\] or it can be X-dependent and estimated by a multiplier bootstrap procedure (Chernozhukov, D. Chetverikov, and K. Kato 2013): \[\lambda = c \times c_W(1-\gamma),\] where \(c_W(1-\gamma)\) is the \(1-\gamma\)-quantile of the random variable \(W\), conditional on the data, where \[W:= n \max_{1 \leq j \leq p} |2\mathbb{E}_n [x_{ij} \hat{\varepsilon}_i e_i]|,\] where \(e_i\) are iid standard normal variables distributed independently from the data, and \(\hat{\varepsilon}_i\) denotes an estimate of the residuals.
Estimation proceeds by iteration. The estimates of residuals \(\hat \varepsilon_i\) are initialized by running least squares of \(y_i\) on five regressors that are most correlated to \(y_i\). This implies conservative starting values for \(\lambda\) and the penalty loadings, and leads to the initial Lasso and Post-Lasso estimates, which are then further updated by iteration. The resulting iterative procedure is fully justified in the theoretical literature.
The core of the package are the functions rlasso
and rlassoEffects
.
They allow estimation of a Lasso regression under heteroscedastic and
non-Gaussian errors and inference on a set of prespecified,
low-dimensional objects. As Lasso regression introduces shrinkage bias,
a post-Lasso option, which can be described as Lasso estimation followed
by a final refit with ols including only the selected variables, is
available.
The function rlasso
implements Lasso and post-Lasso. The default
option is to use post-Lasso, post=TRUE
. The user can also decide if an
unpenalized intercept
should be included (TRUE
by default),
LassoShooting.fit
contains the computational algorithm that underlies
the estimation procedure. This function implements a version of the
Shooting Lasso Algorithm (Fu 1998). The option penalty
of the function
rlasso
allows different choices for the penalization parameter and
loadings. It allows for homoscedastic or heteroscedastic errors with
default homoscedastic = FALSE
. Moreover, the dependence structure of
the design matrix might be taken into consideration for calculation of
the penalization parameter with X.dependent.lambda = TRUE
. With the
option lambda.start
initial values for the algorithm can be set.2
The option penalty
allows to set the constants \(c\) and \(\gamma\) which
are necessary for the calculation of the penalty. For a detailed
description how the penalty and variable loadings are calculated, we
refer to the accompanying vignette. The maximum number of iterations and
the tolerance when the algorithms should stop can be set with control
.
rlasso
returns an object of S3 class rlasso
for which methods like
predict
, print
, summary
are provided. The methods print
and
summary
have the option all
. By setting this option to FALSE
only
the coefficients estimated to be non-zero are shown.
The function rlassoEffects
does inference for prespecified target
variables. Those can be specified either by the variable names, an
integer valued vector giving their position in x
, or by a logical
vector indicating the variables for which inference should be conducted.
It returns an object of S3 class rlassoEffects
for which the methods
summary
, print
, confint
, and plot
are provided. rlassoEffects
is a wrapper function for rlassoEffect
which does inference for a
single target regressor.
The function rlassoEffects
might either be used in the form
rlassoEffects(x, y, index)
where x
is a matrix, y
denotes the
outcome variable and index
specifies the variables of x
for which
inference is conducted. This can done by an integer vector (postion of
the variables), a logical vector or the name of the variables. An
alternative usage is as rlassoEffects(formula, data, I)
where I
is a
one-sided formula which specifies the variables for which is inference
is conducted. For further details we refer to the help page of the
function and the following examples where both methods for usage are
shown.
For logistic regression the corresponding functions are implemented in
an analogous manner, named rlassologit
and rlassologitEffects
.
First, we generate a data set on which we demonstrate the basic functions:
set.seed(12345)
> n <- 100 #sample size
> p <- 100 # number of variables
> s <- 3 # number of variables with non-zero coefficients
> X <- matrix(rnorm(n*p), ncol=p)
> beta <- c(rep(5,s), rep(0,p-s))
> Y <- X%*%beta + rnorm(n)
Next we use Lasso for fitting and prediction, show the results and make predictions for the old and for new X-variables
> lasso.reg <- rlasso(Y~X, post=FALSE, intercept=TRUE) # use Lasso, not-post-Lasso
> # lasso.reg <- rlasso(X,Y, post=FALSE, intercept=TRUE) # alternative use
> summary(lasso.reg, all=FALSE)
:
Callrlasso.formula(formula = Y ~ X, post = FALSE, intercept = TRUE)
-Lasso Estimation: FALSE
Post
: 100
Total number of variables: 11
Number of selected variables
:
Residuals1Q Median 3Q Max
Min -2.09008 -0.45801 -0.01237 0.50291 2.25098
Estimate0.057
(Intercept) 1 4.771
2 4.693
3 4.766
13 -0.045
15 -0.047
16 -0.005
19 -0.092
22 -0.027
40 -0.011
61 0.114
100 -0.025
: 0.8039
Residual standard error-squared: 0.9913
Multiple R-squared: 0.9902
Adjusted R:
Joint significance testfor joint significance test is 64.02 with a p-value of 0
the sup score statistic
> yhat.lasso <- predict(lasso.reg) #in-sample prediction
> Xnew <- matrix(rnorm(n*p), ncol=p) # new X
> Ynew <- Xnew%*%beta + rnorm(n) #new Y
> yhat.lasso.new <- predict(lasso.reg, newdata=Xnew) #out-of-sample prediction
To reduce bias, now use post-Lasso for fitting and prediction
> post.lasso.reg <- rlasso(Y~X, post=TRUE, intercept=TRUE)
> summary(post.lasso.reg, all=FALSE)
:
Callrlasso.formula(formula = Y ~ X, post = TRUE, intercept = TRUE)
-Lasso Estimation: TRUE
Post
: 100
Total number of variables: 3
Number of selected variables
:
Residuals1Q Median 3Q Max
Min -2.83981 -0.80339 0.02063 0.75573 2.30421
Estimate0.000
(Intercept) 1 5.150
2 4.905
3 4.912
: 1.059
Residual standard error-squared: 0.9855
Multiple R-squared: 0.9851
Adjusted R:
Joint significance testfor joint significance test is 66.87 with a p-value of 0
the sup score statistic
> yhat.postlasso <- predict(post.lasso.reg) #in-sample prediction
> yhat.postlasso.new <- predict(post.lasso.reg, newdata=Xnew) #out-of-sample prediction
We can do inference on a set of variables of interest, e.g. the first, second, third, and the fiftieth:
> lasso.effect <- rlassoEffects(x=X, y=Y, index=c(1,2,3,50))
> print(lasso.effect)
:
CallrlassoEffects.default(x = X, y = Y, index = c(1, 2, 3, 50))
:
Coefficients
V1 V2 V3 V50 5.0890 4.7781 4.8292 0.1384
> summary(lasso.effect)
1] "Estimates and significance testing of the effect of target variables"
[Pr(>|t|)
Estimate. Std. Error t value 5.0890 0.1112 45.781 <2e-16 ***
V1 4.7781 0.1318 36.264 <2e-16 ***
V2 4.8292 0.1314 36.752 <2e-16 ***
V3 0.1384 0.1122 1.234 0.217
V50 ---
: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Signif. codes
The confidence intervals for the coefficients are given by
> confint(lasso.effect)
2.5 % 97.5 %
> confint(lasso.effect)
2.5 % 97.5 %
4.8710981 5.306834
V1 4.5198436 5.036332
V2 4.5716439 5.086726
V3 -0.0814167 0.358284 V50
Moreover, simultaneous / joint confidence intervals can be calculated which will be discussed in more detail later.
> confint(lasso.effect, joint = TRUE)
2.5 % 97.5 %
4.8464078 5.3315239
V1 4.4315649 5.1246107
V2 4.4864563 5.1719131
V3 -0.1720149 0.4488822 V50
Finally, we can also plot the estimated effects with their confidence intervals:
> plot(lasso.effect, main="Confidence Intervals")
A strength of the package is that it can also handle non-Gaussian and heteroscedastic error what we illustrate with a small example:
> Y <- X%*%beta + rnorm(n, sd=sin(X%*%beta)^2)
> Y <- X%*%beta + rt(n, df=5)
> lasso.reg = rlasso(Y~X, post=FALSE, intercept=TRUE) # use Lasso, not-post-Lasso
> summary(lasso.reg, all=FALSE)
:
Callrlasso.formula(formula = Y ~ X, post = FALSE, intercept = TRUE)
-Lasso Estimation: FALSE
Post
: 100
Total number of variables: 8
Number of selected variables
:
Residuals1Q Median 3Q Max
Min -2.51449 -0.76567 0.09798 0.80007 2.11780
Estimate0.032
(Intercept) 1 4.959
2 4.686
3 4.671
8 0.083
21 0.010
36 0.047
58 -0.070
100 -0.133
: 1.051
Residual standard error-squared: 0.9858
Multiple R-squared: 0.9845
Adjusted R:
Joint significance testfor joint significance test is 66.87 with a p-value of 0 the sup score statistic
(Belloni, V. Chernozhukov, and K. Kato 2014) provide uniformly valid confidence intervals for \(p_1\) target
parameters which are defined via Huber’s Z-problems. The Z-framework
encompasses, among other things, mean regression, median regression,
generalized linear models, as well as many other methods. The dimension
\(p_1\) of the target parameter might be potentially much larger than the
sample size. Here we apply their results to the Lasso regression which
can be embedded into this framework. This enables the provision of
uniformly valid joint/ simultaneous confidence intervals for the target
parameters. This functionality is implemented via the functions
rlassoEffects
and confint
. To get joint confidence intervals the
option joint
of the latter function has to be set to TRUE
. In the
next section we illustrate this with an empirical illustration,
analysing the effect of gender on wage, to quantify potential
discrimination.
In Labour Economics an important question is how the wage is related to
the gender of the employed. We use US census data from the year 2012 to
analyse the effect of gender and interaction effects of other variables
with gender on wage jointly. The dependent variable is the logarithm of
the wage, the target variable is female
(in combination with other
variables). All other variables denote some other socio-economic
characteristics, e.g. marital status, education, and experience. For a
detailed description of the variables we refer to the help page.
First, we load and prepare the data.
> library(hdm)
> data(cps2012)
> X <- model.matrix( ~ -1 + female + female:(widowed + divorced + separated +
+ nevermarried + hsd08 + hsd911+ hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) +
+ (widowed + divorced + separated + nevermarried +
+ hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3)^2, data=cps2012)
> dim(X)
1] 29217 136
[> X <- X[,which(apply(X, 2, var)!=0)] # exclude all constant variables
> dim(X)
1] 29217 116
[> index.gender <- grep("female", colnames(X))
> y <- cps2012$lnw
The parameter estimates for the target parameters, i.e. all coefficients related to gender (i.e. by interaction with other variables) are calculated and summarized by the following commands
> effects.female <- rlassoEffects(x=X, y=y, index=index.gender)
> summary(effects.female)
1] "Estimates and significance testing of the effect of target variables"
[Pr(>|t|)
Estimate. Std. Error t value -0.154923 0.050162 -3.088 0.002012 **
female :widowed 0.136095 0.090663 1.501 0.133325
female:divorced 0.136939 0.022182 6.174 6.68e-10 ***
female:separated 0.023303 0.053212 0.438 0.661441
female:nevermarried 0.186853 0.019942 9.370 < 2e-16 ***
female:hsd08 0.027810 0.120914 0.230 0.818092
female:hsd911 -0.119335 0.051880 -2.300 0.021435 *
female:hsg -0.012890 0.019223 -0.671 0.502518
female:cg 0.010139 0.018327 0.553 0.580114
female:ad -0.030464 0.021806 -1.397 0.162405
female:mw -0.001063 0.019192 -0.055 0.955811
female:so -0.008183 0.019357 -0.423 0.672468
female:we -0.004226 0.021168 -0.200 0.841760
female:exp1 0.004935 0.007804 0.632 0.527139
female:exp2 -0.159519 0.045300 -3.521 0.000429 ***
female:exp3 0.038451 0.007861 4.891 1.00e-06 ***
female---
: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Signif. codes
Finally, we estimate and plot confident intervals, first "pointwise" and then the joint confidence intervals.
> joint.CI <- confint(effects.female, level = 0.95, joint = TRUE)
> joint.CI
2.5 % 97.5 %
-0.244219905 -0.06562666
female :widowed -0.036833704 0.30902467
female:divorced 0.097104065 0.17677471
female:separated -0.066454223 0.11305975
female:nevermarried 0.149932792 0.22377417
female:hsd08 -0.230085134 0.28570576
female:hsd911 -0.215291089 -0.02337899
female:hsg -0.046383541 0.02060398
female:cg -0.023079946 0.04335705
female:ad -0.072370082 0.01144259
female:mw -0.035451308 0.03332443
female:so -0.043143587 0.02677690
female:we -0.043587197 0.03513494
female:exp1 -0.008802488 0.01867301
female:exp2 -0.239408202 -0.07963045
female:exp3 0.024602478 0.05229868
female> plot(effects.female, joint=TRUE, level=0.95)
This analysis allows a closer look how discrimination according to gender is related to other socio-economic variables.
As a side remark, the version 0.2 allows also now a formula interface
for many functions including rlassoEffects
. Hence, the analysis could
also be done more compact as
> effects.female <- rlassoEffects(lnw ~ female + female:(widowed + divorced + separated +
+ nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) +
+ (widowed + divorced + separated + nevermarried +
+ hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3)^2, data=cps2012,
+ I = ~ female + female:(widowed + divorced + separated + nevermarried +
+ hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3))
The one-sided option I
gives the target variables for which inference
is conducted.
The package contains six data sets which are made available for
researchers. The growth data set contains the so-called Barro-Lee data
(Barro and J.-W. Lee 1994). It contains macroeconomics information for large set of
countries over several decades to analyse the drivers of economic
growth. Next, the package includes a data set on settler mortality and
economic outcomes which was used in the literature to illustrate the
effect of institutions on economic growth (Acemoglu, S. Johnson, and J. A. Robinson 2001). The data
set pension
contains information about the choice of 401(k) pension
plans and savings behaviour in the United States (Chernozhukov and C. Hansen 2004). A data set
for analysing the effect of judicial decisions regarding eminent domain
on economic outcomes (e.g. house prices), is included. More information
is given in Chen and J. Sethi (2010) and Belloni, D. Chen, V. Chernozhukov, and C. Hansen (2012). The BLP data (Berry, J. Levinsohn, and A. Pakes 1995) set was
analysed in Berry, Levinsohn and Pakes (1995). The data stem from annual
issues of the Automotive News Market Data Book. The data set inlcudes
information on all models marketed during the the period beginning 1971
and ending in 1990 cotaining 2217 model/years from 997 distinct models.
Finally, US census data (CPS2012) is included.
All data sets are described in their help pages where also further
references can be found.
Additionally, the package contains functions for estimation and
inference on structural parameters in a high-dimensional setting. An
important tool for estimation in the potential presence of endogeneity
is instrumental variable (IV) estimation. The function rlassoIV
implements methods for IV estimation in a high-dimensional setting. This
function is a wrapper for several methods. The user can specify if
selection shall be done on the instruments (z-variables), on the control
variables (x-variables), or on both. For the low-dimensional case where
no selection is done for the x- or z-variables rlassoIV
performs
classical 2SLS estimation by calling the function tsls
.
The goal of many empirical analyses is to understand and estimate the
causal effect of a treatment, e.g. the effect of an endogenous binary
treatment \(D\), on a outcome, \(Y\), in the presence of a binary
instrumental variable, \(Z\), in settings with very many potential control
variables. We provide functions to estimate the local average treatment
effect (LATE), the local average treatment effect of the treated (LATET)
and as special cases the average treatment effect (ATE) and the average
treatment effect of the treated (ATET) in this setting in functions
rlassoLATE
, rlassoLATE
, rlassoATE
, and rlassoATET
respectively.
These functions also allow calculation of treatment effect standard
errors with different bootstrap methods (“wild”, “normal”, “Bayes”)
besides to the classical plug-in standard errors. Moreover, methods
print
, summary
, and confint
, are available.
Both the family of rlassoIV
-functions and the family of the functions
for treatment effects , which are introduced in the next section, allow
use with both formula-interface and by handing over the prepard model
matrices. Hence the general pattern for use with formula is
function(formula, data, ...)
where formula consists of two-parts and
is a member of the class Formula
. These formulas are of the pattern
y d + x | x + z
where y
is the outcome variable, x
are exogenous
variables, d
endogenous varialbes (if several ones are allowed depends
on the concrete function), and z
denote the instrumental variables. A
more primitive use of the functions is by simply hand over the required
group of variables as matrices: function(x=x, d=d, y=y, z=z)
. In some
of the following examples both alternatives are displayed.
In this section, we briefly present how instrumental variables estimation is conducted in a high-dimensional setting. The eminent domain example, for which the data set is contained in the package, serves as an illustration. The underlying goal is to estimate the effect of pro-plaintiff decisions in cases of eminent domain (government’s takings of private property) on economic outcomes. The analysis of the effects of such decisions is complicated by the possible endogeneity between judicial decisions and potential economic outcomes. To address the potential endogeneity, we employ an instrumental variables strategy based on the random assignment of judges to the federal appellate panels that make the decisions. Because judges are randomly assigned to three-judge panels, the judges and their demographics are randomly assigned conditional on the distribution of characteristics of federal circuit court judges in a given circuit-year.
First, we load the data and construct the matrices with the controls (x), instruments (z), outcome (y), and treatment variables (d). Here we consider regional GDP as the outcome variable.
> data(EminentDomain)
> z <- EminentDomain$logGDP$z
> x <- EminentDomain$logGDP$x
> y <- EminentDomain$logGDP$y
> d <- EminentDomain$logGDP$d
As mentioned above, \(y\) is the economic outcome, the logarithm of the GDP, \(d\) the number of pro plaintiff appellate takings decisions in federal circuit court \(c\) and year \(t\), \(x\) is a matrix with control variables, and \(z\) is the matrix with instruments. Here we consider socio-economic and demographic characteristics of the judges as instruments.
Next, we estimate the model with selection on the instruments.
> lasso.IV.Z <- rlassoIV(x=x, d=d, y=y, z=z, select.X=FALSE, select.Z=TRUE)
> # or lasso.IV.Z <- rlassoIVselectZ(x=X, d=d, y=y, z=Z)
> summary(lasso.IV.Z)
1] "Estimates and significance testing of the effect of target variables in the
[ IV regression model"
-value p-value
coeff. se. t0.01543 0.01926 0.801 0.423 d1
Finally, we do selection on both the \(x\) and \(z\) variables.
> lasso.IV.XZ <- rlassoIV(x=x, d=d, y=y, z=z, select.X=TRUE, select.Z=TRUE)
> summary(lasso.IV.XZ)
in the
Estimates and Significance Testing of the effect of target variables
IV regression model -value p-value
coeff. se. t-0.03488 0.15881 -0.22 0.826
d1
> confint(lasso.IV.XZ)
2.5 % 97.5 %
-0.3461475 0.2763868 d1
Here we apply the treatment functions to 401(k) plan participation. Though it is clear that 401(k) plans are widely used as vehicles for retirement saving, their effect on assets is less clear. The key problem in determining the effect of participation in 401(k) plans on accumulated assets is saver heterogeneity coupled with nonrandom selection into participation states. In particular, it is generally recognized that some people have a higher preference for saving than others. Thus, it seems likely that those individuals with the highest unobserved preference for saving would be most likely to choose to participate in tax-advantaged retirement savings plans and would also have higher savings in other assets than individuals with lower unobserved saving propensity. This implies that conventional estimates that do not allow for saver heterogeneity and selection of the participation state will be biased upward, tending to overstate the actual savings effects of 401(k) and IRA participation.
Again, we start first with the data preparation. For a detailed
description of the data set we refer to the help page help(pension)
.
data(pension)
<- pension$tw; d = pension$p401; z = pension$e401
y <- pension[,c("i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5",
X "fsize", "hs", "smcol", "col", "marr", "twoearn", "db", "pira", "hown")]
# simple model
<- c("i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5",
xvar "fsize", "hs", "smcol", "col", "marr", "twoearn", "db", "pira", "hown")
<- paste(xvar, collapse = "+")
xpart <- as.formula(paste("tw ~ ", paste(c("p401", xvar), collapse ="+"), "|",
form paste(xvar, collapse = "+")))
<- as.formula(paste("tw ~ ", paste(c("p401", xvar), collapse ="+"), "|",
formZ paste(c("e401", xvar), collapse = "+")))
Now we can compute the estimates of the target treatment effect parameters:
> #pension.ate = rlassoATE(X,d,y)
> pension.ate = rlassoATE(form, data = pension)
> summary(pension.ate)
Estimation and significance testing of the treatment effect : ATE
Type: not applicable
Bootstrap-value p-value
coeff. se. t10490 1920 5.464 4.67e-08 ***
TE ---
: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Signif. codes
> #pension.atet <- rlassoATET(X,d,y)
> pension.atet <- rlassoATET(form, data = pension)
> summary(pension.atet)
Estimation and significance testing of the treatment effect : ATET
Type: not applicable
Bootstrap-value p-value
coeff. se. t11810 2844 4.152 3.29e-05 ***
TE ---
: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Signif. codes
> pension.late <- rlassoLATE(X,d,y,z)
> #pension.late <- rlassoLATE(formZ, data=pension)
> summary(pension.late)
Estimation and significance testing of the treatment effect : LATE
Type: not applicable
Bootstrap-value p-value
coeff. se. t12189 2734 4.458 8.27e-06 ***
TE ---
: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Signif. codes
> pension.latet <- rlassoLATET(X,d,y,z)
> #pension.latet <- rlassoLATET(formZ, data=pension)
> summary(pension.latet)
Estimation and significance testing of the treatment effect : LATET
Type: not applicable
Bootstrap-value p-value
coeff. se. t12687 3590 3.534 0.00041 ***
TE ---
: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Signif. codes
The results are summarized in Table 1. We see that in this example the estimates are quite similar, while the standard errors differ.
Estimate | Std. Error | |
---|---|---|
ATE | 10490.07 | 1919.99 |
ATET | 11810.45 | 2844.33 |
LATE | 12188.66 | 2734.12 |
LATET | 12686.87 | 3590.09 |
A core problems in Economics is to estimate the effect of price on the demand of products. This is usually impeded by the fact that price are not exogenously given, but determined by demand and supply in a simultaneous system of equations, introducing endogeneity. A possible way to solve this problem is to employ instrumental variables - a technique which is also getting more popular outside of Economics. The results found here are a replication in Rof the results in (Chernozhukov, C. Hansen, and M. Spindler 2015a) and (Chernozhukov, C. Hansen, and M. Spindler 2015b).
here we are interested in estimating the coefficients in a simple logit model of demand for automobiles using market share data. Our example is based on the data and most basic strategy from (Berry, J. Levinsohn, and A. Pakes 1995). The goal is to estimate the influence / effect of the price on demand (=market share). Specifically,
\[\log(s_{it}) - \log(s_{0t}) = \alpha_0 p_{it} + x_{it}' \beta_0 + \varepsilon_{it},\]
\[p_{it} = z_{it}' \delta_0 + x_{it}' \gamma_0 + u_{it},\]
where \(s_it\) is the market share of product \(i\) in market \(t\) with product zero denoting the outside option, \(p_{it}\) is the price and is treated as endogenous, \(x_{it}\) are observed included product characteristics, and \(z_{it}\) are instruments. In the data set the variable \(y\) is defined as \(\log(s_{it}) - \log(s_{0t})\).
In our example, we use the same set of product characteristics (\(x\)-variables) as used in obtaining the basic results in (Berry, J. Levinsohn, and A. Pakes 1995). Specifically, we use five variables in \(x_{it}\): a constant, an air conditioning dummy, horsepower divided by weight, miles per dollar, and vehicle size. We refer to these five variables as the baseline set of controls.
We also adopt the argument from (Berry, J. Levinsohn, and A. Pakes 1995) to form our potential instruments. BLP argue that that characteristics of other products will satisfy an exclusion restriction, \(\textrm{E}[\varepsilon_{it}|x_{j\tau}] = 0\) for any \(\tau\) and \(j \ne i\), and thus that any function of characteristics of other products may be used as instrument for price. This condition leaves a very high-dimensional set of potential instruments as any combination of functions of \(\{x_{j\tau}\}_{j \ne i, \tau \ge 1}\) may be used to instrument for \(p_{it}\). To reduce the dimensionality, BLP use intuition and an exchangeability argument to motivate consideration of a small number of these potential instruments formed by taking sums of product characteristics formed by summing over products excluding product \(i\). Specifically, we form baseline instruments by taking \[z_{k,it} = \left(\sum_{r \ne i, r \in \mathcal{I}_f} x_{k,rt} , \sum_{r \ne i, r \notin \mathcal{I}_f} x_{k,rt} \right)\] where \(x_{k,it}\) is the \(k^{th}\) element of vector \(x_{it}\) and \(\mathcal{I}_f\) denotes the set of products produced by firm \(f\). This choice yields a vector \(z_{it}\) consisting of 10 instruments. We refer to this set of instruments as the baseline instruments.
While the choice of the baseline instruments and controls is motivated by good intuition and economic theory, it should be noted that theory does not clearly state which product characteristics or instruments should be used in the model. Theory also fails to indicate the functional form with which any such variables should enter the model. High-dimensional methods outlined offer one strategy to help address these concerns which complements the economic intuition motivating the baseline controls and instruments. As an illustration, we consider an expanded set of controls and instruments. We augment the set of potential controls with all first order interactions of the baseline variables, quadratics and cubics in all continuous baseline variables, and a time trend which yields a total of 24 \(x\)-variables. We refer to these as the augmented controls. We then take sums of these characteristics as potential instruments following the original strategy which yields 48 potential instruments.
The data set is included in the package an can be accessed in the following way
data(BLP)
<- BLP$BLP BLPData
A detailed description of the data is given in (Berry, J. Levinsohn, and A. Pakes 1995) and also on the help page.
In the base line model we consider five x-variables and ten instrumental
variables as described above. First we process the data, in particular
we construct the design matrices for the x- and z-variables.The matrix
of instruments \(Z\) is shipped with the data set but could also be
constructed with the internal function constructIV
in the package
hdm.
attach(BLPData)
<- as.matrix(cbind(1, BLPData[,c("hpwt", "air", "mpd", "space")]))
X <- BLP$Z
Z #Z <- hdm:::constructIV(firm.id, cdid, id, X)
With the baseline x- and z- variables, we estimate the price-effect with ols, tsls, and finally with selection on the x- and z-variables.
<- lm(y~ price + hpwt + air + mpd + space, data =BLPData)
ols.reg <- tsls(x=X, d=price, y=y, z=Z, intercept =FALSE, homoscedastic = FALSE)
tsls.reg <- rlassoIV(x=X[,-1], y=y, z=Z, d=price, select.X=TRUE, select.Z=TRUE,
lasso.reg intercept = TRUE)
The results are summarized in the table below:
Method | Price coefficient | Standard error |
---|---|---|
Baseline OLS | -0.089 | 0.004 |
Baseline TSLS | -0.136 | 0.012 |
Baseline TSLS with Lasso selection | -0.174 | 0.013 |
Next, we estimate the augmented model which in total has 24 controls and 48 instruments First, we construct the data needed to estimate the augmented model. The set of augmented IVs can be accessed by
<- BLP$augZ augZ
<- trend/19
tu <- mpd/7
mpdu <- space/2
spaceu = cbind(1, hpwt, air, mpdu, spaceu, tu, hpwt^2, hpwt^3, mpdu^2, mpdu^3,
augX ^2, spaceu^3, tu^2, tu^3, hpwt*air, mpdu*air, spaceu*air, tu*air,
spaceu*mpdu, hpwt*spaceu, hpwt*tu, mpdu*spaceu, mpdu*tu, spaceu*tu)
hpwtcolnames(augX) <- c("constant", "hpwt", "air", "mpdu", "spaceu", "tu", "hpwt^2", "hpwt^3",
"mpdu^2", "mpdu^3", "spaceu^2", "spaceu^3", "tu^2", "tu^3", "hpwt*air", "mpdu*air",
"spaceu*air", "tu*air", "hpwt*mpdu", "hpwt*spaceu", "hpwt*tu", "mpdu*spaceu", "mpdu*tu",
"spaceu*tu")
# augZ <- hdm:::constructIV(firm.id, cdid, id, augX) # construction of augmented set of IVs
Next, we redo the analysis with augmented set of variables:
<- lm(y~ -1 + cbind(price, augX))
ols.reg <- tsls(x=augX, d=price, y=y, z=augZ, intercept =FALSE, homoscedastic = FALSE)
tsls.reg <- rlassoIV(x=augX[,-1], y=y, z=augZ, d=price, select.X=TRUE, select.Z=TRUE,
lasso.reg intercept = TRUE)
The results for the augmented model are presented in the table:
Method | Price coefficient | Standard error |
---|---|---|
Augmented OLS | -0.099 | 0.004 |
Augmented TSLS | -0.127 | 0.008 |
Augmented TSLS with Lasso selection | -0.286 | 0.02 |
We report estimation results from the baseline and augmented setting in the tables above. The estimations all give a negative effect of price on demand as expected, but they differ considerably in size of the effects. In the augmented model we get the strongest influence of price on the market share and we see that the choice of instruments is an important issue.
The package hdm contains methods for estimation and inference in a high-dimensional setting. We have presented a short overview of the functions implemented. The package also contains data sets which might be useful for classroom presentations and as applications for newly developed estimators. More examples and a short introduction in the underlying theoretical concepts are provided in the accompanying vignette. It is planned to extend the functionality and to improve the performance of the functions in subsequent versions of the package.
CausalInference, Econometrics, MachineLearning, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Chernozhukov, et al., "hdm: High-Dimensional Metrics", The R Journal, 2016
BibTeX citation
@article{RJ-2016-040, author = {Chernozhukov, Victor and Hansen, Chris and Spindler, Martin}, title = {hdm: High-Dimensional Metrics}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {2}, issn = {2073-4859}, pages = {185-199} }