Chernozhukov et al. (2018) proposed the sorted effect method for nonlinear regression models. This method consists of reporting percentiles of the partial effects, the sorted effects, in addition to the average effect commonly used to summarize the heterogeneity in the partial effects. They also propose to use the sorted effects to carry out classification analysis where the observational units are classified as most and least affected if their partial effect are above or below some tail sorted effects. The R package SortedEffects implements the estimation and inference methods therein and provides tools to visualize the results. This vignette serves as an introduction to the package and displays basic functionality of the functions within.
Many empirical questions in econometrics, machine learning and
statistics boil down to studying how changes in a key variable
In nonlinear models the PEs vary with the underlying control variables.
Consider the probit model as an example. The conditional probability of
More generally, suppose we have a regression function
Chernozhukov et al. (2018) proposed the sorted partial effect (SPE) method to provide
a more complete summary of
To apply the methods in practice, we need to replace
The rest of the vignette proceeds as follows. We first introduce the main functions within the SortedEffect package and use an application to racial-based discrimination in mortgage lending to illustrate the command options. Then we provide another application to gender wage gap using the Current Population Survey (CPS) data. The empirical results show that the SPE method is effective in uncovering heterogeneous effects and demographic differences between the most and least affected groups in both applications.
The package contains three main commands: spe
, ca
, and subpop
. The
package adds three methods to the generic plot()
(plot.spe
,
plot.ca
and plot.subpop
) and summary()
(summary.spe
,
summary.ca
and summary.subpop
). In practice, users only need to type
plot
and summary
since the generic commands will automatically
dispatch the appropriate method. In this section we explain the options
in each function respectively. Lastly we briefly explain the bootstrap
procedures for inference and bias correction. We provide the
mathematical expressions and an application to racial discrimation in
mortgage lending to facilitate the understanding of the options of the
commands.
The command spe
provides estimation and inference methods for the SPE
and APE. The general syntax is:
spe(fm, data, method = c("ols", "logit", "probit", "QR"),
var_type = c("binary", "continuous", "categorical"), var, compare, subgroup = NULL,
samp_weight = NULL, us = c(1:9)/10, alpha = 0.1, taus = c(5:95)/100,
b = 500, parallel = FALSE, ncores = detectCores(), seed = 1, bc = TRUE,
boot_type = c("nonpar", "weighted"))
The option fm
stores the user-specified regression formula to estimate
the PEs, which assigns the outcome +
operators. The option data
specifies the data set that contains
the variables for the analysis. The user needs to specify the variable
of interest var
as a string. The package accommodates four
regression methods to estimate the PE, "probit"
), logit ("logit"
) and quantile
regression ("QR"
)."QR"
is called, then the user needs to
further specify the quantile indexes with taus
. The default is
taus = c(5:95)/100
.
The option samp_weight
allows the user to input sampling weights. When
samp_weight = NULL
, the default, the package automatically uses a
vector of ones, i.e. no sampling weights are used.
The package provides three options for the variable of interest var
:
var_type = "binary"
if var
is a binary variable such as a black
or female indicator.
var_type = "categorical"
if var
is a categorical (factor)
variable with more than 2 levels. One example is means of
transportation with labels "bus"
, "train"
or "bike"
. In this
case, the user needs to further specify the two labels to be
compared with the option compare
. If the user is interested in the
effect of changing from bus to bike, then
compare = c("bus", "bike")
. If the data only have levels, say
"1", "2"
and "3"
, the users need to input levels instead.
var_type = "continuous"
if var
is a continuous variable. The
package obtains the PEs using a central difference numerical
derivative of the form:
1e-7
. We avoid the symbolic derivative
because it cannot correctly interpret terms involving I()
for
transformations of the variables such as powers, and can therefore
cause erroneous estimation. The code of this part is inspired by
Thomas Leeper’s vignette on the
margins package
(Leeper et al. 2018).
The option subgroup
allows the user to specify the population of
interest. The default is NULL
, which corresponds to the entire
population. If the user is interested in subpopulations, say households
whose income
is lower than Data
, then the user can set
subgroup = Data[, "income"] < 10000
. When subgroup = Data[, var] == 1
. Note that users
cannot input subgroup directly to the data
option or using the
subset
option because the SPE methods use the whole sample to estimate
the PEs. The option subgroup
only specifies the population for the
estimation of
The option us
specifies the set of quantile indexes corresponding to
the estimated SPE to be reported. The mathematical definition of each
empirical SPE is
The empirical SPE function
us
specifies us = c(0.25, 0.5, 0.75)
specifies to report the SPEs corresponding to
the three quartiles.
The option alpha
specifies the significance level of the confidence
bands. The default is alpha = 0.1
, i.e. b
specifies the number of bootstrap repetitions. The default is
b = 500
.
The package supports two types of bootstrap: nonparametric
(boot_type = "nonpar"
) and weighted with standard exponential weights
(boot_type = "weighted"
). The user can fix the random seed for
bootstrap simulation with the option seed
and decide whether or not to
bias-correct the estimates with the option bc
. The default is
bc = TRUE
. Bootstrap and bias-correction will be discussed in section
2.5. The package features parallel computing, which is convenient to
speed up the bootstrap. The user can use the option parallel
to turn
on or off parallel computing. If parallel = TRUE
, the option ncores
allows the user to specify the number of cores in the parallel
computing. The default is ncores = detectCores()
, where
detectCores()
is a command of the package
parallel.
The output of spe
is a list containing four components: spe
, ape
,
us
and alpha
. As the names indicate, spe
stores the results for
the SPE, and ape
stores the results for the APE. Each component is a
list with four elements: the estimates, lower and upper bounds of the
uniform confidence bands, and bootstrapped standard errors obtained as
rescaled interquartile ranges; see Chernozhukov et al. (2018). The other two
components respectively store the percentile indices of the SPE and
significance level of the confidence bands, which are used in the
functions plot.spe
and summary.spe
.
The plot.spe
function plots the result of spe
in one graph that
includes the SPEs, APE and their corresponding confidence bands. Its
general syntax is
plot.spe(object, ylim = NULL, main = NULL, sub = NULL, xlab = "Percentile Index",
ylab = "Sorted Effects", ...)
where object
is the output of spe
. The range of the x-axis is fixed
to be the range of user-specified quantile index us
. The options
ylim
, xlab
and ylab
respectively denote range of the y-axis and
labels of the two axes. The options main
and sub
allow users to
specify the main and sub titles of the plot. The option ...
is an
argument of the generic plot
command that allows for further graphic
parameters.
The syntax of summary.spe
is as follows
summary.spe(object, result = c("sorted", "average"), ...)
If result = "sorted"
, the method provides a table that contain the SPE
percentile indexes, estimates, bootstrap standard errors, pointwise and
uniform confidence bands. If result = "average"
, the methods tabulates
APE estimate, bootstrap standard error and confidence interval.
We illustrate the usage of the command with an empirical application to racial discrimination in mortgage lending. We use data on mortgage applications in Boston from 1990 (Munnell et al. 1996). The Federal Reserve Bank of Boston collected these data in relation to the Home Mortgage Disclosure Act (HMDA), which was passed to monitor minority access to the mortgage market. To retrieve the data from the package, issue the command
data("mortgage")
The outcome variable, deny
, a binary indicator for mortgage
denial. The key variable of interest, black
, a binary
indicator for the applicant being black, while the control variables,
p_irat
), expenses-to-income ratio (hse_inc
),
bad consumer credit (ccred
), bad mortgage credit (mcred
), credit
problems (pubrec
), denied mortgage insurance (denpmi
), medium
loan-to-house value (ltv_med
), high loan-to-house value (ltv_high
),
self employed (selfemp
), single (single
), and high school graduate
(hischl
). The regression formula for the estimation of the PEs is
specified as:
fm <- deny ~ black + p_irat + hse_inc + ccred + mcred + pubrec + ltv_med + ltv_high +
denpmi + selfemp + single + hischl
We invoke the spe
command to calculate the bias-corrected estimates of
the SPE at the quantile indexes
test <- spe(fm = fm, data = mortgage, var = "black", method = "logit", us = c(2:98)/100,
b = 500, bc = TRUE)
The output test
includes the estimates and confidence bands for the
APE and SPE in the entire population. We use plot
to visualize the
results.
plot(x = test, ylim = c(0, 0.25), ylab = "Change in Probability",
main = "APE and SPE of Being Black on the Prob of Mortgage Denial",
sub = "Logit Model")
plot.spe
function. Estimates and 90% confidence
bands for the APE and SPE of black indicator on probability of mortgage
denial. Confidence bands obtained by bootstrap with 500
replications.The result in Figure (1) shows significant heterogeneity in the SPEs, with the PEs ranging from 0 to 14%. The APE misses this heterogeneity, and therefore provides an incomplete picture of the effects.
We can also tabulate the result using summary
. First, we apply the
command to the APE and display the results in Table (1).
summary(test, result = "average")
Est | SE | 90% LB | 90% UB | |
---|---|---|---|---|
APE | 0.051 | 0.019 | 0.021 | 0.081 |
Next, we obtain the results for the SPE. To save space we only show the
first 15 rows of summary(test)
in Table (2). The columns
labelled as PLB and PUB correspond to the lower and upper pointwise
confidence bands, whereas the columns labelled as ULB and UUB correspond
to their uniform counterparts.
summary(test)
Est | SE | 90% PLB | 90% PUB | 90% ULB | 90% UUB | |
---|---|---|---|---|---|---|
0.02 | 0.011 | 0.005 | 0.003 | 0.018 | 0.001 | 0.020 |
0.03 | 0.012 | 0.005 | 0.004 | 0.020 | 0.002 | 0.022 |
0.04 | 0.013 | 0.006 | 0.004 | 0.022 | 0.002 | 0.024 |
0.05 | 0.014 | 0.006 | 0.004 | 0.023 | 0.002 | 0.025 |
0.06 | 0.014 | 0.006 | 0.004 | 0.024 | 0.002 | 0.026 |
0.07 | 0.015 | 0.007 | 0.004 | 0.026 | 0.002 | 0.028 |
0.08 | 0.016 | 0.007 | 0.005 | 0.027 | 0.003 | 0.029 |
0.09 | 0.017 | 0.007 | 0.005 | 0.029 | 0.003 | 0.031 |
0.1 | 0.018 | 0.007 | 0.006 | 0.030 | 0.003 | 0.032 |
0.11 | 0.018 | 0.008 | 0.005 | 0.031 | 0.003 | 0.033 |
0.12 | 0.019 | 0.008 | 0.006 | 0.031 | 0.004 | 0.034 |
0.13 | 0.019 | 0.008 | 0.006 | 0.032 | 0.004 | 0.035 |
0.14 | 0.020 | 0.008 | 0.007 | 0.034 | 0.004 | 0.036 |
0.15 | 0.021 | 0.008 | 0.007 | 0.034 | 0.004 | 0.037 |
0.16 | 0.021 | 0.009 | 0.007 | 0.035 | 0.004 | 0.038 |
The command ca
provides estimation and inference methods for the CA.
The general syntax is:
ca(fm, data, method = c("ols", "logit", "probit", "QR"),
var_type = c("binary", "continuous", "categorical"), var,
compare, subgroup = NULL, samp_weight = NULL, taus = c(5:95)/100,
u = 0.1, interest = c("moment", "dist"),
t = c(1, 1, rep(0, dim(data)[2] - 2)), cl = c("both", "diff"),
cat = NULL, alpha = 0.1, b = 500, parallel = FALSE,
ncores = detectCores(), seed = 1, bc = TRUE,
range_cb = c(1:99)/100, boot_type = c("nonpar", "weighted"))
The first step in the CA is to classify the observational units in most
and least affected groups based on some tail SPEs. The option u
specifies the quantile index of the tail SPEs. Thus, the u = 0.1
to obtain the 10% least and most affected
groups. The option subgroup
specifies the population of interest and
has the same syntax as in the spe
command.
Let
t
is a vector that specifies "a", "b", "c", "d", "e"
) and we are interested in "a"
and "c"
,
then we can either set t = c("a", "c")
directly or
t = c(1, 0, 1, 0, 0)
. The second approach requires the user to know
the order of the variables in the data set, which can be found with the
command View
.
Let interest = "moment"
, then
interest = "dist"
, then
If interest = "moment"
, ca
estimates and makes inference on features
of the chosen variables of interest in the least and most affected
groups. These features are specified with the option cl
. For example,
if cl = "both"
, the command estimates the mean of the variables in the
two affected groups. On the other hand, cl = "diff"
estimates the
differences of the means of the variables between the two groups.
If interest = "dist"
, the option range_cb
specifies the region of
interest for the domain of the distributioncl
doesn’t have any bearing when interest = "dist"
.x
is discrete, we can specify the region of
interest as the support of x
with range_cb = NULL
. If x
is
continuous, we can specify range_cb = c(1:99)/100
if we are interested
in the percentiles with indexes
range_cb = c(1:99)/100
. The choice range_cb = NULL
shuts down this
feature by settimg the region of interest as all the distint values of
the variable.
The output of ca
depends on the choice of interest
. For
interest = "moment"
, the output is a list containing the estimates of
"ms"
with labels
("nevermarried", "married", "divorced", "separated", "widowed")
, then
users could consider adjusting the pointwise p-values within this
factor. To illustrate how to define the option cat
, suppose we have
selected specified 3 variables of interest: t = c("a", "b", "c")
.
Without loss of generality, assume "a"
is not a factor, while "b"
and "c"
are two factors. Then we need to specify cat
as
cat = c("b", "c")
. If cat = NULL
, we report the unadjusted pointwise
p-values. If interest = "dist"
, the output is a list containing the
rearranged estimates, upper confidence bands and lower confidence bands
for the variables of interest in both groups.
When interest = "moment"
, the user can use method summary.ca
to
tabulate the output. The general syntax is
summary.ca(object, ...)
If cl = "both"
, the p-values are omitted from the table. When
interest = "dist"
, users can plot the output for better visualization.
The general syntax is
plot.ca(object, var, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ...)
The user needs to input the variable for plotting with the option var
.
Note that the variable must be one of the variables specified in t
.
Returning to the mortgage denial example, we classify the 10% least and
most affected applicants and compare their characteristics. The
variables of interest include deny
, black
and all the controls. We
first specify t
to reflect the choice of variables of interestView(mortgage)
to locate the variables
and set t <- c(rep(1, 4), 0, rep(1, 7), 0, 0, 1, 1)
.
t <- c("deny", "p_irat", "black", "hse_inc", "ccred", "mcred", "pubrec", "denpmi",
"selfemp", "single", "hischl", "ltv_med", "ltv_high")
Then we invoke the ca
command and summarize the result.
CA <- ca(fm = fm, data = mortgage, var = "black", method = "logit", cl = "both",
t = t, b = 500, bc = TRUE)
summary(CA)
Most | SE | Least | SE | |
---|---|---|---|---|
deny | 0.45 | 0.03 | 0.09 | 0.04 |
p_irat | 0.39 | 0.01 | 0.25 | 0.02 |
black | 0.38 | 0.03 | 0.06 | 0.02 |
hse_inc | 0.28 | 0.01 | 0.21 | 0.02 |
ccred | 4.80 | 0.26 | 1.28 | 0.09 |
mcred | 2.01 | 0.06 | 1.36 | 0.10 |
pubrec | 0.46 | 0.05 | 0.05 | 0.02 |
denpmi | 0.01 | 0.01 | 0.04 | 0.03 |
selfemp | 0.17 | 0.04 | 0.04 | 0.03 |
single | 0.61 | 0.06 | 0.09 | 0.07 |
hischl | 0.93 | 0.03 | 1.00 | 0.01 |
ltv_med | 0.59 | 0.06 | 0.05 | 0.04 |
ltv_high | 0.12 | 0.04 | 0.01 | 0.01 |
Table (3) shows that the 10% of the applicants most affected by the racial mortgage denial gap are more likely to have either of the following characteristics relative to the 10% of the least affected applicants: mortgage denied, high debt-to-income ratio, black, high expense-to-income ratio, bad consumer or credit scores, credit problems, self employed, single, no high school diploma, and medium or high loan-to-income ratio.
Next we test if the differences in the characteristics between the two
groups are statistically significant. To do so we set cl = "diff"
,
which means taking difference between the two groups. The full command
is as follows
CAdiff <- ca(fm = fm, data = mortgage, var = "black", t = t, method = "logit",
cl = "diff", b = 500, bc = TRUE)
summary(CAdiff)
Estimate | SE | JP-vals | P-vals | |
---|---|---|---|---|
deny | 0.36 | 0.05 | 0.00 | 0.00 |
p_irat | 0.14 | 0.02 | 0.00 | 0.00 |
black | 0.32 | 0.04 | 0.00 | 0.00 |
hse_inc | 0.07 | 0.02 | 0.13 | 0.00 |
ccred | 3.52 | 0.28 | 0.00 | 0.00 |
mcred | 0.65 | 0.15 | 0.01 | 0.00 |
pubrec | 0.41 | 0.05 | 0.00 | 0.00 |
denpmi | -0.03 | 0.04 | 1.00 | 0.23 |
selfemp | 0.13 | 0.06 | 0.48 | 0.02 |
single | 0.53 | 0.10 | 0.00 | 0.00 |
hischl | -0.06 | 0.03 | 0.48 | 0.02 |
ltv_med | 0.54 | 0.07 | 0.00 | 0.00 |
ltv_high | 0.10 | 0.03 | 0.10 | 0.00 |
Table (4) shows the results. The joint p-values account for the fact that we conduct simultaneous inference on 13 differences of variables. We employ the so-called “single-step” methods for controlling the family-wise error rate and obtain the p-values by bootstrap. We find that 8 differences are jointly statistically different from zero at the 5% level and 9 at the 10% level.
We also plot the distributions of monthly debt-to-income ratio
(p_irat
) and monthly housing expenses-to-income ratio (hse_inc
) for
both groups. Such plots are useful if the user wants to visualize if
there is stochastic dominance between the two groups. To do so we use
the ca
command and change the interest
to "dist"
.
t2 <- c("p_irat", "hse_inc")
CAdist <- ca(fm = fm, data = mortgage, var = "black", method = "logit", t = t2,
b = 500, interest = "dist")
plot(CAdist, var = "p_irat", ylab = "Prob", xlab = "Monthly Debt-to-Income Ratio",
sub = "logit model")
plot(CAdist, var = "hse_inc", ylab = "Prob",
xlab = "Monthly Housing Expenses-to-Income Ratio", sub = "logit model")
plot.ca
. Bias-corrected
estimates and 90% confidence bands for the distribution of
p_irat
and hse_inc
in the 10% most and least
affected groups by the racial mortgage denial gap. Confidence bands
obtained by bootstrap with 500 replications.
Figure (2) shows that for both variables the distribution in the most affected group first-order stochastically dominates the distribution in least affected group.
In addition to means and distributions, we can conduct inference on the
sets of most and least affected units. Let subpop
provides
estimation and inference methods for these sets. The general syntax is
subpop(fm, data, method = c("ols", "logit", "probit", "QR"),
var_type = c("binary", "continuous", "categorical"), var, compare,
subgroup = NULL, samp_weight = NULL, taus = c(5:95)/100, u = 0.1, alpha = 0.1,
b = 500, seed = 1, parallel = FALSE, ncores = detectCores(),
boot_type = c("nonpar", "weighted"))
No new option is introduced in the command. For theoretical details, we refer the reader to Chernozhukov et al. (2015).
The output of subpop
is a list containing six components: cs_most
,
cs_least
, u
, subgroup
, most
and least
. As the names indicate,
cs_most
and cs_least
denote the confidence sets for the most and
least affected groups. u
stores the percentile index that defines the
most and least affected groups. subgroup
stores the indicators for the
population of interest specified with the option subgroup
. most
and
least
store the estimates of the most and least affected units
respectively. The first four components are used in plot.subpop
and
the last two components can be visualized with summary.subpop
.
The general syntax of summary.subpop
is
summary.subpop(object, affected = c("most", "least"), vars = NULL, ...)
The option object
is the output of subpop
. The option affected
allows users to tabulate either the most or the least affected units,
and the option vars
provides summary statistics for user-specified
variables of interest. The summary statistics include the minimum, 1st
quartile, median, mean, 3rd quaritle and maximum. The default is NULL
,
which produces summary statistics of all the variables.
The plot.subpop
function plots 2-dimensional projections of the
confidence sets for the most and least affected units with respect to
two variables. The general syntax is
plot.subpop(object, varx, vary, xlim = NULL, ylim = NULL, main = NULL, sub = NULL,
xlab = NULL, ylab = NULL, overlap = FALSE, ...)
The user needs to specify the two variables for the projection with
varx
and vary
, and object
should be specified as the output of
subpop
. The option overlap
allows users to either keep or drop
common observations in both confidence sets. The default is
overlap = FALSE
, which drops the observations.
We estimate the 10% most and least affected applicants in the mortgage application.
set_b <- subpop(fm, data = mortgage, method = "logit", var = "black", u = 0.1,
alpha = 0.1, b = 500)
Using summary
, we can estimate the most/least affected applicants and
report summary statistics of the variables of interest in the most/least
affected groups. Table 5 lists the estimated most affected
applicants. For the purpose of illustration we only show the first ten
rows and columns.
groups <- summary(set_b, vars = c("p_irat", "hse_inc"))
most_affected <- groups$most_affected
deny | p_irat | black | hse_inc | loan_val | ccred | mcred | pubrec | denpmi | selfemp | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1.00 | 0.46 | 0.00 | 0.27 | 0.84 | 5.00 | 2.00 | 0.00 | 0.00 | 0.00 |
2 | 1.00 | 0.38 | 0.00 | 0.26 | 0.88 | 6.00 | 1.00 | 1.00 | 0.00 | 0.00 |
3 | 0.00 | 0.40 | 0.00 | 0.34 | 0.80 | 2.00 | 2.00 | 0.00 | 0.00 | 1.00 |
4 | 1.00 | 0.24 | 0.00 | 0.23 | 0.90 | 5.00 | 2.00 | 0.00 | 0.00 | 0.00 |
5 | 0.00 | 0.38 | 0.00 | 0.25 | 0.80 | 6.00 | 2.00 | 1.00 | 0.00 | 0.00 |
6 | 0.00 | 0.36 | 0.00 | 0.13 | 0.95 | 5.00 | 2.00 | 0.00 | 0.00 | 0.00 |
7 | 0.00 | 0.35 | 0.00 | 0.27 | 0.90 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 |
8 | 1.00 | 0.30 | 0.00 | 0.30 | 0.50 | 6.00 | 2.00 | 1.00 | 0.00 | 0.00 |
9 | 1.00 | 0.37 | 0.00 | 0.23 | 0.80 | 6.00 | 2.00 | 1.00 | 0.00 | 0.00 |
10 | 1.00 | 0.39 | 0.00 | 0.27 | 0.90 | 6.00 | 2.00 | 1.00 | 0.00 | 0.00 |
We can also report summary statistics of the variables of interest in
the most and least affected groups using the output of summary
. Table
(6) reports summary statistics for p_irat
and hse_inc
for
the applicants in the most affected group.
sum_stats_most <- groups$stats_most
p_irat | hse_inc | |
---|---|---|
Min | 0.16 | 0.01 |
1st Quartile | 0.34 | 0.23 |
Median | 0.37 | 0.28 |
Mean | 0.39 | 0.28 |
3rd Quartile | 0.42 | 0.32 |
Max | 1.16 | 0.74 |
We finally plot the projection of the confidence sets for the most and
least affected applicants with respect to p_irat
and hse_inc
. Figure
(3) keeps the overlapped observations and shows that the most
affected applicants tend to have higher levels of debt to income and
expenses to income ratios.
plot(set_b, varx = mortgage$p_irat, vary = mortgage$hse_inc, xlim = c(0, 1.5),
ylim = c(0, 1.5), xlab = "Debt/Income", ylab = "Housing expenses/Income",
overlap = TRUE)
p_irat
and hse_inc
. Confidence regions obtained by
bootstrap with 500 replications.Chernozhukov et al. (2018) derived the asymptotic distributions and bootstrap validity for the estimators of the SPE and classification analysis. The package uses bootstrap to compute standard errors and critical values for tests and confidence bands.
The package features nonparametric and weighted bootstrap. When
boot_type = "nonpar"
, the package draws samples with replacement of
the variables and samp_weight
and run all estimation commands weighted
by samp_weight
. When boot_type = "weighted"
, the package draws
weights from the standard exponential distribution and runs all
estimation commands weighted by the product of these weights and
samp_weight
. We use the
boot package (Canty and Ripley 2017), which
is flexible enough to accommodate both types.
Inference on SPE The
Inference on CA The joint
Inference on sets of least/most affected units The outer
Bias-correction Nonlinear estimators are prone to finite-sample
bias, and bootstrap methods can estimate the bias up to some asymptotic
order. To bias correct the SPE, replace
bc = TRUE
, the default.
We analyze the gender wage gap using data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015. The gender wage gap measures the difference in wages between female and male workers with the same observable characteristics. The SPE method allows us to look for heterogeneity in the gender wage gap and to identify the characteristics of the most and least affected workers. To retrieve the data, issue the command
data(wage2015)
The data contain the following variables: log hourly wages (lnw
); a
marital status factor ms
with 5 categories
widowed, divorced, separated, nevermarried, married
; CPS sampling
weights (weight
); a indicator for female worker (female
); an
education attainment factor educ
with 5 categories lhs
(less than
high school graduate), hsg
(high school graduate), sc
(some
college), cg
(college) and ad
(advanced degree); a region factor
region
with 4 categories mw
(midwest), so
(south), we
(west) and
ne
(northeast); potential work experience exp1
computed as
exp2, exp3, exp4
); an occupation factor occ
with 5
categories manager
, service
, sales
, construction
and
production
, and an industry factor ind
with 12 categories minery
,
construction
, manufacture
, retail
, transport
, information
,
finance
, professional
, education
, leisure
, services
and
public
.
The CPS data contains sampling weights in the variable weight
, so we
will set samp_weight = wage2015$weight
. Because women in general earn
less than men, the PEs are predominately negative if we use the female
indicator. To facilitate the interpretation of most and least affected
groups we create an indicator called male
, which assigns 0 to female
workers instead.
wage2015$male <- 1 - wage2015$female
We apply OLS regression to estimate the PEs using the following specification
fmla1 <- lnw ~ male*(ms + (exp1 + exp2 + exp3 + exp4)*educ + occ + ind + region)
We first look at the SPE of the gender wage gap at the quantile indexes
spe
and plot the result. We specify that the population of
interest is female workers with subgroup = wage2015[,"female"] == 1
.
gap <- spe(fm = fmla1, data = wage2015, samp_weight = wage2015$weight,
var = "male", subgroup = wage2015[,"female"] == 1, boot_type = "weighted",
us = c(2:98)/100, b = 500, bc = FALSE)
plot(x = gap, main = "APE and SPE of Gender Wage Gap for Women", sub = "OLS Model",
xlab = "Percentile Index", ylab = "Gender Wage Gap", ylim = c(-0.1, 0.45))
Figure (4) shows large heterogeneity in the gender wage gap that is missed if we only report the APE.
We also compare the SPE across subsets of women defined by marital
status. We implement this by changing the subgroup
options as follows
fem_mar <- wage2015[, "female"] == 1 & wage2015[, "ms"] == "married"
fem_nev <- wage2015[, "female"] == 1 & wage2015[, "ms"] == "nevermarried"
gap_mar <- spe(fm = fmla1, data = wage2015, samp_weight = wage2015$weight,
var = "male", subgroup = fem_mar, us = c(2:98)/100, b = 500,
bc = FALSE, boot_type = "weighted")
gap_nev <- spe(fm = fmla1, data = wage2015, samp_weight = wage2015$weight,
var = "male", subgroup = fem_nev, us = c(2:98)/100, b = 500,
bc = FALSE, boot_type = "weighted")
plot(x = gap_mar, main = "Married Women", sub = "OLS Model", xlab = "Percentile Index",
ylab = "Gender Wage Gap", ylim = c(-0.2, 0.45))
plot(x = gap_nev, main = "Never Married Women", sub = "OLS Model",
xlab = "Percentile Index", ylab = "Gender Wage Gap", ylim = c(-0.2, 0.45))
Figure (5) shows the results for the two subpopulations. Here we find large heterogeneity not only between married and never married women, but also within these more narrowly defined subpopulations.
Now we compare the differences in characteristics of the 5% most and
least affected women using weighted bootstrap with
tw <- c("lnw", "female", "ms", "educ", "region", "exp1", "occ", "ind")
Since many variables arecategory indicators of the same factor, we can
specify cat
as follows to get categorical p-values.
cat <- c("ms", "educ", "region", "occ", "ind")
Then we issue the ca
command and tabulate the mean characteristics of
the two groups
Char <- ca(fm = fmla1, data = wage2015, samp_weight = wage2015$weight, var = "male",
t = tw, cl = "both", b = 500, subgroup = wage2015[,"female"] == 1,
boot_type = "weighted", bc = FALSE, u = 0.05)
Table (7) reports the estimates and standard errors obtained by weighted bootstrap with 500 replications. We find that, compared to the 5% least affected women, the 5% most affected women are much more likely to be married, much less likely to be never married, less likely to have an advanced degree, live in the South, don’t live in Northeast and West, possess more potential experience, are more likely to have sales and non-managerial occupations, and work more often in manufacture, retail, transport and finance, and less often in education and leisure industries.
summary(Char)
Most | SE | Least | SE | |
---|---|---|---|---|
lnw | 2.97 | 0.10 | 3.02 | 0.06 |
female | 1.00 | 0.00 | 1.00 | 0.00 |
exp1 | 26.58 | 2.01 | 8.47 | 2.54 |
occ_manager | 0.21 | 0.13 | 0.77 | 0.08 |
occ_service | 0.04 | 0.03 | 0.12 | 0.06 |
occ_sales | 0.56 | 0.14 | 0.10 | 0.05 |
occ_construction | 0.00 | 0.01 | 0.01 | 0.01 |
occ_production | 0.19 | 0.08 | 0.01 | 0.01 |
ind_minery | 0.00 | 0.02 | 0.00 | 0.00 |
ind_construction | 0.00 | 0.01 | 0.01 | 0.01 |
ind_manufacture | 0.19 | 0.09 | 0.02 | 0.01 |
ind_retail | 0.17 | 0.14 | 0.03 | 0.02 |
ind_transport | 0.14 | 0.07 | 0.00 | 0.00 |
ind_information | 0.00 | 0.02 | 0.01 | 0.02 |
ind_finance | 0.43 | 0.17 | 0.02 | 0.01 |
ind_professional | 0.04 | 0.06 | 0.05 | 0.03 |
ind_education | 0.00 | 0.03 | 0.56 | 0.09 |
ind_leisure | 0.00 | 0.00 | 0.21 | 0.08 |
ind_services | 0.00 | 0.00 | 0.09 | 0.05 |
ind_public | 0.02 | 0.05 | 0.02 | 0.01 |
educ_lhs | 0.04 | 0.04 | 0.04 | 0.02 |
educ_hsg | 0.26 | 0.13 | 0.04 | 0.05 |
educ_sc | 0.50 | 0.15 | 0.12 | 0.06 |
educ_cg | 0.14 | 0.11 | 0.34 | 0.10 |
educ_ad | 0.07 | 0.07 | 0.46 | 0.08 |
ms_married | 1.00 | 0.02 | 0.02 | 0.04 |
ms_widowed | 0.00 | 0.00 | 0.03 | 0.07 |
ms_separated | 0.00 | 0.00 | 0.03 | 0.03 |
ms_divorced | 0.00 | 0.01 | 0.07 | 0.04 |
ms_nevermarried | 0.00 | 0.00 | 0.85 | 0.12 |
region_mw | 0.29 | 0.09 | 0.28 | 0.06 |
region_so | 0.47 | 0.11 | 0.25 | 0.06 |
region_we | 0.12 | 0.06 | 0.22 | 0.05 |
region_ne | 0.13 | 0.07 | 0.25 | 0.06 |
We also test the statistical significance of the mean differences. Table 8 shows that the differences mentioned above are significant after controlling for simultaneous inference within categories, but only the differences in marital status, potential experience and education industry remain jontly significant at the 5% level.
Chardiff <- ca(fm = fmla1, data = wage2015, samp_weight = wage2015$weight, var = "male",
t = tw, cl = "diff", b = 500, cat = cat, bc = FALSE, u = 0.05
subgroup = wage2015[, "female"] == 1, boot_type = "weighted",)
summary(Chardiff)
Estimate | SE | JP-vals | Cat P-vals | |
---|---|---|---|---|
lnw | -0.06 | 0.12 | 1.00 | 0.32 |
female | 0.00 | 0.00 | ||
exp1 | 18.12 | 3.49 | 0.09 | 0.00 |
occ_manager | -0.55 | 0.18 | 0.49 | 0.03 |
occ_service | -0.08 | 0.08 | 1.00 | 0.88 |
occ_sales | 0.46 | 0.17 | 0.67 | 0.09 |
occ_construction | -0.01 | 0.02 | 1.00 | 1.00 |
occ_production | 0.18 | 0.09 | 0.89 | 0.24 |
ind_minery | 0.00 | 0.02 | 1.00 | 1.00 |
ind_construction | -0.01 | 0.02 | 1.00 | 1.00 |
ind_manufacture | 0.18 | 0.10 | 0.97 | 0.79 |
ind_retail | 0.14 | 0.14 | 1.00 | 0.99 |
ind_transport | 0.14 | 0.07 | 0.90 | 0.63 |
ind_information | -0.00 | 0.04 | 1.00 | 1.00 |
ind_finance | 0.41 | 0.18 | 0.86 | 0.58 |
ind_professional | -0.00 | 0.08 | 1.00 | 1.00 |
ind_education | -0.56 | 0.10 | 0.07 | 0.04 |
ind_leisure | -0.21 | 0.09 | 0.79 | 0.50 |
ind_services | -0.09 | 0.05 | 0.98 | 0.82 |
ind_public | -0.00 | 0.06 | 1.00 | 1.00 |
educ_lhs | 0.00 | 0.05 | 1.00 | 1.00 |
educ_hsg | 0.22 | 0.16 | 1.00 | 0.64 |
educ_sc | 0.38 | 0.16 | 0.77 | 0.13 |
educ_cg | -0.21 | 0.16 | 1.00 | 0.69 |
educ_ad | -0.39 | 0.12 | 0.37 | 0.01 |
ms_married | 0.98 | 0.05 | 0.00 | 0.00 |
ms_widowed | -0.03 | 0.07 | 1.00 | 0.97 |
ms_separated | -0.03 | 0.03 | 1.00 | 0.83 |
ms_divorced | -0.07 | 0.05 | 0.99 | 0.56 |
ms_nevermarried | -0.85 | 0.12 | 0.02 | 0.01 |
region_mw | 0.01 | 0.15 | 1.00 | 1.00 |
region_so | 0.22 | 0.17 | 1.00 | 0.50 |
region_we | -0.10 | 0.11 | 1.00 | 0.75 |
region_ne | -0.13 | 0.12 | 1.00 | 0.71 |
Lastly we use show the functionality of the command subpop
. We plot
projections of 90% confidence sets for the 5% most and least affected
group with respect to two pairs of variables: log wages and potential
experience, and marital status and potential experience. The estimated
sets are obtained by weighted bootstrap with 500 repetitions and we drop
the overlapped observations.
set <- subpop(fm = fmla1, data = wage2015, var = "male", samp_weight = wage2015$weight,
boot_type = "weighted", b = 500, subgroup = wage2015[, "male"] == 0,
u = 0.05)
plot(set, varx = wage2015$exp1, vary = wage2015$lnw, main = "Projections of Exp-lnw",
sub = "OLS", xlab = "Exp", ylab = "Log Wages")
plot(set, varx = wage2015$exp1, vary =wage2015$ms, main = "Projections of Exp-MS",
sub = "OLS", xlab = "Exp", ylab = "Marital Status")
exp
and lnw
(left panel), and
exp
and ms
(right panel). Confidence regions
obtained by weighted bootstrap with 500 replications.
Figure (6) shows that there are relatively more least affected women with low experience at all wage levels, more high affected women with high wages with between 15 and 45 years of experience, and more least affected women which are not married at all experience levels.
We thank the editor Norman Matloff, Thomas Leeper and an anonymous referee for valuable suggestions on the paper and package. We gratefully acknowledge research support from the NSF.
SortedEffects, SortedEffect, quantreg, margins, parallel, boot
CausalInference, Econometrics, Environmetrics, MixedModels, Optimization, ReproducibleResearch, Robust, Survival, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
cl
doesn’t have any bearing when interest = "dist"
.[↩]View(mortgage)
to locate the variables
and set t <- c(rep(1, 4), 0, rep(1, 7), 0, 0, 1, 1)
.[↩]Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Chen, et al., "SortedEffects: Sorted Causal Effects in R", The R Journal, 2020
BibTeX citation
@article{RJ-2020-006, author = {Chen, Shuowen and Chernozhukov, Victor and Fernández-Val, Iván and Luo, Ye}, title = {SortedEffects: Sorted Causal Effects in R}, journal = {The R Journal}, year = {2020}, note = {https://rjournal.github.io/}, volume = {12}, issue = {1}, issn = {2073-4859}, pages = {131-146} }