Chernozhukov et al. (2018) proposed the sorted effect method for nonlinear regression models. This method consists of reporting percentiles of the partial effects, the sorted effects, in addition to the average effect commonly used to summarize the heterogeneity in the partial effects. They also propose to use the sorted effects to carry out classification analysis where the observational units are classified as most and least affected if their partial effect are above or below some tail sorted effects. The R package SortedEffects implements the estimation and inference methods therein and provides tools to visualize the results. This vignette serves as an introduction to the package and displays basic functionality of the functions within.
Many empirical questions in econometrics, machine learning and statistics boil down to studying how changes in a key variable \(T\) affect an outcome variable of interest \(Y\), holding fixed some control variables \(W\). Such effects are called predictive partial effects (PEs), or treatment or structural partial effects when they have a causal or structural interpretation. Depending on the context, researchers often work with models that feature nonlinearity in the key variable of interest or nonlinearity in parameters. The first kind of models includes mean and quantile regressions in which \(T\) is interacted with \(W\), while the second type includes generalized linear models such as logit and probit. The methods implemented in SortedEffects are designed to estimate and make inference on PEs in nonlinear models.
In nonlinear models the PEs vary with the underlying control variables. Consider the probit model as an example. The conditional probability of \(\left\{Y=1\right\}\) is \[\Pr\left(Y = 1 \mid T=t,W=w\right) =\Phi\left(t\beta+w'\gamma\right),\] where \(\Phi\) denotes CDF of standard normal distribution. Then, the PE of changing \(T\) on \(Y\), holding \(W\) fixed at \(w\), is \[\Phi\left(\beta+w'\gamma\right)-\Phi\left(w'\gamma\right)\] when \(T\) is binary with values \(0\) and \(1\), or \[\phi\left(t\beta+w'\gamma\right)\beta, \quad \phi\left(u\right) = \partial \Phi\left(u\right)/ \partial u,\] when \(T\) is continuous. Since \(W\) typically differ among observational units, the PEs are heterogeneous.
More generally, suppose we have a regression function \(g\left(X\right)\) corresponding to some characteristic of \(Y\) conditional on the covariates \(X:=\left[T, W\right]\). Let \(\Delta\left(x\right)\) denote the PE of \(T\) on \(Y\), holding \(W\) fixed. Then, \[\Delta\left(x\right)=g\left(t_{1}, w\right)-g\left(t_{0}, w\right)\] if \(T\) is discrete and takes values \(t_0\) and \(t_1\), or \[\Delta\left(x\right)=\partial g\left(t, w\right)/ \partial t\] if \(T\) is continuous and \(t \mapsto g\left(t,w\right)\) is differentiable. As \(\Delta\left(x\right)\) is a function of \(x\), the PE of an observational unit, \(\Delta\left(X\right)\), is a random variable with a distribution induced by the distribution of \(X\). A popular statistic to summarize the heterogeneity of \(\Delta\left(X\right)\) is the average partial effect (APE): \[E\left[\Delta(X)\right]=\int\Delta\left(x\right)d\mu\left(x\right),\] where \(\mu\) denotes the distribution of \(X\) in the population of interest. However, the APE might provide an incomplete summary of \(\Delta\left(X\right)\) as it neglects all the heterogeneity by design.
Chernozhukov et al. (2018) proposed the sorted partial effect (SPE) method to provide a more complete summary of \(\Delta\left(X\right)\). This method consists on reporting the entire set of PEs, sorted in increasing order and indexed by a user-specified ranking in the population of interest. More specifically, the SPEs are defined as percentiles of the PE in the population of interest, that is \[\Delta^{*}_{\mu}\left(u\right)=u^{th}-\text{quantile of } \Delta\left(X\right), \quad X \sim \mu.\] The SPEs are also useful to conduct classification analysis (CA). This analysis consists of two steps. First, classify the observational units with PEs above or below some thresholds defined by tail SPEs in the most or least affected groups. Second, report and compare the moments or distributions of the outcome and covariates in the two groups.
To apply the methods in practice, we need to replace \(\Delta\) and \(\mu\) by sample analogs \(\hat \Delta\) and \(\hat \mu\), and quantify the sampling uncertainty. Chernozhukov et al. (2018) derived the theoretical underpinnings of the resulting empirical SPEs. We refer the interested reader to that paper for details. We expect the method to be helpful to a large audience. For example, medical researchers are often interested in estimating the treatment effect of a drug. Treated units can experience different effects due to individual characteristics such as age and health status. The SortedEffects package thus provides a way to visualize the heterogeneity in the effect. Researchers can also define the most and least affected groups and compare their characteristics.
The rest of the vignette proceeds as follows. We first introduce the main functions within the SortedEffect package and use an application to racial-based discrimination in mortgage lending to illustrate the command options. Then we provide another application to gender wage gap using the Current Population Survey (CPS) data. The empirical results show that the SPE method is effective in uncovering heterogeneous effects and demographic differences between the most and least affected groups in both applications.
The package contains three main commands: spe
, ca
, and subpop
. The
package adds three methods to the generic plot()
(plot.spe
,
plot.ca
and plot.subpop
) and summary()
(summary.spe
,
summary.ca
and summary.subpop
). In practice, users only need to type
plot
and summary
since the generic commands will automatically
dispatch the appropriate method. In this section we explain the options
in each function respectively. Lastly we briefly explain the bootstrap
procedures for inference and bias correction. We provide the
mathematical expressions and an application to racial discrimation in
mortgage lending to facilitate the understanding of the options of the
commands.
The command spe
provides estimation and inference methods for the SPE
and APE. The general syntax is:
spe(fm, data, method = c("ols", "logit", "probit", "QR"),
var_type = c("binary", "continuous", "categorical"), var, compare, subgroup = NULL,
samp_weight = NULL, us = c(1:9)/10, alpha = 0.1, taus = c(5:95)/100,
b = 500, parallel = FALSE, ncores = detectCores(), seed = 1, bc = TRUE,
boot_type = c("nonpar", "weighted"))
The option fm
stores the user-specified regression formula to estimate
the PEs, which assigns the outcome \(Y\) on the left-hand-side of a \(\sim\)
operator, and the covariates \(X\) on the right-hand-side, separated by
+
operators. The option data
specifies the data set that contains
the variables for the analysis. The user needs to specify the variable
of interest \(T\) in var
as a string. The package accommodates four
regression methods to estimate the PE, \(\Delta\left(x\right)\): OLS
(default), probit ("probit"
), logit ("logit"
) and quantile
regression ("QR"
).1 If "QR"
is called, then the user needs to
further specify the quantile indexes with taus
. The default is
taus = c(5:95)/100
.
The option samp_weight
allows the user to input sampling weights. When
samp_weight = NULL
, the default, the package automatically uses a
vector of ones, i.e. no sampling weights are used.
The package provides three options for the variable of interest var
:
var_type = "binary"
if var
is a binary variable such as a black
or female indicator.
var_type = "categorical"
if var
is a categorical (factor)
variable with more than 2 levels. One example is means of
transportation with labels "bus"
, "train"
or "bike"
. In this
case, the user needs to further specify the two labels to be
compared with the option compare
. If the user is interested in the
effect of changing from bus to bike, then
compare = c("bus", "bike")
. If the data only have levels, say
"1", "2"
and "3"
, the users need to input levels instead.
var_type = "continuous"
if var
is a continuous variable. The
package obtains the PEs using a central difference numerical
derivative of the form:
\[f'\left(x\right)=\lim_{h\rightarrow 0}\frac{f\left(x+h\right)-f\left(x-h\right)}{2h},\]
where \(h\) is set to be 1e-7
. We avoid the symbolic derivative
because it cannot correctly interpret terms involving I()
for
transformations of the variables such as powers, and can therefore
cause erroneous estimation. The code of this part is inspired by
Thomas Leeper’s vignette on the
margins package
(Leeper et al. 2018).
The option subgroup
allows the user to specify the population of
interest. The default is NULL
, which corresponds to the entire
population. If the user is interested in subpopulations, say households
whose income
is lower than \(10,000\) dollars, and the data in use is
called Data
, then the user can set
subgroup = Data[, "income"] < 10000
. When \(T\) is a binary treatment
indicator, the user can specify that the population of interest is the
treated population with subgroup = Data[, var] == 1
. Note that users
cannot input subgroup directly to the data
option or using the
subset
option because the SPE methods use the whole sample to estimate
the PEs. The option subgroup
only specifies the population for the
estimation of \(\mu\), the distribution of \(X\).
The option us
specifies the set of quantile indexes corresponding to
the estimated SPE to be reported. The mathematical definition of each
empirical SPE is
\[\widehat{\Delta}^{*}_{\widehat{\mu}}\left(u\right) = u^{th}-\text{quantile of } \widehat{\Delta}\left(X\right), \quad X \sim \widehat{\mu}.\]
The empirical SPE function
\[\left\{u\mapsto \widehat{\Delta}^{*}_{\widehat{\mu}}\left(u\right): u\in\mathcal{U}\right\}, \quad \mathcal{U}\subset\left[0,1\right]\]
then outputs the SPEs with quantile indexes in the set \(\mathcal{U}\).
The option us
specifies \(\mathcal{U}\). For example,
us = c(0.25, 0.5, 0.75)
specifies to report the SPEs corresponding to
the three quartiles.
The option alpha
specifies the significance level of the confidence
bands. The default is alpha = 0.1
, i.e. \(90\%\) confidence level. The
option b
specifies the number of bootstrap repetitions. The default is
b = 500
.
The package supports two types of bootstrap: nonparametric
(boot_type = "nonpar"
) and weighted with standard exponential weights
(boot_type = "weighted"
). The user can fix the random seed for
bootstrap simulation with the option seed
and decide whether or not to
bias-correct the estimates with the option bc
. The default is
bc = TRUE
. Bootstrap and bias-correction will be discussed in section
2.5. The package features parallel computing, which is convenient to
speed up the bootstrap. The user can use the option parallel
to turn
on or off parallel computing. If parallel = TRUE
, the option ncores
allows the user to specify the number of cores in the parallel
computing. The default is ncores = detectCores()
, where
detectCores()
is a command of the package
parallel.
The output of spe
is a list containing four components: spe
, ape
,
us
and alpha
. As the names indicate, spe
stores the results for
the SPE, and ape
stores the results for the APE. Each component is a
list with four elements: the estimates, lower and upper bounds of the
uniform confidence bands, and bootstrapped standard errors obtained as
rescaled interquartile ranges; see Chernozhukov et al. (2018). The other two
components respectively store the percentile indices of the SPE and
significance level of the confidence bands, which are used in the
functions plot.spe
and summary.spe
.
The plot.spe
function plots the result of spe
in one graph that
includes the SPEs, APE and their corresponding confidence bands. Its
general syntax is
plot.spe(object, ylim = NULL, main = NULL, sub = NULL, xlab = "Percentile Index",
ylab = "Sorted Effects", ...)
where object
is the output of spe
. The range of the x-axis is fixed
to be the range of user-specified quantile index us
. The options
ylim
, xlab
and ylab
respectively denote range of the y-axis and
labels of the two axes. The options main
and sub
allow users to
specify the main and sub titles of the plot. The option ...
is an
argument of the generic plot
command that allows for further graphic
parameters.
The syntax of summary.spe
is as follows
summary.spe(object, result = c("sorted", "average"), ...)
If result = "sorted"
, the method provides a table that contain the SPE
percentile indexes, estimates, bootstrap standard errors, pointwise and
uniform confidence bands. If result = "average"
, the methods tabulates
APE estimate, bootstrap standard error and confidence interval.
We illustrate the usage of the command with an empirical application to racial discrimination in mortgage lending. We use data on mortgage applications in Boston from 1990 (Munnell et al. 1996). The Federal Reserve Bank of Boston collected these data in relation to the Home Mortgage Disclosure Act (HMDA), which was passed to monitor minority access to the mortgage market. To retrieve the data from the package, issue the command
data("mortgage")
The outcome variable, \(Y\), is deny
, a binary indicator for mortgage
denial. The key variable of interest, \(T\), is black
, a binary
indicator for the applicant being black, while the control variables,
\(W\), are financial and demographical characteristics that might affect
the mortgage decision of the bank. These characteristics include the
debt-to-income ratio (p_irat
), expenses-to-income ratio (hse_inc
),
bad consumer credit (ccred
), bad mortgage credit (mcred
), credit
problems (pubrec
), denied mortgage insurance (denpmi
), medium
loan-to-house value (ltv_med
), high loan-to-house value (ltv_high
),
self employed (selfemp
), single (single
), and high school graduate
(hischl
). The regression formula for the estimation of the PEs is
specified as:
<- deny ~ black + p_irat + hse_inc + ccred + mcred + pubrec + ltv_med + ltv_high +
fm + selfemp + single + hischl denpmi
We invoke the spe
command to calculate the bias-corrected estimates of
the SPE at the quantile indexes
\(\left\{0.02, 0.03, \ldots, 0.98 \right\}\) for the entire population
using a logit model.
<- spe(fm = fm, data = mortgage, var = "black", method = "logit", us = c(2:98)/100,
test b = 500, bc = TRUE)
The output test
includes the estimates and confidence bands for the
APE and SPE in the entire population. We use plot
to visualize the
results.
plot(x = test, ylim = c(0, 0.25), ylab = "Change in Probability",
main = "APE and SPE of Being Black on the Prob of Mortgage Denial",
sub = "Logit Model")
The result in Figure (1) shows significant heterogeneity in the SPEs, with the PEs ranging from 0 to 14%. The APE misses this heterogeneity, and therefore provides an incomplete picture of the effects.
We can also tabulate the result using summary
. First, we apply the
command to the APE and display the results in Table (1).
summary(test, result = "average")
Est | SE | 90% LB | 90% UB | |
---|---|---|---|---|
APE | 0.051 | 0.019 | 0.021 | 0.081 |
Next, we obtain the results for the SPE. To save space we only show the
first 15 rows of summary(test)
in Table (2). The columns
labelled as PLB and PUB correspond to the lower and upper pointwise
confidence bands, whereas the columns labelled as ULB and UUB correspond
to their uniform counterparts.
summary(test)
Est | SE | 90% PLB | 90% PUB | 90% ULB | 90% UUB | |
---|---|---|---|---|---|---|
0.02 | 0.011 | 0.005 | 0.003 | 0.018 | 0.001 | 0.020 |
0.03 | 0.012 | 0.005 | 0.004 | 0.020 | 0.002 | 0.022 |
0.04 | 0.013 | 0.006 | 0.004 | 0.022 | 0.002 | 0.024 |
0.05 | 0.014 | 0.006 | 0.004 | 0.023 | 0.002 | 0.025 |
0.06 | 0.014 | 0.006 | 0.004 | 0.024 | 0.002 | 0.026 |
0.07 | 0.015 | 0.007 | 0.004 | 0.026 | 0.002 | 0.028 |
0.08 | 0.016 | 0.007 | 0.005 | 0.027 | 0.003 | 0.029 |
0.09 | 0.017 | 0.007 | 0.005 | 0.029 | 0.003 | 0.031 |
0.1 | 0.018 | 0.007 | 0.006 | 0.030 | 0.003 | 0.032 |
0.11 | 0.018 | 0.008 | 0.005 | 0.031 | 0.003 | 0.033 |
0.12 | 0.019 | 0.008 | 0.006 | 0.031 | 0.004 | 0.034 |
0.13 | 0.019 | 0.008 | 0.006 | 0.032 | 0.004 | 0.035 |
0.14 | 0.020 | 0.008 | 0.007 | 0.034 | 0.004 | 0.036 |
0.15 | 0.021 | 0.008 | 0.007 | 0.034 | 0.004 | 0.037 |
0.16 | 0.021 | 0.009 | 0.007 | 0.035 | 0.004 | 0.038 |
The command ca
provides estimation and inference methods for the CA.
The general syntax is:
ca(fm, data, method = c("ols", "logit", "probit", "QR"),
var_type = c("binary", "continuous", "categorical"), var,
subgroup = NULL, samp_weight = NULL, taus = c(5:95)/100,
compare, u = 0.1, interest = c("moment", "dist"),
t = c(1, 1, rep(0, dim(data)[2] - 2)), cl = c("both", "diff"),
cat = NULL, alpha = 0.1, b = 500, parallel = FALSE,
ncores = detectCores(), seed = 1, bc = TRUE,
range_cb = c(1:99)/100, boot_type = c("nonpar", "weighted"))
The first step in the CA is to classify the observational units in most
and least affected groups based on some tail SPEs. The option u
specifies the quantile index of the tail SPEs. Thus, the \(u\)-least
affected group includes the observational units with
\(\widehat{\Delta}(X)<\widehat{\Delta}^{*}_{\widehat{\mu}}\left(u\right)\)
and the \(u\)-most affected group the units with
\(\widehat{\Delta}(X)>\widehat{\Delta}^{*}_{\widehat{\mu}}\left(1-u\right)\).
The default is u = 0.1
to obtain the 10% least and most affected
groups. The option subgroup
specifies the population of interest and
has the same syntax as in the spe
command.
Let
\(\widehat{\Lambda}^{-u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right)\)
and
\(\widehat{\Lambda}^{+u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right)\)
denote the objects of interest in the least and most affected groups for
the CA. Define
\[\widehat{\Lambda}^{u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right) := \left[\widehat{\Lambda}^{-u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right), \widehat{\Lambda}^{+u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right)\right].\]
These objects are indexed by the vector \(t\), which specifies the
variables of interest among the outcome and covariates. The option t
is a vector that specifies \(t\). Suppose the data has 5 variables
("a", "b", "c", "d", "e"
) and we are interested in "a"
and "c"
,
then we can either set t = c("a", "c")
directly or
t = c(1, 0, 1, 0, 0)
. The second approach requires the user to know
the order of the variables in the data set, which can be found with the
command View
.
Let \(Z\) denote the set of variables of interest. The package provides
two types of objects of interest. If interest = "moment"
, then
\(\widehat{\Lambda}^{u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right)\)
include the means of the variables in \(Z\) for the least and most
affected groups. If interest = "dist"
, then
\(\widehat{\Lambda}^{u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right)\)
includes the distributions of the variables in \(Z\) for the least and
most affected groups.
If interest = "moment"
, ca
estimates and makes inference on features
of the chosen variables of interest in the least and most affected
groups. These features are specified with the option cl
. For example,
if cl = "both"
, the command estimates the mean of the variables in the
two affected groups. On the other hand, cl = "diff"
estimates the
differences of the means of the variables between the two groups.
If interest = "dist"
, the option range_cb
specifies the region of
interest for the domain of the distribution2. For example, if the
variable of interest x
is discrete, we can specify the region of
interest as the support of x
with range_cb = NULL
. If x
is
continuous, we can specify range_cb = c(1:99)/100
if we are interested
in the percentiles with indexes
\(\left\{0.01, 0.03, \ldots, 0.99\right\}\). The default is
range_cb = c(1:99)/100
. The choice range_cb = NULL
shuts down this
feature by settimg the region of interest as all the distint values of
the variable.
The output of ca
depends on the choice of interest
. For
interest = "moment"
, the output is a list containing the estimates of
\(\widehat{\Lambda}^{u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right)\),
bootstrapped standard errors, pointwise and adjusted p-values. The null
hypothesis for the p-values is that the estimated coefficient is zero.
The p-values are adjusted for multiplicity to account for joint testing
for all variables. In addition, users can adjust the pointwise p-values
to account for joint testing for all simulataneous tests of categories
within a factor. For example, if the variables of interest include a
marital status factor "ms"
with labels
("nevermarried", "married", "divorced", "separated", "widowed")
, then
users could consider adjusting the pointwise p-values within this
factor. To illustrate how to define the option cat
, suppose we have
selected specified 3 variables of interest: t = c("a", "b", "c")
.
Without loss of generality, assume "a"
is not a factor, while "b"
and "c"
are two factors. Then we need to specify cat
as
cat = c("b", "c")
. If cat = NULL
, we report the unadjusted pointwise
p-values. If interest = "dist"
, the output is a list containing the
rearranged estimates, upper confidence bands and lower confidence bands
for the variables of interest in both groups.
When interest = "moment"
, the user can use method summary.ca
to
tabulate the output. The general syntax is
summary.ca(object, ...)
If cl = "both"
, the p-values are omitted from the table. When
interest = "dist"
, users can plot the output for better visualization.
The general syntax is
plot.ca(object, var, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ...)
The user needs to input the variable for plotting with the option var
.
Note that the variable must be one of the variables specified in t
.
Returning to the mortgage denial example, we classify the 10% least and
most affected applicants and compare their characteristics. The
variables of interest include deny
, black
and all the controls. We
first specify t
to reflect the choice of variables of interest3
<- c("deny", "p_irat", "black", "hse_inc", "ccred", "mcred", "pubrec", "denpmi",
t "selfemp", "single", "hischl", "ltv_med", "ltv_high")
Then we invoke the ca
command and summarize the result.
<- ca(fm = fm, data = mortgage, var = "black", method = "logit", cl = "both",
CA t = t, b = 500, bc = TRUE)
summary(CA)
Most | SE | Least | SE | |
---|---|---|---|---|
deny | 0.45 | 0.03 | 0.09 | 0.04 |
p_irat | 0.39 | 0.01 | 0.25 | 0.02 |
black | 0.38 | 0.03 | 0.06 | 0.02 |
hse_inc | 0.28 | 0.01 | 0.21 | 0.02 |
ccred | 4.80 | 0.26 | 1.28 | 0.09 |
mcred | 2.01 | 0.06 | 1.36 | 0.10 |
pubrec | 0.46 | 0.05 | 0.05 | 0.02 |
denpmi | 0.01 | 0.01 | 0.04 | 0.03 |
selfemp | 0.17 | 0.04 | 0.04 | 0.03 |
single | 0.61 | 0.06 | 0.09 | 0.07 |
hischl | 0.93 | 0.03 | 1.00 | 0.01 |
ltv_med | 0.59 | 0.06 | 0.05 | 0.04 |
ltv_high | 0.12 | 0.04 | 0.01 | 0.01 |
Table (3) shows that the 10% of the applicants most affected by the racial mortgage denial gap are more likely to have either of the following characteristics relative to the 10% of the least affected applicants: mortgage denied, high debt-to-income ratio, black, high expense-to-income ratio, bad consumer or credit scores, credit problems, self employed, single, no high school diploma, and medium or high loan-to-income ratio.
Next we test if the differences in the characteristics between the two
groups are statistically significant. To do so we set cl = "diff"
,
which means taking difference between the two groups. The full command
is as follows
<- ca(fm = fm, data = mortgage, var = "black", t = t, method = "logit",
CAdiff cl = "diff", b = 500, bc = TRUE)
summary(CAdiff)
Estimate | SE | JP-vals | P-vals | |
---|---|---|---|---|
deny | 0.36 | 0.05 | 0.00 | 0.00 |
p_irat | 0.14 | 0.02 | 0.00 | 0.00 |
black | 0.32 | 0.04 | 0.00 | 0.00 |
hse_inc | 0.07 | 0.02 | 0.13 | 0.00 |
ccred | 3.52 | 0.28 | 0.00 | 0.00 |
mcred | 0.65 | 0.15 | 0.01 | 0.00 |
pubrec | 0.41 | 0.05 | 0.00 | 0.00 |
denpmi | -0.03 | 0.04 | 1.00 | 0.23 |
selfemp | 0.13 | 0.06 | 0.48 | 0.02 |
single | 0.53 | 0.10 | 0.00 | 0.00 |
hischl | -0.06 | 0.03 | 0.48 | 0.02 |
ltv_med | 0.54 | 0.07 | 0.00 | 0.00 |
ltv_high | 0.10 | 0.03 | 0.10 | 0.00 |
Table (4) shows the results. The joint p-values account for the fact that we conduct simultaneous inference on 13 differences of variables. We employ the so-called “single-step” methods for controlling the family-wise error rate and obtain the p-values by bootstrap. We find that 8 differences are jointly statistically different from zero at the 5% level and 9 at the 10% level.
We also plot the distributions of monthly debt-to-income ratio
(p_irat
) and monthly housing expenses-to-income ratio (hse_inc
) for
both groups. Such plots are useful if the user wants to visualize if
there is stochastic dominance between the two groups. To do so we use
the ca
command and change the interest
to "dist"
.
<- c("p_irat", "hse_inc")
t2 <- ca(fm = fm, data = mortgage, var = "black", method = "logit", t = t2,
CAdist b = 500, interest = "dist")
plot(CAdist, var = "p_irat", ylab = "Prob", xlab = "Monthly Debt-to-Income Ratio",
sub = "logit model")
plot(CAdist, var = "hse_inc", ylab = "Prob",
xlab = "Monthly Housing Expenses-to-Income Ratio", sub = "logit model")
Figure (2) shows that for both variables the distribution in the most affected group first-order stochastically dominates the distribution in least affected group.
In addition to means and distributions, we can conduct inference on the
sets of most and least affected units. Let \(\mathcal{Z}\) be a compact
subset of the support of the outcome and covariates. Define
\[\mathcal{M}^{-u}\equiv \left\{\left(x,y\right)\in\mathcal{Z}: \Delta\left(x\right)\leq\Delta^{*}_{\mu}\left(u\right)\right\}\]
as the set of the least affected units and
\[\mathcal{M}^{+u}\equiv \left\{\left(x,y\right)\in\mathcal{Z}: \Delta\left(x\right)\geq\Delta^{*}_{\mu}\left(1-u\right)\right\}\]
as the set of the most affected units. The command subpop
provides
estimation and inference methods for these sets. The general syntax is
subpop(fm, data, method = c("ols", "logit", "probit", "QR"),
var_type = c("binary", "continuous", "categorical"), var, compare,
subgroup = NULL, samp_weight = NULL, taus = c(5:95)/100, u = 0.1, alpha = 0.1,
b = 500, seed = 1, parallel = FALSE, ncores = detectCores(),
boot_type = c("nonpar", "weighted"))
No new option is introduced in the command. For theoretical details, we refer the reader to Chernozhukov et al. (2015).
The output of subpop
is a list containing six components: cs_most
,
cs_least
, u
, subgroup
, most
and least
. As the names indicate,
cs_most
and cs_least
denote the confidence sets for the most and
least affected groups. u
stores the percentile index that defines the
most and least affected groups. subgroup
stores the indicators for the
population of interest specified with the option subgroup
. most
and
least
store the estimates of the most and least affected units
respectively. The first four components are used in plot.subpop
and
the last two components can be visualized with summary.subpop
.
The general syntax of summary.subpop
is
summary.subpop(object, affected = c("most", "least"), vars = NULL, ...)
The option object
is the output of subpop
. The option affected
allows users to tabulate either the most or the least affected units,
and the option vars
provides summary statistics for user-specified
variables of interest. The summary statistics include the minimum, 1st
quartile, median, mean, 3rd quaritle and maximum. The default is NULL
,
which produces summary statistics of all the variables.
The plot.subpop
function plots 2-dimensional projections of the
confidence sets for the most and least affected units with respect to
two variables. The general syntax is
plot.subpop(object, varx, vary, xlim = NULL, ylim = NULL, main = NULL, sub = NULL,
xlab = NULL, ylab = NULL, overlap = FALSE, ...)
The user needs to specify the two variables for the projection with
varx
and vary
, and object
should be specified as the output of
subpop
. The option overlap
allows users to either keep or drop
common observations in both confidence sets. The default is
overlap = FALSE
, which drops the observations.
We estimate the 10% most and least affected applicants in the mortgage application.
<- subpop(fm, data = mortgage, method = "logit", var = "black", u = 0.1,
set_b alpha = 0.1, b = 500)
Using summary
, we can estimate the most/least affected applicants and
report summary statistics of the variables of interest in the most/least
affected groups. Table 5 lists the estimated most affected
applicants. For the purpose of illustration we only show the first ten
rows and columns.
<- summary(set_b, vars = c("p_irat", "hse_inc"))
groups <- groups$most_affected most_affected
deny | p_irat | black | hse_inc | loan_val | ccred | mcred | pubrec | denpmi | selfemp | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1.00 | 0.46 | 0.00 | 0.27 | 0.84 | 5.00 | 2.00 | 0.00 | 0.00 | 0.00 |
2 | 1.00 | 0.38 | 0.00 | 0.26 | 0.88 | 6.00 | 1.00 | 1.00 | 0.00 | 0.00 |
3 | 0.00 | 0.40 | 0.00 | 0.34 | 0.80 | 2.00 | 2.00 | 0.00 | 0.00 | 1.00 |
4 | 1.00 | 0.24 | 0.00 | 0.23 | 0.90 | 5.00 | 2.00 | 0.00 | 0.00 | 0.00 |
5 | 0.00 | 0.38 | 0.00 | 0.25 | 0.80 | 6.00 | 2.00 | 1.00 | 0.00 | 0.00 |
6 | 0.00 | 0.36 | 0.00 | 0.13 | 0.95 | 5.00 | 2.00 | 0.00 | 0.00 | 0.00 |
7 | 0.00 | 0.35 | 0.00 | 0.27 | 0.90 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 |
8 | 1.00 | 0.30 | 0.00 | 0.30 | 0.50 | 6.00 | 2.00 | 1.00 | 0.00 | 0.00 |
9 | 1.00 | 0.37 | 0.00 | 0.23 | 0.80 | 6.00 | 2.00 | 1.00 | 0.00 | 0.00 |
10 | 1.00 | 0.39 | 0.00 | 0.27 | 0.90 | 6.00 | 2.00 | 1.00 | 0.00 | 0.00 |
We can also report summary statistics of the variables of interest in
the most and least affected groups using the output of summary
. Table
(6) reports summary statistics for p_irat
and hse_inc
for
the applicants in the most affected group.
<- groups$stats_most sum_stats_most
p_irat | hse_inc | |
---|---|---|
Min | 0.16 | 0.01 |
1st Quartile | 0.34 | 0.23 |
Median | 0.37 | 0.28 |
Mean | 0.39 | 0.28 |
3rd Quartile | 0.42 | 0.32 |
Max | 1.16 | 0.74 |
We finally plot the projection of the confidence sets for the most and
least affected applicants with respect to p_irat
and hse_inc
. Figure
(3) keeps the overlapped observations and shows that the most
affected applicants tend to have higher levels of debt to income and
expenses to income ratios.
plot(set_b, varx = mortgage$p_irat, vary = mortgage$hse_inc, xlim = c(0, 1.5),
ylim = c(0, 1.5), xlab = "Debt/Income", ylab = "Housing expenses/Income",
overlap = TRUE)
Chernozhukov et al. (2018) derived the asymptotic distributions and bootstrap validity for the estimators of the SPE and classification analysis. The package uses bootstrap to compute standard errors and critical values for tests and confidence bands.
The package features nonparametric and weighted bootstrap. When
boot_type = "nonpar"
, the package draws samples with replacement of
the variables and samp_weight
and run all estimation commands weighted
by samp_weight
. When boot_type = "weighted"
, the package draws
weights from the standard exponential distribution and runs all
estimation commands weighted by the product of these weights and
samp_weight
. We use the
boot package (Canty and Ripley 2017), which
is flexible enough to accommodate both types.
Inference on SPE The \(\left(1-\alpha\right)\)-uniform confidence band of \(\Delta^{*}_{\mu}\left(u\right)\) in \(\mathcal{U}\) is \[\left[\widehat{\Delta}^{*}_{\widehat{\mu}}-\widehat{t}_{1-\alpha}\left(\mathcal{U}\right)\widehat{\Sigma}\left(u\right)^{1/2}/\sqrt{n}, \quad \widehat{\Delta}^{*}_{\widehat{\mu}}+\widehat{t}_{1-\alpha}\left(\mathcal{U}\right)\widehat{\Sigma}\left(u\right)^{1/2}/\sqrt{n}\right],\] where \(\widehat{t}_{1-\alpha}\left(\mathcal{U}\right)\) is a bootstrapped uniform critical value and \(\widehat{\Sigma}\left(u\right)^{1/2}\) is a boostrapped standard error of \(\widehat{\Delta}^{*}_{\widehat{\mu}}\left(u\right)\).4 To deal with the possibility that the end-point functions of the confidence band \(u \mapsto \widehat{\Delta}^{*}_{\widehat{\mu}}\pm\widehat{t}_{1-\alpha}\left(\mathcal{U}\right)\widehat{\Sigma}\left(u\right)^{1/2}/\sqrt{n}\) be nonincreasing, we monotonize these functions via rearrangement (Chernozhukov et al. 2009). The \(\left(1-\alpha\right)\)-pointwise confidence band of \(\Delta^{*}_{\mu}\left(u\right)\) is obtained replacing \(\widehat{t}_{1-\alpha}\left(\mathcal{U}\right)\) by the \(\left(1-\alpha/2\right)\)-quantile of the standard normal distribution.
Inference on CA The joint \(p\)-value for the hypothesis \(\Lambda^{+u}_{\Delta, \mu}\left(t\right) = \Lambda^{-u}_{\Delta, \mu}\left(t\right)\) for all \(t\in\mathcal{T}\) is \[\Pr\left(\hat{t}^{u}\left(\mathcal{T}\right) > \sup_{t\in\mathcal{T}} \frac{|c'\widehat{\Lambda}^{u}_{\widehat{\Delta}, \widehat{\mu}}(t)c|}{\sqrt{c'\hat{\Sigma}^{u}\left(t\right)c}}\right),\] where \(c=\left(-1,1\right)'\), \(\hat{t}^{u}\left(\mathcal{T}\right)\) is a bootstrap estimator of \[t^{u}\left(\mathcal{T}\right) = \sup_{t\in\mathcal{T}} \frac{|c'\widehat{\Lambda}^{u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right)c - c'\Lambda^{u}_{\Delta, \mu}\left(t\right)c|}{\sqrt{c'\Sigma^{u}\left(t\right) c}},\] and \(\hat{\Sigma}^{u}\left(t\right)\) is a bootstrap estimator of \(\Sigma^{u}\left(t\right)\), the asymptotic variance of \(\widehat{\Lambda}^{u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right)\).5 The poitwise \(p\)-value for the hypothesis \(\Lambda^{+u}_{\Delta, \mu}\left(t\right) = \Lambda^{-u}_{\Delta, \mu}\left(t\right)\) is obtained by setting \(\mathcal{T} = \left\{t\right\}\).
Inference on sets of least/most affected units The outer \(\left(1-\alpha\right)\)-confidence set for \(\mathcal{M}^{-u}\) is \[\mathcal{C}\mathcal{M}^{-u}\left(1-\alpha\right)=\left\{\left(x, y\right)\in\mathcal{Z}: \widehat{\Sigma}^{-1/2}\left(x, u\right)\sqrt{n}\left[\widehat{\Delta}\left(x\right)-\widehat{\Delta}^{*}_{\mu}\left(u\right)\right]\leq\hat{c}\left(1-\alpha\right)\right\},\] where \(\widehat{\Sigma}\left(x, u\right)\) is an estimator of the asymptotic variance of \(\widehat{\Delta}\left(x\right)-\widehat{\Delta}^{*}_{\mu}\left(u\right)\). Similarly the outer \(\left(1-\alpha\right)\)-confidence set for \(\mathcal{M}^{+u}\) is \[\mathcal{C}\mathcal{M}^{+u}(1-\alpha)=\left\{\left(x, y\right)\in\mathcal{Z}: \widehat{\Sigma}^{-1/2}\left(x, 1-u\right)\sqrt{n}\left[\widehat{\Delta}^{*}_{\mu}\left(1-u\right)-\widehat{\Delta}\left(x\right)\right]\leq\tilde{c}\left(1-\alpha\right)\right\}.\] The critical value \(\hat{c}\left(1-\alpha\right)\) is the \(\left(1-\alpha\right)\)-quantile of the statistic: \[\tilde{V}^{*}_{\infty}=\sup_{\left\{x\in\mathcal{X}:\widehat{\Delta}\left(x\right)=\widehat{\Delta}^{*}_{\mu}\left(u\right)\right\}}\widehat{\Sigma}^{-1/2}\left(x, u\right)\sqrt{n}\left(\left[\tilde{\Delta}\left(x\right)-\tilde{\Delta^{*}_{\mu}}\left(u\right)\right]-\left[\widehat{\Delta}\left(x\right)-\widehat{\Delta}^{*}_{\mu}\left(u\right)\right]\right).\] while the critical value \(\tilde{c}\left(1-\alpha\right)\) is the \(\left(1-\alpha\right)\)-quantile of the statistic: \[\tilde{V}^{**}_{\infty}=\sup_{\left\{x\in\mathcal{X}:\widehat{\Delta}\left(x\right)=\widehat{\Delta}^{*}_{\mu}\left(1-u\right)\right\}}\widehat{\Sigma}^{-1/2}\left(x, 1-u\right)\sqrt{n}\left(\left[\tilde{\Delta}\left(x\right)-\tilde{\Delta^{*}_{\mu}}\left(1-u\right)\right]-\left[\widehat{\Delta}\left(x\right)-\widehat{\Delta}^{*}_{\mu}\left(1-u\right)\right]\right).\] To implement \(\sup\left\{x\in\mathcal{X}: \widehat{\Delta}\left(x\right)=\widehat{\Delta}^{*}_{\mu}\left(u\right)\right\}\) in the code, we find the minimum of \(|\widehat{\Delta}\left(x\right)-\widehat{\Delta}^{*}_{\widehat{\mu}}\left(u\right)|\) among all \(x\)’s.
Bias-correction Nonlinear estimators are prone to finite-sample
bias, and bootstrap methods can estimate the bias up to some asymptotic
order. To bias correct the SPE, replace
\(\widehat{\Delta}^{*}_{\widehat{\mu}}\) with
\(2\widehat{\Delta}^{*}_{\widehat{\mu}}-\overline{\Delta^{*}_{\mu}}\),
where \(\overline{\Delta^{*}_{\mu}}\) is the mean of bootstrap draws.
Similarly, for CA we replace
\(\widehat{\Lambda}^{u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right)\)
with
\(2\widehat{\Lambda}^{u}_{\widehat{\Delta}, \widehat{\mu}}\left(t\right) - \overline{\Lambda^{u}_{\Delta, \mu}}\),
where \(\overline{\Lambda^{u}_{\Delta, \mu}}\) is the mean of the
bootstrap draws. Bias-corrected estimates and corresponding inference
will be reported if bc = TRUE
, the default.
We analyze the gender wage gap using data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015. The gender wage gap measures the difference in wages between female and male workers with the same observable characteristics. The SPE method allows us to look for heterogeneity in the gender wage gap and to identify the characteristics of the most and least affected workers. To retrieve the data, issue the command
data(wage2015)
The data contain the following variables: log hourly wages (lnw
); a
marital status factor ms
with 5 categories
widowed, divorced, separated, nevermarried, married
; CPS sampling
weights (weight
); a indicator for female worker (female
); an
education attainment factor educ
with 5 categories lhs
(less than
high school graduate), hsg
(high school graduate), sc
(some
college), cg
(college) and ad
(advanced degree); a region factor
region
with 4 categories mw
(midwest), so
(south), we
(west) and
ne
(northeast); potential work experience exp1
computed as
\(\max\left\{0, \text{age - years of educ} -7\right\}\); 4 powers of
experience (exp2, exp3, exp4
); an occupation factor occ
with 5
categories manager
, service
, sales
, construction
and
production
, and an industry factor ind
with 12 categories minery
,
construction
, manufacture
, retail
, transport
, information
,
finance
, professional
, education
, leisure
, services
and
public
.
The CPS data contains sampling weights in the variable weight
, so we
will set samp_weight = wage2015$weight
. Because women in general earn
less than men, the PEs are predominately negative if we use the female
indicator. To facilitate the interpretation of most and least affected
groups we create an indicator called male
, which assigns 0 to female
workers instead.
$male <- 1 - wage2015$female wage2015
We apply OLS regression to estimate the PEs using the following specification
<- lnw ~ male*(ms + (exp1 + exp2 + exp3 + exp4)*educ + occ + ind + region) fmla1
We first look at the SPE of the gender wage gap at the quantile indexes
\(\left\{0.02, 0.03, \ldots, 0.98\right\}\) in the population of women via
the command spe
and plot the result. We specify that the population of
interest is female workers with subgroup = wage2015[,"female"] == 1
.
<- spe(fm = fmla1, data = wage2015, samp_weight = wage2015$weight,
gap var = "male", subgroup = wage2015[,"female"] == 1, boot_type = "weighted",
us = c(2:98)/100, b = 500, bc = FALSE)
plot(x = gap, main = "APE and SPE of Gender Wage Gap for Women", sub = "OLS Model",
xlab = "Percentile Index", ylab = "Gender Wage Gap", ylim = c(-0.1, 0.45))
Figure (4) shows large heterogeneity in the gender wage gap that is missed if we only report the APE.
We also compare the SPE across subsets of women defined by marital
status. We implement this by changing the subgroup
options as follows
<- wage2015[, "female"] == 1 & wage2015[, "ms"] == "married"
fem_mar <- wage2015[, "female"] == 1 & wage2015[, "ms"] == "nevermarried"
fem_nev
<- spe(fm = fmla1, data = wage2015, samp_weight = wage2015$weight,
gap_mar var = "male", subgroup = fem_mar, us = c(2:98)/100, b = 500,
bc = FALSE, boot_type = "weighted")
<- spe(fm = fmla1, data = wage2015, samp_weight = wage2015$weight,
gap_nev var = "male", subgroup = fem_nev, us = c(2:98)/100, b = 500,
bc = FALSE, boot_type = "weighted")
plot(x = gap_mar, main = "Married Women", sub = "OLS Model", xlab = "Percentile Index",
ylab = "Gender Wage Gap", ylim = c(-0.2, 0.45))
plot(x = gap_nev, main = "Never Married Women", sub = "OLS Model",
xlab = "Percentile Index", ylab = "Gender Wage Gap", ylim = c(-0.2, 0.45))
Figure (5) shows the results for the two subpopulations. Here we find large heterogeneity not only between married and never married women, but also within these more narrowly defined subpopulations.
Now we compare the differences in characteristics of the 5% most and least affected women using weighted bootstrap with \(500\) repetitions. We pick the following variables
<- c("lnw", "female", "ms", "educ", "region", "exp1", "occ", "ind") tw
Since many variables arecategory indicators of the same factor, we can
specify cat
as follows to get categorical p-values.
<- c("ms", "educ", "region", "occ", "ind") cat
Then we issue the ca
command and tabulate the mean characteristics of
the two groups
<- ca(fm = fmla1, data = wage2015, samp_weight = wage2015$weight, var = "male",
Char t = tw, cl = "both", b = 500, subgroup = wage2015[,"female"] == 1,
boot_type = "weighted", bc = FALSE, u = 0.05)
Table (7) reports the estimates and standard errors obtained by weighted bootstrap with 500 replications. We find that, compared to the 5% least affected women, the 5% most affected women are much more likely to be married, much less likely to be never married, less likely to have an advanced degree, live in the South, don’t live in Northeast and West, possess more potential experience, are more likely to have sales and non-managerial occupations, and work more often in manufacture, retail, transport and finance, and less often in education and leisure industries.
summary(Char)
Most | SE | Least | SE | |
---|---|---|---|---|
lnw | 2.97 | 0.10 | 3.02 | 0.06 |
female | 1.00 | 0.00 | 1.00 | 0.00 |
exp1 | 26.58 | 2.01 | 8.47 | 2.54 |
occ_manager | 0.21 | 0.13 | 0.77 | 0.08 |
occ_service | 0.04 | 0.03 | 0.12 | 0.06 |
occ_sales | 0.56 | 0.14 | 0.10 | 0.05 |
occ_construction | 0.00 | 0.01 | 0.01 | 0.01 |
occ_production | 0.19 | 0.08 | 0.01 | 0.01 |
ind_minery | 0.00 | 0.02 | 0.00 | 0.00 |
ind_construction | 0.00 | 0.01 | 0.01 | 0.01 |
ind_manufacture | 0.19 | 0.09 | 0.02 | 0.01 |
ind_retail | 0.17 | 0.14 | 0.03 | 0.02 |
ind_transport | 0.14 | 0.07 | 0.00 | 0.00 |
ind_information | 0.00 | 0.02 | 0.01 | 0.02 |
ind_finance | 0.43 | 0.17 | 0.02 | 0.01 |
ind_professional | 0.04 | 0.06 | 0.05 | 0.03 |
ind_education | 0.00 | 0.03 | 0.56 | 0.09 |
ind_leisure | 0.00 | 0.00 | 0.21 | 0.08 |
ind_services | 0.00 | 0.00 | 0.09 | 0.05 |
ind_public | 0.02 | 0.05 | 0.02 | 0.01 |
educ_lhs | 0.04 | 0.04 | 0.04 | 0.02 |
educ_hsg | 0.26 | 0.13 | 0.04 | 0.05 |
educ_sc | 0.50 | 0.15 | 0.12 | 0.06 |
educ_cg | 0.14 | 0.11 | 0.34 | 0.10 |
educ_ad | 0.07 | 0.07 | 0.46 | 0.08 |
ms_married | 1.00 | 0.02 | 0.02 | 0.04 |
ms_widowed | 0.00 | 0.00 | 0.03 | 0.07 |
ms_separated | 0.00 | 0.00 | 0.03 | 0.03 |
ms_divorced | 0.00 | 0.01 | 0.07 | 0.04 |
ms_nevermarried | 0.00 | 0.00 | 0.85 | 0.12 |
region_mw | 0.29 | 0.09 | 0.28 | 0.06 |
region_so | 0.47 | 0.11 | 0.25 | 0.06 |
region_we | 0.12 | 0.06 | 0.22 | 0.05 |
region_ne | 0.13 | 0.07 | 0.25 | 0.06 |
We also test the statistical significance of the mean differences. Table 8 shows that the differences mentioned above are significant after controlling for simultaneous inference within categories, but only the differences in marital status, potential experience and education industry remain jontly significant at the 5% level.
<- ca(fm = fmla1, data = wage2015, samp_weight = wage2015$weight, var = "male",
Chardiff t = tw, cl = "diff", b = 500, cat = cat, bc = FALSE, u = 0.05
subgroup = wage2015[, "female"] == 1, boot_type = "weighted",)
summary(Chardiff)
Estimate | SE | JP-vals | Cat P-vals | |
---|---|---|---|---|
lnw | -0.06 | 0.12 | 1.00 | 0.32 |
female | 0.00 | 0.00 | ||
exp1 | 18.12 | 3.49 | 0.09 | 0.00 |
occ_manager | -0.55 | 0.18 | 0.49 | 0.03 |
occ_service | -0.08 | 0.08 | 1.00 | 0.88 |
occ_sales | 0.46 | 0.17 | 0.67 | 0.09 |
occ_construction | -0.01 | 0.02 | 1.00 | 1.00 |
occ_production | 0.18 | 0.09 | 0.89 | 0.24 |
ind_minery | 0.00 | 0.02 | 1.00 | 1.00 |
ind_construction | -0.01 | 0.02 | 1.00 | 1.00 |
ind_manufacture | 0.18 | 0.10 | 0.97 | 0.79 |
ind_retail | 0.14 | 0.14 | 1.00 | 0.99 |
ind_transport | 0.14 | 0.07 | 0.90 | 0.63 |
ind_information | -0.00 | 0.04 | 1.00 | 1.00 |
ind_finance | 0.41 | 0.18 | 0.86 | 0.58 |
ind_professional | -0.00 | 0.08 | 1.00 | 1.00 |
ind_education | -0.56 | 0.10 | 0.07 | 0.04 |
ind_leisure | -0.21 | 0.09 | 0.79 | 0.50 |
ind_services | -0.09 | 0.05 | 0.98 | 0.82 |
ind_public | -0.00 | 0.06 | 1.00 | 1.00 |
educ_lhs | 0.00 | 0.05 | 1.00 | 1.00 |
educ_hsg | 0.22 | 0.16 | 1.00 | 0.64 |
educ_sc | 0.38 | 0.16 | 0.77 | 0.13 |
educ_cg | -0.21 | 0.16 | 1.00 | 0.69 |
educ_ad | -0.39 | 0.12 | 0.37 | 0.01 |
ms_married | 0.98 | 0.05 | 0.00 | 0.00 |
ms_widowed | -0.03 | 0.07 | 1.00 | 0.97 |
ms_separated | -0.03 | 0.03 | 1.00 | 0.83 |
ms_divorced | -0.07 | 0.05 | 0.99 | 0.56 |
ms_nevermarried | -0.85 | 0.12 | 0.02 | 0.01 |
region_mw | 0.01 | 0.15 | 1.00 | 1.00 |
region_so | 0.22 | 0.17 | 1.00 | 0.50 |
region_we | -0.10 | 0.11 | 1.00 | 0.75 |
region_ne | -0.13 | 0.12 | 1.00 | 0.71 |
Lastly we use show the functionality of the command subpop
. We plot
projections of 90% confidence sets for the 5% most and least affected
group with respect to two pairs of variables: log wages and potential
experience, and marital status and potential experience. The estimated
sets are obtained by weighted bootstrap with 500 repetitions and we drop
the overlapped observations.
<- subpop(fm = fmla1, data = wage2015, var = "male", samp_weight = wage2015$weight,
set boot_type = "weighted", b = 500, subgroup = wage2015[, "male"] == 0,
u = 0.05)
plot(set, varx = wage2015$exp1, vary = wage2015$lnw, main = "Projections of Exp-lnw",
sub = "OLS", xlab = "Exp", ylab = "Log Wages")
plot(set, varx = wage2015$exp1, vary =wage2015$ms, main = "Projections of Exp-MS",
sub = "OLS", xlab = "Exp", ylab = "Marital Status")
Figure (6) shows that there are relatively more least affected women with low experience at all wage levels, more high affected women with high wages with between 15 and 45 years of experience, and more least affected women which are not married at all experience levels.
We thank the editor Norman Matloff, Thomas Leeper and an anonymous referee for valuable suggestions on the paper and package. We gratefully acknowledge research support from the NSF.
SortedEffects, SortedEffect, quantreg, margins, parallel, boot
CausalInference, Econometrics, Environmetrics, MixedModels, Optimization, ReproducibleResearch, Robust, Survival, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Chen, et al., "SortedEffects: Sorted Causal Effects in R", The R Journal, 2020
BibTeX citation
@article{RJ-2020-006, author = {Chen, Shuowen and Chernozhukov, Victor and Fernández-Val, Iván and Luo, Ye}, title = {SortedEffects: Sorted Causal Effects in R}, journal = {The R Journal}, year = {2020}, note = {https://rjournal.github.io/}, volume = {12}, issue = {1}, issn = {2073-4859}, pages = {131-146} }