Martin Feldstein, Joel Slemrod, and Shlomo Yitzhaki
This document seeks to replicate a 1980 paper by Martin Feldstein, Joel Slemrod, and Shlomo Yitzhaki titled “The Effects of Taxation on the Selling of Corporate Stock and The Realization of Capital Gains” with and without formal privacy.
library(tidyverse)
library(stargazer)
library(gt)
library(broom)
library(yardstick)
library(DT)
theme_set(theme_minimal())
theme_update(plot.title.position = "plot")
source(here::here("tax-code", "puf.preprocess.R"))
source(here::here("tax-code", "utility.taxcalc.R"))
source(here::here("tax-code", "utility.taxcalc.prep_for_tax_calc.R"))
source(here::here("tax-code", "utility.taxcalc.run_calculator_distribution.R"))
source(here::here("tax-code", "utility.taxcalc.run_calculator_reform.R"))
Age is missing for about half of observations in the data. This will create issues for replicating Feldstein, Slemrod, and Yitzhaki (1980).
We impute age by assigning observations with missing age to age 65+ if the record has non-zero total pensions and annuities received, non-zero pensions and annuities in AGI, or Social Security benefits. All other records with missing age are assigned to ages 0-64. This is crude but should be somewhat precise for records with more than $3,000 in dividends. Furthermore, the estimates with and without noise will be flawed in the same way.
df <- synthpuf::puf_2012 %>%
# drop the aggregate observations
filter(!RECID %in% 999996:999999)
# impute age 65
df <- df %>%
mutate(
age65 = case_when(
AGERANGE == 6 ~ "Age 65+",
AGERANGE == 3 & DSI == 1 ~ "Age 65+",
# total pensions and annuities received
E01500 > 0 ~ "Age 65+",
# pensions and annuities in AGI
E01700 > 0 ~ "Age 65+",
# social security benefits
E02400 > 0 ~ "Age 65+",
TRUE ~ "Age 0-64"
)
)
df <- df %>%
mutate(AGERANGE = if_else(is.na(AGERANGE) & age65 == "Age 65+", 6, 0))
# standard preprocessing
puf <- puf.preprocess(df)
“Last dollar” marginal tax rates on long-term capital gains are endogenous to long-term capital gains because filers may adjust their capital gains to hit lower tax rates through timing sales or manipulating reporting. We calculate the last dollar and first dollar marginal tax rates on long-term capital gains. The last dollar rate is calculated by running a tax calculator with observed long-term capital gains and then running a tax calculator with observed long-term capital gains plus one penny and then comparing the amount of taxes paid. The first dollar rate is calculated by running a tax calculator with zero long-term capital gains and then running a tax calculator with one penny of long-term capital gains and then comparing the amount of taxes paid.
Run the Open Source Policy Center’s tax calculator to calculate both marginal capital gains rates.
# calculate first dollar marginal tax rate on capital gains
results_first_dollar <-
puf %>%
mutate(P23250 = 0) %>%
utility.taxcalc(data = .,
year = 2012,
enforce_constraints = TRUE,
condaenv = "taxcalc-dev",
conda = "auto")
# calculate last dollar marginal tax rate on capital gains
results_last_dollar <-
utility.taxcalc(data = puf,
year = 2012,
enforce_constraints = TRUE,
condaenv = "taxcalc-dev",
conda = "auto")
# create one output data set
puf <- bind_cols(
puf,
mcg_first_dollar = results_first_dollar[[3]],
mcg_last_dollar = results_last_dollar[[3]]
)
Visualize the results:
puf %>%
filter(mcg_first_dollar < 1) %>%
ggplot(aes(mcg_first_dollar)) +
geom_density() +
labs(title = "Density of Marginal Tax Rate on the First Dollar of Long-Term Capital Gains",
x = "Marginal Tax Rate on the First Dollar of Long-Term Capital Gains")
puf %>%
filter(mcg_last_dollar < 1) %>%
ggplot(aes(mcg_last_dollar)) +
geom_density() +
labs(title = "Density of Marginal Tax Rate on the Last Dollar of Long-Term Capital Gains",
x = "Marginal Tax Rate on the Last Dollar of Long-Term Capital Gains")
Prepare the necessary variables for the regression analysis.
# prepare variables
final_vars <- puf %>%
mutate(
net_agi = pmax(E00100 - P23250, 0.00001)
) %>%
select(
ltcg = P23250,
dividends = E00600,
agi = E00100,
net_agi,
mcg_first_dollar,
mcg_last_dollar,
age65
) %>%
mutate(ltcg_dividends = ltcg / dividends)
Feldstein et al. filtered observations to those with $3,000 or more in dividends in 1973. We adjust the $3,000 threshold into 2012 dollars using the CPI-U.
# create inflation adjusted threshold
analysis_adjusted <- final_vars %>%
filter(dividends > 12513.11)
# estimate the model
model_adjusted <- lm(ltcg_dividends ~
mcg_last_dollar +
factor(age65) +
log(dividends) +
log(net_agi),
data = analysis_adjusted)
stargazer(model_adjusted, type = "html")
Dependent variable: | |
ltcg_dividends | |
mcg_last_dollar | 72.714*** |
(5.067) | |
factor(age65)Age 65+ | -2.762*** |
(0.732) | |
log(dividends) | -3.929*** |
(0.267) | |
log(net_agi) | -0.671*** |
(0.052) | |
Constant | 52.159*** |
(3.115) | |
Observations | 30,016 |
R2 | 0.018 |
Adjusted R2 | 0.018 |
Residual Std. Error | 63.227 (df = 30011) |
F Statistic | 135.166*** (df = 4; 30011) |
Note: | p<0.1; p<0.05; p<0.01 |
Because of endogeneity, the estimated coefficient for the marginal tax rate on the last dollar of capital gains is positive.
Here, we use the first marginal tax rate. The estimated coefficient is now negative.
# create inflation adjusted threshold
analysis_adjusted <- final_vars %>%
filter(dividends > 12513.11)
# estimate the model
model_adjusted <- lm(ltcg_dividends ~ mcg_first_dollar + factor(age65) + log(dividends) + log(net_agi),
data = analysis_adjusted)
stargazer(model_adjusted, type = "html")
Dependent variable: | |
ltcg_dividends | |
mcg_first_dollar | -28.381*** |
(8.204) | |
factor(age65)Age 65+ | -3.163*** |
(0.735) | |
log(dividends) | -3.533*** |
(0.269) | |
log(net_agi) | -0.433*** |
(0.062) | |
Constant | 61.881*** |
(3.148) | |
Observations | 30,016 |
R2 | 0.011 |
Adjusted R2 | 0.011 |
Residual Std. Error | 63.431 (df = 30011) |
F Statistic | 86.129*** (df = 4; 30011) |
Note: | p<0.1; p<0.05; p<0.01 |
These results are pretty similar to equation (1) from Table 1 in Feldstein et al. (1980).
Variable | Coefficient | Standard Error |
---|---|---|
Tax | -62.4 | (5.98) |
Age 65 | -1.13 | (0.557) |
Log Dividends | -2.80 | (0.188) |
Log Adjusted Net AGI | -0.483 | (0.184) |
Intercept | 38.8 | (2.11) |
boom <- analysis_adjusted %>%
mutate(
log_dividends = log(dividends),
log_net_agi = log(net_agi),
age65 = factor(age65)
)
boom %>%
select(ltcg_dividends, mcg_first_dollar, age65, log_dividends, log_net_agi) %>%
GGally::ggpairs()
model_adjusted %>%
ggplot(aes(ltcg_dividends)) +
geom_density()
model_adjusted %>%
ggplot(aes(ltcg_dividends)) +
geom_density() +
scale_x_continuous(limits = c(0, 100))
model_adjusted %>%
ggplot(aes(mcg_first_dollar)) +
geom_density() +
labs(title = "First Dollar MCG")
boom %>%
ggplot(aes(log_dividends)) +
geom_density() +
labs(title = "log(dividends)")
boom %>%
ggplot(aes(log_net_agi)) +
geom_density() +
labs(title = "log(net agi)")
Our tests generated thousands of results and the problem is highly multi-dimensional with multiple methods, multiple tuning parameters, multiple coefficients, and multiple metrics. Our strategy is
We focus on six metrics
coefficients <- tidy(model_adjusted, conf.int = TRUE) %>%
mutate(
ci_length = conf.high - conf.low,
ci_contains_zero = conf.high > 0 & conf.low < 0
) %>%
mutate(term =
recode(
term,
`factor(age65)Age 65+` = "age65Age 65+",
`log(dividends)` = "log_dividends",
`log(net_agi)` = "log_net_agi"
)
)
coefficients_bootstrap <- read_csv(here::here("results", "02_soi_regression_coefficient_asymptotic.csv"),
col_types = cols(
method = col_character(),
epsilon = col_double(),
replicate = col_double(),
term = col_character(),
estimate_results = col_double(),
estimate_noisy_results = col_double(),
bias = col_double(),
ci_overlap = col_double(),
ci_length = col_double(),
sign_match = col_logical(),
ci_contains_zero = col_logical(),
delta = col_double(),
n = col_double()
)) %>%
mutate(method = if_else(method == "Brawner and Honaker Method", "BHM", method)) %>%
mutate(method = if_else(!is.na(n), paste0(method, " (n = ", n, ")"), method))
coefficients_bootstrap <- read_csv(here::here("results", "02_soi_regression_coefficient_bootstrap.csv"),
col_types = cols(
method = col_character(),
epsilon = col_double(),
replicate = col_double(),
term = col_character(),
estimate_results = col_double(),
estimate_noisy_results = col_double(),
bias = col_double(),
ci_overlap = col_double(),
ci_length = col_double(),
sign_match = col_logical(),
ci_contains_zero = col_logical(),
delta = col_double(),
n = col_double()
)) %>%
mutate(method = if_else(method == "Brawner and Honaker Method", "BHM", method)) %>%
mutate(method = if_else(!is.na(n), paste0(method, " (n = ", n, ")"), method))
true_rmse <- analysis_adjusted %>%
modelr::add_predictions(model_adjusted) %>%
rmse(truth = ltcg_dividends, estimate = pred) %>%
pull(.estimate)
rmse_asymptotic <- read_csv(here::here("results", "02_soi_regression_rmse_asymptotic.csv"),
col_types = cols(
method = col_character(),
epsilon = col_double(),
replicate = col_double(),
rmse = col_double(),
delta = col_double(),
n = col_double()
)) %>%
mutate(method = if_else(method == "Brawner and Honaker Method", "BHM", method)) %>%
mutate(method = if_else(!is.na(n), paste0(method, " (n = ", n, ")"), method))
rmse_bootstrap <- read_csv(here::here("results", "02_soi_regression_rmse_bootstrap.csv"),
col_types = cols(
method = col_character(),
epsilon = col_double(),
replicate = col_double(),
rmse = col_double(),
delta = col_double(),
n = col_double()
)) %>%
mutate(method = if_else(method == "Brawner and Honaker Method", "BHM", method)) %>%
mutate(method = if_else(!is.na(n), paste0(method, " (n = ", n, ")"), method))
coefficients_bootstrap_summary <- coefficients_bootstrap %>%
left_join(
coefficients,
by = "term",
suffix = c("_noisy", "_original")
) %>%
group_by(method, epsilon, term, delta, n) %>%
summarize(
`Median bias` = median(bias),
`Median Relative Bias` = median(bias / estimate_results),
`Median CI Overlap` = median(ci_overlap),
`Median Relative CI Length` = median(ci_length_noisy / ci_length_original),
`Prop Sign Match` = mean(sign_match),
`Prop. Sig. Match` = mean(ci_contains_zero_original == ci_contains_zero_noisy)
) %>%
ungroup()
This table summarizes the results across all methods for regression. Detailed data visualizations are in subsequent sections.
coefficients_bootstrap_summary %>%
mutate(
`Median Relative Bias` = round(`Median Relative Bias`, digits = 3),
`Median CI Overlap` = round(`Median CI Overlap`, digits = 3),
`Median Relative CI Length` = round(`Median Relative CI Length`, digits = 3)
) %>%
datatable()
eps5 <- coefficients_bootstrap %>%
left_join(
coefficients,
by = "term",
suffix = c("_noisy", "_original")
) %>%
filter(epsilon == 5, delta %in% c(NA, 0.000001))
eps5 <- eps5 %>%
mutate(
relative_bias = bias / estimate_results,
relative_ci_length = ci_length_noisy / ci_length_original
)
probs <- seq(0.1, 0.9, 0.1)
eps5_summary <- eps5 %>%
group_by(method, term) %>%
summarize(
n(),
q = probs,
bias = quantile(bias, probs = probs),
relative_bias = quantile(relative_bias, probs = probs),
ci_overlap = quantile(ci_overlap, probs = probs),
relative_ci_length = quantile(relative_ci_length, probs = probs)
)
rmse_bootstrap_summary <- rmse_bootstrap %>%
group_by(method, epsilon, delta, n) %>%
summarize(
q = probs,
relative_rmse = quantile(rmse / true_rmse, probs = probs)
) %>%
ungroup()
eps5_summary %>%
filter(q == 0.5) %>%
ggplot(aes(bias, method)) +
geom_point() +
geom_vline(xintercept = 0, color = "red") +
facet_wrap(~term, scales = "free_x") +
labs(
title = "Median Bias with Epsilon == 5 and Delta == 0.000001",
y = NULL
)
eps5_summary %>%
filter(!method %in% c("BHM (n = 10)", "BHM (n = 25)")) %>%
ggplot(aes(bias, method)) +
geom_vline(xintercept = 0, color = "red") +
geom_point(alpha = 0.2) +
facet_wrap(~term, scales = "free_x") +
labs(
title = "Bias with Epsilon == 5 and Delta == 0.000001",
subtitle = "Deciles",
y = NULL,
caption = "BHM dropped for clarity"
)
eps5 %>%
filter(!method %in% c("BHM (n = 10)", "BHM (n = 25)")) %>%
ggplot(aes(bias, method)) +
geom_vline(xintercept = 0, color = "red") +
geom_violin(draw_quantiles = 0.5, alpha = 0.2) +
facet_wrap(~term, scales = "free_x") +
labs(
title = "Bias with Epsilon == 5 and Delta == 0.000001",
y = NULL,
caption = "BHM dropped for clarity"
)
eps5_summary %>%
filter(q == 0.5) %>%
ggplot(aes(relative_bias, method)) +
geom_point() +
geom_vline(xintercept = 0, color = "red") +
facet_wrap(~term, scales = "free_x") +
labs(
title = "Median Relative Bias with Epsilon == 5 and Delta == 0.000001",
y = NULL
)
eps5_summary %>%
filter(!method %in% c("BHM (n = 10)", "BHM (n = 25)")) %>%
ggplot(aes(relative_bias, method)) +
geom_vline(xintercept = 0, color = "red") +
geom_point(alpha = 0.2) +
facet_wrap(~term, scales = "free_x") +
labs(
title = "Relative Bias with Epsilon == 5 and Delta == 0.000001",
subtitle = "Deciles",
y = NULL,
caption = "BHM dropped for clarity"
)
eps5 %>%
filter(!method %in% c("BHM (n = 10)", "BHM (n = 25)")) %>%
ggplot(aes(relative_bias, method)) +
geom_vline(xintercept = 0, color = "red") +
geom_violin(draw_quantiles = 0.5, alpha = 0.2) +
facet_wrap(~term, scales = "free_x") +
labs(
title = "Relative Bias with Epsilon == 5 and Delta == 0.000001",
y = NULL,
caption = "BHM dropped for clarity"
)
eps5_summary %>%
filter(q == 0.5) %>%
ggplot(aes(ci_overlap, method)) +
geom_point() +
geom_vline(xintercept = 1, color = "red") +
facet_wrap(~term, scales = "free_x") +
labs(
title = "Median CI Overlap with Epsilon == 5 and Delta == 0.000001",
y = NULL
)
eps5_summary %>%
filter(!method %in% c("BHM (n = 10)", "BHM (n = 25)")) %>%
ggplot(aes(ci_overlap, method)) +
geom_vline(xintercept = 1, color = "red") +
geom_point(alpha = 0.2) +
facet_wrap(~term, scales = "free_x") +
labs(
title = "CI Overlap with Epsilon == 5 and Delta == 0.000001",
subtitle = "Deciles",
y = NULL,
caption = "BHM dropped for clarity"
)
eps5 %>%
filter(!method %in% c("BHM (n = 10)", "BHM (n = 25)")) %>%
ggplot(aes(ci_overlap, method)) +
geom_vline(xintercept = 1, color = "red") +
geom_violin(draw_quantiles = 0.5, alpha = 0.2) +
facet_wrap(~term, scales = "free_x") +
labs(
title = "CI Overlap with Epsilon == 5 and Delta == 0.000001",
y = NULL,
caption = "BHM dropped for clarity"
)
eps5_summary %>%
filter(q == 0.5) %>%
ggplot(aes(relative_ci_length, method)) +
geom_point() +
geom_vline(xintercept = 1, color = "red") +
facet_wrap(~term, scales = "free_x") +
labs(
title = "Median Relative CI Length with Epsilon == 5 and Delta == 0.000001",
y = NULL
)
eps5_summary %>%
filter(!method %in% c("BHM (n = 10)", "BHM (n = 25)")) %>%
ggplot(aes(relative_ci_length, method)) +
geom_vline(xintercept = 1, color = "red") +
geom_point(alpha = 0.2) +
facet_wrap(~term, scales = "free_x") +
labs(
title = "Relative CI Length with Epsilon == 5 and Delta == 0.000001",
subtitle = "Deciles",
y = NULL,
caption = "BHM dropped for clarity"
)
eps5 %>%
filter(!method %in% c("BHM (n = 10)", "BHM (n = 25)")) %>%
ggplot(aes(relative_ci_length, method)) +
geom_vline(xintercept = 1, color = "red") +
geom_violin(draw_quantiles = 0.5, alpha = 0.2) +
facet_wrap(~term, scales = "free_x") +
labs(
title = "Relative CI Length with Epsilon == 5 and Delta == 0.000001",
y = NULL,
caption = "BHM dropped for clarity"
)
eps5 %>%
group_by(term, method, n) %>%
summarize(prop_sign_match = mean(sign_match)) %>%
ggplot(aes(prop_sign_match, method)) +
geom_col(position = "dodge") +
geom_vline(xintercept = 1, color = "red") +
facet_wrap(~term) +
labs(
title = "Sign Match with Epsilon == 5 and Delta == 0.000001",
y = NULL
)
eps5 %>%
group_by(term, method, n) %>%
summarize(prop_sig_match = mean(ci_contains_zero_original == ci_contains_zero_noisy)) %>%
ggplot(aes(prop_sig_match, method)) +
geom_col(position = "dodge") +
geom_vline(xintercept = 1, color = "red") +
facet_wrap(~term) +
labs(
title = "Significance Match with Epsilon == 5 and Delta == 0.000001",
y = NULL
)
rmse_bootstrap_summary %>%
filter(q == 0.5) %>%
filter(epsilon == 5, delta %in% c(NA, 0.000001)) %>%
ggplot(aes(relative_rmse, method)) +
geom_point() +
geom_vline(xintercept = 1, color = "red") +
labs(
title = "Median Relative RMSE with Epsilon == 5 and Delta == 0.000001",
y = NULL
)
rmse_bootstrap_summary %>%
filter(!method %in% c("BHM (n = 10)", "BHM (n = 25)")) %>%
filter(epsilon == 5, delta %in% c(NA, 0.000001)) %>%
ggplot(aes(relative_rmse, method)) +
geom_point(alpha = 0.2) +
geom_vline(xintercept = 1, color = "red") +
labs(
title = "Relative RMSE with Epsilon == 5 and Delta == 0.000001",
subtitle = "Deciles",
y = NULL,
caption = "BHM dropped for clarity"
)
rmse_bootstrap %>%
filter(!method %in% c("BHM (n = 10)", "BHM (n = 25)")) %>%
mutate(relative_rmse = rmse / true_rmse) %>%
ggplot(aes(relative_rmse, method)) +
geom_vline(xintercept = 1, color = "red") +
geom_violin(draw_quantiles = 0.5) +
labs(
title = "Relative RMSE with Epsilon == 5 and Delta == 0.000001",
y = NULL,
caption = "BHM dropped for clarity"
)
coefficients_bootstrap_summary %>%
filter(
epsilon < 100,
delta %in% c(NA, 0.000001),
!method %in% c("BHM (n = 10)", "BHM (n = 25)")
) %>%
ggplot(aes(epsilon, `Median bias`)) +
geom_hline(yintercept = 0, color = "red") +
geom_line(aes(group = method), alpha = 0.2) +
geom_point(aes(color = method), alpha = 0.4) +
facet_wrap(~term, scales = "free_y") +
labs(
title = "Median Bias with Delta == 0.000001",
caption = "BHM dropped for clarity"
)
coefficients_bootstrap_summary %>%
filter(
epsilon < 100,
delta %in% c(NA, 0.000001),
!method %in% c("BHM (n = 10)", "BHM (n = 25)")
) %>%
ggplot(aes(epsilon, `Median Relative Bias`)) +
geom_hline(yintercept = 0, color = "red") +
geom_line(aes(group = method), alpha = 0.2) +
geom_point(aes(color = method), alpha = 0.4) +
facet_wrap(~term, scales = "free_y") +
labs(
title = "Median Relative Bias with Delta == 0.000001",
caption = "BHM dropped for clarity"
)
coefficients_bootstrap_summary %>%
filter(
epsilon < 100,
delta %in% c(NA, 0.000001),
!method %in% c("BHM (n = 10)", "BHM (n = 25)")
) %>%
ggplot(aes(x = epsilon, y = `Median CI Overlap`)) +
geom_hline(yintercept = 1, color = "red") +
geom_line(aes(group = method), alpha = 0.2) +
geom_point(aes(color = method), alpha = 0.4) +
facet_wrap(~term, scales = "free_y") +
labs(
title = "Median CI Overlap with Delta == 0.000001",
caption = "BHM dropped for clarity"
)
coefficients_bootstrap_summary %>%
filter(
epsilon < 100,
delta %in% c(NA, 0.000001),
!method %in% c("BHM (n = 10)", "BHM (n = 25)")
) %>%
ggplot(aes(x = epsilon, y = `Median Relative CI Length`)) +
geom_hline(yintercept = 1, color = "red") +
geom_line(aes(group = method), alpha = 0.2) +
geom_point(aes(color = method), alpha = 0.4) +
facet_wrap(~term, scales = "free_y") +
labs(
title = "Median Relative CI Length with Delta == 0.000001",
caption = "BHM dropped for clarity"
)
coefficients_bootstrap_summary %>%
filter(
epsilon < 100,
delta %in% c(NA, 0.000001),
!method %in% c("BHM (n = 10)", "BHM (n = 25)")
) %>%
ggplot(aes(`Prop Sign Match`, method, fill = factor(epsilon))) +
geom_col(position = "dodge") +
geom_vline(xintercept = 1, color = "red") +
facet_wrap(~term) +
labs(
title = "Proportion Sign Match with Delta == 0.000001",
caption = "BHM dropped for clarity"
)
coefficients_bootstrap_summary %>%
filter(
epsilon < 100,
delta %in% c(NA, 0.000001),
!method %in% c("BHM (n = 10)", "BHM (n = 25)")
) %>%
ggplot(aes(`Prop. Sig. Match`, method, fill = factor(epsilon))) +
geom_col(position = "dodge") +
geom_vline(xintercept = 1, color = "red") +
facet_wrap(~term) +
labs(
title = "Proportion Significance Match with Delta == 0.000001",
caption = "BHM dropped for clarity"
)
# RMSE
rmse_bootstrap_summary %>%
filter(
q == 0.5,
epsilon < 100,
delta %in% c(NA, 0.000001),
!method %in% c("BHM (n = 10)", "BHM (n = 25)")
) %>%
ggplot(aes(epsilon, relative_rmse)) +
geom_hline(yintercept = 1, color = "red") +
geom_line(aes(group = method), alpha = 0.2) +
geom_point(aes(color = method), alpha = 0.4) +
facet_wrap(~method, scales = "free_y") +
labs(
title = "Relative RMSE with Delta == 0.000001",
caption = "BHM dropped for clarity"
)