Doubly Robust Semiparametric Difference-in-Differences with High-Dimensional Data for Stata
hddid implements the doubly robust semiparametric difference-in-differences estimator proposed by Ning, Peng, and Tao (2024) for settings with high-dimensional covariates. The estimator targets the conditional average treatment effect on the treated (CATT) under a partially linear model:
where
Features:
- Doubly robust estimation via AIPW score (consistent if either propensity or outcome model is correct)
- High-dimensional covariates (
$p > n$ ) with Lasso penalization - Flexible nonparametric component via polynomial (
Pol) or trigonometric (Tri) sieve basis - Debiased
$\sqrt{n}$ -inference for the parametric component$\beta$ - Pointwise and uniform confidence intervals for the nonparametric component
$f(z)$ - Cross-fitting to avoid overfitting bias
- Postestimation:
predictfor fitted ATT surface (tau, xb, fz) - Built-in DGP generators (
hddid_dgp1,hddid_dgp2) for Monte Carlo simulation
- Stata 16.0 or later
- lassopack (provides
lasso2,cvlasso,cvlassologit) - Python 3.7+ with
numpy >= 1.20andscipy >= 1.7— required only whenx()has more than one covariate (the CLIME precision matrix step calls Python); with a single covariate,hddiduses an analytic scalar inverse and does not call Python
* Install lassopack from SSC
ssc install lassopack, replacePython note: When
x()has more than one covariate,hddidcalls Python for the CLIME precision matrix step. Stata 16+ auto-detects Python on most systems. To verify, runpython query; if Python is not found, runpython searchto list available installations, thenset python_exec <path>, permanentlyto bind one. Seehelp pythonfor details.
net install hddid, from("https://raw.githubusercontent.com/gorgeousfish/hddid/main") replaceThe bundled empirical datasets are ancillary files. Download them to your working directory once:
net get hddid, from("https://raw.githubusercontent.com/gorgeousfish/hddid/main") replaceThis downloads nsw_dw.dta and minwage_state.dta to your current directory.
which hddid
which hddid_dgp1
which hddid_dgp2
help hddidAfter running net get hddid (Step 3 above), load the datasets with:
use nsw_dw, clear // LaLonde-Dehejia-Wahba NSW job training
use minwage_state, clear // Fair Minimum Wage Act state panelThis example uses the NSW (National Supported Work) job training experiment. The outcome change is the difference in real earnings from 1975 (pre-treatment) to 1978 (post-treatment). The key nonparametric variable re74 captures heterogeneity by prior earnings, whose effect on the treatment-earnings relationship is left unrestricted. This is the canonical benchmark dataset for doubly robust estimators.
Data: nsw_dw.dta — 445 individuals (185 treated, 260 control)
use nsw_dw, clear
* deltay = re78 - re75 (change in real earnings, USD)
* x() = age educ black hisp married nodegree (demographic covariates)
* z() = re74 (1974 baseline earnings: nonparametric heterogeneity by prior income)
* Estimate: treatment effect of NSW training, heterogeneous by baseline earnings
* No Python needed (p=6 uses scalar CLIME path)
set seed 42
hddid deltay, treat(treat) x(age educ black hisp married nodegree) z(re74) ///
method(Pol) q(4) k(3) z0(0 2000 5000 10000) nboot(500) alpha(0.1)
* Debiased estimates: beta for each demographic covariate
matrix list e(xdebias), format(%9.2f)
matrix list e(stdx), format(%9.2f)
* Nonparametric ATT component at z0 evaluation points
* f(re74): how treatment effect varies with 1974 baseline earnings
matrix list e(gdebias), format(%9.2f)
matrix list e(z0), format(%9.2f)
matrix list e(CIpoint), format(%9.2f)
* Predict full ATT surface: tau(X,Z) = X'beta + f(Z)
predict double tau_hat
summarize tau_hatThis example replicates the empirical application in the original paper (Ning, Peng & Tao 2020). The Fair Minimum Wage Act of 2007 raised the federal minimum wage in three steps (2007–2009). Treated states are those where the federal increase was binding (state minimum wage ≤ $5.15 before the Act). The outcome is the change in state unemployment rate. The nonparametric variable prior_unem allows the unemployment response to differ flexibly by a state's pre-crisis labor market slack — tighter labor markets may respond differently to wage floors.
Data: minwage_state.dta — 50 U.S. states (17 treated, 33 control)
use minwage_state, clear
* deltay = change in unemployment rate (post 2007-09 avg minus pre 2005-06 avg)
* x() = gdp_pc_2006 mfg_share union_rate educ_ba poverty_rate teen_share south
* z() = prior_unem (2004 unemployment rate: nonparametric labor-market heterogeneity)
* treat = 1 if federal minimum wage increase was binding in 2007
set seed 42
hddid deltay, treat(treat) ///
x(gdp_pc_2006 mfg_share union_rate educ_ba poverty_rate teen_share south) ///
z(prior_unem) ///
method(Pol) q(4) k(2) ///
z0(3.5 4.5 5.5 6.5) nboot(500) alpha(0.1)
* Debiased beta: effect of each state covariate on treatment heterogeneity
matrix list e(xdebias), format(%9.4f)
matrix list e(stdx), format(%9.4f)
* Nonparametric component: how wage-floor effect varies with baseline unemployment
* f(prior_unem): unemployment response at z0 = 3.5, 4.5, 5.5, 6.5
matrix list e(gdebias), format(%9.4f)
matrix list e(CIpoint), format(%9.4f)After estimation, hddid supports predict for fitted values:
* Default: tau = X'beta + f(Z) (full ATT prediction)
predict tau_hat
* Linear component only: xb = X'beta_debias
predict xb_hat, xb
* Nonparametric component only: fz = a0 + psi(z)'gamma
predict fz_hat, fz
* Verify decomposition: tau = xb + fz
summarize tau_hat xb_hat fz_hatReplay the stored estimation table:
* Bare hddid re-displays the last estimation results
hddidThe built-in DGP generators reproduce the simulation designs from the paper (Section 5). Use these to verify estimation accuracy against known true values before applying hddid to your own data.
* Homoscedastic, independent X; true beta_1 = 1.0, true f(z) = exp(z)
hddid_dgp1, n(300) p(1) seed(12345) clear
hddid deltay, treat(treat) x(x1) z(z) ///
method(Pol) q(4) k(2) seed(42) z0(-1 0 1) nboot(500) alpha(0.1)
* Check: e(xdebias) should be close to 1.0
matrix list e(xdebias), format(%9.4f)* AR(1)-correlated covariates, heteroscedastic baseline; true beta_1 = 1.0
hddid_dgp2, n(500) p(1) seed(54321) rho(0.5) clear
hddid deltay, treat(treat) x(x1) z(z) ///
method(Pol) q(8) k(3) seed(42) z0(-1 0 1) nboot(500) alpha(0.05)
matrix list e(xdebias), format(%9.4f)* Section 5 design: true beta_j = 1/j for j=1..15, 0 for j>15
hddid_dgp1, n(500) p(50) seed(12345) clear
unab x_vars : x*
hddid deltay, treat(treat) x(`x_vars') z(z) ///
method(Tri) q(16) k(3) seed(42) z0(-1 -0.5 0 0.5 1) ///
nboot(1000) alpha(0.1)
matrix list e(xdebias), format(%9.4f)
matrix list e(CIpoint), format(%9.4f)
matrix list e(CIuniform), format(%9.4f)| Command | Description |
|---|---|
hddid |
Main estimation command |
hddid_dgp1 |
Generate DGP1 simulation data (homoscedastic, independent X) |
hddid_dgp2 |
Generate DGP2 simulation data (heteroscedastic, correlated X) |
hddid depvar [if] [in], treat(varname) x(varlist) z(varname) [options]
The dependent variable depvar should be the outcome change
| Option | Description | Default |
|---|---|---|
treat(varname) |
Binary treatment indicator (0/1) | Required |
x(varlist) |
High-dimensional covariates | Required |
z(varname) |
Low-dimensional covariate for nonparametric component | Required |
method(string) |
Sieve basis: Pol (polynomial) or Tri (trigonometric) |
Pol |
q(#) |
Sieve basis order; under Tri, the harmonic degree is q/2, so q(8) yields a 4th-degree trigonometric basis |
8 |
k(#) |
Number of cross-fitting folds (minimum 2) | 3 |
alpha(#) |
Significance level for confidence intervals | 0.1 |
nboot(#) |
Number of Gaussian bootstrap replications (minimum 2) | 1000 |
seed(#) |
Random seed for fold assignment, bootstrap, and CLIME CV | -1 (no seed) |
z0(numlist) |
Evaluation points for the nonparametric component | Unique retained z values |
nofirst |
Skip first-stage estimation; requires pihat(), phi1hat(), phi0hat() |
— |
verbose |
Print fold-level diagnostics | — |
After hddid, the following predict options are available:
| Command | Description |
|---|---|
predict newvar |
Default: full ATT prediction |
predict newvar, xb |
Linear component |
predict newvar, fz |
Nonparametric component |
hddid_dgp1, n(#) p(#) [seed(#) clear]
hddid_dgp2, n(#) p(#) [seed(#) rho(#) clear]
| Option | Description | Default |
|---|---|---|
n(#) |
Number of observations | Required |
p(#) |
Number of high-dimensional covariates | Required |
seed(#) |
Random seed for data generation | -1 (no seed) |
rho(#) |
AR(1) correlation parameter (DGP2 only; requires |
0.5 |
clear |
Replace existing dataset in memory | — |
hddid stores the following in e():
| Result | Description |
|---|---|
e(N) |
Final post-trim retained sample size |
e(N_pretrim) |
Pretrim common-score sample count |
e(N_trimmed) |
Number of observations trimmed |
e(k) |
Number of cross-fitting folds |
e(p) |
Dimension of high-dimensional covariates |
e(q) |
Sieve basis order |
e(qq) |
Number of evaluation points |
e(alpha) |
Significance level |
e(nboot) |
Number of bootstrap replications |
e(seed) |
Random seed (when specified) |
| Result | Description |
|---|---|
e(b) |
1 × p debiased parametric coefficient vector (same as e(xdebias)) |
e(V) |
p × p parametric variance-covariance matrix |
e(xdebias) |
1 × p debiased parametric estimates |
e(stdx) |
1 × p parametric standard errors |
e(gdebias) |
1 × qq debiased nonparametric estimates (omitted-intercept z-varying block) |
e(stdg) |
1 × qq nonparametric standard errors |
e(tc) |
1 × 2 bootstrap critical-value pair (lower, upper) |
e(CIpoint) |
2 × (p+qq) pointwise confidence interval bounds |
e(CIuniform) |
2 × qq uniform confidence interval bounds for nonparametric component |
e(z0) |
1 × qq evaluation points |
e(N_per_fold) |
1 × k post-trim sample sizes by fold |
| Result | Description |
|---|---|
e(cmd) |
hddid |
e(method) |
Sieve basis method (Pol or Tri) |
e(depvar_role) |
Original dependent variable name |
e(treat) |
Treatment variable name |
e(xvars) |
Covariate names in published beta-coordinate order |
e(zvar) |
Z variable name |
| Result | Description |
|---|---|
e(sample) |
Final post-trim estimation sample indicator |
For a complete list, see help hddid.
The estimation procedure consists of six steps:
- K-fold cross-fitting: Split data into K folds; for each held-out fold, estimate nuisance functions on the remaining folds
-
First-stage nuisance estimation: Propensity score
$\hat\pi(W)$ viacvlassologit; outcome regressions$\hat\Phi_1(W)$ ,$\hat\Phi_0(W)$ viacvlasso -
DR score construction: Form the doubly robust score with propensity trimming at
$[0.01, 0.99]$ :$$\hat{S}_i = \hat\rho_i \left(\Delta Y_i - (1-\hat\pi_i)\hat\Phi_1(W_i) - \hat\pi_i \hat\Phi_0(W_i)\right), \quad \hat\rho_i = \frac{D_i - \hat\pi_i}{\hat\pi_i(1-\hat\pi_i)}$$ -
Second-stage Lasso: Regress
$\hat{S}_i$ on$(X, \psi(Z))$ with Lasso penalty on$X$ only (sieve terms unpenalized) -
Debiased inference for
$\beta$ : Correct Lasso bias via the CLIME precision matrix estimate$\hat\Omega$ : $$\hat{\beta}^d = \hat{\beta} + \frac{1}{n}\sum_{i=1}^n \hat{\eta}_i \tilde{X}i' \hat{\Omega}$$ where $\tilde{X} = X - \Pi{X|Z}$ is the sieve projection residual -
Debiased inference for
$f(z)$ : One-step update via the M-matrix, with Gaussian bootstrap for uniform confidence bands
Under the conditional parallel trends assumption and full support, the CATT is identified via the doubly robust estimand:
This estimand remains consistent when either the propensity score or the outcome regressions are correctly specified.
The CATT is modeled as:
where
-
Parametric component: The debiased estimator
$\hat\beta^d$ achieves$\sqrt{n}$ -normality. Pointwise confidence intervals use the normal approximation. -
Nonparametric component: Pointwise normality at each evaluation point
$z_0$ . Uniform confidence bands are constructed via Gaussian bootstrap critical values.
Ning, Y., Peng, S., & Tao, J. (2024). Doubly robust semiparametric difference-in-differences estimators with high-dimensional data. Review of Economics and Statistics, 106(4), 1063–1080.
Stata Implementation:
- Xuanyu Cai, City University of Macau Email: xuanyuCAI@outlook.com
- Wenli Xu, City University of Macau Email: wlxu@cityu.edu.mo
Methodology:
- Yang Ning, Cornell University
- Sida Peng, Microsoft Research
- Jing Tao, University of Washington
AGPL-3.0. See LICENSE for details.
If you use this package in your research, please cite both the methodology paper and the Stata implementation:
APA Format:
Cai, X., & Xu, W. (2025). hddid: Stata module for doubly robust semiparametric difference-in-differences estimation with high-dimensional data (Version 1.0.0) [Computer software]. GitHub. https://github.com/gorgeousfish/hddid
Ning, Y., Peng, S., & Tao, J. (2024). Doubly robust semiparametric difference-in-differences estimators with high-dimensional data. Review of Economics and Statistics, 106(4), 1063–1080.
BibTeX:
@software{hddid2025stata,
title={hddid: Stata module for doubly robust semiparametric difference-in-differences estimation with high-dimensional data},
author={Xuanyu Cai and Wenli Xu},
year={2025},
version={1.0.0},
url={https://github.com/gorgeousfish/hddid}
}
@article{ning2024doubly,
title={Doubly robust semiparametric difference-in-differences estimators with high-dimensional data},
author={Ning, Yang and Peng, Sida and Tao, Jing},
journal={Review of Economics and Statistics},
volume={106},
number={4},
pages={1063--1080},
year={2024}
}- Original R package by Ning, Peng, and Tao: https://github.com/psdsam/HDdiffindiff
- Paper: Ning, Y., Peng, S., & Tao, J. (2024). Review of Economics and Statistics, 106(4), 1063–1080. https://arxiv.org/abs/2009.03151
