Statistical Power for Estimating Treatment Effects Using Difference-in-Differences and Comparative Interrupted Time Series Designs with Variation in Treatment Timing, with an Application to Estimating the Effects of COVID-19.

Peter Schochet

Background: When RCTs are not feasible and time series data are available, panel data methods can be used to estimate treatment effects on outcomes, by exploiting variation in policies and conditions over time and across locations. A complication with these methods, however, is that treatment timing often varies across the sample, for example, due to differences across locations in treatment implementation, adoption of laws, or natural phenomena (such as the onset of COVID-19). Only recently has this issue been addressed in the literature for estimating treatment effects for difference-in-differences (DID) designs but not for other panel designs. Another complication is that outcomes are often autocorrelated over time, which if not addressed can lead to estimated standard errors that are seriously biased downwards. Purpose: This article discusses new power analyses to assess required sample sizes that account for both of these real-world complications for several commonly used panel designs: (1) DID designs and (2) comparative interrupted time series (CITS) and associated ITS designs. We adopt common impact estimators found in the literature and build on the much smaller associated literature on statistical power that has focused only on settings without variation in treatment timing. An additional contribution of this work is that we incorporate recent approaches for adjusting for the clustering of individuals within groups--such as geographic or educational units--where there has been considerable confusion for QEDs. We consider models with separate cross-sections of individuals as well as longitudinal designs where the same individuals are followed. Further, we allow for time periods to be unevenly spaced and for the inclusion of model covariates to improve precision. After presenting the theory, the paper presents results from an illustrative power analysis to provide guidance on appropriate sample sizes for various model specifications. The results show that accounting for variation in treatment timing and autocorrelated errors can reduce power substantially. The session will also preview an available Shiny R dashboard that performs the sample size calculations. The paper contributes to the conference theme, "The Fierce Urgency of Knowledge: Education Evidence for Reimagining and Reckoning," because the impetus for this work was to help support design efforts for the plethora of research being conducted on the effects of COVID-19 that has hit areas at different times. Statistical Model The analysis relies on developing formulas for minimum detectable impacts in effect size units and associated closed-formed variance expressions. The paper relies on common regression estimators found in the literature and focuses on the average treatment on the treated (ATT) parameter (the typical parameter of interest). The key innovation is to accommodate variation in treatment timing. As an illustration, for the DID cross-sectional analysis, we use the following regression model using stacked data on separate cross-sections of individuals nested within study units (clusters) over time: y[subscript ijt] = [alpha subscript j] + [delta subscript t] + [sigma superscript K subscript k] = [sigma superscript P] p=[subscript 2 beta subscript kpa]I(G[subscript j]=k)F[subscript ijt,p]T[subscript j] + [theta subscript jt] + [epsilon subscript ijt];(1). [theta subscript jt] = [rho theta subscript j(t-a)] + [eta subscript jt]. In this model, the i subscript represents individuals, j represents clusters, and t represents time with P time periods (whose intervals do not need to be evenly spaced). The parameters, [alpha subscript j] and [delta subscript t], are cluster and time fixed effects, I(G[subscript j]=k) is an indicator that equals 1 for treatment group clusters in timing group k, F[subscript ijt,p] is a period p indicator that equals 1 if t=p, and T[subscript j] is a treatment indicator. The random error, [theta subscript jt], captures the correlations of individuals within the same cluster and time period, and [epsilon subscript ijt] are "iid" individual-level errors. We allow [theta subscript jt] to be correlated over time using an autoregressive process of order 1. The regression model in (1) includes three-way interactions between indicators of timing group, time period, and treatment status; the model, however, excludes interactions for the comparison group (k=0) and the first pre-period (p=1). Thus, the resulting OLS estimators, [beta]hat[subscript DID,kp1], provide DID estimates relative to the first pre-period. Next, we can (1) aggregate the pre-period DID estimators to obtain estimators for each post-period relative to the average pre-period, [beta]hat[subscript DID,kq]; (2) aggregate these estimators across post-periods; and (3) aggregate across the K timing groups to obtain an unbiased estimator for the ATT parameter: [beta]hat[subscript DID] = [sigma superscript K subscript k=1][sigma superscript P subscript q=Sk]w[subscript kq][beta]hat[subscript DID,kq](2) where w[subscript kq] and weights that sum to 1 and S[subscript k] are the treatment start periods for the timing groups. This event history approach allows for heterogeneity of treatment effects across time and timing groups. The paper discusses potential choices for the weights and develops closed-form variance formulas for the ATT estimator in (2). It also derives variance formulas with model covariates (and associated assumptions). For the CITS analysis, the paper builds on the DID model in (1) and (2) by modeling pre-period trends (linearly). The design examines whether once the treatment begins, the treatment group deviates from its pre-intervention trend by a greater amount than does the comparison group. Under the more common ITS design, the treatment group is compared to its own pre-period trend only. The paper develops variance formulas for these designs, which become more complex than for the DID design due to extra terms pertaining to the fitted pre-period trend lines. Usefulness of Method: The paper presents an illustrative power analysis showing that ignoring variation in treatment timing and autocorrelated errors can seriously understate required sample sizes (overstate power) for the considered panel designs by 70 percent on average across the examined specifications. Power erodes as the timing groups become farther apart. Nonetheless, the paper finds that the DID design, even after adjusting for the extra variance factors, can detect treatment effects that are likely to meet industry standards with the amount of data often available in practice. However, the CITS and ITS designs require considerably larger samples that may not be viable for some studies. Conclusions: The key conclusion is that adjusting for the variation in treatment timing (which is realistic in many panel settings) and autocorrelated errors matters considerably in assessing appropriate sample sizes for panel studies. Thus, it is important that these adjustments be applied when assessing appropriate sample sizes in designing panel studies, such as examining the effects of the pandemic on student outcomes.