Using Machine Learning Methods to Detect Heterogeneous Treatment Effects for Multilevel Randomized Controlled Trials: A Review and Empirical Comparison.

Wei Li; Walter Leite; Jia Quan

Background: Multilevel randomized controlled trials (MRCTs) have been widely used to evaluate the causal effects of educational interventions. Traditionally, educational researchers and policymakers focused on the average treatment effects (ATE) of the intervention. Recently there has been an increasing interest in evaluating the heterogeneity of treatment effects (HTEs) among intervention participants for several reasons. First, it allows researchers to understand how treatment effects vary among subgroups to address questions about "for whom and under what conditions" an intervention works. These individual and cluster characteristics used to identify subgroups are called moderators and can augment or reduce the treatment effects. Also, it helps researchers to examine whether an intervention increase or decreases the gaps in the outcomes of interest and thus identifies the interventions that can improve fairness and equity in education. Educational researchers commonly incorporate a treatment by moderator interaction within OLS regression or multilevel models (MLMs) to explore the moderator effects (e.g., Dong et al., 2022). Recent development in statistics and econometrics (e.g., Athey & Wager, 2019; Chernozhukov et al., 2020) proposed to use machine learning (ML) methods to explore the HTEs by estimating the conditional average treatment (CATE). Compared to traditional interaction analysis, these methods have some advantages. For example, the interaction analysis cannot identify causal relationships because of the potential correlations between the moderators and the omitted variables in the error term (Dong et al., 2022), while some ML methods (e.g., causal forest) can facilitate causal inference under regular assumptions. Also, traditional moderator analysis usually requires specifying the moderators in the design phase and thus may miss important sources of HTEs. However, ML methods can select moderators from a potentially large number of covariates. To deal with the potential dependency of students within the same schools, MLMs are widely used for moderation analysis. Similarly, when applying ML methods to estimate CATE, applied researchers still need to consider the nested data structure. However, most prior literature assumes the participants are independent. There is a lack of literature to guide educational researchers in appropriately applying ML methods for clustered data when evaluating HTEs. Purpose and Significance: This study contributes to the literature on the design and analysis of MRCTs by reviewing the current available ML methods and tools that account for the nested data structure when estimating CATE and provides recommendations to applied researchers on how to choose the appropriate methods and statistical package among alternative ML methods. Specifically, this study will focus on two ML methods -- Causal Forest (CF) and the GenericML methods and demonstrate the application of these two methods using the dataset from a large multisite experimental study (Leite et al., 2023). Research Design and Methods: Many ML methods have been proposed to estimate CATE in the past decade (see Caron et al., 2020 and Jacob, 2021, for recent reviews). In general, these methods include three main steps: (1) splitting the data into training and test sets, (2) using the training set and ML algorithms to build a prediction model, and (3) using the test set to estimate HTEs and the standard errors (SEs). Based on our review of all the currently available methods and packages, only two algorithms --CF and the GenericML consider the nested data structure in at least one step. Specifically, the CF algorithm estimates CATE through honest causal trees (Wager & Athey, 2018). When analyzing data from MRCTs, the cluster-robust CF algorithm estimates CATE by making predictions as an average of b trees (Athey & Wager, 2019). It considers the nested data structure in all three steps: (1) for each b= 1, ..., B, draw a subsample of clusters and then draw a random sample from each cluster as the training data; (2) grow a tree via recursive partitioning on each such subsample of the data; (3) make the out-of-bag predictions. It should be noted that to account for the potential within cluster dependency, an observation "i" is considered to be out-of-bag if its cluster was not drawn in step (1). Similarly, the GenericML algorithm (Chernozhukov et al., 2020) estimates the best linear predictor of CATE through the following steps: (1) randomly split the data into training and test sets; (2) estimates the CATE with any number of selected ML methods (e.g., LASSO, SVM, etc.) using the training data; (3) use OLS regression to obtain the BLP of the CATE using the test data. Note that, for multisite designs, OLS with site dummy variables is used in the third step, but it does not consider nested data structure in the first and second steps. The cluster-robust CF algorithm can be applied through the R package "grf", and the GenericML algorithm can be implemented through the "GenericML" R package. Both packages report cluster-Robust SEs. Besides, the "GenericML" package can also estimate the Sorted group average treatment effects (GATEs) that consisted of creating five groups of participants using quintiles of the CATE distribution and perform classification analysis (CLAN) to explore the relationships between covariates and the CATE. In contrast, the "grf" package cannot automatically report GATEs or CLAN. Preliminary Results: We applied the cluster-robust CF and the GenericML algorithms to the data from a large-scale three-level multisite experimental study. This study included 52 math teachers and 2,936 students from three school districts and randomly assigned students of participating teachers to see video recommendations. Our analysis includes 516 predictors, with 484 consisting of dummy-coded indicators. The Appendix Table 1 summarizes GATEs using the "GenericML" package, showing the difference between the group that benefitted the most (Group 5) and the least (Group 1) from the intervention. The 516 predictors were sorted based on their mean differences between Groups 1 and 5. A thematic analysis of the most important predictors showed that teachers of students who benefitted most reported spending more time using the videos of the VLE and following student progress on the dashboard. We will manually estimate GATE and CLAN based on the CATE estimates from the "grf" package and then compare the results from two packages regarding CATE, GATE, and CLAN.