Fully-Latent Principal Stratification for Modeling Complex Program Implementation Data.

Adam Sales; Sooyong Lee; Tiffany Whittaker; Hyeon-Ah Kang

Background: The data revolution in education has led to more data collection, more randomized controlled trials (RCTs), and more data collection within RCTs. Often following IES recommendations, researchers studying program effectiveness gather data on how the intervention was implemented. Educational implementation data can be complex, including longitudinal measurements of several indicators of different types. When implementation fidelity is measured along a single variable, instrumental variables techniques [Angrist et al., 1996] may apply, estimating average treatment effects for subjects who implement the program to a certain extent. When implementation can take several forms, principal stratification (PS) [e.g. Frangakis and Rubin, 2002, Page, 2012, Feller et al., 2016] can play that role. However, when measures of implementation are longitudinal, highly multivariate and/or complex, traditional PS methods can fail [Sales et al., 2019]. This talk will describe and illustrate an extension of the PS framework, called "Fully-Latent Principal Stratification," (FLPS) that incorporates latent-variable measurement modeling into PS. Purpose: It is widely acknowledged that variance in program implementation can lead to variance in treatment effects, but it is unclear what statistical models are capable of capturing that variance. We will describe a framework for modeling treatment effects that vary with varying implementation and illustrate it in a secondary analysis of a randomized controlled trial (RCT) comparing two online mathematics tutors, one that offered students hints and immediate error feedback, and one that only offered feedback after a delay. The goal will be to learn if the effect of offering immediate feedback tends to be higher for students who receive more feedback. Data: 1,141 7th-grade students were blocked within classrooms, given an online pretest consisting of ten math questions, and individually-randomized between the two conditions. We excluded students who were missing either their pretest or their outcome measurements, leaving n=804 students; all other variables were imputed using a random forest imputation algorithm. Students from both treatment groups worked on the same set of problems within the tutor, but since our goal was to measure hints and feedback, we only modeled log data from the Immediate group, and from problems for hints were available and responses were marked either correct or incorrect. There were 212 different problems organized into nine "problem sets." Many problems had several parts, each of which had its own hints and was marked correct or incorrect separately; hence, we modeled each problem part--298 in total--on its own. The outcome of interest was student performance on a state standardized mathematics test. Methodology: In a randomized trial, let i represent the treatment e ect for student i, i = 1; : : : ; n. In the classic PS framework, say implementation is measured with a variable Mi, only available in the treatment group. Let MTi represent the value of M that student i would exhibit if assigned to the treatment condition. The goal of PS is to estimate E[jMT = m], the average treatment e ect for subjects who would implement as M = m, for some value of m. In this context, MT is observed as M in the treatment group, but unobserved in the control group; it must be imputed based on modeling assumptions and/or baseline covariates X. If implementation measurements are multivariate, longitudinal, or complex they may not be representable as a univariate measurement M without measurement error. Instead, represent them as a vector for each subjectMi and assume a measurement model p(mjT), relating implementation measurements to a latent variable T, representing students' latent propensity to implement in a certain way if assigned to treatment. In our example, we consider one- and two-parameter logistic models (1PL and 2PL) as well as a graded response model (GRM) for student feedback. Given a models for T as a function of X and Y as a function of treatment assignment, T, and X, we may use Bayesian techniques to estimate a fully-latent principal e ect E[jT], the e ect of randomization to the treatment condition for subjects with etaT propensity to implement the program. In our illustration, we let mij be either an indicator of whether student i answered question j correctly without receiving feedback or, for the GRM, an ordered categorical variable for whether student i answered question j correctly on the first try, received either a hint or an error message, but was not told the answer, or whether a student received the answer as part of their feedback. Results: Figure 1 shows the correlations between estimated student effects T for all three measurement models (1PL, 2PL, and GRM) and the percentage of problems on which students received feedback. Estimates from all methods were highly correlated, with the highest correlation between latent variables in the measurement models and slightly lower correlation with the proportion correct on the first try (i.e. no feedback). Table 1 shows the the coefficients on T, treatment assignment, and their interaction in the outcome model, for models run with each of the three measurement models we considered. When feedback was modeled via the proportion of questions students answered correctly on the first try ("Classical"), we estimated the treatment effect to decrease with feedback receipt. When we modeled feedback receipt with measurement models, the trend was in the opposite direction. However, in all four cases trends of opposite signs are also consistent with the data--i.e. the estimates were not statistically significant. Conclusions: There is a serious need for statistical models relating program implementation to effects, to help answer questions about when programs are effective, and what styles of implementation led to the greatest effects. However, the rich, informative implementation data gathered automatically in computerized educational application present both an opportunity and a challenge: they can provide deep and nuanced pictures of varying implementation, but also require more complicated models in an already difficulty-to-model context. FLPS can serve as a framework in which to develop and interpret such models. The wide range of models from psychometrics can be marshalled to summarize complex multidimensional implementation data into low-dimensional, interpretable latent variables; then, the framework of principal stratification, along with Bayesian modeling, can show us the relationship between implementation and effectiveness.