Using Simulation to Interrogate Inequities before Scaling up a Promising Teacher Education Initiative.

Emanuele Bardelli; Matthew Ronfeldt; Matthew Truwit

Background: Recent field experiments confirm that learning to teach under a more instructionally effective mentor causes teacher candidates to feel more prepared (Ronfeldt et al., 2020; Ronfeldt, Goldhaber, et al., 2018) and demonstrate more effective teaching (Goldhaber et al., 2022). One of these experiments--designed under a research-practice partnership (RPP) with the STATE Department of Education (SDoE)--leveraged statewide administrative data to help one teacher preparation program (TPP) recruit more instructionally effective teachers to serve as mentors (Ronfeldt et al., 2020); specifically, the authors developed a novel algorithm that combined observation ratings (ORs), value-added to student achievement scores (VAMs), and years of experience to generate recommendation lists identifying the most promising teachers who could potentially serve as mentors in each district and endorsement area. Providing these lists to a randomly assigned group of districts resulted in the recruitment of substantially more instructionally effective and experienced mentors; moreover, the candidates placed with these mentors felt more prepared and were rated as more instructionally effective than peers placed in districts employing business-as-usual recruitment. Purpose: Given these results, partners at the SDoE were interested in implementing this algorithm across the state. However, given that recommendation lists relied heavily on observation ratings--which, in prior literature, have shown possible bias against teachers of color and classrooms with more students of color (Campbell, 2020; Grissom & Bartanen, 2022; Steinberg & Sartain, 2020)--there was a collective concern that doing so could exacerbate the existing overrepresentation of White and female teachers working with more privileged students among the mentor pool (Ronfeldt, Brockman, et al., 2018). Therefore, prior to encouraging the adoption of this algorithm at scale, we sought to simulate its statewide implementation, hypothetically identifying the mentors that would have been recruited if all programs had received recommendation lists and comparing their characteristics to those of mentors who previously served in STATE. Setting: This simulation draws on administrative data from STATE's statewide longitudinal data system. STATE has a history of emphasizing the recruitment of instructionally effective teachers as mentors more strongly than most states; moreover, the SDoE has committed to supporting the continued refinement of the practices of TPPs--especially with regard to clinical experience design--largely through the support of RPPs. Sample: The administrative data leveraged in these simulations include most of STATE's program completers, teachers, and schools between the 2009-2010 and 2018-2019 academic years. For our comparison sample, we rely on historical clinical mentor match data that indicate all 16,256 mentors with whom each of the state's 10,917 teacher candidates served. Intervention: The algorithm that serves as the linchpin of this intervention calculates a weighted composite of up to three prior years of ORs, VAMs, and years of experience for each teacher; it then generates lists--ranked from most to least promising--of all teachers within each block (i.e., combination of district, grade level, and subject area). Design: We simulate statewide implementation of this algorithm in three distinct but interconnected steps that mimic the mentor recruitment process traditionally followed by TPPs. First, we use historical data to estimate the plausible number of requests for mentors made in each block as an analogue to TPPs collecting candidates' field placement preferences. Second, we train LASSO-penalized linear probability models that predict candidate endorsement areas from mentor teaching assignments to identify the teachers in all blocks and years that could have served as mentors, mirroring TPPs requests to districts to identify possible teachers; we then rank these teachers using the algorithm described above. Third, we run 10,000 Monte Carlo simulations to model teachers' responses to the invitation to serve as a mentor, accounting for the possibility that they decline, as a parallel to the interactions between TPPs and potential mentors. Analysis: We use regression to compare the average mentor (e.g., identifying as Black) and school (e.g., percentage of students qualifying for free or reduced-price lunch) characteristics of the teachers hypothetically recruited under statewide implementation of recommendation lists to those who historically served. We include a fixed effect to effectively restrict estimates to leverage variation within each block by year and cluster standard errors at the district-by-year level to further account for possible correlation in error terms. Findings: In Table 1, we find that the teachers that would have been recruited using the recommendation lists are consistently more instructionally effective and experienced than teachers who historically served, providing further evidence for the capacity of this algorithm to improve the average instructional quality of the mentor pool. At the same time, we find that these teachers were also significantly more likely to be White and to work in wealthier and higher-achieving schools with significantly more White students. Together, these findings suggest the possibility of small but statistically significant biases in the mentors and schools identified by this algorithm. Consequently, we explore in Table 2 possible ways to address these biases by adjusting the components of the algorithm for mentor and/or school characteristics. We find that adjusting mentors' ORs for race and gender reduces and, in some cases, eliminates or even reverses differences in the characteristics of the teachers hypothetically recruited in our simulations and of those who historically served as mentors, as well as in the characteristics of their schools. Importantly, even with these adjustments, the algorithm continues to produce a substantially more instructionally effective and experienced mentor pool than that which was historically recruited in STATE. Conclusions: Simulations can help answer the what if question of the potential effects of an intervention adopted at scale, providing observational estimates of its plausible impacts by combining historical data with models of hypothetical implementations. This simulation, which approximates real-world mentor recruitment, finds that statewide scale-up of a data-driven algorithm would have increased the instructional effectiveness and experience of the average mentor but also reinforced existing disparities in who serves as a mentor. More importantly, it also helps identify a solution (i.e., adjusting ORs for mentor characteristics) that has the potential to preemptively avert exacerbating racial and gender biases without eroding the benefits of algorithmically incorporating administrative data into the mentor recruitment process.