Using Machine Learning to Decrease the Human Coding Burden in Experimental Assessments of Text.

Reagan Mozer; Luke Miratrix

Background: For randomized trials that use text as an outcome, traditional approaches for assessing treatment impact require each document first be manually coded for constructs of interest by trained human raters. These hand-coded scores are then used as a measured outcome for an impact analysis, with the average scores of the treatment group compared to the control (possibly adjusting for other observed covariates, etc.). This process, the current standard, is time-consuming and limiting; even the largest human coding efforts are typically constrained to measure only a small set of dimensions across a subsample of available texts. We propose using modern machine learning tools to assist with such human coding efforts, allowing more researchers to use text as an outcome. Achieving this end is important: while difficult to use, text is a critical outcome to consider. In K-12 settings, for example, students' academic success is partially dependent upon their writing proficiency--a cognitive, linguistic, and social task that requires writers to communicate with a non-present audience (Gee, 2015). Despite its importance for students' learning, however, writing research continues to lag behind reading research (Juzwik et al., 2006). One key impediment to text-based research may be the difficulty associated with analyzing text. Purpose: We aim to increase power of large-scale randomized evaluations with text outcomes in a manner that preserves the validity of a human-coded construct. In particular, we propose augmenting standard human coding efforts by taking advantage of "untapped" observations -- those documents not manually scored due to time or resource constraints -- as a supplementary resource. This can increase the power of an impact assessment, given a fixed human-coding budget, by leveraging the full collection of documents. Methods: Our core idea is to, first, hand-code a random sample of the original documents; next, we train a machine learning model on this subset that can then be used to predict scores for the entire corpus. We then use methods from survey sampling to adjust our impact estimate. As we prove in our paper, this ensures any biases from the machine learner are adjusted for, so we maintain focus on the human-coded construct. Through a simulation study based on data from a recent field trial in education, we show that our approach indeed reduces the scope of the human-coding effort needed to achieve a desired level of power. In particular, consider a collection of N documents where Y is the score for document i with respect to a human-coded construct and Z is an indicator for treatment. Our approach, a methodological combination of causal inference, survey sampling methods, and machine learning, has four steps: (1) select and code a sample of n<N documents; (2) build a machine learning model to predict the human-coded outcomes from a set of automatically extracted text features, X; (3) generate machine-predicted scores for all documents; and (4) adjust final impact estimates using the residual differences between human-coded and machine-predicted outcomes. This final step ensures any biases in the modeling procedure do not propagate to biases in final estimated effects. Our estimator for the average treatment effect is shown on Display 1. Simulation Study and Data: To demonstrate the utility of our proposed framework, we conducted a simulation study using a sample of N=1361 student "essays" collected in the cluster-randomized trial recently conducted by Kim et al. (2021). Within this sample, a total of 722 students received the Model of Reading Engagement (MORE) classroom intervention (i.e., treatment) and the remaining 639 received typical instruction (i.e., control). After three weeks, both groups were given an open-ended writing assessment, and their hand-written responses were digitally transcribed and scored by human raters. In the actual study, raters coded all 1361 documents. Here, we report on a simulation study where we repeatedly simulate a researcher scoring a subset of between 10% to 90% of the corpus and then using our approach discussed above. For each 10% increase, we examine to what extent exploiting the additional uncoded sample improves our power to detect a significant treatment effect beyond what can be achieved using only the hand-coded sample. For a given human coding budget n, our simulation proceeds as follows. We first select a stratified random sample of n documents from our full population of N essays, with equal sampling within each treatment group. These essays are then "coded" (by revealing the original hand-coded values) and used to train a model for predicting the human-coded outcomes as a function of the essay text. We then apply this model to all N documents, human-coded or otherwise, to generate predicted scores. We finally estimate the treatment impact using Display 1. We repeat this procedure across 5000 bootstrap samples of the original data and calculate the power (as a function of the standard error calculated across iterations) to detect an effect of 0.30 standard deviations. Results: Display 2 shows the estimated power of our proposed approach using five different classes of machine learners. For comparison, we also compute a "baseline" power estimate for the simple difference in means on just the hand-coded documents. Overall, we find exploiting additional, uncoded data indeed improves efficiency. In particular, with machine learning, we are able to achieve a nominal power of 80% by hand-coding only 40% of the corpus. Without our machine learning augmentation, we would need to code roughly 65% of the corpus, a 63% increase in effort, to achieve the same power. Conclusions: Overall, our simulation study provides compelling evidence to suggest that machine learning models trained on automatically extracted features of text can be used to reduce the scope of a human-coding effort while maintaining nominal power to detect a significant average treatment impact. Our simulation also identifies tree-based methods as particularly effective for scoring text in this context. We believe that by easing some of the burdens associated with using coded qualitative constructs, our proposed approach can help enhance and expand the use of text data as an outcome in education evaluations.