Effect of Gamification on Gamers: Evaluating Effect Heterogeneity of Gamified Programs for Students Who Game the System Using Fully Latent Principal Stratification.

Kirk Vanacore; Ashish Gurung; Adam Sales; Neil Heffernan

Background: Gaming the system -- attempting to progress through a learning activity without learning (R. Baker et al., 2008) -- is an enduring problem that reduces the efficacy of Computer Based Learning Platforms (CBLPs). Researchers made substantial progress in identifying instances when students are gaming the system (Baker et al., 2006; Dang & Koedinger, 2019). However, solutions remain scarce. This study addresses whether students who game the system (gamers) in a traditional CBLP that includes problem sets with immediate hints and feedback would respond differently to alternative CBLPs environments: two gamified CBLPs and a traditional CBLP in which the access to hints and feedback were delayed until the end of each activity. We find that gamification does not consistently mitigate the negative effects of gaming the system on learning. Still, gamers may benefit from delayed hints and feedback. As a secondary objective, we present an example of integrating prediction from detection models into causal models. We utilize a method of causal moderation -- Fully Latent Principal Stratification (FLPS) -- that can leverage detection model outputs to understand heterogeneity in treatment effects. The combination of detection and causal models provides opportunities to leverage AI and ML model outputs to improve our understanding of student learning and be more responsive to their learning needs. Study Design: This study uses open-source data from a randomized controlled trial conducted in ten US middle schools. Students were randomly assigned either one of the gamified conditions -- From Here to There (FH2T) or DragonBox -- or one of the two manifestations of ASSISTments: one with immediate feedback and on-demand hints (Immediate Condition) and another condition where the feedback and hint access was delayed (Delayed Condition). Figure 1 shows an example problem from each condition. Method: Our aim to assess whether the students who gamed the system in a traditional CBLP would have benefited from a different CBLP poses a methodological difficulty. Estimating the effect conditions on gamers requires us to contrast their post-test scores with scores from comparable students in the other conditions. However, the behavior is confounded by the condition. To address this methodological problem, we propose that students have a baseline propensity to game the system in the Immediate Condition before randomization. Because it is considered a baseline covariate, similar to students' pretest knowledge, it is independent of the random treatment assignment and can serve as a moderating variable. However, only students in the Immediate Condition can display the behavior; therefore, its value in the other conditions is unknown. Nevertheless, once this latent propensity is estimated, we can evaluate whether it moderates the effects of the various interventions. Our analysis requires two key steps. First, we use a Knowledge Engineering Gaming Detector (Paquette et al., 2015) to identify instances where students are gaming the system within the Immediate Condition. Next, we use FLPS, which will allow us to estimate the effect heterogeneity of each program based on students' latent propensity to game the system (Sales & Pane, 2019). FLPS requires two submodels, which are delineated briefly below. Measurement Submodel: First, we estimate the propensity to game the system in the Immediate Condition (x[subscript c]) by running a multilevel logistic submodel predicting whether the gaming detector identified the students in the treatment condition to have gamed the system on each twenty-second time clip as delineated in Equation 1. G[subscript cji] is a binary indicator of whether student i gamed the system during time-clip c when working on problem j. P[subscript ki] is a covariate predictor of student-level predictors, measured at baseline for all students. Let the random intercepts be [gamma][subscript j] for problems, [gamma][subscript i] for students, [gamma][subscript t] for teachers, and [gamma][subscript s] schools. [equation omitted] Thus, students' propensity to replay is defined as [equation omitted]. [alpha][subscript c] is imputed for students who are not in the Immediate condition. Outcomes Submodel: To estimate the treatment effect for students with differing propensities to game the system, we run a multilevel linear regression predicting student's post-test algebraic knowledge (Y[subscript i]). The submodel includes interaction between [alpha][subscript ci] and each Z[subscript i], indicator of being in the treatment condition. In practice there is one $$Z_i$$ indicator for each treatment condition. [nu][subscript t] and [nu][subscript s] are random effects for teachers and schools, respectively. [equation omitted] Using the parameters from this submodel, the treatment effect for students with a particular propensity to game the system is modeled as [equation omitted]. Together the submodels formed a FLPS model, which we fit using the Stan Markov Chain Monte Carlo software through STAN. Data: We use data at two levels: student and time clip. The student-level variables include the pretest data, demographics, roster, and learning outcomes. Alternatively, the data used in gaming the system detector and the detector's output is aggregated in twenty-second clips of the student's program usage. Only time-clip data from the Immediate Condition is used to estimate students' propensity to game the system in that condition. The sample consists of 1976 students: 402 in the Immediate Condition, 385 in the Delayed Condition, 372 in the DragonBox, and 817 in FH2T. Results: Table 1 presents both submodel parameters, and Figure 2 visualizes the interactions. Propensity to game the system ([alpha]) was negatively associated with the outcome. However, the interactions between [alpha] and the conditions varied widely. First, for students with an average propensity to game the system ([alpha] = 0), the effect of the delayed feedback condition was likely negative. The interaction between students' propensity to game the system and the delayed condition was likely positive. The estimated effect of FH2T on students with an average propensity to game the system ([alpha] = 0) was small, and we have low confidence it is greater than zero. Yet, the interaction between FH2T and propensity to game the system was likely negative. However, these results were inconsistent for DragonBox, which had a likely positive main effect, yet there is little evidence of an interaction with students' propensity to game the system. Conclusion: Overall, this study provides evidence that program effects may vary based on students' engagement tendencies and that this heterogeneity can be estimated using FLPS. Furthermore, we present a potential method for incorporating outputs from detection models into causal analyses, which will become increasingly important with the proliferation of predictive AI in education.