Redefining Populations of Inference for Generalizations from Small Studies.

Wendy Chan; Jimin Oh; Katherine Wilson

Background: Over the past decade, research on the development and assessment of tools to improve the generalizability of experimental findings has grown extensively (Tipton & Olsen, 2018). However, many experimental studies in education are based on small samples, which may include 30-70 schools while inference populations to which generalizations are made can be at least ten times larger (Tipton et al., 2017). This is particularly the case when generalizability is not necessarily planned for at the design stage. For instance, principal investigators are likely to choose a broadly defined population (e.g., all schools in a state) in a post hoc generalization analysis (Tipton et al., 2021). The small sample to population size ratio affects both the bias and precision of estimates of treatment effects. Bias is affected because the small sample may include schools that are different (based on covariates) to schools in the population. Precision is affected because the limited sample size is associated with larger standard errors (Chan, 2018). These concerns raise the question of whether existing statistical methods can still produce useful inferences when generalizations are made from small studies. Focus of Study: To address the small sample limitation, one suggestion is to redefine the inference population and identify a subset of schools that would facilitate improved estimation (bias reduction and improved precision). In this study, we describe two approaches among many ways to redefine an inference population. The first, the quantitatively-optimal approach, primarily relies on propensity scores. Propensity scores, in the generalizability context, estimate the probability of selection using a set of observable covariates (Stuart et al., 2011; Tipton, 2013; O'Muircheartaigh & Hedges, 2014). Here, the methods are quantitatively optimal in the sense that the resulting subpopulations minimize the variance and bias of the parameter estimates. Alternatively, the policy-relevant approach identifies subgroups of schools that may potentially be impacted by policies surrounding evaluation studies. An example of a policy-relevant subpopulation would be high poverty schools where a significant number of students qualify for free and reduced priced lunch (FRPL). Importantly, unlike quantitatively optimal methods, policy-relevant subpopulations are often identified in an ad hoc manner; namely, the subpopulations are identified based on the general concerns and interests of stakeholders. There is (typically) little consideration of propensity scores or formal statistical approaches to construct the subgroups. Given the differences in quantitatively optimal and policy-relevant approaches, we address the following questions: (1) What are the statistical inference implications when various approaches to redefining the population are used? and (2) When generalizing from small studies, how should researchers and practitioners decide upon the appropriate redefinition method? Research Design: In 2006, the Indiana Department of Education and the Indiana State Board of Education supported a new assessment system to measure annual student growth and provide feedback to teachers, measured through a cluster-randomized design (CRT; Konstantopoulos et al., 2013). From 2009-2010, 56 K-8 schools volunteered to implement the system. 34 were randomly assigned to use the assessment system and 22 served as control schools. The effectiveness of the assessment system was measured using the Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) scores in English Language Arts and mathematics. The Indiana CRT has been used in prior generalization research, where the inference population consisted of all K-8 schools in Indiana (Tipton et al., 2017; Chan, 2018). Here, we examined eight redefinitions of the Indiana CRT population. Five were considered policy-relevant: (i) urban schools, (ii) suburban schools, (iii) rural schools, (iv) high poverty schools in which over 75% of students qualify for FRPL and (v) low achievement schools whose average ISTEP+ scores in math fell in the lowest 25 quantile. These five examples of redefined populations are not exhaustive, but they represent the types of subpopulations that may potentially implement the intervention or be impacted by policies stemming from the CRT results. We also considered three quantitatively optimal subpopulations (i.e., Crump, PS Min/Max, Covariates). The first two correspond to a propensity score cutoff method, where all population schools whose estimated propensity scores lie outside a given range are excluded (Crump et al., 2009). The third is a trimming approach using the covariate distributions, where all population schools that did not share common covariate overlap with the sample were excluded (Chan, 2022). Results and Conclusion: We applied and compared five estimators of the population average treatment effect (PATE) for the original and redefined populations. Table 1 provides a description of each estimator. Tables 2 and 3 provide the values of each of the estimators for the original and redefined populations, respectively. When comparing Tables 2 and 3, the standard errors for all estimators decreased when the population was redefined using the propensity score cutoff based on Crump et al. (2009). Precision also improved in the two additional quantitatively optimal subpopulations, PS Min/Max and Covariates, which were based on a propensity score cutoff and overlap in covariate distributions, respectively. Among the policy-relevant subpopulations, the standard errors of the estimates decreased when inferences were made to Rural schools. However, improvement was inconsistent among the estimators. This suggests that when the goal of redefinition is improvement in precision, both quantitatively optimal and policy-relevant approaches may be useful. In addition to precision, we considered the impact of redefinition on bias reduction. Table 4 provides values of the B-index, which quantifies the similarity in propensity score distributions between the sample and population (and subpopulation) schools (Tipton, 2014). Interestingly, the subpopulations based on the quantitatively optimal approaches were considered more compositionally similar to the sample compared to the original population. Among the policy-relevant subgroups, compositional similarity is highest for the Rural and Suburban schools, but similarity is not as strong compared to the quantitatively optimal subpopulations. Thus, if the goal of redefinition is bias reduction, the quantitatively optimal approaches may be more appropriate. In sum, we argue that it is important to identify the goal(s) of using redefinition since one approach may outweigh another. This is crucial when the small sample sizes of a study require researchers to prioritize among factors such as bias reduction, precision, and interpretability of the subpopulation.