ERIC Number: ED656860
Record Type: Non-Journal
Publication Date: 2021-Sep-27
Pages: N/A
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: N/A
Available Date: N/A
Average Differences in Effect Sizes by Outcome Measure Type
Betsy Wolf
Society for Research on Educational Effectiveness
The What Works Clearinghouse (WWC) seeks to provide practitioners information about "what works in education." One challenge in understanding "what works" to practitioners is that effect sizes--the degree to which an intervention produces positive (or negative) outcomes--are not comparable across different interventions, in large part due to differences in study characteristics (Wilson & Lipsey, 2001). One study characteristic that has been shown to significantly relate to effect sizes is the type of outcome measure used in the study. Researchers have consistently identified larger average effect sizes when outcome measures were created either by study authors or researchers involved with the development of the intervention than when outcome measures were standardized or created by third parties (SWAT Measurement Small Group, 2020; Cheung & Slavin, 2016; de Boer et al., 2014; Li & Ma, 2010; Lipsey et al., 2012; Lynch et al., 2019; Pellegrini et al., 2019; Wilson & Lipsey, 2001). The WWC study database contains several different types of outcome measures, and one question that has not yet been thoroughly addressed is to what extent effect sizes in the WWC systematically differ by outcome measure type. The purpose of this paper is to examine to what extent effect sizes in WWC study data systematically vary by outcome measure type, with a particular focus on researcher and developer measures. This paper addresses the primary research question: To what extent do effect sizes systematically vary by outcome measure type, controlling for other factors? For the purpose of this paper, the outcome measure types are defined by the following mutually exclusive categories: (1) Broad: Measures intended to capture student achievement in a content area, schoolwide climate, or general educational outcomes. This category includes state and district assessments, national surveys and assessments, grade point average, graduation rates, and school disciplinary data; (2) Narrow: Measures intended to capture student achievement at a more granular level than a content area, or specific student behaviors. This category includes commercial assessments, measures developed by researchers not involved in the study, and outcomes associated with a specific class (credit, grades, etc.); (3) Developer: Measures that were developed for a particular intervention and typically only used when the intervention is also being implemented; and (4) Researcher: Measures developed by study authors, including measures that were created by selecting specific items from preexisting scales. This paper uses WWC study data and analyzes 1,553 findings from 373 studies that meet WWC standards across the literacy, STEM, and behavior topic areas to explore the differences in magnitude and statistical significance of effect sizes by outcome measure type. Multivariate meta-analysis with robust variance estimation is used to account for the dependency of multiple findings within the same study and to identify statistically significant differences in effect sizes by outcome measure type, controlling for the following covariates: (1) Outcome domain; (2) Grade level bands; (3) Program type; (4) Program delivery method; (5) Study design; (6) WWC study rating; (7) Version of handbook (2.1+ or higher); and (8) Purpose of study review. When examining differences in effect sizes by outcome measure type by looking "within" studies (by including study fixed effects), effect sizes using researcher measures were larger by an average of +0.24 (in standardized units) relative to broad measures, and by an average of +0.15 relative to narrow measures. Effect sizes using developer measures were larger by an average of +0.32 relative to broad measures, and by an average of +0.23 relative to narrow measures. Put another way, researcher and developer measures showed average effect sizes that were about 1.75 to 2 times larger than effect sizes from broad measures, and about 1.4 to 1.6 times larger than effect sizes from narrow measures within the same study and outcome domain. There was no statistically significant difference in the average effect sizes for researcher versus developer measures, nor was there a statistically significant difference in average effect sizes for broad versus narrow measures. This finding implies that effect sizes may not systematically vary across narrow versus broad measures, once study quality, implementation fidelity, and other study characteristics are held constant. Researcher and developer measures may be useful to validate the effectiveness of an intervention in a pilot study or efficacy trial. Yet practitioners and policymakers, who are held accountable for student progress on independent measures, may not find this evidence sufficient to inform their decisions. Perhaps there is a mismatch between the evidence needed by researchers or developers to validate an intervention versus evidence needed by practitioners and policymakers to select interventions to implement at scale in their settings. One open question is whether positive and statistically significant findings on researcher or developer measures translate into something meaningful for practitioners. Descriptive findings suggest that in 32% of studies, positive and statistically significant effects were identified on a researcher or developer measure, yet null effects were identified on all independent (broad or narrow) measures in the same study and outcome domain. The best-case scenario is that statistical significance on a researcher or developer measure is a signal that students have learned concepts and skills along the way towards mastering required academic content. Yet another scenario is that statistical significance on a researcher or developer measure has no bearing on how well students will perform on a formative or summative assessment in the same content area. Results of this paper call into question whether using researcher and developer measures leads to inaccurate and misleading conclusions about the effectiveness of educational interventions. There are several existing statistical approaches that could be used to account for differences in outcome measure types. Statistical approaches, such as meta-regression or Bayesian modeling, could be used to adjust both the statistical significance and magnitude of effect sizes, accounting for larger average effect sizes when using researcher or developer measures. Given that outcome measure type is by far the most predictive variable explaining the magnitude of effect sizes in studies reviewed by the WWC in some topic areas, researchers should use the tools available to them to help practitioners and policymakers make sense of the evidence to understand which educational interventions might work best in their contexts.
Descriptors: Effect Size, Outcome Measures, Intervention, Educational Research, Statistical Significance, Bayesian Statistics, Regression (Statistics)
Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; e-mail: contact@sree.org; Web site: https://www.sree.org/
Publication Type: Reports - Research
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: Society for Research on Educational Effectiveness (SREE)
Grant or Contract Numbers: N/A
Author Affiliations: N/A