Multiple Views on Multiple Metrics: How Teachers Perceive the Validity and Utility of Metrics in a Multi-Measure Evaluation System.

Lindsay Brown

Over the past five years, policymakers have shown a renewed interest in using teacher evaluation as a lever for increasing the quality of the teacher labor market. As such, states have been spurred by federal policies such as Race to the Top and influential foundations such as the Bill & Melinda Gates Foundation to use a more reliable, more nuanced method of evaluation: multiple metrics. Unlike relying solely on test scores or observation checklists, these new composite metrics were touted as more stable estimates for both human resource decisions and for targeting professional development. This study focuses mainly on the latter of the two objectives. The "Measures of Effective Teaching" study claimed that "multiple measures can provide teachers with rich, contextualized information on their practice for use in professional development" (Kane & Staiger, 2012). Others reported that the new metrics could "create opportunities for teachers to learn from their colleagues" (Goe, 2012). Yet we know little about teachers' perspectives on these new systems or whether they find such metrics to be useful. These multiple measures vary from district to state, but they are largely a combination of growth-based student test metrics, standards-based observation metrics, and surveys from stakeholders such as students or teachers. This study focuses on a state that uses the former two metrics, student growth and teacher observation, to ask whether such metrics are providing information to teachers that they can use to inform their instruction. This mixed-methods study takes place in Delaware, which was the first winner of President Obama's federal education initiative, Race to the Top. As such, it won $100 million to invest in reforming aspects of their educational system, teacher evaluation among them. Delaware overhauled their existing evaluation system, the Delaware Performance Appraisal System (DPAS) and unveiled DPAS-II in 2012. The system uses a multi-step, structured observation process using Charlotte Danielson's "Framework for Teaching" that includes conferences before and after teacher observations. Student growth is measured via two tests, one of which is mandated by the district and another that is chosen at the school level. The data utilized for this study include a survey of all teachers in the state of Delaware, administrative data from the Delaware Department of Education, and interviews with 32 Math and English/Language Arts teachers. There was a 60 percent response rate for the survey across grades and subjects. The interviewed district was purposively sampled for maximum heterogeneity of school performance on state standardized examinations; participants were sampled by content area (Math/ELA) and by school performance level (high/low). Interviews were coded inductively and deductively using the pre-existing framework of "The Practicality Ethic" (Doyle & Ponder, 1977) as well as with themes that emerged from interviews during iterative development of codebook. Analysis was specific to qualitative and quantitative data. All interviews were transcribed, validated, and descriptive and thematic codes were applied using Dedoose. Codes were pulled via different code combinations based on sampling patterns and a priori hypotheses; analytic memos detailed both supportive and disconfirming evidence. Quantitative analyses were conducted using Stata v.13. Primary analyses included descriptions of survey responses, an exploratory factor analysis, and regression analyses. Factors resulting from the exploratory factor analysis were used as dependent variables in subsequent regression analyses that explored predictors of teacher responses to the constructs underscoring each factor. The findings of this study are organized into three chapters. The first chapter investigates teachers' perceptions of the validity and utility of information they receive as part of their test-based student growth metrics. In general, teachers do not find test-based metrics to be accurate or valid methods of capturing teacher performance. Major themes that emerged regarding the lack of validity included teachers' ability to manipulate scores on the exams, the subjectivity of individualized growth-goals for certain metrics, and error in test scores that correlated with student demographics. Major themes that emerged regarding teachers' ability to use test-based data included the specificity of the data (more detail was generally regarded as better), the frequency of the test administration (more frequent was preferred), and the timeliness with which teachers received the data (sooner was better). The second findings chapter investigates teachers' perceptions of the validity and utility of information teachers receive as part of their observation-based metrics. On average, teachers find observation metrics to be more valid than test-based metrics; most teachers also reported that the "Framework for Teaching" aligned with their vision of good instruction. Furthermore, teachers liked that the observations were grounded in a rubric that demanded low-inference data from classrooms. However, teachers reported validity concerns around the representative of announced observations as well as how student composition affected teacher ratings on the observation rubric. Despite their generally positive view of observations, teachers reported that the feedback they received from principals was largely unhelpful; rather, teachers seemed to derive utility from the process of the observation -- including time for detailed planning and reflection -- as well as the time set aside to engage in meaningful dialogue with administrators. The third findings chapter uses the survey and administrative data to investigate systematic variations of teacher perceptions by teacher and school characteristics. Regression analyses find novice teachers were more likely to report that the conferences in the DPAS process were useful and that the evaluation is related to practice, both between and within schools. In addition to novice status, the number of observations conducted by an administrator was positively predictive of teachers' perceptions of evaluators, utility of the conferences, and utility of evaluation process for practice. Teachers in high-needs schools were also more likely to rate the utility of the evaluation process and its conferences well; this may be the result of such schools participating in a special program called the "Delaware Talent Co-operative" meant to support struggling schools. Last, the average number of years of experience of administrators in a school was positively predictive of teachers' ratings of their evaluator as well as trust in observation scores. The final chapter offers a discussion of the implications of this work for policy, practitioners, and research. Taken as a whole, this study demonstrates that teachers may be receiving limited amounts of additional evidence from multiple-metric evaluation systems; however, many serious validity concerns often impede the usage of this information. If policymakers indeed desire teachers to use evaluation information formatively, they must invest in systems and human resources to afford teachers the ability to do so. If the system instead is meant to monitor teachers, policymakers should temper their insistence that such metrics are useful for the improvement of practice. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com.bibliotheek.ehb.be/en-US/products/dissertations/individuals.shtml.]