Publication Date
In 2025 | 1 |
Since 2024 | 3 |
Since 2021 (last 5 years) | 13 |
Since 2016 (last 10 years) | 25 |
Since 2006 (last 20 years) | 290 |
Descriptor
Interrater Reliability | 515 |
Evaluation Methods | 133 |
Test Reliability | 81 |
Foreign Countries | 77 |
Scoring | 72 |
Test Validity | 72 |
Correlation | 61 |
Rating Scales | 60 |
Measures (Individuals) | 59 |
Evaluators | 56 |
Psychometrics | 56 |
More ▼ |
Source
Author
Lunz, Mary E. | 6 |
Baer, John | 3 |
Baker, Eva L. | 3 |
Coniam, David | 3 |
Engelhard, George, Jr. | 3 |
Epstein, Michael H. | 3 |
Greatorex, Jackie | 3 |
Jaeger, Richard M. | 3 |
Kaufman, James C. | 3 |
Knoch, Ute | 3 |
Linn, Robert L. | 3 |
More ▼ |
Publication Type
Education Level
Audience
Researchers | 12 |
Practitioners | 10 |
Teachers | 6 |
Administrators | 5 |
Location
United Kingdom | 12 |
Australia | 11 |
California | 6 |
Taiwan | 6 |
United Kingdom (England) | 6 |
Canada | 5 |
Florida | 5 |
Netherlands | 5 |
Sweden | 5 |
Pennsylvania | 4 |
United States | 4 |
More ▼ |
Laws, Policies, & Programs
No Child Left Behind Act 2001 | 3 |
Individuals with Disabilities… | 2 |
Race to the Top | 2 |
Americans with Disabilities… | 1 |
Improving Americas Schools… | 1 |
Rehabilitation Act 1973… | 1 |
Assessments and Surveys
What Works Clearinghouse Rating
Meets WWC Standards without Reservations | 1 |
Meets WWC Standards with or without Reservations | 1 |
Pearson, Terry – FORUM: for promoting 3-19 comprehensive education, 2023
Ofsted has frequently defended the judgements made during inspections by claiming that inspection ratings are reliable, as shown by the results from the collection of studies the inspectorate has conducted. I outline the inspectorate's view of reliability and problematise the studies that it has carried out, noting that these provide insufficient…
Descriptors: Inspection, Interrater Reliability, Decision Making, Value Judgment
Tavares, Walter; Kinnear, Benjamin; Schumacher, Daniel J.; Forte, Milena – Advances in Health Sciences Education, 2023
In this perspective, the authors critically examine "rater training" as it has been conceptualized and used in medical education. By "rater training," they mean the educational events intended to "improve" rater performance and contributions during assessment events. Historically, rater training programs have focused…
Descriptors: Medical Education, Interrater Reliability, Evaluation Methods, Training
Bonett, Douglas G. – Journal of Educational and Behavioral Statistics, 2022
The limitations of Cohen's ? are reviewed and an alternative G-index is recommended for assessing nominal-scale agreement. Maximum likelihood estimates, standard errors, and confidence intervals for a two-rater G-index are derived for one-group and two-group designs. A new G-index of agreement for multirater designs is proposed. Statistical…
Descriptors: Statistical Inference, Statistical Data, Interrater Reliability, Design
John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024
Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…
Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics
Burkhardt, Amy; Lottridge, Susan; Woolf, Sherri – Educational Measurement: Issues and Practice, 2021
For some students, standardized tests serve as a conduit to disclose sensitive issues of harm or distress that may otherwise go unreported. By detecting this writing, known as "crisis papers," testing programs have a unique opportunity to assist in mitigating the risk of harm to these students. The use of machine learning to…
Descriptors: Scoring Rubrics, Identification, At Risk Students, Standardized Tests
Constructing a Roadmap to Measure the Quality of Business Assessments Aimed at Curriculum Management
Silva, Thanuci; Santos, Regiane dos; Mallet, Débora – Journal of Education for Business, 2023
Assuring the quality of education is a concern of learning institutions. To do so, it is necessary to have assertive learning management, with consistent data on students' outcomes. This research provides associate deans and researchers, a roadmap with which to gather evidence to improve the quality of open-ended assessments. Based on statistical…
Descriptors: Student Evaluation, Evaluation Methods, Business Education, Higher Education
Ole J. Kemi – Advances in Physiology Education, 2025
Students are assessed by coursework and/or exams, all of which are marked by assessors (markers). Student and marker performances are then subject to end-of-session board of examiner handling and analysis. This occurs annually and is the basis for evaluating students but also the wider learning and teaching efficiency of an academic institution.…
Descriptors: Undergraduate Students, Evaluation Methods, Evaluation Criteria, Academic Standards
Thorne, Casey Lee – Journal of Dance Education, 2022
The research outlined in this article offers a systematic training methodology for students and licensed Traditional Chinese Medicine (TCM) practitioners to learn the clinical art and science of pulsology through dance. One of the greatest hurdles in learning pulse palpation is a TCM practitioner's inability to feel the pulse with a degree of…
Descriptors: Dance Education, Metabolism, Medicine, Asian Culture
Doewes, Afrizal; Kurdhi, Nughthoh Arfawi; Saxena, Akrati – International Educational Data Mining Society, 2023
Automated Essay Scoring (AES) tools aim to improve the efficiency and consistency of essay scoring by using machine learning algorithms. In the existing research work on this topic, most researchers agree that human-automated score agreement remains the benchmark for assessing the accuracy of machine-generated scores. To measure the performance of…
Descriptors: Essays, Writing Evaluation, Evaluators, Accuracy
McCarthy, Kathryn S.; Magliano, Joseph P.; Snyder, Jacob O.; Kenney, Elizabeth A.; Newton, Natalie N.; Perret, Cecile A.; Knezevic, Melanie; Allen, Laura K.; McNamara, Danielle S. – Grantee Submission, 2021
The objective in the current paper is to examine the processes of how our research team negotiated meaning using an iterative design approach as we established, developed, and refined a rubric to capture comprehension processes and strategies evident in students' verbal protocols. The overarching project comprises multiple data sets, multiple…
Descriptors: Scoring Rubrics, Interrater Reliability, Design, Learning Processes
Bimpeh, Yaw; Pointer, William; Smith, Ben Alexander; Harrison, Liz – Applied Measurement in Education, 2020
Many high-stakes examinations in the United Kingdom (UK) use both constructed-response items and selected-response items. We need to evaluate the inter-rater reliability for constructed-response items that are scored by humans. While there are a variety of methods for evaluating rater consistency across ratings in the psychometric literature, we…
Descriptors: Scoring, Generalizability Theory, Interrater Reliability, Foreign Countries
Marion Heron; Helen Donaghue; Kieran Balloo – Teaching in Higher Education, 2024
The aim of teaching observations and post observation feedback in higher education is to support teachers to reflect on and improve their teaching. Yet, our understanding of tutors' (observers') and teachers' (observees') capacities for capitalising on these feedback opportunities is limited and there is little empirically derived advice for…
Descriptors: Feedback (Response), Classroom Observation Techniques, Teacher Evaluation, Multiple Literacies
Jin, Kuan-Yu; Wang, Wen-Chung – Journal of Educational Measurement, 2018
The Rasch facets model was developed to account for facet data, such as student essays graded by raters, but it accounts for only one kind of rater effect (severity). In practice, raters may exhibit various tendencies such as using middle or extreme scores in their ratings, which is referred to as the rater centrality/extremity response style. To…
Descriptors: Scoring, Models, Interrater Reliability, Computation
Gitomer, Drew H.; Martínez, José Felipe; Battey, Dan; Hyland, Nora E. – American Educational Research Journal, 2021
The Educative Teacher Performance Assessment (edTPA) is a system of standardized portfolio assessments of teaching performance mandated for use by educator preparation programs in 18 states, and approved in 21 others, as part of initial certification for preservice teachers. Because of the high stakes involved for examinees, it is critical that…
Descriptors: Evaluation, Performance Based Assessment, Test Reliability, Test Validity
Knoch, Ute; Chapelle, Carol A. – Language Testing, 2018
Argument-based validation requires test developers and researchers to specify what is entailed in test interpretation and use. Doing so has been shown to yield advantages (Chapelle, Enright, & Jamieson, 2010), but it also requires an analysis of how the concerns of language testers can be conceptualized in the terms used to construct a…
Descriptors: Test Validity, Language Tests, Evaluation Research, Rating Scales