Publication Date
| In 2026 | 0 |
| Since 2025 | 56 |
| Since 2022 (last 5 years) | 282 |
| Since 2017 (last 10 years) | 778 |
| Since 2007 (last 20 years) | 2040 |
Descriptor
| Interrater Reliability | 3122 |
| Foreign Countries | 654 |
| Test Reliability | 503 |
| Evaluation Methods | 502 |
| Test Validity | 410 |
| Correlation | 401 |
| Scoring | 347 |
| Comparative Analysis | 327 |
| Scores | 324 |
| Validity | 310 |
| Student Evaluation | 308 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Researchers | 130 |
| Practitioners | 42 |
| Teachers | 22 |
| Administrators | 11 |
| Counselors | 3 |
| Policymakers | 2 |
Location
| Australia | 56 |
| Turkey | 53 |
| United Kingdom | 46 |
| Canada | 45 |
| Netherlands | 40 |
| China | 38 |
| California | 37 |
| United States | 30 |
| United Kingdom (England) | 24 |
| Taiwan | 23 |
| Germany | 22 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 3 |
| Meets WWC Standards with or without Reservations | 3 |
| Does not meet standards | 3 |
Monica L. Coleman; Moira Ragan; Tahani Dari – Measurement and Evaluation in Counseling and Development, 2024
Intercoder reliability can increase trustworthiness, accuracy, rigor, collaboration, and power sharing in qualitative research. Though not every qualitative design can utilize intercoder reliability, this article highlights how positivist qualitative research, community-based participatory research, and participatory evaluation all strengthen when…
Descriptors: Interrater Reliability, Qualitative Research, Counseling, Research
Elizabeth J. Preas; Mary E. Halbur; Regina A. Carroll – Analysis of Verbal Behavior, 2024
Procedural fidelity refers to the degree to which procedures for an assessment or intervention (i.e., independent variables) are implemented consistent with the prescribed protocols. Procedural fidelity is an important factor in demonstrating the internal validity of an experiment and clinical treatments. Previous reviews evaluating the inclusion…
Descriptors: Verbal Communication, Behavioral Science Research, Periodicals, Fidelity
McCluskey, Sydne – ProQuest LLC, 2023
Rater comparison analysis is commonly necessary in the social sciences. Conventional approaches to the problem generally focus on calculation of agreement statistics, which provide useful but incomplete information about rater agreement. Importantly, one-number agreement statistics give no indication regarding the nature of disagreements, nor do…
Descriptors: Bayesian Statistics, Structural Equation Models, Interrater Reliability, Beliefs
Feldberg, Zachary R. – ProQuest LLC, 2023
Cognitive diagnostic models (CDMs) provide pedagogically relevant information in the form of a student profile of multiple binary categorizations of students into mastery or nonmastery statuses on latent traits called attributes. Federal educational accountability requires accountability measures to designate students into one of at least three…
Descriptors: Accountability, Standards, Cutting Scores, Models
Tavares, Walter; Kinnear, Benjamin; Schumacher, Daniel J.; Forte, Milena – Advances in Health Sciences Education, 2023
In this perspective, the authors critically examine "rater training" as it has been conceptualized and used in medical education. By "rater training," they mean the educational events intended to "improve" rater performance and contributions during assessment events. Historically, rater training programs have focused…
Descriptors: Medical Education, Interrater Reliability, Evaluation Methods, Training
Erik Voss – Language Testing, 2025
An increasing number of language testing companies are developing and deploying deep learning-based automated essay scoring systems (AES) to replace traditional approaches that rely on handcrafted feature extraction. However, there is hesitation to accept neural network approaches to automated essay scoring because the features are automatically…
Descriptors: Artificial Intelligence, Automation, Scoring, English (Second Language)
Yoonseo Kim – TESOL Quarterly: A Journal for Teachers of English to Speakers of Other Languages and of Standard English as a Second Dialect, 2025
This study explores the potential of OpenAI's ChatGPT-4 (gpt-4-0613) as an automated essay scoring (AES) tool in a trial involving 300 essays from an American university's academic English program placement test. Three prompting strategies (minimal/detailed rubric, require/not require rationale, and with/without scoring examples) were tested for…
Descriptors: Automation, Scoring, Artificial Intelligence, Placement Tests
Heather Hirst; Jennifer Campbell; Samantha Chamberlin; Ibukun Olagunju; Frank Bird; James K. Luiselli – Journal of Intellectual Disabilities, 2024
Frailty is a health concern for many adults with intellectual disability and should be measured to detect at-risk conditions, monitor disease, plan treatment, and gauge mortality. This descriptive pilot study evaluated measurement consistency (inter-rater agreement) of the Intellectual Disability-Frailty Index Short Form among multiple assessors…
Descriptors: Adults, Intellectual Disability, Physical Health, Aging (Individuals)
Angus Kittelman; Sara Izzard; Kent McIntosh; Kelsey R. Morris; Timothy J. Lewis – Assessment for Effective Intervention, 2024
The purpose of this study was to evaluate the psychometric properties of the Self-Assessment Survey (SAS) 4.0, an updated measure assessing implementation fidelity of positive behavioral interventions and supports (PBIS). A total of 627 school personnel from 33 schools in six U.S. states completed the SAS 4.0 during the 2021-2022 school year. We…
Descriptors: Positive Behavior Supports, Teachers, Self Evaluation (Individuals), Test Reliability
Abbas, Mohsin; van Rosmalen, Peter; Kalz, Marco – IEEE Transactions on Learning Technologies, 2023
For predicting and improving the quality of essays, text analytic metrics (surface, syntactic, morphological, and semantic features) can be used to provide formative feedback to the students in higher education. In this study, the goal was to identify a sufficient number of features that exhibit a fair proxy of the scores given by the human raters…
Descriptors: Feedback (Response), Automation, Essays, Scoring
Mazin T. Alqhazo; Tha’er Al-Kadi; Firas S. Alfwaress – Language, Speech, and Hearing Services in Schools, 2025
Purpose: The Stuttering Severity Instrument--Fourth Edition (SSI-4) is unavailable in Arabic language. The purpose of the current research is to translate the SSI-4 (Riley, 2009) into Arabic and to discuss its validity, as well as its intrajudge and interjudge reliability. Method: Archived videos of 28 school-aged children who stutter ranged in…
Descriptors: Arabic, Translation, Test Validity, Test Reliability
Dawn Holford; Janet McLean; Alex O. Holcombe; Iratxe Puebla; Vera Kempe – Active Learning in Higher Education, 2025
Authentic assessment allows students to demonstrate knowledge and skills in real-world tasks. In research, peer review is one such task that researchers learn by doing, as they evaluate other researchers' work. This means peer review could serve as an authentic assessment that engages students' critical thinking skills in a process of active…
Descriptors: Undergraduate Students, Evaluation Methods, Peer Evaluation, Interrater Reliability
Mark White; Matt Ronfeldt – Educational Assessment, 2024
Standardized observation systems seek to reliably measure a specific conceptualization of teaching quality, managing rater error through mechanisms such as certification, calibration, validation, and double-scoring. These mechanisms both support high quality scoring and generate the empirical evidence used to support the scoring inference (i.e.,…
Descriptors: Interrater Reliability, Quality Control, Teacher Effectiveness, Error Patterns
Elayne P. Colón; Lori M. Dassa; Thomas M. Dana; Nathan P. Hanson – Action in Teacher Education, 2024
To meet accreditation expectations, teacher preparation programs must demonstrate their candidates are evaluated using summative assessment tools that yield sound, reliable, and valid data. These tools are primarily used by the clinical experience team -- university supervisors and mentor teachers. Institutional beliefs regarding best practices…
Descriptors: Student Teachers, Teacher Interns, Evaluation Methods, Interrater Reliability
Weingarden, Merav; Heyd-Metzuyanim, Einat – Journal of Mathematics Teacher Education, 2023
In this study, we examine "what went wrong" in our professional development program for encouraging cognitively demanding instruction, focusing on the difficulties we encountered in using an observational tool for evaluating this type of instruction and reaching inter-rater reliability. We do so through the lens of a discursive theory of…
Descriptors: Mathematics Instruction, Interrater Reliability, Cognitive Processes, Difficulty Level

Peer reviewed
Direct link
