Publication Date
| In 2026 | 0 |
| Since 2025 | 58 |
| Since 2022 (last 5 years) | 284 |
| Since 2017 (last 10 years) | 780 |
| Since 2007 (last 20 years) | 2042 |
Descriptor
| Interrater Reliability | 3124 |
| Foreign Countries | 655 |
| Test Reliability | 503 |
| Evaluation Methods | 502 |
| Test Validity | 410 |
| Correlation | 401 |
| Scoring | 347 |
| Comparative Analysis | 327 |
| Scores | 324 |
| Validity | 310 |
| Student Evaluation | 308 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Researchers | 130 |
| Practitioners | 42 |
| Teachers | 22 |
| Administrators | 11 |
| Counselors | 3 |
| Policymakers | 2 |
Location
| Australia | 56 |
| Turkey | 53 |
| United Kingdom | 46 |
| Canada | 45 |
| Netherlands | 40 |
| China | 38 |
| California | 37 |
| United States | 30 |
| United Kingdom (England) | 25 |
| Taiwan | 23 |
| Germany | 22 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 3 |
| Meets WWC Standards with or without Reservations | 3 |
| Does not meet standards | 3 |
Hunn, Lorie L. – ProQuest LLC, 2009
This study explored and compared the ways in which school-based cooperating teachers and college supervisors evaluate student teachers. The scores allocated to student teachers by school-based cooperating teachers and college supervisors in the final field experience evaluations of student teachers were analyzed. A mixed methods research design…
Descriptors: Cooperating Teachers, Leadership, Research Design, Student Teachers
Peer reviewedNewman, Jody L.; Fuqua, Dale R. – Counselor Education and Supervision, 1986
Examined the effects of order of stimulus presentation on observer ratings of counseling performance. Results revealed a statistically significant interaction between quality of performance and the order in which the performances were rated. (Author/ABB)
Descriptors: Counselor Evaluation, Counselor Performance, Interrater Reliability, Observation
Peer reviewedAnsorge, Charles J.; Scheer, John K. – Research Quarterly for Exercise and Sport, 1988
Analysis of gymnastics judges scores of their own and other countries' gymnasts' performance during the 1984 Olympic Games indicated that the judges were biased in favor of their own country's gymnasts. (Author/CB)
Descriptors: Bias, Competition, Gymnastics, International Relations
Peer reviewedKane, Robert L.; And Others – Journal of Consulting and Clinical Psychology, 1987
Three experienced neuropsychologists rated brain damaged and control subjects for brain damage using the Halstead-Reitan Battery and the Luria-Nebraska Neuropsychological Battery. Using either battery, raters were accurate in judging the presence of brain damage. There was a high degree of consistency between raters and test batteries when both…
Descriptors: Interrater Reliability, Neurological Impairments, Psychological Testing, Psychometrics
Peer reviewedCicchetti, Domenic V.; And Others – Educational and Psychological Measurement, 1984
This program computes multiple judge reliability levels under the following conditions. (1) different sets of judges perform the ratings; (2) the number of judges is a constant; and (3) the scale of measurement is nominal. (Author)
Descriptors: Computer Software, Interrater Reliability, Judgment Analysis Technique, Test Reliability
Peer reviewedVance, B.; And Others – Psychology in the Schools, 1983
Investigated the interscorer reliability between a novice and a professional psychologist for the Minnesota Percepto-Diagnostic Test-Revised (MPDT-R), using a sample of 30 individuals. Results indicated that for three of the four MPDT-R scores there was a significant positive correlation between expert and novice scoring criteria. (JAC)
Descriptors: Experimenter Characteristics, Interrater Reliability, Psychological Evaluation, Psychologists
Randolph, Justus J. – Online Submission, 2005
Fleiss' popular multirater kappa is known to be influenced by prevalence and bias, which can lead to the paradox of high agreement but low kappa. It also assumes that raters are restricted in how they can distribute cases across categories, which is not a typical feature of many agreement studies. In this article, a free-marginal, multirater…
Descriptors: Multivariate Analysis, Statistical Distributions, Statistical Bias, Interrater Reliability
Peer reviewedBartfay, Emma – International Journal of Testing, 2003
Used Monte Carlo simulation to compare the properties of a goodness-of-fit (GOF) procedure and a test statistic developed by E. Bartfay and A. Donner (2001) to the likelihood ratio test in assessing the existence of extra variation. Results show the GOF procedure possess satisfactory Type I error rate and power. (SLD)
Descriptors: Goodness of Fit, Interrater Reliability, Monte Carlo Methods, Simulation
Peer reviewedVanLeeuwen, Dawn M. – Journal of Agricultural Education, 1997
Generalizability Theory can be used to assess reliability in the presence of multiple sources and different types of error. It provides a flexible alternative to Classical Theory and can handle estimation of interrater reliability with any number of raters. (SK)
Descriptors: Error of Measurement, Generalizability Theory, Interrater Reliability, Measurement Techniques
Peer reviewedHorowitz, Leonard M.; And Others – Journal of Consulting and Clinical Psychology, 1989
Developed method for aggregating psychodynamic formulations of independent clinicians. Panels of clinicians observed videotaped interviews of patients and wrote individual formulations which were combined into consensual formulation. Other clinical raters read each consensual formulation and judged whether each problem was apt to be distressing…
Descriptors: Clinical Diagnosis, Interpersonal Relationship, Interrater Reliability, Psychological Evaluation
Peer reviewedTsui, Anne S.; Ohlott, Patricia – Personnel Psychology, 1988
To test model of general managerial effectiveness, superiors (N=271), subordinates (N=605), and peers (N=469) rated 344 managers. Study designed to test three specific hypotheses on criterion type and criterion weights found consensus in effectiveness models of superiors, subordinates, and peers. Consensus among different raters was high on both…
Descriptors: Administrator Effectiveness, Congruence (Psychology), Evaluation Problems, Interrater Reliability
Peer reviewedFabbris, Luigi; Gallo, Francesca – Educational and Psychological Measurement, 1993
New coefficients of agreement are suggested for the measure of intraclass consistency between observations on two variables. The coefficients are derived from a general coefficient for measuring intraclass dependence in a bivariate analysis context. Various coefficients for the univariate agreement analysis are shown to be cases of the suggested…
Descriptors: Correlation, Equations (Mathematics), Interrater Reliability, Judges
Peer reviewedCorty, Eric; And Others – Journal of Consulting and Clinical Psychology, 1993
Examined interrater reliability of diagnoses made on basis of structured interview for psychiatric patients with and without psychoactive substance use disorders (PSUDs). Results from 47 pairs of ratings by 9 clinical interviewers revealed that interrater reliability for non-PSUD psychiatric diagnoses was quite high when patient had no diagnosable…
Descriptors: Clinical Diagnosis, Interrater Reliability, Patients, Psychiatric Hospitals
Peer reviewedKember, David; Jones, Alice; Loke, Alice; McKay, Jan; Sinclair, Kit; Tse, Harrison; Webb, Celia; Wong, Frances; Wong, Marian; Yeung, Ella – International Journal of Lifelong Education, 1999
A coding method for measuring reflective thinking in student journals was tested twice, demonstrating acceptable reliability among evaluators and supporting the precision of the guidelines for coding. Coding categories were as follows: habitual action, introspection, thoughtful action, content reflection, process reflection, content and process…
Descriptors: Adult Education, Coding, Evaluation Methods, Interrater Reliability
Peer reviewedBerning, Lisa C.; Weed, Nathan C.; Aloia, Mark S. – Assessment, 1998
To examine the interrater reliability of the Ruff Figural Fluency Test (RFFT) (R. Ruff, 1988), 124 college students completed the measure and scored RFFT test protocols. Results indicated substantial interscorer reliability on the RFFT, particularly for number of unique designs. Reliability was lower for scoring perseverative errors and error…
Descriptors: College Students, Higher Education, Interrater Reliability, Scoring

Direct link
