Publication Date
| In 2026 | 0 |
| Since 2025 | 58 |
| Since 2022 (last 5 years) | 284 |
| Since 2017 (last 10 years) | 780 |
| Since 2007 (last 20 years) | 2042 |
Descriptor
| Interrater Reliability | 3124 |
| Foreign Countries | 655 |
| Test Reliability | 503 |
| Evaluation Methods | 502 |
| Test Validity | 410 |
| Correlation | 401 |
| Scoring | 347 |
| Comparative Analysis | 327 |
| Scores | 324 |
| Validity | 310 |
| Student Evaluation | 308 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Researchers | 130 |
| Practitioners | 42 |
| Teachers | 22 |
| Administrators | 11 |
| Counselors | 3 |
| Policymakers | 2 |
Location
| Australia | 56 |
| Turkey | 53 |
| United Kingdom | 46 |
| Canada | 45 |
| Netherlands | 40 |
| China | 38 |
| California | 37 |
| United States | 30 |
| United Kingdom (England) | 25 |
| Taiwan | 23 |
| Germany | 22 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 3 |
| Meets WWC Standards with or without Reservations | 3 |
| Does not meet standards | 3 |
Peer reviewedWeider-Hatfield, Deborah; Hatfield, John D. – Communication Quarterly, 1984
Evaluation approaches to measuring reliabilty in interaction analysis by (1) presenting criteria for a sound reliability estimate, (2) evaluating currently used tests against these criteria, and (3) discussing application of appropriate tests to interaction data. (PD)
Descriptors: Communication Research, Evaluation Criteria, Interaction Process Analysis, Interrater Reliability
Peer reviewedOrwin, Robert G.; Cordray, David S. – Psychological Bulletin, 1985
Identifies three sources of reporting deficiency for meta-analytic results: quality (adequacy) of publicizing; quality of macrolevel reporting, and quality of microlevel reporting. Reanalysis of 25 reports from the Smith, Glass and Miller (1980) psychotherapy meta-analysis established two sources of misinformation, interrater reliabilities and…
Descriptors: Confidence Testing, Interrater Reliability, Meta Analysis, Psychotherapy
Miller-Whitehead, Marie – 2001
A hypothetical case study provides examples of the inter-rater reliability issues involved in complex performance assessment, focusing on the Baldrige model. A hypothetical team of five evaluators was asked to rate a Baldrige model performance assessment along the seven defined criteria or performance dimensions that comprise the Baldrige model…
Descriptors: Case Studies, Criteria, Evaluators, Interrater Reliability
Fan, Xitao; Chen, Michael – 1999
It is erroneous to extend or generalize the inter-rater reliability coefficient estimated from only a (small) proportion of the sample to the rest of the sample data where only one rater is used for scoring, although such generalization is often made implicitly in practice. It is shown that if inter-rater reliability estimate from part of a sample…
Descriptors: Estimation (Mathematics), Generalizability Theory, Interrater Reliability, Sample Size
Michaelides, Michalis P.; Haertel, Edward H. – Center for Research on Evaluation Standards and Student Testing CRESST, 2004
There is variability in the estimation of an equating transformation because common-item parameters are obtained from responses of samples of examinees. The most commonly used standard error of equating quantifies this source of sampling error, which decreases as the sample size of examinees used to derive the transformation increases. In a…
Descriptors: Test Items, Testing, Error Patterns, Interrater Reliability
Peer reviewedKennison, Monica Metrick; Misselwitz, Shirley – Nursing Education Perspectives, 2002
Samples from 17 reflective journals of nursing students were evaluated by 6 faculty. Results indicate a lack of consistency in grading reflective writing, lack of consensus regarding evaluation, and differences among faculty regarding their view of such exercises. (Contains 26 references.) (JOW)
Descriptors: Grading, Higher Education, Interrater Reliability, Nursing Education
Peer reviewedMaurer, Steven D.; Fay, Charles – Personnel Psychology, 1988
Examined degree to which agreement in interviewer ratings may be influenced by training, use of structured conventional interviews, or situational interviews. Results from 42 managers experienced as interviewers revealed no training effect on rating agreement; impact of situational format on consistency in assessments of applicant suitability was…
Descriptors: Administrators, Employment Interviews, Examiners, Experimenter Characteristics
Peer reviewedCordes, Anne K. – Journal of Speech and Hearing Research, 1994
This paper contends that behavior observation data relating to speech-language pathology are reliable if they are not affected by differences among observers or other variations in the recording context. The theoretical bases of methods used to estimate reliability for observational data are reviewed, and suggestions are provided for improving the…
Descriptors: Data Collection, Interrater Reliability, Observation, Reliability
Readers' Responses to the Rating of Non-Uniform Portfolios: Are There Limits on Portfolios' Utility?
Peer reviewedDespain, LaRene; Hilgers, Thomas L. – WPA: Writing Program Administration, 1992
Describes readers' responses to the task of assigning scores to nonuniform portfolios of student writing. Suggests that reaching the goal of reliability in reading practices will not be easy. Concludes that writing program administrators should greet suggestions for the use of nonuniform portfolios with questioning restraint. (RS)
Descriptors: Higher Education, Interrater Reliability, Portfolios (Background Materials), Student Evaluation
Peer reviewedKreiman, Jody; And Others – Journal of Speech and Hearing Research, 1992
Sixteen listeners (10 expert, 6 naive) judged the dissimilarity of pairs of voices drawn from pathological and normal populations. Only parameters that showed substantial variability were perceptually salient across listeners. Results suggest that traditional means of assessing listener reliability in voice perception tasks may not be appropriate.…
Descriptors: Evaluation Methods, Individual Differences, Interrater Reliability, Perception
Peer reviewedTindley, Howard E. A.; And Others – Career Development Quarterly, 1994
Describes investigation employing within-counselor design. Investigators analyzed audio recordings of career counseling interviews with clients who held either relatively negative expectations or relatively positive expectations regarding counseling. Clients who held relatively positive expectations were rated significantly higher on global…
Descriptors: Career Counseling, Expectation, Higher Education, Interrater Reliability
Peer reviewedIngham, Roger J.; And Others – Journal of Speech and Hearing Research, 1993
Two experiments investigating interval-by-interval interjudge and intrajudge agreement for stuttered and nonstuttered speech intervals found that training of judges could improve reliability levels; judges with relatively high intrajudge agreement also showed relatively higher interjudge agreement; and interval-by-interval interjudge agreement was…
Descriptors: Evaluation Methods, Interrater Reliability, Performance Factors, Speech Evaluation
Peer reviewedCox, Maureen V.; Perara, Julian – Educational Psychology: An International Journal of Experimental Educational Psychology, 1998
Devises a nine-point scale for scoring drawings of a cube. Provides detailed criteria and examples for each category. Shows that interrater reliability of the scale is high, and scores trace a linear trend through a sample age-range. Suggests that the scale is suitable for use as a diagnostic or assessment tool. (DSK)
Descriptors: Art Education, Evaluation Methods, Foreign Countries, Geometric Constructions
Peer reviewedDyson, Maree; Allen, Felicity; Duckett, Stephen – Evaluation and Program Planning, 2000
Reports on the interrater reliability of the Educational Needs Questionnaire (Victoria Department of Education, Australia), which was applied to 70 school-age children by their parents and 2 therapists. Results indicate that six of the subscales are reliable when evaluated by therapists and parents, but three subscales did not achieve the…
Descriptors: Children, Disabilities, Foreign Countries, Interrater Reliability
Peer reviewedMacMillan, Peter D. – Journal of Experimental Education, 2000
Compared classical test theory (CTT), generalizability theory (GT), and multifaceted Rasch model (MFRM) approaches to detecting and correcting for rater variability using responses of 4,930 high school students graded by 3 raters on 9 scales. The MFRM approach identified far more raters as different than did the CTT analysis. GT and Rasch…
Descriptors: Generalizability Theory, High School Students, High Schools, Interrater Reliability


