Publication Date
| In 2026 | 0 |
| Since 2025 | 58 |
| Since 2022 (last 5 years) | 284 |
| Since 2017 (last 10 years) | 780 |
| Since 2007 (last 20 years) | 2042 |
Descriptor
| Interrater Reliability | 3124 |
| Foreign Countries | 655 |
| Test Reliability | 503 |
| Evaluation Methods | 502 |
| Test Validity | 410 |
| Correlation | 401 |
| Scoring | 347 |
| Comparative Analysis | 327 |
| Scores | 324 |
| Validity | 310 |
| Student Evaluation | 308 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Researchers | 130 |
| Practitioners | 42 |
| Teachers | 22 |
| Administrators | 11 |
| Counselors | 3 |
| Policymakers | 2 |
Location
| Australia | 56 |
| Turkey | 53 |
| United Kingdom | 46 |
| Canada | 45 |
| Netherlands | 40 |
| China | 38 |
| California | 37 |
| United States | 30 |
| United Kingdom (England) | 25 |
| Taiwan | 23 |
| Germany | 22 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 3 |
| Meets WWC Standards with or without Reservations | 3 |
| Does not meet standards | 3 |
Froman, Richard L., Jr. – 1988
The reliability of a taxonomy of humor was tested in two studies. The first study involved rater identification of nine categories for humorous incidents excerpted from television comedy programs (wordplay, exaggeration/understatement, contrast, audience knowledge, aggression, emotion, taboo, pratfall/slapstick, and repetition). The second study,…
Descriptors: Classification, Humor, Interrater Reliability, Psychometrics
Brown, R. L. – 1987
This paper explores the use of K. G. Joreskog's (1970) congeneric modeling approach to reliability using censored quantitative variables. Two Monte Carlo studies were conducted. The first explored the robustness of Normal Theory Generalized Least-Squares (NTGLS) estimates for a single-factor congeneric model across several sample sizes…
Descriptors: Interrater Reliability, Monte Carlo Methods, Sample Size
Peer reviewedWhitehurst, Grover, J. – American Psychologist, 1984
Holds that interrater agreement for journal manuscript reviews has seemed unacceptably low because it has been assessed using techniques such as the intraclass correlation, which compares error variance with the variance due to manuscripts. Describes and recommends an alternative approach for computing interrater agreement. (GC)
Descriptors: Interrater Reliability, Periodicals, Psychological Studies, Statistical Analysis
Peer reviewedCollis, Glyn M. – Educational and Psychological Measurement, 1985
Some suggestions for measuring marginal symmetry in agreement matrices for categorical data are discussed, together with measures of item-by-item agreement conditional on marginal asymmetry. Connections with intraclass correlations for dichotomous data are noted. (Author)
Descriptors: Correlation, Interrater Reliability, Item Analysis, Matrices
Peer reviewedLi, Mao-Neng Fred; Lautenschlager, Gary – Educational and Psychological Measurement, 1997
lllustrates a link between the multiple-rater kappa of J. Fleiss (1971) or other analogues and the generalizability (G) coefficient for a single facet design, and discusses the use and interpretation of G theory in the study of interrater agreement when data are measured on a nominal scale. (SLD)
Descriptors: Classification, Generalizability Theory, Interrater Reliability, Research Design
Peer reviewedLi, Mao-Neng Fred; Lautenschlager, Gary J. – Educational and Psychological Measurement, 1999
Describes a Statistical Analysis System (SAS) MACRO for computing various indices of interrater agreement, including a new generalizability coefficient, for categorical data in a single-facet, crossed design. (Author/SLD)
Descriptors: Classification, Generalizability Theory, Interrater Reliability, Qualitative Research
Peer reviewedLindell, Michael K.; Brandt, Christina J.; Whitney, David J. – Applied Psychological Measurement, 1999
Proposes a revised index of interrater agreement for multi-item ratings of a single target. This index is an inverse linear function of the ratio of the average obtained variance to the variance of the uniformly distributed random error. Discusses the importance of sample size for the index. (SLD)
Descriptors: Error of Measurement, Interrater Reliability, Sample Size
Schuster, Christof; Smith, David A. – Psychometrika, 2005
The rater agreement literature is complicated by the fact that it must accommodate at least two different properties of rating data: the number of raters (two versus more than two) and the rating scale level (nominal versus metric). While kappa statistics are most widely used for nominal scales, intraclass correlation coefficients have been…
Descriptors: Psychometrics, Statistics, Rating Scales, Correlation
Baird, Jo-Anne; Greatorex, Jackie; Bell, John F. – Assessment in Education Principles Policy and Practice, 2004
Marking reliability is purported to be produced by having an effective community of practice. No experimental research has been identified which attempts to verify empirically the aspects of a community of practice that have been observed to produce marking reliability. This research outlines what that community of practice might entail and…
Descriptors: Foreign Countries, Grades (Scholastic), Grading, Interrater Reliability
Munson, Benjamin; Brinkman, Kayla N. – American Journal of Speech-Language Pathology, 2004
Two experiments examined whether listening to multiple presentations of recorded speech stimuli influences the reliability and accuracy of judgments of children's speech production accuracy. In Experiment 1, 10 listeners phonetically transcribed words produced by children with phonological impairments after a single presentation and after the word…
Descriptors: Speech, Children, Phonetics, Speech Impairments
Roberts, Felicia; Robinson, Jeffrey D. – Human Communication Research, 2004
This investigation assesses interobserver agreement on conversation analytic (CA) transcription. Four professional CA transcribers spent a maximum of 3 hours transcribing 2.5 minutes of a previously unknown, naturally occurring, mundane telephone call. Researchers unitized transcripts into words, sounds, silences, inbreaths, outbreaths, and laugh…
Descriptors: Interrater Reliability, Discourse Analysis, Semantics, Pragmatics
Fleming, Judith A.; Taylor, Janeen McCracken; Carran, Deborah – Assessment for Effective Intervention, 2004
This article offers an alternative methodology for practitioners and researchers to use in establishing interrater reliability for testing purposes. The majority of studies on interrater reliability use a traditional methodology where by two raters are compared using a Pearson product-moment correlation. This traditional method of estimating…
Descriptors: Interrater Reliability, Methods, Correlation, Evaluation Methods
Schuster, Christof; Smith, David A. – Educational and Psychological Measurement, 2006
Because nominal-scale judgments cannot directly be aggregated into meaningful composites, the addition of a second rater is usually motivated by a desire to estimate the quality of a single rater's classifications rather than to improve reliability. When raters agree, the aggregation problem does not arise. Nevertheless, a proportion of this…
Descriptors: Models, Interrater Reliability, Measures (Individuals), Evaluation Criteria
Millar, Dorothy Squatrito – Education and Training in Developmental Disabilities, 2009
IEP transition-related content was compared between young adults with developmental disabilities who had or did not have legal guardians. It was found that students with guardians were more likely to earn a certificate of completion, and wanted to remain living with their families, in comparison to students without guardians who were more likely…
Descriptors: Developmental Disabilities, Young Adults, Individualized Education Programs, Self Determination
Coniam, David – ReCALL, 2009
This paper describes a study of the computer essay-scoring program BETSY. While the use of computers in rating written scripts has been criticised in some quarters for lacking transparency or lack of fit with how human raters rate written scripts, a number of essay rating programs are available commercially, many of which claim to offer comparable…
Descriptors: Writing Tests, Scoring, Foreign Countries, Interrater Reliability

Direct link
