Publication Date
| In 2026 | 0 |
| Since 2025 | 56 |
| Since 2022 (last 5 years) | 282 |
| Since 2017 (last 10 years) | 778 |
| Since 2007 (last 20 years) | 2040 |
Descriptor
| Interrater Reliability | 3122 |
| Foreign Countries | 654 |
| Test Reliability | 503 |
| Evaluation Methods | 502 |
| Test Validity | 410 |
| Correlation | 401 |
| Scoring | 347 |
| Comparative Analysis | 327 |
| Scores | 324 |
| Validity | 310 |
| Student Evaluation | 308 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Researchers | 130 |
| Practitioners | 42 |
| Teachers | 22 |
| Administrators | 11 |
| Counselors | 3 |
| Policymakers | 2 |
Location
| Australia | 56 |
| Turkey | 53 |
| United Kingdom | 46 |
| Canada | 45 |
| Netherlands | 40 |
| China | 38 |
| California | 37 |
| United States | 30 |
| United Kingdom (England) | 24 |
| Taiwan | 23 |
| Germany | 22 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 3 |
| Meets WWC Standards with or without Reservations | 3 |
| Does not meet standards | 3 |
Reading, Suzanne; Richie, Carolyn – Child Language Teaching and Therapy, 2007
The Structured Observation System (SOS) is a data collection method developed to document changes in the communication behaviours of children identified with speech and language delays. The system employs a rating scale which reflects the occurrence of communication behaviours as well as the amount of assistance needed for behaviours to occur.…
Descriptors: Observation, Rating Scales, Delayed Speech, Evaluation Methods
Wang, Wen-chung – 1997
Traditional approaches to the investigation of the objectivity of ratings for constructed-response items are based on classical test theory, which is item-dependent and sample-dependent. Item response theory overcomes this drawback by decomposing item difficulties into genuine difficulties and rater severity. In so doing, objectivity of ability…
Descriptors: College Entrance Examinations, Constructed Response, Foreign Countries, Interrater Reliability
Cole, Donna J.; And Others – 1991
This study explores an endeavor by the Ohio Consortium for Portfolio Development to assess preservice teachers' reflectivity as demonstrated through the development of professional portfolios. The first section of this paper presents the demographic information of the study, explaining the consortium derivation, purpose, and interrater format. The…
Descriptors: Educational Research, Higher Education, Interrater Reliability, Portfolios (Background Materials)
Dudczak, Craig A.; Day, Donald L. – 1989
To develop a taxonomy of Cross Examination Debate Association (CEDA) critics, a study associated professed judging philosophy and responses to survey questions with ballot behavior and elaborated judging profiles. Subjects were debate critics who judged rounds at CEDA tournaments in the Northeast during the Spring 1989 season. In all, 13 critics…
Descriptors: Classification, Communication Research, Correlation, Criteria
Uebersax, John; Grove, Will – 1989
Methods of probability modeling to analyze rater agreement are described, emphasizing their basic similarities and viewing them as variants of a common methodology. Statistical techniques for analyzing agreement data are described to address questions such as how many opinions are required to make a medical diagnosis with necessary accuracy. Kappa…
Descriptors: Clinical Diagnosis, Correlation, Estimation (Mathematics), Evaluation Methods
Richards, Ruth L.; And Others – 1985
This paper presents a new research instrument, The Lifetime Creativity Scales (LCS), along with validation evidence based on two large and independent samples. Views on creativity are discussed, the background of the LCS is reviewed, and the LCS are briefly described. The seven scales--three measuring peak creativity, three measuring extent of…
Descriptors: Adults, Construct Validity, Content Validity, Creativity
Fowler, Floyd J., Jr.; Mangione, Thomas W. – 1986
This large-scale field experiment examined the potential of various training and supervision programs to affect the performance of health survey interviewers and the quality of data they collect. It was found that interviewers who received less than one day of basic training generally displayed inadequate interviewing skills. A program of tape…
Descriptors: Data Collection, Health Services, Information Seeking, Inquiry
Halpin, Glennelle; And Others – 1986
This study was designed as a reconsideration of the weights used in evaluative decisions made with regard to research proposals submitted for funding at a major state university. The specific objective of the study was to determine whether the actual weights for components used in the evaluation of the proposals differed from a priori weights…
Descriptors: College Faculty, Decision Making, Evaluation Methods, Grants
Ferguson, Harold L.; Enger, John M. – 1985
The purpose of this study was to: (1) assess the anticipated ratings of teacher performance by principals using the Missouri Performance Based Teacher Evaluation (PBTE) prior to the first cycle of its implementation; (2) determine whether or not elementary and secondary principals, using the same instrument, would be consistent in perceived…
Descriptors: Competence, Elementary Secondary Education, Interrater Reliability, Job Performance
Atkinson, Dianne; Murray, Mary – 1987
Noting that improvement in rater reliability means eliminating differences among raters, this paper discusses ways to assess writing evaluator reliability and methods for achieving higher levels of interrater reliability. After showing that reliability can be improved two ways--by increasing the number of raters or measurements made, and by…
Descriptors: Evaluation Methods, Holistic Evaluation, Interrater Reliability, Measurement Techniques
Lees, Elaine O. – 1981
Given the concern for reliability in essay evaluation and the prospect of "error" variance in its absence, methods to promote interrater reliability in the evaluation of written compositions have been developed. These methods reduce variation in the value systems being applied by readers to texts, either by limiting the group of readers…
Descriptors: Elementary Secondary Education, Evaluation Criteria, Evaluation Methods, Evaluative Thinking
Peer reviewedMatheny, Adam P., Jr.; And Others – Developmental Psychology, 1987
Mothers of about 100 toddlers completed the Toddler Temperament Scale when their children were 12, 18, and 24 months old. Other data sets were available for: (a) factors representing laboratory observations; (b) measures of mothers' temperament by mothers and by social workers; and (c) measures of the home and family environment by social workers.…
Descriptors: Behavior Rating Scales, Experimenter Characteristics, Family Environment, Interrater Reliability
Peer reviewedStewart, Krista J. – Psychology in the Schools, 1987
Evaluated the technical aspects of three Wechsler Intelligence Scale for Children-Revised (WISC-R) administrations of five psychology graduate students using the WISC-R Administration Observational Checklist (WAOC) to evaluate interrater agreement. Students performed significantly better on the second than on the first observation, with…
Descriptors: Educational Diagnosis, Error Patterns, Examiners, Graduate Students
Peer reviewedAbrahams, Ruby; And Others – Evaluation Review, 1988
A methodology for developing clinical/research assessment tools, training interviewers, and continuously assessing interrater reliability is discussed. Data from a multisite national evaluation of long-term health care programs (i.e., the Social/Health Maintenance Organization (HMO) for elderly clients) are used. Focus is on providing research…
Descriptors: Clinical Diagnosis, Data Collection, Health Facilities, Health Programs
Peer reviewedJafarpur, Abdoljavad – System, 1988
Investigation of non-native English speakers' ratings of other non-native English learners' oral proficiency. Results indicate that the judges' ratings significantly differed, and the average of three judges' ratings was a better appraisal of the testee's true ability than that of any single rating or pair of ratings. (Author/CB)
Descriptors: English (Second Language), Evaluation Methods, Foreign Countries, Interrater Reliability

Direct link
