Publication Date
| In 2026 | 0 |
| Since 2025 | 56 |
| Since 2022 (last 5 years) | 282 |
| Since 2017 (last 10 years) | 778 |
| Since 2007 (last 20 years) | 2040 |
Descriptor
| Interrater Reliability | 3122 |
| Foreign Countries | 654 |
| Test Reliability | 503 |
| Evaluation Methods | 502 |
| Test Validity | 410 |
| Correlation | 401 |
| Scoring | 347 |
| Comparative Analysis | 327 |
| Scores | 324 |
| Validity | 310 |
| Student Evaluation | 308 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Researchers | 130 |
| Practitioners | 42 |
| Teachers | 22 |
| Administrators | 11 |
| Counselors | 3 |
| Policymakers | 2 |
Location
| Australia | 56 |
| Turkey | 53 |
| United Kingdom | 46 |
| Canada | 45 |
| Netherlands | 40 |
| China | 38 |
| California | 37 |
| United States | 30 |
| United Kingdom (England) | 24 |
| Taiwan | 23 |
| Germany | 22 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 3 |
| Meets WWC Standards with or without Reservations | 3 |
| Does not meet standards | 3 |
Lecavalier, Luc – Journal of Autism and Developmental Disorders, 2005
The Gilliam Autism Rating Scale was developed to identify individuals with autism in research and clinical settings. It has benefited from wide use and acceptance but has received little empirical attention. The purpose of this study was to evaluate the construct and diagnostic validity, interrater reliability, and effects of participant…
Descriptors: Behavior Rating Scales, Factor Analysis, Construct Validity, Pervasive Developmental Disorders
Bradley, Thomas P.; Allen, Jeff M.; Hamilton, Scott; Filgo, Scott K. – Performance Improvement Quarterly, 2006
Multirater feedback, often called 360-degree feedback, is a popular development and assessment tool, especially for organizational leaders. Raters from different organizational levels, including subordinates, boss, peers, and self, rate the leader's performance. However, there seldom is strong agreement across rater groups. This study used the…
Descriptors: Leadership Effectiveness, Peer Evaluation, Job Performance, Personnel Evaluation
Schwalbe, Craig S.; Fraser, Mark W.; Day, Steven H.; Arnold, Elizabeth Mayfield – Journal of Offender Rehabilitation, 2004
Actuarial risk assessment instruments are used increasingly in juvenile justice to classify youths according to their risk of recidivism. The purpose of this article is to describe the results of two studies of one instrument: the North Carolina Assessment of Risk (NCAR). In the first study, the inter-rater reliability of the risk assessment…
Descriptors: Recidivism, Predictive Validity, Interrater Reliability, Program Effectiveness
Rockwell, Pam; Dunham, Mardis – Art Therapy: Journal of the American Art Therapy Association, 2006
This study explored the use of the Formal Elements Art Therapy Scale (FEATS) with a population of persons with a DSM-IV diagnosis of Substance Use Disorder who were court ordered for treatment. Two groups of adults (N = 40) were closely matched on age, gender, race, socioeconomic status and education level, and were administered the Person Picking…
Descriptors: Measures (Individuals), Interrater Reliability, Group Membership, Art Therapy
Cutler, Lois J.; Kane, Rosalie A.; Degenholtz, Howard B.; Miller, Michael J.; Grant, Leslie – Gerontologist, 2006
Purpose: We developed and tested theoretically derived procedures to observe physical environments experienced by nursing home residents at three nested levels: their rooms, the nursing unit, and the overall facility. Illustrating with selected descriptive results, in this article we discuss the development of the approach. Design and Methods: On…
Descriptors: Physical Environment, Nursing Homes, Research Tools, Evaluation Methods
Collins, Kathleen M. T. – Evaluation and Research in Education, 2006
The purpose of this mixed-methods study was to document the prevalence of sampling designs utilised in mixed-methods research and to examine the interpretive consistency between interpretations made in mixed-methods studies and the sampling design used. Classification of studies was based on a two-dimensional mixed-methods sampling model. This…
Descriptors: Social Science Research, Incidence, Social Sciences, School Psychology
Bennett, Randy Elliot; Rock, Donald A. – 1993
Formulating-Hypotheses (F-H) items present a situation and ask the examinee to generate as many explanations for it as possible. This study examined the generalizability, validity, and examinee perceptions of a computer-delivered version of the task. Eight F-H questions were administered to 192 graduate students. Half of the items restricted…
Descriptors: Computer Assisted Testing, Difficulty Level, Generalizability Theory, Graduate Students
Bridgeman, Brent; And Others – 1996
The various methods for computing the reliability of scores on Advanced Placement (AP) examinations are summarized. For the free response portion of the examinations, raters can contribute to score unreliability through both systematic severity errors (in which some raters consistently rate more severely than other raters) and through…
Descriptors: Advanced Placement, College Entrance Examinations, Error of Measurement, High School Students
O'Neill, Thomas R.; Lunz, Mary E. – 1996
To generalize test results beyond the particular test administration, an examinee's ability estimate must be independent of the particular items attempted, and the item difficulty calibrations must be independent of the particular sample of people attempting the items. This stability is a key concept of the Rasch model, a latent trait model of…
Descriptors: Ability, Benchmarking, Comparative Analysis, Difficulty Level
Giota, Joanna – 1995
This study examined the concept of quality in child day care and how this can be measured by the Early Childhood Environment Rating Scale (ECERS). Swedish day care centers in three communities were administered a version of the ECERS, which was translated from the original scale to accommodate conceptual differences between Sweden and the United…
Descriptors: Day Care, Day Care Centers, Foreign Countries, Interrater Reliability
Reckase, Mark D. – 1997
This paper argues that special procedures for constructing assessment tools containing performance assessment tasks are unnecessary and that current test methodology can easily be generalized to complex performance assessment tasks without destroying the desirable characteristics of those tasks. Reasonable statistical requirements for sound…
Descriptors: Educational Assessment, Generalizability Theory, High Stakes Tests, Interrater Reliability
PDF pending restorationJaeger, Richard M.; Usher, Claire H. – 1991
This paper reports on a study of the foundation and application of two procedures used to specify appropriate weights to be applied to components in determining the overall quality of a school. These procedures are multiattribute utility technology (MAUT) and policy capturing, and the paper presents the results of applying them, using key…
Descriptors: Achievement Tests, Comparative Analysis, Curriculum Evaluation, Educational Assessment
Hambleton, Ronald K.; Plake, Barbara S. – 1994
The number of performance-based assessments is increasing rapidly, but to date there is no established procedure for setting standards on these assessments. This paper describes several extensions to the Angoff procedure to accommodate the characteristics of a performance-based assessment and presents the results of research in applying this…
Descriptors: Educational Assessment, Evaluation Methods, Interrater Reliability, Performance Based Assessment
Chang, Lei; And Others – 1994
The present study examines the influence of judges' item-related knowledge on setting standards for competency tests. Seventeen judges from different professions took a 122-item teacher-certification test in economics while setting competency standards for the test using the Angoff procedure. Judges tended to set higher standards for items they…
Descriptors: Economics, Evaluators, Experience, Interrater Reliability
Crews, William E., Jr. – 1991
As part of a study of teacher evaluation of student replies to open-ended questions, a second question--the best method of determining interrater reliability--was examined. The standard method, the Pearson Product-Moment correlation, overestimated the degree of match between researchers' and teachers' scoring of tests. The simpler percent…
Descriptors: Comparative Analysis, Elementary School Teachers, Evaluation Methods, Evaluators

Peer reviewed
Direct link
