Publication Date
| In 2026 | 0 |
| Since 2025 | 58 |
| Since 2022 (last 5 years) | 284 |
| Since 2017 (last 10 years) | 780 |
| Since 2007 (last 20 years) | 2042 |
Descriptor
| Interrater Reliability | 3124 |
| Foreign Countries | 655 |
| Test Reliability | 503 |
| Evaluation Methods | 502 |
| Test Validity | 410 |
| Correlation | 401 |
| Scoring | 347 |
| Comparative Analysis | 327 |
| Scores | 324 |
| Validity | 310 |
| Student Evaluation | 308 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Researchers | 130 |
| Practitioners | 42 |
| Teachers | 22 |
| Administrators | 11 |
| Counselors | 3 |
| Policymakers | 2 |
Location
| Australia | 56 |
| Turkey | 53 |
| United Kingdom | 46 |
| Canada | 45 |
| Netherlands | 40 |
| China | 38 |
| California | 37 |
| United States | 30 |
| United Kingdom (England) | 25 |
| Taiwan | 23 |
| Germany | 22 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 3 |
| Meets WWC Standards with or without Reservations | 3 |
| Does not meet standards | 3 |
McGinty, Dixie; Neel, John H. – 1996
A new standard setting approach is introduced, called the cognitive components approach. Like the Angoff method, the cognitive components method generates minimum pass levels (MPLs) for each item. In both approaches, the item MPLs are summed for each judge, then averaged across judges to yield the standard. In the cognitive components approach,…
Descriptors: Cognitive Processes, Criterion Referenced Tests, Evaluation Methods, Grade 3
Takala, Sauli – 1998
This paper discusses recent developments in language testing. It begins with a review of the traditional criteria that are applied to all measurement and outlines recent emphases that derive from the expanding range of stakeholders. Drawing on Alderson's seminal work, criteria are presented for evaluating communicative language tests. Developments…
Descriptors: Alternative Assessment, Communicative Competence (Languages), Comparative Analysis, Evaluation Criteria
Peer reviewedAngoff, William H. – Applied Measurement in Education, 1988
Suggestions are provided for future research in item bias detection, reduction of essay-reader variation in setting cut-score levels, and limitations of equating theory. (TJH)
Descriptors: College Entrance Examinations, Cutting Scores, Equated Scores, Essay Tests
Peer reviewedTyson, LeaAnn; Silverman, Stephen – Journal of Personnel Evaluation in Education, 1994
Differences in the Texas Teacher Appraisal System scores of teacher subgroups over 2 years were examined for 2,366 teachers for scores on individual domains, sums of scores of the 1st 4 domains, and overall summary performance scores, as well as appraiser differences. Implications for teacher evaluation are discussed. (SLD)
Descriptors: Educational Assessment, Elementary Secondary Education, Evaluation Methods, Evaluators
Peer reviewedGross, Leon J. – Evaluation and the Health Professions, 1994
Whether adequate levels of interrater reliability could be obtained on a national, standardized examination using one examiner per observation was studied with 101 paired candidate observations on an examination for optometry. Results indicate that psychometrically sound judgments can be obtained with one examiner. (SLD)
Descriptors: Educational Assessment, Error of Measurement, Evaluation Methods, Evaluators
Peer reviewedWigglesworth, Gillian – Australian Review of Applied Linguistics, 1994
Multifaceted Rasch analysis was used to determine whether bias was evident in the way a group of raters graded two different versions of an oral interaction test, undertaken by the same candidates. Results indicate that certain raters consistently rated the tape version of the test more harshly while others rated the live one more harshly. (10…
Descriptors: Data Collection, Foreign Countries, Graphs, Interaction Process Analysis
Peer reviewedJaeger, Richard M. – Educational Measurement: Issues and Practice, 1991
Issues concerning the selection of judges for standard setting are discussed. Determining the consistency of judges' recommendations, or their congruity with other expert recommendations, would help in selection. Enough judges must be chosen to allow estimation of recommendations by an entire population of judges. (SLD)
Descriptors: Cutting Scores, Evaluation Methods, Evaluators, Examiners
Peer reviewedReid, Jerry B. – Educational Measurement: Issues and Practice, 1991
Training judges to generate item ratings in standard setting once the reference group has been defined is discussed. It is proposed that sensitivity to the factors that determine difficulty can be improved through training. Three criteria for determining when training is sufficient are offered. (SLD)
Descriptors: Computer Assisted Instruction, Difficulty Level, Evaluators, Interrater Reliability
Peer reviewedElam, Carol L.; Andrykowski, Michael A. – Academic Medicine, 1991
Medical school admission interview ratings for four entering classes (n=356 students) were compared with preadmission academic variables (admission test scores, undergraduate grades), student characteristics (age, gender, residence), and interviewer characteristics (gender, professional background, admission committee membership). Recommendations…
Descriptors: Academic Achievement, Admission Criteria, College Admission, Higher Education
Peer reviewedHughes, I. E.; Large, B. J. – Studies in Higher Education, 1993
A study investigated the consistency of faculty and peer evaluations of the oral communication skills of 44 fourth-year pharmacology students. Substantial agreement between faculty and students was found. Peer evaluations were independent of their own communication skills. In addition, a significant correlation between oral and written…
Descriptors: Communication Skills, Comparative Analysis, Evaluation Methods, Higher Education
Peer reviewedGierl, Mark J. – Alberta Journal of Educational Research, 1998
Examined the generalizability of written-response scores on the English 30 diploma examination administered to Alberta 12th-grade students. Student scores differed as a function of rater, but this variance component was small across two tasks and two administrations; score generalizability was high using a two-rater system; and scale variability…
Descriptors: Error of Measurement, Foreign Countries, Generalizability Theory, High School Seniors
Peer reviewedKlein, Stephen P.; Stecher, Brian M.; Shavelson, Richard J.; McCaffrey, Daniel; Ormseth, Tor; Bell, Robert M.; Comfort, Kathy; Othman, Abdul R. – Applied Measurement in Education, 1998
Two studies involving 368 elementary and high school students and 29 readers were conducted to investigate reader consistency, score reliability, and reader time requirements of three hands-on science performance tasks. Holistic scores were as reliable as analytic scores, and there was a high correlation between them after they were disattenuated…
Descriptors: Elementary School Students, Elementary Secondary Education, Hands on Science, High School Students
Peer reviewedMagin, D. J. – Assessment & Evaluation in Higher Education, 2001
Presents a novel application of analysis of variance (ANOVA) techniques to compare the reliability of multiple peer ratings with single teacher ratings. Uses rating data from two different courses, both involving multiple peer and individual teacher ratings that were used to assess student contributions to group process work. Discusses…
Descriptors: Analysis of Variance, Comparative Analysis, Cooperative Learning, Evaluation Methods
Costrell, Robert – Education Next, 2005
Each January since 1997, "Education Week," the K-12 industry's newspaper of record, has issued its "Quality Counts" report, ranking states by, among other things, the "equity" of their school finances. On the other hand, every fall since 2001, the "Education Trust," a national organization devoted to closing the achievement gap in public schools,…
Descriptors: Trust (Psychology), National Organizations, Elementary Secondary Education, Educational Finance
Anderson, Rachel L.; Lyons, John S.; Giles, Debra M.; Price, Judith A.; Estle, George – Journal of Child and Family Studies, 2003
We examined the interrater reliability of the "Child and Adolescent Needs and Strengths-Mental Health" (CANS-MH) scale among researchers and between researchers and clinicians. All children presenting to a treatment facility for either protective or mental health needs were eligible to be included in the study. As part of standard assessment…
Descriptors: Health Needs, Mental Health, Interrater Reliability, Quality Assurance

Direct link
