Publication Date
| In 2026 | 0 |
| Since 2025 | 58 |
| Since 2022 (last 5 years) | 284 |
| Since 2017 (last 10 years) | 780 |
| Since 2007 (last 20 years) | 2042 |
Descriptor
| Interrater Reliability | 3124 |
| Foreign Countries | 655 |
| Test Reliability | 503 |
| Evaluation Methods | 502 |
| Test Validity | 410 |
| Correlation | 401 |
| Scoring | 347 |
| Comparative Analysis | 327 |
| Scores | 324 |
| Validity | 310 |
| Student Evaluation | 308 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Researchers | 130 |
| Practitioners | 42 |
| Teachers | 22 |
| Administrators | 11 |
| Counselors | 3 |
| Policymakers | 2 |
Location
| Australia | 56 |
| Turkey | 53 |
| United Kingdom | 46 |
| Canada | 45 |
| Netherlands | 40 |
| China | 38 |
| California | 37 |
| United States | 30 |
| United Kingdom (England) | 25 |
| Taiwan | 23 |
| Germany | 22 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 3 |
| Meets WWC Standards with or without Reservations | 3 |
| Does not meet standards | 3 |
Florida State Dept. of Education, Tallahassee. Div. of Vocational, Adult, and Community Education. – 1991
This packet contains a manual and a workbook for developing performance tests in vocational education. The manual gives an in-depth description of how to develop, score, and use performance tests. It includes the following sections: definitions of performance testing, steps in developing a performance test, selecting a performance development…
Descriptors: Interrater Reliability, Performance Tests, Postsecondary Education, Scoring
Pino, Barbara Gonzalez – Texas Papers in Foreign Language Education, 1998
Previous literature on classroom testing of second language speech skills provides several models of both task types and rubrics for rating, and suggestions regarding procedures for testing speaking with large numbers of learners. However, there is no clear, widely disseminated consensus in the profession on the appropriate paradigm to guide the…
Descriptors: College Instruction, Evaluation Criteria, Higher Education, Interrater Reliability
Shavelson, Richard J.; And Others – 1993
In this paper, performance assessments are cast within a sampling framework. A performance assessment score is viewed as a sample of student performance drawn from a complex universe defined by a combination of all possible tasks, occasions, raters, and measurement methods. Using generalizability theory, the authors present evidence bearing on the…
Descriptors: Academic Achievement, Educational Assessment, Error of Measurement, Evaluators
Myford, Carol M. – 1991
The aesthetic judgments of experts (casting directors and high school drama teachers), theater buffs, and novices were compared as they rated high school students' videotaped performances of Shakespearean monologues. It was hypothesized that theater buffs would represent an intermediate stage on the path to developing expertise in judging acting…
Descriptors: Ability, Acting, Aesthetic Values, Art Criticism
Silvestro, John R.; And Others – 1989
The job analysis procedures used in the development of the Illinois Certification Testing System are described. The degree of congruence between job analysis ratings provided by public school educators (PSEs) and teacher educators (TEs) who completed the job analysis surveys is examined. National Evaluation Systems, Inc., and the Illinois State…
Descriptors: Comparative Analysis, Content Analysis, Elementary Secondary Education, Interrater Reliability
Performance-based Assessment of At-risk Students in Mathematics: The Effects of Context and Setting.
Telese, James A.; Kulm, Gerald – 1995
A team of university and public school mathematics educators designed performance-based mathematics assessment tasks designed to align with the Texas Assessment of Academic Skills for 93 students who had been identified as at-risk in mathematics. Scenarios were developed based on four contexts: (1) familiar activity; (2) social issue; (3)…
Descriptors: Analysis of Variance, Context Effect, Educational Assessment, Educational Environment
North Carolina State Dept. of Public Instruction, Raleigh. Div. of Accountability/Testing. – 2001
During 1999-2000 school year, the North Carolina Alternate Assessment Portfolio was administered to eligible students with serious cognitive deficits statewide as a pilot program. This report provides state, regional, and local education agency results of that pilot program. The purpose of the pilot was to review the feasibility, validity, and…
Descriptors: Academic Achievement, American Indians, Cultural Differences, Elementary Secondary Education
1999
This document contains four symposium papers on assessing employee performance. In "Influence of Liking and Similarity on Multi-rater Proficiency Ratings of Managerial Competencies" (Reid A. Bates), the pattern of correlations identified between raters, independent variables, and different competencies suggests that raters may react…
Descriptors: Adult Education, Case Studies, Competence, Educational Needs
Peer reviewedTinsley, Barbara J.; And Others – Educational and Psychological Measurement, 1997
The convergent validity of peer, self, and teacher methods of assessing youths' risk propensity and the relation of these measures to health risk behavior were studied with 436 elementary and junior high school students. Findings demonstrate low congruence between rater sources. Prediction depended on behavior assessed and grade level. (SLD)
Descriptors: Age Differences, Behavior Patterns, Children, Elementary Education
Yang, Yongwei; Buckendahl, Chad W.; Juszkiewicz, Piotr J.; Bhola, Dennison S. – Journal of Applied Testing Technology, 2005
With the continual progress of computer technologies, computer automated scoring (CAS) has become a popular tool for evaluating writing assessments. Research of applications of these methodologies to new types of performance assessments is still emerging. While research has generally shown a high agreement of CAS system generated scores with those…
Descriptors: Scoring, Validity, Interrater Reliability, Comparative Analysis
Strong, Gregory – Thought Currents in English Literature, 1995
This paper traces developments in educational psychology and measurement that led to the Test of English as a Foreign Language (TOEFL) and the test of English for International Communication (TOEIC) and the application of educational measurement terms such as validity and reliability to testing. Use of a table of specifications for planning…
Descriptors: Cloze Procedure, Difficulty Level, English (Second Language), Foreign Countries
Carlson, Sybil B.; And Others – 1985
Four writing samples were obtained from 638 foreign college applicants who represented three major foreign language groups (Arabic, Chinese, and Spanish), and from 60 native English speakers. All four were scored holistically, two were also scored for sentence-level and discourse-level skills, and some were scored by the Writer's Workbench…
Descriptors: Arabic, Chinese, College Entrance Examinations, Computer Software
Shiflett, Samuel; And Others – 1985
A study was undertaken to improve the measurement of small team performance within the Army. A provisional taxonomy of team-level performance functions was field-validated; criteria and measures of the functions were developed; and their reliability was examined. The provisional taxonomy, used for observing Army field training exercises, was used…
Descriptors: Behavior Rating Scales, Classification, Evaluation Criteria, Evaluators
Jaeger, Richard M.; Busch, John Christian – 1986
This study explores the use of the modified caution index (MCI) for identifying judges whose patterns of recommendations suggest that their judgments might be based on incomplete information, flawed reasoning, or inattention to their standard-setting tasks. It also examines the effect on test standards and passing rates when the test standards of…
Descriptors: Criterion Referenced Tests, Error of Measurement, Evaluation Methods, High Schools
Rose, Andrew M.; And Others – 1985
This third of three volumes reports on analytic procedures conducted to address various aspects of the scalar properties of the Device Effectiveness Forecasting Technique (DEFT). DEFT, a series of microcomputer programs applied to data gathered from rating scales, is used to evaluate simulator devices used in U.S. Army weapons training. The…
Descriptors: Adults, Computer Oriented Programs, Computer Simulation, Data Interpretation

Direct link
