ERIC - Search Results

Publication Date

In 2025	1
Since 2024	1
Since 2021 (last 5 years)	1
Since 2016 (last 10 years)	9
Since 2006 (last 20 years)	12

Descriptor

Interrater Reliability	24
Scoring	7
Evaluation Methods	5
Item Response Theory	5
Correlation	4
Elementary Education	4
Elementary School Students	4
Essays	4
Higher Education	4
Models	4
Writing Evaluation	4
College Students	3
Educational Assessment	3
Essay Tests	3
Evaluators	3
Generalizability Theory	3
Performance Based Assessment	3
Scores	3
Test Validity	3
Cutting Scores	2
Data Analysis	2
Foreign Countries	2
Mathematical Models	2
Measurement	2
Multiple Choice Tests	2
More ▼

Source

Journal of Educational…

Publication Type

Journal Articles	24
Reports - Research	16
Reports - Evaluative	5
Reports - Descriptive	3
Information Analyses	1

Education Level

Higher Education	2
Postsecondary Education	2
Elementary Education	1
High Schools	1
Secondary Education	1

Audience

Location

China	1
Georgia	1
Israel	1

Laws, Policies, & Programs

Assessments and Surveys

Test of Standard Written…

What Works Clearinghouse Rating

Showing 1 to 15 of 24 results Save | Export

Using Automated Procedures to Score Educational Essays Written in Three Languages

Peer reviewed

Direct link

Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025

The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…

Descriptors: College Students, Slavic Languages, German, Italian

Pedagogical Considerations for Examining Rater Variability in Rater-Mediated Assessments: A Three-Model Framework

Peer reviewed

Direct link

Wesolowski, Brian C.; Wind, Stefanie A. – Journal of Educational Measurement, 2019

Rater-mediated assessments are a common methodology for measuring persons, investigating rater behavior, and/or defining latent constructs. The purpose of this article is to provide a pedagogical framework for examining rater variability in the context of rater-mediated assessments using three distinct models. The first model is the observation…

Descriptors: Interrater Reliability, Models, Observation, Measurement

Modeling Rater Response Processes in Evaluating Score Meaning

Peer reviewed

Direct link

Lane, Suzanne – Journal of Educational Measurement, 2019

Rater-mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the…

Descriptors: Responses, Accuracy, Validity, Interrater Reliability

A New Facets Model for Rater's Centrality/Extremity Response Style

Peer reviewed

Direct link

Jin, Kuan-Yu; Wang, Wen-Chung – Journal of Educational Measurement, 2018

The Rasch facets model was developed to account for facet data, such as student essays graded by raters, but it accounts for only one kind of rater effect (severity). In practice, raters may exhibit various tendencies such as using middle or extreme scores in their ratings, which is referred to as the rater centrality/extremity response style. To…

Descriptors: Scoring, Models, Interrater Reliability, Computation

Accounting for Rater Effects with the Hierarchical Rater Model Framework When Scoring Simple Structured Constructed Response Tests

Peer reviewed

Direct link

Nieto, Ricardo; Casabianca, Jodi M. – Journal of Educational Measurement, 2019

Many large-scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple-choice and/or constructed responses sections of items to generate multiple…

Descriptors: Tests, Scoring, Responses, Test Items

Scoring Stability in a Large-Scale Assessment Program: A Longitudinal Analysis of Leniency/Severity Effects

Peer reviewed

Direct link

Palermo, Corey; Bunch, Michael B.; Ridge, Kirk – Journal of Educational Measurement, 2019

Although much attention has been given to rater effects in rater-mediated assessment contexts, little research has examined the overall stability of leniency and severity effects over time. This study examined longitudinal scoring data collected during three consecutive administrations of a large-scale, multi-state summative assessment program.…

Descriptors: Scoring, Interrater Reliability, Measurement, Summative Evaluation

A Two-Stage Method for Classroom Assessments of Essay Writing

Peer reviewed

Direct link

Humphry, Stephen Mark; Heldsinger, Sandy – Journal of Educational Measurement, 2019

To capitalize on professional expertise in educational assessment, it is desirable to develop and test methods of rater-mediated assessment that enable classroom teachers to make reliable and informative judgments. Accordingly, this article investigates the reliability of a two-stage method used by classroom teachers to assess primary school…

Descriptors: Essays, Elementary School Students, Writing (Composition), Writing Evaluation

Predicting Operational Rater-Type Classifications Using Rasch Measurement Theory and Random Forests: A Music Performance Assessment Perspective

Peer reviewed

Direct link

Wesolowski, Brian C. – Journal of Educational Measurement, 2019

The purpose of this study was to build a Random Forest supervised machine learning model in order to predict musical rater-type classifications based upon a Rasch analysis of raters' differential severity/leniency related to item use. Raw scores (N = 1,704) from 142 raters across nine high school solo and ensemble festivals (grades 9-12) were…

Descriptors: Item Response Theory, Prediction, Classification, Artificial Intelligence

Autoscoring Essays Based on Complex Networks

Peer reviewed

Direct link

Ke, Xiaohua; Zeng, Yongqiang; Luo, Haijiao – Journal of Educational Measurement, 2016

This article presents a novel method, the Complex Dynamics Essay Scorer (CDES), for automated essay scoring using complex network features. Texts produced by college students in China were represented as scale-free networks (e.g., a word adjacency model) from which typical network features, such as the in-/out-degrees, clustering coefficient (CC),…

Descriptors: Scoring, Automation, Essays, Networks

Item Response Models for Local Dependence among Multiple Ratings

Peer reviewed

Direct link

Wang, Wen-Chung; Su, Chi-Ming; Qiu, Xue-Lan – Journal of Educational Measurement, 2014

Ratings given to the same item response may have a stronger correlation than those given to different item responses, especially when raters interact with one another before giving ratings. The rater bundle model was developed to account for such local dependence by forming multiple ratings given to an item response as a bundle and assigning…

Descriptors: Item Response Theory, Interrater Reliability, Models, Correlation

Non-Numeric Intrajudge Consistency Feedback in an Angoff Procedure

Peer reviewed

Direct link

Harrison, George M. – Journal of Educational Measurement, 2015

The credibility of standard-setting cut scores depends in part on two sources of consistency evidence: intrajudge and interjudge consistency. Although intrajudge consistency feedback has often been provided to Angoff judges in practice, more evidence is needed to determine whether it achieves its intended effect. In this randomized experiment with…

Descriptors: Interrater Reliability, Standard Setting (Scoring), Cutting Scores, Feedback (Response)

Comparing the Effectiveness of Self-Paced and Collaborative Frame-of-Reference Training on Rater Accuracy in a Large-Scale Writing Assessment

Peer reviewed

Direct link

Raczynski, Kevin R.; Cohen, Allan S.; Engelhard, George, Jr.; Lu, Zhenqiu – Journal of Educational Measurement, 2015

There is a large body of research on the effectiveness of rater training methods in the industrial and organizational psychology literature. Less has been reported in the measurement literature on large-scale writing assessments. This study compared the effectiveness of two widely used rater training methods--self-paced and collaborative…

Descriptors: Interrater Reliability, Writing Evaluation, Training Methods, Pacing

A Cautionary Note on IRT-Based Linking of Tests with Polytomous Items.

Peer reviewed

Tate, Richard L. – Journal of Educational Measurement, 1999

Suggests that a modification of traditional linking is necessary when tests consist of constructed response items judged by raters and a possibility of year-to-year variation in rating discrimination and severity exists. Illustrates this situation with an artificial example. (SLD)

Descriptors: Equated Scores, Interrater Reliability, Item Response Theory, Multiple Choice Tests

Estimating the Reliability, Validity, and Invalidity of Essay Ratings.

Peer reviewed

Blok, H. – Journal of Educational Measurement, 1985

Raters judged essays on two occasions making it possible to address the question of whether multiple ratings, however obtained, represent the same true scores. Multiple ratings of a given rater did represent the same true scores, but ratings of different raters did not. Reliability, validity, and invalidity coefficients were computed. (Author/DWH)

Descriptors: Analysis of Variance, Elementary Education, Essay Tests, Evaluators

Face Validity Revisited.

Peer reviewed

Nevo, Baruch – Journal of Educational Measurement, 1985

A literature review and a proposed means of measuring face validity, a test's appearance of being valid, are presented. Empirical evidence from examinees' perceptions of a college entrance examination support the reliability of measuring face validity. (GDC)

Descriptors: College Entrance Examinations, Evaluation Methods, Evaluators, Foreign Countries

Previous Page | Next Page »

Pages: 1 | 2

Engelhard, George, Jr.	2
Lane, Suzanne	2
Wang, Wen-Chung	2
Wesolowski, Brian C.	2
Benton, Stephen L.	1
Blok, H.	1
Bunch, Michael B.	1
Casabianca, Jodi M.	1
Cohen, Allan S.	1
Congdon, Peter J.	1
Eiting, Mindert H.	1
Hamid Mohammadi	1
Harrison, George M.	1
Hatfield, James G.	1
Heldsinger, Sandy	1
Humphry, Stephen Mark	1
Jin, Kuan-Yu	1
Ke, Xiaohua	1
Kiewra, Kenneth A.	1
Lu, Zhenqiu	1
Luo, Haijiao	1
Mark J. Gierl	1
McQueen, Joy	1
Nevo, Baruch	1
More ▼