ERIC - Search Results

Publication Date

In 2026	0
Since 2025	4
Since 2022 (last 5 years)	5
Since 2017 (last 10 years)	6
Since 2007 (last 20 years)	7

Descriptor

Evaluation Methods	10
Test Reliability	10
Test Validity	5
Scores	4
Accuracy	3
Student Evaluation	3
Test Format	3
Computer Assisted Testing	2
Criterion Referenced Tests	2
Cutting Scores	2
Error of Measurement	2
Test Construction	2
Tests	2
Academic Standards	1
Alternative Assessment	1
Assessment Literacy	1
Attribution Theory	1
Automation	1
Classification	1
College Students	1
Comparative Testing	1
Data	1
Decision Making	1
Demography	1
Diagnostic Tests	1
More ▼

Source

Journal of Educational…

Publication Type

Journal Articles	9
Reports - Research	7
Guides - Non-Classroom	1
Information Analyses	1
Reports - Evaluative	1

Education Level

Higher Education	2
Postsecondary Education	2

Audience

Location

Laws, Policies, & Programs

Assessments and Surveys

What Works Clearinghouse Rating

Showing all 10 results Save | Export

Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems

Peer reviewed

Direct link

Thompson, W. Jake; Nash, Brooke; Clark, Amy K.; Hoover, Jeffrey C. – Journal of Educational Measurement, 2023

As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment…

Descriptors: Diagnostic Tests, Simulation, Test Reliability, Accuracy

Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System

Peer reviewed

Direct link

Wallace N. Pinto Jr.; Jinnie Shin – Journal of Educational Measurement, 2025

In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates…

Descriptors: Automation, Grading, Computer Assisted Testing, Scoring

Studying Score Stability with a Harmonic Regression Family: A Comparison of Three Approaches to Adjustment of Examinee-Specific Demographic Data

Peer reviewed

Direct link

Lee, Yi-Hsuan; Haberman, Shelby J. – Journal of Educational Measurement, 2021

For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be…

Descriptors: Scores, Regression (Statistics), Demography, Data

Using Automated Procedures to Score Educational Essays Written in Three Languages

Peer reviewed

Direct link

Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025

The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…

Descriptors: College Students, Slavic Languages, German, Italian

A Note on the Use of Categorical Subscores

Peer reviewed

Direct link

Kylie Gorney; Sandip Sinharay – Journal of Educational Measurement, 2025

Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim…

Descriptors: Tests, Scores, Test Interpretation, Alternative Assessment

Using Multilabel Neural Network to Score High-Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment

Peer reviewed

Direct link

Shun-Fu Hu; Amery D. Wu; Jake Stone – Journal of Educational Measurement, 2025

Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or…

Descriptors: Tests, Testing, Scores, Test Construction

Maintaining Equivalent Cut Scores for Small Sample Test Forms

Peer reviewed

Direct link

Dwyer, Andrew C. – Journal of Educational Measurement, 2016

This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common-item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common-item equating methodology to standard setting ratings to account for…

Descriptors: Cutting Scores, Equivalency Tests, Test Format, Academic Standards

A Consumers' Guide to Criterion-Referenced Test Reliability. Reliability.

Peer reviewed

Berk, Ronald A. – Journal of Educational Measurement, 1980

A dozen different approaches that yield 13 reliability indices for criterion-referenced tests were identified and grouped into three categories: threshold loss function, squared-error loss function, and domain score estimation. Indices were evaluated within each category. (Author/RL)

Descriptors: Classification, Criterion Referenced Tests, Cutting Scores, Evaluation Methods

An Empirical Investigation of the Applicability of Multiple Matrix Sampling to the Method of Rank Order.

Peer reviewed

Askegaard, Lewis D.; Umila, Benwardo V. – Journal of Educational Measurement, 1982

Multiple matrix sampling of items and examinees was applied to an 18-item rank order instrument administered to a randomly assigned group and compared to the ordering and ranking of all items by control subjects. High correlations between ranks suggest the methodology may viably reduce respondent effort on long rank ordering tasks. (Author/CM)

Descriptors: Evaluation Methods, Item Sampling, Junior High Schools, Student Reaction

An Evaluation Model for Mastery Testing

Peer reviewed

Emrick, John A. – Journal of Educational Measurement, 1971

Descriptors: Criterion Referenced Tests, Error of Measurement, Evaluation Methods, Item Analysis

Amery D. Wu	1
Askegaard, Lewis D.	1
Berk, Ronald A.	1
Clark, Amy K.	1
Dwyer, Andrew C.	1
Emrick, John A.	1
Haberman, Shelby J.	1
Hamid Mohammadi	1
Hoover, Jeffrey C.	1
Jake Stone	1
Jinnie Shin	1
Kylie Gorney	1
Lee, Yi-Hsuan	1
Mark J. Gierl	1
Nash, Brooke	1
Sandip Sinharay	1
Shun-Fu Hu	1
Tahereh Firoozi	1
Thompson, W. Jake	1
Umila, Benwardo V.	1
Wallace N. Pinto Jr.	1
More ▼