Publication Date
| In 2026 | 0 |
| Since 2025 | 4 |
| Since 2022 (last 5 years) | 5 |
| Since 2017 (last 10 years) | 6 |
| Since 2007 (last 20 years) | 7 |
Descriptor
| Evaluation Methods | 10 |
| Test Reliability | 10 |
| Test Validity | 5 |
| Scores | 4 |
| Accuracy | 3 |
| Student Evaluation | 3 |
| Test Format | 3 |
| Computer Assisted Testing | 2 |
| Criterion Referenced Tests | 2 |
| Cutting Scores | 2 |
| Error of Measurement | 2 |
| More ▼ | |
Source
| Journal of Educational… | 10 |
Author
| Amery D. Wu | 1 |
| Askegaard, Lewis D. | 1 |
| Berk, Ronald A. | 1 |
| Clark, Amy K. | 1 |
| Dwyer, Andrew C. | 1 |
| Emrick, John A. | 1 |
| Haberman, Shelby J. | 1 |
| Hamid Mohammadi | 1 |
| Hoover, Jeffrey C. | 1 |
| Jake Stone | 1 |
| Jinnie Shin | 1 |
| More ▼ | |
Publication Type
| Journal Articles | 9 |
| Reports - Research | 7 |
| Guides - Non-Classroom | 1 |
| Information Analyses | 1 |
| Reports - Evaluative | 1 |
Education Level
| Higher Education | 2 |
| Postsecondary Education | 2 |
Audience
Location
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
Thompson, W. Jake; Nash, Brooke; Clark, Amy K.; Hoover, Jeffrey C. – Journal of Educational Measurement, 2023
As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment…
Descriptors: Diagnostic Tests, Simulation, Test Reliability, Accuracy
Wallace N. Pinto Jr.; Jinnie Shin – Journal of Educational Measurement, 2025
In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates…
Descriptors: Automation, Grading, Computer Assisted Testing, Scoring
Lee, Yi-Hsuan; Haberman, Shelby J. – Journal of Educational Measurement, 2021
For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be…
Descriptors: Scores, Regression (Statistics), Demography, Data
Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025
The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…
Descriptors: College Students, Slavic Languages, German, Italian
Kylie Gorney; Sandip Sinharay – Journal of Educational Measurement, 2025
Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim…
Descriptors: Tests, Scores, Test Interpretation, Alternative Assessment
Shun-Fu Hu; Amery D. Wu; Jake Stone – Journal of Educational Measurement, 2025
Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or…
Descriptors: Tests, Testing, Scores, Test Construction
Dwyer, Andrew C. – Journal of Educational Measurement, 2016
This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common-item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common-item equating methodology to standard setting ratings to account for…
Descriptors: Cutting Scores, Equivalency Tests, Test Format, Academic Standards
Peer reviewedBerk, Ronald A. – Journal of Educational Measurement, 1980
A dozen different approaches that yield 13 reliability indices for criterion-referenced tests were identified and grouped into three categories: threshold loss function, squared-error loss function, and domain score estimation. Indices were evaluated within each category. (Author/RL)
Descriptors: Classification, Criterion Referenced Tests, Cutting Scores, Evaluation Methods
Peer reviewedAskegaard, Lewis D.; Umila, Benwardo V. – Journal of Educational Measurement, 1982
Multiple matrix sampling of items and examinees was applied to an 18-item rank order instrument administered to a randomly assigned group and compared to the ordering and ranking of all items by control subjects. High correlations between ranks suggest the methodology may viably reduce respondent effort on long rank ordering tasks. (Author/CM)
Descriptors: Evaluation Methods, Item Sampling, Junior High Schools, Student Reaction
Peer reviewedEmrick, John A. – Journal of Educational Measurement, 1971
Descriptors: Criterion Referenced Tests, Error of Measurement, Evaluation Methods, Item Analysis

Direct link
