Publication Date
In 2025 | 1 |
Since 2024 | 8 |
Since 2021 (last 5 years) | 17 |
Since 2016 (last 10 years) | 35 |
Since 2006 (last 20 years) | 82 |
Descriptor
Test Bias | 125 |
Test Items | 82 |
Item Response Theory | 43 |
Item Bias | 39 |
Models | 36 |
Comparative Analysis | 32 |
Statistical Bias | 32 |
Bias | 29 |
Scores | 29 |
Simulation | 28 |
Higher Education | 24 |
More ▼ |
Source
Journal of Educational… | 215 |
Author
Publication Type
Education Level
Secondary Education | 6 |
Elementary Secondary Education | 3 |
Higher Education | 2 |
Postsecondary Education | 2 |
Grade 10 | 1 |
Grade 4 | 1 |
Grade 8 | 1 |
Grade 9 | 1 |
High Schools | 1 |
Middle Schools | 1 |
Audience
Researchers | 3 |
Laws, Policies, & Programs
Defunis v Odegaard | 1 |
Elementary and Secondary… | 1 |
No Child Left Behind Act 2001 | 1 |
Assessments and Surveys
What Works Clearinghouse Rating
Daria Gerasimova – Journal of Educational Measurement, 2024
I propose two practical advances to the argument-based approach to validity: developing a living document and incorporating preregistration. First, I present a potential structure for the living document that includes an up-to-date summary of the validity argument. As the validation process may span across multiple studies, the living document…
Descriptors: Validity, Documentation, Methods, Research Reports
Guo, Jinxin; Xu, Xin; Xin, Tao – Journal of Educational Measurement, 2023
Missingness due to not-reached items and omitted items has received much attention in the recent psychometric literature. Such missingness, if not handled properly, would lead to biased parameter estimation, as well as inaccurate inference of examinees, and further erode the validity of the test. This paper reviews some commonly used IRT based…
Descriptors: Psychometrics, Bias, Error of Measurement, Test Validity
Sooyong Lee; Suhwa Han; Seung W. Choi – Journal of Educational Measurement, 2024
Research has shown that multiple-indicator multiple-cause (MIMIC) models can result in inflated Type I error rates in detecting differential item functioning (DIF) when the assumption of equal latent variance is violated. This study explains how the violation of the equal variance assumption adversely impacts the detection of nonuniform DIF and…
Descriptors: Factor Analysis, Bayesian Statistics, Test Bias, Item Response Theory
Hwanggyu Lim; Danqi Zhu; Edison M. Choe; Kyung T. Han – Journal of Educational Measurement, 2024
This study presents a generalized version of the residual differential item functioning (RDIF) detection framework in item response theory, named GRDIF, to analyze differential item functioning (DIF) in multiple groups. The GRDIF framework retains the advantages of the original RDIF framework, such as computational efficiency and ease of…
Descriptors: Item Response Theory, Test Bias, Test Reliability, Test Construction
Johnson, Matthew S.; Liu, Xiang; McCaffrey, Daniel F. – Journal of Educational Measurement, 2022
With the increasing use of automated scores in operational testing settings comes the need to understand the ways in which they can yield biased and unfair results. In this paper, we provide a brief survey of some of the ways in which the predictive methods used in automated scoring can lead to biased, and thus unfair automated scores. After…
Descriptors: Psychometrics, Measurement Techniques, Bias, Automation
Chalmers, R. Philip – Journal of Educational Measurement, 2023
Several marginal effect size (ES) statistics suitable for quantifying the magnitude of differential item functioning (DIF) have been proposed in the area of item response theory; for instance, the Differential Functioning of Items and Tests (DFIT) statistics, signed and unsigned item difference in the sample statistics (SIDS, UIDS, NSIDS, and…
Descriptors: Test Bias, Item Response Theory, Definitions, Monte Carlo Methods
DeCarlo, Lawrence T.; Zhou, Xiaoliang – Journal of Educational Measurement, 2021
In signal detection rater models for constructed response (CR) scoring, it is assumed that raters discriminate equally well between different latent classes defined by the scoring rubric. An extended model that relaxes this assumption is introduced; the model recognizes that a rater may not discriminate equally well between some of the scoring…
Descriptors: Scoring, Models, Bias, Perception
Tong Wu; Stella Y. Kim; Carl Westine; Michelle Boyer – Journal of Educational Measurement, 2025
While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating…
Descriptors: Item Response Theory, Evaluators, Error of Measurement, Test Validity
Corinne Huggins-Manley; Anthony W. Raborn; Peggy K. Jones; Ted Myers – Journal of Educational Measurement, 2024
The purpose of this study is to develop a nonparametric DIF method that (a) compares focal groups directly to the composite group that will be used to develop the reported test score scale, and (b) allows practitioners to explore for DIF related to focal groups stemming from multicategorical variables that constitute a small proportion of the…
Descriptors: Nonparametric Statistics, Test Bias, Scores, Statistical Significance
Bolt, Daniel M.; Liao, Xiangyi – Journal of Educational Measurement, 2021
We revisit the empirically observed positive correlation between DIF and difficulty studied by Freedle and commonly seen in tests of verbal proficiency when comparing populations of different mean latent proficiency levels. It is shown that a positive correlation between DIF and difficulty estimates is actually an expected result (absent any true…
Descriptors: Test Bias, Difficulty Level, Correlation, Verbal Tests
Sandip Sinharay; Matthew S. Johnson – Journal of Educational Measurement, 2024
Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a…
Descriptors: Culturally Relevant Education, Evaluation, Equal Education, Disadvantaged
Lanrong Li; Betsy Jane Becker – Journal of Educational Measurement, 2021
Differential bundle functioning (DBF) has been proposed to quantify the accumulated amount of differential item functioning (DIF) in an item cluster/bundle (Douglas, Roussos, and Stout). The simultaneous item bias test (SIBTEST, Shealy and Stout) has been used to test for DBF (e.g., Walker, Zhang, and Surber). Research on DBF may have the…
Descriptors: Test Bias, Test Items, Meta Analysis, Effect Size
Huelmann, Thorben; Debelak, Rudolf; Strobl, Carolin – Journal of Educational Measurement, 2020
This study addresses the topic of how anchoring methods for differential item functioning (DIF) analysis can be used in multigroup scenarios. The direct approach would be to combine anchoring methods developed for two-group scenarios with multigroup DIF-detection methods. Alternatively, multiple tests could be carried out. The results of these…
Descriptors: Test Items, Test Bias, Equated Scores, Item Analysis
Carmen Köhler; Lale Khorramdel; Artur Pokropek; Johannes Hartig – Journal of Educational Measurement, 2024
For assessment scales applied to different groups (e.g., students from different states; patients in different countries), multigroup differential item functioning (MG-DIF) needs to be evaluated in order to ensure that respondents with the same trait level but from different groups have equal response probabilities on a particular item. The…
Descriptors: Measures (Individuals), Test Bias, Models, Item Response Theory
Yang Jiang; Mo Zhang; Jiangang Hao; Paul Deane; Chen Li – Journal of Educational Measurement, 2024
The emergence of sophisticated AI tools such as ChatGPT, coupled with the transition to remote delivery of educational assessments in the COVID-19 era, has led to increasing concerns about academic integrity and test security. Using AI tools, test takers can produce high-quality texts effortlessly and use them to game assessments. It is thus…
Descriptors: Integrity, Artificial Intelligence, Technology Uses in Education, Ethics