Publication Date
| In 2026 | 0 |
| Since 2025 | 190 |
| Since 2022 (last 5 years) | 1057 |
| Since 2017 (last 10 years) | 2567 |
| Since 2007 (last 20 years) | 4928 |
Descriptor
| Test Items | 9520 |
| Test Construction | 2711 |
| Foreign Countries | 2176 |
| Item Response Theory | 1867 |
| Difficulty Level | 1620 |
| Item Analysis | 1501 |
| Test Validity | 1412 |
| Test Reliability | 1186 |
| Multiple Choice Tests | 1152 |
| Scores | 1135 |
| Computer Assisted Testing | 1054 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Practitioners | 653 |
| Teachers | 563 |
| Researchers | 250 |
| Students | 201 |
| Administrators | 81 |
| Policymakers | 22 |
| Parents | 17 |
| Counselors | 8 |
| Community | 7 |
| Support Staff | 3 |
| Media Staff | 1 |
| More ▼ | |
Location
| Turkey | 225 |
| Canada | 223 |
| Australia | 155 |
| Germany | 116 |
| United States | 99 |
| China | 90 |
| Florida | 86 |
| Indonesia | 82 |
| Taiwan | 78 |
| United Kingdom | 73 |
| California | 65 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 4 |
| Meets WWC Standards with or without Reservations | 4 |
| Does not meet standards | 1 |
Haeju Lee; Kyung Yong Kim – Journal of Educational Measurement, 2025
When no prior information of differential item functioning (DIF) exists for items in a test, either the rank-based or iterative purification procedure might be preferred. The rank-based purification selects anchor items based on a preliminary DIF test. For a preliminary DIF test, likelihood ratio test (LRT) based approaches (e.g.,…
Descriptors: Test Items, Equated Scores, Test Bias, Accuracy
Tom Benton – Practical Assessment, Research & Evaluation, 2025
This paper proposes an extension of linear equating that may be useful in one of two fairly common assessment scenarios. One is where different students have taken different combinations of test forms. This might occur, for example, where students have some free choice over the exam papers they take within a particular qualification. In this…
Descriptors: Equated Scores, Test Format, Test Items, Computation
Kelsey Nason; Christine DeMars – Journal of Educational Measurement, 2025
This study examined the widely used threshold of 0.2 for Yen's Q3, an index for violations of local independence. Specifically, a simulation was conducted to investigate whether Q3 values were related to the magnitude of bias in estimates of reliability, item parameters, and examinee ability. Results showed that Q3 values below the typical cut-off…
Descriptors: Item Response Theory, Statistical Bias, Test Reliability, Test Items
Daniel P. Jurich; Matthew J. Madison – Educational Assessment, 2023
Diagnostic classification models (DCMs) are psychometric models that provide probabilistic classifications of examinees on a set of discrete latent attributes. When analyzing or constructing assessments scored by DCMs, understanding how each item influences attribute classifications can clarify the meaning of the measured constructs, facilitate…
Descriptors: Test Items, Models, Classification, Influences
Bin Tan; Nour Armoush; Elisabetta Mazzullo; Okan Bulut; Mark J. Gierl – International Journal of Assessment Tools in Education, 2025
This study reviews existing research on the use of large language models (LLMs) for automatic item generation (AIG). We performed a comprehensive literature search across seven research databases, selected studies based on predefined criteria, and summarized 60 relevant studies that employed LLMs in the AIG process. We identified the most commonly…
Descriptors: Artificial Intelligence, Test Items, Automation, Test Format
Joshua B. Gilbert; Zachary Himmelsbach; James Soland; Mridul Joshi; Benjamin W. Domingue – Journal of Policy Analysis and Management, 2025
Analyses of heterogeneous treatment effects (HTE) are common in applied causal inference research. However, when outcomes are latent variables assessed via psychometric instruments such as educational tests, standard methods ignore the potential HTE that may exist among the individual items of the outcome measure. Failing to account for…
Descriptors: Item Response Theory, Test Items, Error of Measurement, Scores
Hui Jin; Cynthia Lima; Limin Wang – Educational Measurement: Issues and Practice, 2025
Although AI transformer models have demonstrated notable capability in automated scoring, it is difficult to examine how and why these models fall short in scoring some responses. This study investigated how transformer models' language processing and quantification processes can be leveraged to enhance the accuracy of automated scoring. Automated…
Descriptors: Automation, Scoring, Artificial Intelligence, Accuracy
Leonidas Zotos; Hedderik van Rijn; Malvina Nissim – International Educational Data Mining Society, 2025
In an educational setting, an estimate of the difficulty of Multiple-Choice Questions (MCQs), a commonly used strategy to assess learning progress, constitutes very useful information for both teachers and students. Since human assessment is costly from multiple points of view, automatic approaches to MCQ item difficulty estimation are…
Descriptors: Multiple Choice Tests, Test Items, Difficulty Level, Artificial Intelligence
Yanyan Fu – Educational Measurement: Issues and Practice, 2024
The template-based automated item-generation (TAIG) approach that involves template creation, item generation, item selection, field-testing, and evaluation has more steps than the traditional item development method. Consequentially, there is more margin for error in this process, and any template errors can be cascaded to the generated items.…
Descriptors: Error Correction, Automation, Test Items, Test Construction
Stella Y. Kim; Sungyeun Kim – Educational Measurement: Issues and Practice, 2025
This study presents several multivariate Generalizability theory designs for analyzing automatic item-generated (AIG) based test forms. The study used real data to illustrate the analysis procedure and discuss practical considerations. We collected the data from two groups of students, each group receiving a different form generated by AIG. A…
Descriptors: Generalizability Theory, Automation, Test Items, Students
Neset Demirci – Turkish Online Journal of Educational Technology - TOJET, 2025
In this study, the performance of artificial intelligence chatbots--OpenAI's ChatGPT, Google Gemini, and Microsoft's Copilot--was evaluated and compared based on their responses to questions from the Turkish Higher Education Entrance Physics Examination over the past three years. Analysis of the chatbots' responses to TYT Physics questions showed…
Descriptors: Artificial Intelligence, College Entrance Examinations, Physics, Science Tests
Changiz Mohiyeddini – Anatomical Sciences Education, 2025
This article presents a step-by-step guide to using R and SPSS to bootstrap exam questions. Bootstrapping, a versatile nonparametric analytical technique, can help to improve the psychometric qualities of exam questions in the process of quality assurance. Bootstrapping is particularly useful in disciplines such as medical education, where student…
Descriptors: Test Items, Sampling, Statistical Inference, Nonparametric Statistics
Daibao Guo; Katherine Landau Wright; Lianne Josbacher; Eun Hye Son – Elementary School Journal, 2025
Limited research has explored the use of visual displays (ViDis) in science tests, making it challenging to know how these tests align with classroom instruction and what skills students need to be successful on these tests. Therefore, the current study aims to describe the use of ViDis in upper elementary grade standardized science tests. We…
Descriptors: Standardized Tests, Science Tests, Elementary Education, Science Education
Yumou Wei; Paulo Carvalho; John Stamper – International Educational Data Mining Society, 2025
Educators evaluate student knowledge using knowledge component (KC) models that map assessment questions to KCs. Still, designing KC models for large question banks remains an insurmountable challenge for instructors who need to analyze each question by hand. The growing use of Generative AI in education is expected only to aggravate this chronic…
Descriptors: Artificial Intelligence, Cluster Grouping, Student Evaluation, Test Items
Marcoulides, Katerina M. – Measurement: Interdisciplinary Research and Perspectives, 2023
Integrative data analyses have recently been shown to be an effective tool for researchers interested in synthesizing datasets from multiple studies in order to draw statistical or substantive conclusions. The actual process of integrating the different datasets depends on the availability of some common measures or items reflecting the same…
Descriptors: Data Analysis, Synthesis, Test Items, Simulation

Peer reviewed
Direct link
