Publication Date
| In 2026 | 0 |
| Since 2025 | 37 |
| Since 2022 (last 5 years) | 187 |
| Since 2017 (last 10 years) | 534 |
| Since 2007 (last 20 years) | 984 |
Descriptor
| Difficulty Level | 1620 |
| Test Items | 1620 |
| Item Response Theory | 460 |
| Test Construction | 428 |
| Foreign Countries | 411 |
| Item Analysis | 377 |
| Multiple Choice Tests | 298 |
| Test Reliability | 279 |
| Test Validity | 248 |
| Comparative Analysis | 216 |
| Scores | 211 |
| More ▼ | |
Source
Author
| Tindal, Gerald | 21 |
| Alonzo, Julie | 16 |
| Plake, Barbara S. | 13 |
| Reckase, Mark D. | 12 |
| Anderson, Daniel | 9 |
| Sinharay, Sandip | 9 |
| Wise, Steven L. | 9 |
| Herrmann-Abell, Cari F. | 8 |
| Kostin, Irene | 8 |
| Park, Bitnara Jasmine | 8 |
| Bulut, Okan | 7 |
| More ▼ | |
Publication Type
Education Level
Location
| Turkey | 46 |
| Indonesia | 35 |
| Germany | 31 |
| Australia | 26 |
| Canada | 19 |
| United States | 16 |
| South Africa | 15 |
| California | 14 |
| Florida | 14 |
| China | 13 |
| Nigeria | 13 |
| More ▼ | |
Laws, Policies, & Programs
| No Child Left Behind Act 2001 | 3 |
| Education Consolidation… | 1 |
| Elementary and Secondary… | 1 |
| Head Start | 1 |
Assessments and Surveys
What Works Clearinghouse Rating
Leonidas Zotos; Hedderik van Rijn; Malvina Nissim – International Educational Data Mining Society, 2025
In an educational setting, an estimate of the difficulty of Multiple-Choice Questions (MCQs), a commonly used strategy to assess learning progress, constitutes very useful information for both teachers and students. Since human assessment is costly from multiple points of view, automatic approaches to MCQ item difficulty estimation are…
Descriptors: Multiple Choice Tests, Test Items, Difficulty Level, Artificial Intelligence
Changiz Mohiyeddini – Anatomical Sciences Education, 2025
This article presents a step-by-step guide to using R and SPSS to bootstrap exam questions. Bootstrapping, a versatile nonparametric analytical technique, can help to improve the psychometric qualities of exam questions in the process of quality assurance. Bootstrapping is particularly useful in disciplines such as medical education, where student…
Descriptors: Test Items, Sampling, Statistical Inference, Nonparametric Statistics
Rodgers, Emily; D'Agostino, Jerome V.; Berenbon, Rebecca; Johnson, Tracy; Winkler, Christa – Journal of Early Childhood Literacy, 2023
Running Records are thought to be an excellent formative assessment tool because they generate results that educators can use to make their teaching more responsive. Despite the technical nature of scoring Running Records and the kinds of important decisions that are attached to their analysis, few studies have investigated assessor accuracy. We…
Descriptors: Formative Evaluation, Scoring, Accuracy, Difficulty Level
Yun-Kyung Kim; Li Cai – National Center for Research on Evaluation, Standards, and Student Testing (CRESST), 2025
This paper introduces an application of cross-classified item response theory (IRT) modeling to an assessment utilizing the embedded standard setting (ESS) method (Lewis & Cook). The cross-classified IRT model is used to treat both item and person effects as random, where the item effects are regressed on the target performance levels (target…
Descriptors: Standard Setting (Scoring), Item Response Theory, Test Items, Difficulty Level
Kseniia Marcq; Johan Braeken – Large-scale Assessments in Education, 2025
Background: Theoretical frameworks excel in conceptualising reading literacy, yet their value hinges on their applicability for real-world purposes, such as assessment. By combining diverse theoretical frameworks, the Programme for International Student Assessment (PISA) 2018 designed an assessment framework for assessing the reading literacy of…
Descriptors: International Assessment, Achievement Tests, Foreign Countries, Secondary School Students
Linh Thi Thao Le; Nam Thi Phuong Ho; Nguyen Huynh Trang; Hung Tan Ha – SAGE Open, 2025
The International English Language Testing System (IELTS) has served as one of the most reliable proofs of people's English language proficiency. There have been rumors about the discrepancy in difficulty between the two modules of IELTS, namely Academic (AC) and General Training (GT); however, there is little empirical evidence to confirm such a…
Descriptors: English (Second Language), Language Tests, Second Language Learning, Reading Tests
Aiman Mohammad Freihat; Omar Saleh Bani Yassin – Educational Process: International Journal, 2025
Background/purpose: This study aimed to reveal the accuracy of estimation of multiple-choice test items parameters following the models of the item-response theory in measurement. Materials/methods: The researchers depended on the measurement accuracy indicators, which express the absolute difference between the estimated and actual values of the…
Descriptors: Accuracy, Computation, Multiple Choice Tests, Test Items
Patrik Havan; Michal Kohút; Peter Halama – International Journal of Testing, 2025
Acquiescence is the tendency of participants to shift their responses to agreement. Lechner et al. (2019) introduced the following mechanisms of acquiescence: social deference and cognitive processing. We added their interaction into a theoretical framework. The sample consists of 557 participants. We found significant medium strong relationship…
Descriptors: Cognitive Processes, Attention, Difficulty Level, Reflection
Sherwin E. Balbuena – Online Submission, 2024
This study introduces a new chi-square test statistic for testing the equality of response frequencies among distracters in multiple-choice tests. The formula uses the information from the number of correct answers and wrong answers, which becomes the basis of calculating the expected values of response frequencies per distracter. The method was…
Descriptors: Multiple Choice Tests, Statistics, Test Validity, Testing
Metsämuuronen, Jari – Practical Assessment, Research & Evaluation, 2023
Traditional estimators of reliability such as coefficients alpha, theta, omega, and rho (maximal reliability) are prone to give radical underestimates of reliability for the tests common when testing educational achievement. These tests are often structured by widely deviating item difficulties. This is a typical pattern where the traditional…
Descriptors: Test Reliability, Achievement Tests, Computation, Test Items
Cornelia E. Neuert – Field Methods, 2025
Using masculine forms in surveys is still common practice, with researchers presumably assuming they operate in a generic way. However, the generic masculine has been found to lead to male-biased representations in various contexts. This article studies the effects of alternative gendered linguistic forms in surveys. The language forms are…
Descriptors: Language Usage, Surveys, Response Style (Tests), Gender Bias
Ali Orhan; Inan Tekin; Sedat Sen – International Journal of Assessment Tools in Education, 2025
In this study, it was aimed to translate and adapt the Computational Thinking Multidimensional Test (CTMT) developed by Kang et al. (2023) into Turkish and to investigate its psychometric qualities with Turkish university students. Following the translation procedures of the CTMT with 12 multiple-choice questions developed based on real-life…
Descriptors: Cognitive Tests, Thinking Skills, Computation, Test Validity
Tia M. Fechter; Heeyeon Yoon – Language Testing, 2024
This study evaluated the efficacy of two proposed methods in an operational standard-setting study conducted for a high-stakes language proficiency test of the U.S. government. The goal was to seek low-cost modifications to the existing Yes/No Angoff method to increase the validity and reliability of the recommended cut scores using a convergent…
Descriptors: Standard Setting, Language Proficiency, Language Tests, Evaluation Methods
Samah AlKhuzaey; Floriana Grasso; Terry R. Payne; Valentina Tamma – International Journal of Artificial Intelligence in Education, 2024
Designing and constructing pedagogical tests that contain items (i.e. questions) which measure various types of skills for different levels of students equitably is a challenging task. Teachers and item writers alike need to ensure that the quality of assessment materials is consistent, if student evaluations are to be objective and effective.…
Descriptors: Test Items, Test Construction, Difficulty Level, Prediction
Kuan-Yu Jin; Thomas Eckes – Educational and Psychological Measurement, 2024
Insufficient effort responding (IER) refers to a lack of effort when answering survey or questionnaire items. Such items typically offer more than two ordered response categories, with Likert-type scales as the most prominent example. The underlying assumption is that the successive categories reflect increasing levels of the latent variable…
Descriptors: Item Response Theory, Test Items, Test Wiseness, Surveys

Peer reviewed
Direct link
