Publication Date
| In 2026 | 0 |
| Since 2025 | 0 |
| Since 2022 (last 5 years) | 1 |
| Since 2017 (last 10 years) | 2 |
| Since 2007 (last 20 years) | 5 |
Descriptor
| Difficulty Level | 14 |
| Interrater Reliability | 14 |
| Scoring | 14 |
| Test Items | 6 |
| Evaluators | 5 |
| Higher Education | 5 |
| Standard Setting (Scoring) | 4 |
| Estimation (Mathematics) | 3 |
| Judges | 3 |
| Minimum Competency Testing | 3 |
| Scores | 3 |
| More ▼ | |
Source
| Educational Measurement:… | 2 |
| Educational and Psychological… | 2 |
| Applied Measurement in… | 1 |
| Education Digest: Essential… | 1 |
| Education Sciences | 1 |
| European Journal of Science… | 1 |
| High School Journal | 1 |
Author
Publication Type
| Journal Articles | 9 |
| Reports - Research | 7 |
| Reports - Evaluative | 6 |
| Speeches/Meeting Papers | 4 |
| Tests/Questionnaires | 2 |
| Reports - Descriptive | 1 |
Education Level
| High Schools | 2 |
| Higher Education | 2 |
| Postsecondary Education | 2 |
| Secondary Education | 2 |
| Early Childhood Education | 1 |
| Elementary Education | 1 |
| Elementary Secondary Education | 1 |
| Grade 10 | 1 |
| Kindergarten | 1 |
| Primary Education | 1 |
Audience
| Researchers | 1 |
Location
| California | 2 |
| Florida | 1 |
| Tennessee | 1 |
Laws, Policies, & Programs
Assessments and Surveys
| edTPA (Teacher Performance… | 1 |
What Works Clearinghouse Rating
Parker, Mark A. J.; Hedgeland, Holly; Jordan, Sally E.; Braithwaite, Nicholas St. J. – European Journal of Science and Mathematics Education, 2023
The study covers the development and testing of the alternative mechanics survey (AMS), a modified force concept inventory (FCI), which used automatically marked free-response questions. Data were collected over a period of three academic years from 611 participants who were taking physics classes at high school and university level. A total of…
Descriptors: Test Construction, Scientific Concepts, Physics, Test Reliability
Lyness, Scott A.; Peterson, Kent; Yates, Kenneth – Education Sciences, 2021
The Performance Assessment for California Teachers (PACT) is a high stakes summative assessment that was designed to measure pre-service teacher readiness. We examined the inter-rater reliability (IRR) of trained PACT evaluators who rated 19 candidates. As measured by Cohen's weighted kappa, the overall IRR estimate was 0.17 (poor strength of…
Descriptors: High Stakes Tests, Performance Based Assessment, Teacher Effectiveness, Academic Language
Early, Diane M.; Rogge, Ronald D.; Deci, Edward L. – High School Journal, 2014
This paper investigates engagement (E), alignment (A), and rigor (R) as vital signs of high-quality teacher instruction as measured by the EAR Classroom Visit Protocol, designed by the Institute for Research and Reform in Education (IRRE). Findings indicated that both school leaders and outside raters could learn to score the protocol with…
Descriptors: Educational Quality, Learner Engagement, Alignment (Education), Difficulty Level
Sawchuk, Stephen – Education Digest: Essential Readings Condensed for Quick Review, 2010
Most experts in the testing community have presumed that the $350 million promised by the U.S. Department of Education to support common assessments would promote those that made greater use of open-ended items capable of measuring higher-order critical-thinking skills. But as measurement experts consider the multitude of possibilities for an…
Descriptors: Educational Quality, Test Items, Comparative Analysis, Multiple Choice Tests
Huang, Chiungjung – Educational and Psychological Measurement, 2009
This study examined the percentage of task-sampling variability in performance assessment via a meta-analysis. In total, 50 studies containing 130 independent data sets were analyzed. Overall results indicate that the percentage of variance for (a) differential difficulty of task was roughly 12% and (b) examinee's differential performance of the…
Descriptors: Test Bias, Research Design, Performance Based Assessment, Performance Tests
Peer reviewedLunz, Mary E.; And Others – Applied Measurement in Education, 1990
An extension of the Rasch model is used to obtain objective measurements for examinations graded by judges. The model calibrates elements of each facet of the examination on a common log-linear scale. Real examination data illustrate the way correcting for judge severity improves fairness of examinee measures. (SLD)
Descriptors: Certification, Difficulty Level, Interrater Reliability, Judges
Bennett, Randy Elliot; Rock, Donald A. – 1993
Formulating-Hypotheses (F-H) items present a situation and ask the examinee to generate as many explanations for it as possible. This study examined the generalizability, validity, and examinee perceptions of a computer-delivered version of the task. Eight F-H questions were administered to 192 graduate students. Half of the items restricted…
Descriptors: Computer Assisted Testing, Difficulty Level, Generalizability Theory, Graduate Students
O'Neill, Thomas R.; Lunz, Mary E. – 1996
To generalize test results beyond the particular test administration, an examinee's ability estimate must be independent of the particular items attempted, and the item difficulty calibrations must be independent of the particular sample of people attempting the items. This stability is a key concept of the Rasch model, a latent trait model of…
Descriptors: Ability, Benchmarking, Comparative Analysis, Difficulty Level
Peer reviewedMills, Craig N.; And Others – Educational Measurement: Issues and Practice, 1991
An approach is presented to the definition of minimal competence for judges to use in standard setting. Panelists in standard setting must receive training to ensure that differences in rating result from differences in perceptions of item difficulty, not in differences of opinion about the definition of minimal competence. (SLD)
Descriptors: Cutting Scores, Decision Making, Definitions, Difficulty Level
Peer reviewedReid, Jerry B. – Educational Measurement: Issues and Practice, 1991
Training judges to generate item ratings in standard setting once the reference group has been defined is discussed. It is proposed that sensitivity to the factors that determine difficulty can be improved through training. Three criteria for determining when training is sufficient are offered. (SLD)
Descriptors: Computer Assisted Instruction, Difficulty Level, Evaluators, Interrater Reliability
Lange, Dale L.; Lowe, Pardee, Jr. – 1987
A study investigated the use of reading proficiency scales developed by the American Council on the Teaching of Foreign Languages (ACTFL), Educational Testing Service (ETS), and Interagency Language Roundtable (ILR) for meaningful rank-ordering and assigning levels of second language competence to reading passages. In a proficiency test writing…
Descriptors: College Entrance Examinations, Difficulty Level, Higher Education, Interrater Reliability
Dovell, Patricia; Buhr, Dianne C. – 1986
This study examined the difficulty level of essay topics used in the large-scale assessment of writing in relation to five different scoring models, and sought to determine what effects the scoring models would have on passing rates. In model one, examinee's score is the direct result of a score assigned by the reader or the sum of scores assigned…
Descriptors: College Students, Difficulty Level, Essay Tests, Essays
Peer reviewedPlake, Barbara S.; Melican, Gerald J. – Educational and Psychological Measurement, 1989
The impact of overall test length and difficulty on the expert judgments of item performance by the Nedelsky method were studied. Five university-level instructors predicting the performance of minimally competent candidates on a mathematics examination were fairly consistent in their assessments regardless of length or difficulty of the test.…
Descriptors: Difficulty Level, Estimation (Mathematics), Evaluators, Higher Education
Wheeler, Patricia – 1991
The appropriateness of the Angoff method (W. H. Angoff, 1971) for setting standards on tests was studied. Evaluators (judges) from California school districts and teacher training institutions reviewed 15 NTE (National Teacher Examinations) Program Specialty Area Tests published by the Educational Testing Service for their appropriateness in…
Descriptors: Art Education, Biology, Difficulty Level, Elementary Secondary Education

Direct link
