ERIC - Search Results

Publication Date

In 2026	0
Since 2025	6
Since 2022 (last 5 years)	17
Since 2017 (last 10 years)	40
Since 2007 (last 20 years)	68

Descriptor

Error of Measurement	116
Interrater Reliability	116
Test Reliability	29
Evaluation Methods	26
Correlation	25
Generalizability Theory	24
Scores	24
Scoring	23
Foreign Countries	17
Measurement Techniques	17
Rating Scales	17
Test Validity	16
Psychometrics	15
Statistical Analysis	14
Higher Education	13
Reliability	12
Evaluators	11
Item Response Theory	11
Performance Based Assessment	11
Language Tests	10
Accuracy	9
Children	9
Scoring Rubrics	9
Student Evaluation	9
Test Items	9
More ▼

Publication Type

Journal Articles	86
Reports - Research	77
Reports - Evaluative	22
Speeches/Meeting Papers	15
Reports - Descriptive	11
Numerical/Quantitative Data	5
Opinion Papers	3
Tests/Questionnaires	3
Dissertations/Theses -…	2
ERIC Digests in Full Text	1
ERIC Publications	1
Guides - Non-Classroom	1
More ▼

Education Level

Higher Education	10
Postsecondary Education	10
Elementary Education	6
Elementary Secondary Education	5
Middle Schools	3
Secondary Education	3
Grade 4	2
Intermediate Grades	2
Junior High Schools	2
Adult Education	1
Early Childhood Education	1
Grade 1	1
Grade 3	1
Grade 5	1
Grade 6	1
Grade 7	1
High Schools	1
Primary Education	1
More ▼

Audience

Researchers	9
Administrators	2
Counselors	1

Location

Canada	2
Netherlands	2
New Mexico	2
Turkey	2
United Kingdom	2
California	1
Canada (Toronto)	1
China (Beijing)	1
Finland	1
Florida	1
Illinois	1
Japan	1
Netherlands (Amsterdam)	1
Nevada	1
Ohio	1
Oklahoma	1
Rhode Island	1
Taiwan	1
Taiwan (Taipei)	1
United Kingdom (England)	1
More ▼

Laws, Policies, & Programs

Assessments and Surveys

Advanced Placement…	1
Alabama High School…	1
Cognitive Abilities Test	1
Iowa Tests of Basic Skills	1
National Assessment of…	1
Praxis Series	1
Stanford Binet Intelligence…	1
Test of English as a Foreign…	1
Trends in International…	1
Wechsler Intelligence Scale…	1
Work Keys (ACT)	1
More ▼

What Works Clearinghouse Rating

Showing 1 to 15 of 116 results Save | Export

Technical Adequacy-Reliability

Peer reviewed

Direct link

Susan K. Johnsen – Gifted Child Today, 2025

The author provides information about reliability and areas that educators should examine in determining if an assessment is consistent and trustworthy for use, and how it should be interpreted in making decisions about students. Reliability areas that are discussed in the column include internal consistency, test-retest or stability, inter-scorer…

Descriptors: Test Reliability, Academically Gifted, Student Evaluation, Error of Measurement

Grading Exams Using Large Language Models: A Comparison between Human and AI Grading of Exams in Higher Education Using ChatGPT

Peer reviewed

Direct link

Jonas Flodén – British Educational Research Journal, 2025

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…

Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring

New Tests of Rater Drift in Trend Scoring

Peer reviewed

Direct link

John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024

Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…

Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics

The Vague Language Use Scale: Clinical Utility and Psychometrics from Adults with Traumatic Brain Injury

Peer reviewed

Direct link

Kathryn J. Greenslade; Julia K. Bushell; Emily F. Dillon; Amy E. Ramage – International Journal of Language & Communication Disorders, 2025

Background: Pragmatic communication difficulties encompass many distinct behaviours, including the use of vague and/or insufficient language, a common characteristic following traumatic brain injury (TBI) that negatively impacts psychosocial outcomes. Existing assessments evaluate pragmatic communication broadly, often with only one or two items…

Descriptors: Neurological Impairments, Head Injuries, Language Impairments, Language Tests

Evidence-Based Evaluation of Student and Marker Performances in Assessment and Examination

Peer reviewed

Direct link

Ole J. Kemi – Advances in Physiology Education, 2025

Students are assessed by coursework and/or exams, all of which are marked by assessors (markers). Student and marker performances are then subject to end-of-session board of examiner handling and analysis. This occurs annually and is the basis for evaluating students but also the wider learning and teaching efficiency of an academic institution.…

Descriptors: Undergraduate Students, Evaluation Methods, Evaluation Criteria, Academic Standards

Comparison of the Results of the Generalizability Theory with the Inter-Rater Agreement Coefficients

Peer reviewed
PDF on ERIC

Download full text

Eser, Mehmet Taha; Aksu, Gökhan – International Journal of Curriculum and Instruction, 2022

The agreement between raters is examined within the scope of the concept of "inter-rater reliability". Although there are clear definitions of the concepts of agreement between raters and reliability between raters, there is no clear information about the conditions under which agreement and reliability level methods are appropriate to…

Descriptors: Generalizability Theory, Interrater Reliability, Evaluation Methods, Test Theory

The Reliability of Simultaneous versus Individual Data Collection during Stuttering Assessment

Peer reviewed

Direct link

Davidow, Jason H.; Ye, Jun; Edge, Robin L. – International Journal of Language & Communication Disorders, 2023

Background: Speech-language pathologists often multitask in order to be efficient with their commonly large caseloads. In stuttering assessment, multitasking often involves collecting multiple measures simultaneously. Aims: The present study sought to determine reliability when collecting multiple measures simultaneously versus individually.…

Descriptors: Graduate Students, Measurement, Reliability, Group Activities

What Makes Children's Responses to Creativity Assessments Difficult to Judge Reliably?

Peer reviewed
PDF on ERIC

Download full text

Direct link

Denis Dumas; Selcuk Acar; Kelly Berthiaume; Peter Organisciak; David Eby; Katalin Grajzel; Theadora Vlaamster; Michele Newman; Melanie Carrera – Grantee Submission, 2023

Open-ended verbal creativity assessments are commonly administered in psychological research and in educational practice to elementary-aged children. Children's responses are then typically rated by teams of judges who are trained to identify original ideas, hopefully with a degree of inter-rater agreement. Even in cases where the judges are…

Descriptors: Elementary School Students, Grade 3, Grade 4, Grade 5

Monitoring Rater Quality in Observational Systems: Issues Due to Unreliable Estimates of Rater Quality

Peer reviewed

Direct link

Mark White; Matt Ronfeldt – Educational Assessment, 2024

Standardized observation systems seek to reliably measure a specific conceptualization of teaching quality, managing rater error through mechanisms such as certification, calibration, validation, and double-scoring. These mechanisms both support high quality scoring and generate the empirical evidence used to support the scoring inference (i.e.,…

Descriptors: Interrater Reliability, Quality Control, Teacher Effectiveness, Error Patterns

Inter-Rater Reliability Methods in Qualitative Case Study Research

Peer reviewed

Direct link

Rosanna Cole – Sociological Methods & Research, 2024

The use of inter-rater reliability (IRR) methods may provide an opportunity to improve the transparency and consistency of qualitative case study data analysis in terms of the rigor of how codes and constructs have been developed from the raw data. Few articles on qualitative research methods in the literature conduct IRR assessments or neglect to…

Descriptors: Interrater Reliability, Error of Measurement, Evaluation Methods, Research Methodology

Exploring Rating Quality in the Context of High-Stakes Rater-Mediated Educational Assessments

Direct link

Wenjing Guo – ProQuest LLC, 2021

Constructed response (CR) items are widely used in large-scale testing programs, including the National Assessment of Educational Progress (NAEP) and many district and state-level assessments in the United States. One unique feature of CR items is that they depend on human raters to assess the quality of examinees' work. The judgment of human…

Descriptors: National Competency Tests, Responses, Interrater Reliability, Error of Measurement

What You Don't Know about Measurement Error--And Why You Should Care

Direct link

Lichtenstein, Robert – Communique, 2020

Appropriate interpretation of assessment data requires an appreciation that tools are subject to measurement error. School psychologists recognize, at least on an intellectual level, that measures are imperfect--that test scores and other quantitative measures (e.g., rating scales, systematic behavioral observations) are best estimates of…

Descriptors: Error of Measurement, Test Reliability, Pretests Posttests, Standardized Tests

Statistically Guided Grading Judgements: Contextualisation or Contamination?

Peer reviewed

Direct link

Louise Badham – Oxford Review of Education, 2025

Different sources of assessment evidence are reviewed during International Baccalaureate (IB) grade awarding to convert marks into grades and ensure fair results for students. Qualitative and quantitative evidence are analysed to determine grade boundaries, with statistical evidence weighed against examiner judgement and teachers' feedback on…

Descriptors: Advanced Placement Programs, Grading, Interrater Reliability, Evaluative Thinking

All Types of Experience Are Equal, but Some Are More Equal: The Effect of Different Types of Experience on Rater Severity and Rater Consistency

Peer reviewed

Direct link

Reeta Neittaanmäki; Iasonas Lamprianou – Language Testing, 2024

This article focuses on rater severity and consistency and their relation to different types of rater experience over a long period of time. The article is based on longitudinal data collected from 2009 to 2019 from the second language Finnish speaking subtest in the National Certificates of Language Proficiency in Finland. The study investigated…

Descriptors: Foreign Countries, Interrater Reliability, Error of Measurement, Experience

Online Administration of the Test of Narrative Language--Second Edition: Psychometrics and Considerations for Remote Assessment

Peer reviewed
PDF on ERIC

Download full text

Direct link

Beula M. Magimairaj; Philip Capin; Sandra L. Gillam; Sharon Vaughn; Greg Roberts; Anna-Maria Fall; Ronald B. Gillam – Grantee Submission, 2022

Purpose: Our aim was to evaluate the psychometric properties of the online administered format of the Test of Narrative Language--Second Edition (TNL-2; Gillam & Pearson, 2017), given the importance of assessing children's narrative ability and considerable absence of psychometric studies of spoken language assessments administered online.…

Descriptors: Computer Assisted Testing, Language Tests, Story Telling, Language Impairments

Previous Page | Next Page »

Pages: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Educational and Psychological…	7
Educational Assessment	4
International Journal of…	3
Language Testing	3
Sociological Methods &…	3
Applied Psychological…	2
Developmental Medicine &…	2
ETS Research Report Series	2
Educational Measurement:…	2
Evaluation and the Health…	2
Grantee Submission	2
IDEA Center, Inc.	2
Language Assessment Quarterly	2
Measurement in Physical…	2
New Mexico Public Education…	2
ProQuest LLC	2
Research in Developmental…	2
Advances in Health Sciences…	1
Advances in Physiology…	1
Alberta Journal of…	1
American Journal of Evaluation	1
American Journal on…	1
Applied Measurement in…	1
Assessment	1
Athletic Training Education…	1
More ▼

Anna-Maria Fall	2
Benton, Stephen L.	2
Beula M. Magimairaj	2
Greg Roberts	2
Li, Dan	2
McCaffrey, Daniel F.	2
Philip Capin	2
Ronald B. Gillam	2
Sandra L. Gillam	2
Sharon Vaughn	2
Aksu, Gökhan	1
Alkahtani, Saif F.	1
Almond, Patricia	1
Alyssa M. Merbler	1
Amy E. Ramage	1
Anderson, Michele A.	1
Applegate, E. Brooks	1
Arends-Tòth, Judit	1
Aulie, Vibeke Smith	1
Bais, Frank	1
Bannister, Brendan D.	1
Bardhoshi, Gerta	1
Bartels, Meike	1
Bartoš, František	1
More ▼