ERIC - Search Results

Publication Date

In 2026	0
Since 2025	0
Since 2022 (last 5 years)	1
Since 2017 (last 10 years)	6
Since 2007 (last 20 years)	13

Descriptor

Interrater Reliability	19
Scoring	19
Statistical Analysis	19
Comparative Analysis	5
Writing Evaluation	5
Automation	4
Essay Tests	4
Essays	4
Evaluation Methods	4
Computer Assisted Testing	3
Correlation	3
Evaluation Criteria	3
Interviews	3
Writing Tests	3
Biology	2
College Science	2
College Students	2
Computer Software	2
Data Analysis	2
Error of Measurement	2
Estimation (Mathematics)	2
Evaluators	2
Generalizability Theory	2
Grade 4	2
Grade 8	2
More ▼

Source

Applied Measurement in…	3
ProQuest LLC	3
CBE - Life Sciences Education	2
Applied Psychological…	1
ETS Research Report Series	1
English Language Teaching	1
Journal of Applied Testing…	1
Journal of Experimental…	1
Journal of Speech, Language,…	1
Language Assessment Quarterly	1
National Center for Education…	1
Reading Psychology	1
More ▼

Publication Type

Journal Articles	13
Reports - Research	11
Reports - Evaluative	5
Dissertations/Theses -…	3
Speeches/Meeting Papers	2
Numerical/Quantitative Data	1

Education Level

Higher Education	3
Middle Schools	3
Elementary Education	2
Postsecondary Education	2
Elementary Secondary Education	1
Grade 4	1
Grade 5	1
Grade 8	1
Intermediate Grades	1
Junior High Schools	1
Secondary Education	1
More ▼

Audience

Researchers

Location

Delaware	1
Israel	1
Texas (Houston)	1

Laws, Policies, & Programs

Assessments and Surveys

ACT Assessment	1
Advanced Placement…	1
SAT (College Admission Test)	1

What Works Clearinghouse Rating

Showing 1 to 15 of 19 results Save | Export

New Tests of Rater Drift in Trend Scoring

Peer reviewed

Direct link

John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024

Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…

Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics

A Comparison of Manual versus Automated Quantitative Production Analysis of Connected Speech

Peer reviewed

Direct link

Fromm, Davida; Katta, Saketh; Paccione, Mason; Hecht, Sophia; Greenhouse, Joel; MacWhinney, Brian; Schnur, Tatiana T. – Journal of Speech, Language, and Hearing Research, 2021

Purpose: Analysis of connected speech in the field of adult neurogenic communication disorders is essential for research and clinical purposes, yet time and expertise are often cited as limiting factors. The purpose of this project was to create and evaluate an automated program to score and compute the measures from the Quantitative Production…

Descriptors: Speech, Automation, Statistical Analysis, Adults

Statistically Comparing the Performance of Multiple Automated Raters across Multiple Items

Peer reviewed

Direct link

Kieftenbeld, Vincent; Boyer, Michelle – Applied Measurement in Education, 2017

Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to…

Descriptors: Automation, Scoring, Comparative Analysis, Test Items

Validating Human and Automated Scoring of Essays against "True" Scores

Peer reviewed

Direct link

Cohen, Yoav; Levi, Effi; Ben-Simon, Anat – Applied Measurement in Education, 2018

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay's true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By…

Descriptors: Test Validity, Automation, Scoring, Computer Assisted Testing

The Impact of Rater Variability on Relationships among Different Effect-Size Indices for Inter-Rater Agreement between Human and Automated Essay Scoring

Direct link

Yun, Jiyeo – ProQuest LLC, 2017

Since researchers investigated automatic scoring systems in writing assessments, they have dealt with relationships between human and machine scoring, and then have suggested evaluation criteria for inter-rater agreement. The main purpose of my study is to investigate the magnitudes of and relationships among indices for inter-rater agreement used…

Descriptors: Interrater Reliability, Essays, Scoring, Evaluators

Development and Validation of the Written Communication Assessment of the "HEIghten"® Outcomes Assessment Suite. Research Report. ETS RR-17-53

Peer reviewed
PDF on ERIC

Download full text

Rios, Joseph A.; Sparks, Jesse R.; Zhang, Mo; Liu, Ou Lydia – ETS Research Report Series, 2017

Proficiency with written communication (WC) is critical for success in college and careers. As a result, institutions face a growing challenge to accurately evaluate their students' writing skills to obtain data that can support demands of accreditation, accountability, or curricular improvement. Many current standardized measures, however, lack…

Descriptors: Test Construction, Test Validity, Writing Tests, College Outcomes Assessment

Using Student Writing and Lexical Analysis to Reveal Student Thinking about the Role of Stop Codons in the Central Dogma

Peer reviewed

Direct link

Prevost, Luanna B.; Smith, Michelle K.; Knight, Jennifer K. – CBE - Life Sciences Education, 2016

Previous work has shown that students have persistent difficulties in understanding how central dogma processes can be affected by a stop codon mutation. To explore these difficulties, we modified two multiple-choice questions from the Genetics Concept Assessment into three open-ended questions that asked students to write about how a stop codon…

Descriptors: Science Instruction, Genetics, Scientific Concepts, Scoring

Planning and Revising Written Arguments: The Effects of Two Text Structure-Based Interventions on Persuasiveness of 8th-Grade Students' Essays

Peer reviewed

Direct link

Midgette, Ekaterina; Haria, Priti – Reading Psychology, 2016

The purpose of the study was to investigate the effects of two comprehensive argumentative writing interventions--Text Structure Instruction (TSI) and Text Structure Revision Instruction (TSRI)--on the eighth-grade students' ability to compose convincing essays that include structural elements of argumentative discourse. Both treatment groups…

Descriptors: Persuasive Discourse, Essays, Text Structure, Writing Instruction

Assessing Writing in Elementary Schools: Moving Away from a Focus on Mechanics

Peer reviewed

Direct link

Casey, Laura B.; Miller, Neal D.; Stockton, Michelle B.; Justice, William V. – Language Assessment Quarterly, 2016

Many students struggle with writing; however, curriculum-based measures (CBM) of writing often use assessment criteria that focus primarily on mechanics. When academic development is assessed in this way, more complex aspects of a student's writing, such as the expression and development of ideas, may be neglected. The current study was a…

Descriptors: Elementary School Students, Writing (Composition), Writing Evaluation, Curriculum Based Assessment

Oral Performace Scoring Using Generalizability Theory and Many-Facet Rasch Measurement: A Comparison Study

Direct link

Alkahtani, Saif F. – ProQuest LLC, 2012

The principal aim of the present study was to better guide the Quranic recitation appraisal practice by presenting an application of Generalizability theory and Many-facet Rasch Measurement Model for assessing the dependability and fit of two suggested rubrics. Recitations of 93 students were rated holistically and analytically by 3 independent…

Descriptors: Generalizability Theory, Item Response Theory, Verbal Tests, Islam

What Are They Thinking? Automated Analysis of Student Writing about Acid-Base Chemistry in Introductory Biology

Peer reviewed

Direct link

Haudek, Kevin C.; Prevost, Luanna B.; Moscarella, Rosa A.; Merrill, John; Urban-Lurain, Mark – CBE - Life Sciences Education, 2012

Students' writing can provide better insight into their thinking than can multiple-choice questions. However, resource constraints often prevent faculty from using writing assessments in large undergraduate science courses. We investigated the use of computer software to analyze student writing and to uncover student ideas about chemistry in an…

Descriptors: Chemistry, Biology, Introductory Courses, Science Instruction

The Inter-Rater Reliability in Scoring Composition

Peer reviewed
PDF on ERIC

Download full text

Wang, Ping – English Language Teaching, 2009

This paper makes a study of the rater reliability in scoring composition in the test of English as a foreign language (EFL) and focuses on the inter-rater reliability as well as several interactions between raters and the other facets involved (that is examinees, rating criteria and rating methods). Results showed that raters were fairly…

Descriptors: Interrater Reliability, Scoring, Writing (Composition), English (Second Language)

Triangulating Evidence to Investigate the Validity of Measures: Evidence from Discussion during Instruction, Cognitive Interviews, and Written Assessments

Direct link

Burmester, Kristen O'Rourke – ProQuest LLC, 2011

Classrooms are a primary site of evidence about learning. Yet classroom proceedings often occur behind closed doors and hence evidence of student learning is observable only to the classroom teacher. The informal and undocumented nature of this information means that it is rarely included in statistical models or quantifiable analyses. This…

Descriptors: Evidence, Student Evaluation, Educational Research, Validity

Making Essay Test Scores Fairer with Statistics. ETS Program Statistics Research Technical Report No. 89-90.

Download full text

Braun, Henry I.; Wainer, Howard – 1989

A desirable goal would be to develop a methodology for scoring essays so that the final grades are less affected by when or by whom each essay was read. It seems sensible to derive such grades by somehow adjusting the ratings originally given by each reader. This essay describes a solution that relies on statistical adjustment, using the context…

Descriptors: Essay Tests, Estimation (Mathematics), Interrater Reliability, Scoring

Estimating Rater Agreement in 2 x 2 Tables: Correction for Chance and Intraclass Correlation.

Peer reviewed

Blackman, Nicole J-M.; Koval, John J. – Applied Psychological Measurement, 1993

Four indexes of agreement between ratings of a person that correct for chance and are interpretable as intraclass correlation coefficients for different analysis of variance models are investigated. Relationships among the estimators are established for finite samples, and the equivalence of these estimators in large samples is demonstrated. (SLD)

Descriptors: Analysis of Variance, Equations (Mathematics), Estimation (Mathematics), Interrater Reliability

Previous Page | Next Page »

Pages: 1 | 2

Prevost, Luanna B.	2
Alkahtani, Saif F.	1
Allen, Nancy	1
Ben-Simon, Anat	1
Bennett, Randy Elliot	1
Bhola, Dennison S.	1
Blackman, Nicole J-M.	1
Boyer, Michelle	1
Braswell, James	1
Braun, Henry I.	1
Buckendahl, Chad W.	1
Buhr, Dianne C.	1
Burmester, Kristen O'Rourke	1
Carol Eckerly	1
Casey, Laura B.	1
Cohen, Yoav	1
Dovell, Patricia	1
Fromm, Davida	1
Greenhouse, Joel	1
Haria, Priti	1
Haudek, Kevin C.	1
Hecht, Sophia	1
Horkay, Nancy	1
John R. Donoghue	1
More ▼