ERIC - Search Results

Publication Date

In 2026	0
Since 2025	0
Since 2022 (last 5 years)	1
Since 2017 (last 10 years)	4
Since 2007 (last 20 years)	13

Descriptor

Interrater Reliability	24
Probability	24
Comparative Analysis	6
Scoring	6
Accuracy	4
Computer Software	4
Correlation	4
Error of Measurement	4
Evaluation Methods	4
Evaluators	4
Higher Education	4
Item Response Theory	4
Mathematical Models	4
Rating Scales	4
Test Items	4
Undergraduate Students	4
Content Analysis	3
Cues	3
Cutting Scores	3
English (Second Language)	3
Estimation (Mathematics)	3
Generalizability Theory	3
Judges	3
Models	3
Research Methodology	3
More ▼

Source

Educational and Psychological…	4
Journal of Speech, Language,…	2
Applied Psychological…	1
Contemporary Educational…	1
ETS Research Report Series	1
IEEE Transactions on Learning…	1
International Association for…	1
International Journal of…	1
International Journal of…	1
Journal of Learning Analytics	1
Journal of MultiDisciplinary…	1
Language Assessment Quarterly	1
Online Submission	1
Roeper Review	1
Sociological Methods &…	1
More ▼

Publication Type

Journal Articles	18
Reports - Research	13
Reports - Evaluative	4
Reports - Descriptive	3
Speeches/Meeting Papers	2
Book/Product Reviews	1
Collected Works - Proceedings	1
Collected Works - Serials	1
Information Analyses	1
Opinion Papers	1

Education Level

Adult Education	2
Elementary Education	2
Higher Education	2
Elementary Secondary Education	1
Middle Schools	1
Postsecondary Education	1
Preschool Education	1

Audience

Practitioners	1
Researchers	1

Location

United States	2
Arizona	1
Asia	1
Australia	1
Brazil	1
Canada	1
Connecticut	1
Denmark	1
Egypt	1
Estonia	1
Florida	1
Germany	1
Greece	1
Hawaii	1
Ireland	1
Israel	1
Italy	1
Japan	1
Kazakhstan	1
Netherlands	1
Norway	1
Ohio	1
Pakistan	1
Pennsylvania	1
Philippines	1
More ▼

Laws, Policies, & Programs

Assessments and Surveys

Expressive One Word Picture…	1
Mean Length of Utterance	1
Peabody Picture Vocabulary…	1

What Works Clearinghouse Rating

Showing 1 to 15 of 24 results Save | Export

Investigating Constructed-Response Scoring over Time: The Effects of Study Design on Trend Rescore Statistics. Research Report. ETS RR-22-15

Peer reviewed
PDF on ERIC

Download full text

Donoghue, John R.; McClellan, Catherine A.; Hess, Melinda R. – ETS Research Report Series, 2022

When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration's raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of…

Descriptors: Item Response Theory, Test Construction, Scoring, Testing

Metrics for Discrete Student Models: Chance Levels, Comparisons, and Use Cases

Peer reviewed
PDF on ERIC

Download full text

Bosch, Nigel; Paquette, Luc – Journal of Learning Analytics, 2018

Metrics including Cohen's kappa, precision, recall, and F[subscript 1] are common measures of performance for models of discrete student states, such as a student's affect or behaviour. This study examined discrete model metrics for previously published student model examples to identify situations where metrics provided differing perspectives on…

Descriptors: Models, Comparative Analysis, Prediction, Probability

Sources of Variance in the Accuracy of Interviewer Observations

Peer reviewed

Direct link

West, Brady T.; Li, Dan – Sociological Methods & Research, 2019

In face-to-face surveys, interviewer observations are a cost-effective source of paradata for nonresponse adjustment of survey estimates and responsive survey designs. Unfortunately, recent studies have suggested that the accuracy of these observations can vary substantially among interviewers, even after controlling for household-, area-, and…

Descriptors: Observation, Interviews, Error of Measurement, Accuracy

Kappa and Rater Accuracy: Paradigms and Parameters

Peer reviewed

Direct link

Conger, Anthony J. – Educational and Psychological Measurement, 2017

Drawing parallels to classical test theory, this article clarifies the difference between rater accuracy and reliability and demonstrates how category marginal frequencies affect rater agreement and Cohen's kappa. Category assignment paradigms are developed: comparing raters to a standard (index) versus comparing two raters to one another…

Descriptors: Interrater Reliability, Evaluators, Accuracy, Statistical Analysis

Coordinating Multiple Representations in a Reform Calculus Textbook

Peer reviewed

Direct link

Chang, Briana L.; Cromley, Jennifer G.; Tran, Nhi – International Journal of Science and Mathematics Education, 2016

Coordination of multiple representations (CMR) is widely recognized as a critical skill in mathematics and is frequently demanded in reform calculus textbooks. However, little is known about the prevalence of coordination tasks in such textbooks. We coded 707 instances of CMR in a widely used reform calculus textbook and analyzed the distributions…

Descriptors: Calculus, Textbooks, Teaching Methods, Mathematics Instruction

Item Response Theory for Peer Assessment

Peer reviewed

Direct link

Uto, Masaki; Ueno, Maomi – IEEE Transactions on Learning Technologies, 2016

As an assessment method based on a constructivist approach, peer assessment has become popular in recent years. However, in peer assessment, a problem remains that reliability depends on the rater characteristics. For this reason, some item response models that incorporate rater parameters have been proposed. Those models are expected to improve…

Descriptors: Item Response Theory, Peer Evaluation, Bayesian Statistics, Simulation

Utilizing Generalizability Theory to Investigate the Reliability of the Grades Assigned to Undergraduate Research Papers

Peer reviewed

Direct link

Gugiu, Mihaiela R.; Gugiu, Paul C.; Baldus, Robert – Journal of MultiDisciplinary Evaluation, 2012

Background: Educational researchers have long espoused the virtues of writing with regard to student cognitive skills. However, research on the reliability of the grades assigned to written papers reveals a high degree of contradiction, with some researchers concluding that the grades assigned are very reliable whereas others suggesting that they…

Descriptors: Grades (Scholastic), Grading, Scoring Rubrics, Research Design

Rater Experience, Rating Scale Length, and Judgments of L2 Pronunciation: Revisiting Research Conventions

Peer reviewed

Direct link

Isaacs, Talia; Thomson, Ron I. – Language Assessment Quarterly, 2013

This mixed-methods study examines the effects of rating scale length and rater experience on listeners' judgments of second-language (L2) speech. Twenty experienced and 20 novice raters, who were randomly assigned to 5-point or 9-point rating scale conditions, judged speech samples of 38 newcomers to Canada on numerical rating scales for…

Descriptors: Foreign Countries, Adults, Second Language Learning, English (Second Language)

IRR (Inter-Rater Reliability) of a COP (Classroom Observation Protocol)--A Critical Appraisal

Download full text

Rui, Ning; Feldman, Jill M. – Online Submission, 2012

Notwithstanding broad utility of COPs (classroom observation protocols), there has been limited documentation of the psychometric properties of even the most popular COPs. This study attempted to fill this void by closely examining the item and domain-level IRR (inter-rater reliability) of a COP that was used in a federally funded striving readers…

Descriptors: Classroom Observation Techniques, Interrater Reliability, Correlation, Psychometrics

Misunderstanding and Misrepresentation: A Reply to Hutchison and Schagen

Peer reviewed

Direct link

Gorard, Stephen – International Journal of Research & Method in Education, 2009

The author previously published a paper discussing how to conduct an analysis based on a cluster sample. In that paper, the author outlined several widely adopted alternative approaches, and pointed out that such approaches are anyway not needed for population figures, and not possible for non-probability samples. Thus, the author queried the…

Descriptors: Probability, Misconceptions, Reader Response, Research Methodology

Factors that Influence Fast Mapping in Children Exposed to Spanish and English

Peer reviewed

Direct link

Alt, Mary; Meyers, Christina; Figueroa, Cecilia – Journal of Speech, Language, and Hearing Research, 2013

Purpose: The purpose of this study was to determine whether children exposed to 2 languages would benefit from the phonotactic probability cues of a single language in the same way as monolingual peers and to determine whether crosslinguistic influence would be present in a fast-mapping task. Method: Two groups of typically developing children…

Descriptors: Regression (Statistics), Spanish, Cues, Task Analysis

Measuring the Joint Agreement between Multiple Raters and a Standard.

Peer reviewed

Berry, Kenneth J.; Mielke, Paul W., Jr. – Educational and Psychological Measurement, 1997

A FORTRAN subroutine is presented to calculate a generalized measure of agreement between multiple raters and a set of correct responses at any level of measurement and among multiple responses, along with the associated probability value, under the null hypothesis. (Author)

Descriptors: Computer Software, Interrater Reliability, Measurement Techniques, Probability

The Gifted Rating Scales-School Form: A Validation Study Based on Age, Gender, and Race

Peer reviewed

Direct link

Pfeiffer, Steven; Petscher, Yaacov; Kumtepe, Alper – Roeper Review, 2008

This study examined the internal consistency and validity of a new rating scale to identify gifted students, the Gifted Rating Scales-School Form (GRS-S). The study explored the effect of gender, race/ethnicity, age, and rater familiarity on GRS-S ratings. One hundred twenty-two students in first to eighth grade from elementary and middle schools…

Descriptors: Ethnicity, Middle Schools, Academically Gifted, Talent

Detecting Intrajudge Inconsistency in Standard Setting Using Test Items with a Selected-Response Format. Research Report.

Download full text

van der Linden, Wim J.; Vos, Hans J.; Chang, Lei – 2000

In judgmental standard setting experiments, it may be difficult to specify subjective probabilities that adequately take the properties of the items into account. As a result, these probabilities are not consistent with each other in the sense that they do not refer to the same borderline level of performance. Methods to check standard setting…

Descriptors: Interrater Reliability, Judges, Probability, Standard Setting

A Review of Reliability Procedures for Measuring Observer Agreement.

Peer reviewed

Towstopiat, Olga – Contemporary Educational Psychology, 1984

The present article reviews the procedures that have been developed for measuring the reliability of human observers' judgments when making direct observations of behavior. These include the percentage of agreement, Cohen's Kappa, phi, and univariate and multivariate agreement measures that are based on quasi-equiprobability and quasi-independence…

Descriptors: Interrater Reliability, Mathematical Models, Multivariate Analysis, Observation

Previous Page | Next Page »

Pages: 1 | 2

Berry, Kenneth J.	2
Mielke, Paul W., Jr.	2
van der Linden, Wim J.	2
Alt, Mary	1
Baldus, Robert	1
Blackman, Nicole J-M.	1
Bosch, Nigel	1
Chang, Briana L.	1
Chang, Lei	1
Conger, Anthony J.	1
Cromley, Jennifer G.	1
Donoghue, John R.	1
Fehrmann, Melinda L.	1
Feldman, Jill M.	1
Figueroa, Cecilia	1
Gorard, Stephen	1
Grove, Will	1
Gugiu, Mihaiela R.	1
Gugiu, Paul C.	1
Hess, Melinda R.	1
Isaacs, Talia	1
Koval, John J.	1
Kumtepe, Alper	1
Li, Dan	1
More ▼