ERIC - Search Results

Publication Date

In 2025	0
Since 2024	0
Since 2021 (last 5 years)	1
Since 2016 (last 10 years)	6
Since 2006 (last 20 years)	20

Descriptor

Interrater Reliability	61
Statistical Analysis	14
Correlation	10
Evaluation Methods	9
Test Validity	9
Evaluators	8
Measurement Techniques	8
Rating Scales	8
Scoring	8
Classification	7
Error of Measurement	7
Higher Education	7
Standard Setting (Scoring)	7
Elementary Education	6
Equations (Mathematics)	6
Generalizability Theory	6
Test Reliability	6
Validity	6
Cutting Scores	5
Models	5
Statistical Bias	5
Behavior Rating Scales	4
Comparative Analysis	4
Computation	4
Factor Analysis	4
More ▼

Source

Educational and Psychological…

Publication Type

Journal Articles	61
Reports - Research	30
Reports - Evaluative	22
Reports - Descriptive	9
Book/Product Reviews	1
Opinion Papers	1
Speeches/Meeting Papers	1

Education Level

Elementary Education	3
Grade 4	2
Intermediate Grades	2
Elementary Secondary Education	1
Grade 7	1
Higher Education	1
Junior High Schools	1
Middle Schools	1
Postsecondary Education	1
Secondary Education	1

Audience

Location

Laws, Policies, & Programs

Assessments and Surveys

Conners Rating Scales	1
Coopersmith Self Esteem…	1
Teacher Performance…	1
Work Keys (ACT)	1

What Works Clearinghouse Rating

Showing 1 to 15 of 61 results Save | Export

Large-Sample Variance of Fleiss Generalized Kappa

Peer reviewed

Direct link

Gwet, Kilem L. – Educational and Psychological Measurement, 2021

Cohen's kappa coefficient was originally proposed for two raters only, and it later extended to an arbitrarily large number of raters to become what is known as Fleiss' generalized kappa. Fleiss' generalized kappa and its large-sample variance are still widely used by researchers and were implemented in several software packages, including, among…

Descriptors: Sample Size, Statistical Analysis, Interrater Reliability, Computation

Using Differential Item Functioning to Test for Interrater Reliability in Constructed Response Items

Peer reviewed

Direct link

Walker, Cindy M.; Göçer Sahin, Sakine – Educational and Psychological Measurement, 2020

The purpose of this study was to investigate a new way of evaluating interrater reliability that can allow one to determine if two raters differ with respect to their rating on a polytomous rating scale or constructed response item. Specifically, differential item functioning (DIF) analyses were used to assess interrater reliability and compared…

Descriptors: Test Bias, Interrater Reliability, Responses, Correlation

Kappa Coefficients for Missing Data

Peer reviewed

Direct link

De Raadt, Alexandra; Warrens, Matthijs J.; Bosker, Roel J.; Kiers, Henk A. L. – Educational and Psychological Measurement, 2019

Cohen's kappa coefficient is commonly used for assessing agreement between classifications of two raters on a nominal scale. Three variants of Cohen's kappa that can handle missing data are presented. Data are considered missing if one or both ratings of a unit are missing. We study how well the variants estimate the kappa value for complete data…

Descriptors: Interrater Reliability, Data, Statistical Analysis, Statistical Bias

Exploring Incomplete Rating Designs with Mokken Scale Analysis

Peer reviewed

Direct link

Wind, Stefanie A.; Patil, Yogendra J. – Educational and Psychological Measurement, 2018

Recent research has explored the use of models adapted from Mokken scale analysis as a nonparametric approach to evaluating rating quality in educational performance assessments. A potential limiting factor to the widespread use of these techniques is the requirement for complete data, as practical constraints in operational assessment systems…

Descriptors: Scaling, Data, Interrater Reliability, Writing Tests

Kappa and Rater Accuracy: Paradigms and Parameters

Peer reviewed

Direct link

Conger, Anthony J. – Educational and Psychological Measurement, 2017

Drawing parallels to classical test theory, this article clarifies the difference between rater accuracy and reliability and demonstrates how category marginal frequencies affect rater agreement and Cohen's kappa. Category assignment paradigms are developed: comparing raters to a standard (index) versus comparing two raters to one another…

Descriptors: Interrater Reliability, Evaluators, Accuracy, Statistical Analysis

An Unbiased Estimate of Global Interrater Agreement

Peer reviewed

Direct link

Cousineau, Denis; Laurencelle, Louis – Educational and Psychological Measurement, 2017

Assessing global interrater agreement is difficult as most published indices are affected by the presence of mixtures of agreements and disagreements. A previously proposed method was shown to be specifically sensitive to global agreement, excluding mixtures, but also negatively biased. Here, we propose two alternatives in an attempt to find what…

Descriptors: Interrater Reliability, Evaluation Methods, Statistical Bias, Accuracy

A Ratio Test of Interrater Agreement with High Specificity

Peer reviewed

Direct link

Cousineau, Denis; Laurencelle, Louis – Educational and Psychological Measurement, 2015

Existing tests of interrater agreements have high statistical power; however, they lack specificity. If the ratings of the two raters do not show agreement but are not random, the current tests, some of which are based on Cohen's kappa, will often reject the null hypothesis, leading to the wrong conclusion that agreement is present. A new test of…

Descriptors: Interrater Reliability, Monte Carlo Methods, Measurement Techniques, Accuracy

Interrater Agreement Evaluation: A Latent Variable Modeling Approach

Peer reviewed

Direct link

Raykov, Tenko; Dimitrov, Dimiter M.; von Eye, Alexander; Marcoulides, George A. – Educational and Psychological Measurement, 2013

A latent variable modeling method for evaluation of interrater agreement is outlined. The procedure is useful for point and interval estimation of the degree of agreement among a given set of judges evaluating a group of targets. In addition, the approach allows one to test for identity in underlying thresholds across raters as well as to identify…

Descriptors: Interrater Reliability, Models, Statistical Analysis, Computation

Automated Scoring of Teachers' Open-Ended Responses to Video Prompts: Bringing the Classroom-Video-Analysis Assessment to Scale

Peer reviewed
PDF on ERIC

Download full text

Direct link

Nicole B. Kersting; Bruce L. Sherin; James W. Stigler – Educational and Psychological Measurement, 2014

In this study, we explored the potential for machine scoring of short written responses to the Classroom-Video-Analysis (CVA) assessment, which is designed to measure teachers' usable mathematics teaching knowledge. We created naïve Bayes classifiers for CVA scales assessing three different topic areas and compared computer-generated scores to…

Descriptors: Scoring, Automation, Video Technology, Teacher Evaluation

A Body of Work Standard-Setting Method with Construct Maps

Peer reviewed

Direct link

Wyse, Adam E.; Bunch, Michael B.; Deville, Craig; Viger, Steven G. – Educational and Psychological Measurement, 2014

This article describes a novel variation of the Body of Work method that uses construct maps to overcome problems of transparency, rater inconsistency, and scores gaps commonly occurring with the Body of Work method. The Body of Work method with construct maps was implemented to set cut-scores for two separate K-12 assessment programs in a large…

Descriptors: Standard Setting (Scoring), Educational Assessment, Elementary Secondary Education, Measurement

A Two-Stage Scoring Method to Enhance Accuracy of Performance Level Classification

Peer reviewed

Direct link

Finkelman, Matthew; Darby, Mark; Nering, Michael – Educational and Psychological Measurement, 2009

Many tests classify each examinee into one of multiple performance levels on the basis of a combination of multiple-choice (MC) and constructed-response (CR) items. This study introduces a two-stage scoring method that identifies examinees whose MC scores place them near a cut point, advising scorers on which examinees will be most affected by…

Descriptors: Classification, Scoring, Multiple Choice Tests, Responses

Discriminant Validity of Self-Reported Emotional Intelligence: A Multitrait-Multisource Study

Peer reviewed

Direct link

Joseph, Dana L.; Newman, Daniel A. – Educational and Psychological Measurement, 2010

A major stumbling block for emotional intelligence (EI) research has been the lack of adequate evidence for discriminant validity. In a sample of 280 dyads, self- and peer-reports of EI and Big Five personality traits were used to confirm an a priori four-factor model for the Wong and Law Emotional Intelligence Scale (WLEIS) and a five-factor…

Descriptors: Emotional Intelligence, Measurement Techniques, Validity, Personality Traits

Development of the Tempe Sorting Task: A Principled Approach to Assessment of Children's Executive Functioning

Peer reviewed

Direct link

Marshall, Seth J.; Wodrich, David L.; Gorin, Joanna S. – Educational and Psychological Measurement, 2009

This study examined psychometric properties of the Tempe Sorting Task (TST), a new measure of executive function (EF) for children. To increase the meaningfulness of test score interpretations, an age-appropriate construct was employed to incorporate Denckla's description of EF. Multiple measures of EF, including the TST, were collected for…

Descriptors: Cognitive Tests, Cognitive Processes, Children, Attention Deficit Hyperactivity Disorder

The Development and Evaluation of Procedures to Assess Child Self-Report Item Validity

Peer reviewed

Direct link

Woolley, Michael E.; Bowen, Gary L.; Bowen, Natasha K. – Educational and Psychological Measurement, 2006

Cognitive pretesting (CP) is an interview methodology for pretesting the validity of items during the development of self-report instruments. This article reports on the development and evaluation of a systematic method to rate self-report item validity performance utilizing CP interview text data. Five raters were trained in the application of…

Descriptors: Measurement Techniques, Validity, Pretesting, Interviews

Magnitude of Task-Sampling Variability in Performance Assessment: A Meta-Analysis

Peer reviewed

Direct link

Huang, Chiungjung – Educational and Psychological Measurement, 2009

This study examined the percentage of task-sampling variability in performance assessment via a meta-analysis. In total, 50 studies containing 130 independent data sets were analyzed. Overall results indicate that the percentage of variance for (a) differential difficulty of task was roughly 12% and (b) examinee's differential performance of the…

Descriptors: Test Bias, Research Design, Performance Based Assessment, Performance Tests

Previous Page | Next Page »

Pages: 1 | 2 | 3 | 4 | 5

Berry, Kenneth J.	3
Mielke, Paul W., Jr.	3
Cousineau, Denis	2
Janson, Harald	2
Laurencelle, Louis	2
Li, Mao-Neng Fred	2
Olsson, Ulf	2
Schuster, Christof	2
Abedi, Jamal	1
Alliger, George M.	1
Baker, Eva L.	1
Bannister, Brendan D.	1
Behuniak, Peter, Jr.	1
Bosker, Roel J.	1
Bowen, Gary L.	1
Bowen, Natasha K.	1
Brennan, Robert L.	1
Brown, Ronald T.	1
Bruce L. Sherin	1
Bunch, Michael B.	1
Burry-Stock, Judith A.	1
Campbell, Justin S.	1
Chang, Lei	1
Charters, W. W., Jr.	1
More ▼