NotesFAQContact Us
Collection
Advanced
Search Tips
Publication Date
In 20260
Since 20250
Since 2022 (last 5 years)0
Since 2017 (last 10 years)3
Since 2007 (last 20 years)11
Audience
Researchers3
Laws, Policies, & Programs
What Works Clearinghouse Rating
Showing 1 to 15 of 17 results Save | Export
Peer reviewed Peer reviewed
Direct linkDirect link
Jin, Kuan-Yu; Wang, Wen-Chung – Journal of Educational Measurement, 2018
The Rasch facets model was developed to account for facet data, such as student essays graded by raters, but it accounts for only one kind of rater effect (severity). In practice, raters may exhibit various tendencies such as using middle or extreme scores in their ratings, which is referred to as the rater centrality/extremity response style. To…
Descriptors: Scoring, Models, Interrater Reliability, Computation
Peer reviewed Peer reviewed
Direct linkDirect link
Nieto, Ricardo; Casabianca, Jodi M. – Journal of Educational Measurement, 2019
Many large-scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple-choice and/or constructed responses sections of items to generate multiple…
Descriptors: Tests, Scoring, Responses, Test Items
Peer reviewed Peer reviewed
PDF on ERIC Download full text
Rupp, André A.; Casabianca, Jodi M.; Krüger, Maleika; Keller, Stefan; Köller, Olaf – ETS Research Report Series, 2019
In this research report, we describe the design and empirical findings for a large-scale study of essay writing ability with approximately 2,500 high school students in Germany and Switzerland on the basis of 2 tasks with 2 associated prompts, each from a standardized writing assessment whose scoring involved both human and automated components.…
Descriptors: Automation, Foreign Countries, English (Second Language), Language Tests
Peer reviewed Peer reviewed
PDF on ERIC Download full text
Ramineni, Chaitanya; Trapani, Catherine S.; Williamson, David M. – ETS Research Report Series, 2015
Automated scoring models were trained and evaluated for the essay task in the "Praxis I"® writing test. Prompt-specific and generic "e-rater"® scoring models were built, and evaluation statistics, such as quadratic weighted kappa, Pearson correlation, and standardized differences in mean scores, were examined to evaluate the…
Descriptors: Writing Tests, Licensing Examinations (Professions), Teacher Competency Testing, Scoring
Peer reviewed Peer reviewed
Direct linkDirect link
Prevost, Luanna B.; Smith, Michelle K.; Knight, Jennifer K. – CBE - Life Sciences Education, 2016
Previous work has shown that students have persistent difficulties in understanding how central dogma processes can be affected by a stop codon mutation. To explore these difficulties, we modified two multiple-choice questions from the Genetics Concept Assessment into three open-ended questions that asked students to write about how a stop codon…
Descriptors: Science Instruction, Genetics, Scientific Concepts, Scoring
Peer reviewed Peer reviewed
Direct linkDirect link
Nehm, Ross H.; Haertig, Hendrik – Journal of Science Education and Technology, 2012
Our study examines the efficacy of Computer Assisted Scoring (CAS) of open-response text relative to expert human scoring within the complex domain of evolutionary biology. Specifically, we explored whether CAS can diagnose the explanatory elements (or Key Concepts) that comprise undergraduate students' explanatory models of natural selection with…
Descriptors: Evolution, Undergraduate Students, Interrater Reliability, Computers
Peer reviewed Peer reviewed
Direct linkDirect link
Black, Beth; Suto, Irenka; Bramley, Tom – Assessment in Education: Principles, Policy & Practice, 2011
In this paper we develop an evidence-based framework for considering many of the factors affecting marker agreement in GCSEs and A levels. A logical analysis of the demands of the marking task suggests a core grouping comprising: (i) question features; (ii) mark scheme features; and (iii) examinee response features. The framework synthesises…
Descriptors: Interrater Reliability, Grading, Scoring, High Stakes Tests
Peer reviewed Peer reviewed
PDF on ERIC Download full text
Zhang, Mo; Breyer, F. Jay; Lorenz, Florian – ETS Research Report Series, 2013
In this research, we investigated the suitability of implementing "e-rater"® automated essay scoring in a high-stakes large-scale English language testing program. We examined the effectiveness of generic scoring and 2 variants of prompt-based scoring approaches. Effectiveness was evaluated on a number of dimensions, including agreement…
Descriptors: Computer Assisted Testing, Computer Software, Scoring, Language Tests
Peer reviewed Peer reviewed
PDF on ERIC Download full text
Ramineni, Chaitanya; Trapani, Catherine S.; Williamson, David M.; Davey, Tim; Bridgeman, Brent – ETS Research Report Series, 2012
Automated scoring models for the "e-rater"® scoring engine were built and evaluated for the "GRE"® argument and issue-writing tasks. Prompt-specific, generic, and generic with prompt-specific intercept scoring models were built and evaluation statistics such as weighted kappas, Pearson correlations, standardized difference in…
Descriptors: Scoring, Test Scoring Machines, Automation, Models
Peer reviewed Peer reviewed
PDF on ERIC Download full text
Coe, Michael; Hanita, Makoto; Nishioka, Vicki; Smiley, Richard – National Center for Education Evaluation and Regional Assistance, 2011
The 6+1 Trait[R] Writing model (Culham 2003) emphasizes writing instruction in which teachers and students analyze writing using a set of characteristics, or "traits," of written work: ideas, organization, voice, word choice, sentence fluency, conventions, and presentation. The Ideas trait includes the main content and message, including…
Descriptors: Models, Writing Instruction, Instructional Effectiveness, Grade 5
Peer reviewed Peer reviewed
Direct linkDirect link
Mariano, Louis T.; Junker, Brian W. – Journal of Educational and Behavioral Statistics, 2007
When constructed response test items are scored by more than one rater, the repeated ratings allow for the consideration of individual rater bias and variability in estimating student proficiency. Several hierarchical models based on item response theory have been introduced to model such effects. In this article, the authors demonstrate how these…
Descriptors: Test Items, Item Response Theory, Rating Scales, Scoring
Peer reviewed Peer reviewed
Schoonen, Rob; And Others – Language Testing, 1997
Reports on three studies conducted in the Netherlands about the reading reliability of lay and expert readers in rating content and language usage of students' writing performances in three kinds of writing assignments. Findings reveal that expert readers are more reliable in rating usage, whereas both lay and expert readers are reliable raters of…
Descriptors: Foreign Countries, Interrater Reliability, Language Usage, Models
McIntyre, Kenneth E. – 1986
This paper dealt with the use of classroom observation data for formative evaluation purposes, and with a research project in which scores based on observed performance of teachers in secondary school algebra and English classes were compared with efficiency scores based on an input-output model. The model, using Data Envelopment Analysis (DEA)…
Descriptors: Algebra, Classroom Observation Techniques, Classroom Research, Evaluation Methods
De Ayala, R. J.; And Others – 1989
The graded response (GR) model of Samejima (1969) and the partial credit model (PC) of Masters (1982) were fitted to identical writing samples that were holistically scored. The performance and relative benefits of each model were then evaluated. Writing samples were both expository and narrative. Data were from statewide assessments of secondary…
Descriptors: Comparative Analysis, Essay Tests, Holistic Evaluation, Interrater Reliability
Peer reviewed Peer reviewed
Edwards, Alison L. – Modern Language Journal, 1996
Examined the validity of the pragmatic approach to test difficulty put forward by Child (1987). This study investigated whether the Child discourse-type hierarchy predicts text difficulty for second-language readers. Results suggested that this hierarchy may provide a sound basis for developing foreign-language tests when it is applied by trained…
Descriptors: Adult Students, Analysis of Variance, French, Interrater Reliability
Previous Page | Next Page »
Pages: 1  |  2