Publication Date
| In 2026 | 0 |
| Since 2025 | 3 |
| Since 2022 (last 5 years) | 7 |
| Since 2017 (last 10 years) | 20 |
| Since 2007 (last 20 years) | 31 |
Descriptor
| Interrater Reliability | 36 |
| Scoring | 36 |
| Second Language Learning | 36 |
| English (Second Language) | 27 |
| Language Tests | 24 |
| Foreign Countries | 20 |
| Evaluators | 15 |
| Second Language Instruction | 11 |
| Computer Assisted Testing | 10 |
| Correlation | 10 |
| Comparative Analysis | 9 |
| More ▼ | |
Source
Author
| Polat, Murat | 2 |
| Ahmadi Shirazi, Masoumeh | 1 |
| Ahmadi, Alireza | 1 |
| Alt, Mary | 1 |
| Bejar, Isaac I. | 1 |
| Breyer, F. Jay | 1 |
| Buzick, Heather | 1 |
| Carey, Michael D. | 1 |
| Carlson, Sybil B. | 1 |
| Casabianca, Jodi M. | 1 |
| Chan, Stephanie W. Y. | 1 |
| More ▼ | |
Publication Type
| Journal Articles | 33 |
| Reports - Research | 28 |
| Tests/Questionnaires | 7 |
| Reports - Descriptive | 3 |
| Information Analyses | 2 |
| Reports - Evaluative | 2 |
| Dissertations/Theses -… | 1 |
| Speeches/Meeting Papers | 1 |
Education Level
Audience
| Researchers | 1 |
Laws, Policies, & Programs
Assessments and Surveys
| Test of English as a Foreign… | 11 |
| International English… | 4 |
| Peabody Picture Vocabulary… | 2 |
| Expressive One Word Picture… | 1 |
| Graduate Record Examinations | 1 |
| Mean Length of Utterance | 1 |
| Test of English for… | 1 |
What Works Clearinghouse Rating
Erik Voss – Language Testing, 2025
An increasing number of language testing companies are developing and deploying deep learning-based automated essay scoring systems (AES) to replace traditional approaches that rely on handcrafted feature extraction. However, there is hesitation to accept neural network approaches to automated essay scoring because the features are automatically…
Descriptors: Artificial Intelligence, Automation, Scoring, English (Second Language)
Somayeh Fathali; Fatemeh Mohajeri – Technology in Language Teaching & Learning, 2025
The International English Language Testing System (IELTS) is a high-stakes exam where Writing Task 2 significantly influences the overall scores, requiring reliable evaluation. While trained human raters perform this task, concerns about subjectivity and inconsistency have led to growing interest in artificial intelligence (AI)-based assessment…
Descriptors: English (Second Language), Language Tests, Second Language Learning, Artificial Intelligence
Junfei Li; Jinyan Huang; Thomas Sheeran – SAGE Open, 2025
This study investigated the role of ChatGPT4o as an AI peer assessor in English-as-a-foreign-language (EFL) speaking classrooms, with a focus on its scoring reliability and the effectiveness of its feedback. The research involved 40 first-year English major students from two parallel classes at a Chinese university. Twenty from one class served as…
Descriptors: Artificial Intelligence, Technology Uses in Education, Peer Evaluation, English (Second Language)
Seedhouse, Paul; Satar, Müge – Classroom Discourse, 2023
The same L2 speaking performance may be analysed and evaluated in very different ways by different teachers or raters. We present a new, technology-assisted research design which opens up to investigation the trajectories of convergence and divergence between raters. We tracked and recorded what different raters noticed when, whilst grading a…
Descriptors: Language Tests, English (Second Language), Second Language Learning, Oral Language
Jiyeo Yun – English Teaching, 2023
Studies on automatic scoring systems in writing assessments have also evaluated the relationship between human and machine scores for the reliability of automated essay scoring systems. This study investigated the magnitudes of indices for inter-rater agreement and discrepancy, especially regarding human and machine scoring, in writing assessment.…
Descriptors: Meta Analysis, Interrater Reliability, Essays, Scoring
Polat, Murat – International Online Journal of Education and Teaching, 2020
The assessment of speaking skills in foreign language testing has always had some pros (testing learners' speaking skills doubles the validity of any language test) and cons (many testrelevant/irrelevant variables interfere) since it is a multi-dimensional process. In the meantime, exploring grader behaviours while scoring learners' speaking…
Descriptors: Item Response Theory, Interrater Reliability, Speech Skills, Second Language Learning
Polat, Murat; Turhan, Nihan Sölpük – International Journal of Curriculum and Instruction, 2021
Scoring language learners' speaking skills is open to a number of measurement errors since raters' personal judgements could involve in the process. Different grading designs in which raters score a student's whole speaking skills or a specific dimension of the speaking performance could be settled to control and minimize the amount of the error…
Descriptors: Language Tests, Scoring, Speech Communication, State Universities
Önen, Emine; Yayvak, Melike Kübra Tasdelen – Journal of Education and Training Studies, 2019
In this study, it was aimed to examine the interrater reliability of the scoring of paragraph writing skills on foreign languages with the measurement invariance tests. The study group consists of 267 students studying English at the Preparatory School at Gazi University. In the study, where students write a paragraph on the same topic, the…
Descriptors: Second Language Learning, Second Language Instruction, Factor Analysis, English (Second Language)
Thai, Thuy; Sheehan, Susan – Language Education & Assessment, 2022
In language performance tests, raters are important as their scoring decisions determine which aspects of performance the scores represent; however, raters are considered as one of the potential sources contributing to unwanted variability in scores (Davis, 2012). Although a great number of studies have been conducted to unpack how rater…
Descriptors: Rating Scales, Speech Communication, Second Language Learning, Second Language Instruction
Wang, Qiao – Education and Information Technologies, 2022
This study searched for open-source semantic similarity tools and evaluated their effectiveness in automated content scoring of fact-based essays written by English-as-a-Foreign-Language (EFL) learners. Fifty writing samples under a fact-based writing task from an academic English course in a Japanese university were collected and a gold standard…
Descriptors: English (Second Language), Second Language Learning, Second Language Instruction, Scoring
He, Tung-hsien – SAGE Open, 2019
This study employed a mixed-design approach and the Many-Facet Rasch Measurement (MFRM) framework to investigate whether rater bias occurred between the onscreen scoring (OSS) mode and the paper-based scoring (PBS) mode. Nine human raters analytically marked scanned scripts and paper scripts using a six-category (i.e., six-criterion) rating…
Descriptors: Computer Assisted Testing, Scoring, Item Response Theory, Essays
Ahmadi, Alireza – Taiwan Journal of TESOL, 2020
Rater subjectivity has long been an intriguing topic. The use of discussion as a resolution method is a practical way to reduce this subjectivity. However, the efficacy of discussion depends on whether different raters get equally engaged in it or one rater tends to dominate others. This study investigated whether and how rater dominance occurs in…
Descriptors: Evaluators, Interrater Reliability, Discussion, Discourse Analysis
Davis, Larry; Norris, John – ETS Research Report Series, 2021
The elicited imitation task (EIT), in which language learners listen to a series of spoken sentences and repeat each one verbatim, is a commonly used measure of language proficiency in second language acquisition research. The "TOEFL® Essentials"™ test includes an EIT as a holistic measure of speaking proficiency, referred to as the…
Descriptors: Task Analysis, Language Proficiency, Speech Communication, Language Tests
Rupp, André A.; Casabianca, Jodi M.; Krüger, Maleika; Keller, Stefan; Köller, Olaf – ETS Research Report Series, 2019
In this research report, we describe the design and empirical findings for a large-scale study of essay writing ability with approximately 2,500 high school students in Germany and Switzerland on the basis of 2 tasks with 2 associated prompts, each from a standardized writing assessment whose scoring involved both human and automated components.…
Descriptors: Automation, Foreign Countries, English (Second Language), Language Tests
Ahmadi Shirazi, Masoumeh – SAGE Open, 2019
Threats to construct validity should be reduced to a minimum. If true, sources of bias, namely raters, items, tests as well as gender, age, race, language background, culture, and socio-economic status need to be spotted and removed. This study investigates raters' experience, language background, and the choice of essay prompt as potential…
Descriptors: Foreign Countries, Language Tests, Test Bias, Essay Tests

Peer reviewed
Direct link
