Utilizing Large Language Models for EFL Essay Grading: An Examination of Reliability and Validity in Rubric-Based Assessments.

Fatih Yavuz; Özgür Çelik; Gamze Yavas Çelik

Notes FAQ Contact Us

Back to results

Peer reviewed

Direct link

ERIC Number: EJ1456980

Record Type: Journal

Publication Date: 2025-Jan

Pages: 17

Abstractor: As Provided

ISBN: N/A

ISSN: ISSN-0007-1013

EISSN: EISSN-1467-8535

Available Date: N/A

Utilizing Large Language Models for EFL Essay Grading: An Examination of Reliability and Validity in Rubric-Based Assessments

Fatih Yavuz; Özgür Çelik; Gamze Yavas Çelik

British Journal of Educational Technology, v56 n1 p150-166 2025

This study investigates the validity and reliability of generative large language models (LLMs), specifically ChatGPT and Google's Bard, in grading student essays in higher education based on an analytical grading rubric. A total of 15 experienced English as a foreign language (EFL) instructors and two LLMs were asked to evaluate three student essays of varying quality. The grading scale comprised five domains: grammar, content, organization, style & expression and mechanics. The results revealed that fine-tuned ChatGPT model demonstrated a very high level of reliability with an intraclass correlation (ICC) score of 0.972, Default ChatGPT model exhibited an ICC score of 0.947 and Bard showed a substantial level of reliability with an ICC score of 0.919. Additionally, a significant overlap was observed in certain domains when comparing the grades assigned by LLMs and human raters. In conclusion, the findings suggest that while LLMs demonstrated a notable consistency and potential for grading competency, further fine-tuning and adjustment are needed for a more nuanced understanding of non-objective essay criteria. The study not only offers insights into the potential use of LLMs in grading student essays but also highlights the need for continued development and research.

Descriptors: English (Second Language), Second Language Learning, Second Language Instruction, Computational Linguistics, Evaluators, Writing Evaluation, Artificial Intelligence, Scoring Rubrics, Language Teachers, Essays, Validity, Reliability, Evaluation Criteria, Computer Software, Comparative Analysis, Grades (Scholastic)

Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www-wiley-com.bibliotheek.ehb.be/en-us

Publication Type: Journal Articles; Reports - Research

Education Level: N/A

Audience: N/A

Language: English

Sponsor: N/A

Authoring Institution: N/A

Grant or Contract Numbers: N/A

Author Affiliations: N/A