NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
Peer reviewed Peer reviewed
Direct linkDirect link
ERIC Number: EJ1469507
Record Type: Journal
Publication Date: 2025
Pages: 25
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: EISSN-2469-9896
Available Date: 0000-00-00
Grading Explanations of Problem-Solving Process and Generating Feedback Using Large Language Models at Human-Level Accuracy
Physical Review Physics Education Research, v21 n1 Article 010126 2025
This study examines the feasibility and potential advantages of using large language models, in particular GPT-4o, to perform partial credit grading of large numbers of student written responses to introductory level physics problems. Students were instructed to write down verbal explanations of their reasoning process when solving one conceptual and two numerical calculation problems on two exams. The explanations were then graded according to a three-item rubric with each item graded as binary (1 or 0). We first demonstrate that machine grading using GPT-4o with no examples or reference answers can reliably agree with human graders in 70%-80% of all cases, which is equal to or higher than the level at which two human graders agree with each other. Two methods are essential for achieving this level of accuracy: (i) Adding explanation language to each rubric item that targets the errors of initial machine grading. (ii) Running the grading process 5 times and taking the most frequent outcome. Next, we show that the variation in outcomes across five machine grading attempts can serve as a grading confidence index. The index allows a human expert to identify [approximately]40% of all potentially incorrect gradings by reviewing just 10%-15% of all responses with the highest variation. Finally, we show that it is straightforward to use GPT-4o to write a clear and detailed explanation of the partial credit grading outcome. Those explanations can be used as feedback for students, which will allow students to understand their grades and raise different opinions when necessary. Almost all feedback messages generated were rated three or above on a five-point scale by two instructors who had taught the course multiple times. The entire grading and feedback generating process costs roughly $5 per 100 student answers, which shows immense promise for automating labor-intensive grading process through a combination of machine grading with human input and supervision.
American Physical Society. One Physics Ellipse 4th Floor, College Park, MD 20740-3844. Tel: 301-209-3200; Fax: 301-209-0865; e-mail: assocpub@aps.org; Web site: https://journals.aps.org/prper/
Publication Type: Journal Articles; Reports - Research; Tests/Questionnaires
Education Level: N/A
Audience: N/A
Language: English
Authoring Institution: N/A
Grant or Contract Numbers: 1845436
Author Affiliations: N/A