AI's Ability to Interpret Unlabeled Anatomy Images and Supplement Educational Research as an AI Rater.

Lord J. Hyeamang; Tejas C. Sekhar; Emily Rush; Amy C. Beresheim; Colleen M. Cheverko; William S. Brooks; Abbey C. M. Breckling; M. Nazmul Karim; Christopher Ferrigno; Adam B. Wilson

Notes FAQ Contact Us

Back to results

Peer reviewed

Direct link

ERIC Number: EJ1486230

Record Type: Journal

Publication Date: 2025-Oct

Pages: 12

Abstractor: As Provided

ISBN: N/A

ISSN: ISSN-1935-9772

EISSN: EISSN-1935-9780

Available Date: 2025-07-11

AI's Ability to Interpret Unlabeled Anatomy Images and Supplement Educational Research as an AI Rater

Lord J. Hyeamang¹; Tejas C. Sekhar¹; Emily Rush²; Amy C. Beresheim³; Colleen M. Cheverko³; William S. Brooks⁴; Abbey C. M. Breckling⁵; M. Nazmul Karim⁶; Christopher Ferrigno³; Adam B. Wilson³

Anatomical Sciences Education, v18 n10 p1102-1113 2025

Evidence suggests custom chatbots are superior to commercial generative artificial intelligence (GenAI) systems for text-based anatomy content inquiries. This study evaluates ChatGPT-4o's and Claude 3.5 Sonnet's capabilities to interpret unlabeled anatomical images. Secondarily, ChatGPT o1-preview was evaluated as an AI rater to grade AI-generated outputs using a rubric and was compared against human raters. Anatomical images (five musculoskeletal, five thoracic) representing diverse image-based media (e.g., illustrations, photographs, MRI) were annotated with identification markers (e.g., arrows, circles) and uploaded to each GenAI system for interpretation. Forty-five prompts (i.e., 15 first-order, 15 second-order, and 15 third-order questions) with associated images were submitted to both GenAI systems across two timepoints. Responses were graded by anatomy experts for factual accuracy and superfluity (the presence of excessive wording) on a three-point Likert scale. ChatGPT o1-preview was tested for agreement against human anatomy experts to determine its usefulness as an AI rater. Statistical analyses included inter-rater agreement, hierarchical linear modeling, and test-retest reliability. ChatGPT-4o's factual accuracy score across 45 outputs was 68.0% compared to Claude 3.5 Sonnet's score of 61.5% (p = 0.319). As an AI rater, ChatGPT o1-preview showed moderate to substantial agreement with human raters (Cohen's kappa = 0.545-0.755) for evaluating factual accuracy according to a rubric of textbook answers. Further improvements and evaluations are needed before commercial GenAI systems can be used as credible student resources in anatomy education. Similarly, ChatGPT o1-preview demonstrates promise as an AI assistant for educational research, though further investigation is warranted.

Descriptors: Artificial Intelligence, Anatomy, Identification, Man Machine Systems, Natural Language Processing, Likert Scales, Accuracy

Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www-wiley-com.bibliotheek.ehb.be/en-us

Publication Type: Journal Articles; Reports - Research

Education Level: N/A

Audience: N/A

Language: English

Sponsor: N/A

Authoring Institution: N/A

Grant or Contract Numbers: N/A

Author Affiliations: ¹Rush Medical College, Rush University, Chicago, Illinois, USA; ²Academic Affairs, Rush University, Chicago, Illinois, USA; ³Department of Anatomy and Cell Biology, Rush Medical College, Rush University, Chicago, Illinois, USA; ⁴Department of Medical Education, Marnix E. Heersink School of Medicine, University of Alabama at Birmingham, Birmingham, Alabama, USA; ⁵Department of Anatomy and Cell Biology, College of Medicine, University of Illinois at Chicago, Chicago, Illinois, USA; ⁶School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria, Australia