ERIC Number: EJ1487745
Record Type: Journal
Publication Date: 2025-Oct
Pages: 30
Abstractor: As Provided
ISBN: N/A
ISSN: ISSN-1069-4730
EISSN: EISSN-2168-9830
Available Date: 2025-08-31
Leveraging AI-Generated Synthetic Data to Train Natural Language Processing Models for Qualitative Feedback Analysis
Journal of Engineering Education, v114 n4 e70033 2025
Background: High-quality feedback is crucial for academic success, driving student motivation and engagement while research explores effective delivery and student interactions. Advances in artificial intelligence (AI), particularly natural language processing (NLP), offer innovative methods for analyzing complex qualitative data such as feedback interactions. Purpose: We developed a framework to train sentence transformers using generative AI--created synthetic data to categorize student-feedback interactions in engineering studios. We compared traditional thematic analysis with modern methods to evaluate the realism of synthetic datasets and their effectiveness in training NLP models by exploring how generative AI can aid qualitative coding. Methods: We deidentified and transcribed eight audio recordings from engineering studios. Synthetic feedback transcripts were generated using three locally hosted large language models: Llama 3.1, Gemma 2.0, and Mistral NeMo, adjusting parameters to produce datasets mimicking the real transcripts. We assessed the quality of synthetic transcripts using our framework and used a sentence transformer model (trained on both real and synthetic data) to compare changes in the model's percent accuracy when qualitatively coding feedback interactions. Results: Synthetic data improved the NLP model's performance in classifying feedback interactions, boosting the average accuracy from 68.4% to 81% with Llama 3.1. Although incorporating synthetic data improved classification, all models produced transcripts that occasionally included extraneous details and failed to capture instructor-dominant discourse. Conclusions: Synthetic data offers an opportunity to expand qualitative research, particularly in contexts where real data for NLP training is limited or hard to obtain; however, transparency in its use is paramount to maintain research integrity.
Descriptors: Artificial Intelligence, Training, Data Analysis, Natural Language Processing, Feedback (Response), Student Evaluation, Engineering Education, Evaluation Methods, Models, Coding, Sentences, Accuracy, Classification
Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www-wiley-com.bibliotheek.ehb.be/en-us
Publication Type: Journal Articles; Reports - Research
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Grant or Contract Numbers: EF2222434
Author Affiliations: 1Meinig School of Biomedical Engineering, Cornell University, Ithaca, New York, USA; 2School of Applied and Engineering Physics, Cornell University, Ithaca, New York, USA

Peer reviewed
Direct link
