跳至主導覽 跳至搜尋 跳過主要內容

Large language models fall short in classifying learners’ open-ended responses

  • Kansai University

研究成果: Article同行評審

7 引文 斯高帕斯(Scopus)

摘要

Generative Artificial Intelligence (GenAI), based on large language models (LLMs), excels in various language comprehension tasks and is increasingly utilized in applied linguistics research. This study examines the accuracy and methodological implications of using LLMs to classify open-ended responses from learners. We surveyed 143 Japanese university students studying English as a foreign language (EFL) about their essay-writing process. Two human evaluators independently classified the students’ responses based on self-regulated learning processes: planning, monitoring, and evaluation. At the same time, several LLMs performed the same classification task, and their results were compared with those of the human evaluators using Cohen's kappa coefficient. We established κ ≥ 0.8 as the threshold for strong agreement based on rigorous methodological standards. Our findings revealed that even the best-performing model (DeepSeek-V3) achieved only moderate agreement (κ = 0.68), while other models demonstrated fair-to-moderate agreement (κ = 0.37–0.61). Surprisingly, open-source models outperformed several commercial counterparts. These results highlight the necessity of expert oversight when integrating GenAI as a support tool in qualitative data analysis. The paper concludes by discussing the methodological implications for using LLMs in qualitative research and proposing specific prompt engineering strategies to enhance their reliability in applied linguistics.

原文English
文章編號100210
期刊Research Methods in Applied Linguistics
4
發行號2
DOIs
出版狀態Published - 8月 2025

指紋

深入研究「Large language models fall short in classifying learners’ open-ended responses」主題。共同形成了獨特的指紋。

引用此