TY - JOUR
T1 - Large language models fall short in classifying learners’ open-ended responses
AU - Mizumoto, Atsushi
AU - Teng, Mark Feng
N1 - Publisher Copyright:
© 2025 The Authors
PY - 2025/8
Y1 - 2025/8
N2 - Generative Artificial Intelligence (GenAI), based on large language models (LLMs), excels in various language comprehension tasks and is increasingly utilized in applied linguistics research. This study examines the accuracy and methodological implications of using LLMs to classify open-ended responses from learners. We surveyed 143 Japanese university students studying English as a foreign language (EFL) about their essay-writing process. Two human evaluators independently classified the students’ responses based on self-regulated learning processes: planning, monitoring, and evaluation. At the same time, several LLMs performed the same classification task, and their results were compared with those of the human evaluators using Cohen's kappa coefficient. We established κ ≥ 0.8 as the threshold for strong agreement based on rigorous methodological standards. Our findings revealed that even the best-performing model (DeepSeek-V3) achieved only moderate agreement (κ = 0.68), while other models demonstrated fair-to-moderate agreement (κ = 0.37–0.61). Surprisingly, open-source models outperformed several commercial counterparts. These results highlight the necessity of expert oversight when integrating GenAI as a support tool in qualitative data analysis. The paper concludes by discussing the methodological implications for using LLMs in qualitative research and proposing specific prompt engineering strategies to enhance their reliability in applied linguistics.
AB - Generative Artificial Intelligence (GenAI), based on large language models (LLMs), excels in various language comprehension tasks and is increasingly utilized in applied linguistics research. This study examines the accuracy and methodological implications of using LLMs to classify open-ended responses from learners. We surveyed 143 Japanese university students studying English as a foreign language (EFL) about their essay-writing process. Two human evaluators independently classified the students’ responses based on self-regulated learning processes: planning, monitoring, and evaluation. At the same time, several LLMs performed the same classification task, and their results were compared with those of the human evaluators using Cohen's kappa coefficient. We established κ ≥ 0.8 as the threshold for strong agreement based on rigorous methodological standards. Our findings revealed that even the best-performing model (DeepSeek-V3) achieved only moderate agreement (κ = 0.68), while other models demonstrated fair-to-moderate agreement (κ = 0.37–0.61). Surprisingly, open-source models outperformed several commercial counterparts. These results highlight the necessity of expert oversight when integrating GenAI as a support tool in qualitative data analysis. The paper concludes by discussing the methodological implications for using LLMs in qualitative research and proposing specific prompt engineering strategies to enhance their reliability in applied linguistics.
KW - Coding and classification
KW - Generative AI
KW - Large language models (LLM)
KW - Qualitative analysis
KW - Research methods
UR - https://www.scopus.com/pages/publications/105002832628
U2 - 10.1016/j.rmal.2025.100210
DO - 10.1016/j.rmal.2025.100210
M3 - Article
AN - SCOPUS:105002832628
SN - 2772-7661
VL - 4
JO - Research Methods in Applied Linguistics
JF - Research Methods in Applied Linguistics
IS - 2
M1 - 100210
ER -