Large language models fall short in classifying learners’ open-ended responses

Research output: Contribution to journalArticlepeer-review

4 Citations (Scopus)

Abstract

Generative Artificial Intelligence (GenAI), based on large language models (LLMs), excels in various language comprehension tasks and is increasingly utilized in applied linguistics research. This study examines the accuracy and methodological implications of using LLMs to classify open-ended responses from learners. We surveyed 143 Japanese university students studying English as a foreign language (EFL) about their essay-writing process. Two human evaluators independently classified the students’ responses based on self-regulated learning processes: planning, monitoring, and evaluation. At the same time, several LLMs performed the same classification task, and their results were compared with those of the human evaluators using Cohen's kappa coefficient. We established κ ≥ 0.8 as the threshold for strong agreement based on rigorous methodological standards. Our findings revealed that even the best-performing model (DeepSeek-V3) achieved only moderate agreement (κ = 0.68), while other models demonstrated fair-to-moderate agreement (κ = 0.37–0.61). Surprisingly, open-source models outperformed several commercial counterparts. These results highlight the necessity of expert oversight when integrating GenAI as a support tool in qualitative data analysis. The paper concludes by discussing the methodological implications for using LLMs in qualitative research and proposing specific prompt engineering strategies to enhance their reliability in applied linguistics.

Original languageEnglish
Article number100210
JournalResearch Methods in Applied Linguistics
Volume4
Issue number2
DOIs
Publication statusPublished - Aug 2025

Keywords

  • Coding and classification
  • Generative AI
  • Large language models (LLM)
  • Qualitative analysis
  • Research methods

Fingerprint

Dive into the research topics of 'Large language models fall short in classifying learners’ open-ended responses'. Together they form a unique fingerprint.

Cite this