TY - JOUR
T1 - Reasoning or not? A comprehensive evaluation of reasoning LLMs for dialogue summarization
AU - Jin, Keyan
AU - Wang, Yapeng
AU - Santos, Leonel
AU - Fang, Tao
AU - Yang, Xu
AU - Im, Sio Kei
AU - Oliveira, Hugo Gonçalo
N1 - Publisher Copyright:
© 2025 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
PY - 2025
Y1 - 2025
N2 - Despite the rapid progress in reasoning Large Language Models, their efficacy in dialogue summarization remains a critical, underexplored area, as this task requires a delicate balance of abstraction, faithfulness, and conciseness. To address this gap, we present the first large-scale, systematic evaluation of leading reasoning LLMs against their direct non-reasoning counterparts. Our rigorous framework covers three core paradigms of generic, role-oriented, and query-oriented summarization, and is tested on four diverse benchmark datasets spanning multiple languages and contexts. Our multi-perspective evaluation consistently demonstrates that, rather than conferring an advantage, the explicit reasoning processes in current models often hinder summarization quality. We find that reasoning models systematically produce longer, less faithful summaries that exhibit higher novelty but lower source coverage, deviating significantly from human summarization styles. Moving beyond performance metrics, we provide a deep diagnostic of the root causes for these failures through a novel, human-annotated error analysis. We identify a critical trade-off where one class of models suffers from structural inefficiency, characterized by verbose and redundant reasoning, while another, though more concise, is prone to multifaceted errors involving logical and factual fallacies. These findings reveal a fundamental conflict between the verbose, step-by-step nature of current reasoning architectures and the high-level abstraction required for summarization, offering crucial insights for designing future models that can effectively bridge logical deduction with concise synthesis.
AB - Despite the rapid progress in reasoning Large Language Models, their efficacy in dialogue summarization remains a critical, underexplored area, as this task requires a delicate balance of abstraction, faithfulness, and conciseness. To address this gap, we present the first large-scale, systematic evaluation of leading reasoning LLMs against their direct non-reasoning counterparts. Our rigorous framework covers three core paradigms of generic, role-oriented, and query-oriented summarization, and is tested on four diverse benchmark datasets spanning multiple languages and contexts. Our multi-perspective evaluation consistently demonstrates that, rather than conferring an advantage, the explicit reasoning processes in current models often hinder summarization quality. We find that reasoning models systematically produce longer, less faithful summaries that exhibit higher novelty but lower source coverage, deviating significantly from human summarization styles. Moving beyond performance metrics, we provide a deep diagnostic of the root causes for these failures through a novel, human-annotated error analysis. We identify a critical trade-off where one class of models suffers from structural inefficiency, characterized by verbose and redundant reasoning, while another, though more concise, is prone to multifaceted errors involving logical and factual fallacies. These findings reveal a fundamental conflict between the verbose, step-by-step nature of current reasoning architectures and the high-level abstraction required for summarization, offering crucial insights for designing future models that can effectively bridge logical deduction with concise synthesis.
KW - Dialogue summarization
KW - LLMs evaluation
KW - Large language models
UR - https://www.scopus.com/pages/publications/105020737080
U2 - 10.1016/j.eswa.2025.129831
DO - 10.1016/j.eswa.2025.129831
M3 - Article
AN - SCOPUS:105020737080
SN - 0957-4174
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 129831
ER -