Reasoning or not? A comprehensive evaluation of reasoning LLMs for dialogue summarization

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

Despite the rapid progress in reasoning Large Language Models, their efficacy in dialogue summarization remains a critical, underexplored area, as this task requires a delicate balance of abstraction, faithfulness, and conciseness. To address this gap, we present the first large-scale, systematic evaluation of leading reasoning LLMs against their direct non-reasoning counterparts. Our rigorous framework covers three core paradigms of generic, role-oriented, and query-oriented summarization, and is tested on four diverse benchmark datasets spanning multiple languages and contexts. Our multi-perspective evaluation consistently demonstrates that, rather than conferring an advantage, the explicit reasoning processes in current models often hinder summarization quality. We find that reasoning models systematically produce longer, less faithful summaries that exhibit higher novelty but lower source coverage, deviating significantly from human summarization styles. Moving beyond performance metrics, we provide a deep diagnostic of the root causes for these failures through a novel, human-annotated error analysis. We identify a critical trade-off where one class of models suffers from structural inefficiency, characterized by verbose and redundant reasoning, while another, though more concise, is prone to multifaceted errors involving logical and factual fallacies. These findings reveal a fundamental conflict between the verbose, step-by-step nature of current reasoning architectures and the high-level abstraction required for summarization, offering crucial insights for designing future models that can effectively bridge logical deduction with concise synthesis.

Original languageEnglish
Article number129831
JournalExpert Systems with Applications
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • Dialogue summarization
  • LLMs evaluation
  • Large language models

Fingerprint

Dive into the research topics of 'Reasoning or not? A comprehensive evaluation of reasoning LLMs for dialogue summarization'. Together they form a unique fingerprint.

Cite this