Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise

Qimin Yang, Jiexin Chen, Yue Sun, Yapeng Wang, Tao Tan

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Since the emergence of large language models (LLMs), a large number of applications have emerged in vertical fields, such as medical, legal, and subject education. By fine-tuning the pre-trained base model, professional knowledge can be well parameterized into model capabilities, enabling it to have better performance in specific fields. However, we observed that although the fine-tuned model has improved domain-specific knowledge, the performance of medical LLMs (Med-LLMs) in long-context understanding has declined significantly due to the large amount of knowledge-intensive fine-tuning, especially compared with the general language model with similar parameters. This study aims to investigate the problem of the decline in performance of Med-LLMs in long-context understanding. Methods: We designed a series of experiments to conduct open-book professional knowledge tests related to the medical field on models using different fine-tuning methods to evaluate their long-context understanding capabilities in the medical field. These experiments included benchmarks of general language models, benchmarks of medical language models, tests that adjusted the ratio and amount of general data and professional data during fine-tuning, and experimental data to determine the best data composition to optimize professional models and achieve a balance between long-context performance and specific domain knowledge. Results: Our experimental framework evaluated 5 general-purpose LLMs and 6 medical-adapted models through an open-book knowledge assessment protocol. The results revealed a striking performance hierarchy: even the lowest-performing general model (37.52% accuracy) outperformed non-retrained medical baselines (34.65% peak accuracy). However, medical models employing our optimized fine-tuning strategies demonstrated marked accuracy gains, with maximum improvements reaching 13.5 percentage points. Notably, retrained medical specialists like IvyGPT (40.48%) and WiNGPT2 (38.94%) surpassed several general models of larger parameter scales, establishing new performance benchmarks in medical context processing. Our experiments on fine-tuning data volume revealed a critical saturation threshold near 100,000 domain-specific samples. When approaching this boundary, models exhibited instability in contextual understanding and even performance regression, while further fine-tuning beyond this point failed to induce measurable improvements in long-context comprehension capabilities. This suggests an inherent limitation in scaling domain-specific knowledge integration through continued data exposure alone. Conclusions: The composition and quantity of data for model fine-tuning actually affect the model’s ability to understand context in downstream tasks. The balance between the model’s expertise and context understanding depends on the rationality of the fine-tuning data.

Original languageEnglish
Pages (from-to)5450-5462
Number of pages13
JournalQuantitative Imaging in Medicine and Surgery
Volume15
Issue number6
DOIs
Publication statusPublished - 6 Jun 2025

Keywords

  • Large language model (LLM)
  • artificial intelligence
  • big data
  • fine-tuning
  • medical model

Fingerprint

Dive into the research topics of 'Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise'. Together they form a unique fingerprint.

Cite this