TY - JOUR
T1 - Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise
AU - Yang, Qimin
AU - Chen, Jiexin
AU - Sun, Yue
AU - Wang, Yapeng
AU - Tan, Tao
N1 - Publisher Copyright:
© AME Publishing Company.
PY - 2025/6/6
Y1 - 2025/6/6
N2 - Background: Since the emergence of large language models (LLMs), a large number of applications have emerged in vertical fields, such as medical, legal, and subject education. By fine-tuning the pre-trained base model, professional knowledge can be well parameterized into model capabilities, enabling it to have better performance in specific fields. However, we observed that although the fine-tuned model has improved domain-specific knowledge, the performance of medical LLMs (Med-LLMs) in long-context understanding has declined significantly due to the large amount of knowledge-intensive fine-tuning, especially compared with the general language model with similar parameters. This study aims to investigate the problem of the decline in performance of Med-LLMs in long-context understanding. Methods: We designed a series of experiments to conduct open-book professional knowledge tests related to the medical field on models using different fine-tuning methods to evaluate their long-context understanding capabilities in the medical field. These experiments included benchmarks of general language models, benchmarks of medical language models, tests that adjusted the ratio and amount of general data and professional data during fine-tuning, and experimental data to determine the best data composition to optimize professional models and achieve a balance between long-context performance and specific domain knowledge. Results: Our experimental framework evaluated 5 general-purpose LLMs and 6 medical-adapted models through an open-book knowledge assessment protocol. The results revealed a striking performance hierarchy: even the lowest-performing general model (37.52% accuracy) outperformed non-retrained medical baselines (34.65% peak accuracy). However, medical models employing our optimized fine-tuning strategies demonstrated marked accuracy gains, with maximum improvements reaching 13.5 percentage points. Notably, retrained medical specialists like IvyGPT (40.48%) and WiNGPT2 (38.94%) surpassed several general models of larger parameter scales, establishing new performance benchmarks in medical context processing. Our experiments on fine-tuning data volume revealed a critical saturation threshold near 100,000 domain-specific samples. When approaching this boundary, models exhibited instability in contextual understanding and even performance regression, while further fine-tuning beyond this point failed to induce measurable improvements in long-context comprehension capabilities. This suggests an inherent limitation in scaling domain-specific knowledge integration through continued data exposure alone. Conclusions: The composition and quantity of data for model fine-tuning actually affect the model’s ability to understand context in downstream tasks. The balance between the model’s expertise and context understanding depends on the rationality of the fine-tuning data.
AB - Background: Since the emergence of large language models (LLMs), a large number of applications have emerged in vertical fields, such as medical, legal, and subject education. By fine-tuning the pre-trained base model, professional knowledge can be well parameterized into model capabilities, enabling it to have better performance in specific fields. However, we observed that although the fine-tuned model has improved domain-specific knowledge, the performance of medical LLMs (Med-LLMs) in long-context understanding has declined significantly due to the large amount of knowledge-intensive fine-tuning, especially compared with the general language model with similar parameters. This study aims to investigate the problem of the decline in performance of Med-LLMs in long-context understanding. Methods: We designed a series of experiments to conduct open-book professional knowledge tests related to the medical field on models using different fine-tuning methods to evaluate their long-context understanding capabilities in the medical field. These experiments included benchmarks of general language models, benchmarks of medical language models, tests that adjusted the ratio and amount of general data and professional data during fine-tuning, and experimental data to determine the best data composition to optimize professional models and achieve a balance between long-context performance and specific domain knowledge. Results: Our experimental framework evaluated 5 general-purpose LLMs and 6 medical-adapted models through an open-book knowledge assessment protocol. The results revealed a striking performance hierarchy: even the lowest-performing general model (37.52% accuracy) outperformed non-retrained medical baselines (34.65% peak accuracy). However, medical models employing our optimized fine-tuning strategies demonstrated marked accuracy gains, with maximum improvements reaching 13.5 percentage points. Notably, retrained medical specialists like IvyGPT (40.48%) and WiNGPT2 (38.94%) surpassed several general models of larger parameter scales, establishing new performance benchmarks in medical context processing. Our experiments on fine-tuning data volume revealed a critical saturation threshold near 100,000 domain-specific samples. When approaching this boundary, models exhibited instability in contextual understanding and even performance regression, while further fine-tuning beyond this point failed to induce measurable improvements in long-context comprehension capabilities. This suggests an inherent limitation in scaling domain-specific knowledge integration through continued data exposure alone. Conclusions: The composition and quantity of data for model fine-tuning actually affect the model’s ability to understand context in downstream tasks. The balance between the model’s expertise and context understanding depends on the rationality of the fine-tuning data.
KW - Large language model (LLM)
KW - artificial intelligence
KW - big data
KW - fine-tuning
KW - medical model
UR - http://www.scopus.com/inward/record.url?scp=105007538614&partnerID=8YFLogxK
U2 - 10.21037/qims-2024-2655
DO - 10.21037/qims-2024-2655
M3 - Article
AN - SCOPUS:105007538614
SN - 2223-4292
VL - 15
SP - 5450
EP - 5462
JO - Quantitative Imaging in Medicine and Surgery
JF - Quantitative Imaging in Medicine and Surgery
IS - 6
ER -