TY - JOUR
T1 - Pipeline-optimized machine learning for chronic fatigue syndrome diagnosis
T2 - A lightweight, interpretable model using blood biochemical and metabolomic data
AU - Li, Junrong
AU - Cao, Hanyu
AU - Zhu, Zirun
AU - Zhai, Xiaobing
AU - Xing, Abao
AU - Zeng, Shuowen
AU - Luo, Gang
AU - Sha, Yuyang
AU - Li, Peng
AU - Li, Kefeng
N1 - Publisher Copyright:
© 2026 Elsevier Ltd
PY - 2026/8
Y1 - 2026/8
N2 - Introduction and Background: Chronic fatigue syndrome (CFS) is a debilitating multisystem disorder with persistent fatigue and functional impairment, yet remains underdiagnosed due to symptom heterogeneity and the lack of objective biomarkers. Developing a lightweight, interpretable diagnostic model requires systematic optimization of the entire analytical pipeline—from control group selection to biomarker identification and model construction. Method: We developed a comprehensive pipeline optimization framework using UK Biobank metabolomic and blood biochemical data (1137 CFS cases; 66,838 controls). Unlike previous studies, our control group included both healthy individuals and patients with CFS-overlapping conditions. We employed stratified bootstrap sampling (1000 iterations) instead of traditional random sampling to ensure balanced covariate distributions between cases and controls. Our systematic approach compared 7 missing value imputation methods, 9 feature selection techniques, and 11 machine learning/deep learning models. Feature selection incorporated collinearity exclusion and sequential forward selection to identify the 10 most influential biomarkers. Model evaluation extended beyond standard metrics (ROC-AUC, accuracy, sensitivity, specificity, F1-score, NPV, and PPV) to include Matthews Correlation Coefficient (MCC) for comprehensive performance assessment. We enhanced model interpretability through both Mendelian randomization (MR) for causal inference and SHAP (SHapley Additive exPlanations) analysis for feature contribution quantification. Clinical utility was evaluated using decision curve analysis (DCA), with additional validation through Spearman's correlation and restricted cubic spline (RCS) analyses examining biomarker relationships with core CFS symptoms. Results: The optimized pipeline yielded a lightweight model combining Bayesian Principal Component Analysis (BPCA) imputation, NearMiss undersampling, and random forest classification using only 10 biomarkers plus three covariates (BMI, age, and gender). This model achieved exceptional diagnostic performance (accuracy = 0.939, ROC-AUC = 0.979, MCC = 0.878, Balanced Performance Score = 0.859 across 11 metrics), effectively discriminating CFS from both healthy controls and overlapping conditions. DCA demonstrated substantial net clinical benefit across a wide threshold range (0.01–0.98), confirming strong clinical applicability. MR analysis established causal relationships for six biomarkers (urea, total protein, glucose, total bilirubin, leucine, vitamin D; P < 0.05). SHAP-based interpretability analysis, corroborated by Spearman's correlation and RCS analyses, revealed that elevated glucose and leucine levels exacerbated CFS symptoms, providing mechanistic insights aligned with personalized risk directionality. Conclusion: Through systematic pipeline optimization—from stratified control selection to comprehensive model comparison and multi-faceted interpretability analysis—we developed a lightweight, highly interpretable CFS diagnostic model using exclusively objective biomarkers. To ensure reproducibility, this methodology was implemented via the ClinMetML framework.
AB - Introduction and Background: Chronic fatigue syndrome (CFS) is a debilitating multisystem disorder with persistent fatigue and functional impairment, yet remains underdiagnosed due to symptom heterogeneity and the lack of objective biomarkers. Developing a lightweight, interpretable diagnostic model requires systematic optimization of the entire analytical pipeline—from control group selection to biomarker identification and model construction. Method: We developed a comprehensive pipeline optimization framework using UK Biobank metabolomic and blood biochemical data (1137 CFS cases; 66,838 controls). Unlike previous studies, our control group included both healthy individuals and patients with CFS-overlapping conditions. We employed stratified bootstrap sampling (1000 iterations) instead of traditional random sampling to ensure balanced covariate distributions between cases and controls. Our systematic approach compared 7 missing value imputation methods, 9 feature selection techniques, and 11 machine learning/deep learning models. Feature selection incorporated collinearity exclusion and sequential forward selection to identify the 10 most influential biomarkers. Model evaluation extended beyond standard metrics (ROC-AUC, accuracy, sensitivity, specificity, F1-score, NPV, and PPV) to include Matthews Correlation Coefficient (MCC) for comprehensive performance assessment. We enhanced model interpretability through both Mendelian randomization (MR) for causal inference and SHAP (SHapley Additive exPlanations) analysis for feature contribution quantification. Clinical utility was evaluated using decision curve analysis (DCA), with additional validation through Spearman's correlation and restricted cubic spline (RCS) analyses examining biomarker relationships with core CFS symptoms. Results: The optimized pipeline yielded a lightweight model combining Bayesian Principal Component Analysis (BPCA) imputation, NearMiss undersampling, and random forest classification using only 10 biomarkers plus three covariates (BMI, age, and gender). This model achieved exceptional diagnostic performance (accuracy = 0.939, ROC-AUC = 0.979, MCC = 0.878, Balanced Performance Score = 0.859 across 11 metrics), effectively discriminating CFS from both healthy controls and overlapping conditions. DCA demonstrated substantial net clinical benefit across a wide threshold range (0.01–0.98), confirming strong clinical applicability. MR analysis established causal relationships for six biomarkers (urea, total protein, glucose, total bilirubin, leucine, vitamin D; P < 0.05). SHAP-based interpretability analysis, corroborated by Spearman's correlation and RCS analyses, revealed that elevated glucose and leucine levels exacerbated CFS symptoms, providing mechanistic insights aligned with personalized risk directionality. Conclusion: Through systematic pipeline optimization—from stratified control selection to comprehensive model comparison and multi-faceted interpretability analysis—we developed a lightweight, highly interpretable CFS diagnostic model using exclusively objective biomarkers. To ensure reproducibility, this methodology was implemented via the ClinMetML framework.
KW - Biomarker Identification
KW - Chronic fatigue syndrome
KW - Deep Learning
KW - Interpretable AI
KW - Machine learning
KW - Risk Prediction Model
UR - https://www.scopus.com/pages/publications/105031774486
U2 - 10.1016/j.compbiolchem.2026.108995
DO - 10.1016/j.compbiolchem.2026.108995
M3 - Article
AN - SCOPUS:105031774486
SN - 1476-9271
VL - 123
JO - Computational Biology and Chemistry
JF - Computational Biology and Chemistry
M1 - 108995
ER -