TY - JOUR
T1 - Robust Sentiment and Semantic Analysis of Small and Medium-Sized News Headline Datasets
T2 - A Study on Sports, Science, and Agricultural Domains
AU - Liang, Zijun
AU - Lai, Pengyu
AU - Su, Guanpeng
AU - Wu, Rouying
AU - Tang, Su Kit
AU - Wong, Dennis
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2026
Y1 - 2026
N2 - This study investigates the application of deep learning and advanced machine learning techniques to sentiment analysis and thematic clustering in small and medium-sized news datasets. Sentiment semantic analysis and scoring were initially performed using GPT-4, customized with domain-specific prompts to capture nuanced terminology, representing a novel application to small, domain-specific news datasets. Based on these scores, data were labeled as positive, negative, or neutral. For semantic classification, a TF-IDF-SVM-OvR model with a linear kernel was developed, incorporating feature engineering tailored to low-resource, small-domain datasets and imbalance-aware OvR classification. Its performance was compared against six traditional machine learning models (Random Forest, GBM, KNN, Logistic Regression, Naive Bayes, and SVM-OvR with RBF kernel) and five deep learning models (TextCNN, LSTM-CNN, MobileNet, TinyBERT, OfficialSimpleTransformer) across four datasets (Sports, Science, Agriculture, and Mixed). The linear SVM-OvR consistently achieved the highest test accuracies (81.8–87.1%) and F1 scores (81.2–85.5%), significantly outperforming baselines (p <0.05), while maintaining moderate training times (6.9–75.6 s) and low model sizes (0.47–1.44 MB). Additionally, the Qwen2-Birch combination was employed for thematic clustering, effectively capturing nuanced sentiment and topics. These results highlight the practical value of this integrated approach for small, domain-specific datasets, emphasizing robustness, efficiency, and reproducibility.
AB - This study investigates the application of deep learning and advanced machine learning techniques to sentiment analysis and thematic clustering in small and medium-sized news datasets. Sentiment semantic analysis and scoring were initially performed using GPT-4, customized with domain-specific prompts to capture nuanced terminology, representing a novel application to small, domain-specific news datasets. Based on these scores, data were labeled as positive, negative, or neutral. For semantic classification, a TF-IDF-SVM-OvR model with a linear kernel was developed, incorporating feature engineering tailored to low-resource, small-domain datasets and imbalance-aware OvR classification. Its performance was compared against six traditional machine learning models (Random Forest, GBM, KNN, Logistic Regression, Naive Bayes, and SVM-OvR with RBF kernel) and five deep learning models (TextCNN, LSTM-CNN, MobileNet, TinyBERT, OfficialSimpleTransformer) across four datasets (Sports, Science, Agriculture, and Mixed). The linear SVM-OvR consistently achieved the highest test accuracies (81.8–87.1%) and F1 scores (81.2–85.5%), significantly outperforming baselines (p <0.05), while maintaining moderate training times (6.9–75.6 s) and low model sizes (0.47–1.44 MB). Additionally, the Qwen2-Birch combination was employed for thematic clustering, effectively capturing nuanced sentiment and topics. These results highlight the practical value of this integrated approach for small, domain-specific datasets, emphasizing robustness, efficiency, and reproducibility.
KW - Deep learning
KW - Qwen2-Birch
KW - TF-IDF-SVM-OvR
KW - agriculture news
KW - clustering analysis
KW - machine learning
KW - science news
KW - sentiment classification
KW - small and medium-sized datasets
KW - sports news
UR - https://www.scopus.com/pages/publications/105026877080
U2 - 10.1109/ACCESS.2025.3645352
DO - 10.1109/ACCESS.2025.3645352
M3 - Article
AN - SCOPUS:105026877080
SN - 2169-3536
VL - 14
SP - 3852
EP - 3896
JO - IEEE Access
JF - IEEE Access
ER -