Abstract
This study investigates the application of deep learning and advanced machine learning techniques to sentiment analysis and thematic clustering in small and medium-sized news datasets. Sentiment semantic analysis and scoring were initially performed using GPT-4, customized with domain-specific prompts to capture nuanced terminology, representing a novel application to small, domain-specific news datasets. Based on these scores, data were labeled as positive, negative, or neutral. For semantic classification, a TF-IDF-SVM-OvR model with a linear kernel was developed, incorporating feature engineering tailored to low-resource, small-domain datasets and imbalance-aware OvR classification. Its performance was compared against six traditional machine learning models (Random Forest, GBM, KNN, Logistic Regression, Naive Bayes, and SVM-OvR with RBF kernel) and five deep learning models (TextCNN, LSTM-CNN, MobileNet, TinyBERT, OfficialSimpleTransformer) across four datasets (Sports, Science, Agriculture, and Mixed). The linear SVM-OvR consistently achieved the highest test accuracies (81.8–87.1%) and F1 scores (81.2–85.5%), significantly outperforming baselines (p <0.05), while maintaining moderate training times (6.9–75.6 s) and low model sizes (0.47–1.44 MB). Additionally, the Qwen2-Birch combination was employed for thematic clustering, effectively capturing nuanced sentiment and topics. These results highlight the practical value of this integrated approach for small, domain-specific datasets, emphasizing robustness, efficiency, and reproducibility.
| Original language | English |
|---|---|
| Pages (from-to) | 3852-3896 |
| Number of pages | 45 |
| Journal | IEEE Access |
| Volume | 14 |
| DOIs | |
| Publication status | Published - 2026 |
Keywords
- Deep learning
- Qwen2-Birch
- TF-IDF-SVM-OvR
- agriculture news
- clustering analysis
- machine learning
- science news
- sentiment classification
- small and medium-sized datasets
- sports news
Fingerprint
Dive into the research topics of 'Robust Sentiment and Semantic Analysis of Small and Medium-Sized News Headline Datasets: A Study on Sports, Science, and Agricultural Domains'. Together they form a unique fingerprint.Press/Media
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver