Robust Sentiment and Semantic Analysis of Small and Medium-Sized News Headline Datasets: A Study on Sports, Science, and Agricultural Domains

Research output: Contribution to journalArticlepeer-review

Abstract

This study investigates the application of deep learning and advanced machine learning techniques to sentiment analysis and thematic clustering in small and medium-sized news datasets. Sentiment semantic analysis and scoring were initially performed using GPT-4, customized with domain-specific prompts to capture nuanced terminology, representing a novel application to small, domain-specific news datasets. Based on these scores, data were labeled as positive, negative, or neutral. For semantic classification, a TF-IDF-SVM-OvR model with a linear kernel was developed, incorporating feature engineering tailored to low-resource, small-domain datasets and imbalance-aware OvR classification. Its performance was compared against six traditional machine learning models (Random Forest, GBM, KNN, Logistic Regression, Naive Bayes, and SVM-OvR with RBF kernel) and five deep learning models (TextCNN, LSTM-CNN, MobileNet, TinyBERT, OfficialSimpleTransformer) across four datasets (Sports, Science, Agriculture, and Mixed). The linear SVM-OvR consistently achieved the highest test accuracies (81.8–87.1%) and F1 scores (81.2–85.5%), significantly outperforming baselines (p <0.05), while maintaining moderate training times (6.9–75.6 s) and low model sizes (0.47–1.44 MB). Additionally, the Qwen2-Birch combination was employed for thematic clustering, effectively capturing nuanced sentiment and topics. These results highlight the practical value of this integrated approach for small, domain-specific datasets, emphasizing robustness, efficiency, and reproducibility.

Original languageEnglish
Pages (from-to)3852-3896
Number of pages45
JournalIEEE Access
Volume14
DOIs
Publication statusPublished - 2026

Keywords

  • Deep learning
  • Qwen2-Birch
  • TF-IDF-SVM-OvR
  • agriculture news
  • clustering analysis
  • machine learning
  • science news
  • sentiment classification
  • small and medium-sized datasets
  • sports news

Fingerprint

Dive into the research topics of 'Robust Sentiment and Semantic Analysis of Small and Medium-Sized News Headline Datasets: A Study on Sports, Science, and Agricultural Domains'. Together they form a unique fingerprint.

Cite this