Robust Sentiment and Semantic Analysis of Small and Medium-Sized News Headline Datasets: A Study on Sports, Science, and Agricultural Domains

研究成果: Article同行評審

摘要

This study investigates the application of deep learning and advanced machine learning techniques to sentiment analysis and thematic clustering in small and medium-sized news datasets. Sentiment semantic analysis and scoring were initially performed using GPT-4, customized with domain-specific prompts to capture nuanced terminology, representing a novel application to small, domain-specific news datasets. Based on these scores, data were labeled as positive, negative, or neutral. For semantic classification, a TF-IDF-SVM-OvR model with a linear kernel was developed, incorporating feature engineering tailored to low-resource, small-domain datasets and imbalance-aware OvR classification. Its performance was compared against six traditional machine learning models (Random Forest, GBM, KNN, Logistic Regression, Naive Bayes, and SVM-OvR with RBF kernel) and five deep learning models (TextCNN, LSTM-CNN, MobileNet, TinyBERT, OfficialSimpleTransformer) across four datasets (Sports, Science, Agriculture, and Mixed). The linear SVM-OvR consistently achieved the highest test accuracies (81.8–87.1%) and F1 scores (81.2–85.5%), significantly outperforming baselines (p <0.05), while maintaining moderate training times (6.9–75.6 s) and low model sizes (0.47–1.44 MB). Additionally, the Qwen2-Birch combination was employed for thematic clustering, effectively capturing nuanced sentiment and topics. These results highlight the practical value of this integrated approach for small, domain-specific datasets, emphasizing robustness, efficiency, and reproducibility.

原文English
頁(從 - 到)3852-3896
頁數45
期刊IEEE Access
14
DOIs
出版狀態Published - 2026

指紋

深入研究「Robust Sentiment and Semantic Analysis of Small and Medium-Sized News Headline Datasets: A Study on Sports, Science, and Agricultural Domains」主題。共同形成了獨特的指紋。

引用此