TY - GEN
T1 - Clustering Algorithms based Noise Identification from Air Pollution Monitoring Data
AU - Fang, Xinyi
AU - Chong, Chak Fong
AU - Yang, Xu
AU - Wang, Yapeng
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - The development of data science has brought about many discussions of noise detection, and so far, there is no universal best method. In this paper, we propose a clustering-algorithm-based solution to identify and remove noise from air pollution data collected with mobile portable sensors. The test dataset is the air pollution data collected by the portable sensors throughout three seasons at the campus in Macao. We have applied and compared six clustering algorithms to identify the most appropriate clustering algorithm to achieve this goal: Simple K-means, Hierarchical Clustering, Cascading K-means, X-means, Expectation Maximization, and Self-Organizing Map. The performance is evaluated by their accuracy and the best number of clusters calculated by the Silhouette Coefficient. Additionally, a classification algorithm J48 tree can extract the key attributes and identify the noise cluster for future unlabeled data that may contain noise. The experiment results indicate that the Expectation Maximization and Cascading Simple K-Means perform the best. Moreover, temperature and carbon dioxide are vital attributes in identifying the noise cluster.
AB - The development of data science has brought about many discussions of noise detection, and so far, there is no universal best method. In this paper, we propose a clustering-algorithm-based solution to identify and remove noise from air pollution data collected with mobile portable sensors. The test dataset is the air pollution data collected by the portable sensors throughout three seasons at the campus in Macao. We have applied and compared six clustering algorithms to identify the most appropriate clustering algorithm to achieve this goal: Simple K-means, Hierarchical Clustering, Cascading K-means, X-means, Expectation Maximization, and Self-Organizing Map. The performance is evaluated by their accuracy and the best number of clusters calculated by the Silhouette Coefficient. Additionally, a classification algorithm J48 tree can extract the key attributes and identify the noise cluster for future unlabeled data that may contain noise. The experiment results indicate that the Expectation Maximization and Cascading Simple K-Means perform the best. Moreover, temperature and carbon dioxide are vital attributes in identifying the noise cluster.
KW - air pollution data
KW - data clustering
KW - noise identification
KW - noise removal
KW - portable sensor
UR - http://www.scopus.com/inward/record.url?scp=85153671738&partnerID=8YFLogxK
U2 - 10.1109/CSDE56538.2022.10089276
DO - 10.1109/CSDE56538.2022.10089276
M3 - Conference contribution
AN - SCOPUS:85153671738
T3 - Proceedings of IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2022
BT - Proceedings of IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2022
Y2 - 18 December 2022 through 20 December 2022
ER -