TY - GEN
T1 - Improved data streams classification with fast unsupervised feature selection
AU - Wang, Lulu
AU - Shen, Hong
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/2
Y1 - 2016/7/2
N2 - Data streams classification poses three major challenges, namely, infinite length, concept-drift, and featureevolution. The first two issues have been widely studied. However, most existing data stream classification techniques ignore the last one. DXMiner [17], the first model which addresses featureevolution by using the past labeled instances to select the top ranked features based on a scores computed by a formula. This semi-supervised feature selection method depends on the quality of the past classification and neglects the possible correlation among different features, thus unable to produce an optimal feature subset which deteriorates the accuracy of classification. Multi-Cluster Feature Selection (MCFS) [5] proposed for static data classification and clustering applies unsupervised feature selection to address the feature-evolution problem, but suffers from the high computational cost in feature selection. In this paper, we apply MCFS in the DXMiner framework to handle each window of data in a data stream for dynamic data stream-classification. With unsupervised feature selection, our method produces the optimal feature subset and hence improves DXMiner on the classification accuracy. We further improve the time complexity of the feature selection process in MCFS by using the locality sensitive hashing forest (LSH Forest) [4]. The empirical results indicate that our approach outperforms stateof-The-Art streams classification techniques in classifying real-life data streams.
AB - Data streams classification poses three major challenges, namely, infinite length, concept-drift, and featureevolution. The first two issues have been widely studied. However, most existing data stream classification techniques ignore the last one. DXMiner [17], the first model which addresses featureevolution by using the past labeled instances to select the top ranked features based on a scores computed by a formula. This semi-supervised feature selection method depends on the quality of the past classification and neglects the possible correlation among different features, thus unable to produce an optimal feature subset which deteriorates the accuracy of classification. Multi-Cluster Feature Selection (MCFS) [5] proposed for static data classification and clustering applies unsupervised feature selection to address the feature-evolution problem, but suffers from the high computational cost in feature selection. In this paper, we apply MCFS in the DXMiner framework to handle each window of data in a data stream for dynamic data stream-classification. With unsupervised feature selection, our method produces the optimal feature subset and hence improves DXMiner on the classification accuracy. We further improve the time complexity of the feature selection process in MCFS by using the locality sensitive hashing forest (LSH Forest) [4]. The empirical results indicate that our approach outperforms stateof-The-Art streams classification techniques in classifying real-life data streams.
UR - http://www.scopus.com/inward/record.url?scp=85022044344&partnerID=8YFLogxK
U2 - 10.1109/PDCAT.2016.056
DO - 10.1109/PDCAT.2016.056
M3 - Conference contribution
AN - SCOPUS:85022044344
T3 - Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings
SP - 221
EP - 226
BT - Proceedings - 17th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2016
A2 - Shen, Hong
A2 - Shen, Hong
A2 - Sang, Yingpeng
A2 - Tian, Hui
PB - IEEE Computer Society
T2 - 17th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2016
Y2 - 16 December 2016 through 18 December 2016
ER -