Improved data streams classification with fast unsupervised feature selection

Lulu Wang, Hong Shen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Citations (Scopus)

Abstract

Data streams classification poses three major challenges, namely, infinite length, concept-drift, and featureevolution. The first two issues have been widely studied. However, most existing data stream classification techniques ignore the last one. DXMiner [17], the first model which addresses featureevolution by using the past labeled instances to select the top ranked features based on a scores computed by a formula. This semi-supervised feature selection method depends on the quality of the past classification and neglects the possible correlation among different features, thus unable to produce an optimal feature subset which deteriorates the accuracy of classification. Multi-Cluster Feature Selection (MCFS) [5] proposed for static data classification and clustering applies unsupervised feature selection to address the feature-evolution problem, but suffers from the high computational cost in feature selection. In this paper, we apply MCFS in the DXMiner framework to handle each window of data in a data stream for dynamic data stream-classification. With unsupervised feature selection, our method produces the optimal feature subset and hence improves DXMiner on the classification accuracy. We further improve the time complexity of the feature selection process in MCFS by using the locality sensitive hashing forest (LSH Forest) [4]. The empirical results indicate that our approach outperforms stateof-The-Art streams classification techniques in classifying real-life data streams.

Original languageEnglish
Title of host publicationProceedings - 17th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2016
EditorsHong Shen, Hong Shen, Yingpeng Sang, Hui Tian
PublisherIEEE Computer Society
Pages221-226
Number of pages6
ISBN (Electronic)9781509050819
DOIs
Publication statusPublished - 2 Jul 2016
Externally publishedYes
Event17th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2016 - Guangzhou, China
Duration: 16 Dec 201618 Dec 2016

Publication series

NameParallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings
Volume0

Conference

Conference17th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2016
Country/TerritoryChina
CityGuangzhou
Period16/12/1618/12/16

Fingerprint

Dive into the research topics of 'Improved data streams classification with fast unsupervised feature selection'. Together they form a unique fingerprint.

Cite this