TY - JOUR
T1 - A selectively re-train approach based on clustering to classify concept-drifting data streams with skewed distribution
AU - Zhang, Dandan
AU - Shen, Hong
AU - Hui, Tian
AU - Li, Yidong
AU - Wu, Jun
AU - Sang, Yingpeng
PY - 2014
Y1 - 2014
N2 - Classification is an important and practical tool which uses a model built on historical data to predict class labels for new arrival data. In the last few years, there have been many interesting studies on classification in data streams. However, most such studies assume that those data streams are relatively balanced and stable. Actually, skewed data streams (e.g., few positive but lots of negatives) are very important and typical, which appear in many real world applications. Concept drifts and skewed distributions, two common properties of data streams, make the task of learning in streams particularly difficult and the traditional data mining algorithms no longer work. In this paper, we propose a method (Selectively Re-train Approach Based on Clustering) which can deal with concept-drifting and skewed distribution simultaneously. We evaluate our algorithm on both synthetic and real data sets simulating skewed data streams. Empirical results show the proposed method yields better performance than the previous work.
AB - Classification is an important and practical tool which uses a model built on historical data to predict class labels for new arrival data. In the last few years, there have been many interesting studies on classification in data streams. However, most such studies assume that those data streams are relatively balanced and stable. Actually, skewed data streams (e.g., few positive but lots of negatives) are very important and typical, which appear in many real world applications. Concept drifts and skewed distributions, two common properties of data streams, make the task of learning in streams particularly difficult and the traditional data mining algorithms no longer work. In this paper, we propose a method (Selectively Re-train Approach Based on Clustering) which can deal with concept-drifting and skewed distribution simultaneously. We evaluate our algorithm on both synthetic and real data sets simulating skewed data streams. Empirical results show the proposed method yields better performance than the previous work.
KW - concept-drifting
KW - data stream
KW - selectively Re-train
KW - skewed distribution
UR - http://www.scopus.com/inward/record.url?scp=84901268510&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-06605-9_34
DO - 10.1007/978-3-319-06605-9_34
M3 - Conference article
AN - SCOPUS:84901268510
SN - 0302-9743
VL - 8444 LNAI
SP - 413
EP - 424
JO - Lecture Notes in Computer Science
JF - Lecture Notes in Computer Science
IS - PART 2
T2 - 18th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2014
Y2 - 13 May 2014 through 16 May 2014
ER -