TY - GEN
T1 - Clustering high dimensional data streams with representative points
AU - Wang, Xiujun
AU - Shen, Hong
PY - 2009
Y1 - 2009
N2 - In this paper, we propose a novel algorithm for clustering high dimensional data streams with representative data points. The fixed-size interval partitioning adopted in traditional grid based clustering methods can not capture clusters in each dimension well when they are applied in evolving high dimensional data streams. It may generate unnecessary dense grids which misrepresent clusters in a subspace. To overcome these drawbacks, we quantify each dimension (attribute) of data points separately and use the generated representative data points for each dimension instead of fixed-size intervals. These data points are updated with incoming data points continuously so that they can capture the cluster trends in each dimension more accurately than the fixed-size intervals. Instead of discarding the historical data point as a whole, our algorithm confines data discarding at attribute level with the statistics stored in the representative data points. This enables us to keep useful parts of data points and discard the trivial parts. Experiment results on synthetic and real data sets display the high effectiveness and accuracy of the proposed method.
AB - In this paper, we propose a novel algorithm for clustering high dimensional data streams with representative data points. The fixed-size interval partitioning adopted in traditional grid based clustering methods can not capture clusters in each dimension well when they are applied in evolving high dimensional data streams. It may generate unnecessary dense grids which misrepresent clusters in a subspace. To overcome these drawbacks, we quantify each dimension (attribute) of data points separately and use the generated representative data points for each dimension instead of fixed-size intervals. These data points are updated with incoming data points continuously so that they can capture the cluster trends in each dimension more accurately than the fixed-size intervals. Instead of discarding the historical data point as a whole, our algorithm confines data discarding at attribute level with the statistics stored in the representative data points. This enables us to keep useful parts of data points and discard the trivial parts. Experiment results on synthetic and real data sets display the high effectiveness and accuracy of the proposed method.
KW - Clustering
KW - High dimensional data streams
KW - Probability density estimation
KW - Quantification
KW - Representative data points
UR - http://www.scopus.com/inward/record.url?scp=76349091611&partnerID=8YFLogxK
U2 - 10.1109/FSKD.2009.341
DO - 10.1109/FSKD.2009.341
M3 - Conference contribution
AN - SCOPUS:76349091611
SN - 9780769537351
T3 - 6th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2009
SP - 449
EP - 453
BT - 6th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2009
T2 - 6th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2009
Y2 - 14 August 2009 through 16 August 2009
ER -