Imbalanced data classification using improved clustering algorithm and under-sampling method

Lu Cao, Hong Shen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Citations (Scopus)

Abstract

Imbalanced classification problem is a hot issue in data mining and machine learning. Traditional classification algorithms are proposed based on some form of symmetry hypothesis of class distribution, whose main purpose is to improve the overall classification performance. It is difficult to obtain ideal classification result when handling imbalanced datasets. In order to improve the classification performance of imbalanced datasets, this paper proposes a cluster-based under-sampling algorithm (CUS) according to the important characteristic of support vector machines (SVM) classification relying on support vector. Firstly, majority class is divided into different clusters using improved clustering by fast search and find of density peaks (CFSFDP) algorithm. The improved clustering algorithm can realize automatic selection of clustering centers, which overcomes the limitation of the original algorithm. Then the minority class and each cluster of the majority class are used to construct training set to get the support vector of each cluster by support vector machine. Retaining support vectors for each cluster and deleting non-support vectors are to construct a new majority class sample points to obtain relatively balanced datasets. Finally, the new datasets are classified by support vector machines and the performance is evaluated by cross validation sets. The experimental results show that CUS algorithm is effective.

Original languageEnglish
Title of host publicationProceedings - 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019
EditorsHui Tian, Hong Shen, Wee Lum Tan
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages358-363
Number of pages6
ISBN (Electronic)9781728126166
DOIs
Publication statusPublished - Dec 2019
Externally publishedYes
Event20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019 - Gold Coast, Australia
Duration: 5 Dec 20197 Dec 2019

Publication series

NameProceedings - 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019

Conference

Conference20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019
Country/TerritoryAustralia
CityGold Coast
Period5/12/197/12/19

Keywords

  • Classification
  • Clustering by fast search and find of density peaks
  • Imbalanced dataset
  • Support vector machine
  • Under-sampling

Fingerprint

Dive into the research topics of 'Imbalanced data classification using improved clustering algorithm and under-sampling method'. Together they form a unique fingerprint.

Cite this