CSS: Handling imbalanced data by improved clustering with stratified sampling

Lu Cao, Hong Shen

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)

Abstract

The traditional support vector machine technique (SVM) has drawbacks in dealing with imbalanced data. To address this issue, in this paper we propose an algorithm of improved clustering with stratified sampling technique (CSS) to improve the classification performance of SVMs on imbalanced datasets. Instead of applying a single type of sampling method as used in the literature, our algorithm treats different type of classes with different sampling methods. For minority classes, the algorithm uses oversampling method by adding noise which obeys normal distribution around every support vector to generate new samples. For majority classes, samples are first divided into different clusters by applying first the improved clustering by fast search to find of density peaks (CFSFDP) to obtain latent structure information in each majority class and then stratified sampling method is applied to extract samples from each subcluster of the majority class. Moreover, we further extend this method into an ensemble classifiers that use multiple base SVM classifiers for prediction. The experimental results of classification on several imbalanced classification datasets show that our CSS is more effective than the state-of-the-art sampling methods.

Original languageEnglish
Article numbere6071
JournalConcurrency Computation Practice and Experience
Volume34
Issue number2
DOIs
Publication statusPublished - 25 Jan 2022
Externally publishedYes

Keywords

  • classification
  • clustering by fast search and find of density peaks
  • ensemble learning
  • imbalanced data
  • stratified sampling

Fingerprint

Dive into the research topics of 'CSS: Handling imbalanced data by improved clustering with stratified sampling'. Together they form a unique fingerprint.

Cite this