Abstract
In the area of machine learning (ML) training data optimization through the construction of compact data, the focus of this paper is presented. The concept of compact data design, aimed at creating an optimized dataset that maximizes benefits without the need to manage a vast amount of complex data, is introduced. Improvements in the methods for optimizing ML training have been incorporated into the development of artificial intelligence (AI) systems. The introduction of understanding ML training datasets as a facet of Explainable AI (XAI), comprehensible to humans, has been made. Among the methods of XAI, the evaluation of input feature importance stands out as a way to enhance the accuracy of complex ML models. The innovative method of compact data design for optimizing ML training through dataset reduction is proposed. The performance of an ML-based malware detection system, along with its variant utilizing compact data, has been assessed, demonstrating the maintenance of 99% accuracy. By applying a 76% reduced input dataset, the speed of ML training with the novel compact data design could be maximized, suggesting that an ML system trained in this manner could achieve statistically equivalent accuracy with only 57% of the original data sample size.
Original language | English |
---|---|
Pages (from-to) | 115296-115305 |
Number of pages | 10 |
Journal | IEEE Access |
Volume | 12 |
DOIs | |
Publication status | Published - 2024 |
Keywords
- Compact data
- artificial intelligence
- data complexity
- data reduction
- machine learning
- malware
- robust classification
- security
- supervised learning