Selecting Representative Samples from Malware Datasets
Lukáš Děd () and
Martin Jureček ()
Additional contact information
Lukáš Děd: Czech Technical University in Prague, Faculty of Information Technology
Martin Jureček: Czech Technical University in Prague, Faculty of Information Technology
A chapter in Machine Learning, Deep Learning and AI for Cybersecurity, 2025, pp 113-142 from Springer
Abstract:
Abstract This work focuses on the selection of representative instances for the training set in malware detection. Opposed to random instance selection, the goal of instance selection algorithms is to remove noise and redundancy while preserving relevant data for solving the task. Experiments were conducted on two publicly available datasets containing metadata of Windows PE files, namely the EMBER and SOREL-20M datasets. The theoretical part describes data preprocessing methods, instance selection algorithms, and classification algorithms used in the practical part of this work. The practical part outlines the process of preprocessing datasets and main experiments related to the comparison of state-of-the-art instance selection algorithms. As part of the work, modifications to the parallel instance selection algorithm PIF were proposed and implemented, and these were also experimentally evaluated and compared with the results of state-of-the-art instance selection algorithms. Some of the modified versions ranked among the best in terms of reduction level as well as the ratio between accuracy and the size of the reduced sets. The best among the modified versions was the RPIF-AllKNN algorithm, which reduced the entire training set of the SOREL-20M dataset to 6.24% of its original size with an accuracy loss of 2.1%. The ratio between accuracy and the size of the reduced set was 14.43 and in terms of this metric, RPIF-AllKNN was the best among the compared algorithms.
Date: 2025
References: Add references at CitEc
Citations:
There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:sprchp:978-3-031-83157-7_5
Ordering information: This item can be ordered from
http://www.springer.com/9783031831577
DOI: 10.1007/978-3-031-83157-7_5
Access Statistics for this chapter
More chapters in Springer Books from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().