Cyber Security Data Science: Machine Learning Methods and Their Performance on Imbalanced Datasets

Lopez-Ledezma, Mateo; Velarde, Gissel

Cyber Security Data Science: Machine Learning Methods and Their Performance on Imbalanced Datasets

Mateo Lopez-Ledezma () and Gissel Velarde ()
Additional contact information
Mateo Lopez-Ledezma: Universidad Privada Boliviana
Gissel Velarde: IU International University of Applied Sciences

A chapter in Digital Management and Artificial Intelligence, 2025, pp 569-578 from Springer

Abstract: Abstract Cybersecurity has become essential worldwide and at all levels, concerning individuals, institutions, and governments. A basic principle in cybersecurity is to be always alert. Therefore, automation is imperative in processes where the volume of daily operations is large. Several cybersecurity applications can be addressed as binary classification problems, including anomaly detection, fraud detection, intrusion detection, spam detection, or malware detection. In many cases, the positive class samples, those that represent a problem, occur at a much lower frequency than negative samples, and this poses a challenge for machine learning algorithms since learning patterns out of under-represented samples is hard. This is known in machine learning as imbalance learning. In this study, we systematically evaluate various machine learning methods using two representative financial datasets containing numerical and categorical features. The Credit Card dataset contains 283726 samples, 31 features, and 0.2 percent of the transactions are fraudulent (imbalance ratio of 598.84:1). The PaySim dataset contains 6362620 samples, 11 features and 0.13 percent of the transactions are fraudulent (imbalance ratio of 773.70:1). We present three experiments. In the first experiment, we evaluate single classifiers including Random Forests, Light Gradient Boosting Machine, eXtreme Gradient Boosting, Logistic Regression, Decision Tree, and Gradient Boosting Decision Tree. In the second experiment, we test different sampling techniques including over-sampling, under-sampling, Synthetic Minority Over-sampling Techique, and Self-Paced Ensembling. In the last experiment, we evaluate Self-Paced Ensembling and its number of base classifiers. We found that imbalance learning techniques had positive and negative effects, as reported in related studies. Thus, these techniques should be applied with caution. Besides, we found different best performers for each dataset. Therefore, we recommend testing single classifiers and imbalance learning techniques for each new dataset and application involving imbalanced datasets as is the case in several cyber security applications. We provide the code with all experiments as open-source (Available at https://github.com/MateoLopez00/Imbalanced-Learning-Empirical-Evaluation .).

Keywords: Machine Learning; Cyber Security; Classification; Imbalance Learning; Data Science (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

There are no downloads for this item, see the EconPapers FAQ for hints about obtaining it.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:prbchp:978-3-031-88052-0_45

Ordering information: This item can be ordered from
http://www.springer.com/9783031880520

DOI: 10.1007/978-3-031-88052-0_45

Access Statistics for this chapter

More chapters in Springer Proceedings in Business and Economics from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().