A Machine Learning-Based Framework with Enhanced Feature Selection and Resampling for Improved Intrusion Detection

Malik, Fazila; Khan, Qazi Waqas; Rizwan, Atif; Alnashwan, Rana; Atteia, Ghada

A Machine Learning-Based Framework with Enhanced Feature Selection and Resampling for Improved Intrusion Detection

Fazila Malik, Qazi Waqas Khan, Atif Rizwan, Rana Alnashwan () and Ghada Atteia
Additional contact information
Fazila Malik: Department of Computer Science, Iqra University Islamabad, Islamabad 44000, Pakistan
Qazi Waqas Khan: Department of Computer Engineering, Jeju National University, Jejusi 63243, Republic of Korea
Atif Rizwan: Department of Computer Engineering, Jeju National University, Jejusi 63243, Republic of Korea
Rana Alnashwan: Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
Ghada Atteia: Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

Mathematics, 2024, vol. 12, issue 12, 1-25

Abstract: Intrusion Detection Systems (IDSs) play a crucial role in safeguarding network infrastructures from cyber threats and ensuring the integrity of highly sensitive data. Conventional IDS technologies, although successful in achieving high levels of accuracy, frequently encounter substantial model bias. This bias is primarily caused by imbalances in the data and the lack of relevance of certain features. This study aims to tackle these challenges by proposing an advanced machine learning (ML) based IDS that minimizes misclassification errors and corrects model bias. As a result, the predictive accuracy and generalizability of the IDS are significantly improved. The proposed system employs advanced feature selection techniques, such as Recursive Feature Elimination (RFE), sequential feature selection (SFS), and statistical feature selection, to refine the input feature set and minimize the impact of non-predictive attributes. In addition, this work incorporates data resampling methods such as Synthetic Minority Oversampling Technique and Edited Nearest Neighbor (SMOTE_ENN), Adaptive Synthetic Sampling (ADASYN), and Synthetic Minority Oversampling Technique–Tomek Links (SMOTE_Tomek) to address class imbalance and improve the accuracy of the model. The experimental results indicate that our proposed model, especially when utilizing the random forest (RF) algorithm, surpasses existing models regarding accuracy, precision, recall, and F Score across different data resampling methods. Using the ADASYN resampling method, the RF model achieves an accuracy of 99.9985% for botnet attacks and 99.9777% for Man-in-the-Middle (MITM) attacks, demonstrating the effectiveness of our approach in dealing with imbalanced data distributions. This research not only improves the abilities of IDS to identify botnet and MITM attacks but also provides a scalable and efficient solution that can be used in other areas where data imbalance is a recurring problem. This work has implications beyond IDS, offering valuable insights into using ML techniques in complex real-world scenarios.

Keywords: feature selection; data resampling; intrusion detection; applied machine learning; deep learning (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.mdpi.com/2227-7390/12/12/1799/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/12/1799/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:12:p:1799-:d:1411729

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().