Detecting Cyber Threats in UWF-ZeekDataFall22 Using K-Means Clustering in the Big Data Environment
Sikha S. Bagui (),
Germano Correa Silva De Carvalho,
Asmi Mishra,
Dustin Mink,
Subhash C. Bagui and
Stephanie Eager
Additional contact information
Sikha S. Bagui: Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA
Germano Correa Silva De Carvalho: Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA
Asmi Mishra: Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA
Dustin Mink: Department of Cybersecurity, The University of West Florida, Pensacola, FL 32514, USA
Subhash C. Bagui: Department of Mathematics and Statistics, The University of West Florida, Pensacola, FL 32514, USA
Stephanie Eager: Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA
Future Internet, 2025, vol. 17, issue 6, 1-35
Abstract:
In an era marked by the rapid growth of the Internet of Things (IoT), network security has become increasingly critical. Traditional Intrusion Detection Systems, particularly signature-based methods, struggle to identify evolving cyber threats such as Advanced Persistent Threats (APTs)and zero-day attacks. Such threats or attacks go undetected with supervised machine-learning methods. In this paper, we apply K-means clustering, an unsupervised clustering technique, to a newly created modern network attack dataset, UWF-ZeekDataFall22. Since this dataset contains labeled Zeek logs, the dataset was de-labeled before using this data for K-means clustering. The labeled data, however, was used in the evaluation phase, to determine the attack clusters post-clustering. In order to identify APTs as well as zero-day attack clusters, three different labeling heuristics were evaluated to determine the attack clusters. To address the challenges faced by Big Data, the Big Data framework, that is, Apache Spark and PySpark, were used for our development environment. In addition, the uniqueness of this work is also in using connection-based features. Using connection-based features, an in-depth study is done to determine the effect of the number of clusters, seeds, as well as features, for each of the different labeling heuristics. If the objective is to detect every single attack, the results indicate that 325 clusters with a seed of 200, using an optimal set of features, would be able to correctly place 99% of attacks.
Keywords: K-means clustering; cyber threat detection; intrusion detection system; MITRE ATT&CK framework; network security (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/1999-5903/17/6/267/pdf (application/pdf)
https://www.mdpi.com/1999-5903/17/6/267/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:17:y:2025:i:6:p:267-:d:1681866
Access Statistics for this article
Future Internet is currently edited by Ms. Grace You
More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().