Cybersecurity Intelligence Through Textual Data Analysis: A Framework Using Machine Learning and Terrorism Datasets
Mohammed Salem Atoum (),
Ala Abdulsalam Alarood,
Eesa Alsolami,
Adamu Abubakar,
Ahmad K. Al Hwaitat and
Izzat Alsmadi
Additional contact information
Mohammed Salem Atoum: Department of Computer Science, The University of Jordan, Amman 11942, Jordan
Ala Abdulsalam Alarood: College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi Arabia
Eesa Alsolami: College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi Arabia
Adamu Abubakar: Department of Computer Science, International Islamic University Malaysia, Kuala Lumpur 53100, Malaysia
Ahmad K. Al Hwaitat: Department of Computer Science, The University of Jordan, Amman 11942, Jordan
Izzat Alsmadi: Department of Computing, Engineering and Mathematical Sciences, Texas A&M University, San Antonio, TX 78224, USA
Future Internet, 2025, vol. 17, issue 4, 1-31
Abstract:
This study examines multi-lexical data sources, utilizing an extracted dataset from an open-source corpus and the Global Terrorism Datasets (GTDs), to predict lexical patterns that are directly linked to terrorism. This is essential as specific patterns within a textual context can facilitate the identification of terrorism-related content. The research methodology focuses on generating a corpus from various published works and extracting texts pertinent to “terrorism”. Afterwards, we extract additional lexical contexts of GTDs that directly relate to terrorism. The integration of multi-lexical data sources generates lexical patterns linked to terrorism. Machine learning models were used to train the dataset. We conducted two primary experiments and analyzed the results. The analysis of data obtained from open sources reveals that while the Extra Trees model achieved the highest accuracy at 94.31%, the XGBoost model demonstrated superior overall performance with a higher recall (81.32%) and F1-Score (83.06%) after tuning, indicating a better balance between sensitivity and precision. Similarly, on the GTD dataset, XGBoost consistently outperformed other models in recall and the F1-score, making it a more suitable candidate for tasks where minimizing false negatives is critical. This implies that we can establish a specific co-occurrence and context within the terrorism dataset from multiple lexical data sources in effectively identifying certain multi-lexical patterns such as “Suicide Attack/Casualty”, “Civilians/Victims”, and “Hostage Taking/Abduction” across various applications or contexts. This will facilitate the development of a framework for understanding the lexical patterns associated with terrorism.
Keywords: cyber intelligence; terrorism; machine learning (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/1999-5903/17/4/182/pdf (application/pdf)
https://www.mdpi.com/1999-5903/17/4/182/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:17:y:2025:i:4:p:182-:d:1639084
Access Statistics for this article
Future Internet is currently edited by Ms. Grace You
More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().