EconPapers    
Economics at your fingertips  
 

Malware detection using pre-trained transformer encoder with byte sequences

Eun-Jin Kim, Yun-Kyung Lee, Sang-Min Lee, Jeong-Nyeo Kim, Ah Reum Kang, Mi-seo Kim and Young-Seob Jeong

PLOS ONE, 2025, vol. 20, issue 10, 1-16

Abstract: Ordinary users encounter various documents on the network every day, such as news articles, emails, and messages, and most are vulnerable to malicious attacks. Malicious attack methods continue to evolve, making neural network-based malware detection increasingly appealing to both academia and industry. Recent studies have leveraged byte sequences within files to detect malicious activities, primarily using convolutional neural networks to capture local patterns in the byte sequences. Meanwhile, in natural language processing, Transformer-based language models have demonstrated superior performance across various tasks and have been applied to other domains, such as image analysis and speech recognition. In this paper, we introduce a novel Transformer-based language model for malware detection that processes byte sequences as input. We propose two new pre-training strategies: real-or-fake prediction and same-sequence prediction. Including conventional pre-training strategies such as masked language modeling and next-sentence prediction, we explore all possible combinations of these approaches. By compiling existing byte sequences for malware detection, we construct a benchmark consisting of three file types (PDF, HWP, and MS Office) for pre-training and fine-tuning. Our empirical results demonstrate that our language model outperforms convolutional neural networks in the malware detection task, achieving a macro F1 score improvement of approximately 2.7%p∼11.1%p. We believe our language model will serve as a foundation model for malware detection services, and will extend our research to develop a more powerful encoder-based model that can process longer byte sequences.

Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0332307 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 32307&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0332307

DOI: 10.1371/journal.pone.0332307

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().

 
Page updated 2025-10-18
Handle: RePEc:plo:pone00:0332307