Eliminating Packing-Aware Masking via LoRA-Based Supervised Fine-Tuning of Large Language Models
Jeong Woo Seo and
Ho-Young Jung ()
Additional contact information
Jeong Woo Seo: Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of Korea
Ho-Young Jung: Department of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of Korea
Mathematics, 2025, vol. 13, issue 20, 1-24
Abstract:
Packing approaches enhance training efficiency by filling the padding space in each batch with shorter sequences, thereby reducing the total number of batches per epoch. This approach has proven effective in both pre-training and supervised fine-tuning of large language models (LLMs). However, most packing methods necessitate a packing-aware masking (PAM) mechanism to prevent cross-contamination between different text segments in the multi-head attention (MHA) layers. This masking ensures that the scaled dot-product attention operates only within segment boundaries. Despite its functional utility, PAM introduces significant implementation complexity and computational overhead during training. In this paper, we propose a novel method that eliminates the need for PAM during supervised fine-tuning with packing. Instead of masking, we introduce a learnable tensor derived from Low-Rank Adaptation (LoRA) with the query and value parameters of the attention mechanism. This tensor is trained to attenuate the subspace corresponding to cross-contamination, effectively replacing the function of PAM. Through component-wise decomposition of attention head outputs, we isolate the contamination component and demonstrate that it can be attenuated using the LoRA-derived tensor. Empirical evaluations on 7B-scale LLMs show that our method reduces training time and runtime overhead by completely removing the implementation associated with PAM. This enables more scalable and efficient supervised fine-tuning with packing, without compromising model integrity.
Keywords: training efficiency; supervised fine-tuning; large language model; packing; training-time overhead; implementation complexity (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/13/20/3344/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/20/3344/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:20:p:3344-:d:1775599
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().