Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models

Yang, Jaewoo; Kim, Hayun; Ji, Junyung; Kim, Younghoon

Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models

Jaewoo Yang, Hayun Kim, Junyung Ji and Younghoon Kim ()
Additional contact information
Jaewoo Yang: Department of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of Korea
Hayun Kim: Department of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of Korea
Junyung Ji: Department of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of Korea
Younghoon Kim: Department of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of Korea

Future Internet, 2025, vol. 17, issue 4, 1-21

Abstract: Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference. Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8. However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family. Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance. Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence. To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization. Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques. The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.

Keywords: quantization; LLM; post-training quantization; outliers (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/17/4/185/pdf (application/pdf)
https://www.mdpi.com/1999-5903/17/4/185/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:17:y:2025:i:4:p:185-:d:1639317

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().