Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning

Kim, Wooseok; Kim, Gyunyeop; Kang, Sangwoo

Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning

Wooseok Kim, Gyunyeop Kim () and Sangwoo Kang ()
Additional contact information
Wooseok Kim: School of Computing, Gachon University, 1342, Seongnam-daero, Sujeong-gu, Seongnam-si 13120, Republic of Korea
Gyunyeop Kim: School of Computing, Gachon University, 1342, Seongnam-daero, Sujeong-gu, Seongnam-si 13120, Republic of Korea
Sangwoo Kang: School of Computing, Gachon University, 1342, Seongnam-daero, Sujeong-gu, Seongnam-si 13120, Republic of Korea

Mathematics, 2025, vol. 13, issue 14, 1-18

Abstract: Fusion-in-Decoder (FiD), a prominent retrieval-augmented generation model, has demonstrated outstanding performance in open-domain question answering by effectively leveraging multiple passages. However, processing multiple passages significantly increases computational costs at both encoder and decoder components. In particular, in Long-Form Question Answering (LFQA) scenarios, the decoder’s cross-attention computation scales proportionally with the length of the generated answer, severely impacting the overall inference speed. In this paper, we propose a novel dynamic token pruning mechanism to alleviate the computational bottleneck of the FiD decoder. Our method selectively identifies and removes tokens predicted to have low contributions to answer generation by jointly considering their contextual information and attention scores within the FiD encoder. The resulting pruned representations are then passed to the decoder, significantly reducing the cross-attention computations and thereby accelerating the overall inference process. Experimental evaluations on two LFQA benchmarks, ASQA and CLAPNQ, demonstrate that the proposed method achieves up to a 1.74-fold speed-up while maintaining minimal degradation in answer quality, effectively enhancing computational efficiency compared to the original FiD model.

Keywords: long-form question answering; retrieval-augmented generation; fusion in decoder; token pruning; deep learning (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/14/2231/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/14/2231/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:14:p:2231-:d:1698089

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().