Utilizing Pre-trained and Large Language Models for 10-K Items Segmentation
Hsin-Min Lu,
Yu-Tai Chien,
Huan-Hsun Yen and
Yen-Hsiu Chen
Papers from arXiv.org
Abstract:
Extracting specific items from 10-K reports is challenging due to variations in document formats and item presentation. To improve over traditional rule-based approaches, this study introduces and compares two advanced item segmentation methods: (1) GPT4ItemSeg, using a novel line-ID-based prompting mechanism to utilize a large language model, ChatGPT-4o, for item segmentation, and (2) BERT4ItemSeg, combining a pre-trained language model, BERT, with a Bi-LSTM model in a hierarchical structure to overcome context window constraints. Trained and evaluated on 3,737 annotated 10-K reports, BERT4ItemSeg achieves a macro-F1 of 0.9825, surpassing GPT4ItemSeg (0.9567), conditional random field (0.9818), and rule-based methods (0.9048) for core items (1, 1A, 3, and 7). These approaches enhance item segmentation performance, improving text analytics in accounting and finance. BERT4ItemSeg offers satisfactory item segmentation performance, while GPT4ItemSeg can easily adapt to regulatory changes. Together, they provide an extensible framework for 10-K item segmentation that supports reliable and reproducible results.
Date: 2025-02, Revised 2026-04
New Economics Papers: this item is included in nep-big and nep-cmp
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://arxiv.org/pdf/2502.08875 Latest version (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:arx:papers:2502.08875
Access Statistics for this paper
More papers in Papers from arXiv.org
Bibliographic data for series maintained by arXiv administrators ().