Extraction of characteristic information from financial super-long texts and prediction of corporate violations
Hanglin Lu,
Yongjie Zhang and
Jinchang Xu
Research in International Business and Finance, 2025, vol. 79, issue C
Abstract:
Annual report texts contain clues about corporate misconduct. Predicting misconduct through AI-based analysis of these texts can help investors better avoid risks. However, due to the current limitations of AI language models, embedding the semantic vectors of long text paragraphs from annual reports faces a trade-off between "globality" and "accuracy." By using machine learning models (DecisionTree, RandomForest, LightGBM), our study compares the effectiveness of annual report text information at four segmentation granularities in predicting corporate misconduct. We find that, with single-granularity encoding, the Bert-Sentence-Stack semantic extraction method provides more effective annual report text encodings for predicting misconduct, achieving a best AUC of 0.7250. Furthermore, by implementing multi-granularity feature fusion, we achieve a winning combination of "globality" and "accuracy" with a maximum AUC of 0.7701. Compared to using financial features alone, multi-granularity text feature fusion increases the prediction AUC for corporate misconduct by about 12 %, indicating that multi-granularity text semantic features provide valuable incremental information. This study offers new insights and solutions for the integration and utilization of long financial texts and information mining.
Keywords: Management discussion and analysis (MD&A); Corporate violations; Text vectorization; Large language model (LLM); Machine learning (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://www.sciencedirect.com/science/article/pii/S0275531925003356
Full text for ScienceDirect subscribers only
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:eee:riibaf:v:79:y:2025:i:c:s0275531925003356
DOI: 10.1016/j.ribaf.2025.103079
Access Statistics for this article
Research in International Business and Finance is currently edited by T. Lagoarde Segot
More articles in Research in International Business and Finance from Elsevier
Bibliographic data for series maintained by Catherine Liu ().