EconPapers    
Economics at your fingertips  
 

A Zipf's law-based text generation approach for addressing imbalance in entity extraction

Zhenhua Wang, Ming Ren, Dong Gao and Zhuang Li

Journal of Informetrics, 2023, vol. 17, issue 4

Abstract: Entity extraction is critical in the intelligent advancement across diverse domains. Nevertheless, a challenge to its effectiveness arises from the data imbalance, where certain entities are common while others are scarce. To address this issue, this study proposes a novel text generation approach that harnesses Zipf's law, which is a powerful tool from informetrics for studying human language. By employing characteristics of Zipf's law, words within the documents are classified as common and rare ones. Subsequently, sentences are classified into common and rare ones, and are further processed by text generation models accordingly. Rare entities within the generated sentences are then labeled using human-designed rules, serving as a supplement to the raw dataset, thereby mitigating the imbalance problem. The study presents a case of extracting entities from technical documents, and the extensive experimental results on two datasets prove the effectiveness of the proposed method. Furthermore, the significance and potential of Zipf's law in driving the progress of artificial intelligence (AI) is discussed, broadening the scope and coverage of informetrics. By incorporating the foundational principles of informetrics into text generation, this study showcases the pivotal role of informetrics in shaping the design and developmental of AI systems.

Keywords: Zipf's law; Data imbalance; Text generation; Entity extraction (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://www.sciencedirect.com/science/article/pii/S1751157723000780
Full text for ScienceDirect subscribers only

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:eee:infome:v:17:y:2023:i:4:s1751157723000780

DOI: 10.1016/j.joi.2023.101453

Access Statistics for this article

Journal of Informetrics is currently edited by Leo Egghe

More articles in Journal of Informetrics from Elsevier
Bibliographic data for series maintained by Catherine Liu ().

 
Page updated 2025-03-19
Handle: RePEc:eee:infome:v:17:y:2023:i:4:s1751157723000780