A Zipf's law-based text generation approach for addressing imbalance in entity extraction
Zhenhua Wang,
Ming Ren,
Dong Gao and
Zhuang Li
Journal of Informetrics, 2023, vol. 17, issue 4
Abstract:
Entity extraction is critical in the intelligent advancement across diverse domains. Nevertheless, a challenge to its effectiveness arises from the data imbalance, where certain entities are common while others are scarce. To address this issue, this study proposes a novel text generation approach that harnesses Zipf's law, which is a powerful tool from informetrics for studying human language. By employing characteristics of Zipf's law, words within the documents are classified as common and rare ones. Subsequently, sentences are classified into common and rare ones, and are further processed by text generation models accordingly. Rare entities within the generated sentences are then labeled using human-designed rules, serving as a supplement to the raw dataset, thereby mitigating the imbalance problem. The study presents a case of extracting entities from technical documents, and the extensive experimental results on two datasets prove the effectiveness of the proposed method. Furthermore, the significance and potential of Zipf's law in driving the progress of artificial intelligence (AI) is discussed, broadening the scope and coverage of informetrics. By incorporating the foundational principles of informetrics into text generation, this study showcases the pivotal role of informetrics in shaping the design and developmental of AI systems.
Keywords: Zipf's law; Data imbalance; Text generation; Entity extraction (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://www.sciencedirect.com/science/article/pii/S1751157723000780
Full text for ScienceDirect subscribers only
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:eee:infome:v:17:y:2023:i:4:s1751157723000780
DOI: 10.1016/j.joi.2023.101453
Access Statistics for this article
Journal of Informetrics is currently edited by Leo Egghe
More articles in Journal of Informetrics from Elsevier
Bibliographic data for series maintained by Catherine Liu ().