EconPapers    
Economics at your fingertips  
 

Calculating Semantic Frequency of GSL Words Using a BERT Model in Large Corpora

Liu Lei, Gong Tongxi, Shi Jianjun and Guo Yi

SAGE Open, 2025, vol. 15, issue 2, 21582440251333182

Abstract: There has always been a pressing need to provide semantic information for words in high-frequency word lists, but technical limitations have hindered this goal. This study addresses this challenge by leveraging a large language model, such as BERT, to semantically annotate large corpora and identify the high-frequency senses of headwords from the General Service List (GSL). We aim to explore three key questions: (1) Can BERT automatically annotate large corpora and accurately calculate sense frequencies? (2) What are the high-frequency senses of GSL words? (3) Can this approach be verified? Using a BERT-based framework, we annotated 1,891 GSL headwords (10,925 senses) in the 100-million-word British National Corpus (BNC), representing each sense with a 1,024-dimensional vector. From this, we identified 3,695 high-frequency senses for the GSL words. Three main conclusions are drawn from this study. First, BERT demonstrates high accuracy in sense annotation, achieving 92% precision when disambiguating the senses of GSL words. Second, a relatively small number of high-frequency senses account for a significant portion of corpus coverage. Specifically, these high-frequency senses (33.8% of the total) cover approximately 60% of all GSL word occurrences in the BNC. Third, the high-frequency senses selected via this method can be verified by their consistent coverage across different corpora. This study illustrates a pioneering method for semantic annotation in large corpora, which can be easily applied to calculate semantic frequencies for other word lists.

Keywords: word lists; semantic annotation; semantic frequency; BERT; large language model (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://journals.sagepub.com/doi/10.1177/21582440251333182 (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:sae:sagope:v:15:y:2025:i:2:p:21582440251333182

DOI: 10.1177/21582440251333182

Access Statistics for this article

More articles in SAGE Open
Bibliographic data for series maintained by SAGE Publications ().

 
Page updated 2025-07-04
Handle: RePEc:sae:sagope:v:15:y:2025:i:2:p:21582440251333182