Zero-inflated beta distribution applied to word frequency and lexical dispersion in corpus linguistics
Brent Burch and
Jesse Egbert
Journal of Applied Statistics, 2020, vol. 47, issue 2, 337-353
Abstract:
Corpus linguistics is the study of language as expressed in a body of texts or documents. The relative frequency of a word within a text and the dispersion of the word across the collection of texts provide information about the word's prominence and diffusion, respectively. In practice, people tend to use a relatively small number of words in a language's inventory of words and thus a large number of words in the lexicon are rarely employed. The zero-inflated beta distribution enables one to model the relative frequency of a word in a text since some texts may not even contain the word under study. In this paper, the expectation of a word's prominence and dispersion are defined under the zero-inflated beta model. Estimates of a word's prominence and dispersion are computed for words in the British National Corpus 1994 (BNC), a 100 million word collection of written and spoken language of a wide range of British English. The relationship between a word's prominence and dispersion is discussed as well as measures that are functions of both prominence and dispersion.
Date: 2020
References: Add references at CitEc
Citations:
Downloads: (external link)
http://hdl.handle.net/10.1080/02664763.2019.1636941 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:taf:japsta:v:47:y:2020:i:2:p:337-353
Ordering information: This journal article can be ordered from
http://www.tandfonline.com/pricing/journal/CJAS20
DOI: 10.1080/02664763.2019.1636941
Access Statistics for this article
Journal of Applied Statistics is currently edited by Robert Aykroyd
More articles in Journal of Applied Statistics from Taylor & Francis Journals
Bibliographic data for series maintained by Chris Longhurst ().