EconPapers    
Economics at your fingertips  
 

Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts

Álvaro Corral, Gemma Boleda and Ramon Ferrer-i-Cancho

PLOS ONE, 2015, vol. 10, issue 7, 1-23

Abstract: Zipf’s law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf’s law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf’s law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf’s law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.

Date: 2015
References: View complete reference list from CitEc
Citations: View citations in EconPapers (3)

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0129031 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 29031&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0129031

DOI: 10.1371/journal.pone.0129031

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().

 
Page updated 2025-03-19
Handle: RePEc:plo:pone00:0129031