EconPapers    
Economics at your fingertips  
 

Zipf’s Law Arises Naturally When There Are Underlying, Unobserved Variables

Laurence Aitchison, Nicola Corradi and Peter E Latham

PLOS Computational Biology, 2016, vol. 12, issue 12, 1-32

Abstract: Zipf’s law, which states that the probability of an observation is inversely proportional to its rank, has been observed in many domains. While there are models that explain Zipf’s law in each of them, those explanations are typically domain specific. Recently, methods from statistical physics were used to show that a fairly broad class of models does provide a general explanation of Zipf’s law. This explanation rests on the observation that real world data is often generated from underlying causes, known as latent variables. Those latent variables mix together multiple models that do not obey Zipf’s law, giving a model that does. Here we extend that work both theoretically and empirically. Theoretically, we provide a far simpler and more intuitive explanation of Zipf’s law, which at the same time considerably extends the class of models to which this explanation can apply. Furthermore, we also give methods for verifying whether this explanation applies to a particular dataset. Empirically, these advances allowed us extend this explanation to important classes of data, including word frequencies (the first domain in which Zipf’s law was discovered), data with variable sequence length, and multi-neuron spiking activity.Author Summary: Datasets ranging from word frequencies to neural activity all have a seemingly unusual property, known as Zipf’s law: when observations (e.g., words) are ranked from most to least frequent, the frequency of an observation is inversely proportional to its rank. Here we demonstrate that a single, general principle underlies Zipf’s law in a wide variety of domains, by showing that models in which there is a latent, or hidden, variable controlling the observations can, and sometimes must, give rise to Zipf’s law. We illustrate this mechanism in three domains: word frequency, data with variable sequence length, and neural data.

Date: 2016
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (3)

Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005110 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 05110&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1005110

DOI: 10.1371/journal.pcbi.1005110

Access Statistics for this article

More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().

 
Page updated 2025-03-19
Handle: RePEc:plo:pcbi00:1005110