A model for estimating the occurrence of same‐frequency words and the boundary between high‐ and low‐frequency words in texts
Qinglan Sun,
Debora Shaw and
Charles H. Davis
Journal of the American Society for Information Science, 1999, vol. 50, issue 3, 280-286
Abstract:
A simpler model is proposed for estimating the frequency of any same‐frequency words and identifying the boundary point between high‐frequency words and low‐frequency words in a text. The model, based on a “maximum ranking method,” assigns ranks to the words and estimates word frequency by the formula: Int[(−1 + (1 + 4D/In+1)1/2)/2] > n* ≥ Int[(−1 + (1 + 4D/In)1/2)/2]. The boundary value between high‐frequency and low‐frequency words is obtained by taking the square root of the number of different words in the text: n* = (D)1/2. This straightforward model was used successfully with both English and Chinese texts, demonstrating that the frequency of words and the number of same‐frequency words are dependent only on the vocabulary of a text (the number of different words) but not on its length. Like Zipf's Law, the model may be universally applicable.
Date: 1999
References: Add references at CitEc
Citations:
Downloads: (external link)
https://doi.org/10.1002/(SICI)1097-4571(1999)50:33.0.CO;2-H
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bla:jamest:v:50:y:1999:i:3:p:280-286
Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1097-4571
Access Statistics for this article
More articles in Journal of the American Society for Information Science from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().