EconPapers    
Economics at your fingertips  
 

A model for estimating the occurrence of same‐frequency words and the boundary between high‐ and low‐frequency words in texts

Qinglan Sun, Debora Shaw and Charles H. Davis

Journal of the American Society for Information Science, 1999, vol. 50, issue 3, 280-286

Abstract: A simpler model is proposed for estimating the frequency of any same‐frequency words and identifying the boundary point between high‐frequency words and low‐frequency words in a text. The model, based on a “maximum ranking method,” assigns ranks to the words and estimates word frequency by the formula: Int[(−1 + (1 + 4D/In+1)1/2)/2] > n* ≥ Int[(−1 + (1 + 4D/In)1/2)/2]. The boundary value between high‐frequency and low‐frequency words is obtained by taking the square root of the number of different words in the text: n* = (D)1/2. This straightforward model was used successfully with both English and Chinese texts, demonstrating that the frequency of words and the number of same‐frequency words are dependent only on the vocabulary of a text (the number of different words) but not on its length. Like Zipf's Law, the model may be universally applicable.

Date: 1999
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1002/(SICI)1097-4571(1999)50:33.0.CO;2-H

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamest:v:50:y:1999:i:3:p:280-286

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1097-4571

Access Statistics for this article

More articles in Journal of the American Society for Information Science from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-19
Handle: RePEc:bla:jamest:v:50:y:1999:i:3:p:280-286