EconPapers    
Economics at your fingertips  
 

DISTRIBUTION OF MULTI-WORDS IN CHINESE AND ENGLISH DOCUMENTS

Wen Zhang (), Taketoshi Yoshida () and Xijin Tang ()
Additional contact information
Wen Zhang: School of Knowledge Science, Japan Advanced Institute, of Science and Technology, 1-1 Asahidai, Tatsunokuchi, Ishikawa 923-1292, Japan;
Taketoshi Yoshida: School of Knowledge Science, Japan Advanced Institute, of Science and Technology, 1-1 Asahidai, Tatsunokuchi, Ishikawa 923-1292, Japan
Xijin Tang: Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, P. R. China

International Journal of Information Technology & Decision Making (IJITDM), 2009, vol. 08, issue 02, 249-265

Abstract: As a hybrid of N-gram in natural language processing and collocation in statistical linguistics, multi-word is becoming a hot topic in area of text mining and information retrieval. In this paper, a study concerning distribution of multi-words is carried out to explore a theoretical basis for probabilistic term-weighting scheme. Specifically, the Poisson distribution, zero-inflated binomial distribution, and G-distribution are comparatively studied on a task of predicting probabilities of multi-words' occurrences using these distributions, for both technical multi-words and nontechnical multi-words. In addition, a rule-based multi-word extraction algorithm is proposed to extract multi-words from texts based on words' occurring patterns and syntactical structures. Our experimental results demonstrate that G-distribution has the best capability to predict probabilities of frequency of multi-words' occurrence and the Poisson distribution is comparable to zero-inflated binomial distribution in estimation of multi-word distribution. The outcome of this study validates that burstiness is a universal phenomenon in linguistic count data, which is applicable not only for individual content words but also for multi-words.

Keywords: Multi-word; term distribution; Poisson distribution; zero-inflated distribution; G-distribution (search for similar items in EconPapers)
Date: 2009
References: View complete reference list from CitEc
Citations: View citations in EconPapers (2)

Downloads: (external link)
http://www.worldscientific.com/doi/abs/10.1142/S0219622009003399
Access to full text is restricted to subscribers

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:wsi:ijitdm:v:08:y:2009:i:02:n:s0219622009003399

Ordering information: This journal article can be ordered from

DOI: 10.1142/S0219622009003399

Access Statistics for this article

International Journal of Information Technology & Decision Making (IJITDM) is currently edited by Yong Shi

More articles in International Journal of Information Technology & Decision Making (IJITDM) from World Scientific Publishing Co. Pte. Ltd.
Bibliographic data for series maintained by Tai Tone Lim ().

 
Page updated 2025-03-20
Handle: RePEc:wsi:ijitdm:v:08:y:2009:i:02:n:s0219622009003399