Using statistical and contextual information to identify two‐ and three‐character words in Chinese text
Christopher S.G. Khoo,
Yubin Dai and
Teck Ee Loh
Journal of the American Society for Information Science and Technology, 2002, vol. 53, issue 5, 365-377
Abstract:
New statistical formulas were developed for identifying two‐ and three‐character words in Chinese text. The formulas were constructed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. For identifying two‐character words, the relative frequency of the adjacent characters and the document frequency of the overlapping bigrams were found to be significant factors. These provide information about the immediate neighborhood or context of the character string. Contextual information was also found to be significant in predicting three‐character words. Local information (the number of times the bigram or trigram occurs in the document being segmented) and the position of the bigram/trigram in the sentence were not found to be useful in identifying words. The new formulas, called contextual information formulas, were found to be substantially better than the mutual information formula in identifying two‐ and three‐character words. Using the contextual information formulas for both two‐ and three‐character words gave significantly better results than using the formula for two‐character words alone. The method can also be used for identifying multiword terms in English text.
Date: 2002
References: Add references at CitEc
Citations:
Downloads: (external link)
https://doi.org/10.1002/asi.10045
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bla:jamist:v:53:y:2002:i:5:p:365-377
Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1532-2890
Access Statistics for this article
More articles in Journal of the American Society for Information Science and Technology from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().