Chinese text segmentation for text retrieval: Achievements and problems
Zimin Wu and
Gwyneth Tseng
Journal of the American Society for Information Science, 1993, vol. 44, issue 9, 532-542
Abstract:
Present text retrieval systems are generally built on the reductionist basis that words in texts (keywords) are used as indexing terms to represent the texts. A necessary precursor to these systems is word extraction which, for English texts, can be achieved automatically by using spaces and punctuations as word delimiters. This cannot be readily applied to Chinese texts because they do not have obvious word boundaries. A Chinese text consists of a linear sequence of nonspaced or equally spaced ideographic characters, which are similar to morphemes in English. Researchers of Chinese text retrieval have been seeking methods of text segmentation to divide Chinese texts automatically into words. First, a review of these methods is provided in which the various different approaches to Chinese text segmentation have been classified in order to provide a general picture of the research activity in this area. Some of the most important work is described. There follows a discussion of the problems of Chinese text segmentation with examples to illustrate. These problems include morphological complexities, segmentation ambiguity, and parsing problems, and demonstrate that text segmentation remains one of the most challenging and interesting areas for Chinese text retrieval. © 1993 John Wiley & Sons, Inc.
Date: 1993
References: Add references at CitEc
Citations: View citations in EconPapers (2)
Downloads: (external link)
https://doi.org/10.1002/(SICI)1097-4571(199310)44:93.0.CO;2-M
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bla:jamest:v:44:y:1993:i:9:p:532-542
Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1097-4571
Access Statistics for this article
More articles in Journal of the American Society for Information Science from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().