EconPapers    
Economics at your fingertips  
 

A heuristic method based on a statistical approach for Chinese text segmentation

Christopher C. Yang and K. W. Li

Journal of the American Society for Information Science and Technology, 2005, vol. 56, issue 13, 1438-1447

Abstract: The authors propose a heuristic method for Chinese automatic text segmentation based on a statistical approach. This method is developed based on statistical information about the association among adjacent characters in Chinese text. Mutual information of bi‐grams and significant estimation of tri‐grams are utilized. A heuristic method with six rules is then proposed to determine the segmentation points in a Chinese sentence. No dictionary is required in this method. Chinese text segmentation is important in Chinese text indexing and thus greatly affects the performance of Chinese information retrieval. Due to the lack of delimiters of words in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, segmentation ambiguities and occurrences of out‐of‐vocabulary words (i.e., unknown words) are the major challenges in Chinese segmentation. Many research studies dealing with the problem of word segmentation have focused on the resolution of segmentation ambiguities. The problem of unknown word identification has not drawn much attention. The experimental result shows that the proposed heuristic method is promising to segment the unknown words as well as the known words. The authors further investigated the distribution of the errors of commission and the errors of omission caused by the proposed heuristic method and benchmarked the proposed heuristic method with a previous proposed technique, boundary detection. It is found that the heuristic method outperformed the boundary detection method.

Date: 2005
References: Add references at CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://doi.org/10.1002/asi.20237

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamist:v:56:y:2005:i:13:p:1438-1447

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1532-2890

Access Statistics for this article

More articles in Journal of the American Society for Information Science and Technology from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-19
Handle: RePEc:bla:jamist:v:56:y:2005:i:13:p:1438-1447