Domain-specific Chinese word segmentation using suffix tree and mutual information
Daniel Zeng (),
Donghua Wei (),
Michael Chau () and
Feiyue Wang ()
Additional contact information
Daniel Zeng: Institute of Automation
Donghua Wei: Institute of Automation
Michael Chau: The University of Hong Kong
Feiyue Wang: Institute of Automation
Information Systems Frontiers, 2011, vol. 13, issue 1, No 10, 115-125
Abstract:
Abstract As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus.
Keywords: Mutual information; Chinese segmentation; N-gram; Suffix tree; Ukkonen algorithm; Heuristic rules (search for similar items in EconPapers)
Date: 2011
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (5)
Downloads: (external link)
http://link.springer.com/10.1007/s10796-010-9278-5 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:infosf:v:13:y:2011:i:1:d:10.1007_s10796-010-9278-5
Ordering information: This journal article can be ordered from
http://www.springer.com/journal/10796
DOI: 10.1007/s10796-010-9278-5
Access Statistics for this article
Information Systems Frontiers is currently edited by Ram Ramesh and Raghav Rao
More articles in Information Systems Frontiers from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().