EconPapers    
Economics at your fingertips  
 

A study on word‐based and integral‐bit Chinese text compression algorithms

Kwok‐Shing Cheng, Gilbert H. Young and Kam‐Fai Wong

Journal of the American Society for Information Science, 1999, vol. 50, issue 3, 218-228

Abstract: Experimental results show that a word‐based arithmetic coding scheme can achieve a higher compression performance for Chinese text. However, an arithmetic coding scheme is a fractional‐bit compression algorithm which is known to be time consuming. In this article, we change the direction to study how to cascade the word segmentation model with a faster alternative, the integral‐bit compression algorithm. It is shown that the cascaded algorithm is more suitable for practical usage. Among several word‐based integral‐bit compression algorithms, WLZSSHUF achieves the best compression results. Not only can it achieve a comparable compression ratio with a PPM compressor, COMP‐2, it demonstrates a faster compression and decompression speed. In the last part of this article, the relation between the accuracy of the word segmentation model (match ratio) and the performance of the compression algorithm (compression ratio) are analyzed. By varying the match ratio, it was discovered that the growth rate of the compression ratio is content‐dependent and close to linear. The results of our study will help the practitioners of information retrieval to design word‐based compression algorithms for Chinese. This is particularly useful to multilingual digital libraries in which a massive volume of data is often involved.

Date: 1999
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1002/(SICI)1097-4571(1999)50:33.0.CO;2-1

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamest:v:50:y:1999:i:3:p:218-228

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1097-4571

Access Statistics for this article

More articles in Journal of the American Society for Information Science from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-19
Handle: RePEc:bla:jamest:v:50:y:1999:i:3:p:218-228