EconPapers    
Economics at your fingertips  
 

Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy

Mamtimin Qasim () and Wushour Silamu
Additional contact information
Mamtimin Qasim: School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China
Wushour Silamu: School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China

Data, 2025, vol. 10, issue 4, 1-22

Abstract: While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text. For this reason, we analyze the Unicode encoding of each type of script and construct regular expressions in this study, in order to design an improved script identification algorithm. Because some scripts share common characters, it’s impossible to count and summarize them. As a result, some extracted scripts are incomplete, which affects subsequent text processing tasks; furthermore, if a new script identification feature is required, the regular expression for each script must be re-adjusted. To improve the performance and scalability of script identification, we analyze the encoding range of each script provided on the official Unicode website and identify the shared characters, allowing us to design an improved script identification algorithm. Using this approach, we can fully consider all 169 Unicode script types. The proposed method is scalable and does not require numbers, punctuation marks, or other symbols to be filtered during script identification; furthermore, these items in the text are also included in the script identification results, thus ensuring the integrity of the provided information. The experimental results show that the proposed algorithm performs almost as well as our previous script identification algorithm while providing improvements on its basis.

Keywords: script; script identification; Unicode; language identification (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2306-5729/10/4/43/pdf (application/pdf)
https://www.mdpi.com/2306-5729/10/4/43/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:10:y:2025:i:4:p:43-:d:1619994

Access Statistics for this article

Data is currently edited by Ms. Cecilia Yang

More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-04-05
Handle: RePEc:gam:jdataj:v:10:y:2025:i:4:p:43-:d:1619994