EconPapers    
Economics at your fingertips  
 

The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Mamtimin Qasim (), Wushour Silamu and Minghui Qiu
Additional contact information
Mamtimin Qasim: School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China
Wushour Silamu: School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Minghui Qiu: School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China

Data, 2024, vol. 9, issue 11, 1-11

Abstract: Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted. Therefore, in this paper, we discuss and design a script identification algorithm and the construction of a language identification dataset based on script groups. The data sources in this paper comprise 261 different languages’ text corpora from the Leipzig Corpora Collection, which are grouped into 23 different script groups. In the Unicode encoding scheme, different scripts are arranged into different code regions. Based on this feature, we propose a written script identification algorithm based on regular expression matching, the micro F-score of which reaches 0.9929 in sentence-level script identification experiments. To reduce noise when constructing the language identification dataset for each script, a script identification algorithm is used to filter out other-script content in each text.

Keywords: script; script identification; language identification; language identification dataset (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2306-5729/9/11/134/pdf (application/pdf)
https://www.mdpi.com/2306-5729/9/11/134/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:9:y:2024:i:11:p:134-:d:1518130

Access Statistics for this article

Data is currently edited by Ms. Cecilia Yang

More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-03-19
Handle: RePEc:gam:jdataj:v:9:y:2024:i:11:p:134-:d:1518130