Enhancing tokenization accuracy with dynamic patterns: cumulative logic for segmenting user-generated content in logographic languages
Yin Zhang (),
Zhihuai Lin,
Castiel Chi-chiu Tong and
Sam Wai-yeung Ho
Additional contact information
Yin Zhang: Hong Kong Baptist University
Zhihuai Lin: University of North Carolina at Chapel Hill
Castiel Chi-chiu Tong: Hong Kong University of Science and Technology
Sam Wai-yeung Ho: Fasta.ai Ltd.
Journal of Computational Social Science, 2025, vol. 8, issue 3, No 26, 24 pages
Abstract:
Abstract Despite the significant advancements of Large Language Models (LLMs) in recent years, tokenization remains a critical step in Natural Language Processing (NLP) for social scientific research. This study presents a simple but effective approach to enhance tokenization accuracy in segmenting user-generated content (UGC) in logographic languages, such as Chinese. Existing tokenization techniques often struggle to effectively handle the complexities of UGC on digital platforms, which include informal language, slang, and newly coined terms. To address this challenge, we developed a dynamic tokenization model that incorporates cumulative logic to recognize and adapt to evolving linguistic patterns in social media content. By analyzing large online discussion datasets from LIHKG, a Reddit-like forum in Hong Kong, the model’s effectiveness is demonstrated through its ability to accurately segment domain-specific terms and novel expressions over time. Our results show that the model outperforms traditional tokenizers in recognizing contextually relevant tokens. This innovative approach offers practical advantages for analyzing large-scale UGC data, and has the potential to improve the performance of downstream NLP tasks.
Keywords: Tokenization; User-generated content (UGC); Logographic languages; Natural language processing (NLP); Dynamic patterns (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s42001-025-00406-7 Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:jcsosc:v:8:y:2025:i:3:d:10.1007_s42001-025-00406-7
Ordering information: This journal article can be ordered from
http://www.springer. ... iences/journal/42001
DOI: 10.1007/s42001-025-00406-7
Access Statistics for this article
Journal of Computational Social Science is currently edited by Takashi Kamihigashi
More articles in Journal of Computational Social Science from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().