SVTRv2X: Enhanced scene text recognition via self-distilled mixture-of-experts

Guo, Jian; Cui, Hanxin; Tang, Wengang; Zhou, Xuehai; Xu, Xing; Cheng, Qianqian

SVTRv2X: Enhanced scene text recognition via self-distilled mixture-of-experts

Jian Guo, Hanxin Cui, Wengang Tang, Xuehai Zhou, Xing Xu and Qianqian Cheng

PLOS ONE, 2026, vol. 21, issue 6, 1-20

Abstract: Scene Text Recognition (STR) is a fundamental component of intelligent perception systems and plays a crucial role in a wide range of real-world applications such as autonomous driving, document understanding, and human–computer interaction. STR still faces several challenges in practical applications, including high sensitivity to spatial perturbations, limited representational capacity of lightweight Connectionist Temporal Classification(CTC)-based models, and the difficulty of handling diverse text styles within a single unified architecture. Although SVTRv2 enhances the recognition ability of CTC models through a combination of local and global mixing mechanisms, its robustness and generalization capability remain insufficient when dealing with geometric distortions, complex backgrounds, or text with large stylistic variations. To address these issues, we propose SVTRv2X, an enhanced STR framework built upon SVTRv2 that integrates three complementary improvement modules. The Jumble Module strategically rearranges input patches before the patch embedding stage, fundamentally reducing the model’s reliance on fixed spatial structures and significantly improving robustness to rotated, misaligned, and irregular text. The Self-Distillation Module transfers deep-layer knowledge to shallow features, effectively strengthening early-stage representations while maintaining lightweight inference. The Mixture-of-Experts (MoE) Module expands model capacity through sparsely activated expert networks, allowing specialized processing of different text styles without introducing substantial computational overhead. Extensive experiments demonstrate that SVTRv2X achieves state-of-the-art performance on multiple STR benchmarks, substantially advancing the model’s recognition capability in real-world scene text scenarios.

Date: 2026
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0349085 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 49085&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0349085

DOI: 10.1371/journal.pone.0349085

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().