A Benchmark for Symbolic Reasoning from Pixel Sequences: Grid-Level Visual Completion and Correction

Kang, Lei; Fu, Xuanshuo; Souibgui, Mohamed Ali; Barsky, Andrey; Gomez, Lluis; Vazquez-Corral, Javier; Fornés, Alicia; Valveny, Ernest; Karatzas, Dimosthenis

A Benchmark for Symbolic Reasoning from Pixel Sequences: Grid-Level Visual Completion and Correction

Lei Kang (), Xuanshuo Fu, Mohamed Ali Souibgui, Andrey Barsky, Lluis Gomez, Javier Vazquez-Corral, Alicia Fornés, Ernest Valveny and Dimosthenis Karatzas
Additional contact information
Lei Kang: Computer Vision Center, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain
Xuanshuo Fu: Computer Vision Center, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain
Mohamed Ali Souibgui: Computer Vision Center, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain
Andrey Barsky: Computer Vision Center, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain
Lluis Gomez: Computer Vision Center, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain
Javier Vazquez-Corral: Computer Vision Center, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain
Alicia Fornés: Computer Vision Center, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain
Ernest Valveny: Computer Vision Center, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain
Dimosthenis Karatzas: Computer Vision Center, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain

Mathematics, 2025, vol. 13, issue 17, 1-14

Abstract: Grid structured visual data such as forms, tables, and game boards require models that pair pixel level perception with symbolic consistency under global constraints. Recent Pixel Language Models (PLMs) map images to token sequences with promising flexibility, yet we find they generalize poorly when observable evidence becomes sparse or corrupted. We present GridMNIST-Sudoku, a benchmark that renders large numbers of Sudoku instances with style diverse handwritten digits and provides parameterized stress tracks for two tasks: Completion (predict missing cells) and Correction (detect and repair incorrect cells) across difficulty levels ranging from 1 to 90 altered positions in a 9 × 9 grid. Attention diagnostics on PLMs trained with conventional one dimensional positional encodings reveal weak structure awareness outside the natural Sudoku sparsity band. Motivated by these findings, we propose a lightweight Row-Column-Box (RCB) positional prior that injects grid aligned coordinates and combine it with simple sparsity and corruption augmentations. Trained only on the natural distribution, the resulting model substantially improves out of distribution accuracy across wide sparsity and corruption ranges while maintaining strong in distribution performance.

Keywords: Pixel Language Models; visual symbolic reasoning; GridMNIST-Sudoku benchmark; structured spatial prior; Explainable AI (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/17/2851/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/17/2851/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:17:p:2851-:d:1741786

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().