SNMatch: An Unsupervised Method for Column Semantic-Type Detection Based on Siamese Network
Tiezheng Nie,
Hanyu Mao (),
Aolin Liu,
Xuliang Wang,
Derong Shen and
Yue Kou
Additional contact information
Tiezheng Nie: School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
Hanyu Mao: School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
Aolin Liu: School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
Xuliang Wang: School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
Derong Shen: School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
Yue Kou: School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
Mathematics, 2025, vol. 13, issue 4, 1-15
Abstract:
Column semantic-type detection is a crucial task for data integration and schema matching, particularly when dealing with large volumes of unlabeled tabular data. Existing methods often rely on supervised learning models, which require extensive labeled data. In this paper, we propose SNMatch, an unsupervised approach based on a Siamese network for detecting column semantic types without labeled training data. The novelty of SNMatch lies in its ability to generate the semantic embeddings of columns by considering both format and semantic features and clustering them into semantic types. Unlike traditional methods, which typically rely on keyword matching or supervised classification, SNMatch leverages unsupervised learning to tackle the challenges of column semantic detection in massive datasets with limited labeled examples. We demonstrate that SNMatch significantly outperforms current state-of-the-art techniques in terms of clustering accuracy, especially in handling complex and nested semantic types. Extensive experiments on the MACST and VizNet-Manyeyes datasets validate its effectiveness, achieving superior performance in column semantic-type detection compared to methods such as TF-IDF, FastText, and BERT. The proposed method shows great promise for practical applications in data integration, data cleaning, and automated schema mapping, particularly in scenarios where labeled data are scarce or unavailable. Furthermore, our work builds upon recent advances in neural network-based embeddings and unsupervised learning, contributing to the growing body of research in automatic schema matching and tabular data understanding.
Keywords: data integration; tabular data; column matching; unsupervised learning; Siamese network (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/13/4/607/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/4/607/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:4:p:607-:d:1589900
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().