A Trie Based Set Similarity Query Algorithm
Lianyin Jia,
Junzhuo Tang,
Mengjuan Li (),
Runxin Li,
Jiaman Ding and
Yinong Chen
Additional contact information
Lianyin Jia: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Junzhuo Tang: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Mengjuan Li: Library, Yunnan Normal University, Kunming 650500, China
Runxin Li: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Jiaman Ding: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Yinong Chen: School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ 85287, USA
Mathematics, 2023, vol. 11, issue 1, 1-13
Abstract:
Set similarity query is a primitive for many applications, such as data integration, data cleaning, and gene sequence alignment. Most of the existing algorithms are inverted index based, they usually filter unqualified sets one by one and do not have sufficient support for duplicated sets, thus leading to low efficiency. To solve this problem, this paper designs T-starTrie, an efficient trie based index for set similarity query, which can naturally group sets with the same prefix into one node, and can filter all sets corresponding to the node at a time, thereby significantly improving the candidates generation efficiency. In this paper, we find that the set similarity query problem can be transformed into matching nodes of the first-layer (FMNodes) detecting problem on T-starTrie. Therefore, an efficient FLMNode detection algorithm is designed. Based on this, an efficient set similarity query algorithm, TT-SSQ, is implemented by developing a variety of filtering techniques. Experimental results show that TT-SSQ can be up to 3.10x faster than existing algorithms.
Keywords: set similarity query; T-starTrie; FMNodes; TT-SSQ (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2023
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/11/1/229/pdf (application/pdf)
https://www.mdpi.com/2227-7390/11/1/229/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:11:y:2023:i:1:p:229-:d:1022705
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().