Solving the Longest Common Subsequence Problem Concerning Non-Uniform Distributions of Letters in Input Strings

Nikolic, Bojan; Kartelj, Aleksandar; Djukanovic, Marko; Grbic, Milana; Blum, Christian; Raidl, Günther

Solving the Longest Common Subsequence Problem Concerning Non-Uniform Distributions of Letters in Input Strings

Bojan Nikolic, Aleksandar Kartelj, Marko Djukanovic, Milana Grbic, Christian Blum and Günther Raidl
Additional contact information
Bojan Nikolic: Faculty of Natural Science and Mathematics, University of Banja Luka, 78 000 Banja Luka, Bosnia and Herzegovina
Aleksandar Kartelj: Faculty of Mathematics, University of Belgrade, 105104 Belgrade, Serbia
Marko Djukanovic: Faculty of Natural Science and Mathematics, University of Banja Luka, 78 000 Banja Luka, Bosnia and Herzegovina
Milana Grbic: Faculty of Natural Science and Mathematics, University of Banja Luka, 78 000 Banja Luka, Bosnia and Herzegovina
Christian Blum: Artificial Intelligence Research Institute (IIIA-CSIC), Campus UAB, 08193 Bellaterra, Spain
Günther Raidl: Institute of Logic and Computation, Faculty of Informatics, TU Wien, 1040 Vienna, Austria

Mathematics, 2021, vol. 9, issue 13, 1-25

Abstract: The longest common subsequence (LCS) problem is a prominent NP –hard optimization problem where, given an arbitrary set of input strings, the aim is to find a longest subsequence, which is common to all input strings. This problem has a variety of applications in bioinformatics, molecular biology and file plagiarism checking, among others. All previous approaches from the literature are dedicated to solving LCS instances sampled from uniform or near-to-uniform probability distributions of letters in the input strings. In this paper, we introduce an approach that is able to effectively deal with more general cases, where the occurrence of letters in the input strings follows a non-uniform distribution such as a multinomial distribution. The proposed approach makes use of a time-restricted beam search, guided by a novel heuristic named Gmpsum . This heuristic combines two complementary scoring functions in the form of a convex combination. Furthermore, apart from the close-to-uniform benchmark sets from the related literature, we introduce three new benchmark sets that differ in terms of their statistical properties. One of these sets concerns a case study in the context of text analysis. We provide a comprehensive empirical evaluation in two distinctive settings: (1) short-time execution with fixed beam size in order to evaluate the guidance abilities of the compared search heuristics; and (2) long-time executions with fixed target duration times in order to obtain high-quality solutions. In both settings, the newly proposed approach performs comparably to state-of-the-art techniques in the context of close-to-uniform instances and outperforms state-of-the-art approaches for non-uniform instances.

Keywords: longest common subsequence problem; multi-nomial distribution; probability-based search guidance (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2021
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/9/13/1515/pdf (application/pdf)
https://www.mdpi.com/2227-7390/9/13/1515/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:9:y:2021:i:13:p:1515-:d:584261

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().