TY - JOUR
T1 - Minimizing the Minimizers via Alphabet Reordering
AU - Verbeek, Hilde
AU - Ayad, Lorraine A.K.
AU - Loukides, Grigorios
AU - Pissis, Solon P.
N1 - Publisher Copyright:
© Hilde Verbeek, Lorraine A.K. Ayad, Grigorios Loukides, and Solon P. Pissis.
PY - 2024/4/12
Y1 - 2024/4/12
N2 - Minimizers sampling is one of the most widely-used mechanisms for sampling strings [Roberts et al., Bioinformatics 2004]. Let S = S[1] . . . S[n] be a string over a totally ordered alphabet Σ. Further let w ≥ 2 and k ≥ 1 be two integers. The minimizer of S[i . . i + w + k − 2] is the smallest position in [i, i + w − 1] where the lexicographically smallest length-k substring of S[i . . i + w + k − 2] starts. The set of minimizers over all i ∈ [1, n − w − k + 2] is the set Mw,k(S) of the minimizers of S. We consider the following basic problem: Given S, w, and k, can we efficiently compute a total order on Σ that minimizes |Mw,k(S)|? We show that this is unlikely by proving that the problem is NP-hard for any w ≥ 3 and k ≥ 1. Our result provides theoretical justification as to why there exist no exact algorithms for minimizing the minimizers samples, while there exists a plethora of heuristics for the same purpose.
AB - Minimizers sampling is one of the most widely-used mechanisms for sampling strings [Roberts et al., Bioinformatics 2004]. Let S = S[1] . . . S[n] be a string over a totally ordered alphabet Σ. Further let w ≥ 2 and k ≥ 1 be two integers. The minimizer of S[i . . i + w + k − 2] is the smallest position in [i, i + w − 1] where the lexicographically smallest length-k substring of S[i . . i + w + k − 2] starts. The set of minimizers over all i ∈ [1, n − w − k + 2] is the set Mw,k(S) of the minimizers of S. We consider the following basic problem: Given S, w, and k, can we efficiently compute a total order on Σ that minimizes |Mw,k(S)|? We show that this is unlikely by proving that the problem is NP-hard for any w ≥ 3 and k ≥ 1. Our result provides theoretical justification as to why there exist no exact algorithms for minimizing the minimizers samples, while there exists a plethora of heuristics for the same purpose.
KW - alphabet reordering
KW - feedback arc set
KW - minimizers
KW - sequence analysis
UR - http://www.scopus.com/inward/record.url?scp=85196736700&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.CPM.2024.28
DO - 10.4230/LIPIcs.CPM.2024.28
M3 - Conference paper
AN - SCOPUS:85196736700
SN - 1868-8969
JO - Leibniz International Proceedings in Informatics, LIPIcs
JF - Leibniz International Proceedings in Informatics, LIPIcs
T2 - 35th Annual Symposium on Combinatorial Pattern Matching, CPM 2024
Y2 - 25 June 2024 through 27 June 2024
ER -