TY - JOUR
T1 - Comparing Degenerate Strings
AU - Alzamel, Mai
AU - Ayad, Lorraine A.K.
AU - Bernardini, Giulia
AU - Grossi, Roberto
AU - Iliopoulos, Costas S.
AU - Pisanti, Nadia
AU - Pissis, Solon P.
AU - Rosone, Giovanna
PY - 2020
Y1 - 2020
N2 - Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string S is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1,... , kn-1. Our main result is an (N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing all palindromes in S in (min{W, n2}N)-time. We complement this upper bound by showing a similar conditional lower bound for computing maximal palindromes in S. We also show that a result, which is essentially the same as our string comparison linear-time algorithm, can be obtained by employing an automata-based approach.
AB - Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string S is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1,... , kn-1. Our main result is an (N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing all palindromes in S in (min{W, n2}N)-time. We complement this upper bound by showing a similar conditional lower bound for computing maximal palindromes in S. We also show that a result, which is essentially the same as our string comparison linear-time algorithm, can be obtained by employing an automata-based approach.
KW - degenerate strings
KW - elastic-degenerate strings
KW - generalized degenerate strings
KW - palindromes
KW - string comparison
UR - http://www.scopus.com/inward/record.url?scp=85092930420&partnerID=8YFLogxK
U2 - 10.3233/FI-2020-1947
DO - 10.3233/FI-2020-1947
M3 - Article
AN - SCOPUS:85092930420
VL - 175
SP - 41
EP - 58
JO - FUNDAMENTA INFORMATICAE
JF - FUNDAMENTA INFORMATICAE
SN - 0169-2968
IS - 1-4
ER -
TY - JOUR
T1 - Faster algorithms for 1-mappability of a sequence
AU - Alzamel, Mai
AU - Charalampopoulos, Panagiotis
AU - Iliopoulos, Costas S.
AU - Pissis, Solon P.
AU - Radoszewski, Jakub
AU - Sung, Wing-Kin
PY - 2019/5/23
Y1 - 2019/5/23
N2 - In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k=1. There exists an algorithm to solve this problem for k=1 requiring time O(mnlogn/loglogn) using space O(n). Here we present two new algorithms that require worst-case time O(mn) and O(nlognloglogn), respectively, and space O(n), thus greatly improving the previous result. Moreover, we present another algorithm that requires average-case time and space O(n) for integer alphabets of size σ if m=Ω(log
σn). Notably, we show that this algorithm is generalizable for arbitrary k, requiring average-case time O(kn) and space O(n) if m=Ω(klog
σn), assuming that the letters are independent and uniformly distributed random variables. Finally, we provide an experimental evaluation of our average-case algorithm demonstrating its competitiveness to the state-of-the-art implementation.
AB - In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k=1. There exists an algorithm to solve this problem for k=1 requiring time O(mnlogn/loglogn) using space O(n). Here we present two new algorithms that require worst-case time O(mn) and O(nlognloglogn), respectively, and space O(n), thus greatly improving the previous result. Moreover, we present another algorithm that requires average-case time and space O(n) for integer alphabets of size σ if m=Ω(log
σn). Notably, we show that this algorithm is generalizable for arbitrary k, requiring average-case time O(kn) and space O(n) if m=Ω(klog
σn), assuming that the letters are independent and uniformly distributed random variables. Finally, we provide an experimental evaluation of our average-case algorithm demonstrating its competitiveness to the state-of-the-art implementation.
KW - Algorithms on strings
KW - Hamming distance
KW - Sequence mappability
UR - http://www.scopus.com/inward/record.url?scp=85067179715&partnerID=8YFLogxK
U2 - 10.1016/j.tcs.2019.04.026
DO - 10.1016/j.tcs.2019.04.026
M3 - Article
JO - Theoretical Computer Science
JF - Theoretical Computer Science
SN - 0304-3975
ER -