TY - CHAP
T1 - Finding the Anticover of a String
AU - Alzamel, Mai
AU - Conte, Alessio
AU - Denzumi, Shuhei
AU - Grossi, Roberto
AU - Iliopoulos, Costas S.
AU - Kurita, Kazuhiro
AU - Wasa, Kunihiro
PY - 2020/6/1
Y1 - 2020/6/1
N2 - A k-anticover of a string x is a set of pairwise distinct factors of x of equal length k, such that every symbol of x is contained into an occurrence of at least one of those factors. The existence of a k-anticover can be seen as a notion of non-redundancy, which has application in computational biology, where they are associated with various non-regulatory mechanisms. In this paper we address the complexity of the problem of finding a k-anticover of a string x if it exists, showing that the decision problem is NP-complete on general strings for k-3. We also show that the problem admits a polynomial-time solution for k = 2. For unbounded k, we provide an exact exponential algorithm to find a k-anticover of a string of length n (or determine that none exists), which runs in O(min{3 n-k 3 , ( k(k+1) 2 ) n k+1 }) time using polynomial space. 2012 ACM Subject Classification Mathematics of computing ! Combinatorics on words.
AB - A k-anticover of a string x is a set of pairwise distinct factors of x of equal length k, such that every symbol of x is contained into an occurrence of at least one of those factors. The existence of a k-anticover can be seen as a notion of non-redundancy, which has application in computational biology, where they are associated with various non-regulatory mechanisms. In this paper we address the complexity of the problem of finding a k-anticover of a string x if it exists, showing that the decision problem is NP-complete on general strings for k-3. We also show that the problem admits a polynomial-time solution for k = 2. For unbounded k, we provide an exact exponential algorithm to find a k-anticover of a string of length n (or determine that none exists), which runs in O(min{3 n-k 3 , ( k(k+1) 2 ) n k+1 }) time using polynomial space. 2012 ACM Subject Classification Mathematics of computing ! Combinatorics on words.
KW - Anticover
KW - Np-complete
KW - String algorithms
KW - Stringology
UR - http://www.scopus.com/inward/record.url?scp=85088393876&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.CPM.2020.2
DO - 10.4230/LIPIcs.CPM.2020.2
M3 - Conference paper
AN - SCOPUS:85088393876
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020
A2 - Gortz, Inge Li
A2 - Weimann, Oren
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020
Y2 - 17 June 2020 through 19 June 2020
ER -
TY - JOUR
T1 - GenMap
T2 - ultra-fast computation of genome mappability
AU - Pockrandt, Christopher
AU - Alzamel, Mai
AU - Iliopoulos, Costas S.
AU - Reinert, Knut
PY - 2020/3/31
Y1 - 2020/3/31
N2 - Motivation: Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches. Results: We present a fast method GenMap to compute the (k, e)-mappability. We extend the mappability algorithm, such that it can also be computed across multiple genomes where a k-mer occurrence is only counted once per genome. This allows for the computation of marker sequences or finding candidates for probe design by identifying approximate k-mers that are unique to a genome or that are present in all genomes. GenMap supports different formats such as binary output, wig and bed files as well as csv files to export the location of all approximate k-mers for each genomic position.
AB - Motivation: Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches. Results: We present a fast method GenMap to compute the (k, e)-mappability. We extend the mappability algorithm, such that it can also be computed across multiple genomes where a k-mer occurrence is only counted once per genome. This allows for the computation of marker sequences or finding candidates for probe design by identifying approximate k-mers that are unique to a genome or that are present in all genomes. GenMap supports different formats such as binary output, wig and bed files as well as csv files to export the location of all approximate k-mers for each genomic position.
UR - http://www.scopus.com/inward/record.url?scp=85087320301&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btaa222
DO - 10.1093/bioinformatics/btaa222
M3 - Article
C2 - 32246826
AN - SCOPUS:85087320301
VL - 36
SP - 3687
EP - 3692
JO - Bioinformatics (Oxford, England)
JF - Bioinformatics (Oxford, England)
SN - 1367-4811
IS - 12
ER -
TY - JOUR
T1 - Comparing Degenerate Strings
AU - Alzamel, Mai
AU - Ayad, Lorraine A.K.
AU - Bernardini, Giulia
AU - Grossi, Roberto
AU - Iliopoulos, Costas S.
AU - Pisanti, Nadia
AU - Pissis, Solon P.
AU - Rosone, Giovanna
PY - 2020
Y1 - 2020
N2 - Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string S is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1,... , kn-1. Our main result is an (N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing all palindromes in S in (min{W, n2}N)-time. We complement this upper bound by showing a similar conditional lower bound for computing maximal palindromes in S. We also show that a result, which is essentially the same as our string comparison linear-time algorithm, can be obtained by employing an automata-based approach.
AB - Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string S is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1,... , kn-1. Our main result is an (N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing all palindromes in S in (min{W, n2}N)-time. We complement this upper bound by showing a similar conditional lower bound for computing maximal palindromes in S. We also show that a result, which is essentially the same as our string comparison linear-time algorithm, can be obtained by employing an automata-based approach.
KW - degenerate strings
KW - elastic-degenerate strings
KW - generalized degenerate strings
KW - palindromes
KW - string comparison
UR - http://www.scopus.com/inward/record.url?scp=85092930420&partnerID=8YFLogxK
U2 - 10.3233/FI-2020-1947
DO - 10.3233/FI-2020-1947
M3 - Article
AN - SCOPUS:85092930420
VL - 175
SP - 41
EP - 58
JO - FUNDAMENTA INFORMATICAE
JF - FUNDAMENTA INFORMATICAE
SN - 0169-2968
IS - 1-4
ER -