TY - JOUR
T1 - GenMap
T2 - ultra-fast computation of genome mappability
AU - Pockrandt, Christopher
AU - Alzamel, Mai
AU - Iliopoulos, Costas S.
AU - Reinert, Knut
PY - 2020/3/31
Y1 - 2020/3/31
N2 - Motivation: Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches. Results: We present a fast method GenMap to compute the (k, e)-mappability. We extend the mappability algorithm, such that it can also be computed across multiple genomes where a k-mer occurrence is only counted once per genome. This allows for the computation of marker sequences or finding candidates for probe design by identifying approximate k-mers that are unique to a genome or that are present in all genomes. GenMap supports different formats such as binary output, wig and bed files as well as csv files to export the location of all approximate k-mers for each genomic position.
AB - Motivation: Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches. Results: We present a fast method GenMap to compute the (k, e)-mappability. We extend the mappability algorithm, such that it can also be computed across multiple genomes where a k-mer occurrence is only counted once per genome. This allows for the computation of marker sequences or finding candidates for probe design by identifying approximate k-mers that are unique to a genome or that are present in all genomes. GenMap supports different formats such as binary output, wig and bed files as well as csv files to export the location of all approximate k-mers for each genomic position.
UR - http://www.scopus.com/inward/record.url?scp=85087320301&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btaa222
DO - 10.1093/bioinformatics/btaa222
M3 - Article
C2 - 32246826
AN - SCOPUS:85087320301
VL - 36
SP - 3687
EP - 3692
JO - Bioinformatics (Oxford, England)
JF - Bioinformatics (Oxford, England)
SN - 1367-4811
IS - 12
ER -
TY - JOUR
T1 - Faster algorithms for 1-mappability of a sequence
AU - Alzamel, Mai
AU - Charalampopoulos, Panagiotis
AU - Iliopoulos, Costas S.
AU - Pissis, Solon P.
AU - Radoszewski, Jakub
AU - Sung, Wing-Kin
PY - 2019/5/23
Y1 - 2019/5/23
N2 - In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k=1. There exists an algorithm to solve this problem for k=1 requiring time O(mnlogn/loglogn) using space O(n). Here we present two new algorithms that require worst-case time O(mn) and O(nlognloglogn), respectively, and space O(n), thus greatly improving the previous result. Moreover, we present another algorithm that requires average-case time and space O(n) for integer alphabets of size σ if m=Ω(log
σn). Notably, we show that this algorithm is generalizable for arbitrary k, requiring average-case time O(kn) and space O(n) if m=Ω(klog
σn), assuming that the letters are independent and uniformly distributed random variables. Finally, we provide an experimental evaluation of our average-case algorithm demonstrating its competitiveness to the state-of-the-art implementation.
AB - In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k=1. There exists an algorithm to solve this problem for k=1 requiring time O(mnlogn/loglogn) using space O(n). Here we present two new algorithms that require worst-case time O(mn) and O(nlognloglogn), respectively, and space O(n), thus greatly improving the previous result. Moreover, we present another algorithm that requires average-case time and space O(n) for integer alphabets of size σ if m=Ω(log
σn). Notably, we show that this algorithm is generalizable for arbitrary k, requiring average-case time O(kn) and space O(n) if m=Ω(klog
σn), assuming that the letters are independent and uniformly distributed random variables. Finally, we provide an experimental evaluation of our average-case algorithm demonstrating its competitiveness to the state-of-the-art implementation.
KW - Algorithms on strings
KW - Hamming distance
KW - Sequence mappability
UR - http://www.scopus.com/inward/record.url?scp=85067179715&partnerID=8YFLogxK
U2 - 10.1016/j.tcs.2019.04.026
DO - 10.1016/j.tcs.2019.04.026
M3 - Article
JO - Theoretical Computer Science
JF - Theoretical Computer Science
SN - 0304-3975
ER -
TY - JOUR
T1 - Off-line and on-line algorithms for closed string factorization
AU - Alzamel, Mai
AU - Iliopoulos, Costas S.
AU - Smyth, W.F.
AU - Sung, Wing-Kin
PY - 2018/10/30
Y1 - 2018/10/30
N2 - A string X=X[1..n], n>1, is said to be closed if it has a nonempty proper prefix that is also a suffix, but that otherwise occurs nowhere else in X; for n=1, every X is closed. Closed strings were introduced by Fici in [1] as objects of combinatorial interest. Recently Badkobeh et al.[2] described a variety of algorithms to factor a given string into closed factors. In particular, they studied the Longest Closed Factorization (LCF) problem, which greedily computes the decomposition X=X1X2⋯Xk, where X1 is the longest closed prefix of X, X2 the longest closed prefix of X with prefix X1 removed, and so on. In this paper we present an O(logn) amortized per character algorithm to compute LCF on-line, where n is the length of the string. We also introduce the Minimum Closed Factorization (MCF) problem, which identifies the minimum number of closed factors that cover X. We first describe an off-line O(nlog2n)-time algorithm to compute MCF(X), then we present an on-line algorithm for the same problem. In fact, we show that MCF(X) can be computed in O(Llogn) time from MCF(X′), where X′=X[1..n−1], and L is the largest integer such that the suffix X[n−L+1..n] is a substring of X′.
AB - A string X=X[1..n], n>1, is said to be closed if it has a nonempty proper prefix that is also a suffix, but that otherwise occurs nowhere else in X; for n=1, every X is closed. Closed strings were introduced by Fici in [1] as objects of combinatorial interest. Recently Badkobeh et al.[2] described a variety of algorithms to factor a given string into closed factors. In particular, they studied the Longest Closed Factorization (LCF) problem, which greedily computes the decomposition X=X1X2⋯Xk, where X1 is the longest closed prefix of X, X2 the longest closed prefix of X with prefix X1 removed, and so on. In this paper we present an O(logn) amortized per character algorithm to compute LCF on-line, where n is the length of the string. We also introduce the Minimum Closed Factorization (MCF) problem, which identifies the minimum number of closed factors that cover X. We first describe an off-line O(nlog2n)-time algorithm to compute MCF(X), then we present an on-line algorithm for the same problem. In fact, we show that MCF(X) can be computed in O(Llogn) time from MCF(X′), where X′=X[1..n−1], and L is the largest integer such that the suffix X[n−L+1..n] is a substring of X′.
KW - Closed string
KW - On-line
KW - Minimum factorization
U2 - 10.1016/j.tcs.2018.10.033
DO - 10.1016/j.tcs.2018.10.033
M3 - Article
JO - Theoretical Computer Science
JF - Theoretical Computer Science
SN - 0304-3975
ER -
TY - JOUR
T1 - Efficient Computation of Palindromes in Sequences with Uncertainties Extension Version
AU - Alzamel, Mai Abdulaziz M
AU - Liu, Chang
AU - Gao, Jia
AU - Iliopoulos, Costas
PY - 2018
Y1 - 2018
M3 - Article
VL - 163
SP - 253
EP - 266
JO - FUNDAMENTA INFORMATICAE
JF - FUNDAMENTA INFORMATICAE
SN - 0169-2968
IS - 3
ER -