TY - CHAP
T1 - Approximate Suffix-Prefix Dictionary Queries
AU - Zuba, Wiktor
AU - Loukides, Grigorios
AU - Pissis, Solon P.
AU - Thankachan, Sharma V.
N1 - Publisher Copyright:
© Wiktor Zuba, Grigorios Loukides, Solon P. Pissis, and Sharma V. Thankachan.
PY - 2024/8/23
Y1 - 2024/8/23
N2 - In the all-pairs suffix-prefix (APSP) problem [Gusfield et al., Inf. Process. Lett. 1992], we are given a dictionary R of r strings, S1, . . ., Sr, of total length n, and we are asked to find the length SPLi,j of the longest string that is both a suffix of Si and a prefix of Sj, for all i, j ∈ [1 . . r]. APSP is a classic problem in string algorithms with applications in bioinformatics, especially in sequence assembly. Since r = |R| is typically very large in real-world applications, considering all r2 pairs of strings explicitly is prohibitive. This is when the data structure variant of APSP makes sense; in the same spirit as distance oracles computing shortest paths between any two vertices given online. We show how to quickly locate k-approximate matches (under the Hamming or the edit distance) in R using a version of the k-errata tree [Cole et al., STOC 2004] that we introduce. Let SPLki,j be the length of the longest suffix of Si that is at distance at most k from a prefix of Sj. In particular, for any k = O(1), we show an O(nlogk n)-sized data structure to support the following queries: One-to-Onek(i, j): output SPLki,j in O(logk nlog log n) time. Reportk(i, d): output all j ∈ [1 . . r], such that SPLki,j ≥ d, in O(logk n(log n/log log n + output)) time, where output denotes the size of the output. In fact, our algorithms work for any value of k not just for k = O(1), but the formulas bounding the complexities get much more complicated for larger values of k.
AB - In the all-pairs suffix-prefix (APSP) problem [Gusfield et al., Inf. Process. Lett. 1992], we are given a dictionary R of r strings, S1, . . ., Sr, of total length n, and we are asked to find the length SPLi,j of the longest string that is both a suffix of Si and a prefix of Sj, for all i, j ∈ [1 . . r]. APSP is a classic problem in string algorithms with applications in bioinformatics, especially in sequence assembly. Since r = |R| is typically very large in real-world applications, considering all r2 pairs of strings explicitly is prohibitive. This is when the data structure variant of APSP makes sense; in the same spirit as distance oracles computing shortest paths between any two vertices given online. We show how to quickly locate k-approximate matches (under the Hamming or the edit distance) in R using a version of the k-errata tree [Cole et al., STOC 2004] that we introduce. Let SPLki,j be the length of the longest suffix of Si that is at distance at most k from a prefix of Sj. In particular, for any k = O(1), we show an O(nlogk n)-sized data structure to support the following queries: One-to-Onek(i, j): output SPLki,j in O(logk nlog log n) time. Reportk(i, d): output all j ∈ [1 . . r], such that SPLki,j ≥ d, in O(logk n(log n/log log n + output)) time, where output denotes the size of the output. In fact, our algorithms work for any value of k not just for k = O(1), but the formulas bounding the complexities get much more complicated for larger values of k.
KW - all-pairs suffix-prefix
KW - k-errata tree
KW - suffix tree
KW - suffix-prefix queries
UR - http://www.scopus.com/inward/record.url?scp=85203331424&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.MFCS.2024.85
DO - 10.4230/LIPIcs.MFCS.2024.85
M3 - Conference paper
AN - SCOPUS:85203331424
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 49th International Symposium on Mathematical Foundations of Computer Science, MFCS 2024
A2 - Kralovic, Rastislav
A2 - Kucera, Antonin
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 49th International Symposium on Mathematical Foundations of Computer Science, MFCS 2024
Y2 - 26 August 2024 through 30 August 2024
ER -