TY - JOUR

T1 - All-pairs suffix/prefix in optimal time using Aho-Corasick space

AU - Loukidis, Grigorios

AU - Pissis, Solon P.

N1 - Funding Information:
The authors would like to thank Anne Luesink (Vrije Universiteit) for implementing the algorithm underlying Theorem 1 . The source code is freely available at https://github.com/ALuesink/APSP_algorithm . Grigorios Loukides is partially supported by the Leverhulme Trust RPG-2019-399 project. Solon P. Pissis is partially supported by the PANGAIA and ALPACA projects that have received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229 , respectively.
Publisher Copyright:
© 2022 The Authors

PY - 2022/11

Y1 - 2022/11

N2 - The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science with many applications in bioinformatics. Given a set $\{S_1,\ldots,S_k\}$ of $k$ strings of total length $n$, we are asked to find, for each string $S_i$, $i\in[1,k]$, its longest suffix that is a prefix of string $S_j$, for all $j\neq i$, $j\in[1,k]$. Several algorithms running in the optimal $\cO(n+k^2)$ time for solving APSP are known. All of these algorithms are based on suffix sorting and thus require space $\Omega(n)$ in any case. We consider the parameterized version of the APSP problem, denoted by $\ell$-APSP, in which we are asked to output only the pairs whose suffix/prefix overlap is of length at least $\ell$. We give an algorithm for solving $\ell$-APSP that runs in the optimal $\cO(n+|\LOUTPUT|)$ time using $\cO(n)$ space, where $\LOUTPUT$ is the set of output pairs. Our algorithm is thus optimal for the APSP problem as well by setting $\ell=0$. Notably, our algorithm is fundamentally different from all optimal algorithms solving the APSP problem: it does not rely on sorting the suffixes of all input strings but on a novel traversal of the Aho-Corasick machine, and it thus requires space linear in the size of the machine.

AB - The all-pairs suffix/prefix (APSP) problem is a classic problem in computer science with many applications in bioinformatics. Given a set $\{S_1,\ldots,S_k\}$ of $k$ strings of total length $n$, we are asked to find, for each string $S_i$, $i\in[1,k]$, its longest suffix that is a prefix of string $S_j$, for all $j\neq i$, $j\in[1,k]$. Several algorithms running in the optimal $\cO(n+k^2)$ time for solving APSP are known. All of these algorithms are based on suffix sorting and thus require space $\Omega(n)$ in any case. We consider the parameterized version of the APSP problem, denoted by $\ell$-APSP, in which we are asked to output only the pairs whose suffix/prefix overlap is of length at least $\ell$. We give an algorithm for solving $\ell$-APSP that runs in the optimal $\cO(n+|\LOUTPUT|)$ time using $\cO(n)$ space, where $\LOUTPUT$ is the set of output pairs. Our algorithm is thus optimal for the APSP problem as well by setting $\ell=0$. Notably, our algorithm is fundamentally different from all optimal algorithms solving the APSP problem: it does not rely on sorting the suffixes of all input strings but on a novel traversal of the Aho-Corasick machine, and it thus requires space linear in the size of the machine.

UR - http://www.scopus.com/inward/record.url?scp=85129975164&partnerID=8YFLogxK

U2 - 10.1016/j.ipl.2022.106275

DO - 10.1016/j.ipl.2022.106275

M3 - Article

SN - 0020-0190

VL - 178

JO - INFORMATION PROCESSING LETTERS

JF - INFORMATION PROCESSING LETTERS

M1 - 106275

ER -