TY - CHAP
T1 - On Strings Having the Same Length-k Substrings
AU - Bernardini, Giulia
AU - Conte, Alessio
AU - Gabory, Esteban
AU - Grossi, Roberto
AU - Loukides, Grigorios
AU - Pissis, Solon P.
AU - Punzi, Giulia
AU - Sweering, Michelle
N1 - Funding Information:
The work in this paper is supported in part by: the MIUR project Algorithms for HArnessing networked Data (AHeAD); and by the PANGAIA and ALPACA projects that have received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively. Giulia Bernardini: Supported by the Netherlands Organisation for Scientific Research (NWO) under project OCENW.GROOT.2019.015 “Optimization for and with Machine Learning (OPTIMAL)”. Grigorios Loukides: Supported by the Leverhulme Trust RPG-2019-399 project. Michelle Sweering: Supported by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003.
Funding Information:
Funding The work in this paper is supported in part by: the MIUR project Algorithms for HArnessing networked Data (AHeAD); and by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively. Giulia Bernardini: Supported by the Netherlands Organisation for Scientific Research (NWO) under project OCENW.GROOT.2019.015 “Optimization for and with Machine Learning (OPTIMAL)”. Grigorios Loukides: Supported by the Leverhulme Trust RPG-2019-399 project. Michelle Sweering: Supported by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003.
Publisher Copyright:
© Giulia Bernardini, Alessio Conte, Esteban Gabory, Roberto Grossi, Grigorios Loukides, Solon P. Pissis, Giulia Punzi, and Michelle Sweering; licensed under Creative Commons License CC-BY 4.0
PY - 2022/6/1
Y1 - 2022/6/1
N2 - Let Substrk(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest Sk-Equivalent Strings: Given a set Sk of n length-k strings and an integer z > 0, list z shortest distinct strings T1,..., Tz such that Substrk(Ti) = Sk, for all i ∈ [1, z]. The z-Shortest Sk-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest Sk-Equivalent Strings, referred to as Shortest Sk-Equivalent String, asks for a shortest string X such that Substrk(X) = Sk. Our main contributions are summarized below: Given a directed graph G(V, E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in Õ(|E||V |) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest Sk-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP. We show that the length of a shortest string output by Shortest Sk-Equivalent String is in O(k + n2). We generalize this bound by showing that the total length of z shortest strings is in O(zk + zn2 + z2n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs. We present an algorithm for solving z-Shortest Sk-Equivalent Strings in O(nk + n2 log2 n + zn2 log n + |output|) time. If z = 1, the time becomes O(nk + n2 log2 n) by the fact that the size of the input is Θ(nk) and the size of the output is O(k + n2).
AB - Let Substrk(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest Sk-Equivalent Strings: Given a set Sk of n length-k strings and an integer z > 0, list z shortest distinct strings T1,..., Tz such that Substrk(Ti) = Sk, for all i ∈ [1, z]. The z-Shortest Sk-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest Sk-Equivalent Strings, referred to as Shortest Sk-Equivalent String, asks for a shortest string X such that Substrk(X) = Sk. Our main contributions are summarized below: Given a directed graph G(V, E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in Õ(|E||V |) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest Sk-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP. We show that the length of a shortest string output by Shortest Sk-Equivalent String is in O(k + n2). We generalize this bound by showing that the total length of z shortest strings is in O(zk + zn2 + z2n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs. We present an algorithm for solving z-Shortest Sk-Equivalent Strings in O(nk + n2 log2 n + zn2 log n + |output|) time. If z = 1, the time becomes O(nk + n2 log2 n) by the fact that the size of the input is Θ(nk) and the size of the output is O(k + n2).
KW - Chinese Postman
KW - combinatorics on words
KW - de Bruijn graph
KW - string algorithms
UR - http://www.scopus.com/inward/record.url?scp=85134333057&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.CPM.2022.16
DO - 10.4230/LIPIcs.CPM.2022.16
M3 - Conference paper
AN - SCOPUS:85134333057
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022
A2 - Bannai, Hideo
A2 - Holub, Jan
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022
Y2 - 27 June 2022 through 29 June 2022
ER -