King's College London

Research portal

On Strings Having the Same Length-k Substrings

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Giulia Bernardini, Alessio Conte, Esteban Gabory, Roberto Grossi, Grigorios Loukides, Solon P. Pissis, Giulia Punzi, Michelle Sweering

Original languageEnglish
Title of host publication33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022
EditorsHideo Bannai, Jan Holub
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
ISBN (Electronic)9783959772341
DOIs
Published1 Jun 2022
Event33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022 - Prague, Czech Republic
Duration: 27 Jun 202229 Jun 2022

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume223
ISSN (Print)1868-8969

Conference

Conference33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022
Country/TerritoryCzech Republic
CityPrague
Period27/06/202229/06/2022

Bibliographical note

Funding Information: The work in this paper is supported in part by: the MIUR project Algorithms for HArnessing networked Data (AHeAD); and by the PANGAIA and ALPACA projects that have received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively. Giulia Bernardini: Supported by the Netherlands Organisation for Scientific Research (NWO) under project OCENW.GROOT.2019.015 “Optimization for and with Machine Learning (OPTIMAL)”. Grigorios Loukides: Supported by the Leverhulme Trust RPG-2019-399 project. Michelle Sweering: Supported by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003. Funding Information: Funding The work in this paper is supported in part by: the MIUR project Algorithms for HArnessing networked Data (AHeAD); and by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively. Giulia Bernardini: Supported by the Netherlands Organisation for Scientific Research (NWO) under project OCENW.GROOT.2019.015 “Optimization for and with Machine Learning (OPTIMAL)”. Grigorios Loukides: Supported by the Leverhulme Trust RPG-2019-399 project. Michelle Sweering: Supported by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003. Publisher Copyright: © Giulia Bernardini, Alessio Conte, Esteban Gabory, Roberto Grossi, Grigorios Loukides, Solon P. Pissis, Giulia Punzi, and Michelle Sweering; licensed under Creative Commons License CC-BY 4.0

King's Authors

Abstract

Let Substrk(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest Sk-Equivalent Strings: Given a set Sk of n length-k strings and an integer z > 0, list z shortest distinct strings T1,..., Tz such that Substrk(Ti) = Sk, for all i ∈ [1, z]. The z-Shortest Sk-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest Sk-Equivalent Strings, referred to as Shortest Sk-Equivalent String, asks for a shortest string X such that Substrk(X) = Sk. Our main contributions are summarized below: Given a directed graph G(V, E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in Õ(|E||V |) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest Sk-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP. We show that the length of a shortest string output by Shortest Sk-Equivalent String is in O(k + n2). We generalize this bound by showing that the total length of z shortest strings is in O(zk + zn2 + z2n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs. We present an algorithm for solving z-Shortest Sk-Equivalent Strings in O(nk + n2 log2 n + zn2 log n + |output|) time. If z = 1, the time becomes O(nk + n2 log2 n) by the fact that the size of the input is Θ(nk) and the size of the output is O(k + n2).

View graph of relations

© 2020 King's College London | Strand | London WC2R 2LS | England | United Kingdom | Tel +44 (0)20 7836 5454