Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

Giulia Bernardini, Huiping Chen, Grigorios Loukides, Solon P. Pissis, Leen Stougie, Michelle Sweering

Original language | English |
---|---|

Title of host publication | 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022 |

Editors | Hideo Bannai, Jan Holub |

Publisher | Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing |

ISBN (Electronic) | 9783959772341 |

DOIs | |

Published | 1 Jun 2022 |

Additional links | |

Event | 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022 - Prague, Czech Republic Duration: 27 Jun 2022 → 29 Jun 2022 |

Name | Leibniz International Proceedings in Informatics, LIPIcs |
---|---|

Volume | 223 |

ISSN (Print) | 1868-8969 |

Conference | 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022 |
---|---|

Country/Territory | Czech Republic |

City | Prague |

Period | 27/06/2022 → 29/06/2022 |

Funding Information:
Funding The work in this paper is supported in part by: the Netherlands Organisation for Scientific Research (NWO) through project OCENW.GROOT.2019.015 “Optimization for and with Machine Learning (OPTIMAL)” and Gravitation-grant NETWORKS-024.002.003; a CSC scholarship; the Leverhulme Trust RPG-2019-399 project; and the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively.
Funding Information:
The work in this paper is supported in part by: the Netherlands Organisation for Scientific Research (NWO) through project OCENW.GROOT.2019.015 “Optimization for and with Machine Learning (OPTIMAL)” and Gravitation-grant NETWORKS-024.002.003; a CSC scholarship; the Leverhulme Trust RPG-2019-399 project; and the PANGAIA and ALPACA projects that have received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively.
Publisher Copyright:
© Giulia Bernardini, Huiping Chen, Grigorios Loukides, Solon P. Pissis, Leen Stougie, and Michelle Sweering; licensed under Creative Commons License CC-BY 4.0

A directed multigraph is called Eulerian if it has a circuit which uses each edge exactly once. Euler's theorem tells us that a weakly connected directed multigraph is Eulerian if and only if every node is balanced. Given a collection S of strings over an alphabet Σ, the de Bruijn graph (dBG) of order k of S is a directed multigraph GS,k(V, E), where V is the set of length-(k − 1) substrings of the strings in S, and GS,k contains an edge (u, v) with multiplicity mu,v, if and only if the string u[0] · v is equal to the string u · v[k − 2] and this string occurs exactly mu,v times in total in strings in S. Let GΣ,k(VΣ,k, EΣ,k) be the complete dBG of Σ^{k}. The Eulerian Extension (EE) problem on GS,k asks to extend GS,k with a set B of nodes from VΣ,k and a smallest multiset A of edges from EΣ,k to make it Eulerian. Note that extending dBGs is algorithmically much more challenging than extending general directed multigraphs because some edges in dBGs are by definition forbidden. Extending dBGs lies at the heart of sequence assembly [Medvedev et al., WABI 2007], one of the most important tasks in bioinformatics. The novelty of our work with respect to existing works is that we allow not only to duplicate existing edges of GS,k but to also add novel edges and nodes, in an effort to (i) connect multiple components and (ii) reduce the total EE cost. It is easy to show that EE on GS,k is NP-hard via a reduction from shortest common superstring. We further show that EE remains NP-hard, even when we are not allowed to add new nodes, via a highly non-trivial reduction from 3-SAT. We thus investigate the following two problems underlying EE in dBGs: 1. When GS,k is not weakly connected, we are asked to connect its d > 1 components using a minimum-weight spanning tree, whose edges are paths on the underlying GΣ,k and weights are the corresponding path lengths. This way of connecting guarantees that no new unbalanced node is added. We show that this problem can be solved in O(|V |k log d + |E|) time, which is nearly optimal, since the size of GS,k is Θ(|V |k + |E|). 2. When GS,k is not balanced, we are asked to extend GS,k to HS,k(V ∪ B, E ∪ A) such that every node of HS,k is balanced and the total number |A| of added edges is minimized. We show that this problem can be solved in the optimal O(k|V | + |E| + |A|) time. Let us stress that, although our main contributions are theoretical, the algorithms we design for the above two problems are practical. We combine the two algorithms in one method that makes any dBG Eulerian; and show experimentally that the cost of the obtained feasible solutions on real-world dBGs is substantially smaller than the corresponding cost obtained by existing greedy approaches.

King's College London - Homepage

© 2020 King's College London | Strand | London WC2R 2LS | England | United Kingdom | Tel +44 (0)20 7836 5454