# Algorithms on weighted sequences & applications

Student thesis: Doctoral ThesisDoctor of Philosophy

### Abstract

In this thesis, we study problems on a type of uncertain sequences called weighted sequences. In weighted sequences, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. This type of sequences are commonly used to represent imprecise or uncertain data, for example in molecular biology, where they are known as PositionWeight Matrices. Normally in weighted sequences problem a cumulative weight threshold 1/z ∈ (0,1] is given and we only consider the substrings with cumulative probability at least 1/z, which we call it valid. Two main problems on weighted sequences are addressed in this thesis: pattern matching and weighted indexing. An application of weighted indexing is also presented to solve palindromic factorization in weighted sequences.

• In pattern matching problem we are given a pattern of length m, a text of length n > m and a cumulative weight threshold 1/z. Our aim is to find all valid occurrences of the pattern in the text. We provide two types of average-case algorithms: one is to solve the problem with searching time o(n) and the other one is to solve the problem with searching time O(nzlogm/m). Both algorithms require a specific weighted ratio z/m . An average-case algorithm for approximate pattern matching on weighted sequences is also provided in the thesis.

• In weighted indexing problem we are aiming to build an index containing all the valid substrings of a given weighted sequences of length n. We introduce a type of property string of length nz called z-estimation which contains all valid substrings, and provide algorithms to construct weighted suffix tree and weighted suffix array in time and space O(nz) based on z-estimation.

• In palindromic factorization problem we are given a weighted sequence of length n and a threshold 1/z. We find all the maximum palindrome in the sequence and give a maximal-palindromic factorization of the given sequence. Our algorithm is based on weighted suffix tree and solve the problem with time and space complexity O(nz).
Experimental results are given using synthetic or real data with DNA alphabet (A, C, G, T) for all the algorithms to prove our complexity theorem or to present the time and space performance.
Date of Award 1 Jul 2019 English King's College London Solon Pissis (Supervisor) & Costas Iliopoulos (Supervisor)

'