This thesis studies four computational problems derived from molecular sequence analysis, which play a core role in many real-life applications. Its purpose is to develop efficient, fast and practical algorithms for use in sequence analysis. The approach is motivated by bio-informatics, but has a wide range of other applications. Firstly, we focus on the 1-mappability problem associated with a given sequence; in this the goal is to compute a table, wherein each entry comprises the number of repeats of specific substrings of a given length that start at this entry, with one mismatch at most. Furthermore, we present an "expected" linear-time algorithm (average case complexity) linked to the same problem, and we also generalise the algorithm to k-mappability that permits up to k mismatches. Additionally, we study a new type of uncertain string, called "degenerate strings"; here our goal is to locate maximal palindromes. We provide a linear algorithm for string comparison, and then use this to provide two algorithms to report maximal palindromes. Moreover, we illustrate a novel algorithm in the presence of k errors, as a way to determine whether an input string is a "closed string". Finally, we revisit the well-known Range Minimum Query (RMQ) problem and consider its variant when processing a small batch of RMQs in real-world applications, as well as the connection between the RMQ and the Lowest Common Ancestor (LCA) problem. In the case of all the new algorithms presented here, we implemented the necessary libraries and demonstrated the efficiency of the algorithms by testing the software across extensive data-sets.
Efficient and practical algorithms for sequence analysis: algorithm and data analysis research
Alzamel, M. (Author). 1 Sept 2021
Student thesis: Doctoral Thesis › Doctor of Philosophy