AbstractGenomics, owing to its immediate applications in medicine, forensics, evolu-tionary and molecular biology etc., has witnessed a dramatic advancement in the technology of acquiring and generating data. Consequently, the botleneck of the information-extraction pipeline has shifted from data-acquisition to the computational capacity for storing and processing prodigious amounts of data. Uncertainty and identification of regularity in data are two key aspects contributing to the complexity of the task of mining knowledge and insights from genomic data.
One form of macro-level uncertainty arises in sequential data when a single representation is used for a multitude of strings which are by and large similar. For example, in human genomics, the reference genome has been represented as a single sequence so far. Now, with the availability of a vast collection of human genomes, the so called reference cohorts seem more sensible in order to avoid the reference-bias presented by a single genomic sequence. Different representations have recently been explored in an attempt to organise human genomic sequences in reference cohorts. Each such representation has its own challenges.
Moreover, in genomic sequences, local regularity (a term encapsulating various forms of repetitions) is often flanked by regions of interest – genes, for example – which are, in comparison, not regular. In other words, the regularity of a local segment of genomic data is indicative of it being a potential biologically-important region. One of the multiple possible ways to express this notion of local regularity of strings can be in terms of unbordered factors of a string. A border of a string – one of the central properties characterising the regularity associated with repetitions – is its (possibly empty) proper factor occurring both as a prefix and as a suffix.
This dissertation presents an assortment of efficient novel algorithms – based on string algorithms and data-structures – to solve three problems that find direct or indirect applications in genomic data analysis. Specifically, the presented algorithms handle the uncertainty arising in the representation of an ensemble of sequences as well as characterise the regularity present in a sequence in terms of unbordered factors.
Firstly, we present an optimal algorithm – in terms of both time and space –improving the state-of-the-art, to identify Superbubbles (a special type of self-contained subgraphs, each with a single source and a single sink) in de Bruijn sequence graphs for genome assembly. Identifying these motifs in a reference graph is crucial for overcoming the lack of a coordinate system in the graphical representa-
|Date of Award||2019|
|Supervisor||Solon Pissis (Supervisor) & Toktam Mahmoodi (Supervisor)|