Biological sequences such as DNA, RNA and proteins can be represented as a sequence or string of characters. In this thesis I present the work I have done to implement new heuristic and exact string algorithms and show how they can be applied with existing algorithms to perform sequence analysis. I present examples of applications of these algorithms to solve problems in Bioinformatics and show how they are competitive or orders of magnitude superior to existing algorithms.
At the start of this thesis we present a flexible implementation of the MaxShiftM bitparallel algorithm (Crochemore et al., 2010) for performing Fixed-Length Approximate String Matching (FLASM) under the Hamming distance model. We also describe how we can apply Myers’ bit-parallel approximate pattern matching algorithm (Myers, 1999) for solving the FLASM problem under the Edit distance model. FLASM is the problem of searching for all factors of length ℓ of a pattern p in a text t. The two algorithms mentioned were packaged into a software library called libFLASM. We then presented multiple applications for these algorithms – we used the algorithms for implementing an algorithmic text pre-processing step known as the Chang and Marr Index (Chang, W.I. and Marr, T.G., 1994), we showed how to use the algorithms for pairwise circular sequence alignment, we incorporated it into BEAR (BEst Aligned Rotations), a state-of-the-art tool for improving Multiple Circular Sequence Alignment (MCSA), and finally we incorporated it into MoTeX-II a state-of-the-art tool to identify single and structured motifs. We also compare the algorithms to two competitors and show that whilst not being extremely fast, the FLASM algorithms are very effective and robust in practice, especially under increasing error rate.
Next, the thesis deals with the pairwise circular sequence comparison problem (CSC) which we solve using a technique involving q-grams and splitting the circular sequences into blocks. The CSC problem consists in finding an optimal linear alignment of a pair of circular strings. Circular sequences are abundant in nature and can be found in Mitochondrial DNA, bacterial and viral genomes and in some protein structures. We describe two exact algorithms and one heuristic algorithm for solving the problem and demonstrate their speed and accuracy using synthetic and real data. We compare our algorithms to each other for speed and accuracy and compare them to other accurate but comparatively slow algorithms and show one of the algorithms, saCSC, to be very fast. We then explain how we improved its accuracy by a refinement step to the point it was fit to be incorporated into BEAR for performing multiple circular sequence alignment.
Finally, the thesis deals with algorithms for searching for patterns in Elastic- Degenerate (ED) texts, a text-based format for expressing variation in a group of related sequences such as a pan-genome. We present three algorithms in total to solve the problem - the first two search a single pattern in an ED text while the third searches for multiple patterns simultaneously. We tested our algorithms for speed against each other and one other competitor using synthetic and real data to find that the bit-parallel algorithms we implemented perform very well. We applied our algorithms to search for patterns in real data taken from the 1000 Human Genomes project. We used the multiple pattern matching algorithm for validating Minimal Absent Words (MAWs) in the Human Genome and for validating the work of researchers (Silva et al, 2015) who identified three MAWs thought to exist in the Ebola virus but not in humans. We discovered these patterns do in fact exist in the greater human population and will present as a false positive for Ebola viral infection.
Original language | English |
---|
Awarding Institution | |
---|
Supervisors/Advisors | |
---|
Award date | 1 Jun 2019 |
---|