Multiple sequence alignment is a core computational task in bioinformatics and has been extensively studied over the past decades. This computation requires an implicit assumption on the input data: the left- and right-most position for each sequence is relevant. However, this is not the case for circular structures; for instance, MtDNA. Efforts have been made to address this issue but it is far from being solved. We have very recently introduced a fast algorithm for approximate circular string matching (Barton et al., Algo Mol Biol, 2014). Here, we first show how to extend this algorithm for approximate circular dictionary matching; and, then, apply this solution with agglomerative hierarchical clustering to find a sufficiently good rotation for each sequence. Furthermore, we propose an alternative method that is suitable for more divergent sequences. We implemented these methods in BEAR, a programme for improving multiple circular sequence alignment. Experimental results, using real and synthetic data, show the high accuracy and efficiency of these new methods in terms of the inferred likelihood-based phylogenies.
|Lecture Notes in Computer Science
|Springer International Publishing Switzerland