King's College London

Research portal

Bioinformatic analysis of genomic sequencing data : Read alignment and variant evaluation

Student thesis: Doctoral ThesisDoctor of Philosophy

The invention and rise in popularity of Next Generation Sequencing technologies
has led to a steep increase of sequencing data and the rise of new challenges.
This thesis aims to contribute methods for the analysis of NGS data, and focuses
on two of the challenges presented by these data.
The rst challenge regards the need for NGS reads to be aligned to a reference
sequence, as their short length complicates direct assembly. A great number of
tools exist that carry out this task quickly and eciently, yet they all rely on
the mere count of mismatches in order to assess alignments, ignoring the knowledge
that genome composition and mutation frequencies are biased. Thus, the
use of a scoring matrix that incorporates the mutation and composition biases
observed among humans was tested with simulated reads. The scoring matrix
was implemented and incorporated into the in-house algorithm REAL, allowing
side-by-side comparison of the performance of the biased model and the
mismatch count. The algorithm REAL was also used to investigate the applicability
of NGS RNA-seq data to the understanding of the relationship between
genomic expression and the compartmentalisation of genomic base composition
into isochores.
The second challenge regards the evaluation of the variants (SNPs) that are
discovered by sequencing. NGS technologies have caused a sharp rise in the
rate with which new SNPs are discovered, rendering impossible the experimental
validation of each one. Several tools exist that take into account various
properties of the genome, the transcripts and the protein products relevant to
the location of a SNP and attempt to predict the SNP's impact. These tools
are valuable in screening and prioritising SNPs likely to have a causative association with a genetic disease of interest. Despite the number of individual
tools and the diversity of their resources, no attempt had been made to draw a
consensus among them. Two consensus approaches were considered, one based
on a very simplistic vote majority of the tools considered, and one based on
machine learning. Both methods proved to oer highly competitive classication
both against the individual tools and against other consensus methods that
were published in the meantime.
Original languageEnglish
Awarding Institution
Award date2014


Download statistics

No data available

View graph of relations

© 2018 King's College London | Strand | London WC2R 2LS | England | United Kingdom | Tel +44 (0)20 7836 5454