Using Machine Learning and Systems-Biology Approaches to Analyse Next-Generation Sequence Data in Cancers

Student thesis: Doctoral ThesisDoctor of Philosophy


The availability of exome sequence data for thousands of cancer samples has enabled the investigation of the sequence-level mutations that contribute to cancer. There is a need for strategies to analyse sequence data to gain new biological and clinical insights. This thesis investigates the use of machine learning and network-based methods to identify the mutated genes associated with important clinical features and cancer types, and to aid candidate gene prioritisation in colorectal cancer, and rheumatoid arthritis.

Firstly, tumour / normal exome sequence data was analysed to identify the mutated genes associated with cancer grade and cancer stage across and within three adenocarcinomas. Tumour grading is an important prognostic indicator which is based upon subjective assessment by pathologists, and is not standardised across cancer types. Despite this, this study found that protein coding mutations within TP53 were indicative of high grade status across three adenocarcinomas once adjusted for age, gender, stage, and tumour type.

Secondly, Random Forest models were used to identify the mutations that discriminate each of five high-order cancer types. Based on this work a Random Forest approach was used to investigate whether exome sequence data could be used to assign cancers to their tissue of origin without prior knowledge, for future use as a classifier for cancers of unknown primary origin.

Finally, a network-based method to perform candidate disease gene prioritisation called ‘k-pseudo cliques analysis’ was developed. The method identifies sets of highly interacting proteins that are enriched for low gene-level p-values. In tests, the identified gene sets outperformed a univariate test for general cancer gene enrichment. As part of the final chapter a network-based method called ‘Region Growing Analysis’ was used to perform candidate disease gene prioritisation of rheumatoid arthritis genome-wide association study data.

The findings and methods developed in this thesis can provide insights to the genetic correlates of cancer phenotypes and suggest new candidate disease genes.
Date of Award2016
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorCathryn Lewis (Supervisor) & Richard Dobson (Supervisor)

Cite this