Machine learning in precision oncology

Student thesis: Doctoral ThesisDoctor of Philosophy


Cancer is a disease with genetic underpinnings, associated to changes in genes that control the way cells function, how they grow and divide. There have been many efforts on understanding cancer formation and progression, especially in terms of finding genes linked to a specific type of cancer. Since cancer is a complex disease, being able to identify cancer subgroups and investigate them separately may help in increasing the depth of our knowledge in terms of driver genes and oncogenic pathways which can be used to direct treatment. The selected gene biomarkers can then be targeted by specific drugs to change their functionality which can improve patient outcomes. This thesis focuses on using machine learning algorithms in precision diagnosis and precision treatment, two main parts of precision oncology to illustrate how the analysis of large and complex biological data can improve cancer prognosis and support clinical decisions.

First, I reported a bioinformatics strategy where transcriptomic profiles across multiple breast cancer datasets were integrated and a machine learning model was generated to classify tumours into relevant histological grades. The resulting Cancer Grade Model (CGM) was then used to re-classify samples into low-risk or high-risk categories. The model offers means to improve our understanding of breast cancer subgroups and support precision treatments, thereby potentially contributing to preventing underdiagnosis of high-risk tumours and minimising overtreatment of low-risk disease. Key genes were extracted to predict metastasis, risk of relapse and overall survival, regardless of traditional histologically-defined receptor status. These markers might also provide potential therapeutic targets for the disease currently lacking treatment options.

Next, in order to find potential drugs that can bind to the selected biomarkers, a machine learning based method named DT2Vec was proposed which relies on a key underlying assumption that similar drugs may tend to target similar proteins and vice versa. DT2Vec was implemented using XGboost and evaluated on two different datasets. The results on test sets were compared with three open-source machine learning models namely DNILMF, DT-Hybrid, DDR which were re-ported as top-performing methods in the DTI prediction. Additionally, in order to train a realistic model to detect novel interactions, the model was implemented on a dataset from the ChEMBL database. After demonstrating the performance of the model on the new dataset, the model was applied to unknown DTIs to detect novel interactions with high prediction scores.

Finally, I developed an extended version of the DT2Vec method, DT2Vec+, to determine drug modes of action by integrating associations between drugs, diseases, and proteins into a heterogeneous graph. In target treatment, underling the type of interaction and finding the right drugs that can interact with a specific target and make desired changes is an important step in drug repurposing. The model achieved high performance on external testsets to stratify interactions into the six different interaction types. To my knowledge, this work is the first machine learning method investigating six different types of drug-target interaction.

Overall, different machine learning based pipelines have been implemented to stratify tumour samples into better-defined prognostic categories, select gene signatures, find drugs able to target the selected genes, and predict the type of drug-target interactions which can advance precision oncology research. All the methods tested on external testsets using cross-validation schema and showed excellent results. The main focus of the thesis was on breast cancer, but it is noted that the proposed approaches are generic and can be applied to other cancer types and associated gene biomarkers. Moreover, the models applied on external datasets and their exploratory outcomes were discussed in detail which could provide valuable insights about breast cancer subgroups and precision diagnosis and accelerate the process in precision target treatment.
Date of Award1 Aug 2022
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorSophia Tsoka (Supervisor) & Vasilis Friderikos (Supervisor)

Cite this