Optimisation Models for Pathway Activity Inference in Complex Diseases

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

With advances in high-throughput technologies, there has been enormous increase in data related to profiling the activity of molecules in disease. While such data provide more comprehensive analysis of cellular actions, their large volume and complexity pose difficulty in accurate disease phenotype classification. Therefore, novel modelling methods that can not only improve accuracy but also offer interpretable means of analysis, are important. In this respect, biological pathways (i.e. gene sets that reflect related functional cascades) can be used to incorporate a-priori knowledge of biological interactions, so as to decrease the data dimensionality from gene-level to pathway-level and increase biological interpretability of related methodologies. Methods to infer pathway activity values across high-throughput data have shown good potential towards better understanding of the regulation pattern of gene expression values. This thesis focuses on the application and development of mathematical programming models for pathway activity inference in disease classification and gene signature identification in gene profiling data.

First, an optimisation model, known as DIGS, for pathway activity inference toward precise disease phenotype prediction is implemented on Microarray datasets of ischemic stroke and RNA-Seq datasets of colorectal cancer. DIGS is a mixed integer linear programming (MILP) mathematical optimisation model aiming at separating the different cancer subtypes to the largest extent. In supervised manner, DIGS defines pathway activity as the linear combination of the member gene expression values multiplying the inferred gene weights. Inside the DIGS model, gene weights are optimised to maximise the discriminative power of the inferred pathway activity and the optimisation objective is set to minimise the number of incorrect sample allocation.

Comparative analysis shows that DIGS model outperforms other up-to-date methods in three pathway activity evaluation metrics, classification accuracy, robustness against noisy data and survival outcome prediction accuracy of patients.

Next, the model is improved to form a more efficient MILP model (DIGS2). This model avoids a large number of binary decision variables in the original model and is thus easier to be solved to global optimality. The assessment of DIGS2 model on twoiv RNA-Seq datasets shows improvements on solution qualities and better performance on the evaluation metrics compared with other pathway activity inference methods.

These models exhibit outstanding contribution on identifying disease relevant pathways and genes, which are also verified on relevant findings in the literature. Finally, the effectiveness of the proposed MILP models is explored in the noisier and sparser scRNA-Seq data. In addition to the classification effect, following the up-to-date research interests for this type of data, the clustering ability of pathway activity value is emphasised in this work to see whether the pathway activity values can clustering the cells of same label through dimension reduction methods. A comparison made with methods from literature shows that the proposed method achieves competitive results for separating the cells.

Overall, this thesis demonstrates that the flexible nature of mathematical programming lends itself well to developing solution procedures for pathway activity inference. The evaluation metrics show the proposed methods to outperform other methods from literature, as well as to provide explainable means of modelling. Also, the proposed methods show the potential to reveal meaningful biological interpretations for complex diseases such as cancer.
Date of Award1 Nov 2023
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorSophia Tsoka (Supervisor) & Konstantinos Theofilatos (Supervisor)

Cite this

'