Machine learning for translational medicine

Student thesis: Doctoral ThesisDoctor of Philosophy


In biomedical sciences, the increasing amount of available high through­ put data brings many challenges. The collection of such data usually results in large number of predictor genes and few samples, possibly also with high noise levels. Such problems are associated to the so called "curse of dimensionality", i.e. the small n large p problem. Therefore, the development and application of computational protocols in bioinformatics is necessary in order to tackle these problems, translate knowledge discovery from genome-scale studies and infer new knowledge combining the different types of post-genomics data. Data mining methods, including machine learning approaches, aim to identify patterns in high-throughput data and extract information about the underlying biological interactions.

Research questions that are discussed in this thesis are disease strati­fication, biomarker discovery, network inference and data integration. The methodological contributions of this thesis focus on the problem encountered, nowadays, by clinicians where patients appearing to have the same disease may not respond to the same treatment. First, using supervised and unsupervised learning techniques, a machine learning strategy based on ensembles of decision trees was used to define sub­ phenotypes based on gene expression patterns and generate potential biomarkers for disease progression. Second, we developed a network inference algorithm (NetCFS) that uses feature selection to select a number of genes (n) that are highly correlated with the phenotype of interest so as to generate n different regression problems. Third, a "top-down" approach was implemented where gene sets corresponding to biochemical pathways are used to develop a disease classification framework. A multi-stage procedure was developed to uncover func­tional modules that are closely associated to the phenotype of interest and relevant to disease pathology. Phenotype-Responsive Genes (PRGs) are identified based on non-overlapping constraints of the classification procedure and association rules are used to estimate the activity level of each pathway.

Applications discussed in this thesis include skin inflammation where an integrative approach combining clinically relevant in vivo mod­els with molecular network analysis was developed to infer disease biomarkers and to translate the rapidly growing body of data into knowledge usable at the bedside. Other disease cases studied involve cancer analyses to illustrate contributions in systems medicine. Overall, this thesis presents methodological contributions on predictive models based on machine learning techniques and mathematical programming together with relevant insights in disease mechanisms and potential treatment options.
Date of Award2013
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorSophia Tsoka (Supervisor), Frank Nestle (Supervisor) & Costas Iliopoulos (Supervisor)

Cite this