Mutators in cancer
: an integrative bioinformatics analysis. APOBEC3 cytidine deaminases as a case study

Student thesis: Doctoral ThesisDoctor of Philosophy


It is well-established that mutations cause cancer; however we have only begun recently to understand how certain endogenous proteins (“mutators”) might cause these mutations, and the cellular consequences of these mutations. In this thesis I use an integrative bioinformatics approach to understand how mutators cause mutations in cancer, and the impact mutations impart on tumour progression. I focus on the APOBEC3 (A3) family of cytidine deaminases, which have recently been found to be an important source of mutations in cancers, and are associated with a distinct mutational signature in the genomes of many cancer types. The extraction of mutational signatures are a prime example of deriving representative patterns from biological data. I therefore explore the concept of “signatures” in biology, and their applications in mining biological sequences, structures and networks. I discuss various computational approaches which are instrumental in analysing large volumes of biological data and extracting signatures, and conclude with guiding principles towards reproducible and robust signature discovery. The bioinformatics analyses presented in this thesis aim to map the mechanisms under which mutators like A3 are activated and select their substrates, as well as the consequences of the mutations that they generate. In the first analysis, I annotate the functional context in which APOBEC3 genes are activated using transcriptomic data, and extract gene co-expression patterns associated with each member of the human APOBEC3 family in tumours, cancer cell lines and normal samples across tissue types. These analyses reveal that APOBEC3B (A3B) expression correlates with cell cycle and DNA repair genes, whereas the other APOBEC3 members display specificity for immune processes and immune cell populations. I have developed a pipeline called RESPECTEx (REconstituting SPecific Cell-Type Expression) to deconvolute bulk tumour gene expression data into contributions from different cell types in the heterogeneous cell mixture obtained from bulk tumours. I have also generated a web-based interactive tool to browse functional annotation data generated from this analysis. This work offers molecular insights into the substantial functional differences that could exist between highly similar paralogous mutators. The second analysis focuses on missense mutations, and aims to understand the functional consequences of these mutations in cancer by mapping them to protein structural information. A robust statistical method has been applied to quantify enrichment of variants in individual proteins, tumour samples and tumour cohorts. This analysis shows that the localisation of missense variants to protein cores, surfaces or interacting interfaces clearly segregates driver mutations from passengers, and oncogenes from tumour suppressors. I have also considered somatic variant data taken from different cellular states, from induced pluripotent stem cells (iPSCs) growing in culture, to advanced, metastatic tumours, and observed a gradual expansion of pathway perturbation by missense variants as cells become more cancerous. This work demonstrates that tumours continually optimise their mutation acquisition programme, perturbing the appropriate proteins and protein structural regions as required in different cellular states. Knowledge on the strategies of somatic mutation acquisition could improve therapeutic interventions in controlling tumour growth. The last part of the thesis zooms in to study the structural details of how mutators accommodate its nucleic acid substrate. I present a new method called Protein Interface Fingerprinting using Subgraphs (PIFS), which analyses user-specified protein interface by encoding it as a residue network graph. These graph-based representations capture subtle features in regions related to substrate recognition and oligomerisation of the APOBEC3 cytidine deaminase (CDA) domains. I have used in silico modelling approaches to examine the interactions between CDA domains and DNA substrates of different sequences and topologies. I have developed methods for extracting and comparing features from these graphs. Network graphs are effective ways to rationalise, both visually and statistically, conformational differences across CDA domains, and could be applied more generally to different protein structures, potentially generating graph-based “signatures” of protein interfaces. Overall, this thesis attempts to characterise the mechanisms of mutators at work from the cellular, “systems” scale to the microscopic level of protein structural regions. Throughout this work a series of bioinformatics methods and tools were also generated, which could be applied in other systems and contexts, improving the understanding into general principles of protein-DNA interactions, and the broader question of how mutations impact on cellular fitness.
Date of Award1 Mar 2020
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorFranca Fraternali (Supervisor) & Michael Malim (Supervisor)

Cite this