New statistical methodologies for improved analysis of genomic and omic data

  • Najla Saad R. Elhezzani

Student thesis: Doctoral ThesisDoctor of Philosophy


We develop statistical tools for analyzing different types of phenotypic data in genome-wide settings. When the phenotype of interest is a binary case-control status, most genome-wide association studies (GWASs) use randomly selected samples from the population (hereafter bases) as the control set. This approach is successful when the trait of interest is very rare; otherwise, a loss in the statistical power to detect disease-associated variants is expected. To address this, we propose a joint analysis of the three types of samples; cases, bases and controls. This is done by modeling the bases as a mixture of multinomial logistic functions of cases and controls, according to disease prevalence.
In a typical GWAS, where thousands of single-nucleotide polymorphisms (SNPs) are available for testing, score-based test statistics are ideal in this case. Other tests of associations such as Wald’s and likelihood ratio tests are known to be asymptotically equivalent to the score test, however their performance under small sample sizes can vary significantly. In order to allow the test comparison to be performed under the proposed case-base-control (CBC) design, we provide an estimation procedure using the maximum likelihood (ML) method along with the expectation-maximization (EM) algorithm. Simulations show that combining the three samples can increase the power to detect disease-associated variants, though a very large base sample set can compensate for lack of controls.
In the second part of the thesis, we consider a joint analysis of both genome-wide SNPs as well as multiple phenotypes, with a focus on the challenges they present in the estimation of SNP heritability. The current standard for performing this task is fit-ting a variance component model, despite its tendency to produce boundary estimates when small sample sizes are used. We propose a Bayesian covariance component model (BCCM) that takes into account genetic correlation among phenotypes and genetic correlation among individuals. The use of Bayesian methods allows us to circumvent some issues related to small sample sizes, mainly overfitting and boundary estimates. Using gene expression pathways, we demonstrate a significant improvement in SNP her-itability estimates over univariate and ML-based methods, thus explaining why recent progress in eQTL identification has been limited. I published this work as an article in the European Journal of Human genetics.
In the third part of the thesis, we study the prospects of using the proposed BCCM for phenotype prediction. Results from real data show consistency in accuracy between ML based methods and the proposed Bayesian method, when effect sizes are estimated using their posterior mode. It is also noted that an initial imputation step relatively increases the predictive accuracy.
Date of Award2018
Original languageEnglish
Awarding Institution
  • King's College London

Cite this