Gaussian process regression models for the analysis of survival data with competing risks, interval censoring and high dimensionality

Student thesis: Doctoral ThesisDoctor of Philosophy


We develop novel statistical methods for analysing biomedical survival data based on Gaussian process (GP) regression. GP regression provides a powerful non-parametric probabilistic method of relating inputs to outputs. We apply this to survival data which consist of time-to-event and covariate measurements. In the context of GP regression the covariates are regarded as `inputs' and the event times are the `outputs'. This allows for highly exible inference of non-linear relationships between covariates and event times.
Many existing methods for analysing survival data, such as the ubiquitous Cox proportional hazards model, focus primarily on the hazard rate which is typically assumed to take some parametric or semi-parametric form. Our proposed model belongs to the class of accelerated failure time models and as such our focus is on directly characterising the relationship between the covariates and event times without any explicit assumptions on what form the hazard rates take. This provides a more direct route to connecting the covariates to survival
outcomes with minimal assumptions. An application of our model to experimental data illustrates its usefulness.
We then apply multiple output GP regression, which can handle multiple potentially correlated outputs for each input, to competing risks survival data where multiple event types can occur. In this case the multiple outputs correspond to the time-to-event for each risk. By tuning one of the model parameters we can control the extent to which the multiple outputs
are dependent thus allowing the specication of correlated risks. However, the identiability problem, which states that it is not possible to infer whether risks are truly independent or otherwise on the basis of observed data, still holds. In spite of this fundamental limitation simulation studies suggest that in some cases assuming dependence can lead to more accurate predictions.
The second part of this thesis is concerned with high dimensional survival data where there are a large number of covariates compared to relatively few individuals. This leads to the problem of overtting, where spurious relationships are inferred from the data. One strategy to tackle this problem is dimensionality reduction. The Gaussian process latent variable model (GPLVM) is a powerful method of extracting a low dimensional representation of high dimensional data. We extend the GPLVM to incorporate survival outcomes by combining
the model with a Weibull proportional hazards model (WPHM). By reducing the ratio of covariates to samples we hope to diminish the eects of overtting.
The combined GPLVM-WPHM model can also be used to combine several datasets by simultaneously expressing them in terms of the same low dimensional latent variables. We construct the Laplace approximation of the marginal likelihood and use this to determine the optimal number of latent variables, thereby allowing detection of intrinsic low dimensional
structure. Results from both simulated and real data show a reduction in overtting and an increase in predictive accuracy after dimensionality reduction.
Date of Award2015
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorReimer Kuhn (Supervisor) & Ton Coolen (Supervisor)

Cite this