King's College London

Research portal

Replica analysis of overfitting in regression models for time-to-event data

Research output: Contribution to journalArticle

ACC Coolen, JE Barrett, P Paga, C J Perez-Vicente

Original languageEnglish
Number of pages37
JournalJournal of Physics A and EPL (Europhysics Letters)
Early online date20 Jul 2017
DOIs
Accepted/In press20 Jul 2017
E-pub ahead of print20 Jul 2017
Published17 Aug 2017

Documents

  • Replica analysis of overfitting_COOLEN_Accepted20July2017_GREEN AAM

    Replica_analysis_of_overfitting_COOLEN_Accepted20July2017_GREEN_AAM.pdf, 5.09 MB, application/pdf

    Uploaded date:21 Jul 2017

    Version:Accepted author manuscript

    Licence:CC BY-NC-ND

    This is an author-created, un-copyedited version of an article accepted for publication in Journal of Physics A and EPL. IOP Publishing Ltd is not responsible for any errors or omissions in this version of the manuscript or any version derived from it.

Links

King's Authors

Abstract

Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitting, and even for Cox's proportional hazards model (the main tool of medical statisticians), one finds in literature only rules of thumb on the number of samples required to avoid overfitting.
In this paper we present a mathematical theory of overfitting in regression models for time-to-event data, which aims to increase our quantitative understanding of the problem and provide practical tools with which to correct regression outcomes for the impact of overfitting.
It is based on the replica method, a statistical mechanical technique for the analysis of heterogeneous many-variable systems that has been used successfully for several decades in physics, biology, and computer science, but not yet in medical statistics. We develop the theory initially for arbitrary regression models for time-to-event data, and verify its predictions in detail for the popular Cox model.

Download statistics

No data available

View graph of relations

© 2018 King's College London | Strand | London WC2R 2LS | England | United Kingdom | Tel +44 (0)20 7836 5454