Abstract
The message from medicine is clear. We are in possession of vast amounts of data from sources such as electron microscopes, magnetic resonance imaging and DNA microarray technology. The aim is to translate this into a world of faster drug discovery, more accurate automated diagnosis and early warnings of impending disease. Statistics, and in particular statistical inference, has a key role to play in providing a principled approach to this task.In practice, solid theoretical underpinnings are often replaced with heuristics suited only to the data-set in hand. Traditional statistical tools are often ill-equipped to cope with the structure of this new flood of data. The need for methodological improvements presents a major opportunity to progress medical sciences.The scientific method relies on carefully designed experiments, typically collecting N samples each with a small number p of measurements, in order to test a hypothesis or to discover a causal relationship. When p is much smaller than N, existing statistical methods are satisfactory for this purpose. Today, however, the approach is quite different. Extensive quantities of data are available due to the decreasing cost of high throughput measurement devices and data storage alongside rapidly increasing computational power. Examples from biomedicine include genetic and epigenetic data as well as the growing availability of real-time health information from wearable devices and hospital intensive care units. In these cases, the data dimension p can be comparable to the number of samples N. The task is now to look for patterns or correlations in this data without first providing a plausible hypothesis. The response to this problem has been to search for a lower dimensional representation of the data. To achieve this, methods of variable selection, regularization or projection are routinely used.
Given its wide-reaching implications for biomedical data studies we examine the phenomenon of overfitting, one of the primary challenges of high-dimensional inference, from two different perspectives: statistical physics in Part I and Bayesian inference in Part II. In the first approach, we consider all variables of the dataset and investigate the inference outcomes in the regime where p is comparable to N. Surprisingly, for a family of models, we have found systematic and reproducible effects which are a function of the ratio p/N.
Date of Award | 1 Jul 2020 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Ton Coolen (Supervisor) |