## Abstract

Background

Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions – approaches which are susceptible to misrepresentation and incompleteness, respectively. The objective of this analysis is to compare the performance of LASSO, Elastic LASSO, best-subset selection, L_0 L_1 penalisation and L_0 L_2 penalisation in real cancer genomic data using purely data-driven validation.

Methods

Five large (n≈4000) genomic datasets were extracted from Gene Expression Omnibus. “Gold-standard” regression models were trained on subspaces of these datasets (n≈4000, p=500). Penalised regression models were trained on small samples from these subspaces (n∈{25,75,150},p=500) and validated against the gold-standard models. Variable selection performance and out-of-sample prediction were assessed.

Results

L_1 L_2-penalistion achieved the highest cosine similarity between estimated coefficients and those of gold-standard models. L_0 L_2-penalised models explained the greatest proportion of variance in test responses, though performance was unreliable in low signal:noise conditions. L_0 L_2 also attained the highest overall median variable selection F1 score.

Conclusions

This analysis explores a novel approach for comparisons of model selection approaches in real genomic data from five cancer types. Our benchmarking datasets have been made publicly available for use in future research. Our findings support the use of L_0 L_2 penalisation for structural selection and L_1 L_2 penalisation for coefficient recovery in genomic data.

Evaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions – approaches which are susceptible to misrepresentation and incompleteness, respectively. The objective of this analysis is to compare the performance of LASSO, Elastic LASSO, best-subset selection, L_0 L_1 penalisation and L_0 L_2 penalisation in real cancer genomic data using purely data-driven validation.

Methods

Five large (n≈4000) genomic datasets were extracted from Gene Expression Omnibus. “Gold-standard” regression models were trained on subspaces of these datasets (n≈4000, p=500). Penalised regression models were trained on small samples from these subspaces (n∈{25,75,150},p=500) and validated against the gold-standard models. Variable selection performance and out-of-sample prediction were assessed.

Results

L_1 L_2-penalistion achieved the highest cosine similarity between estimated coefficients and those of gold-standard models. L_0 L_2-penalised models explained the greatest proportion of variance in test responses, though performance was unreliable in low signal:noise conditions. L_0 L_2 also attained the highest overall median variable selection F1 score.

Conclusions

This analysis explores a novel approach for comparisons of model selection approaches in real genomic data from five cancer types. Our benchmarking datasets have been made publicly available for use in future research. Our findings support the use of L_0 L_2 penalisation for structural selection and L_1 L_2 penalisation for coefficient recovery in genomic data.

Original language | English |
---|---|

Journal | Cancer informatics |

Early online date | 9 Oct 2021 |

DOIs | |

Publication status | Published - 21 Nov 2021 |