Time Machine: Generative Real-Time Model For Failure (and Lead Time) Prediction in HPC Systems

Khalid Ayed Alharthi, Arshad Jhumka, Sheng Di, Lin Gui, Franck Cappello, Simon McIntosh-Smith

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

17 Downloads (Pure)

Abstract

High Performance Computing (HPC) systems gen- erate a large amount of unstructured/alphanumeric log messages that capture the health state of their components. Due to their design complexity, HPC systems often undergo failures that halt applications (e.g., weather prediction, aerodynamics simulation) execution. However, existing failure prediction methods, which typically seek to extract some information theoretic features, fail to scale both in terms of accuracy and prediction speed, limiting their adoption in real-time production systems. In this paper, differently from existing work and inspired by current transformer-based neural networks which have revolutionized the sequential learning in the natural language processing (NLP) tasks, we propose a novel scalable log-based, self-supervised model (i.e., no need for manual labels), called Time Machine, that predicts (i) forthcoming log events (ii) the upcoming failure and its location and (iii) the expected lead time to failure. Time Machine is designed by combining two stacks of transformer- decoders, each employing the self-attention mechanism. The first stack addresses the failure location by predicting the sequence of log events and then identifying if a failure event is part of that sequence. The lead time to predicted failure is addressed by the second stack. We evaluate Time Machine on four real- world HPC log datasets and compare it against three state-of- the-art failure prediction approaches. Results show that Time Machine significantly outperforms the related works on Bleu, Rouge, MCC, and F1-score in predicting forthcoming events, failure location, failure lead-time, with higher prediction speed.
Original languageEnglish
Title of host publicationThe 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks
PublisherIEEE Press
Publication statusPublished - 27 Jun 2023

Fingerprint

Dive into the research topics of 'Time Machine: Generative Real-Time Model For Failure (and Lead Time) Prediction in HPC Systems'. Together they form a unique fingerprint.

Cite this