AI and the Sacred Disease – The Opportunities of Electronic Patient Records and Natural Language Processing to Advance Epilepsy Care and Beyond

Student thesis: Doctoral ThesisDoctor of Philosophy


Patients with epilepsy frequently visit healthcare providers, making routinely collected data ideal sources of information about their health status. Each patient-healthcare provider interaction is recorded in the Electronic Patient Record (EPR). These data range from documentation of diagnoses and symptoms to procedures, prescriptions, and tests. This granularity of data makes EPRs ideal sources of information to identify disease patterns and provide evidence for the optimisation of health management. However, beyond direct patient care, the use of EPR data for secondary purposes such as research or service development has so far been limited. This is because the majority of clinically valuable information is contained within unstructured or free-text documentation. In this form, the lack of format requirements permits the use of homonyms, spelling mistakes, abbreviations, and use of legacy epilepsy classification terminology. For this information to be extracted and grouped for downstream machine-interpretable analytics, these details must be standardised into distinct areas of meaning.

This thesis aims to provide a fresh insight into correlations between patient care pathways and specific outcomes of people with epilepsy, at a scale not previously possible with traditional study designs or manually created data sources. This will be done by centralising, capturing, and aggregating large amounts of structured and unstructured clinical narratives from heterogeneous textual data sources, i.e. different types of medical records and administrative records from a secondary care setting – King’s College Hospital NHS foundation trust (KCH) – using an information retrieval and extraction platform called CogStack.

We begin by exploring first suspected seizure patients and how they were managed at a time when there was no formal first fit care pathway. We found that approximately half of cases were not in keeping with NICE guidelines. We used rules-based natural language processing (NLP) techniques to uncover that the most commonly reached diagnosis after a first suspected seizure event were epilepsy, cardiovascular disorders, and a single unknown episode.

This thesis then proceeds to describe the development of MedCAT, an NLP tool that uses a novel self-supervised machine learning algorithm approach for identifying clinical information and linking them to clinical concepts from a standardised terminology. Real-world validation demonstrates accurate SNOMED-CT concept extraction from different EPR vendor systems at 3 large London hospitals with self-supervised training over 8.8 billion words from >17 million clinical records and further supervised training fine-tuning step with ~6,000 clinician or domain expert-annotated examples. We discuss the transferability of models and reveal retained performances, across different hospitals and datasets, to allow the re-use and site-specific finetuning of models.

We then demonstrate the application of the CogStack-MedCAT pipeline to gather a large retrospective epilepsy cohort of 4,011 patients. These patients were then subgrouped based on demographics, epilepsy type, and comorbidity to provide insights into healthcare service utilisation, anti-seizure medication treatment patterns, and health outcomes. At KCH, the majority of patients were classified under an unknown epilepsy type (43.3%) followed by focal (37.3%), generalised (15.5%), and lastly combined generalised and focal epilepsy (3.9%). In terms of anti-seizure medication precipitation patterns, we found that levetiracetam was the most popular monotherapy (17.3%) and a combination of Levetiracetam + Lamotrigine was the most popular polytherapy (6.5%). Additionally, this study determined that polytherapy was associated with greater prevalence of idiopathic/associated symptoms that include anxiety, depression, headache, dizziness, rash, nausea, constipation, and diarrhoea. Lastly, we showcase and discuss the generalisability of these tools to other UK hospital sites and medical conditions, indicating cross-domain EPR-agnostic utility for accelerated clinical and research use cases.

In conclusion, automated information extraction tools applied to routinely collected data can not only be used for service demand monitoring and improvement but to also to enhance epilepsy research. This project unified information across different clinical document types on a scale much larger than previous studies. This has never previously been done; therefore, this project had the ability to sub-group people with epilepsy based upon demographics, clinical manifestations and provide insight into treatment patterns, and optimal patient health outcomes.
Date of Award1 Oct 2022
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorMark Richardson (Supervisor) & James Teo (Supervisor)

Cite this