Estimating redundancy in clinical text

Research output: Contribution to journalArticlepeer-review

7 Citations (Scopus)


The current mode of use of Electronic Health Records (EHR) elicits text redundancy. Clinicians often populate new documents by duplicating existing notes, then updating accordingly. Data duplication can lead to propagation of errors, inconsistencies and misreporting of care. Therefore, measures to quantify information redundancy play an essential role in evaluating innovations that operate on clinical narratives. This work is a quantitative examination of information redundancy in EHR notes. We present and evaluate two methods to measure redundancy: an information-theoretic approach and a lexicosyntactic and semantic model. Our first measure trains large Transformer-based language models using clinical text from a large openly available US-based ICU dataset and a large multi-site UK based Hospital. By comparing the information-theoretic efficient encoding of clinical text against open-domain corpora, we find that clinical text is ∼1.5× to ∼3× less efficient than open-domain corpora at conveying information. Our second measure, evaluates automated summarisation metrics Rouge and BERTScore to evaluate successive note pairs demonstrating lexicosyntactic and semantic redundancy, with averages from ∼43 to ∼65%.

Original languageEnglish
Article number103938
Early online date22 Oct 2021
Publication statusPublished - Dec 2021


  • Deep transfer learning for language modelling of clinical text
  • Natural language processing methods to estimate redundancy of clinical text


Dive into the research topics of 'Estimating redundancy in clinical text'. Together they form a unique fingerprint.

Cite this