TY - JOUR
T1 - Validating Transformers for Redaction of Text from Electronic Health Records in Real-World Healthcare
AU - Kraljevic, Zeljko
AU - Shek, Anthony
AU - Yeung, Joshua Au
AU - Sheldon, Ewart Jonathan
AU - Shuaib, Haris
AU - Al-Agil, Mohammad
AU - Bai, Xi
AU - Noor, Kawsar
AU - Shah, Anoop D.
AU - Dobson, Richard
AU - Teo, James
N1 - Funding Information:
R.D. is supported by the following: (1) NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, United Kingdom; (2) Health Data Research UK; (3) the NIHR University College London Hospitals Biomedical Research Centre; (4) the UK Research and Innovation London Medical Imaging and Artificial Intelligence Centre for Value Based Healthcare; and (5) the NIHR Applied Research Collaboration South London (NIHR ARC South London) at King’s College Hospital NHS Foundation Trust. ADS is supported by a postdoctoral fellowship from THIS Institute, NIHR (AI AWARD01864 and COV-LT-0009), UKRI (Horizon Europe Guarantee for DataTools4Heart) and British Heart Foundation Accelerator Award (AA/18/6/24223).
Funding Information:
This work was made possible by funding received from NHS AI Lab, National Institutes of Health Research, Health Data Research UK, Guy’s & St Thomas’ NHS Foundation Trust, NIHR Applied Research Collaborative South London, Guy’s and St Thomas’ Foundation, the Radiological Research Trust, and a National Institute for Health Research Doctoral Research Fellowship (ref. 20647). Special thanks as well to the patients of the KERRI committee.
Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Protecting patient privacy in healthcare records is a top priority, and redaction is a commonly used method for obscuring directly identifiable information in text. Rule-based methods have been widely used, but their precision is often low causing over-redaction of text and frequently not being adaptable enough for non-standardised or unconventional structures of personal health information. Deep learning techniques have emerged as a promising solution, but implementing them in real-world environments poses challenges due to the differences in patient record structure and language across different departments, hospitals, and countries.In this study, we present AnonCAT, a transformer-based model and a blueprint on how deidentification models can be deployed in real-world healthcare. AnonCAT was trained through a process involving manually annotated redactions of real-world documents from three UK hospitals with different electronic health record systems and 3116 documents. The model achieved high performance in all three hospitals with a Recall of 0.99, 0.99 and 0.96.Our findings demonstrate the potential of deep learning techniques for improving the efficiency and accuracy of redaction in global healthcare data and highlight the importance of building workflows which not just use these models but are also able to continually fine-tune and audit the performance of these algorithms to ensure continuing effectiveness in real-world settings. This approach provides a blueprint for the real-world use of de-identifying algorithms through fine-tuning and localisation, the code together with tutorials is available on GitHub (https://github.com/CogStack/MedCAT).
AB - Protecting patient privacy in healthcare records is a top priority, and redaction is a commonly used method for obscuring directly identifiable information in text. Rule-based methods have been widely used, but their precision is often low causing over-redaction of text and frequently not being adaptable enough for non-standardised or unconventional structures of personal health information. Deep learning techniques have emerged as a promising solution, but implementing them in real-world environments poses challenges due to the differences in patient record structure and language across different departments, hospitals, and countries.In this study, we present AnonCAT, a transformer-based model and a blueprint on how deidentification models can be deployed in real-world healthcare. AnonCAT was trained through a process involving manually annotated redactions of real-world documents from three UK hospitals with different electronic health record systems and 3116 documents. The model achieved high performance in all three hospitals with a Recall of 0.99, 0.99 and 0.96.Our findings demonstrate the potential of deep learning techniques for improving the efficiency and accuracy of redaction in global healthcare data and highlight the importance of building workflows which not just use these models but are also able to continually fine-tune and audit the performance of these algorithms to ensure continuing effectiveness in real-world settings. This approach provides a blueprint for the real-world use of de-identifying algorithms through fine-tuning and localisation, the code together with tutorials is available on GitHub (https://github.com/CogStack/MedCAT).
KW - electronic health records
KW - text deidentification
KW - transformers
UR - http://www.scopus.com/inward/record.url?scp=85181563493&partnerID=8YFLogxK
U2 - 10.1109/ICHI57859.2023.00098
DO - 10.1109/ICHI57859.2023.00098
M3 - Article
AN - SCOPUS:85181563493
SP - 544
EP - 549
JO - Proceedings - 2023 IEEE 11th International Conference on Healthcare Informatics, ICHI 2023
JF - Proceedings - 2023 IEEE 11th International Conference on Healthcare Informatics, ICHI 2023
T2 - 11th IEEE International Conference on Healthcare Informatics, ICHI 2023
Y2 - 26 June 2023 through 29 June 2023
ER -