TY - JOUR
T1 - Impact of hospital-specific domain adaptation on BERT-based models to classify neuroradiology reports
AU - Agarwal, Siddharth
AU - Wood, David
AU - Murray, Benjamin A. K.
AU - Wei, Yiran
AU - Busaidi, Ayisha Al
AU - Kafiabadi, Sina
AU - Guilhem, Emily
AU - Lynch, Jeremy
AU - Townend, Matthew
AU - Mazumder, Asif
AU - Barker, Gareth J
AU - Cole, James H
AU - Sasieni, Peter
AU - Ourselin, Sebastien
AU - Modat, Marc
AU - Booth, Thomas C.
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/3/17
Y1 - 2025/3/17
N2 - Objectives: To determine the effectiveness of hospital-specific domain adaptation through masked language modelling (MLM) on BERT-based models’ performance in classifying neuroradiology reports, and to compare these models with open-source large language models (LLMs). Materials and methods: This retrospective study (2008–2019) utilised 126,556 and 86,032 MRI brain reports from two tertiary hospitals—King’s College Hospital (KCH) and Guys and St Thomas’ Trust (GSTT). Various BERT-based models, including RoBERTa, BioBERT and RadBERT, underwent MLM on unlabelled reports from these centres. The downstream tasks were binary abnormality classification and multi-label classification. Performances of models with and without hospital-specific domain adaptation were compared against each other and LLMs on internal (KCH) and external (GSTT) hold-out test sets. Model performances for binary classification were compared using 2-way and 1-way ANOVA. Results: All models that underwent hospital-specific domain adaptation performed better than their baseline counterparts (all p-values < 0.001). For binary classification, MLM on all available unlabelled reports (194,467 reports) yielded the highest balanced accuracies (KCH: mean 97.0 ± 0.4% (standard deviation), GSTT: 95.5 ± 1.0%), after which no differences between BERT-based models remained (1-way ANOVA, p-values > 0.05). There was a log-linear relationship between the number of reports and performance. LLama-3.0 70B was the best-performing LLM (KCH: 97.1%, GSTT: 94.0%). Multi-label classification demonstrated consistent performance improvements from MLM for all abnormality categories. Conclusion: Hospital-specific domain adaptation should be considered best practice when deploying BERT-based models in new clinical settings. When labelled data is scarce or unavailable, LLMs can serve as a viable alternative, assuming adequate computational power is accessible. Key Points: Question BERT-based models can classify radiology reports, but it is unclear if there is any incremental benefit from additional hospital-specific domain adaptation. Findings Hospital-specific domain adaptation resulted in the highest BERT-based model accuracies and performance scaled log-linearly with the number of reports. Clinical relevance BERT-based models after hospital-specific domain adaptation achieve the best classification results provided sufficient high-quality training labels. When labelled data is scarce, LLMs such as Llama-3.0 70B are a viable alternative provided there are sufficient computational resources.
AB - Objectives: To determine the effectiveness of hospital-specific domain adaptation through masked language modelling (MLM) on BERT-based models’ performance in classifying neuroradiology reports, and to compare these models with open-source large language models (LLMs). Materials and methods: This retrospective study (2008–2019) utilised 126,556 and 86,032 MRI brain reports from two tertiary hospitals—King’s College Hospital (KCH) and Guys and St Thomas’ Trust (GSTT). Various BERT-based models, including RoBERTa, BioBERT and RadBERT, underwent MLM on unlabelled reports from these centres. The downstream tasks were binary abnormality classification and multi-label classification. Performances of models with and without hospital-specific domain adaptation were compared against each other and LLMs on internal (KCH) and external (GSTT) hold-out test sets. Model performances for binary classification were compared using 2-way and 1-way ANOVA. Results: All models that underwent hospital-specific domain adaptation performed better than their baseline counterparts (all p-values < 0.001). For binary classification, MLM on all available unlabelled reports (194,467 reports) yielded the highest balanced accuracies (KCH: mean 97.0 ± 0.4% (standard deviation), GSTT: 95.5 ± 1.0%), after which no differences between BERT-based models remained (1-way ANOVA, p-values > 0.05). There was a log-linear relationship between the number of reports and performance. LLama-3.0 70B was the best-performing LLM (KCH: 97.1%, GSTT: 94.0%). Multi-label classification demonstrated consistent performance improvements from MLM for all abnormality categories. Conclusion: Hospital-specific domain adaptation should be considered best practice when deploying BERT-based models in new clinical settings. When labelled data is scarce or unavailable, LLMs can serve as a viable alternative, assuming adequate computational power is accessible. Key Points: Question BERT-based models can classify radiology reports, but it is unclear if there is any incremental benefit from additional hospital-specific domain adaptation. Findings Hospital-specific domain adaptation resulted in the highest BERT-based model accuracies and performance scaled log-linearly with the number of reports. Clinical relevance BERT-based models after hospital-specific domain adaptation achieve the best classification results provided sufficient high-quality training labels. When labelled data is scarce, LLMs such as Llama-3.0 70B are a viable alternative provided there are sufficient computational resources.
UR - http://www.scopus.com/inward/record.url?scp=105000299752&partnerID=8YFLogxK
U2 - 10.1007/s00330-025-11500-9
DO - 10.1007/s00330-025-11500-9
M3 - Article
C2 - 40097844
SN - 0938-7994
JO - European Radiology
JF - European Radiology
M1 - 102391
ER -