Completeness and consistency of ethnicity categorisations in administrative data: A comparison between linked health and education datasets

Alice Wickersham*, J. Das-Munshi, T. Ford, R. Stewart, J. Downs

*Corresponding author for this work

Research output: Working paper/PreprintPreprint


BackgroundEthnicity data are critical for identifying inequalities in the population, but previous studies suggest that it is not consistently recorded between different administrative datasets. With researchers increasingly leveraging cross-domain data linkages, we aimed to investigate the completeness and consistency of ethnicity data in two linked health and education datasets. MethodsWe used an existing individual-level data linkage between South London and Maudsley NHS Foundation Trust de-identified electronic health records, accessed via the Clinical Record Interactive Search (CRIS), and the National Pupil Database (NPD) (2007-2013). For a cohort of children and adolescents referred to Child and Adolescent Mental Health Services, we used descriptive statistics to summarise the availability of ethnicity data from each source and explore how aggregate ethnicity (White, Black, Asian, Mixed and Other) compared between the sources. We also conducted unadjusted logistic regression analyses to estimate how ethnicity was associated with two key educational and clinical outcomes (educational attainment and neurodevelopmental disorder) when coding ethnicity from each source. ResultsOf the n=30,426 available linked records, ethnicity data was available for 79.3% from the NPD, 87.0% from CRIS, 97.3% from either source and 69.0% from both sources. Among those who had ethnicity data from both, the two data sources agreed on 87.0% of aggregate ethnicity categorisations overall, but with high levels of disagreement in Mixed and Other ethnic groups. Strength of the associations between ethnicity, educational attainment and neurodevelopmental disorder varied according to which data source was used to code ethnicity. For example, as compared to White pupils, a significantly higher proportion of Asian pupils achieved expected educational attainment thresholds only if ethnicity was coded from the NPD (OR=1.85, 95% CI=1.48 to 2.31, p< 0.001), not if ethnicity was coded from CRIS (OR=1.18, 95% CI=0.97 to 1.44, p=0.099). ConclusionsData linkage has the potential to minimise missing ethnicity data, and overlap in ethnicity categorisations between CRIS and the NPD was generally high. However, choosing which data source to primarily code ethnicity from can have implications for analyses of ethnicity, mental health, and educational outcomes. Users of linked data should exercise caution in combining and comparing ethnicity between different data sources.
Original languageEnglish
Publication statusPublished - 2022


  • epidemiology


Dive into the research topics of 'Completeness and consistency of ethnicity categorisations in administrative data: A comparison between linked health and education datasets'. Together they form a unique fingerprint.

Cite this