King's College London

Research portal

Accessible data curation and analytics for international-scale citizen science datasets

Research output: Contribution to journalArticlepeer-review

Original languageEnglish
Article number297
Pages (from-to)297
JournalScientific Data
Volume8
Issue number1
Early online date22 Nov 2021
DOIs
E-pub ahead of print22 Nov 2021
Published22 Nov 2021

Bibliographical note

Funding Information: ZOE Limited provided in-kind support for all aspects of building, running, and supporting the ZOE app and service to all users worldwide. Support for this study was provided by the National Institute for Health Research (NIHR)-funded Biomedical Research Centre based at Guy’s and St Thomas’ (GSTT) NHS Foundation Trust. This work was supported by the UK Research and Innovation London Medical Imaging & Artificial Intelligence Centre for Value-Based Healthcare (104691) and the Chronic Disease Research Foundation award CDRF-22/2020. Investigators also received support from the Wellcome Trust (WT203148/Z/16/Z, WT213038/Z/18/Z, and W212904/Z/18/Z), Medical Research Council (MRC; MR/V005030/1 and MR/M004422/1), British Heart Foundation, Alzheimer’s Society, EU, NIHR, COVID-19 Driver Relief Fund, Innovate UK, the NIHR-funded BioResource, and the Clinical Research Facility and Biomedical Research Centre based at GSTT NHS Foundation Trust, in partnership with Kings College London. This work was also supported by the National Core Studies, an initiative funded by UK Research and Innovation, NIHR, and the Health and Safety Executive. The COVID-19 Longitudinal Health and Wellbeing National Core Study was funded by the MRC (MC_PC_20030). The work performed on the Swedish study was supported by grants from the Swedish Research Council, Swedish Heart-Lung Foundation and the Swedish Foundation for Strategic Research (LUDC-IRC 15-0067). SO was supported by the French government, through the 3IA C.te d’Azur Investments in the Future project managed by the National Research Agency (ANR-19-P3IA-0002). ATC was supported by a Stuart and Suzanne Steele MGH Research Scholar Award and by the Massachusetts Consortium on Pathogen Readiness and M Schwartz and L Schwartz. Publisher Copyright: © 2021, The Author(s).

King's Authors

Abstract

The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation.

The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers.

We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipline that enables reproducible research across an international research group for the Covid Symptom Study.

View graph of relations

© 2020 King's College London | Strand | London WC2R 2LS | England | United Kingdom | Tel +44 (0)20 7836 5454