A survey of data quality requirements that matter in ML development pipelines

Maria Priestley, Fionntán O’Donnell, Elena Simperl

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)
112 Downloads (Pure)

Abstract

The fitness of the systems in which Machine Learning (ML) is used depends greatly on good-quality data. Specifications on what makes a good-quality dataset have traditionally been defined by the needs of the data users - typically analysts and engineers. Our article critically examines the extent to which established data quality frameworks are applicable to contemporary use cases in ML. Using a review of recent literature at the intersection of ML, data management, and human-computer interaction, we find that the classical "fitness-for-use"view of data quality can benefit from a more stage-specific approach that is sensitive to where in the ML lifecycle the data are encountered. This helps practitioners to plan their data quality tasks in a manner that meets the needs of the stakeholders who will encounter the dataset, whether it be data subjects, software developers or organisations. We therefore propose a new treatment of traditional data quality criteria by structuring them according to two dimensions: (1) the stage of the ML lifecycle where the use case occurs vs. (2) the main categories of data quality that can be pursued (intrinsic, contextual, representational and accessibility). To illustrate how this works in practice, we contribute a temporal mapping of the various data quality requirements that are important at different stages of the ML data pipeline. We also share some implications for data practitioners and organisations that wish to enhance their data management routines in preparation for ML.

Original languageEnglish
Article number3592616
JournalJournal of Data and Information Quality
Volume15
Issue number2
Early online date19 Apr 2023
DOIs
Publication statusPublished - 22 Jun 2023

Fingerprint

Dive into the research topics of 'A survey of data quality requirements that matter in ML development pipelines'. Together they form a unique fingerprint.

Cite this