Abstract
The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. The web provides access to millions of datasets. These data can have additional impact when it is used beyond the context for which it was originally created. We have little empirical insight into what makes a dataset more reusable than others, and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This work demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. There is plenty of advice on how to make a dataset easier to reuse, including technical standards, legal frameworks, and guidelines. This paper begins to address the gap between this advice and practice. To do so, a compilation of reuse features from literature is presented. To understand how they look like in data projects, we carried out a case study of datasets published and shared on GitHub, a large online platform to share code and data.
Original language | English |
---|---|
Article number | 100136 |
Journal | Patterns |
Volume | 1 |
Issue number | 8 |
DOIs | |
Publication status | Published - 13 Nov 2020 |
Keywords
- data portals
- dataset reuse
- DSML 2: Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem
- human-data interaction
- neural networks
- reuse prediction