TY - JOUR
T1 - Assessing the Quality of Sources in Wikidata Across Languages
T2 - A Hybrid Approach
AU - Amaral, Gabriel
AU - Piscopo, Alessandro
AU - Kaffee, Lucie Aimée
AU - Rodrigues, Odinaldo
AU - Simperl, Elena
N1 - Funding Information:
The project leading to this publication has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Grant Agreement No. 812997 (Cleopatra). Authors’ addresses: G. Amaral, O. Rodrigues, and E. Simperl, King’s College London, London WC2R 2LS, UK, London, United Kingdom, WC2R 2LS; emails: [email protected], [email protected], [email protected]; A. Piscopo, BBC, London W12 7TQ, London, United Kingdom; email: [email protected]; L.-A. Kaffee, University of Southampton, Southampton SO17 1BJ, UK, Southampton, United Kingdom; email: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. 1936-1955/2021/10-ART23 $15.00 https://doi.org/10.1145/3484828
Publisher Copyright:
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2021
Y1 - 2021
N2 - Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important, as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.
AB - Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important, as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.
KW - crowdsourcing
KW - data quality
KW - knowledge graphs
KW - verifiability
KW - Wikidata
UR - http://www.scopus.com/inward/record.url?scp=85130162436&partnerID=8YFLogxK
U2 - 10.1145/3484828
DO - 10.1145/3484828
M3 - Article
AN - SCOPUS:85130162436
SN - 1936-1955
VL - 13
JO - Journal of Data and Information Quality
JF - Journal of Data and Information Quality
IS - 4
M1 - 23
ER -