TY - JOUR
T1 - Evaluating the generalisability of neural rumour verification models
AU - Kochkina, Elena
AU - Hossain, Tamanna
AU - Logan, Robert L.
AU - Arana-Catania, Miguel
AU - Procter, Rob
AU - Zubiaga, Arkaitz
AU - Singh, Sameer
AU - He, Yulan
AU - Liakata, Maria
N1 - Funding Information:
Dr. Elena Kochkina Postdoctoral Researcher at the Queen Mary University of London and the Alan Turing Institute. Elena have completed a PhD in Computer Science at the Warwick Institute for the Science of Cities (WISC) CDT, funded by the Leverhulme Trust via the Bridges Programme. Her background is Applied Mathematics (BSc, MSc, Lobachevsky State University of Nizhny Novgorod) and Complexity Science (MSc, University of Warwick, Chalmers University). She has published in venues such as ACL, COLING, EACL and IP&M.
Funding Information:
This work was supported by a UKRI/EPSRC grant (EP/V048597/1) to Profs Yulan He, Rob Procter and Maria Liakata, as well as project funding from the Alan Turing Institute, UK, grant EP/N510129/1. ML and YH are supported by Turing AI Fellowships (EP/V030302/1, EP/V020579/1). This material is based upon work sponsored in part by NSF award #IIS-1817183 and in part by the DARPA MCS program under Contract No. N660011924033 with the United States Office Of Naval Research. We thank Arjuna Ugarte, Staff Researcher III at the University of California Irvine (UCI) Emergency Medicine & Informatics department for leading the stance annotation team. We also thank our stance annotators: UCI undergraduate students Ali Al-Hakeem, Sharon Li, Abhi Madduri, Juhi Patel, and Anam Zahidi; and Evergreen Valley High School, San Jose student Nitya Golla.
Funding Information:
This work was supported by a UKRI/EPSRC grant ( EP/V048597/1 ) to Profs Yulan He, Rob Procter and Maria Liakata, as well as project funding from the Alan Turing Institute, UK , grant EP/N510129/1 . ML and YH are supported by Turing AI Fellowships ( EP/V030302/1 , EP/V020579/1 ). This material is based upon work sponsored in part by NSF award #IIS-1817183 and in part by the DARPA MCS program under Contract No. N660011924033 with the United States Office Of Naval Research .
Funding Information:
Prof. Sameer Singh Associate Professor of Computer Science at the University of California, Irvine (UCI). He is working primarily on robustness and interpretability of machine learning algorithms, along with models that reason with text and structure for natural language processing. He has received the NSF CAREER award, selected as a DARPA Riser, UCI Distinguished Early Career Faculty award, and the Hellman Faculty Fellowship. Sameer has published extensively at machine learning and natural language processing venues and received conference paper awards at KDD 2016, ACL 2018, EMNLP 2019, AKBC 2020, and ACL 2020.
Publisher Copyright:
© 2022 The Author(s)
PY - 2023/1
Y1 - 2023/1
N2 - Research on automated social media rumour verification, the task of identifying the veracity of questionable information circulating on social media, has yielded neural models achieving high performance, with accuracy scores that often exceed 90%. However, none of these studies focus on the real-world generalisability of the proposed approaches, that is whether the models perform well on datasets other than those on which they were initially trained and tested. In this work we aim to fill this gap by assessing the generalisability of top performing neural rumour verification models covering a range of different architectures from the perspectives of both topic and temporal robustness. For a more complete evaluation of generalisability, we collect and release COVID-RV, a novel dataset of Twitter conversations revolving around COVID-19 rumours. Unlike other existing COVID-19 datasets, our COVID-RV contains conversations around rumours that follow the format of prominent rumour verification benchmarks, while being different from them in terms of topic and time scale, thus allowing better assessment of the temporal robustness of the models. We evaluate model performance on COVID-RV and three popular rumour verification datasets to understand limitations and advantages of different model architectures, training datasets and evaluation scenarios. We find a dramatic drop in performance when testing models on a different dataset from that used for training. Further, we evaluate the ability of models to generalise in a few-shot learning setup, as well as when word embeddings are updated with the vocabulary of a new, unseen rumour. Drawing upon our experiments we discuss challenges and make recommendations for future research directions in addressing this important problem.
AB - Research on automated social media rumour verification, the task of identifying the veracity of questionable information circulating on social media, has yielded neural models achieving high performance, with accuracy scores that often exceed 90%. However, none of these studies focus on the real-world generalisability of the proposed approaches, that is whether the models perform well on datasets other than those on which they were initially trained and tested. In this work we aim to fill this gap by assessing the generalisability of top performing neural rumour verification models covering a range of different architectures from the perspectives of both topic and temporal robustness. For a more complete evaluation of generalisability, we collect and release COVID-RV, a novel dataset of Twitter conversations revolving around COVID-19 rumours. Unlike other existing COVID-19 datasets, our COVID-RV contains conversations around rumours that follow the format of prominent rumour verification benchmarks, while being different from them in terms of topic and time scale, thus allowing better assessment of the temporal robustness of the models. We evaluate model performance on COVID-RV and three popular rumour verification datasets to understand limitations and advantages of different model architectures, training datasets and evaluation scenarios. We find a dramatic drop in performance when testing models on a different dataset from that used for training. Further, we evaluate the ability of models to generalise in a few-shot learning setup, as well as when word embeddings are updated with the vocabulary of a new, unseen rumour. Drawing upon our experiments we discuss challenges and make recommendations for future research directions in addressing this important problem.
KW - Rumour verification
KW - Generalisability
KW - Rumour dataset
KW - Deep learning
UR - http://www.scopus.com/inward/record.url?scp=85140491991&partnerID=8YFLogxK
U2 - 10.1016/j.ipm.2022.103116
DO - 10.1016/j.ipm.2022.103116
M3 - Article
SN - 0306-4573
VL - 60
JO - INFORMATION PROCESSING AND MANAGEMENT
JF - INFORMATION PROCESSING AND MANAGEMENT
IS - 1
M1 - 103116
ER -