Statistically Validated Network for analysing textual data

Research output: Contribution to journalArticlepeer-review

Abstract

This paper presents a novel methodology, called Word Co-occurrence SVN topic model (WCSVNtm), for document clustering and topic modeling in textual datasets. This method represents the corpus as a bipartite network of words and documents to rigorously assess the statistical significance of word co-occurrences within documents and document overlap based on shared vocabulary. By employ- ing the Leiden community detection algorithm to the SVN, distinct communities of words can be identified and interpreted as topics. Similarly, documents can be sorted into groups based on their thematic similarities. We demonstrate the effectiveness of our approach by analyzing three datasets: a set of 120 Wikipedia articles, the arXiv10 dataset, which consists of 100 000 abstracts from scientific papers, and a sampled subset of 10 000 documents from the original arXiv10. To benchmark our results, we compare our approach with several well-established models in the field of topic modeling and document clustering, including the hierarchical Stochastic Block Model (hSBM), BERTopic, and Latent Dirichlet
Allocation (LDA). The results show that WCSVNtm achieves competitive per- formance across all datasets, selecting automatically the number of topics and document clusters, whereas state-of-the-art methods require prior knowledge or additional tuning for optimization. Finally, any advancements in community detection algorithms could further improve our method.
Original languageEnglish
JournalApplied Network Science
Publication statusAccepted/In press - 29 Jan 2025

Fingerprint

Dive into the research topics of 'Statistically Validated Network for analysing textual data'. Together they form a unique fingerprint.

Cite this