TY - JOUR
T1 - Clustering sequence graphs
AU - Zhong, Haodi
AU - Loukidis, Grigorios
AU - Pissis, Solon P
N1 - Funding Information:
H. Zhong is supported by a CSC Scholarship, UK. G. Loukides is supported by the Leverhulme Trust, UKRPG-2019-399 project. S. P. Pissis is partially supported by the ALPACA project that has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 956229 and by the PANGAIA project that has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 872539.
Funding Information:
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Grigorios Loukides reports financial support was provided by King’s College London. Haodi Zhong reports financial support was provided by King’s College London. Solon P. Pissis reports financial support was provided by National Research Institute for Mathematics and Computer Science.
Funding Information:
H. Zhong is supported by a CSC Scholarship, UK . G. Loukides is supported by the Leverhulme Trust, UK RPG-2019-399 project. S. P. Pissis is partially supported by the ALPACA project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 956229 and by the PANGAIA project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no. 872539 .
Publisher Copyright:
© 2022 The Author(s)
PY - 2022/3
Y1 - 2022/3
N2 - In application domains ranging from social networks to e-commerce, it is important to cluster users with respect to both their relationships (e.g., friendship or trust) and their actions (e.g., visited locations or rated products). Motivated by these applications, we introduce here the task of clustering the nodes of a sequence graph, i.e., a graph whose nodes are labeled with strings (e.g., sequences of users' visited locations or rated products). Both string clustering algorithms and graph clustering algorithms are inappropriate to deal with this task, as they do not consider the structure of strings and graph simultaneously. Moreover, attributed graph clustering algorithms generally construct poor solutions because they need to represent a string as a vector of attributes, which inevitably loses information and may harm clustering quality. We thus introduce the problem of clustering a sequence graph. We first propose two pairwise distance measures for sequence graphs, one based on edit distance and shortest path distance and another one based on SimRank. We then formalize the problem under each measure, showing also that it is NP-hard. In addition, we design a polynomial-time 2-approximation algorithm, as well as a heuristic for the problem. Experiments using real datasets and a case study demonstrate the effectiveness and efficiency of our methods.
AB - In application domains ranging from social networks to e-commerce, it is important to cluster users with respect to both their relationships (e.g., friendship or trust) and their actions (e.g., visited locations or rated products). Motivated by these applications, we introduce here the task of clustering the nodes of a sequence graph, i.e., a graph whose nodes are labeled with strings (e.g., sequences of users' visited locations or rated products). Both string clustering algorithms and graph clustering algorithms are inappropriate to deal with this task, as they do not consider the structure of strings and graph simultaneously. Moreover, attributed graph clustering algorithms generally construct poor solutions because they need to represent a string as a vector of attributes, which inevitably loses information and may harm clustering quality. We thus introduce the problem of clustering a sequence graph. We first propose two pairwise distance measures for sequence graphs, one based on edit distance and shortest path distance and another one based on SimRank. We then formalize the problem under each measure, showing also that it is NP-hard. In addition, we design a polynomial-time 2-approximation algorithm, as well as a heuristic for the problem. Experiments using real datasets and a case study demonstrate the effectiveness and efficiency of our methods.
UR - http://www.scopus.com/inward/record.url?scp=85124049255&partnerID=8YFLogxK
U2 - 10.1016/j.datak.2022.101981
DO - 10.1016/j.datak.2022.101981
M3 - Article
SN - 0169-023X
VL - 138
JO - DATA AND KNOWLEDGE ENGINEERING
JF - DATA AND KNOWLEDGE ENGINEERING
M1 - 101981
ER -