Clustering complex datasets
: algorithms and applications

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

Clustering is a fundamental data mining task aiming to partition a given dataset into groups, called clusters, which should be compact and well-separated. In this thesis, we focus on clustering three types of data which are difficult to be handled appropriately by existing clustering algorithms, due to their structural properties: Relational Transaction (RT) datasets, Relational Sequential (RS) datasets, and sequence graphs.

RT-datasets are comprised of relational (single-valued) and transaction (set-valued) attributes and are commonly used to model Electronic Health Record (EHR) data; a patient’s demographics are modeled as relational attributes and the patient’s set of diagnosis codes is modeled as a transaction attribute. We propose the first approach for clustering an RT dataset comprised of patient demographics and diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity and finding compact, well-separated clusters through hierarchical clustering. Our experiments demonstrate that our approach constructs clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.

RS-datasets are comprised of relational attributes, as well as a sequential attribute. These datasets are also commonly used to model EHR data; a patient’s sequence of diagnosis codes is modeled as a sequential attribute. Clustering an RS-dataset is helpful for analyses ranging from pattern mining to classification. However, existing methods are not appropriate to perform this task. Thus, we initiate a study of how an RS-dataset can be clustered effectively and efficiently. First, we formalize the task of clustering an RS-dataset as an optimization problem. Second, we propose a distance measure to quantify the pairwise similarity between records of an RS-dataset. Third, we develop an algorithm which first identifies k representative records (centers), for a given k, and then constructs k clusters, each containing one center and the records that are closer to the center compared to other centers. Experiments using two EHR datasets demonstrate that our algorithm constructs compact and well-separated clusters and that it is efficient and scalable.

A sequence graph is a graph whose nodes are labeled with sequences of letters (i.e., strings). Sequence graphs are commonly encountered in social networks, where nodes represent users and edges represent user friendships, or in e-commerce, where nodes represent consumers and edges represent consumers’ trust relationships. Clustering the nodes of a sequence graph allows detecting user communities in a social network, or identifying groups of consumers with bonds of trust among them in e-commerce. However, this problem has not been considered before, to our knowledge. We thus introduce the problem of clustering a sequence graph. We first propose two pairwise distance measures for sequence graphs, one based on edit distance and shortest path distance and another one based on SimRank. We then formalize the problem under each measure, showing also that it is NP-hard. In addition, we design a polynomial-time 2-approximation algorithm, as well as a heuristic for the problem. Experiments using real datasets and a case study demonstrate the effectiveness and efficiency of our methods.
Date of Award1 Jan 2023
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorGrigorios Loukidis (Supervisor) & Sophia Tsoka (Supervisor)

Cite this

'