Abstract
Clustering is a fundamental data mining task aiming to partition a given dataset into groups, called clusters, which should be compact and wellseparated. In this thesis, we focus on clustering three types of data which are difﬁcult to be handled appropriately by existing clustering algorithms, due to their structural properties: Relational Transaction (RT) datasets, Relational Sequential (RS) datasets, and sequence graphs.RTdatasets are comprised of relational (singlevalued) and transaction (setvalued) attributes and are commonly used to model Electronic Health Record (EHR) data; a patient’s demographics are modeled as relational attributes and the patient’s set of diagnosis codes is modeled as a transaction attribute. We propose the ﬁrst approach for clustering an RT dataset comprised of patient demographics and diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity and ﬁnding compact, wellseparated clusters through hierarchical clustering. Our experiments demonstrate that our approach constructs clusters with correlated demographics and diagnosis codes, and that it is efﬁcient and scalable.
RSdatasets are comprised of relational attributes, as well as a sequential attribute. These datasets are also commonly used to model EHR data; a patient’s sequence of diagnosis codes is modeled as a sequential attribute. Clustering an RSdataset is helpful for analyses ranging from pattern mining to classiﬁcation. However, existing methods are not appropriate to perform this task. Thus, we initiate a study of how an RSdataset can be clustered effectively and efﬁciently. First, we formalize the task of clustering an RSdataset as an optimization problem. Second, we propose a distance measure to quantify the pairwise similarity between records of an RSdataset. Third, we develop an algorithm which ﬁrst identiﬁes k representative records (centers), for a given k, and then constructs k clusters, each containing one center and the records that are closer to the center compared to other centers. Experiments using two EHR datasets demonstrate that our algorithm constructs compact and wellseparated clusters and that it is efﬁcient and scalable.
A sequence graph is a graph whose nodes are labeled with sequences of letters (i.e., strings). Sequence graphs are commonly encountered in social networks, where nodes represent users and edges represent user friendships, or in ecommerce, where nodes represent consumers and edges represent consumers’ trust relationships. Clustering the nodes of a sequence graph allows detecting user communities in a social network, or identifying groups of consumers with bonds of trust among them in ecommerce. However, this problem has not been considered before, to our knowledge. We thus introduce the problem of clustering a sequence graph. We ﬁrst propose two pairwise distance measures for sequence graphs, one based on edit distance and shortest path distance and another one based on SimRank. We then formalize the problem under each measure, showing also that it is NPhard. In addition, we design a polynomialtime 2approximation algorithm, as well as a heuristic for the problem. Experiments using real datasets and a case study demonstrate the effectiveness and efﬁciency of our methods.
Date of Award  1 Jan 2023 

Original language  English 
Awarding Institution 

Supervisor  Grigorios Loukidis (Supervisor) & Sophia Tsoka (Supervisor) 