Algorithms for String and Graph Data Protection

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

The growth in data volume has led to an increasing interest in analyzing data about individuals in domains ranging from marketing to biomedical informatics. Therefore, data about individuals are often shared beyond the parties that have collected them. However, data sharing has raised justified privacy concerns. In this thesis, we will focus on data privacy of two types of data: strings and graphs.

Strings (i.e., sequences of letters over some alphabet) are typically used to model web activity and movement of individuals, natural language, as well as genetic material of organisms. Therefore, strings are featured in many applications. We introduce three privacy threats related to string data: (I) sensitive pattern exposure, (II) k-mer inference, and (III) k-mer based string reconstruction. Sensitive pattern exposure occurs when substrings modeling confidential information occur in a string that is disseminated. To prevent sensitive pattern exposure, we introduce the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility. We consider this problem in two different settings and propose time-optimal algorithms and heuristics to deal with it. To prevent k-mer inference, we introduce the problem of protecting the presence or absence of length-k substrings, referred to as k-mers, in an input string, while preserving data utility in frequency-based data mining tasks. We propose a sampling-based mechanism for enforcing differential privacy and several methods, including exact algorithms and linear-time heuristics, that apply the mechanism with minimal utility loss. To prevent k-mer based string reconstruction, we introduce a new type of data structure that makes reconstructing a string based on its k-mers difficult. We also introduce an algorithm for constructing such a data structure for answering pattern matching decision and counting queries on strings.

Graphs are typically used to model relationships between entities in domains such as social networks, communication networks, and the web. As such, many data analysis tasks are applied to graphs. We introduce the problem of Community Breaking in undirected graphs. The problem seeks to delete a smallest subset of edges from an input graph, so that the graph is community-free. We study the problem under the community notion of k-truss. We propose an exact algorithm and several heuristics for the problem, as well as a data structure that is used to speed up our methods. We also propose a lower bound on the size of optimal solution to the problem and develop an efficient algorithm to compute it. Last, we conduct extensive experiments using real-world and synthetic data to demonstrate the effectiveness and efficiency of our methods for protecting strings or graphs.
Date of Award1 Jan 2023
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorGrigorios Loukidis (Supervisor)

Cite this

'