Efficient String Algorithms for Data Security and Privacy

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

Uncertainties in data propagation are magnified across all stages in data science proliferation. It is therefore essential, that these uncertainties are processed appropriately by the system. It is also important that data is not artificially made certain; for example, by choosing the most probable outcome at each stage. Therefore, the central hypothesis of this thesis focuses on algorithmic designs on uncertain sequences which forms an integral aspect and foundation for representing and mining uncertain data arising in a wide variety of contexts.

Additionally, our study focuses on computing novel algorithmic techniques useful for other high throughput data processing, and devise appropriate uncertain and probabilistic sequence formulations for modelling large-scale complex heterogeneous data. Furthermore, highly-scalable algorithms for pattern and motif discovery for sequential pattern mining in uncertain sequence data was developed.

To achieve all these, we propose a novel pattern matching technique that caters for the rotational issue responsible for the orientation differences in fingerprints. This was done by implementing a pre-matching stage called the orientation identification stage using lexicographical ordering to determine the correct orientation. Our first step was to extract the binary circular string from the scanned image using the proposed extraction method over specified radii to produce binary strings in linear time. Practically, this can be sub linear with respect to the short length of the extracted circular strings. Thereafter, the lexicographical order was calculated for each extracted string to establish the starting point of the circular string in ascending order. This is done in linear time with respect to the length of the strings. The fingerprint string information is then matched against a database of stored images using existing approximate string matching techniques. We present approaches to implement our algorithms, using real data to show the efficiency of our algorithms.

Recognizing that indiscriminate sharing of data may lead to the disclosure of sensitive patterns that are represented as substrings and encode confidential information, we examine the novel problem of concealing a given set of sensitive patterns in a string. Our approach is based on injecting a minimal level of uncertainty to the string, while preserving the utility of the string as much as possible, by replacing selected symbols in this string with a symbol “∗”. To realize our approach, we propose an algorithm SSA that efficiently sanitizes these sensitive patterns after detecting them with Aho-Corasick algorithm. A set of experiments to demonstrate the effectiveness and efficiency of our algorithms are presented.

Additionally, the study concentrated on computing cyclic regularities of strings by proposing a few adept algorithms for computing cyclic periodicity and covers.

In conclusion, this thesis demonstrated the use of string algorithms as a tool for efficient computing in achieving and sustaining data security and privacy.






Date of Award1 Nov 2021
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorCostas Iliopoulos (Supervisor) & Solon Pissis (Supervisor)

Cite this

'