TY - JOUR
T1 - Hide and Mine in Strings
T2 - Hardness, Algorithms, and Experiments
AU - Bernardini, Giulia
AU - Conte, Alessio
AU - Gourdel, Garance
AU - Grossi, Roberto
AU - Loukidis, Grigorios
AU - Pisanti, Nadia
AU - Pissis, Solon
AU - Punzi, Giulia
AU - Stougie, Leen
AU - Sweering, Michelle
PY - 2022/3/10
Y1 - 2022/3/10
N2 - Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well.
AB - Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well.
KW - data privacy
KW - data sanitization
KW - knowledge hiding
KW - frequent pattern mining
KW - string algorithms
KW - missing value replacement
UR - http://www.scopus.com/inward/record.url?scp=85126326575&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2022.3158063
DO - 10.1109/TKDE.2022.3158063
M3 - Article
SN - 1041-4347
JO - IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
JF - IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
ER -