TY - CONF

T1 - Frequency-Constrained Substring Complexity

AU - Pissis, Solon

AU - Shekelyan, Michael

AU - Liu, Chang

AU - Loukidis, Grigorios

N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

PY - 2023

Y1 - 2023

N2 - We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in $$\tau $$ intervals, $$\mathcal {I}=I_1,\ldots,I_\tau $$, the frequency-constrained substring complexity of x is the function $$f:{x,\mathcal {I}}(i,j)$$ that maps i, j to the number of distinct substrings of length i of x occurring at least $$\alpha _j$$ and at most $$\beta _j$$ times in x, where $$I:j=[\alpha _j,\beta _j]$$. We extend this notion as follows. For a string x, a dictionary $$\mathcal {D}$$ of d strings (documents), and a partition of [d] in $$\tau $$ intervals $$I:1,\ldots,I_\tau $$, we define a 2D array $$S=S[1\mathinner {.\,.}|x|,1\mathinner {.\,.}\tau ]$$ as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least $$\alpha _j$$ and at most $$\beta _j$$ documents, where $$I:j=[\alpha _j,\beta _j]$$. Array S can thus be seen as the distribution of the substring complexity of x into $$\tau $$ document frequency classes. We show that after a linear-time preprocessing of $$\mathcal {D}$$, for any x and any partition of [d] in $$\tau $$ intervals given online, array S can be computed in near-optimal $$\mathcal {O}(|x| \tau \log \log d)$$ time.

AB - We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in $$\tau $$ intervals, $$\mathcal {I}=I_1,\ldots,I_\tau $$, the frequency-constrained substring complexity of x is the function $$f:{x,\mathcal {I}}(i,j)$$ that maps i, j to the number of distinct substrings of length i of x occurring at least $$\alpha _j$$ and at most $$\beta _j$$ times in x, where $$I:j=[\alpha _j,\beta _j]$$. We extend this notion as follows. For a string x, a dictionary $$\mathcal {D}$$ of d strings (documents), and a partition of [d] in $$\tau $$ intervals $$I:1,\ldots,I_\tau $$, we define a 2D array $$S=S[1\mathinner {.\,.}|x|,1\mathinner {.\,.}\tau ]$$ as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least $$\alpha _j$$ and at most $$\beta _j$$ documents, where $$I:j=[\alpha _j,\beta _j]$$. Array S can thus be seen as the distribution of the substring complexity of x into $$\tau $$ document frequency classes. We show that after a linear-time preprocessing of $$\mathcal {D}$$, for any x and any partition of [d] in $$\tau $$ intervals given online, array S can be computed in near-optimal $$\mathcal {O}(|x| \tau \log \log d)$$ time.

UR - http://www.scopus.com/inward/record.url?scp=85174451379&partnerID=8YFLogxK

U2 - https://doi.org/10.1007/978-3-031-43980-3_28

DO - https://doi.org/10.1007/978-3-031-43980-3_28

M3 - Paper

SP - 345

EP - 352

ER -