TY - JOUR

T1 - Property Suffix Array with Applications in Indexing Weighted Sequences

AU - Charalampopoulos, Panagiotis

AU - Iliopoulos, Costas S.

AU - Liu, Chang

AU - Pissis, Solon P.

PY - 2020/11

Y1 - 2020/11

N2 - The suffix array is one of the most prevalent data structures for string indexing; it stores the lexicographically sorted list of suffixes of a given string. Its practical advantage compared to the suffix tree is space efficiency. In Property Indexing, we are given a string x of length n and a property Π, i.e., a set of Π-valid intervals over x so that a pattern p occurs in x if and only if x has an occurrence of p that lies entirely within an interval of Π. A suffix-tree-like index over the valid prefixes of suffixes of x can be built in time and space O(n). We show here how to directly build a suffix-array-like index, the Property Suffix Array (PSA), in time and space O(n). We mainly draw our motivation from weighted (probabilistic) sequences: sequences of probability distributions over a given alphabet. Given a probability threshold 1/z, we say that a string p of length m matches a weighted sequence x of length n at starting position i if the product of probabilities of the letters of p at positions i, ⋯, i+m - 1 in x is at least 1/z. Our algorithm for building the PSA can be directly applied to build an O(nz)-sized suffix-array-like index over x in time and space O(nz). Finally, we present extensive experimental results showing that our new indexing data structure is well suited for real-world applications.

AB - The suffix array is one of the most prevalent data structures for string indexing; it stores the lexicographically sorted list of suffixes of a given string. Its practical advantage compared to the suffix tree is space efficiency. In Property Indexing, we are given a string x of length n and a property Π, i.e., a set of Π-valid intervals over x so that a pattern p occurs in x if and only if x has an occurrence of p that lies entirely within an interval of Π. A suffix-tree-like index over the valid prefixes of suffixes of x can be built in time and space O(n). We show here how to directly build a suffix-array-like index, the Property Suffix Array (PSA), in time and space O(n). We mainly draw our motivation from weighted (probabilistic) sequences: sequences of probability distributions over a given alphabet. Given a probability threshold 1/z, we say that a string p of length m matches a weighted sequence x of length n at starting position i if the product of probabilities of the letters of p at positions i, ⋯, i+m - 1 in x is at least 1/z. Our algorithm for building the PSA can be directly applied to build an O(nz)-sized suffix-array-like index over x in time and space O(nz). Finally, we present extensive experimental results showing that our new indexing data structure is well suited for real-world applications.

KW - property suffix array

KW - property suffix tree

KW - suffix array

KW - Suffix tree

KW - weighted sequence

UR - http://www.scopus.com/inward/record.url?scp=85092095146&partnerID=8YFLogxK

U2 - 10.1145/3385898

DO - 10.1145/3385898

M3 - Article

AN - SCOPUS:85092095146

SN - 1084-6654

VL - 25

JO - ACM Journal of Experimental Algorithmics

JF - ACM Journal of Experimental Algorithmics

M1 - 3385898

ER -