Bidirectional String Anchors for Improved Text Indexing and Top-K Similarity Search

Grigorios Loukidis, Solon Pissis, Michelle Sweering

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)
53 Downloads (Pure)

Abstract

The minimizers sampling mechanism is a popular mechanism for string sampling. However, minimizers sampling mechanisms lack good guarantees on the expected size of their samples for different combinations of their input parameters. Furthermore, indexes constructed over minimizers samples lack good worst-case guarantees for on-line pattern searches. In response, we propose bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given an integer ℓ, our mechanism selects the lexicographically smallest rotation in every length-ℓ fragment. We show that, like minimizers samples, bd-anchors samples are approximately uniform, locally consistent, and computable in linear time. Furthermore, our experiments demonstrate that the bd-anchors sample sizes decrease proportionally to ℓ; and that these sizes are competitive to or smaller than the minimizers sample sizes. We theoretically justify these results by analyzing the expected size of bd-anchors samples. We also prove that computing a total order on the input alphabet which minimizes the bd-anchors sample size is NP-hard. We next highlight the benefits of bd-anchors in two important applications: text indexing and top-K similarity search. For the first application, we develop an index for performing on-line pattern searches in near-optimal time, and show experimentally that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index; we also show that it is substantially faster than two classic text indexes. For the second application, we develop a heuristic for top-K similarity search under edit distance, and show experimentally that it is generally as accurate as the state-of-the-art tool for the same purpose but more than one order of magnitude faster.

Original languageEnglish
Pages (from-to)11093-11111
Number of pages19
JournalIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Volume35
Issue number11
DOIs
Publication statusPublished - 16 Jan 2023

Keywords

  • string algorithms
  • string sampling
  • text indexing
  • top-K similarity search

Fingerprint

Dive into the research topics of 'Bidirectional String Anchors for Improved Text Indexing and Top-K Similarity Search'. Together they form a unique fingerprint.

Cite this