Compressed Self-Indices

Fuller-Text Indexing

In the last few years, a number of techniques for efficiently indexing unstructured data have arisen, which make sophisticated pattern matching possible in a small memory and CPU footprint. These methods put word-based techniques such as inverted indices to shame, because they make it possible to find any substring in a text, and allow us to build approximate matching, wildcard, or regexp matching systems over very large corpora.

Because these indexes contain enough information to reconstitute the text from which they were built, they are called self-indexes. The total size of a self-index is often close to the compressed size of the source, so we have the remarkable ability to make text both searchable and retrievable in less space than it takes to store it.

These methods essentially all rely on the same concept, which can be called context sorting: sorting the contexts (suffixes) beginning at each character position in the source text lexicographically. For instance, if we take the string "abracadabra", we get a list as follows:

Position (i)SuffixSuffix Array&PsiL
1 a 11 3 r
2 abra 8 6 d
3 abracadabra 1 7* a *
4 acadabra 4 8 r
5 adabra 6 9 c
6 bra 9 10 a
7 bracadabra 2 11 a
8 cadabra 5 5 a
9 dabra 7 2 a
10 ra 10 1 b
11 racadabra 3 4 b

To the right, we list the different representations of the list. The first, the Suffix Array, identifies each entry by its starting position in the text. The second, &Psi (Psi), links each suffix to the next, for instance, "dabra" links to "abra", and so on. This linking is circular, so the end points to the beginning; to avoid confusion, we mark the first character with "*". The last representation, called L in the BWT, records each character to the left (also circularly) of the suffix in question, for instance, "d" is put at the position of "abra".

All of these forms can be converted from one to the other. For instance, &Psi can be converted to a suffix array by starting at the first, "*" entry and following the links through the entire string, labelling entries 1,2,3...etc.

Of these representations, &Psi, also called "suffix linking", "the neighbor function", or "the F-to-L map", is probably the easiest to understand. Each element in &Psi links to the next, and its position indicates the character it represents (all the a's are in a contiguous range, followed by the b's, etc.). Decoding text from &Psi is like traversing a linked list, except with some overhead in decoding each pointer from the compressed array, and looking up the character for each position.

The useful feature of &Psi is that, for each character in the alphabet, all the suffixes beginning with that character are in a contiguous range in the array, and the suffix pointers within that range are in ascending order. For each character c in the alphabet &Sigma, there is a sorted vector of suffix positions.

Searching

Searching these representations for a substring exploits the context-sorting to find prefix matches in the list (each substring is the prefix of a suffix, as it were). Since sorting groups identical prefixes together, one approach is a binary search of the list, to find the range of suffixes matching the pattern. Manber and Myers used this technique in their original paper on the Suffix Array.

Indexes based on &Psi and L can use the right-to-left search technique, which relies on back-tracing suffix links. For instance, given the range of matches for "dabra", we find out if any links point to this range from entries which start with "a", indicating a match for "adabra". Position 5 above contains such a match (in general, we search for a range of matches at each step). The back-tracing step turns out to be very easy in &Psi.

Locating

The hitch with these representations is that, while they allow easy access to the context (characters surrounding) each match, they don't contain much information about the location of the match relative to the beginning of the text, or the document ID in multiple-document collections, etc. The problem is akin to being given a random element in a linked list and being asked for its distance from the head or tail. As with linked lists, a few strategies can be tried. One is to tag every k-th element or so with an explicit location value, and require traversal from each match to a tagged node.

Grossi and Vitter use a strategy more like skip lists, where some elements have pointers which skip ahead by more than one character. Their compressed suffix array is like a &Psi array, but it is partitioned into levels, with each element in the original array assigned to one level. On the first level, only suffixes with odd positions are included. On the second level, each suffix link traverses two positions ahead, to the next even-offset suffix, and the third level contains 4-aligned suffixes, and so forth up to a small number of levels (lg lg n). From a given suffix element, one traverses to the next 2-boundary, then from there to the next 4-boundary, etc. in the string until the top level is reached, which contains explicit locations.

This structure works by using bitmaps to specify the interleaving of the levels, so the first level suffixes can be shuffled into the second level ones, etc., recovering the original suffix-sorted order of all the elements. Looking up an element at index i is a matter of checking the first-level bitmap to see if that element is in the first level (its bit is 0) or in the second (1). If it's in the second, bit-counting is used to find the index at that level, and so forth recursively up the levels.

The bit operations rely on Succinct Dictionaries, which have well-understood bounds on access, rank (the number of 1 bits prior to bit j), and select (find the bit j with rank k). Grossi, Gupta and Vitter use the same structures in their Wavelet Tree construction.

References

High-Order Entropy-Compressed Text Indexes Roberto Grossi, Ankur Gupta, Jeffrey Scott Vitter Proc. 14th Annual ACM-SIAM Symp. on Discrete Algorithms (SODA'03)
This method uses the &Psi representation, with a hierarchical tree structure to the ordered sublists. The wavelet tree is a particularly interesting representation for &Psi-type permutations.

R. Grossi, A. Gupta, and J. S. Vitter. ``When Indexing Equals Compression: Experiments with Compressing Suffix Arrays and Applications,'' ACM Transactions on Algorithms, to appear. This is the latest and greatest on Grossi and Vitter's compressed suffix trees. Some of the descriptions may be clearer in this round.

Ferragina and Manzini's FM Index, which showed how BWT-compressed, L-representation data can also serve as an index. This format is closest to bzip2 or other BWT compressors, using L, MTF (Move-to-Front), and entropy coding.

U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, pages 935-948, 1993. (Not online).
The first to shrink the suffix tree to about 4x the text size.

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices (with Wing-Kai Hon and Kunihiko Sadakane). In the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2003.

A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays
How suffix sorting should really be done, inserting into the &Psi representation.

My Article in DDJ lays out the basics of searching &Psi, using the right-to-left search of Ferragina and Manzini. Unfortunately I cannot make this article available online outside of DDJ, although email requests are OK.