A Text Retrieval Package for the Unix Operating System

Liam R. E. Quin

SoftQuad Inc. (lee at sq.com)

Note: the author of this paper has moved to liamquin at interlog dot com

Technology Overview

This section gives a brief background to some of the main approaches to text retrieval.

Signatures

Packages based on signatures keep a hash of each document or block of each document. The idea is to reduce I/O by identifying those blocks which might contain a given word. This method does not store enough information to match phrases precisely, and software relying on it needs to scan documents to eliminate bad drops [Salt88], [Falo85], [Falo87a], [Falo87b]. A widely distributed publicly available system, zbrowser, uses a cross between block signatures and the Document Vector method discussed below. This system can answer proximity-style queries (these two words in either order, near each other) fairly well, but does not handle searching for phrases [Zimm91].

Full Text Inverted Index

A full text inverted index consists of a record of every occurrence of every word, and hence is generally the largest in size of the indexes discussed; the index also allows the highest accuracy (but not necessarily highest precision, see Future Work below). The larger index increases I/O needed for searching, but on the other hand there is no need to scan documents for bad drops [Salt88], [Mead92].

Document Vector

This strategy keeps a record of every file in which each word appears; one could call it a partial inverted index. This is usually much smaller than a full text inverted index, but cannot be used to find phrases directly without a bad drop scan. A recent example is Glimpse; this and other examples are mentioned in [Salt88], [Mead92], and [Orac92]. Glimpse also appears to be restricted to searching a single file system at a time on a local machine.

Relational Tables

One way of implementing a full or partial inverted index is by storing each occurrence of each word in a cell of a relational table. This is generally by far the least efficient of the strategies discussed, with indexes typically three times the size of the data, but it is also the easiest to implement robustly.

DFA or Patricia Tree

These systems store a data structure representing a deterministic finite state automaton that, when executed against the query, will reach an `accept' state representing all matches. They are usually byte rather than word oriented, although they can be written either way. It is difficult to allow updates to such an index, and the algorithms are fairly complex. Knuth describes Patricia trees in some detail [Knut81]; a sample inmemory implementation, Cecilia, was described by Tsuchiya [Tsuc91]; PAT, the Oxford English Dictionary (OED) software, also uses Patricia trees [Bray89], [Fawc89].

Other

Many other approaches are possible. There are several schemes that actually replace the original data with, for example, a multiply-threaded linked list, so that the data can be recreated from the index. This has an unfortunate and well-known failure mode, in which the reconstituted text uses incorrect words. Other schemes include sub-word indexing, either on individual bytes or on n-grams, although these usually fall into the `DFA' category above in terms of their characteristics.

Next Top