A Text Retrieval Package for the Unix Operating System

Liam R. E. Quin

SoftQuad Inc. (lee at sq.com)

Note: the author of this paper has moved to liamquin at interlog dot com


The lq-text Design

A full text inverted index was chosen to meet the design goals. In particular, this is the only strategy which allows accurate matching of phrases without reverting to a bad drop scan.

In order to make the index smaller, however, the list of matches for each word is compressed, as described in detail in the Implementation section below.

The package is implemented as a C API in a number of separate libraries, which are in turn used by a number of separate client programs. The programs are typically combined in a pipeline, much in the manner of the probabilistic inverted index used by refer and hunt [Lesk78].

The lq-text package includes a set of input filters for reading documents into a canonical form suitable for the indexing program lqaddfile to process; a set of search programs; and programs that take search results and deliver the corresponding text. There are also wrappers so that users don't have to remember all the individual programs. The lq-text package has been successfully integrated into a number of other systems and products [Royd93], [Hutt94].


Next   Top