A Text Retrieval Package for the Unix Operating System

Liam R. E. Quin

SoftQuad Inc. (lee at sq.com)

Note: the author of this paper has moved to liamquin at interlog dot com

Design Goals

The main goals of any text retrieval package were outlined briefly in the introduction. lq-text was developed with some more specific goals in mind, described in the following sections.

Limited Memory

Developed originally on a 4MByte 386 under 386/ix, lq-text does not assume that any of its index files will fit completely into memory. As a result, indexing performance does not degrade significantly if the data does not fit into main, or even virtual, memory.

Offline Storage

Once files are indexed, lq-text does not need to consult them again during the searching process. The result of a query is a list of matching files, together with locations within those files. The original text files are needed if the user wants to view the matched documents, but it turns out that the file names and document titles are often sufficient.

Matching With Certainty

Packages that use hashing or probablistic techniques often return results that might match the user's query. A `bad drop scan' is then used to reject the false hits [Lesk78]. These techniques are incompatible with the Offline Storage requirement, since a bad drop scan may not be possible.

Accurate Phrase Matching

The package should be able to match a phrase word for word, including getting the words in the right order and coping with uneven whitespace. It should also be able to match capitalisation and punctuation at least approximately, so that the user can constrain a search on someone's name, Brown for example, to match Brown but not brown in the text. Only the first letter of each word is inspected for capitalisation, in order to minimise the data stored in the index. This is sufficient for most queries.

In the literature, the terms recall and precision are used to refer to the proportion of relevant matches retrieved and the proportion of retrieved matches that are relevant, respectively; the goal of lq-text is to have very high precision, and to give the user some control over the recall. These terms are defined more precisely in the literature [Clev66], [Salt88], and are not discussed further in this paper. The term accuracy is used loosely in this paper to refer to literal exactness of a match - for example, whether Ankle matches ankles as well as ankle. In an inaccurate system, Ankle might also match a totally dissimilar word such as yellow. The information that lq-text stores in the database enables it to achieve a high level of accuracy in searching for phrases.

Updatable Index

It should be possible to add new documents to the index at any time. It should also be possible to unindex documents, or to update documents in place, as well as to inform the package when documents are renamed or moved.

Unix toolkit approach

It should be possible to manipulate search results with standard Unix tools such as awk and sed. This must be done in a way consistent with the Offline Storage requirement. For example, the user should be able to see the first seven matches using head -7 or sed 7q without having to search the documents themselves.

Summary

This paper shows how some of the goals have been met, and indicates where work is still in progress on other goals.

All of the original design goals are still felt to be relevant. With continued improvements in price/performance ratios of disks, the offline storage goal may become less important, but the design philosophy is still very important for CD/ROM work.

Next Top