A Text Retrieval Package for the Unix Operating System

Liam R. E. Quin

SoftQuad Inc. (lee at sq.com)

Note: the author of this paper has moved to liamquin at interlog dot com


Introduction

The main reason for developing lq-text was to provide an inexpensive (or free) package that could index and search fairly large corpora, and that would integrate well with other Unix tools.

Low Cost Solution

There are already a number of commercial text retrieval packages available for the Unix operating system. However, the prices for these packages range from Cdn$30,000 to well over $150,000. In addition, the packages are not always available on any given platform. A few packages were freely available for Unix at the time the project was started, but generally had severe limitations, as mentioned below.

A tool for searching large corpora

Some of the freely available tools used grep to search the data. While lq-text is O(n) on the number of matches, irrespective of the size of the data, grep is O(n) on the size of the data, irrespective of the number of matches. This limits the maximum speed of the system, and searching for an unusual term in a large database of several gigabytes would be infeasible with grep. Other packages had limits such as 32767 files indexed or 65535 bytes per file. Tools available on, or ported from, MS/DOS were particularly likely to suffer from this malaise.

Unix-based

Unix has always been a productive text processing environment, and one would want to be able to use any new text processing tools in combination with other tools in that environment.

Results and Benefits

Timings are given in detail below, along with a perspective on how the above goals were met. As a brief example, a search over the SunOS 4.1.1 manual pages for core dump on a Sun 4/110 took 52 seconds with grep, and 1.3 seconds with lq-text. In addition, lq-text automatically finds matches of a phrase even if there is a newline or punctuation in the text between two words. It is also possible to combine lq-text searches, finding documents containing all of a number of phrases. Complex queries can be built up incrementally.

Next   Top