A Text Retrieval Package for the Unix Operating System
Liam R. E. Quin
SoftQuad Inc. (lee at sq.com)
Note: the author of this paper has moved to liamquin at interlog dot com
The main goals of any text retrieval package were outlined briefly
in the introduction.
lq-text was developed with some more specific goals in mind,
described in the following sections.
Limited Memory
Developed originally on a 4MByte 386 under 386/ix,
lq-text does not assume that any of its index files will fit
completely into memory.
As a result, indexing performance does not degrade significantly
if the data does not fit into main, or even virtual, memory.
Offline Storage
Once files are indexed, lq-text does not need to consult them again
during the searching process.
The result of a query is a list of matching files, together with locations
within those files.
The original text files are needed if the user wants to view the
matched documents, but
it turns out that the file names and document titles are often
sufficient.
Matching With Certainty
Packages that use hashing or probablistic techniques often return results
that might match the user's query.
A `bad drop scan' is then used to reject the false hits [Lesk78].
These techniques are incompatible with the Offline Storage requirement,
since a bad drop scan may not be possible.
Accurate Phrase Matching
The package should be able to match a phrase word for word,
including getting the words in the right order and coping with uneven
whitespace.
It should also be able to match capitalisation and punctuation at least
approximately, so that the user can constrain a search on someone's name,
Brown
for example, to match Brown but not brown in the text.
Only the first letter of each word is inspected for capitalisation, in order to minimise the data stored in the index.
This is sufficient for most queries.
In the literature, the terms
recall
and
precision
are used to refer to the proportion of relevant matches retrieved
and the proportion of retrieved matches that are relevant,
respectively;
the goal of lq-text is to have very high precision,
and to give the user some control over the recall.
These terms are defined more precisely in the literature
[Clev66], [Salt88],
and are not discussed further in this paper.
The term
accuracy
is used loosely in this paper to refer to literal exactness of a
match - for example, whether Ankle matches
ankles as well as ankle.
In an inaccurate system, Ankle might also match a
totally dissimilar word such as yellow.
The information that lq-text stores in the database enables
it to achieve a high level of accuracy in searching for phrases.
Updatable Index
It should be possible to add new documents to the index at
any time.
It should also be possible to
unindex
documents,
or to
update
documents in place, as well as to inform the
package when documents are renamed or moved.
Unix toolkit approach
It should be possible to manipulate search results with
standard Unix tools such as awk and sed.
This must be
done in a way consistent with the Offline Storage requirement.
For example, the user should be able to see the first seven matches
using
head -7
or
sed 7q
without having to search the documents themselves.
Summary
This paper shows how some of the goals have been met,
and indicates where work is still in progress on other goals.
All of the original design goals are still felt to be relevant.
With continued improvements in price/performance ratios of disks,
the offline storage goal may become less important, but
the design philosophy is still very important for CD/ROM work.
Next Top