A Text Retrieval Package for the Unix Operating System
Liam R. E. Quin
SoftQuad Inc. (lee at sq.com)
Note: the author of this paper has moved to liamquin at interlog dot com
This section gives a brief background to some of the main approaches
to text retrieval.
Signatures
Packages based on
signatures
keep a hash of each document or block of each document.
The idea is to reduce I/O by identifying those blocks which might
contain a given word.
This method does not store enough information to match phrases precisely,
and software relying on it needs to scan documents to eliminate bad drops
[Salt88],
[Falo85],
[Falo87a],
[Falo87b].
A widely distributed publicly available system, zbrowser,
uses a cross between block signatures and the Document Vector method
discussed below.
This system can answer proximity-style queries (these two words in either
order, near each other) fairly well, but does not handle searching for
phrases [Zimm91].
Full Text Inverted Index
A
full text inverted index
consists of
a record of every occurrence of every word, and hence is generally the
largest in size of the indexes discussed; the index also allows the highest
accuracy
(but not necessarily highest precision, see Future Work below).
The larger index increases I/O needed for searching, but on the other
hand there is no need to scan documents for bad drops [Salt88],
[Mead92].
Document Vector
This strategy keeps a record of every file in which each word appears;
one could call it a
partial inverted index.
This is usually much smaller than a full text inverted index,
but cannot be used to find phrases directly without a bad drop scan.
A recent example is Glimpse; this and other examples are mentioned in [Salt88],
[Mead92],
and [Orac92].
Glimpse also appears to be restricted to searching a single file
system at a time on a local machine.
Relational Tables
One way of implementing a full or partial inverted index is by
storing each occurrence of each word in a cell of a relational table.
This is generally by far the least efficient of the strategies discussed,
with indexes typically three times the size of the data,
but it is also the easiest to implement robustly.
DFA or Patricia Tree
These systems store a data structure representing a deterministic finite
state automaton that, when executed against the query, will
reach an `accept' state representing all matches.
They are usually byte rather than word oriented,
although they can be written either way.
It is difficult to allow updates to such an index,
and the algorithms are fairly complex.
Knuth describes Patricia trees in some detail
[Knut81];
a sample inmemory implementation, Cecilia,
was described by Tsuchiya [Tsuc91];
PAT, the Oxford English Dictionary (OED)
software, also uses Patricia trees
[Bray89], [Fawc89].
Other
Many other approaches are possible.
There are several schemes that actually replace the original data with, for
example, a multiply-threaded linked list, so that the data can be recreated
from the index.
This has an unfortunate and well-known failure mode,
in which the reconstituted text uses incorrect words.
Other schemes include sub-word indexing, either on individual bytes or on
n-grams,
although these usually fall into the `DFA' category above in terms of their
characteristics.
Next Top