A Text Retrieval Package for the Unix Operating System
Liam R. E. Quin
SoftQuad Inc. (lee at sq.com)
Note: the author of this paper has moved to liamquin at interlog dot com
This section describes speculative, planned and ongoing work.
The C API
The lq-text libraries
(liblqerror,
liblqutil and liblqtext itself)
each provide a clear set of functions forming an Application Programmer's
Interface (API).
The process of tidying up the API is under way:
Documentation
The API is currently documented only by function prototypes in header files
and by examples.
Clearly this needs to change.
Completeness
The API isn't complete yet.
For example, in liblqerror there's
an Eopen() function which works like Unix open(2),
except that it provides error messages and can be made to exit on error.
However, there is no Eclose() function yet.
Consistency
The structure of the API needs to be clear enough that one would
be able to guess which library contains any given function;
this is largely but not completely true now.
Almost all functions have a prefix, such as LQU_ in
LQT_ObtainWriteAccess(), for example,
for functions from liblqtext.
A very few functions don't do this, and a few others are
actually defined in client programs rather than in the library.
Configuration and Testing
Configuration is currently a case of editing a Makefile and
a C header file, but several people have asked for something like
the GNU auto-configuration package.
An ad hoc test suite is included with the lq-text distribution,
but this needs to be made more formal, and to be run automatically when the
software is built.
A User Interface
lq-text is primarily a text retrieval engine suitable for integration
into other systems.
However, experimental user interfaces have proved popular, and
it is certainly expected that better interfaces will be provided in the
future.
X11 interface
An X11 client based on the Fresco toolkit is planned, building on the
work of Marc Chignel [Golo93],
Ed Fox et al. [Fox93] and others.
However, this work is awaiting the distribution of the
Fresco toolkit with X11R6.
Functionality
In addition to the user interface, there are some specific
features that are wanted:
Approximate matching
currently, lq-text can perform egrep-style matches
against the vocabulary in the index; it would be interesting to extend
this to agrep-style approximate patterns, and to integrate it into
the main query language, so that
``core /^dump.*~/'' might match `core dumped',
using approximate matching only for the second word in the phrase.
Complex queries
It is desired to support
queries that are themselves complex, or that refer to the
structure of documents stored marked up in SGML
format [Stan88],
perhaps building on the work of Forbes Burkowski [Burk92].
Allowing a more complex syntax in a query has to be done carefully,
so that the language is both straightforward and general.
Handling structured documents also entails an extended query parser.
At the same time, Fuzzy Logic
[Zade78]
and limited
recognition of anaphoristic references is proceeding.
It may also be possible to perform experiments in clustering,
in the manner of some of the recent work at Xerox [Cutt93].
Performance
Although lq-text is already pretty fast at both retrieval
and indexing, it could certainly be made faster.
Experiments with mmap(2) and with alternate cache
algorithms are ongoing.
Run-time configuration
New parameters will include user-defined stemming
(perhaps using stemming algorithms described by W. Frakes in [Frak92]),
and allowing a partial (document-vector) index.
Next Top