A Text Retrieval Package for the Unix Operating System

Liam R. E. Quin

SoftQuad Inc. (lee at sq.com)

Note: the author of this paper has moved to liamquin at interlog dot com

Ongoing and Future Work

This section describes speculative, planned and ongoing work.

The C API

The lq-text libraries (liblqerror, liblqutil and liblqtext itself) each provide a clear set of functions forming an Application Programmer's Interface (API). The process of tidying up the API is under way:

Documentation

The API is currently documented only by function prototypes in header files and by examples. Clearly this needs to change.

Completeness

The API isn't complete yet. For example, in liblqerror there's an Eopen() function which works like Unix open(2), except that it provides error messages and can be made to exit on error. However, there is no Eclose() function yet.

Consistency

The structure of the API needs to be clear enough that one would be able to guess which library contains any given function; this is largely but not completely true now. Almost all functions have a prefix, such as LQU_ in LQT_ObtainWriteAccess(), for example, for functions from liblqtext. A very few functions don't do this, and a few others are actually defined in client programs rather than in the library.

Configuration and Testing

Configuration is currently a case of editing a Makefile and a C header file, but several people have asked for something like the GNU auto-configuration package.

An ad hoc test suite is included with the lq-text distribution, but this needs to be made more formal, and to be run automatically when the software is built.

A User Interface

lq-text is primarily a text retrieval engine suitable for integration into other systems. However, experimental user interfaces have proved popular, and it is certainly expected that better interfaces will be provided in the future.

X11 interface

An X11 client based on the Fresco toolkit is planned, building on the work of Marc Chignel [Golo93], Ed Fox et al. [Fox93] and others. However, this work is awaiting the distribution of the Fresco toolkit with X11R6.

Functionality

In addition to the user interface, there are some specific features that are wanted:

Approximate matching

currently, lq-text can perform egrep-style matches against the vocabulary in the index; it would be interesting to extend this to agrep-style approximate patterns, and to integrate it into the main query language, so that ``core /^dump.*~/'' might match `core dumped', using approximate matching only for the second word in the phrase.

Complex queries

It is desired to support queries that are themselves complex, or that refer to the structure of documents stored marked up in SGML format [Stan88], perhaps building on the work of Forbes Burkowski [Burk92]. Allowing a more complex syntax in a query has to be done carefully, so that the language is both straightforward and general. Handling structured documents also entails an extended query parser. At the same time, Fuzzy Logic [Zade78] and limited recognition of anaphoristic references is proceeding. It may also be possible to perform experiments in clustering, in the manner of some of the recent work at Xerox [Cutt93].

Performance

Although lq-text is already pretty fast at both retrieval and indexing, it could certainly be made faster. Experiments with mmap(2) and with alternate cache algorithms are ongoing.

Run-time configuration

New parameters will include user-defined stemming (perhaps using stemming algorithms described by W. Frakes in [Frak92]), and allowing a partial (document-vector) index.

Next Top