lq-text Unix text retrieval package

news | download | volunteer | browse | documentation | socks

About lq-text

lq-text is a full-text retrieval package. It makes an inverted index of your files, and can then later find words or phrases in those files. This is also sometimes called a full-text (or fulltext) database, or an information retrieval system. I don't use those terms because lq-text is not ACID-compliant (database), and the phrase "information retrieval" seems overly broad.

lq-text can produce extracts from the files it matches, for example highlighting every occurrence of matched words, or producing one-line keyword-in-context (KWIC) indexes.

A higher level summary might say that lq-text is a search engine used in text information retrieval systems, based on a fulltext index.

You can read the FAQ here.

lq-text was first released in 1989 and posted to comp.source.unix. It's about 30,000 lines of C.

I am no longer actively developing the software, although I will help out if you want to use it, and I do still fix bugs and add occasional features.

Papers Documentation and Examples

There is a 1994 Usenix paper about it here. The paper also includes an example commandline session as an appendix.

There is a sample CGI script to search approx. 260 MBytes of Linux documentation here; you can also see the (somewhat messy) source of that CGI script.

License

The software is free for non-commercial use, and is distributed under the barefoot licence. For commercial use contact the author, Liam Quin, liam at holoweb.net; support is also available, whether or not you paid for the software.

As of June 2001, lq-text is also available under the LGPL on request.

Contributed Software

Currently there is a program to make a hyperlinked glossary, and also some CGI scripts. I'll be putting these up for ftp separately.

[Feb 2000] RPMs for Red Hat 6, and a simple gnome interface, are in the works. The Red Hat 6 port now passess all the tests, and indexes all of the HOWTOs and man pages in under 5 minutes on a PII 266 laptop. A search for serial port produces 684 matches in 0.34 seconds.

Download

freebsd binaries

lq-text1.17.tgz (current release)

Binary RPM for Mandrake Linux 9.2 (cooker)

distribution directory with all archived releases, old and new.

I'm still working on the RPM; it installs into /usr/local/ for now, and does not include the development stuff (headers, libraries, documentation). If there's interest, i'll make a devel package. I have not yet made 1.18 into a formal release; there are very minor differences from 1.17, mostly some small bug fixes and Linux porting changes, and the RPM spec file.

Current Status

I'm looking for help:

a new name for the package, and then a logo!

autoconf or some other config script; people didn't like mine. Rich Salz has contributed some autoconf work so this will be part of release 1.19 later this year (2005)

use parseargs() or some other argument package

integrate the CGI scripts

improve/complete the HTML documentation

help with the license and a CVS repository to make lq-text into an open source project; right now it uses a modified BSD license.

the code is mature, but there are a number of missing features and places where it could be faster, even after years of careful studying of profiling data.

How you can help

Send email to liam at holoweb dot net with lq-text is the subject. Mention the colour of socks you are wearing, to get past my filters.

Where has lq-text been used?

The package has been used to power intranet search engines in Fortune 500 companies, for ecommerce information management (whatever that is!), for automated retrieval over the internet in conjunction with crawler bots, to index Usenet articles (a sort of local Usenet search engine software), as a free search engine incorporated into other software, as part of online document retrieval services in a document and information management project, as the back end in document management servers, as part of a source code management and browsing system, as a personal research aid for text processing, and many other applications.

I feel like a keyword whore using all those generic terms, but I'm not able to name some of the companies involved, so I decided to stay generic.

Further reading

There are a lot of books ond papers on text retrieval, and even conferences devoted to it.

Here are some books I have found useful:

Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto; this is a superb book for implementors.

Advanced Unix Programming by Marc Rochkind; I have linked to the second edition, although I used the first. This book is very helpful in teaching programmers about error handling and writing robust Unix software.

GNU Autoconf, Automake, and Libtool by Gary V. Vaughan, Ben Elliston, Tom Tromey and Ian Lance Taylor; I'm not sure there's anything that can be done to make the GNU auto* tools humanly manageable, but this is the best weapon I've seen so far.