This is a brief description of lqtext, Liam Quin's text retrieval system.
If you have more questions, feel free to mail me at
	{utzoo, utai}!sq!lee, or lee@sq.com  (Liam R. E. Quin)

This file is to help peopple decide if they want to look at lq-text, and,
in particular, whether they want to go to the effort of compiling it...


``lq-text'' is the name of a software package which lets you make an
index of files and then, at some later date, identify files by their
contents.  For example, you might find the names of all files containing
the phrase `context turns objects into relation'.
You can also gather statistics about word frequency and distribution,
for example.

Now, you could be thinking about grep at this point...  grep "context
turns objects into relation" won't find this sentence (lq-text will!).
A recent suggestion was to build a file containing all of the words,
and to use grep on that.  But the resulting index is not so much smaller
than the original file, and you still can't find phrases.

[mail me for more details on this and other index/retrieval methods!]

When I designed lq-text, I had some specific goals in mind (these are not
in any particular order):

1. fast
2. make the index quite small -- at most 70% of the size of the text
3. allow the text to be stored externally, compressed or not
4. cope with non-ascii (e.g. word processor) data files
5. ability to add files to the index at any time, without having to have
   all of the indexed files on-line or to rebuild the entire index.
6. easy to use

I have achieved some of these more nearly than others.

* The retrieval has been quite fast, but then, I have not tried with a
  really huge database.  The largest so far has been the King Jammes Bible
  (five megabytes, roughly) and the SunOS man pages.
  On a 16MHz 386 the package takes about three quarters of an hour to make
  an index to the King James Bible.
  I am aware of changes that I could make fairly easily that would give
  me a marked improvement on speed, although I doubt that there is much
  more than a factor of two or three to be had now.

* The index is usually about 60% of the size of the original text.  If you
  compress the text files, you can get the whole lot to be at best about
  the same size ass the originals, and at worst about half again the size.
  Mail and news articles do well, because I strip out a lot of the header.
  Again, it is hard to see how to make big improvements on this without
  losing information.

* The text is stored externally.  Some text retrieval packages store a copy
  of the text inside their data structure.  Then they throw away the
  original (yikes!).  My scheme is a win if you archive files, because
  you can compress them, or write them to tape, and still search them!
  lq-text never needs access to the original files (although if you want
  to see the matches in context, you will obviously need them!)

* A general filtering scheme allows non-ascii filles to be added.  Well, OK,
  it isn't finished yet.  But it is well on the way.  I can cope with
  arbitrary input, but I don't have a generic display tool yet.
  
* The curses front end is very easy to use, but does not work on BSD.  The
  next interface will clearly be X-windows...


State of Play:

* ported to Sun 4 recently
* previously ported to System V on the 386 (386/ix) and the Bull XPS100,
  and BSD on the Sun 3.

* woudl like simple display support for mail and news files

* documentation is sketchy.  In particular, there is no real information
  on the program interface.  This will be true for beta, but not for anything
  one could call a production release.

Long term plans:

* talking about adding a thesaurus.
* networked indexes with common word-lists...
* hypertext engine?

===
Porting:

Edit globals.h and Makefile, and type make.

A library liblqtext.a" will be built, together with at least one other
library, and then various user programs from the src/lqtext directory
will get linked with it.  There is not yet any documentation on the
functions provided by the library, and many of the library's internal
names are visible at present.

Aside on ndbm:
lq-text can currently use either ndbm or sdbm; Xenix users be warned that
dbm is not really suitable, as you will end up slowing everything down by
a huge factor.  I have experimiented with gdbm, but not productively; I'm
also looking at dbz and btrees, but I don't want to get too complicated!
If your distribution includes sdbm, you may find that you have to make
libsdbm.a by hand and put it in the lib directory; if you have libndbm.a
or use -lndbm, simply edit the Makefile.
I have included (and tested) support for versions of ndbm that
can't create their own database files.

On Byte Ordering:
lq-text databases are *not* portable to machines with different byte
ordering.  This means simply that you can't put a database on a tape on a
VAX, read it back on a sun, and expect it to work.  Of course, you could
run lq-text on both systems without any problems, but not on the same
database.  A more practical result of this is that you can't use NFS or RFS
to share an lq-text database.  If there is any pressure I will innvestigate
fixing this, but I don't think it can be done without abandoning ndbm.
BSD4.4 db solves this problem, though.

On compilers:
I have used both AT&T's cc and gcc, as well as the BSD pcc.  The default
flags for gcc are -Wall, and only one file produces any messages (emalloc,
which uses varargs).  Saber-C occasionally produces a warning, but it
doesn't seem to be too bad.  I haven't linted recently, and will do so
before the next release.


===
Filters:

There are several filters, not all of which are ready to be distributed.
These turn unwanted characters into spaces, so that the byte offset of each
word start is unaltered.  For example, NewsFilter deletes the Path: header
field, as it is not useful for indexing.  It would be a noticeable win to
delete Message-ID as well, and to use a separate file for that, but I've not
done so.  The MailFilter deletes Received-By:, Via:, and maybe some other
lines.  The CFilter deletes C keywords like typedef and #include, but treats
strings and comments specially.  Filters for various word packages may or
may not exist, but I can't distribute them if they do.

The filters are called again when displaying a list of matches.
(not yet!)

In addition, the filters turn words that they don't want indexed into
qxxxxx, with enough x's to make up the length of the word.

===
Front ends:

* An X Windows interface using XView was distributed, but has been dropped.
* The curses-based menu front end is pretty naff.

===
Other Questions and comments:

please ask!
I really, really appreciate comments, whether negative or positive.
And anything that helps to improve (tidy, spead up, clarify) the code
is a big win, of course.

Lee