This is a brief description of lqtext, Liam Quin's text retrieval system. If you have more questions, feel free to mail me at {utzoo, utai}!sq!lee, or lee@sq.com (Liam R. E. Quin) This file is to help peopple decide if they want to look at lq-text, and, in particular, whether they want to go to the effort of compiling it... ``lq-text'' is the name of a software package which lets you make an index of files and then, at some later date, identify files by their contents. For example, you might find the names of all files containing the phrase `context turns objects into relation'. You can also gather statistics about word frequency and distribution, for example. Now, you could be thinking about grep at this point... grep "context turns objects into relation" won't find this sentence (lq-text will!). A recent suggestion was to build a file containing all of the words, and to use grep on that. But the resulting index is not so much smaller than the original file, and you still can't find phrases. [mail me for more details on this and other index/retrieval methods!] When I designed lq-text, I had some specific goals in mind (these are not in any particular order): 1. fast 2. make the index quite small -- at most 70% of the size of the text 3. allow the text to be stored externally, compressed or not 4. cope with non-ascii (e.g. word processor) data files 5. ability to add files to the index at any time, without having to have all of the indexed files on-line or to rebuild the entire index. 6. easy to use I have achieved some of these more nearly than others. * The retrieval has been quite fast, but then, I have not tried with a really huge database. The largest so far has been the King Jammes Bible (five megabytes, roughly) and the SunOS man pages. On a 16MHz 386 the package takes about three quarters of an hour to make an index to the King James Bible. I am aware of changes that I could make fairly easily that would give me a marked improvement on speed, although I doubt that there is much more than a factor of two or three to be had now. * The index is usually about 60% of the size of the original text. If you compress the text files, you can get the whole lot to be at best about the same size ass the originals, and at worst about half again the size. Mail and news articles do well, because I strip out a lot of the header. Again, it is hard to see how to make big improvements on this without losing information. * The text is stored externally. Some text retrieval packages store a copy of the text inside their data structure. Then they throw away the original (yikes!). My scheme is a win if you archive files, because you can compress them, or write them to tape, and still search them! lq-text never needs access to the original files (although if you want to see the matches in context, you will obviously need them!) * A general filtering scheme allows non-ascii filles to be added. Well, OK, it isn't finished yet. But it is well on the way. I can cope with arbitrary input, but I don't have a generic display tool yet. * The curses front end is very easy to use, but does not work on BSD. The next interface will clearly be X-windows... State of Play: * ported to Sun 4 recently * previously ported to System V on the 386 (386/ix) and the Bull XPS100, and BSD on the Sun 3. * woudl like simple display support for mail and news files * documentation is sketchy. In particular, there is no real information on the program interface. This will be true for beta, but not for anything one could call a production release. Long term plans: * talking about adding a thesaurus. * networked indexes with common word-lists... * hypertext engine? === Porting: Edit globals.h and Makefile, and type make. A library liblqtext.a" will be built, together with at least one other library, and then various user programs from the src/lqtext directory will get linked with it. There is not yet any documentation on the functions provided by the library, and many of the library's internal names are visible at present. Aside on ndbm: lq-text can currently use either ndbm or sdbm; Xenix users be warned that dbm is not really suitable, as you will end up slowing everything down by a huge factor. I have experimiented with gdbm, but not productively; I'm also looking at dbz and btrees, but I don't want to get too complicated! If your distribution includes sdbm, you may find that you have to make libsdbm.a by hand and put it in the lib directory; if you have libndbm.a or use -lndbm, simply edit the Makefile. I have included (and tested) support for versions of ndbm that can't create their own database files. On Byte Ordering: lq-text databases are *not* portable to machines with different byte ordering. This means simply that you can't put a database on a tape on a VAX, read it back on a sun, and expect it to work. Of course, you could run lq-text on both systems without any problems, but not on the same database. A more practical result of this is that you can't use NFS or RFS to share an lq-text database. If there is any pressure I will innvestigate fixing this, but I don't think it can be done without abandoning ndbm. BSD4.4 db solves this problem, though. On compilers: I have used both AT&T's cc and gcc, as well as the BSD pcc. The default flags for gcc are -Wall, and only one file produces any messages (emalloc, which uses varargs). Saber-C occasionally produces a warning, but it doesn't seem to be too bad. I haven't linted recently, and will do so before the next release. === Filters: There are several filters, not all of which are ready to be distributed. These turn unwanted characters into spaces, so that the byte offset of each word start is unaltered. For example, NewsFilter deletes the Path: header field, as it is not useful for indexing. It would be a noticeable win to delete Message-ID as well, and to use a separate file for that, but I've not done so. The MailFilter deletes Received-By:, Via:, and maybe some other lines. The CFilter deletes C keywords like typedef and #include, but treats strings and comments specially. Filters for various word packages may or may not exist, but I can't distribute them if they do. The filters are called again when displaying a list of matches. (not yet!) In addition, the filters turn words that they don't want indexed into qxxxxx, with enough x's to make up the length of the word. === Front ends: * An X Windows interface using XView was distributed, but has been dropped. * The curses-based menu front end is pretty naff. === Other Questions and comments: please ask! I really, really appreciate comments, whether negative or positive. And anything that helps to improve (tidy, spead up, clarify) the code is a big win, of course. Lee