Liam Quin's text retrieval package (lq-text) Sat Nov 27 22:50:31 EST 1993
src/h/Revision.h defines this as Revision 1.13.

NOTE:	this is not the "README" file that you put into a database directory;
	use Sample/README for that (and then edit it).

lq-text is copyright 1989, 1990-1995, 1996 Liam R. E. Quin;
see src/COPYRIGHT for details.  Parts of the source may also be copyrighted by
the University of California at Berkley - see src/qsort.c and src/db*/....

Lqtext is a text retrieval package.

That means you can tell it about lots of files, and later you can ask
it questions about them.
The questions have to be
	which files contain this word?
	which files contain this phrase?
but this information turns out to be rather useful.

Lqtext has been designed to be reasonably fast.  It uses an inverted
index, which is simply a kind of database.  This tends to be smaller than
the size of the data, but more than half as large.  You still need to keep
the original data.

Commands include:
	lqaddfile -- add files to the database at any time
	lqfile -- information about files that have been indexed
	lqword -- information about words
	lqphrase -- look up phrases
	lqrank -- combine phrase searches, and sort the results
	lqkwic -- creates keyword-in-context indexes (this is fun!)
	lqshow -- show the matches on the screen (uses curses)
	lqtext -- curses-based front end.
	lq -- shell-script front end

There are about 11,000 lines of C in total, of which 8,000 are the
text database and 3,000 are the curses front end (lqtext).  Well, last time
I counted, anyway.

Here are some examples, based mostly on the (King James) New Testament,
simply because that is what I have lying around.  The timings ran on a
16 MHz Sun 4/110 -- about 7 MIPS, with a disk drive giving around 1 MByte/sec.

$ time lqphrase 'wept bitterly' 
2 35 10 955 KingJames/NT/Matthew/matt26.kjv
2 26 47 995 KingJames/NT/Luke/luke22.kjv
        0.6 real         0.0 user         0.2 sys  
			//  The first number is the number of words in the
			// phrase -- 2 for "wept bitterly"
$ time lqword -l jesus > XXX
        1.0 real         0.4 user         0.4 sys  
$ wc XXX
     983    4915   68604 XXX
$ sed 12q XXX
1 0 8 930 KingJames/NT/Matthew/matt01.kjv
1 5 21 930 KingJames/NT/Matthew/matt01.kjv
1 6 24 930 KingJames/NT/Matthew/matt01.kjv
1 8 48 930 KingJames/NT/Matthew/matt01.kjv
1 10 49 930 KingJames/NT/Matthew/matt01.kjv
1 0 4 931 KingJames/NT/Matthew/matt02.kjv
1 6 4 932 KingJames/NT/Matthew/matt03.kjv 
(and so on for 983 lines)
So there are nine hundred and eighty-three matches.  The line for each match
gives the block in the file, the word within the block, the file number,
and the filename.

More useful things to do include:

// see some of the matching text:

$ lqphrase 'wept bitterly' | lqkwic
==== Document 1: /home/mieza/lee/text/bible/KingJames/NT/Matthew/matt26.kjv ====
  1: thrice. And he went out, and wept bitterly.                               
==== Document 2: /home/mieza/lee/text/bible/KingJames/NT/Luke/luke22.kjv ====
  2:22:62 And Peter went out, and wept bitterly. 22:63 And the men that held Je
$

// which words contain "foot" or "feet"?
$ lqwordlist -g "f[oe][oe]t"
afoot
barefoot
brokenfooted
clovenfooted
feet
foot
footmen
footstep
footstool
fourfooted

// documents containing "shoe" and "barefoot"
$ lqrank "barefoot" "shoe" | lqkwic
==== Document 1: /home/mieza/lee/text/bible/KingJames/OT/Isaiah/isa20.kjv ====
  1:ff thy loins, and put off thy shoe from thy foot. And he did so, walking na
  2: he did so, walking naked and barefoot. 20:3 And the LORD said, Like as my 
  3: Isaiah hath walked naked and barefoot three years [for] a sign and wonder 
  4:ves, young and old, naked and barefoot, even with [their] buttocks uncovere

// save a query... docs containing any of the following:
$ lqrank -r or serpent witch snake stick rod > skinny-things    

// documents containing abraham said, or god of abraham:
$ lqrank -r or "abraham said" "God of Abraham" > abe     

// documents appearing in both sets of results (intersect), if any:
$ lqrank -r and -f skinny-things -f abe | lqkwic    
==== Document 1: /home/mieza/lee/text/bible/KingJames/OT/Exodus/exod04.kjv ====
  1:in thine hand? And he said, A rod. 4:3 And he said, Cast it on the ground. 
  2:n the ground, and it became a serpent; and Moses fled from before it. 4:4 A
  3:nd caught it, and it became a rod in his hand: 4:5 That they may believe th
  4:ORD God of their fathers, the God of Abraham, the God of Isaac, and the God
  5:4:17 And thou shalt take this rod in thine hand, wherewith thou shalt do si
  6: of Egypt: and Moses took the rod of God in his hand. 4:21 And the LORD sai
$  

// Ah, it was Moses I was thinking of...

The "lq" shell script is much more convenient for simple queries.
It's interactive -- give it a try.


How to Install lq-text
    see the file INSTALL

How to Use It
    (see doc/*)
    Make a directory $HOME/LQTEXTDIR (or set $LQTEXTDIR to point to the
    (currently empty) directory you want to contain the new database).
    Include lq-text/src/bin and lq-text/src/lib in your search path if
    you haven't done a "make install" yet.
    Put a README file in $LQTEXTDIR:
	docpath /my/login/directory:/or/somewhere/else
	common Common
    and make an empty file called Common (or include words like "uucp"
    that you don't want indexed) in the same directory.
    You can copy lq-text/Sample/README if you want, and then edit it.

    The common word list is searched linearly, so it is worth keeping it
    fairly short.  Usually about a dozen words is plenty.  Don't bother
    including words less of than three letters unless you have edited
    src/wordrules.h, or have changed minwordlength in Sample/README,
    as short words aren't normally included in the index.

    Find some files (e.g. your mailbox) and say
	lqaddfile -t2 file [...]
    You should see some diagnostic output... (this is what -t2 does).
    lqaddfile may take several minutes to write out its data, depending
    on the system.  Try a small file first -- you can add more later!
    Another fun thing to try is setting DOCPATH to /usr/man and running
	cd /usr/man
	find man* -type f -print | lqaddfile -t2 -f -
    to make an index of the manual pages (use cat* instead of man* if you
    prefer).  If you have less than 10 meg or so of RAM, give lqaddfile the
    -w100000 option -- this is the number of words to keep in memory before
    writing to the database.  The idea is that the number should be small
    enough to prevent frantic paging activity!  I find that on my Sun 4/110,
    -w100000 makes lqaddfile grow to maybe 2 megabytes; 300000 takes it up
    to 8 or 10 megabytes, but makes it run a *lot* faster.
    It's best to add lots of files at once, as in the example above using
    find(1), rather than adding a file at a time - it can make a very large
    difference in indexing speed, although probably no difference in retrieval
    times in most cases.


    Now try
	lqword		---> an unsorted list of all known words
	lq		---> type phrases and browse through them
	lqtext		---> curses-based browser, if it compiled.
	lqrank		---> a sorted list of matches

	lqkwic `lqphrase "floppy disk"`   ---> this is the most fun.
	lqshow `lqphrase "floppy disk"`   ---> lq does this for you


    If the files you are indexing have pathnmames with leading bits in
    common (e.g. indexing a directory such as  /usr/spool/news, or
    /home/zx81/lee/text/humour), make use of DOCPATH.  This is searched
    linearly, so a dozen or so entries is the practical limit at the
    moment.  For example, if your README file contained the line
	docpath /usr/spool/news:/shared-text/books:.
    and you ran the command
	lqaddfile simon/chapter3
    lqaddfile would look for
	/usr/spool/news/simon/chapter3
	/shared-text/simon/chapter3
	./books/simon/chapter3
    in that order.  But it would only need to store "simon/chapter3" in the
    index, and this can save a lot of space if you index large numbers of
    files.  Of course,it's up to you to ensure that all of the filenames
    you pass to lqaddfile are unique!

Note:
    Every indexed pathname must fit into a dbm page, which is 4KBytes
    with sdbm but probably much less (e.g. 512) with dbm.  With BSD db
    this problem has gone away.

Lee

Liam R. E. Quin
liamquin@interlog.com
