Liam Quin's text retrieval package (lq-text) Wed May 30 17:01:46 EDT 2001
src/h/revision.h defines this as Revision 1.17.

lq-text is copyright 1989-2001 Liam R. E. Quin; see src/COPYRIGHT for
details.  Parts of the source may also be copyrighted by
the University of California at Berkley - see src/qsort.c and src/db*/....

This package is distributed under the barefoot licence, and is open source.
It is also available under the GNU Lesser (library) Public Licence.
See COPYING-barefoot or COPYING-LGPL.


Lqtext is a text retrieval package.

That means you can tell it about lots of files, and later you can ask
it questions about them.
The questions have to be
	which files contain this word?
	which files contain this phrase?
	which words are contained in these files?
but this information turns out to be rather useful.

Lqtext has been designed to be fast.  It uses an inverted index, which is
simply a kind of database.  This tends to be smaller than the size of the
data, but more than half as large.  You still need to keep the original
data, although yu can compress it.

Commands include:
	lqaddfile -- add files to the database at any time
	lqfile -- information about files that have been indexed
	lqword -- information about words
	lqphrase -- look up phrases
	lqrank -- combine phrase searches, and sort the results
	lqquery -- supports wildcards in phrases, after running "sortwids"
	sortwids -- run after lqaddfile to enable lqquery
	lqkwic -- creates keyword-in-context indexes (this is fun!)
	lqshow -- show the matches on the screen (uses curses)
	lqwordlist -- search the stored vocabulary
	lqtext -- curses-based front end.
	lq -- shell-script front end
	lqcat -- fetch and print files by lq-text FID or document name

This distribution may also contain cgi-scripts; see src/http for details.

There are about 20,000 lines of C in total. Well, last time I counted, anyway.

Here are some examples, based mostly on the (King James) New Testament,
simply because that is what I have lying around.  The timings ran on a
16 MHz Sun 4/110 -- about 7 MIPS, with a disk drive giving around 1 MByte/sec.

$ time lqphrase 'wept bitterly' 
2 35 10 955 KingJames/NT/Matthew/matt26.kjv
2 26 47 995 KingJames/NT/Luke/luke22.kjv
        0.6 real         0.0 user         0.2 sys  
			//  The first number is the number of words in the
			// phrase -- 2 for "wept bitterly"
On a 200 MHz Pentium 1 under FreeBSD, the times were
real	0m0.012s
user	0m0.001s
sys	0m0.011s

$ time lqword -l jesus > XXX
        1.0 real         0.4 user         0.4 sys  

(the time on the pentium was too small to measure; using a 30 MByte database,
with approx. 400 matches, I got a time of 0.04 seconds)

$ wc XXX
     983    4915   68604 XXX
$ head -12 XXX
1 0 8 930 KingJames/NT/Matthew/matt01.kjv
1 5 21 930 KingJames/NT/Matthew/matt01.kjv
1 6 24 930 KingJames/NT/Matthew/matt01.kjv
1 8 48 930 KingJames/NT/Matthew/matt01.kjv
1 10 49 930 KingJames/NT/Matthew/matt01.kjv
1 0 4 931 KingJames/NT/Matthew/matt02.kjv
1 6 4 932 KingJames/NT/Matthew/matt03.kjv 
(and so on for 983 lines)
So there are nine hundred and eighty-three matches.  The line for each match
gives the block in the file, the word within the block, the file number,
and the filename.

More useful things to do include:

// see some of the matching text:

$ lqphrase 'wept bitterly' | lqkwic
==== Document 1: /home/mieza/lee/text/bible/KingJames/NT/Matthew/matt26.kjv ====
  1: thrice. And he went out, and wept bitterly.                               
==== Document 2: /home/mieza/lee/text/bible/KingJames/NT/Luke/luke22.kjv ====
  2:22:62 And Peter went out, and wept bitterly. 22:63 And the men that held Je
$

// which words contain "foot" or "feet"?
$ lqwordlist -g "f[oe][oe]t"
afoot
barefoot
brokenfooted
clovenfooted
feet
foot
footmen
footstep
footstool
fourfooted

// documents containing "shoe" and "barefoot"
$ lqrank "barefoot" "shoe" | lqkwic
==== Document 1: /home/mieza/lee/text/bible/KingJames/OT/Isaiah/isa20.kjv ====
  1:ff thy loins, and put off thy shoe from thy foot. And he did so, walking na
  2: he did so, walking naked and barefoot. 20:3 And the LORD said, Like as my 
  3: Isaiah hath walked naked and barefoot three years [for] a sign and wonder 
  4:ves, young and old, naked and barefoot, even with [their] buttocks uncovere

// save a query... docs containing any of the following:
$ lqrank -r or serpent witch snake stick rod > skinny-things    

// documents containing abraham said, or god of abraham:
$ lqrank -r or "abraham said" "God of Abraham" > abe     

// documents appearing in both sets of results (intersect), if any:
$ lqrank -r and -f skinny-things -f abe | lqkwic    
==== Document 1: /home/mieza/lee/text/bible/KingJames/OT/Exodus/exod04.kjv ====
  1:in thine hand? And he said, A rod. 4:3 And he said, Cast it on the ground. 
  2:n the ground, and it became a serpent; and Moses fled from before it. 4:4 A
  3:nd caught it, and it became a rod in his hand: 4:5 That they may believe th
  4:ORD God of their fathers, the God of Abraham, the God of Isaac, and the God
  5:4:17 And thou shalt take this rod in thine hand, wherewith thou shalt do si
  6: of Egypt: and Moses took the rod of God in his hand. 4:21 And the LORD sai
$  

// Ah, it was Moses I was thinking of...

The "lq" shell script is much more convenient for simple queries.
It's interactive -- give it a try.


How to Install lq-text
    see the file INSTALL

How to Build an Index
    (see doc/*)
    Make a directory $HOME/LQTEXTDIR (or set $LQTEXTDIR to point to the
    (currently empty) directory you want to contain the new database).
    Include lq-text/src/bin and lq-text/src/lib in your search path if
    you haven't done a "make install" yet.
    Put a config.txt file in $LQTEXTDIR:
	docpath /my/login/directory:/or/somewhere/else
	common Common
    and make an empty file called Common (or include words like "the"
    that you don't want indexed; see the next section) in the same directory.

    You can copy lq-text/Sample/config.txt if you want, and then edit it.

    Find some files (e.g. your mailbox) and say
	lqaddfile -t2 file [...]
    You should see some diagnostic output... (this is what -t2 does).
    lqaddfile may take several minutes to write out its data, depending
    on the system.  Try a small file first -- you can add more later!
    Another fun thing to try is setting DOCPATH to /usr/man and running
	cd /usr/man
	find man* -type f -print | lqaddfile -t2 -f -
    to make an index of the manual pages (use cat* instead of man* if you
    prefer).  If you have less than 10 meg or so of RAM, give lqaddfile the
    -w100000 option -- this is the number of words to keep in memory before
    writing to the database.  The idea is that the number should be small
    enough to prevent frantic paging activity!  I find that on my Sun 4/110,
    -w100000 makes lqaddfile grow to maybe 2 megabytes; 300000 takes it up
    to 8 or 10 megabytes, but makes it run a *lot* faster.

    It's best to add lots of files at once, as in the example above using
    find(1), rather than adding a file at a time - it can make a very large
    difference in indexing speed, although probably no difference in retrieval
    times in most cases.  The index will be very slightly larger if you
    index individual files with multiple runs of lqaddfile.

How to make the index slightly smaller
    (skip ahead to How to search, if you are not interested in this)

    This section is here because the size of the index is very important
    for some applications, and saving even a few percent is useful.

    The common word list is searched linearly, so it is worth keeping it
    fairly short.  The best way is to make it empty, then after indexing
    your files, use this command to find the most common 20 words:
	$ lqwordlist -u -n -g . | sort +1nr | sed 20q
        the     192910
        of      110798
        and     83969
        to      64403
        a       52039
        in      49124
        is      42962
        that    31834
        i       30915
        it      30071
        as      21029
        for     18499
        this    18002
        with    17705
        be      16031
        are     15174
        by      14951
        he      14652
        was     14121
        not     13498

    You can see that in my database, "the" occurs approx. ten times as often
    as "for". If the database didn't include "the", it would be smaller.
    Here is the database with those words in it:

    $ du -k LQTEXTDIR 
    12325	LQTEXTDIR

    $ ls -l LQTEXTDIR/
    total 12324
    -rw-r--r--  1 liam  liam      108 Apr 27 22:54 config.txt
    -rw-r--r--  1 liam  liam  9076736 Apr 27 23:10 data
    -rw-r--r--  1 liam  liam   270336 Apr 27 23:10 filelist
    -rw-r--r--  1 liam  liam     7149 Apr 27 23:02 files
    -rw-r--r--  1 liam  liam    35840 Apr 27 23:10 freelist
    -rw-r--r--  1 liam  liam      307 Apr 27 23:15 index.html
    -rw-r--r--  1 liam  liam   237568 Apr 27 23:10 lastblks
    -rwxr-xr-x  1 liam  liam     7311 Apr 27 23:15 nph-search.cgi
    -rw-r--r--  1 liam  liam  1875968 Apr 27 23:10 widindex
    -rw-r--r--  1 liam  liam  1286144 Apr 27 23:10 wordlist

    Here is the new commonwords file:
    the
    of
    and
    a
    is
    that

    After running the index again (which took 40 seconds):
    $ lqaddfile -t3 -w5000000 -H655350 -f LQTEXTDIR/files
    I got the following:

    $ du -k LQTEXTDIR/
    11820	LQTEXTDIR/

    I checked "the" wasn't there:
    bash-2.03$ lqword the
    lqword: warning: No index information for: the (too common)

    In this case, we saved about 504 kbytes out of 12 megabytes,
    or a little over 4% of the database size.  It really wasn't worth it.

    If you are not planning to add files to the index or to unindex files,
    you can remove or truncate the "freelist" and "lastblks" files for
    another 2% or so space saving.

    If you are not planning on using lqquery with wildcards, you can also
    save space by including the line
	wordsinindex off
    taking us down to 11282 kbytes for the index, a savings of over 9%

    If you turn indexnumbers to off, and set the shortest word indexed
    to 3 bytes, and the longest to 16 (so that longer ones are truncated
    and indexed as if they were the same)

    $ ls -l LQTEXTDIR/
    total 9754
    -rw-r--r--  1 liam  liam      343 May 30 17:29 common
    -rw-r--r--  1 liam  liam      123 May 30 17:52 config.txt
    -rw-r--r--  1 liam  liam  6828032 May 30 17:52 data
    -rw-r--r--  1 liam  liam   270336 May 30 17:52 filelist
    -rw-r--r--  1 liam  liam     7149 Apr 27 23:02 files
    -rw-r--r--  1 liam  liam      307 Apr 27 23:15 index.html
    -rwxr-xr-x  1 liam  liam     7311 Apr 27 23:15 nph-search.cgi
    -rw-r--r--  1 liam  liam  1769472 May 30 17:52 widindex
    -rw-r--r--  1 liam  liam  1277952 May 30 17:52 wordlist


    We're now 22.3% smaller than when we started, but we have lost some
    precision.

    It's also possible to save space by restricting the per-occurrence flags;
    see Sample/config.txt for more information.


Searching the index

    Now try
	lqword		---> an unsorted list of all known words
	lq		---> type phrases and browse through them
	lqtext		---> curses-based browser, if it compiled.
	lqrank		---> a sorted list of matches

	lqphrase "floppy disk" | lqkwic -f -   ---> this is the most fun.
	lqphrase "floppy disk" | lqshow -f -   ---> lq does this for you


    If the files you are indexing have pathnmames with leading bits in
    common (e.g. indexing a directory such as  /usr/spool/news, or
    /home/zx81/lee/text/humour), make use of DOCPATH.  This is searched
    linearly, so a dozen or so entries is the practical limit at the
    moment.  For example, if your config.txt file contained the line
	docpath /usr/spool/news:/shared-text/books:.
    and you ran the command
	lqaddfile simon/chapter3
    lqaddfile would look for
	/usr/spool/news/simon/chapter3
	/shared-text/simon/chapter3
	./books/simon/chapter3
    in that order.  But it would only need to store "simon/chapter3" in the
    index, and this can save a lot of space if you index large numbers of
    files.  Of course, it's up to you to ensure that all of the filenames
    you pass to lqaddfile are unique!


If you create a file called "titles", lqkwix will dis0play the document
titles alongside filenames. The file lives in $LQTEXTDIR, and contains
a file number, a tab, then a description, all on the same line.
The first line must be a comment (starting with a #).
# document titles
1	title for document 1
2	title for document 2
and so on.

There is a sample CGI script in the http/ directory.

You can also use the C libraries and header files directly; the directory
api/doc contains some documentation on this subject.

Finally, if you use this package, you have to go barefoot for 24 consecutive
hours within a week of first using it.  See LICENCE for details.
And yes, "licence" is the usual UK spelling.

Lee

Liam R. E. Quin
liam@holoweb.net