Liam Quin's text retrieval package (lq-text) Wed May 30 17:01:46 EDT 2001 src/h/revision.h defines this as Revision 1.17. lq-text is copyright 1989-2001 Liam R. E. Quin; see src/COPYRIGHT for details. Parts of the source may also be copyrighted by the University of California at Berkley - see src/qsort.c and src/db*/.... This package is distributed under the barefoot licence, and is open source. It is also available under the GNU Lesser (library) Public Licence. See COPYING-barefoot or COPYING-LGPL. Lqtext is a text retrieval package. That means you can tell it about lots of files, and later you can ask it questions about them. The questions have to be which files contain this word? which files contain this phrase? which words are contained in these files? but this information turns out to be rather useful. Lqtext has been designed to be fast. It uses an inverted index, which is simply a kind of database. This tends to be smaller than the size of the data, but more than half as large. You still need to keep the original data, although yu can compress it. Commands include: lqaddfile -- add files to the database at any time lqfile -- information about files that have been indexed lqword -- information about words lqphrase -- look up phrases lqrank -- combine phrase searches, and sort the results lqquery -- supports wildcards in phrases, after running "sortwids" sortwids -- run after lqaddfile to enable lqquery lqkwic -- creates keyword-in-context indexes (this is fun!) lqshow -- show the matches on the screen (uses curses) lqwordlist -- search the stored vocabulary lqtext -- curses-based front end. lq -- shell-script front end lqcat -- fetch and print files by lq-text FID or document name This distribution may also contain cgi-scripts; see src/http for details. There are about 20,000 lines of C in total. Well, last time I counted, anyway. Here are some examples, based mostly on the (King James) New Testament, simply because that is what I have lying around. The timings ran on a 16 MHz Sun 4/110 -- about 7 MIPS, with a disk drive giving around 1 MByte/sec. $ time lqphrase 'wept bitterly' 2 35 10 955 KingJames/NT/Matthew/matt26.kjv 2 26 47 995 KingJames/NT/Luke/luke22.kjv 0.6 real 0.0 user 0.2 sys // The first number is the number of words in the // phrase -- 2 for "wept bitterly" On a 200 MHz Pentium 1 under FreeBSD, the times were real 0m0.012s user 0m0.001s sys 0m0.011s $ time lqword -l jesus > XXX 1.0 real 0.4 user 0.4 sys (the time on the pentium was too small to measure; using a 30 MByte database, with approx. 400 matches, I got a time of 0.04 seconds) $ wc XXX 983 4915 68604 XXX $ head -12 XXX 1 0 8 930 KingJames/NT/Matthew/matt01.kjv 1 5 21 930 KingJames/NT/Matthew/matt01.kjv 1 6 24 930 KingJames/NT/Matthew/matt01.kjv 1 8 48 930 KingJames/NT/Matthew/matt01.kjv 1 10 49 930 KingJames/NT/Matthew/matt01.kjv 1 0 4 931 KingJames/NT/Matthew/matt02.kjv 1 6 4 932 KingJames/NT/Matthew/matt03.kjv (and so on for 983 lines) So there are nine hundred and eighty-three matches. The line for each match gives the block in the file, the word within the block, the file number, and the filename. More useful things to do include: // see some of the matching text: $ lqphrase 'wept bitterly' | lqkwic ==== Document 1: /home/mieza/lee/text/bible/KingJames/NT/Matthew/matt26.kjv ==== 1: thrice. And he went out, and wept bitterly. ==== Document 2: /home/mieza/lee/text/bible/KingJames/NT/Luke/luke22.kjv ==== 2:22:62 And Peter went out, and wept bitterly. 22:63 And the men that held Je $ // which words contain "foot" or "feet"? $ lqwordlist -g "f[oe][oe]t" afoot barefoot brokenfooted clovenfooted feet foot footmen footstep footstool fourfooted // documents containing "shoe" and "barefoot" $ lqrank "barefoot" "shoe" | lqkwic ==== Document 1: /home/mieza/lee/text/bible/KingJames/OT/Isaiah/isa20.kjv ==== 1:ff thy loins, and put off thy shoe from thy foot. And he did so, walking na 2: he did so, walking naked and barefoot. 20:3 And the LORD said, Like as my 3: Isaiah hath walked naked and barefoot three years [for] a sign and wonder 4:ves, young and old, naked and barefoot, even with [their] buttocks uncovere // save a query... docs containing any of the following: $ lqrank -r or serpent witch snake stick rod > skinny-things // documents containing abraham said, or god of abraham: $ lqrank -r or "abraham said" "God of Abraham" > abe // documents appearing in both sets of results (intersect), if any: $ lqrank -r and -f skinny-things -f abe | lqkwic ==== Document 1: /home/mieza/lee/text/bible/KingJames/OT/Exodus/exod04.kjv ==== 1:in thine hand? And he said, A rod. 4:3 And he said, Cast it on the ground. 2:n the ground, and it became a serpent; and Moses fled from before it. 4:4 A 3:nd caught it, and it became a rod in his hand: 4:5 That they may believe th 4:ORD God of their fathers, the God of Abraham, the God of Isaac, and the God 5:4:17 And thou shalt take this rod in thine hand, wherewith thou shalt do si 6: of Egypt: and Moses took the rod of God in his hand. 4:21 And the LORD sai $ // Ah, it was Moses I was thinking of... The "lq" shell script is much more convenient for simple queries. It's interactive -- give it a try. How to Install lq-text see the file INSTALL How to Build an Index (see doc/*) Make a directory $HOME/LQTEXTDIR (or set $LQTEXTDIR to point to the (currently empty) directory you want to contain the new database). Include lq-text/src/bin and lq-text/src/lib in your search path if you haven't done a "make install" yet. Put a config.txt file in $LQTEXTDIR: docpath /my/login/directory:/or/somewhere/else common Common and make an empty file called Common (or include words like "the" that you don't want indexed; see the next section) in the same directory. You can copy lq-text/Sample/config.txt if you want, and then edit it. Find some files (e.g. your mailbox) and say lqaddfile -t2 file [...] You should see some diagnostic output... (this is what -t2 does). lqaddfile may take several minutes to write out its data, depending on the system. Try a small file first -- you can add more later! Another fun thing to try is setting DOCPATH to /usr/man and running cd /usr/man find man* -type f -print | lqaddfile -t2 -f - to make an index of the manual pages (use cat* instead of man* if you prefer). If you have less than 10 meg or so of RAM, give lqaddfile the -w100000 option -- this is the number of words to keep in memory before writing to the database. The idea is that the number should be small enough to prevent frantic paging activity! I find that on my Sun 4/110, -w100000 makes lqaddfile grow to maybe 2 megabytes; 300000 takes it up to 8 or 10 megabytes, but makes it run a *lot* faster. It's best to add lots of files at once, as in the example above using find(1), rather than adding a file at a time - it can make a very large difference in indexing speed, although probably no difference in retrieval times in most cases. The index will be very slightly larger if you index individual files with multiple runs of lqaddfile. How to make the index slightly smaller (skip ahead to How to search, if you are not interested in this) This section is here because the size of the index is very important for some applications, and saving even a few percent is useful. The common word list is searched linearly, so it is worth keeping it fairly short. The best way is to make it empty, then after indexing your files, use this command to find the most common 20 words: $ lqwordlist -u -n -g . | sort +1nr | sed 20q the 192910 of 110798 and 83969 to 64403 a 52039 in 49124 is 42962 that 31834 i 30915 it 30071 as 21029 for 18499 this 18002 with 17705 be 16031 are 15174 by 14951 he 14652 was 14121 not 13498 You can see that in my database, "the" occurs approx. ten times as often as "for". If the database didn't include "the", it would be smaller. Here is the database with those words in it: $ du -k LQTEXTDIR 12325 LQTEXTDIR $ ls -l LQTEXTDIR/ total 12324 -rw-r--r-- 1 liam liam 108 Apr 27 22:54 config.txt -rw-r--r-- 1 liam liam 9076736 Apr 27 23:10 data -rw-r--r-- 1 liam liam 270336 Apr 27 23:10 filelist -rw-r--r-- 1 liam liam 7149 Apr 27 23:02 files -rw-r--r-- 1 liam liam 35840 Apr 27 23:10 freelist -rw-r--r-- 1 liam liam 307 Apr 27 23:15 index.html -rw-r--r-- 1 liam liam 237568 Apr 27 23:10 lastblks -rwxr-xr-x 1 liam liam 7311 Apr 27 23:15 nph-search.cgi -rw-r--r-- 1 liam liam 1875968 Apr 27 23:10 widindex -rw-r--r-- 1 liam liam 1286144 Apr 27 23:10 wordlist Here is the new commonwords file: the of and a is that After running the index again (which took 40 seconds): $ lqaddfile -t3 -w5000000 -H655350 -f LQTEXTDIR/files I got the following: $ du -k LQTEXTDIR/ 11820 LQTEXTDIR/ I checked "the" wasn't there: bash-2.03$ lqword the lqword: warning: No index information for: the (too common) In this case, we saved about 504 kbytes out of 12 megabytes, or a little over 4% of the database size. It really wasn't worth it. If you are not planning to add files to the index or to unindex files, you can remove or truncate the "freelist" and "lastblks" files for another 2% or so space saving. If you are not planning on using lqquery with wildcards, you can also save space by including the line wordsinindex off taking us down to 11282 kbytes for the index, a savings of over 9% If you turn indexnumbers to off, and set the shortest word indexed to 3 bytes, and the longest to 16 (so that longer ones are truncated and indexed as if they were the same) $ ls -l LQTEXTDIR/ total 9754 -rw-r--r-- 1 liam liam 343 May 30 17:29 common -rw-r--r-- 1 liam liam 123 May 30 17:52 config.txt -rw-r--r-- 1 liam liam 6828032 May 30 17:52 data -rw-r--r-- 1 liam liam 270336 May 30 17:52 filelist -rw-r--r-- 1 liam liam 7149 Apr 27 23:02 files -rw-r--r-- 1 liam liam 307 Apr 27 23:15 index.html -rwxr-xr-x 1 liam liam 7311 Apr 27 23:15 nph-search.cgi -rw-r--r-- 1 liam liam 1769472 May 30 17:52 widindex -rw-r--r-- 1 liam liam 1277952 May 30 17:52 wordlist We're now 22.3% smaller than when we started, but we have lost some precision. It's also possible to save space by restricting the per-occurrence flags; see Sample/config.txt for more information. Searching the index Now try lqword ---> an unsorted list of all known words lq ---> type phrases and browse through them lqtext ---> curses-based browser, if it compiled. lqrank ---> a sorted list of matches lqphrase "floppy disk" | lqkwic -f - ---> this is the most fun. lqphrase "floppy disk" | lqshow -f - ---> lq does this for you If the files you are indexing have pathnmames with leading bits in common (e.g. indexing a directory such as /usr/spool/news, or /home/zx81/lee/text/humour), make use of DOCPATH. This is searched linearly, so a dozen or so entries is the practical limit at the moment. For example, if your config.txt file contained the line docpath /usr/spool/news:/shared-text/books:. and you ran the command lqaddfile simon/chapter3 lqaddfile would look for /usr/spool/news/simon/chapter3 /shared-text/simon/chapter3 ./books/simon/chapter3 in that order. But it would only need to store "simon/chapter3" in the index, and this can save a lot of space if you index large numbers of files. Of course, it's up to you to ensure that all of the filenames you pass to lqaddfile are unique! If you create a file called "titles", lqkwix will dis0play the document titles alongside filenames. The file lives in $LQTEXTDIR, and contains a file number, a tab, then a description, all on the same line. The first line must be a comment (starting with a #). # document titles 1 title for document 1 2 title for document 2 and so on. There is a sample CGI script in the http/ directory. You can also use the C libraries and header files directly; the directory api/doc contains some documentation on this subject. Finally, if you use this package, you have to go barefoot for 24 consecutive hours within a week of first using it. See LICENCE for details. And yes, "licence" is the usual UK spelling. Lee Liam R. E. Quin liam@holoweb.net