Some simple notes about how it all works... @(#) $Revision: 1.2 $ (1) Overview lq-Text keeps a number of separate files. Files ending in .dir or .pag are ndbm files. I have used ndbm for the index because it's so fast at accessing things. ** If you compiled with db, the files won't end with .dir or .pag $(LQTEXTDIR)/wordlist.{dir,pag} Every distinct word has an entry containing the word number. In order to reduce space, sWriteNumber is used (this is described below, but basically writes a variable number of bytes depending on the size of the number). This number is an index into: $(LQTEXTDIR)/WIDIndex. This has -- for each distinct word -- a block of WIDBLOCKSIZE bytes which contains: *The length of the word (1 byte) *The word itself (no trailing \0, hence Min...MaxWordLength bytes) Offset into the physical database "$(LQTEXTDIR)/data", written in from one to five bytes as necessary [actually the format is more like {FID No_of_times { WID No_of_times { block word flags dist }* }* }* see the comments in liblqtext/{r,w}pblock.c for details] Number of occurrences all told (again 1-5 bytes) *At least one byte, and generally more, of data (see under "data") [* - these are doomed, may already be gone] Given a word, one can find the WID with Word2WID(). Given a WID, one can find the corresponding string with WID2Word(). $(LQTEXTDIR)/FileList.{dir,pag} There are two entries here for every file that lq-text has seen. (a) to allow one to get a filename from a FID (b) to get a (t_FileInfo *) from a filename; this contains FID (like a WID...) Pathname Date last indexed (seconds since Jan 1st 1970) Filter type (see below) There is room in the returned structure for a (FILE *), which can be used if the file is opened. When support for file structure (SGML) is added, a FIDIndex file will appear just like WIDIndex, and this file will shrink accordingly. $(LQTEXTDIR)/data This is the biggie, folks! The data file is divided into blocks of BLOCKSIZE bytes -- currently 64 bytes each... subject to change. There's an average of BLOCKSIZE/2 bytes wasted for every distinct word... Each block is part of some linked list or other...and has a NextOffset field. This is a block offset, multiplied by two to leave a single bit free at the bottom. Block 0 is unused. Block 1 is the first data block. Data blocks contain: a pointer to the next block in this Chain (0 if none) (4 bytes) and then the actual matches. The first block in each chain is pointed to by a WordInfo entry stored in the Wordlist (see above), and the first few bytes of data are also there. In some cases, all of the data for a word fits in the WIDIndex file, saving more space. The layout of the data is described in detail in comments in pblock.c; here is a summary: For each word in the database we have to keep a list of every occurrence of that word in every file. Hence the data is the list of occurrences. Each occurrence fits in memory into a t_WordPlace struct, which has the file number the block in the file the word within the file the distance since the last word (bytes) a byte of flags. In order to write these economically, we store + A FID * 2 (1..5 bytes), with the bottom bit meaning that there are more than one matches of this word in this file + The number of times that FID occurs (if appropriate) + For each of those occurrences: . the block number, stored as the difference since the last block (reset to zero for each file, of course) (1..5 bytes) . the word in the block, plus 1 bit if there are flags (1 byte) . flags (1 byte) . the distance from the last word, if the appropriate flag bit is set -- otherwise it's assumed to be 1 byte away (1 byte). Hence some typical sizes of entries are: max... small... min... FID 5 1 5 -- shared among up to 255 n - 1 1 -- entries block 5 1 1 word 1 1 1 flags 1 0 distance 1 0 ======================================== 13 4 2 + share of 2..6 It's important that it doesn't take too long to unpack all of these things, of course. There are _literally_ millions of these entries in even a few megabytes' worth of database, so the compression turns out to be good value. The compressed numbers are written a byte at a time, with the top bit set if there is another byte to follow. Hence, a small number will take up only one byte. Most numbers are small. Algorithms ========== 1. Adding files (addfile) * check that file is not already indexed (this uses stat(file) and FileInfo->DateLastIndexed to check for changes) * determine file type * run the appropriate filter, if any * read the output + for each indexable word, . add the word and its offset to a symbol table * at the end, or when the symbol table fills, + for each word we've seen . append the results back to the existing entry for the word, or create a new entry. . delete the word from the symbol table, to save space (not done at the very end of the run, however, to save time) 2. Retrieving Words (wordinfo, wordindex.c) * for each word . find the WordInfo entry from the database (WID2WordInfo(Word2WID())) . get the pblock [see below] with Getpblock() (pblock.c) . for each match, get the FileInfo and print the name and offset. An alternative is GetWordPlacesWhere(), which takes a function to call for each word place as it is read from disk. 3. Retrieving a Phrase (tryphrase.c, Phrase.c) * For each phrase . Make a new string, containing only the words we index. . Build up a t_Phrase structure, containing a WordInfo and a pblock for each indexed word. If a word isn't in the index, clearly the phrase won't be. . For each FID/Offset pair in the first word's pblock + for each word in the phrase . see if the word has a pair with a matching FID, and an Offset reasonably close; no --> go onto the next pair yes --> maybe a partial match In other words, we grow a list of matches until they all either reach the end of the phrase or come to a sticky end, wither and die (like my houseplants!) There are oodles and oodles of comments in Phrase.c which might help. The simplest example of using String2Phrase() and MakeMatches() is in lqphrase.c, although that part of lqtext.c is quite simple too.