Note: the author of this paper has moved to liamquin at interlog dot com
Three distinct kinds of index are used. The first uses ndbm, a dynamic hashing package [Knut81]. The second type of index is a fixed record size random-access file, read using lseek(2) and a cache. Finally, the third index contains linked lists of variable-sized blocks; the location of the head of each chain is stored in one of the other indexes, as detailed below. This third index is used to store the variable-sized data constituting the list of matches for each word-form. The combination of the three index schemes allows fast access and helps to minimise space overhead.
Two ndbm-clone packages are distributed with lq-text. One of these, sdbm, has been widely distributed and is very portable [Yigi89]. The other, db, is part of the 4.4BSD work at Berkeley, and is described in [Selt91]. A general discussion on implementing such packages was distributed in [Tore87]. See also ndbm(1).
Two ndbm databases are used: one maps file names into File Identifiers (FIDs), and the other maps natural-language words into Word IDentifiers (WIDs).
|title||Star Eyes watched the jellycrusts peel|
|last indexed||12th Dec. 1991|
The size and date fields shown in the table are used to prevent duplicate indexes of the same file, so that one can run lqaddfile repeatedly on all the files in a directory and add only the new files to the index (no attempt is made to detect duplicated files with differing names). In addition, the file size allows lq-text to make some optimisations to reduce the size of the index, such as reserving numerically smaller file identifiers for large files. The reasoning is that larger file are likely to contain more, and more varied, words. Since the file identifier has to be stored along with each occurrence of each word in the index, and since (as will be shown) lq-text works more efficiently with smaller numbers, this can be a significant improvement.
In fact, all of the above information except the document title is stored directly in the ndbm index. Using a multi-way trie would probably save space for the file locations, but the file index is rarely larger than 10% of the total index size. The DOCPATH configuration parameter and Unix environment variable supported by lq-text allow the file names to be stored as relative paths, which can save almost as much space as a trie would, and at the same time allows the user to move entire hierarchies of files after an index has been created. The document title is kept in a separate text file, since users are likely to want to update these independently of the main text.
For each unique word (that is, for each lemma), lq-text stores the following information:
|the Word itself||(wordlength)|
|No. of Occurrences||4|
|Total||10 + wordlength|
The word length and the word are stored only if the WordList feature is not disabled in the database configuration file. The Overflow Offset is the position in the data overflow file (described below) at which the data for this word begins. The remainder of the word index record is used to store the first few matches for this word. In many cases, it turns out that all of the matches fit in the word index, and the Overflow Offset is set to zero.
The flags are intended to be a bitwise `and' of all the flags in the per-word entries described below, but this is not currently implemented, and the space is not reserved. When implemented, this will let lq-text detect whether all occurrences of a word have a given property, such as starting with a capital letter. This information can then be used to recreate the correct word form in generating reports, and when searching the vocabulary.
|Field||Size in Bytes|
|FID (file identifier)||4|
|Word In Block||1|
Since, on average, English words are approximately four characters long, and allowing one character for a space between words, one might expect the index to be approximately double the size of the data from the per-match information alone. Many commercial text retrieval systems do as badly, or worse [Litt94].
In fact, the information is stored compressed. The matches are stored sorted first by FID and then by block within file, then by word in block. As a result, for all but the first match it is only necessary to store the difference from the previous FID. Furthermore, matches for a single FID are grouped together, so that it isn't necessary to repeat the FID each time. The informtion ends up being stored as follows:
[DELTA]FID (difference from previous File IDentifier) Number of following matches using this FID Block In File Word In Block Flags (if present) Separation (if not 1) [DELTA]Block In File Word In Block Flags (if different from previous) Separation (if Flags and/or Seperation differ from previous)
Storing lower-valued numbers makes the use of a variable-byte representation particularly attractive. The representation used in lq-text is that the top bit is set on all but the last byte in a sequence representing a number. Another common representation is to mark the first byte with the number of bytes in the number, rather like the Class field of an Internet address, but this means that fewer bits are stored in the first byte, so that there are many more two-byte numbers.
The flags are stored only if they are different from the previous match's
This is indicated on the disk
by setting the least significant bit in the Word In Block value;
this bit is automatically bit-shifted away on retrieval.
Further, the separation is only stored if the flags indicate that the
value is greater than one (whether or not the flags were stored explicitly).
The combination of delta-coding, variable-byte representation and
optional fields reduces the size of the average match stored on disk
from eleven to approximately three bytes.
For large databases, the index size is about half the size of the data.
Furthermore, since lq-text has enough information stored that it can
match phrases accurately without looking at the original documents,
it is reasonable to compress and archive (in the sense of ar(1))
the original files.
When presenting results to the user, lq-text will fetch the files from
such archives automatically.