A Text Retrieval Package for the Unix Operating System

Liam R. E. Quin

SoftQuad Inc. (lee at sq.com)

Note: the author of this paper has moved to liamquin at interlog dot com

The lq-text Implementation

This section describes the implementation of lq-text.

Information Stored

Information is stored about the documents indexed, and about each distinct word-form (for example, information about occurrences of sock and socks is all stored under sock; see the discussion of stemming under Indexing below).

Three distinct kinds of index are used. The first uses ndbm, a dynamic hashing package [Knut81]. The second type of index is a fixed record size random-access file, read using lseek(2) and a cache. Finally, the third index contains linked lists of variable-sized blocks; the location of the head of each chain is stored in one of the other indexes, as detailed below. This third index is used to store the variable-sized data constituting the list of matches for each word-form. The combination of the three index schemes allows fast access and helps to minimise space overhead.

Dynamic Hashing Packages

These have a number of desirable properties, including minimal disk access, since usually only two disk accesses are needed to retrieve any item, and automatic expansion, since the hash function simply gets wider as the database grows, allowing updates at any time. In addition, the technology is widely available, since many Unix systems include ndbm, and there are also much faster ndbm-clone implementations available.

Two ndbm-clone packages are distributed with lq-text. One of these, sdbm, has been widely distributed and is very portable [Yigi89]. The other, db, is part of the 4.4BSD work at Berkeley, and is described in [Selt91]. A general discussion on implementing such packages was distributed in [Tore87]. See also ndbm(1).

Two ndbm databases are used: one maps file names into File Identifiers (FIDs), and the other maps natural-language words into Word IDentifiers (WIDs).

Fixed Record Indexes

A word identifier (WID) as obtained from the ndbm word map is taken as a record number into a fixed size record file, widindex. The information found in this record is described later; for now, it suffices that one of the fields is a pointer into the linked lists of blocks in the final file, data.

Linked Lists of Blocks

Each word can occur many times in any number of files. Hence, a variable-sized data structure is needed. A linked list of 64-byte disk blocks is used. However, where adjacent blocks in the data correspond to the same thread, they are coalesced, rather as in the Berkeley Fast File System [McKu83]. Although an lseek(2) and a read(2) may be required for each 64-byte block, the Unix block buffer cache makes this arrangement relatively efficient. A Least Recently Used (LRU) cache holds a number of 16Kbyte segments of these 64byte blocks, giving a significant speedup, especially over NFS, where write(2) is synchronous.

Perfile Information

Each document is assigned a File Identifier when it is indexed. A File Identifier, or FID, is simply a number. Conceptually, this number is then used to store and retrieve the information thus:

Field	Example
location	`/home/hermatech/cabochon/letters`
name	`tammuz.ar(june-14.Z)`
title	Star Eyes watched the jellycrusts peel
size	1352 bytes
last indexed	12th Dec. 1991

Here, /home/hermatech/cabochon/letters is an absolute path to a directory; if a relative or unrooted path is given, lq-text will search along a document path specified in the database configuration file, or by the DOCPATH environment variable. The search is performed on retrieval, so that the prefix to the path needn't be stored in the database. The document name is here given separately to emphasise that once a document has been indexed, it can be moved, or even compressed and then stored as a member of an archive, as here: lq-text will automatically extract june-14.Z from the ar-format archive tammuz.ar and run uncompress on the result to retrieve the desired document.

The size and date fields shown in the table are used to prevent duplicate indexes of the same file, so that one can run lqaddfile repeatedly on all the files in a directory and add only the new files to the index (no attempt is made to detect duplicated files with differing names). In addition, the file size allows lq-text to make some optimisations to reduce the size of the index, such as reserving numerically smaller file identifiers for large files. The reasoning is that larger file are likely to contain more, and more varied, words. Since the file identifier has to be stored along with each occurrence of each word in the index, and since (as will be shown) lq-text works more efficiently with smaller numbers, this can be a significant improvement.

In fact, all of the above information except the document title is stored directly in the ndbm index. Using a multi-way trie would probably save space for the file locations, but the file index is rarely larger than 10% of the total index size. The DOCPATH configuration parameter and Unix environment variable supported by lq-text allow the file names to be stored as relative paths, which can save almost as much space as a trie would, and at the same time allows the user to move entire hierarchies of files after an index has been created. The document title is kept in a separate text file, since users are likely to want to update these independently of the main text.

Perword Information

For each unique word (that is, for each lemma), lq-text stores the following information:

Field Bytes

Word length 1

the Word itself (wordlength)

Overflow Offset 4

No. of Occurrences 4

Flags 1

Total 10 + wordlength

Field	Bytes
Word length	1
the Word itself	(wordlength)
Overflow Offset	4
No. of Occurrences	4
Flags	1
Total	10 + wordlength

The word length and the word are stored only if the WordList feature is not disabled in the database configuration file. The Overflow Offset is the position in the data overflow file (described below) at which the data for this word begins. The remainder of the word index record is used to store the first few matches for this word. In many cases, it turns out that all of the matches fit in the word index, and the Overflow Offset is set to zero.

The flags are intended to be a bitwise `and' of all the flags in the per-word entries described below, but this is not currently implemented, and the space is not reserved. When implemented, this will let lq-text detect whether all occurrences of a word have a given property, such as starting with a capital letter. This information can then be used to recreate the correct word form in generating reports, and when searching the vocabulary.

Peroccurrence Information

For each occurrence of each lemma (that is, for each match), the following information is stored:

Field	Size in Bytes
FID (file identifier)	4
Block Number	4
Word In Block	1
Flags	1
Separation	1
Total	11

Since, on average, English words are approximately four characters long, and allowing one character for a space between words, one might expect the index to be approximately double the size of the data from the per-match information alone. Many commercial text retrieval systems do as badly, or worse [Litt94].

In fact, the information is stored compressed. The matches are stored sorted first by FID and then by block within file, then by word in block. As a result, for all but the first match it is only necessary to store the difference from the previous FID. Furthermore, matches for a single FID are grouped together, so that it isn't necessary to repeat the FID each time. The informtion ends up being stored as follows:

    [DELTA]FID (difference from previous File IDentifier)
    Number of following matches using this FID
        Block In File
        Word In Block
        Flags (if present)
        Separation (if not 1)

        [DELTA]Block In File
        Word In Block
        Flags (if different from previous)
        Separation (if Flags and/or Seperation differ from previous)

Storing lower-valued numbers makes the use of a variable-byte representation particularly attractive. The representation used in lq-text is that the top bit is set on all but the last byte in a sequence representing a number. Another common representation is to mark the first byte with the number of bytes in the number, rather like the Class field of an Internet address, but this means that fewer bits are stored in the first byte, so that there are many more two-byte numbers.

The flags are stored only if they are different from the previous match's flags. This is indicated on the disk by setting the least significant bit in the Word In Block value; this bit is automatically bit-shifted away on retrieval. Further, the separation is only stored if the flags indicate that the value is greater than one (whether or not the flags were stored explicitly). The combination of delta-coding, variable-byte representation and optional fields reduces the size of the average match stored on disk from eleven to approximately three bytes. For large databases, the index size is about half the size of the data. Furthermore, since lq-text has enough information stored that it can match phrases accurately without looking at the original documents, it is reasonable to compress and archive (in the sense of ar(1)) the original files. When presenting results to the user, lq-text will fetch the files from such archives automatically.

Next Top