A Text Retrieval Package for the Unix Operating System

Liam R. E. Quin

SoftQuad Inc. (lee at sq.com)

Note: the author of this paper has moved to liamquin at interlog dot com


Performance

This section describes some work that was done to measure and improve the performance of lq-text, and then gives some actual measurements and timing comparisons with other systems.

Profiling

lq-text originally took over 8 hours to index the King James Bible on a 25 MHz Acer PC running Interactive Unix. Extensive profiling, and careful tuning of cache algorithms, improved performance dramatically: the time to index the Bible has been reduced to under five minutes.

Function Calls

Although most C compilers have a fairly low function call overhead these days, it's still not trivial. Functions called for every character of the input were folded inline, and those called for every word were made into macros in many cases. Understanding why each function was called the number of time it was proved a big help both in speeding up the programs and in debugging them.

At one point, lqaddfile was spending over 40% of its time in stat(2). It turned out that it was opening and closing an ndbm database for every word of input, which was suboptimal.

Now, most of the routines spend more than half of the time doing I/O, and no single function accounts for more than 10% of the total execution time.

Timings

The performance of lq-text is compared with SunOS 4.1 grep, GNU grep (ggrep), and Udi Manber's agrep. The agrep timings reflect only the simplest use of that program, since the goal was to generate comparable results. For lq-text, the time to build the index is also reported. Recall that the index only needs to be built once.

The following searches were timed; since the results for the various forms of grep were always very similar for any given set of files, the grep timings are only given once for each collection of data.

0. Not There

something not in the index at all, a nonsense word; the time was always 0.0 for lq-text, as reported by time(1), irrespective of the size of the database. This timing is therefore omitted from the table.

1. Not Found

a phrase made up of words that do occur, but not in the order given (if `gleeful' and `boy' each occur, but `gleeful boy' does not, `gleeful boy' would be such a search).

2. Common

a phrase that occurs infrequently, but includes a relatively frequent word.

3. Unusual

a word or phrase that occurs infrequently

The following small corpora were used:

Man Pages

The on-line manuals from SunOS 4.1, a total of twelve megabytes.

Bibles

The King James and New International Bibles, and the Moby Complete Works of Shakespeare, a total of over 15 megabytes.

FAQ

The netnews news.answers newsgroup of approximately 1150 articles, totalling over 40 megabytes.

Timings

Indexegrepggrepagreplq-text
[1] Not Found[2] Common[3] Unusual
man pages41.537.145.91.80.00.3
Bible60.255.567.71.81.20.1
FAQs181.9179.2115.70.31.00.2

Index Statistics

Index	Size	Creation Time
	Data	Index	real	user	sys
man	12M	6.5M	267.4	124.9	64.5
Bible	15M	6.6M	351.3	136.6	51.0
FAQs	41M	20.5M	1598.8	851.9	375.8

Timing Environment

A SPARCstation 10/30 (1 processor, 30 MHz, no cache) with 64 MBytes of memory was used for the timings. The system was not equipped with Wide SCSI disks. The timings are given in real time, using time(1), as this is the most important in practice. Each timing was performed several times, and an average taken of all but the first. This favours the grep algorithms somewhat, since it reduces the impact of the I/O that they do.

The lq-text timings do not include the time to produce the text of the match, for example with lqkwic. However, running lq-kwic added less than one second of run­time for any except the very large queries, even when the data files were accessed over NFS.

The index overhead is approximately 50% of the size of the original data. This can be controlled to some extent using stop-words; the news.answers database used 79 stop-words, reducing the database by about 2 Megabytes. In addition, single-letter words were not indexed, although the presence of a single-letter word was stored in the per-word flags. The other databases used no stop words, and indexed words from 1 to 20 characters in length - the differences are because the FAQ index is accessed by a number of other users on our site.

Results

As one would expect, lq-text easily out-performs the grep family of tools. For queries producing a lot of matches, such as `and the' (1790 occurrences in the SunOS manual pages), the time taken to print the matches dominates the run-time of lqphrase.

Next   Top