A Text Retrieval Package for the Unix Operating System
Liam R. E. Quin
SoftQuad Inc. (lee at sq.com)
Note: the author of this paper has moved to liamquin at interlog dot com
This section describes some work that was done to measure
and improve the performance of lq-text, and then gives some
actual measurements and timing comparisons with other systems.
Profiling
lq-text originally took over 8 hours to index the King James Bible
on a 25 MHz Acer PC running Interactive Unix.
Extensive profiling, and careful tuning of cache algorithms,
improved performance dramatically: the time to index the Bible
has been reduced to under five minutes.
Function Calls
Although most C compilers have a fairly low function call overhead these
days, it's still not trivial.
Functions called for every character of the input were folded inline,
and those called for every word were made into macros in many cases.
Understanding why each function was called the number of time it was
proved a big help both in speeding up the programs and in debugging them.
At one point, lqaddfile was spending over 40% of its time in
stat(2).
It turned out that it was opening and closing an ndbm database for
every word of input, which was suboptimal.
Now, most of the routines spend more than half of the time doing I/O, and
no single function accounts for more than 10% of the total execution time.
Timings
The performance of lq-text is compared with
SunOS 4.1 grep,
GNU grep (ggrep),
and Udi Manber's agrep.
The agrep timings reflect only the simplest use of that program,
since the goal was to generate comparable results.
For lq-text, the time to build the index is also reported.
Recall that the index only needs to be built once.
The following searches were timed;
since the results for the various forms of grep were always very
similar for any given set of files, the grep timings are only
given once for each collection of data.
0. Not There
something not in the index at all, a nonsense word;
the time was always 0.0 for lq-text, as reported
by time(1), irrespective of the size of the database.
This timing is therefore omitted from the table.
a phrase made up of words that do occur, but not
in the order given (if `gleeful' and `boy' each occur, but `gleeful boy'
does not, `gleeful boy' would be such a search).
2. Common
a phrase that occurs infrequently, but includes a relatively frequent word.
3. Unusual
a word or phrase that occurs infrequently
The following small corpora were used:
Man Pages
The on-line manuals from SunOS 4.1, a total of twelve megabytes.
Bibles
The King James and New International Bibles, and the
Moby Complete Works of Shakespeare,
a total of over 15 megabytes.
FAQ
The netnews news.answers newsgroup of approximately 1150 articles,
totalling over 40 megabytes.
Timings
Index Statistics
Index Size Creation Time
Data Index real user sys
man 12M 6.5M 267.4 124.9 64.5
Bible 15M 6.6M 351.3 136.6 51.0
FAQs 41M 20.5M 1598.8 851.9 375.8
Timing Environment
A SPARCstation 10/30 (1 processor, 30 MHz, no cache) with 64 MBytes of
memory was used for the timings.
The system was not equipped with Wide SCSI disks.
The timings are given in real time, using time(1), as this
is the most important in practice.
Each timing was performed several times, and an average taken of all but the
first. This favours the grep algorithms somewhat, since it reduces the
impact of the I/O that they do.
The lq-text timings do
not include the time to produce the text of the match, for example with
lqkwic.
However, running lq-kwic added less than one second
of runtime for any except the very large queries,
even when the data files were accessed over NFS.
The index overhead is approximately 50% of the size of the original data.
This can be controlled to some extent using stop-words; the
news.answers database used 79 stop-words, reducing the database
by about 2 Megabytes. In addition, single-letter words were not indexed,
although the presence of a single-letter word was stored in the per-word
flags.
The other databases used no stop words, and indexed words from 1 to 20
characters in length - the differences are because the FAQ index is
accessed by a number of other users on our site.
Results
As one would expect, lq-text easily out-performs the grep family
of tools.
For queries producing a lot of matches, such as `and the' (1790 occurrences
in the SunOS manual pages), the time taken to print the matches dominates the
run-time of lqphrase.
Next Top