lq-text FAQ

What is it?

lq-text is a text retrieval package. It makes an inverted index of your files, and can then later find words or phrases in those files.

You could use it for a web search page, or to help you analyse large bodies of text, or for a search function in some other program; I use it to find text files I've saved, and to search email.

why not just use grep?

It's much faster than grep and uses less I/O and CPU; it can also find things that grep can't, such as phrases that span a line, or that contain punctuation of plurals (a search for foot can find feet too).

what languages?

Currently, lq-text is 8-bit clean, but it has built-in knowledge of English plurals and possessives. The knowledge is almost all in one C file (the code to allow an apostrophe within a word is elsewhere) so if you are interested in changing it, it should be fairly easy.

Searching collections where the files are in multiple languages would be harder.

what character sets?

Any 8-bit character set will work. You can also search UTF8 file, but then lq-text won't know how to handle upper/lower case conversion and might get plurals wrong. Of course, English words in UTF-8 will generally work fine.

What file formats?

Plain text, email (RFC 822-style; lq-text doesn't know about MIME although it'd be easy to teach it), HTML and XML.

There is no support for structured searches within HTML or XML.

Retrieving

how fast?

for 30 megabytes of text, retrieval times of a fraction of a second are usual even on an old Pentium. Generally lq-text is likely to be one to two orders of magnitude faster than grep for common words, and much faster for searches involving unusual words.

Indexing speeds of over 400 MBytes/hour have been reported.

Update: that was in 1995; in 2005, speeds over over 100MBytes per minute are possible on even an old laptop.

how are the results sorted?

By default, matching files are listed in the order in which they were indexed, and matches within files are in document order. You can also sort the results by other criteria, including how many of the search terms were found, or even alphabetically on the word immediately to the left of that found.

what control do I have over how the results are shown?

The lqkwik program has a powerful formatting language that lets you say what information is shown, and how much. There is also an lqsed program that can put markers around the start and end of each match, for example to highlight matched text.

query languages?

Currently there's no boolean query language as standard, although one was contributed at one point and is in the src/contrib directory.

how much memory?

For indexing, lq-text is careful not to assume that any of the data structures fit entirely into memory, so everything is read and written through cache mechanisms. You can control the size of lqaddfile's word cache by saying how many word occurrences it should save before writing out the cache, and also how many slots to allocate; for performance, words that have been seen only one or two times are not normally written out until the file in they were seen has been completely indexed, but you can change that behaviour too.

For retrieval, lqphrase and lqquery might need to store all the matches of all the words in a phrase in memory at once, but they are written to avoid that whenever possible. Each occurrence will use approximately 20 bytes in memory, depending on CPU architecture.

how much CPU?

The lq-text code has been extensively profiled and refined over the past twelve years; as far as I'm aware there's no faster text retrieval system, neither free nor commercial, that also supports updating the index.

The Index

how do i make an index?

The easiest way is to make a file that contains filenames, one per line, and then run
lqaddfile -f thefile
Look in the distribution for the README file and also at the Sample database directory.

how large is the index?

Usually between 40% and 70% of the size of your data, depending on the options you enable in the config.txt file and the nature of your data.

how fast is the indexer?

Speeds of 400 MBytes/hour have been reported on collections of over a gigabyte. The computer in that case was a dual-CPU pentium running Red Hat Linux; similar speeds were reported several years earlier with a DEC Alpha system.

If you get a faster result let me know.

can I add files to the index?

Yes. Simply run lqaddfile with the new files to add.

can I delete files from the index?

Yes. Use lqunindex to remove files.

what if a file changes?

If its name changes, use lqrename; if it is about to change, use lqunindex on the old file, and then lqaddfile on the new one. You could do this as part of a CVS commit script, for example.

If a file has already changed, you can still delete it with lqunindex, but if you do that a lot there will be a slight performance and index size penalty, as words that only occurred in the file before it changed won't be removed from the index.

Can I search the vocabulary itself?

Yes. Use sortwids after running lqaddfile, and then you can use lqwordlist to see which words match a POSIX regular expression:
lqwordlist -g '*a.*e.*i.*o.*u'
will find words in the index containing all the vowels, in order;
lqwordlist -n -g '.' | sort +1n | head -20
will print the twenty words that occur most frequently, along with how often they occur.

Note that the vocabulary is stored in lower case, with plurals more or less removed, so you'll find foot but not feet.

Status

Licences: what can I do with it?

License

The software is free for non-commercial use, and is distributed under the barefoot licence; you have to go without shoes or socks for 24 hours within a week of first using the software. Send the author pictures or an account describing this experience.

For commercial use contact the author, Liam Quin, liam at holoweb.net; support is also available, whether or not you paid for the software.

lq-text is also available under the LGPL.

lq-text frequently asked questions

General