It's common practice in text retrieval to omit words from the database if they
occur very often.
For example, `and', `the' and `to' don't seem to add very much information.
However, in certain circumstances, such as `The Times', or `Bitwise and',
the words are suddenly of great significance.
There are three approaches to this.
First, you can say that people looking for The Times are out of luck.
Second, you can index all of the words, and take a penalty on index size. This penalty is usually from one to thirty percent of the total index size, and is usually acceptable.
Thirdly, you could specify a list of contexts in which words in the stoplist are to be indexed anyway. There are three problems with this last approach. Firstly, you don't have enough context in a query to determine what to do about those words. Secondly, you have to think of all the contexts in advance; if you didn't think of `the Times', the user would still be out of luck. Finally, lq-text doesn't support this third approach directly, although you could modify lq-text, perhaps using the routines in this category.
There are three approaches to this.
First, you can say that people looking for The Times are out of luck.
Second, you can index all of the words, and take a penalty on index size. This penalty is usually from one to thirty percent of the total index size, and is usually acceptable.
Thirdly, you could specify a list of contexts in which words in the stoplist are to be indexed anyway. There are three problems with this last approach. Firstly, you don't have enough context in a query to determine what to do about those words. Secondly, you have to think of all the contexts in advance; if you didn't think of `the Times', the user would still be out of luck. Finally, lq-text doesn't support this third approach directly, although you could modify lq-text, perhaps using the routines in this category.