Word Rules: Stemming and Morphology

Purpose

The Word Rules determine how the individual words in the files you index are interpreted by lq-text.

Currently, the Word Rules are specified in two places. The actual rules are compiled into lq-text itself, for efficiency, and you cannot change them very much without compiling the whole package. Some of those rules, however, can be customised a little from the lq-text configurtation file (config.txt).

The rules that you can change are documented here. Most of the other rules are documented in the C header file h/wordrules.h, and are implemented in liblqtext/readword.c and liblqtext/root.c. You will need to edit h/wordrules.h to change those rules.

The simplest way to customise the word rules is to change the list of flags that are stored.

Word Flags

When lq-text reads a word, either to index it or as part of a query, it associates various pieces of information, called Flags, with the word. A Flag is either set or not set; if it is not present, it is assumed that it is not set.

Most of the flags indicate that the word was altered in some way as it was read. For example, words in UPPER CASE are converted to lower case, and the Plural flag is set to mark that this was done.

List of flags

All (WPF_ALL)
This value sets all of the flags at once. It is useful if you want to set al except one or two flags:
WordFlags All-HasStuffBefore|LastHadPunct|NextHasPunct
None (zero)
This sets none of the flags. Note that even if you use
WordFlags None
in the configuration file, lq-text will always include LastInBlock in the value.

HasStuffBefore (WPF_HASSTUFFBEFORE)
This flag is set if the distance between the end of the previous word read and the start of this word was other than one byte, or two bytes if this word has the will have the LastHadPunct flag set. In addition, if this flag is set, the actual distance is stored in the database, as a value between one and fifteen.

LastHadLetters (WPF_LASTHADLETTERS)
This flag is set if there were any alphabetic letters (as determined by the database Locale configuration parameter) between the end of the previous word read and the start of this word. Note that this can only happen if the MinWordLength configuration parameter is set to a value greater than one, as otherwise a single letter by itself would be treated as a normal word.

LastHadPunct (WPF_LASTHADPUNCT)
This flag is set if there were any punctuation characters (as determined by the database Locale configuration parameter) between the end of the previous word read and the start of this word.
Compare: NextHasPunct.

LastInBlock (WPF_LASTINBLOCK)
This flag is set on the last word on each block in the input. The block size is determined by the FileBlockSize configuration parameter. This flag is always stored when appropriate, so that lq-text can match a phrase that crosses a block boundary.

LastWasCommon (WPF_LASTWASCOMMON)
This flag is set if a word was skipped between the end of the previous word read and the start of the current word; this includes words listed in the StopList file, if any, and also numbers if the IndexNumbers configuration parameter is set to Off. During indexing, it may also include words that an input filter has decided not to index.
Compare: NextIsCommon.

NextHasPunct (WPF_NEXTHASPUNCT)
This flag is set if the word read after this one will have the LastHadPunct flag set. Note that in order to implement this, lq-text actually reads one word ahead.

NextIsCommon (WPF_NEXTISCOMMON)
This flag is set if the word read after this one will have the LastWasCommon flag set. Note that in order to implement this, lq-text actually reads one word ahead.

Plural (WPF_PLURAL)
This flag is set if lq-text thought that the word was a plural form. Note that since there is no part of speech recognition, a third person present participle such as `dictates' in `she dictates' is considered to be a plural, and is read as dictate with the Plural flag set. See the discussion of Word Forms and Stemming below for mnore information on this topic.

Possessive (WPF_POSSESSIVE)
This flag is set if lq-text thought that the word was a possessive form. Currently, this is set if the word ended in 's (which is removed) or if the word ends in an s and is followed by an apostrophe.

UpperCase (WPF_UPPERCASE)
This flag is set if the word contained any upper case letters. Exactly which letters are upper case is determined by the database Locale configuration parameter, along with how to turn upper case letters into their lower case equivalents.

Word Forms and Stemming

When lq-text indexes a file, it tries to put all the forms of a given word together. For example, if you search for boy, you wil also find boys. This collation is only done for noun forms, so that running and ran are not currently indexed under run.

The rules for determining plurals may be found in the lq-text source in liblqtext/root.c; in the future, these will probably be read externally.

The following examples may help to clarify what's going on.

Input Word Indexed As Flags
boyboyNone
boysboyPlural
boy'sboyPossessive
boys'boyPlural|Possessive
feetfootPlural
micemiceNone
mousemiceNone (not detected)
liceliceNone (not detected)
houseshousePlural
radiiradiusPlural
JesusjesusUpperCase
ÆthelredeÆthelredeNone
NeWSnewPlural|UpperCase