Word Rules: Stemming and Morphology
Purpose
The Word Rules determine how the individual words in
the files you index are interpreted by lq-text.
Currently, the Word Rules are specified in two places. The
actual rules are compiled into lq-text itself, for
efficiency, and you cannot change them very much without
compiling the whole package. Some of those rules, however, can
be customised a little from the lq-text configurtation
file (config.txt).
The rules that you can change are documented here. Most of
the other rules are documented in the C header file
h/wordrules.h, and are implemented in
liblqtext/readword.c and liblqtext/root.c.
You will need to edit h/wordrules.h to change those
rules.
The simplest way to customise the word rules is to change the
list of flags that are stored.
Word Flags
When lq-text reads a word, either to index it or as
part of a query, it associates various pieces of information,
called Flags, with the word.
A Flag is either set or not set; if it is not present, it is
assumed that it is not set.
Most of the flags indicate that the word was altered in some
way as it was read. For example, words in UPPER CASE are converted
to lower case,
and the Plural flag is set to mark that this was done.
- All (WPF_ALL)
- This value sets all of the flags at once. It is useful if
you want to set al except one or two flags:
WordFlags All-HasStuffBefore|LastHadPunct|NextHasPunct
- None (zero)
- This sets none of the flags. Note that even if you use
WordFlags None
in the configuration file, lq-text will always include
LastInBlock in the value.
- HasStuffBefore (WPF_HASSTUFFBEFORE)
- This flag is set if the distance between the end of the
previous word read and the start of this word was other than
one byte, or two bytes if this word has the
will have the LastHadPunct flag set.
In addition, if this flag is set, the actual distance is stored
in the database, as a value between one and fifteen.
- LastHadLetters (WPF_LASTHADLETTERS)
- This flag is set if there were any alphabetic letters
(as determined by the database
Locale configuration
parameter)
between the end of the previous word read and the start of this word.
Note that this can only happen if the
MinWordLength configuration
parameter is set to a value greater than one, as otherwise a
single letter by itself would be treated as a normal word.
- LastHadPunct (WPF_LASTHADPUNCT)
- This flag is set if there were any punctuation characters
(as determined by the database
Locale configuration
parameter)
between the end of the previous word read and the start of this word.
Compare: NextHasPunct.
- LastInBlock (WPF_LASTINBLOCK)
- This flag is set on the last word on each block in the
input.
The block size is determined by the
FileBlockSize configuration
parameter.
This flag is always stored when appropriate, so that lq-text
can match a phrase that crosses a block boundary.
- LastWasCommon (WPF_LASTWASCOMMON)
- This flag is set if a word was skipped between the end of
the previous word read and the start of the current word; this
includes words listed in the
StopList file, if any, and
also numbers if the
IndexNumbers configuration
parameter is set to Off.
During indexing, it may also include words that an
input filter has decided not to index.
Compare: NextIsCommon.
- NextHasPunct (WPF_NEXTHASPUNCT)
- This flag is set if the word read after this one
will have the LastHadPunct flag set.
Note that in order to implement this, lq-text actually
reads one word ahead.
- NextIsCommon (WPF_NEXTISCOMMON)
- This flag is set if the word read after this one
will have the LastWasCommon flag set.
Note that in order to implement this, lq-text actually
reads one word ahead.
- Plural (WPF_PLURAL)
- This flag is set if lq-text thought that
the word was a plural form. Note that since there is no
part of speech recognition, a third person present participle
such as `dictates' in `she dictates' is considered to be a
plural, and is read as dictate with the Plural flag set.
See the discussion of Word Forms and Stemming
below for mnore information on this topic.
- Possessive (WPF_POSSESSIVE)
- This flag is set if lq-text thought that the word
was a possessive form. Currently, this is set if the word
ended in 's (which is removed) or if the word ends in an
s and is followed by an apostrophe.
- UpperCase (WPF_UPPERCASE)
- This flag is set if the word contained any upper case letters.
Exactly which letters are upper case is determined by the
database Locale configuration
parameter, along with how to turn upper case letters into their
lower case equivalents.
When lq-text indexes a file, it tries to put all the forms
of a given word together. For example, if you search for boy,
you wil also find boys. This collation is only done for noun
forms, so that running and ran are not currently indexed
under run.
The rules for determining plurals may be found in the lq-text
source in liblqtext/root.c; in the future, these will probably
be read externally.
The following examples may help to clarify what's going on.
Input Word | Indexed As | Flags |
boy | boy | None |
boys | boy | Plural |
boy's | boy | Possessive |
boys' | boy | Plural|Possessive |
feet | foot | Plural |
mice | mice | None |
mouse | mice | None (not detected) |
lice | lice | None (not detected) |
houses | house | Plural |
radii | radius | Plural |
Jesus | jesus | UpperCase |
Æthelrede | Æthelrede | None |
NeWS | new | Plural|UpperCase |