KEY: * -- easy change ** - harder, needs more understanding @ -- needs understanding of internals @@ - mail me if you need this! source fixes awaiting attention: * common.c -- can this use fReadLine and ReadWord? * docpath.c -- move docpath into db * OpenDatabase -- should take severity, description, see liblqtext.h! [1 - ui] * give lqshow the ability to page a file (it can call up $PAGER if you use "v") Or, **, rewrite the curses-based front end altogether. * make the gnome/gtk front end usable @ make a perl module? @ make a python module? [2 - search,index] @@ special treatment of dates in the index [3 - index] ** table of pagers for browsing by file/type ** Make the table read from a file at run-time, and include filters Probably the filters should call LQT_AddWord(), instead of the other way round. [4 - retrieval] ** Better ranking of queries Start with lqrank, which already does some sorting. The difficult thing here is deciding on what basis to do the ranking. E.g. docs containing the target phrases the most times come first? Or does the length of the document make a difference too? Probably has to be configurable. See TREC conference reports, also see src/lqtext/lqsimilar [6 - doc] **@@ write a manual I have started this in the doc directory. My goal is to have * a user manual * an administration manual * a programmer's manual, documenting the API. The API reference manual is the most advanced, but an API guide is needed. [7 - index] **@ The entire plural code (Root.c) needs a rethink. I have started Plurals.c, but it's not ready yet. Yell if you have any ideas, I need them! [although "this" is OK now] In particular, how to handle morphological analysis into lemmas in an internationalised application is a difficulty. @ allow user-defined stemmers (compiled-in) via config (README) file [8 - index] ** allow dynamic definition of word start/mid/end, in README. Must be at least as fast as isupper() etc. Perhaps per-file-type rules, though? makes Phrase Matching hard. [9 - index] @@ Replace the common words file with three files: [1] a list of words not to be indexed [2] a list of phrases to index completely, even if some of the words in them occur in [1] [3] a list of phrases not to index at all, possibly with the ability to mark specific words. Then you can say don't index /the/ except in /the times/ always index /our/ except in /Our Company Ltd/ Tim Bray of Open Text says stop lists are a bug, and I think I agree with him, except you also have to say that the price of disk is a bug too. [10 - ui] ** lqshow could be made a routine (BrowseList() I suppose) that takes a list of Phrases with their matches... Hard to integrate into X. [11 - implement.] **@ should abandon dbm for the list of filenames. A better approach would be to store path components as words in the database! This would make / a common-word, though. Needs some thought. Or maybe as blocks in data. That would be fairly easy. A btree might be a good comprimise. For now, at least db-1.xx doesn't have overflow problems. (12 - index) **@@ the ability to delete a file. (this has been done, lqunindex) [13 - implementation] ** Better file locking (no file locking or signal handling at all at the moment -- I ripped it all out when I discovered that it was broken on many systems, and this gave a false sense of security.) Also, it was too slow, and gave console error messages on NFS! (14 - retrieval) * Phrase Matching would be orders of magnitude faster if it did not involve reading the tables of matches until they are needed, as many of them won't be! It should extend the lists of matches for each word in the phrase only as necessary. Done. It wasn't orders of maginitude in most cases, only a factor of two or so. Could possibly be improved. [16 - implementation] @@ use mmap for data and widindex, in segments with a cache [18 - implementation] @ Add a WIDIndex cache! [19 - implementation] @ Close "chainend" (lasblks) on exit (done) @ optionally remove chainend, with README line [20 - implementation] ** Proper variable-based configuration, no global variables. (1.14: the global variables are all gone but no good config code yet) (21 - index) @@ 4-bit coding for when delta-block and WIB fit in 7 bits combined? (1.14: done) Or, could always use 4/4 bits, and set the top bit on each if continued, nead to measure what numbers occur. [22 - index] @@ Variable WIDBLOCKSIZE to reduce wastage? How much is wasted? (1.16: not very much) (23 - implementation) * cd/rom changes -- [23a] read-only database (done) [23b] ms/dos-compatible filenames (done) [24 - retrieval] @@ find within fielded data e.g. "find within title" [25 - retrieval] * make lqsed handle overlapping matches, e.g. use only the longest. ?what to do with this?: aa bb cc dd ee ff gg [1 ]1 [2 ]2 Probably turn it into aa bb cc dd ee ff gg [1 ]2 [25 - ui] @@ integrate a decent command-line option parser and combine it with the config file. Preferably something like X defaults, except with a description and type for each of them. Someone did this but never sent patches, and usd a GPL'd lib. Symbol table to be added to t_Database. See also item 20. [26 ?] *@ query expansion (this is done with lqquery for wildcards, but not for thesaurus) [27 ?] * quorum ranking (done in lqrank I think) [28 ?] @@ statistical ranking ala SMART. see also item 5. [29 ?] @@ support structured documents see also item 24. [30 ui] @@ i18n ?? unicode?? [31 implementation] @@ use an mmap() cache for LQT_ReadBlock() [32 filters] integrate filters with findmatchends [33 doc] user guide [34 doc] update the man pages [35 doc] api reference/C (1.14: there's a fledgeling API reference, but it only documents functions, not data structures right now) [37 ui] ** provide a GUI, e.g. using Motif? Ugh. (1.14: I've started with Motif.) (1.16: I started again with perlgtk and got further) [38 filters] C (used to have one of these, it seems to have got lost) ASCII man pages/troff (done! but need work) SGML RTF? HTML [39 retrieval] proximity searching [40 ?] queries by file (this work started in Phrase.c) [41 ?] indexing of compressed and archived files [42 ui] Udi Manber's agrep on vocabulary. Or maybe soundex. [43 retrieval] complex queries [44 ?] lqgrep/lqegrep using preprocessor to reduce no. of files to search [46 index/physical] lqdbfsck? Program to check a db is ok. lqword -A > /dev/null is silly. [48 ?] lq enhancements: save { matches/files } with(out) titles to a file { index } [49 ?] ship faq script [50 ?] @ thesaurus [51 ?] generalise the variables used in lqkwic to a general-purpose facility suitable for i18n, etc. (1.14: this is what t_NameSpace is for, but is not yet widely used) [52 WWW] * CGI script interface, see also item 36 Known Bugs ========== * lqshow does not know about file types * there is no troff (or sqtroff) file type -> fixed, but the filter is still buggy: you'll get the wrong word highlighted quite often, I'm afraid. * the C filter got lost in history (sigh) write PID into block zero, also IP address of host, for exclusive access. Use lockf or fcntl-style locking. Need to test which ones work.