A Text Retrieval Package for the Unix Operating System

Liam R. E. Quin

SoftQuad Inc. (lee at sq.com)

Note: the author of this paper has moved to liamquin at interlog dot com


Bentley, Jon, Little Languages, in More Programming Pearls, Addison-Wesley, 1988. A clearly-written rationale for the use of little (or embedded) languages. This column first appeared in Comm. ACM in August 1986.

Bray, Tim, Lessons of the New Oxford English Dictionary Project, Usenix, Winter, 1989, pp. 137-199

Burkowski, Forbes J., An algebra for hierarchically organized text­dominated databases, 1992, in Information Processing & Management 28 No. 3, pp. 333

Cleverdon, C. W., Mills, J., and Keen, E.M., Factors Determining the Performance of Indexing Systems, Volume 1 - Design, Aslib Cranfield Research Project, Cranfield, 1966

Cutting, Douglas R., Karger, David R., and Pedersen, Jan O., Constant Interaction­Time Scatter/Gather Browsing of Very Large Document Collections, in Proc. 16th ACM SIGIR, pp. 126-131, 1993

One of a number of papers reporting work at Xerox Parc on information retrieval

Faloutsos, Christos, Access Methods for Text, in Computing Surveys 17, 1, pp. 49-74, March 1985

Compares text retrieval methods for office systems

Faloutsos, Christos and Christodoulakis, Stavros, ``Optimal Signature Extraction and Information Loss'', in ACM Trans. on Database Systems 12, 3, pp. 395-428, Sept. 1987

Faloutsos, Christos and Christodoulakis, Stavros, ``Description and Performance Analysis of Signature File Methods'', in ACM Trans. on Office Systems 5, 3, July 1987

A good overview of signatures.

Fawcett, Heather, PAT User's Guide, Open Text, 1989

Fox, Edward A., France, Robert K., Sahle, Eskinder, Daoud, Amjad, and Cutter, Ben, ``Development of a Modern OPAC: From REVTOLC to MARIAN'', TR 93-06, Virginia Polytechnic Institute and State University, 1993. A client­server Online Punlic Access Catalogue for a library, using the NeXTStep GUI.

Frakes, William B. and Baeza-Yates, Ricardo, Information Retrieval: Data Structures and Algorithms, Prentice-Hall, 1992.

An excellent introduction to the issues in implementing information retrieval systems. Examples in C for Unix, available by ftp from ftp://ftp.vt.edu/pub/reuse/ir-code.

Golovchinsky, G. and Chignell, M.H., ``Queries­R­Links: Graphical Markup for Text Navigation'', in Proceedings of INTERCHI '93, Amsterdam, pp. 454-460, April 1993, ACM Press., N.Y.
Presents a conceptually simple way for users to add and subtract terms from text retrieval queries, and raises issues about the trade­offs between pre­determined hypertext links and live text retrieval queries.

Harman, Donna, ``Overview of the First TREC Conference'', Annual ACM SIGIR Conf., 16, pp. 36, 1993.

At the SIGIR conference in 1993, some of the TREC participants reported that they had had difficulties using similarity techniques on long documents.

Hutton, Scott, Computing Information for Indiana University Users, 1994, formerly at http://scww.ucs.indiana.edu/kb/search.html

Knuth, Donald, The Art of Computer Programming, Vol III: Sorting and Searching, Addison-Wesley, 1981.

Lesk, M. E., ``Some Applications of Inverted Indexes on the Unix System'', in V7 Unix Programmers' Manual, Vol 2A, Bell Laboratories, 1978

Littman, Dan, ``AppleSearch 1.0'', Macworld, May 1994.
A review of Apple's `easy to administer, easy to use' text retrieval software. Mentions that `the indexing process required more than double the disk space of the original documents'.

Mandelbrot, Benoit, ``An informational theory of the statistical structure of language'', in Communication Theory, Ed. Willis Jackson, Butterworths, 1953, pp. 486-502

McKusick, Marshall Kirk, Joy, William N., Leffler, Samuel J., and Fabry, Robert S., ``A Fast File System for Unix'', CSRG Technical Report 83-147, 1983

Meadow, Charles T., Text information Retrieval Systems, Academic Press, Toronto, 1992.

Gives clear descriptions of full­text retrieval data structures and algorithms, although with a bias towards indexing only abstracts of books or of library catalogue entries.

Oracle Corporation, SQL*TextRetrieval Version 2 Technical Overview, 1992

Roydhouse, Aaron, Miller, Linton, Jones, Eric K., and McGregor, James, The Design and Implementation of MetVUW Workbench Version 1.0, CS-TR-93/7, 1993

Describes a multi­media meteorological database that uses lq-text to provide text searching.

Salton, Gerald, Automatic Text Processing, Addison-Wesley, 1988

Seltzer, Margo and Yigit, Ozan, A New Hashing Package for Unix, Usenix '91, Dallas, TX, 1991

(ISO), International Organization for Standardization, Information Processing - Text and office sytems - Standard Generalized Markup Language (SGML), ISO8879, 1988

Torek, Chris, Re: dbm.a and ndbm.a archives, netnews comp.unix newsgroup, 1987.

Tsuchiya, Paul F., Bellcore, 1991, A Search Algorithm for Table Entries with Non­contiguous Wildcarding, Bellcore, 1991.

Unpublished(?) description of Cecelia, a package using in­memory Patricia trees with efficient update and deletion.

Yigit, Ozan, How to roll your own dbm/ndbm, Unpublished Manuscript, 1989

Zadeh, L. A., ``PRIF - a meaning representation language for natural languages'', in Int. J. Man-Machine Studies, 10, pp. 395-460, 1978.

One of many of L. A. Zadeh's papers arguing for modeling the `pervasive imprecision of natural languages' (p. 396).

Zimmerman, Mark, Zbrowsr implementation, 1991

Article in para mailing list, unpublished.

Zipf, Georke K., Human Behaviour and the Principle of Least Effort, Addison-Wesley, Cambridge, MA., USA, 1949.

Next   Top