Words and Pictures from Old Books · Search · About

How the Search facility works

This page is about how the searching is implemented; it might interest programmers, Web site developers, system integrators, or standards geeks.

The technology is all open source, and it's all available in exchange for pictures of your ankles. (Just kidding. Actually it's freely available, no pictures required)

The Metadata

There is some metadata associated with each image:

  1. about the image itself
  2. about the place or places depicted in the image

An example is probably the best way to explain the metadata. Consider a book with text written by Sir Charles Knight, such as Old England. Perhaps there is a colour plate in the book, such as that opposite page 383 in Volume I of Methley Hall, which I have scanned.

Now, Methley Hall is (or rather, was) a building in the West Riding of Yorkshire, in England. I have marked its location as Mickletown, West Riding, Yorkshire, England.

Of course, the printed book that I scanned is in Canada, and the image is on my Web server (and maybe also on your computer now, too), but the place (Methley Hall) is located in England.

The image as I scanned it was saved in PNG format, and after I cleaned up the scan in either Adobe PhotoShop or The GIMP, I saved it in several resolutions: the largest is 1475 pixels wide and 1023 pixels wide. So we have an image format (JPEG) and size (1475x1023). We must be careful not to suggest that Methley Hall is 1475 pixels wide! Although this is absurd, it's a surprisingly common mistake when people prepare data about images.

I associated some keywords with the image: interiors, windows, ceilings, arches, staircases, colour, furniture and manors.

The location and keywords are stored in an RDF/XML file (actually it isn't really proper RDF, but it's close enough for my purposes). The information about the physical image, the format and the pixel size, is stored in a relational database, separately from the RDF. In this way there is no possibility of confusing the metadata about the physical image and about the picture.

An astute reader might have observed that the keyword colour is more about the image than the place. Obviously Methley Hall is not black-and-white. The keywords, then, are about the printed picture in the book and what can be seen in it. Methley Hall might have a swimming pool and a wheelchair-accessible billiards table, but the From Old Books Web site is not about finding a country house! It is about finding cool pictures, though, so I have mentioned the arches and the staircase.

Searching

I am currently using Qizx/Open, a Java-based implementation of the XML Query language. This is a query language that lets me run queries against any mixture of XML files, XML document stores and relational databases, without needing to know which is which in the body of my query.

Under http://fromoldbooks.org/Search/ is a file named index.cgi; the Apache Web server runs this program to satisfy incoming HTTP requests for that directory or anything beneath it.

I should mention at this point that the Common Gateway Interface (CGI) is not a programming language. It's just the way in which the Web server communicates with an external program, and you can use almost any programming language you like. Java, Perl and PHP are two common choices, and in this case I used Perl.

The CGI script in this case does several things:

  1. Parses the query options;
  2. Builds an XML Query expression on the fly;
  3. Runs the XML Query with a template (usually HTML or SVG)
  4. Sends the results back to the requesting agent (usually your Web browser, but it might also be a search engine crawler)

The CGI script keeps a cache of recent search results, and also monitors system load (using /proc/uptime), deliberately sleeping for several seconds and printing an error message system too busy if the system gets too loaded. I found this was necessary because Internet Explorer has a button that tries to make a copy of a Web site for reading offline, and when people press it, their Web browser tries to download every page at full speed, including in this case every possible combination of search options!

If you would like to see what the queries look like, you can append &showquery=1 to any query, and you will see the text of the query that would have been fed to the XML Query implementation.

I used Qizx/Open for several reasons: at the time I chose it, it was the fastest available implementation of XML Query, and the computer I was using for a Web server was a little on the slow side. But in addition it has support for JDBC, so it can connect directly to MySQL.

I could equally well have used Mike Kay's SaxonSA and XSLT, but Qizx/Open was (at the time) twice as fast. That is probably not true any more, and in fact the CGI script can use any of Galax, Saxon or Qizx/Open by changing a single variable. This is a good indication of the level of interoperability between XML Query implementations! I make use of this if I'm having difficulty finding a bug, because of course the three different engines often give slightly different error messages.

Combining Queries

You can give multiple keywords and multiple locations, although if you do, the result must match at least one of the keywords you give and at least one location. Since this isn't what most people would expect (e.g. searching for Yorkshire, England will give the same number of matches as England, because every item matching Yorkshire will also match England), the user interface no longer supports it. If anyone asks me (see below) for a more complex interface I'll probably provide one.

Full Text Searches

I am the primary author of an open source search engine, lq-text, so it might seem odd that I do not provide full text searching yet.

Text searching is in the works, and, again, if you would like it, let me know. The delay is partly because I am working on incorporating a large number of texts, partly because I am busy, partly because I am lazy, and perhaps partly because The cobbler's children always have bare feet.

Using this for your own Web site

If you bribe me (see below) you could try to use this set-up on your own Web site. The main reason I don't distribute it today is that the mkgallery Perl script that makes the gallery Web pages uses a totally odd non-XML format for the metadata, and I have never had the time to move to using XML!

Contacting Liam

I am Liam at holoweb dot net; To get through my spam filters, tell me what colour socks you are wearing. If you want to bribe me, send pictures of barefooted men and their ankles that I can add to a public Web gallery! (I can add the images anonymously if you wish)

Valid XHTML 1.0!