Author BiographyLiam has worked with SGML since 1988, and first spoke at an SGML conference in Atlanta in 1989. He has written DTDs, given SGML consulting and spoken on abstruse and irrelevant topics at SGML, Web, Unix and other conferences and workshops He has had an ongoing involvement in a number of standards committees and workshops, including XML and the Dublin Core. His current rôle is as head of development in a small company producing XML-based collaborative software. Liam previously worked at SoftQuad Inc., a well-known vendor of SGML software and services based in Toronto. He has a degree in Computer Science from Warwick University, England. He is also known for his propensity to wander about always barefooted. |
Director of Development
Suite 901 Inc.
67 Yonge Street
Toronto, Ont
Canada M5A 3C7
+1 416 955-9845
An SGML Document type definition serves many purposes, and is read both by software and by people. It must therefore be presented in a way which is clear and effective. A DTD is neither a program nor a document, but shares some characteristics of both. Techniques for presenting both textual information and structured information have been developed in other fields, and it is instructive to study these techniques and to see how they apply to SGML DTDs. In particular, graphic design and typography on the one hand and computer science and program layout on the other are very relevant.
Existing literature on document analysis and the preparation of SGML Document Type Definitions does not generally discuss the layout of SGML DTDs from a typographic or engineering point of view. This paper describes a number of techniques and principles used in typography, graphic design, information architecture and also in engineering and computer science.
The principles of design that underline these techniques are discussed in turn, and a clear way to organise and lay out a DTD is then presented.
Further reading is given in an annotated bibliography.
An SGML Document type definition serves many purposes. It tells an SGML parser how to interpret markup, and hence is computer readable. But it also serves as a repository for documentation about the kinds of document that should use it, documentation that is read by humans rather than computers. Finally, it is itself a document that must be read and understood by those people who have to maintain it, change it or use it.
Because a DTD must be read by humans as well as computer, it must be presented in a way that is clear and effective. A DTD is neither a program nor a document, but shares some characteristics of both. Techniques for presenting both textual information and structured information have been developed in other fields, and it is instructive to study these techniques and to see how they apply to SGML DTDs. In particular, graphic design and typography on the one hand and computer science and program layout on the other are very relevant.
Existing literature on document analysis and the preparation of SGML Document Type Definitions does not generally discuss the layout of SGML DTDs from a typographic or engineering point of view. This paper describes a number of techniques and principles used in typography, graphic design, information architecture and also in engineering and computer science.
The resulting DTD layout may appear unfamiliar and uncomfortable at first to those accustomed to other layouts; with a little practice, however, most people seem to prefer them, and they are generally much clearer to people who have not studied large numbers of DTDs.
Further reading is given in an annotated bibliography.
This section briefly reviews the SGML Document Type Definition from the point of view of who uses it and how; the results of this review are central to how one should organise the DTD.
At the start of every valid SGML document is a document type declaration, characterised by the DOCTYPE keyword with which it begins. This contains a set of rules (usually through a reference to an external file), and is a promise that the document that follows is completely and accurately described by that Document Type Declaration. The Document Type Declaration, together with any comments it contains, and any application conventions not explicitly stated in the Document Type Declaration, is called the Document Type Definition. The term DTD is said (in ISO8879:1986 Clause 4.108) to refer to the Document Type Definition, but informally the terms Definition and Declaration are frequently confused. In this paper, DTD refers to that part of the Document Type definition which is explicitly contained in the SGML file or files; external undocumented application conventions are not included.
The most obvious use of the DTD is to control the behaviour of an SGML parser. For example, the DTD lists the elements that are permitted in a document, together with their attributes.
In practice, though, what matters more is what use an application makes of the markup. To some extent this may be directed by the DTD: an editor may prompt the user with only the elements that are valid at any given point, for example. But an editor will probably also need information that the DTD does not formally supply: a short description of each element for the Insert Element menu, and a longer description for documentation; style information on how to present the document on the screen; information about what substructures to insert; hints on which elements to spell-check and which to ignore; the list is probably unbounded.
As far as an SGML parser is concerned, the following two declarations may be the same:
<!ELEMENT BOY(%ZZ.36;)> <!ATTLIST BOY%AG;%AG.36;>and
<!AttList Boy %Attributes.global; %Attributes.Boy; >Readers who find the first of these clearer and who immediately knew the content model are invited to the next meeting of the Spiritualists Society to be held at Aintree Betting Shop. Seriously, although the parser doesn't need extra spaces, indentation or comments, the human users do, as we shall discuss shortly.
A document analyst is someone who looks at one or more existing or proposed documents and deduces enough about their structure to write an SGML DTD to represent those documents. Such people are sometimes programmers of computer analysts in the more usual sense of the word, but more often they are not. They are typically called in to help an organisation, and they are therefore often unfamiliar with the subject material. But they are likely to be very skilled at writing DTDs.
Document analysts only rarely have to live with their work. Perhaps this is one reason why so many DTDs make so few concessions to the casual reader. But it is the responsibility of the document analyst to present information in a way that is clear to the people who are going to be using and working with it.
In his book on user interface design, About Face, Alan Cooper characterises most computer users as perpetual intermediates [Cooper1995]:
“Most users remain in a perpetual state of adequacy striving for fluency, with their skills ebbing and flowing like the tides, depending on how frequently they use the program. I call this a state of perpetual intermediacy, and such users are perpetual intermediates.” (p. 484)
Most users of a DTD seem to fall into this category. A typist will generally work with a subset of the full range of elements at any given time, and in any case will usually not need to refer to the DTD itself very often.
Most people using SGML software to create or edit documents are neither programmers nor analysts, and although they can generally learn to read a DTD, they may not find this particularly enjoyable.
When all is said and done at the end of the day, we put away our books and follow other pursuits. But those who come after us will look at our documents, and will know nothing about them except for the comments in the DTD and what they can deduce from the content of the documents.
It is the responsibility of the archivist to provide any necessary additional information, and to make sure that a catalogue is kept up to date.
An SGML document that is valid today will still be valid in a thousand years' time, but you'll probably have a different archivist then. The DTD serves as a link for future generations to be able to make use of your documents.: but these people may have neither much SGML knowledge nor much domain-specific knowledge. To them, the DTD is much more than a computer program: it is an archaeological artifact, and its provenance will derive from the care taken by the Archivist in providing today all the metadata that will be needed in the future.
It very often happens that an SGML document, once created, is to be used by other people, and must be typeset either for paper or for the screen.
People skilled in graphic design and page layout may be called upon to create a design. Such people will almost certainly not be familiar with a DTD, and may instead ask for a list of page elements (not in the SGML sense!). The document analyst, if available, is generally the best person to start with this, and in any case will have consulted the final users of the SGML document as part of document analysis. But if the typesetters are using SGML-based software, they too will generally have to read the DTD. They may need to write style sheets or macros for every possible combination of nested elements, for example, and will need to know that <title> occurs within <bibRef> and <lifePeer> as well as at the start of a <chapter>.
The typographer is only one of many possible non-specialist users of the DTD, of course, and all of them will be grateful for a document that is well laid out and clear.
This section uses the terminology and ideas from computer software to discuss DTD layout. If you are not a programmer, do not be put off by the C examples: we are only considering the way they are laid out, not what (if anything) they actually do!
This principle was discovered after people tried to maintain million-line computer programs. If you have to read the whole program before you change any of it, and the program is twenty times the size of the entire Bible, you might as well give up now and go and live in Hawaii.
The idea is that a program be presented in short, easily-digested chunks. Each chunk is self-contained, so that you don't need to look elsewhere to fix it. Consider the following short example (in a fictional programming language called C--):
one_func(char *n, char **cl) { char **p; for (p = cl; *p; p++) { acg(*p, "calls", n); acg(n, "called by", *p); } }
The astute reader may detect a bug in the above code; the following version of the same program fragment is written to try to be clearer:
indexOneFunction( String myFunctionName; StringList functionsCalled; ) { foreach (fName in functionsCalled) { AddToCallGraph( fName, "calls", myFunctionName ); AddToCallGraph( myFunctionName, "called by", fName ); } }
In the second version, even if you aren't familiar with the C programming language, you might guess that the names are reversed, so that myFunctionName actually calls fName, and not the other way round as written. Given the likely symptoms, someone could repair the second version very quickly indeed, but the first version would require the maintainer to read both the definition of the cfg routine and the context in which the one_func routine was called. From a software engineering point of view this is obviously very costly.
In the context of an SGML DTD, the principle of locality of reference means that you should use clear, meaningful names, and that you should lay out your DTD in small, related sections. More on both of these ideas later.
This section first argues that long names are not a bad idea, and then discusses some other naming issues for SGML.
We have already seen an example of how clear names can make computer programs easier to read, understand and maintain. In the early days of SGML, there was an idea that parsers would pre-allocate memory using the QUANTITY section of the SGML declaration, and that using longer names would use more memory. Today few programmers would write code that way; memory is allocated as it is needed, with the consequence that even if your names are several times longer, you are likely to end up using far less memory than if the parser had pre-allocated space for the largest possible document.
People sometimes object that long names mean more typing. In the DTD this is probably true, but the DTD is typed once, and read and re-read and revised countless times: this is a flawed argument.
Another common objection is that people may have to type the names into their documents. If they are not using SGML-aware authoring software such as PSGML for emacs or Author/Editor, they should be! But even without it, word completion and macros in most editors make this argument specious. Furthermore, a single keystroke typed in error changes N into H, which is much less likely to be noticed than changing Name into Hame, and if there is actually an H element in the DTD, even the parser might not flag the error.
So use long, clear names with an easy conscience.
The lack of scope in SGML means that if you declare a parameter entity in one file, it will be visible for the rest of that file and in all other files that are included from that point on, rather like a C preprocessor macro definition of the sort that C++ deprecates and Java eliminates altogether.
SGML Elements and entities work the same way. One can imagine a language like SGML in which the following declaration would make an entity &Mc; available only within the <name> element:
<!Element name [ <!Entity Mc SDATA "M<superscript>c</superscript>""> ] (#PCDATA) >but this is not in fact possible in SGML. As a result, if several people or organisations are working on the same DTD, they may need a central registry so that two people don't accidentally choose the same name. In the days of FORTRAN programming, individual programmers were assigned a namespace by prefixing all of their names with their initials, or with the name of a module. This is ugly, and the XML work on namespaces may bring better solutions.
One often wants to make a name that is comprised of several words in a phrase, such as running text or first name. If you are using an incomplete SGML system that does not support case sensitive names, you'll find that element names will get mapped to all UPPER CASE but that entity names are unchanged. Since spaces are not allowed in SGML names, it is necessary to choose something else. A good general scheme follows:
Where case is preserved, you can use an upper case letter at the start of each word, <JustLikeThis>. This is easily read, and has the advantage that the names are somewhat smaller on the screen than if a separator had been used.
Where only upper case is available, you can use a hyphen between words, rather like English place names: <Holme-on-Spalding-Moor>, or, more plausibly in an SGML DTD, <LEFT-MARGIN>
Sometimes a property of an element is encoded in its name; this is a pathetic and miserable technique necessitated by the lack of scope in SGML. One might use BookTitle instead of Title, for example, within a bibliographic element, because Title has also been used elsewhere in the DTD with conflicting attributes or different content.
In this case, the use of a dot may be more natural than the use of a dash, especially for people acquainted with C, C++, Java, Pascal, JavaScript, or any of a host of other programming languages. With this convention, one would write Book.Title, which in broken case-folding systems will turn into <BOOK-TITLE>
It is worth commenting on three aspects of the above system. First, many programmers prefer using an underscore to mixed case; this works moderately well in a fixed width typeface, but looks unbearable in anything else, as the underscore is generally a full em wide. At any rate, the underscore is not available as an SGML name character in the reference concrete syntax. Only the dot and the hyphen, together with ASCII letters and digits, may be used. If you can change the syntax to allow the underscore, you can also change it to allow mixed case.
The second comment is to explain why a word separation is needed. You can simply run words together, but when FirstOne is changed to FIRSTONE, the user may be perplexed about why a Fir Stone is needed. More on this under graphic design, below.
The third comment is that the distinction between a dot and a hyphen may be too subtle, in which case it is not worth making at all.
A name should reflect something about the problem you are trying to solve, or something about the document it represents. If you can't think of a name for an element, or if the name is in terms of the SGML representation instead of the problem space, consider if the element is needed at all.
For example, an element called <CUSTID> sounds like it corresponds to a database field, or that maybe the analyst can't spell `custard'! But if it's really the customer's last name, then call it that. If the users of the DTD and documents are familiar with the database field, use that name. If, as is most likely, the DTD readers and programmers use CUSTID but the users of the documents see Customer Number on their screen, it might be best to use a parameter entity:
<!--* The database field CUSTID is known to the users * as Customer Number: *--> <!Entity CUSTID % 'CustomerNumber'> <!Element %CUSTID; . . .>
Writing a program in terms of the problem rather than in terms of the implementation is known as abstraction, and greatly facilitates software maintenance when an underlying implementation is changed. When you are thinking in terms of the problem rather than in terms of the solution, you are much less likely to make mistakes of incorrect reasoning.
So choose names that represent concepts taken from the problem domain wherever possible.
Programs are organised into modules, each stored in one or more separate files, and modules are themselves broken down into functions and procedures. Each function can be considered alone, read and understood. A long function is further broken down into subsections, each with their own comments and with blank space above and below, so that you can see at a glance that a particular group of program statements go together, can understand them as a group, and can know when to stop reading.
In a DTD, it is very helpful to group related declarations together, and to use plenty of white space. The author has encountered people who believe that blank lines are not allowed in a DTD; this is an incorrect belief!
A suggested convention follows:
At the start of a file, have a comment header that explains the purpose of the file as a whole. Perhaps you can save someone from having to read the file at all, or help them go directly to the right place when they have to make a change. The comment header is described in more detail below.
Group related declarations together. Before each group, have at least two blank lines, a comment explaining the purpose of the group, and then a blank line.
Before each element declaration, place a comment explaining how to use the element and what it represents.
Include a short description of each element or entity: this should be short enough for use in a `tooltips' style help file, or in an Insert Element menu. This comment should be distinguished in a way that enables it to be extracted automatically.
An example will be given at the end of this paper.
Consider the following two short C fragments:
if (isConnected(theSocket)) if (needToClose) sendMessage(theSocket, MClose); close(theSocket);and
if (isConnected(theSocket)) { if (needToClose) { sendMessage(theSocket, MClose); } } close(theSocket);
These two fragments are in fact identical as far as a C compiler is concerned. As far as a human reader is concerned, in the second instance it's immediately clear that the close is outside the scope of the outer if. The close brace aligns with the start of the line containing the corresponding open brace.
In SGML, then, indentation can be used in just the same way to indicate lexical scoping. Consider the following:
<!Element jog (run,run,run,walk,trot,walk,trudge,collapse) +(pant)> <!Attlist jog to cdata #Required from cdata #Required>and the following:
<!Element jog (run,run,run,walk,trot,walk,trudge,collapse) +(pant) > <!AttList jog to cdata #Required from cdata #Required >
In the second case the attribute declarations are much clearer, and the inclusion of pant is more visible.
By now most readers will have noticed the use of Element instead of ELEMENT. This is for two reasons. Firstly, as will be discussed below, lower case words are more easily recognised and read. But secondly, and more importantly, the name of the SGML element being defined is far more important than the keyword used to define it! If you have software that can do it, you could try putting the element name in bold, or the keywords in grey. (The conference paper formatting doesn't support grey keywords, so we show only the latter in this example)
<!Element jog (run,run,run,walk,trot,walk,trudge,collapse) +(pant) > <!AttList jog to cdata #Required from cdata #Required >
If the editors you're using can't support this, there are DTD pretty-printers available. The author has a simple one for Unix and troff which he is willing to share. An alternative to using lower case for the keywords is to display them using small caps, as in the figure, if you have them available.
Using small caps for keywords |
From the point of view of a programmer, a comment must be arranged so that it is immediately clear to what it applies, and also so that its extent is clear at a glance.
It must not be necessary to read comments that don't apply to the section of the DTD you're trying to read (see Locality of Reference above). It is therefore important that the reader can tell at a glance whether a comment applies generally to a section that follows, to something before it, or to something after it.
Two common approaches are to use whitespace, or to use decorations such as rows of minus signs or equals signs in comments. Unfortunately, since -- terminates an SGML comment, you always have to use a multiple of four equals signs. Apart from being error prone, this is just plain silly. Edward Tufte discusses the use of visual clutter in his three books on presenting information [tufte01], and refers to extraneous matter on diagrams as chartjunk. One might call the cute extra rows of dashes and hyphens commentjunk perhaps. There may be some merit in marking the start of a section clearly, but little else.
One day, it will be possible to set the start and end of SGML comments to different strings. One good way to prepare for that Glorious Day is to use different strings from the start, but to enclose them in -- signs so that they work today. Here is an example that uses the asterisk, and that:
makes the block comment clearly distinct from any surrounding matter by the row of asterisks;
makes the start and end of the comment clear, since the eye can tell at a glance where the line of asterisks starts and ends;
can easily be extended to allow a row of asterisks at the start of a section, if you are addicted to such things;
uses start end end delimiters that differ one from the other (--* and *--) and yet work in today's SGML:
use indenting to show the lexical nesting of the contents of the comment within the actual comment:
<!--******* * Containers for Lists * * Elements in this section are used for representing * the various lateral angles of declension * of ships, which are known as lists. * * Note that a ship can list either to port or * to starboard. *-->
Every line after the first is indented by four spaces, so that if you use a fixed width font for the spaces and for the open comment delimiter ant the *, the asterisks will all line up in a straight vertical line. For regularity (see under graphic design, below), you should use four spaces as an indent elsewhere. If you are using troff or TeX with a proportional typeface (such as Monotype Dante used in the small caps figure), you can adjust the default space width so that it is a quarter of the width of " <!--", and then the asterisks will again align correctly.
If possible, use a different colour or italics for the text of comments, as per the figure.
Italic used for the text of a comment |
After discussing how to present comments on the screen or page, perhaps it is worth mentioning what to put inside them.
Someone receiving an SGML DTD is likely to need to know some or all of the following information:
How to refer to it: usually a public identifier
The version of the DTD as a whole; this is usually included in the formal public identifier;
The date when the file was last modified, and the version of each individual file;
where to get the latest file, and who to ask if there are problems; this may be a URL and electronic mail address
The purpose of the DTD: that is, what sort of documents it represents, and why;
The distribution status of the files: are they in the public domain, copyrighted, proprietary or what? Note that in most countries documents are always protected by copyright unless they explicitly say otherwise.
Here is a brief example:
<!--* DTD for representing a C++ compiler parse tree * * Public Identifier: * "-//Liam Quin//DTD C++ parse tree v1.1//EN" * System Identifier: "c++parse.dtd" * Release: 1.1, August 12th, 1997 * Author: Liam Quin, Suite 901 Inc * Liam Quin, Suite 901 Inc., * 67 Yonge Street, Toronto * liamquin at interlog dot com * * Status: Public Domain * * File Revision: * $Id: c++parse.dtd,v 3.6 97/08/12 16:54:45 liam $ *-->
You may be thinking that this is structured information and should therefore be represented in SGML. That would be a good think to ponder, and it does indeed turn out that keeping DTDs in SGML is a very useful thing to do, but for interchange they must be in the Good Old-fashioned DTD Format.
Note: The email address would normally be given in RFC822 notation: user@host.domain; it is given in this form in case these proceedings are published on the web, as then the web crawlers for junk mail spammers would be able to discover the author's electronic mail address. The author was receiving over 20 junk messages a day at his account at SoftQuad before he left in June 1997.
This section describes some principles that have been developed over a period of approximately three thousand years, and that relate to the presentation of information.
A brief historical digression may be in order. Ever since the invention of the codex, and its later dissemination by the early Christians, books have been written to be read by people unfamiliar with the material. In this way, they have been like the Roman and Greek public inscriptions. Legibility was very important, and this is quite different from the scrolls used by Jewish Priests and Scholars and perhaps by some of the Greek poets and historians: the scrolls were used as an aide meámoire
By the Mediæval period, books were being used to convey very complex information, often with complex cross-references. Page numbers were invented to help serve as destinations for links in Bible commentaries, where a reference to book, chapter and verse wasn't possible.
Figure 1. Psalms in 1581 A page from the Booke of Psalmes, taken from the Geneva or Breeches Bible, printed in 1581; private collection, reproduced with permission of the owner. [high resoultion image (101K)] |
One very common style of commentary that arose is exemplified by the Glossa Ordinnaire, but is also seen in Jewish commentaries. In this style, the main text to be annotated is written (and later printed) in the centre of the page, and the commentary flows around it. The commentary is generally written with much smaller letters, partly because there can be many words of commentary for a single word of text, partly to elevate the importance of the text, and partly to make the page layout clearer and easier to follow at a glance.
This tradition continued well into the 16th Century, and if it died out in the use of Bible commentaries, it was for economical reasons. Figure One is taken from a 1581 edition of the Geneva Bible, and the footnotes can clearly be seen wrapping round the text.
In arranging a DTD, we must remember that some of the comments are part of the text, and some of them are a commentary on it, and perhaps some are comments on the comments.
The following sections first describe some general principles of graphic design and then apply these to DTD writing.
A good introductory reference is Robin Williams'
Other books can be found in the Further Reading section below.
Where things are supposed to differ, make them obviously different
If two things that are intended to contrast with one another are too similar, some people won't notice the different, others will wonder if there is a mistake, and still others will be very irritated and get out a magnifying glass so that they can be sure of the difference each time.
In practice, this means that you should not reply on subtle typographic distinctions to convey meaning. Most readers who have not been introduced to thinking about type and design will have difficulty in distinguishing two sans serif typefaces from each other, or two serif faces. This means that you should not normally use more than one of each sort of typeface in combination.
In practice, roman and italics are enough for most uses, with a bold weight being used very sparingly to draw attention to things. A page of type normally presents an evenly spread grey texture, so that the eye is immediately drawn to bold black islands.
If you absolutely have to use more than one type style, try not to mix them inline. For example, you might use a fixed width typewriter font for everything in a DTD except comments, and you might use a face such as Hermann Zapf's calligraphic Palatino Italic for the comments, giving them a feeling of being annotations.
Treat similar items in the same way -- find a consistent set of rules and apply them everywhere. For example, if you indent text by four spaces to show lexical nesting, use the same indent everywhere and indent all nested structures.
Every exception will make people wonder whether you have made a mistake, or what is special about that one particular thing. This can be very distracting.
Align related things -- This is the principle which is easiest to follow, and which makes the biggest difference to almost any design. If two things should line up, line them up exactly! The human eye can easily detect a misalignment of one hundredth of an inch (a quarter of a millimetre), and even if this is not registered consciously, it will contribute towards an overall impression.
The converse of this rule applies only in information graphics: don't align unrelated things! We'll see an example of that when we discuss attributes.
Put related things bearer to each other than to unrelated things -- This may sound obvious, but is often not done. A heading, for example, should be nearer to the text that follows it than to that before it. Headings that float half-way between two paragraphs are unsettling and weaken a layout. A heading that's much closer to the text before it than to the text to which it actually applies is simply confusing.
This applies not just to headings; consider the layout of comments, and putting a blank line before or after a comment ads appropriate, for example.
Don't crowd too much information into any one area; use whitespace liberally-- White space is free. There's no extra charge. And the SGML parser eats it awfully fast. It makes your DTD much easier to read, too. So use lots of it.
Don't litter your DTD with commentjunk-- Each graphic element, whether it's a font change or a row of stars, must have a purpose. Ask yourself whether blank space would have served the purpose as well or better.
Now that the principles have been introduced, it is time to apply them.
This is a matter that seems to cause many people difficulty, although it is really very simple. Consider the CALS declaration example, extracted (with small changes) from the SGML OPEN CALS Tables Interchange Fragment, shown in the figure.
CALS Declaration Example An example of an ATTLIST declaration taken from the SGML OPEN CALS Tables Interchange Fragment (exchange.txt). This is how not to do it. |
We can improve this in several ways:
the keywords are shouting at us; let's put them in their place with lower case;
for some strange reason, the keywords #REQUIRED and so forth are arranged in a vertical column, as if they were ready to be added up! In doing this, they are moved so far away from the attribute names to which they apply that the only way you can tell which attribute is in fact required is by noticing that it's the first in the list, and then looking back at the column of names on the extreme left.
The columns in the original text file were spaced out even further than they are here; the author moved them closer to try to make sure that they would fit on the page.
CALS AttList Declaration Improved |
For the purpose of editing on-screen or of printing, we can apply the principle of contrast to make the various names stand out. Although either bold or italics could be used for the attribute names, they are less important than the element name itself, and so italic has been used. This is probably a matter of personal preference.
The CALS Table Example Formatted For Printing or Editing |
The principle of Proximity applies not just to space within declarations but also to space between them. A comment that applies to an element declaration following it must be given a blank line before it, so that it is nearer to its antecedent than to whatever preceded it. In the same way, a comment that applies to a group of following elements needs to have extra space after it so that it clearly does not apply only to whatever immediately follows it, and therefore needs even more space above it.
This may be seen in the example at the end of the paper.
ROMAN INSCRIPTIONS WERE WRITTEN IN CAPITAL LETTERS; in many cases they were painted onto the stone with a brush and then carved with a chisel. Perhaps wooden signs were common: it has been suggested that stone carving was seen as a way to make brush writing permanent, set in stone as you might say. They were written in capital, or majuscule, letters because lower case letters had not been invented.
All of the research that has been done on the subject indicates clearly that we recognise words by the shape of their outlines, and if that doesn't work we peer at each letter one after the other [wheildon]. Since words in upper case generally all have rectangular outlines, we read them much slowly than the lower case words that we can recognise at a glance.
It follows that words in all upper case should be avoided. in our DTDs. In the examples in this paper, the keywords have been given an initial capital letter, and this is because many of them follow the tall punctuation <!, so that the initial capital letters help to balance the whole miserable affair.
If you are using XML, or have to use upper case keywords for some other reason, you may notice that the CAPITAL LETTERS STAND OUT ON THE PAGE much more than the mere subservient keywords deserve. For this reason, if they have to be in capital letters, they should be set with small capitals or in a smaller size, or, if that is not possible, in dark grey instead of black.
Comments, of course, should never need to be in all caps.
This has been mentioned before: by giving comments a firm and definite right edge they are made more visible, more cohesive, and more clearly distinct from their surroundings. This uses the principles of Alignment in the vertical line, of Contrast in treating them differently, and of Repetition as the idiom is enforced by multiple comments, and by using lexical indenting in the same way everywhere.
There are many other applications of all the principles described in here, but if nothing were left to the imagination, we should be no better than slaves.
We have seen a number of ways to lay out a DTD. The flat structure of a DTD and the paucity of tools for assisting in the layout lead one inexorably towards the idea of using SGML itself to represent a DTD. If this approach is taken, one can use DSSSL or any other SGML formatting language to add style to one's DTD. One can have sections and subsections, with titles and overall comments and tables of comments. One can have Ambrosia, but at the risk of standing alone. None the less, the author has found that this is the most convenient way of managing a DTD. One generates the actual DTD automatically from the marked-up form, perhaps using perl or Balise.
There are a few tools for producing HTML from an SGML DTD. The author has a simple tool for producing PostScript using troff. There are not many tools in this category, sadly.
Tools such as make, RCS and CVS are all very useful. These are freely available for Unix and also for MS-DOS; commercial versions are available for Microsoft Windows and the Macintosh.
The make utility is very useful if you work in an environment where SGML DTDs must be processed or compiled before they can be used. Make reads a configuration file (Makefile) that describes the interdependencies between all the files in a directory, and can run programs as needed to make things up to date. For example, you could tell make that the rules file depends on the DTD file, and that the DTD file in turn depends on an included DTD fragment. If you edited the fragment, make would see that the DTD was older than the fragment, and hence out of date; this in turn would tell it that the rules file was out of date, so it would be recompiled automatically.
RCS is useful for keeping a history of all the changes made to a file. You can compare any two revisions, and can retrieve any earlier revision. RCS will also maintain a revision number in the file for you if you want, so that someone who has a copy of one of your files can tell you what version they have.
CVS is like RCS, except that several people can share a single master copy of the DTD (or whatever else you wish) and can work on it together.
If you have ever saved a file from an editor and then moments later wished you hadn't, and wished you could `go back' to last Wednesday's version of the file, you need something like RCS or CVS.
If you use vi or emacs to edit your DTDs, you should investigate the ctags facility. The author wrote a short program to generate a ctags-like database from a DTD in an hour or so using awk or perl on Unix. The consequence is then that you can position your cursor on any element name, press a key, and be taken straight to the place where that element is defined.
Very often, an hour or two thinking about tools can save a lot of time later.
The following references may prove useful or interesting.
Baecker, Ron & Marcus, Aaron, Human Factors and Typography for More readable Programs, ACM Press, Addison-Wesley, 1990. This thesis is interesting, although it does not appear to have a grounding in classical typography, and in the present author's view suffers as a result.
Knuth, Donald E. Literate Programming, Center for the Study of Language and Information, 1992. This volume reprints many of the classic texts on the subject of literate programming, and is well worth a read.
Tufte, Edward, The Visual Display of Quantitative Information, Graphic Press 1983.
Tufte, Edward, Envisioning Information, Graphic Press 1990.
Tufte, Edward, Visual Explanations, Graphic Press 1997. Edward Tufte's three books, particularly the first listed, are essential reading for anyone in the business of presenting information, especially statistical or graphical data.
Wurman, Richard Saul, Information Architects, Graphis Press 1996. Richly illustrated with many examples of information design, any one of which could itself fill a book.
Wheildon, Colin, Type & Layout, Strathmore Press. It's hard to take a book seriously when it has such devastatingly enthusiastic reviews on its cover, but this is a rare example of published studies on legibility, and in particular the application of those studies to advertising. Would whoever has my copy please return it?
Professor Charles Bigelow has written in the Seybold Report, in Baseline magazine, in Fine Print on Type and elsewhere on legibility in type design.
Robin Williams, The Non-Designer's Design Book, Peachpit Press. Probably the best introductory book on graphic design, and it's got a wonderful yellow cover too!
Robin Williams, The Mac is not a Typewriter, Peachpit Press. This is a wonderfully gentle introduction to careful typography; there is a companion book, The PC is Not a Typewriter. My copies of these books seem to vanish mysteriously from my shelves when people borrow them.
Erik Spiekermann & E. M. Ginger, Stop Stealing Sheep & find out how type works, Adobe Press 1993. The title is a reference to a comment once made by Fred Goudy, a famous typographer. This book talks mainly about typefaces rather than layout, but is written by two people who love their subject and infect it with enthusiasm.
Jan Tschichold, The Form of the Book: Essays on the Morality of Good Design, Hartley & Marks, 1991, translated from the German by Hajo Hadelerr. There's lots of good information on layout here from someone who was widely regarded as a world expert on the subject.
Now that we've dissected the DTD and taken apart all that goes into making it the shape it should be, it's time to put it back together again.
The heading refers to the fundamental tenet of SGML. Epimorphistic is to do with shapes, so that by epimorphistic conformability I mean the possibility of DTDs from disparate sources having the same or compatible outer forms. By encouraging this, we can hope that the DTD becomes gradually more and more accessible. Today, if you receive a new DTD, you have to learn its layout conventions in order to understand it, and this can be both time-consuming and error-prone. DTDs that are laid out for manageability and readability do exist, but they unusual, a minority in a morass of mediocrity.
A slightly more serious way of summing up would be to say that the author hopes to have helped DTD writers and analysts to lay their DTDs out in a way that is easier to work with and will help other users. The author would welcome comments on this or any other aspect of this paper.