An efficient representation for editable XML

Liam Quin, Barefoot Computing, May 2000

Abstract

An XML document model is described that is suitable for use in applications that require rapid processing and a low space overhead. A full multi-connected graph is supported, with both user edits and programmatic annotations.

The integrated undo mechanism is designed to persist between editing sessions, without requiring any extra information to be stored in the XML document.

The API presents an abstract model suitable for use in a model-view-controller paradigm.

Progressive rendering is supported, with the document growing as more input becomes available.

The Abstract model

Implementation Features

Memory Efficient

mmap() with pointers to text

Internationalised

All strings are stored as start and end pointers, so that nul bytes and wide characters are acceptable. Automatic normalization to an internal 16- or 32-bit ISO 10646 representation is available, with strings copied on demand as needed.

Supporting Progressive Rendering

mmap() document_change callback

Fast

Few calls to external memory allocators such as malloc() are needed, since text is stored as offsets in an mmap'd buffer. The representation is also designed for rapid searches by path and by XPointer. The use of pointers into external text may also allow a high locality of reference, producing fewer page cache misses. This remains to be measured.

Implementation Details

The XML Parser (libxml?) provides callbacks on events such as element start: this is standard SAX[ref] behaviour.

The callbacks build a tree, but the strings in the tree are integer offsets into text.

Consider the following XML fragment:

<p>The boy went <emphasis type="surprise">barefoot</emphasis> all day!</p>

The following abstract tree is presented:

    Element {
        Name: "p"
        Attributes: [none]
        Children: {
            Text
                Content: "The boy went "
            
            Element {
                Name: "Emphasis"
                Attributes: {
                    Attribute
                        Name: type
                        Type: String
                        Content: "surprise"
                }
                Children: {
                    Text
                        Content: "barefoot"
                }
            }

            Text
                Content: "all day!"
        }
    }

Internally, the representation is shown in Figure N.

The outermost p element has a baseOffset relative to the start of the document buffer. Pointers to text are integers that are added to the containing baseOffset in order to find the text. This representation, rather than using pointers, allows the original text to be moved in physical memory without affecting the document representation. If a document is known to be static, the base pointers can all point to the topmost (document root) element, so that dereferencing a string requires only a single addition.

Most textual XML documents have fairly shallow nesting with large sequences of unbroken text: paragraphs within sections within chapters. As a result, although string access is O(log n) for a balanced XML tree, or O(treeDepth) if you will, the largest treeDepth is usually of the order of 3 or 4. There is no additional cost for sequential access, since a standard pre-order document traversal can simply keep an accumulator, and perform a single addition for each leaf (string) object.

The out-of-band representation permits elements to be inserted without affecting the original document. Inserted elements might be for user interface reasons (for example, the user highlighted some text and asked that it be underlined) or a result of document edits. In the same way, an element deletion involves editing the document tree but not the original document. A text deletion involves splitting a Text node into two. [figure?]

Undo Mechanism

Commands that affect the document tree are stored in reverse, as operations that can be performed on the current document to recreate the previous version. These can be stored in a separate file, so that undo state can be saved between sessions. Furthermore, the current document tree can be rebuilt by replaying the history, or can also be saved to a file as a snapshot save.

Both of these mechanisms are fragile with respect to changes in the original document text, so the normal Save and Save As operations create a single XML file. This file can optionally include the program-specific tree annotations.

General Graph

Any object in the tree may occur in multiple locations. When the document is saved as XML, these objects can appear multiple times in the output as duplicates, or can appear once with ID/IDREF refereneces generated.

Is this all too much?

It turns out that a page layout program needs to be able to save edited documents. Most other programs use proprietary binary formats to save object hierarchies. Using XML admits of other uses of the information, is open and extensible, and allows the program to read XML documents directly, a great benefit. An internal object representation is needed, and an annotated XML graph fills the needs efficiently.