XML Doesn't Beep

Thursday, 17 January 2008 | Rich

I learnt a something new about XML today, a part of the specification that deals one of the many edge cases that exist in every file format. To illustate this, lets take a look at a few examples. Why is this XML document well-formed :

<test>X</test>

this one also well-formed:

<test>& #9;</test>

But this document isn't:

<test>& #7;</test>

Note that I've added an extra space to these examples as the blogging software used by kdedevelopers.org seems to quote the characters required to make this appear directly.

To find out why it's broken, read on...

It turns out that the XML specification limits the content of text regions such that control characters like the last example are illegal, the three whitespace characters CR, LF and tab are exceptions and are specifically allowed. Control characters are used to control a terminal and aren't generally part of text documents these days, they exist below the normal range of ASCII (which uses the range 32-127). The second example used an escape sequence to say that the content was character 9, this is the tab character which is allowed. The final example used an escape sequence to say that the content was character 7, this is outside the range allowed so the document is not well-formed.

The characters allowed in XML documents are specified in the XML specification as production number 2 (thanks to Simon for finding the reference):

Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] |
               [#xE000-#xFFFD] | [#x10000-#x10FFFF]
               /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

but it definitely wasn't the behaviour I was expecting. A little googling shows that I'm not the only one to be caught out by this - yep, it seems google's search has the same bug: if you search for &# 1; in google you get no results but no message, normally if there are no hits then you get a message saying so. I guess sometimes we all need to be reminded that the devil is in the details.

ps. If you're wondering about the title, look up the control code 0x07 in your ASCII reference. Cheers Mark!