A friend writes: There is a reported bug in the validator, that SGML character number128-159 are not allowed for xml-based markup languages. We have a test case at: Our parser is opensp, and our opensp uses In the bugzilla item I mentioned above, Terje Bless, who generallyknows much more about SGML than I do, thinks it may just be that oursgml declaration for xml should be updated to include this characterrange. http://www.w3.org/Bugs/Public/show_bug.cgi?id=3164#c5 As I am rather confused by the issue, I'd appreciate any guidance,diagnosis, or pointer, you could provide.
This is the second time this week I've indulged myinner language-lawyer in response to some query. I reproducemy reply to this question here, as an Awful Warning to thosewho might otherwise be tempted to ask me questions about XML.What I wrote was (more or less):
OK, I'll try.
I'm going to ask a number of short, pointed questions, provide long,digressive answers (sorry about that), and then say what I think itall means for your problem.
Note that the character range x80-x9F is known, for historicalreasons, as "the C1 range" or "the C1 characters". I can explain ifyou wish. But be careful what you wish for.
In XML 1.0, the grammar production for Char includes them, so yes.
The formulation of the Char production has changed from time to time,but 7F and the C1 range have always been included.
In XML 1.1, the grammar production for Char continues to allow them,but the 'document' production takes care to exclude them (in theirliteral form) from the document.
In 1.1, the C1 characters may be referred to using numeric characterreferences (€ etc.) but not used as literal characters.
So: XML 1.0 does not forbid the use of these characters. XML 1.1forbids their appearance as literals but not as numeric characterreferences.
It might be argued (I think Chris Lilley has done so) that since theC1 characters aren't really Unicode, they aren't legal in a documentwhose document character set is supposed to be Unicode.
The Unicode 2.0 spec open on my desk, however, says Like the C0 control codes, the Unicode Standard makes no specific use of these C1 control codes, but provides for the passage of their numeric code values intact, neither adding to nor subtracting from their semantics. The semantics of the C1 controls are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the semantics specified in ISO 6429. (p. 6-5, section "Latin-1 Supplement: U+0080 - U+00FF")
I take that to mean that for all intents and purposes they are legalUnicode characters. That Unicode does not assign meanings to themdoes not constitute an argument that they are excluded from Unicode:there are lots of gaps in Unicode. U+0FB0, for example, is also notdefined as meaning a specific character (at least in Unicode 2.0; I'mtoo lazy to check the current version), but it's clearly got to beaccepted in a Unicode data stream.
So: Unicode includes these characters.
No. The test case you linked to illustrates that very nicely.
Yes, the declaration at
The document character set is defined by a CHARSET declaration
The "document character set" as defined by SGML is rather unlike the"document character set" concept of HTML 4, which brilliantly co-optedthe SGML term and gave it a new and better meaning. (At least, that'sthe way I understand the history of events.)
As defined by SGML, the document character set is the actual codedcharacter set (aka character encoding) the parser can expect toencounter, conceived as a mapping from integers to characters. Thebit combinations come in, and the parser knows what characters theyrepresent by reference to the character set declaration.
In HTML, by contrast, the "document character set" is the repertoireof abstract objects called "characters" which may occur in an HTMLdocument and which are mapped 1:1 with a set of integers. The integermappings are relevant for numeric character references, but fornothing else. In particular, the HTML spec explicitly clarifies thatthe document character set has nothing in particular to do with theencoding in which data may arrive, except that the abstract charactersencoded by the encoding had better be present in the documentcharacter set. (The HTML and later XML view is concisely summarizedby Gavin Nicol at
XML essentially adopted the ideas of HTML 4 on this point: the ISO10646/Unicode character set is conceived as a large and abstractpairing of integers and characters, one step divorced from the messybusiness of actual encoding.
So from the point of view of an SGML processor, the character setdescription above documents which bit patterns will and won't occur inthe input stream. (The ones that won't occur are important, becausethe SGML spec assumes a processor may want to use them for its owninternal purposes.) From the HTML and XML point of view, thedescription documents which characters will and won't occur.
How does it work? The BASESET says that we'll describe the documentcharacter set by reference to the coded character set whose publicidentifier is "ISO Registration Number 177//CHARSET ISO/IEC10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6".The reader is assumed to be in a position to understand whatreferences to that character spec mean.
The DESCSET bit contains a sequence of triples which assign meaningsto integers, using a kind of run-length documentation.
And so on. So the lines
A character set declaration similar to this one, but which allowsDEL and the C1 range, would have a DESCSET section like this:
You could of course replace the three lines for 127-55295 withthe single line
So: the SGML declaration used by many people as representing the rulesof XML 1.0 disagrees with the XML 1.0 spec on the characters x7Fthrough x9F.
It may be worth noting that the rule in the XML 1.1 spec which somepeople find odd, that says that the characters in the range x7F-x9Fmay be referred to using numeric character references but must notappear as literals, is precisely the rule implied by the SGMLdeclaration: by marking the characters UNUSED, the SGML declarationsays they don't appear as literals, but not that they can't bereferred to numerically.
As far as I can tell, it's always been there.
I have looked at all the published drafts of XML 1.0 to see if someearly draft excluded the C1 characters; no, as mentioned above theyall include 7F and the C1 controls.
I have consulted Dave Peterson, who worked intensively with me in thewinter of 1996-97 to categorize all of the divergences between SGMLand the first draft of XML, and who on the basis of that work preparedthe first draft of what became the Web SGML Annex, to ask him if heremembered the responsible ISO WG deciding that they needed to excludethe C1 controls. He has no memory of such a decision, and neither henor I can think of a reason the SGML WG would have felt it necessary.The SGML spec goes to extreme lengths to try make it possible todescribe arbitrarily weird encodings and use them to encode SGMLdocuments. (In fact even the huge complexity of the character setmechanisms in SGML falls short of the ingenuity of some designers ofcharacter encodings, so SGML can't describe some existing encodingswell -- but even so, those encodings can be used to encode SGMLdocuments.)
It appears likely that the SGML declaration in the Web SGML Annex wascopied from the SGML declaration formulated by James Clark during thedevelopment of XML and published as part of the SGML/XML note(
Further excavation reveals that an SGML declaration was included inthe first published working draft of XML -- in the printed form only,however, not the version at
Like every other SGML declaration for SGML I have found today, thatone excludes 7F and the C1 controls.
Those whose pain threshold for character set discussions has notalready been exceeded will find more discussion in the long thread at
Initially, one might be unsure.
The prose suggests that they are legal. HTML 4.01 saysthat its document character set is Unicode, and nowherein the section on HTML document representation in HTML 4.01 (
On the other hand, the HTML 4 spec has an SGML declaration thatindicates quite clearly that the characters are not legal. (
The relevant part of the SGML declaration reads:
Is the SGML declaration normative? It would seem to be: it's in anumbered section, not an appendix, and it's not labeled non-normativeor informative. And the section on conformance describes HTML as aconforming SGML application.
So: I conclude that the SGML declaration is normative and that 7F andthe C1 controls are not legal in HTML 4.
It might appear not.
XHTML describes itself as a reformulation in XML of HTML 4.01, so Ibelieve that the character-set restriction of HTML 4 is inherited byXHTML 1.0. It's no longer enforced by the lower-level markup system,so in XHTML it would appear to be an "application convention", i.e. arule that goes beyond those imposed by XML. The comparison of XHTML1.0 with HTML 4.01(
May we conclude that XHTML 1.0, like HTML 4, excludes x7F and the C1controls?
In the first draft of this treatise I did so conclude. Buta different analysis is possible.
XHTML 1.0 was intended as an XML 1.0 application, and all XML applications have the same rule for character sets. The WG regarded itself as just adopting whatever it was that the XML spec said; they didn't believe they had an option.
On that analysis, the rule for XHTML 1.0 is whatever the rule for XML 1.0 is, which does not forbid these characters. XHTML 1.0 assumes a ‘generic’ XML parser.
The absence of the difference from the list of differencesis to be understood not as a statement that there is no differnce,but as an omission from the list, either because it wasregarded as uninteresting or because the WG didn't notice thisparticular difference. The overarching goal in developing XHTML 1.0was, to quote the chair of the HTML WG, “to be a genericXML as we could”.
So: we conclude that XHTML 1.0, unlike HTML 4, does not exclude x7F and the C1 controls by means of its SGML declaration. Ifthey are legal in XML in general, they are legal in XHTML 1.0.
N.B. this does not mean that it's a good idea to
1) There's definitely a bug in the SGML declarations in widecirculation for XML 1.0. Either that, or I am being defeated oncemore by ISO 8879's character set mechanisms.
2) HTML 4 excludes 7F and the C1 range.
3) XHTML 1.0 does
4) The validator seems to be correct in rejecting the charactersin question in HTML 4.0 documents.
5) The validator appears to be incorrect in rejecting the testdocument at
I hope this helps.