Movatterモバイル変換

A friend writes:Olivier Thereaux,help with "sgml declaration for xml"?Email to MSM and w3t-archive, 23 March 2007.

There is a reported bug in the validator, that SGML character number128-159 are not allowed for xml-based markup languages.http://www.w3.org/Bugs/Public/show_bug.cgi?id=3164

We have a test case at:http://validator.w3.org/check?uri=http%3A%2F%2Ftest.wikipedia.org%2Fwiki%2FUser%3AR._Koot%2FC1-2&charset=%28detect+automatically%29&doctype=Inlinewhich indeed complains about these.

Our parser is opensp, and our opensp useshttp://dev.w3.org/cvsweb/validator/htdocs/sgml-lib/xml.soc as acatalog in xml mode, and thushttp://dev.w3.org/cvsweb/validator/htdocs/sgml-lib/xml.dcl as ansgml declaration for xml.

In the bugzilla item I mentioned above, Terje Bless, who generallyknows much more about SGML than I do, thinks it may just be that oursgml declaration for xml should be updated to include this characterrange. http://www.w3.org/Bugs/Public/show_bug.cgi?id=3164#c5

As I am rather confused by the issue, I'd appreciate any guidance,diagnosis, or pointer, you could provide.

This is the second time this week I've indulged myinner language-lawyer in response to some query. I reproducemy reply to this question here, as an Awful Warning to thosewho might otherwise be tempted to ask me questions about XML.What I wrote was (more or less):

I'm going to ask a number of short, pointed questions, provide long,digressive answers (sorry about that), and then say what I think itall means for your problem.

Note that the character range x80-x9F is known, for historicalreasons, as "the C1 range" or "the C1 characters". I can explain ifyou wish. But be careful what you wish for.

Are characters x7F through x9F legal in XML?

In XML 1.0, the grammar production for Char includes them, so yes. [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

The formulation of the Char production has changed from time to time,but 7F and the C1 range have always been included.

In XML 1.1, the grammar production for Char continues to allow them,but the 'document' production takes care to exclude them (in theirliteral form) from the document. [1] document ::= ( prolog element Misc* ) - ( Char* RestrictedChar Char* )

In 1.1, the C1 characters may be referred to using numeric characterreferences ( etc.) but not used as literal characters.

So: XML 1.0 does not forbid the use of these characters. XML 1.1forbids their appearance as literals but not as numeric characterreferences.

Are these characters legal in Unicode?

It might be argued (I think Chris Lilley has done so) that since theC1 characters aren't really Unicode, they aren't legal in a documentwhose document character set is supposed to be Unicode.

The Unicode 2.0 spec open on my desk, however, says

Like the C0 control codes, the Unicode Standard makes no specific use of these C1 control codes, but provides for the passage of their numeric code values intact, neither adding to nor subtracting from their semantics. The semantics of the C1 controls are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the semantics specified in ISO 6429.

(p. 6-5, section "Latin-1 Supplement: U+0080 - U+00FF")

I take that to mean that for all intents and purposes they are legalUnicode characters. That Unicode does not assign meanings to themdoes not constitute an argument that they are excluded from Unicode:there are lots of gaps in Unicode. U+0FB0, for example, is also notdefined as meaning a specific character (at least in Unicode 2.0; I'mtoo lazy to check the current version), but it's clearly got to beaccepted in a Unicode data stream.

So: Unicode includes these characters.

Does the SGML declaration used by opensp really exclude them? How?

Yes, the declaration athttp://dev.w3.org/cvsweb/validator/htdocs/sgml-lib/xml.dcl?rev=1.3&content-type=text/x-cvsweb-markup, which you say our installation ofopensp is using, does exclude 7F and the C1 range.

The document character set is defined by a CHARSET declaration CHARSET BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED 128 32 UNUSED 160 55136 160 55296 2048 UNUSED -- surrogates -- 57344 8190 57344 65534 2 UNUSED -- FFFE and FFFF -- 65536 1048576 65536 -- 16 planes outside BMP --

The "document character set" as defined by SGML is rather unlike the"document character set" concept of HTML 4, which brilliantly co-optedthe SGML term and gave it a new and better meaning. (At least, that'sthe way I understand the history of events.)

As defined by SGML, the document character set is the actual codedcharacter set (aka character encoding) the parser can expect toencounter, conceived as a mapping from integers to characters. Thebit combinations come in, and the parser knows what characters theyrepresent by reference to the character set declaration.

In HTML, by contrast, the "document character set" is the repertoireof abstract objects called "characters" which may occur in an HTMLdocument and which are mapped 1:1 with a set of integers. The integermappings are relevant for numeric character references, but fornothing else. In particular, the HTML spec explicitly clarifies thatthe document character set has nothing in particular to do with theencoding in which data may arrive, except that the abstract charactersencoded by the encoding had better be present in the documentcharacter set. (The HTML and later XML view is concisely summarizedby Gavin Nicol athttp://lists.w3.org/Archives/Public/w3c-sgml-wg/1997Jan/0287.html.)

XML essentially adopted the ideas of HTML 4 on this point: the ISO10646/Unicode character set is conceived as a large and abstractpairing of integers and characters, one step divorced from the messybusiness of actual encoding.

So from the point of view of an SGML processor, the character setdescription above documents which bit patterns will and won't occur inthe input stream. (The ones that won't occur are important, becausethe SGML spec assumes a processor may want to use them for its owninternal purposes.) From the HTML and XML point of view, thedescription documents which characters will and won't occur.

How does it work? The BASESET says that we'll describe the documentcharacter set by reference to the coded character set whose publicidentifier is "ISO Registration Number 177//CHARSET ISO/IEC10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6".The reader is assumed to be in a position to understand whatreferences to that character spec mean.

The DESCSET bit contains a sequence of triples which assign meaningsto integers, using a kind of run-length documentation. 0 9 UNUSEDmeans the numbers from 0 to 8 (starting position 0, length of sequence9) are not used. They are NONSGML characters -- they may (in SGML butnot in XML 1.0) be referred to by means of numeric characterreferences, but they will NOT appear as literals. 9 2 9means that characters 9 and 10 (sequence of length 2, starting at 9)have the meanings of characters 9 and 10 in the base character set,which in this case is HT and LF.

And so on. So the lines 127 1 UNUSED 128 32 UNUSEDmean that the character whose number is 127 (conventionally DEL) isnot used, and neither are the 32 characters from 128 through 159 (x80 - x9F).

A character set declaration similar to this one, but which allowsDEL and the C1 range, would have a DESCSET section like this: DESCSET 0 9 UNUSED 9 2 9 -- HT, LF -- 11 2 UNUSED 13 1 13 -- CR -- 14 18 UNUSED 32 95 32 -- space through tilde -- 127 1 127 -- DEL, legal in XML 1.0 -- 128 32 128 -- C1 controls, legal in XML 1.0 -- 160 55136 160 55296 2048 UNUSED -- surrogates -- 57344 8190 57344 65534 2 UNUSED -- FFFE and FFFF -- 65536 1048576 65536 -- 16 planes outside BMP --

You could of course replace the three lines for 127-55295 withthe single line 127 55169 127but that would only confuse people, I think.

So: the SGML declaration used by many people as representing the rulesof XML 1.0 disagrees with the XML 1.0 spec on the characters x7Fthrough x9F.

It may be worth noting that the rule in the XML 1.1 spec which somepeople find odd, that says that the characters in the range x7F-x9Fmay be referred to using numeric character references but must notappear as literals, is precisely the rule implied by the SGMLdeclaration: by marking the characters UNUSED, the SGML declarationsays they don't appear as literals, but not that they can't bereferred to numerically.

When did this discrepancy enter into the SGML declaration?

As far as I can tell, it's always been there.

I have looked at all the published drafts of XML 1.0 to see if someearly draft excluded the C1 characters; no, as mentioned above theyall include 7F and the C1 controls.

I have consulted Dave Peterson, who worked intensively with me in thewinter of 1996-97 to categorize all of the divergences between SGMLand the first draft of XML, and who on the basis of that work preparedthe first draft of what became the Web SGML Annex, to ask him if heremembered the responsible ISO WG deciding that they needed to excludethe C1 controls. He has no memory of such a decision, and neither henor I can think of a reason the SGML WG would have felt it necessary.The SGML spec goes to extreme lengths to try make it possible todescribe arbitrarily weird encodings and use them to encode SGMLdocuments. (In fact even the huge complexity of the character setmechanisms in SGML falls short of the ingenuity of some designers ofcharacter encodings, so SGML can't describe some existing encodingswell -- but even so, those encodings can be used to encode SGMLdocuments.)

It appears likely that the SGML declaration in the Web SGML Annex wascopied from the SGML declaration formulated by James Clark during thedevelopment of XML and published as part of the SGML/XML note(http://www.w3.org/TR/NOTE-sgml-xml-971215.html). That SGMLdeclaration excludes these characters; I do not understand why. Nordo I understand why the discrepancy between the definition of Char inthe XML spec and the SGML declaration was not noticed by the XMLWorking Group and eliminated. Possibly I have simply forgotten somediscussion of the topic.

Further excavation reveals that an SGML declaration was included inthe first published working draft of XML -- in the printed form only,however, not the version athttp://www.w3.org/TR/WD-xml-961114

Like every other SGML declaration for SGML I have found today, thatone excludes 7F and the C1 controls.

Those whose pain threshold for character set discussions has notalready been exceeded will find more discussion in the long thread athttp://lists.w3.org/Archives/Public/w3c-sgml-wg/1997Jan/0162.htmlwhich seems to indicate that the first SGML declaration for XML wasactually drafted not by James Clark but by Jon Bosak. If it's the oneinhttp://www.w3.org/TR/WD-xml-lang-970331 then the error really HASalways been there.

Are these characters legal in HTML 4?

Initially, one might be unsure.

The prose suggests that they are legal. HTML 4.01 saysthat its document character set is Unicode, and nowherein the section on HTML document representation in HTML 4.01 (http://www.w3.org/TR/1999/REC-html401-19991224/charset.html)have Ifound anything that implies that the C1 characters are excluded.

On the other hand, the HTML 4 spec has an SGML declaration thatindicates quite clearly that the characters are not legal. (http://www.w3.org/TR/1999/REC-html401-19991224/sgml/sgmldecl.html)

The relevant part of the SGML declaration reads: CHARSET ... DESCSET 0 9 UNUSED ... 127 1 UNUSED 128 32 UNUSED ...

Is the SGML declaration normative? It would seem to be: it's in anumbered section, not an appendix, and it's not labeled non-normativeor informative. And the section on conformance describes HTML as aconforming SGML application.

So: I conclude that the SGML declaration is normative and that 7F andthe C1 controls are not legal in HTML 4.

Are these characters legal in XHTML 1.0?

It might appear not.

XHTML describes itself as a reformulation in XML of HTML 4.01, so Ibelieve that the character-set restriction of HTML 4 is inherited byXHTML 1.0. It's no longer enforced by the lower-level markup system,so in XHTML it would appear to be an "application convention", i.e. arule that goes beyond those imposed by XML. The comparison of XHTML1.0 with HTML 4.01(http://www.w3.org/TR/2002/REC-xhtml1-20020801/#diffs) seems tosuggest that the differences are all subtractions from the set oflegal documents: XHTML forbids some things allowed by HTML 4. If thereare any points where it says XHTML allows things disallowed by HTML 4,I didn't see them.

May we conclude that XHTML 1.0, like HTML 4, excludes x7F and the C1controls?

In the first draft of this treatise I did so conclude. Buta different analysis is possible.

XHTML 1.0 was intended as an XML 1.0 application, and all XML applications have the same rule for character sets. The WG regarded itself as just adopting whatever it was that the XML spec said; they didn't believe they had an option.

On that analysis, the rule for XHTML 1.0 is whatever the rule for XML 1.0 is, which does not forbid these characters. XHTML 1.0 assumes a ‘generic’ XML parser.

The absence of the difference from the list of differencesis to be understood not as a statement that there is no differnce,but as an omission from the list, either because it wasregarded as uninteresting or because the WG didn't notice thisparticular difference. The overarching goal in developing XHTML 1.0was, to quote the chair of the HTML WG, “to be a genericXML as we could”.

So: we conclude that XHTML 1.0, unlike HTML 4, does not exclude x7F and the C1 controls by means of its SGML declaration. Ifthey are legal in XML in general, they are legal in XHTML 1.0.

N.B. this does not mean that it's a good idea touse 7F and the C1 controls in XHTML 1.0 documents.See the articleFAQ: HTML, XHTML, XML and Control Codesathttp://www.w3.org/International/questions/qa-controlsfor a concise statement of why they should in practice be avoided.