This document summarizes the discussion and conclusions of a meeting held to coordinate across several W3C Working Groups. While the decisions of this forum are not binding on any of the working groups, they represent substantial experience and analysis and should guide future work.
Please direct comments towww-html, a public discussion forum.
This document is a NOTE made available by the W3 Consortium for discussion only. This indicates no endorsement of its content, nor that the Consortium has, is, or will be allocating any resources to the issues addressed by the NOTE.
A number of issues regarding the use of XML[XML] in HTMLdocuments were brought to the attention of the W3C Hypertext CoordinationGroup. In particular, MathML[MathML] andRDF[RDF] are written in XML and intended to beused in HTML documents.
In response, the coordination group held a meeting 11-12 Feb 1998 in SanJose, CA. We would like to thank the host, Sun Microsystems.
As discussed in[Dialects], evolution of the HTMLspecification proceeds by introduction of new idioms which interact withdeployed software in one of the following ways:
For the past few years, the HTML Working Group has vetted new proposals onbehalf of the web community, considering the value of each versus the costof deployment. But with the introduction of XML into the web, markup designis decentralized. Each community or even each user can use whatever elementsand attributes they choose and give them whatever meaning and significancethey choose. As MathML and RDF show, at least some of this XML markup isintended for use inside HTML documents.
This meeting explored mechanisms to use XML markup in HTML documents: existingmechanisms and possible enhancements. In particular:
Participants from all W3C working groups, especially RDF, MathML, CSS&FP,and XML, and DOM were invited. A wide variety of experience and requirementswere represented by the meeting participants:
The participants request that W3C make the W3C site searchable.
The recommended technique for embedding RDF statements in an HTML document is simply to insert the RDF in-line. This will make the resulting document non-conformant to HTML specifications up to and including HTML 4.0 but the RDF Working Group hopes that the HTML specification will evolve to support this.
The discussion around the RDF requirements showed that possible solutionsfor RDF included putting all the information into attributes; putting itin an external file; and putting it at the end of the document. in generalthe participants thought that putting information into attributes was saferthan putting it in an external file because of worries about security andforcing tools to be able to cope with multiple files. Since many tools alreadyhave to cope with multiple files, other participants thought this was nota drawback where security was not an issue. Some participants thought thatputting the information in an external file would sometimes be a necessity,so tools would have to learn to cope.
MathML has many requirements. One of these is a system that can cope withseveral small chunks of XML in one document, since a document may have manysmall equations. It has extreme formatting requirements, only some of whichare shared by other objects. There was some discussion of MathML needs interms of the DOM and formatting properties. The MathML has to be able tobe passed as a chunk to an external renderer, and the XML has to be ableto be formatted in a reasonable way. The MathML does not include HTML elementswithin it. That was discussed within the MathML WG, but rejected. The requirementthat the content of MathML should not show up in down-level browsers wasnot as strong for MathML as for RDF, although some of the participants thoughtit would be best.
The participants came to the conclusion that there was definite agreementon doing an XML block, where the contents of the block are well-formed XML,without any HTML semantics. There was much discussion about whether therewas a reasonable method to include significant non-standard non-empty elementscould be found, and whether there was a possibility of defining some sortof "good" HTML that people would use. Reasons for not allowing HTML semanticsin the XML block, even on elements with the same element types as exist inHTML, included
There was also some support for doing an XML version of HTML, where all theXML rules would apply.
The discussion about whether it was possible to require that the contentsof any non-standard elements be well-formed XML mostly came to the conclusionthat it wasn't; or that it would be extremely expensive for those users simplywanting to add, e.g., a CHAPTER element to their pages. There was supportfor the notion that there is a difference between adding XML to pages (wherethe contents of the XML would be well-formed XML) and adding unknown elementsin a standard way to HTML (where the contents of the unknown element wouldnot follow XML well-formed rules.) Whether the HTML in an unknown HTML elementneeded to be "good" HTML wasn't fully clarified at the meeting.
Another problem is that old browsers render PIs.
During the discussion the following requirements were generally agreed upon.
Agreement on terminology: XML blocks, significant non-standard HTML elements(sometimes also called sprinkles), and crud (or real-world HTML). But howdo we distinguish between XML blocks and significant elements? An XML blockcontains XML -- not HTML. A significant element contains HTML -- not XML(unless it's empty, of course; we have to be able to distinguish betweenempty and non-empty).
The question of how to "sprinkle" non-standard elements in an HTML documentwhile retaining HTML semantics of all elements with HTML element and attributetypes devoured most of the meeting. We did not come to a final conclusionon this subject. One proposed solution was to use new elements called CONTAINERand LEAF, with the CLASS attribute used to show the type. The drawback isthat users can't define non-standard attributes. There was also much discussionas to whether users would accept this sort of solution, or whether they wouldwant to invent their own element types. It was felt that this solution wouldallow users to keep on using "real" HTML (a.k.a crud) inside the wrapperelements.
Another proposal was to allow users to define their own wrapper elements.If all elements within the block have end tags, even if they are EMPTY elements,then this could be the way to extensible HTML (not XML). There were severalpoints against this, including the large number of non-standard EMPTY elementsthat already exist. Many participants thought that defining browser behaviourfor this would be almost impossible, and that migrating HTML users to XMLwith the HTML tagset was a better solution.
How to clean up HTML came up again and again in the discussions. The participantsagreed that it is impossible in the general case to create valid HTML froman arbitrary page on the Web without human intervention. Users will not wantto risk breaking documents which function. Current HTML has three components:the element type names, default rendering, and semantics (e.g. forms).
There was a strong contingent that said users should wait for XML tools tobecome generally available and use those, rather than trying to add XML toHTML.
The MathML group would like a mechanism to tell browsers a plain-text stringto render, if the equation can't be rendered. This sort of mechanism wouldpotentially be useful for other XML content with high rendering requirementsas well.
The biggest reason to come up with a standard method for adding XML (or unknownHTML) to HTML is to allow poeple to use styles and the DOM with these elements.Currently they can't. Browsers do not apply CSS styles to unknown elements,and unknown container elements are not exposed as containers in the MSIEobject model. (The DOM WG decided not to tackle the problem, and only talksabout valid HTML 4.0 documents, and XML as a separate entity.)
A potential solution was to write HTML as XML, i.e. with MIME-type text/xml.Then all the XML rules would apply. One problem with this is that some browserssniff the document irrespective of MIME-type and display the content if itlooks like HTML according to some heuristic[InetSDK],AppendixA. This may include, for example, having a TITLE element anywhere withinthe first 200 bytes of the document. Thus document providers may have toadd a comment long enough to get rid of the heuristics.
The first option for using XML in HTML documents is to include it by reference,using<LINK>,<A>,<OBJECT> or perhaps even<IMG>. This markupconforms to existing W3C Recommendations. This gives predictable behaviouracross the whole spectrum of HTML user agents, at the cost of managing andaccessing the compound document.
Another option with predictable behaviour is to use tags and attributes only,and avoid character data which will be displayed by deployed software. Strictlyspeaking, documents enhanced this way do not conform to the HTML 2, 3.2,or 4.0 specification, but each of those specifications included a note toimplementors to ignore unknown attributes.
The XML namespace facility[XML-Names] should beused to manage the risk of name collisions for new attributes and elements.Note that unfortunately, much of the deployed base of user agents will displayXML namespace declarations as text.
The linking and attributes mechanisms do not satisfy all of the requirementspresented at the meeting. It was agreed that an enhancement to HTML to accomodateXML blocks is necessary.
The definition of an XML block is a chunk of well-formed XML that is insidean HTML document. Any elements within the chunk that happen to have the sameelement types as HTML elements arenot considered to be HTML elements.The error-handling as defined in the XML specification applies, i.e. theprocessormust halt on well-formedness errors.
There were two proposals for this. (Other proposals that were discussed werediscovered to be variations of these).
Using a specific element type has the advantage that the meaning is clear,and that attribute can be added to the element for such things as MIME-typeand a link to an external file containing the XML content.
For the XML block case, the group decided on a vote of 10 for and 1 abstension(none against) to use an element called XML. This must be added to a futureversion of HTML. The attributes are TYPE for the MIME-type and SRC for theURL of the content if it is in an external file. The contents of the XMLelement are XML. There is an xml PI at the beginning of the XML block thatcontains all other information that the XML block needs.
Interoperability with the 3.0 generation of browsers is required for successfuldeployment of RDF, among other applications. This means that the XML blockis not a complete solution either.
There are a number of ways in which content can be made to not show up inbrowsers that don't understand the element.
Of these, putting the content in the HEAD is the most problematic becauseof the difficulties for HTML browsers of defining where the HEAD ends.
Any of these methods would be considered to not break HTML or XML, and theparticipants decided that these should be written up (with the exceptionof putting content in the HEAD) as the recommended methods for coping withXML where the content should not show up in older browsers.
There are, of course, times when none of these methods are suitable for somereason. The group therefore decided to also figure out which of the manyunliked methods was the least undesirable. The choices were
The proposal to put the XML content inside an OBJECT element was quicklyrejected, as it would not work in Netscape Navigator 3.0.
The problem with APPLET is that if the user has applet loading turned off,the content will show. The problem with SCRIPT is that it breaks the currentlydefined content model of SCRIPT. There were also worries about whether futureXML users will use the SCRIPT element themselves, which would not be possibleif it were a reserved element. This concern wasn't shared by the entire group.The problem with using comments is that comments are meant to not containparsed data, and users couldn't put another comment inside the XML content.
The vote (1 per company) was 1 for comments, 1 for APPLET, and 8 for SCRIPT.
Details of the XML block and SCRIPT mechanisms are the subject of a WorkingDraft in progress.
The discussion of using XML markup in HTML documents such that it would be"significant" to stylesheet and DOM implementations did not reach a clearconsensus.
We observed that XML can be modelled using the HTML 4.0 DIV, SPAN, and CLASSmarkup, which are significant to stylesheet and DOM implementations. Someexperience with this style suggested the community would not embrace it,but the discussion was not conclusive.
A proposal for a "sprinkles" mechanism is the subject of a Working Draftin progress.