Note: On 7 February 2013, this specification was modified in place to replace broken links to RFC4646 and RFC4647.
Please refer to theerrata for this document, which may include some normative corrections.
Theprevious errata for this document, are also available.
See alsotranslations.
This document is also available in these non-normative formats:XML and XHTML with color-coded revision indicators.
Copyright © 2008 W3C® (MIT,ERCIM,Keio), All Rights Reserved. W3Cliability,trademark anddocument use rules apply.
The Extensible Markup Language (XML) is a subset of SGML that is completelydescribed in this document. Its goal is to enable generic SGML to be served,received, and processed on the Web in the way that is now possible with HTML.XML has been designed for ease of implementation and for interoperabilitywith both SGML and HTML.
This section describes the status of this document at the time of its publication.Other documents may supersede this document. A list of current W3C publications and thelatest revision of this technical report can be found in theW3C technical reports index athttp://www.w3.org/TR/.
This document specifies a syntax created by subsetting an existing, widelyused international text processing standard (Standard Generalized Markup Language,ISO 8879:1986(E) as amended and corrected) for use on the World Wide Web.It is a product of theXML Core Working Groupas part of theXML Activity.The English version of this specification is the only normative version. However,for translations of this document, seehttp://www.w3.org/2003/03/Translations/byTechnology?technology=xml.
This document is aW3C Recommendation. This fifth edition isnot a new version of XML. As a convenience to readers,it incorporates the changes dictated by the accumulated errata (available athttp://www.w3.org/XML/xml-V10-4e-errata) to theFourthEdition of XML 1.0, dated 16 August 2006. In particular, erratum[E09]relaxes the restrictions on element and attribute names, thereby providing in XML 1.0 the major end user benefitcurrently achievable only by using XML1.1. As a consequence, many possible documents which were not well-formed according to previous editions of this specification are now well-formed, and previously invalid documentsusing the newly-allowed name characters in, for example, IDattributes, are now valid.
This edition supersedes the previousW3C Recommendationof 16 August 2006.
Please report errors in this document tothe publicxml-editor@w3.org mail list; publicarchives are available. For the convenience of readers,anXHTML version with color-coded revision indicators isalso provided; this version highlights each change due to an erratum published in theerratalist for the previous edition, together with a link to the particularerratum in that list. Most of theerrata in the list provide a rationale for the change. The erratalist for this fifth edition is available athttp://www.w3.org/XML/xml-V10-5e-errata.
An implementation report is available athttp://www.w3.org/XML/2008/01/xml10-5e-implementation.html.ATest Suite is maintained to help assessing conformance to this specification.
This document has been reviewed by W3C Members, by software developers, and by other W3C groups and interested parties, and is endorsed by the Director as a W3C Recommendation. It is a stable document and may be used as reference material or cited from another document. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionality and interoperability of the Web.
W3C maintains apublic list ofany patent disclosures made in connection with the deliverables ofthe group; that page also includes instructions for disclosing a patent.An individual who has actual knowledge of a patent which the individualbelieves containsEssentialClaim(s) must disclose the information in accordance withsection 6 of the W3C Patent Policy.
1Introduction
1.1Origin and Goals
1.2Terminology
2Documents
2.1Well-Formed XML Documents
2.2Characters
2.3Common Syntactic Constructs
2.4Character Data and Markup
2.5Comments
2.6Processing Instructions
2.7CDATA Sections
2.8Prolog and Document Type Declaration
2.9Standalone Document Declaration
2.10White Space Handling
2.11End-of-Line Handling
2.12Language Identification
3Logical Structures
3.1Start-Tags, End-Tags, and Empty-Element Tags
3.2Element Type Declarations
3.2.1Element Content
3.2.2Mixed Content
3.3Attribute-List Declarations
3.3.1Attribute Types
3.3.2Attribute Defaults
3.3.3Attribute-Value Normalization
3.4Conditional Sections
4Physical Structures
4.1Character and Entity References
4.2Entity Declarations
4.2.1Internal Entities
4.2.2External Entities
4.3Parsed Entities
4.3.1The Text Declaration
4.3.2Well-Formed Parsed Entities
4.3.3Character Encoding in Entities
4.4XML Processor Treatment of Entities and References
4.4.1Not Recognized
4.4.2Included
4.4.3Included If Validating
4.4.4Forbidden
4.4.5Included in Literal
4.4.6Notify
4.4.7Bypassed
4.4.8Included as PE
4.4.9Error
4.5Construction of Entity Replacement Text
4.6Predefined Entities
4.7Notation Declarations
4.8Document Entity
5Conformance
5.1Validating and Non-Validating Processors
5.2Using XML Processors
6Notation
AReferences
A.1Normative References
A.2Other References
BCharacter Classes
CXML and SGML (Non-Normative)
DExpansion of Entity and Character References (Non-Normative)
EDeterministic Content Models (Non-Normative)
FAutodetection of Character Encodings (Non-Normative)
F.1Detection Without External Encoding Information
F.2Priorities in the Presence of External Encoding Information
GW3C XML Working Group (Non-Normative)
HW3C XML Core Working Group (Non-Normative)
IProduction Notes (Non-Normative)
JSuggestions for XML Names (Non-Normative)
Extensible Markup Language, abbreviated XML, describes a class of dataobjects calledXML documents and partiallydescribes the behavior of computer programs which process them. XML is anapplication profile or restricted form of SGML, the Standard Generalized MarkupLanguage[ISO 8879]. By construction, XML documents are conformingSGML documents.
XML documents are made up of storage units calledentities,which contain either parsed or unparsed data. Parsed data is made up ofcharacters, some of which formcharacterdata, and some of which formmarkup.Markup encodes a description of the document's storage layout and logicalstructure. XML provides a mechanism to impose constraints on the storage layoutand logical structure.
[Definition: A software module calledanXML processor is used to read XML documents and provide accessto their content and structure.][Definition: Itis assumed that an XML processor is doing its work on behalf of another module,called theapplication.] This specification describesthe required behavior of an XML processor in terms of how it must read XMLdata and the information it must provide to the application.
XML was developed by an XML Working Group (originally known as the SGMLEditorial Review Board) formed under the auspices of the World Wide Web Consortium(W3C) in 1996. It was chaired by Jon Bosak of Sun Microsystems with the activeparticipation of an XML Special Interest Group (previously known as the SGMLWorking Group) also organized by the W3C. The membership of the XML WorkingGroup is given in an appendix. Dan Connolly served as the Working Group's contact withthe W3C.
The design goals for XML are:
XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML documents.
The number of optional features in XML is to be kept to the absoluteminimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.
This specification, together with associated standards (Unicode[Unicode]and ISO/IEC 10646[ISO/IEC 10646] for characters, InternetBCP 47[IETF BCP 47]and the Language Subtag Registry[IANA-LANGCODES] for languageidentification tags), providesall the information necessary to understand XML Version 1.0 andconstruct computer programs to process it.
This version of the XML specification may be distributed freely, as long asall text and legal notices remain intact.
The terminology used to describe XML documents is defined in the body ofthis specification. The key wordsMUST,MUST NOT,REQUIRED,SHALL,SHALL NOT,SHOULD,SHOULD NOT,RECOMMENDED,MAY, andOPTIONAL, whenEMPHASIZED,are to be interpreted as described in[IETF RFC 2119]. In addition, the terms definedin the following list are used in buildingthose definitions and in describing the actions of an XML processor:
[Definition: A violation of the rules of this specification;results are undefined. Unless otherwise specified, failure to observe a prescription of this specification indicated by one of the keywordsMUST,REQUIRED,MUST NOT,SHALL andSHALL NOT is an error. Conforming softwareMAY detect and report an errorandMAY recover from it.]
[Definition: An error which a conformingXML processorMUST detect and report to the application.After encountering a fatal error, the processorMAY continue processing thedata to search for further errors andMAY report such errors to the application.In order to support correction of errors, the processorMAY make unprocesseddata from the document (with intermingled character data and markup) availableto the application. Once a fatal error is detected, however, the processorMUST NOT continue normal processing (i.e., itMUST NOT continue to pass characterdata and information about the document's logical structure to the applicationin the normal way).]
[Definition: Conforming softwareMAY orMUST (depending on the modal verb in the sentence) behave as described;if it does, itMUST provide users a means to enable or disable the behaviordescribed.]
[Definition: A rule which applies toallvalid XML documents. Violations of validityconstraints are errors; theyMUST, at user option, be reported byvalidating XML processors.]
[Definition: A rule which appliesto allwell-formed XML documents. Violationsof well-formedness constraints arefatal errors.]
[Definition: (Of strings or names:) Two stringsor names being compared are identical. Characters with multiple possiblerepresentations in ISO/IEC 10646 (e.g. characters with both precomposed andbase+diacritic forms) match only if they have the same representation in bothstrings. Nocase folding is performed. (Of strings and rules in the grammar:) A stringmatches a grammatical production if it belongs to the language generated bythat production. (Of content and content models:) An element matches its declarationwhen it conforms in the fashion described in the constraint[VC:Element Valid].]
[Definition: Marksa sentence describing a feature of XML included solely to ensurethat XML remains compatible with SGML.]
[Definition: Marksa sentence describing a non-binding recommendation included to increasethe chances that XML documents can be processed by the existing installedbase of SGML processors which predate the WebSGML Adaptations Annex to ISO 8879.]
[Definition: A data object is anXMLdocument if it iswell-formed,as defined in this specification. In addition, the XML document isvalid if it meets certain further constraints.]
Each XML document has both a logical and a physical structure. Physically,the document is composed of units calledentities.An entity mayrefer to other entities tocause their inclusion in the document. A document begins in a "root"ordocument entity. Logically, the documentis composed of declarations, elements, comments, character references, andprocessing instructions, all of which are indicated in the document by explicitmarkup. The logical and physical structuresMUST nest properly, as describedin4.3.2 Well-Formed Parsed Entities.
[Definition: A textual object is awell-formedXML document if:]
Taken as a whole, it matches the production labeleddocument.
It meets all the well-formedness constraints given in this specification.
Each of theparsed entitieswhich is referenced directly or indirectly within the document iswell-formed.
[1] | document | ::= | prologelementMisc* |
Matching thedocument production implies that:
It contains one or moreelements.
[Definition: There is exactly one element,called theroot, or document element, no part of which appearsin thecontent of any other element.] Forall other elements, if thestart-tag is inthe content of another element, theend-tagis in the content of the same element. More simply stated, the elements,delimited by start- and end-tags, nest properly within each other.
[Definition: As a consequence of this,for each non-root elementC
in the document, there is one other elementP
in the document such thatC
is in the content ofP
, butis not in the content of any other element that is in the content ofP
.P
is referred to as theparent ofC
, andC
asachild ofP
.]
[Definition: A parsed entity containstext,a sequence ofcharacters, which mayrepresent markup or character data.][Definition: Acharacteris an atomic unit of text as specified by ISO/IEC 10646:2000[ISO/IEC 10646]. Legal characters are tab, carriagereturn, line feed, and the legal charactersof Unicode and ISO/IEC 10646. Theversions of these standards cited inA.1 Normative References werecurrent at the time this document was prepared. New characters may be addedto these standards by amendments or new editions. Consequently, XML processorsMUST accept any character in the range specified forChar.]
[2] | Char | ::= | #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] | /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ |
The mechanism for encoding character code points into bit patterns mayvary from entity to entity. All XML processorsMUST accept the UTF-8 and UTF-16encodings of Unicode[Unicode];the mechanisms for signaling which of the two is in use,or for bringing other encodings into play, are discussed later, in4.3.3 Character Encoding in Entities.
Note:
Document authors are encouraged to avoid"compatibility characters", as definedin section2.3 of[Unicode]. The characters defined in the following ranges are alsodiscouraged. They are either control characters or permanently undefined Unicodecharacters:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],[#x10FFFE-#x10FFFF].
This section defines some symbols used widely in the grammar.
S (white space) consists of one or more space (#x20)characters, carriage returns, line feeds, or tabs.
[3] | S | ::= | (#x20 | #x9 | #xD | #xA)+ |
Note:
The presence of #xD in the above production ismaintained purely for backward compatibility with theFirst Edition.As explained in2.11 End-of-Line Handling,all #xD characters literally present in an XML documentare either removed or replaced by #xA characters beforeany other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.
AnNmtoken (name token) is any mixture of namecharacters.
[Definition: AName is anNmtoken with a restricted set of initial characters.] Disallowed initial characters forNames include digits, diacritics, the full stop and the hyphen.
Names beginning with the string "xml
",or with any string which would match(('X'|'x') ('M'|'m') ('L'|'l'))
,are reserved for standardization in this or future versions of this specification.
Note:
TheNamespaces in XML Recommendation[XML Names] assigns a meaningto names containing colon characters. Therefore, authors should not use thecolon in XML names except for namespace purposes, but XML processors mustaccept the colon as a name character.
The first character of aNameMUST be aNameStartChar, and anyother charactersMUST beNameChars; this mechanism is used toprevent names from beginning with European (ASCII) digits or withbasic combining characters. Almost all characters are permitted innames, except those which either are or reasonably could be used asdelimiters. The intention is to be inclusive rather than exclusive,so that writing systems not yet encoded in Unicode can be used inXML names. SeeJ Suggestions for XML Names for suggestions on the creation ofnames.
Document authors are encouraged to use names which aremeaningful words or combinations of words in natural languages, andto avoid symbolic or white space characters in names. Note thatCOLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), andMIDDLE DOT are explicitly permitted.
The ASCII symbols and punctuation marks, along with a fairlylarge group of Unicode symbol characters, are excluded from namesbecause they are more useful as delimiters in contexts where XMLnames are used outside XML documents; providing this group givesthose contexts hard guarantees about whatcannot be part ofan XML name. The character #x037E, GREEK QUESTION MARK, is excludedbecause when normalized it becomes a semicolon, which could changethe meaning of entity references.
[4] | NameStartChar | ::= | ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] |
[4a] | NameChar | ::= | NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] |
[5] | Name | ::= | NameStartChar (NameChar)* |
[6] | Names | ::= | Name (#x20Name)* |
[7] | Nmtoken | ::= | (NameChar)+ |
[8] | Nmtokens | ::= | Nmtoken (#x20Nmtoken)* |
Note:
TheNamesandNmtokens productions are used to define the validityof tokenized attribute values after normalization (see3.3.1 Attribute Types).
Literal data is any quoted string not containing the quotation mark usedas a delimiter for that string. Literals are used for specifying the contentof internal entities (EntityValue), the valuesof attributes (AttValue), and external identifiers(SystemLiteral). Note that aSystemLiteralcan be parsed without scanning for markup.
[9] | EntityValue | ::= | '"' ([^%&"] |PEReference|Reference)* '"' |
| "'" ([^%&'] |PEReference |Reference)* "'" | |||
[10] | AttValue | ::= | '"' ([^<&"] |Reference)*'"' |
| "'" ([^<&'] |Reference)*"'" | |||
[11] | SystemLiteral | ::= | ('"' [^"]* '"') | ("'" [^']* "'") |
[12] | PubidLiteral | ::= | '"'PubidChar* '"'| "'" (PubidChar - "'")* "'" |
[13] | PubidChar | ::= | #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%] |
Note:
AlthoughtheEntityValue production allows the definitionof a general entity consisting of a single explicit<
in the literal(e.g.,<!ENTITY mylt "<">
), it is strongly advised to avoidthis practice since any reference to that entity will cause a well-formednesserror.
Text consists of intermingledcharacter data and markup. [Definition:Markup takes the form ofstart-tags,end-tags,empty-element tags,entity references,characterreferences,comments,CDATA section delimiters,documenttype declarations,processing instructions,XML declarations,text declarations,and any white space that is at the top level of the document entity (thatis, outside the document element and not inside any other markup).]
[Definition: All text that is not markupconstitutes thecharacter data of the document.]
The ampersand character (&) and the left angle bracket (<)MUST NOT appearin their literal form, except when used as markup delimiters, orwithin acomment, aprocessinginstruction, or aCDATA section.If they are needed elsewhere, theyMUST beescapedusing eithernumeric character referencesor the strings "&
" and "<
"respectively. The right angle bracket (>) may be represented using the string ">
",andMUST,for compatibility, be escapedusing either ">
" or a character reference when itappears in the string "]]>
" in content, whenthat string is not marking the end of aCDATAsection.
In the content of elements, character data is any string of characterswhich does not contain the start-delimiter of any markup and does not include the CDATA-section-closedelimiter, "]]>
". In a CDATA section,character data is any string of characters not including the CDATA-section-closedelimiter, "]]>
".
To allow attribute values to contain both single and double quotes, theapostrophe or single-quote character (') may be represented as "'
",and the double-quote character (") as ""
".
[14] | CharData | ::= | [^<&]* - ([^<&]* ']]>' [^<&]*) |
[Definition:Comments may appearanywhere in a document outside othermarkup;in addition, they may appear within the document type declaration at placesallowed by the grammar. They are not part of the document'scharacterdata; an XML processorMAY, but need not, make it possible for anapplication to retrieve the text of comments.Forcompatibility, the string "--
" (double-hyphen)MUST NOT occur within comments.] Parameterentity referencesMUST NOT be recognized within comments.
[15] | Comment | ::= | '<!--' ((Char - '-') | ('-'(Char - '-')))* '-->' |
An example of a comment:
<!-- declarations for <head> & <body> -->
Notethat the grammar does not allow a comment ending in--->
. Thefollowing example isnot well-formed.
<!-- B+, B, or B--->
[Definition:Processing instructions(PIs) allow documents to contain instructions for applications.]
[16] | PI | ::= | '<?'PITarget (S(Char* - (Char* '?>'Char*)))? '?>' |
[17] | PITarget | ::= | Name - (('X' | 'x') ('M' |'m') ('L' | 'l')) |
PIs are not part of the document'scharacterdata, butMUST be passed through to the application. The PI beginswith a target (PITarget) used to identify the applicationto which the instruction is directed. The target names "XML
", "xml
",and so on are reserved for standardization in this or future versions of thisspecification. The XMLNotation mechanismmay be used for formal declaration of PI targets. Parameterentity referencesMUST NOT be recognized within processing instructions.
[Definition:CDATA sections may occur anywhere character data may occur; they are used to escape blocksof text containing characters which would otherwise be recognized as markup.CDATA sections begin with the string "<![CDATA[
"and end with the string "]]>
":]
[18] | CDSect | ::= | CDStartCDataCDEnd |
[19] | CDStart | ::= | '<![CDATA[' |
[20] | CData | ::= | (Char* - (Char*']]>'Char*)) |
[21] | CDEnd | ::= | ']]>' |
Within a CDATA section, only theCDEnd string isrecognized as markup, so that left angle brackets and ampersands may occurin their literal form; they need not (and cannot) be escaped using "<
"and "&
". CDATA sections cannot nest.
An example of a CDATA section, in which "<greeting>
"and "</greeting>
" are recognized ascharacter data, notmarkup:
<![CDATA[<greeting>Hello, world!</greeting>]]>
[Definition: XML documentsSHOULDbegin with anXML declaration which specifies the version ofXML being used.] For example, the following is a complete XML document,well-formed but notvalid:
<?xml version="1.0"?><greeting>Hello, world!</greeting>
and so is this:
<greeting>Hello, world!</greeting>
The function of the markup in an XML document is to describe its storage andlogical structure and to associate attributename-value pairs with its logical structures. XML provides a mechanism, thedocumenttype declaration, to define constraints on the logical structureand to support the use of predefined storage units. [Definition: An XML document isvalid if it has an associateddocument type declaration and if the document complies with the constraintsexpressed in it.]
The document type declarationMUST appear before the firstelementin the document.
[22] | prolog | ::= | XMLDecl?Misc*(doctypedeclMisc*)? |
[23] | XMLDecl | ::= | '<?xml'VersionInfoEncodingDecl?SDDecl?S? '?>' |
[24] | VersionInfo | ::= | S 'version'Eq("'"VersionNum "'" | '"'VersionNum'"') |
[25] | Eq | ::= | S? '='S? |
[26] | VersionNum | ::= | '1.' [0-9]+ |
[27] | Misc | ::= | Comment |PI|S |
Even though theVersionNum production matchesany version number of the form '1.x', XML 1.0 documentsSHOULD NOT specify a version number other than '1.0'.
Note:
When an XML 1.0 processor encounters a document that specifiesa 1.x version number other than '1.0', it will process it asa 1.0 document. This means that an XML 1.0 processor will accept1.x documents provided they do not use any non-1.0 features.
[Definition: The XMLdocumenttype declaration contains or points tomarkupdeclarations that provide a grammar for a class of documents. Thisgrammar is known as a document type definition, orDTD. The documenttype declaration can point to an external subset (a special kind ofexternal entity) containing markup declarations,or can contain the markup declarations directly in an internal subset, orcan do both. The DTD for a document consists of both subsets taken together.]
[Definition: Amarkup declarationis anelement type declaration, anattribute-list declaration, anentitydeclaration, or anotation declaration.]These declarations may be contained in whole or in part withinparameterentities, as described in the well-formedness and validity constraintsbelow. For furtherinformation, see4 Physical Structures.
[28] | doctypedecl | ::= | '<!DOCTYPE'SName(SExternalID)?S?('['intSubset ']'S?)? '>' | [VC: Root Element Type] |
[WFC: External Subset] | ||||
[28a] | DeclSep | ::= | PEReference |S | [WFC: PE Between Declarations] |
[28b] | intSubset | ::= | (markupdecl |DeclSep)* | |
[29] | markupdecl | ::= | elementdecl |AttlistDecl |EntityDecl|NotationDecl |PI |Comment | [VC: Proper Declaration/PE Nesting] |
[WFC: PEs in Internal Subset] |
Notethat it is possible to construct a well-formed document containing adoctypedeclthat neither points to an external subset nor contains an internal subset.
The markup declarations may be made up in whole or in part of thereplacement text ofparameterentities. The productions later in this specification for individualnonterminals (elementdecl,AttlistDecl,and so on) describe the declarationsafter all the parameterentities have beenincluded.
Parameterentity references are recognized anywhere in the DTD (internal and externalsubsets and external parameter entities), except in literals, processing instructions,comments, and the contents of ignored conditional sections (see3.4 Conditional Sections).They are also recognized in entity value literals. The use of parameter entitiesin the internal subset is restricted as described below.
Validity constraint: Root Element Type
TheNamein the document type declarationMUST match the element type of theroot element.
Validity constraint: Proper Declaration/PE Nesting
Parameter-entityreplacement textMUST be properly nested with markup declarations. That is to say, if eitherthe first character or the last character of a markup declaration (markupdeclabove) is contained in the replacement text for aparameter-entityreference, bothMUST be contained in the same replacement text.
Well-formedness constraint: PEs in Internal Subset
Inthe internal DTD subset,parameter-entity referencesMUST NOT occur within markup declarations; they may occur where markup declarations can occur.(This does not apply to references that occur in external parameter entitiesor to the external subset.)
Well-formedness constraint: External Subset
The external subset, if any,MUST match the production forextSubset.
Well-formedness constraint: PE Between Declarations
The replacement text of a parameter entity referencein aDeclSepMUST match the productionextSubsetDecl.
Like the internal subset, the external subset and any external parameterentities referencedin aDeclSepMUST consist of a series ofcomplete markup declarations of the types allowed by the non-terminal symbolmarkupdecl, interspersed with white space orparameter-entity references. However, portions ofthe contents of the external subset or of theseexternal parameter entities may conditionally be ignored by using theconditional section construct; this is notallowed in the internal subset but isallowed in external parameter entities referenced in the internal subset.
[30] | extSubset | ::= | TextDecl?extSubsetDecl |
[31] | extSubsetDecl | ::= | (markupdecl |conditionalSect |DeclSep)* |
The external subset and external parameter entities also differ from theinternal subset in that in them,parameter-entityreferences are permittedwithin markup declarations,not onlybetween markup declarations.
An example of an XML document with a document type declaration:
<?xml version="1.0"?><!DOCTYPE greeting SYSTEM "hello.dtd"><greeting>Hello, world!</greeting>
Thesystem identifier"hello.dtd
"gives the address (a URI reference) of a DTD for the document.
The declarations can also be given locally, as in this example:
<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE greeting [ <!ELEMENT greeting (#PCDATA)>]><greeting>Hello, world!</greeting>
If both the external and internal subsets are used, the internal subsetMUST be considered to occur before the external subset. This has the effect that entity and attribute-list declarations in the internalsubset take precedence over those in the external subset.
Markup declarations can affect the content of the document, as passed fromanXML processor to an application; examplesare attribute defaults and entity declarations. The standalone document declaration,which may appear as a component of the XML declaration, signals whether ornot there are such declarations which appear external to thedocumententityor in parameter entities. [Definition: Anexternalmarkup declaration is defined as a markup declaration occurring inthe external subset or in a parameter entity (external or internal, the latterbeing included because non-validating processors are not required to readthem).]
[32] | SDDecl | ::= | S 'standalone'Eq(("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"')) | [VC: Standalone Document Declaration] |
In a standalone document declaration, the value "yes" indicatesthat there are noexternal markup declarations whichaffect the information passed from the XML processor to the application. Thevalue "no" indicates that there are or may be such externalmarkup declarations. Note that the standalone document declaration only denotesthe presence of externaldeclarations; the presence, in a document,of references to externalentities, when those entities are internallydeclared, does not change its standalone status.
If there are no external markup declarations, the standalone document declarationhas no meaning. If there are external markup declarations but there is nostandalone document declaration, the value "no" is assumed.
Any XML document for whichstandalone="no"
holds can be convertedalgorithmically to a standalone document, which may be desirable for somenetwork delivery applications.
Validity constraint: Standalone Document Declaration
Thestandalone document declarationMUST have the value "no" ifany external markup declarations contain declarations of:
attributes withdefault values,if elements to which these attributes apply appear in the document withoutspecifications of values for these attributes, or
entities (other thanamp
,lt
,gt
,apos
,quot
), ifreferencesto those entities appear in the document, or
attributes withtokenized types, where theattribute appears in the document with a value such thatnormalizationwill produce a different value from that which would be producedin the absence of the declaration, or
element types withelement content,if white space occurs directly within any instance of those types.
An example XML declaration with a standalone document declaration:
<?xml version="1.0" standalone='yes'?>
In editing XML documents, it is often convenient to use "white space"(spaces, tabs, and blank lines)to set apart the markup for greater readability. Such white space is typicallynot intended for inclusion in the delivered version of the document. On theother hand, "significant" white space that should be preservedin the delivered version is common, for example in poetry and source code.
AnXML processorMUST always passall characters in a document that are not markup through to the application.A validating XML processorMUST alsoinform the application which of these characters constitute white space appearinginelement content.
A specialattribute namedxml:space
may be attached to an element to signal an intention that in that element,white space should be preserved by applications. In valid documents, thisattribute, like any other,MUST bedeclaredif it is used. When declared, itMUST be given as anenumeratedtype whose valuesare one or both of "default" and "preserve".For example:
<!ATTLIST poem xml:space (default|preserve) 'preserve'><!ATTLIST pre xml:space (preserve) #FIXED 'preserve'>
The value "default" signals that applications' default white-spaceprocessing modes are acceptable for this element; the value "preserve"indicates the intent that applications preserve all the white space. Thisdeclared intent is considered to apply to all elements within the contentof the element where it is specified, unless overridden withanother instance of thexml:space
attribute. This specification does not give meaning to any value ofxml:space
other than "default" and "preserve". It is an error for other values to be specified; the XML processorMAY report the error orMAY recover by ignoring the attribute specification or by reporting the (erroneous) value to the application. Applications may ignore or reject erroneous values.
Theroot element of any document is consideredto have signaled no intentions as regards application space handling, unlessit provides a value for this attribute or the attribute is declared with adefault value.
XMLparsed entities are often storedin computer files which, for editing convenience, are organized into lines.These lines are typically separated by some combination of the charactersCARRIAGE RETURN (#xD) and LINE FEED (#xA).
Tosimplify the tasks ofapplications, theXMLprocessorMUST behave as if it normalized all line breaks in external parsedentities (including the document entity) on input, before parsing, by translatingboth the two-character sequence #xD #xA and any #xD that is not followed by#xA to a single #xA character.
In document processing, it is often useful to identify the natural or formallanguage in which the content is written. A specialattributenamedxml:lang
may be inserted in documents to specify the languageused in the contents and attribute values of any element in an XML document.In valid documents, this attribute, like any other,MUST bedeclaredif it is used. Thevalues of the attribute are language identifiers as defined by[IETF BCP 47],Tagsfor the Identification of Languages; in addition, the empty string may be specified.
(Productions 33 through 38 have been removed.)
For example:
<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p><p xml:lang="en-GB">What colour is it?</p><p xml:lang="en-US">What color is it?</p><sp who="Faust" desc='leise' xml:lang="de"> <l>Habe nun, ach! Philosophie,</l> <l>Juristerei, und Medizin</l> <l>und leider auch Theologie</l> <l>durchaus studiert mit heißem Bemüh'n.</l></sp>
The language specified byxml:lang
applies to the element where it is specified (including the values of its attributes), and to all elements in its content unless overridden with another instance ofxml:lang
. In particular, the empty value ofxml:lang
is used on an element B to override a specification ofxml:lang
on an enclosing element A, without specifying another language. Within B, it is considered that there is no language information available, just as ifxml:lang
had not been specified on B or any of its ancestors. Applications determine which of an element's attribute values and which parts of its character content, if any, are treated as language-dependent values described byxml:lang
.
Note:
Language information may also be provided by external transport protocols (e.g. HTTP or MIME). When available, this information may be used by XML applications, but the more local information provided byxml:lang
should be considered to override it.
A simple declaration forxml:lang
might take the form
xml:lang CDATA #IMPLIED
but specific default values may also be given, if appropriate. In a collectionof French poems for English students, with glosses and notes in English, thexml:lang
attribute might be declared this way:
<!ATTLIST poem xml:lang CDATA 'fr'><!ATTLIST gloss xml:lang CDATA 'en'><!ATTLIST note xml:lang CDATA 'en'>
[Definition: EachXMLdocument contains one or moreelements, the boundariesof which are either delimited bystart-tagsandend-tags, or, foremptyelements, by anempty-element tag. Eachelement has a type, identified by name, sometimes called its "genericidentifier" (GI), and may have a set of attribute specifications.]Each attribute specification has anameand avalue.
[39] | element | ::= | EmptyElemTag | |
|STagcontentETag | [WFC: Element Type Match] | |||
[VC: Element Valid] |
This specification does not constrain theapplication semantics, use, or (beyond syntax)names of the element types and attributes, except that names beginning witha match to(('X'|'x')('M'|'m')('L'|'l'))
are reserved for standardizationin this or future versions of this specification.
Well-formedness constraint: Element Type Match
TheNamein an element's end-tagMUST match the element type in the start-tag.
Validity constraint: Element Valid
An element is validif there is a declaration matchingelementdeclwhere theName matches the element type, and one ofthe following holds:
The declaration matchesEMPTY and the element has nocontent (not even entityreferences, comments, PIs or white space).
The declaration matcheschildren and thesequence ofchild elements belongsto the language generated by the regular expression in the content model,with optional white space, comments andPIs (i.e. markup matching production [27]Misc) between thestart-tag and the first child element, between child elements, or betweenthe last child element and the end-tag. Note that a CDATA section containingonly white space or a referenceto an entity whose replacement text is character references expanding to whitespace do notmatch the nonterminalS, andhence cannot appear in these positions; however, areference to an internal entity with a literal value consisting of characterreferences expanding to white space does matchS, since itsreplacement text is the white space resulting from expansion of the characterreferences.
The declaration matchesMixed, and the content(after replacingany entity references with their replacement text) consists ofcharacter data(includingCDATA sections),comments,PIs andchild elements whose types match names in thecontent model.
The declaration matchesANY, and the content (after replacingany entity references with their replacement text)consists of character data,CDATAsections,comments,PIs andchild elementswhose types have been declared.
[Definition: The beginning of every non-emptyXML element is marked by astart-tag.]
[40] | STag | ::= | '<'Name (SAttribute)*S? '>' | [WFC: Unique Att Spec] |
[41] | Attribute | ::= | NameEqAttValue | [VC: Attribute Value Type] |
[WFC: No External Entity References] | ||||
[WFC: No < in Attribute Values] |
TheName in the start- and end-tags gives the element'stype. [Definition: TheName-AttValuepairs are referred to as theattribute specifications of theelement], [Definition: with theName in each pair referred to as theattribute name]and [Definition: the content of theAttValue (the text between the'
or"
delimiters) as theattribute value.] Notethat the order of attribute specifications in a start-tag or empty-elementtag is not significant.
Well-formedness constraint: Unique Att Spec
An attribute nameMUST NOT appear more than once in the same start-tag or empty-element tag.
Validity constraint: Attribute Value Type
The attributeMUSThave been declared; the valueMUST be of the type declared for it. (For attributetypes, see3.3 Attribute-List Declarations.)
Well-formedness constraint: No External Entity References
AttributevaluesMUST NOT contain direct or indirect entity references to external entities.
Well-formedness constraint: No<
in Attribute Values
Thereplacement text of any entityreferred to directly or indirectly in an attribute valueMUST NOT contain a<
.
An example of a start-tag:
<termdef term="dog">
[Definition: The end of every element that beginswith a start-tagMUST be marked by anend-tag containing a namethat echoes the element's type as given in the start-tag:]
[42] | ETag | ::= | '</'NameS?'>' |
An example of an end-tag:
</termdef>
[Definition: Thetextbetween the start-tag and end-tag is called the element'scontent:]
[43] | content | ::= | CharData? ((element|Reference |CDSect|PI |Comment)CharData?)* |
[Definition: An elementwith nocontent is said to beempty.] The representationof an empty element is either a start-tag immediately followed by an end-tag,or an empty-element tag. [Definition: Anempty-elementtag takes a special form:]
[44] | EmptyElemTag | ::= | '<'Name (SAttribute)*S? '/>' | [WFC: Unique Att Spec] |
Empty-element tags may be used for any element which has no content, whetheror not it is declared using the keywordEMPTY.Forinteroperability, the empty-element tagSHOULDbe used, andSHOULD only be used, for elements which are declaredEMPTY.
Examples of empty elements:
<IMG align="left" src="http://www.w3.org/Icons/WWW/w3c_home" /><br></br><br/>
Theelement structure of anXML document may, forvalidationpurposes, be constrained using element type and attribute-list declarations.An element type declaration constrains the element'scontent.
Element type declarations often constrain which element types can appearaschildren of the element. At useroption, an XML processorMAY issue a warning when a declaration mentions anelement type for which no declaration is provided, but this is not an error.
[Definition: Anelementtype declaration takes the form:]
[45] | elementdecl | ::= | '<!ELEMENT'SNameScontentspecS?'>' | [VC: Unique Element Type Declaration] |
[46] | contentspec | ::= | 'EMPTY' | 'ANY' |Mixed|children |
where theName gives the element type being declared.
Validity constraint: Unique Element Type Declaration
An element typeMUST NOT be declared more than once.
Examples of element type declarations:
<!ELEMENT br EMPTY><!ELEMENT p (#PCDATA|emph)* ><!ELEMENT %name.para; %content.para; ><!ELEMENT container ANY>
[Definition: An elementtype haselement content when elementsof that typeMUST contain onlychildelements (no character data), optionally separated by white space (charactersmatching the nonterminalS).][Definition: In this case, the constraint includes acontentmodel, a simple grammar governing the allowed types of thechild elements and the order in which they are allowed to appear.]The grammar is built on content particles (cps), whichconsist of names, choice lists of content particles, or sequence lists ofcontent particles:
[47] | children | ::= | (choice |seq)('?' | '*' | '+')? | |
[48] | cp | ::= | (Name |choice|seq) ('?' | '*' | '+')? | |
[49] | choice | ::= | '('S?cp (S? '|'S?cp )+S? ')' | [VC: Proper Group/PE Nesting] |
[50] | seq | ::= | '('S?cp (S? ','S?cp )*S? ')' | [VC: Proper Group/PE Nesting] |
where eachName is the type of an element whichmay appear as achild. Any contentparticle in a choice list may appear in theelementcontent at the location where the choice list appears in the grammar;content particles occurring in a sequence listMUST each appear in theelement content in the order given in the list.The optional character following a name or list governs whether the elementor the content particles in the list may occur one or more (+
),zero or more (*
), or zero or one times (?
). Theabsence of such an operator means that the element or content particleMUSTappear exactly once. This syntax and meaning are identical to those used inthe productions in this specification.
The content of an element matches a content model if and only if it ispossible to trace out a path through the content model, obeying the sequence,choice, and repetition operators and matching each element in the contentagainst an element type in the content model.Forcompatibility, it is an error if the content modelallows an element to match more than one occurrence of an element type in thecontent model. For more information, seeE Deterministic Content Models.
Validity constraint: Proper Group/PE Nesting
Parameter-entityreplacement textMUST be properly nested with parenthesizedgroups. That is to say, if either of the opening or closing parentheses inachoice,seq, orMixedconstruct is contained in the replacement text for aparameterentity, bothMUST be contained in the same replacement text.
For interoperability, if a parameter-entity referenceappears in achoice,seq, orMixed construct, its replacement textSHOULD contain atleast one non-blank character, and neither the first nor last non-blank characterof the replacement textSHOULD be a connector (|
or,
).
Examples of element-content models:
<!ELEMENT spec (front, body, back?)><!ELEMENT div1 (head, (p | list | note)*, div2*)><!ELEMENT dictionary-body (%div.mix; | %dict.mix;)*>
[Definition: An elementtypehasmixed content when elements of that type may contain characterdata, optionally interspersed withchildelements.] In this case, the types of the child elements may be constrained,but not their order or their number of occurrences:
[51] | Mixed | ::= | '('S? '#PCDATA' (S?'|'S?Name)*S?')*' | |
| '('S? '#PCDATA'S? ')' | [VC: Proper Group/PE Nesting] | |||
[VC: No Duplicate Types] |
where theNames give the types of elements thatmay appear as children. Thekeyword#PCDATA derives historically from the term "parsedcharacter data."
Validity constraint: No Duplicate Types
Thesame nameMUST NOT appear more than once in a single mixed-content declaration.
Examples of mixed content declarations:
<!ELEMENT p (#PCDATA|a|ul|b|i|em)*><!ELEMENT p (#PCDATA | %font; | %phrase; | %special; | %form;)* ><!ELEMENT b (#PCDATA)>
Attributes are used to associate name-valuepairs withelements. Attribute specificationsMUST NOT appear outside ofstart-tags andempty-element tags; thus, the productions used torecognize them appear in3.1 Start-Tags, End-Tags, and Empty-Element Tags. Attribute-list declarationsmay be used:
To define the set of attributes pertaining to a given element type.
To establish type constraints for these attributes.
To providedefault values forattributes.
[Definition:Attribute-listdeclarations specify the name, data type, and default value (if any)of each attribute associated with a given element type:]
[52] | AttlistDecl | ::= | '<!ATTLIST'SNameAttDef*S? '>' |
[53] | AttDef | ::= | SNameSAttTypeSDefaultDecl |
TheName in theAttlistDeclrule is the type of an element. At user option, an XML processorMAY issuea warning if attributes are declared for an element type not itself declared,but this is not an error. TheName in theAttDefrule is the name of the attribute.
When more than oneAttlistDecl is providedfor a given element type, the contents of all those provided are merged. Whenmore than one definition is provided for the same attribute of a given elementtype, the first declaration is binding and later declarations are ignored.For interoperability, writers of DTDs may chooseto provide at most one attribute-list declaration for a given element type,at most one attribute definition for a given attribute name in an attribute-listdeclaration, and at least one attribute definition in each attribute-listdeclaration. For interoperability, an XML processorMAY at user optionissue a warning when more than one attribute-list declaration is providedfor a given element type, or more than one attribute definition is providedfor a given attribute, but this is not an error.
XML attribute types are of three kinds: a string type, a set of tokenizedtypes, and enumerated types. The string type may take any literal string asa value; the tokenized types are more constrained.The validity constraints noted in the grammar are applied after the attributevalue has been normalized as described in3.3.3 Attribute-Value Normalization.
[54] | AttType | ::= | StringType |TokenizedType|EnumeratedType | |
[55] | StringType | ::= | 'CDATA' | |
[56] | TokenizedType | ::= | 'ID' | [VC: ID] |
[VC: One ID per Element Type] | ||||
[VC: ID Attribute Default] | ||||
| 'IDREF' | [VC: IDREF] | |||
| 'IDREFS' | [VC: IDREF] | |||
| 'ENTITY' | [VC: Entity Name] | |||
| 'ENTITIES' | [VC: Entity Name] | |||
| 'NMTOKEN' | [VC: Name Token] | |||
| 'NMTOKENS' | [VC: Name Token] |
Values of typeIDMUST match theName production. A nameMUST NOT appear more than oncein an XML document as a value of this type; i.e., ID valuesMUST uniquelyidentify the elements which bear them.
Validity constraint: One ID per Element Type
An element typeMUST NOT have more than one ID attribute specified.
Validity constraint: ID Attribute Default
An ID attributeMUST have a declared default of#IMPLIED or#REQUIRED.
Values of typeIDREFMUSTmatch theName production, and values of typeIDREFSMUST matchNames; eachNameMUST match the value of an ID attribute on some element in the XML document;i.e.IDREF valuesMUST match the value of some ID attribute.
Validity constraint: Entity Name
Values of typeENTITYMUST match theName production, values of typeENTITIESMUST matchNames; eachNameMUST match the name of anunparsed entitydeclared in theDTD.
Validity constraint: Name Token
Values of typeNMTOKENMUST match theNmtoken production; values of typeNMTOKENSMUST matchNmtokens.
[Definition:Enumerated attributeshave a list of allowed values in their declaration]. TheyMUST take one of those values. There are two kinds of enumerated attribute types:
[57] | EnumeratedType | ::= | NotationType|Enumeration | |
[58] | NotationType | ::= | 'NOTATION'S '('S?Name (S? '|'S?Name)*S? ')' | [VC: Notation Attributes] |
[VC: One Notation Per Element Type] | ||||
[VC: No Notation on Empty Element] | ||||
[VC: No Duplicate Tokens] | ||||
[59] | Enumeration | ::= | '('S?Nmtoken(S? '|'S?Nmtoken)*S? ')' | [VC: Enumeration] |
[VC: No Duplicate Tokens] |
ANOTATION attribute identifies anotation,declared in the DTD with associated system and/or public identifiers, to beused in interpreting the element to which the attribute is attached.
Validity constraint: Notation Attributes
Values of this typeMUST match one of thenotation namesincluded in the declaration; all notation names in the declarationMUST bedeclared.
Validity constraint: One Notation Per Element Type
An element typeMUST NOT have more than oneNOTATIONattribute specified.
Validity constraint: No Notation on Empty Element
For compatibility,an attribute of typeNOTATIONMUST NOT be declared on an elementdeclaredEMPTY.
Validity constraint: No Duplicate Tokens
The notation names in a singleNotationTypeattribute declaration, as well as theNmTokens in a singleEnumeration attribute declaration,MUST all be distinct.
Validity constraint: Enumeration
Values of this typeMUST matchone of theNmtoken tokens in the declaration.
For interoperability, the sameNmtokenSHOULD NOT occur more than once in the enumeratedattribute types of a single element type.
Anattribute declaration provides informationon whether the attribute's presence isREQUIRED, and if not, how an XML processoris to react if a declared attribute is absent in a document.
[60] | DefaultDecl | ::= | '#REQUIRED' | '#IMPLIED' | |
| (('#FIXED'S)?AttValue) | [VC: Required Attribute] | |||
[VC: Attribute Default Value Syntactically Correct] | ||||
[WFC: No < in Attribute Values] | ||||
[VC: Fixed Attribute Default] | ||||
[WFC: No External Entity References] |
In an attribute declaration,#REQUIRED means that the attributeMUST always be provided,#IMPLIED that no default value is provided.[Definition: Ifthe declaration is neither#REQUIRED nor#IMPLIED, thentheAttValue value contains the declareddefaultvalue; the#FIXED keyword states that the attributeMUST always havethe default value.When an XML processor encountersan elementwithout a specification for an attribute for which it has read a defaultvalue declaration, itMUST report the attribute with the declared defaultvalue to the application.]
Validity constraint: Required Attribute
If the defaultdeclaration is the keyword#REQUIRED, then the attributeMUST bespecified for all elements of the type in the attribute-list declaration.
Validity constraint: Attribute Default Value Syntactically Correct
The declared default valueMUST meet the syntacticconstraints of the declared attribute type. That is, the default value of an attribute:
of type IDREF or ENTITY must match theName production;
of type IDREFS or ENTITIES must match theNames production;
of type NMTOKEN must match theNmtoken production;
of type NMTOKENS must match theNmtokens production;
of anenumerated type (either aNOTATION type or anenumeration) must match one of the enumerated values.
Note that only thesyntactic constraints of the type are required here; other constraints (e.g.that the value be the name of a declared unparsed entity, for an attribute oftype ENTITY) will be reported by a validatingparser only if an element without a specification for this attributeactually occurs.
Validity constraint: Fixed Attribute Default
If an attributehas a default value declared with the#FIXED keyword, instances ofthat attributeMUST match the default value.
Examples of attribute-list declarations:
<!ATTLIST termdef id ID #REQUIRED name CDATA #IMPLIED><!ATTLIST list type (bullets|ordered|glossary) "ordered"><!ATTLIST form method CDATA #FIXED "POST">
Before the value of an attribute is passed to the application or checkedfor validity, the XML processorMUST normalize the attribute value by applyingthe algorithm below, or by using some other method such that the value passedto the application is the same as that produced by the algorithm.
All line breaksMUST have been normalized on input to #xA as describedin2.11 End-of-Line Handling, so the rest of this algorithm operateson text normalized in this way.
Begin with a normalized value consisting of the empty string.
For each character, entity reference, or character reference in theunnormalized attribute value, beginning with the first and continuing to thelast, do the following:
For a character reference, append the referenced character to thenormalized value.
For an entity reference, recursively apply step 3 of this algorithmto the replacement text of the entity.
For a white space character (#x20, #xD, #xA, #x9), append a spacecharacter (#x20) to the normalized value.
For another character, append the character to the normalized value.
If the attribute type is not CDATA, then the XML processorMUST furtherprocess the normalized attribute value by discarding any leading and trailingspace (#x20) characters, and by replacing sequences of space (#x20) charactersby a single space (#x20) character.
Note that if the unnormalized attribute value contains a character referenceto a white space character other than space (#x20), the normalized value containsthe referenced character itself (#xD, #xA or #x9). This contrasts with thecase where the unnormalized value contains a white space character (not areference), which is replaced with a space character (#x20) in the normalizedvalue and also contrasts with the case where the unnormalized value containsan entity reference whose replacement text contains a white space character;being recursively processed, the white space character is replaced with aspace character (#x20) in the normalized value.
All attributes for which no declaration has been readSHOULD be treatedby a non-validating processor as if declaredCDATA.
It is an error if anattributevalue contains areference to anentity for which no declaration has been read.
Following are examples of attribute normalization. Given the followingdeclarations:
<!ENTITY d "
"><!ENTITY a "
"><!ENTITY da "
">
the attribute specifications in the left column below would be normalizedto the character sequences of the middle column if the attributea
is declaredNMTOKENS and to those of the right columns ifa
is declaredCDATA.
Attribute specification | a is NMTOKENS | a is CDATA |
---|---|---|
a="xyz" | x y z | #x20 #x20 x y z |
a="&d;&d;A&a; &a;B&da;" | A #x20 B | #x20 #x20 A #x20 #x20 #x20 B #x20 #x20 |
a="

A

B
" | #xD #xD A #xA #xA B #xD #xA | #xD #xD A #xA #xA B #xD #xA |
Note that the last example is invalid (but well-formed) ifa
is declared to be of typeNMTOKENS.
[Definition:Conditionalsections are portions of thedocument typedeclaration external subset orof external parameter entities which are included in, or excluded from,the logical structure of the DTD based on the keyword which governs them.]
[61] | conditionalSect | ::= | includeSect |ignoreSect | |
[62] | includeSect | ::= | '<!['S? 'INCLUDE'S? '['extSubsetDecl']]>' | [VC: Proper Conditional Section/PE Nesting] |
[63] | ignoreSect | ::= | '<!['S? 'IGNORE'S? '['ignoreSectContents*']]>' | [VC: Proper Conditional Section/PE Nesting] |
[64] | ignoreSectContents | ::= | Ignore ('<!['ignoreSectContents ']]>'Ignore)* | |
[65] | Ignore | ::= | Char* - (Char*('<![' | ']]>')Char*) |
Validity constraint: Proper Conditional Section/PE Nesting
If any of the "<![
","[
", or "]]>
" of a conditional section is containedin the replacement text for a parameter-entity reference, all of themMUSTbe contained in the same replacement text.
Like the internal and external DTD subsets, a conditional section may containone or more complete declarations, comments, processing instructions, or nestedconditional sections, intermingled with white space.
If the keyword of the conditional section isINCLUDE, then thecontents of the conditional sectionMUST be processed as part of the DTD. If the keyword ofthe conditional section isIGNORE, then the contents of the conditionalsectionMUSTNOT be processed as part of the DTD.If a conditional section with a keyword ofINCLUDE occurs withina larger conditional section with a keyword ofIGNORE, both the outerand the inner conditional sectionsMUST be ignored. The contentsof an ignored conditional sectionMUST be parsed by ignoring all characters afterthe "[
" following the keyword, except conditional section starts"<![
" and ends "]]>
", until the matching conditionalsection end is found. Parameter entity referencesMUST NOT be recognized in thisprocess.
If the keyword of the conditional section is a parameter-entity reference,the parameter entityMUST be replaced by its content before the processordecides whether to include or ignore the conditional section.
An example:
<!ENTITY % draft 'INCLUDE' ><!ENTITY % final 'IGNORE' ><![%draft;[<!ELEMENT book (comments*, title, body, supplements?)>]]><![%final;[<!ELEMENT book (title, body, supplements?)>]]>
[Definition: An XML document may consist of oneor many storage units. Theseare calledentities; they all havecontent and areall (except for thedocument entity andtheexternal DTD subset) identified byentityname.] Each XML document has one entitycalled thedocument entity, which servesas the starting point for theXML processorand may contain the whole document.
Entities may be either parsed or unparsed. [Definition: The contents of aparsedentity are referred to as itsreplacementtext; thistext is considered anintegral part of the document.]
[Definition: Anunparsed entityis a resource whose contents may or may not betext,and if text, maybe other than XML. Each unparsed entity has an associatednotation, identified by name. Beyond a requirementthat an XML processor make the identifiers for the entity and notation availableto the application, XML places no constraints on the contents of unparsedentities.]
Parsed entities are invoked by name using entity references; unparsed entitiesby name, given in the value ofENTITY orENTITIES attributes.
[Definition:General entitiesare entities for use within the document content. In this specification, generalentities are sometimes referred to with the unqualified termentitywhen this leads to no ambiguity.][Definition:Parameterentities are parsed entities for use within the DTD.]These two types of entities use different forms of reference and are recognizedin different contexts. Furthermore, they occupy different namespaces; a parameterentity and a general entity with the same name are two distinct entities.
[Definition: Acharacterreference refers to a specific character in the ISO/IEC 10646 characterset, for example one not directly accessible from available input devices.]
[66] | CharRef | ::= | '&#' [0-9]+ ';' | |
| '&#x' [0-9a-fA-F]+ ';' | [WFC: Legal Character] |
Well-formedness constraint: Legal Character
Characters referredto using character referencesMUST match the production forChar.
If the character reference begins with "&#x
",the digits and letters up to the terminating;
provide a hexadecimalrepresentation of the character's code point in ISO/IEC 10646. If it beginsjust with "&#
", the digits up to the terminating;
provide a decimal representation of the character's code point.
[Definition: Anentity referencerefers to the content of a named entity.][Definition: References to parsed general entities useampersand (&
) and semicolon (;
) as delimiters.][Definition:Parameter-entity referencesuse percent-sign (%
) and semicolon (;
) as delimiters.]
[67] | Reference | ::= | EntityRef |CharRef | |
[68] | EntityRef | ::= | '&'Name ';' | [WFC: Entity Declared] |
[VC: Entity Declared] | ||||
[WFC: Parsed Entity] | ||||
[WFC: No Recursion] | ||||
[69] | PEReference | ::= | '%'Name ';' | [VC: Entity Declared] |
[WFC: No Recursion] | ||||
[WFC: In DTD] |
Well-formedness constraint: Entity Declared
In a documentwithout any DTD, a document with only an internal DTD subset which containsno parameter entity references, or a document with "standalone='yes'
", foran entity reference that does not occur within the external subset or a parameterentity, theName given in the entity referenceMUSTmatch that in anentitydeclaration that does not occur within the external subset or aparameter entity, except that well-formed documents need not declareany of the following entities:amp
,lt
,gt
,apos
,quot
. Thedeclaration of a general entityMUST precede any reference to it which appearsin a default value in an attribute-list declaration.
Note that non-validating processors arenotobligated to read and process entity declarations occurring in parameter entities or inthe external subset; for such documents,the rule that an entity must be declared is a well-formedness constraint onlyifstandalone='yes'.
Validity constraint: Entity Declared
In a document with an external subset or parameter entity references,if the document is not standalone (either "standalone='no'
"is specified or there is no standalone declaration), thentheName given in the entity referenceMUSTmatch that in anentitydeclaration. For interoperability, valid documentsSHOULD declarethe entitiesamp
,lt
,gt
,apos
,quot
, in the form specified in4.6 Predefined Entities.The declaration of a parameter entityMUST precede any reference to it. Similarly,the declaration of a general entityMUST precede any attribute-listdeclaration containing a default value with a direct or indirect referenceto that general entity.
Well-formedness constraint: Parsed Entity
An entity referenceMUSTNOT contain the name of anunparsed entity.Unparsed entities may be referred to only inattributevalues declared to be of typeENTITY orENTITIES.
Well-formedness constraint: No Recursion
A parsed entityMUST NOT contain a recursive reference to itself, either directly or indirectly.
Well-formedness constraint: In DTD
Parameter-entity referencesMUST NOT appear outside theDTD.
Examples of character and entity references:
Type <key>less-than</key> (<) to save options.This document was prepared on &docdate; andis classified &security-level;.
Example of a parameter-entity reference:
<!-- declare the parameter entity "ISOLat2"... --><!ENTITY % ISOLat2 SYSTEM "http://www.xml.com/iso/isolat2-xml.entities" ><!-- ... now reference it. -->%ISOLat2;
[Definition: Entities are declaredthus:]
[70] | EntityDecl | ::= | GEDecl |PEDecl |
[71] | GEDecl | ::= | '<!ENTITY'SNameSEntityDefS?'>' |
[72] | PEDecl | ::= | '<!ENTITY'S '%'SNameSPEDefS? '>' |
[73] | EntityDef | ::= | EntityValue| (ExternalIDNDataDecl?) |
[74] | PEDef | ::= | EntityValue |ExternalID |
TheName identifies the entity in anentityreference or, in the case of an unparsed entity, in the value ofanENTITY orENTITIES attribute. If the same entity is declaredmore than once, the first declaration encountered is binding; at user option,an XML processorMAY issue a warning if entities are declared multiple times.
[Definition: If theentity definition is anEntityValue, the definedentity is called aninternal entity. There is no separate physicalstorage object, and the content of the entity is given in the declaration.]Note that some processing of entity and character references in theliteral entity value may be required to producethe correctreplacement text: see4.5 Construction of Entity Replacement Text.
An internal entity is aparsed entity.
Example of an internal entity declaration:
<!ENTITY Pub-Status "This is a pre-release of the specification.">
[Definition: If the entity is not internal,it is anexternal entity, declared as follows:]
[75] | ExternalID | ::= | 'SYSTEM'SSystemLiteral | |
| 'PUBLIC'SPubidLiteralSSystemLiteral | ||||
[76] | NDataDecl | ::= | S 'NDATA'SName | [VC: Notation Declared] |
If theNDataDecl is present, this is a generalunparsed entity; otherwise it is a parsed entity.
[Definition: TheSystemLiteral is called the entity'ssystemidentifier. It is meant to be converted to a URI reference(as defined in[IETF RFC 3986]),as part of theprocess of dereferencing it to obtain input for the XML processor to construct theentity's replacement text.] It is an error for a fragment identifier(beginning with a#
character) to be part of a system identifier.Unless otherwise provided by information outside the scope of this specification(e.g. a special XML element type defined by a particular DTD, or a processinginstruction defined by a particular application specification), relative URIsare relative to the location of the resource within which the entity declarationoccurs. This is defined tobe the external entity containing the '<' which starts the declaration, at thepoint when it is parsed as a declaration.A URI might thus be relative to thedocumententity, to the entity containing theexternalDTD subset, or to some otherexternal parameterentity. Attempts toretrieve the resource identified by a URI may be redirected at the parserlevel (for example, in an entity resolver) or below (at the protocol level,for example, via an HTTPLocation:
header). In the absence of additionalinformation outside the scope of this specification within the resource,the base URI of a resource is always the URI of the actual resource returned.In other words, it is the URI of the resource retrieved after all redirectionhas occurred.
Systemidentifiers (and other XML strings meant to be used as URI references) may containcharacters that, according to[IETF RFC 3986],must be escaped before a URI can be used to retrieve the referenced resource. Thecharacters to be escaped are the control characters #x0 to #x1F and #x7F (most ofwhich cannot appear in XML), space #x20, the delimiters '<' #x3C, '>' #x3E and'"' #x22, theunwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and'`' #x60, as well as all characters above #x7F. Since escaping is not always a fullyreversible process, itMUST be performed only when absolutely necessary and as lateas possible in a processing chain. In particular, neither the process of convertinga relative URI to an absolute one nor the process of passing a URI reference to aprocess or software component responsible for dereferencing itSHOULD trigger escaping.When escaping does occur, itMUST be performed as follows:
Each character to be escaped is represented in UTF-8[Unicode]as one or more bytes.
The resulting bytes are escaped withthe URI escaping mechanism (that is, converted to%
HH,where HH is the hexadecimal notation of the byte value).
The original character is replaced by the resulting character sequence.
Note:
In a future edition of this specification, the XML Core Working Group intends to replace the preceding paragraphand list of steps with a normative reference to an upcoming revision of IETF RFC 3987, which will define"Legacy Extended IRIs (LEIRIs)". When this revision is available, it is the intent of the XML Core WG to use it to replacelanguage similar to the above in any future revisions of XML-related specifications under its purview.
[Definition: In addition to a systemidentifier, an external identifier may include apublic identifier.]An XML processor attempting to retrieve the entity's content may useany combination ofthe public and system identifiers as well as additional information outside thescope of this specification to try to generate an alternative URI reference.If the processor is unable to do so, itMUST use the URIreference specified in the system literal. Before a match is attempted,all strings of white space in the public identifierMUST be normalized tosingle space characters (#x20), and leading and trailing white spaceMUSTbe removed.
Examples of external entity declarations:
<!ENTITY open-hatch SYSTEM "http://www.textuality.com/boilerplate/OpenHatch.xml"><!ENTITY open-hatch PUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//EN" "http://www.textuality.com/boilerplate/OpenHatch.xml"><!ENTITY hatch-pic SYSTEM "../grafix/OpenHatch.gif" NDATA gif >
External parsed entitiesSHOULD each begin with atext declaration.
[77] | TextDecl | ::= | '<?xml'VersionInfo?EncodingDeclS? '?>' |
The text declarationMUST be provided literally, not by referenceto a parsed entity. The text declarationMUST NOT appear at anyposition other than the beginning of an external parsed entity. The text declaration in an external parsed entity is not considered part of itsreplacement text.
The document entity is well-formed if it matches the production labeleddocument. An external general parsed entity is well-formedif it matches the production labeledextParsedEnt. Allexternal parameter entities are well-formed by definition.
Note:
Only parsed entities that are referenced directly or indirectly within the document are required to be well-formed.
[78] | extParsedEnt | ::= | TextDecl?content |
An internal general parsed entity is well-formed if its replacement textmatches the production labeledcontent. All internalparameter entities are well-formed by definition.
A consequence of well-formedness in generalentities is that the logical and physicalstructures in an XML document are properly nested; nostart-tag,end-tag,empty-element tag,element,comment,processing instruction,characterreference, orentity referencecan begin in one entity and end in another.
Each external parsed entity in an XML document may use a different encodingfor its characters. All XML processorsMUST be able to read entities in boththe UTF-8 and UTF-16 encodings. The terms "UTF-8"and "UTF-16" in this specification do not apply torelated character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8.
Entities encoded in UTF-16MUST and entitiesencoded in UTF-8MAY begin with the Byte Order Mark described byAnnex H of[ISO/IEC 10646:2000], section16.8 of[Unicode](the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature,not part of either the markup or the character data of the XML document. XMLprocessorsMUST be able to use this character to differentiate between UTF-8and UTF-16 encoded documents.
If the replacement text of an external entity is tobegin with the character U+FEFF, and no text declarationis present, then a Byte Order Mark MUST be present,whether the entity is encoded in UTF-8 or UTF-16.
Although an XML processor is required to read only entities in the UTF-8and UTF-16 encodings, it is recognized that other encodings are used aroundthe world, and it may be desired for XML processors to read entities thatuse them. Inthe absence of external character encoding information (such as MIME headers),parsed entities which are stored in an encoding other than UTF-8 or UTF-16MUST begin with a text declaration (see4.3.1 The Text Declaration) containingan encoding declaration:
[80] | EncodingDecl | ::= | S 'encoding'Eq('"'EncName '"' | "'"EncName"'" ) | |
[81] | EncName | ::= | [A-Za-z] ([A-Za-z0-9._] | '-')* | /* Encodingname contains only Latin characters */ |
In thedocument entity, the encodingdeclaration is part of theXML declaration.TheEncName is the name of the encoding used.
In an encoding declaration, the values "UTF-8
", "UTF-16
","ISO-10646-UCS-2
", and "ISO-10646-UCS-4
"SHOULD be usedfor the various encodings and transformations of Unicode / ISO/IEC 10646,the values "ISO-8859-1
", "ISO-8859-2
",... "ISO-8859-
n" (wherenis the part number)SHOULD be used for the parts of ISO 8859, andthe values "ISO-2022-JP
", "Shift_JIS
",and "EUC-JP
"SHOULD be used for the various encodedforms of JIS X-0208-1997. ItisRECOMMENDED that character encodings registered (ascharsets)with the Internet Assigned Numbers Authority[IANA-CHARSETS],other than those just listed, be referred to using their registered names;other encodingsSHOULD use names starting with an "x-" prefix.XML processorsSHOULD match character encoding names in a case-insensitiveway andSHOULD either interpret an IANA-registered name as the encoding registeredat IANA for that name or treat it as unknown (processors are, of course, notrequired to support all IANA-registered encodings).
In the absence of information provided by an external transport protocol(e.g. HTTP or MIME), it is afatal error foran entity including an encoding declaration to be presented to the XML processorin an encoding other than that named in the declaration, or for an entity whichbegins with neither a Byte Order Marknor an encoding declaration to use an encoding other than UTF-8. Note thatsince ASCII is a subset of UTF-8, ordinary ASCII entities do not strictlyneed an encoding declaration.
It is afatal error for aTextDecl to occur otherthan at the beginning of an external entity.
It is afatal error when an XML processorencounters an entity with an encoding that it is unable to process. Itis afatal error if an XML entity is determined (via default, encoding declaration,or higher-level protocol) to be in a certain encoding but contains bytesequences that are not legal in that encoding. Specifically, it is afatal error if an entity encoded in UTF-8 contains anyill-formed code unit sequences,as defined insection 3.9 of Unicode[Unicode]. Unless an encodingis determined by a higher-level protocol, it is also afatal error if an XML entitycontains no encoding declaration and its content is not legal UTF-8 or UTF-16.
Examples of text declarations containing encoding declarations:
<?xml encoding='UTF-8'?><?xml encoding='EUC-JP'?>
The table below summarizes the contexts in which character references,entity references, and invocations of unparsed entities might appear and theREQUIRED behavior of anXML processorin each case. The labels in the leftmost column describe the recognition context:
as a reference anywhere after thestart-tagand before theend-tag of an element; correspondsto the nonterminalcontent.
as a reference within either the value of an attribute in astart-tag,or a default value in anattribute declaration;corresponds to the nonterminalAttValue.
as aName, not a reference, appearing either asthe value of an attribute which has been declared as typeENTITY,or as one of the space-separated tokens in the value of an attribute whichhas been declared as typeENTITIES.
as a reference within a parameter or internal entity'sliteralentity value in the entity's declaration; corresponds to the nonterminalEntityValue.
as a reference within either the internal or external subsets of theDTD, but outside of anEntityValue,AttValue,PI,Comment,SystemLiteral,PubidLiteral,or the contents of an ignored conditional section (see3.4 Conditional Sections).
.
EntityType | Character | ||||
Parameter | Internal General | External ParsedGeneral | Unparsed | ||
Referencein Content | Not recognized | Included | Includedif validating | Forbidden | Included |
Reference in Attribute Value | Not recognized | Includedin literal | Forbidden | Forbidden | Included |
Occurs as AttributeValue | Not recognized | Forbidden | Forbidden | Notify | Not recognized |
Reference in EntityValue | Included in literal | Bypassed | Bypassed | Error | Included |
Reference in DTD | Included as PE | Forbidden | Forbidden | Forbidden | Forbidden |
Outside the DTD, the%
character has no special significance;thus, what would be parameter entity references in the DTD are not recognizedas markup incontent. Similarly, the names of unparsedentities are not recognized except when they appear in the value of an appropriatelydeclared attribute.
[Definition: An entity isincludedwhen itsreplacement text is retrievedand processed, in place of the reference itself, as though it were part ofthe document at the location the reference was recognized.] The replacementtext may contain bothcharacter dataand (except for parameter entities)markup,whichMUST be recognized in the usual way. (The string "AT&T;
"expands to "AT&T;
" and the remaining ampersandis not recognized as an entity-reference delimiter.) A character referenceisincluded when the indicated character is processed in placeof the reference itself.
When an XML processor recognizes a reference to a parsed entity, in ordertovalidate the document, the processorMUSTinclude its replacement text. Ifthe entity is external, and the processor is not attempting to validate theXML document, the processorMAY, but neednot, include the entity's replacement text. If a non-validating processordoes not include the replacement text, itMUST inform the application thatit recognized, but did not read, the entity.
This rule is based on the recognition that the automatic inclusion providedby the SGML and XML entity mechanism, primarily designed to support modularityin authoring, is not necessarily appropriate for other applications, in particulardocument browsing. Browsers, for example, when encountering an external parsedentity reference, might choose to provide a visual indication of the entity'spresence and retrieve it for display only on demand.
The following are forbidden, and constitutefatalerrors:
the appearance of a reference to anunparsedentity, except in theEntityValue in an entity declaration.
the appearance of any character or general-entity reference in theDTD except within anEntityValue orAttValue.
a reference to an external entity in an attribute value.
When anentity reference appears inan attribute value, or a parameter entity reference appears in a literal entityvalue, itsreplacement textMUST be processedin place of the reference itself as though it were part of the document atthe location the reference was recognized, except that a single or doublequote character in the replacement textMUST always be treated as a normal datacharacter andMUST NOT terminate the literal. For example, this is well-formed:
<!ENTITY % YN '"Yes"' ><!ENTITY WhatHeSaid "He said %YN;" >
while this is not:
<!ENTITY EndAttr "27'" ><element attribute='a-&EndAttr;>
When the name of anunparsed entityappears as a token in the value of an attribute of declared typeENTITYorENTITIES, a validating processorMUST inform the application ofthesystem andpublic(if any) identifiers for both the entity and its associatednotation.
When a general entity reference appears in theEntityValuein an entity declaration, itMUST be bypassed and left as is.
Just as with external parsed entities, parameter entities need only beincluded if validating. When a parameter-entityreference is recognized in the DTD and included, itsreplacementtextMUST be enlarged by the attachment of one leading and one followingspace (#x20) character; the intent is to constrain the replacement text ofparameter entities to contain an integral number of grammatical tokens inthe DTD. ThisbehaviorMUST NOT apply to parameter entity references within entity values;these are described in4.4.5 Included in Literal.
It is anerror for a reference toan unparsed entity to appear in theEntityValue in anentity declaration.
In discussing the treatment of entities, it is useful to distinguishtwo forms of the entity's value.[Definition: For aninternal entity, theliteralentity value is the quoted string actually present in the entity declaration,corresponding to the non-terminalEntityValue.][Definition: For an external entity, theliteralentity value is the exact text contained in the entity.][Definition: For aninternal entity, thereplacement textis the content of the entity, after replacement of character references andparameter-entity references.][Definition: Foran external entity, thereplacement text is the content of the entity,after stripping the text declaration (leaving any surrounding whitespace) if thereis one but without any replacement of character references or parameter-entityreferences.]
The literal entity value as given in an internal entity declaration (EntityValue) may contain character, parameter-entity,and general-entity references. Such referencesMUST be contained entirelywithin the literal entity value. The actual replacement text that isincluded (orincluded in literal) as described aboveMUST contain thereplacementtext of any parameter entities referred to, andMUST contain the characterreferred to, in place of any character references in the literal entity value;however, general-entity referencesMUST be left as-is, unexpanded. For example,given the following declarations:
<!ENTITY % pub "Éditions Gallimard" ><!ENTITY rights "All rights reserved" ><!ENTITY book "La Peste: Albert Camus,© 1947 %pub;. &rights;" >
then the replacement text for the entity "book
"is:
La Peste: Albert Camus,© 1947 Éditions Gallimard. &rights;
The general-entity reference "&rights;
" wouldbe expanded should the reference "&book;
" appearin the document's content or an attribute value.
These simple rules may have complex interactions; for a detailed discussionof a difficult example, seeD Expansion of Entity and Character References.
[Definition: Entity and character references mayboth be used toescape the left angle bracket, ampersand, andother delimiters. A set of general entities (amp
,lt
,gt
,apos
,quot
) is specified forthis purpose. Numeric character references may also be used; they are expandedimmediately when recognized andMUST be treated as character data, so thenumeric character references "<
" and "&
" may be used to escape<
and&
when they occurin character data.]
All XML processorsMUST recognize these entities whether they are declaredor not.For interoperability, valid XMLdocumentsSHOULD declare these entities, like any others, before using them. Ifthe entitieslt
oramp
are declared, theyMUST bedeclared as internal entities whose replacement text is a character referenceto the respectivecharacter (less-than sign or ampersand) being escaped; the doubleescaping isREQUIRED for these entities so that references to them producea well-formed result. If the entitiesgt
,apos
,orquot
are declared, theyMUST be declared as internal entitieswhose replacement text is the single character being escaped (or a characterreference to that character; the double escaping here isOPTIONAL but harmless).For example:
<!ENTITY lt "&#60;"><!ENTITY gt ">"><!ENTITY amp "&#38;"><!ENTITY apos "'"><!ENTITY quot """>
[Definition:Notations identifyby name the format ofunparsed entities,the format of elements which bear a notation attribute, or the applicationto which aprocessing instruction is addressed.]
[Definition:Notation declarationsprovide a name for the notation, for use in entity and attribute-list declarationsand in attribute specifications, and an external identifier for the notationwhich may allow an XML processor or its client application to locate a helperapplication capable of processing data in the given notation.]
[82] | NotationDecl | ::= | '<!NOTATION'SNameS (ExternalID |PublicID)S? '>' | [VC: Unique Notation Name] |
[83] | PublicID | ::= | 'PUBLIC'SPubidLiteral |
Validity constraint: Unique Notation Name
A givenNameMUST NOT be declared in more than one notation declaration.
XML processorsMUST provide applications with the name and external identifier(s)of any notation declared and referred to in an attribute value, attributedefinition, or entity declaration. TheyMAY additionally resolve the externalidentifier into thesystem identifier, filename, or other information needed to allow the application to call a processorfor data in the notation described. (It is not an error, however, for XMLdocuments to declare and refer to notations for which notation-specific applicationsare not available on the system where the XML processor or application isrunning.)
[Definition: Thedocument entityserves as the root of the entity tree and a starting-point for anXML processor.] This specification doesnot specify how the document entity is to be located by an XML processor;unlike other entities, the document entity has no name and might well appearon a processor input stream without any identification at all.
ConformingXML processors fall intotwo classes: validating and non-validating.
Validating and non-validating processors alikeMUST report violations ofthis specification's well-formedness constraints in the content of thedocument entity and any otherparsedentities that they read.
[Definition:ValidatingprocessorsMUST,at user option, report violations of the constraints expressed bythe declarations in theDTD, and failuresto fulfill the validity constraints given in this specification.]To accomplish this, validating XML processorsMUST read and process the entireDTD and all external parsed entities referenced in the document.
Non-validating processors areREQUIRED to check only thedocumententity, including the entire internal DTD subset, for well-formedness. [Definition: While they are not requiredto check the document for validity, they areREQUIRED toprocessall the declarations they read in the internal DTD subset and in any parameterentity that they read, up to the first reference to a parameter entity thatthey donot read; that is to say, theyMUST use the informationin those declarations tonormalizeattribute values,include the replacementtext of internal entities, and supplydefaultattribute values.] Except whenstandalone="yes"
, theyMUST NOTprocessentitydeclarations orattribute-list declarationsencountered after a reference to a parameter entity that is not read, sincethe entity may have contained overriding declarations; whenstandalone="yes"
, processorsMUSTprocess these declarations.
Note that when processing invalid documents with a non-validatingprocessor the application may not be presented with consistentinformation. For example, several requirements for uniquenesswithin the document may not be met, including more than one elementwith the same id, duplicate declarations of elements or notationswith the same name, etc. In these cases the behavior of the parserwith respect to reporting such information to the application isundefined.
The behavior of a validating XML processor is highly predictable; it mustread every piece of a document and report all well-formedness and validityviolations. Less is required of a non-validating processor; it need not readany part of the document other than the document entity. This has two effectsthat may be important to users of XML processors:
Certain well-formedness errors, specifically those that require readingexternal entities, may fail to be detected by a non-validating processor. Examplesinclude the constraints entitledEntity Declared,Parsed Entity, andNoRecursion, as well as some of the cases described asforbidden in4.4 XML Processor Treatment of Entities and References.
The information passed from the processor to the application mayvary, depending on whether the processor reads parameter and external entities.For example, a non-validating processor may fail tonormalizeattribute values,include the replacementtext of internal entities, or supplydefaultattribute values, where doing so depends on having read declarationsin external or parameter entities, or in the internal subset after an unread parameter entity reference.
For maximum reliability in interoperating between different XML processors,applications which use non-validating processorsSHOULD NOT rely on any behaviorsnot required of such processors. Applications which require DTD facilities not related to validation (suchas the declaration of default attributes and internal entities that are or may be specified inexternal entities)SHOULD use validating XML processors.
The formal grammar of XML is given in this specification using a simpleExtended Backus-Naur Form (EBNF) notation. Each rule in the grammar definesone symbol, in the form
symbol ::= expression
Symbols are written with an initial capital letter if they are thestart symbol of a regular language, otherwise with an initial lowercase letter.Literal strings are quoted.
Within the expression on the right-hand side of a rule, the following expressionsare used to match strings of one or more characters:
#xN
whereN
is a hexadecimal integer, the expression matches the characterwhose number(code point) in ISO/IEC 10646 isN
. The number of leading zeros in the#xN
form is insignificant.
[a-zA-Z]
,[#xN-#xN]
matches anyChar with a value in the range(s) indicated (inclusive).
[abc]
,[#xN#xN#xN]
matches anyChar with a value among the charactersenumerated. Enumerations and ranges can be mixed in one set of brackets.
[^a-z]
,[^#xN-#xN]
matches anyChar with a valueoutside the rangeindicated.
[^abc]
,[^#xN#xN#xN]
matches anyChar with a value not among the characters given. Enumerationsand ranges of forbidden values can be mixed in one set of brackets.
"string"
matches a literal stringmatching thatgiven inside the double quotes.
'string'
matches a literal stringmatching thatgiven inside the single quotes.
These symbols may be combined to match more complex patterns as follows,whereA
andB
represent simple expressions:
expression
)expression
is treated as a unit and may be combined as describedin this list.
A?
matchesA
or nothing; optionalA
.
A B
matchesA
followed byB
. Thisoperator has higher precedence than alternation; thusA B | C D
is identical to(A B) | (C D)
.
A | B
matchesA
orB
.
A - B
matches any string that matchesA
but does not matchB
.
A+
matches one or more occurrences ofA
. Concatenationhas higher precedence than alternation; thusA+ | B+
is identicalto(A+) | (B+)
.
A*
matches zero or more occurrences ofA
. Concatenationhas higher precedence than alternation; thusA* | B*
is identicalto(A*) | (B*)
.
Other notations used in the productions are:
/* ... */
comment.
[ wfc: ... ]
well-formedness constraint; this identifies by name a constraint onwell-formed documents associated with a production.
[ vc: ... ]
validity constraint; this identifies by name a constraint onvaliddocuments associated with a production.
Because of changes to productions[4] and[5], the productions inthis Appendix are now orphaned and not used anymore in determiningname characters. This Appendix may be removed in a future edition of this specification; other specifications that wish to refer to the productions herein shoulddo so by means of a reference to the relevant production(s) in theFourth Edition of this specification.
Following the characteristics defined in the Unicode standard, charactersare classed as base characters (among others, these contain the alphabeticcharacters of the Latin alphabet), ideographic characters, and combining characters (amongothers, this class contains most diacritics). Digits and extenders are alsodistinguished.
[84] | Letter | ::= | BaseChar |Ideographic |
[85] | BaseChar | ::= | [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6]| [#x00D8-#x00F6] | [#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E]| [#x0141-#x0148] | [#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0]| [#x01F4-#x01F5] | [#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1]| #x0386 | [#x0388-#x038A] | #x038C | [#x038E-#x03A1]| [#x03A3-#x03CE] | [#x03D0-#x03D6] | #x03DA | #x03DC| #x03DE | #x03E0 | [#x03E2-#x03F3] | [#x0401-#x040C]| [#x040E-#x044F] | [#x0451-#x045C] | [#x045E-#x0481] | [#x0490-#x04C4]| [#x04C7-#x04C8] | [#x04CB-#x04CC] | [#x04D0-#x04EB] | [#x04EE-#x04F5]| [#x04F8-#x04F9] | [#x0531-#x0556] | #x0559 | [#x0561-#x0586]| [#x05D0-#x05EA] | [#x05F0-#x05F2] | [#x0621-#x063A] | [#x0641-#x064A]| [#x0671-#x06B7] | [#x06BA-#x06BE] | [#x06C0-#x06CE] | [#x06D0-#x06D3]| #x06D5 | [#x06E5-#x06E6] | [#x0905-#x0939] | #x093D| [#x0958-#x0961] | [#x0985-#x098C] | [#x098F-#x0990] | [#x0993-#x09A8]| [#x09AA-#x09B0] | #x09B2 | [#x09B6-#x09B9] | [#x09DC-#x09DD]| [#x09DF-#x09E1] | [#x09F0-#x09F1] | [#x0A05-#x0A0A] | [#x0A0F-#x0A10]| [#x0A13-#x0A28] | [#x0A2A-#x0A30] | [#x0A32-#x0A33] | [#x0A35-#x0A36]| [#x0A38-#x0A39] | [#x0A59-#x0A5C] | #x0A5E | [#x0A72-#x0A74]| [#x0A85-#x0A8B] | #x0A8D | [#x0A8F-#x0A91] | [#x0A93-#x0AA8]| [#x0AAA-#x0AB0] | [#x0AB2-#x0AB3] | [#x0AB5-#x0AB9] | #x0ABD| #x0AE0 | [#x0B05-#x0B0C] | [#x0B0F-#x0B10] | [#x0B13-#x0B28]| [#x0B2A-#x0B30] | [#x0B32-#x0B33] | [#x0B36-#x0B39] | #x0B3D| [#x0B5C-#x0B5D] | [#x0B5F-#x0B61] | [#x0B85-#x0B8A] | [#x0B8E-#x0B90]| [#x0B92-#x0B95] | [#x0B99-#x0B9A] | #x0B9C | [#x0B9E-#x0B9F]| [#x0BA3-#x0BA4] | [#x0BA8-#x0BAA] | [#x0BAE-#x0BB5] | [#x0BB7-#x0BB9]| [#x0C05-#x0C0C] | [#x0C0E-#x0C10] | [#x0C12-#x0C28] | [#x0C2A-#x0C33]| [#x0C35-#x0C39] | [#x0C60-#x0C61] | [#x0C85-#x0C8C] | [#x0C8E-#x0C90]| [#x0C92-#x0CA8] | [#x0CAA-#x0CB3] | [#x0CB5-#x0CB9] | #x0CDE| [#x0CE0-#x0CE1] | [#x0D05-#x0D0C] | [#x0D0E-#x0D10] | [#x0D12-#x0D28]| [#x0D2A-#x0D39] | [#x0D60-#x0D61] | [#x0E01-#x0E2E] | #x0E30| [#x0E32-#x0E33] | [#x0E40-#x0E45] | [#x0E81-#x0E82] | #x0E84| [#x0E87-#x0E88] | #x0E8A | #x0E8D | [#x0E94-#x0E97]| [#x0E99-#x0E9F] | [#x0EA1-#x0EA3] | #x0EA5 | #x0EA7| [#x0EAA-#x0EAB] | [#x0EAD-#x0EAE] | #x0EB0 | [#x0EB2-#x0EB3]| #x0EBD | [#x0EC0-#x0EC4] | [#x0F40-#x0F47] | [#x0F49-#x0F69]| [#x10A0-#x10C5] | [#x10D0-#x10F6] | #x1100 | [#x1102-#x1103]| [#x1105-#x1107] | #x1109 | [#x110B-#x110C] | [#x110E-#x1112]| #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150| [#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163| #x1165 | #x1167 | #x1169 | [#x116D-#x116E] | [#x1172-#x1173]| #x1175 | #x119E | #x11A8 | #x11AB | [#x11AE-#x11AF]| [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB| #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9]| [#x1F00-#x1F15] | [#x1F18-#x1F1D] | [#x1F20-#x1F45] | [#x1F48-#x1F4D]| [#x1F50-#x1F57] | #x1F59 | #x1F5B | #x1F5D | [#x1F5F-#x1F7D]| [#x1F80-#x1FB4] | [#x1FB6-#x1FBC] | #x1FBE | [#x1FC2-#x1FC4]| [#x1FC6-#x1FCC] | [#x1FD0-#x1FD3] | [#x1FD6-#x1FDB] | [#x1FE0-#x1FEC]| [#x1FF2-#x1FF4] | [#x1FF6-#x1FFC] | #x2126 | [#x212A-#x212B]| #x212E | [#x2180-#x2182] | [#x3041-#x3094] | [#x30A1-#x30FA]| [#x3105-#x312C] | [#xAC00-#xD7A3] |
[86] | Ideographic | ::= | [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029] |
[87] | CombiningChar | ::= | [#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486]| [#x0591-#x05A1] | [#x05A3-#x05B9] | [#x05BB-#x05BD] | #x05BF| [#x05C1-#x05C2] | #x05C4 | [#x064B-#x0652] | #x0670| [#x06D6-#x06DC] | [#x06DD-#x06DF] | [#x06E0-#x06E4] | [#x06E7-#x06E8]| [#x06EA-#x06ED] | [#x0901-#x0903] | #x093C | [#x093E-#x094C]| #x094D | [#x0951-#x0954] | [#x0962-#x0963] | [#x0981-#x0983]| #x09BC | #x09BE | #x09BF | [#x09C0-#x09C4] | [#x09C7-#x09C8]| [#x09CB-#x09CD] | #x09D7 | [#x09E2-#x09E3] | #x0A02| #x0A3C | #x0A3E | #x0A3F | [#x0A40-#x0A42] | [#x0A47-#x0A48]| [#x0A4B-#x0A4D] | [#x0A70-#x0A71] | [#x0A81-#x0A83] | #x0ABC| [#x0ABE-#x0AC5] | [#x0AC7-#x0AC9] | [#x0ACB-#x0ACD] | [#x0B01-#x0B03]| #x0B3C | [#x0B3E-#x0B43] | [#x0B47-#x0B48] | [#x0B4B-#x0B4D]| [#x0B56-#x0B57] | [#x0B82-#x0B83] | [#x0BBE-#x0BC2] | [#x0BC6-#x0BC8]| [#x0BCA-#x0BCD] | #x0BD7 | [#x0C01-#x0C03] | [#x0C3E-#x0C44]| [#x0C46-#x0C48] | [#x0C4A-#x0C4D] | [#x0C55-#x0C56] | [#x0C82-#x0C83]| [#x0CBE-#x0CC4] | [#x0CC6-#x0CC8] | [#x0CCA-#x0CCD] | [#x0CD5-#x0CD6]| [#x0D02-#x0D03] | [#x0D3E-#x0D43] | [#x0D46-#x0D48] | [#x0D4A-#x0D4D]| #x0D57 | #x0E31 | [#x0E34-#x0E3A] | [#x0E47-#x0E4E]| #x0EB1 | [#x0EB4-#x0EB9] | [#x0EBB-#x0EBC] | [#x0EC8-#x0ECD]| [#x0F18-#x0F19] | #x0F35 | #x0F37 | #x0F39 | #x0F3E| #x0F3F | [#x0F71-#x0F84] | [#x0F86-#x0F8B] | [#x0F90-#x0F95]| #x0F97 | [#x0F99-#x0FAD] | [#x0FB1-#x0FB7] | #x0FB9| [#x20D0-#x20DC] | #x20E1 | [#x302A-#x302F] | #x3099| #x309A |
[88] | Digit | ::= | [#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9]| [#x0966-#x096F] | [#x09E6-#x09EF] | [#x0A66-#x0A6F] | [#x0AE6-#x0AEF]| [#x0B66-#x0B6F] | [#x0BE7-#x0BEF] | [#x0C66-#x0C6F] | [#x0CE6-#x0CEF]| [#x0D66-#x0D6F] | [#x0E50-#x0E59] | [#x0ED0-#x0ED9] | [#x0F20-#x0F29] |
[89] | Extender | ::= | #x00B7 | #x02D0 | #x02D1 | #x0387 | #x0640| #x0E46 | #x0EC6 | #x3005 | [#x3031-#x3035] | [#x309D-#x309E]| [#x30FC-#x30FE] |
The character classes defined here can be derived from the Unicode 2.0character database as follows:
Name start characters must have one of the categories Ll, Lu, Lo,Lt, Nl.
Name characters other than Name-start characters must have one ofthe categories Mc, Me, Mn, Lm, or Nd.
Characters in the compatibility area (i.e. with character code greaterthan #xF900 and less than #xFFFE) are not allowed in XML names.
Characters which have a font or compatibility decomposition (i.e.those with a "compatibility formatting tag" in field 5 of thedatabase -- marked by field 5 beginning with a "<") are notallowed.
The following characters are treated as name-start characters ratherthan name characters, because the property file classifies them as Alphabetic:[#x02BB-#x02C1], #x0559, #x06E5, #x06E6.
Characters #x20DD-#x20E0 are excluded (in accordance with Unicode 2.0,section 5.14).
Character #x00B7 is classified as an extender, because the propertylist so identifies it.
Character #x0387 is added as a name character, because #x00B7 isits canonical equivalent.
Characters ':' and '_' are allowed as name-start characters.
Characters '-' and '.' are allowed as name characters.
XMLis designed to be a subset of SGML, in that every XML document should alsobe a conforming SGML document. For a detailed comparison of the additionalrestrictions that XML places on documents beyond those of SGML, see[Clark].
This appendix contains some examples illustrating the sequence of entity-and character-reference recognition and expansion, as specified in4.4 XML Processor Treatment of Entities and References.
If the DTD contains the declaration
<!ENTITY example "<p>An ampersand (&#38;) may be escapednumerically (&#38;#38;) or with a general entity(&amp;).</p>" >
then the XML processor will recognize the character references when itparses the entity declaration, and resolve them before storing the followingstring as the value of the entity "example
":
<p>An ampersand (&) may be escapednumerically (&#38;) or with a general entity(&amp;).</p>
A reference in the document to "&example;
"will cause the text to be reparsed, at which time the start- and end-tagsof thep
element will be recognized and the three references willbe recognized and expanded, resulting in ap
element with the followingcontent (all data, no delimiters or markup):
An ampersand (&) may be escapednumerically (&) or with a general entity(&).
A more complex example will illustrate the rules and their effects fully.In the following example, the line numbers are solely for reference.
1 <?xml version='1.0'?>2 <!DOCTYPE test [3 <!ELEMENT test (#PCDATA) >4 <!ENTITY % xx '%zz;'>5 <!ENTITY % zz '<!ENTITY tricky "error-prone" >' >6 %xx;7 ]>8 <test>This sample shows a &tricky; method.</test>
This produces the following:
in line 4, the reference to character 37 is expanded immediately,and the parameter entity "xx
" is stored in the symboltable with the value "%zz;
". Since the replacementtext is not rescanned, the reference to parameter entity "zz
"is not recognized. (And it would be an error if it were, since "zz
"is not yet declared.)
in line 5, the character reference "<
"is expanded immediately and the parameter entity "zz
"is stored with the replacement text "<!ENTITY tricky "error-prone">
", which is a well-formed entity declaration.
in line 6, the reference to "xx
" is recognized,and the replacement text of "xx
" (namely "%zz;
")is parsed. The reference to "zz
" is recognized inits turn, and its replacement text ("<!ENTITY tricky "error-prone">
") is parsed. The general entity "tricky
"has now been declared, with the replacement text "error-prone
".
in line 8, the reference to the general entity "tricky
"is recognized, and it is expanded, so the full content of thetest
element is the self-describing (and ungrammatical) stringThis sampleshows a error-prone method.
In the following example
<!DOCTYPE foo [ <!ENTITY x "<"> ]> <foo attr="&x;"/>
the replacement text of x is the four characters "<" becausereferences to general entities in entity values arebypassed.The replacement text of lt is a character reference tothe less-than character, for example the five characters "<"(see4.6 Predefined Entities). Since neither of these contains a less-than characterthe result is well-formed.
If the definition of x had been
<!ENTITY x "<">
then the document would not have been well-formed, because thereplacement text of x would be the single character "<" whichis not permitted in attribute values (seeWFC: No < in Attribute Values).
Asnoted in3.2.1 Element Content, it is required that contentmodels in element type declarations be deterministic. This requirement isfor compatibility with SGML (which calls deterministiccontent models "unambiguous"); XML processors builtusing SGML systems may flag non-deterministic content models as errors.
For example, the content model((b, c) | (b, d))
is non-deterministic,because given an initialb
the XML processorcannot know whichb
in the model is being matched without lookingahead to see which element follows theb
. In this case, the two referencestob
can be collapsed into a single reference, making the model read(b,(c | d))
. An initialb
now clearly matches only a single namein the content model. The processor doesn't need to look ahead to see what follows; eitherc
ord
would be accepted.
More formally: a finite state automaton may be constructed from the contentmodel using the standard algorithms, e.g. algorithm 3.5 in section 3.9 ofAho, Sethi, and Ullman[Aho/Ullman]. In many such algorithms, a followset is constructed for each position in the regular expression (i.e., eachleaf node in the syntax tree for the regular expression); if any positionhas a follow set in which more than one following position is labeled withthe same element type name, then the content model is in error and may bereported as an error.
Algorithms exist which allow many but not all non-deterministic contentmodels to be reduced automatically to equivalent deterministic models; seeBrüggemann-Klein 1991[Brüggemann-Klein].
The XML encoding declaration functions as an internal label on each entity,indicating which character encoding is in use. Before an XML processor canread the internal label, however, it apparently has to know what characterencoding is in use—which is what the internal label is trying to indicate.In the general case, this is a hopeless situation. It is not entirely hopelessin XML, however, because XML limits the general case in two ways: each implementationis assumed to support only a finite set of character encodings, and the XMLencoding declaration is restricted in position and content in order to makeit feasible to autodetect the character encoding in use in each entity innormal cases. Also, in many cases other sources of information are availablein addition to the XML data stream itself. Two cases may be distinguished,depending on whether the XML entity is presented to the processor without,or with, any accompanying (external) information. Wewill considerthese cases in turn.
Because each XML entity not accompanied by externalencoding information and not in UTF-8 or UTF-16 encoding mustbegin with an XML encoding declaration, in which the first characters mustbe '<?xml
', any conforming processor can detect, after twoto four octets of input, which of the following cases apply. In reading thislist, it may help to know that in UCS-4, '<' is "#x0000003C
"and '?' is "#x0000003F
", and the Byte Order Markrequired of UTF-16 data streams is "#xFEFF
". The notation## is used to denote any byte value except that two consecutive##s cannot be both 00.
With a Byte Order Mark:
00 00 FEFF | UCS-4, big-endian machine (1234 order) |
FFFE 00 00 | UCS-4, little-endian machine (4321 order) |
00 00 FF FE | UCS-4, unusual octet order (2143) |
FE FF 00 00 | UCS-4, unusual octet order (3412) |
FE FF ## ## | UTF-16, big-endian |
FF FE ## ## | UTF-16, little-endian |
EF BB BF | UTF-8 |
Without a Byte Order Mark:
00 00 00 3C | UCS-4 or other encoding with a 32-bit code unit and ASCIIcharacters encoded as ASCII values, in respectively big-endian (1234), little-endian(4321) and two unusual byte orders (2143 and 3412). The encoding declarationmust be read to determine which of UCS-4 or other supported 32-bit encodingsapplies. |
3C 00 00 00 | |
00 00 3C 00 | |
00 3C 00 00 | |
00 3C 00 3F | UTF-16BE or big-endian ISO-10646-UCS-2or other encoding with a 16-bit code unit in big-endian order and ASCII charactersencoded as ASCII values (the encoding declaration must be read to determinewhich) |
3C 00 3F 00 | UTF-16LE or little-endianISO-10646-UCS-2 or other encoding with a 16-bit code unit in little-endianorder and ASCII characters encoded as ASCII values (the encoding declarationmust be read to determine which) |
3C 3F 78 6D | UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other7-bit, 8-bit, or mixed-width encoding which ensures that the characters ofASCII have their normal positions, width, and values; the actual encodingdeclaration must be read to detect which of these applies, but since all ofthese encodings use the same bit patterns for the relevant ASCII characters,the encoding declaration itself may be read reliably |
4C6F A7 94 | EBCDIC (in some flavor; the full encoding declarationmust be read to tell which code page is in use) |
Other | UTF-8 without an encoding declaration, or else the data stream is mislabeled(lacking a required encoding declaration), corrupt, fragmentary, or enclosedin a wrapper of some kind |
Note:
In cases above which do not require reading the encoding declaration todetermine the encoding, section 4.3.3 still requires that the encoding declaration,if present, be read and that the encoding name be checked to match the actualencoding of the entity. Also, it is possible that new character encodingswill be invented that will make it necessary to use the encoding declarationto determine the encoding, in cases where this is not required at present.
This level of autodetection is enough to read the XML encoding declarationand parse the character-encoding identifier, which is still necessary to distinguishthe individual members of each family of encodings (e.g. to tell UTF-8 from8859, and the parts of 8859 from each other, or to distinguish the specificEBCDIC code page in use, and so on).
Because the contents of the encoding declaration are restricted to charactersfrom the ASCII repertoire (however encoded),a processor can reliably read the entire encoding declaration as soon as ithas detected which family of encodings is in use. Since in practice, all widelyused character encodings fall into one of the categories above, the XML encodingdeclaration allows reasonably reliable in-band labeling of character encodings,even when external sources of information at the operating-system or transport-protocollevel are unreliable. Character encodings such as UTF-7that make overloaded usage of ASCII-valued bytes may fail to be reliably detected.
Once the processor has detected the character encoding in use, it can actappropriately, whether by invoking a separate input routine for each case,or by calling the proper conversion function on each character of input.
Like any self-labeling system, the XML encoding declaration will not workif any software changes the entity's character set or encoding without updatingthe encoding declaration. Implementors of character-encoding routines shouldbe careful to ensure the accuracy of the internal and external informationused to label the entity.
The second possible case occurs when the XML entity is accompanied by encodinginformation, as in some file systems and some network protocols. When multiplesources of information are available, their relative priority and the preferredmethod of handling conflict should be specified as part of the higher-levelprotocol used to deliver XML. In particular, please referto[IETF RFC 3023] or its successor, which defines thetext/xml
andapplication/xml
MIME types and provides some useful guidance.In the interests of interoperability, however, the following rule is recommended.
If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used(if present) to determine the character encoding.
This specification was prepared and approved for publication by the W3CXML Working Group (WG). WG approval of this specification does not necessarilyimply that all WG members voted for its approval. The current and formerparticipants of the XML WG are:
Thefifth edition of this specification was prepared by the W3C XML CoreWorking Group (WG). The participants in the WG at the time of publication of thisedition were:
This edition was encoded in aslightly modified version of theXMLspec DTD, v2.10.The XHTML versions were produced with a combination of thexmlspec.xsl,diffspec.xsl,andREC-xml.xslXSLT stylesheets.
The following suggestions define what is believed to be bestpractice in the construction of XML names used as element names,attribute names, processing instruction targets, entity names,notation names, and the values of attributes of type ID, and areintended as guidance for document authors and schema designers.All references to Unicode are understood with respect toa particular version of the Unicode Standard greater than or equalto 5.0; which version should be used is left to the discretion ofthe document author or schema designer.
The first two suggestions are directly derived from the rulesgiven for identifiers in Standard Annex #31 (UAX #31) of the Unicode Standard, version 5.0[Unicode], andexclude all control characters, enclosing nonspacing marks,non-decimal numbers, private-use characters, punctuation characters(with the noted exceptions), symbol characters, unassignedcodepoints, and white space characters. The other suggestionsare mostly derived from Appendix B in previous editions of this specification.
The first character of any name should have a Unicode propertyof ID_Start, or else be '_' #x5F.
Characters other than the first should have a Unicode propertyof ID_Continue, or be one of the characters listed in the tableentitled "Characters for Natural Language Identifiers" in UAX#31, with the exception of "'" #x27 and "’" #x2019.
Characters in names should be expressed usingNormalization Form C as defined in[UnicodeNormal].
Ideographic characters which have a canonical decomposition(including those in the ranges [#xF900-#xFAFF] and[#x2F800-#x2FFFD], with 12 exceptions) should not be used in names.
Characters which have a compatibility decomposition (those witha "compatibility formatting tag" in field 5 of the UnicodeCharacter Database -- marked by field 5 beginning with a "<")should not be used in names. This suggestion does not applyto characters whichdespite their compatibility decompositions are in regular use intheir scripts, forexample #x0E33 THAI CHARACTER SARA AM or #x0EB3 LAO CHARACTER AM.
Combining characters meant for use with symbols only (includingthose in the ranges [#x20D0-#x20EF] and [#x1D165-#x1D1AD]) shouldnot be used in names.
The interlinear annotation characters ([#xFFF9-#xFFFB]) shouldnot be used in names.
Variation selector characters should not be used in names.
Names which are nonsensical, unpronounceable, hard to read, oreasily confusable with other names should not be employed.