Extensible Markup Language, abbreviated XML, describes a class of dataobjects calledXML documents and partiallydescribes the behavior of computer programs which process them. XML is anapplication profile or restricted form of SGML, the Standard Generalized MarkupLanguage. By construction, XML documents are conformingSGML documents.
XML documents are made up of storage units calledentities,which contain either parsed or unparsed data. Parsed data is made up ofcharacters, some of which formcharacterdata, and some of which formmarkup.Markup encodes a description of the document's storage layout and logicalstructure. XML provides a mechanism to impose constraints on the storage layoutand logical structure.
A software module calledanXML processor is used to read XML documents and provide accessto their content and structure.Itis assumed that an XML processor is doing its work on behalf of another module,called theapplication. This specification describesthe required behavior of an XML processor in terms of how it must read XMLdata and the information it must provide to the application.
Origin and Goals
XML was developed by an XML Working Group (originally known as the SGMLEditorial Review Board) formed under the auspices of the World Wide Web Consortium(W3C) in 1996. It was chaired by Jon Bosak of Sun Microsystems with the activeparticipation of an XML Special Interest Group (previously known as the SGMLWorking Group) also organized by the W3C. The membership of the XML WorkingGroup is given in an appendix. Dan Connolly served as the Working Group's contact withthe W3C.
The design goals for XML are:
XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML documents.
The number of optional features in XML is to be kept to the absoluteminimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.
This specification, together with associated standards (Unicode and ISO/IEC 10646for characters, Internet RFC 3066 forlanguage identification tags, ISO 639for language name codes, and ISO 3166 forcountry name codes), provides all the information necessary tounderstand XML Version &versionOfXML; and construct computerprograms to process it.
This version of the XML specification may be distributed freely, as long asall text and legal notices remain intact.
Terminology
The terminology used to describe XML documents is defined in the body ofthis specification. The key wordsMUST,MUST NOT,REQUIRED,SHALL,SHALL NOT,SHOULD,SHOULD NOT,RECOMMENDED, &MAY;, andOPTIONAL, whenEMPHASIZED, are to be interpreted as described in. In addition, the terms defined in the following list are used in buildingthose definitions and in describing the actions of an XML processor:
A violation of the rules of this specification;results are undefined. Unless otherwise specified, failure to observe a prescription of this specification indicated by one of the keywordsMUST,REQUIRED,MUST NOT,SHALL andSHALL NOT is an error. Conforming software &MAY; detect and report an errorand &MAY; recover from it.
An error which a conformingXML processorMUST detect and report to the application.After encountering a fatal error, the processor &MAY; continue processing thedata to search for further errors and &MAY; report such errors to the application.In order to support correction of errors, the processor &MAY; make unprocesseddata from the document (with intermingled character data and markup) availableto the application. Once a fatal error is detected, however, the processorMUST NOT continue normal processing (i.e., itMUST NOT continue to pass characterdata and information about the document's logical structure to the applicationin the normal way).
Conforming software&MAY; orMUST (depending on the modal verb in the sentence) behave as described;if it does, itMUST provide users a means to enable or disable the behaviordescribed.
A rule which applies toallvalid XML documents. Violations of validityconstraints are errors; theyMUST, at user option, be reported byvalidating XML processors.
A rule which appliesto allwell-formed XML documents. Violationsof well-formedness constraints arefatal errors.
(Of strings or names:) Two stringsor names being comparedMUSTbeare identical. Characters with multiple possiblerepresentations in Unicode (e.g. characters with both precomposed andbase+diacritic forms) match only if they have the same representation in bothstrings. Nocase folding is performed. (Of strings and rules in the grammar:) A stringmatches a grammatical production if it belongs to the language generated bythat production. (Of content and content models:) An element matches its declarationwhen it conforms in the fashion described in the constraint.
Marksa sentence describing a feature of XML included solely to ensurethat XML remains compatible with SGML.
Marksa sentence describing a non-binding recommendation included to increasethe chances that XML documents can be processed by the existing installedbase of SGML processors which predate the &WebSGML;.
Rationale and list of changes for XML 1.1
The W3C's XML 1.0 Recommendation was first issued in 1998, anddespite the issuance of many errata culminating in a Third Editionof 2004, has remained (by intention) unchanged with respect to whatis well-formed XML and what is not. This stability has beenextremely useful for interoperability. However, the UnicodeStandard on which XML 1.0 relies for character specifications hasnot remained static, evolving from version 2.0 to version 4.0 andbeyond. Characters not present in Unicode 2.0 may already be usedin XML 1.0 character data. However, they are not allowed in XMLnames such as element type names, attribute names, enumeratedattribute values, processing instruction targets, and so on. Inaddition, some characters that should have been permitted in XMLnames were not, due to oversights and inconsistencies in Unicode2.0.
The overall philosophy of names has changed since XML 1.0.Whereas XML 1.0 provided a rigid definition of names, whereineverything that was not permitted was forbidden, XML 1.1 names aredesigned so that everything that is not forbidden (for a specificreason) is permitted. Since Unicode will continue to grow pastversion 4.0, further changes to XML can be avoided by allowingalmost any character, including those not yet assigned, innames.
In addition, XML 1.0 attempts to adapt to the line-endconventions of various modern operating systems, but discriminatesagainst the conventions used on IBM and IBM-compatible mainframes.As a result, XML documents on mainframes are not plain text filesaccording to the local conventions. XML 1.0 documents generated onmainframes must either violate the local line-end conventions, oremploy otherwise unnecessary translation phases before parsing andafter generation. Allowing straightforward interoperability isparticularly important when data stores are shared betweenmainframe and non-mainframe systems (as opposed to being copiedfrom one to the other). Therefore XML 1.1 adds NEL (#x85) to thelist of line-end characters. For completeness, the Unicode lineseparator character, #x2028, is also supported.
Finally, there is considerable demand to define a standard representationof arbitrary Unicode characters in XML documents. Therefore, XML 1.1allows the use of character references to the control characters #x1 through#x1F, most of which are forbidden in XML 1.0. For reasons of robustness,however, these characters still cannot be used directly in documents. Inorder to improve the robustness of character encoding detection, the additionalcontrol characters #x7F through #x9F, which were freely allowed in XML 1.0documents, now must also appear only as character references. (Whitespacecharacters are of course exempt.) The minor sacrifice of backward compatibilityis considered not significant. Due to potential problems with APIs,#x0 is still forbidden both directly and as a character reference.
Finally, XML 1.1 defines a set of constraints called "fullnormalization" on XML documents, which document creatorsSHOULD adhere to, and document processorsSHOULD verify. Using fully normalized documentsensures that identity comparisons of names, attribute values, andcharacter content can be made correctly by simple binary comparison ofUnicode strings.
A new XML version, rather than a set of errata to XML 1.0, isbeing created because the changes affect the definition ofwell-formed documents. XML 1.0 processors must continue to rejectdocuments that contain new characters in XML names, new line-endconventions, and references to control characters. The distinction between XML 1.0 and XML 1.1 documentsis indicated by the version number information in the XMLdeclaration at the start of each document.
Documents
A data object is anXMLdocument if it iswell-formed,as defined in this specification.In addition, the XML document isvalid if it meets certain further constraints.
Each XML document has both a logical and a physical structure. Physically,the document is composed of units calledentities.An entity &may;refer to other entities tocause their inclusion in the document. A document begins in arootordocument entity. Logically, the documentis composed of declarations, elements, comments, character references, andprocessing instructions, all of which are indicated in the document by explicitmarkup. The logical and physical structuresMUST nest properly, as describedin.
Well-Formed XML Documents
A textual object is awell-formedXML document if:
Taken as a whole, it matches the production labeleddocument.
It meets all the well-formedness constraints given in this specification.
Each of theparsed entitieswhich is referenced directly or indirectly within the document iswell-formed.
There is exactly one element,called theroot, or document element, no part of which appearsin thecontent of any other element. Forall other elements, if thestart-tag is inthe content of another element, theend-tagis in the content of the same element. More simply stated, the elements,delimited by start- and end-tags, nest properly within each other.
As a consequence of this,for each non-root elementC in the document, there is one other elementPin the document such thatC is in the content ofP, butis not in the content of any other element that is in the content ofP.Pis referred to as theparent ofC, andC asachild ofP.
Characters
A parsed entity containstext,a sequence ofcharacters, which mayrepresent markup or character data.Acharacteris an atomic unit of text as specified by ISO/IEC 10646. Legal characters are tab, carriagereturn, line feed, and the legal charactersof Unicode and ISO/IEC 10646. Theversions of these standards cited in werecurrent at the time this document was prepared. New characters may be addedto these standards by amendments or new editions. Consequently, XML processorsMUST accept any character in the range specified forChar.
Character RangeChar[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.RestrictedChar[#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
The mechanism for encoding character code points into bit patterns &may;vary from entity to entity. All XML processorsMUST accept the UTF-8 and UTF-16encodings of Unicode;the mechanisms for signaling which of the two is in use,or for bringing other encodings into play, are discussed later, in.
Document authors are encouraged to avoidcompatibility characters, as definedin Unicode.The characters defined in the following ranges are alsodiscouraged. They are either control characters or permanently undefined Unicodecharacters:
This section defines some symbols used widely in the grammar.
S (white space) consists of one or more space (#x20)characters, carriage returns, line feeds, or tabs.
White SpaceS(#x20 | #x9 | #xD | #xA)+
The presence of #xD in the above production ismaintained purely for backward compatibility with theFirst Edition.As explained in,all #xD characters literally present in an XML documentare either removed or replaced by #xA characters beforeany other processing is done. The only way to get a #xD character to match this production is touse a character reference in an entity value literal.
AName is a token beginningwith a letter or one of a few punctuation characters, and continuing withletters, digits, hyphens, underscores, colons, or full stops, together knownas name characters. Names beginning with the stringxml,or with any string which would match(('X'|'x') ('M'|'m') ('L'|'l')),are reserved for standardization in this or future versions of this specification.
TheNamespaces in XML Recommendation assigns a meaningto names containing colon characters. Therefore, authors should not use thecolon in XML names except for namespace purposes, but XML processors mustaccept the colon as a name character.
AnNmtoken (name token) is any mixture of namecharacters.
The first character of a NameMUST be a NameStartChar, and anyother charactersMUST be NameChars; this mechanism is used toprevent names from beginning with European (ASCII) digits or withbasic combining characters. Almost all characters are permitted innames, except those which either are or reasonably could be used asdelimiters. The intention is to be inclusive rather than exclusive,so that writing systems not yet encoded in Unicode can be used inXML names. See for suggestions on the creation ofnames.
Document authors are encouraged to use names which aremeaningful words or combinations of words in natural languages, andto avoid symbolic or white space characters in names. Note thatCOLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), andMIDDLE DOT are explicitly permitted.
The ASCII symbols and punctuation marks, along with a fairlylarge group of Unicode symbol characters, are excluded from namesbecause they are more useful as delimiters in contexts where XMLnames are used outside XML documents; providing this group givesthose contexts hard guarantees about whatcannot be part ofan XML name. The character #x037E, GREEK QUESTION MARK, is excludedbecause when normalized it becomes a semicolon, which could changethe meaning of entity references.
TheNamesandNmtokens productions are used to define the validityof tokenized attribute values after normalization (see).
Literal data is any quoted string not containing the quotation mark usedas a delimiter for that string. Literals are used for specifying the contentof internal entities (EntityValue), the valuesof attributes (AttValue), and external identifiers(SystemLiteral). Note that aSystemLiteralcan be parsed without scanning for markup.
AlthoughtheEntityValue production allows the definitionof a general entity consisting of a single explicit< in the literal(e.g.,<!ENTITY mylt "<">), it is strongly advised to avoidthis practice since any reference to that entity will cause a well-formednesserror.
Character Data and Markup
Text consists of intermingledcharacter data and markup.Markup takes the form ofstart-tags,end-tags,empty-element tags,entity references,characterreferences,comments,CDATA section delimiters,documenttype declarations,processing instructions,XML declarations,text declarations,and any white space that is at the top level of the document entity (thatis, outside the document element and not inside any other markup).
All text that is not markupconstitutes thecharacter data of the document.
The ampersand character (&) and the left angle bracket (<)MUST NOT appearin their literal form, except when used as markup delimiters, orwithin acomment, aprocessinginstruction, or aCDATA section.If they are needed elsewhere, theyMUST beescapedusing eithernumeric character referencesor the strings& and<respectively. The right angle bracket (>) &may; be represented using the string>,andMUST,for compatibility, be escapedusing either> or a character reference when itappears in the string]]> in content, whenthat string is not marking the end of aCDATAsection.
In the content of elements, character data is any string of characterswhich does not contain the start-delimiter of any markup or theCDATA-section-close delimiter,]]>.In a CDATA section,character data is any string of characters not including the CDATA-section-closedelimiter.
To allow attribute values to contain both single and double quotes, theapostrophe or single-quote character (') &may; be represented as',and the double-quote character (") as".
Character DataCharData[^<&]* - ([^<&]* ']]>' [^<&]*)Comments
Comments &may; appearanywhere in a document outside othermarkup;in addition, they &may; appear within the document type declaration at placesallowed by the grammar. They are not part of the document'scharacterdata; an XML processor &MAY;, but need not, make it possible for anapplication to retrieve the text of comments.Forcompatibility, the string-- (double-hyphen)MUST NOT occur within comments. Parameterentity referencesMUST NOT be recognized within comments.
PIs are not part of the document'scharacterdata, butMUST be passed through to the application. The PI beginswith a target (PITarget) used to identify the applicationto which the instruction is directed. The target namesXML,xml,and so on are reserved for standardization in this or future versions of thisspecification. The XMLNotation mechanism&may; be used for formal declaration of PI targets. Parameterentity referencesMUST NOT be recognized within processing instructions.
CDATA Sections
CDATA sections &may; occur anywhere character data may occur; they are used to escape blocksof text containing characters which would otherwise be recognized as markup.CDATA sections begin with the string<![CDATA[and end with the string]]>:
Within a CDATA section, only theCDEnd string isrecognized as markup, so that left angle brackets and ampersands may occurin their literal form; they need not (and cannot) be escaped using<and&. CDATA sections cannot nest.
An example of a CDATA section, in which<greeting>and</greeting> are recognized ascharacter data, notmarkup:
<![CDATA[<greeting>Hello, world!</greeting>]]>Prolog and Document Type Declaration
XML 1.1 documentsMUSTbegin with anXML declaration which specifies the version ofXML being used. For example, the following is a complete XML 1.1 document,well-formed but notvalid:
Hello, world! ]]>
but the following is an XML 1.0 document because itdoes not have an XML declaration:
Hello, world!]]>
The function of the markup in an XML document is to describe its storage andlogical structure and to associate attributename-value pairs with its logical structures. XML provides a mechanism, thedocumenttype declaration, to define constraints on the logical structureand to support the use of predefined storage units.An XML document isvalid if it has an associateddocument type declaration and if the document complies with the constraintsexpressed in it.
The document type declarationMUST appear before the firstelementin the document.
The XMLdocumenttype declaration contains or points tomarkupdeclarations that provide a grammar for a class of documents. Thisgrammar is known as a document type definition, orDTD. The documenttype declaration can point to an external subset (a special kind ofexternal entity) containing markup declarations,or can contain the markup declarations directly in an internal subset, orcan do both. The DTD for a document consists of both subsets taken together.
Amarkup declarationis anelement type declaration, anattribute-list declaration, anentitydeclaration, or anotation declaration.These declarations &may; be contained in whole or in part withinparameterentities, as described in the well-formedness and validity constraintsbelow. For furtherinformation, see.
Notethat it is possible to construct a well-formed document containing adoctypedeclthat neither points to an external subset nor contains an internal subset.
The markup declarations &may; be made up in whole or in part of thereplacement text ofparameterentities. The productions later in this specification for individualnonterminals (elementdecl,AttlistDecl,and so on) describe the declarationsafter all the parameterentities have beenincluded.
Parameterentity references are recognized anywhere in the DTD (internal and externalsubsets and external parameter entities), except in literals, processing instructions,comments, and the contents of ignored conditional sections (see).They are also recognized in entity value literals. The use of parameter entitiesin the internal subset is restricted as described below.
Root Element Type
TheNamein the document type declarationMUST match the element type of theroot element.
Proper Declaration/PE Nesting
Parameter-entityreplacement textMUST be properly nested with markup declarations. That is to say, if eitherthe first character or the last character of a markup declaration (markupdeclabove) is contained in the replacement text for aparameter-entityreference, bothMUST be contained in the same replacement text.
PEs in Internal Subset
Inthe internal DTD subset,parameter-entity referencesMUST NOT occur within markup declarations; they &may; occur where markup declarations can occur.(This does not apply to references that occur in external parameter entitiesor to the external subset.)
External Subset
The external subset, if any,MUST match the production forextSubset.
PE Between Declarations
The replacement text of a parameter entity referencein aDeclSepMUST match the productionextSubsetDecl.
Like the internal subset, the external subset and any external parameterentities referencedin aDeclSepMUST consist of a series ofcomplete markup declarations of the types allowed by the non-terminal symbolmarkupdecl, interspersed with white space orparameter-entity references. However, portions ofthe contents of the external subset or of theseexternal parameter entities &may; conditionally be ignored by using theconditional section construct; this is notallowed in the internal subset but isallowed in external parameter entities referenced in the internal subset.
The external subset and external parameter entities also differ from theinternal subset in that in them,parameter-entityreferences are permittedwithin markup declarations,not onlybetween markup declarations.
An example of an XML document with a document type declaration:
Hello, world! ]]>
Thesystem identifierhello.dtdgives the address (a URI reference) of a DTD for the document.
The declarations can also be given locally, as in this example:
]>Hello, world!]]>
If both the external and internal subsets are used, the internal subsetMUST be considered to occur before the external subset. This has the effect that entity and attribute-list declarations in the internalsubset take precedence over those in the external subset.
XML 1.1 processorsSHOULD accept XML 1.0documents as well.If a document is well-formed or valid XML 1.0, and provided itdoes not contain any control charactersin the range [#x7F-#x9F] other than as character escapes, it may bemade well-formed or valid XML 1.1 respectively simply by changing theversion number.
Standalone Document Declaration
Markup declarations can affect the content of the document, as passed fromanXML processor to an application; examplesare attribute defaults and entity declarations. The standalone document declaration,which &may; appear as a component of the XML declaration, signals whether ornot there are such declarations which appear external to thedocumententityor in parameter entities.Anexternalmarkup declaration is defined as a markup declaration occurring inthe external subset or in a parameter entity (external or internal, the latterbeing included because non-validating processors are not required to readthem).
If there are no external markup declarations, the standalone document declarationhas no meaning. If there are external markup declarations but there is nostandalone document declaration, the valueno is assumed.
Any XML document for whichstandalone="no" holds can be convertedalgorithmically to a standalone document, which may be desirable for somenetwork delivery applications.
attributes withdefault values,if elements to which these attributes apply appear in the document withoutspecifications of values for these attributes, or
entities (other than &magicents;), ifreferencesto those entities appear in the document, or
attributes withtokenized types, where theattribute appears in the document with a value such thatnormalizationwill produce a different value from that which would be producedin the absence of the declaration, or
element types withelement content,if white space occurs directly within any instance of those types.
An example XML declaration with a standalone document declaration:
<?xml version="&versionOfXML;" standalone='yes'?>White Space Handling
In editing XML documents, it is often convenient to usewhite space(spaces, tabs, and blank lines)to set apart the markup for greater readability. Such white space is typicallynot intended for inclusion in the delivered version of the document. On theother hand,significant white space that should be preservedin the delivered version is common, for example in poetry and source code.
AnXML processorMUST always passall characters in a document that are not markup through to the application.A validating XML processorMUST alsoinform the application which of these characters constitute white space appearinginelement content.
Theroot element of any document is consideredto have signaled no intentions as regards application space handling, unlessit provides a value for this attribute or the attribute is declared with adefault value.
End-of-Line Handling
XMLparsed entities are often storedin computer files which, for editing convenience, are organized into lines.These lines are typically separated by some combination of the charactersCARRIAGE RETURN (#xD) and LINE FEED (#xA).
Tosimplify the tasks ofapplications, theXMLprocessorMUST behave as if it normalized all line breaks in external parsedentities (including the document entity) on input, before parsing, by translatingall of the following to a single #xA character:
the two-character sequence #xD #xA
the two-character sequence #xD #x85
the single character #x85
the single character #x2028
any #xD character that is not immediately followed by #xA or #x85.
The characters #x85 and #x2028 cannot be reliably recognized andtranslated until an entity's encoding declaration (if present) hasbeen read. Therefore, it is a fatal error to use them within the XMLdeclaration or text declaration.
Language information may also be provided by external transport protocols (e.g. HTTP orMIME). When available, this information may be used by XML applications, but the more localinformation provided byxml:lang should be considered to override it.
but specific default values &may; also be given, if appropriate. In a collectionof French poems for English students, with glosses and notes in English, thexml:langattribute might be declared this way:
All XML parsedentities (including documententities)SHOULD befullynormalized as per the definition of supplemented by the following definitions ofrelevant constructs for XML:
Thereplacement text of allparsedentities
All text matching, in context, one of the followingproductions:
CData
CharData
content
Name
Nmtoken
However, a document is still well-formed even if it is notfully normalized.XML processorsSHOULD provide a user option to verify that the document beingprocessed is infully normalized form, and report to the application whetherit is or not. The option to not verifySHOULD be chosen only when theinput text iscertified,as defined by.
The verification of full normalizationMUST be carried out as if byfirst verifying that the entity is ininclude-normalizedform as defined by and by then verifying that none of the relevantconstructs listed above begins (after character references areexpanded) with acomposing character as defined by.Non-validating processorsMUST ignore possibledenormalizations that would be caused by inclusion of externalentities that they do not read.
Thecomposing character are allUnicode characters of non-zero combining class, plus a small numberof class-zero characters that nevertheless take part as anon-initial character in certain Unicode canonicaldecompositions. Since these characters are meant to followbase characters, restricting relevant constructs (includingcontent) from beginning with acomposing character does notmeaningfully diminish the expressiveness of XML.
If, while verifying full normalization, a processor encounterscharacters for which it cannot determine the normalizationproperties (i.e., characters introduced in a version of Unicodelater than the one used in the implementation of the processor),then the processor &MAY;, at user option, ignore any possibledenormalizations caused by these characters. The option to ignorethose denormalizationsSHOULD NOT be chosen by applications whenreliability or security are critical.
XML processorsMUST NOT transform the input to be infully normalized form.XML applications that create XML 1.1 outputfrom either XML 1.1 or XML 1.0 inputSHOULD ensure that the outputisfully normalized; it is not necessary for internal processingforms to befully normalized.
The purpose of this section is to strongly encourage XMLprocessors to ensure that the creators of XML documents haveproperly normalized them, so that XML applications can make testssuch as identity comparisons of strings without having to worryabout the different possible "spellings" of strings whichUnicode allows.
When entities are in a non-Unicode encoding, if the processortranscodes them to Unicode, itSHOULD use a normalizing transcoder.
Logical Structures
EachXMLdocument contains one or moreelements, the boundariesof which are either delimited bystart-tagsandend-tags, or, foremptyelements, by anempty-element tag. Eachelement has a type, identified by name, sometimes called itsgenericidentifier (GI), and &may; have a set of attribute specifications.Each attribute specification has anameand avalue.
ElementelementEmptyElemTag|STagcontentETag
This specification does not constrain theapplicationsemantics, use, or (beyond syntax)names of the element types and attributes, except that names beginning witha match to(('X'|'x')('M'|'m')('L'|'l')) are reserved for standardizationin this or future versions of this specification.
Element Type Match
TheNamein an element's end-tagMUST match the element type in the start-tag.
Element Valid
An element is validif there is a declaration matchingelementdeclwhere theName matches the element type, and one ofthe following holds:
The declaration matchesEMPTY and the element has nocontent (not even entityreferences, comments, PIs or white space).
The declaration matcheschildren and thesequence ofchild elements belongsto the language generated by the regular expression in the content model,with optional white space, comments andPIs (i.e. markup matching production [27]Misc) between thestart-tag and the first child element, between child elements, or betweenthe last child element and the end-tag. Note that a CDATA section containingonly white space or a referenceto an entity whose replacement text is character references expanding to whitespace do notmatch the nonterminalS, andhence cannot appear in these positions; however, areference to an internal entity with a literal value consisting of characterreferences expanding to white space does matchS, since itsreplacement text is the white space resulting from expansion of the characterreferences.
The declaration matchesMixed, and the content(after replacingany entity references with their replacement text) consists ofcharacter data (includingCDATA sections),comments,PIs andchild elements whose types match names in thecontent model.
The declaration matchesANY, and thecontent (after replacingany entity references with their replacement text)consists of character data,CDATA sections,comments,PIs andchild elementswhose types have been declared.
Start-Tags, End-Tags, and Empty-Element Tags
The beginning of every non-emptyXML element is marked by astart-tag.
TheName in the start- and end-tags gives the element'stype. TheName-AttValuepairs are referred to as theattribute specifications of theelement,with theName in each pair referred to as theattribute nameandthe content of theAttValue (the text between the' or"delimiters) as theattribute value. Notethat the order of attribute specifications in a start-tag or empty-elementtag is not significant.
Unique Att Spec
An attribute nameMUST NOT appear more than once in the same start-tag or empty-element tag.
Attribute Value Type
The attributeMUSThave been declared; the valueMUST be of the type declared for it. (For attributetypes, see.)
No External Entity References
AttributevaluesMUST NOT contain direct or indirect entity references to external entities.
No< in Attribute Values
Thereplacement text of any entityreferred to directly or indirectly in an attribute valueMUST NOT contain a<.
An example of a start-tag:
<termdef term="dog">
The end of every element that beginswith a start-tagMUST be marked by anend-tag containing a namethat echoes the element's type as given in the start-tag:
End-tagETag'</'NameS?'>'
An example of an end-tag:
</termdef>
Thetextbetween the start-tag and end-tag is called the element'scontent:
Content of ElementscontentCharData? ((element|Reference |CDSect|PI |Comment)CharData?)*
An elementwith nocontent is said to beempty. The representationof an empty element is either a start-tag immediately followed by an end-tag,or an empty-element tag.Anempty-elementtag takes a special form:
Tags for Empty ElementsEmptyElemTag'<'Name (SAttribute)*S? '/>'
Empty-element tags &may; be used for any element which has no content, whetheror not it is declared using the keywordEMPTY.Forinteroperability, the empty-element tagSHOULDbe used, andSHOULD only be used, for elements which are declaredEMPTY.
Examples of empty elements:
<IMG align="left"src="http://www.w3.org/Icons/WWW/w3c_home" /><br></br><br/>Element Type Declarations
Theelement structure of anXML document &may;, forvalidationpurposes, be constrained using element type and attribute-list declarations.An element type declaration constrains the element'scontent.
Element type declarations often constrain which element types can appearaschildren of the element. At useroption, an XML processor &MAY; issue a warning when a declaration mentions anelement type for which no declaration is provided, but this is not an error.
Anelementtype declaration takes the form:
Element Type Declarationelementdecl'<!ELEMENT'SNameScontentspecS?'>'contentspec'EMPTY' | 'ANY' |Mixed|children
where theName gives the element type being declared.
Unique Element Type Declaration
An element typeMUST NOT be declared more than once.
An elementtype haselement content when elementsof that typeMUST contain onlychildelements (no character data), optionally separated by white space (charactersmatching the nonterminalS).In this case, the constraint includes acontentmodel, a simple grammar governing the allowed types of thechild elements and the order in which they are allowed to appear.The grammar is built on content particles (cps), whichconsist of names, choice lists of content particles, or sequence lists ofcontent particles:
where eachName is the type of an element which&may; appear as achild. Any contentparticle in a choice list &may; appear in theelementcontent at the location where the choice list appears in the grammar;content particles occurring in a sequence listMUST each appear in theelement content in the order given in the list.The optional character following a name or list governs whether the elementor the content particles in the list may occur one or more (+),zero or more (*), or zero or one times (?). Theabsence of such an operator means that the element or content particleMUSTappear exactly once. This syntax and meaning are identical to those used inthe productions in this specification.
The content of an element matches a content model if and only if it ispossible to trace out a path through the content model, obeying the sequence,choice, and repetition operators and matching each element in the contentagainst an element type in the content model.Forcompatibility, it is an error if the content modelallows an element to match more than one occurrence of an element type in thecontent model. For more information, see.
Proper Group/PE Nesting
Parameter-entityreplacement textMUST be properly nested with parenthesizedgroups. That is to say, if either of the opening or closing parentheses inachoice,seq, orMixedconstruct is contained in the replacement text for aparameterentity, bothMUST be contained in the same replacement text.
For interoperability, if a parameter-entity referenceappears in achoice,seq, orMixed construct, its replacement textSHOULD contain atleast one non-blank character, and neither the first nor last non-blank characterof the replacement textSHOULD be a connector (| or,).
An elementtypehasmixed content when elements of that type &may; contain characterdata, optionally interspersed withchildelements. In this case, the types of the child elements &may; be constrained,but not their order or their number of occurrences:
where theNames give the types of elements thatmay appear as children. Thekeyword#PCDATA derives historically from the termparsedcharacter data.
No Duplicate Types
Thesame nameMUST NOT appear more than once in a single mixed-content declaration.
Examples of mixed content declarations:
<!ELEMENT p (#PCDATA|a|ul|b|i|em)*><!ELEMENT p (#PCDATA | %font; | %phrase; | %special; | %form;)* ><!ELEMENT b (#PCDATA)>Attribute-List Declarations
Attributes are used to associate name-valuepairs withelements. Attribute specificationsMUST NOT appear outside ofstart-tags andempty-element tags; thus, the productions used torecognize them appear in. Attribute-list declarations&may; be used:
To define the set of attributes pertaining to a given element type.
To establish type constraints for these attributes.
To providedefault values forattributes.
Attribute-listdeclarations specify the name, data type, and default value (if any)of each attribute associated with a given element type:
TheName in theAttlistDeclrule is the type of an element. At user option, an XML processor &MAY; issuea warning if attributes are declared for an element type not itself declared,but this is not an error. TheName in theAttDefrule is the name of the attribute.
When more than oneAttlistDecl is providedfor a given element type, the contents of all those provided are merged. Whenmore than one definition is provided for the same attribute of a given elementtype, the first declaration is binding and later declarations are ignored.For interoperability, writers of DTDs &may; chooseto provide at most one attribute-list declaration for a given element type,at most one attribute definition for a given attribute name in an attribute-listdeclaration, and at least one attribute definition in each attribute-listdeclaration. For interoperability, an XML processor &MAY; at user optionissue a warning when more than one attribute-list declaration is providedfor a given element type, or more than one attribute definition is providedfor a given attribute, but this is not an error.
Attribute Types
XML attribute types are of three kinds: a string type, a set of tokenizedtypes, and enumerated types. The string type may take any literal string asa value; the tokenized typeshave varying lexical and semantic constraintsare more constrained.The validity constraints noted in the grammar are applied after the attributevalue has been normalized as described in.
Values of typeIDMUST match theName production. A nameMUST NOT appear more than oncein an XML document as a value of this type; i.e., ID valuesMUST uniquelyidentify the elements which bear them.
One ID per Element Type
An element typeMUST NOT have more than one ID attribute specified.
ID Attribute Default
An ID attributeMUST have a declared default of#IMPLIED or#REQUIRED.
IDREF
Values of typeIDREFMUSTmatch theName production, and values of typeIDREFSMUST matchNames; eachNameMUST match the value of an ID attribute on some element in the XML document;i.e.IDREF valuesMUST match the value of some ID attribute.
Entity Name
Values of typeENTITYMUST match theName production, values of typeENTITIESMUST matchNames; eachNameMUST match the name of anunparsed entitydeclared in theDTD.
Name Token
Values of typeNMTOKENMUST match theNmtoken production; values of typeNMTOKENSMUST matchNmtokens.
Enumerated attributesMUST take one of a list of valuesprovided in the declarationhave a list of allowedvalues in their declaration.TheyMUST take one of those values.There are two kinds of enumerated attribute types:
ANOTATION attribute identifies anotation,declared in the DTD with associated system and/or public identifiers, to beused in interpreting the element to which the attribute is attached.
Notation Attributes
Values of this typeMUST match one of thenotation namesincluded in the declaration; all notation names in the declarationMUST bedeclared.
One Notation Per Element Type
An element typeMUST NOT have more than oneNOTATIONattribute specified.
No Notation on Empty Element
For compatibility,an attribute of typeNOTATIONMUST NOT be declared on an elementdeclaredEMPTY.
No DuplicateTokens
The notation names in a singleNotationTypeattribute declaration, as well as theNmTokens in a singleEnumeration attribute declaration,MUST all be distinct.
Enumeration
Values of this typeMUST matchone of theNmtoken tokens in the declaration.
For interoperability, the sameNmtokenSHOULD NOT occur more than once in the enumeratedattribute types of a single element type.
Attribute Defaults
Anattribute declaration provides informationon whether the attribute's presence isREQUIRED, and if not, how an XML processoris to react if a declared attribute is absent in a document.
In an attribute declaration,#REQUIRED means that the attributeMUST always be provided,#IMPLIED that no default value is provided.Ifthe declaration is neither#REQUIRED nor#IMPLIED, thentheAttValue value contains the declareddefaultvalue; the#FIXED keyword states that the attributeMUST always havethe default value.When an XML processor encountersan elementwithout a specification for an attribute for which it has read a defaultvalue declaration, itMUST report the attribute with the declared defaultvalue to the application.
Required Attribute
If the defaultdeclaration is the keyword#REQUIRED, then the attributeMUST bespecified for all elements of the type in the attribute-list declaration.
Attribute Default Value Syntactically Correct
The declared default valueMUST meet the syntacticconstraints of the declared attribute type. That is, the default value of an attribute:
of type IDREF or ENTITY must match theName production;
of type IDREFS or ENTITIES must match theNames production;
of type NMTOKEN must match theNmtoken production;
of type NMTOKENS must match theNmtokens production;
of anenumerated type (either aNOTATION type or anenumeration) must match one of the enumerated values.
Note that only thesyntactic constraints of the type are required here; other constraints (e.g.that the value be the name of a declared unparsed entity, for an attribute oftype ENTITY)may come into play if the declared default value is actually used(an element without a specification for this attribute occurs)will be reported by a validatingparser only if an element without a specification for this attributeactually occurs.
Fixed Attribute Default
If an attributehas a default value declared with the#FIXED keyword, instances ofthat attributeMUST match the default value.
Before the value of an attribute is passed to the application or checkedfor validity, the XML processorMUST normalize the attribute value by applyingthe algorithm below, or by using some other method such that the value passedto the application is the same as that produced by the algorithm.
All line breaksMUST have been normalized on input to #xA as describedin, so the rest of this algorithm operateson text normalized in this way.
Begin with a normalized value consisting of the empty string.
For each character, entity reference, or character reference in theunnormalized attribute value, beginning with the first and continuing to thelast, do the following:
For a character reference, append the referenced character to thenormalized value.
For an entity reference, recursively apply step 3 of this algorithmto the replacement text of the entity.
For a white space character (#x20, #xD, #xA, #x9), append a spacecharacter (#x20) to the normalized value.
For another character, append the character to the normalized value.
If the attribute type is not CDATA, then the XML processorMUST furtherprocess the normalized attribute value by discarding any leading and trailingspace (#x20) characters, and by replacing sequences of space (#x20) charactersby a single space (#x20) character.
Note that if the unnormalized attribute value contains a character referenceto a white space character other than space (#x20), the normalized value containsthe referenced character itself (#xD, #xA or #x9). This contrasts with thecase where the unnormalized value contains a white space character (not areference), which is replaced with a space character (#x20) in the normalizedvalue and also contrasts with the case where the unnormalized value containsan entity reference whose replacement text contains a white space character;being recursively processed, the white space character is replaced with aspace character (#x20) in the normalized value.
All attributes for which no declaration has been readSHOULD be treatedby a non-validating processor as if declaredCDATA.
Itis an error if anattributevalue contains areference to anentity for which no declaration has been read.
Following are examples of attribute normalization. Given the followingdeclarations:
<!ENTITY d "
"><!ENTITY a "
"><!ENTITY da "
">
Conditionalsections are portions of thedocument typedeclaration external subset orof external parameter entities which are included in, or excluded from,the logical structure of the DTD based on the keyword which governs them.
If any of the "<![","[", or "]]>" of a conditional section is containedin the replacement text for a parameter-entity reference, all of themMUSTbe contained in the same replacement text.
Like the internal and external DTD subsets, a conditional section may containone or more complete declarations, comments, processing instructions, or nestedconditional sections, intermingled with white space.
If the keyword of the conditional section isINCLUDE, then thecontents of the conditional sectionMUST beconsideredprocessed aspart of the DTD. If the keyword ofthe conditional section isIGNORE, then the contents of the conditionalsectionMUSTbeconsidered as not logicallyNOT be processed as part of the DTD.If a conditional section with a keyword ofINCLUDE occurs withina larger conditional section with a keyword ofIGNORE, both the outerand the inner conditional sectionsMUST be ignored. The contentsof an ignored conditional sectionMUST be parsed by ignoring all characters afterthe "[" following the keyword, except conditional section starts"<![" and ends "]]>", until the matching conditionalsection end is found. Parameter entity referencesMUST NOT be recognized in thisprocess.
If the keyword of the conditional section is a parameter-entity reference,the parameter entityMUST be replaced by its content before the processordecides whether to include or ignore the conditional section.
An example:
<!ENTITY % draft 'INCLUDE' ><!ENTITY % final 'IGNORE' ><![%draft;[<!ELEMENT book (comments*, title, body, supplements?)>]]><![%final;[<!ELEMENT book (title, body, supplements?)>]]>Physical Structures
An XML document may consist of oneor many storage units. Theseare calledentities; they all havecontent and areall (except for thedocument entity andtheexternal DTD subset) identified byentityname. Each XML document has one entitycalled thedocument entity, which servesas the starting point for theXML processorand may contain the whole document.
Entities may be either parsed or unparsed.The contents of aparsedentity are referred to as itsreplacementtext; thistext is considered anintegral part of the document.
Anunparsed entityis a resource whose contents may or may not betext,and if text, maybe other than XML. Each unparsed entity has an associatednotation, identified by name. Beyond a requirementthat an XML processor make the identifiers for the entity and notation availableto the application, XML places no constraints on the contents of unparsedentities.
Parsed entities are invoked by name using entity references; unparsed entitiesby name, given in the value ofENTITY orENTITIES attributes.
General entitiesare entities for use within the document content. In this specification, generalentities are sometimes referred to with the unqualified termentitywhen this leads to no ambiguity.Parameterentities are parsed entities for use within the DTD.These two types of entities use different forms of reference and are recognizedin different contexts. Furthermore, they occupy different namespaces; a parameterentity and a general entity with the same name are two distinct entities.
Character and Entity References
Acharacterreference refers to a specific character in the ISO/IEC 10646 characterset, for example one not directly accessible from available input devices.
Character ReferenceCharRef'&#' [0-9]+ ';'| '&#x' [0-9a-fA-F]+ ';'Legal Character
Characters referredto using character referencesMUST match the production forChar.
If the character reference begins with&#x,the digits and letters up to the terminating; provide a hexadecimalrepresentation of the character's code point in ISO/IEC 10646. If it beginsjust with&#, the digits up to the terminating;provide a decimal representation of the character's code point.
Anentity referencerefers to the content of a named entity.References to parsed general entities useampersand (&) and semicolon (;) as delimiters.Parameter-entity referencesuse percent-sign (%) and semicolon (;) as delimiters.
In a documentwithout any DTD, a document with only an internal DTD subset which containsno parameter entity references, or a document withstandalone='yes', foran entity reference that does not occur within the external subset or a parameterentity, theName given in the entity referenceMUSTmatch that in anentitydeclaration that does not occur within the external subset or aparameter entity, except that well-formed documents need not declareany of the following entities: &magicents;. Thedeclaration of a general entityMUST precede any reference to it which appearsin a default value in an attribute-list declaration.
Notethat non-validating processors arenotobligated to to read and process entity declarations occurring in parameter entities or inthe external subset; for such documents,the rule that an entity must be declared is a well-formedness constraint onlyifstandalone='yes'.
Entity Declared
In a document with an external subset orexternal parameterentitiesentity references withstandalone='no',theName given in the entity referenceMUSTmatch that in anentitydeclaration. For interoperability, valid documentsSHOULD declarethe entities &magicents;, in the form specified in.The declaration of a parameter entityMUST precede any reference to it. Similarly,the declaration of a general entityMUST precede any attribute-listdeclaration containing a default value with a direct or indirect referenceto that general entity.
Parsed Entity
An entity referenceMUSTNOT contain the name of anunparsed entity.Unparsed entities may be referred to only inattributevalues declared to be of typeENTITY orENTITIES.
No Recursion
A parsed entityMUST NOT contain a recursive reference to itself, either directly or indirectly.
In DTD
Parameter-entity referencesMUST NOT appear outsidetheDTD.
Examples of character and entity references:
Type <key>less-than</key> (<) to save options.This document was prepared on &docdate; andis classified &security-level;.
TheName identifies the entity in anentityreference or, in the case of an unparsed entity, in the value ofanENTITY orENTITIES attribute. If the same entity is declaredmore than once, the first declaration encountered is binding; at user option,an XML processor &MAY; issue a warning if entities are declared multiple times.
Internal Entities
If theentity definition is anEntityValue, the definedentity is called aninternal entity. There is no separate physicalstorage object, and the content of the entity is given in the declaration.Note that some processing of entity and character references in theliteral entity value may be required to producethe correctreplacement text: see.
An internal entity is aparsed entity.
Example of an internal entity declaration:
<!ENTITY Pub-Status "This is a pre-release of thespecification.">External Entities
If the entity is not internal,it is anexternal entity, declared as follows:
If theNDataDecl is present, this is a generalunparsed entity; otherwise it is a parsed entity.
Notation Declared
TheNameMUST match the declared name of anotation.
TheSystemLiteral is called the entity'ssystemidentifier. It is meant to beconverted to a URI reference(as defined in, updated by),as part of theprocess of dereferencing it to obtain input for the XML processor to construct theentity's replacement text. It is an error for a fragment identifier(beginning with a# character) to be part of a system identifier.Unless otherwise provided by information outside the scope of this specification(e.g. a special XML element type defined by a particular DTD, or a processinginstruction defined by a particular application specification), relative URIsare relative to the location of the resource within which the entity declarationoccurs. This is defined tobe the external entity containing the '<' which starts the declaration, at thepoint when it is parsed as a declaration.A URI might thus be relative to thedocumententity, to the entity containing theexternalDTD subset, or to some otherexternal parameterentity. Attempts toretrieve the resource identified by a URI &may; be redirected at the parserlevel (for example, in an entity resolver) or below (at the protocol level,for example, via an HTTPLocation: header). In the absence of additionalinformation outside the scope of this specification within the resource,the base URI of a resource is always the URI of the actual resource returned.In other words, it is the URI of the resource retrieved after all redirectionhas occurred.
Systemidentifiers (and other XML strings meant to be used as URI references) &may; containcharacters that, according to and,must be escaped before a URI can be used to retrieve the referenced resource. Thecharacters to be escaped are the control characters #x0 to #x1F and #x7F (most ofwhich cannot appear in XML), space #x20, the delimiters '<' #x3C, '>' #x3E and'"' #x22, theunwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and'`' #x60, as well as all characters above #x7F. Since escaping is not always a fullyreversible process, itMUST be performed only when absolutely necessary and as lateas possible in a processing chain. In particular, neither the process of convertinga relative URI to an absolute one nor the process of passing a URI reference to aprocess or software component responsible for dereferencing itSHOULD trigger escaping.When escaping does occur, itMUST be performed as follows:
Each character to be escaped is represented inUTF-8as one or more bytes.
The resulting bytes are escaped withthe URI escaping mechanism (that is, converted to%HH,where HH is the hexadecimal notation of the byte value).
The original character is replaced by the resulting character sequence.
In addition to a systemidentifier, an external identifier &may; include apublic identifier.An XML processor attempting to retrieve the entity's content &may; useany combination ofthe public and system identifiers as well as additional information outside thescope of this specification to try to generate an alternative URI reference.If the processor is unable to do so, itMUST use the URIreference specified in the system literal. Before a match is attempted,all strings of white space in the public identifierMUST be normalized tosingle space characters (#x20), and leading and trailing white spaceMUSTbe removed.
Examples of external entity declarations:
<!ENTITY open-hatchSYSTEM "http://www.textuality.com/boilerplate/OpenHatch.xml"><!ENTITY open-hatchPUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//EN""http://www.textuality.com/boilerplate/OpenHatch.xml"><!ENTITY hatch-picSYSTEM "../grafix/OpenHatch.gif"NDATA gif >Parsed EntitiesThe Text Declaration
External parsed entitiesSHOULD each begin with atext declaration.
Text DeclarationTextDecl'<?xml'VersionInfo?EncodingDeclS? '?>'
The text declarationMUST be provided literally, not by referenceto a parsed entity. The text declarationMUST NOT appear at anyposition other than the beginning of an external parsed entity. The text declarationin an external parsed entity is not considered part of itsreplacement text.
Well-Formed Parsed Entities
The document entity is well-formed if it matches the production labeleddocument. An external general parsed entity is well-formedif it matches the production labeledextParsedEnt. Allexternal parameter entities are well-formed by definition.
Only parsed entities that are referenced directly or indirectly within the document are required to be well-formed.
An internal general parsed entity is well-formed if its replacement textmatches the production labeledcontent. All internalparameter entities are well-formed by definition.
A consequence of well-formedness in generalentities is that the logical and physicalstructures in an XML document are properly nested; nostart-tag,end-tag,empty-element tag,element,comment,processing instruction,characterreference, orentity referencecan begin in one entity and end in another.
Character Encoding in Entities
Each external parsed entity in an XML document &may; use a different encodingfor its characters. All XML processorsMUST be able to read entities in boththe UTF-8 and UTF-16 encodings. The termsUTF-8andUTF-16 in this specification do not apply to characterencodings with any other labels, even if the encodings or labels are verysimilar to UTF-8 or UTF-16.
Entities encoded in UTF-16MUST and entitiesencoded in UTF-8 &MAY; begin with the Byte Order Mark described inISO/IEC 10646 or Unicode(the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature,not part of either the markup or the character data of the XML document. XMLprocessorsMUST be able to use this character to differentiate between UTF-8and UTF-16 encoded documents.
Although an XML processor is required to read only entities in the UTF-8and UTF-16 encodings, it is recognized that other encodings are used aroundthe world, and it may be desired for XML processors to read entities thatuse them. Inthe absence of external character encoding information (such as MIME headers),parsed entities which are stored in an encoding other than UTF-8 or UTF-16MUST begin with a text declaration (see) containingan encoding declaration:
Encoding DeclarationEncodingDeclS 'encoding'Eq('"'EncName '"' | "'"EncName"'" )EncName[A-Za-z] ([A-Za-z0-9._] | '-')*Encodingname contains only Latin characters
In thedocument entity, the encodingdeclaration is part of theXML declaration.TheEncName is the name of the encoding used.
In an encoding declaration, the valuesUTF-8,UTF-16,ISO-10646-UCS-2, andISO-10646-UCS-4SHOULD be usedfor the various encodings and transformations of Unicode / ISO/IEC 10646,the valuesISO-8859-1,ISO-8859-2,...ISO-8859-n (wherenis the part number)SHOULD be used for the parts of ISO 8859, andthe valuesISO-2022-JP,Shift_JIS,andEUC-JPSHOULD be used for the various encodedforms of JIS X-0208-1997. ItisRECOMMENDED that character encodings registered (ascharsets)with the Internet Assigned Numbers Authority,other than those just listed, be referred to using their registered names;other encodingsSHOULD use names starting with anx- prefix.XML processorsSHOULD match character encoding names in a case-insensitiveway andSHOULD either interpret an IANA-registered name as the encoding registeredat IANA for that name or treat it as unknown (processors are, of course, notrequired to support all IANA-registered encodings).
In the absence of information provided by an external transport protocol(e.g. HTTP or MIME), it is afatal error foran entity including an encoding declaration to be presented to the XML processorin an encoding other than that named in the declaration, or for an entity whichbegins with neither a Byte Order Marknor an encoding declaration to use an encoding other than UTF-8. Note thatsince ASCII is a subset of UTF-8, ordinary ASCII entities do not strictlyneed an encoding declaration.
It is afatal error for aTextDecl to occur otherthan at the beginning of an external entity.
It is afatal error when an XML processorencounters an entity with an encoding that it is unable to process. Itis afatal error if an XML entity is determined (via default, encoding declaration,or higher-level protocol) to be in a certain encoding but contains bytesequences that are not legal in that encoding. Specifically, it is afatal error if an entity encoded in UTF-8 contains any irregular code unit sequences,as defined in Unicode. Unless an encodingis determined by a higher-level protocol, it is also afatal error if an XML entitycontains no encoding declaration and its content is not legal UTF-8 or UTF-16.
Examples of text declarations containing encoding declarations:
<?xml encoding='UTF-8'?><?xml encoding='EUC-JP'?>Version Information in Entities
Each entity, including thedocument entity,can be separatelydeclared as XML 1.0 or XML 1.1. The version declaration appearingin the document entity determines the version of the document as awhole. An XML 1.1 document may invoke XML 1.0 external entities, sothat otherwise duplicated versions of external entities,particularly DTD external subsets, need not be maintained. However,in such a case the rules of XML 1.1 are applied to the entiredocument.
If an entity (including the document entity) is not labeledwith a version number, it is treated as if labeled as version1.0.
XML Processor Treatment of Entities and References
The table below summarizes the contexts in which character references,entity references, and invocations of unparsed entities might appear and theREQUIRED behavior of anXML processorin each case. The labels in the leftmost column describe the recognition context:
as a reference anywhere after thestart-tagand before theend-tag of an element; correspondsto the nonterminalcontent.
as a reference within either the value of an attribute in astart-tag,or a default value in anattribute declaration;corresponds to the nonterminalAttValue.
as aName, not a reference, appearing either asthe value of an attribute which has been declared as typeENTITY,or as one of the space-separated tokens in the value of an attribute whichhas been declared as typeENTITIES.
as a reference within a parameter or internal entity'sliteralentity value in the entity's declaration; corresponds to the nonterminalEntityValue.
as a reference within either the internal or external subsets of theDTD, but outside of anEntityValue,AttValue,PI,Comment,SystemLiteral,PubidLiteral,or the contents of an ignored conditional section (see).
.
EntityType
Character
Parameter
Internal General
External ParsedGeneral
Unparsed
Referencein Content
Not recognized
Included
Includedif validating
Forbidden
Included
Reference in Attribute Value
Not recognized
Includedin literal
Forbidden
Forbidden
Included
Occurs as AttributeValue
Not recognized
Forbidden
Forbidden
Notify
Not recognized
Reference in EntityValue
Included in literal
Bypassed
Bypassed
Error
Included
Reference in DTD
Included as PE
Forbidden
Forbidden
Forbidden
Forbidden
Not Recognized
Outside the DTD, the% character has no special significance;thus, what would be parameter entity references in the DTD are not recognizedas markup incontent. Similarly, the names of unparsedentities are not recognized except when they appear in the value of an appropriatelydeclared attribute.
Included
An entity isincludedwhen itsreplacement text is retrievedand processed, in place of the reference itself, as though it were part ofthe document at the location the reference was recognized. The replacementtext &may; contain bothcharacter dataand (except for parameter entities)markup,whichMUST be recognized in the usual way. (The stringAT&T;expands toAT&T; and the remaining ampersandis not recognized as an entity-reference delimiter.) A character referenceisincluded when the indicated character is processed in placeof the reference itself.
Included If Validating
When an XML processor recognizes a reference to a parsed entity, in ordertovalidate the document, the processorMUSTinclude its replacement text. Ifthe entity is external, and the processor is not attempting to validate theXML document, the processor &MAY;, but neednot, include the entity's replacement text. If a non-validating processordoes not include the replacement text, itMUST inform the application thatit recognized, but did not read, the entity.
This rule is based on the recognition that the automatic inclusion providedby the SGML and XML entity mechanism, primarily designed to support modularityin authoring, is not necessarily appropriate for other applications, in particulardocument browsing. Browsers, for example, when encountering an external parsedentity reference, might choose to provide a visual indication of the entity'spresence and retrieve it for display only on demand.
Forbidden
The following are forbidden, and constitutefatalerrors:
the appearance of a reference to anunparsedentity, except in theEntityValue in an entity declaration.
the appearance of any character or general-entity reference in theDTD except within anEntityValue orAttValue.
a reference to an external entity in an attribute value.
Included in Literal
When anentity reference appears inan attribute value, or a parameter entity reference appears in a literal entityvalue, itsreplacement textMUST be processedin place of the reference itself as though it were part of the document atthe location the reference was recognized, except that a single or doublequote character in the replacement textMUST always be treated as a normal datacharacter andMUST NOT terminate the literal. For example, this is well-formed:
When the name of anunparsed entityappears as a token in the value of an attribute of declared typeENTITYorENTITIES, a validating processorMUST inform the application ofthesystem andpublic(if any) identifiers for both the entity and its associatednotation.
Bypassed
When a general entity reference appears in theEntityValuein an entity declaration, itMUST be bypassed and left as is.
Included as PE
Just as with external parsed entities, parameter entities need only beincluded if validating. When a parameter-entityreference is recognized in the DTD and included, itsreplacementtextMUST be enlarged by the attachment of one leading and one followingspace (#x20) character; the intent is to constrain the replacement text ofparameter entities to contain an integral number of grammatical tokens inthe DTD. ThisbehaviorMUST NOT apply to parameter entity references within entity values;these are described in.
Error
It is anerror for a reference toan unparsed entity to appear in theEntityValue in anentity declaration.
Construction of Entity Replacement Text
In discussing the treatment of entities, it is useful to distinguishtwo forms of the entity's value.For aninternal entity, theliteralentity value is the quoted string actually present in the entity declaration,corresponding to the non-terminalEntityValue.For an external entity, theliteralentity value is the exact text contained in the entity.For aninternal entity, thereplacement textis the content of the entity, after replacement of character references andparameter-entity references.Foran external entity, thereplacement text is the content of the entity,after stripping the text declaration (leaving any surrounding white space) if thereis one but without any replacement of character references or parameter-entityreferences.
The literal entity value as given in an internal entity declaration (EntityValue) &may; contain character, parameter-entity,and general-entity references. Such referencesMUST be contained entirelywithin the literal entity value. The actual replacement text that isincluded (orincluded in literal) as described aboveMUST contain thereplacementtext of any parameter entities referred to, andMUST contain the characterreferred to, in place of any character references in the literal entity value;however, general-entity referencesMUST be left as-is, unexpanded. For example,given the following declarations:
The general-entity reference&rights; wouldbe expanded should the reference&book; appearin the document's content or an attribute value.
These simple rules may have complex interactions; for a detailed discussionof a difficult example, see.
Predefined Entities
Entity and character references &may;both be used toescape the left angle bracket, ampersand, andother delimiters. A set of general entities (&magicents;) is specified forthis purpose. Numeric character references &may; also be used; they are expandedimmediately when recognized andMUST be treated as character data, so thenumeric character references< and& &may; be used to escape< and& when they occurin character data.
All XML processorsMUST recognize these entities whether they are declaredor not.For interoperability, valid XMLdocumentsSHOULD declare these entities, like any others, before using them. Ifthe entitieslt oramp are declared, theyMUST bedeclared as internal entities whose replacement text is a character referenceto the respectivecharacter (less-than sign or ampersand) being escaped; the doubleescaping isREQUIRED for these entities so that references to them producea well-formed result. If the entitiesgt,apos,orquot are declared, theyMUST be declared as internal entitieswhose replacement text is the single character being escaped (or a characterreference to that character; the double escaping here isOPTIONAL but harmless).For example:
]]>Notation Declarations
Notations identifyby name the format ofunparsed entities,the format of elements which bear a notation attribute, or the applicationto which aprocessing instruction is addressed.
Notation declarationsprovide a name for the notation, for use in entity and attribute-list declarationsand in attribute specifications, and an external identifier for the notationwhich may allow an XML processor or its client application to locate a helperapplication capable of processing data in the given notation.
Notation DeclarationsNotationDecl'<!NOTATION'SNameS (ExternalID |PublicID)S? '>'PublicID'PUBLIC'SPubidLiteralUnique Notation Name
A givenNameMUST NOT be declared in more than one notation declaration.
XML processorsMUST provide applications with the name and external identifier(s)of any notation declared and referred to in an attribute value, attributedefinition, or entity declaration. They &MAY; additionally resolve the externalidentifier into thesystem identifier, filename, or other information needed to allow the application to call a processorfor data in the notation described. (It is not an error, however, for XMLdocuments to declare and refer to notations for which notation-specific applicationsare not available on the system where the XML processor or application isrunning.)
Document Entity
Thedocument entityserves as the root of the entity tree and a starting-point for anXML processor. This specification doesnot specify how the document entity is to be located by an XML processor;unlike other entities, the document entity has no name and might well appearon a processor input stream without any identification at all.
ConformanceValidating and Non-Validating Processors
ConformingXML processors fall intotwo classes: validating and non-validating.
Validating and non-validating processors alikeMUST report violations ofthis specification's well-formedness constraints in the content of thedocument entity and any otherparsedentities that they read.
ValidatingprocessorsMUST,at user option, report violations of the constraints expressed bythe declarations in theDTD, and failuresto fulfill the validity constraints given in this specification.To accomplish this, validating XML processorsMUST read and process the entireDTD and all external parsed entities referenced in the document.
Non-validating processors areREQUIRED to check only thedocumententity, including the entire internal DTD subset, for well-formedness. While they are not requiredto check the document for validity, they areREQUIRED toprocessall the declarations they read in the internal DTD subset and in any parameterentity that they read, up to the first reference to a parameter entity thatthey donot read; that is to say, theyMUST use the informationin those declarations tonormalizeattribute values,include the replacementtext of internal entities, and supplydefaultattribute values. Except whenstandalone="yes", theyMUST NOTprocessentitydeclarations orattribute-list declarationsencountered after a reference to a parameter entity that is not read, sincethe entity may have contained overriding declarations; whenstandalone="yes", processorsMUSTprocess these declarations.
Notethat when processing invalid documents with a non-validatingprocessor the application may not be presented with consistentinformation. For example, several requirements for uniquenesswithin the document may not be met, including more than one elementwith the same id, duplicate declarations of elements or notationswith the same name, etc. In these cases the behavior of the parserwith respect to reporting such information to the application isundefined.
XML 1.1 processorsMUST be able to process both XML 1.0and XML 1.1 documents. Programs which generate XMLSHOULDgenerate XML 1.0, unless one of the specific features of XML 1.1 is required.
Using XML Processors
The behavior of a validating XML processor is highly predictable; it mustread every piece of a document and report all well-formedness and validityviolations. Less is required of a non-validating processor; it need not readany part of the document other than the document entity. This has two effectsthat may be important to users of XML processors:
Certain well-formedness errors, specifically those that require readingexternal entities, may fail to be detected by a non-validating processor. Examplesinclude the constraints entitledEntity Declared,Parsed Entity, andNoRecursion, as well as some of the cases described asforbidden in.
The information passed from the processor to the application mayvary, depending on whether the processor reads parameter and external entities.For example, a non-validating processor may fail tonormalizeattribute values,include the replacementtext of internal entities, or supplydefaultattribute values, where doing so depends on having read declarationsin external or parameter entities.
For maximum reliability in interoperating between different XML processors,applications which use non-validating processorsSHOULD NOT rely on any behaviorsnot required of such processors. Applications which require DTD facilitiesnot related to validation (suchas the declaration of default attributes and internal entities that areor may be specified inexternal entities)SHOULD use validating XML processors.
Notation
The formal grammar of XML is given in this specification using a simpleExtended Backus-Naur Form (EBNF) notation. Each rule in the grammar definesone symbol, in the form
symbol ::= expression
Symbols are written with an initial capital letter if they are thestart symbol of a regular language, otherwise with an initial lowercaseletter. Literal strings are quoted.
Within the expression on the right-hand side of a rule, the following expressionsare used to match strings of one or more characters:
whereN is a hexadecimal integer, the expression matches the characterwhose number(code point) in ISO/IEC 10646 isN. The number of leading zeros in the#xNform is insignificant.
matches anyChar with a value in the range(s) indicated (inclusive).
matches anyChar with a value among the charactersenumerated. Enumerations and ranges can be mixed in one set of brackets.
matches anyChar with a valueoutside the rangeindicated.
matches anyChar with a value not among the characters given. Enumerationsand ranges of forbidden values can be mixed in one set of brackets.
matches a literal stringmatching thatgiven inside the double quotes.
matches a literal stringmatching thatgiven inside the single quotes.
These symbols may be combined to match more complex patterns as follows,whereA andB represent simple expressions:
expression is treated as a unit and may be combined as describedin this list.
matchesA or nothing; optionalA.
matchesA followed byB. Thisoperator has higher precedence than alternation; thusA B | C Dis identical to(A B) | (C D).
matchesA orB.
matches any string that matchesA but does not matchB.
matches one or more occurrences ofA. Concatenationhas higher precedence than alternation; thusA+ | B+ is identicalto(A+) | (B+).
matches zero or more occurrences ofA. Concatenationhas higher precedence than alternation; thusA* | B* is identicalto(A*) | (B*).
Other notations used in the productions are:
comment.
well-formedness constraint; this identifies by name a constraint onwell-formed documents associated with a production.
validity constraint; this identifies by name a constraint onvaliddocuments associated with a production.