Movatterモバイル変換


[0]ホーム

URL:


W3C

Extensible Markup Language (XML) 1.0 (Fifth Edition)

W3C Recommendation 26 November 2008

Note: On 7 February 2013, this specification was modified in place to replace broken links to RFC4646 and RFC4647.

This version:
http://www.w3.org/TR/2008/REC-xml-20081126/
Latest version:
http://www.w3.org/TR/xml/
Previous versions:
http://www.w3.org/TR/2008/PER-xml-20080205/
http://www.w3.org/TR/2006/REC-xml-20060816/
Editors:
Tim Bray, Textuality and Netscape<tbray@textuality.com>
Jean Paoli, Microsoft<jeanpa@microsoft.com>
C. M. Sperberg-McQueen, W3C<cmsmcq@w3.org>
Eve Maler, Sun Microsystems, Inc.<eve.maler@east.sun.com>
François Yergeau

Please refer to theerrata for this document, which may include some normative corrections.

Theprevious errata for this document, are also available.

See alsotranslations.

This document is also available in these non-normative formats:XML and XHTML with color-coded revision indicators.

Copyright © 2008 W3C® (MIT,ERCIM,Keio), All Rights Reserved. W3Cliability,trademark anddocument use rules apply.


Abstract

The Extensible Markup Language (XML) is a subset of SGML that is completelydescribed in this document. Its goal is to enable generic SGML to be served,received, and processed on the Web in the way that is now possible with HTML.XML has been designed for ease of implementation and for interoperabilitywith both SGML and HTML.

Status of this Document

This section describes the status of this document at the time of its publication.Other documents may supersede this document. A list of current W3C publications and thelatest revision of this technical report can be found in theW3C technical reports index athttp://www.w3.org/TR/.

This document specifies a syntax created by subsetting an existing, widelyused international text processing standard (Standard Generalized Markup Language,ISO 8879:1986(E) as amended and corrected) for use on the World Wide Web.It is a product of theXML Core Working Groupas part of theXML Activity.The English version of this specification is the only normative version. However,for translations of this document, seehttp://www.w3.org/2003/03/Translations/byTechnology?technology=xml.

This document is aW3C Recommendation. This fifth edition isnot a new version of XML. As a convenience to readers,it incorporates the changes dictated by the accumulated errata (available athttp://www.w3.org/XML/xml-V10-4e-errata) to theFourthEdition of XML 1.0, dated 16 August 2006. In particular, erratum[E09]relaxes the restrictions on element and attribute names, thereby providing in XML 1.0 the major end user benefitcurrently achievable only by using XML1.1. As a consequence, many possible documents which were not well-formed according to previous editions of this specification are now well-formed, and previously invalid documentsusing the newly-allowed name characters in, for example, IDattributes, are now valid.

This edition supersedes the previousW3C Recommendationof 16 August 2006.

Please report errors in this document tothe publicxml-editor@w3.org mail list; publicarchives are available. For the convenience of readers,anXHTML version with color-coded revision indicators isalso provided; this version highlights each change due to an erratum published in theerratalist for the previous edition, together with a link to the particularerratum in that list. Most of theerrata in the list provide a rationale for the change. The erratalist for this fifth edition is available athttp://www.w3.org/XML/xml-V10-5e-errata.

An implementation report is available athttp://www.w3.org/XML/2008/01/xml10-5e-implementation.html.ATest Suite is maintained to help assessing conformance to this specification.

This document has been reviewed by W3C Members, by software developers, and by other W3C groups and interested parties, and is endorsed by the Director as a W3C Recommendation. It is a stable document and may be used as reference material or cited from another document. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionality and interoperability of the Web.

W3C maintains apublic list ofany patent disclosures made in connection with the deliverables ofthe group; that page also includes instructions for disclosing a patent.An individual who has actual knowledge of a patent which the individualbelieves containsEssentialClaim(s) must disclose the information in accordance withsection 6 of the W3C Patent Policy.

Table of Contents

1Introduction
    1.1Origin and Goals
    1.2Terminology
2Documents
    2.1Well-Formed XML Documents
    2.2Characters
    2.3Common Syntactic Constructs
    2.4Character Data and Markup
    2.5Comments
    2.6Processing Instructions
    2.7CDATA Sections
    2.8Prolog and Document Type Declaration
    2.9Standalone Document Declaration
    2.10White Space Handling
    2.11End-of-Line Handling
    2.12Language Identification
3Logical Structures
    3.1Start-Tags, End-Tags, and Empty-Element Tags
    3.2Element Type Declarations
        3.2.1Element Content
        3.2.2Mixed Content
    3.3Attribute-List Declarations
        3.3.1Attribute Types
        3.3.2Attribute Defaults
        3.3.3Attribute-Value Normalization
    3.4Conditional Sections
4Physical Structures
    4.1Character and Entity References
    4.2Entity Declarations
        4.2.1Internal Entities
        4.2.2External Entities
    4.3Parsed Entities
        4.3.1The Text Declaration
        4.3.2Well-Formed Parsed Entities
        4.3.3Character Encoding in Entities
    4.4XML Processor Treatment of Entities and References
        4.4.1Not Recognized
        4.4.2Included
        4.4.3Included If Validating
        4.4.4Forbidden
        4.4.5Included in Literal
        4.4.6Notify
        4.4.7Bypassed
        4.4.8Included as PE
        4.4.9Error
    4.5Construction of Entity Replacement Text
    4.6Predefined Entities
    4.7Notation Declarations
    4.8Document Entity
5Conformance
    5.1Validating and Non-Validating Processors
    5.2Using XML Processors
6Notation

Appendices

AReferences
    A.1Normative References
    A.2Other References
BCharacter Classes
CXML and SGML (Non-Normative)
DExpansion of Entity and Character References (Non-Normative)
EDeterministic Content Models (Non-Normative)
FAutodetection of Character Encodings (Non-Normative)
    F.1Detection Without External Encoding Information
    F.2Priorities in the Presence of External Encoding Information
GW3C XML Working Group (Non-Normative)
HW3C XML Core Working Group (Non-Normative)
IProduction Notes (Non-Normative)
JSuggestions for XML Names (Non-Normative)


1 Introduction

Extensible Markup Language, abbreviated XML, describes a class of dataobjects calledXML documents and partiallydescribes the behavior of computer programs which process them. XML is anapplication profile or restricted form of SGML, the Standard Generalized MarkupLanguage[ISO 8879]. By construction, XML documents are conformingSGML documents.

XML documents are made up of storage units calledentities,which contain either parsed or unparsed data. Parsed data is made up ofcharacters, some of which formcharacterdata, and some of which formmarkup.Markup encodes a description of the document's storage layout and logicalstructure. XML provides a mechanism to impose constraints on the storage layoutand logical structure.

[Definition: A software module calledanXML processor is used to read XML documents and provide accessto their content and structure.][Definition: Itis assumed that an XML processor is doing its work on behalf of another module,called theapplication.] This specification describesthe required behavior of an XML processor in terms of how it must read XMLdata and the information it must provide to the application.

1.1 Origin and Goals

XML was developed by an XML Working Group (originally known as the SGMLEditorial Review Board) formed under the auspices of the World Wide Web Consortium(W3C) in 1996. It was chaired by Jon Bosak of Sun Microsystems with the activeparticipation of an XML Special Interest Group (previously known as the SGMLWorking Group) also organized by the W3C. The membership of the XML WorkingGroup is given in an appendix. Dan Connolly served as the Working Group's contact withthe W3C.

The design goals for XML are:

  1. XML shall be straightforwardly usable over the Internet.

  2. XML shall support a wide variety of applications.

  3. XML shall be compatible with SGML.

  4. It shall be easy to write programs which process XML documents.

  5. The number of optional features in XML is to be kept to the absoluteminimum, ideally zero.

  6. XML documents should be human-legible and reasonably clear.

  7. The XML design should be prepared quickly.

  8. The design of XML shall be formal and concise.

  9. XML documents shall be easy to create.

  10. Terseness in XML markup is of minimal importance.

This specification, together with associated standards (Unicode[Unicode]and ISO/IEC 10646[ISO/IEC 10646] for characters, InternetBCP 47[IETF BCP 47]and the Language Subtag Registry[IANA-LANGCODES] for languageidentification tags), providesall the information necessary to understand XML Version 1.0 andconstruct computer programs to process it.

This version of the XML specification may be distributed freely, as long asall text and legal notices remain intact.

1.2 Terminology

The terminology used to describe XML documents is defined in the body ofthis specification. The key wordsMUST,MUST NOT,REQUIRED,SHALL,SHALL NOT,SHOULD,SHOULD NOT,RECOMMENDED,MAY, andOPTIONAL, whenEMPHASIZED,are to be interpreted as described in[IETF RFC 2119]. In addition, the terms definedin the following list are used in buildingthose definitions and in describing the actions of an XML processor:

error

[Definition: A violation of the rules of this specification;results are undefined. Unless otherwise specified, failure to observe a prescription of this specification indicated by one of the keywordsMUST,REQUIRED,MUST NOT,SHALL andSHALL NOT is an error. Conforming softwareMAY detect and report an errorandMAY recover from it.]

fatal error

[Definition: An error which a conformingXML processorMUST detect and report to the application.After encountering a fatal error, the processorMAY continue processing thedata to search for further errors andMAY report such errors to the application.In order to support correction of errors, the processorMAY make unprocesseddata from the document (with intermingled character data and markup) availableto the application. Once a fatal error is detected, however, the processorMUST NOT continue normal processing (i.e., itMUST NOT continue to pass characterdata and information about the document's logical structure to the applicationin the normal way).]

at user option

[Definition: Conforming softwareMAY orMUST (depending on the modal verb in the sentence) behave as described;if it does, itMUST provide users a means to enable or disable the behaviordescribed.]

validity constraint

[Definition: A rule which applies toallvalid XML documents. Violations of validityconstraints are errors; theyMUST, at user option, be reported byvalidating XML processors.]

well-formedness constraint

[Definition: A rule which appliesto allwell-formed XML documents. Violationsof well-formedness constraints arefatal errors.]

match

[Definition: (Of strings or names:) Two stringsor names being compared are identical. Characters with multiple possiblerepresentations in ISO/IEC 10646 (e.g. characters with both precomposed andbase+diacritic forms) match only if they have the same representation in bothstrings. Nocase folding is performed. (Of strings and rules in the grammar:) A stringmatches a grammatical production if it belongs to the language generated bythat production. (Of content and content models:) An element matches its declarationwhen it conforms in the fashion described in the constraint[VC:Element Valid].]

for compatibility

[Definition: Marksa sentence describing a feature of XML included solely to ensurethat XML remains compatible with SGML.]

for interoperability

[Definition: Marksa sentence describing a non-binding recommendation included to increasethe chances that XML documents can be processed by the existing installedbase of SGML processors which predate the WebSGML Adaptations Annex to ISO 8879.]

2 Documents

[Definition: A data object is anXMLdocument if it iswell-formed,as defined in this specification. In addition, the XML document isvalid if it meets certain further constraints.]

Each XML document has both a logical and a physical structure. Physically,the document is composed of units calledentities.An entity mayrefer to other entities tocause their inclusion in the document. A document begins in a "root"ordocument entity. Logically, the documentis composed of declarations, elements, comments, character references, andprocessing instructions, all of which are indicated in the document by explicitmarkup. The logical and physical structuresMUST nest properly, as describedin4.3.2 Well-Formed Parsed Entities.

2.1 Well-Formed XML Documents

[Definition: A textual object is awell-formedXML document if:]

  1. Taken as a whole, it matches the production labeleddocument.

  2. It meets all the well-formedness constraints given in this specification.

  3. Each of theparsed entitieswhich is referenced directly or indirectly within the document iswell-formed.

Document
[1]   document   ::=   prologelementMisc*

Matching thedocument production implies that:

  1. It contains one or moreelements.

  2. [Definition: There is exactly one element,called theroot, or document element, no part of which appearsin thecontent of any other element.] Forall other elements, if thestart-tag is inthe content of another element, theend-tagis in the content of the same element. More simply stated, the elements,delimited by start- and end-tags, nest properly within each other.

[Definition: As a consequence of this,for each non-root elementC in the document, there is one other elementPin the document such thatC is in the content ofP, butis not in the content of any other element that is in the content ofP.Pis referred to as theparent ofC, andC asachild ofP.]

2.2 Characters

[Definition: A parsed entity containstext,a sequence ofcharacters, which mayrepresent markup or character data.][Definition: Acharacteris an atomic unit of text as specified by ISO/IEC 10646:2000[ISO/IEC 10646]. Legal characters are tab, carriagereturn, line feed, and the legal charactersof Unicode and ISO/IEC 10646. Theversions of these standards cited inA.1 Normative References werecurrent at the time this document was prepared. New characters may be addedto these standards by amendments or new editions. Consequently, XML processorsMUST accept any character in the range specified forChar.]

Character Range
[2]   Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

The mechanism for encoding character code points into bit patterns mayvary from entity to entity. All XML processorsMUST accept the UTF-8 and UTF-16encodings of Unicode[Unicode];the mechanisms for signaling which of the two is in use,or for bringing other encodings into play, are discussed later, in4.3.3 Character Encoding in Entities.

Note:

Document authors are encouraged to avoid"compatibility characters", as definedin section2.3 of[Unicode]. The characters defined in the following ranges are alsodiscouraged. They are either control characters or permanently undefined Unicodecharacters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],[#x10FFFE-#x10FFFF].

2.3 Common Syntactic Constructs

This section defines some symbols used widely in the grammar.

S (white space) consists of one or more space (#x20)characters, carriage returns, line feeds, or tabs.

White Space
[3]   S   ::=   (#x20 | #x9 | #xD | #xA)+

Note:

The presence of #xD in the above production ismaintained purely for backward compatibility with theFirst Edition.As explained in2.11 End-of-Line Handling,all #xD characters literally present in an XML documentare either removed or replaced by #xA characters beforeany other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.

AnNmtoken (name token) is any mixture of namecharacters.

[Definition: AName is anNmtoken with a restricted set of initial characters.] Disallowed initial characters forNames include digits, diacritics, the full stop and the hyphen.

Names beginning with the string "xml",or with any string which would match(('X'|'x') ('M'|'m') ('L'|'l')),are reserved for standardization in this or future versions of this specification.

Note:

TheNamespaces in XML Recommendation[XML Names] assigns a meaningto names containing colon characters. Therefore, authors should not use thecolon in XML names except for namespace purposes, but XML processors mustaccept the colon as a name character.

The first character of aNameMUST be aNameStartChar, and anyother charactersMUST beNameChars; this mechanism is used toprevent names from beginning with European (ASCII) digits or withbasic combining characters. Almost all characters are permitted innames, except those which either are or reasonably could be used asdelimiters. The intention is to be inclusive rather than exclusive,so that writing systems not yet encoded in Unicode can be used inXML names. SeeJ Suggestions for XML Names for suggestions on the creation ofnames.

Document authors are encouraged to use names which aremeaningful words or combinations of words in natural languages, andto avoid symbolic or white space characters in names. Note thatCOLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), andMIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairlylarge group of Unicode symbol characters, are excluded from namesbecause they are more useful as delimiters in contexts where XMLnames are used outside XML documents; providing this group givesthose contexts hard guarantees about whatcannot be part ofan XML name. The character #x037E, GREEK QUESTION MARK, is excludedbecause when normalized it becomes a semicolon, which could changethe meaning of entity references.

Names and Tokens
[4]   NameStartChar   ::=   ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]   NameChar   ::=   NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]   Name   ::=   NameStartChar (NameChar)*
[6]   Names   ::=   Name (#x20Name)*
[7]   Nmtoken   ::=   (NameChar)+
[8]   Nmtokens   ::=   Nmtoken (#x20Nmtoken)*

Note:

TheNamesandNmtokens productions are used to define the validityof tokenized attribute values after normalization (see3.3.1 Attribute Types).

Literal data is any quoted string not containing the quotation mark usedas a delimiter for that string. Literals are used for specifying the contentof internal entities (EntityValue), the valuesof attributes (AttValue), and external identifiers(SystemLiteral). Note that aSystemLiteralcan be parsed without scanning for markup.

Literals
[9]   EntityValue   ::=   '"' ([^%&"] |PEReference|Reference)* '"'
|  "'" ([^%&'] |PEReference |Reference)* "'"
[10]   AttValue   ::=   '"' ([^<&"] |Reference)*'"'
|  "'" ([^<&'] |Reference)*"'"
[11]   SystemLiteral   ::=   ('"' [^"]* '"') | ("'" [^']* "'")
[12]   PubidLiteral   ::=   '"'PubidChar* '"'| "'" (PubidChar - "'")* "'"
[13]   PubidChar   ::=   #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]

Note:

AlthoughtheEntityValue production allows the definitionof a general entity consisting of a single explicit< in the literal(e.g.,<!ENTITY mylt "<">), it is strongly advised to avoidthis practice since any reference to that entity will cause a well-formednesserror.

2.4 Character Data and Markup

Text consists of intermingledcharacter data and markup. [Definition:Markup takes the form ofstart-tags,end-tags,empty-element tags,entity references,characterreferences,comments,CDATA section delimiters,documenttype declarations,processing instructions,XML declarations,text declarations,and any white space that is at the top level of the document entity (thatis, outside the document element and not inside any other markup).]

[Definition: All text that is not markupconstitutes thecharacter data of the document.]

The ampersand character (&) and the left angle bracket (<)MUST NOT appearin their literal form, except when used as markup delimiters, orwithin acomment, aprocessinginstruction, or aCDATA section.If they are needed elsewhere, theyMUST beescapedusing eithernumeric character referencesor the strings "&amp;" and "&lt;"respectively. The right angle bracket (>) may be represented using the string "&gt;",andMUST,for compatibility, be escapedusing either "&gt;" or a character reference when itappears in the string "]]>" in content, whenthat string is not marking the end of aCDATAsection.

In the content of elements, character data is any string of characterswhich does not contain the start-delimiter of any markup and does not include the CDATA-section-closedelimiter, "]]>". In a CDATA section,character data is any string of characters not including the CDATA-section-closedelimiter, "]]>".

To allow attribute values to contain both single and double quotes, theapostrophe or single-quote character (') may be represented as "&apos;",and the double-quote character (") as "&quot;".

Character Data
[14]   CharData   ::=   [^<&]* - ([^<&]* ']]>' [^<&]*)

2.5 Comments

[Definition:Comments may appearanywhere in a document outside othermarkup;in addition, they may appear within the document type declaration at placesallowed by the grammar. They are not part of the document'scharacterdata; an XML processorMAY, but need not, make it possible for anapplication to retrieve the text of comments.Forcompatibility, the string "--" (double-hyphen)MUST NOT occur within comments.] Parameterentity referencesMUST NOT be recognized within comments.

Comments
[15]   Comment   ::=   '<!--' ((Char - '-') | ('-'(Char - '-')))* '-->'

An example of a comment:

<!-- declarations for <head> & <body> -->

Notethat the grammar does not allow a comment ending in--->. Thefollowing example isnot well-formed.

<!-- B+, B, or B--->

2.6 Processing Instructions

[Definition:Processing instructions(PIs) allow documents to contain instructions for applications.]

Processing Instructions
[16]   PI   ::=   '<?'PITarget (S(Char* - (Char* '?>'Char*)))? '?>'
[17]   PITarget   ::=   Name - (('X' | 'x') ('M' |'m') ('L' | 'l'))

PIs are not part of the document'scharacterdata, butMUST be passed through to the application. The PI beginswith a target (PITarget) used to identify the applicationto which the instruction is directed. The target names "XML", "xml",and so on are reserved for standardization in this or future versions of thisspecification. The XMLNotation mechanismmay be used for formal declaration of PI targets. Parameterentity referencesMUST NOT be recognized within processing instructions.

2.7 CDATA Sections

[Definition:CDATA sections may occur anywhere character data may occur; they are used to escape blocksof text containing characters which would otherwise be recognized as markup.CDATA sections begin with the string "<![CDATA["and end with the string "]]>":]

CDATA Sections
[18]   CDSect   ::=   CDStartCDataCDEnd
[19]   CDStart   ::=   '<![CDATA['
[20]   CData   ::=   (Char* - (Char*']]>'Char*))
[21]   CDEnd   ::=   ']]>'

Within a CDATA section, only theCDEnd string isrecognized as markup, so that left angle brackets and ampersands may occurin their literal form; they need not (and cannot) be escaped using "&lt;"and "&amp;". CDATA sections cannot nest.

An example of a CDATA section, in which "<greeting>"and "</greeting>" are recognized ascharacter data, notmarkup:

<![CDATA[<greeting>Hello, world!</greeting>]]>

2.8 Prolog and Document Type Declaration

[Definition: XML documentsSHOULDbegin with anXML declaration which specifies the version ofXML being used.] For example, the following is a complete XML document,well-formed but notvalid:

<?xml version="1.0"?><greeting>Hello, world!</greeting>

and so is this:

<greeting>Hello, world!</greeting>

The function of the markup in an XML document is to describe its storage andlogical structure and to associate attributename-value pairs with its logical structures. XML provides a mechanism, thedocumenttype declaration, to define constraints on the logical structureand to support the use of predefined storage units. [Definition: An XML document isvalid if it has an associateddocument type declaration and if the document complies with the constraintsexpressed in it.]

The document type declarationMUST appear before the firstelementin the document.

Prolog
[22]   prolog   ::=   XMLDecl?Misc*(doctypedeclMisc*)?
[23]   XMLDecl   ::=   '<?xml'VersionInfoEncodingDecl?SDDecl?S? '?>'
[24]   VersionInfo   ::=   S 'version'Eq("'"VersionNum "'" | '"'VersionNum'"')
[25]   Eq   ::=   S? '='S?
[26]   VersionNum   ::=   '1.' [0-9]+
[27]   Misc   ::=   Comment |PI|S

Even though theVersionNum production matchesany version number of the form '1.x', XML 1.0 documentsSHOULD NOT specify a version number other than '1.0'.

Note:

When an XML 1.0 processor encounters a document that specifiesa 1.x version number other than '1.0', it will process it asa 1.0 document. This means that an XML 1.0 processor will accept1.x documents provided they do not use any non-1.0 features.

[Definition: The XMLdocumenttype declaration contains or points tomarkupdeclarations that provide a grammar for a class of documents. Thisgrammar is known as a document type definition, orDTD. The documenttype declaration can point to an external subset (a special kind ofexternal entity) containing markup declarations,or can contain the markup declarations directly in an internal subset, orcan do both. The DTD for a document consists of both subsets taken together.]

[Definition: Amarkup declarationis anelement type declaration, anattribute-list declaration, anentitydeclaration, or anotation declaration.]These declarations may be contained in whole or in part withinparameterentities, as described in the well-formedness and validity constraintsbelow. For furtherinformation, see4 Physical Structures.

Document Type Definition
[28]   doctypedecl   ::=   '<!DOCTYPE'SName(SExternalID)?S?('['intSubset ']'S?)? '>'[VC: Root Element Type]
[WFC: External Subset]
[28a]   DeclSep   ::=   PEReference |S[WFC: PE Between Declarations]
[28b]   intSubset   ::=   (markupdecl |DeclSep)*
[29]   markupdecl   ::=   elementdecl |AttlistDecl |EntityDecl|NotationDecl |PI |Comment[VC: Proper Declaration/PE Nesting]
[WFC: PEs in Internal Subset]

Notethat it is possible to construct a well-formed document containing adoctypedeclthat neither points to an external subset nor contains an internal subset.

The markup declarations may be made up in whole or in part of thereplacement text ofparameterentities. The productions later in this specification for individualnonterminals (elementdecl,AttlistDecl,and so on) describe the declarationsafter all the parameterentities have beenincluded.

Parameterentity references are recognized anywhere in the DTD (internal and externalsubsets and external parameter entities), except in literals, processing instructions,comments, and the contents of ignored conditional sections (see3.4 Conditional Sections).They are also recognized in entity value literals. The use of parameter entitiesin the internal subset is restricted as described below.

Validity constraint: Root Element Type

TheNamein the document type declarationMUST match the element type of theroot element.

Validity constraint: Proper Declaration/PE Nesting

Parameter-entityreplacement textMUST be properly nested with markup declarations. That is to say, if eitherthe first character or the last character of a markup declaration (markupdeclabove) is contained in the replacement text for aparameter-entityreference, bothMUST be contained in the same replacement text.

Well-formedness constraint: PEs in Internal Subset

Inthe internal DTD subset,parameter-entity referencesMUST NOT occur within markup declarations; they may occur where markup declarations can occur.(This does not apply to references that occur in external parameter entitiesor to the external subset.)

Well-formedness constraint: External Subset

The external subset, if any,MUST match the production forextSubset.

Well-formedness constraint: PE Between Declarations

The replacement text of a parameter entity referencein aDeclSepMUST match the productionextSubsetDecl.

Like the internal subset, the external subset and any external parameterentities referencedin aDeclSepMUST consist of a series ofcomplete markup declarations of the types allowed by the non-terminal symbolmarkupdecl, interspersed with white space orparameter-entity references. However, portions ofthe contents of the external subset or of theseexternal parameter entities may conditionally be ignored by using theconditional section construct; this is notallowed in the internal subset but isallowed in external parameter entities referenced in the internal subset.

External Subset
[30]   extSubset   ::=   TextDecl?extSubsetDecl
[31]   extSubsetDecl   ::=   (markupdecl |conditionalSect |DeclSep)*

The external subset and external parameter entities also differ from theinternal subset in that in them,parameter-entityreferences are permittedwithin markup declarations,not onlybetween markup declarations.

An example of an XML document with a document type declaration:

<?xml version="1.0"?><!DOCTYPE greeting SYSTEM "hello.dtd"><greeting>Hello, world!</greeting>

Thesystem identifier"hello.dtd"gives the address (a URI reference) of a DTD for the document.

The declarations can also be given locally, as in this example:

<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE greeting [  <!ELEMENT greeting (#PCDATA)>]><greeting>Hello, world!</greeting>

If both the external and internal subsets are used, the internal subsetMUST be considered to occur before the external subset. This has the effect that entity and attribute-list declarations in the internalsubset take precedence over those in the external subset.

2.9 Standalone Document Declaration

Markup declarations can affect the content of the document, as passed fromanXML processor to an application; examplesare attribute defaults and entity declarations. The standalone document declaration,which may appear as a component of the XML declaration, signals whether ornot there are such declarations which appear external to thedocumententityor in parameter entities. [Definition: Anexternalmarkup declaration is defined as a markup declaration occurring inthe external subset or in a parameter entity (external or internal, the latterbeing included because non-validating processors are not required to readthem).]

Standalone Document Declaration
[32]   SDDecl   ::=   S 'standalone'Eq(("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"'))[VC: Standalone Document Declaration]

In a standalone document declaration, the value "yes" indicatesthat there are noexternal markup declarations whichaffect the information passed from the XML processor to the application. Thevalue "no" indicates that there are or may be such externalmarkup declarations. Note that the standalone document declaration only denotesthe presence of externaldeclarations; the presence, in a document,of references to externalentities, when those entities are internallydeclared, does not change its standalone status.

If there are no external markup declarations, the standalone document declarationhas no meaning. If there are external markup declarations but there is nostandalone document declaration, the value "no" is assumed.

Any XML document for whichstandalone="no" holds can be convertedalgorithmically to a standalone document, which may be desirable for somenetwork delivery applications.

Validity constraint: Standalone Document Declaration

Thestandalone document declarationMUST have the value "no" ifany external markup declarations contain declarations of:

  • attributes withdefault values,if elements to which these attributes apply appear in the document withoutspecifications of values for these attributes, or

  • entities (other thanamp,lt,gt,apos,quot), ifreferencesto those entities appear in the document, or

  • attributes withtokenized types, where theattribute appears in the document with a value such thatnormalizationwill produce a different value from that which would be producedin the absence of the declaration, or

  • element types withelement content,if white space occurs directly within any instance of those types.

An example XML declaration with a standalone document declaration:

<?xml version="1.0" standalone='yes'?>

2.10 White Space Handling

In editing XML documents, it is often convenient to use "white space"(spaces, tabs, and blank lines)to set apart the markup for greater readability. Such white space is typicallynot intended for inclusion in the delivered version of the document. On theother hand, "significant" white space that should be preservedin the delivered version is common, for example in poetry and source code.

AnXML processorMUST always passall characters in a document that are not markup through to the application.A validating XML processorMUST alsoinform the application which of these characters constitute white space appearinginelement content.

A specialattribute namedxml:space may be attached to an element to signal an intention that in that element,white space should be preserved by applications. In valid documents, thisattribute, like any other,MUST bedeclaredif it is used. When declared, itMUST be given as anenumeratedtype whose valuesare one or both of "default" and "preserve".For example:

<!ATTLIST poem  xml:space (default|preserve) 'preserve'><!ATTLIST pre xml:space (preserve) #FIXED 'preserve'>

The value "default" signals that applications' default white-spaceprocessing modes are acceptable for this element; the value "preserve"indicates the intent that applications preserve all the white space. Thisdeclared intent is considered to apply to all elements within the contentof the element where it is specified, unless overridden withanother instance of thexml:space attribute. This specification does not give meaning to any value ofxml:space other than "default" and "preserve". It is an error for other values to be specified; the XML processorMAY report the error orMAY recover by ignoring the attribute specification or by reporting the (erroneous) value to the application. Applications may ignore or reject erroneous values.

Theroot element of any document is consideredto have signaled no intentions as regards application space handling, unlessit provides a value for this attribute or the attribute is declared with adefault value.

2.11 End-of-Line Handling

XMLparsed entities are often storedin computer files which, for editing convenience, are organized into lines.These lines are typically separated by some combination of the charactersCARRIAGE RETURN (#xD) and LINE FEED (#xA).

Tosimplify the tasks ofapplications, theXMLprocessorMUST behave as if it normalized all line breaks in external parsedentities (including the document entity) on input, before parsing, by translatingboth the two-character sequence #xD #xA and any #xD that is not followed by#xA to a single #xA character.

2.12 Language Identification

In document processing, it is often useful to identify the natural or formallanguage in which the content is written. A specialattributenamedxml:lang may be inserted in documents to specify the languageused in the contents and attribute values of any element in an XML document.In valid documents, this attribute, like any other,MUST bedeclaredif it is used. Thevalues of the attribute are language identifiers as defined by[IETF BCP 47],Tagsfor the Identification of Languages; in addition, the empty string may be specified.

(Productions 33 through 38 have been removed.)

For example:

<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p><p xml:lang="en-GB">What colour is it?</p><p xml:lang="en-US">What color is it?</p><sp who="Faust" desc='leise' xml:lang="de">  <l>Habe nun, ach! Philosophie,</l>  <l>Juristerei, und Medizin</l>  <l>und leider auch Theologie</l>  <l>durchaus studiert mit heißem Bemüh'n.</l></sp>

The language specified byxml:lang applies to the element where it is specified (including the values of its attributes), and to all elements in its content unless overridden with another instance ofxml:lang. In particular, the empty value ofxml:lang is used on an element B to override a specification ofxml:lang on an enclosing element A, without specifying another language. Within B, it is considered that there is no language information available, just as ifxml:lang had not been specified on B or any of its ancestors. Applications determine which of an element's attribute values and which parts of its character content, if any, are treated as language-dependent values described byxml:lang.

Note:

Language information may also be provided by external transport protocols (e.g. HTTP or MIME). When available, this information may be used by XML applications, but the more local information provided byxml:lang should be considered to override it.

A simple declaration forxml:lang might take the form

xml:lang CDATA #IMPLIED

but specific default values may also be given, if appropriate. In a collectionof French poems for English students, with glosses and notes in English, thexml:langattribute might be declared this way:

<!ATTLIST poem   xml:lang CDATA 'fr'><!ATTLIST gloss  xml:lang CDATA 'en'><!ATTLIST note   xml:lang CDATA 'en'>

3 Logical Structures

[Definition: EachXMLdocument contains one or moreelements, the boundariesof which are either delimited bystart-tagsandend-tags, or, foremptyelements, by anempty-element tag. Eachelement has a type, identified by name, sometimes called its "genericidentifier" (GI), and may have a set of attribute specifications.]Each attribute specification has anameand avalue.

Element
[39]   element   ::=   EmptyElemTag
|STagcontentETag[WFC: Element Type Match]
[VC: Element Valid]

This specification does not constrain theapplication semantics, use, or (beyond syntax)names of the element types and attributes, except that names beginning witha match to(('X'|'x')('M'|'m')('L'|'l')) are reserved for standardizationin this or future versions of this specification.

Well-formedness constraint: Element Type Match

TheNamein an element's end-tagMUST match the element type in the start-tag.

Validity constraint: Element Valid

An element is validif there is a declaration matchingelementdeclwhere theName matches the element type, and one ofthe following holds:

  1. The declaration matchesEMPTY and the element has nocontent (not even entityreferences, comments, PIs or white space).

  2. The declaration matcheschildren and thesequence ofchild elements belongsto the language generated by the regular expression in the content model,with optional white space, comments andPIs (i.e. markup matching production [27]Misc) between thestart-tag and the first child element, between child elements, or betweenthe last child element and the end-tag. Note that a CDATA section containingonly white space or a referenceto an entity whose replacement text is character references expanding to whitespace do notmatch the nonterminalS, andhence cannot appear in these positions; however, areference to an internal entity with a literal value consisting of characterreferences expanding to white space does matchS, since itsreplacement text is the white space resulting from expansion of the characterreferences.

  3. The declaration matchesMixed, and the content(after replacingany entity references with their replacement text) consists ofcharacter data(includingCDATA sections),comments,PIs andchild elements whose types match names in thecontent model.

  4. The declaration matchesANY, and the content (after replacingany entity references with their replacement text)consists of character data,CDATAsections,comments,PIs andchild elementswhose types have been declared.

3.1 Start-Tags, End-Tags, and Empty-Element Tags

[Definition: The beginning of every non-emptyXML element is marked by astart-tag.]

Start-tag
[40]   STag   ::=   '<'Name (SAttribute)*S? '>'[WFC: Unique Att Spec]
[41]   Attribute   ::=   NameEqAttValue[VC: Attribute Value Type]
[WFC: No External Entity References]
[WFC: No < in Attribute Values]

TheName in the start- and end-tags gives the element'stype. [Definition: TheName-AttValuepairs are referred to as theattribute specifications of theelement], [Definition: with theName in each pair referred to as theattribute name]and [Definition: the content of theAttValue (the text between the' or"delimiters) as theattribute value.] Notethat the order of attribute specifications in a start-tag or empty-elementtag is not significant.

Well-formedness constraint: Unique Att Spec

An attribute nameMUST NOT appear more than once in the same start-tag or empty-element tag.

Validity constraint: Attribute Value Type

The attributeMUSThave been declared; the valueMUST be of the type declared for it. (For attributetypes, see3.3 Attribute-List Declarations.)

Well-formedness constraint: No External Entity References

AttributevaluesMUST NOT contain direct or indirect entity references to external entities.

Well-formedness constraint: No< in Attribute Values

Thereplacement text of any entityreferred to directly or indirectly in an attribute valueMUST NOT contain a<.

An example of a start-tag:

<termdef term="dog">

[Definition: The end of every element that beginswith a start-tagMUST be marked by anend-tag containing a namethat echoes the element's type as given in the start-tag:]

End-tag
[42]   ETag   ::=   '</'NameS?'>'

An example of an end-tag:

</termdef>

[Definition: Thetextbetween the start-tag and end-tag is called the element'scontent:]

Content of Elements
[43]   content   ::=   CharData? ((element|Reference |CDSect|PI |Comment)CharData?)*

[Definition: An elementwith nocontent is said to beempty.] The representationof an empty element is either a start-tag immediately followed by an end-tag,or an empty-element tag. [Definition: Anempty-elementtag takes a special form:]

Tags for Empty Elements
[44]   EmptyElemTag   ::=   '<'Name (SAttribute)*S? '/>'[WFC: Unique Att Spec]

Empty-element tags may be used for any element which has no content, whetheror not it is declared using the keywordEMPTY.Forinteroperability, the empty-element tagSHOULDbe used, andSHOULD only be used, for elements which are declaredEMPTY.

Examples of empty elements:

<IMG align="left" src="http://www.w3.org/Icons/WWW/w3c_home" /><br></br><br/>

3.2 Element Type Declarations

Theelement structure of anXML document may, forvalidationpurposes, be constrained using element type and attribute-list declarations.An element type declaration constrains the element'scontent.

Element type declarations often constrain which element types can appearaschildren of the element. At useroption, an XML processorMAY issue a warning when a declaration mentions anelement type for which no declaration is provided, but this is not an error.

[Definition: Anelementtype declaration takes the form:]

Element Type Declaration
[45]   elementdecl   ::=   '<!ELEMENT'SNameScontentspecS?'>'[VC: Unique Element Type Declaration]
[46]   contentspec   ::=   'EMPTY' | 'ANY' |Mixed|children

where theName gives the element type being declared.

Validity constraint: Unique Element Type Declaration

An element typeMUST NOT be declared more than once.

Examples of element type declarations:

<!ELEMENT br EMPTY><!ELEMENT p (#PCDATA|emph)* ><!ELEMENT %name.para; %content.para; ><!ELEMENT container ANY>

3.2.1 Element Content

[Definition: An elementtype haselement content when elementsof that typeMUST contain onlychildelements (no character data), optionally separated by white space (charactersmatching the nonterminalS).][Definition: In this case, the constraint includes acontentmodel, a simple grammar governing the allowed types of thechild elements and the order in which they are allowed to appear.]The grammar is built on content particles (cps), whichconsist of names, choice lists of content particles, or sequence lists ofcontent particles:

Element-content Models
[47]   children   ::=   (choice |seq)('?' | '*' | '+')?
[48]   cp   ::=   (Name |choice|seq) ('?' | '*' | '+')?
[49]   choice   ::=   '('S?cp (S? '|'S?cp )+S? ')'[VC: Proper Group/PE Nesting]
[50]   seq   ::=   '('S?cp (S? ','S?cp )*S? ')'[VC: Proper Group/PE Nesting]

where eachName is the type of an element whichmay appear as achild. Any contentparticle in a choice list may appear in theelementcontent at the location where the choice list appears in the grammar;content particles occurring in a sequence listMUST each appear in theelement content in the order given in the list.The optional character following a name or list governs whether the elementor the content particles in the list may occur one or more (+),zero or more (*), or zero or one times (?). Theabsence of such an operator means that the element or content particleMUSTappear exactly once. This syntax and meaning are identical to those used inthe productions in this specification.

The content of an element matches a content model if and only if it ispossible to trace out a path through the content model, obeying the sequence,choice, and repetition operators and matching each element in the contentagainst an element type in the content model.Forcompatibility, it is an error if the content modelallows an element to match more than one occurrence of an element type in thecontent model. For more information, seeE Deterministic Content Models.

Validity constraint: Proper Group/PE Nesting

Parameter-entityreplacement textMUST be properly nested with parenthesizedgroups. That is to say, if either of the opening or closing parentheses inachoice,seq, orMixedconstruct is contained in the replacement text for aparameterentity, bothMUST be contained in the same replacement text.

For interoperability, if a parameter-entity referenceappears in achoice,seq, orMixed construct, its replacement textSHOULD contain atleast one non-blank character, and neither the first nor last non-blank characterof the replacement textSHOULD be a connector (| or,).

Examples of element-content models:

<!ELEMENT spec (front, body, back?)><!ELEMENT div1 (head, (p | list | note)*, div2*)><!ELEMENT dictionary-body (%div.mix; | %dict.mix;)*>

3.2.2 Mixed Content

[Definition: An elementtypehasmixed content when elements of that type may contain characterdata, optionally interspersed withchildelements.] In this case, the types of the child elements may be constrained,but not their order or their number of occurrences:

Mixed-content Declaration
[51]   Mixed   ::=   '('S? '#PCDATA' (S?'|'S?Name)*S?')*'
| '('S? '#PCDATA'S? ')'[VC: Proper Group/PE Nesting]
[VC: No Duplicate Types]

where theNames give the types of elements thatmay appear as children. Thekeyword#PCDATA derives historically from the term "parsedcharacter data."

Validity constraint: No Duplicate Types

Thesame nameMUST NOT appear more than once in a single mixed-content declaration.

Examples of mixed content declarations:

<!ELEMENT p (#PCDATA|a|ul|b|i|em)*><!ELEMENT p (#PCDATA | %font; | %phrase; | %special; | %form;)* ><!ELEMENT b (#PCDATA)>

3.3 Attribute-List Declarations

Attributes are used to associate name-valuepairs withelements. Attribute specificationsMUST NOT appear outside ofstart-tags andempty-element tags; thus, the productions used torecognize them appear in3.1 Start-Tags, End-Tags, and Empty-Element Tags. Attribute-list declarationsmay be used:

  • To define the set of attributes pertaining to a given element type.

  • To establish type constraints for these attributes.

  • To providedefault values forattributes.

[Definition:Attribute-listdeclarations specify the name, data type, and default value (if any)of each attribute associated with a given element type:]

Attribute-list Declaration
[52]   AttlistDecl   ::=   '<!ATTLIST'SNameAttDef*S? '>'
[53]   AttDef   ::=   SNameSAttTypeSDefaultDecl

TheName in theAttlistDeclrule is the type of an element. At user option, an XML processorMAY issuea warning if attributes are declared for an element type not itself declared,but this is not an error. TheName in theAttDefrule is the name of the attribute.

When more than oneAttlistDecl is providedfor a given element type, the contents of all those provided are merged. Whenmore than one definition is provided for the same attribute of a given elementtype, the first declaration is binding and later declarations are ignored.For interoperability, writers of DTDs may chooseto provide at most one attribute-list declaration for a given element type,at most one attribute definition for a given attribute name in an attribute-listdeclaration, and at least one attribute definition in each attribute-listdeclaration. For interoperability, an XML processorMAY at user optionissue a warning when more than one attribute-list declaration is providedfor a given element type, or more than one attribute definition is providedfor a given attribute, but this is not an error.

3.3.1 Attribute Types

XML attribute types are of three kinds: a string type, a set of tokenizedtypes, and enumerated types. The string type may take any literal string asa value; the tokenized types are more constrained.The validity constraints noted in the grammar are applied after the attributevalue has been normalized as described in3.3.3 Attribute-Value Normalization.

Attribute Types
[54]   AttType   ::=   StringType |TokenizedType|EnumeratedType
[55]   StringType   ::=   'CDATA'
[56]   TokenizedType   ::=   'ID'[VC: ID]
[VC: One ID per Element Type]
[VC: ID Attribute Default]
| 'IDREF'[VC: IDREF]
| 'IDREFS'[VC: IDREF]
| 'ENTITY'[VC: Entity Name]
| 'ENTITIES'[VC: Entity Name]
| 'NMTOKEN'[VC: Name Token]
| 'NMTOKENS'[VC: Name Token]

Validity constraint: ID

Values of typeIDMUST match theName production. A nameMUST NOT appear more than oncein an XML document as a value of this type; i.e., ID valuesMUST uniquelyidentify the elements which bear them.

Validity constraint: One ID per Element Type

An element typeMUST NOT have more than one ID attribute specified.

Validity constraint: ID Attribute Default

An ID attributeMUST have a declared default of#IMPLIED or#REQUIRED.

Validity constraint: IDREF

Values of typeIDREFMUSTmatch theName production, and values of typeIDREFSMUST matchNames; eachNameMUST match the value of an ID attribute on some element in the XML document;i.e.IDREF valuesMUST match the value of some ID attribute.

Validity constraint: Entity Name

Values of typeENTITYMUST match theName production, values of typeENTITIESMUST matchNames; eachNameMUST match the name of anunparsed entitydeclared in theDTD.

Validity constraint: Name Token

Values of typeNMTOKENMUST match theNmtoken production; values of typeNMTOKENSMUST matchNmtokens.

[Definition:Enumerated attributeshave a list of allowed values in their declaration]. TheyMUST take one of those values. There are two kinds of enumerated attribute types:

Enumerated Attribute Types
[57]   EnumeratedType   ::=   NotationType|Enumeration
[58]   NotationType   ::=   'NOTATION'S '('S?Name (S? '|'S?Name)*S? ')'[VC: Notation Attributes]
[VC: One Notation Per Element Type]
[VC: No Notation on Empty Element]
[VC: No Duplicate Tokens]
[59]   Enumeration   ::=   '('S?Nmtoken(S? '|'S?Nmtoken)*S? ')'[VC: Enumeration]
[VC: No Duplicate Tokens]

ANOTATION attribute identifies anotation,declared in the DTD with associated system and/or public identifiers, to beused in interpreting the element to which the attribute is attached.

Validity constraint: Notation Attributes

Values of this typeMUST match one of thenotation namesincluded in the declaration; all notation names in the declarationMUST bedeclared.

Validity constraint: One Notation Per Element Type

An element typeMUST NOT have more than oneNOTATIONattribute specified.

Validity constraint: No Notation on Empty Element

For compatibility,an attribute of typeNOTATIONMUST NOT be declared on an elementdeclaredEMPTY.

Validity constraint: No Duplicate Tokens

The notation names in a singleNotationTypeattribute declaration, as well as theNmTokens in a singleEnumeration attribute declaration,MUST all be distinct.

Validity constraint: Enumeration

Values of this typeMUST matchone of theNmtoken tokens in the declaration.

For interoperability, the sameNmtokenSHOULD NOT occur more than once in the enumeratedattribute types of a single element type.

3.3.2 Attribute Defaults

Anattribute declaration provides informationon whether the attribute's presence isREQUIRED, and if not, how an XML processoris to react if a declared attribute is absent in a document.

Attribute Defaults
[60]   DefaultDecl   ::=   '#REQUIRED' | '#IMPLIED'
| (('#FIXED'S)?AttValue)[VC: Required Attribute]
[VC: Attribute Default Value Syntactically Correct]
[WFC: No < in Attribute Values]
[VC: Fixed Attribute Default]
[WFC: No External Entity References]

In an attribute declaration,#REQUIRED means that the attributeMUST always be provided,#IMPLIED that no default value is provided.[Definition: Ifthe declaration is neither#REQUIRED nor#IMPLIED, thentheAttValue value contains the declareddefaultvalue; the#FIXED keyword states that the attributeMUST always havethe default value.When an XML processor encountersan elementwithout a specification for an attribute for which it has read a defaultvalue declaration, itMUST report the attribute with the declared defaultvalue to the application.]

Validity constraint: Required Attribute

If the defaultdeclaration is the keyword#REQUIRED, then the attributeMUST bespecified for all elements of the type in the attribute-list declaration.

Validity constraint: Attribute Default Value Syntactically Correct

The declared default valueMUST meet the syntacticconstraints of the declared attribute type. That is, the default value of an attribute:

Note that only thesyntactic constraints of the type are required here; other constraints (e.g.that the value be the name of a declared unparsed entity, for an attribute oftype ENTITY) will be reported by a validatingparser only if an element without a specification for this attributeactually occurs.

Validity constraint: Fixed Attribute Default

If an attributehas a default value declared with the#FIXED keyword, instances ofthat attributeMUST match the default value.

Examples of attribute-list declarations:

<!ATTLIST termdef          id      ID      #REQUIRED          name    CDATA   #IMPLIED><!ATTLIST list          type    (bullets|ordered|glossary)  "ordered"><!ATTLIST form          method  CDATA   #FIXED "POST">

3.3.3 Attribute-Value Normalization

Before the value of an attribute is passed to the application or checkedfor validity, the XML processorMUST normalize the attribute value by applyingthe algorithm below, or by using some other method such that the value passedto the application is the same as that produced by the algorithm.

  1. All line breaksMUST have been normalized on input to #xA as describedin2.11 End-of-Line Handling, so the rest of this algorithm operateson text normalized in this way.

  2. Begin with a normalized value consisting of the empty string.

  3. For each character, entity reference, or character reference in theunnormalized attribute value, beginning with the first and continuing to thelast, do the following:

    • For a character reference, append the referenced character to thenormalized value.

    • For an entity reference, recursively apply step 3 of this algorithmto the replacement text of the entity.

    • For a white space character (#x20, #xD, #xA, #x9), append a spacecharacter (#x20) to the normalized value.

    • For another character, append the character to the normalized value.

If the attribute type is not CDATA, then the XML processorMUST furtherprocess the normalized attribute value by discarding any leading and trailingspace (#x20) characters, and by replacing sequences of space (#x20) charactersby a single space (#x20) character.

Note that if the unnormalized attribute value contains a character referenceto a white space character other than space (#x20), the normalized value containsthe referenced character itself (#xD, #xA or #x9). This contrasts with thecase where the unnormalized value contains a white space character (not areference), which is replaced with a space character (#x20) in the normalizedvalue and also contrasts with the case where the unnormalized value containsan entity reference whose replacement text contains a white space character;being recursively processed, the white space character is replaced with aspace character (#x20) in the normalized value.

All attributes for which no declaration has been readSHOULD be treatedby a non-validating processor as if declaredCDATA.

It is an error if anattributevalue contains areference to anentity for which no declaration has been read.

Following are examples of attribute normalization. Given the followingdeclarations:

<!ENTITY d "&#xD;"><!ENTITY a "&#xA;"><!ENTITY da "&#xD;&#xA;">

the attribute specifications in the left column below would be normalizedto the character sequences of the middle column if the attributeais declaredNMTOKENS and to those of the right columns ifais declaredCDATA.

Attribute specificationa is NMTOKENSa is CDATA
a="xyz"
x y z
#x20 #x20 x y z
a="&d;&d;A&a;&#x20;&a;B&da;"
A #x20 B
#x20 #x20 A #x20 #x20 #x20 B #x20 #x20
a="&#xd;&#xd;A&#xa;&#xa;B&#xd;&#xa;"
#xD #xD A #xA #xA B #xD #xA
#xD #xD A #xA #xA B #xD #xA

Note that the last example is invalid (but well-formed) ifais declared to be of typeNMTOKENS.

3.4 Conditional Sections

[Definition:Conditionalsections are portions of thedocument typedeclaration external subset orof external parameter entities which are included in, or excluded from,the logical structure of the DTD based on the keyword which governs them.]

Conditional Section
[61]   conditionalSect   ::=   includeSect |ignoreSect
[62]   includeSect   ::=   '<!['S? 'INCLUDE'S? '['extSubsetDecl']]>'[VC: Proper Conditional Section/PE Nesting]
[63]   ignoreSect   ::=   '<!['S? 'IGNORE'S? '['ignoreSectContents*']]>'[VC: Proper Conditional Section/PE Nesting]
[64]   ignoreSectContents   ::=   Ignore ('<!['ignoreSectContents ']]>'Ignore)*
[65]   Ignore   ::=   Char* - (Char*('<![' | ']]>')Char*)

Validity constraint: Proper Conditional Section/PE Nesting

If any of the "<![","[", or "]]>" of a conditional section is containedin the replacement text for a parameter-entity reference, all of themMUSTbe contained in the same replacement text.

Like the internal and external DTD subsets, a conditional section may containone or more complete declarations, comments, processing instructions, or nestedconditional sections, intermingled with white space.

If the keyword of the conditional section isINCLUDE, then thecontents of the conditional sectionMUST be processed as part of the DTD. If the keyword ofthe conditional section isIGNORE, then the contents of the conditionalsectionMUSTNOT be processed as part of the DTD.If a conditional section with a keyword ofINCLUDE occurs withina larger conditional section with a keyword ofIGNORE, both the outerand the inner conditional sectionsMUST be ignored. The contentsof an ignored conditional sectionMUST be parsed by ignoring all characters afterthe "[" following the keyword, except conditional section starts"<![" and ends "]]>", until the matching conditionalsection end is found. Parameter entity referencesMUST NOT be recognized in thisprocess.

If the keyword of the conditional section is a parameter-entity reference,the parameter entityMUST be replaced by its content before the processordecides whether to include or ignore the conditional section.

An example:

<!ENTITY % draft 'INCLUDE' ><!ENTITY % final 'IGNORE' ><![%draft;[<!ELEMENT book (comments*, title, body, supplements?)>]]><![%final;[<!ELEMENT book (title, body, supplements?)>]]>

4 Physical Structures

[Definition: An XML document may consist of oneor many storage units. Theseare calledentities; they all havecontent and areall (except for thedocument entity andtheexternal DTD subset) identified byentityname.] Each XML document has one entitycalled thedocument entity, which servesas the starting point for theXML processorand may contain the whole document.

Entities may be either parsed or unparsed. [Definition: The contents of aparsedentity are referred to as itsreplacementtext; thistext is considered anintegral part of the document.]

[Definition: Anunparsed entityis a resource whose contents may or may not betext,and if text, maybe other than XML. Each unparsed entity has an associatednotation, identified by name. Beyond a requirementthat an XML processor make the identifiers for the entity and notation availableto the application, XML places no constraints on the contents of unparsedentities.]

Parsed entities are invoked by name using entity references; unparsed entitiesby name, given in the value ofENTITY orENTITIES attributes.

[Definition:General entitiesare entities for use within the document content. In this specification, generalentities are sometimes referred to with the unqualified termentitywhen this leads to no ambiguity.][Definition:Parameterentities are parsed entities for use within the DTD.]These two types of entities use different forms of reference and are recognizedin different contexts. Furthermore, they occupy different namespaces; a parameterentity and a general entity with the same name are two distinct entities.

4.1 Character and Entity References

[Definition: Acharacterreference refers to a specific character in the ISO/IEC 10646 characterset, for example one not directly accessible from available input devices.]

Character Reference
[66]   CharRef   ::=   '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';'[WFC: Legal Character]

Well-formedness constraint: Legal Character

Characters referredto using character referencesMUST match the production forChar.

If the character reference begins with "&#x",the digits and letters up to the terminating; provide a hexadecimalrepresentation of the character's code point in ISO/IEC 10646. If it beginsjust with "&#", the digits up to the terminating;provide a decimal representation of the character's code point.

[Definition: Anentity referencerefers to the content of a named entity.][Definition: References to parsed general entities useampersand (&) and semicolon (;) as delimiters.][Definition:Parameter-entity referencesuse percent-sign (%) and semicolon (;) as delimiters.]

Entity Reference
[67]   Reference   ::=   EntityRef |CharRef
[68]   EntityRef   ::=   '&'Name ';'[WFC: Entity Declared]
[VC: Entity Declared]
[WFC: Parsed Entity]
[WFC: No Recursion]
[69]   PEReference   ::=   '%'Name ';'[VC: Entity Declared]
[WFC: No Recursion]
[WFC: In DTD]

Well-formedness constraint: Entity Declared

In a documentwithout any DTD, a document with only an internal DTD subset which containsno parameter entity references, or a document with "standalone='yes'", foran entity reference that does not occur within the external subset or a parameterentity, theName given in the entity referenceMUSTmatch that in anentitydeclaration that does not occur within the external subset or aparameter entity, except that well-formed documents need not declareany of the following entities:amp,lt,gt,apos,quot. Thedeclaration of a general entityMUST precede any reference to it which appearsin a default value in an attribute-list declaration.

Note that non-validating processors arenotobligated to read and process entity declarations occurring in parameter entities or inthe external subset; for such documents,the rule that an entity must be declared is a well-formedness constraint onlyifstandalone='yes'.

Validity constraint: Entity Declared

In a document with an external subset or parameter entity references,if the document is not standalone (either "standalone='no'"is specified or there is no standalone declaration), thentheName given in the entity referenceMUSTmatch that in anentitydeclaration. For interoperability, valid documentsSHOULD declarethe entitiesamp,lt,gt,apos,quot, in the form specified in4.6 Predefined Entities.The declaration of a parameter entityMUST precede any reference to it. Similarly,the declaration of a general entityMUST precede any attribute-listdeclaration containing a default value with a direct or indirect referenceto that general entity.

Well-formedness constraint: Parsed Entity

An entity referenceMUSTNOT contain the name of anunparsed entity.Unparsed entities may be referred to only inattributevalues declared to be of typeENTITY orENTITIES.

Well-formedness constraint: No Recursion

A parsed entityMUST NOT contain a recursive reference to itself, either directly or indirectly.

Well-formedness constraint: In DTD

Parameter-entity referencesMUST NOT appear outside theDTD.

Examples of character and entity references:

Type <key>less-than</key> (&#x3C;) to save options.This document was prepared on &docdate; andis classified &security-level;.

Example of a parameter-entity reference:

<!-- declare the parameter entity "ISOLat2"... --><!ENTITY % ISOLat2         SYSTEM "http://www.xml.com/iso/isolat2-xml.entities" ><!-- ... now reference it. -->%ISOLat2;

4.2 Entity Declarations

[Definition: Entities are declaredthus:]

Entity Declaration
[70]   EntityDecl   ::=   GEDecl |PEDecl
[71]   GEDecl   ::=   '<!ENTITY'SNameSEntityDefS?'>'
[72]   PEDecl   ::=   '<!ENTITY'S '%'SNameSPEDefS? '>'
[73]   EntityDef   ::=   EntityValue| (ExternalIDNDataDecl?)
[74]   PEDef   ::=   EntityValue |ExternalID

TheName identifies the entity in anentityreference or, in the case of an unparsed entity, in the value ofanENTITY orENTITIES attribute. If the same entity is declaredmore than once, the first declaration encountered is binding; at user option,an XML processorMAY issue a warning if entities are declared multiple times.

4.2.1 Internal Entities

[Definition: If theentity definition is anEntityValue, the definedentity is called aninternal entity. There is no separate physicalstorage object, and the content of the entity is given in the declaration.]Note that some processing of entity and character references in theliteral entity value may be required to producethe correctreplacement text: see4.5 Construction of Entity Replacement Text.

An internal entity is aparsed entity.

Example of an internal entity declaration:

<!ENTITY Pub-Status "This is a pre-release of the specification.">

4.2.2 External Entities

[Definition: If the entity is not internal,it is anexternal entity, declared as follows:]

External Entity Declaration
[75]   ExternalID   ::=   'SYSTEM'SSystemLiteral
| 'PUBLIC'SPubidLiteralSSystemLiteral
[76]   NDataDecl   ::=   S 'NDATA'SName[VC: Notation Declared]

If theNDataDecl is present, this is a generalunparsed entity; otherwise it is a parsed entity.

Validity constraint: Notation Declared

TheNameMUST match the declared name of anotation.

[Definition: TheSystemLiteral is called the entity'ssystemidentifier. It is meant to be converted to a URI reference(as defined in[IETF RFC 3986]),as part of theprocess of dereferencing it to obtain input for the XML processor to construct theentity's replacement text.] It is an error for a fragment identifier(beginning with a# character) to be part of a system identifier.Unless otherwise provided by information outside the scope of this specification(e.g. a special XML element type defined by a particular DTD, or a processinginstruction defined by a particular application specification), relative URIsare relative to the location of the resource within which the entity declarationoccurs. This is defined tobe the external entity containing the '<' which starts the declaration, at thepoint when it is parsed as a declaration.A URI might thus be relative to thedocumententity, to the entity containing theexternalDTD subset, or to some otherexternal parameterentity. Attempts toretrieve the resource identified by a URI may be redirected at the parserlevel (for example, in an entity resolver) or below (at the protocol level,for example, via an HTTPLocation: header). In the absence of additionalinformation outside the scope of this specification within the resource,the base URI of a resource is always the URI of the actual resource returned.In other words, it is the URI of the resource retrieved after all redirectionhas occurred.

Systemidentifiers (and other XML strings meant to be used as URI references) may containcharacters that, according to[IETF RFC 3986],must be escaped before a URI can be used to retrieve the referenced resource. Thecharacters to be escaped are the control characters #x0 to #x1F and #x7F (most ofwhich cannot appear in XML), space #x20, the delimiters '<' #x3C, '>' #x3E and'"' #x22, theunwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and'`' #x60, as well as all characters above #x7F. Since escaping is not always a fullyreversible process, itMUST be performed only when absolutely necessary and as lateas possible in a processing chain. In particular, neither the process of convertinga relative URI to an absolute one nor the process of passing a URI reference to aprocess or software component responsible for dereferencing itSHOULD trigger escaping.When escaping does occur, itMUST be performed as follows:

  1. Each character to be escaped is represented in UTF-8[Unicode]as one or more bytes.

  2. The resulting bytes are escaped withthe URI escaping mechanism (that is, converted to%HH,where HH is the hexadecimal notation of the byte value).

  3. The original character is replaced by the resulting character sequence.

Note:

In a future edition of this specification, the XML Core Working Group intends to replace the preceding paragraphand list of steps with a normative reference to an upcoming revision of IETF RFC 3987, which will define"Legacy Extended IRIs (LEIRIs)". When this revision is available, it is the intent of the XML Core WG to use it to replacelanguage similar to the above in any future revisions of XML-related specifications under its purview.

[Definition: In addition to a systemidentifier, an external identifier may include apublic identifier.]An XML processor attempting to retrieve the entity's content may useany combination ofthe public and system identifiers as well as additional information outside thescope of this specification to try to generate an alternative URI reference.If the processor is unable to do so, itMUST use the URIreference specified in the system literal. Before a match is attempted,all strings of white space in the public identifierMUST be normalized tosingle space characters (#x20), and leading and trailing white spaceMUSTbe removed.

Examples of external entity declarations:

<!ENTITY open-hatch         SYSTEM "http://www.textuality.com/boilerplate/OpenHatch.xml"><!ENTITY open-hatch         PUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//EN"         "http://www.textuality.com/boilerplate/OpenHatch.xml"><!ENTITY hatch-pic         SYSTEM "../grafix/OpenHatch.gif"         NDATA gif >

4.3 Parsed Entities

4.3.1 The Text Declaration

External parsed entitiesSHOULD each begin with atext declaration.

Text Declaration
[77]   TextDecl   ::=   '<?xml'VersionInfo?EncodingDeclS? '?>'

The text declarationMUST be provided literally, not by referenceto a parsed entity. The text declarationMUST NOT appear at anyposition other than the beginning of an external parsed entity. The text declaration in an external parsed entity is not considered part of itsreplacement text.

4.3.2 Well-Formed Parsed Entities

The document entity is well-formed if it matches the production labeleddocument. An external general parsed entity is well-formedif it matches the production labeledextParsedEnt. Allexternal parameter entities are well-formed by definition.

Note:

Only parsed entities that are referenced directly or indirectly within the document are required to be well-formed.

Well-Formed External Parsed Entity
[78]   extParsedEnt   ::=   TextDecl?content

An internal general parsed entity is well-formed if its replacement textmatches the production labeledcontent. All internalparameter entities are well-formed by definition.

A consequence of well-formedness in generalentities is that the logical and physicalstructures in an XML document are properly nested; nostart-tag,end-tag,empty-element tag,element,comment,processing instruction,characterreference, orentity referencecan begin in one entity and end in another.

4.3.3 Character Encoding in Entities

Each external parsed entity in an XML document may use a different encodingfor its characters. All XML processorsMUST be able to read entities in boththe UTF-8 and UTF-16 encodings. The terms "UTF-8"and "UTF-16" in this specification do not apply torelated character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8.

Entities encoded in UTF-16MUST and entitiesencoded in UTF-8MAY begin with the Byte Order Mark described byAnnex H of[ISO/IEC 10646:2000], section16.8 of[Unicode](the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature,not part of either the markup or the character data of the XML document. XMLprocessorsMUST be able to use this character to differentiate between UTF-8and UTF-16 encoded documents.

If the replacement text of an external entity is tobegin with the character U+FEFF, and no text declarationis present, then a Byte Order Mark MUST be present,whether the entity is encoded in UTF-8 or UTF-16.

Although an XML processor is required to read only entities in the UTF-8and UTF-16 encodings, it is recognized that other encodings are used aroundthe world, and it may be desired for XML processors to read entities thatuse them. Inthe absence of external character encoding information (such as MIME headers),parsed entities which are stored in an encoding other than UTF-8 or UTF-16MUST begin with a text declaration (see4.3.1 The Text Declaration) containingan encoding declaration:

Encoding Declaration
[80]   EncodingDecl   ::=   S 'encoding'Eq('"'EncName '"' | "'"EncName"'" )
[81]   EncName   ::=   [A-Za-z] ([A-Za-z0-9._] | '-')*/* Encodingname contains only Latin characters */

In thedocument entity, the encodingdeclaration is part of theXML declaration.TheEncName is the name of the encoding used.

In an encoding declaration, the values "UTF-8", "UTF-16","ISO-10646-UCS-2", and "ISO-10646-UCS-4"SHOULD be usedfor the various encodings and transformations of Unicode / ISO/IEC 10646,the values "ISO-8859-1", "ISO-8859-2",... "ISO-8859-n" (wherenis the part number)SHOULD be used for the parts of ISO 8859, andthe values "ISO-2022-JP", "Shift_JIS",and "EUC-JP"SHOULD be used for the various encodedforms of JIS X-0208-1997. ItisRECOMMENDED that character encodings registered (ascharsets)with the Internet Assigned Numbers Authority[IANA-CHARSETS],other than those just listed, be referred to using their registered names;other encodingsSHOULD use names starting with an "x-" prefix.XML processorsSHOULD match character encoding names in a case-insensitiveway andSHOULD either interpret an IANA-registered name as the encoding registeredat IANA for that name or treat it as unknown (processors are, of course, notrequired to support all IANA-registered encodings).

In the absence of information provided by an external transport protocol(e.g. HTTP or MIME), it is afatal error foran entity including an encoding declaration to be presented to the XML processorin an encoding other than that named in the declaration, or for an entity whichbegins with neither a Byte Order Marknor an encoding declaration to use an encoding other than UTF-8. Note thatsince ASCII is a subset of UTF-8, ordinary ASCII entities do not strictlyneed an encoding declaration.

It is afatal error for aTextDecl to occur otherthan at the beginning of an external entity.

It is afatal error when an XML processorencounters an entity with an encoding that it is unable to process. Itis afatal error if an XML entity is determined (via default, encoding declaration,or higher-level protocol) to be in a certain encoding but contains bytesequences that are not legal in that encoding. Specifically, it is afatal error if an entity encoded in UTF-8 contains anyill-formed code unit sequences,as defined insection 3.9 of Unicode[Unicode]. Unless an encodingis determined by a higher-level protocol, it is also afatal error if an XML entitycontains no encoding declaration and its content is not legal UTF-8 or UTF-16.

Examples of text declarations containing encoding declarations:

<?xml encoding='UTF-8'?><?xml encoding='EUC-JP'?>

4.4 XML Processor Treatment of Entities and References

The table below summarizes the contexts in which character references,entity references, and invocations of unparsed entities might appear and theREQUIRED behavior of anXML processorin each case. The labels in the leftmost column describe the recognition context:

Reference in Content

as a reference anywhere after thestart-tagand before theend-tag of an element; correspondsto the nonterminalcontent.

Reference in Attribute Value

as a reference within either the value of an attribute in astart-tag,or a default value in anattribute declaration;corresponds to the nonterminalAttValue.

Occurs as Attribute Value

as aName, not a reference, appearing either asthe value of an attribute which has been declared as typeENTITY,or as one of the space-separated tokens in the value of an attribute whichhas been declared as typeENTITIES.

Reference in Entity Value

as a reference within a parameter or internal entity'sliteralentity value in the entity's declaration; corresponds to the nonterminalEntityValue.

Reference in DTD

as a reference within either the internal or external subsets of theDTD, but outside of anEntityValue,AttValue,PI,Comment,SystemLiteral,PubidLiteral,or the contents of an ignored conditional section (see3.4 Conditional Sections).

.

EntityTypeCharacter
ParameterInternal GeneralExternal ParsedGeneralUnparsed
Referencein ContentNot recognizedIncludedIncludedif validatingForbiddenIncluded
Reference in Attribute ValueNot recognizedIncludedin literalForbiddenForbiddenIncluded
Occurs as AttributeValueNot recognizedForbiddenForbiddenNotifyNot recognized
Reference in EntityValueIncluded in literalBypassedBypassedErrorIncluded
Reference in DTDIncluded as PEForbiddenForbiddenForbiddenForbidden

4.4.1 Not Recognized

Outside the DTD, the% character has no special significance;thus, what would be parameter entity references in the DTD are not recognizedas markup incontent. Similarly, the names of unparsedentities are not recognized except when they appear in the value of an appropriatelydeclared attribute.

4.4.2 Included

[Definition: An entity isincludedwhen itsreplacement text is retrievedand processed, in place of the reference itself, as though it were part ofthe document at the location the reference was recognized.] The replacementtext may contain bothcharacter dataand (except for parameter entities)markup,whichMUST be recognized in the usual way. (The string "AT&amp;T;"expands to "AT&T;" and the remaining ampersandis not recognized as an entity-reference delimiter.) A character referenceisincluded when the indicated character is processed in placeof the reference itself.

4.4.3 Included If Validating

When an XML processor recognizes a reference to a parsed entity, in ordertovalidate the document, the processorMUSTinclude its replacement text. Ifthe entity is external, and the processor is not attempting to validate theXML document, the processorMAY, but neednot, include the entity's replacement text. If a non-validating processordoes not include the replacement text, itMUST inform the application thatit recognized, but did not read, the entity.

This rule is based on the recognition that the automatic inclusion providedby the SGML and XML entity mechanism, primarily designed to support modularityin authoring, is not necessarily appropriate for other applications, in particulardocument browsing. Browsers, for example, when encountering an external parsedentity reference, might choose to provide a visual indication of the entity'spresence and retrieve it for display only on demand.

4.4.4 Forbidden

The following are forbidden, and constitutefatalerrors:

  • the appearance of a reference to anunparsedentity, except in theEntityValue in an entity declaration.

  • the appearance of any character or general-entity reference in theDTD except within anEntityValue orAttValue.

  • a reference to an external entity in an attribute value.

4.4.5 Included in Literal

When anentity reference appears inan attribute value, or a parameter entity reference appears in a literal entityvalue, itsreplacement textMUST be processedin place of the reference itself as though it were part of the document atthe location the reference was recognized, except that a single or doublequote character in the replacement textMUST always be treated as a normal datacharacter andMUST NOT terminate the literal. For example, this is well-formed:

<!ENTITY % YN '"Yes"' ><!ENTITY WhatHeSaid "He said %YN;" >

while this is not:

<!ENTITY EndAttr "27'" ><element attribute='a-&EndAttr;>

4.4.6 Notify

When the name of anunparsed entityappears as a token in the value of an attribute of declared typeENTITYorENTITIES, a validating processorMUST inform the application ofthesystem andpublic(if any) identifiers for both the entity and its associatednotation.

4.4.7 Bypassed

When a general entity reference appears in theEntityValuein an entity declaration, itMUST be bypassed and left as is.

4.4.8 Included as PE

Just as with external parsed entities, parameter entities need only beincluded if validating. When a parameter-entityreference is recognized in the DTD and included, itsreplacementtextMUST be enlarged by the attachment of one leading and one followingspace (#x20) character; the intent is to constrain the replacement text ofparameter entities to contain an integral number of grammatical tokens inthe DTD. ThisbehaviorMUST NOT apply to parameter entity references within entity values;these are described in4.4.5 Included in Literal.

4.4.9 Error

It is anerror for a reference toan unparsed entity to appear in theEntityValue in anentity declaration.

4.5 Construction of Entity Replacement Text

In discussing the treatment of entities, it is useful to distinguishtwo forms of the entity's value.[Definition: For aninternal entity, theliteralentity value is the quoted string actually present in the entity declaration,corresponding to the non-terminalEntityValue.][Definition: For an external entity, theliteralentity value is the exact text contained in the entity.][Definition: For aninternal entity, thereplacement textis the content of the entity, after replacement of character references andparameter-entity references.][Definition: Foran external entity, thereplacement text is the content of the entity,after stripping the text declaration (leaving any surrounding whitespace) if thereis one but without any replacement of character references or parameter-entityreferences.]

The literal entity value as given in an internal entity declaration (EntityValue) may contain character, parameter-entity,and general-entity references. Such referencesMUST be contained entirelywithin the literal entity value. The actual replacement text that isincluded (orincluded in literal) as described aboveMUST contain thereplacementtext of any parameter entities referred to, andMUST contain the characterreferred to, in place of any character references in the literal entity value;however, general-entity referencesMUST be left as-is, unexpanded. For example,given the following declarations:

<!ENTITY % pub    "&#xc9;ditions Gallimard" ><!ENTITY   rights "All rights reserved" ><!ENTITY   book   "La Peste: Albert Camus,&#xA9; 1947 %pub;. &rights;" >

then the replacement text for the entity "book"is:

La Peste: Albert Camus,© 1947 Éditions Gallimard. &rights;

The general-entity reference "&rights;" wouldbe expanded should the reference "&book;" appearin the document's content or an attribute value.

These simple rules may have complex interactions; for a detailed discussionof a difficult example, seeD Expansion of Entity and Character References.

4.6 Predefined Entities

[Definition: Entity and character references mayboth be used toescape the left angle bracket, ampersand, andother delimiters. A set of general entities (amp,lt,gt,apos,quot) is specified forthis purpose. Numeric character references may also be used; they are expandedimmediately when recognized andMUST be treated as character data, so thenumeric character references "&#60;" and "&#38;" may be used to escape< and& when they occurin character data.]

All XML processorsMUST recognize these entities whether they are declaredor not.For interoperability, valid XMLdocumentsSHOULD declare these entities, like any others, before using them. Ifthe entitieslt oramp are declared, theyMUST bedeclared as internal entities whose replacement text is a character referenceto the respectivecharacter (less-than sign or ampersand) being escaped; the doubleescaping isREQUIRED for these entities so that references to them producea well-formed result. If the entitiesgt,apos,orquot are declared, theyMUST be declared as internal entitieswhose replacement text is the single character being escaped (or a characterreference to that character; the double escaping here isOPTIONAL but harmless).For example:

<!ENTITY lt     "&#38;#60;"><!ENTITY gt     "&#62;"><!ENTITY amp    "&#38;#38;"><!ENTITY apos   "&#39;"><!ENTITY quot   "&#34;">

4.7 Notation Declarations

[Definition:Notations identifyby name the format ofunparsed entities,the format of elements which bear a notation attribute, or the applicationto which aprocessing instruction is addressed.]

[Definition:Notation declarationsprovide a name for the notation, for use in entity and attribute-list declarationsand in attribute specifications, and an external identifier for the notationwhich may allow an XML processor or its client application to locate a helperapplication capable of processing data in the given notation.]

Notation Declarations
[82]   NotationDecl   ::=   '<!NOTATION'SNameS (ExternalID |PublicID)S? '>'[VC: Unique Notation Name]
[83]   PublicID   ::=   'PUBLIC'SPubidLiteral

Validity constraint: Unique Notation Name

A givenNameMUST NOT be declared in more than one notation declaration.

XML processorsMUST provide applications with the name and external identifier(s)of any notation declared and referred to in an attribute value, attributedefinition, or entity declaration. TheyMAY additionally resolve the externalidentifier into thesystem identifier, filename, or other information needed to allow the application to call a processorfor data in the notation described. (It is not an error, however, for XMLdocuments to declare and refer to notations for which notation-specific applicationsare not available on the system where the XML processor or application isrunning.)

4.8 Document Entity

[Definition: Thedocument entityserves as the root of the entity tree and a starting-point for anXML processor.] This specification doesnot specify how the document entity is to be located by an XML processor;unlike other entities, the document entity has no name and might well appearon a processor input stream without any identification at all.

5 Conformance

5.1 Validating and Non-Validating Processors

ConformingXML processors fall intotwo classes: validating and non-validating.

Validating and non-validating processors alikeMUST report violations ofthis specification's well-formedness constraints in the content of thedocument entity and any otherparsedentities that they read.

[Definition:ValidatingprocessorsMUST,at user option, report violations of the constraints expressed bythe declarations in theDTD, and failuresto fulfill the validity constraints given in this specification.]To accomplish this, validating XML processorsMUST read and process the entireDTD and all external parsed entities referenced in the document.

Non-validating processors areREQUIRED to check only thedocumententity, including the entire internal DTD subset, for well-formedness. [Definition: While they are not requiredto check the document for validity, they areREQUIRED toprocessall the declarations they read in the internal DTD subset and in any parameterentity that they read, up to the first reference to a parameter entity thatthey donot read; that is to say, theyMUST use the informationin those declarations tonormalizeattribute values,include the replacementtext of internal entities, and supplydefaultattribute values.] Except whenstandalone="yes", theyMUST NOTprocessentitydeclarations orattribute-list declarationsencountered after a reference to a parameter entity that is not read, sincethe entity may have contained overriding declarations; whenstandalone="yes", processorsMUSTprocess these declarations.

Note that when processing invalid documents with a non-validatingprocessor the application may not be presented with consistentinformation. For example, several requirements for uniquenesswithin the document may not be met, including more than one elementwith the same id, duplicate declarations of elements or notationswith the same name, etc. In these cases the behavior of the parserwith respect to reporting such information to the application isundefined.

5.2 Using XML Processors

The behavior of a validating XML processor is highly predictable; it mustread every piece of a document and report all well-formedness and validityviolations. Less is required of a non-validating processor; it need not readany part of the document other than the document entity. This has two effectsthat may be important to users of XML processors:

For maximum reliability in interoperating between different XML processors,applications which use non-validating processorsSHOULD NOT rely on any behaviorsnot required of such processors. Applications which require DTD facilities not related to validation (suchas the declaration of default attributes and internal entities that are or may be specified inexternal entities)SHOULD use validating XML processors.

6 Notation

The formal grammar of XML is given in this specification using a simpleExtended Backus-Naur Form (EBNF) notation. Each rule in the grammar definesone symbol, in the form

symbol ::= expression

Symbols are written with an initial capital letter if they are thestart symbol of a regular language, otherwise with an initial lowercase letter.Literal strings are quoted.

Within the expression on the right-hand side of a rule, the following expressionsare used to match strings of one or more characters:

#xN

whereN is a hexadecimal integer, the expression matches the characterwhose number(code point) in ISO/IEC 10646 isN. The number of leading zeros in the#xNform is insignificant.

[a-zA-Z],[#xN-#xN]

matches anyChar with a value in the range(s) indicated (inclusive).

[abc],[#xN#xN#xN]

matches anyChar with a value among the charactersenumerated. Enumerations and ranges can be mixed in one set of brackets.

[^a-z],[^#xN-#xN]

matches anyChar with a valueoutside the rangeindicated.

[^abc],[^#xN#xN#xN]

matches anyChar with a value not among the characters given. Enumerationsand ranges of forbidden values can be mixed in one set of brackets.

"string"

matches a literal stringmatching thatgiven inside the double quotes.

'string'

matches a literal stringmatching thatgiven inside the single quotes.

These symbols may be combined to match more complex patterns as follows,whereA andB represent simple expressions:

(expression)

expression is treated as a unit and may be combined as describedin this list.

A?

matchesA or nothing; optionalA.

A B

matchesA followed byB. Thisoperator has higher precedence than alternation; thusA B | C Dis identical to(A B) | (C D).

A | B

matchesA orB.

A - B

matches any string that matchesA but does not matchB.

A+

matches one or more occurrences ofA. Concatenationhas higher precedence than alternation; thusA+ | B+ is identicalto(A+) | (B+).

A*

matches zero or more occurrences ofA. Concatenationhas higher precedence than alternation; thusA* | B* is identicalto(A*) | (B*).

Other notations used in the productions are:

/* ... */

comment.

[ wfc: ... ]

well-formedness constraint; this identifies by name a constraint onwell-formed documents associated with a production.

[ vc: ... ]

validity constraint; this identifies by name a constraint onvaliddocuments associated with a production.

A References

A.1 Normative References

IANA-CHARSETS
(InternetAssigned Numbers Authority)Official Names for Character Sets,ed. Keld Simonsen et al. (See http://www.iana.org/assignments/character-sets.)
IETF RFC 2119
IETF(Internet Engineering Task Force).RFC 2119: Key words for use in RFCs to Indicate Requirement Levels.Scott Bradner, 1997. (See http://www.ietf.org/rfc/rfc2119.txt.)
IETF BCP 47
IETF (Internet Engineering Task Force).BCP 47, consisting ofRFC 4646: Tags for Identifying Languages, andRFC 4647: Matching of Language Tags,A. Phillips, M. Davis. 2006.
IETF RFC 3986
IETF (Internet Engineering Task Force).RFC 3986: Uniform Resource Identifier (URI): Generic Syntax. T. Berners-Lee, R. Fielding, L. Masinter. 2005. (See http://www.ietf.org/rfc/rfc3986.txt.)
ISO/IEC 10646
ISO (InternationalOrganization for Standardization).ISO/IEC 10646-1:2000. Informationtechnology — Universal Multiple-Octet Coded Character Set (UCS) —Part 1: Architecture and Basic Multilingual Plane andISO/IEC 10646-2:2001.Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 2:Supplementary Planes, as, from time to time, amended, replaced by a new edition orexpanded by the addition of new parts. [Geneva]: International Organization for Standardization.(Seehttp://www.iso.org/iso/home.htm for the latest version.)
ISO/IEC 10646:2000
ISO (InternationalOrganization for Standardization).ISO/IEC 10646-1:2000. Informationtechnology — Universal Multiple-Octet Coded Character Set (UCS) —Part 1: Architecture and Basic Multilingual Plane. [Geneva]: InternationalOrganization for Standardization, 2000.
Unicode
The Unicode Consortium.The UnicodeStandard, Version5.0.0,defined by: The Unicode Standard, Version 5.0 (Boston, MA,Addison-Wesley, 2007. ISBN 0-321-48091-0).
UnicodeNormal
The UnicodeConsortium.Unicode normalization forms. Mark Davis andMartin Durst. 2008. (See http://unicode.org/reports/tr15/.)

A.2 Other References

Aho/Ullman
Aho, Alfred V., Ravi Sethi, and Jeffrey D.Ullman.Compilers: Principles, Techniques, and Tools.Reading: Addison-Wesley, 1986, rpt. corr. 1988.
Brüggemann-Klein
Brüggemann-Klein,Anne.Formal Models in Document Processing. Habilitationsschrift. Facultyof Mathematics at the University of Freiburg, 1993. (See ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps.)
Brüggemann-Klein and Wood
Brüggemann-Klein,Anne, and Derick Wood.Deterministic Regular Languages.Universität Freiburg, Institut für Informatik, Bericht 38, Oktober 1991. Extendedabstract in A. Finkel, M. Jantzen, Hrsg., STACS 1992, S. 173-184. Springer-Verlag,Berlin 1992. Lecture Notes in Computer Science 577. Full version titledOne-UnambiguousRegular Languages in Information and Computation 140 (2): 229-253,February 1998.
Clark
James Clark.Comparison of SGML and XML. (See http://www.w3.org/TR/NOTE-sgml-xml-971215.)
IANA-LANGCODES
(InternetAssigned Numbers Authority)Registry of Language Tags (See http://www.iana.org/assignments/language-subtag-registry.)
IETF RFC 2141
IETF(Internet Engineering Task Force).RFC 2141: URN Syntax, ed.R. Moats. 1997. (See http://www.ietf.org/rfc/rfc2141.txt.)
IETF RFC 3023
IETF(Internet Engineering Task Force).RFC 3023: XML Media Types.eds. M. Murata, S. St.Laurent, D. Kohn. 2001. (See http://www.ietf.org/rfc/rfc3023.txt.)
IETF RFC 2781
IETF(Internet Engineering Task Force).RFC 2781: UTF-16, an encodingof ISO 10646, ed. P. Hoffman, F. Yergeau. 2000. (See http://www.ietf.org/rfc/rfc2781.txt.)
ISO 639
(International Organization for Standardization).ISO 639:1988 (E).Code for the representation of names of languages. [Geneva]: InternationalOrganization for Standardization, 1988.
ISO 3166
(International Organization for Standardization).ISO 3166-1:1997(E). Codes for the representation of names of countries and their subdivisions —Part 1: Country codes [Geneva]: International Organization forStandardization, 1997.
ISO 8879
ISO (International Organization for Standardization).ISO8879:1986(E). Information processing — Text and Office Systems —Standard Generalized Markup Language (SGML). First edition —1986-10-15. [Geneva]: International Organization for Standardization, 1986.
ISO/IEC 10744
ISO (International Organization forStandardization).ISO/IEC 10744-1992 (E). Information technology —Hypermedia/Time-based Structuring Language (HyTime). [Geneva]:International Organization for Standardization, 1992.Extended FacilitiesAnnexe. [Geneva]: International Organization for Standardization, 1996.
WEBSGML
ISO(International Organization for Standardization).ISO 8879:1986TC2. Information technology — Document Description and Processing Languages.[Geneva]: International Organization for Standardization, 1998. (See http://www.sgmlsource.com/8879/n0029.htm.)
XML Names
Tim Bray,Dave Hollander, and Andrew Layman, editors.Namespaces in XML.Textuality, Hewlett-Packard, and Microsoft. World Wide Web Consortium, 1999. (See http://www.w3.org/TR/xml-names/.)

B Character Classes

Because of changes to productions[4] and[5], the productions inthis Appendix are now orphaned and not used anymore in determiningname characters. This Appendix may be removed in a future edition of this specification; other specifications that wish to refer to the productions herein shoulddo so by means of a reference to the relevant production(s) in theFourth Edition of this specification.

Following the characteristics defined in the Unicode standard, charactersare classed as base characters (among others, these contain the alphabeticcharacters of the Latin alphabet), ideographic characters, and combining characters (amongothers, this class contains most diacritics). Digits and extenders are alsodistinguished.

Characters
[84]   Letter   ::=   BaseChar |Ideographic
[85]   BaseChar   ::=   [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6]| [#x00D8-#x00F6] | [#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E]| [#x0141-#x0148] | [#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0]| [#x01F4-#x01F5] | [#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1]| #x0386 | [#x0388-#x038A] | #x038C | [#x038E-#x03A1]| [#x03A3-#x03CE] | [#x03D0-#x03D6] | #x03DA | #x03DC| #x03DE | #x03E0 | [#x03E2-#x03F3] | [#x0401-#x040C]| [#x040E-#x044F] | [#x0451-#x045C] | [#x045E-#x0481] | [#x0490-#x04C4]| [#x04C7-#x04C8] | [#x04CB-#x04CC] | [#x04D0-#x04EB] | [#x04EE-#x04F5]| [#x04F8-#x04F9] | [#x0531-#x0556] | #x0559 | [#x0561-#x0586]| [#x05D0-#x05EA] | [#x05F0-#x05F2] | [#x0621-#x063A] | [#x0641-#x064A]| [#x0671-#x06B7] | [#x06BA-#x06BE] | [#x06C0-#x06CE] | [#x06D0-#x06D3]| #x06D5 | [#x06E5-#x06E6] | [#x0905-#x0939] | #x093D| [#x0958-#x0961] | [#x0985-#x098C] | [#x098F-#x0990] | [#x0993-#x09A8]| [#x09AA-#x09B0] | #x09B2 | [#x09B6-#x09B9] | [#x09DC-#x09DD]| [#x09DF-#x09E1] | [#x09F0-#x09F1] | [#x0A05-#x0A0A] | [#x0A0F-#x0A10]| [#x0A13-#x0A28] | [#x0A2A-#x0A30] | [#x0A32-#x0A33] | [#x0A35-#x0A36]| [#x0A38-#x0A39] | [#x0A59-#x0A5C] | #x0A5E | [#x0A72-#x0A74]| [#x0A85-#x0A8B] | #x0A8D | [#x0A8F-#x0A91] | [#x0A93-#x0AA8]| [#x0AAA-#x0AB0] | [#x0AB2-#x0AB3] | [#x0AB5-#x0AB9] | #x0ABD| #x0AE0 | [#x0B05-#x0B0C] | [#x0B0F-#x0B10] | [#x0B13-#x0B28]| [#x0B2A-#x0B30] | [#x0B32-#x0B33] | [#x0B36-#x0B39] | #x0B3D| [#x0B5C-#x0B5D] | [#x0B5F-#x0B61] | [#x0B85-#x0B8A] | [#x0B8E-#x0B90]| [#x0B92-#x0B95] | [#x0B99-#x0B9A] | #x0B9C | [#x0B9E-#x0B9F]| [#x0BA3-#x0BA4] | [#x0BA8-#x0BAA] | [#x0BAE-#x0BB5] | [#x0BB7-#x0BB9]| [#x0C05-#x0C0C] | [#x0C0E-#x0C10] | [#x0C12-#x0C28] | [#x0C2A-#x0C33]| [#x0C35-#x0C39] | [#x0C60-#x0C61] | [#x0C85-#x0C8C] | [#x0C8E-#x0C90]| [#x0C92-#x0CA8] | [#x0CAA-#x0CB3] | [#x0CB5-#x0CB9] | #x0CDE| [#x0CE0-#x0CE1] | [#x0D05-#x0D0C] | [#x0D0E-#x0D10] | [#x0D12-#x0D28]| [#x0D2A-#x0D39] | [#x0D60-#x0D61] | [#x0E01-#x0E2E] | #x0E30| [#x0E32-#x0E33] | [#x0E40-#x0E45] | [#x0E81-#x0E82] | #x0E84| [#x0E87-#x0E88] | #x0E8A | #x0E8D | [#x0E94-#x0E97]| [#x0E99-#x0E9F] | [#x0EA1-#x0EA3] | #x0EA5 | #x0EA7| [#x0EAA-#x0EAB] | [#x0EAD-#x0EAE] | #x0EB0 | [#x0EB2-#x0EB3]| #x0EBD | [#x0EC0-#x0EC4] | [#x0F40-#x0F47] | [#x0F49-#x0F69]| [#x10A0-#x10C5] | [#x10D0-#x10F6] | #x1100 | [#x1102-#x1103]| [#x1105-#x1107] | #x1109 | [#x110B-#x110C] | [#x110E-#x1112]| #x113C | #x113E | #x1140 | #x114C | #x114E | #x1150| [#x1154-#x1155] | #x1159 | [#x115F-#x1161] | #x1163| #x1165 | #x1167 | #x1169 | [#x116D-#x116E] | [#x1172-#x1173]| #x1175 | #x119E | #x11A8 | #x11AB | [#x11AE-#x11AF]| [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] | #x11EB| #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9]| [#x1F00-#x1F15] | [#x1F18-#x1F1D] | [#x1F20-#x1F45] | [#x1F48-#x1F4D]| [#x1F50-#x1F57] | #x1F59 | #x1F5B | #x1F5D | [#x1F5F-#x1F7D]| [#x1F80-#x1FB4] | [#x1FB6-#x1FBC] | #x1FBE | [#x1FC2-#x1FC4]| [#x1FC6-#x1FCC] | [#x1FD0-#x1FD3] | [#x1FD6-#x1FDB] | [#x1FE0-#x1FEC]| [#x1FF2-#x1FF4] | [#x1FF6-#x1FFC] | #x2126 | [#x212A-#x212B]| #x212E | [#x2180-#x2182] | [#x3041-#x3094] | [#x30A1-#x30FA]| [#x3105-#x312C] | [#xAC00-#xD7A3]
[86]   Ideographic   ::=   [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029]
[87]   CombiningChar   ::=   [#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486]| [#x0591-#x05A1] | [#x05A3-#x05B9] | [#x05BB-#x05BD] | #x05BF| [#x05C1-#x05C2] | #x05C4 | [#x064B-#x0652] | #x0670| [#x06D6-#x06DC] | [#x06DD-#x06DF] | [#x06E0-#x06E4] | [#x06E7-#x06E8]| [#x06EA-#x06ED] | [#x0901-#x0903] | #x093C | [#x093E-#x094C]| #x094D | [#x0951-#x0954] | [#x0962-#x0963] | [#x0981-#x0983]| #x09BC | #x09BE | #x09BF | [#x09C0-#x09C4] | [#x09C7-#x09C8]| [#x09CB-#x09CD] | #x09D7 | [#x09E2-#x09E3] | #x0A02| #x0A3C | #x0A3E | #x0A3F | [#x0A40-#x0A42] | [#x0A47-#x0A48]| [#x0A4B-#x0A4D] | [#x0A70-#x0A71] | [#x0A81-#x0A83] | #x0ABC| [#x0ABE-#x0AC5] | [#x0AC7-#x0AC9] | [#x0ACB-#x0ACD] | [#x0B01-#x0B03]| #x0B3C | [#x0B3E-#x0B43] | [#x0B47-#x0B48] | [#x0B4B-#x0B4D]| [#x0B56-#x0B57] | [#x0B82-#x0B83] | [#x0BBE-#x0BC2] | [#x0BC6-#x0BC8]| [#x0BCA-#x0BCD] | #x0BD7 | [#x0C01-#x0C03] | [#x0C3E-#x0C44]| [#x0C46-#x0C48] | [#x0C4A-#x0C4D] | [#x0C55-#x0C56] | [#x0C82-#x0C83]| [#x0CBE-#x0CC4] | [#x0CC6-#x0CC8] | [#x0CCA-#x0CCD] | [#x0CD5-#x0CD6]| [#x0D02-#x0D03] | [#x0D3E-#x0D43] | [#x0D46-#x0D48] | [#x0D4A-#x0D4D]| #x0D57 | #x0E31 | [#x0E34-#x0E3A] | [#x0E47-#x0E4E]| #x0EB1 | [#x0EB4-#x0EB9] | [#x0EBB-#x0EBC] | [#x0EC8-#x0ECD]| [#x0F18-#x0F19] | #x0F35 | #x0F37 | #x0F39 | #x0F3E| #x0F3F | [#x0F71-#x0F84] | [#x0F86-#x0F8B] | [#x0F90-#x0F95]| #x0F97 | [#x0F99-#x0FAD] | [#x0FB1-#x0FB7] | #x0FB9| [#x20D0-#x20DC] | #x20E1 | [#x302A-#x302F] | #x3099| #x309A
[88]   Digit   ::=   [#x0030-#x0039] | [#x0660-#x0669] | [#x06F0-#x06F9]| [#x0966-#x096F] | [#x09E6-#x09EF] | [#x0A66-#x0A6F] | [#x0AE6-#x0AEF]| [#x0B66-#x0B6F] | [#x0BE7-#x0BEF] | [#x0C66-#x0C6F] | [#x0CE6-#x0CEF]| [#x0D66-#x0D6F] | [#x0E50-#x0E59] | [#x0ED0-#x0ED9] | [#x0F20-#x0F29]
[89]   Extender   ::=   #x00B7 | #x02D0 | #x02D1 | #x0387 | #x0640| #x0E46 | #x0EC6 | #x3005 | [#x3031-#x3035] | [#x309D-#x309E]| [#x30FC-#x30FE]

The character classes defined here can be derived from the Unicode 2.0character database as follows:

C XML and SGML (Non-Normative)

XMLis designed to be a subset of SGML, in that every XML document should alsobe a conforming SGML document. For a detailed comparison of the additionalrestrictions that XML places on documents beyond those of SGML, see[Clark].

D Expansion of Entity and Character References (Non-Normative)

This appendix contains some examples illustrating the sequence of entity-and character-reference recognition and expansion, as specified in4.4 XML Processor Treatment of Entities and References.

If the DTD contains the declaration

<!ENTITY example "<p>An ampersand (&#38;#38;) may be escapednumerically (&#38;#38;#38;) or with a general entity(&amp;amp;).</p>" >

then the XML processor will recognize the character references when itparses the entity declaration, and resolve them before storing the followingstring as the value of the entity "example":

<p>An ampersand (&#38;) may be escapednumerically (&#38;#38;) or with a general entity(&amp;amp;).</p>

A reference in the document to "&example;"will cause the text to be reparsed, at which time the start- and end-tagsof thep element will be recognized and the three references willbe recognized and expanded, resulting in ap element with the followingcontent (all data, no delimiters or markup):

An ampersand (&) may be escapednumerically (&#38;) or with a general entity(&amp;).

A more complex example will illustrate the rules and their effects fully.In the following example, the line numbers are solely for reference.

1 <?xml version='1.0'?>2 <!DOCTYPE test [3 <!ELEMENT test (#PCDATA) >4 <!ENTITY % xx '&#37;zz;'>5 <!ENTITY % zz '&#60;!ENTITY tricky "error-prone" >' >6 %xx;7 ]>8 <test>This sample shows a &tricky; method.</test>

This produces the following:

In the following example

<!DOCTYPE foo [ <!ENTITY x "&lt;"> ]> <foo attr="&x;"/>

the replacement text of x is the four characters "&lt;" becausereferences to general entities in entity values arebypassed.The replacement text of lt is a character reference tothe less-than character, for example the five characters "&#60;"(see4.6 Predefined Entities). Since neither of these contains a less-than characterthe result is well-formed.

If the definition of x had been

<!ENTITY x "&#60;">

then the document would not have been well-formed, because thereplacement text of x would be the single character "<" whichis not permitted in attribute values (seeWFC: No < in Attribute Values).

E Deterministic Content Models (Non-Normative)

Asnoted in3.2.1 Element Content, it is required that contentmodels in element type declarations be deterministic. This requirement isfor compatibility with SGML (which calls deterministiccontent models "unambiguous"); XML processors builtusing SGML systems may flag non-deterministic content models as errors.

For example, the content model((b, c) | (b, d)) is non-deterministic,because given an initialb the XML processorcannot know whichb in the model is being matched without lookingahead to see which element follows theb. In this case, the two referencestob can be collapsed into a single reference, making the model read(b,(c | d)). An initialb now clearly matches only a single namein the content model. The processor doesn't need to look ahead to see what follows; eitherc ordwould be accepted.

More formally: a finite state automaton may be constructed from the contentmodel using the standard algorithms, e.g. algorithm 3.5 in section 3.9 ofAho, Sethi, and Ullman[Aho/Ullman]. In many such algorithms, a followset is constructed for each position in the regular expression (i.e., eachleaf node in the syntax tree for the regular expression); if any positionhas a follow set in which more than one following position is labeled withthe same element type name, then the content model is in error and may bereported as an error.

Algorithms exist which allow many but not all non-deterministic contentmodels to be reduced automatically to equivalent deterministic models; seeBrüggemann-Klein 1991[Brüggemann-Klein].

F Autodetection of Character Encodings (Non-Normative)

The XML encoding declaration functions as an internal label on each entity,indicating which character encoding is in use. Before an XML processor canread the internal label, however, it apparently has to know what characterencoding is in use—which is what the internal label is trying to indicate.In the general case, this is a hopeless situation. It is not entirely hopelessin XML, however, because XML limits the general case in two ways: each implementationis assumed to support only a finite set of character encodings, and the XMLencoding declaration is restricted in position and content in order to makeit feasible to autodetect the character encoding in use in each entity innormal cases. Also, in many cases other sources of information are availablein addition to the XML data stream itself. Two cases may be distinguished,depending on whether the XML entity is presented to the processor without,or with, any accompanying (external) information. Wewill considerthese cases in turn.

F.1 Detection Without External Encoding Information

Because each XML entity not accompanied by externalencoding information and not in UTF-8 or UTF-16 encoding mustbegin with an XML encoding declaration, in which the first characters mustbe '<?xml', any conforming processor can detect, after twoto four octets of input, which of the following cases apply. In reading thislist, it may help to know that in UCS-4, '<' is "#x0000003C"and '?' is "#x0000003F", and the Byte Order Markrequired of UTF-16 data streams is "#xFEFF". The notation## is used to denote any byte value except that two consecutive##s cannot be both 00.

With a Byte Order Mark:

00 00 FEFFUCS-4, big-endian machine (1234 order)
FFFE 00 00UCS-4, little-endian machine (4321 order)
00 00 FF FEUCS-4, unusual octet order (2143)
FE FF 00 00UCS-4, unusual octet order (3412)
FE FF ## ##UTF-16, big-endian
FF FE ## ##UTF-16, little-endian
EF BB BFUTF-8

Without a Byte Order Mark:

00 00 00 3CUCS-4 or other encoding with a 32-bit code unit and ASCIIcharacters encoded as ASCII values, in respectively big-endian (1234), little-endian(4321) and two unusual byte orders (2143 and 3412). The encoding declarationmust be read to determine which of UCS-4 or other supported 32-bit encodingsapplies.
3C 00 00 00
00 00 3C 00
00 3C 00 00
00 3C 00 3FUTF-16BE or big-endian ISO-10646-UCS-2or other encoding with a 16-bit code unit in big-endian order and ASCII charactersencoded as ASCII values (the encoding declaration must be read to determinewhich)
3C 00 3F 00UTF-16LE or little-endianISO-10646-UCS-2 or other encoding with a 16-bit code unit in little-endianorder and ASCII characters encoded as ASCII values (the encoding declarationmust be read to determine which)
3C 3F 78 6DUTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other7-bit, 8-bit, or mixed-width encoding which ensures that the characters ofASCII have their normal positions, width, and values; the actual encodingdeclaration must be read to detect which of these applies, but since all ofthese encodings use the same bit patterns for the relevant ASCII characters,the encoding declaration itself may be read reliably
4C6F A7 94EBCDIC (in some flavor; the full encoding declarationmust be read to tell which code page is in use)
OtherUTF-8 without an encoding declaration, or else the data stream is mislabeled(lacking a required encoding declaration), corrupt, fragmentary, or enclosedin a wrapper of some kind

Note:

In cases above which do not require reading the encoding declaration todetermine the encoding, section 4.3.3 still requires that the encoding declaration,if present, be read and that the encoding name be checked to match the actualencoding of the entity. Also, it is possible that new character encodingswill be invented that will make it necessary to use the encoding declarationto determine the encoding, in cases where this is not required at present.

This level of autodetection is enough to read the XML encoding declarationand parse the character-encoding identifier, which is still necessary to distinguishthe individual members of each family of encodings (e.g. to tell UTF-8 from8859, and the parts of 8859 from each other, or to distinguish the specificEBCDIC code page in use, and so on).

Because the contents of the encoding declaration are restricted to charactersfrom the ASCII repertoire (however encoded),a processor can reliably read the entire encoding declaration as soon as ithas detected which family of encodings is in use. Since in practice, all widelyused character encodings fall into one of the categories above, the XML encodingdeclaration allows reasonably reliable in-band labeling of character encodings,even when external sources of information at the operating-system or transport-protocollevel are unreliable. Character encodings such as UTF-7that make overloaded usage of ASCII-valued bytes may fail to be reliably detected.

Once the processor has detected the character encoding in use, it can actappropriately, whether by invoking a separate input routine for each case,or by calling the proper conversion function on each character of input.

Like any self-labeling system, the XML encoding declaration will not workif any software changes the entity's character set or encoding without updatingthe encoding declaration. Implementors of character-encoding routines shouldbe careful to ensure the accuracy of the internal and external informationused to label the entity.

F.2 Priorities in the Presence of External Encoding Information

The second possible case occurs when the XML entity is accompanied by encodinginformation, as in some file systems and some network protocols. When multiplesources of information are available, their relative priority and the preferredmethod of handling conflict should be specified as part of the higher-levelprotocol used to deliver XML. In particular, please referto[IETF RFC 3023] or its successor, which defines thetext/xmlandapplication/xml MIME types and provides some useful guidance.In the interests of interoperability, however, the following rule is recommended.

  • If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used(if present) to determine the character encoding.

G W3C XML Working Group (Non-Normative)

This specification was prepared and approved for publication by the W3CXML Working Group (WG). WG approval of this specification does not necessarilyimply that all WG members voted for its approval. The current and formerparticipants of the XML WG are:

H W3C XML Core Working Group (Non-Normative)

Thefifth edition of this specification was prepared by the W3C XML CoreWorking Group (WG). The participants in the WG at the time of publication of thisedition were:

I Production Notes (Non-Normative)

This edition was encoded in aslightly modified version of theXMLspec DTD, v2.10.The XHTML versions were produced with a combination of thexmlspec.xsl,diffspec.xsl,andREC-xml.xslXSLT stylesheets.

J Suggestions for XML Names (Non-Normative)

The following suggestions define what is believed to be bestpractice in the construction of XML names used as element names,attribute names, processing instruction targets, entity names,notation names, and the values of attributes of type ID, and areintended as guidance for document authors and schema designers.All references to Unicode are understood with respect toa particular version of the Unicode Standard greater than or equalto 5.0; which version should be used is left to the discretion ofthe document author or schema designer.

The first two suggestions are directly derived from the rulesgiven for identifiers in Standard Annex #31 (UAX #31) of the Unicode Standard, version 5.0[Unicode], andexclude all control characters, enclosing nonspacing marks,non-decimal numbers, private-use characters, punctuation characters(with the noted exceptions), symbol characters, unassignedcodepoints, and white space characters. The other suggestionsare mostly derived from Appendix B in previous editions of this specification.

  1. The first character of any name should have a Unicode propertyof ID_Start, or else be '_' #x5F.

  2. Characters other than the first should have a Unicode propertyof ID_Continue, or be one of the characters listed in the tableentitled "Characters for Natural Language Identifiers" in UAX#31, with the exception of "'" #x27 and "’" #x2019.

  3. Characters in names should be expressed usingNormalization Form C as defined in[UnicodeNormal].

  4. Ideographic characters which have a canonical decomposition(including those in the ranges [#xF900-#xFAFF] and[#x2F800-#x2FFFD], with 12 exceptions) should not be used in names.

  5. Characters which have a compatibility decomposition (those witha "compatibility formatting tag" in field 5 of the UnicodeCharacter Database -- marked by field 5 beginning with a "<")should not be used in names. This suggestion does not applyto characters whichdespite their compatibility decompositions are in regular use intheir scripts, forexample #x0E33 THAI CHARACTER SARA AM or #x0EB3 LAO CHARACTER AM.

  6. Combining characters meant for use with symbols only (includingthose in the ranges [#x20D0-#x20EF] and [#x1D165-#x1D1AD]) shouldnot be used in names.

  7. The interlinear annotation characters ([#xFFF9-#xFFFB]) shouldnot be used in names.

  8. Variation selector characters should not be used in names.

  9. Names which are nonsensical, unpronounceable, hard to read, oreasily confusable with other names should not be employed.


[8]ページ先頭

©2009-2025 Movatter.jp