Extensible Markup Language (XML)1.0 (Second Edition)

W3C Recommendation 6 October 2000

This version:: http://www.w3.org/TR/2000/REC-xml-20001006(XHTML,XML,PDF,XHTMLreview version with color-coded revision indicators)
Latest version:: http://www.w3.org/TR/REC-xml
Previous versions:: http://www.w3.org/TR/2000/WD-xml-2e-20000814; http://www.w3.org/TR/1998/REC-xml-19980210
Editors:: Tim Bray, Textuality and Netscape<tbray@textuality.com>; Jean Paoli, Microsoft<jeanpa@microsoft.com>; C. M. Sperberg-McQueen, University of Illinois at Chicago and Text EncodingInitiative<cmsmcq@uic.edu>; Eve Maler, Sun Microsystems, Inc.<eve.maler@east.sun.com> - Second Edition

Abstract

The Extensible Markup Language (XML)is a subset of SGML that is completely described in this document. Its goalis to enable generic SGML to be served, received, and processed on the Webin the way that is now possible with HTML. XML has been designed for easeof implementation and for interoperability with both SGML and HTML.

Status of this Document

Thisdocument has been reviewed by W3C Members and other interested parties andhas been endorsed by the Director as a W3C Recommendation. It is a stabledocument and may be used as reference material or cited as a normative referencefrom another document. W3C's role in making the Recommendation is to drawattention to the specification and to promote its widespread deployment. Thisenhances the functionality and interoperability of the Web.

This documentspecifies a syntax created by subsetting an existing, widely used internationaltext processing standard (Standard Generalized Markup Language, ISO 8879:1986(E)as amended and corrected) for use on the World Wide Web. It is a product ofthe W3C XML Activity, details of which can be found athttp://www.w3.org/XML. The English version of this specification is the only normative version.However, for translations of this document, seehttp://www.w3.org/XML/#trans.A list of current W3C Recommendations and other technical documents can befound athttp://www.w3.org/TR.

Thissecond edition isnot a new version of XML (first published 10 February 1998); it merely incorporatesthe changes dictated by the first-edition errata (available athttp://www.w3.org/XML/xml-19980210-errata) as a convenience to readers. The errata list for this second edition isavailable athttp://www.w3.org/XML/xml-V10-2e-errata.

Pleasereport errors in this document toxml-editor@w3.org;archives are available.

Note:

C. M. Sperberg-McQueen'saffiliation has changed since the publication of the first edition. He isnow at the World Wide Web Consortium, and can be contacted atcmsmcq@w3.org.

1 Introduction

ExtensibleMarkup Language, abbreviated XML, describes a class of data objects calledXML documents and partially describesthe behavior of computer programs which process them. XML is an applicationprofile or restricted form of SGML, the Standard Generalized Markup Language[ISO 8879]. By construction, XML documents are conformingSGML documents.

XML documents are made up of storage units calledentities, which contain either parsedor unparsed data. Parsed data is made up ofcharacters,some of which formcharacterdata, and some of which formmarkup.Markup encodes a description of the document's storage layout and logicalstructure. XML provides a mechanism to impose constraints on the storage layoutand logical structure.

[Definition:A software module called anXML processor is used to read XML documentsand provide access to their content and structure.] [Definition: It is assumed that an XML processor is doingits work on behalf of another module, called theapplication.] Thisspecification describes the required behavior of an XML processor in termsof how it must read XML data and the information it must provide to the application.

1.1 Origin and Goals

XMLwas developed by an XML Working Group (originally known as the SGML EditorialReview Board) formed under the auspices of the World Wide Web Consortium (W3C)in 1996. It was chaired by Jon Bosak of Sun Microsystems with the active participationof an XML Special Interest Group (previously known as the SGML Working Group)also organized by the W3C. The membership of the XML Working Group is givenin an appendix. Dan Connolly served as the WG's contact with the W3C.

Thedesign goals for XML are:

XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML documents.
The number of optional features in XML is to be kept to the absoluteminimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.

This specification, together with associated standards (Unicode andISO/IEC 10646 for characters, Internet RFC 1766 for language identificationtags, ISO 639 for language name codes, and ISO 3166 for country name codes),provides all the information necessary to understand XML Version 1.0 and constructcomputer programs to process it.

This version of the XML specification may be distributed freely, as long as all text and legal notices remain intact.

2Documents

[Definition: A data object is anXML document if it iswell-formed,as defined in this specification. A well-formed XML document may in additionbevalid if it meets certain furtherconstraints.]

Each XML document has both a logical and a physical structure.Physically, the document is composed of units calledentities.An entity mayrefer to otherentities to cause their inclusion in the document. A document begins in a"root" ordocument entity.Logically, the document is composed of declarations, elements, comments, characterreferences, and processing instructions, all of which are indicated in thedocument by explicit markup. The logical and physical structures must nestproperly, as described in4.3.2 Well-Formed ParsedEntities.

2.1Well-Formed XML Documents

[Definition: A textual object is awell-formed XML document if:]

Taken as a whole, it matches the production labeleddocument.
It meets all the well-formedness constraints given in this specification.
Each of theparsed entitieswhich is referenced directly or indirectly within the document iswell-formed.

Document

[1] document ::= prologelementMisc*

Matching thedocument productionimplies that:

It contains one or moreelements.
[Definition: There is exactlyone element, called theroot, or document element, no part of whichappears in thecontent of any otherelement.] For all other elements, if thestart-tagis in the content of another element, theend-tagis in the content of the same element. More simply stated, the elements, delimitedby start- and end-tags, nest properly within each other.

[Definition: Asa consequence of this, for each non-root elementC in the document,there is one other elementP in the document such thatCis in the content ofP, but is not in the content of any otherelement that is in the content ofP.P is referredto as theparent ofC, andC as achildofP.]

2.2Characters

[Definition: A parsedentity containstext, a sequence ofcharacters,which may represent markup or character data.] [Definition:Acharacter is an atomic unit of text as specified by ISO/IEC 10646[ISO/IEC 10646] (see also[ISO/IEC10646-2000]). Legal characters are tab, carriage return, line feed, andthe legal characters of Unicode and ISO/IEC 10646. The versions of these standardscited inA.1 Normative Referenceswere current at the time this document was prepared. New characters may beadded to these standards by amendments or new editions. Consequently, XMLprocessors must accept any character in the range specified forChar.The use of "compatibility characters", as defined in section 6.8 of[Unicode](see also D21 in section 3.6 of[Unicode3]), is discouraged.]

CharacterRange

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, andFFFF. */

The mechanism for encoding character code points intobit patterns may vary from entity to entity. All XML processors must acceptthe UTF-8 and UTF-16 encodings of 10646; the mechanisms for signaling whichof the two is in use, or for bringing other encodings into play, are discussedlater, in4.3.3 Character Encoding in Entities.

2.3 Common SyntacticConstructs

This section defines some symbols used widely in the grammar.

S (white space) consists of one or more space (#x20) characters,carriage returns, line feeds, or tabs.

White Space

[3] S ::= (#x20 | #x9 | #xD | #xA)+

Characters are classified for convenience as letters,digits, or other characters. A letter consists of an alphabetic or syllabicbase character or an ideographic character. Full definitions of the specificcharacters in each class are given inB CharacterClasses.

[Definition: ANameis a token beginning with a letter or one of a few punctuation characters,and continuing with letters, digits, hyphens, underscores, colons, or fullstops, together known as name characters.] Names beginning with the string"xml", or any string which would match(('X'|'x') ('M'|'m')('L'|'l')), are reserved for standardization in this or future versionsof this specification.

Note:

TheNamespaces in XML Recommendation[XML Names] assignsa meaning to names containing colon characters. Therefore, authors shouldnot use the colon in XML names except for namespace purposes, but XML processorsmust accept the colon as a name character.

AnNmtoken(name token) is any mixture of name characters.

Names and Tokens

[4] NameChar ::= Letter |Digit| '.' | '-' | '_' | ':' |CombiningChar |Extender[5] Name ::= (Letter | '_' | ':') (NameChar)*[6] Names ::= Name (SName)*[7] Nmtoken ::= (NameChar)+[8] Nmtokens ::= Nmtoken (SNmtoken)*

Literal data is any quoted string not containing the quotationmark used as a delimiter for that string. Literals are used for specifyingthe content of internal entities (EntityValue),the values of attributes (AttValue), and externalidentifiers (SystemLiteral). Note that aSystemLiteral can be parsed without scanningfor markup.

Literals

[9]	`EntityValue`	::=	`'"' ([^%&"] \|PEReference \|Reference)* '"'`
			`\| "'" ([^%&'] \|PEReference\|Reference)* "'"`
[10]	`AttValue`	::=	`'"' ([^<&"] \|Reference)* '"'`
			`\| "'" ([^<&'] \|Reference)*"'"`
[11]	`SystemLiteral`	::=	`('"' [^"]* '"') \| ("'" [^']* "'")`
[12]	`PubidLiteral`	::=	`'"'PubidChar* '"' \| "'" (PubidChar- "'")* "'"`
[13]	`PubidChar`	::=	`#x20 \| #xD \| #xA \| [a-zA-Z0-9] \| [-'()+,./:=?;!*#@$_%]`

Note:

AlthoughtheEntityValue production allows the definitionof an entity consisting of a single explicit< in the literal(e.g.,<!ENTITY mylt "<">), it is strongly advised to avoidthis practice since any reference to that entity will cause a well-formednesserror.

2.4 CharacterData and Markup

Text consistsof intermingledcharacter dataand markup. [Definition:Markuptakes the form ofstart-tags,end-tags,empty-elementtags,entity references,character references,comments,CDATA section delimiters,document type declarations,processing instructions,XML declarations,text declarations, and any white space that is atthe top level of the document entity (that is, outside the document elementand not inside any other markup).]

[Definition:All text that is not markup constitutes thecharacter data of the document.]

Theampersand character (&) and the left angle bracket (<) may appear intheir literal formonly when used as markup delimiters, or withinacomment, aprocessing instruction, or aCDATAsection. If they are needed elsewhere, they must beescaped using eithernumeric character references or the strings "&"and "<" respectively. The right angle bracket (>) maybe represented using the string ">", and must,for compatibility, be escaped using ">"or a character reference when it appears in the string "]]>"in content, when that string is not marking the end of aCDATA section.

In the content of elements,character data is any string of characters which does not contain the start-delimiterof any markup. In a CDATA section, character data is any string of charactersnot including the CDATA-section-close delimiter, "]]>".

Toallow attribute values to contain both single and double quotes, the apostropheor single-quote character (') may be represented as "'",and the double-quote character (") as """.

CharacterData

[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

2.5Comments

[Definition:Commentsmay appear anywhere in a document outside othermarkup;in addition, they may appear within the document type declaration at placesallowed by the grammar. They are not part of the document'scharacter data; an XML processor may, but need not,make it possible for an application to retrieve the text of comments.For compatibility, the string "--" (double-hyphen)must not occur within comments.] Parameter entity references are not recognizedwithin comments.

Comments

[15] Comment ::= ''

An example of a comment:

<!-- declarations for <head> & <body> -->

Note that the grammar does not allow a comment ending in--->.The following example isnot well-formed.

<!-- B+, B, or B--->

2.6 ProcessingInstructions

[Definition:Processinginstructions (PIs) allow documents to contain instructions for applications.]

ProcessingInstructions

[16]	`PI`	::=	`'<?'PITarget (S(Char* - (Char* '?>'Char*)))?'?>'`
[17]	`PITarget`	::=	`Name - (('X' \| 'x') ('M' \| 'm') ('L' \| 'l'))`

PIs are not part of the document'scharacter data, but must be passed through to theapplication. The PI begins with a target (PITarget)used to identify the application to which the instruction is directed. Thetarget names "XML", "xml", and so on are reservedfor standardization in this or future versions of this specification. TheXMLNotation mechanism may beused for formal declaration of PI targets. Parameter entity references arenot recognized within processing instructions.

2.7 CDATA Sections

[Definition:CDATA sections may occur anywherecharacter data may occur; they are used to escape blocks of text containingcharacters which would otherwise be recognized as markup. CDATA sections beginwith the string "<![CDATA[" and end with the string "]]>":]

CDATASections

[18]	`CDSect`	::=	`CDStartCDataCDEnd`
[19]	`CDStart`	::=	`'<![CDATA['`
[20]	`CData`	::=	`(Char* - (Char* ']]>'Char*))`
[21]	`CDEnd`	::=	`']]>'`

Within a CDATA section, only theCDEndstring is recognized as markup, so that left angle brackets and ampersandsmay occur in their literal form; they need not (and cannot) be escaped using"<" and "&". CDATA sections cannotnest.

An example of a CDATA section, in which "<greeting>"and "</greeting>" are recognized ascharacter data, notmarkup:

<![CDATA[<greeting>Hello, world!</greeting>]]>

2.8 Prologand Document Type Declaration

[Definition:XML documents should begin with anXML declaration which specifiesthe version of XML being used.] For example, the following is a complete XMLdocument,well-formed butnotvalid:

<?xml version="1.0"?> <greeting>Hello, world!</greeting>

and so is this:

<greeting>Hello, world!</greeting>

The version number "1.0" should be used to indicateconformance to this version of this specification; it is an error for a documentto use the value "1.0" if it does not conform to this versionof this specification. It is the intent of the XML working group to give laterversions of this specification numbers other than "1.0", butthis intent does not indicate a commitment to produce any future versionsof XML, nor if any are produced, to use any particular numbering scheme. Sincefuture versions are not ruled out, this construct is provided as a means toallow the possibility of automatic version recognition, should it become necessary.Processors may signal an error if they receive documents labeled with versionsthey do not support.

The function of the markup in an XML documentis to describe its storage and logical structure and to associate attribute-valuepairs with its logical structures. XML provides a mechanism, thedocument type declaration, to define constraints onthe logical structure and to support the use of predefined storage units.[Definition: An XML document isvalidif it has an associated document type declaration and if the document complieswith the constraints expressed in it.]

The document type declarationmust appear before the firstelementin the document.

Prolog

[22]	`prolog`	::=	`XMLDecl?Misc(doctypedeclMisc)?`
[23]	`XMLDecl`	::=	`'<?xml'VersionInfoEncodingDecl?SDDecl?S? '?>'`
[24]	`VersionInfo`	::=	`S 'version'Eq ("'"VersionNum "'" \| '"'VersionNum'"')/* */`
[25]	`Eq`	::=	`S? '='S?`
[26]	`VersionNum`	::=	`([a-zA-Z0-9_.:] \| '-')+`
[27]	`Misc`	::=	`Comment \|PI \|S`

[Definition:The XMLdocument type declaration contains or points tomarkup declarations that provide a grammar for aclass of documents. This grammar is known as a document type definition, orDTD.The document type declaration can point to an external subset (a special kindofexternal entity) containingmarkup declarations, or can contain the markup declarations directly in aninternal subset, or can do both. The DTD for a document consists of both subsetstaken together.]

[Definition: Amarkup declaration is anelementtype declaration, anattribute-listdeclaration, anentitydeclaration, or anotationdeclaration.] These declarations may be contained in whole or in partwithinparameter entities, asdescribed in the well-formedness and validity constraints below. For furtherinformation, see4 Physical Structures.

DocumentType Definition

[28]	`doctypedecl`	::=	`'<!DOCTYPE'SName(SExternalID)?S?('[' (markupdecl \|DeclSep)*']'S?)? '>'`	[VC: Root Element Type]
				[WFC: External Subset]
				/ /
[28a]	`DeclSep`	::=	`PEReference \|S`	[WFC: PE Between Declarations]
				/ /
[29]	`markupdecl`	::=	`elementdecl \|AttlistDecl\|EntityDecl \|NotationDecl\|PI \|Comment`	[VC: Proper Declaration/PE Nesting]
				[WFC: PEs in Internal Subset]

Note that it is possible to construct a well-formed documentcontaining adoctypedecl that neither pointsto an external subset nor contains an internal subset.

The markup declarationsmay be made up in whole or in part of thereplacementtext ofparameter entities.The productions later in this specification for individual nonterminals (elementdecl,AttlistDecl,and so on) describe the declarationsafter all the parameter entitieshave beenincluded.

Parameterentity references are recognized anywhere in the DTD (internal and externalsubsets and external parameter entities), except in literals, processing instructions,comments, and the contents of ignored conditional sections (see3.4Conditional Sections). They are also recognized in entity value literals.The use of parameter entities in the internal subset is restricted as describedbelow.

Validityconstraint: Root Element Type

TheName inthe document type declaration must match the element type of theroot element.

Validity constraint: Proper Declaration/PENesting

Parameter-entityreplacementtext must be properly nested with markup declarations. That is to say,if either the first character or the last character of a markup declaration(markupdecl above) is contained in the replacementtext for aparameter-entityreference, both must be contained in the same replacement text.

Well-formednessconstraint: PEs in Internal Subset

In the internal DTD subset,parameter-entity referencescan occur only where markup declarations can occur, not within markup declarations.(This does not apply to references that occur in external parameter entitiesor to the external subset.)

Well-formedness constraint: External Subset

Theexternal subset, if any, must match the production forextSubset.

Well-formednessconstraint: PE Between Declarations

The replacement text of a parameterentity reference in aDeclSep must match the productionextSubsetDecl.

2.9Standalone Document Declaration

Markup declarations can affect thecontent of the document, as passed from anXMLprocessor to an application; examples are attribute defaults and entitydeclarations. The standalone document declaration, which may appear as a componentof the XML declaration, signals whether or not there are such declarationswhich appear external to thedocumententity or in parameter entities. [Definition: Anexternal markup declarationis defined as a markup declaration occurring in the external subset or ina parameter entity (external or internal, the latter being included becausenon-validating processors are not required to read them).]

StandaloneDocument Declaration

[32] SDDecl ::= S 'standalone'Eq (("'"('yes' | 'no') "'") | ('"' ('yes' | 'no') '"')) [VC: Standalone Document Declaration]

In a standalone document declaration, the value "yes"indicates that there are noexternalmarkup declarations which affect the information passed from the XML processorto the application. The value "no" indicates that there are or may be suchexternal markup declarations. Note that the standalone document declarationonly denotes the presence of externaldeclarations; the presence,in a document, of references to externalentities, when those entitiesare internally declared, does not change its standalone status.

Ifthere are no external markup declarations, the standalone document declarationhas no meaning. If there are external markup declarations but there is nostandalone document declaration, the value "no" is assumed.

Any XMLdocument for whichstandalone="no" holds can be converted algorithmicallyto a standalone document, which may be desirable for some network deliveryapplications.

Validityconstraint: Standalone Document Declaration

The standalone documentdeclaration must have the value "no" if any external markup declarations containdeclarations of:

attributes withdefaultvalues, if elements to which these attributes apply appear in the documentwithout specifications of values for these attributes, or
entities (other thanamp,lt,gt,apos,quot),ifreferences to those entitiesappear in the document, or
attributes with values subject tonormalization,where the attribute appears in the document with a value which will changeas a result of normalization, or
element types withelementcontent, if white space occurs directly within any instance of those types.

An example XML declaration with a standalone document declaration:

<?xml version="1.0" standalone='yes'?>

2.10White Space Handling

In editing XML documents, it is often convenientto use "white space" (spaces, tabs, and blank lines) to set apart the markupfor greater readability. Such white space is typically not intended for inclusionin the delivered version of the document. On the other hand, "significant"white space that should be preserved in the delivered version is common, forexample in poetry and source code.

AnXMLprocessor must always pass all characters in a document that are not markupthrough to the application. Avalidating XML processor must also inform the application which of thesecharacters constitute white space appearing inelement content.

A specialattribute namedxml:space may be attachedto an element to signal an intention that in that element, white space shouldbe preserved by applications. In valid documents, this attribute, like anyother, must bedeclaredif it is used. When declared, it must be given as anenumerated type whose values are one or both of"default" and "preserve". For example:

<!ATTLIST poem  xml:space (default|preserve) 'preserve'><!-- --><!ATTLIST pre xml:space (preserve) #FIXED 'preserve'>

The value "default" signals that applications' default white-spaceprocessing modes are acceptable for this element; the value "preserve" indicatesthe intent that applications preserve all the white space. This declared intentis considered to apply to all elements within the content of the element whereit is specified, unless overriden with another instance of thexml:spaceattribute.

Theroot elementof any document is considered to have signaled no intentions as regards applicationspace handling, unless it provides a value for this attribute or the attributeis declared with a default value.

2.11End-of-Line Handling

XMLparsedentities are often stored in computer files which, for editing convenience,are organized into lines. These lines are typically separated by some combinationof the characters carriage-return (#xD) and line-feed (#xA).

To simplifythe tasks ofapplications, the characterspassed to an application by theXMLprocessor must be as if the XML processor normalized all line breaks inexternal parsed entities (including the document entity) on input, beforeparsing, by translating both the two-character sequence #xD #xA and any #xDthat is not followed by #xA to a single #xA character.

2.12 Language Identification

In document processing,it is often useful to identify the natural or formal language in which thecontent is written. A specialattributenamedxml:lang may be inserted in documents to specify the languageused in the contents and attribute values of any element in an XML document.In valid documents, this attribute, like any other, must bedeclared if it is used. The values of the attributeare language identifiers as defined by[IETF RFC 1766],Tagsfor the Identification of Languages, or its successor on the IETF StandardsTrack.

Note:

[IETFRFC 1766] tags are constructed from two-letter language codes as definedby[ISO 639], from two-letter country codes as definedby[ISO 3166], or from language identifiers registeredwith the Internet Assigned Numbers Authority[IANA-LANGCODES].It is expected that the successor to[IETF RFC 1766]will introduce three-letter language codes for languages not presently coveredby[ISO 639].

(Productions 33 through38 have been removed.)

For example:

<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p><p xml:lang="en-GB">What colour is it?</p><p xml:lang="en-US">What color is it?</p><sp who="Faust" desc='leise' xml:lang="de">  <l>Habe nun, ach! Philosophie,</l>  <l>Juristerei, und Medizin</l>  <l>und leider auch Theologie</l>  <l>durchaus studiert mit heißem Bemüh'n.</l></sp>

The intent declared withxml:lang is consideredto apply to all attributes and content of the element where it is specified,unless overridden with an instance ofxml:lang on another elementwithin that content.

A simple declaration forxml:langmight take the form

xml:lang NMTOKEN #IMPLIED

but specific default values may also be given, if appropriate.In a collection of French poems for English students, with glosses and notesin English, thexml:lang attribute might be declared this way:

<!ATTLIST poem   xml:lang NMTOKEN 'fr'><!ATTLIST gloss  xml:lang NMTOKEN 'en'><!ATTLIST note   xml:lang NMTOKEN 'en'>

3Logical Structures

[Definition:EachXML document containsone or moreelements, the boundaries of which are either delimitedbystart-tags andend-tags, or, foremptyelements, by anempty-elementtag. Each element has a type, identified by name, sometimes called its"generic identifier" (GI), and may have a set of attribute specifications.]Each attribute specification has anameand avalue.

Element

[39] element ::= EmptyElemTag|STagcontentETag[WFC: Element Type Match][VC: Element Valid]

This specification does not constrain the semantics, use,or (beyond syntax) names of the element types and attributes, except thatnames beginning with a match to(('X'|'x')('M'|'m')('L'|'l'))are reserved for standardization in this or future versions of this specification.

Well-formednessconstraint: Element Type Match

TheName inan element's end-tag must match the element type in the start-tag.

3.1 Start-Tags,End-Tags, and Empty-Element Tags

[Definition:The beginning of every non-empty XML element is marked by astart-tag.]

Start-tag

[40] STag ::= '<'Name (SAttribute)*S? '>'[WFC: Unique Att Spec][41] Attribute ::= NameEqAttValue[VC: Attribute Value Type][WFC: No External Entity References][WFC: No < in Attribute Values]

TheName in the start- and end-tagsgives the element'stype. [Definition: TheName-AttValue pairsare referred to as theattribute specifications of the element], [Definition: with theNamein each pair referred to as theattribute name] and [Definition: the content of theAttValue(the text between the' or" delimiters) as theattributevalue.]Note that the order of attribute specifications in a start-tagor empty-element tag is not significant.

Well-formedness constraint: Unique Att Spec

Noattribute name may appear more than once in the same start-tag or empty-elementtag.

Validityconstraint: Attribute Value Type

The attribute must have been declared;the value must be of the type declared for it. (For attribute types, see3.3 Attribute-List Declarations.)

Well-formednessconstraint: No External Entity References

Attribute values cannotcontain direct or indirect entity references to external entities.

3.2 ElementType Declarations

Theelementstructure of anXML documentmay, forvalidation purposes, beconstrained using element type and attribute-list declarations. An elementtype declaration constrains the element'scontent.

Elementtype declarations often constrain which element types can appear aschildren of the element. At user option, an XMLprocessor may issue a warning when a declaration mentions an element typefor which no declaration is provided, but this is not an error.

[Definition: Anelementtype declaration takes the form:]

Element Type Declaration

[45] elementdecl ::= '<!ELEMENT'SNameScontentspecS?'>'[VC: Unique Element Type Declaration][46] contentspec ::= 'EMPTY' | 'ANY' |Mixed |children

where theName gives the elementtype being declared.

Validityconstraint: Unique Element Type Declaration

No element type maybe declared more than once.

Examples of element type declarations:

<!ELEMENT br EMPTY><!ELEMENT p (#PCDATA|emph)* ><!ELEMENT %name.para; %content.para; ><!ELEMENT container ANY>

3.2.1 ElementContent

[Definition:An elementtype haselement contentwhen elements of that type must contain onlychildelements (no character data), optionally separated by white space (charactersmatching the nonterminalS).][Definition: In this case, the constraint includesacontent model, a simple grammar governing the allowed types of thechild elements and the order in which they are allowed to appear.] The grammaris built on content particles (cps), which consist ofnames, choice lists of content particles, or sequence lists of content particles:

Element-contentModels

[47]	`children`	::=	`(choice \|seq) ('?'\| '*' \| '+')?`
[48]	`cp`	::=	`(Name \|choice \|seq) ('?' \| '*' \| '+')?`
[49]	`choice`	::=	`'('S?cp (S?'\|'S?cp )+S?')'`	/ /
				/ /
				[VC: Proper Group/PE Nesting]
[50]	`seq`	::=	`'('S?cp (S?','S?cp )*S?')'`	/ /
				[VC: Proper Group/PE Nesting]

where eachName is the type ofan element which may appear as achild.Any content particle in a choice list may appear in theelement content at the location where the choicelist appears in the grammar; content particles occurring in a sequence listmust each appear in theelementcontent in the order given in the list. The optional character followinga name or list governs whether the element or the content particles in thelist may occur one or more (+), zero or more (*),or zero or one times (?). The absence of such an operator meansthat the element or content particle must appear exactly once. This syntaxand meaning are identical to those used in the productions in this specification.

Thecontent of an element matches a content model if and only if it is possibleto trace out a path through the content model, obeying the sequence, choice,and repetition operators and matching each element in the content againstan element type in the content model.Forcompatibility, it is an error if an element in the document can matchmore than one occurrence of an element type in the content model. For moreinformation, seeE Deterministic Content Models.

Validityconstraint: Proper Group/PE Nesting

Parameter-entityreplacement text must be properly nested with parenthesizedgroups. That is to say, if either of the opening or closing parentheses inachoice,seq, orMixedconstruct is contained in the replacement text for aparameter entity, both must be contained in the samereplacement text.

Forinteroperability, if a parameter-entity reference appears in achoice,seq, orMixed construct, its replacementtext should contain at least one non-blank character, and neither the firstnor last non-blank character of the replacement text should be a connector(| or,).

Examples of element-contentmodels:

<!ELEMENT spec (front, body, back?)><!ELEMENT div1 (head, (p | list | note)*, div2*)><!ELEMENT dictionary-body (%div.mix; | %dict.mix;)*>

3.2.2Mixed Content

[Definition:An elementtype hasmixed contentwhen elements of that type may contain character data, optionally interspersedwithchild elements.] Inthis case, the types of the child elements may be constrained, but not theirorder or their number of occurrences:

Mixed-content Declaration

[51] Mixed ::= '('S? '#PCDATA' (S? '|'S?Name)*S? ')*'| '('S? '#PCDATA'S? ')'[VC: Proper Group/PE Nesting][VC: No Duplicate Types]

where theNames give the typesof elements that may appear as children. The keyword#PCDATA deriveshistorically from the term "parsed character data."

Validity constraint:No Duplicate Types

The same name must not appear more than oncein a single mixed-content declaration.

Examples of mixed contentdeclarations:

<!ELEMENT p (#PCDATA|a|ul|b|i|em)*><!ELEMENT p (#PCDATA | %font; | %phrase; | %special; | %form;)* ><!ELEMENT b (#PCDATA)>

3.3 Attribute-ListDeclarations

Attributes areused to associate name-value pairs withelements.Attribute specifications may appear only withinstart-tagsandempty-element tags;thus, the productions used to recognize them appear in3.1Start-Tags, End-Tags, and Empty-Element Tags. Attribute-list declarationsmay be used:

To define the set of attributes pertaining to a given element type.
To establish type constraints for these attributes.
To providedefaultvalues for attributes.

[Definition:Attribute-list declarations specify the name, data type, and defaultvalue (if any) of each attribute associated with a given element type:]

Attribute-listDeclaration

[52]	`AttlistDecl`	::=	`'<!ATTLIST'SNameAttDef*S? '>'`
[53]	`AttDef`	::=	`SNameSAttTypeSDefaultDecl`

TheName in theAttlistDeclrule is the type of an element. At user option, an XML processor may issuea warning if attributes are declared for an element type not itself declared,but this is not an error. TheName in theAttDefrule is the name of the attribute.

When more than oneAttlistDeclis provided for a given element type, the contents of all those provided aremerged. When more than one definition is provided for the same attribute ofa given element type, the first declaration is binding and later declarationsare ignored.For interoperability,writers of DTDs may choose to provide at most one attribute-list declarationfor a given element type, at most one attribute definition for a given attributename in an attribute-list declaration, and at least one attribute definitionin each attribute-list declaration. For interoperability, an XML processormay at user option issue a warning when more than one attribute-list declarationis provided for a given element type, or more than one attribute definitionis provided for a given attribute, but this is not an error.

3.3.1 Attribute Types

XML attributetypes are of three kinds: a string type, a set of tokenized types, and enumeratedtypes. The string type may take any literal string as a value; the tokenizedtypes have varying lexical and semantic constraints. The validity constraintsnoted in the grammar are applied after the attribute value has been normalizedas described in3.3 Attribute-List Declarations.

AttributeTypes

[54]	`AttType`	::=	`StringType \|TokenizedType\|EnumeratedType`
[55]	`StringType`	::=	`'CDATA'`
[56]	`TokenizedType`	::=	`'ID'`	[VC: ID]
				[VC: One ID per Element Type]
				[VC: ID Attribute Default]
			`\| 'IDREF'`	[VC: IDREF]
			`\| 'IDREFS'`	[VC: IDREF]
			`\| 'ENTITY'`	[VC: Entity Name]
			`\| 'ENTITIES'`	[VC: Entity Name]
			`\| 'NMTOKEN'`	[VC: Name Token]
			`\| 'NMTOKENS'`	[VC: Name Token]

Validityconstraint: ID

Values of typeID must match theNameproduction. A name must not appear more than once in an XML document as avalue of this type; i.e., ID values must uniquely identify the elements whichbear them.

Validityconstraint: One ID per Element Type

No element type may have morethan one ID attribute specified.

Validity constraint: ID Attribute Default

AnID attribute must have a declared default of#IMPLIED or#REQUIRED.

Validity constraint:IDREF

Values of typeIDREF must match theNameproduction, and values of typeIDREFS must matchNames;eachName must match the value of an ID attribute onsome element in the XML document; i.e.IDREF values must match thevalue of some ID attribute.

Validity constraint: Entity Name

Values oftypeENTITY must match theName production,values of typeENTITIES must matchNames; eachName must match the name of anunparsed entity declared in theDTD.

Validity constraint: Name Token

Values of typeNMTOKENmust match theNmtoken production; values of typeNMTOKENSmust matchNmtokens.

[Definition:Enumeratedattributes can take one of a list of values provided in the declaration].There are two kinds of enumerated types:

Enumerated Attribute Types

[57] EnumeratedType ::= NotationType |Enumeration[58] NotationType ::= 'NOTATION'S '('S?Name(S? '|'S?Name)*S? ')'[VC: Notation Attributes][VC: One Notation Per Element Type][VC: No Notation on Empty Element][59] Enumeration ::= '('S?Nmtoken (S? '|'S?Nmtoken)*S? ')'[VC: Enumeration]

ANOTATION attribute identifies anotation, declared in the DTD with associated systemand/or public identifiers, to be used in interpreting the element to whichthe attribute is attached.

Validity constraint: Notation Attributes

Valuesof this type must match one of thenotationnames included in the declaration; all notation names in the declaration mustbe declared.

Validityconstraint: One Notation Per Element Type

No element type may havemore than oneNOTATION attribute specified.

Validity constraint: No Notationon Empty Element

Forcompatibility, an attribute of typeNOTATION must not be declaredon an element declaredEMPTY.

Validity constraint: Enumeration

Valuesof this type must match one of theNmtoken tokensin the declaration.

Forinteroperability, the sameNmtoken should notoccur more than once in the enumerated attribute types of a single elementtype.

3.3.2Attribute Defaults

Anattributedeclaration provides information on whether the attribute's presence isrequired, and if not, how an XML processor should react if a declared attributeis absent in a document.

Attribute Defaults

[60]	`DefaultDecl`	::=	`'#REQUIRED' \| '#IMPLIED'`
			`\| (('#FIXED' S)?AttValue)`	[VC: Required Attribute]
				[VC: Attribute Default Legal]
				[WFC: No < in Attribute Values]
				[VC: Fixed Attribute Default]

In an attribute declaration,#REQUIRED means thatthe attribute must always be provided,#IMPLIED that no default valueis provided. [Definition:If the declaration is neither#REQUIRED nor#IMPLIED, then theAttValue value contains the declareddefaultvalue; the#FIXED keyword states that the attribute must always havethe default value. If a default value is declared, when an XML processor encountersan omitted attribute, it is to behave as though the attribute were presentwith the declared default value.]

Validity constraint: Required Attribute

Ifthe default declaration is the keyword#REQUIRED, then the attributemust be specified for all elements of the type in the attribute-list declaration.

Validityconstraint: Attribute Default Legal

The declared default value mustmeet the lexical constraints of the declared attribute type.

3.3.3 Attribute-ValueNormalization

Before the value of an attribute is passed to the applicationor checked for validity, the XML processor must normalize the attribute valueby applying the algorithm below, or by using some other method such that thevalue passed to the application is the same as that produced by the algorithm.

All line breaks must have been normalized on input to #xA as describedin2.11 End-of-Line Handling, so the restof this algorithm operates on text normalized in this way.
Begin with a normalized value consisting of the empty string.
For each character, entity reference, or character reference in theunnormalized attribute value, beginning with the first and continuing to thelast, do the following:
- For a character reference, append the referenced character to the normalizedvalue.
- For an entity reference, recursively apply step 3 of this algorithmto the replacement text of the entity.
- For a white space character (#x20, #xD, #xA, #x9), append a space character(#x20) to the normalized value.
- For another character, append the character to the normalized value.

If the attribute type is not CDATA, then the XML processor must furtherprocess the normalized attribute value by discarding any leading and trailingspace (#x20) characters, and by replacing sequences of space (#x20) charactersby a single space (#x20) character.

Note that if the unnormalized attributevalue contains a character reference to a white space character other thanspace (#x20), the normalized value contains the referenced character itself(#xD, #xA or #x9). This contrasts with the case where the unnormalized valuecontains a white space character (not a reference), which is replaced witha space character (#x20) in the normalized value and also contrasts with thecase where the unnormalized value contains an entity reference whose replacementtext contains a white space character; being recursively processed, the whitespace character is replaced with a space character (#x20) in the normalizedvalue.

All attributes for which no declaration has been read shouldbe treated by a non-validating processor as if declaredCDATA.

Followingare examples of attribute normalization. Given the following declarations:

<!ENTITY d "&#xD;"><!ENTITY a "&#xA;"><!ENTITY da "&#xD;&#xA;">

the attribute specifications in the left column below would benormalized to the character sequences of the middle column if the attributeais declaredNMTOKENS and to those of the right columns ifais declaredCDATA.

Attribute specification a is NMTOKENS a is CDATA

a="xyz"

x y z

#x20 #x20 x y z

a="&d;&d;A&a;&a;B&da;"

A #x20 B

#x20 #x20 A #x20 #x20 B #x20 #x20

a="&#xd;&#xd;A&#xa;&#xa;B&#xd;&#xa;"

#xD #xD A #xA #xA B #xD #xA

#xD #xD A #xA #xA B #xD #xD

Note that the last example is invalid (but well-formed)ifa is declared to be of typeNMTOKENS.

Validityconstraint: Proper Conditional Section/PE Nesting

If any of the"<![", "[", or "]]>" of a conditionalsection is contained in the replacement text for a parameter-entity reference,all of them must be contained in the same replacement text.

4Physical Structures

[Definition:An XML document may consist of one or many storage units. These are calledentities;they all havecontent and are all (except for thedocument entity and theexternal DTD subset) identified by entityname.]Each XML document has one entity called thedocumententity, which serves as the starting point for theXML processor and may contain the whole document.

Entitiesmay be either parsed or unparsed. [Definition:Aparsed entity's contents are referred to as itsreplacement text; thistextis considered an integral part of the document.]

[Definition: Anunparsed entity is a resourcewhose contents may or may not betext,and if text, may be other than XML. Each unparsed entity has an associatednotation, identified by name. Beyonda requirement that an XML processor make the identifiers for the entity andnotation available to the application, XML places no constraints on the contentsof unparsed entities.]

Parsed entities are invoked by name using entityreferences; unparsed entities by name, given in the value ofENTITYorENTITIES attributes.

[Definition:Generalentities are entities for use within the document content. In this specification,general entities are sometimes referred to with the unqualified termentitywhen this leads to no ambiguity.] [Definition:Parameterentities are parsed entities for use within the DTD.] These two typesof entities use different forms of reference and are recognized in differentcontexts. Furthermore, they occupy different namespaces; a parameter entityand a general entity with the same name are two distinct entities.

4.1 Character and Entity References

[Definition: Acharacterreference refers to a specific character in the ISO/IEC 10646 characterset, for example one not directly accessible from available input devices.]

CharacterReference

[66]	`CharRef`	::=	`'&#' [0-9]+ ';'`
			`\| '&#x' [0-9a-fA-F]+ ';'`	[WFC: Legal Character]

Well-formednessconstraint: Legal Character

Characters referred to using characterreferences must match the production forChar.

Ifthe character reference begins with "&#x", the digits andletters up to the terminating; provide a hexadecimal representationof the character's code point in ISO/IEC 10646. If it begins just with "&#",the digits up to the terminating; provide a decimal representationof the character's code point.

[Definition:Anentity reference refers to the content of a named entity.] [Definition: References to parsed general entities useampersand (&) and semicolon (;) as delimiters.][Definition:Parameter-entityreferences use percent-sign (%) and semicolon (;)as delimiters.]

Entity Reference

[67]	`Reference`	::=	`EntityRef \|CharRef`
[68]	`EntityRef`	::=	`'&'Name ';'`	[WFC: Entity Declared]
				[VC: Entity Declared]
				[WFC: Parsed Entity]
				[WFC: No Recursion]
[69]	`PEReference`	::=	`'%'Name ';'`	[VC: Entity Declared]
				[WFC: No Recursion]
				[WFC: In DTD]

Well-formednessconstraint: Entity Declared

In a document without any DTD, a documentwith only an internal DTD subset which contains no parameter entity references,or a document with "standalone='yes'", for an entity referencethat does not occur within the external subset or a parameter entity, theName given in the entity reference mustmatch that in anentitydeclaration that does not occur within the external subset or aparameter entity, except that well-formed documents need not declare any ofthe following entities:amp,lt,gt,apos,quot.The declaration of a general entity must precede any reference to it whichappears in a default value in an attribute-list declaration.

Note thatif entities are declared in the external subset or in external parameter entities,a non-validating processor isnot obligatedto read and process their declarations; for such documents, therule that an entity must be declared is a well-formedness constraint onlyifstandalone='yes'.

Validity constraint: EntityDeclared

In a document with an external subset or external parameterentities with "standalone='no'", theNamegiven in the entity reference mustmatchthat in anentity declaration.For interoperability, valid documents should declare the entitiesamp,lt,gt,apos,quot, in the form specified in4.6 PredefinedEntities. The declaration of a parameter entity must precede any referenceto it. Similarly, the declaration of a general entity must precede any attribute-listdeclaration containing a default value with a direct or indirect referenceto that general entity.

Well-formedness constraint: Parsed Entity

Anentity reference must not contain the name of anunparsed entity. Unparsed entities may be referredto only inattribute valuesdeclared to be of typeENTITY orENTITIES.

Well-formedness constraint: NoRecursion

A parsed entity must not contain a recursive referenceto itself, either directly or indirectly.

Well-formedness constraint: In DTD

Parameter-entityreferences may only appear in theDTD.

Examplesof character and entity references:

Type <key>less-than</key> (&#x3C;) to save options.This document was prepared on &docdate; andis classified &security-level;.

Example of a parameter-entity reference:

<!-- declare the parameter entity "ISOLat2"... --><!ENTITY % ISOLat2         SYSTEM "http://www.xml.com/iso/isolat2-xml.entities" ><!-- ... now reference it. -->%ISOLat2;

4.2 EntityDeclarations

[Definition: Entities are declared thus:]

Entity Declaration

[70]	`EntityDecl`	::=	`GEDecl \|PEDecl`
[71]	`GEDecl`	::=	`'<!ENTITY'SNameSEntityDefS?'>'`
[72]	`PEDecl`	::=	`'<!ENTITY'S '%'SNameSPEDefS? '>'`
[73]	`EntityDef`	::=	`EntityValue \| (ExternalIDNDataDecl?)`
[74]	`PEDef`	::=	`EntityValue \|ExternalID`

TheName identifies the entityin anentity reference or,in the case of an unparsed entity, in the value of anENTITY orENTITIESattribute. If the same entity is declared more than once, the first declarationencountered is binding; at user option, an XML processor may issue a warningif entities are declared multiple times.

4.2.1Internal Entities

[Definition:If the entity definition is anEntityValue,the defined entity is called aninternal entity. There is no separatephysical storage object, and the content of the entity is given in the declaration.]Note that some processing of entity and character references in theliteral entity value may be required to produce thecorrectreplacement text:see4.5 Construction of Internal Entity ReplacementText.

An internal entity is aparsedentity.

Example of an internal entity declaration:

<!ENTITY Pub-Status "This is a pre-release of the specification.">

4.2.2External Entities

[Definition:If the entity is not internal, it is anexternal entity, declared asfollows:]

External Entity Declaration

[75]	`ExternalID`	::=	`'SYSTEM'SSystemLiteral`
			`\| 'PUBLIC'SPubidLiteralSSystemLiteral`
[76]	`NDataDecl`	::=	`S 'NDATA'SName`	[VC: Notation Declared]

If theNDataDecl is present,this is a generalunparsedentity; otherwise it is a parsed entity.

Validity constraint: NotationDeclared

TheName must match the declaredname of anotation.

[Definition: TheSystemLiteralis called the entity'ssystem identifier. It is a URI reference (asdefined in[IETF RFC 2396], updated by[IETFRFC 2732]), meant to be dereferenced to obtain input for the XML processorto construct the entity's replacement text.] It is an error for a fragmentidentifier (beginning with a# character) to be part of a systemidentifier. Unless otherwise provided by information outside the scope ofthis specification (e.g. a special XML element type defined by a particularDTD, or a processing instruction defined by a particular application specification),relative URIs are relative to the location of the resource within which theentity declaration occurs. A URI might thus be relative to thedocument entity, to the entity containing theexternal DTD subset, or to some otherexternal parameter entity.

URI references requireencoding and escaping of certain characters. The disallowed characters includeall non-ASCII characters, plus the excluded characters listed in Section 2.4of[IETF RFC 2396], except for the number sign (#)and percent sign (%) characters and the square bracket charactersre-allowed in[IETF RFC 2732]. Disallowed charactersmust be escaped as follows:

Each disallowed character is converted to UTF-8[IETFRFC 2279] as one or more bytes.
Any octets corresponding to a disallowed character are escaped withthe URI escaping mechanism (that is, converted to%HH,where HH is the hexadecimal notation of the byte value).
The original character is replaced by the resulting character sequence.

[Definition: Inaddition to a system identifier, an external identifier may include apublicidentifier.] An XML processor attempting to retrieve the entity's contentmay use the public identifier to try to generate an alternative URI reference.If the processor is unable to do so, it must use the URI reference specifiedin the system literal. Before a match is attempted, all strings of white spacein the public identifier must be normalized to single space characters (#x20),and leading and trailing white space must be removed.

Examples of externalentity declarations:

<!ENTITY open-hatch         SYSTEM "http://www.textuality.com/boilerplate/OpenHatch.xml"><!ENTITY open-hatch         PUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//EN"         "http://www.textuality.com/boilerplate/OpenHatch.xml"><!ENTITY hatch-pic         SYSTEM "../grafix/OpenHatch.gif"         NDATA gif >

4.3Parsed Entities

4.3.1The Text Declaration

External parsed entities should each begin withatext declaration.

Text Declaration

[77] TextDecl ::= '<?xml'VersionInfo?EncodingDeclS? '?>'

The text declaration must be provided literally, not byreference to a parsed entity. No text declaration may appear at any positionother than the beginning of an external parsed entity. The text declarationin an external parsed entity is not considered part of itsreplacement text.

4.3.2 Well-Formed Parsed Entities

The documententity is well-formed if it matches the production labeleddocument.An external general parsed entity is well-formed if it matches the productionlabeledextParsedEnt. All external parameterentities are well-formed by definition.

Well-Formed External ParsedEntity

[78] extParsedEnt ::= TextDecl?content

An internal general parsed entity is well-formed if itsreplacement text matches the production labeledcontent.All internal parameter entities are well-formed by definition.

A consequenceof well-formedness in entities is that the logical and physical structuresin an XML document are properly nested; nostart-tag,end-tag,empty-elementtag,element,comment,processinginstruction,characterreference, orentity referencecan begin in one entity and end in another.

4.3.3 Character Encoding in Entities

Eachexternal parsed entity in an XML document may use a different encoding forits characters. All XML processors must be able to read entities in both theUTF-8 and UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specificationdo not apply to character encodings with any other labels, even if the encodingsor labels are very similar to UTF-8 or UTF-16.

Entities encoded inUTF-16 must begin with the Byte Order Mark described by Annex F of[ISO/IEC10646], Annex H of[ISO/IEC 10646-2000],section 2.4 of[Unicode], and section 2.7 of[Unicode3](the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature,not part of either the markup or the character data of the XML document. XMLprocessors must be able to use this character to differentiate between UTF-8and UTF-16 encoded documents.

Although an XML processor is requiredto read only entities in the UTF-8 and UTF-16 encodings, it is recognizedthat other encodings are used around the world, and it may be desired forXML processors to read entities that use them. In the absence of externalcharacter encoding information (such as MIME headers), parsed entities whichare stored in an encoding other than UTF-8 or UTF-16 must begin with a textdeclaration (see4.3.1 The Text Declaration)containing an encoding declaration:

Encoding Declaration

[80] EncodingDecl ::= S 'encoding'Eq ('"'EncName '"' | "'"EncName"'" )[81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')*/* Encoding name contains only Latin characters */

In thedocumententity, the encoding declaration is part of theXML declaration. TheEncNameis the name of the encoding used.

In an encoding declaration, thevalues "UTF-8", "UTF-16", "ISO-10646-UCS-2",and "ISO-10646-UCS-4" should be used for the various encodingsand transformations of Unicode / ISO/IEC 10646, the values "ISO-8859-1","ISO-8859-2", ... "ISO-8859-n" (wherenis the part number) should be used for the parts of ISO 8859, and the values"ISO-2022-JP", "Shift_JIS", and "EUC-JP"should be used for the various encoded forms of JIS X-0208-1997. It is recommendedthat character encodings registered (ascharsets) with the InternetAssigned Numbers Authority[IANA-CHARSETS], other thanthose just listed, be referred to using their registered names; other encodingsshould use names starting with an "x-" prefix. XML processors should matchcharacter encoding names in a case-insensitive way and should either interpretan IANA-registered name as the encoding registered at IANA for that name ortreat it as unknown (processors are, of course, not required to support allIANA-registered encodings).

In the absence of information providedby an external transport protocol (e.g. HTTP or MIME), it is anerror for an entity including an encoding declarationto be presented to the XML processor in an encoding other than that namedin the declaration, or for an entity which begins with neither a Byte OrderMark nor an encoding declaration to use an encoding other than UTF-8. Notethat since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictlyneed an encoding declaration.

It is a fatal error for aTextDeclto occur other than at the beginning of an external entity.

It is afatal error when an XML processorencounters an entity with an encoding that it is unable to process. It isa fatal error if an XML entity is determined (via default, encoding declaration,or higher-level protocol) to be in a certain encoding but contains octet sequencesthat are not legal in that encoding. It is also a fatal error if an XML entitycontains no encoding declaration and its content is not legal UTF-8 or UTF-16.

Examplesof text declarations containing encoding declarations:

<?xml encoding='UTF-8'?><?xml encoding='EUC-JP'?>

4.4 XMLProcessor Treatment of Entities and References

The table below summarizesthe contexts in which character references, entity references, and invocationsof unparsed entities might appear and the required behavior of anXML processor in each case. The labels in the leftmostcolumn describe the recognition context:

Reference in Content

as a reference anywhere after thestart-tagand before theend-tag of an element;corresponds to the nonterminalcontent.

Reference in Attribute Value

as a reference within either the value of an attribute in astart-tag, or a default value in anattribute declaration; corresponds to the nonterminalAttValue.

Occurs as Attribute Value

as aName, not a reference, appearing eitheras the value of an attribute which has been declared as typeENTITY,or as one of the space-separated tokens in the value of an attribute whichhas been declared as typeENTITIES.

Reference in Entity Value

as a reference within a parameter or internal entity'sliteral entity value in the entity's declaration;corresponds to the nonterminalEntityValue.

Reference in DTD

as a reference within either the internal or external subsets of theDTD, but outsideof anEntityValue,AttValue,PI,Comment,SystemLiteral,PubidLiteral, or the contents of an ignored conditionalsection (see3.4 Conditional Sections).

	Entity Type				Character
	Parameter	Internal General	External Parsed General	Unparsed	Character
Reference in Content	Not recognized	Included	Included ifvalidating	Forbidden	Included
Reference in Attribute Value	Not recognized	Included in literal	Forbidden	Forbidden	Included
Occurs as Attribute Value	Not recognized	Forbidden	Forbidden	Notify	Not recognized
Reference in EntityValue	Included in literal	Bypassed	Bypassed	Forbidden	Included
Reference in DTD	Included as PE	Forbidden	Forbidden	Forbidden	Forbidden

4.4.1Not Recognized

Outside the DTD, the% character has nospecial significance; thus, what would be parameter entity references in theDTD are not recognized as markup incontent. Similarly,the names of unparsed entities are not recognized except when they appearin the value of an appropriately declared attribute.

4.4.2 Included

[Definition:An entity isincluded when itsreplacementtext is retrieved and processed, in place of the reference itself, asthough it were part of the document at the location the reference was recognized.]The replacement text may contain bothcharacterdata and (except for parameter entities)markup,which must be recognized in the usual way. (The string "AT&T;"expands to "AT&T;" and the remaining ampersand is not recognizedas an entity-reference delimiter.) A character reference isincludedwhen the indicated character is processed in place of the reference itself.

4.4.3 Included If Validating

Whenan XML processor recognizes a reference to a parsed entity, in order tovalidate the document, the processormustinclude its replacement text.If the entity is external, and the processor is not attempting to validatethe XML document, the processormay, butneed not, include the entity's replacement text. If a non-validating processordoes not include the replacement text, it must inform the application thatit recognized, but did not read, the entity.

This rule is based onthe recognition that the automatic inclusion provided by the SGML and XMLentity mechanism, primarily designed to support modularity in authoring, isnot necessarily appropriate for other applications, in particular documentbrowsing. Browsers, for example, when encountering an external parsed entityreference, might choose to provide a visual indication of the entity's presenceand retrieve it for display only on demand.

4.4.4 Forbidden

The following are forbidden,and constitutefatal errors:

the appearance of a reference to anunparsedentity.
the appearance of any character or general-entity reference in theDTD except within anEntityValue orAttValue.
a reference to an external entity in an attribute value.

4.4.5 Includedin Literal

When anentityreference appears in an attribute value, or a parameter entity referenceappears in a literal entity value, itsreplacementtext is processed in place of the reference itself as though it were partof the document at the location the reference was recognized, except thata single or double quote character in the replacement text is always treatedas a normal data character and will not terminate the literal. For example,this is well-formed:

<!--  --><!ENTITY % YN '"Yes"' ><!ENTITY WhatHeSaid "He said %YN;" >

while this is not:

<!ENTITY EndAttr "27'" ><element attribute='a-&EndAttr;>

4.4.6 Notify

Whenthe name of anunparsed entityappears as a token in the value of an attribute of declared typeENTITYorENTITIES, a validating processor must inform the application ofthesystem andpublic (if any) identifiers for both the entity and itsassociatednotation.

4.4.8 Included as PE

Just as with external parsedentities, parameter entities need only beincludedif validating. When a parameter-entity reference is recognizedin the DTD and included, itsreplacementtext is enlarged by the attachment of one leading and one following space(#x20) character; the intent is to constrain the replacement text of parameterentities to contain an integral number of grammatical tokens in the DTD. Thisbehavior does not apply to parameter entity references within entity values;these are described in4.4.5 Included in Literal.

4.6 Predefined Entities

[Definition: Entity and character references can bothbe used toescape the left angle bracket, ampersand, and other delimiters.A set of general entities (amp,lt,gt,apos,quot)is specified for this purpose. Numeric character references may also be used;they are expanded immediately when recognized and must be treated as characterdata, so the numeric character references "<" and "&"may be used to escape< and& when they occurin character data.]

All XML processors must recognize these entitieswhether they are declared or not.Forinteroperability, valid XML documents should declare these entities, likeany others, before using them. If the entitieslt orampare declared, they must be declared as internal entities whose replacementtext is a character reference to the respective character (less-than signor ampersand) being escaped; the double escaping is required for these entitiesso that references to them produce a well-formed result. If the entitiesgt,apos,orquot are declared, they must be declared as internal entitieswhose replacement text is the single character being escaped (or a characterreference to that character; the double escaping here is unnecessary but harmless).For example:

<!ENTITY lt     "&#38;#60;"><!ENTITY gt     "&#62;"><!ENTITY amp    "&#38;#38;"><!ENTITY apos   "&#39;"><!ENTITY quot   "&#34;">

4.7 NotationDeclarations

[Definition:Notationsidentify by name the format ofunparsedentities, the format of elements which bear a notation attribute, or theapplication to which aprocessinginstruction is addressed.]

[Definition:Notation declarations provide a name for the notation, for use inentity and attribute-list declarations and in attribute specifications, andan external identifier for the notation which may allow an XML processor orits client application to locate a helper application capable of processingdata in the given notation.]

Notation Declarations

[82]	`NotationDecl`	::=	`'<!NOTATION'SNameS (ExternalID \|PublicID)S? '>'`	[VC: Unique Notation Name]
[83]	`PublicID`	::=	`'PUBLIC'SPubidLiteral`

Validityconstraint: Unique Notation Name

Only one notation declaration candeclare a givenName.

XML processorsmust provide applications with the name and external identifier(s) of anynotation declared and referred to in an attribute value, attribute definition,or entity declaration. They may additionally resolve the external identifierinto thesystem identifier,file name, or other information needed to allow the application to call aprocessor for data in the notation described. (It is not an error, however,for XML documents to declare and refer to notations for which notation-specificapplications are not available on the system where the XML processor or applicationis running.)

4.8Document Entity

[Definition:Thedocument entity serves as the root of the entity tree and a starting-pointfor anXML processor.] Thisspecification does not specify how the document entity is to be located byan XML processor; unlike other entities, the document entity has no name andmight well appear on a processor input stream without any identification atall.

5Conformance

5.1 Validatingand Non-Validating Processors

ConformingXML processors fall into two classes: validating andnon-validating.

Validating and non-validating processors alike mustreport violations of this specification's well-formedness constraints in thecontent of thedocument entityand any otherparsed entitiesthat they read.

[Definition:Validatingprocessors must, at user option, report violations of the constraintsexpressed by the declarations in theDTD, and failures to fulfill the validity constraintsgiven in this specification.] To accomplish this, validating XML processorsmust read and process the entire DTD and all external parsed entities referencedin the document.

Non-validating processors are required to check onlythedocument entity, includingthe entire internal DTD subset, for well-formedness. [Definition: While they are not required to checkthe document for validity, they are required toprocess all the declarationsthey read in the internal DTD subset and in any parameter entity that theyread, up to the first reference to a parameter entity that they donotread; that is to say, they must use the information in those declarationstonormalize attribute values,include the replacement text of internalentities, and supplydefault attributevalues.] Except whenstandalone="yes", they must notprocess entity declarations orattribute-list declarations encountered after a referenceto a parameter entity that is not read, since the entity may have containedoverriding declarations.

5.2Using XML Processors

The behavior of a validating XML processor ishighly predictable; it must read every piece of a document and report allwell-formedness and validity violations. Less is required of a non-validatingprocessor; it need not read any part of the document other than the documententity. This has two effects that may be important to users of XML processors:

Certain well-formedness errors, specifically those that require readingexternal entities, may not be detected by a non-validating processor. Examplesinclude the constraints entitledEntity Declared,Parsed Entity, andNoRecursion, as well as some of the cases described asforbiddenin4.4 XML Processor Treatment of Entities and References.
The information passed from the processor to the application may vary,depending on whether the processor reads parameter and external entities.For example, a non-validating processor may notnormalizeattribute values,include the replacementtext of internal entities, or supplydefaultattribute values, where doing so depends on having read declarationsin external or parameter entities.

For maximum reliability in interoperating between different XML processors,applications which use non-validating processors should not rely on any behaviorsnot required of such processors. Applications which require facilities suchas the use of default attributes or internal entities which are declared inexternal entities should use validating XML processors.

AReferences

A.1Normative References

IANA-CHARSETS: (Internet Assigned Numbers Authority)Official Names for CharacterSets, ed. Keld Simonsen et al. Seeftp://ftp.isi.edu/in-notes/iana/assignments/character-sets.
IETF RFC 1766: IETF (Internet Engineering Task Force).RFC 1766: Tags for the Identificationof Languages, ed. H. Alvestrand. 1995. (Seehttp://www.ietf.org/rfc/rfc1766.txt.)
ISO/IEC 10646: ISO (International Organization for Standardization).ISO/IEC 10646-1993(E). Information technology -- Universal Multiple-Octet Coded Character Set(UCS) -- Part 1: Architecture and Basic Multilingual Plane. [Geneva]:International Organization for Standardization, 1993 (plus amendments AM 1through AM 7).
ISO/IEC 10646-2000: ISO (International Organization for Standardization).ISO/IEC 10646-1:2000.Information technology -- Universal Multiple-Octet Coded Character Set (UCS)-- Part 1: Architecture and Basic Multilingual Plane. [Geneva]: InternationalOrganization for Standardization, 2000.
Unicode: The Unicode Consortium.The Unicode Standard, Version 2.0. Reading,Mass.: Addison-Wesley Developers Press, 1996.
Unicode3: The Unicode Consortium.The Unicode Standard, Version 3.0. Reading,Mass.: Addison-Wesley Developers Press, 2000. ISBN 0-201-61633-5.

A.2 Other References

Aho/Ullman: Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman.Compilers: Principles,Techniques, and Tools. Reading: Addison-Wesley, 1986, rpt. corr. 1988.
Berners-Lee et al.: Berners-Lee, T., R. Fielding, and L. Masinter.Uniform ResourceIdentifiers (URI): Generic Syntax and Semantics. 1997. (Work in progress;see updates to RFC1738.)
Brüggemann-Klein: Brüggemann-Klein, Anne. Formal Models in Document Processing. Habilitationsschrift.Faculty of Mathematics at the University of Freiburg, 1993. (Seeftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps.)
Brüggemann-Klein and Wood: Brüggemann-Klein, Anne, and Derick Wood.Deterministic RegularLanguages. Universität Freiburg, Institut für Informatik,Bericht 38, Oktober 1991. Extended abstract in A. Finkel, M. Jantzen, Hrsg.,STACS 1992, S. 173-184. Springer-Verlag, Berlin 1992. Lecture Notes in ComputerScience 577. Full version titledOne-Unambiguous Regular Languagesin Information and Computation 140 (2): 229-253, February 1998.
Clark: James Clark. Comparison of SGML and XML. Seehttp://www.w3.org/TR/NOTE-sgml-xml-971215.
IANA-LANGCODES: (Internet Assigned Numbers Authority)Registry of Language Tags,ed. Keld Simonsen et al. (Seehttp://www.isi.edu/in-notes/iana/assignments/languages/.)
IETF RFC2141: IETF (Internet Engineering Task Force).RFC 2141: URN Syntax,ed. R. Moats. 1997. (Seehttp://www.ietf.org/rfc/rfc2141.txt.)
IETF RFC 2279: IETF (Internet Engineering Task Force).RFC 2279: UTF-8, a transformationformat of ISO 10646, ed. F. Yergeau, 1998. (Seehttp://www.ietf.org/rfc/rfc2279.txt.)
IETF RFC 2376: IETF (Internet Engineering Task Force).RFC 2376: XML Media Types.ed. E. Whitehead, M. Murata. 1998. (Seehttp://www.ietf.org/rfc/rfc2376.txt.)
IETF RFC 2396: IETF (Internet Engineering Task Force).RFC 2396: Uniform ResourceIdentifiers (URI): Generic Syntax. T. Berners-Lee, R. Fielding, L.Masinter. 1998. (Seehttp://www.ietf.org/rfc/rfc2396.txt.)
IETF RFC 2732: IETF (Internet Engineering Task Force).RFC 2732: Format for LiteralIPv6 Addresses in URL's. R. Hinden, B. Carpenter, L. Masinter. 1999. (Seehttp://www.ietf.org/rfc/rfc2732.txt.)
IETF RFC 2781: IETF (Internet Engineering Task Force).RFC 2781: UTF-16, an encodingof ISO 10646, ed. P. Hoffman, F. Yergeau. 2000. (Seehttp://www.ietf.org/rfc/rfc2781.txt.)
ISO 639: (International Organization for Standardization).ISO 639:1988(E). Code for the representation of names of languages. [Geneva]: InternationalOrganization for Standardization, 1988.
ISO 3166: (International Organization for Standardization).ISO 3166-1:1997(E). Codes for the representation of names of countries and their subdivisions-- Part 1: Country codes [Geneva]: International Organization for Standardization,1997.
ISO 8879: ISO (International Organization for Standardization).ISO 8879:1986(E).Information processing -- Text and Office Systems -- Standard GeneralizedMarkup Language (SGML). First edition -- 1986-10-15. [Geneva]: InternationalOrganization for Standardization, 1986.
ISO/IEC 10744: ISO (International Organization for Standardization).ISO/IEC 10744-1992(E). Information technology -- Hypermedia/Time-based Structuring Language(HyTime). [Geneva]: International Organization for Standardization,1992.Extended Facilities Annexe. [Geneva]: International Organizationfor Standardization, 1996.
WEBSGML: ISO (International Organization for Standardization).ISO 8879:1986TC2. Information technology -- Document Description and Processing Languages.[Geneva]: International Organization for Standardization, 1998. (Seehttp://www.sgmlsource.com/8879rev/n0029.htm.)
XML Names: Tim Bray, Dave Hollander, and Andrew Layman, editors.Namespacesin XML. Textuality, Hewlett-Packard, and Microsoft. World Wide WebConsortium, 1999. (Seehttp://www.w3.org/TR/REC-xml-names/.)

B CharacterClasses

Following the characteristics defined in the Unicode standard,characters are classed as base characters (among others, these contain thealphabetic characters of the Latin alphabet), ideographic characters, andcombining characters (among others, this class contains most diacritics) Digitsand extenders are also distinguished.

Characters

[84]	`Letter`	::=	`BaseChar \|Ideographic`
[85]	`BaseChar`	::=	[#x0041-#x005A] \| [#x0061-#x007A] \| [#x00C0-#x00D6] \| [#x00D8-#x00F6]\| [#x00F8-#x00FF] \| [#x0100-#x0131] \| [#x0134-#x013E] \| [#x0141-#x0148]\| [#x014A-#x017E] \| [#x0180-#x01C3] \| [#x01CD-#x01F0] \| [#x01F4-#x01F5]\| [#x01FA-#x0217] \| [#x0250-#x02A8] \| [#x02BB-#x02C1] \| #x0386\| [#x0388-#x038A] \| #x038C \| [#x038E-#x03A1] \| [#x03A3-#x03CE]\| [#x03D0-#x03D6] \| #x03DA \| #x03DC \| #x03DE \| #x03E0\| [#x03E2-#x03F3] \| [#x0401-#x040C] \| [#x040E-#x044F] \| [#x0451-#x045C]\| [#x045E-#x0481] \| [#x0490-#x04C4] \| [#x04C7-#x04C8] \| [#x04CB-#x04CC]\| [#x04D0-#x04EB] \| [#x04EE-#x04F5] \| [#x04F8-#x04F9] \| [#x0531-#x0556]\| #x0559 \| [#x0561-#x0586] \| [#x05D0-#x05EA] \| [#x05F0-#x05F2]\| [#x0621-#x063A] \| [#x0641-#x064A] \| [#x0671-#x06B7] \| [#x06BA-#x06BE]\| [#x06C0-#x06CE] \| [#x06D0-#x06D3] \| #x06D5 \| [#x06E5-#x06E6]\| [#x0905-#x0939] \| #x093D \| [#x0958-#x0961] \| [#x0985-#x098C]\| [#x098F-#x0990] \| [#x0993-#x09A8] \| [#x09AA-#x09B0] \| #x09B2\| [#x09B6-#x09B9] \| [#x09DC-#x09DD] \| [#x09DF-#x09E1] \| [#x09F0-#x09F1]\| [#x0A05-#x0A0A] \| [#x0A0F-#x0A10] \| [#x0A13-#x0A28] \| [#x0A2A-#x0A30]\| [#x0A32-#x0A33] \| [#x0A35-#x0A36] \| [#x0A38-#x0A39] \| [#x0A59-#x0A5C]\| #x0A5E \| [#x0A72-#x0A74] \| [#x0A85-#x0A8B] \| #x0A8D\| [#x0A8F-#x0A91] \| [#x0A93-#x0AA8] \| [#x0AAA-#x0AB0] \| [#x0AB2-#x0AB3]\| [#x0AB5-#x0AB9] \| #x0ABD \| #x0AE0 \| [#x0B05-#x0B0C]\| [#x0B0F-#x0B10] \| [#x0B13-#x0B28] \| [#x0B2A-#x0B30] \| [#x0B32-#x0B33]\| [#x0B36-#x0B39] \| #x0B3D \| [#x0B5C-#x0B5D] \| [#x0B5F-#x0B61]\| [#x0B85-#x0B8A] \| [#x0B8E-#x0B90] \| [#x0B92-#x0B95] \| [#x0B99-#x0B9A]\| #x0B9C \| [#x0B9E-#x0B9F] \| [#x0BA3-#x0BA4] \| [#x0BA8-#x0BAA]\| [#x0BAE-#x0BB5] \| [#x0BB7-#x0BB9] \| [#x0C05-#x0C0C] \| [#x0C0E-#x0C10]\| [#x0C12-#x0C28] \| [#x0C2A-#x0C33] \| [#x0C35-#x0C39] \| [#x0C60-#x0C61]\| [#x0C85-#x0C8C] \| [#x0C8E-#x0C90] \| [#x0C92-#x0CA8] \| [#x0CAA-#x0CB3]\| [#x0CB5-#x0CB9] \| #x0CDE \| [#x0CE0-#x0CE1] \| [#x0D05-#x0D0C]\| [#x0D0E-#x0D10] \| [#x0D12-#x0D28] \| [#x0D2A-#x0D39] \| [#x0D60-#x0D61]\| [#x0E01-#x0E2E] \| #x0E30 \| [#x0E32-#x0E33] \| [#x0E40-#x0E45]\| [#x0E81-#x0E82] \| #x0E84 \| [#x0E87-#x0E88] \| #x0E8A\| #x0E8D \| [#x0E94-#x0E97] \| [#x0E99-#x0E9F] \| [#x0EA1-#x0EA3]\| #x0EA5 \| #x0EA7 \| [#x0EAA-#x0EAB] \| [#x0EAD-#x0EAE]\| #x0EB0 \| [#x0EB2-#x0EB3] \| #x0EBD \| [#x0EC0-#x0EC4]\| [#x0F40-#x0F47] \| [#x0F49-#x0F69] \| [#x10A0-#x10C5] \| [#x10D0-#x10F6]\| #x1100 \| [#x1102-#x1103] \| [#x1105-#x1107] \| #x1109\| [#x110B-#x110C] \| [#x110E-#x1112] \| #x113C \| #x113E\| #x1140 \| #x114C \| #x114E \| #x1150 \| [#x1154-#x1155]\| #x1159 \| [#x115F-#x1161] \| #x1163 \| #x1165 \| #x1167\| #x1169 \| [#x116D-#x116E] \| [#x1172-#x1173] \| #x1175\| #x119E \| #x11A8 \| #x11AB \| [#x11AE-#x11AF] \| [#x11B7-#x11B8]\| #x11BA \| [#x11BC-#x11C2] \| #x11EB \| #x11F0 \| #x11F9\| [#x1E00-#x1E9B] \| [#x1EA0-#x1EF9] \| [#x1F00-#x1F15] \| [#x1F18-#x1F1D]\| [#x1F20-#x1F45] \| [#x1F48-#x1F4D] \| [#x1F50-#x1F57] \| #x1F59\| #x1F5B \| #x1F5D \| [#x1F5F-#x1F7D] \| [#x1F80-#x1FB4]\| [#x1FB6-#x1FBC] \| #x1FBE \| [#x1FC2-#x1FC4] \| [#x1FC6-#x1FCC]\| [#x1FD0-#x1FD3] \| [#x1FD6-#x1FDB] \| [#x1FE0-#x1FEC] \| [#x1FF2-#x1FF4]\| [#x1FF6-#x1FFC] \| #x2126 \| [#x212A-#x212B] \| #x212E\| [#x2180-#x2182] \| [#x3041-#x3094] \| [#x30A1-#x30FA] \| [#x3105-#x312C]\| [#xAC00-#xD7A3]
[86]	`Ideographic`	::=	`[#x4E00-#x9FA5] \| #x3007 \| [#x3021-#x3029]`
[87]	`CombiningChar`	::=	[#x0300-#x0345] \| [#x0360-#x0361] \| [#x0483-#x0486] \| [#x0591-#x05A1]\| [#x05A3-#x05B9] \| [#x05BB-#x05BD] \| #x05BF \| [#x05C1-#x05C2]\| #x05C4 \| [#x064B-#x0652] \| #x0670 \| [#x06D6-#x06DC]\| [#x06DD-#x06DF] \| [#x06E0-#x06E4] \| [#x06E7-#x06E8] \| [#x06EA-#x06ED]\| [#x0901-#x0903] \| #x093C \| [#x093E-#x094C] \| #x094D\| [#x0951-#x0954] \| [#x0962-#x0963] \| [#x0981-#x0983] \| #x09BC\| #x09BE \| #x09BF \| [#x09C0-#x09C4] \| [#x09C7-#x09C8]\| [#x09CB-#x09CD] \| #x09D7 \| [#x09E2-#x09E3] \| #x0A02\| #x0A3C \| #x0A3E \| #x0A3F \| [#x0A40-#x0A42] \| [#x0A47-#x0A48]\| [#x0A4B-#x0A4D] \| [#x0A70-#x0A71] \| [#x0A81-#x0A83] \| #x0ABC\| [#x0ABE-#x0AC5] \| [#x0AC7-#x0AC9] \| [#x0ACB-#x0ACD] \| [#x0B01-#x0B03]\| #x0B3C \| [#x0B3E-#x0B43] \| [#x0B47-#x0B48] \| [#x0B4B-#x0B4D]\| [#x0B56-#x0B57] \| [#x0B82-#x0B83] \| [#x0BBE-#x0BC2] \| [#x0BC6-#x0BC8]\| [#x0BCA-#x0BCD] \| #x0BD7 \| [#x0C01-#x0C03] \| [#x0C3E-#x0C44]\| [#x0C46-#x0C48] \| [#x0C4A-#x0C4D] \| [#x0C55-#x0C56] \| [#x0C82-#x0C83]\| [#x0CBE-#x0CC4] \| [#x0CC6-#x0CC8] \| [#x0CCA-#x0CCD] \| [#x0CD5-#x0CD6]\| [#x0D02-#x0D03] \| [#x0D3E-#x0D43] \| [#x0D46-#x0D48] \| [#x0D4A-#x0D4D]\| #x0D57 \| #x0E31 \| [#x0E34-#x0E3A] \| [#x0E47-#x0E4E]\| #x0EB1 \| [#x0EB4-#x0EB9] \| [#x0EBB-#x0EBC] \| [#x0EC8-#x0ECD]\| [#x0F18-#x0F19] \| #x0F35 \| #x0F37 \| #x0F39 \| #x0F3E\| #x0F3F \| [#x0F71-#x0F84] \| [#x0F86-#x0F8B] \| [#x0F90-#x0F95]\| #x0F97 \| [#x0F99-#x0FAD] \| [#x0FB1-#x0FB7] \| #x0FB9\| [#x20D0-#x20DC] \| #x20E1 \| [#x302A-#x302F] \| #x3099\| #x309A
[88]	`Digit`	::=	`[#x0030-#x0039] \| [#x0660-#x0669] \| [#x06F0-#x06F9] \| [#x0966-#x096F]\| [#x09E6-#x09EF] \| [#x0A66-#x0A6F] \| [#x0AE6-#x0AEF] \| [#x0B66-#x0B6F]\| [#x0BE7-#x0BEF] \| [#x0C66-#x0C6F] \| [#x0CE6-#x0CEF] \| [#x0D66-#x0D6F]\| [#x0E50-#x0E59] \| [#x0ED0-#x0ED9] \| [#x0F20-#x0F29]`
[89]	`Extender`	::=	`#x00B7 \| #x02D0 \| #x02D1 \| #x0387 \| #x0640 \| #x0E46\| #x0EC6 \| #x3005 \| [#x3031-#x3035] \| [#x309D-#x309E]\| [#x30FC-#x30FE]`

The character classes defined here can be derived fromthe Unicode 2.0 character database as follows:

Name start characters must have one of the categories Ll, Lu, Lo, Lt,Nl.
Name characters other than Name-start characters must have one of thecategories Mc, Me, Mn, Lm, or Nd.
Characters in the compatibility area (i.e. with character code greaterthan #xF900 and less than #xFFFE) are not allowed in XML names.
Characters which have a font or compatibility decomposition (i.e. thosewith a "compatibility formatting tag" in field 5 of the database -- markedby field 5 beginning with a "<") are not allowed.
The following characters are treated as name-start characters ratherthan name characters, because the property file classifies them as Alphabetic:[#x02BB-#x02C1], #x0559, #x06E5, #x06E6.
Characters #x20DD-#x20E0 are excluded (in accordance with Unicode 2.0,section 5.14).
Character #x00B7 is classified as an extender, because the propertylist so identifies it.
Character #x0387 is added as a name character, because #x00B7 is itscanonical equivalent.
Characters ':' and '_' are allowed as name-start characters.
Characters '-' and '.' are allowed as name characters.

C XML andSGML (Non-Normative)

XML is designed to be a subset of SGML, in thatevery XML document should also be a conforming SGML document. For a detailedcomparison of the additional restrictions that XML places on documents beyondthose of SGML, see[Clark].

D Expansion of Entity and Character References (Non-Normative)

Thisappendix contains some examples illustrating the sequence of entity- and character-referencerecognition and expansion, as specified in4.4 XML ProcessorTreatment of Entities and References.

If the DTD contains thedeclaration

<!ENTITY example "<p>An ampersand (&#38;#38;) may be escapednumerically (&#38;#38;#38;) or with a general entity(&amp;amp;).</p>" >

then the XML processor will recognize the character referenceswhen it parses the entity declaration, and resolve them before storing thefollowing string as the value of the entity "example":

<p>An ampersand (&#38;) may be escapednumerically (&#38;#38;) or with a general entity(&amp;amp;).</p>

A reference in the document to "&example;" willcause the text to be reparsed, at which time the start- and end-tags of thepelement will be recognized and the three references will be recognized andexpanded, resulting in ap element with the following content(all data, no delimiters or markup):

An ampersand (&) may be escapednumerically (&#38;) or with a general entity(&amp;).

A more complex example will illustrate the rules and their effectsfully. In the following example, the line numbers are solely for reference.

1 <?xml version='1.0'?>2 <!DOCTYPE test [3 <!ELEMENT test (#PCDATA) >4 <!ENTITY % xx '&#37;zz;'>5 <!ENTITY % zz '&#60;!ENTITY tricky "error-prone" >' >6 %xx;7 ]>8 <test>This sample shows a &tricky; method.</test>

This produces the following:

in line 4, the reference to character 37 is expanded immediately, andthe parameter entity "xx" is stored in the symbol table withthe value "%zz;". Since the replacement text is not rescanned,the reference to parameter entity "zz" is not recognized. (Andit would be an error if it were, since "zz" is not yet declared.)
in line 5, the character reference "<" is expandedimmediately and the parameter entity "zz" is stored with thereplacement text "<!ENTITY tricky "error-prone" >", whichis a well-formed entity declaration.
in line 6, the reference to "xx" is recognized, and thereplacement text of "xx" (namely "%zz;") is parsed.The reference to "zz" is recognized in its turn, and its replacementtext ("<!ENTITY tricky "error-prone" >") is parsed. The generalentity "tricky" has now been declared, with the replacement text"error-prone".
in line 8, the reference to the general entity "tricky"is recognized, and it is expanded, so the full content of thetestelement is the self-describing (and ungrammatical) stringThis sampleshows a error-prone method.

E DeterministicContent Models (Non-Normative)

As noted in3.2.1Element Content, it is required that content models in element typedeclarations be deterministic. This requirement isfor compatibility with SGML (which calls deterministiccontent models "unambiguous"); XML processors built using SGML systems mayflag non-deterministic content models as errors.

For example, the contentmodel((b, c) | (b, d)) is non-deterministic, because given aninitialb the XML processor cannot know whichbin the model is being matched without looking ahead to see which element followstheb. In this case, the two references tob canbe collapsed into a single reference, making the model read(b, (c |d)). An initialb now clearly matches only a single namein the content model. The processor doesn't need to look ahead to see whatfollows; eitherc ord would be accepted.

Moreformally: a finite state automaton may be constructed from the content modelusing the standard algorithms, e.g. algorithm 3.5 in section 3.9 of Aho, Sethi,and Ullman[Aho/Ullman]. In many such algorithms, a followset is constructed for each position in the regular expression (i.e., eachleaf node in the syntax tree for the regular expression); if any positionhas a follow set in which more than one following position is labeled withthe same element type name, then the content model is in error and may bereported as an error.

Algorithms exist which allow many but not allnon-deterministic content models to be reduced automatically to equivalentdeterministic models; see Brüggemann-Klein 1991[Brüggemann-Klein].

F Autodetection of CharacterEncodings (Non-Normative)

The XML encoding declaration functions asan internal label on each entity, indicating which character encoding is inuse. Before an XML processor can read the internal label, however, it apparentlyhas to know what character encoding is in use--which is what the internallabel is trying to indicate. In the general case, this is a hopeless situation.It is not entirely hopeless in XML, however, because XML limits the generalcase in two ways: each implementation is assumed to support only a finiteset of character encodings, and the XML encoding declaration is restrictedin position and content in order to make it feasible to autodetect the characterencoding in use in each entity in normal cases. Also, in many cases othersources of information are available in addition to the XML data stream itself.Two cases may be distinguished, depending on whether the XML entity is presentedto the processor without, or with, any accompanying (external) information.We consider the first case first.

F.1Detection Without External Encoding Information

Because each XML entitynot accompanied by external encoding information and not in UTF-8 or UTF-16encodingmust begin with an XML encoding declaration, in which thefirst characters must be '<?xml', any conforming processorcan detect, after two to four octets of input, which of the following casesapply. In reading this list, it may help to know that in UCS-4, '<' is"#x0000003C" and '?' is "#x0000003F", and the ByteOrder Mark required of UTF-16 data streams is "#xFEFF". The notation##is used to denote any byte value except that two consecutive##scannot be both 00.

With a Byte Order Mark:

`00 00 FE FF`	UCS-4, big-endian machine (1234 order)
`FF FE 00 00`	UCS-4, little-endian machine (4321 order)
`00 00 FF FE`	UCS-4, unusual octet order (2143)
`FE FF 00 00`	UCS-4, unusual octet order (3412)
`FE FF ## ##`	UTF-16, big-endian
`FF FE ## ##`	UTF-16, little-endian
`EF BB BF`	UTF-8

Without a Byte Order Mark:

`00 00 00 3C`	UCS-4 or other encoding with a 32-bit code unitand ASCII characters encoded as ASCII values, in respectively big-endian (1234),little-endian (4321) and two unusual byte orders (2143 and 3412). The encodingdeclaration must be read to determine which of UCS-4 or other supported 32-bitencodings applies.
`3C 00 00 00`
`00 00 3C 00`
`00 3C 00 00`
`00 3C 00 3F`	UTF-16BE or big-endian ISO-10646-UCS-2 or otherencoding with a 16-bit code unit in big-endian order and ASCII charactersencoded as ASCII values (the encoding declaration must be read to determinewhich)
`3C 00 3F 00`	UTF-16LE or little-endian ISO-10646-UCS-2 or otherencoding with a 16-bit code unit in little-endian order and ASCII charactersencoded as ASCII values (the encoding declaration must be read to determinewhich)
`3C 3F 78 6D`	UTF-8, ISO 646, ASCII, some part of ISO 8859,Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensuresthat the characters of ASCII have their normal positions, width, and values;the actual encoding declaration must be read to detect which of these applies,but since all of these encodings use the same bit patterns for the relevantASCII characters, the encoding declaration itself may be read reliably
`4C 6F A7 94`	EBCDIC (in some flavor; the full encoding declarationmust be read to tell which code page is in use)
Other	UTF-8 without an encoding declaration, or elsethe data stream is mislabeled (lacking a required encoding declaration), corrupt,fragmentary, or enclosed in a wrapper of some kind

Note:

Incases above which do not require reading the encoding declaration to determinethe encoding, section 4.3.3 still requires that the encoding declaration,if present, be read and that the encoding name be checked to match the actualencoding of the entity. Also, it is possible that new character encodingswill be invented that will make it necessary to use the encoding declarationto determine the encoding, in cases where this is not required at present.

Thislevel of autodetection is enough to read the XML encoding declaration andparse the character-encoding identifier, which is still necessary to distinguishthe individual members of each family of encodings (e.g. to tell UTF-8 from8859, and the parts of 8859 from each other, or to distinguish the specificEBCDIC code page in use, and so on).

Because the contents of the encodingdeclaration are restricted to characters from the ASCII repertoire (howeverencoded), a processor can reliably read the entire encoding declaration assoon as it has detected which family of encodings is in use. Since in practice,all widely used character encodings fall into one of the categories above,the XML encoding declaration allows reasonably reliable in-band labeling ofcharacter encodings, even when external sources of information at the operating-systemor transport-protocol level are unreliable. Character encodings such as UTF-7that make overloaded usage of ASCII-valued bytes may fail to be reliably detected.

Oncethe processor has detected the character encoding in use, it can act appropriately,whether by invoking a separate input routine for each case, or by callingthe proper conversion function on each character of input.

Like anyself-labeling system, the XML encoding declaration will not work if any softwarechanges the entity's character set or encoding without updating the encodingdeclaration. Implementors of character-encoding routines should be carefulto ensure the accuracy of the internal and external information used to labelthe entity.

F.2Priorities in the Presence of External Encoding Information

The secondpossible case occurs when the XML entity is accompanied by encoding information,as in some file systems and some network protocols. When multiple sourcesof information are available, their relative priority and the preferred methodof handling conflict should be specified as part of the higher-level protocolused to deliver XML. In particular, please refer to[IETFRFC 2376] or its successor, which defines thetext/xml andapplication/xmlMIME types and provides some useful guidance. In the interests of interoperability,however, the following rule is recommended.

If an XML entity is in a file, the Byte-Order Mark and encoding declarationare used (if present) to determine the character encoding.

G W3CXML Working Group (Non-Normative)

This specification was preparedand approved for publication by the W3C XML Working Group (WG). WG approvalof this specification does not necessarily imply that all WG members votedfor its approval. The current and former members of the XML WG are:

Jon Bosak, Sun (Chair)
James Clark (Technical Lead)
Tim Bray, Textuality and Netscape (XML Co-editor)
Jean Paoli, Microsoft (XML Co-editor)
C. M. Sperberg-McQueen, U. of Ill. (XML Co-editor)
Dan Connolly, W3C (W3C Liaison)
Paula Angerstein, Texcel
Steve DeRose, INSO
Dave Hollander, HP
Eliot Kimber, ISOGEN
Eve Maler, ArborText
Tom Magliery, NCSA
Murray Maloney, SoftQuad, Grif SA, Muzmo and Veo Systems
MURATA Makoto (FAMILY Given), Fuji Xerox Information Systems
Joel Nava, Adobe
Conleth O'Connell, Vignette
Peter Sharpe, SoftQuad
John Tigue, DataChannel

H W3C XML CoreGroup (Non-Normative)

The second edition of this specification wasprepared by the W3C XML Core Working Group (WG). The members of the WG atthe time of publication of this edition were:

Paula Angerstein, Vignette
Daniel Austin, Ask Jeeves
Tim Boland
Allen Brown, Microsoft
Dan Connolly, W3C (Staff Contact)
John Cowan, Reuters Limited
John Evdemon, XMLSolutions Corporation
Paul Grosso, Arbortext (Co-Chair)
Arnaud Le Hors, IBM (Co-Chair)
Eve Maler, Sun Microsystems (Second Edition Editor)
Jonathan Marsh, Microsoft
MURATA Makoto (FAMILY Given), IBM
Mark Needleman, Data Research Associates
David Orchard, Jamcracker
Lew Shannon, NCR
Richard Tobin, University of Edinburgh
Daniel Veillard, W3C
Dan Vint, Lexica
Norman Walsh, Sun Microsystems
François Yergeau, Alis Technologies (Errata List Editor)
Kongyi Zhou, Oracle

I ProductionNotes (Non-Normative)

This Second Edition was encoded in theXMLspecDTD (which hasdocumentationavailable). The HTML versions were produced with a combination of thexmlspec.xsl,diffspec.xsl, andREC-xml-2e.xsl XSLTstylesheets. The PDF version was produced with thehtml2psfacility and a distiller program.

[61]	`conditionalSect`	::=	`includeSect \|ignoreSect`
[62]	`includeSect`	::=	`'<![' S? 'INCLUDE' S? '['extSubsetDecl']]>'`	/ /
				[VC: Proper Conditional Section/PE Nesting]
[63]	`ignoreSect`	::=	`'<![' S? 'IGNORE' S? '['ignoreSectContents*']]>'`	/ /
				[VC: Proper Conditional Section/PE Nesting]
[64]	`ignoreSectContents`	::=	`Ignore ('<!['ignoreSectContents']]>'Ignore)*`
[65]	`Ignore`	::=	`Char* - (Char* ('<!['\| ']]>')Char*)`

Movatterモバイル変換

Extensible Markup Language (XML)1.0 (Second Edition)

W3C Recommendation 6 October 2000

Abstract

Status of this Document

Table of Contents

Appendices

1 Introduction

1.1 Origin and Goals

1.2 Terminology

2Documents

2.1Well-Formed XML Documents

Document

2.2Characters

CharacterRange

2.3 Common SyntacticConstructs

White Space

Names and Tokens

Literals

2.4 CharacterData and Markup

CharacterData

2.5Comments

Comments

2.6 ProcessingInstructions

ProcessingInstructions

2.7 CDATA Sections

CDATASections

2.8 Prologand Document Type Declaration

Prolog

DocumentType Definition

External Subset

2.9Standalone Document Declaration

StandaloneDocument Declaration

2.10White Space Handling

2.11End-of-Line Handling

2.12 Language Identification

3Logical Structures

Element

3.1 Start-Tags,End-Tags, and Empty-Element Tags

Start-tag

End-tag

Contentof Elements

Tags for Empty Elements

3.2 ElementType Declarations

Element Type Declaration

3.2.1 ElementContent

Element-contentModels

3.2.2Mixed Content

Mixed-content Declaration

3.3 Attribute-ListDeclarations

Attribute-listDeclaration

3.3.1 Attribute Types

AttributeTypes

Enumerated Attribute Types

3.3.2Attribute Defaults

Attribute Defaults

3.3.3 Attribute-ValueNormalization

3.4 Conditional Sections

ConditionalSection

4Physical Structures

4.1 Character and Entity References

CharacterReference

Entity Reference

4.2 EntityDeclarations

Entity Declaration

4.2.1Internal Entities

4.2.2External Entities

External Entity Declaration

4.3Parsed Entities

4.3.1The Text Declaration

Text Declaration

4.3.2 Well-Formed Parsed Entities

Well-Formed External ParsedEntity

4.3.3 Character Encoding in Entities

Encoding Declaration

4.4 XMLProcessor Treatment of Entities and References

4.4.1Not Recognized

4.4.2 Included

4.4.3 Included If Validating

4.4.4 Forbidden