The Extensible Markup Language, abbreviated XML, describes a class ofdata objects calledXML documents and partiallydescribes the behavior of computer programs which process them. XML is an application profile orrestricted form of SGML, the Standard Generalized Markup Language[ISO 8879].
XML documents are made up of storage units calledentities, which contain eithertext orbinary data.Text is made up ofcharacters, someof which form thecharacter data in thedocument, and some of which formmarkup.Markup encodes a description of the document's storage layout andlogical structure. XML provides a mechanism to impose constraints onthe storage layout and logical structure.
A software modulecalled anXML processor is used to read XML documentsand provide access to their content and structure.It is assumed that an XML processor isdoing its work on behalf of another module, referred to as theapplication. This specification describes therequired behavior of an XML processor in terms of how it must read XMLdata and the information it must provide to the application.
Origin and Goals
XML was developed by an XML Working Group (originally known as theSGML Editorial Review Board) formed under the auspices of the WorldWide Web Consortium (W3C) in 1996 and chaired by Jon Bosak of SunMicrosystems with the very active participation of an XML SpecialInterest Group (previously known as the SGML Working Group) alsoorganized by the W3C. The membership of the XML Working Group is givenin an appendix. Dan Connolly served as the WG's contact with the W3C.
The design goals for XML are:
XML shall be straightforwardly usable over theInternet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XMLdocuments.
The number of optional features in XML is to be kept to theabsolute minimum, ideally zero.
XML documents should be human-legible and reasonablyclear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.
This specification, together with the associated standards, providesall the information necessary to understand XML version &XML.version;and construct computer programs to process it.
This version of the XML specification (&doc.date;)is for &doc.audience;.It &doc.distribution;.
Relationship to Existing Standards
Standards relevant to users and implementors of XML include:
SGML (ISO 8879:1986). By definition,valid XML documents are conformant SGMLdocuments in the sense described in ISO standard 8879. The currentdraft of this specification presupposes the successful completion of the currentwork on a technical corrigendum to ISO 8879 now being preparedby ISO/IEC JTC1/SC18/WG8. If the corrigendum is notadopted in the expected form, some clauses of this specification may change, and somerecommendations now labeledforinteroperability will become requirements labeledfor compatibility.
Unicode and ISO/IEC 10646. This specification depends on theinternationalstandard ISO/IEC10646 (with amendments AM 1 through AM 5)and the
Unicode Standard, Version2.0, which define the encodings and meanings of thecharacters which make up XMLtextdata. All the characters in ISO/IEC 10646are present, at the same code points, in Unicode.
IETF RFC 1738 and RFC 1808.RFC 1738 and RFC 1808 define the syntax and semantics of Uniform ResourceLocators, or URLs.
Terminology
Some terms used with special meaning in this specification are:
Conforming data and XMLprocessors are permitted to but need not behave asdescribed.
Conforming data and XML processors are required to behave as described; otherwisethey are in error.
A violation of the rules of thisspecification; results areundefined. Conforming software may detect and report an error and mayrecover from it.
An errorwhich conforming software must detect and report to the application.After encountering a fatal error, an XML processor may continueprocessing the data to search for further errors and may report sucherrors to the application. In order to support correction of errors,the processor may make unprocessed text from the document (withintermingled character data and markup) available to the application.Once a fatal error is detected, however, the processor must notcontinue normal processing (i.e. it must notcontinue to pass character data and information about the document'slogical structure to the application in the normal way).
A rule which applies to allvalid XMLdocuments.Violations ofvalidity constraints are errors; they must, at user option, be reportedbyvalidating XML processors.
A rule which applies to allwell-formed XML documents.Violations of well-formedness constraints arefatal errors.
Conforming software may or must (depending on the modal verb in thesentence) behave as described; if it does, it mustprovide users a means to enable or disable the behaviordescribed.
(Of strings or names:) Case-insensitive match: two strings or names beingcompared match if they are identical after case-folding.(Of strings and rules in the grammar:) A string matches a grammatical production if it belongs to thelanguage generated by that production.(Of content and content models:)The content of aparent element in a document matches the content modelfor that element if (a) the content model matches the rule forMixed and the content consists of character data and elements whose names match names in thecontent model, or if (b) the content model matches the rule forelements, and the sequence ofchild elementsbelongs to the language generated by the regular expression inthe content model.
a process appliedto a sequence of characters, in which those identified asnon-uppercase (in scripts which have case distinctions) are replacedby their uppercase equivalents, as specified inThe Unicode Standard, Version 2.0, section 4.1.Note that Unicode recommends folding to lowercase;for compatibility reasons, XML processors mustfold to uppercase. Case-folding, as described here, neither requiresnor forbids the normalization of Unicode character sequences intocanonical form (e.g. as described inThe Unicode Standard, section 5.9).
Case-sensitivestring match: two strings or names being compared must be identical.Characters with multiple possible representations in ISO/IEC 10646 (e.g.characters with both precomposed and base+diacritic forms) match only if they have thesame representation in both strings.At user option, processors may normalize such characters to theircanonical form.
A feature ofXML included solely to ensure that XML remains compatible with SGML.
Anon-binding recommendation included to increase the chances that XMLdocuments can be processed by the existing installed base of SGMLprocessors which predate thetechnical corrigendum to ISO 8879 now in the process of preparationby ISO/IEC JTC1/SC18/WG8.
Notation
The formal grammar of XML is given using a simple ExtendedBackus-Naur Form (EBNF) notation. Each rule in the grammar defines onesymbol, in the formsymbol ::= expression
Symbols are written with an initial capital letter if they aredefined by a regular expression, or with an initial lowercase letter ifa recursive grammar is required for recognition.Literal strings are quoted; unless otherwise notedthey are case-insensitive.The distinction between symbols which can and cannotbe recognized using simple regular expressions may be used to set theboundary between an implementation's lexicalscanner and its parser, but this specification neither constrains theplacement of that boundary nor presupposes that all implementationswill have one.
Within the expression on the right-hand side of a rule, themeaning of symbols is as shown below.
whereN is a hexadecimal integer, theexpression represents the character in ISO/IEC 10646 whose canonical(UCS-4) bit string, when interpreted as an unsigned binary number, hasthe value indicated. The number of leading zeroes in the#xN form is insignificant; the number of leadingzeroes in the corresponding bit string is governed by the characterencoding in use and is not significant for XML.
represents anycharacter with a value in the range(s) indicated (inclusive).
represents anycharacter with a valueoutside therange indicated.
represents anycharacterwith a value not among the characters given.
represents a literal stringmatching that given inside the double quotes.
represents a literal stringmatching thatgiven inside the single quotes.
a followed byb.
a orb but not both.
the set of strings represented bya but not represented byb
a or nothing; optionala.
one or more occurrences ofa.
zero or more occurrences ofa.
specifies thatin the external DTD subset aparameter entity may occur in thetext at the position wherea may occur; if so, itsreplacement text must matchS?a S?. Ifthe expressiona is governed by a suffix operator, thenthe suffix operator determines both the maximum number of parameter-entity references allowed and the number of occurrences ofain the replacement text of the parameter entities:%a*means thata must occur zero or more times, andthat some of its occurrences may be replaced by references toparameter entities whose replacement text must contain zero or more occurrences ofa; it is thus a more compact wayof writing%(a*)*.Similarly,%a+ means thatamust occur one or more times, and may be replaced by parameter entities with replacement text matchingS? (a S?)+. The recognition of parameter entities in the internal subset is much morehighly constrained.
expression is treated as a unit, andmay carry the% prefix operator, or a suffix operator:?,*, or+.
comment.
Well-formedness check; this identifies by name a check forwell-formedness associated witha production.
Validity check; this identifies by name a check forvalidity associated witha production.
Common Syntactic Constructs
This section defines some symbols used widely in the grammar.
S (white space) consists of one or more space (#x20)characters, carriage returns, line feeds, or tabs.S(#x20 | #x9 | #xd | #xa)+
Legalcharacters are tab, carriage return, line feed, and the legal graphiccharacters of Unicode and ISO/IEC 10646.Char#x9 | #xA | #xD | [#x20-#xFFFD]| [#x00010000-#x7FFFFFFF]any ISO/IEC 10646 UCS-4 code, FFFE and FFFF excludedCharacters are classified for convenience as letters, digits, or othercharacters. Letters consist of an alphabetic or syllabic base character possiblyfollowed by one or more combining characters, or of an ideographiccharacter. Certain layout and format-control characters defined by ISO/IEC 10646 should be ignoredwhen recognizing identifiers; these are defined by theclassesIgnorable andExtender.Full definitions of the specific characters in each classare given in the appendix oncharacter classes.
AName is a tokenbeginning with a letter or underscore character and continuing withletters, digits, hyphens, underscores, or full stops (together knownas name characters). Names beginning with the stringXML arereserved for standardization in this or future versions of thisspecification.
Note: the colon character is alsoallowed within XML names; it is reserved for experimentation withname spaces and schema scoping. Its meaning is expected to bestandardized at some future point, at which point those documents using colon for experimental purposes will need to be updated.(Note: there is no guarantee that any name-space mechanismadopted for XML will in fact use colon as a name-space delimiter.)In practice, this means that authors should not use colon in XMLnames except as part of name-space experiments, but that implementorsshould accept colon as a name character.
Literal data is any quoted string not containingthe quotation mark used as a delimiter for that string; differentforms of literal data may or may not containangle brackets,entity references, andcharacter references. Literals are usedfor specifying the replacement text of internal entities(EntityValue),the values of attributes (AttValue), and external identifiers (SystemLiteral); for somepurposes, the entire literal can be skipped without scanning formarkup within it (SkipLit):EntityValue'"' ([^%&"] |PEReference |Reference)* '"'| "'" ([^%&'] |PEReference |Reference)* "'"AttValue'"' ([^<&"] |Reference)* '"'| "'" ([^<&'] |Reference)* "'"SystemLiteral'"'URLchar* '"' | "'" (URLchar - "'")* "'"URLcharSeeRFC 1738and1808PubidLiteral'"'PubidChar* '"' | "'" (PubidChar - "'")* "'"PubidChar#x20 | #x9 | #xd | #xa | #x&IDEOSPACE;| [a-zA-Z0-9]| [-'()+,./:=?]SkipLit('"' [^"]* '"') | ("'" [^']* "'")Note thatentity references andcharacter references are recognized andprocessed withinEntityValue andAttValue, but not withinSystemLiteral.
Documents
A textual object is anXML document if it iseithervalid orwell-formed, asdefined in this specification.
Logical and Physical Structure
Each XML document has both a logical and a physical structure.
Physically, the document is composed of units calledentities. An entity mayrefer to other entities to cause theirinclusion in the document. A document begins in a "root" ordocument entity.
The logical structure contains declarations, elements, comments,character references, andprocessinginstructions, all of which are indicated in the document by explicitmarkup.
The two structures must be synchronous: seesection 4.1.
Well-Formed XML Documents
A textual object issaid to be awell-formed XML document if, first, itmatches the production labeleddocument, and if foreachentity reference which appears inthe document, either the entity has been declared in thedocument type declaration or the entity name isone of: &magicents;.
Matching thedocument production implies that:
It contains one or moreelements.
It meets all the well-formedness constraints (WFCs) givenin the grammar.
There is exactlyone element,called theroot, or document element, forwhich neither thestart-tag nor theend-tag is inthecontent of any other element. Forall otherelements,if the start-tag is in the content of another element, the end-tag is inthe content of the same element. More simply stated, the elements,delimited by start- and end-tags, nest within each other.
As a consequence of this,for each non-rootelementC in the document, there is one other elementPin the document such thatC is in the content ofP, but is not inthe content of any other element that is in the content ofP. ThenP is referred to as theparent ofC, andC as achild ofP.
Characters
The data stored in an XMLentity iseithertext orbinary.Binary data has an associatednotation, identified by name; beyond arequirement to make available the notation's name and the associated systemidentifier, XML places no constraints on the contents or use of binaryentities. So-called binary data might in fact be textual; itsidentification as binary means that an XML processor need not parseit in the fashion described by this specification.XML text data is a sequence ofcharacters. A character is an atomic unit oftext; valid charactersare specified byISO/IEC 10646.Users may extend the ISO/IEC 10646 character repertoire by exploiting theprivate use areas.
The mechanism for encoding character values into bit patterns mayvary from entity to entity. All XML processors must accept the UTF-8and UCS-2 encodings of 10646; the mechanisms for signaling which ofthe two are in use, or for bringing other encodings into play, arediscussed later, in the discussion ofcharacter encodings.
Regardless of the specific encoding used, any character in the ISO/IEC10646 character set may be referred to by the decimal or hexadecimalequivalent of its bit string.
Character Data and Markup
XML text consists of intermingledcharacterdata and markup.Markup takes the form ofstart-tags,end-tags,empty elements,entity references,character references,comments,CDATA sections,document type declarations, andprocessing instructions.
All text that is not markupconstitutes the character data ofthe document.
The ampersand character (&) and the left angle bracket (<)may appear in their literal formonly when used as markupdelimiters, or withincomments,processing instructions, orCDATA sections. If they are needed elsewhere,they must beescapedusing either numeric character references or the strings"&" and "<". The right anglebracket (>) may be represented using the string">", and must,forcompatibility, be so represented when it appears in the string"]]>", when that string is not marking the end of aCDATA section.
In the content of elements, character data is any string of characters which doesnot contain the start-delimiter of any markup. In a CDATA section, character datais any string of characters not including the CDATA-section-closedelimiter, "]]>".
To allow attribute values to contain both single and double quotes, theapostrophe or single-quote character (') may be represented as"'", and the double-quote character (") as""".PCData[^<&]*
Comments
Comments may appear anywhere except in aCDATA section, i.e. withinelement content, inmixed content, or in a DTD. They mustnot occur within declarations or tags.They are not part of the document'scharacterdata; an XMLprocessor may, but need not, make it possible for an application toretrieve the text of comments.For compatibility, the string-- (double-hyphen) must not occur withincomments.Comment'<!&como;' (Char* - (Char* '&comc;'Char*)) '&comc;>'
An example of a comment:<!&como; declarations for <head> & <body> &comc;>
Processing Instructions
Processinginstructions (PIs) allow documents to contain instructionsfor applications.PI'<?'NameS (Char* - (Char* &pic;Char*)) &pic;PIs are not part of the document'scharacterdata, but must be passed through to the application. TheName is called thePI target; it is usedto identify the application to which the instruction is directed. XMLprovides an optional mechanism,NOTATION, forformal declaration of such names.PI targets with names beginning with the string "XML"are reserved for standardization in this or future versions ofthis specification.
CDATA Sections
CDATA sections can occuranywhere character data may occur; they areused to escape blocks of text containing characters which wouldotherwise be recognized as markup. CDATA sections begin with thestring<![CDATA[ and end with the string]]>:CDSectCDStartCDataCDEndCDStart'<![CDATA['CData(Char* - (Char* ']]>'Char*))CDEnd']]>'Within a CDATA section, only theCDEnd string isrecognized, so that left angle brackets and ampersands may occur intheir literal form; they need not (and cannot) be escaped using< and&. CDATA sectionscannot nest.
An example of a CDATA section:<![CDATA[<greeting>Hello, world!</greeting>]]>
White Space Handling
In editing XML documents, it is often convenient to use "white space"(spaces, tabs, and blank lines, denoted by the nonterminalS inthis specification) toset apart the markup for greater readability. Such white space is typicallynot intended for inclusion in the delivered version of the document.On the other hand, "significant" white space that must be retained in thedelivered version is common, for example in poetry andsource code.
AnXML processor must always pass all characters in a document that are notmarkup through to the application. Avalidating XML processor must distinguish white spaceinelement content from other non-markupcharacters and signalto the application that white space in element content is notsignificant.
A specialattribute may be inserted indocuments to signal an intention that the element to which this attributeapplies requires all white space to be treated assignificant by applications.
In valid documents, this attribute must bedeclared as follows, if used:
The valueDEFAULT signals that applications'default white-space processing modes are acceptable for this element; thevaluePRESERVE indicates the intent that applications preserveall the white space.This declared intent is considered to apply to all elements within the contentof this element, unless overriden with another instance of theXML-SPACE attribute.
Theroot element of any documentis considered to have signaled no intentions as regards application spacehandling, unless it provides a value for this attribute or the attribute is declared with a default value.
Prolog and Document Type Declaration
XML documents may, and should, begin with an XML declaration which specifies, among otherthings, the version ofXML being used.
The function of the markup in an XML document is to describe itsstorage and logical structures, and associate attribute-value pairs with thelogical structure.XML provides amechanism, thedocument type declaration, todefine constraints on that logical structure and to support the use ofpredefined storage units.An XML document is said to bevalid if there is an associated document typedeclaration and if the documentcomplies with the constraints expressed in it.
The document type declaration must appear beforethe firststart-tag in the document.documentprologelementMisc*prologXMLDecl?Misc* (doctypedeclMisc*)?XMLDecl&xmlpio;VersionInfoEncodingDecl?RMDecl?S? &pic;VersionInfoS 'version'Eq ('"&XML.version;"' | "'&XML.version;'")MiscComment |PI |S
The identification of the XML version as "1.0" does not indicate acommitment to produce any future versions of XML, nor if any are produced, touse any particular numbering scheme.Since future versions are not ruled out, this construct is provided as a means to allow the possibility of automatic version recognition, shouldit become necessary.
For example, the following is a complete XML document,well-formed but notvalid:Hello, world!]]>and so is this:Hello, world!]]>
The XMLdocument type declaration may include a pointer to anexternal entity containing a subset ofthe necessary markup declarations, and may also directly includeanother, internal, subset.
These two subsets make up thedocument type definition, abbreviatedDTD.The DTD, in effect, provides a grammar which defines a class of documents.Properly speaking, the DTD consists of both subsets taken together,but it is a common practice for the bulk of the markupdeclarations to appear in the external subset, and for thissubset, usually contained in a file, to be referred to as "the DTD"for a class of documents.doctypedecl'<!DOCTYPE'SName (SExternalID)?S? ('[' %markupdecl* ']'S?)? '>'Root Element TypeNon-null DTDmarkupdecl(%elementdecl | %AttlistDecl | %EntityDecl | %NotationDecl | %PI | %S | %Comment|InternalPERef )*InternalPERefPEReferenceIntegral DeclarationsTheName in the document-type declaration mustmatch the element type of the root element.The internal and external subsets of the DTD must not bothbe empty.A parameter-entity reference recognized in this context must have replacementtext consisting of zero or more complete declarations,i.e. matching the production for the non-terminalmarkupdecl.
The external subset must obey substantially the same grammatical constraintsas the internal subset; i.e. it must match the production for thenon-terminal symbolmarkupdecl.In the external subset, however, parameter-entity references canbe used to replace constructs prefixed by% in a production ofthe grammar, andconditional sectionsmay occur.In the internal subset, by contrast, conditional sections may notoccur and the only parameter-entity references allowed are those which match the non-terminalInternalPERefwithin the rule formarkupdecl.extSubset(%markupdecl | %conditionalSect )*
For example:Hello, world!]]>Thesystem identifierhello.dtdindicatesthe location of a DTD for the document.
The declarations can also be given locally, as in this example:]>Hello, world!]]>
If both the external and internal subsets are used,an XML processor must read the internal subset first,then the external subset.This has the effect that entity and attribute declarations in theinternal subset take precedence over those in the external subset.
Required Markup Declaration
In some cases, anXML processor canread an XML document and accomplish useful tasks without having firstprocessed the entireDTD. However, certaindeclarations can substantially affect the actions of an XML processor.It is desirable, therefore, to be able to specify whether a document contains any such declarations.A document author can communicate whether or not DTD processing isnecessary using arequired markup declaration(abbreviated RMD), which appears as a component of the XMLdeclaration:RMDeclS 'RMD'Eq "'" ('NONE' | 'INTERNAL' | 'ALL') "'"|S 'RMD'Eq '"' ('NONE' | 'INTERNAL' | 'ALL') '"'
In an RMD, the valueNONE indicates that an XMLprocessor can parse the containing document correctly without firstreading any part of the DTD. The valueINTERNALindicates that the XML processor must read and process theinternal subset of the DTD, if provided, toparse the containing document correctly. The valueALLindicates that the XML processor must read and process thedeclarations in both the subsets of the DTD, if provided, to parse thecontaining document correctly.
The RMD must indicate that the entire DTD is required if theexternal subset contains any declarations of
attributes withdefault values, ifelements to whichthese attributes apply appear in the document withoutspecifying values for these attributes, or
entities (other than &magicents;), ifreferences to thoseentities appear in the document, or
element types withelement content, if white space occursdirectly within any instance of those types.
If such declarations occur in the internal but not the externalsubset, the RMD must take the valueINTERNAL. It is anerror to specifyINTERNAL if the external subset isrequired, or to specifyNONE if the internal orexternal subset is required.
If no RMD is provided, an XML processor must behave as though an RMD had been provided with the valueALL.
An example XML declaration with an RMD:<?XML version="&XML.version;" RMD='INTERNAL'?>
Logical Structures
EachXML document contains one or moreelements, the boundaries of which are either delimited bystart-tags andend-tags, or, forempty elements by anempty-element tag. Each element has a type,identified by name (sometimes called itsgenericidentifier orGI), and may have a set ofattributes. Each attribute has aname and avalue.
This specification does not constrain the semantics, use, or (beyondsyntax) names of the elements and attributes, except that namesbeginning with the stringXML are reserved for standardization in this or future versions of thisspecification.
Start-Tags, End-Tags, and Empty-Element Tags
The beginning of everynon-empty XML element is marked by astart-tag.STag'<'Name (SAttribute)*S? '>'Unique Att SpecAttributeNameEqAttValueAttribute Value TypeNo External Entity ReferencesEqS? '='S?TheName in the start- and end-tags gives theelement'stype.TheName-AttValue pairs are referred to astheattribute specifications of the element,with theNamereferred to as theattribute name andthe content of theAttValue (the characters between the' or" delimiters)as theattribute value.
No attribute may appear more than once in the same start-tagor empty-element tag.The attribute must have been declared; the value must be of the type declared for it.(For attribute types, see the discussion ofattributedeclarations.)Attribute values cannot contain entity references toexternal entities.
An example of a start-tag:<termdef term="dog">
The end of every element may (for elements which are notempty, must) be marked by anend-tagcontaining a name that echoes the element's type as given in thestart-tag:ETag'</'NameS? '>'
An example of an end-tag:</termdef>
The text between the start-tag andend-tag is called the element'scontent:content(element |PCData |Reference |CDSect |PI |Comment)*ContentelementEmptyElement|STagcontentETagGI Match
Each element type used must be declared.The content of an element instance must match the content model declaredfor that element type.TheName in an element's end-tag must match that inthe start-tag.
If an element isempty,it may be represented either by a start-tag immediately followedby an end-tag, or by an empty-element tag.An Empty-element tag takes a special form:EmptyElement'<'Name (SAttribute)*S? '/>'Unique Att Spec
Empty-element tags may be used for any element which has nocontent, whether or not they are declared using the keywordEMPTY.
Examples of empty elements:<IMG align="left" src="http://www.w3.org/Icons/WWW/w3c_home" /><br></br><br/>
Element Declarations
Theelement structure of anXML document may, forvalidation purposes, be constrainedusing element and attribute declarations.
An element declaration constrains the element'stype and itscontent.
Element declarations often constrain which element types canappear aschildren of the element.At user option, an XML processor may issue a warningwhen a reference is made to an element type for which no declarationis provided, but this is not an error.
Anelementdeclaration takes the form:elementdecl'<!ELEMENT'S %NameS (%SS)?%contentspecS? '>'Unique Element Declarationcontentspec'EMPTY' | 'ANY' |Mixed |elementswhere theName gives the type of theelement.
No element type may be declared more than once.
An element can declared using acontent model, in which caseits content can be categorized aselement content ormixed content,as explained below.
An element declared using the keywordEMPTY must beempty and may be tagged using anempty-element tagwhen it appears in the document.
If an element type is declared using the keywordANY, thenthere are no validity constraints on its content: it maycontainchild elements ofany type andnumber, interspersed with character data.
Examples of element declarations:<!ELEMENT br EMPTY><!ELEMENT %name.para; %content.para; ><!ELEMENT container ANY>
Element Content
An elementtype may be declared to haveelement content, which means that elements of thattype may only contain other elements (no character data).In this case, theconstraint includes a content model, a simple grammar governingthe allowed types of thechildelements and the order in which they appear. The grammar is built oncontent particles (CPs), which consist of names, choice lists of content particles, orsequence lists of content particles:elements(choice |seq) ('?' | '*' | '+')?cp(Name |choice |seq) ('?' | '*' | '+')?cpsS? %cpS?choice'('S? %ctokplus (S? '|'S? %ctoks)*S? ')'ctokpluscps('|'cps)+ctokscps('|'cps)*seq'('S?%stoks (S?','S? %stoks)*S? ')'stokscps(','cps)*where eachName gives the type of an element which mayappear as achild. Any contentparticle in a choice list may appear in theelement content at the appropriatelocation; content particles occurring in a sequence list must eachappear in theelement content in theorder given. The optional character following a name or list governswhether the element or the content particles in the list may occur oneor more (+), zero or more (*), or zero or one times (?). The syntaxand meaning are identical to those used in the productions in thisspecification.
The content of an element matches a content model if and only if it ispossible to trace out a path through the content model, obeying thesequence, choice, and repetition operators and matching each element inthe content against an element name in the content model.For compatibility reasons, it is an errorif an element in the document canmatch more than one occurrence of an element name in the content model.More formally: a finite state automaton may be constructed from thecontent model using the standard algorithms, e.g. algorithm 3.5 in section 3.9ofAho, Sethi, and Ullman.In many such algorithms, a follow set is constructed for each position in the regular expression (i.e., each leaf node in the syntax tree for the regular expression);if any position has a follow set in which more than one following position is labeled with the same element type name, then the content model is in errorand may be reported as an error.For more information, see the appendix ondeterministic content models.
Examples of element-content models:<!ELEMENT spec (front, body, back?)><!ELEMENT div1 (head, (p | list | note)*, div2*)><!ELEMENT head (%head.content; | %head.misc;)*>
Mixed Content
An element type may bedeclared to containmixed content, that is,text comprisingcharacterdata optionally interspersed withchild elements.In this case, thetypes of the child elements areconstrained, but not their order nor their number of occurrences:Mixed'('S? %( %'#PCDATA'(S? '|'S? %Mtoks)* )S? ')*'| '('S? %('#PCDATA')S? ')'Mtoks%Name(S? '|'S? %Name)*where theNames give the types of elementsthat may appear as children.The same name must not appear more than once in a single mixed-contentdeclaration.
Examples of mixed content declarations:<!ELEMENT p (#PCDATA|a|ul|b|i|em)*><!ELEMENT p (#PCDATA | %font; | %phrase; | %special; | %form;)* ><!ELEMENT b (#PCDATA)>
Attribute-List Declarations
Attributes are used to associatename-value pairs withelements.Attributes may appear only withinstart-tags; thus, the productions used torecognize them appear in the discussion ofstart-tags. Attribute-listdeclarations may be used:
To define the set of attributes pertaining to a givenelement type.
To establish a set of type constraints on theseattributes.
To providedefault valuesfor attributes.
Attribute-list declarations specify the name, data type, and defaultvalue (if any) of each attribute associated with a given element type:AttlistDecl'<!ATTLIST'S %NameS? %AttDef+S? '>'AttDefS %NameS %AttTypeS %DefaultTheName in theAttlistDecl rule is the type of an element. At useroption, an XML processor may issue a warning if attributes aredeclared for an element type not itself declared, but this is not anerror. TheName in theAttDef rule isthe name of the attribute.
When more than oneAttlistDecl is provided for a givenelement type, the contents of all those provided are merged. Whenmore than one definition is provided for the same attribute of agiven element type, the first declaration is binding and laterdeclarations are ignored.For interoperability, writers of DTDsmay choose to provide at most one attribute-list declarationfor a given element type, and at most one attribute definitionfor a given attribute name.For interoperability, an XML processor may at user optionissue a warning when more than one attribute-list declaration isprovided for a given element type, or more than one attribute definitionfor a given attribute, but this is not an error.
Attribute Types
XML attribute types are of three kinds: a string type, aset of tokenized types, and enumerated types. The string type may takeany literal string as a value; the tokenized types have varying lexicaland semantic constraints, as noted:AttTypeStringType |TokenizedType |EnumeratedTypeStringType'CDATA'TokenizedType'ID'ID| 'IDREF'Idref| 'IDREFS'Idref| 'ENTITY'Entity Name| 'ENTITIES'Entity Name| 'NMTOKEN'Name Token| 'NMTOKENS'Name Token
Values of this type must be validName symbols. A name must not appear more than once inan XML document as a value of this type; i.e., ID values must uniquelyidentify the elements which bear them.Values of this type must matchtheName (forIDREFS, theNames) production; eachName must match the value of an ID attribute on some element in the XML document; i.e.IDREF values must match some ID.Values of this typemust match the production forName (forENTITIES,Names);eachName mustexactly match thename of anexternalbinary general entity declared in theDTD.Values of this typemust consist of a string matching theNmtoken nonterminal (forNMTOKENS, theNmtokens nonterminal) of the grammar definedin this specification.
The XML processor must normalize attribute values beforepassing them to the application, as described in the sectiononattribute-value normalization.
Enumerated attributes can take one of a list of values provided inthe declaration; there are two types:EnumeratedTypeNotationType |EnumerationNotationType%'NOTATION'S '('S? %Ntoks (S? '|'S? %Ntoks)*S? ')'Notation AttributesNtoks%Name(S? '|'S? %Name)*Enumeration'('S?%Etoks (S? '|'S? %Etoks)*S? ')'EnumerationEtoks%Nmtoken(S? '|'S? %Nmtoken)*
The names inthe declaration ofNOTATION attributes must be names ofdeclared notations (see the discussion ofnotations). Values of this type must matchone of the notation names included in the declaration.Values of this typemust match one of theNmtoken tokens in the declaration.For interoperability, the sameNmtoken should not occur more than once in the enumeratedattribute types of a single element type.Attribute Defaults
Anattribute declaration providesinformation on whetherthe attribute's presence is required, and if not, how an XML processor shouldreact if a declared attribute is absent in a document:Default'#REQUIRED' | '#IMPLIED'Attribute Default Legal| ((%'#FIXED' S)? %AttValue)#REQUIRED means that the document isinvalidshould the processorencounter astart-tag for the element type in question which specifies no value forthis attribute.#IMPLIED means that if the attribute is omittedfrom an element of this type,the XML processor must inform the applicationthat no value was specified; no constraint is placed on the behaviorof the application.
If the attributeis neither#REQUIRED nor#IMPLIED, then theAttValue value contains the declareddefault value. If the#FIXED is present, the document isinvalidif the attributeis present with a different value from the default. If a default valueis declared, when an XML processor encounters an omitted attribute, itis to behave as though the attribute were present with its value beingthe declared default value.
The declareddefault value must meet the constraints of the declared attribute type.
Examples of attribute-list declarations:<!ATTLIST termdef id ID #REQUIRED name CDATA #IMPLIED><!ATTLIST list type (bullets|ordered|glossary) "ordered"><!ATTLIST form method CDATA #FIXED "POST">
Attribute-Value Normalization
Before the value of an attribute is passed to the application, theXML processor must normalize it as follows:
Line-end characters (or, on some systems, record boundaries)must be replaced by single space (#x20) characters.
Character references and references to internal textentities must be expanded. References to external entitiesare an error.
If the attribute is not of typeCDATA, all stringsof white space must be normalized to single space characters (#x20),and leading and trailing white space must be removed.
Values of typeID,IDREF,IDREFS,NMTOKEN,NMTOKENS, or of enumerated or notation types, must befolded touppercase.
If no DTD is present, attributes should be treated asCDATA.
Conditional Sections
Conditional sections are portions of thedocument type declaration external subset which areincluded in, or excluded from, the logical structure of the DTD based onthe keyword which governs them.conditionalSectincludeSect|ignoreSectincludeSect'<![' %'INCLUDE' '[' (%markupdecl*)* ']]>'ignoreSect'<![' %'IGNORE' '['ignoreSectContents*']]>'ignoreSectContents((SkipLit |Comment|PI) - (Char* ']]>'Char*))| ('<!['ignoreSectContents* ']]>')| (Char - (']' | [<'"]))| ('<!' (Char - ('-' | '[')))
Like the internal and external DTD subsets, a conditional sectionmay contain one or more complete declarations,comments, processing instructions, or nested conditional sections, intermingled with white space.
If the keyword of the conditional section isINCLUDE, then the conditional section is read andprocessed in the normal way. If the keyword isIGNORE, then the declarations within the conditionalsection are ignored; the processor must read the conditional section todetect nested conditional sections and ensure that the end of theoutermost (ignored) conditional section is properly detected.If a conditional section with akeyword ofINCLUDE occurs within a larger conditionalsection with a keyword ofIGNORE, both the outer and theinner conditional sections are ignored.
If the keyword of the conditional section is a parameterentity reference, the parameter entity is replaced by its valuebefore the processor decides whether toinclude or ignore the conditional section.
An example:<!ENTITY % draft 'INCLUDE' ><!ENTITY % final 'IGNORE' > <![%draft;[<!ELEMENT book (comments*, title, body, supplements?)>]]><![%final;[<!ELEMENT book (title, body, supplements?)>]]>
Physical Structures
An XML document may consistof one or many virtual storage units. These are calledentities; they are identified by name and havecontent. An entity may be stored in, but need not comprise the whole of, a single physical storage object such as a file or database field.Each XML document has one entitycalled thedocument entity, which servesas the starting point for theXMLprocessor (and may contain the whole document).
Entities may be either binary or text.A text entity containstext data which is considered as anintegral part of the document.A binary entity containsbinary data with an associatednotation.Only text entities may be referred to using entity references;only the names of binary entities may be given as the value ofENTITYattributes.
Logical and Physical Structures
The logical and physical structures (elements and entities)in an XML document mustbe synchronous.Tags andelements musteach begin and end in the sameentity, but mayrefer to other entities internally;comments,processing instructions,characterreferences, andentity references must each be contained entirelywithin a single entity. Entities must each contain an integral numberof elements, comments, processing instructions, and references,possibly together with character data not contained within any elementin the entity, or else they must contain non-textual data, which bydefinition contains no elements.
Character and Entity References
Acharacter reference refers to a specific character in theISO/IEC 10646 character set, e.g. one not directly accessible fromavailable input devices:CharRef'&#' [0-9]+ ';'| '&hcro;' [0-9a-fA-F]+ ';'
Anentityreference refers to the content of a named entity.General entities aretext entities for use within the document itself; references to themuse ampersand (&) and semicolon (;) asdelimiters. In this specification, general entities are sometimes referred to with theunqualified termentity when this leadsto no ambiguity.Parameter entities are text entities for use within the DTD,or to control processing ofconditional sections;references to them use percent-sign (%) and semicolon(;) as delimiters.ReferenceEntityRef |CharRefEntityRef'&'Name ';'Entity DeclaredText EntityNo RecursionPEReference'%'Name ';'Entity DeclaredText EntityNo RecursionIn DTD
TheName given in the entity reference mustexactly match the name given in the declarationof the entity, except that well-formed documents need not declareany of the following entities: &magicents;. In validdocuments, these entities must be declared, in the formspecified in thesection onpredefined entities.In the case of parameter entities, the declarationmust precede the reference.An entity reference must not contain the name of abinary entity. Binary entities may be referredto only inattribute values declared tobe of typeENTITY orENTITIES.A text or parameter entity must not contain a recursive reference to itself,either directly or indirectly.In the external DTD subset, a parameter-entity reference is recognized only at the locations wherethe nonterminalPEReference or thespecial operator% appears in a production of thegrammar. In the internal subset, parameter-entity referencesare recognized only when they match theInternalPERef non-terminalin the production formarkupdecl.
Examples of character and entity references:Type <key>less-than</key> (&hcro;3C;) to save options.This document was prepared on &docdate; andis classified &security-level;.
Example of a parameter-entity reference:<!ENTITY % ISOLat2 SYSTEM "http://www.xml.com/iso/isolat2-xml.entities" >%ISOLat2;
Entity Declarations
Entities are declared thus:EntityDecl'<!ENTITY'S %NameS %EntityDefS? '>'General entities| '<!ENTITY'S '%'S %NameS %EntityDefS? '>'Parameter entitiesEntityDefEntityValue |ExternalDefTheName is that by which the entity is invoked byexact match in anentityreference.If the same entity is declared more than once, the first declarationencountered is binding; at user option, an XML processor may issue awarning if entities are declared multiple times.
Internal Entities
If the entity definition is anEntityValue, the defined entity iscalled aninternal entity. There is no separate physicalstorage object, and the replacement text of the entity is given in thedeclaration. Within theEntityValue,parameter-entity references and character references are recognizedand expanded immediately. General-entity references within thereplacement text are not recognizedat the time the entity declaration is parsed, though they may berecognized when the entity itself is referred to.
An internal entity is atext entity.
Example of an internal entity declaration:<!ENTITY Pub-Status "This is a pre-release of the specification.">
External Entities
If the entity is notinternal, it is anexternalentity, declared as follows:ExternalDefExternalID %NDataDecl?ExternalID'SYSTEM'SSystemLiteral| 'PUBLIC'SPubidLiteralSSystemLiteralNDataDeclS %'NDATA'S %NameNotation DeclaredIf theNDataDecl is present, this is abinarydata entity, otherwise a text entity.
TheName must match the declared name of anotation.
TheSystemLiteral that follows the keywordSYSTEMis called the entity'ssystem identifier. It is a URL,which may be used to retrieve the entity.Unless otherwise provided by information outside the scope of thisspecification (e.g. a special XML element defined by a particularDTD, or a processing instruction defined by a particular applicationspecification), relative URLs are relative to the location of theentity or file within which the entity declaration occurs. RelativeURLs in entity declarations within the internal DTD subset are thusrelative to the location of the document; those in entity declarationsin the external subset are relative to the location of the filescontaining the external subset.
In addition to a system literal, an external identifier mayinclude a public identifier. An XML processor may use the publicidentifier to try to generate an alternative URL. If the processoris unable to do so, it must use the URL specified in the systemliteral.
Examples of external entity declarations:<!ENTITY open-hatch SYSTEM "http://www.textuality.com/boilerplate/OpenHatch.xml"><!ENTITY open-hatch PUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//EN" "http://www.textuality.com/boilerplate/OpenHatch.xml"><!ENTITY hatch-pic SYSTEM "../grafix/OpenHatch.gif" NDATA gif >
Character Encoding in Entities
Each external text entity in an XML document may use a differentencoding for its characters. All XML processors must be able to readentities in either UTF-8 or UCS-2. It is recognized that for some purposes, the use of additionalISO/IEC 10646 planes other than the Basic Multilingual Planemay be required. A facility for handling characters in these planes is therefore adesirable characteristic in XML processors and applications.
Entities encoded in UCS-2 mustbegin with the Byte Order Mark described by ISO/IEC 10646 Annex E andUnicode Appendix B (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF).This is an encoding signature, not part of either the markup orcharacter data of the XML document.XML processors must be able to use this character todifferentiate between UTF-8 and UCS-2 encoded documents.
Although an XML processor is only required to read entities inthe UTF-8 and UCS-2, it is recognized that many other encodings are in dailyuse around the world, and it may be advantageous for XML processors to readentities that use these encodings.For this purpose, XML provides an encoding declarationprocessing instruction, which, if it occurs, must appear at thebeginning of a system entity, before anyothercharacter data ormarkup. In the document entity, the encodingdeclaration is part of theXML declaration;in other entities, it is part of an encoding processing instruction:EncodingDeclS 'encoding'EqQEncodingStart of System EntityEncodingPI&xmlpio;S 'encoding'EqQEncodingS? &pic;QEncoding'"'Encoding '"' | "'"Encoding "'"EncodingLatinNameLatinName[A-Za-z] ([A-Za-z0-9._] | '-')*Name containing only Latin characters
An XML encoding declaration may occur at the beginning of asystem entity; it must not occur within the body of any entity.
The valuesUTF-8,UTF-16,ISO-10646-UCS-2, andISO-10646-UCS-4 should be used for the various encodings and transformations of Unicode /ISO/IEC 10646, the valuesISO-8859-1,ISO-8859-2, ...ISO-8859-9 should be used for the parts of ISO 8859, andthe valuesISO-2022-JP,Shift_JIS, andEUC-JPshould be used for the various encoded forms of JIS X-0208. XMLprocessors may recognize other encodings; it is recommended thatcharacter encodings registered (ascharsets) with the Internet Assigned NumbersAuthority (IANA), other than those just listed, should be referred tousing their registered names.
It is anerror for an entity includingan encoding declaration to be presented to the XML processor in an encoding other than thatnamed in the declaration.
An entity which begins with neither a Byte Order Mark nor an encodingdeclaration must be in the UTF-8 encoding.
While XML provides mechanisms for distinguishing encodings,it is recognized that in a heterogeneous networked environment, it may be difficult to signal the encoding of an entity reliably.Errors in this area fall into two categories:
failing to read an entity because of inability to recognizeits actual encoding, and
reading an entity incorrectly becauseof an incorrect guess of its proper encoding.
The first class of error is extremely damaging, and given acorrect encoding declaration, the second class isextremely unlikely.For these reasons, XML processors should make an effort to use all availableinformation, internal and external, to aid in detecting an entity's correctencoding. Such information may include, but is not limited to:
Using information from an HTTP header
Using a MIME header obtained other than through HTTP
Metadata provided by the native OS file system or by documentmanagement software
Analysing the bit patterns at the front of an entity to determine ifthe application of any known encoding yields a valid encodingdeclaration. Seethe appendix onautodetection of character sets for a fuller description.
If an XML processor encounters an entity with an encoding that it isunable to process, it mayinform the application of this fact and may allow the application torequest either that the entity should be treated as anbinary entity, or that processing shouldcease.
Examples of encoding declarations:<?XML ENCODING='UTF-8'?><?XML ENCODING='EUC-JP'?>
Document Entity
Thedocumententity serves as the root of the entitytree and a starting-point for anXMLprocessor.This specification doesnot specify how the document entity is to be located by an XMLprocessor; unlike other entities, the document entity might wellappear on an input stream of the processorwithout any identification at all.
XML Processor Treatment of Entities
XML allows character and general-entity references in two places:the content of elements (content) andattribute values (AttValue).When anXML processor encounterssuch a reference, or the name of an external binary entity as the valueof anENTITY orENTITIES attribute, then:
In all cases, the XML processor may inform the application of the reference's occurrence and its identifier(for an entity reference, the name; for a characterreference, the character number in decimal, hexadecimal, or binary form).
For both character and entity references, the processor mustremove the reference itself from thetext databefore passing the data to the application.
For character references, the processor must pass the character indicatedto the application inplace of the reference.
For an external entity, the processor must inform theapplication of the entity'ssystemidentifier andpublic identifier if any.
If the external entity is binary, the processor must inform theapplication of the associatednotation name,and the notation's associated system and public (if any)identifiers.
For an internal(text) entity, the processor mustinclude the entity; that is, retrieve its replacement text and process it as a part of the document (i.e. ascontent orAttValue, whichever was being processed whenthe reference was recognized), passing the result to the applicationin place of the reference. The replacement text may contain both textandmarkup, which must be recognized inthe usual way, except that the replacement text of entities used to escapemarkup delimiters (the entities &magicents;) is always treated asdata. (The stringAT&T expands toAT&T; the remaining ampersand is not recognizedas an entity-reference delimiter.)
Since the entity may contain other entity references,an XML processor may have to repeat the inclusion process recursively.
If the entity is an external text entity, then in order tovalidate the XML document, the processor mustinclude the content of theentity.
If the entity is an external text entity, and the processor is notattempting tovalidate the XML document, theprocessormay, but need not,include the entity's content.
This rule is based on the recognition that the automatic inclusionprovided by the SGML and XML text entity mechanism, primarily designedto support modularity in authoring, is not necessarily appropriate for other applications, in particular document browsing.Browsers, for example, when encountering an external text entity reference,might choose to provide a visual indication of the entity'spresence and retrieve it for display only on demand.
Entity and characterreferences can both be used to escape the left angle bracket,ampersand, and other delimiters. A set of general entities(&magicents;) is specified for this purpose.Numeric character references may also be used; they areexpanded immediately when recognized, and must be treated ascharacter data, so the numeric character references< and& may be used to escape< and& when they occurin character data.
XML allows parameter-entity references in a variety of placeswithin the DTD. Parameter-entity references are always expandedimmediately upon being recognized, and the DTD must match the relevantrules of the grammar after all parameter-entity references have beenexpanded. In addition, parameter entities referred to in specificcontexts are required to satisfy certain constraints in theirreplacement text; for example, a parameter entity referred to withinthe internal DTD subset must match the rule formarkupdecl.
Implementors of XML processors need to know the rules forexpansion of references in more detail. These rules only come intoplay when the replacement text for an internal entity itself containsother references.
In the replacement text of an internal entity, parameter-entityreferences and character references in the replacement text are recognized and resolved when the entity declaration is parsed,before the replacement text is stored inthe processor's symbol table.General-entity references in the replacement text are not resolved when the entity declaration is parsed.
In the document, when a general-entity reference isresolved, its replacement text is parsed. Character references encountered in the replacement text areresolved immediately; general-entity references encountered in thereplacement text may be resolved or left unresolved, as describedabove.Character and general-entity references must becontained entirely within the entity's replacement text.
Simple character references do not suffice to escape delimiterswithin the replacement text of an internal entity: they will beexpanded when the entity declaration is parsed, before the replacementtext is stored in the symbol table. When the entity itself isreferred to, the replacement text will be parsed again, and thedelimiters (no longer character references) will be recognized as delimiters. To escape thecharacters &magicents; in an entity replacement text, usea general-entity reference or a doubly-escaped character reference.Seethe appendix on expansion of entity references for detailed examples.
Predefined Entities
As mentioned in the discussion ofCharacter Data and Markup, the characters used as markupdelimiters by XML may all be escaped using entity references(for the entities &magicents;).
All XML processors must recognize these entities whether theyare declared or not. Valid XML documents must declare theseentities, like any others, before using them.
If the entities in question are declared, they must be declaredas internal entities whose replacement text is the singlecharacter being escaped, as shown below. ]]>
Notation Declarations
Notations identify byname the format ofexternal binaryentities, or the application to whichprocessing instructions are addressed.
Notation declarationsprovide a name for the notation, for use inentity and attribute-list declarations and in attribute-value specifications,and an external identifier for the notation which may allow an XMLprocessor or its client application to locate a helper applicationcapable of processing data in the given notation.NotationDecl'<!NOTATION'S %NameS %ExternalIDS? '>'
XML processors must provide applications with the name and externalidentifier of any notation declared and referred to in an attributevalue, attribute definition, or entity declaration. They mayadditionally resolve the external identifier into thesystem identifier,file name, or other information needed to allow theapplication to call a processor for data in the notation described. (Itis not an error, however, for XML documents to declare and refer tonotations for which notation-specific applications are not available onthe system where the XML processor or application is running.)
Conformance
ConformingXML processors fall into twoclasses: validating and non-validating.
Validating and non-validating systems alike must reportviolations of the well-formedness constraintsgiven in this specification.
Validating processors must report locations in which the documentdoes not comply withthe constraints expressed by the declarations in theDTD.They must also report all failures to fulfill the validity constraints givenin this specification.