Movatterモバイル変換


[0]ホーム

URL:


W3C

Character Model for the World Wide Web 1.0: Normalization

W3C Working Draft 1 May 2012

This version:
http://www.w3.org/TR/2012/WD-charmod-norm-20120501/
Latest version:
http://www.w3.org/TR/charmod-norm/
Previous versions:
http://www.w3.org/TR/2005/WD-charmod-norm-20051027/
http://www.w3.org/TR/2004/WD-charmod-norm-20040225/
Editors:
François Yergeau, Invited Expert (and before at Alis Technologies)
Martin J. Dürst, (until Dec 2004 while at W3C)
Richard Ishida, W3C (and before at Xerox)
Addison Phillips, Invited Expert (and before at WebMethods)
Misha Wolf, (until Dec 2002 while at Reuters Ltd.)
Tex Texin, (until Dec 2004 while an Invited Expert, and before at Progress Software)

Copyright © 2012W3C® (MIT,ERCIM,Keio), All Rights Reserved. W3Cliability,trademark anddocument use rules apply.


Abstract

Based onCharacter Model for the World Wide Web 1.0: Fundamentals[CharMod], this Architectural Specification provides authors of specifications, software developers, and content developers with a common reference on the use of normalization of text and string identity matching on the Web. The goal of this specification is to improve interoperable text manipulation on the World Wide Web.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in theW3C technical reports index at http://www.w3.org/TR/.

This version of this document was published to indicate the Internationalization Core Working Group's intention to substantially alter or replace the recommendations found here with very different recommendations in the near future. Other than this note, this Working Draft is identical to the draft of 2005-10-27.

This is an updated W3C Working Draft of this document. The main difference from previous versions of this document is that it no longer proposes to rely exclusively on Early Uniform Normalization. Comments may be submitted by email towww-international@w3.org (public archive). A list of comments from an earlier last call with their status can be found in the disposition of comments (public version,Members only version).

This document is published as part of theW3C Internationalization Activity by theInternationalization Core Working Group. The Working Group expects to advance this Working Draft to Recommendation Status (seeW3C document maturity levels).

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the5 February 2004 W3C Patent Policy. W3C maintains apublic list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes containsEssential Claim(s) must disclose the information in accordance withsection 6 of the W3C Patent Policy.

Table of Contents

1Introduction
    1.1Goals and Scope
    1.2Background
    1.3Terminology and Notation
2Conformance
3Normalization
    3.1Motivation
        3.1.1Why do we need character normalization?
        3.1.2Early or late normalization
        3.1.3The choice of Normalization Form
    3.2Definitions for W3C Text Normalization
        3.2.1Normalizing Transcoder
        3.2.2Unicode-normalized text
        3.2.3Include-normalized text
        3.2.4Fully-normalized text
        3.2.5Normalization-sensitive operations
        3.2.6Text-processing component
        3.2.7Certified and suspect text
    3.3Examples
        3.3.1General examples
        3.3.2Examples of XML in a Unicodeencoding form
        3.3.3Examples of restrictions on the useof combining characters
    3.4Responsibility for Normalization
4String Identity Matching

Appendices

AReferences
    A.1Normative References
    A.2Other References
BComposing Characters (Non-Normative)
CResources forNormalization (Non-Normative)
DAcknowledgements (Non-Normative)


1 Introduction

1.1 Goals and Scope

The goal of the Character Model for the World WideWeb is to facilitate use of the Web by all people,regardless of their language, script, writing system, and cultural conventions,in accordance with theW3Cgoal of universal access. One basic prerequisite to achieve this goalis to be able to transmit and process the characters used around the world in awell-defined and well-understood way.

The main target audience of this specification is W3C specification developers. This specificationand parts of it can be referenced from other W3C specifications. It defines conformance criteria for W3C specificationsas well as other specifications.

Other audiences of this specificationinclude software developers, contentdevelopers, and authors of specifications outside the W3C. Software developersand content developers implement and use W3C specifications. Thisspecificationdefines some conformance criteria for implementations (software) and contentthat implement and use W3C specifications. It also helps software developers andcontent developers to understand the character-related provisions in W3Cspecifications.

The character model described in this specificationprovides authors ofspecifications, software developers, and content developers with a commonreference for consistent, interoperable text manipulation on the World Wide Web.Working together, these three groups can build a more international Web.

Topics addressed in this part of the Character Model for the World Wide Webinclude early uniformnormalization, late normalization and string identity matching.

Other parts of the Character Model address the fundamental aspects ofthe model ([CharMod]) and Internationalized Resource Identifiers(IRI) conventions ([CharIRI]).

Topics as yet not addressed or barely touched include fuzzymatching, and language tagging. Some of these topics may be addressed in afuture version of this specification.

At the core of the model is the Universal Character Set (UCS), definedjointly by the Unicode Standard[Unicode] and ISO/IEC 10646[ISO/IEC 10646]. In this document, Unicode is used as asynonym for the Universal Character Set. The model will allow Web documentsauthored in the world's scripts (and on different platforms) to be exchanged,read, and searched by Web users around the world.

1.2 Background

This section provides some historical background on the topicsaddressed in this specification.

Starting withInternationalization of the Hypertext Markup Language[RFC 2070], the Web community has recognized the needfor a character model for the World Wide Web. The first step towards buildingthis model was the adoption of Unicode as the document character set for HTML.

The choice of Unicode was motivated by the fact that Unicode:

  • is the only universal character repertoire available,

  • provides a way of referencing characters independent of the encoding of the text,

  • is being updated/completed carefully,

  • is widely accepted and implemented by industry.

W3C adopted Unicode as the document character set for HTML in[HTML 4.0]. The same approach was later used for specifications such as XML 1.0[XML 1.0] and CSS2[CSS2]. W3C specifications andapplications now use Unicode as the common reference character set.

When data transfer on the Web remained mostly unidirectional (from server tobrowser), and where the main purpose was to render documents, the use of Unicodewithout specifying additional details was sufficient. However, the Web hasgrown:

  • Data transfers among servers, proxies, and clients, in all directions, have increased.

  • Non-ASCII characters[ISO/IEC 646] are being used in more and more places.

  • Data transfers between different protocol/format elements (such as element/attribute names, URI components, and textual content) have increased.

  • More and more APIs are defined, not just protocols and formats.

In short, the Web may be seen as a single, very large application (see[Nicol]), rather than as a collection of small independentapplications.

While these developments strengthen the requirement that Unicode be the basisof a character model for the Web, they also create the need for additionalspecifications on the application of Unicode to the Web. Some aspects of Unicodethat require additional specification for the Web include:

  • Choice of Unicode encoding forms (UTF-8, UTF-16, UTF-32).

  • Counting characters, measuring string length in the presence of variable-length character encodings and combining characters.

  • Duplicate encodings of characters (e.g. precomposed vs decomposed).

  • Use of control codes for various purposes (e.g. bidirectionality control, symmetric swapping, etc.).

It should be noted that such aspects also exist for legacyencodings (wherelegacy encoding is taken to mean any characterencoding not based on Unicode), and in many cases have been inherited by Unicodein one way or another from such legacy encodings.

The remainder of this specification presentsadditional requirements to ensure an interoperable character model for the Web, taking intoaccount earlier work (from W3C, ISO and IETF).

The first few chapters of the Unicode Standard[Unicode]provide very useful background reading. The policies adopted by the IETF for onthe use of character sets on the Internet are documented in[RFC 2277].

For information about the requirements that informed the development ofimportant parts of this specification, seeRequirements for StringIdentity Matching and String Indexing[CharReq].

1.3 Terminology and Notation

For the purpose of this specification, theproducer of text data is the sender of the data in the case ofprotocols, and the tool that produces the data in the case of formats. Therecipient of text data is the software module that receives thedata.

NOTE:A software module may be both a recipient and a producer.

Unicode code points are denoted as U+hhhh, where "hhhh" is asequence of at least four, and at most six hexadecimal digits.

Characters have been used in various examples that will not appear asintended unless you have the appropriate font. Care has been taken to ensurethat the examples nevertheless remain understandable.

2 Conformance

The key words "MUST", "MUSTNOT", "REQUIRED", "SHALL","SHALL NOT",SHOULD", "SHOULDNOT", "RECOMMENDED", "MAY" and"OPTIONAL" in this document are to be interpreted asdescribed in RFC 2119[RFC 2119].

NOTE:RFC 2119 makes it clear that requirements that useSHOULD are not optional and must be complied with unless there are specific reasons not to: "This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course."

This specification places conformance criteria on specifications, on software and on Web content. To aid the reader, all conformance criteria are preceded by '[X]' where 'X' is one of 'S' for specifications, 'I' for software implementations, and 'C' for Web content. These markers indicate the relevance of the conformance criteria and allow the reader to quickly locate relevant conformance criteria by searching through this document.

Specifications conform to this document if they:

  1. do not violate any conformance criteria preceded by [S],

  2. document the reason for any deviation from criteria where the imperative isSHOULD,SHOULD NOT, orRECOMMENDED,

  3. make it a conformance requirement for implementations to conform to this document,

  4. make it a conformance requirement for content to conform to this document.

Software conforms to this document if it does not violate any conformance criteria preceded by [I].

Content conforms to this document if it does not violate any conformance criteria preceded by [C].

NOTE:Requirements placed on specifications might indirectly cause requirements to be placed on implementations or content that claim to conform to those specifications.

Where this specification contains a procedural description, it is to be understood as a way to specify the desired external behavior. Implementations can use other means of achieving the same results, as long as observable behavior is not affected.

3 Normalization

This chapter discusses text normalization for the Web.3.1 Motivation discusses the need for normalization.3.2 Definitions for W3C Text Normalization defines the various types of normalization and3.3 Examples gives supporting examples.3.4 Responsibility for Normalization assigns responsibilities to various components and situations. The requirements for early uniform normalization are discussed in[CharReq],section 3.

3.1 Motivation

3.1.1 Why do we need character normalization?

Text in computers can be encoded in one of many character encodings. Inaddition, some character encodings allow multiple representations for the'same' string, and Web languages have escape mechanisms thatintroduce even more equivalent representations. For instance, in ISO 8859-1 theletter 'ç' can only be represented as the single character E7'ç', but in a Unicode encoding it can be represented as the singlecharacter U+00E7 'ç'or the sequence U+0063'c' U+0327 '¸'. In HTML it could be additionallyrepresented asç orç orç (five equivalent representations in total).

There are a number of fundamental operations that are sensitive tothese multiple representations: string matching, indexing, searching, sorting,regular expression matching, selection, etc. In particular, the properfunctioning of the Web (and of much other software) depends to a large extenton string matching. Examples of string matching abound: parsing element andattribute names in Web documents, matching CSS selectors to the nodes in adocument, matching font names in a style sheet to the names known to theoperating system, matching URI pieces to the resources in a server, matchingstrings embedded in an ECMAScript program to strings typed in by a Web formuser, matching the parts of an XPath expression (element names, attribute namesand values, content, etc.) to what is found in an XML instance, etc.

String matching is usually taken for granted and performed bycomparing two strings byte for byte, but the existence on the Web of multiplecharacter representations means that it is actually non-trivial. Binarycomparisondoes not work if the strings are not in the samecharacter encoding (e.g. an EBCDIC style sheet being directly applied to an ASCIIdocument, or a font specification in a Shift_JIS style sheet directly used on asystem that maintains font names in UTF-16) or if they are in the same character encodingbut show variations allowed for the 'same' string by the use ofcombining characters or by the constructs of Web languages.

Incorrect string matching can have far reaching consequences,including the creation of security holes. Consider a contract, encoded in XML,for buying goods: each item sold is described in aStück element;unfortunately, "Stück" is subject to different representationsin the character encoding of the contract. Suppose that the contract is viewedand signed by means of a user agent that looks forStück elements,extracts them (matching on the element name), presents them to the user andadds up their prices. If different instances of theStück elementhappen to be represented differently in a particular contract, then the buyerand seller may see (and sign) different contracts if their respective useragents perform string identity matching differently, which is fairly likely inthe absence of a well-defined specification for string matching. The absence ofa well-defined specification would also mean that there would be no way toresolve the ensuing contractual dispute.

Solving the string matching problem involves normalization, whichin a nutshell means bringing the two strings to be compared to a common,canonical encoding prior to performing binary matching. (For additional stepsinvolved in string matching see4 String Identity Matching.)

There are options in the exact way normalization can be used toachieve correct behavior of normalization-sensitive operations such as stringmatching. These options lie along two axes: i)when normalization is performed, and ii)what canonical encoding is used. The next subsections discuss these axes.

3.1.2 Early or late normalization

The first axis is a choice ofwhen normalizationoccurs: early (when strings are created) or late (when strings are compared).The former amounts to establishing a canonical encoding for all data that istransmitted or stored, so that it doesn't need any normalization later, beforebeing used. The latter is the equivalent of mandating 'smart'compare functions, which will take care of any encoding differences.

There are several advantages toearly normalization, as follows:

  • Almost all legacy data as well as data created by currentsoftware is normalized (if usingNFC).

  • The number of Web components that generate or transform textis considerably smaller than the number of components that receive text andneed to perform matching or other processes requiring normalized text.

  • Current receiving components (browsers, XML parsers, etc.)implicitly assume early normalization by not performing or verifyingnormalization themselves. This is a vast legacy.

  • Web components that generate and process text are in a muchbetter position to do normalization than other components; in particular, theymay be aware that they deal with a restricted repertoire only, which simplifiesthe process of normalization.

  • Not all components of the Web that implement functions suchas string matching can reasonably be expected to do normalization. This, inparticular, applies to very small components and components in the lower layersof the architecture.

  • Forward-compatibility issues can be dealt with more easily:less software needs to be updated, namely only the software that generatesnewly introduced characters.

  • It is a prerequisite for comparison of encrypted strings(see[CharReq],section 2.7).

Early normalization also has downsides: everyone must play by the same rules, and things break down when a producer of text data doesn't play by the rules. Furthermore, the location of the error (typically at a recipient that assumes proper normalization) is remote from the source (the faulty producer).

When recipients cannot count on early normalization, then some form of late normalization is the only way to ensure proper results of string comparison and other normalization-sensitive operations.

3.1.3 The choice of Normalization Form

The second axis is a choice of canonical encoding. This choiceneeds only be made if early normalization is chosen. With late normalization,the canonical encoding would be an internal matter of the smart comparefunction, which doesn't need any wide agreement or standardization.

By choosing a single canonical encoding, it isensured that normalization is uniform throughout the web. Hence the two axes lead us to the name 'early uniform normalization'.

The Unicode Consortium provides four standard normalization forms(seeUnicode Normalization Forms[UTR #15]).These forms differ in 1) whether they normalize towards decomposed characters(NFD, NFKD) or precomposed characters (NFC, NFKC) and 2) whether the normalization process erases compatibility distinctions (NFKD, NFKC) or not (NFD, NFC).

For use on the Web, it is important not to lose the so-calledcompatibility distinctions, which may be important (see[UXML]Chapter4 for a discussion). The NFKD and NFKC normalization forms are thereforeexcluded. Among the remaining two forms, NFC has the advantage that almost alllegacy data (if transcoded trivially, one-to-one, to a Unicode encoding) as well as data created bycurrent software is already in this form; NFC also has a slight compactnessadvantage and a better match to user expectations with respect to the charactervs.grapheme issue. This documenttherefore chooses NFC as the base for Web-related early normalization.

NOTE:Roughly speaking,NFC is defined such that each combining character sequence (a base character followed by one or more combining characters) is replaced, as far as possible, by a canonically equivalent precomposed character. Text in aUnicode encoding form is said to be in NFC if it doesn't contain any combining sequence that could be replaced and if any remaining combining sequence is in canonical order.

For a list of programming resources related to normalization, seeC Resources forNormalization.

3.2 Definitions for W3C Text Normalization

For use on the Web, this document defines Web-related text normalization forms by starting with Unicode Normalization Form C (NFC), and additionally addressing the issues oflegacy encodings, character escapes, includes, and character and markup boundaries. Examples illustrating some of these definitions can be found in3.3 Examples.

3.2.1 Normalizing Transcoder

Anormalizingtranscoder is a transcoder that converts from alegacy encoding to aUnicode encoding formand ensures that the result is in Unicode Normalization Form C (see3.2.2 Unicode-normalized text). For most legacy encodings, it is possible to construct a normalizing transcoder (by using any transcoder followed by a normalizer); it is not possible to do so if the encoding'srepertoire contains characters not represented in Unicode.

3.2.2 Unicode-normalized text

Text is, for the purposes of this specification,Unicode-normalized if it is in aUnicode encoding formand is in Unicode Normalization Form C, according to a version ofUnicode Standard Annex #15: Unicode Normalization Forms[UTR #15]at least as recent as the oldest version of the Unicode Standard that contains all thecharacters actually present in the text, but no earlier than version 3.2[Unicode 3.2].

3.2.3 Include-normalized text

Markup languages, style languages and programminglanguages often offer facilities for including a piece of text inside another.Aninclude is an instance of a syntactic device specified in alanguage to include text at the position of the include,replacing the include itself. Examples of includes are entity references inXML, @import rules in CSS and the #include preprocessor statement in C/C++.Character escapes are a special case ofincludes where the included entity is predetermined by the language.

Text isinclude-normalized if:

  1. the text isUnicode-normalizedand doesnot contain anycharacter escapes orincludes whose expansion would cause thetext to become no longer Unicode-normalized; or

  2. the text is in alegacyencodingand, if it were transcoded to aUnicode encoding form by anormalizing transcoder, theresulting text would satisfy clause 1 above.

NOTE:A consequence of this definition is that legacy text (i.e. text in a legacy encoding) is always include-normalized unless i) a normalizing transcoder cannot exist for that encoding (e.g. because the repertoire contains characters not in Unicode) or ii) the text contains character escapes or includes which, once expanded, result in un-normalized text.

NOTE:The specification of include-normalization relies on thesyntax for character escapes and includes defined by the (computer) language inuse. For plain text (no character escapes orincludes) in a Unicode encoding form, include-normalization andUnicode-normalization are equivalent.

3.2.4 Fully-normalized text

Formal languages defineconstructs, which are identifiable pieces, occurring in instancesof the language, such as comments, identifiers, element tags, processinginstructions, runs ofcharacter data,etc. During the normal processing ofinclude-normalized text, these variousconstructs may be moved, removed (e.g. removing comments) or merged (e.g.merging all thecharacter data within anelement as done by thestring() function of XPath), creating opportunities for text to becomedenormalized. The software performing those operations, or other software down the line that needs to perform normalization-sensitive operations, then has to re-normalizethe result, which is a burden. One way to avoid such denormalization is to makesure that the various important constructs never begin with a character suchthat appending that character to a normalized string can cause the string tobecome denormalized. Acomposing character is a character that isone or both of the following:

  1. the second character in the canonical decomposition mapping of somecharacter that is not listed in the Composition Exclusion Table defined in[UTR #15], or

  2. of non-zero canonical combining class as defined in[Unicode].

Please consult AppendixB Composing Characters for adiscussion of composing characters, which are not exactly the same as Unicodecombining characters.

Text isfully-normalized if:

  1. the text is in aUnicode encoding form, isinclude-normalized and none ofthe constructs comprising the text begin with acomposing character or acharacter escape representing a composingcharacter; or

  2. the text is in alegacyencoding and, if it were transcoded to aUnicode encoding form by anormalizing transcoder, theresulting text would satisfy clause 1 above.

NOTE:Full-normalization is specified against the context of a (computer) language (or the absence thereof), which specifies the form of character escapes andincludes and the separation into constructs. For plain text (no includes, no constructs, no character escapes) in a Unicode encoding form, full-normalization and Unicode-normalization are equivalent.

Identification of the constructs that should be prohibited frombeginning with acomposing character(therelevant constructs) is language-dependent. As specified in3.4 Responsibility for Normalization, it is the responsibility of thespecification for a language to specify exactly what constitutes a relevantconstruct. This may be done by specifying important boundaries, taking intoaccount which operations would benefit the most from being protected againstdenormalization. The relevant constructs are then defined as the spans of textbetween the boundaries. At a minimum, for those languages which have thesenotions, the important boundaries are entity (include) boundaries as well asthe boundaries between mostmarkup andcharacter data. Many languages willbenefit from defining more boundaries and therefore finer-grainedfull-normalization constructs.

NOTE:In general, it will be advisablenot to include character escapes designed to express arbitrary characters among the relevant constructs; the reason is that including them would prevent the expression of combining sequences using character escapes (e.g. 'q̌' for q-caron), which is especially important in legacy encodings that lack the desired combining marks.

NOTE:Full-normalization is closed under concatenation: the concatenation of two fully-normalized strings is also fully-normalized. As a result, a side benefit of including entity boundaries in the set of boundaries important for full-normalization is that the state of normalization of a document that includes entities can be assessedwithout expanding theincludes, if the included entities are known to be fully-normalized. If all the entities are known to be include-normalizedand not to start with acomposing character, then it can be concluded that including the entities would not denormalize the document.

3.2.5 Normalization-sensitive operations

An operation isnormalization-sensitive if its output(s) are different depending on the state of normalization of the input(s); if the output(s) are textual, they are deemed different only if they would remain different were they to be normalized. These operations are any that involve comparison of characters or character counting, as well as some other operations such as ‘delete first character’ or ‘delete last character’.

EXAMPLE: Consider the stringnormalisé, where the 'é' may be a single character (in NFC) or two. The following are three examples of normalization-sensitive operations involving this string. Counting the number of characters may yield either 9 or 10, depending on the state of normalization. Deleting the last character may yield eithernormalis ornormalise (no accent). Binary-comparingnormalisé tonormalisématches if both are in the same state of normalization, but doesn't match otherwise.

EXAMPLE: Examples of operations that arenot normalization-sensitive are normalization, and the copying or deletion of an entire document.

3.2.6 Text-processing component

Atext-processing component is a component that recognizes data as text. This specification does not specify the boundaries of a text-processing component, which may be as small as one line of code or as large as a complete application. A text-processing component may receive text, produce text, or both.

3.2.7 Certified and suspect text

In the following definitions, the word 'normalized'may stand for either 'include-normalized' or'fully-normalized', depending on which is most appropriate forthe specification or implementation under consideration.

Certified text is text which satisfies at least one of the following conditions:

  1. it has been confirmed through inspection that the text is in normalized form

  2. the source of the text (atext-processing component) is known to produce only normalized text.

Suspect text is text which is not certified.

NOTE:To normalize text, it is in general sufficient to store the last seen character, but in certain cases (a sequence of combining marks) a buffer of theoretically unlimited length is necessary. However, for normalization checking no such buffer is necessary, only a few variables.C Resources forNormalization points to some compact code that shows how to check normalization without an expanding buffer.

3.3 Examples

In some of the following examples, '¸' is used to depict the character U+0327COMBINING CEDILLA, for the purposes of illustration. Had a real U+0327 been used instead of this spacing (non-combining) variant, some browsers might combine it with a preceding 'c', resulting in a display indistinguishable from a U+00E7 'ç' and a loss of understandability of the examples. In addition, if the sequence c + combining cedilla were present, this document would not be include-normalized.

It is also assumed that the example strings are relevant constructs for the purposes of full-normalization.

3.3.1 General examples

The stringsuçon (U+0073 U+0075 U+00E7 U+006F U+006E) encoded in aUnicode encoding form, is Unicode-normalized, include-normalized and fully-normalized. The samestring encoded in alegacy encoding forwhich there exists a normalizing transcoder would be both include-normalizedand fully-normalized but not Unicode-normalized (since not in a Unicodeencoding form).

In an XML or HTML context, the stringsuçon is also include-normalized, fully-normalized and, if encoded in aUnicode encoding form, Unicode-normalized. Expanding ç yieldssuçon as above, which contains no replaceable combining sequence.

The stringsuc¸on (U+0073 U+0075 U+0063U+0327 U+006F U+006E), where U+0327is theCOMBINING CEDILLA, encoded in a Unicode encoding form, isnot Unicode-normalized (since the combining sequence '' (U+0063U+0327) should appear instead as the precomposed 'ç' (U+00E7)). As aconsequence this string is neither include-normalized (since in a Unicodeencoding form but not Unicode-normalized) nor fully-normalized (since notinclude-normalized). Note however that the stringsub¸on (U+0073 U+0075U+0062 U+0327 U+006F U+006E) in a Unicodeencoding formis Unicode-normalized since there is no precomposed formof 'b' plus cedilla. It is also include-normalized andfully-normalized.

In plain text the stringsuçon is Unicode-normalized, since plain text doesn't recognize thaţ represents a character in XML or HTML and considers it just asequence of non-replaceable characters.

In an XML or HTML context, however, expanding ̧ yieldsthe stringsuc¸on (U+0073 U+0075 U+0063U+0327 U+006F U+006E) which is notUnicode-normalized ('' is replaceable by 'ç'). As a consequence the string is neither include-normalized nor fully-normalized. As another example, if the entity reference&word-end; refers to an entity containing¸on (U+0327 U+006F U+006E), then the stringsuc&word-end; is not include-normalized for the same reasons.

In an XML or HTML context, expanding ̧ in the stringsub̧on yields the stringsub¸on whichis Unicode-normalized since there is no precomposedcharacter for 'b cedilla' in NFC. This string is therefore alsoinclude-normalized. Similarly, the stringsub&word-end; (with&word-end; as above) is include-normalized, for the same reasons.

In an XML or HTML context, the strings¸on (U+0327 U+006F U+006E) anḑon are not fully-normalized, as they begin with a composing character(after expansion of the character escape for the second). However, both areUnicode-normalized (if expressed in a Unicode encoding form) andinclude-normalized.

The following table consolidates the above examples. Normalizedforms are indicated using 'Y', a hyphen means 'notnormalized'.

StringEncodingContextUnicode-normalizedInclude-normalizedFully-normalized
suçonUnicodePlaintextYYY
XML/HTMLYYY
LegacyPlaintext-YY
XML/HTML-YY
suçonUnicodePlain textYYY
XML/HTMLYYY
LegacyPlaintext-YY
XML/HTML-YY
suc¸onUnicodePlain text---
XML/HTML---
suçonUnicodePlain textYYY
XML/HTMLY--
LegacyPlaintext-YY
XML/HTML---
¸onUnicodePlaintextYY-
XML/HTMLYY-
̧onUnicodePlain textYYY
XML/HTMLYY-
LegacyPlaintext-YY
XML/HTML-Y-

3.3.2 Examples of XML in a Unicodeencoding form

Here is another summary table, with more examples but limited toXML in aUnicode encoding form. The following list describes what the entitiescontain and special character usage. Normalized forms are indicated using'Y'. There is no precomposed 'b with cedilla' in NFC.

  • "ç"LATIN SMALL LETTER C WITHCEDILLA

  • "&cedilla;"CEDILLA(combining)

  • "&c;"LATIN SMALL LETTERC

  • "&b;"LATIN SMALL LETTERB

  • "¸"CEDILLA (combining)

  • "/" (immediately before 'on' inlast example)COMBINING LONG SOLIDUS OVERLAY

StringUnicode-normalizedInclude-normalizedFully-normalized
suçonYYY
sub¸onYYY
suçonYYY
sub̧onYYY
sub¸onYYY
su&ccedill;onYYY
su<![CDATA[çon]]>YYY
su&b;¸onYY-
sub&cedilla;onYY-
suc<!--comment-->¸onYY-
sub<!--comment-->¸onYY-
suc<em>¸</em>onYY-
sub<em>¸</em>onYY-
suc<?proc-instr?>¸onYY-
sub<?proc-instr?>¸onYY-
sub<![CDATA[¸on]]>YY-
su&c;¸onY--
suc&#x327;onY--
su&#x63;¸onY--
suc&cedilla;onY--
suc<![CDATA[¸on]]>Y--
suc¸on---
suç<em>/on</em>---

NOTE: From the last example in the table above, it follows that it is impossible to produce a normalized XML or HTML document containing the character U+0338COMBINING LONG SOLIDUS OVERLAY immediately following an element tag, comment, CDATA section or processing instruction, since the U+0338 '/' combines with the '>' (yielding U+226FNOT GREATER-THAN). It is noteworthy that U+0338COMBINING LONG SOLIDUS OVERLAY also combines with '<', yielding U+226ENOT LESS-THAN. Consequently, U+0338COMBINING LONG SOLIDUS OVERLAY should remain excluded from the initial character of XML identifiers.

3.3.3 Examples of restrictions on the useof combining characters

Include-normalization and full-normalization create restrictionson the use of combining characters. The following examples discuss various suchpotential restrictions and how they can be addressed.

Full-normalization prevents the markup of an isolated combiningmark, for example for styling it differently from its base character (Benoi<span style='color: blue'>^</span>t, where '^' represents a combining circumflex). However,the equivalent effect can be achieved by assigning a class to the accents in anSVG font or using equivalent technology.View an example using SVG (SVG-enabledbrowsers only).

Full-normalization prevents the use of entities for expressingcomposing characters. This limitation can be circumvented by using characterescapes or by using entities representing complete combining charactersequences. With appropriate entity definitions, instead ofA&acute;, write&Aacute; (or better, use 'Á' directly).

3.4 Responsibility for Normalization

This section defines the W3C Text Normalization Model. This model aims to describe the steps and precautions that are necessary to ensure that text processing on the Web is not made incorrect by denormalization of the text (multiple possible representations of "the same text").

Unless otherwise specified, the word 'normalization' in this section may refer to 'include-normalization' or 'full-normalization', depending on which is most appropriate for the specification or implementation under consideration.

Given the definitions and considerations above, specifications, implementations and content have some responsibilities which are listed below. Specifications, implementations and content ought to follow as many of the responsibilities as possible and make sure that this is done in a way that is consistent overall.

  • C300 [C]  Text contentSHOULD be infully-normalized form and if notSHOULD at least be ininclude-normalized form.

  • C301 [S] Specifications of text-based formats and protocolsSHOULD, as part of their syntax definition, require that the text be in normalized form.

  • C302 [S] [I] Atext-processing component that receivessuspect textMUST NOT perform anynormalization-sensitive operations unless it has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreementsMAY, however, be created within private systems which are not subject to these rules, but any externally observable resultsMUST be the same as if the rules had been obeyed.

  • C303 [I] Atext-processing component which modifies text and performsnormalization-sensitive operationsMUST behaveas if normalization took place after each modification, so that any subsequentnormalization-sensitive operations always behaveas if they were dealing with normalized text.

    EXAMPLE: If the 'z' is deleted from the (normalized) stringcz¸ (where '¸' represents a combining cedilla, U+0327), normalization is necessary to turn the denormalized result into the properly normalizedç. If the software that deletes the 'z' later uses the string in anormalization-sensitive operation, it needs to normalize the string before this operation to ensure correctness; otherwise, normalization may be deferred until the data is exposed. Analogous cases exist for insertion and concatenation (e.g.xf:concat(xf:substring('cz¸', 1, 1), xf:substring('cz¸', 3, 1)) inXQuery[XQuery Operators]).

    NOTE:Software that denormalizes a string such as in the deletion example above does not need to perform a potentially expensive re-normalization of the whole string to ensure that the string is normalized. It is sufficient to go back to the last non-composing character and re-normalize forward to the next non-composing character; if the string was normalized before the denormalizing operation, it will now be re-normalized.

  • C304 [S] Specifications of text-based languages and protocolsSHOULD define precisely theconstruct boundaries necessary to obtain a complete definition offull-normalization. These definitionsSHOULD include at least the boundaries betweenmarkup andcharacter data as well as entity boundaries (if the language has any include mechanism),SHOULD include any other boundary that may create denormalization when instances of the language are processed, butSHOULD NOT include character escapes designed to express arbitrary characters.

  • C305 [C] Even when authoring in a (formal) language that does not mandatefull-normalization, content developersSHOULD avoidcomposing characters at the beginning ofconstructs that may be significant, such as at the beginning of an entity that will be included, immediately after aconstruct that causes inclusion or immediately aftermarkup.

  • C306 [I] Authoring tool implementations for a (formal) language that does not mandatefull-normalizationSHOULDeither prevent users from creating content withcomposing characters at the beginning ofconstructs that may be significant, such as at the beginning of an entity that will be included, immediately after aconstruct that causes inclusion or immediately aftermarkup, orSHOULD warn users when they do so.

  • C307 [I] Implementations which transcode text from alegacy encoding to aUnicode encoding formSHOULD use anormalizing transcoder.

    NOTE:Except when an encoding'srepertoire contains characters not represented in Unicode, it is always possible to construct a normalizing transcoder by using any transcoder followed by a normalizer.

  • C308 [S] Where operations may produce unnormalized output from normalized text input, specifications of API components (functions/methods) that implement these operationsMUST define whether normalization is the responsibility of the caller or the callee. SpecificationsMAY state that performing normalization is optional for some API components; in this case the defaultSHOULD be that normalization is performed, and an explicit optionSHOULD be used to switch normalization off. SpecificationsSHOULD NOT make the implementation of normalization optional.

    EXAMPLE: The concatenation operation may either concatenate sequences of codepoints without normalization at the boundary, or may take normalization into account to avoid producing unnormalized output from normalized input. An API specification must define whether the operation normalizes at the boundary or leaves that responsibility to the application using the API.

  • C309 [S] Specifications that define a mechanism (for example an API or a defining language) for producing textual data objectSHOULD require that the final output of this mechanism be normalized.

    EXAMPLE: XSL Transformations[XSLT] and the DOM Load & Save specification[DOM3 LS] are examples of specifications that define text output and that shouldspecify that this output be in normalized form.

    NOTE:As an optimization, it is perfectly acceptable for asystem to define theproducer to be the actual producer (e.g. a small device) together with a remote component (e.g. a server serving as a kind of proxy) to which normalization is delegated. In such a case, the communications channel between the device and proxy server is considered to beinternal to the system, not part of the Web. Only data normalized by the proxy server is to be exposed to the Web at large, as shown in the illustration below:

    Illustration
				  of a text producer defined as including a proxy.
    Illustration of a text producer defined as including a proxy.

    A similar case would be that of a Web repository receiving content from a user and noticing that the content is not properly normalized. If the user so requests, it would certainly be proper for the repository to normalize the content on behalf of the user, the repository becoming effectively part of theproducer for the duration of that operation.

C310 [S] [I] Specifications and implementationsMUST document any deviation from the above requirements.

C311 [S] SpecificationsMUST document any known security issues related to normalization.

4 String Identity Matching

One important operation that depends on early normalization isstring identity matching[CharReq], which is a subset of the more general problem of string matching. There are various degrees of specificity for string matching, from approximate matching such as regular expressions or phonetic matching, to more specific matches such as case-insensitive or accent-insensitive matching and finally to identity matching. In the Web environment, where multiple character encodings are used to represent strings, including some character encodings which allow multiple representations for the same thing,identity is defined to occur if and only if the compared strings contain no user-identifiable distinctions. This definition is such that strings do not match when they differ in case or accentuation, but do match when they differ only in non-semantically significant ways such as character encoding, use ofcharacter escapes (of potentially different kinds), or use of precomposed vs. decomposed character sequences.

To avoid unnecessary conversions and, more importantly, to ensure predictability and correctness, it is necessary for all components of the Web to use the same identity testing mechanism. Conformance to the rule that follows meets this requirement and supports the above definition of identity.

C312[S] [I] String identity matchingMUST be performed as if the following steps were followed:

  1. Early uniform normalization to fully-normalized form, as defined in3.2.4 Fully-normalized text. In accordance with section3 Normalization, this stepMUST be performed by theproducers of the strings to be compared.

  2. Conversion to a commonUnicode encoding form, if necessary.

  3. Expansion of all recognizedcharacter escapes andincludes.

  4. Testing for bit-by-bit identity.

Step 1 ensures 1) that the identity matching process can produce correct results using the next three steps and 2) that a minimum of effort is spent on solving the problem.

NOTE:The expansion of character escapes and includes (step 3 above) is dependent on context, i.e. on which markup or programming language is considered to apply when the string matching operation is performed. Consider a search for the string 'suçon' in an XML document containingsu&#xE7;on but notsuçon. If the search is performed in a plain text editor, the context isplain text (no markup or programming language applies), the &#xE7; character escape is not recognized, hence not expanded and the search fails. If the search is performed in an XML browser, the context isXML, the character escape (defined by XML) is expanded and the search succeeds.

An intermediate case would be an XML editor thatpurposefully provides a view of an XML document with entity references left unexpanded. In that case, a search over that pseudo-XML view will deliberatelynot expand entities: in that particular context, entity references are not considered includes and need not be expanded.

C313[S] [I] Forms of string matching other than identity matchingSHOULD be performed as if the following steps were followed:

  1. Steps 1 to 3 forstring identity matching.

  2. Matching the strings in a way that is appropriate to the application.

Appropriate methods of matching text outside of string identity matching can include such things as case-insensitive matching, accent-insensitive matching, matching characters against Unicode compatibility forms, expansion of abbreviations, matching of stemmed words, phonetic matching, etc.

EXAMPLE: A user who specifies a search for the stringsuçon against a Unicode encoded XML document would expect to find string identity matches against the stringssu&#xE7;on,su&#231;on andsu&ccedill;on (where the entity &ccedil; represents the precomposed character 'ç'). Identity matches should also be found whether the string was encoded as73 75 C3 A7 6F 6E (in UTF-8) or0073 0075 00E7 006F 006E (in UTF-16), or any other character encoding that can be transcoded into normalized Unicode characters.

It should never be the case that a match would be attempted against strings such assuc&#x327;on orsuc¸on since these are not fully-normalized and should cause the text to be rejected. If, however, matching is done against such strings they should also match since they are canonically equivalent.

Forms of matching other than identity, if supported by the application, would have to be used to produce a match against the following strings:SUÇON (case-insensitive matching),sucon (accent-insensitive matching),suçons (matched stems),suçant (phonetic matching), etc.

A References

A.1 Normative References

CharMod
Martin J. Dürst,François Yergeau, Richard Ishida, Misha Wolf, Tex Texin.Character Model for the World Wide Web 1.0: Fundamentals.W3C Recommendation 15 February 2005. Available athttp://www.w3.org/TR/2005/REC-charmod-20050215/. The latest version ofCharMod is available at http://www.w3.org/TR/charmod/.
ISO/IEC 10646
ISO/IEC 10646-1:2000,Informationtechnology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:Architecture and Basic Multilingual Plane and ISO/IEC 10646-2:2001,Informationtechnology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 2:Supplementary Planes, as, from time to time, amended, replaced by anew edition or expanded by the addition of new parts. The latest version ofUCS Part 1 and Part 2 is available at http://www.iso.ch .
ISO/IEC 646
ISO/IEC 646:1991,Information technology -- ISO 7-bit coded character set for information interchange. This standard defines an International Reference Version (IRV) which corresponds exactly to what is widely known as ASCII or US-ASCII. ISO/IEC 646 was based on the earlier standard ECMA-6. ECMA has maintained its standard up to date with respect to ISO/IEC 646. An electronic copy ofECMA is available at http://www.ecma-international.org/publications/standards/Ecma-006.htm.
RFC 2119
S. Bradner.Key words for use in RFCsto Indicate Requirement Levels. IETF RFC 2119 March 1997. Available athttp://www.ietf.org/rfc/rfc2119.txt.
RFC 2396
T. Berners-Lee, R. Fielding, L.Masinter.Uniform ResourceIdentifiers (URI): Generic Syntax. IETF RFC 2396 August 1998. Available athttp://www.ietf.org/rfc/rfc2396.txt.
Unicode
The Unicode Consortium.The Unicode Standard, Version 4. ISBN 0-321-18578-1, asupdated from time to time by the publication of new versions. The latest version ofUnicodeand additional information on versions of the standardand of the Unicode Character Database is available at http://www.unicode.org/unicode/standard/versions/.
Unicode 3.2
The Unicode Consortium.The Unicode Standard, Version 3.2.0 is defined byThe Unicode Standard, Version 3.0, ISBN 0-201-61633-5, as amended by theUnicodeStandard Annex #27: Unicode 3.1 (seehttp://www.unicode.org/reports/tr27/)and by theUnicode Standard Annex #28: Unicode 3.2 (seehttp://www.unicode.org/reports/tr28).
UTR #15
Mark Davis, Martin Dürst.UnicodeNormalization Forms Unicode Standard Annex #15 March 2005. Available athttp://www.unicode.org/reports/tr15/tr15-25.html. The latest version ofUnicode Normalization Forms is available at http://www.unicode.org/reports/tr15/.

A.2 Other References

CharIRI
Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, Tex Texin.Character Model for the World Wide Web 1.0: Resource Identifiers. W3C Canidate Recommendation 22 November 2004. Available athttp://www.w3.org/TR/2004/CR-charmod-resid-20041122/. The latest version ofCharIRI is available at http://www.w3.org/TR/charmod-resid/.
CharReq
Martin J. Dürst.Requirements for StringIdentity Matching and String Indexing. W3C Working Draft 10 July 1998. Available athttp://www.w3.org/TR/1998/WD-charreq-19980710. The latest version ofCharReq is available at http://www.w3.org/TR/WD-charreq.
CSS2
Bert Bos, Håkon Wium Lie, Chris Lilley,Ian Jacobs, Eds.CascadingStyle Sheets, level 2. W3C Recommendation 12 May 1998. Available athttp://www.w3.org/TR/1998/REC-CSS2-19980512/. The latest version ofCSS2 is available at http://www.w3.org/TR/REC-CSS2/.
DOM3 LS
Johnny Stenback, Andy Heninger, Eds.Document Object Model(DOM) Level 3 Load and Save Specification. W3C Recommendation 7 April 2007.Available athttp://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/. The latest version ofDOM3 LS is available at http://www.w3.org/TR/DOM-Level-3-LS/.
HTML 4.0
Dave Raggett, Arnaud Le Hors, IanJacobs, Eds.HTML 4.0Specification. W3C Recommendation 18 December 1997. Available athttp://www.w3.org/TR/REC-html40-971218/. The latest version ofHTML 4.0 is available at http://www.w3.org/TR/REC-html40/.
ISO/IEC 14651
ISO/IEC 14651:2000.Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering as,from time to time, amended, replaced by a new edition or expanded by theaddition of new parts. The latest version ofISO/IEC 14651 is available at http://www.iso.ch/.
Nicol
Gavin Nicol.TheMultilingual World Wide Web, Chapter 2: The WWW As A MultilingualApplication. Available athttp://www.mind-to-mind.com/library/papers/multilingual/multilingual-www.html.
RFC 2070
François Yergeau, Gavin. Nicol, G. Adams, MartinDürst.Internationalization of theHypertext Markup Language. IETF RFC 2070 January 1997. Available athttp://www.ietf.org/rfc/rfc2070.txt.
RFC 2277
H. Alvestrand.IETF Policy on CharacterSets and Languages. IETF RFC 2277, BCP 18 January 1998. Available athttp://www.ietf.org/rfc/rfc2277.txt.
UXML
Martin Dürst and Asmus Freytag.Unicode in XML and otherMarkup Languages. Unicode Technical Report #20 and W3C Note 13 June 2003. Available athttp://www.unicode.org/reports/tr20/tr20-7.html. The latest version ofUXML is available at http://www.unicode.org/reports/tr20/.
XML 1.0
Tim Bray, Jean Paoli, C. MichaelSperberg-McQueen, Eve Maler, Eds.Extensible Markup Language (XML)1.0 (Third Edition). W3C Recommendation 4 February 2004. Available athttp://www.w3.org/TR/2004/REC-xml-20040204. The latest version ofXML 1.0 is available at http://www.w3.org/TR/REC-xml/.
XQuery Operators
Ashok Malhotra,Jim Melton, Jonathan Robie, Norman Walsh, Eds.XQuery 1.0 and XPath2.0 Functions and Operators. W3C Working Draft 15 September 2005. Available athttp://www.w3.org/TR/2005/WD-xpath-functions-20050915/. The latest version ofXQuery Operators is available at http://www.w3.org/TR/xquery-operators/.
XSLT
James Clark Ed.,XSL Transformations(XSLT). W3C Recommendation 16 November 1999. Available athttp://www.w3.org/TR/1999/REC-xslt-19991116. The latest version ofXSLT is available at http://www.w3.org/TR/xslt.

B Composing Characters (Non-Normative)

As specified in3.2.4 Fully-normalized text, a composing character is any character that is

  1. the second character in the canonical decomposition mapping of somecharacter that is not listed in the Composition Exclusion Table defined in[UTR #15], or

  2. of non-zero canonical combining class (as defined in[Unicode]).

These two categories are highly but not exactly overlapping. The first category includes a few class-zero characters thatdo compose with a previous character inNFC; this is the case for some vowel and length marks in Brahmi-derived scripts, as well as for the modern non-initial conjoining jamos of the Korean Hangul script. The second category includes some combining characters thatdo not compose in NFC, for the simple reason that there is no precomposed character involving them. They must nevertheless be taken into account as composing characters because their presence may make reordering of combining marks necessary, to maintain normalization under concatenation or deletion. Therefore, composing characters as defined in3.2.4 Fully-normalized text include all characters of non-zero canonical combining class plus the following (as of Unicode 3.2):

Unicode numberCharacterName
Brahmi-derived scripts
09BEBENGALI VOWEL SIGN AA
09D7BENGALI AU LENGTH MARK
0B3EORIYA VOWEL SIGN AA
0B56ORIYA AI LENGTH MARK
0B57ORIYA AU LENGTH MARK
0BBETAMIL VOWEL SIGN AA
0BD7TAMIL AU LENGTH MARK
0CC2KANNADA VOWEL SIGN UU
0CD5KANNADA LENGTH MARK
0CD6KANNADA AI LENGTH MARK
0D3EMALAYALAM VOWEL SIGN AA
0D57MALAYALAM AU LENGTH MARK
0DCFSINHALA VOWEL SIGN AELA-PILLA
0DDFSINHALA VOWEL SIGN GAYANUKITTA
102EMYANMAR VOWEL SIGN II
Hangul vowels
1161HANGUL JUNGSEONG A
to 
1175HANGUL JUNGSEONG I
Hangul trailing consonants
11A8HANGUL JONGSEONG KIYEOK
to 
11C2HANGUL JONGSEONG HIEUH

NOTE:The characters in the second column of the above table may or may not appear, or may appear as blank rectangles, depending on the capabilities of your browser and on the fonts installed in your system.

C Resources forNormalization (Non-Normative)

The following are freely available programming resources related to normalization:

D Acknowledgements (Non-Normative)

Asmus Freytag and in early stages Ian Jacobs provided significant help in the authoring and editing process of this document. The W3C I18N Working Group and Interest Group, as well as others, provided many comments andsuggestions.


[8]
ページ先頭

©2009-2025 Movatter.jp