This is a W3C Working Draft for use by W3C members and other parties. Thisdocument has been subject to extensive review by the InternationalizationWorking Group. This document may be updated, replaced, or obsoleted by otherdocuments at any time.
This document is being written as the first step towards a character modelfor W3C specifications, to make sure that the requirements of other W3C WorkingGroups (and of other interested parties) are understood and can be addressed.This document itself is not intended to proceed to Proposed Recommendationand Recommendation, but will serve as the base for the document that willspecify the character model. Comments are very welcome and should be senttothe editor of this Working Draftas soon as possible.
For the current status of the Internationalization Activity, seehttp://www.w3.org/International/Activity.
This document describes the requirements for some important aspects of thecharacter model for W3C specifications. The two aspects discussed arestring identity matchingandstring indexing. Both aspectsare considered to be vital for the seamless interaction of many componentsof the current and future web architecture.
be conservative in what you send
Appendix: Details about users of the resultingspecification
Glossary
References
Since [RFC 2070], [ISO10646]/[Unicode] (hereafter denoted as UCS, UniversalCharacter Set) has served as a common reference for character encoding inW3C specifications (see [HTML 4.0], [XML1.0], and [CSS2]). This choice was motivated by thefact that the UCS:
As long as data transfer on the WWW was primarily unidirectional (from serverto browser), and the main purpose was rendering, the direct use of the UCSas a common reference posed no problems.
However, from early on, the WWW included bidirectional data transfer (forms,...).Recently, purposes other than rendering are becoming more and more important.The WWW has traditionally been seen as a collection of applications exchangingdata based on protocols. It can however also be seen as a single, very largeapplication [Nicol]. The second view is becoming moreand more important due to the following developments:
In this context, some properties of the UCS become relevant and have to beaddressed. It should be noted that such properties also exist in legacyencodings, and in many cases have been inherited by the UCS in oneway or another from such legacy encodings. In particular, these propertiesare:
This means that in order to insure consistent behavior on the WWW, someadditional specifications, based on the UCS, are necessary.
This document is written as part of the work of the I18N WG to provideinternationalization guidelines for the authors of W3C specifications. Becauseof the importance of consistent behavior for the WWW, it should be expectedthat the resulting guideline components will become mandatory for W3Cspecifications.
The specification that will be developed based on this document have a verywide range of potential users, which are listed below in three categories.For some of the users listed here, a short description of what they do andhow the requirements described in this document are thought to apply to themis given in theAppendix. A need for specificationsin the areas addressed by this document has directly been expressed by (inparticular at the"QueryLanguage Meeting" in April 1998 in Brisbane) the following W3C WorkingGroups or specifications:
Within the W3C, it may in addition be useful for:
Outside of the W3C, it may in addition be useful for things such as:
The following sections 2-4 each discuss the requirements for a particularaspect of the WWW character model. Each section in its first subsection brieflydescribes the problem addressed. The following subsections then discuss thevarious requirements.Section 2 is devoted to the requirementsfor string identity matching.Section 3 expands on stringidentity matching and discusses subrequirements for early uniform normalization,one way to address string identity matching.Section 4 discussesthe requirements for string indexing. Anappendixgives additional information about some of the users of the specificationresulting from this document. Aglossary gives additionalexplanations for some of the terms used in this document.
This document addresses only those parts of the character model that needexact specification and are extremely time-critical. To see exactly whichparts are addressed, please see the first subsection of each of the followingsections. A more general model, e.g. in the sense of the reference processingmodel in [RFC 2070], and general guidelines, e.g.similar to those in [RFC 2130] and[RFC 2277] for the work of the IETF, are not discussedhere. Nevertheless, something like the reference processing model in[RFC 2070], which requires applications to behaveas if they used the UCS, is assumed as a base.
For each problem, this document lists various requirements. Ideally, allrequirements would be met equally well, and the degree to which they arebeing met could be measured equally well. However, some of the requirementstake the form of more general design objectives, for which it is difficultto measure the degree to which they have been met. Also, some requirementsconflict with each other. Where such conflicts are known, the conflict anda preference (i.e. which requirement has greater weight) is indicated.
Stringidentity matching is a subset of the more general problemof string matching. String matching in general can be done with various degreesof specificity, from very approximate matching such as e.g. regular expressionsor phonetic matching for English, to more specific matches such ascase-insensitive or accent-insensitive matching. This document deals onlywith stringidentity matching. Two strings match as identical ifthey contain no user-identifiable distinctions. For more details on the meaningof user-identifiable distinctions, see the following explanations as wellassubsection 2.3 andsubsection 2.4.Any kind of less specific matching is not discussed in this document.
At various places in the WWW infrastructure, strings, and in particularidentifiers, are compared for identity. If different places use differentdefinitions of string identity matching, this results in undesiredunpredictability. Such comparisons are unproblematic if the expectationsof the users and the results of a simple binary comparison coincide, or canbe made to coincide. For ASCII, such a coincidence is established and assumed,including some degree of user education, e.g. about the differences betweenthe digit 0 and the uppercase letter O. For the full repertoire of the UCS,however, the aforesection coincidence between user expectations and binarycomparisons is not a priori guaranteed.
In order to insure consistent behavior on the WWW, a character model forW3C specifications must make sure that the gap between user expectationsand internal operation is bridged. A character model for W3C specificationsmust therefore specify how the problem ofstring identity matchingis handled. The requirements for such a specification are listed in the followingsubsections. Please note that with the exception ofsubsection2.7andsubsection 2.8, the following subsectionsassume the character processing model of [RFC 2070],i.e. they assume that applications behave as if they used the UCS internally.The section ends withsubsection 2.10, which lays outsome alternatives and motivatessection 3.
In order to fulfill its purpose, a specification of string identity matchingmust not contain any ambiguities.
While in some cases, the addition of version numbers might help to make thespecification unambiguous, carrying version numbers as parameters is in manycases highly undesirable and should therefore be avoided.
Typical examples where a gap between user expectations and internal operationcan occur in the UCS are the duplicate encodings defined ascanonicalequivalences in [Unicode]. As an example, theUCS allows us to encode "ü" both as a single codepoint (U+00FC,LATIN SMALL LETTER U WITH DIAERESIS), or as the codepoint for "u" (U+0075,LATIN SMALL LETTER U) followed by the codepoint U+0308 (COMBINING DIAERESIS).Such equivalences are artifacts of the encoding method(s) chosen for theUCS.
It is expected that the canonical equivalences specified in the Unicode standardwill be an excellent starting point for defining the range of things to beidentified as duplicate encodings. This will make sure that the experienceof the Unicode Technical Committee with respect to character equivalencesis fully leveraged. Whether any changes are necessary will have to be examinedmore closely. If such changes consist only of additions of equivalences,implementations of W3C specifications would collectively conform to conformanceclause C9 given in [Unicode, p. 3-2]:A processshall not assume that the interpretations of two canonical-equivalent charactersequences are distinct.
Additions may include some presentation forms.
Another category where encoding differences are invisible to the user arethe various control codes. W3C standards mostly deal with structured text(as opposed to plain text). It should therefore in most cases be possibleto rely on explicit markup rather than on in-stream control codes.
String identity matching shall not treat as equivalent cases that can clearlybe distinguished by a user because the difference may be significant in manycases. Examples are:
These differences can behandled by the (mainly native) users ofthe characters in question, and can at least beidentifiedby usersnot familiar with the characters in question. Such similarities are explicitlynot considered for stringidentity matching, because they do notneed a coordinated solution for the entirety of the WWW.
Various forms of equivalence testing are needed for operations such as searchingand sorting. But such operations will not be based on stringidentity matching. Also, it is felt that such operations do notneed to behave uniformly across the web; that on the contrary, it is beneficialto have competition (e.g. for search engines and their user interfaces),that this has already been taken care of elsewhere (e.g. the work of ISOand Unicode on default and tailorable sorting), and that the requirementsof language-dependence and user-configurability are stronger than the needsfor consistent behavior.
It is impossible to predict what characters might be added to the UCS inthe future. String identity matching should be specified so as to try tominimize the impact of future additions to the UCS on the specification andits implementations.
One category of additions that warrants particular attention, both becauseit has occurred relatively frequently in the past and because it affectsstring identity matching directly, is the addition of new precomposed formsfor which decomposed equivalents are already available.
Because of the increased integration of the WWW, selecting different waysto solve the string identity matching problem for different components ofthe WWW would produce a fragmentation of users' and implementors' expectations,and the need for constant attention to minute differences that are rarelyvisible. Applicability to a broad range of W3C specifications and the widestnumber of components of the WWW means that a solution has to be feasiblefor all kinds of different systems, and different subsystems of largerapplications, with different resources available. This in particular includesvery small systems, and systems that do not have continuous network access.
Many components of the WWW have to work with data without access to the actualcharacters. This includes all kinds of schemes that make use of encryptiontechniques as well as schemes where the character encoding is in generalleft undefined, such as URIs [URI]. For things such asURIs, it should be possible to test two strings for identity even if theircharacter encoding is unknown, given of course that in both cases the samecharacter encoding has been chosen. Also, it should be possible to test twostrings for identity if the actual data cannot be accessed directly becauseit is encrypted. Even in cases where the character encoding is known, andthe data is accessible, treating data as opaque is often desirable, becausean identity check might occur in an architectural component that has (orthe implementors of which have) completely different concerns thaninternationalization. Examples of such components are firewalls and passwords.
be conservative in what you send
An often cited maxim of Internet engineering isbe liberal in what youaccept; be conservative in what you send
. The use of the appropriatekind of equivalence at the receiving end easily allows you tobe liberalin what you accept
. However, without any kind of indication of thepreferred way of encoding or the preferred character variant, thereis no way tobe conservative in what you send
. This means that potentialbenefits cannot be realized.
Several upcoming W3C specifications depend on a clear and uniform specificationfor string identity matching. Therefore, no time should be lost in preparingthe string identity matching specification.
For a specification for string identity matching, the following issues haveto be addressed:
The arguments for why early normalization may be needed, even if only insome cases, can be listed as follows:
be conservative in what you send
It therefore seems appropriate to address the requirements of early normalizationin particular. This is done in the next section.
As discussed insubsection 2.10, there is a high probabilitythat early normalization may become necessary, even if only for some selectedcases. Early normalization means that data is normalized as close to itsorigin, or as close to its conversion to the UCS, as possible. This eliminatesduplicate representations and other ambiguities. The actual string identitycheck can therefore be done without taking such ambiguities into account.In order for this to work, however, early normalization has to be uniform,i.e. all components of the WWW that normalize have to do so in one specificway.
In order for W3C specifications to attribute the responsibility for earlyuniform normalization to specific components, guidelines on where early uniformnormalization should occur must be provided. Ideally, uniform normalizationwould occur at the time of data creation, e.g. by a keyboard driver. However,W3C specifications do not deal directly with things such as keyboard drivers.This means that more appropriate locations for requiring early uniformnormalization have to be defined. As an example, it could be required thattext transmitted via certain protocols, or text exposed in certain APIs,is normalized.
It should be noted that text is transmitted on the WWW in many encodingsnot based on the UCS. In these cases, uniform normalization ideally occurswhen data is transcoded (or assumed to be transcoded according to the referenceprocessing model of [RFC 2070]) from legacy encodings(such as [ISO 8859] or [ISO6937]) to the UCS.
Ideally, early uniform normalization will spread out from the WWW to otherparts of the information infrastructure. For example, early uniform normalizationmay only be specified for text actually sent out by a server, but the taskof normalization may be transferred from the server to the document provider,and from there further to the editor tool and even to the keyboard driver.Such a transfer is indeed highly desirable in many cases, because to avoidgenerating unnormalized data is in many cases easier than to normalize suchdata later.
A wide range of text on the WWW will have to be normalized. This is easierto do if uniform normalization occurs towards the more popular representationthan if a not so widely used representation is used as the normal form. Itmay also provide a bit more time, in that we are just defining what mighthappen naturally anyway instead of having to fight uphill from day one. Existingstandards (such as the canonical ordering behavior for combining characters[Unicode, page 3-9]) should also be considered.
The views of experts on character coding, especially of members of the UnicodeTechnical Committee and of ISO/IEC JTC1/SC2/WG2 should be sought, with thegoal of achieving a broad consensus. This requirement cannot, however, takeprecedence over all other requirements, especiallyRequirement2.9, "The string identity matching specification shall be prepared quickly".
Where choices are available, early uniform normalization should be specifiedin a way which permits easy and compact implementations. It should howeverbe remembered that the main benefit in terms of implementation simplificationis achieved due to the concept of early uniform normalization itself, byrelieving a large part of the WWW infrastructure of the need to considerequivalences when making comparisons, and by locating normalization at thoseplaces in the WWW architecture where most information on actually occurringcodepoint combinations and most internationalization implementation expertiseand concern are available.
To help in developing, understanding, implementing, and testing early uniformnormalization, reference software shall be developed and provided to thepublic underW3Ccopyright. This software will cover all cases, whereas at a given pointin the infrastructure (e.g. a transcoder or a keyboard driver), only somecases may have to be taken into account.
To help in developing, understanding, implementing, and testing early uniformnormalization, test cases shall be developed and provided to the public underW3Ccopyright.
On many occasions, in order to access a substring or a character, it is necessaryto index characters in a string/sequence/array of characters. Where characterindices are exchanged between components of the WWW, there is a need fora uniform definition of string indexing in order to insure consistent behavior.In the simplest cases, this boils down to questions such asAt which positionin a given string is a given character?
,Which character is at a givenposition in a given string?
, and even simpler,What's the length ofa given string?
.
Note: In many cases, it is highly preferable to use non-numeric ways ofidentifying substrings. The specification of string indexing for the WWWshould not be seen as a general recommendation for the use of string indexingfor substring identification. As an example, in the case of translation ofa document from one language to another, identification of substrings basedon document structure can be expected to be much more stable than identificationbased on string indexing.
Note: Because of the wide variability of scripts and characters, differentoperations may be required to work at different levels of aggregation orsubdivision. String indexing as discussed in this section is only intendedto provide a base for such operations; it cannot address all levels concurrently.
The issue of indexing origin, i.e. whether the first character in a stringis indexed as character number 0 or as character number 1, will not be addressedhere.
This is the basic functional requirement for indexing. It means that thespecification has to be without options.
The basic consistency test is the following:
The requirement is fulfilled if the test is successful for all strings ofcharacters and all combinations of systems.
Tools and programs are supposed to hide most of the indexing values fromthe end users. However, the fact that direct editing/manipulation was possiblewas one of the (unexpected) reasons for the success of the WWW. Also,in the complex infrastructure of the WWW, it is impossible to define a clearand strict boundary between what is manipulated by programs and what is seenand manipulated by the users. Therefore, it is highly desirable that somethingseen as one single character by the user is indeed counted as one character.However, there may be cases where for the same characters, there are differencesin the perceptions of users using various languages, or even of users usingone and the same language. In this case, an ideal solution is not possible.Preference should be given to a solution which, although not correspondingto user expectations, can be understood by as many users as possible (e.g.treat each character in the Klingon alphabet as occupying two indexpositions
).
This requirement may be in conflict withrequirement 4.6(because user expectations and actual encoding might be different). Becauseneither requirement is absolute, no indication of relative priorities hasbeen given here.
Because of the variability of what a "character" can mean in different scriptsand to different people (for the same script), string indexing should permitthe designation of characters at various levels of resolution appropriatefor the task at hand. This can in principle be achieved by indexing on thefinest granularity possible, or by indexing of subelements. Although subelementindexing might not be defined in the first version of the character model,and might not be implemented everywhere, the necessary precautions for syntaxextensibility and fallbacks should be taken care of and defined up-frontwherever applicable.
It is impossible to predict what characters might be added to the UCS inthe future. String indexing should be specified so as to try to minimizethe impact of future additions to the UCS on the specification and itsimplementations.
One category of additions that warrants particular attention, both becauseit has occurred relatively frequently in the past and because it may affectstring indexing directly, is the addition of new precomposed forms for whichdecomposed equivalents are already available.
Indexing into a string of characters is a very frequent operation. Ease ofimplementation is therefore crucial. If string indexing is based on earlyuniform normalization, then this may help to make implementation easier.
Several upcoming W3C specifications depend on a clear character model andin particular on clear definitions for string indexing. It is therefore crucialthat no time is lost.
This appendix gives some additional details about users of the specificationthat will result from the requirements in this document. This is intendedto give some very short background to readers not familiar with some of thework of the W3C, as well as to make sure that the requirements of these groupsare well understood.
Note:The specifications discussed below are still in progress. Thesummaries are based on the current state, as publicly known. Changes mayoccur at any time.
This glossary does not provide exact definitions of terms but gives somebackground on how certain words are used in this document.
Copyright© 1998W3C(MIT,INRIA,Keio ), All Rights Reserved. W3Cliability,trademark,documentuseandsoftwarelicensingrules apply.