Movatterモバイル変換

WD-charreq-19980710

Requirements for String Identity Matching and String Indexing

World Wide Web Consortium Working Draft 10-July-1998

This version:: http://www.w3.org/TR/1998/WD-charreq-19980710
Latest version:: Public:http://www.w3.org/TR/WD-charreq; WG-internal: See overview athttp://www.w3.org/International/Group/
Previous public version:: None
Previous WG-internal version:: http://www.w3.org/International/Group/1998/07/WD-charreq-19980708
Editor:: Martin J. Dürst (W3C)<duerst@w3.org>

Status of this document

This is a W3C Working Draft for use by W3C members and other parties. Thisdocument has been subject to extensive review by the InternationalizationWorking Group. This document may be updated, replaced, or obsoleted by otherdocuments at any time.

This document is being written as the first step towards a character modelfor W3C specifications, to make sure that the requirements of other W3C WorkingGroups (and of other interested parties) are understood and can be addressed.This document itself is not intended to proceed to Proposed Recommendationand Recommendation, but will serve as the base for the document that willspecify the character model. Comments are very welcome and should be senttothe editor of this Working Draftas soon as possible.

For the current status of the Internationalization Activity, seehttp://www.w3.org/International/Activity.

Abstract

This document describes the requirements for some important aspects of thecharacter model for W3C specifications. The two aspects discussed arestring identity matchingandstring indexing. Both aspectsare considered to be vital for the seamless interaction of many componentsof the current and future web architecture.

Appendix: Details about users of the resultingspecification
Glossary
References

1. Introduction

1.1 Background

Since [RFC 2070], [ISO10646]/[Unicode] (hereafter denoted as UCS, UniversalCharacter Set) has served as a common reference for character encoding inW3C specifications (see [HTML 4.0], [XML1.0], and [CSS2]). This choice was motivated by thefact that the UCS:

is the only universal character repertoire available
covers the widest possible repertoire
provides a way of referencing characters independent of the encoding of a resource
is being updated/completed carefully
is widely accepted and implemented by industry.

As long as data transfer on the WWW was primarily unidirectional (from serverto browser), and the main purpose was rendering, the direct use of the UCSas a common reference posed no problems.

However, from early on, the WWW included bidirectional data transfer (forms,...).Recently, purposes other than rendering are becoming more and more important.The WWW has traditionally been seen as a collection of applications exchangingdata based on protocols. It can however also be seen as a single, very largeapplication [Nicol]. The second view is becoming moreand more important due to the following developments:

The increase in data transfers among servers, proxies, and clients
The increase in places where non-ASCII characters are allowed
The increase in data transfers between different protocol/format elements (such as element/attribute names, URI components, and textual content)
Definition of specifications for APIs (as opposed to protocol specifications only)

In this context, some properties of the UCS become relevant and have to beaddressed. It should be noted that such properties also exist in legacyencodings, and in many cases have been inherited by the UCS in oneway or another from such legacy encodings. In particular, these propertiesare:

Choice of binary encoding forms (UTF-8, UTF-16, UCS-4)
Variable length encodings (e.g. due to the use of combining characters, surrogates,...)
Duplicate encodings (e.g. precomposed vs. decomposed)
Control codes for various purposes (e.g. bidirectionality control, symmetric swapping,...)

This means that in order to insure consistent behavior on the WWW, someadditional specifications, based on the UCS, are necessary.

This document is written as part of the work of the I18N WG to provideinternationalization guidelines for the authors of W3C specifications. Becauseof the importance of consistent behavior for the WWW, it should be expectedthat the resulting guideline components will become mandatory for W3Cspecifications.

1.2 Potential users of the resulting specification

The specification that will be developed based on this document have a verywide range of potential users, which are listed below in three categories.For some of the users listed here, a short description of what they do andhow the requirements described in this document are thought to apply to themis given in theAppendix. A need for specificationsin the areas addressed by this document has directly been expressed by (inparticular at the"QueryLanguage Meeting" in April 1998 in Brisbane) the following W3C WorkingGroups or specifications:

DOM (Document Object Model)
TheXML activity, forXPointer
XSL (eXtensible Style Language)
RDF (Resource Description Framework) Model and Syntax

Within the W3C, it may in addition be useful for:

XML element/attribute names
Work ondigital signatures
Internationalization of URIs

Outside of the W3C, it may in addition be useful for things such as:

Identifiers in Java
String handling in ECMAScript
Filenames in FTP
Folder names in IMAP
Usenet newsgroup names
Identifiers in ACAP

1.3 Structure of this Document

The following sections 2-4 each discuss the requirements for a particularaspect of the WWW character model. Each section in its first subsection brieflydescribes the problem addressed. The following subsections then discuss thevarious requirements.Section 2 is devoted to the requirementsfor string identity matching.Section 3 expands on stringidentity matching and discusses subrequirements for early uniform normalization,one way to address string identity matching.Section 4 discussesthe requirements for string indexing. Anappendixgives additional information about some of the users of the specificationresulting from this document. Aglossary gives additionalexplanations for some of the terms used in this document.

1.4 Scope

This document addresses only those parts of the character model that needexact specification and are extremely time-critical. To see exactly whichparts are addressed, please see the first subsection of each of the followingsections. A more general model, e.g. in the sense of the reference processingmodel in [RFC 2070], and general guidelines, e.g.similar to those in [RFC 2130] and[RFC 2277] for the work of the IETF, are not discussedhere. Nevertheless, something like the reference processing model in[RFC 2070], which requires applications to behaveas if they used the UCS, is assumed as a base.

For each problem, this document lists various requirements. Ideally, allrequirements would be met equally well, and the degree to which they arebeing met could be measured equally well. However, some of the requirementstake the form of more general design objectives, for which it is difficultto measure the degree to which they have been met. Also, some requirementsconflict with each other. Where such conflicts are known, the conflict anda preference (i.e. which requirement has greater weight) is indicated.

2. String Identity Matching

2.1 Problem

Stringidentity matching is a subset of the more general problemof string matching. String matching in general can be done with various degreesof specificity, from very approximate matching such as e.g. regular expressionsor phonetic matching for English, to more specific matches such ascase-insensitive or accent-insensitive matching. This document deals onlywith stringidentity matching. Two strings match as identical ifthey contain no user-identifiable distinctions. For more details on the meaningof user-identifiable distinctions, see the following explanations as wellassubsection 2.3 andsubsection 2.4.Any kind of less specific matching is not discussed in this document.

At various places in the WWW infrastructure, strings, and in particularidentifiers, are compared for identity. If different places use differentdefinitions of string identity matching, this results in undesiredunpredictability. Such comparisons are unproblematic if the expectationsof the users and the results of a simple binary comparison coincide, or canbe made to coincide. For ASCII, such a coincidence is established and assumed,including some degree of user education, e.g. about the differences betweenthe digit 0 and the uppercase letter O. For the full repertoire of the UCS,however, the aforesection coincidence between user expectations and binarycomparisons is not a priori guaranteed.

In order to insure consistent behavior on the WWW, a character model forW3C specifications must make sure that the gap between user expectationsand internal operation is bridged. A character model for W3C specificationsmust therefore specify how the problem ofstring identity matchingis handled. The requirements for such a specification are listed in the followingsubsections. Please note that with the exception ofsubsection2.7andsubsection 2.8, the following subsectionsassume the character processing model of [RFC 2070],i.e. they assume that applications behave as if they used the UCS internally.The section ends withsubsection 2.10, which lays outsome alternatives and motivatessection 3.

2.2 The string identity matching specification shall be defined exactly

In order to fulfill its purpose, a specification of string identity matchingmust not contain any ambiguities.

While in some cases, the addition of version numbers might help to make thespecification unambiguous, carrying version numbers as parameters is in manycases highly undesirable and should therefore be avoided.

2.3 The string identity matching specification shall not expose invisible encoding differences to the user

Typical examples where a gap between user expectations and internal operationcan occur in the UCS are the duplicate encodings defined ascanonicalequivalences in [Unicode]. As an example, theUCS allows us to encode "ü" both as a single codepoint (U+00FC,LATIN SMALL LETTER U WITH DIAERESIS), or as the codepoint for "u" (U+0075,LATIN SMALL LETTER U) followed by the codepoint U+0308 (COMBINING DIAERESIS).Such equivalences are artifacts of the encoding method(s) chosen for theUCS.

It is expected that the canonical equivalences specified in the Unicode standardwill be an excellent starting point for defining the range of things to beidentified as duplicate encodings. This will make sure that the experienceof the Unicode Technical Committee with respect to character equivalencesis fully leveraged. Whether any changes are necessary will have to be examinedmore closely. If such changes consist only of additions of equivalences,implementations of W3C specifications would collectively conform to conformanceclause C9 given in [Unicode, p. 3-2]:A processshall not assume that the interpretations of two canonical-equivalent charactersequences are distinct. Additions may include some presentation forms.

Another category where encoding differences are invisible to the user arethe various control codes. W3C standards mostly deal with structured text(as opposed to plain text). It should therefore in most cases be possibleto rely on explicit markup rather than on in-stream control codes.

2.4 The string identity matching specification shall not treat as equivalent characters that can usually be distinguished by the user

String identity matching shall not treat as equivalent cases that can clearlybe distinguished by a user because the difference may be significant in manycases. Examples are:

Lower-case letters and upper-case letters (e.g. "ü" and "Ü")
Characters with and without diacritics such as accents or vowel marks (e.g. "ü" and "u")
Half-width and full-width presentation variants (Even though one of the variants is clearly only encoded for compatibility, users can distinguish them if necessary. Depending on the individual specification and the protocol/format element concerned, the use of such variants may be discouraged or forbidden.)

These differences can behandled by the (mainly native) users ofthe characters in question, and can at least beidentifiedby usersnot familiar with the characters in question. Such similarities are explicitlynot considered for stringidentity matching, because they do notneed a coordinated solution for the entirety of the WWW.

Various forms of equivalence testing are needed for operations such as searchingand sorting. But such operations will not be based on stringidentity matching. Also, it is felt that such operations do notneed to behave uniformly across the web; that on the contrary, it is beneficialto have competition (e.g. for search engines and their user interfaces),that this has already been taken care of elsewhere (e.g. the work of ISOand Unicode on default and tailorable sorting), and that the requirementsof language-dependence and user-configurability are stronger than the needsfor consistent behavior.

2.5 The string identity matching specification shall be forward-compatible

It is impossible to predict what characters might be added to the UCS inthe future. String identity matching should be specified so as to try tominimize the impact of future additions to the UCS on the specification andits implementations.

One category of additions that warrants particular attention, both becauseit has occurred relatively frequently in the past and because it affectsstring identity matching directly, is the addition of new precomposed formsfor which decomposed equivalents are already available.

2.6 The string identity matching specification shall be broadly applicable

Because of the increased integration of the WWW, selecting different waysto solve the string identity matching problem for different components ofthe WWW would produce a fragmentation of users' and implementors' expectations,and the need for constant attention to minute differences that are rarelyvisible. Applicability to a broad range of W3C specifications and the widestnumber of components of the WWW means that a solution has to be feasiblefor all kinds of different systems, and different subsystems of largerapplications, with different resources available. This in particular includesvery small systems, and systems that do not have continuous network access.

2.7 The string identity matching specification shall be workable with opaque identifiers and data

Many components of the WWW have to work with data without access to the actualcharacters. This includes all kinds of schemes that make use of encryptiontechniques as well as schemes where the character encoding is in generalleft undefined, such as URIs [URI]. For things such asURIs, it should be possible to test two strings for identity even if theircharacter encoding is unknown, given of course that in both cases the samecharacter encoding has been chosen. Also, it should be possible to test twostrings for identity if the actual data cannot be accessed directly becauseit is encrypted. Even in cases where the character encoding is known, andthe data is accessible, treating data as opaque is often desirable, becausean identity check might occur in an architectural component that has (orthe implementors of which have) completely different concerns thaninternationalization. Examples of such components are firewalls and passwords.

2.8 The string identity matching specification shall allow you tobe conservative in what you send

An often cited maxim of Internet engineering isbe liberal in what youaccept; be conservative in what you send. The use of the appropriatekind of equivalence at the receiving end easily allows you tobe liberalin what you accept. However, without any kind of indication of thepreferred way of encoding or the preferred character variant, thereis no way tobe conservative in what you send. This means that potentialbenefits cannot be realized.

2.9 The string identify specification shall be prepared quickly

Several upcoming W3C specifications depend on a clear and uniform specificationfor string identity matching. Therefore, no time should be lost in preparingthe string identity matching specification.

2.10 Solutions for string identity matching

For a specification for string identity matching, the following issues haveto be addressed:

Which representations to treat as equivalent (and which not)
Which components in the WWW architecture to make responsible for equivalences:
1. Each individual component that performs a string identity check has to takeequivalences into account (late normalization)
2. Duplicates and ambiguities are removed as close to their source as possible(early normalization)
Which way to normalize (in the case that early normalization (2.2) is needed, even if only in some cases)

The arguments for why early normalization may be needed, even if only insome cases, can be listed as follows:

It is a prerequisite forbe conservative in what you send
It is the only solution to deal with opaque data (seesubsection 2.7)
Not all parts of the WWW may reasonably be expected to do normalization
There is less need for software updates to address forward-compatibility issues
It may lead to more efficient implementations for string indexing (seesubsection 4.6)
With increased component integration, it becomes more and more difficult to hide certain kinds of implementation details

It therefore seems appropriate to address the requirements of early normalizationin particular. This is done in the next section.

3. Early uniform normalization

3.1 Problem

As discussed insubsection 2.10, there is a high probabilitythat early normalization may become necessary, even if only for some selectedcases. Early normalization means that data is normalized as close to itsorigin, or as close to its conversion to the UCS, as possible. This eliminatesduplicate representations and other ambiguities. The actual string identitycheck can therefore be done without taking such ambiguities into account.In order for this to work, however, early normalization has to be uniform,i.e. all components of the WWW that normalize have to do so in one specificway.

3.2 The location of early uniform normalization shall be specified

In order for W3C specifications to attribute the responsibility for earlyuniform normalization to specific components, guidelines on where early uniformnormalization should occur must be provided. Ideally, uniform normalizationwould occur at the time of data creation, e.g. by a keyboard driver. However,W3C specifications do not deal directly with things such as keyboard drivers.This means that more appropriate locations for requiring early uniformnormalization have to be defined. As an example, it could be required thattext transmitted via certain protocols, or text exposed in certain APIs,is normalized.

It should be noted that text is transmitted on the WWW in many encodingsnot based on the UCS. In these cases, uniform normalization ideally occurswhen data is transcoded (or assumed to be transcoded according to the referenceprocessing model of [RFC 2070]) from legacy encodings(such as [ISO 8859] or [ISO6937]) to the UCS.

Ideally, early uniform normalization will spread out from the WWW to otherparts of the information infrastructure. For example, early uniform normalizationmay only be specified for text actually sent out by a server, but the taskof normalization may be transferred from the server to the document provider,and from there further to the editor tool and even to the keyboard driver.Such a transfer is indeed highly desirable in many cases, because to avoidgenerating unnormalized data is in many cases easier than to normalize suchdata later.

3.3 Early uniform normalization shall be based on widespread practice

A wide range of text on the WWW will have to be normalized. This is easierto do if uniform normalization occurs towards the more popular representationthan if a not so widely used representation is used as the normal form. Itmay also provide a bit more time, in that we are just defining what mighthappen naturally anyway instead of having to fight uphill from day one. Existingstandards (such as the canonical ordering behavior for combining characters[Unicode, page 3-9]) should also be considered.

3.4 Early uniform normalization shall be specified in collaboration with the expert communities on character encoding

The views of experts on character coding, especially of members of the UnicodeTechnical Committee and of ISO/IEC JTC1/SC2/WG2 should be sought, with thegoal of achieving a broad consensus. This requirement cannot, however, takeprecedence over all other requirements, especiallyRequirement2.9, "The string identity matching specification shall be prepared quickly".

3.5 Early uniform normalization shall be feasible to implement

Where choices are available, early uniform normalization should be specifiedin a way which permits easy and compact implementations. It should howeverbe remembered that the main benefit in terms of implementation simplificationis achieved due to the concept of early uniform normalization itself, byrelieving a large part of the WWW infrastructure of the need to considerequivalences when making comparisons, and by locating normalization at thoseplaces in the WWW architecture where most information on actually occurringcodepoint combinations and most internationalization implementation expertiseand concern are available.

3.6 Reference software for early uniform normalization shall be provided

To help in developing, understanding, implementing, and testing early uniformnormalization, reference software shall be developed and provided to thepublic underW3Ccopyright. This software will cover all cases, whereas at a given pointin the infrastructure (e.g. a transcoder or a keyboard driver), only somecases may have to be taken into account.

3.7 Test cases for early uniform normalization shall be provided

To help in developing, understanding, implementing, and testing early uniformnormalization, test cases shall be developed and provided to the public underW3Ccopyright.

4. String indexing

4.1 Problem Description

On many occasions, in order to access a substring or a character, it is necessaryto index characters in a string/sequence/array of characters. Where characterindices are exchanged between components of the WWW, there is a need fora uniform definition of string indexing in order to insure consistent behavior.In the simplest cases, this boils down to questions such asAt which positionin a given string is a given character?,Which character is at a givenposition in a given string?, and even simpler,What's the length ofa given string?.

Note: In many cases, it is highly preferable to use non-numeric ways ofidentifying substrings. The specification of string indexing for the WWWshould not be seen as a general recommendation for the use of string indexingfor substring identification. As an example, in the case of translation ofa document from one language to another, identification of substrings basedon document structure can be expected to be much more stable than identificationbased on string indexing.

Note: Because of the wide variability of scripts and characters, differentoperations may be required to work at different levels of aggregation orsubdivision. String indexing as discussed in this section is only intendedto provide a base for such operations; it cannot address all levels concurrently.

The issue of indexing origin, i.e. whether the first character in a stringis indexed as character number 0 or as character number 1, will not be addressedhere.

4.2 String indexing shall behave consistently across implementations

This is the basic functional requirement for indexing. It means that thespecification has to be without options.

The basic consistency test is the following:

On system A, take any string of characters.
In that string, identify a substring by using appropriate indices.
Transmit the string (potentially undergoing transformations such as transcoding and normalization) to system B.
Use the same indices as in step 2 to identify a substring in the received string.
If the substring identified is the same as that identified in step 2, then the test is successful.

The requirement is fulfilled if the test is successful for all strings ofcharacters and all combinations of systems.

4.3 String indexing shall take into account user expectations

Tools and programs are supposed to hide most of the indexing values fromthe end users. However, the fact that direct editing/manipulation was possiblewas one of the (unexpected) reasons for the success of the WWW. Also,in the complex infrastructure of the WWW, it is impossible to define a clearand strict boundary between what is manipulated by programs and what is seenand manipulated by the users. Therefore, it is highly desirable that somethingseen as one single character by the user is indeed counted as one character.However, there may be cases where for the same characters, there are differencesin the perceptions of users using various languages, or even of users usingone and the same language. In this case, an ideal solution is not possible.Preference should be given to a solution which, although not correspondingto user expectations, can be understood by as many users as possible (e.g.treat each character in the Klingon alphabet as occupying two indexpositions ).

This requirement may be in conflict withrequirement 4.6(because user expectations and actual encoding might be different). Becauseneither requirement is absolute, no indication of relative priorities hasbeen given here.

4.4 String indexing shall be able to address "characters" at various levels

Because of the variability of what a "character" can mean in different scriptsand to different people (for the same script), string indexing should permitthe designation of characters at various levels of resolution appropriatefor the task at hand. This can in principle be achieved by indexing on thefinest granularity possible, or by indexing of subelements. Although subelementindexing might not be defined in the first version of the character model,and might not be implemented everywhere, the necessary precautions for syntaxextensibility and fallbacks should be taken care of and defined up-frontwherever applicable.

4.5 String indexing shall be forward-compatible

It is impossible to predict what characters might be added to the UCS inthe future. String indexing should be specified so as to try to minimizethe impact of future additions to the UCS on the specification and itsimplementations.

One category of additions that warrants particular attention, both becauseit has occurred relatively frequently in the past and because it may affectstring indexing directly, is the addition of new precomposed forms for whichdecomposed equivalents are already available.

4.6 String indexing shall be feasible to implement

Indexing into a string of characters is a very frequent operation. Ease ofimplementation is therefore crucial. If string indexing is based on earlyuniform normalization, then this may help to make implementation easier.

4.7 The String indexing specification shall be prepared quickly

Several upcoming W3C specifications depend on a clear character model andin particular on clear definitions for string indexing. It is therefore crucialthat no time is lost.

Appendix: Details about users of the resulting specification

This appendix gives some additional details about users of the specificationthat will result from the requirements in this document. This is intendedto give some very short background to readers not familiar with some of thework of the W3C, as well as to make sure that the requirements of these groupsare well understood.

Note:The specifications discussed below are still in progress. Thesummaries are based on the current state, as publicly known. Changes mayoccur at any time.

DOM (Document Object Model, seehttp://www.w3.org/DOM/): A series of API definitions to access and manipulate documents, both document structure and textual content. Currently, APIs for basic functionality for HTML and XML, with bindings to programming languages such as Java, ECMAScript, and C. All string parameters in the APIs are defined as Unicode strings. To assure consistent behavior of programs written in different languages and running on different implementations, uniform normalization and string indexing specifications are necessary.
XLL (eXtensible Linking Language): Linking support for XML. XLL defines the #anchor syntax component of URIs for XML. A syntax for identifying elements in a document tree (e.g. based on element names that can contain arbitrary characters in XML), as well as for identifying portions of text, is defined. For consistent identification of portions of text, either or both of string identity matching and string indexing are necessary.
RDF (Resource Description Framework): A data model and streaming format for metadata, with search engines and inference engines as potential users. Much metadata is textual, and a basic operation is to decide whether two elements of metadata are the same or not. For consistent behavior, string identity matching is necessary.
URIs: Web addresses, with various components; pivot point for much of the WWW. How to encode arbitrary bytes into a restricted set of characters (using %HH escapes) is well defined, but which character encoding to use to encode arbitrary characters into bytes is not defined. In most cases, e.g. in proxies, comparisons are strictly binary. Without some specification for uniform normalization, some characters cannot reliably be used.

Glossary

This glossary does not provide exact definitions of terms but gives somebackground on how certain words are used in this document.

Character: Used in a loose sense to denote small units of text, where the exact definition of these units is still open.
Early Normalization: Duplicates and ambiguities are removed as close to their source as possible. This is done by normalizing them to a single representation. Because the normalization is not done by the component that carries out the identity check, normalization has to be done uniformly for all the components of the WWW.
Late Normalization: Each individual component that performs a string identity check has to take equivalences into account. This is usually done by normalizing each string to a preferred representation that eliminates duplicates and ambiguities. Because, with late normalization, normalization is done locally and on the fly, there is no need to specify a web-wide uniform normalization.
String Identity Matching: Exact matching of strings, except for encoding duplicates indistinguishable to the user. Seesection 2.
String Indexing: Indexing into a string to address a character or a sequence of characters. Seesection 4.
UCS: Universal Character Set, the character repertoire defined in parallel by [ISO 10646] and [Unicode].
WWW: World-wide Web, the collection of technologies built up starting with HTML, HTTP, and URIs, the corresponding software (servers, browsers,...), and/or the corresponding content.

References

[CSS2]: Bert Bos, Håkon Wium Lie, Chris Lilley, Ian Jacobs, Eds.,Cascading Style Sheets, level 2 (CSS2 Specification), W3C Recommendation 12-May-1998,http://www.w3.org/TR/REC-CSS2/.
[ISO 6937]: ISO/IEC 6937:1994,Information technology -- Coded graphic character set for text communication -- Latin alphabet.
[ISO 8859]: ISO/IEC 8859,Information technology -- 8-bit single-byte coded graphic character sets (various parts and publication dates).
[ISO 10646]: ISO/IEC 10646-1:1993,Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane, and its amendments.
[HTML 4.0]: Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds.,HTML 4.0 Specification, W3C Recommendation 18-Dec-1997 (revised on 24-Apr-1998),http://www.w3.org/TR/REC-html40/.
[Nicol]: Gavin Nicol,The Multilingual World Wide Web,Chapter 2: The WWW As A Multilingual Application,http://www.mind-to-mind.com/i18n/multilingual-www.html#ID-2A08F773.
[RFC 2070]: F. Yergeau, G. Nicol, G. Adams, M. Dürst,Internationalization of the Hypertext Markup Language, RFC 2070, January 1997,ftp://ftp.isi.edu/in-notes/rfc2070.txt.
[RFC 2130]: C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin, P. Svanberg,The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996, RFC 2130, April 1997,ftp://ftp.isi.edu/in-notes/rfc2130.txt.
[RFC 2277]: H. Alvestrand,IETF Policy on Character Sets and Languages, RFC 2277 / BCP 18, January 1998,ftp://ftp.isi.edu/in-notes/rfc2277.txt.
[Unicode]: The Unicode Consortium,The Unicode Standard,Version 2.0, Addison-Wesley, Reading, MA, 1996.
[URI]: T. Berners-Lee, R. Fielding, L. Masinter,Uniform Resource Identifiers (URI): Generic Syntax, work in progress,ftp://ftp.ietf.org/internet-drafts/draft-fielding-uri-syntax-03.txt, June 1998.
[XML 1.0]: Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eds.,Extensible Markup Language (XML) 1.0, W3C Recommendation 10-February-1998,http://www.w3.org/TR/REC-xml.

[8]ページ先頭

Movatterモバイル変換

Requirements for String Identity Matching and String Indexing

World Wide Web Consortium Working Draft 10-July-1998

Status of this document

Abstract

Table of Contents