Movatterモバイル変換


[0]ホーム

URL:


[Unicode]
 

Unicode® Technical Report #17

Unicode Character Encoding Model

EditorsKen Whistler (ken@unicode.org), Asmus Freytag (asmus@unicode.org)
Date2022-11-11
This Version https://www.unicode.org/reports/tr17/tr17-9.html
Previous Version https://www.unicode.org/reports/tr17/tr17-7.html
Latest Version https://www.unicode.org/reports/tr17/
Latest Proposed Updatehttps://www.unicode.org/reports/tr17/proposed.html
Revision9

Summary

This document clarifies a number of the terms used to describe character encodings. It elaborates the Internet Architecture Board (IAB) three-layer “text stream” definitions from RFC 2130 into a four-layer structure more appropriate for explanation of the Unicode Standard.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in theReferences. For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents

  1. The Unicode Character Encoding Model
  2. Abstract Character Repertoire
  3. Coded Character Set (CCS)
  4. Character Encoding Form (CEF)
  5. Character Encoding Scheme (CES)
  6. Character Maps
  7. Transfer Encoding Syntax
  8. Data Types and API Binding
  9. Definitions and Acronyms

1The Unicode Character Encoding Model

This report describes a model for the structure of character encodings. The Unicode Character Encoding Model places the Unicode Standard in the context of other character encodings of all types, as well as other character encoding models such as the character architecture promoted by the Internet Architecture Board (IAB) for use on the internet [RFC 2130], or the Character Data Representation Architecture [CDRA] defined by IBM for organizing and cataloging its own proprietary array of character encodings. The Unicode Character Encoding Model extends these models to cover all the aspects of the Unicode Standard andISO/IEC 10646 [10646]. (Common acronyms used in this text are highlighted. For a list, see Section 9Definitions and Acronyms).

The four levels of the Unicode Character Encoding Model can be summarized as:

ACR: Abstract Character Repertoire
the set of characters to be encoded, for example, some alphabet or symbol set
CCS: Coded Character Set
a specific mapping from an abstract character repertoire to a set of nonnegative integers, which need not be contiguous
CEF: Character Encoding Form
a specific mapping from a set of nonnegative integers that are elements of a CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers
CES: Character Encoding Scheme
a reversible transformation from a set of sequences of code units (from one or more CEFs) to a serialized sequence of bytes

In addition to the four individual levels, there are two other related useful concepts:

CM: Character Map
a mapping from sequences of members of an abstract character repertoire to serialized sequences of bytes bridging all four levels in a single operation
TES: Transfer Encoding Syntax
a reversible transform of encoded data, which may or may not contain textual data

The IAB model, as defined in Section 3.2 of [RFC 2130], distinguishes three levels:Coded Character Set (CCS),Character Encoding Scheme (CES), andTransfer Encoding Syntax (TES). However,four levels need to be defined to adequately cover the distinctions required for the Unicode character encoding model. One of these, theAbstract Character Repertoire, is implicit in the IAB model. The Unicode model also gives the TES a separate status outside the character encoding model proper, while adding an additional level between the CCS and the CES.

The following concepts are also important for the discussion:

Codespace
the numerical space spanned by the set of integers in aCCS
Code Unit
the minimal bit combination that can represent a unit of encoded text for processing or interchange (D77 in [Unicode]), typically a specified binary width in a computer architecture, such as an 8-bit byte

For other terms, see [Glossary].

The following sections give sample definitions, explanations and examples for each of the four levels, as well as the Character Map, and the Transfer Encoding Syntax. These are followed by a discussion ofAPI Binding issues and a list of acronyms used in this document.

2 Abstract CharacterRepertoire

Acharacter repertoire is defined as an unordered set of abstract characters to be encoded. The wordabstract means that these objects are defined by convention. In many cases a repertoire consists of a familiar alphabet or symbol set.

Repertoires come in two types:fixed andopen. In most character encodings, the repertoire is fixed, and often small. Once the repertoire is decided upon, it is never changed. Addition of a new abstract character to a given repertoire creates a new repertoire, which then will be given its own catalogue number, constituting a new object. For the Unicode Standard, on the other hand, the repertoire is inherently open. Because Unicode is intended to be the universal encoding, any abstract character that ever could be encoded is potentially a member of the set to be encoded, whether that character is currently known or not.

For the Unicode Standard, the set of allowable non-negative integers is bounded; however, it is intentionally large enough to leave room for all anticipated additions of abstract characters. Some other character sets use a more limited notion of open repertoires. For example, Microsoft has on occasion extended the repertoire of its Windows character sets by adding a handful of characters to an existing repertoire. This occurred when theEURO SIGN was added to the repertoire for a number of Windows character sets, for example. For suggestions on how to map the unassigned characters of open repertoires, see [CharMapML].

Repertoires are the entities that getCS (“character set”) values in the IBMCDRA architecture.

Examples of Character Repertoires:

2.1 Versioning

The Unicode Standard versions its repertoire by publication of major and minor editions of the standard: 1.0, 1.1, 2.0, 2.1, 3.0, ... The repertoire for each version is defined by the enumeration of abstract characters included in that version.

Repertoire extensions for the Unicode Standard are now strictly additive, even though there were several discontinuities to the earliest versions (1.0 and 1.1) affecting backwards compatibility to them, because of the merger of [Unicode] with [10646]. As of Version 2.0 the Unicode Character Encoding Stability Policies [Stability] guarantee that no character is ever removed from the repertoire.

Note: The Unicode Character Encoding Stability Policies also constrain changes to the standard in other ways. For example, many character properties are subject to consistency constraints, and some properties cannot be changed once they are assigned. Guarantees for the stability of normalization prevent the change or addition of decomposition mappings for existing encoded characters, and also constrain what kinds of characters can be added to the repertoire in future versions.

At times, there may be versions between major and minor versions of the Unicode Standard. While suchupdate versions may amend the text of the Unicode Standard and of the Unicode Character Database [UCD], which defines Character Properties (see also [PropModel]), they do not add to the character repertoire. For more information about versions of the Unicode Standard seeVersions of the Unicode Standard.

ISO/IEC 10646 extends its repertoire by a formal amendment process. As each individual amendment containing additional characters is published, it extends the 10646 repertoire. The repertoires of the Unicode Standard and ISO/IEC 10646 are kept in alignment by coordinating the publication of major versions of the Unicode Standard with the publication of a well-defined list of amendments for 10646 or with a major revision and republication of 10646.

2.2Characters versus Glyphs

The elements of the character repertoire are abstractcharacters. Abstract characters are defined by their identity, which is not limited to their appearance, but may be defined in part by particular properties or membership in a script. In particular, characters differ fromglyphs, which are the particular images representing a character or part of a character. Glyphs for the same character may have very different shapes, as shown in Figure 1 for the lettera.

Figure 1
CharacterSample Glyphs
TimesFigureScriptFigureDecorativeFrakturFigure

Glyphs do not correspond one-to-one with characters. For example, a sequence of“f”followed by “i”may be displayed with a single glyph, called anfi ligature. Notice that the shapes are merged together, and the dot is missing from the“i”as shown in Figure 2.

Figure 2
Character SequenceSample Glyph
fi

fi-ligature

On the other hand, the same image as thefi ligature could conceivably be achieved by a sequence of two glyphs with the right shapes, as in the hypothetical example shown in Figure 3. The choice of whether to use a single glyph or a sequence of two is determined by the font and the rendering software.

Figure 3
Character SequencePossible Glyph Sequence
fi

fi-ligature-left-halffi-ligature-right-half

Similarly, an accented character could be represented by a single glyph, or by separate component glyphs positioned appropriately. In addition, any of the accents can also be considered characters in their own right, in which case a sequence of characters can also correspond to different possible glyph representations:

Figure 4
Character SequencePossible Glyph Sequences
o-circumflex-acuteo-circumflex-acuteocircumflexacutecircumflex-acutecircumflex-acute
ocircumflexacuteo-circumflex-acuteocircumflexacuteocircumflex-acute

In non-Latin scripts, the connection between glyphs and characters is at times even less direct. Glyphs may be required to change their shape, position and width depending on the surrounding glyphs. Such glyphs are called contextual forms. For example, the Arabic characterheh has the four contextual glyphs shown in Figure 5.

Figure 5
CharacterContextual Glyph Shapes
Arabic HehIsolatedMedialInitialFinal

In Arabic and other scripts, text inside fixed margins is justified by elongating the horizontal parts of certain glyphs, rather than by expanding the spaces between words. Ideally this is implemented by changing the shape of the glyph depending on the desired width. On some systems, this stretching is approximated by inserting extra connecting, dash-shaped glyphs calledkashidas, as shown in Figure 6. In such a case, a single character may conceivably correspond to a whole sequence ofkashidas + glyphs + kashidas.

Figure 6
CharacterSequence of glyphs
FigureFigureFigureFigure

In other cases, a single character must correspond to two glyphs, because those two glyphs are positionedaround other letters. See the Tamil characters in Figure 7 below. If one of those glyphs forms a ligature with other characters, then a conceptualpart of a character corresponds to visualpart of a glyph. If a character (or any part of it) corresponds to a glyph (or any part of it), then one says that the charactercontributes to the glyph.

Figure 7
CharacterSplit Glyphs
FigureFigure        Figure

The correspondence between glyphs and characters is generally not one-to-one, and cannot be predicted from the text alone. Whether a particular string of characters is rendered by a particular sequence of glyphs will depend on the sophistication of the host operating system and the font. The ordering of glyphs also does not necessarily correspond to the ordering of the characters. In particular the right-to-left scripts like Arabic and Hebrew give rise to complex reordering. See UAX #9:Unicode Bidirectional Algorithm [Bidi].

2.3Compatibility and User-perceived Characters

For historical reasons, abstract character repertoires may include many entities not considered appropriate members of an abstract character repertoire. These so-called compatibility characters may include ligature glyphs, contextual form glyphs, glyphs that vary by width, sequences of characters, and adorned glyphs, such as circled numbers. Whether a particular character represents a compatibility character may be debatable, and there is no definitive list. However, they are often characters that would have violated one or more encoding principles underlying the Unicode Standard, but which were encoded toenable lossless mapping of data from non-Unicode character encodings.

As with glyphs, there are not necessarily one-to-one relationships between characters and code points. What an end-user thinks of as a single character (also called agrapheme cluster in the context of Unicode) may in fact be represented by multiple code points; conversely, a single code point may correspond to multiple characters. Here are some examples:

Figure 8
CharactersCode PointsNotes
Arabic HehIsolatedMedialInitialFinalArabic contextual form glyphs encoded as compatibility characters in Unicode, also known aspresentation forms
fifi ligatureLigature glyph encoded as compatibility character in Unicode and several character sets
PtsPtsA single code point representing a sequence of three characters encoded as compatibility character in Unicode and several character sets
KSHAKAviramashaThe Devanagari syllable ksharepresented by three code points
g-ringgring aboveG-ring represented by two code points

For more information on grapheme cluster boundaries see UAX #29:Unicode Text Segmentation [Boundaries].

2.4Subsets

Unlike most character repertoires, the synchronized repertoire of Unicode and 10646 is intended to beuniversal in coverage. Given the complexity of many writing systems, in practice this implies that nearly all implementations will fully support only some subset of the total repertoire, rather than all the characters.

Formal subset mechanisms are occasionally seen in implementations of some Asian character sets where, for example, the distinction between “Level 1 JIS” and “Level 2 JIS” support refers to particular parts of the repertoire of theJIS X 0208 kanji characters to be included in the implementation.

Subsetting is a major formal aspect ofISO/IEC 10646. The standard includes a set of internal catalog numbers for named subsets, and further makes a distinction between subsets that arefixed collections and those that areopen collections, defined by a range of code positions. Open collections are extended any time an addition to the repertoire gets encoded in a code position between the range limits defining the collection. When the last of its open code positions is filled, an open collection automatically becomes a fixed collection.

The European Committee for Standardization (CEN) has defined several multilingual European subsets of ISO/IEC 10646-1 (called MES-1, MES-2, MES-3A, and MES-3B). MES-1 and MES-2 have been added as named fixed collections in 10646.

The Unicode Standard specifies neither predefined subsets nor a formal syntax for their definition. It is left to each implementation to define and support the subset of the universal repertoire that it wishes to interpret. Many implementations will use enumerated subsets or subsets implicitly defined by the Script property or by block ranges, where required.

3Coded Character Set (CCS)

Acoded character set is defined to be a mapping from a set of abstract characters to the set of nonnegative integers. This range of integers need not be contiguous. In the Unicode Standard, the concept of the Unicode scalar value (see definition D76, in Chapter 3, "Conformance" of [Unicode]) explicitly defines such a noncontiguous range of integers.

An abstract character is defined to bein a coded character set if the coded character set maps from it to an integer. That integer is thecode point to which the abstract character has beenassigned. That abstract character is then anencoded character.

Coded character sets are the basic object that bothISO and proprietary character encoding committees produce. They relate a defined repertoire to nonnegative integers, which then can be used unambiguously to refer to particular abstract characters from the repertoire.

A coded character set may also be known as acharacter encoding, acoded character repertoire, acharacter set definition, or acode page.

In the IBMCDRA architecture,CP (“code page”) values refer to coded character sets. Note that this use of the termcode page is quite precise and limited. It should not be—but generally is—confused with the generic use ofcode page to refer to character encoding schemes.

Examples of Coded Character Sets:

NameRepertoire
JIS X 0208assigns pairs of integers known askuten points
ISO/IEC 8859-1ASCII plus Latin-1
ISO/IEC 8859-2different repertoire than 8859-1, although both use the same codespace
Code Page 037same repertoire as 8859-1; different integers assigned to the same characters
Code Page 500same repertoire as 8859-1 and Code Page 037; different integers
Windows Code Page 1252contains subset of repertoire of 8859-1 at the same integers, but also Windows-specific additions
The Unicode Standard, Version 2.0exactly the same repertoire and mapping
ISO/IEC 10646-1:1993
plus amendments 1-7
The Unicode Standard, Version 3.0exactly the same repertoire and mapping
ISO/IEC 10646-1:2000
The Unicode Standard, Version 4.0exactly the same repertoire and mapping
ISO/IEC 10646:2003

This document does not attempt to list all versions of the Unicode Standard. SeeVersions of the Unicode Standard for the complete list of versions and for information how they match with particular versions and amendments of 10646.

3.1Character Naming

SC2, theJTC1 subcommittee responsible for character coding, requires the assignment of a unique character name for each abstract character in the repertoire of its coded character sets. This practice is not generally followed in proprietary coded character sets or in the encodings produced by standards committees outside SC2, in which any names provided for characters are often variable and annotative, rather than normative parts of the character encoding.

The main rationale for the SC2 practice of character naming is to provide a mechanism to unambiguously identify abstract characters across different repertoires given different mappings to integers in different coded character sets. ThusLATIN SMALL LETTER A WITH GRAVE would be thesame abstract character, even though it occurs in different repertoires and is assigned different integers in different coded character sets.

The IBM CDRA [CDRA], on the other hand, ensures character identity across different coded character sets (orcode pages) by assigning a catalogue number known as aGCGID (graphic character global identifier), to every abstract character used in any of the repertoires accounted for by theCDRA. Abstract characters that have the same GCGID in two different coded character sets are by definition the same character. Other vendors have made use of similar internal identifier systems for abstract characters.

The advent of Unicode/10646 has largely rendered such schemes obsolete. The identity of abstract characters in all other coded character sets is increasingly defined by reference to Unicode/10646. Part of the pressure to include every “character” from every existing coded character set into the Unicode Standard results from the desire to get rid of subsidiary mechanisms for tracking bits and pieces that are not part of Unicode, and instead just use the Unicode Standard as the universal catalog of characters.

3.2Codespaces

The set of nonnegative integers used to map abstract characters defines a related concept ofcodespace. Traditionally, the outer boundaries for codespaces are closely tied to the encoding forms (see below), because the mappings of abstract characters to nonnegative integers are done with particular encoding forms in mind. Examples of common boundaries for codespaces are 0..7F, 0..FF, 0..FFFF, 0..7FFFFFFF, and 0..FFFFFFFF. The codespace for the Unicode Standard is bounded by 0..10FFFF.

Codespaces can also have elaborate structures, depending on whether the range of integers is contiguous, or whether particular ranges of values are disallowed. Most complications result from considerations of the encoding form for characters. When an encoding form specifies that the integers being encoded are to be serialized as sequences of bytes, there are often constraints placed on the particular values that those bytes may have. Most commonly such constraints disallow byte values corresponding to control functions. In terms of codespace, such constraints on byte values result in multiple non-contiguous ranges of integers that are disallowed for mapping a character repertoire. (See [Lunde] for two-dimensional diagrams of typical codespaces for East Asian coded character sets implementing such constraints.)

Note: InISO standards the term octet is used for an 8-bit byte. In this document, the term byte is used consistently for an 8-bit byte only.

4Character Encoding Form (CEF)

Acharacter encoding form is a mapping from the set of integers used in aCCS to the set of sequences of code units. Acode unit is an integer occupying a specified binary width in a computer architecture, such as an 8-bit byte or a 32-bit word. The encoding form enables character representation as actual data in a computer. The sequences of code units do not necessarily have the same length.

A character encoding formfor a coded character set is defined to be a character encoding form that maps all of the encoded characters for that coded character set.

Note: In many cases, there is only one character encoding form for a given coded character set. In some such cases only the character encoding form has been specified. This leaves the coded character set implicitly defined, based on an implicit relation between the code unit sequences and integers.

When interpreting a sequence of code units, there are three possibilities:

  1. The sequence isill-formed.
    The sequence isincomplete or otherwise fails to match thespecification of the encoding form. For example,
    • 0xA3 is incomplete in CP950.Unless followed by another byte of the right form, it is ill-formed.
    • 0xD800 is incomplete in UTF-16.Unless followed by another 16-bit value of the right form, it is ill-formed.
    • 0xC0 is ill-formed in UTF-8. It cannot be the initial byte (or for that matter,any byte) of a well-formed UTF-8 sequence.
    For details on ill-formed sequences for UTF-8 and UTF-16,see Section 3.9, Unicode Encoding Forms, in [Unicode].
  2. The sequence represents a valid code point, but isunassigned. This sequence may be given an assignment in some future,evolved version of the character encoding. For suggestions on how to handle unassigned characters in mapping, see [CharMapML].For example,
    • 0xA3 0xBF is unassigned in CP950, as of the year 1999.
    • 0x0EDE is unassigned in Unicode 5.0
  3. The source sequence isassigned: it represents a valid encoded character. There are three variants of this:
    First is a typical assigned character. For example,
    • 0x0EDD is assigned in Unicode 5.0
    The second variant is a user-defined character. For example,
    • 0xE000 is an assigned user-defined character whose semantic interpretation is left to agreement between parties outside of the context of the standard.
    The third type is peculiar to the Unicode Standard: thenoncharacter. This is akind of internal-use user-defined character, not intended for public interchange. For example,
    • 0xFFFF is an assigned noncharacter in Unicode 5.0

The encoding form for aCCS may result in either fixed-width or variable-width sequences of code units associated with abstract characters. The encoding form may involve an arbitrary reversible mapping of the integers of the CCS to a set of code unit sequences.

Encoding forms come in various types. Some of them are exclusive to the Unicode/10646, whereas others represent general patterns that are repeated over and over for hundreds of coded character sets. Some of the more important examples of encoding forms follow.

Examples of fixed-width encoding forms:

TypeEach character
encoded as
Notes
  7-bita single 7-bit quantityexample:ISO 646
  8-bit G0/G1a single 8-bit quantitywith constraints on use of C0 and C1 spaces
  8-bita single 8-bit quantitywith no constraints on use of C1 space
  8-bitEBCDICa single 8-bit quantitywith the EBCDIC conventions rather thanASCII conventions
16-bit (UCS-2)a single 16-bit quantitywithin a codespace of 0..FFFF
32-bit (UCS-4)a single 32-bit quantitywithin a codespace 0..7FFFFFFF
32-bit (UTF-32)a single 32-bit quantitywithin a codespace of 0..10FFFF
16-bitDBCS process codea single 16-bit quantityexample: UNIX widechar implementations of Asian CCSes
32-bitDBCS process codea single 32-bit quantityexample: UNIX widechar implementations of Asian CCSes
DBCS Hosttwo 8-bit quantitiesfollowing IBM host conventions

Examples of variable-width encoding forms:

NameCharacters are encoded asNotes
UTF-8a mix of one to four 8-bit code unitsused only with Unicode/10646
UTF-16a mix of one to two 16-bit code unitsused only with Unicode/10646

The encoding form defines one of the fundamental aspects of an encoding: how manycode units are there for each character. The number of code units per character is important to internationalized software. Formerly this was equivalent to how manybytes each character was represented by. With the introduction by Unicode and 10646 of wider code units forUCS-2,UTF-16, UCS-4, and UTF-32, this is generalized to two pieces of information: a specification of the width of the code unit, and the number of code units used to represent each character. The UCS-2 encoding form, which is associated with ISO/IEC 10646 and can only express the subset of characters in theBMP, is a fixed-width encoding form. In contrast, UTF-16 uses either one or two code units and is able to cover the entire codespace of Unicode.

UTF-8 provides a good example. In UTF-8, the fundamental code unit used for representing character data is 8 bits wide (that is, a byte or octet). The width map for UTF-8 is:

0x00..0x7F1 byte
0x80..0x7FF2 bytes
0x800..0xD7FF, 0xE000..0xFFFF3 bytes
0x10000 .. 0x10FFFF4 bytes

Examples of encoding forms as applied to particular coded character sets:

NameEncoding forms
JIS X 0208generally transformed from thekuten notation to a 16-bit “JIS code” encoding form, for example "nichi", 38 92 (kuten) → 0x467C JIS code
ISO 8859-1has the 8-bit G0/G1 encoding form
CP 0378-bitEBCDIC encoding form
CP 5008-bitEBCDIC encoding form
USASCII7-bit encoding form
ISO 6467-bit encoding form
Windows CP 12528-bit encoding form
Unicode 4.0, 5.0UTF-16, UTF-8, or UTF-32 encoding form
Unicode 3.0either UTF-16 (default) or UTF-8 encoding form
Unicode 1.1either UCS-2 (default) or UTF-8 encoding form
ISO/IEC 10646:2003depending on the declared implementation levels, may have UCS-2, UCS-4, UTF-16, or UTF-8
ISO/IEC 10646:2020UTF-8, UTF-16, or UTF-32

5Character Encoding Scheme (CES)

Acharacter encoding scheme (CES) is a reversible transformation of sequences of code units to sequences of bytes in one ofthree ways:

  1. Asimple CES uses a mapping of each code unit of a CEF into a unique serialized byte sequence in order.
  2. Acompound CES uses two or more simple CESs, plus a mechanism to shiftbetween them. This mechanism includes bytes (for example single shifts, SI/SO, orescape sequences) that are not part of any of the simple CESs, but which aredefined by the character encoding architecture and which may require an externalregistry of particular values (such as for the ISO 2022 escape sequences).

    The nature of a compound CES means there may be different sequences of bytes corresponding to the same sequence of code units. While these sequences are not unique, the original sequence of code units can be recovered unambiguously from any of these.

  3. AcompressingCES maps a code unit sequence to a byte sequence while minimizing the length of the byte sequence. Some compressing CESs are designed to produce a unique sequence of bytes for each sequence of code units, so that the compressed byte sequences can be compared for equality or ordered by binary comparison. Other compressing CESs are merely reversible.

Character encoding schemes are relevant to the issue of cross-platform persistent data involving code units wider than a byte, where byte-swapping may be required to put data into the byte polarity which is used for a particular platform. In particular:

It is important not to confuse a Character Encoding Form (CEF) and a CES.

  1. TheCEF maps code points to code units, while the CES transforms sequences of code units to byte sequences. (For a direct mapping from characters to serialized bytes, seeSection 6Character Maps.)
  2. The CES must take into account the byte-order serialization of all code units wider than a byte that are used in the CEF.
  3. Otherwise identical CESs may differ in other aspects, such as the number of user-defined characters allowed. (This applies in particular to the IBMCDRA architecture, which may distinguish hostCCSIDs based on whether the set ofUDCes is conformably convertible to the corresponding code page or not.)

Some of the Unicode encoding schemes have the same labels as thethree Unicodeencoding forms. When used without qualification, the terms UTF-8, UTF-16, and UTF-32 areambiguousbetween their sense as Unicode encoding forms and as Unicode encoding schemes.This ambiguity is usually innocuous for UTF-8 because the UTF-8 encoding scheme istriviallyderived from the byte sequences defined for the UTF-8 encoding form.However, for UTF-16 and UTF-32, the ambiguity is more problematical. As encoding forms,UTF-16 andUTF-32 refer to code units as they are accessed from memory via 16-bit or 32-bit data types; there is no associated byteorientation, and a BOMis never used. (Viewing memory in a debugger or casting wider data types to byte arrays is a byte serialization.)

As encodingschemes, UTF-16 and UTF-32 refer to serializedbytes, for example the serialized bytes forstreaming data or in files; they may have either byte orientation, and asingle BOM may bepresent at the start of the data. When the usage of the abbreviated designators UTF-16 or UTF-32 might bemisinterpreted, andwhere a distinction between their use as referring to Unicode encoding forms or to Unicode encoding schemes is important, the full terms should be used. For example, useUTF-16 encoding form orUTF-16encoding scheme. They may also be abbreviated to UTF-16 CEF or UTF-16 CES,respectively.

Examples of Unicode Character Encoding Schemes:

Examples of Non-Unicode Character Encoding Schemes:

Examples of compressing Character Encoding Schemes:

5.1Byte Order

Processor architectures differ in the way that multi-byte machine integers are mapped to storage locations.Little Endian architectures put the least significant byte at the lower address, whileBig Endian architectures start with the most significant byte.

This difference does not matter for operations on code units in memory, but the byte order becomes important when code units are serialized to sequences of bytes using a particularCES. In terms of reading a data stream, there are two types of byte order:Sameas orOppositeof the byte order of the processor reading the data. In the former case, no special operation needs to be taken; in the latter case, the data needs to be byte reversed before processing.

In terms of external designation of data streams, three types of byte orders can be distinguished:Big Endian (BE),Little Endian (LE) anddefault orinternally marked.

In Unicode, the character at code point U+FEFF is defined as thebyte order mark, while its byte-reversed counterpart, U+FFFE is a noncharacter (U+FFFE) in UTF-16, or outside the codespace (0xFFFE0000) for UTF-32. At the head of a data stream, the presence of a byte order mark can therefore be used to unambiguously signal the byte order of the code units.

6Character Maps

The mapping from a sequence of members of an abstract character repertoire to a serialized sequence of bytes is called aCharacter Map (CM). Asimple character map thus implicitly includes aCCS, aCEF, and aCES, mapping from abstract characters to code units to bytes. Acompound character map includes a compound CES, and thus includes more than one CCS and CEF. In that case, the abstract character repertoire for the character map is the union of the repertoires covered by the coded character sets involved.

Unicode Technical Report #22:Character Mapping Markup Language [CharMapML] defines an XML specification for representing the details of Character Maps. The text also contains a detailed discussion of issues in mapping between character sets.

Character Maps are the entities that get IANA charset [Charset] identifiers in theIAB architecture. From theIANA charset point of view it is important that a sequence of encoded characters be unambiguously mapped onto a sequence of bytes by the charset. The charset must be specified in all instances, as in Internet protocols, where textual content is treated as an ordered sequence of bytes, and where the textual content must be reconstructible from that sequence of bytes.

In the IBMCDRA architecture, Character Maps are the entities that getCCSID (coded character set identifier) values. A character map may also be known as acharset, acharacter set, acode page (broadly construed), or aCHARMAP.

In many cases, the same name is used for both a character map and for a character encoding scheme, such as UTF-16BE. Typically this is done for simple character mappings when such usage is clear from context.

7Transfer Encoding Syntax (TES)

Atransfer encoding syntax is a reversible transform of encodeddata which may (or may not) include textual data represented in one or more character encoding schemes.

Typically TESs are engineered to transform one byte stream into another, while avoiding particular byte values that would confuse one or more Internet or other transmission/storage protocols. Examples include base64, uuencode, BinHex, and quoted-printable. While data transfer protocols often incorporate data compressions to minimize the number of bits to be passed down a communication channel, compression is usually handled outside the TES, for example by protocols such as pkzip, gzip, or winzip.

The Internet Content-Transfer-Encoding tags “7bit” and “8bit” are special cases. These are data width specifications which are relevant to mail protocols and which appear to predate true TESs like quoted-printable. Encountering a “7bit” tag does not imply any actual transform of data; it merely indicates that the charset of the data can be represented in 7 bits, and will pass 7-bit channels—it really indicates the encoding form. In contrast, quoted-printable actually converts various characters (including someASCII) to forms like “=2D” or “=20”, and should be reversed on receipt to regenerate legible text in the designated character encoding scheme.

8Data Types and API Binding

Programming languages define specific data types for character data, using bytes or multi-byte code units. For example, the char data type in Java or C# always uses 16-bit code units, while the size of the char and wchar_t data types in C and C++ are, within quite flexible constraints, implementation defined. In Java or C#, the 16-bit code units are by definitionUTF-16 code units, while in C and C++, the binding to a specific character set is again up to the implementation. In Java, strings are an opaque data type, while in C (and at the lowest level also in C++) they are represented as simple arrays of char or wchar_t.

The Java model supports portable programs, but external data in other encoding forms must first be converted to UTF-16. The C/C++ model is intended to support a byte serialized character set using the char data type, while supporting a character set with a single code unit per character with the wchar_t data type. These two character sets do not have to be the same, but the repertoire of the larger set must include the smaller set to allow mapping from one data type into the other. This allows implementations to supportUTF-8 as the char data type andUTF-32 as the wchar_t data type, for example. In such use, the char data type corresponds to data that is serialized for storage and interchange, and the wchar_t data type is used for internal processing. There is no guarantee that wchar_t represent characters of a specific character set. However, a standard macro, __STDC_ISO_10646__ can be used by an environment to designate that it supports a specific version of 10646, indicated by year and month.

However, the definition of the termcharacter in theISO C and C++ standard does not necessarily match the definition of abstract character in this model. Many widely used libraries and operating systems define wchar_t to be UTF-16 code units. OtherAPIs supporting UTF-16 are often simply defined in terms of arrays of 16-bit unsigned integers, but this makes certain features of the programming language unavailable, such as string literals.

ISO/IEC TR 19769 extends the model used in ISO C and C++ by recommending the use of two typedefs and a minimal extension to the support for character literals and runtime library. The data types char16_t and char32_t are unsigned integers designed to hold one code unit for UTF-16 or UTF-32 respectively. Like wchar_t they can be used generically for any character set, butpredefined macros __STDC_UTF_16__ and __STDC_UTF_32__ can be used to indicate that the data type char16_t or char32_t holds code units that are in the respective Unicode encoding form.

When character data types are passed as arguments in APIs, the byte order of the platform is generally not relevant for code units. The same API can be compiled on platforms with any byte polarity, and will simply expect character data (as for any integral-based data) to be passed to the API in the byte polarity for that platform. However, the size of the data type must correspond to the size of the code unit, or the results can be unpredictable, as when a byte oriented strcpy is used on UTF-16 data which may contain embedded NUL bytes.

While there are many API functions that are designed not to care about which character set the code units correspond to (strlen or strcpy for example), many other operations require information about the character and its properties. As a result, portable programs may not be able to use the char or wchar_t data types in C/C++.

8.1Strings

A string data type is simply a sequence of code units. Thus a Unicode 8-bit string is a sequence of 8-bit Unicode code units; a Unicode 16-bit string is a sequence of 16-bit code units; a Unicode 32-bit string is a sequence of 32-bit code units.

Depending on the programming environment, a Unicode string may or may not also be required to be in the corresponding Unicode encoding form. For example, stringsin Java, C#, orECMAScript are Unicode 16-bit strings, but are not necessarilywell-formed UTF-16 sequences. In normal processing, there are many times wherea string may be in a transient state that is not well-formed UTF-16.Because strings are such a fundamental component of every program, it can befar more efficient to postpone checking for well-formedness.

However, whenever strings are specified to be in a particular Unicodeencoding—even one with the same code unit size—the string must not violate therequirements of that encoding form. For example, isolated surrogates in a Unicode 16-bitstring are not allowed when that string is specified to be well-formed UTF-16.

9Definitions and Acronyms

This section briefly defines some of the common acronyms related to character encoding and used in this text. More extensive definitions for some of these terms can be found elsewhere in this document.

ACRAbstract Character Repertoire
APIApplication Programming Interface
ASCIIAmerican Standard Code for Information Interchange
BEBig-endian (most significant byte first)
BMPBasic Multilingual Plane, the first 65,536 characters of 10646
BOCUByte Ordered Compression for Unicode
CCSCoded Character Set
CCSIDCode Character Set Identifier
CDRACharacter Data Representation Architecture from IBM
CEFCharacter Encoding Form
CENEuropean Committee for Standardization
CESCharacter Encoding Scheme
CMCharacter Map
CPCode Page
CSCharacter Set
DBCSDouble-Byte Character Set
ECMAEuropean Computer Manufacturers Association
EBCDICExtended Binary Coded Decimal Interchange Code
EUCExtended Unix Code
GCGIDGraphic Character Global Identifier
IABInternet Architecture Board
IANAInternet Assigned Numbers Authority
IECInternational Electrotechnical Commission
IETFInternet Engineering Taskforce
ISOInternational Organization for Standardization
JISJapanese Industrial Standard
JTC1Joint Technical Committee 1 (responsible for ISO/IEC IT Standards)
LELittle-endian (least significant byte first)
MBCSMultiple-Byte Character Set (1 to n bytes per code point)
MIMEMultipurpose Internet Mail Extensions
RFCRequest For Comments (term used for an Internet standard)
RCSUReuters Compression Scheme for Unicode (precursor to SCSU)
SBCSSingle-Byte Character Set
SCSUStandard Compression Scheme for Unicode
TESTransfer Encoding Syntax
UCSUniversal Character Set; Universal Multiple-Octet Coded Character Set — the repertoire and encoding represented by ISO/IEC 10646:2003 and its amendments.
UDCUser-defined Character
UTFUnicode (or UCS) Transformation Format

References

[10646]ISO/IEC 10646 — Universal Multiple-Octet Coded Character Set.
For availability seehttp://www.iso.org
[Bidi]Unicode Standard Annex #9:Unicode Bidirectional Algorithm
https://www.unicode.org/reports/tr9/
[BOCU]Unicode Technical Note #6:BOCU-1: MIME-Compatible Unicode Compression
https://www.unicode.org/notes/tn6/
[Boundaries]Unicode Standard Annex #29:Unicode Text Segmentationhttps://www.unicode.org/reports/tr29/
[CDRA]Character Data Representation Architecture Reference and Registry, IBM Corporation
http://www.ibm.com/software/globalization/cdra/index.jsp
[CharMapML]Unicode Technical Report #22:Character Mapping Markup Language (CharMapML)
https://www.unicode.org/reports/tr22/
[Charset]IANA charset assignments
http://www.iana.org/assignments/character-sets
[Charts]The online code charts can be found athttps://www.unicode.org/charts/ An index to characters names with links to the corresponding chart is found athttps://www.unicode.org/charts/charindex.html
[FAQ]Unicode Frequently Asked Questions
https://www.unicode.org/faq/
For answers to common questions on technical issues.
[Glossary]Unicode Glossary
https://www.unicode.org/glossary/
For explanations of terminology used in this and other documents.
[Lunde]Lunde, Ken,CJKV Information Processing,O'Reilly, 1999, ISBN 1-565-92224-7
[PropModel]Unicode Technical Report #23:The Unicode Character Property Model
https://www.unicode.org/reports/tr23/
[RFC2130]The Report of the IAB Character Set Workshop held 29 February 1 March, 1996. C. Weider, et al., April 1997
http://www.ietf.org/rfc/rfc2130.txt
[RFC2277]IETF Policy on Character Sets and Languages, H. Alvestrand, January 1998
http://www.ietf.org/rfc/rfc2277.txt (BCP 18)
[RFC3492]RFC 3492:Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA), A. Costello, March 2003
http://www.ietf.org/rfc/rfc3492.txt
[SCSU]Unicode Technical Standard #6: A Standard Compression Scheme for Unicode
https://www.unicode.org/reports/tr6/
[Stability]Unicode Character Encoding Stability Policies
https://www.unicode.org/policies/stability_policy.html
[UCD]Unicode Character Database
https://www.unicode.org/ucd/
For an overview of the Unicode Character Database and a list of its associated files
[Unicode]The Unicode Standard
For the latest version see:
https://www.unicode.org/versions/latest/
For Version 15.0 see: The Unicode Consortium. The Unicode Standard, Version 15.0.0 (Mountain View, CA: The Unicode Consortium, 2022. ISBN 978-1-936213-32-0).
https://www.unicode.org/versions/Unicode15.0.0/
[W3CCharMod]Character Model for the World Wide Web 1.0: Fundamentals
http://www.w3.org/TR/charmod

Acknowledgements

Mark Davis co-authored the original version of thisdocument and provided most of the figures. Thanks to Dr. Julie Allen for extensive copy-editing and many suggestions on how to improve the readability, particularly of section 2. Ivan Panchenkoprovided a careful copyedit and list of typos to fix for Revision 9.

Modifications

The following summarizes modifications from the previous version of this document.

Revision 9 [KW, AF]

Previous revisions can be accessed with the “Previous Version” link in the header.


Copyright © 2022 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The UnicodeTerms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


[8]ページ先頭

©2009-2025 Movatter.jp