Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Universal Coded Character Set

From Wikipedia, the free encyclopedia
Standard set of characters defined by ISO/IEC 10646

icon
This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Universal Coded Character Set" – news ·newspapers ·books ·scholar ·JSTOR
(April 2020) (Learn how and when to remove this message)
Universal Coded Character Set
Alias(es)UCS,Unicode
LanguageInternational
StandardISO/IEC 10646
Encoding formatsUTF-8,UTF-16,GB 18030
Less common:UTF-32,BOCU,SCSU,UTF-7
Preceded byISO/IEC 8859,ISO/IEC 2022, various others

TheUniversal Coded Character Set (UCS,Unicode) is a standard set ofcharacters defined by theinternational standardISO/IEC 10646,Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of manycharacter encodings, improving as characters from previously unrepresented writing systems are added.[1]

The UCS has over 1.1 million possiblecode points available for use/allocation, but only the first 65,536, which is theBasic Multilingual Plane (BMP), had entered into common use before 2000. This situation began changing when thePeople's Republic of China (PRC) ruled in 2006 that all software sold in its jurisdiction would have to supportGB 18030. This required software intended for sale in the PRC to move beyond the BMP.[clarification needed][2]

The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimise conflicts with other encoding forms.

The original edition of the UCS definedUTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".[clarification needed]

Another encoding,UTF-32 (previously named UCS-4), uses four bytes (total 32 bits) to encode a single character of the codespace. UTF-32 thereby permits a binary representation of every code point (as of year 2024) in the APIs, and software applications.

History

[edit]

TheInternational Organization for Standardization (ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990.Hugh McGregor Ross was one of its principal architects.

This work happened independently of the development of theUnicode standard, which had been in development since 1987 byXerox andApple.

The original ISO 10646 draft differed markedly from the current standard. It defined:

  • 128 groups of
  • 256 planes of
  • 256 rows of
  • 256 cells,

for an apparent total of 2,147,483,648 characters, but actually the standard could code only 679,477,248 characters, as the policy forbade byte values ofC0 and C1 control codes (0x00 to 0x1F and 0x80 to 0x9F, inhexadecimal notation) in any one of the four bytes specifying a group, plane, row and cell. The Latin capital letter A, for example, had a location in group 0x20, plane 0x20, row 0x20, cell 0x41.

One could code the characters of this primordial ISO/IEC 10646 standard in one of three ways:

  1. UCS-4, four bytes for every character, enabling the simple encoding of all characters;
  2. UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them withISO/IEC 2022 escape sequences;
  3. UTF-1, which encodes all the characters in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control codes).

In 1990, therefore, two initiatives for a universal character set existed:Unicode, with 16 bits for every character (65,536 possible characters), and ISO/IEC 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it.[citation needed] ISO officials realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control code values), thus opening code points for allocation; and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode.

Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 code points from17 planes by means of theUTF-16 surrogate mechanism. For that reason, ISO/IEC 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 679 million. The UCS-4 encoding of ISO/IEC 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the nameUTF-32, although it has almost no use outside programs' internal data.

Rob Pike andKen Thompson, the designers of thePlan 9 operating system, devised a new, fast and well-designed mixed-width encoding that was also backward-compatible with 7-bitASCII, which came to be calledUTF-8,[3] and is currently the most popular UCS encoding.

Differences from Unicode

[edit]

ISO/IEC 10646 and Unicode have an identicalrepertoire and numbers—the same characters with the same numbers exist on both standards, although Unicode releases new versions and adds new characters more often. Unicode has rules and specifications outside the scope of ISO/IEC 10646. ISO/IEC 10646 is a simple character map, an extension of previous standards likeISO/IEC 8859. In contrast, Unicode adds rules forcollation,normalisation of forms, and thebidirectional algorithm forright-to-left scripts such as Arabic and Hebrew. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO/IEC 10646; Unicode must be implemented.

To support these rules and algorithms, Unicode adds manyproperties to each character in the set such as properties determining a character's default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number '8', or the vulgar fraction '¼', that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.

Some applications support ISO/IEC 10646 characters but do not fully support Unicode. One such application,Xterm, can properly display all ISO/IEC 10646 characters that have a one-to-one character-to-glyph mapping[clarification needed] and a single directionality. It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional),Devanagari (one character to many glyphs) or Arabic (both features). MostGUI applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly.

Citing the Universal Coded Character Set

[edit]

ISO/IEC 10646, a general, informal citation for the ISO/IEC 10646 family of standards, is acceptable in most prose. And even though it is a separate standard, the termUnicode is used just as often, informally, when discussing the UCS. However, any normative references to the UCS as a publication should cite the year of the edition in the formISO/IEC 10646:{year}, for example:ISO/IEC 10646:2014.

Relationship with Unicode

[edit]

Since 1991, theUnicode Consortium and theISO/IEC have developedThe Unicode Standard ("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Unicode Version 2.0 exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After Unicode 3.0 was published in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. In 2003, parts 1 and 2 of ISO/IEC 10646 were combined into a single part, which has since had a number of amendments adding characters to the standard in approximate synchrony with the Unicode standard.

See also

[edit]

References

[edit]
  1. ^Draft, Final Committee (2010).ISO/IEC International Standard ISO/IEC 10646(PDF) (2nd ed.). Switzerland. p. 8.{{cite book}}: CS1 maint: location missing publisher (link)
  2. ^"Universal Character Set - Acemap".ddescholar.acemap.info. Retrieved2025-06-09.
  3. ^Pike, Rob (2003-04-03)."UTF-8 history".Archived from the original on 2016-05-23.

External links

[edit]
Unicode
Code points
Characters
Special purpose
Lists
Processing
Algorithms
Comparison of encodings
On pairs of
code points
Usage
Related standards
Related topics
Scripts and symbols in Unicode
Common and
inherited scripts
Modern scripts
Ancient and
historic scripts
Notational scripts
Symbols, emojis
Early telecommunications
ISO/IEC 8859
Bibliographic use
National standards
ISO/IEC 2022
Mac OSCode pages
("scripts")
DOS code pages
IBM AIX code pages
Windows code pages
EBCDIC code pages
DEC terminals (VTx)
Platform specific
Unicode /ISO/IEC 10646
TeX typesetting system
Miscellaneous code pages
Control character
Related topics
1–9999
10000–19999
20000–29999
30000+
IEC
ISO/IEC
Related
Retrieved from "https://en.wikipedia.org/w/index.php?title=Universal_Coded_Character_Set&oldid=1309118784"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp