Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

GBK (character encoding)

From Wikipedia, the free encyclopedia
Simplified Chinese character encoding
This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "GBK" character encoding – news ·newspapers ·books ·scholar ·JSTOR
(October 2016) (Learn how and when to remove this message)
Guójiā Biāozhǔn Kuòzhǎn (GBK)
Layout of GBK (see below for a larger copy of this diagram)
MIME / IANAGBK
Alias(es)CP936, MS936, windows-936, csGBK
LanguagesWeb browsers, decode asGB 18030, supporting all languages, while the encoding (and other software decoders) is primarily used forSimplified Chinese, but also supportsTraditional Chinese,Japanese,English,Russian and (partially)Greek.
StandardGBK 1.0
ClassificationExtended ASCII,[a]variable-width encoding,CJK encoding
ExtendsEUC-CN
Preceded byGB 2312
Succeeded byGB 18030
  1. ^Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.

GBK is an extension of theGB 2312character set forSimplified Chinese characters, used in thePeople's Republic of China. It includes all unifiedCJK characters found inGB 13000.1-93, i.e. ISO/IEC 10646:1993, or Unicode 1.1. Since its initial release in 1993, GBK has been extended by Microsoft inCode page 936/1386, which was then extended intoGBK 1.0.GBK is also the IANA-registered internet name for the Microsoft mapping,[1] which differs from other implementations primarily by the single-byteeuro sign at 0x80.

GB abbreviatesGuójiā Biāozhǔn, which meansnational standard in Chinese, whileK stands forExtension (扩展kuòzhǎn). GBK not only extended the old standardGB 2312 with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment ofGB 2312 in 1981. With the arrival of GBK, certain names with characters formerly unrepresentable, like the 镕 (róng) character in former Chinese PremierZhu Rongji's name, are now representable.[2]

As of October 2022[update], GBK is the third-most popular encoding served from China and territories (afterUTF-8 and the subsetGB 2312), with 1.9% of web servers serving a page that declares GBK.[3] However, all major web browsers decode GB2312-marked documents as if they were marked GBK, except for Safari and Edge on the labelGB_2312.[4] Together, GBK andGB 2312 encodings have a combined 5.5% presence in China and territories.[3] Globally, GBK accounts for less than 0.07% of all web pages and GBK+GB2312 for 0.2%.[5]

History

[edit]

In 1993, the Unicode 1.1 standard was released, including 20,902 characters used inmainland China,Taiwan,Japan andKorea. Following this, China releasedGB 13000.1-93, theGuobiao standard equivalent of Unicode 1.1.

TheGBK character set was defined in 1993 as an extension ofGB 2312-80, while also including the characters of GB 13000.1-93 through the unused codepoints available in GB 2312. Hence GBK is backward compatible with GB 2312. GBK was defined in a normative annex to GB 13000.1-93.[6]

Microsoft implemented GBK inWindows 95 andWindows NT 3.51 asCode Page 936. While GBK was never an official standard, widespread usage of Windows 95 led to GBK becoming thede facto standard. While GBK included all the Chinese characters defined in Unicode 1.1 and GB 13000.1-93, these standards used different code tables. The primary reason for its existence was simply to bridge the gap between GB 2312-80 and GB 13000.1-93.

In 1995, China National Information Technology Standardization Technical Committee set down the Chinese Internal Code Extension Specification (Chinese:汉字内码扩展规范 (GBK);pinyin:Hànzì Nèimǎ Kuòzhǎn Guīfàn (GBK)), Version 1.0, known asGBK 1.0, which is a slight extension of Codepage 936. The newly added 95 characters were not found in GB 13000.1-1993, and were provisionally assigned UnicodePUA code points.[7]: 534 

Microsoft later added theeuro sign to Code page 936 and assigned the code 0x80 to it. This is not a valid code point in GBK 1.0.

In 2000, theGB 18030-2000 standard was released, superseding yet maintaining compatibility with GBK 1.0. It increased the number of definitions of Chinese characters and extended the number of possible characters through the implementation of four-byte character spaces. The subset of GB 18030 consisting of one-byte and two-byte characters is sometimes also referred to asGBK. Mapping to Unicode has been slightly changed, though, as some characters are now defined in Unicode. In the most up-to-date form of the standard, GB 18030-2005, only 24[8] characters are still mapped to Unicode PUA (seeGB 18030#PUA.)

In 2002,GBK was registered as an IANA charset; the registration uses code page 936 mapping as well as CP936/MS936 aliases, but refers to GBK 1.0 specification.[1]W3C's technical recommendation published in 2015[9] defines aGBKencoder as a GB 18030 encoder with a single-byte euro sign and without four-byte sequences (while W3C'sGBKdecoder specification has no such limitation, decodes asGB 18030, i.e. with same range of letters as all ofUnicode).

Encoding

[edit]

A character is encoded as 1 or 2 bytes. A byte in the range007F is a single byte that means the same thing as it does inASCII. Strictly speaking, there are 95 characters and 33 control codes in this range.

A byte with the high bit set indicates that it is the first of 2 bytes. Loosely speaking, the first byte is in the range81FE (that is, never80 orFF), and the second byte is40A0 except7F for some areas andA1FE for others.

More specifically, the following ranges of bytes are defined:

GBK Encoding Ranges
rangebyte 1byte 2code pointscharacters
GB 18030GBK 1.0Codepage 936GB 2312
LevelGBK/1A1A9A1FE846718[7]: 8–10 717715682
LevelGBK/2B0F7A1FE6,7686,7636,7636,763
Level GBK/381A040FE except7F6,0806,0806,080
Level GBK/4AAFE40A0 except7F8,1608,1608,080
LevelGBK/5A8A940A0 except7F192166153
user-defined 1[7]AAAFA1FE564
user-defined 2F8FEA1FE658
user-defined 3A1A740A0 except7F672
total:23,94021,88721,88621,7917,445

Layout diagram

[edit]

In graphical form, the following figure shows the space of all 64K possible 2-byte codes. Green and yellow areas are assigned GBK codepoints, red are for user-defined characters. The uncolored areas are invalid byte combinations.

Relationship to other encodings

[edit]

The areas indicated in the previous section as GBK/1 and GBK/2, taken by themselves, is simplyGB 2312-80 in its usual encoding, GBK/1 being the non-hanzi region and GBK/2 the hanzi region. GB 2312, or more properly the EUC-CN encoding thereof, takes a pair of bytes from the rangeA1FE, like any 94² ISO-2022 character set loaded into GR. This corresponds to the lower-right quarter of the illustration above. However, GB 2312 does not assign any code points to the rows located atAAB0 andF8FE, even though it had staked out the territory. GBK added extensions to these rows. You can see that the two gaps were filled in with user-defined areas.

More significantly, GBK extended the range of the bytes. Having two-byte characters in the ISO-2022 GR range gives a limit of 94²=8,836 possibilities. Abandoning the ISO-2022 model of strict regions for graphics and control characters, but retaining the feature of low bytes being 1-byte characters and pairs of high bytes denoting a character, you could potentially have 128²=16,384 positions. GBK takes part of that, extending the range fromA1FE (94 choices for each byte) to81FE (126 choices) for the first byte and40FE (191 choices) for the second byte, for a total of 24,066 positions.

Microsoft's Code Page 936 is generally thought of as being GBK.[1] However, the95 PUA characters added in GBK 1.0 are not included in Code Page 936. Code Page 936 also has a single-byteeuro sign at 0x80 which GBK 1.0 doesn't have.[10]

GBK's successor,GB 18030-2000, uses the remaining range available to the second byte (3039) to further expand the number of possibilities while retaining GBK as a subset.

References

[edit]
  1. ^abc"Character Sets". Retrieved3 October 2016.
  2. ^"Code Page 936 - PRC GBK (XGB)".Microsoft. Archived fromthe original on 2002-10-01. Conversion map between Codepage 936 and Unicode. Need manually selecting GB 18030 or GBK in browser to view it correctly.
  3. ^ab"Distribution of Character Encodings among websites that use China and territories".w3techs.com. Retrieved2022-10-25.
  4. ^"Encoding: Summarized test results".www.w3.org. Retrieved2019-11-15.
  5. ^"Historical trends in the usage statistics of character encodings for websites, October 2022".w3techs.com. Retrieved2022-10-25.
  6. ^"18.2: Ideographic Description Characters"(PDF).The Unicode Standard. Version 15.0.0. 2022. p. 763.The Ideographic Description characters are found in GBK—an extension to GB 2312-80 that added all 20,902 Unicode Version 1.1 ideographs not already in GB 2312-80. GBK is defined as a normative annex of GB 13000.1-93.
  7. ^abcStandardization Administration of China (SAC) (2005-11-18).GB 18030-2005: Information Technology—Chinese coded character set.
  8. ^GB 18030-2005 Standard p.9, 79
  9. ^"Encoding Standard # gbk-encoder".W3C. Retrieved2016-10-02.
  10. ^Scherer, Markus (4 January 2002)."Re: Fun with GBK & GB2312".Unicode Mail List Archive. Retrieved4 March 2020.

Notes

[edit]

External links

[edit]
Chinese, Japanese and Korean computing
Encodings
Chinese
Japanese
Korean
International
Input methods
Fonts
Early telecommunications
ISO/IEC 8859
Bibliographic use
National standards
ISO/IEC 2022
Mac OSCode pages
("scripts")
DOS code pages
IBM AIX code pages
Windows code pages
EBCDIC code pages
DEC terminals (VTx)
Platform specific
Unicode /ISO/IEC 10646
TeX typesetting system
Miscellaneous code pages
Control character
Related topics
Retrieved from "https://en.wikipedia.org/w/index.php?title=GBK_(character_encoding)&oldid=1300679317"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp