![]() | This articlemay be too technical for most readers to understand. Pleasehelp improve it tomake it understandable to non-experts, without removing the technical details.(March 2023) (Learn how and when to remove this message) |
Extended Unix Code (EUC) is a multibytecharacter encoding system used primarily forJapanese,Korean, andsimplified Chinese (characters).
The most commonly used EUC codes arevariable-length encodings with a character belonging to anISO/IEC 646 compliant coded character set (such asASCII) taking one byte, and a character belonging to a 94×94 coded character set (such asGB 2312) represented in two bytes. TheEUC-CN form ofGB 2312 andEUC-KR are examples of such two-byte EUC codes.EUC-JP includes characters represented by up to three bytes, including an initialshift code, whereas a single character inEUC-TW can take up to four bytes.
Modern applications are more likely to useUTF-8, which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors. EUC is however still very popular, especiallyEUC-KR for South Korea.
The structure of EUC is based on theISO/IEC 2022 standard, which specifies a system of graphical character sets that can be represented with a sequence of the 94 7-bit bytes0x21–7E, or alternatively 0xA1–FE if an eighth bit is available. This allows for sets of 94 graphical characters, or 8836 (942) characters, or 830584 (943) characters. Although initially 0x20 and 0x7F were always thespace anddelete character and 0xA0 and 0xFF were unused, later editions ofISO/IEC 2022 allowed the use of the bytes 0xA0 and 0xFF (or 0x20 and 0x7F) within sets under certain circumstances, allowing the inclusion of 96-character sets. The ranges 0x00–1F and 0x80–9F are used forC0 and C1 control codes.
EUC is a family of 8-bit profiles ofISO/IEC 2022, as opposed to 7-bit profiles such asISO-2022-JP. As such, onlyISO 2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. The G0 set is set to anISO/IEC 646 compliant coded character set such asASCII,ISO 646:KR (KS X 1003) orISO 646:JP (the lower half ofJIS X 0201) and invoked over GL (i.e. 0x21–0x7E, with the most significant bit cleared).[1] If ASCII is used, this makes the code anextended ASCII encoding; the most common deviation from ASCII is that 0x5C (backslash in ASCII) is often used to represent ayen sign in EUC-JP (see below) and awon sign in EUC-KR.
The other code sets are invoked over GR (i.e. with the most significant bit set). Hence, to get the EUC form of a character, the most significant bit of each coding byte is set (equivalent to adding 128 to each 7-bit coding byte, or adding 160 to each number in thekuten code); this allows the software to easily distinguish whether a particular byte in acharacter string belongs to theISO 646 code or the extended code. Characters in code sets 2 and 3 are prefixed with the control codesSS2 (0x8E) andSS3 (0x8F) respectively, and invoked over GR. Besides the initial shift code, any byte outside of the range 0xA0–0xFF appearing in a character from code sets 1 through 3 is not a valid EUC code.[1]
The EUC code itself does not make use of the announcement and designation sequences fromISO 2022.[1] However, the code specification is equivalent to the following sequence of fourISO 2022 announcement sequences, with meanings breaking down as follows.[1]
Individual sequence | Hexadecimal | Feature of EUC denoted |
---|---|---|
ESC SP C | 1B 20 43 | ISO-8 (8-bit, G0 in GL, G1 in GR) |
ESC SP Z | 1B 20 5A | G2 accessed using SS2 |
ESC SP [ | 1B 20 5B | G3 accessed using SS3 |
ESC SP \ | 1B 20 5C | Single-shifts invoke over GR |
The ISO-2022-basedvariable-length encoding described above is sometimes referred to as theEUC packed format, which is the encoding format usually labeled as EUC. However, internal processing of EUC data may make use of a fixed-length transformation format called theEUC complete two-byte format. This represents:[2]
Initial bytes of 0x00 and 0x80 are used in cases where the code set uses only one byte. There is also a four-byte fixed-length format.[2] These fixed-length encoding formats are suited to internal processing and are not usually encountered in interchange.
EUC-JP is registered with the IANA in both formats, the packed format as "EUC-JP" or "csEUCPkdFmtJapanese" and the fixed width format as "csEUCFixWidJapanese".[3] Only the packed format is included in theWHATWG Encoding Standard used byHTML5.[4]
![]() | |
MIME / IANA | GB2312 |
---|---|
Alias(es) | csGB2312, CN-GB[5] |
Language(s) | Simplified Chinese,English,Russian |
Standard | GB 2312 (1980) |
Classification | Extended ASCII,variable-length encoding,CJK encoding, EUC |
Extends | ASCII |
Extensions | 748,GBK,GB 18030, x-mac-chinesesimp |
Transforms / Encodes | GB 2312 |
Succeeded by | GBK,GB 18030 |
EUC-CN[6] is the usual encoded form of theGB 2312 standard forsimplified Chinese characters. Unlike the case of JapaneseJIS X 0208 andISO-2022-JP,GB 2312 is not normally used in a 7-bitISO 2022 code version,[a] although a variant form calledHZ (which delimitsGB 2312 text with ASCII sequences) was sometimes used onUSENET.
An ASCII character is represented in its usual encoding. A character fromGB 2312 is represented by two bytes, both from the range 0xA1–0xFE.
An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all ofGB 2312, but is notISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is, therefore, more similar in structure toBig5 and other non–ISO 2022–compliantDBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.
IBM code page 1381 (CCSID 1381) comprises the single-bytecode page 1115 (CPGID 1115 as CCSID 1115) and the double-byte code page 1380 (CPGID 1380 as CCSID 1380),[7] which encodes GB 2312 the same way as EUC-CN, but deviates from the EUC structure by extending the lead byte range back to 0x8C, adding 31 IBM-selected characters in 0x8CE0 through 0x8CFE and adding 1880user-defined characters with lead bytes 0x8D through 0xA0.[8]
IBM code page 1383 (CCSID 1383) comprises the single-bytecode page 367 and the double-byte code page 1382 (CPGID 1382 as CCSID 1382),[9] which differs by conforming to the EUC structure, adding the 31 IBM-selected characters in 0xFEE0 through 0xFEFE instead, and including only 1360 user-defined characters, interspersed in the positions not used by GB 2312.[10] The alternative CCSID 5479[11] is used for the pure EUC-CN code page: it uses CCSID 9574 as its double-byte set, which uses CPGID 1382 but excludes the IBM-selected and user-defined characters.[12]
GBK is an extension toGB 2312. It defines an extended form of the EUC-CN encoding capable of representing a larger array ofCJK characters sourced largely fromUnicode 1.1, includingtraditional Chinese characters and characters used only inJapanese. It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (andC1 bytes, not limited to the single shifts, may appear as lead or trail bytes), due to a larger encoding space being required.
Variants of GBK are implemented byWindows code page 936 (theMicrosoft Windowscode page for simplified Chinese), and by IBM's code page 1386.
The Unicode-basedGB 18030 character encoding defines an extension of GBK capable of encoding the entirety ofUnicode. However, Unicode encoded asGB 18030 is avariable-length encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of otherUnicode transformation formats such asUTF-8.
Other EUC-CN variants deviating from the EUC mechanism include theMac OS Chinese Simplified script (known as Code page 10008 orx-mac-chinesesimp
).[13] It uses the bytes 0x80, 0x81, 0x82, 0xA0, 0xFD, 0xFE, and 0xFF for theU with umlaut (ü), two special font metric characters, thenon-breaking space, thecopyright sign (©), thetrademark sign (™) and the ellipsis (...) respectively.[6] This differs in what is regarded as a single-byte character versus the first byte of a two-byte character from both EUC (where, of those, 0xFD and 0xFE are defined as lead bytes) and GBK (where, of those, 0x81, 0x82, 0xFD and 0xFE are defined as lead bytes).
This use of 0xA0, 0xFD, 0xFE and 0xFF matchesApple's Shift_JIS variant.
Besides these changes to the lead byte range, the other distinctive feature of the double-byte portion of Mac OS Chinese Simplified is the inclusion of two extensions to the basic GB 2312-80 set in rows 6 and 8.[6] These are considered "standard extensions to GB 2312", neither of which is proprietary to Apple: the row 8 extension was taken fromGB 6345.1,[6] both extensions are included byGB/T 12345 (the traditional Chinese variant of GB 2312),[14] and both extensions are included byGB 18030 (the successor to GB 2312).[15]
![]() | |
MIME / IANA | EUC-JP |
---|---|
Alias(es) | Unixized JIS (UJIS), csEUCPkdFmtJapanese |
Language(s) | Japanese,English,Russian |
Classification | ExtendedISO 646,variable-length encoding,CJK encoding, EUC |
Extends | ASCII orISO 646:JP |
Transforms / Encodes | JIS X 0208,JIS X 0212,JIS X 0201 |
Succeeded by | EUC-JISx0213 |
![]() | |
Alias(es) | EUC-JISx0213 |
---|---|
Language(s) | Japanese,Ainu,English,Russian |
Standard | JIS X 0213 |
Classification | Extended ASCII,variable-length encoding,CJK encoding, EUC |
Extends | ASCII |
Transforms / Encodes | JIS X 0213,JIS X 0201 (Kana) |
Preceded by | EUC-JP |
EUC-JP is avariable-length encoding used to represent the elements of threeJapanese character set standards, namelyJIS X 0208,JIS X 0212, andJIS X 0201. Other names for this encoding includeUnixized JIS (orUJIS) andAT&T JIS.[2] 0.1% of all web pages use EUC-JP since September 2022,[16] while 2.6% of websites written with Japanese use this second-most popular (for Japanese) encoding[17] (which is more than forShift JIS both are much less used thatUTF-8). It is calledCode page 954 by IBM.[18][19] Microsoft has two code page numbers for this encoding (51932 and 20932).
This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed byISO-2022-JP, which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlikeShift JIS).
A related and partially compatible encoding, calledEUC-JISx0213 orEUC-JIS-2004, encodesJIS X 0201 andJIS X 0213[20] (similarly toShift_JISx0213, its Shift_JIS-based counterpart).
Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which usedShift JIS or its extensions (Windows code page 932 onMicrosoft Windows, andMacJapanese onclassic Mac OS), although it became heavily used byUnix or Unix-likeoperating systems (except forHP-UX). Therefore, whether Japanese websites use EUC-JP or Shift_JIS often depends on what OS the author uses.
Characters are encoded as follows:
Vendor extensions to EUC-JP (from, for example, theOpen Software Foundation,IBM orNEC) were often allocated within the individual code sets,[25][26] as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR).
However, some vendor-specific encodings are partially compatible with EUC-JP, due to encodingJIS X 0208 over GR, but do not follow the packed EUC structure. Often, these do not include use of the single shifts from EUC-JP, and are thus not straight extensions of EUC-JP, with the exception of Super DEC Kanji.
Digital Equipment Corporation defines two variants of EUC-JP only partly conforming to the EUC packed format, but also bearing some resemblance to the complete two-byte format. The overall format of the "DEC Kanji" encoding mostly corresponds to fixed-length (complete two-byte) EUC; however, code set 0 is not required to be left-padded with null bytes (similarly to the packed format).[28] JIS X 0208 is, as usual, used for code set 1; code set 2 (half-width katakana) is absent; code set 3 is encoded like the two-byte fixed width format (i.e. without a shift byte and with only the first high bit set), but used for two-byte user defined characters rather than being specified for JIS X 0212.[28] In the basic "DEC Kanji" encoding, only the first 31 rows of code set 3 are used for user-defined characters: rows 32 through 94 are reserved, similarly to the unused rows in code set 1.[29]
The "Super DEC Kanji" encoding accepts codes both from the "DEC Kanji" encoding and from packed-format EUC, for a total of five code-sets.[28] It also allows the entire user defined code set, and the unused rows at the ends of the JIS X 0208 and JIS X 0212 code sets (rows 85–94 and 78–94 respectively), to be used for user-defined characters.[29]
Hewlett-Packard defines an encoding referred to as "HP-16". This accompanies their "HP-15" encoding, which is a variant ofShift JIS. HP-16 encodesJIS X 0208 using the same bytes as in EUC-JP, but does not use the single shift codes (thus omitting code sets 2 and 3), and adds three user-defined regions which do not follow the packed-format EUC structure:[28]
The IKIS (Interactive Kanji Information System) encoding used byData General resembles EUC-JP without single shifts, i.e. with only code sets 0 and 1. Half-width katakana are instead included in row 8 of JIS X 0208 (colliding with the box-drawing characters added to the standard in 1983). JIS X 0208 rows 9 through 12 are used for user-defined characters.[28][29]
KEIS (Kanji-processing Extended Information System) is anEBCDIC encoding used byHitachi,[29] with double-byte characters (a DBCS-Host encoding) included using shifting sequences, making it astateful encoding. Specifically, the sequence0x0A 0x41
switches to single-byte mode and the sequence0x0A 0x42
switches to double-byte mode.[b] However, JIS X 0208 characters are encoded using the same byte sequences used to encode them in EUC-JP. This results in duplicate encodings for theideographic space—0x4040 per the DBCS-Host code structure, and 0xA1A1 as in EUC-JP. This differs from IBM's DBCS-Host encoding for Japanese, the layout of which builds on versions which predate JIS X 0208 altogether. The lead byte range is extended back to 0x59, out of which the lead bytes 0x81–A0 are designated for user-defined characters,[28] and the remainder are used for corporate-defined characters, including both kanji and non-kanji.[29]
JEF (Japanese-processing Extended Feature)[29] is an EBCDIC encoding used onFujitsu FACOM mainframes, contrasting with FMR (a variant of Shift JIS) used on Fujitsu PCs. Like KEIS, JEF is a stateful encoding, switching to a double-byte DBCS-Host mode using shifting sequences (where0x29
switches to single-byte mode and0x28
switches to double-byte mode).[30] Also similarly to KEIS,JIS X 0208 codes are represented the same as in EUC-JP.[28] The lead byte range is extended back to 0x41, with 0x80–0xA0 designated for user definition; lead bytes 0x41–0x7F are assigned row numbers 101 through 163 forkuten purposes, although row 162 (lead byte 0x7E) is unused.[28][29] Rows 101 through 148 are used for extended kanji, while rows 149 through 163 are used for extended non-kanji.[29]
![]() EUC-KR code structure | |
MIME / IANA | EUC-KR |
---|---|
Alias(es) | Wansung, IBM-970 |
Language(s) | Korean,English,Russian |
Standard | KS X 2901 (KS C 5861) |
Classification | ExtendedISO 646,variable-length encoding,CJK encoding, EUC |
Extends | ASCII orISO 646:KR |
Extensions | Mac OS Korean,IBM-949,Unified Hangul Code (Windows-949) |
Transforms / Encodes | KS X 1001 |
Succeeded by | Unified Hangul Code (web standards) |
EUC-KR is avariable-length encoding to represent Korean text using two coded character sets,KS X 1001 (formerly KS C 5601)[31][32] and eitherISO 646:KR (KS X 1003, formerlyKS C 5636) orASCII, depending on variant.KS X 2901 (formerlyKS C 5861) stipulates the encoding andRFC 1557 dubbed it as EUC-KR.
A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character fromKS X 1003 or ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E).
It is usually referred to as Wansung (Korean: 완성;RR: Wanseong;lit. precomposed[33]) in theRepublic of Korea. IBM refers to the double-byte component asCode page 971,[34] and to EUC-KR with ASCII asCode page 970.[35][36][37] It is implemented asCode page 20949 ("Korean Wansung")[38][39] andCode page 51949 ("EUC Korean") by Microsoft.[38]
As of March 2025[update], less than 0.065% of all web pages globally declare using EUC-KR,[40] but 5.0% of South Korean web pages use EUC-KR.[41] Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms (macOS, other Unix-like OSes, and Windows), but its use has been very slowly shifting toUTF-8 as it gains popularity, especially on Linux and macOS.
As with most other encodings,UTF-8 is now preferred for new use, solving problems with consistency between platforms and vendors.
A common extension of EUC-KR is theUnified Hangul Code (통합형 한글 코드;Tonghabhyeong Hangeul Kodeu,[42] or통합 완성형;Tonghab Wansunghyung), which is the default Korean codepage on Microsoft Windows. It is given the code page number 949 by Microsoft, and 1261[43] or 1363[44] by IBM.IBM's code page 949 is a different, unrelated, EUC-KR extension.
Unified Hangul Code extends EUC-KR by using codes that do not conform to the EUC structure to incorporate additional syllable blocks, completing the coverage of the composed syllable blocks available inJohab and Unicode. TheW3C/WHATWG Encoding Standard used byHTML5 incorporates the Unified Hangul Code extensions into its definition of EUC-KR.[45]
Other encodings incorporating EUC-KR as a subset include the Mac OS Korean script (known as Code page 10003 orx-mac-korean
),[13] which was used by HangulTalk (MacOS-KH), the Korean localization of theclassic Mac OS. It was developed by Elex Computer (일렉스), who were at the time the authorised distributor of Apple Macintosh computers in South Korea.[46][29]
HangulTalk adds extension characters with lead bytes between 0xA1 and 0xAD, both in unused space within the EUC-KR GR plane (trail bytes 0xA1–0xFE), and using non-EUC codes outside of it (trail bytes 0x41–0xA0). Some of these characters are font-style-independent stylizeddingbats.[29] Many of these characters do not have exact Unicode mappings, and Apple software maps these cases variously tocombining sequences, to approximate mappings with an appendedprivate-use character as a modifier for round-trip purposes, or to private-use characters.[47]
Apple also uses certain single-byte codes outside of the EUC-KR plane for additional characters: 0x80 for arequired space, 0x81 for awon sign (₩), 0x82 for anen dash (–), 0x83 for acopyright sign (©), 0x84 for a wideunderscore (_) and 0xFF for anellipsis (...).[47] Although none of these additional single-byte codes are within the lead byte range of plain EUC-KR (unlike Apple's extensions to EUC-CN,see above), some are within the lead byte range of Unified Hangul Code (specifically, 0x81, 0x82, 0x83 and 0x84).
Similarly to KS X 1001, the North KoreanKPS 9566 standard is typically used in EUC form; in these contexts, it is sometimes referred to as EUC-KP.[48] More recent editions of the standard extend the EUC representation with characters using non-EUC two-byte codes, in a similar manner to Unified Hangul Code.[49]
Although certain single-byte encodings such as theISO/IEC 8859 series technically conform to the EUC structure, they are rarely labeled as EUC. However,eucTH
is used onSolaris as a label forTIS-620.[50]
EUC-TW is avariable-length encoding that supports ASCII and 16 planes ofCNS 11643, each of which is 94×94. It is a rarely used encoding fortraditional Chinese characters as used inTaiwan. Variants ofBig5 are much more common than EUC-TW, although Big5 only encodes the first two planes of CNS 11643hanzi, whileUTF-8 is becoming more common.
Note that plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.
10 65
and10 66
) listed by Lunde.[28] Lunde lists the hexadecimal forms for both as0xA0 0x42
, seemingly in error.ULMBCS_GRP_KO
, and is mapped to the"windows-949"
ICU codec in theOptGroupByteToCPName
array later in the file.