| MIME / IANA | Windows-31J |
|---|---|
| Alias(es) | CP943C |
| Language | Japanese |
| Standard | WHATWG Encoding Standard (as "Shift_JIS")[1] |
| Classification | Extended ASCII,[a]variable-width encoding,CJK encoding |
| Extends | Shift_JIS |
| |
Microsoft Windows code page 932 (abbreviatedMS932,[2][3]Windows-932[3] or ambiguouslyCP932[4]), also calledWindows-31J amongst other names (see§ Terminology below), is theMicrosoft Windowscode page for theJapanese language, which is an extended variant of theShift JIS Japanesecharacter encoding. It contains standard 7-bitASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.
IBM offer the same extended double-byte codes in theircode page 943 (IBM-943 orCP943),[5] which is a combination of the single-byteCode page 897 and the double-byteCode page 941.[6]
Windows-31J is the most used non-UTF-8/Unicode Japanese encoding on the web. However, many people and software packages, including Microsoft libraries,[7] declare theShift JIS encoding for Windows-31J data, although it includes some additional characters, and some of the existing characters are mapped toUnicode differently. This has led the WHATWG HTML standard to treat the encoding labelsshift_jis andwindows-31j interchangeably, and use the Windows variant for its "Shift_JIS" encoder and decoder.[1]
Microsoft's Shift JIS variant is known simply as "Code page 932" on Microsoft Windows, however this is ambiguous asIBM's code page 932, while also a Shift JIS variant, lacks the NEC and NEC-selected double-byte vendor extensions which are present in Microsoft's variant (although both include the IBM extensions) and preserves the 1978 ordering of JIS X 0208.[5]
IBM's code page 943 (or "IBM-943") includes the same double byte codes as Windows code page 932.[5] Microsoft's version corresponds closely to the encoding referred to asibm-943_P15A-2003 (with aliases includingCP943C andWindows-932)[3] inInternational Components for Unicode (ICU). There is also a second ICU encoding namedibm-943_P130-1999,[8] which uses different single-byte mappings which more closely match IBM's code page definitions. (See§ Single-byte character differences below for details.)
Windows code page 932 is registered with theIANA asWindows-31J.[9] The "Windows-31J" label is IANA's and not recognized by Microsoft, which has historically used "shift_jis" instead.[7] TheW3C/WHATWG encoding standard used byHTML5 treats the label "shift_jis" interchangeably with "windows-31j" with the intent of being "compatible with deployed content"[10] and matches Windows code page 932[1] (including the "formerly proprietary extensions from IBM and NEC").[11]
Windows code page 932 is also calledMS_Kanji,[3][12] although IANA treat MS_Kanji as an alias for standard Shift JIS.[9]Python, for example, uses the labelMS-Kanji (orcp932) for Windows-932 and the labelShift_JIS (orsjis) for JIS X 0208-defined Shift JIS, without recognising theWindows-31J label.[12]
In Japanese editions of Windows, this code page isreferred to as "ANSI", since it is the operating system's default 8-bit encoding, even thoughANSI was not involved in its definition.
Windows-31J is often mistaken for standard Shift JIS (as defined inJIS X 0208:1997 Appendix 1): while similar, the distinction is significant for computer programmers wishing to avoidmojibake.

In addition to the standardJIS X 0201:1997 andJIS X 0208:1997 characters, Windows-31J includes several JIS X 0208 extensions, namely "NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)",[9] in addition to setting some encoding space aside forend user definition.[13] This also differs fromIBM-932, which does not include the NEC extensions or NEC selection.[5]
The IBM extensions were designed to encode characters from theIBM Japanese DBCS-Host repertoire which were initially absent in JIS X 0208; the'because' sign ∵ and'not' sign ¬ were later added to JIS X 0208 itself in 1983, and Microsoft includes them at extension locations as well as their 1983 locations.[14] The NEC extensions also encode the entirety of the IBM repertoire, but in a separate extension within the 94×94 JIS X 0208 grid (in rows 89–92, besides the characters already included inNEC row 13), rather than using Shift JIS codes beyond the JIS X 0208 range; Windows code page 932 includes these 388 characters in both locations.[14] As a result, the 'because' and 'not' signs are encoded three times.
Some of these representations were subsequently used for different characters byJIS X 0213 andShift JIS-2004. For example, compare row 89 in JIS X 0213 (beginning 硃, 硎, 硏…)[15] to row 89 as used by JIS X 0208 with IBM/NEC extensions (beginning 纊, 褜, 鍈…).[16] Consequently, Shift JIS-2004 is not compatible with Windows-31J.
In addition to the above, Microsoft uses different (but visually similar) Unicode mapping for several double-byte punctuation characters compared to standard Shift JIS, such as thewave dash beingmapped to U+FF5E rather than U+301C,[17] which is followed by ibm-943_P15A-2003[18] but not ibm-943_P130-1999,[19] and using different mapping for the double byte backslash.[17]
Windows-932 includes standard 7-bitASCII mappings for single-byte sequences with the high bit set to 0. Hence, codes 0x5C and 0x7E are mapped to Unicode as U+005C REVERSE SOLIDUS (\, thebackslash) and U+007ETILDE (~) respectively,[20][21][17] as they are in ASCII (ISO-646-US). This is likewise done by the W3C/WHATWG encoding standard.[22] By contrast, 0x5C is mapped to U+00A5YEN SIGN (¥) inISO-646-JP and consequentlyJIS X 0201, of which standardShift JIS is an extension. Correspondingly, Windows-31J avoids duplicate encoding of the backslash by mapping the double byte 0x815F to U+FF3C FULLWIDTH REVERSE SOLIDUS, whereas standard Shift JIS maps it to U+005C.[17]
However, 0x5C in Windows-932 is nonetheless considered a Yen sign in certain contexts.[23] For this reason, in many Japanese fonts, U+005C is displayed as a Yen symbol, which would normally be represented as U+00A5, rather than as a backslash per Unicode's suggested rendering. U+00A5 is one-way best-fit mapped onto 0x5C in Windows-932. However, code 0x5C in Windows-932 behaves as a reverse solidus (backslash) in all respects (e.g. infile paths on Windows systems) other than how it is displayed by some fonts,[23] and Microsoft's documentation for Windows-932 displays 0x5C as a backslash.[21] This mapping[20] corresponds to the encoding named "ibm-943_P15A-2003" inInternational Components for Unicode (ICU),[3] except for minor reordering of a fewC0 control characters.
IBM-943, likeIBM-932,[5] is a superset of the single-byteCode page 897,[6] which maps 0x5C to the Yen symbol (¥) and 0x7E to the overline (‾),[24] this is followed by the encoding named "ibm-943_P130-1999" in ICU.[8] Code page 897 (and therefore also IBM-943 and IBM-932) also adds single-byte box-drawing characters replacing certainC0 control characters,[24] however these may still be treated as control characters depending on the context,[25] and are mapped to control characters in ICU.[8]
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
If byte is an ASCII byte or 0x80, return a code point whose value is byte.