Movatterモバイル変換

stuff i've learned through creatingShodouka

Japanese text encoding

Japanese character sets

There are a number of Japanesecharacter setstandards, all of which are identified by a code starting with "JIS", which stands for "Japanese Industrial Standard". The most popularly-used Japanesecharacter set is known as JIS X 0208-1990.It includes 6879 characters, among which are the hiragana and katakanasyllabaries, 6355 kanji, the Roman, Greek, and Cyrillic alphabets, the numerals, and a number of typographic symbols. The characters arearranged in a 95-by-95 grid, which usually becomes a row number from33 to 126 and a column number from 33 to 126. In most commondiscussion, "JIS" when not followed by a particular standard numberrefers to the JIS X 0208-1990 character set.

Japanese transfer encodings

With JIS X 0208-1990, there are many more distinct charactersthan can possibly fit in a single byte. So the solution is to use anencoding scheme to send eachvalue as two bytes. Because a lot of communication on the Internet stilltakes place in ASCII, it is also desirable to encode JIS in a such a waythat it can be distinguished from ASCII. There are a few differentways to do this.

Please note that while there are three encoding schemes, all of themencode (effectively) the same character set. Be sure you understandthe difference between a character set and anencoding scheme before you go on.

ISO-2022-JP (JIS) encoding

ISO-2022 defines a standard way to send data in multiple character setswhen the transmission medium supports 7-bit bytes. This is done byincluding "escape sequences" in the text; that is, special codes thatindicate a switch between character sets. Each escape sequence begins(take a wild guess!) with the "escape" character ($1b). There are many registered escape sequences for different character sets and languages;ISO-2022-JPrecognizes a subset of these escape sequences relevant to Japanese.

sequence      hex values       effect   Esc ( B       $1b $28 $42      switch toASCIIEsc ( J       $1b $28 $4a      switch toJIS Roman (JIS X 0201-1976)

JIS Roman runs from 0 to $7f and is identical to ASCII except for a few minor differences (notably, the backslash at 92 is instead ayen symbol, and the tilde at 126 is replaced by an overbar). For mostpractical purposes, JIS Roman and ASCII can be considered the same, soboth these escape sequences can be treated as a switch to ASCII.

sequence      hex values       effect   Esc $ @       $1b $24 $40      switch toJIS C 6226-1978Esc $ B       $1b $24 $42      switch toJIS X 0208-1983

Both JIS C 6226-1978 and JIS X 0208-1983 are earlier versions of JIS X 0208-1990. For most practical purposes, both these escape sequences can be treated as a switch to JIS X 0208-1990.

Typically, then, Japanese text appears enclosed by two escape sequences:eitherEsc $ @ orEsc $ B at the beginning, and eitherEsc ( B orEsc ( Jat the end. The text itself between the escape sequences consists ofpairs of plain 7-bit bytes in the printable range from $21 to $7e, simply formed by splitting apart the JIS value into two bytes, also known as "raw JIS". Because the data itself matches the original JIS character numbers, the ISO-2022-JP encoding method is also known as "JIS encoding"(not to be confused with the "JIS character set"!). The figure shows theencoding range for this method, with each pixel corresponding to onepossible combination of first byte (j) and second byte (k). The pixel colours describe conversion to another system;read on.

EUC-JP encoding

EUC, or Extended Unix Coding, takes advantage of mediums thatsupport 8-bit bytes. It's a very simple and straightforward solution:to distinguish Japanese characters from ASCII, simply add 128 toeach JIS value by setting the highest bit of each byte.

Ifj andk are the original JIS values ande andf are the transmitted EUC bytes, then

e =j + 128

f =k + 128

This pushes all the EUC codes up into the top half of the 8-bit range.They land from $a1 to $fe, where they have no chance of getting confusedwith ASCII codes from 0 to $7e. Nice and easy.

Shift-JIS encoding

This encoding method is easily the messiest of the three. It's also known as SJIS or MS Kanji. It was dreamed up by some folks at Microsoftfor the Japanese support in Japanese versions of their operating systemsand software, and it's very ugly. This method also requires an 8-bitmedium, but doesn't behave by keeping everything neatly above the 128 mark.Instead, you are only guaranteed that the first of each pair of bytes isabove 128; bets are off for the second.

The JIS values get all rearranged in order to reserve the range $a0 to$df for a set of 64 half-width katakana; to accomplish this, thecharacters are squashed into half as many columns (values for the firstbyte) but twice as many rows (values for the second byte). As it turnsout, these half-width katakana are rarely used anyway.

The figureshows the encoding ranges for JIS: the first byte will land eitherfrom $81 to $9f or from $e0 to $ef, and the second byte will land eitherfrom $40 to $7e or from $80 to $fc. You might notice that the encodingrange excludes $9f to $fc for the second byte when the first byte is $ef.That's because JIS has 95 columns, which doesn't evenly squash in half.

The colours of the pixels in these three maps illustrate how toperform the necessary contortions to convert to or from shift-JIS.If you look closely at the maps forISO-2022-JPandEUC, you'll seethat the squares (apart from being split between red and blue at 96and 224) actually have alternating dark and light columns. Eachpair gets joined into one long column for shift-JIS, as thecolours in this map demonstrate. Ifs andt arethe transmitted JIS bytes, then

whenj is from 33 to 96 (red/orange),s = (j+1)/2 + 112
whenj is from 97 to 126 (blue/turquoise),s = (j+1)/2 + 176
whenj is odd (red/blue),t =k + 31, plus one more ifk > 95
whenj is even (orange/turquoise),t =k + 126

Whew.

All Together Now

So what happens when you receive some arbitrary document and you'vegot to figure out how to interpret it? Have a look at the page aboutworking with three encoding schemes on the WWW.

[8]ページ先頭