Movatterモバイル変換


[0]ホーム

URL:


Preface

This is my small review of 8-bit cyrillic encodings universe. I didnʼttry to say anything for encodings not interesting for me (e.g. Tajik);if you didnʼt find enough information, go tolinkssection. In this article, "encoding" is used as alias for more standard"character set". I hope this wonʼt be problem until next major rewriting.

As of 2016, this material gets more and more historic. There are some areaswhere 8-bit encodings are still in use (for example, I have aFidoNet point and keepreading and writing there), but majority has been already moved to Unicode.I appreciate this process.

All pictures shows top half (0x80-0xFF) only. Range 0x00-0x7F isequal to the same of us-ascii and iso-8859-1. Most pictures were got fromAndrew Porokhnyak andFingertipsoft, thanks to both:)

Encoding groups:

  1. KOI8 group
  2. ALT group
  3. ISO
  4. Microsoft CP1251

KOI group

KOI8 group was the most widespread for the long time in traditionalrussian and ukrainian Internet due to historical reasons: it was usedin first localizations of Unix systems. KOI stands for Russianabbreviation of "Information exchange code". Current group consists atleast of:

KOI group falls into KOI8 group and KOI7 group (now historic).KOI7 encodings were used on RSX-11, RT-11 and similar systems. AllKOI8 encodings have identical contents of codes 0x00-0x7F (thesame as in US-ASCII) and 0xC0-0xFF (32 russian letters, i.e. fullalphabet without Io/io, in both cases). Order of russian lettersisnʼt alphabetical, but bound to order of latin alphabet letterswith the same of similar pronounciation. Unrelated letters are boundin almost arbitrary way (Ю(Yu) - @, Я(Ya) - Q, Э(E) - \). Also, big letters areplaced after small ones; this is compatibility issue with KOI7encodings. In original KOI8, contents of 0x80-0xBF is absent atall (for 8-bit meaning) or identical to 0x00-0x3F (for 7-bitmeaning). Different encodings in KOI8 group defines contents of0x80-0xBF area in very different way.

KOI group originates from Soviet standard GOST 19768-74, which definedthree KOI7 variants and one KOI8. Donʼt mix this with GOST 19768-87, whichdefines completely another encodings (see later forISO-8859-5 andALT group).

Some words for encodings unlisted here. DIS-8859-5 is generally knownas another name for KOI8-R, but I suppose it had defined only standardletter group and Io/io.A KOI8-C is known to me as rarely seen mix of KOI8in 0xC0-0xFF andCP1251 in 0x80-0xBF; I didnʼtsee any standard for it. There is another KOI8-C which had addedletters for old (before 1918) Russian alphabet, and also for mostSlavonic alphabets based on Cyrillics; use Google for details. KOI8-RUBis yet another invention to support Ukrainian and Belorussian letters,not popular now. This list isnʼt complete...

The following encodings: KOI8-K1, KOI8-L2, KOI8-CS2 arenotCyrillic; they was created for Czech and Slovak languages. Commonname "KOI8" was used for them due to socialism camp traditions. They doesnʼtfit into requirements listed above for cyrillic KOI-8 encodings.

KOI8-R

KOI8-R is the first IANA-standardized encoding in this group; itis defined in IETF RFC 1489. IANA alias: CSKOI8R. The only used forrussian Internet in mid-1990s and widespread now (but less and less,in preference to cp1251 and Unicode). It is applicable for Englishand Russian languages. It doesnʼt applicable even for Ukrainian;KOI8-R developer, Andrey Chernov, wanted pseudographic charactersinstead of additional letters.

KOI8-R is also known as CP878 in OS/2 and as CP20866 in Windows,csKOI8R.

koi8-r.gif

KOI8-U

KOI8-U is modification of KOI8-R with inclusion of Ukrainian letters.First versions of KOI8-U had appeared in 1992, rather full localizationpackage for Unix systems is known since 1994, but it wasnʼt tried tocodify it for other world until 1997 after appearance of KOI8-RU draft(see below). Standard source is IETF RFC 2319. Now it is standard defacto for Ukrainian Internet.

koi8-u.gif

KOI8-RU

KOI8-RU was invented as private innovation of Yuri Demchenko fromKiev Politechnical Institute to provide KOI8-R-compatible encodingwith letters of Slavonic exUSSR Cyrillic alphabets (Ukrainian,Belorussian), with positions borrowed from ISO-IR-111. In 1997support of this encoding was added to Microsoft Outlook Express. Thischarset wasnʼt supported by Ukrainian Internet community due topresence of uncodified but usedKOI8-U; thelatter one was pushed instead to IETF. Not registered at IANA, butalso supported by GNU iconv.

Microsoft defined CP21866 as KOI8-U but for a long time it reallywas KOI8-RU. In practice, there is too small difference between them as tobe easily mixed.

koi8-ru.gif

KOI8-F (KOI8-unified)

KOI8-F is innovation ofFingertipsoft which contains all letter of Russian, Ukrainian,Belorussian and Serbian alphabets. It isnʼt known to IANA or Windows,but is supported by newest Perl and used in some IRC networks becauseit coversletters from all Cyrillic charsets.

Original definition page.

koi8-uni.gif

ISO-IR-111

ISO-IR-111 (aliases: ECMA-Cyrillic, KOI8-E, ECMA-113:1986) is ECMA andISO standardized cyrillic coding of KOI8 group.(Donʼt mix with ECMA-113:1988 which is effectivelyISO-8859-5.)With KOI8-R compatibility in Russian letters, it defines many additionalletters for Ukrainian, Belorussian and Serbian alphabets. But it doesnʼtcontain ukrainian "ghe with upturn" and so has limited value for Ukrainian.

ISO-IR-111 has problem in IETF definitions: seeISO-IR-111sore letter by Michael Sokolov. In a few words: while ISO/ECMAdefinition really has encoding of KOI8, RFC 1345 has erroneous definitionof completely another encoding (identical to CP1251 in 0xC0-0xFF).This means high probabillity ofimplementations which erroneously use another encoding named asISO-IR-111 or ECMA-Cyrillic.

iso-ir-111.gif

ALT group

PC adaptation of Soviet standard GOST 19768-87 defined new encodings:"main" ("osnovnaya")and "alternative" ("alternativnaya") in order to provide compatibilityfor new generation of computers based on IBM PC clones. The main ideawas that "main" encoding shall be used for home-grown programs, and"alternative" one shall be used for programs developed outside of USSR.Formally they were created in the following way:

"Main" encoding very quickly died because it was incompatible withhuge flow of programs developed outside of (ex-)USSR, butISO-8859-5 is based on it. "Alternative" encoding,on the other side, had given a bunch of encodings compatible with cp437and so with IBM PC pseudographics. Most used now arecp866 andruscii.

Picture for "main" encoding of Soviet IBM PCʼs clones.This picture is somewhat broken because it showsIo/io in 0xF0/0xF1. This differs from original encoding which had therethe same symbols there as in cp437, ≡(U+2261) and ±(U+00B1).

soviet-main.gif

Picture for "alternative" encoding of Soviet IBM PCʼs clones.This picture is somewhat broken because it showsIo/io in 0xF0/0xF1. This differs from original encoding which had therethe same symbols there as in cp437, ≡(U+2261) and ±(U+00B1).

soviet-alt.gif

CP866

CP866 is Microsoftʼs invention based on PC clonesʼ "alternative" coding.It has some extensions after 0xF2 - Ukrainian Ji/ji, Ukrainian Ie/ie andBelorussian short U/u. It hasnʼt got Ukrainian "ghe with upturn" which wasnʼtofficially restored yet at the moment, and Ukrainian/Belorussian I/i which wassupposed to be unnesessary when having Latin I/i.

cp866.gif

RUSCII

RUSCII (a.k.a. IBM CP1125, a.k.a. x-cp866-u in UUPC/Ache) isUkrainian government standard (RST 2018-91) for DOS, based on common"alternative" encoding, but different from cp866 in 0xF2-0xF9.FreeBSD also has console fonts for it (cp866u-*) and map file(koi8-u2cp866u). It is known by GNU iconv as CP1125.

It seems this coding is also known as CP866NAV in TeX and Emacs,CP866NAV/IBM866NAV/866NAV in new GNU iconv. It is incompatible withCP866 in definition of Ukrainian letters, this caused some messbetween encodings.

ruscii.gif

ISO-8859-5

ISO-8859-5 is the ISO standard for cyrillic charset.Symbol range 0xB0-0xEF is the same as in"GOST main" encoding (seeALT group), due toits history. Ranges0xA0-0xAF and 0xF0-0xFF contains many symbols of different Cyrillicalphabets including Ukrainian, Belorussian and Serbian. Same as in forISO-IR-111, it doesnʼt contain Ukrainian "ghewith upturn".

Its usage in Internet and in other practice is very limited;really, it only was source of pain because no really widespreadsystems and system classes used it (used Alt, KOI8-*, cp1251 instead).The only class of systems for it known to me is big DBMS (DB/2,Oracle) but administrators systematically patched them to support moretraditional codings. Using modern jargon, ISO-8859-5 is "epic fail".On the other side, the Cyrillic section of Unicode copies its main part(0xA0-0xFF) to U+0400...U+044F with minor changes.

IANA alias: Cyrillic. See alsodramatic history for ECMA/ISO charset mutation. I have said it already:GOST 19768-87 had defined totally another encoding that wasin previous GOST 19768-74. Donʼt ever mix them.

Aliases: ISO-IR-144, ISO_8859-5, ISO_8859-5:1988.

IBM name: CP915.

Windows name: CP28595.

iso-8859-5.gif

ISO-IR-153 is "restricted" variant of ISO-8859-5: it defines only0xB0-0xEF, 0xA1 (Io) and 0xF1 (io). Often it is erroneously namedas GOST_19768-74.

CP1251

CP1251 was invented by Microsoft and ParaGraph (Moscow) asCyrillic coding for Windows. A legend says that it was initiallyinvented as a result of conversion cp437 -> iso-8859-1 applied toan early version ofCP866 encoding, to simplifyconversion process for DOS documents which encoding canʼt bedetermined. It contains most additional symbols for Ukrainian,Belorussian, and Serbian alphabets. Now it is one of the mostpopular encodings for Russian and Ukrainian, and the most popularone for Belorussian and Bulgarian, among 8-bit codings, used defacto in some areas (e.g. ICQ IM network, video subtitles...)

IANA name: windows-1251

cp1251.gif

Links

Other links for this problem:

© 2001-2017 text by Valentin Nechayev

This page may be fully or partially cited with providing link tooriginal place, and linked without any limitations. Also one can reuseit, except pictures, under GNU Free Documentation License orCreative Commons License.


[8]ページ先頭

©2009-2025 Movatter.jp