Unicode® Technical Note #26

On the Encoding of Latin, Greek, Cyrillic, and Han

Version	3
Authors	Ken Whistler
Date	March 7, 2023
This Version	https://www.unicode.org/notes/tn26/tn26-3.html
Previous Version	https://www.unicode.org/notes/tn26/tn26-2.html
Latest Version	https://www.unicode.org/notes/tn26/

Summary

This documentdiscusses background information and encoding decisions pertaining to Latin, Greek, Cyrillic and Han characters in Unicode.

Status

This document is aUnicode Technical Note. Sole responsibility for its contents rests with the author(s). Publication does not imply any endorsement by the Unicode Consortium.

For information on Unicode Technical Notes, including criteria for acceptance, seehttps://www.unicode.org/notes/.

The Encoding of the Latin, Greek, and Cyrillic Scripts

There are a number of very good reasons why the Latin, Greek,and Cyrillic scripts have been separately encoded, rather thanbeing encoded as a single script.

1. Traditional graphology has always treated them as distinctscripts, while acknowledging that they are, of course, historicallyrelated. Mere historic relatedness is insufficient reason tounify scripts, however, as Latin, Greek, and Cyrillic can ultimatelyall trace their roots back to Phoenician, and Phoenician itselfis then related to Aramaic and all its descendants, from Hebrewand Arabic to farflung outliers like Sogdian, Uighur, and evenMongolian.

2. In the case of Latin and Greek, the distinction has existedsince classical times. Cyrillic is more closely related toGreek, and in medieval manuscripts there is a fair amountof overlap in Greek and early Cyrillic writing conventions,but by the time of the development of modern typography,Greek script and Cyrillic script are clearly distinct, and theircurrent manifestations in print usage are very different.

3. Literate users of Latin, Greek, and Cyrillic alphabets donot have cultural conventions of treating each other'salphabets and letters as part of their own writing systems."It's all Greek to me," is not just a saying, but accuratelyreflects the common perception by users of any one of thosescripts when presented with textual material in one of theother scripts. It's not just that the words are unfamiliar,but that the writing itself is considered alien. Most peoplewill be able to pick out the letters which share common shapes(A, B, etc.), but the high proportion of odd-looking lettersthat carry no significance to users of the other scriptsresults in the text as a whole being treated as simplyillegible. This, by the way, is one of the operative meansby which distinctions in script can be identified, althoughit is far from being a simple, objective method that worksin all instances.

4. Even more significantly, from the point of view of theproblem of character encoding for digital textual representationin information technology, the preexisting identificationof Latin, Greek, and Cyrillic as distinct scripts was carriedover into character encoding, from the very earliest instancesof such encodings. Once ASCII and EBCDIC were expanded tostart incorporating Greek or Cyrillic letters, all significantinstances of such encodings included a basic Latin (ASCII orotherwise) setand a full set of letters for Greek or afull set of letters for Cyrillic. Precedent for thepurposes of character encoding was clearly established bythose early 8-bit charsets.

5. Following on from point #4, any universal character encodingmust distinguish Latin, Greek, and Cyrillic as scripts. Ifit does not do so, it would have insurmountable interoperabilityproblems dealing with any of the huge amount of legacy datawhich already distinguished the scripts. Note that multiscript(partially) universal character encodings predating the UnicodeStandardall did this. That includes IBM's registry ofglyph identifiers, DEC's and Hewlett-Packard's listings ofcharacters and glyphs, Xerox's XCCS character standard,WordPerfect's proprietary character sets, andMicrosoft's and Apple's internal system of character identifications.The library community maintains the same script distinctions in its own data formats: MARC 21 (published by the Library of Congress) and UNIMARC (published by IFLA).Even the East Asian character encodings, as they developed,also distinguished Latin, Greek, and Cyrillic. See, forinstance, JIS X 0208 itself, which separately encodes Greekand Cyrillic alphabets from ASCII Latin.

6. The few character encodings that actually attempt to doa unification of Latin and Greek or Cyrillic are very special-purposeand limited in usage, and cannot interoperate well with thevast majority of text processing infrastructure. A good exampleof this is ETSI's GSM 03.38, which attempts to address theproblem ofdisplaying uppercase Greek on a Latin device witha 7-bit character set by unifying all the uppercase Greek letterswith their Latin lookalikes and by dispensing with any supportforlowercase Greek. Such schemes to unify Greek (or Cyrillic)with Latin have never spread beyond their original, limited-purposecontexts, simply because they cannot handle the requirements formore general-purpose processing.

7. In terms of implementation issues, any attempt at a unificationof Latin, Greek, and Cyrillic would wreak havoc with certainrequired text processes. In particular, a unified encoding ofLatin, Greek, and Cyrillic would make casing operations an unholymess, in effect making all casing operations context sensitive ina way that is now restricted to a few problematical edge cases(Turkish i's, Greek sigmas).

The Encoding of the Han Script

Now by way of contrast, consider the issue for the Han script.

1. Graphologically, the Han script ("Chinese characters") haslong been considered asingle script, adapted for use byneighboring cultures, but not separated into distinct scriptsby such usage. Historically very early versions of the Chinesecharacter usages (e.g., the Small Seal script) probably rightlyqualify as distinct scripts, but such distinctions areirrelevant to the status of Han synchronically.

2. This identity of the Han script has been perpetuated by thehistorically more or less continuous cultural preeminence ofChina in East Asia over the course of millennia, and by thepolitical usage which successive Chinese empires have putChinese writing to—using thesingle written form of Chineseas a way of covering many, many distinct Chinese languagesin a single Han cultural identity. The imperial spread of Hanwriting through East Asia mirrors, in many ways, the imperialspread of Latin script in the Western world, where spreadof the Latin alphabet from language to language and ethnicgroup to ethnic group, first by the Roman Empire and much laterby the Western European empires, did not result in fractionatingthe script itself, but rather the widespread usage of the singlescript, and its elaboration by the addition of new ideographs(for Han) and new letters (for Latin) as new demands wereplaced upon it. (A similar pattern can be seen in the spreadof the Arabic script around the world.)

3. The important non-Han peoples who adapted Chinese writingin their own culture (most notably Korea, Japan, and Vietnam)continued to view the Han characters asChinese writing,as demonstrated even by the name of the script in each ofthese countries, being literally "Chinese character". Andrather than simply adopting the script at one point and thenevolving it off in some independent direction, the typicalpattern for each of these cultures was over the centuries tokeep adding to the store of Han ideographs they used bycontinued borrowing of large new sets of them directly fromChina.

4. The major exception in this developmental pattern, in Japan,actually speaks, to the contrary, to the continued unitarynature of the Han script itself. In Japan, a highly cursivestyle of writing Chinese characters for Japanese sounds,as opposed to borrowed Chinese vocabulary, a style calledmanyooshuu, was simplified down into a set of conventionalsyllabic symbols for Japanese alone.That clearly was thedevelopment of a new script, which came to be known asHiragana, from the Han script. But separately and simultaneously,in Japan, the Han characters themselves ("kanji" in Japanese)continued to be written in the traditional Chinese way.

5. Unlike the case for Latin, Greek, and Cyrillic, thereisa longstanding cultural tradition in Japan, China, and Vietnam,of viewing "Chinese characters" as being of shared identitythroughout the region. Japanese cannot "read" Chinese—afterall, it is a completely different and utterly alien languageto Japanese speakers—any more than English speakers canread Tagalog written in the Latin alphabet. But theydorecognize that the Chinese characters themselves are sharedand in fact can recognize much of the shared vocabulary thatwas originally borrowed into Japanese from Chinese, in thesame way that English speakers recognize much French vocabulary.

6. There is a lot of confusion that results among those notintimately familiar with East Asian languages and writing systemsbecause of the fact that thewriting systems for Japan, Korea,China, and Vietnamare completely different, at the sametime that they all share, as parts of those writing systems,a sharedsingle Han script. There is no question that theJapanese writing system as a whole is very, very different fromthe Chinese writing system. But Japanesekanji as a portionof the Japanese writing system constitute the samescriptas Chinesehanzi functioning as the main part of the Chinese writing system.

7. Another issue that leads to controversy over "Han unification"in East Asia tends to result from considerations of font stylesand character variants. The style issue results from largely from the fact that Japan has traditionallybeen an extremely conservative country, going through a long,deliberately isolationist period before the Meiji reformations.As a result, Japan tended to preserve, in its Buddhist and otherliterary traditions, forms of Chinese that dated all the wayback to Tang, Sung, and Ming dynasty material. In the meantime, Chinaitself was busy undergoing vast revolutions and upheavals andchanging rulers from one ethnicity to entirely different ones(Mongols ruled China in one dynasty, Manchus in another). Duringthis time, Chinese writing kept innovating, while the formscarried to Japan tended to stay more conservative.

Notwithstanding the systematic variation sometimes seen betweenmore conservative character forms in Japan and typographicallydistinct forms seen in China, the typical range of variationamong glyphs across all the CJK user communities is well withinthe bounds of typical variation seen within other scripts. Thisfact, which is noted in the JIS X 0208 standard itself, formsthe basis for the principles upon which characters are identifiedas being the "same character" in Japanese, Chinese, and Koreansources.

8. In the 20th century, we encounter the most extreme form of stylistic innovation in China, when as aresult of educational policy after the Communist revolution inthe PRC, a deliberate and very widespread process of orthographicsimplification and reform was imposed all over China. Thosechanges were not adopted outside of China in Japan, nor evenin Taiwan and Hong Kong. That resultedin a sharp split in Han script usage ("simplified" versus"traditional"). But eventhat cannot be considered enough tohave created a new, distinct Han script. The reason is thateven in the PRC, the new, simplified forms were always treatedas alternative forms of the traditional characters, oftenprinted alongside them in reference works. There has been acontinuous adjustment of writing in China, as more characters get simplified,but some simplifications are abandoned in favor of more traditionalforms, and so on. Many Chinese end up, for one reason oranother, simply having to learnboth the traditional andthe simplified forms of characters, and read them as alternativeglyphs for thesame character—implicitly within the sameoverall Han script.

9. When it comes to character encoding decisions made inEast Asia, it is also clear that Han characters have in almostall cases been considered to constitute a single script, ratherthan distinct scripts per East Asian country. Japanese standardswere early on devoted to encoding those Chinese characters neededforJapanese. And Chinese standards focussed on that subsetof Chinese characters needed forChinese. But later on, thestandards on both sides expanded as the Japanese standards addedcharacters from China and the Chinese standards added charactersfrom Japan. In neither case did these additions follow thepattern seen when Greek or Cyrillic were added to earlyLatin character encodings. Instead, in both instances, it wasjust a matter of adding X more thousand Han characters into thebig tables that already consisted of thousands of Han characters.

10. The effort to "unify" the encoding of Han characters in10646 and the Unicode Standard has been misunderstood bysome as an attempt to intermingle an essentially differentJapanese writing system and a Chinese writing system, as ifthere was some kind of enforced miscegenation underway.But the correct way to interpret what went on was rather simplytheavoidance of duplicate encoding of the same Han charactersrepresented in several different East Asian standards. Thisprocess was well-understood by the actual national standardsparticipants from Japan, Korea, China, and other countries,who all along have been doing the major work involved inminimizing the amount of duplicate encoding of what all thecommittee members fully agree is thesame character.

The analogy to bring to bear when considering "Han unification"is not a picture of trying to unify a Latin encoding, andGreek encoding, and a Cyrillic encoding based on charactershape alone, but instead, unifying an ASCII (Latin) encoding,an EBCDIC (Latin) encoding, the Latin portion of theJIS X 0208 Japanese standard, and the Latin portion of theGB 2312 Chinese standard. There would be no point in encodingthe same Latin character 4 times in Unicodesimply becauseit appeared in ASCII, EBCDIC Code Page 300, JIS X 0208,and GB 2312. The exact same logic was applied to the Han charactersin the various East Asian standards when encoding decisionswere taken about encoding the Han script.

11. In terms of implementation issues, an encoding approach towardsHan characters that did not unify the same characters fromJapanese, Chinese, Korean (and other) source standards wouldneed to carry expensive (both in memory cost and in maintenancecost) equivalence tables around merely to make the unificationhappen on the fly, before text searching or nearly any othertext process of interest could be done.

12. For more information about how Han characters from differentEast Asian legacy source standards were identified as beingthesame character for encoding purposes in the UnicodeStandard, see the detailed discussion inSection 18.1, "Han".

Ideographic Research Group

Information about the Ideographic Research Group, which hasprimary responsibility for development of the repertoire ofHan ideographic characters for encoding in 10646 (and the Unicode Standard):

TheIdeographic Research Group (IRG) is a group reporting toISO/IEC JTC1/SC2/WG2. It focuses on the development of ideographic characters (Han characters used in China, Japan, Korea and other parts of Asia) in the ISO/IEC 10646 standard. Its mission is to submit ideographic characters for inclusion in the ISO/IEC 10646 standard. The IRG has developed the CJK Unified Ideographs Block and CJK Unified Ideographs Extensions A through H.IRG members have included China, Hong Kong SAR, Macao SAR, Taipei Computer Association, Singapore, Japan, SAT (Saṃgaṇikīkṛtaṃ Taiśotripiṭakaṃ Daizōkyō Text Database Committee), South Korea, North Korea, Vietnam, United Kingdom, and the USA. Representatives from theUnicode Consortium also attend IRG meetings for coordinating the synchronization between the ISO/IEC 10646 standard and the Unicode Standard. SeeIRG formore details.

Modifications

The following summarizes modifications from the previous version of this document.

Style cleanup.
Updated links to https.
Updated reference to Section 18.1 Han in the core specification.
Updated links and information about the IRG.

Updated links and information about the IRG.

Initial version.

© 2006–2023 Ken Whistler. This publication is protected by copyright, and permission must be obtained from the author and Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the Terms of Use.

Use of this publication is governed by the UnicodeTerms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.

Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries.

Movatterモバイル変換