Unicode 4.1.0
Version 4.1.0 has been superseded by thelatest version of the Unicode Standard.
 | Version 4.1.0 of the Unicode Standard consists of the core specification,The Unicode Standard, Version 4.0, as amended byUnicode 4.0.1 andfurther amended by this specification, the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD). The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard. |
Version 4.1.0 of the Unicode Standard should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 4.1.0, defined by:The Unicode Standard, Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended byUnicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1/)
and byUnicode 4.1.0 (http://www.unicode.org/versions/Unicode4.1.0/).
A complete specification of the contributory files for Unicode 4.1.0 is found on the pageComponents for Version 4.1.0.
Contents of This Document
Online Edition
Overview
Notable Changes From Unicode 4.0.1 to Unicode 4.1.0
Conformance Changes to the Standard
Other Changes to the Standard
Superseded Sections
Unicode Character Database
Errata Corrected in This Version
Script Additions
Significant Character Additions
Online Edition
The text ofThe Unicode Standard, Version 4.0, as well as the delta and archival code charts, is available online via the navigation links on this page. These files may not be printed. TheUnicode 4.0 Web Bookmarks page has links to all sections of the online text.
Overview
Unicode 4.1.0 is aminor version of the Unicode Standard. 1273 new characters have been added. This document provides information about those additional characters, as well as further clarifications of text of the standard. In addition it covers accumulated corrigenda and errata to the text.
There are significant changes to many of the Unicode Standard Annexes which are part of Unicode 4.1.0. Each annex has a modification section listing the changes in that annex.
Notable Changes From Unicode 4.0.1 to Unicode 4.1.0
- Addition of 1273 new characters to the standard, including those tocomplete roundtrip mapping of the HKSCS and GB 18030 standards, five newcurrency signs, some characters for Indic and Korean, and eight new scripts.(The exact list of additions can be seen in DerivedAge.txt, in the age=4.1section.)
- Change in the end of the CJK Unified Ideographs range from U+9FA5 to U+9FBB,with the addition of some Han characters. The boundaries of such ranges aresometimes hardcoded in software, in which case the hardcoded value needs tobe changed.
- New Unicode Standard Annexes:UAX #31, Identifier and Pattern Syntax andUAX #34, Unicode Named Character Sequences, and significant changes to other Unicode Standard Annexes.
In addition to the repertoire additions, there have been a number of significantchanges to the Unicode Character Database files and theproperties in them. In particular:
- Three new properties, Grapheme_Cluster_Break, Sentence_Break, and Word_Break, have been added in support ofUAX #29, Text Boundaries. Their enumeration can be found in new data files, located in the new "auxiliary" subdirectory of the UCD. Also in the "auxiliary" subdirectory are the test data files and HTML break charts associated with UAX #29.
- The new property Other_ID_Continue has been added to support identifier stability. It is enumerated in PropList.txt and is used in the derivation of other identifier-related properties.
- Two new properties, Pattern_Syntax and Pattern_White_Space, have been added in support ofUAX #31, Identifier and Pattern Syntax. Their enumeration can be found in PropList.txt.
- The bidi properties of a few compatibility equivalents of characters whose bidi classes changed for Unicode 4.0.1 have been harmonized.
- The case mapping contexts defined in SpecialCasing.txt have been updated and now override Table 3-13. Context Specification for Casing on p. 89 of The Unicode Standard, Version 4.0. These changes are described below in the sectionModifications to Default Case Operations.
- Alphabetic is now a superset of Lowercase and Uppercase for compatibility with POSIX-style character classes.
- A new data file NamedSequences.txt has been added in conjunction withUAX #34, Named Character Sequences. This data file defines specific names for some significant Unicode character sequences, giving their USI (Unicode Sequence Identifiers) values.
- The linebreak propeties of Runic, Indic, Mongolian, Tibetan punctuation, and Hangul have been revised to better match their expected behavior. (SeeUAX #14: Line Breaking Properties)
The following complete scripts have been added in Unicode 4.1.0:
- New Tai Lue (U+1980..U+19DF)
- Buginese (U+1A00..U+1A1F)
- Glagolitic (U+2C00..U+2C5F)
- Coptic (U+2C80..U+2CFF)
- Tifinagh (U+2D30..U+2D7F)
- Syloti Nagri (U+A800..U+A82F)
- Old Persian (U+103A0..U+103DF)
- Kharoshthi (U+10A00..U+10A5F)
Two scripts have been disunified or reorganized:
- Coptic is now considered a separate script from Greek. This differs from prior documentation in the standard. A new Coptic block has been added, including characters for Old Coptic. It should be noted, however, that the 14 Coptic letters derived from Demotic, which had already been encoded in the Greek and Coptic block, are unchanged, and need to be included in any complete implementation of Coptic.
- The Nuskhuri forms of Khutsuri Georgian have been added in a new Georgian Supplement block (U+2D00..U+2D2F). Those characters are now to be taken as the lowercase pairs of the Asomtavruli Georgian encoded at U+10A0..U+10C5. This introduction of case pairs for Khutsuri is a change from the previous documentation about Georgian in the standard.
Beyond the addition of entire scripts, therehave been very significant extensions to the repertoirefor the Arabic script and the Ethiopic script. A large number of additionalLatin characters have been added as phonetic extensions tosupport various orthographic conventions for minoritylanguages. There are also significant additions of Greeksymbols and punctuation to support specialist representationof ancient Greek materials. Several small sets have been added to CJK Unified Ideographs and to associated blocks.
A few characters have been added to supplement Hebrew, in particular for support of Biblical Hebrew text representation.
U+060B AFGHANI SIGN has been added. While some glyph variants of this character do occur, the form shown in the code charts is that approved by the Ministry of Finance of the Afghanistan government.
U+09CE BENGALI LETTER KHANDA TA has been added. This will necessitate adjustment of Bengali script implementations. In Unicode 4.1, recommendations for the representation of Khanda-Ta in Bengali differ from those documented in Version 4.0.1 and earlier
Conformance Changes to the Standard
Modifications to Default Case Operations
The following amendsSection 3.13, Default Case Operations, on p. 89-90 ofThe Unicode Standard, Version 4.0.
Add after D47:
D47a A character C is defined to becase-ignorable if C has the Unicode Property Word_Break=MidLetter as defined in Unicode Standard Annex #29, "Text Boundaries;"or the General Category of C is Nonspacing Mark (Mn), Enclosing Mark (Me), Format Control (Cf), Letter Modifier (Lm), or Symbol Modifier (Sk).
D47b Acase-ignorable sequence is a sequence of zero or more case-ignorable characters.
ReplaceTable 3-13, Context Specification by the following:
A description of each context is followed by the equivalent regular expression(s) describing the context before C and the context after C, or both. The regular expression uses the syntax of Unicode Technical Standard #18, "Unicode Regular Expressions ", with one addition: "!" means that the expression does not match. All regular expressions below are case-sensitive.
Table 3-13. Context Specification for Casing | | |
| | | |
| |
| | | |
| | | |
| | | |
| | | |
Clarification of Decomposition Mappings
In order to ensure, as intended, that decomposition mappings for eachversion of the standard derive from the Unicode Character Database for thatversion of the standard, the phrases in D18, D20, and D23 reading "accordingto the decomposition mappings found in the names list of Section 16.1,Character Names List" is changed to "according to the decomposition mappingsfound in the Unicode Character Database".
Other Changes to the Standard
Change in status of recommendation of SPACE as a base for display of nonspacing marks.
The UTC has decided that U+0020 SPACE is no longer recommended as a suitable base character for display of isolated nonspacing marks. Instead, U+00A0 NO-BREAK SPACE is the preferred base character for this function.
The explanatory text ofThe Unicode Standard Version 4.0, page 46, "Spacing Clones of European Diacritical Marks" is updated to read as follows:
Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent isolation by applying them to U+00A0 NO-BREAK SPACE. This convention might be employed, for example, when talking about the combining mark itself as a mark, rather than using it in its normal way in text applied as an accent to a base letter or in other combinations.
Prior to Version 4.1 of the Unicode Standard, the standard also recommended the use of U+0020 SPACE for display of isolated combining marks. This is no longer recommended because of potential conflicts with the handling of sequences of U+0020 space characters in such contexts as XML.
The Unicode Standard separately encodes clones of many common European diacritical marks, primarily for compatibility with existing character set standards. These cloned accents and diacritics arespacing characters, and can be used to display the mark in isolation, without application to a no-break space. They are cross-referenced to the corresponding combining mark in the names list inChapter 16, Code Charts. For example, U+02D8 BREVE is cross-referenced to U+0306 COMBINING BREVE. Most of these spacing clones also have compatibility decomposition mappings involving U+0020 SPACE, but implementers should be cautious in making use of those decomposition mappings because of the complications that can result from replacing a spacing character with a space + combining mark sequence.
SeeUAX #14: Line Breaking Properties for corresponding changes.
Change in equivalence for NO-BREAK SPACE
The Unicode Standard, Version 4.0, p. 387 states:U+00A0 NO-BREAK SPACE behaves like the following coded charactersequence: U+FEFF ZERO WIDTH NO-BREAK SPACE + U+0020 SPACE +U+FEFF ZERO WIDTH NO-BREAK SPACE.
That sentence is stricken from the text of the Unicode Standard,Version 4.1.0, because it is incorrect. The behavior in bidirectionaltext layout is not identical for these sequences (seeUAX #9: The Bidirectional Algorithm). For linebreaking, there are differences with respect to a following SPACE character (seeUAX #14: Line Breaking Properties). In addition, the use of U+FEFF for word-joining has been deprecated in favor of U+2060 WORD JOINER.
Use of CGJ to Prevent Reordering
The following modifies the section headedCombining Grapheme Joiner inSection 15.2, Layout Controls on page 392 ofThe Unicode Standard, Version 4.0.
Replace this text on page 392:
U+034F COMBINING GRAPHEME JOINER is used to indicate thatadjacent characters are to be treated as a unit for the purposes oflanguage-sensitive collation and searching. In language-sensitive collationand searching, the combining grapheme joiner should be ignored unless itspecifically occurs within a tailored collation element mapping. Thus it isgiven a completely ignorable collation element in the default collationtable, like NULL (see Unicode Technical Standard #10, "Unicode CollationAlgorithm," and also ISO/IEC 14651). However, it can be entered into thetailoring rules for any given language, using the tailoring capabilities ofthe collation standards.
by the following text:
U+034F COMBINING GRAPHEME JOINER is used to affect the collation of adjacent characters for purposes of language-sensitive collation and searching, and to distinguish sequences that would otherwise be canonically equivalent.
Formally, the combining grapheme joiner is not a formatcontrol character, but rather a combining mark. It hasthe General Category value gc=Mn and the canonical combining classvalue ccc=0. These property assignments result in thefollowing behavior, which can be useful in certaincircumstances.
The presence of a combining graphemejoiner in the midst of a combining character sequence doesnot interrupt the combining character sequence; a processwhich is accumulating and processing all the charactersof a combining character sequence would include acombining grapheme joiner as part of that sequence. (Thisdiffers from the behavior for most format control characters, whose presence would interrupt a combining character sequence.)However, because the combining grapheme joiner has a combining class of 0, canonical reordering will not reorder any adjacent combining marks around a combining grapheme joiner. (See the definition of canonical reordering inSection 3.11, Canonical Reordering Behavior in Unicode 4.0.) In turn, thismeans that insertion of a combining grapheme joinerbetween two combining marks will prevent normalizationfrom switching the position of those two combining marks,regardless of their own combining classes.
This side-effectof the character properties of the combining graphemejoiner, together with the fact that the combining graphemejoiner has no visible glyph and no other format effect onneighboring characters, can be taken advantage of in thoseexceptional circumstances where two alternative orderingsof a sequence of combining marks must be distinguished forsome processing or rendering purpose and where normalizationwould otherwise eliminate the distinction between the twosequences.
For example, this is one way to avoid the less-than-optimalassignment of fixed-position combining classes to certainHebrew accents and marks which do in fact interact typographicallyand for which accent order distinctions need to be maintainedfor analytic and text representational purposes. Inparticular:
<lamed, patah, hiriq, finalmem>
is canonically equivalent to:
<lamed, hiriq, patah, finalmem>
because the canonical combining classes of U+05B4 HEBREW POINTHIRIQ and U+05B7 HEBREW POINT PATAH are distinct. However, ifan application wishes to make a distinction between a patahfollowing hiriq and a patah preceding a hiriq, the followingsequence wouldnot be canonically equivalent to the first two sequences cited:
<lamed, patah, CGJ, hiriq, finalmem>
The presence of the ccc=0 combining grapheme joiner blocks thereordering of hiriq before patah by canonical reordering. Thatallows the two sequences to be reliably distinguished, whetherfor display or for other processing.
The Unicode Collation Algorithm involves the normalization ofUnicode text strings before collation weighting. The combininggrapheme joiner is ordinarily ignored in collation key weightingin the UCA, but if, as in this case, it is used to block thereordering of combining marks in a string, its effect can beto invert the order of secondary key weights associated withthose combining marks. Because of this, the two strings wouldhave distinct keys, making it possible to treat them distinctly insearching and sorting without having to further tailor eitherthe combining grapheme joiner or the combining marks themselves.
The CGJ can also be used to prevent the formation of contractions in theUnicode Collation Algorithm. Thus, for example, while "ch" is sorted as asingle unit in a tailored Slovak collation, the sequence <c, CGJ, h>will sort as a 'c' followed by an 'h'. This can also be used in German, for example,to force 'ü' to be sorted as 'u' + umlaut (using <u, CGJ, umlaut>), even where a dictionary sort isbeing used. This also happens without having to further tailor either thecombining grapheme joiner or the sequence.
Of course, sequences of characters which include the combining grapheme joiner may also be given tailored weights. Thus the sequence <c, CGJ, h> could be weighted completely differently from the either the contraction"ch" or how "c" and "h" would have sorted without the contraction. However,this application of CGJ is not recommended. For more information on the useof CGJ with sorting, matching, and searching, see UAX #10: Unicode CollationAlgorithm, Version 4.1.0.
Meteg
The following clarifying text regarding the control of positioning ofthemeteg in Hebrew, U+05BD HEBREW POINT METEG, should beadded to Section 8.1, Hebrew, p. 194 of The Unicode Standard, Version4.0.
The basic recommendations for the control of positioning ofmetegestablished in Version 4.1 are as follows:
U+034F COMBINING GRAPHEME JOINER can be used within avowel-meteg sequence to preserve an ordering distinction undernormalization.
So, for instance, to display meteg to the left (after, for aright-to-left script) of the vowel point sheva,U+05B0 HEBREW POINT SHEVA, the following sequence can be used:
<sheva, meteg>
Because these marks are canonically ordered, thissequence is preserved under normalization. Then, to displaymeteg to the right of the sheva, the following sequence canbe used:
<meteg, CGJ, sheva>
A further complication arises for combinations of meteg with hatafvowels. Authors who want to ensure left-position versusmedial-positiondisplay of meteg with hataf vowels across all font implementationsmay use joiner characters to distinguish these cases.
Thus, the following encoded representations can be used for differentpositioning of meteg with a hataf vowel, such as hataf patah:
left-positioned meteg: | <hataf patah, ZWNJ, meteg> |
medially-positioned meteg: | <hataf patah, ZWJ, meteg> |
right-positioned meteg: | <meteg, CGJ, hataf patah> |
In no case is use of ZWNJ, ZWJ, or CGJrequired forrepresentation of meteg. These recommendations are simply provided forinteroperability in those instances where authors wish topreserve specific positional information regarding the layoutof a meteg in text.
Rendering of Thai Combining Marks
Thai tone marks are a type of combining mark displayed above an associated base character; they have a combining class of 107. Other Thai combining marks displayed above in particular vowels have a combining class of 0. This assignment of combining classes is insufficient to fully characterize the typographic interaction between those marks.
For the purpose of rendering, the Thai combining marks above (U+0E31, U+0E34..U+0E37, U+0E47..U+0E4E) should be displayed outward from the base character they modify, in the order in which they appear in the text. In particular, a sequence containing <U+0E48 THAI CHARACTER MAI EK, U+0E4D THAI CHARACTER NIKHAHIT> should be displayed with the nikhahit above the mai ek, and a sequence containing <U+0E4D THAI CHARACTER NIKHAHIT, U+0E48 THAI CHARACTER MAI EK> should be displayed with the mai ek above the nikhahit.
This does not preclude input processors from helping the user by pointing out or correcting typing mistakes, perhaps taking into account the language. For example, because the string <mai ek, nikhahit> is not useful for the Thai language and is likely a typing mistake, an input processor could reject it or correct it to <nikhahit, mai ek>.
When the character U+0E33 THAI CHARACTER SARA AM follows one or more tone marks (U+0E48 .. U+0E4B), the nikhahit that is part of the sara am should be displayed below those tone marks. In particular, a sequence containing <U+0E48 THAI CHARACTER MAI EK, U+0E33 THAI CHARACTER SARA AM> should be displayed with the mai ek above the nikhahit.
Superseded Sections
Unicode Character Database
The complete Unicode Character Database files for this version are available in the4.1.0 directory.For more detailed information about the changes in the Unicode Character Database, see the fileUCD.html in the Unicode Character Database.
Errata Corrected in This Version
Errata corrected in this version are listed by date in aseparate table. For corrigenda and errata after the release of Unicode 4.1.0, see the list of currentUpdates and Errata.
Script Additions
New Tai Lue: U+1980 - U+19DF
The New Tai Lue script, also known as Xishuang Banna Dai, is usedmainly in southern China. The script was developed in the twentiethcentury as an orthographic simplification of the historic Lanna scriptused to write the Tai Lue language.
New Tai Lue differs from Lanna in that it regularizes the consonantrepertoire, simplifies the writing of consonant clusters andsyllable-final consonants, and uses only spacing vowel signs, whichappear before or after the consonants they modify. By contrast, Lannauses both spacing vowel signs and nonspacing vowel signs which appearabove or below the consonants they modify. All vowel signs in New TaiLue are considered combining characters and follow their baseconsonants in the text stream. Where a syllable is composed of a vowelsign to the left and a vowel sign or tone mark on the right of theconsonant, a sequence of characters is used, in the order consonant +vowel + tone mark.
A virama or killer character is not used to create conjunct consonantsin New Tai Lue, because clusters of consonants do not regularly occur.New Tai Lue has a limited set of final consonants, which are modifiedwith a hook showing that the inherent vowel is killed.
Similar to the Thai and Lao scripts, New Tai Lue consonant letterscome in pairs that denote two tonal registers. The tone of a syllableis indicated by the combination of the tonal register of the consonantletter plus a tone mark written at the end of the syllable.
Buginese: U+1A00 - U+1A1F
The Buginese script is used on the island of Sulawesi, mainly in the southwest. A variety of traditional literature has been printed in it. The script is one of the easternmost of the Brahmi scripts and is perhaps related to Javanese. It bears some affinity to Tagalog, and it does not traditionally record final consonants. The Buginese language, an Austronesian language with a rich traditional literature, is one of the foremost languages of Indonesia. The script was previously also used to write the Makassar, Bimanese, and Madurese languages.
Glagolitic: U+2C00 - U+2C5F
Glagolitic, from the Slavic root "glagol" meaning "word", is an alphabet considered to have been devised by St. Cyril in the ninth century CE, for his translation of the Scriptures and liturgical books into Slavonic. Glagolitic was eventually supplanted by the alphabet now known as Cyrillic, which probably arose in late ninth-century Bulgaria. In parts of Croatia where a vernacular liturgy was used, Glagolitic continued in use until modern times; in these areas Glagolitic is still occasionally used as a decorative alphabet.Like Cyrillic, the Glagolitic script is written in linear sequence from left to right with no contextual modification of the letterforms.
Glagolitic is treated as a separate alphabet fromCyrillic because of its historical primacy, and because the lettershapes in the two alphabets are completely dissimilar: the one can inno sense be regarded as a variant of the other.Glagolitic itself exists in two styles, known as round and square.Round Glagolitic is the original style and more geographically widespread; square Glagolitic was used in Croatia from the thirteenth century. The letterforms used in the charts are round Glagolitic.
Coptic: U+2C80 - U+2CFF
Coptic is now considered a separate script from Greek inthe Unicode Standard. This differs from prior documentationin the standard, for which Coptic was considered to bea stylistic variant of Greek, to be implemented by afont shift.
Starting with Unicode Version 4.1.0, a separate Copticscript block has been added at U+2C80..U+2CFF. The blockcontains the common Coptic alphabet, but also containsextensions needed for Old Coptic and dialectal usage ofthe Coptic script. It also contains Coptic-specific symbolsand punctuation.
The long-encoded 14 Coptic letters derived from Demotic,encoded in the range U+03E2..U+03EF in the Greek and Copticblock, are also considered part of the Coptic script, andshould be included in any complete implementation of thescript.
Any implementations of Coptic predating Unicode Version 4.1.0should be carefully checked, since use of Greek characterswith Coptic-style fonts is no longer recommended forCoptic data.
Tifinagh: U+2D30 - U+2D7F
The Tifinagh script is used by around 20 million people in Morocco for writing Berber languages including Tarifite, Tamazighe, and Tachelhite. The teaching of Berber, written in Tifinagh, will be generalized and compulsory in Morocco. It is scheduled to be taught in all public schools by 2008. Historically the script has been used in several variant traditions along the Mediterranean coast from Kabylia to Morocco and the Canary Islands, the Constantinois and Aurès regions, as well as in Tunisia.
Syloti Nagri: U+A800 - U+A82F
The Syloti Nagri is a lesser-known Brahmi-derived script used for writing the Sylheti language. Sylheti is an Indo-European language spoken by some 5 million speakers in the Barak Valley region of northeast Bangladesh and southeast Assam (India). Sylheti has commonly been regarded as a dialect of Bengali, with which it shares a high proportion of vocabulary. The Sylheti Nagri script has 27 consonant letters with an inherent vowel of /o/, and 5 independent vowel letters. There are five dependent vowel signs which are attached to a consonant letter. Included in the encoding are several script-specific punctuation marks.
Old Persian: U+103A0 - U+103DF
Old Persian is found in a number of inscriptions in the Old Persian language dating from the Achaemenid Empire. It is an alphabetic writing system with some syllabic aspects. While the shapes of some Old Persian letters may look similar to signs in Sumero-Akkadian Cuneiform, it is clear that only one of them was borrowed from Sumero-Akkadian Cuneiform. Scholars today agree that the character inventory of Old Persian was newly-invented for the purpose of providing monumental inscriptions of the Achaemenid king, Darius I, by about 525 BCE.
Old Persian is written from left to right. The repertoire contains 36 signs which represent consonants, vowels or sequences of single consonants plus vowels, a set of five numbers, one word divider, and eight ideograms.
Kharoshthi: U+10A00 - U+10A5F
The Kharoshthi script was used historically to write Gāndhārī and Sanskrit as well as various mixed dialects. Kharoshthi is an Indic script of the abugida type. However, unlike other Indic scripts, it is written from right to left. The Kharoshthi script was initially deciphered around the middle of the nineteenth century by James Prinsep and others who worked from short Greek and Kharoshthi inscriptions on the coins of the Indo-Greek and Indo-Scythian kings. The decipherment has been refined over the last 150 years as more material has come to light. Representation of Kharoshthi in the Unicode code charts uses forms based on manuscripts of the first century CE.
Kharoshthi can be implemented using the rules of the Unicode bidirectional algorithm. In Kharoshthi both letters and digits are written from right to left. Rendering requirements for Kharoshthi are similar to those for Devanagari.
Significant Character Additions
In addition to encodings of entirely new scripts inUnicode Version 4.1.0, there have been other significantadditions to the character repertoire. In some instances,these consist of major or minor extensions of existingscripts, and in other instances consist of specializedsets of punctuation, modifier letters or other symbols.These additions are sorted by category and explained inthe sections below.
Arabic Supplement: U+0750-U+077F
Unicode 4.1 adds 30 additional extended Arabic letters mainly for thelanguages used in Northern and Western Africa, such as Fulfulde,Hausa, Songhoy and Wolof. In the second half of the twentieth century,the use of the Arabic script was actively promoted for theselanguages. Characters used for other languages are annotated in thecharacter names list. Additional vowel marks used with these languagesare found in the main Arabic block.
Ethiopic Extensions: U+1380 - U+139F, U+2D80 - U+2DDF
The Ethiopic script is used for a large number of languagesand dialects in Ethiopia, and in some instances has beenextended significantly beyond the set of characters usedfor major languages such as Amharic and Tigré. Unicode Version4.1.0 adds two blocks of extensions to the Ethiopic script:Ethiopic Supplement U+1380..U+139F and Ethiopic ExtendedU+2D80..U+2DDF. Those extensions cover such languages asMe'en, Blin, and Sebatbeit, which use many additionalcharacters. Several other characters have been added to themain Ethiopic script block in the range U+1200..U+137F,including one additional Ethiopic punctuation mark, and acombining mark used to indicate gemination.
In the Ethiopic Supplement block there is also a new set oftonal marks. These are used in multiline scored layout,and as for other musical (an)notational systems of this type,require a higher-level protocol to enable proper rendering.
Additions for Biblical Hebrew
Five new Hebrew characters have been added in Unicode 4.1 for specialusage in Biblical Hebrew text:
U+05A2 HEBREW ACCENT ATNAH HAFUKH
U+05BA HEBREW POINT HOLAM HASER FOR VAV
U+05C5 HEBREW MARK LOWER DOT
U+05C6 HEBREW PUNCTUATION NUN HAFUKHA
U+05C7 HEBREW POINT QAMATS QATAN
In some older versions of Biblical text, a distinction is made between the accents U+05A2 HEBREW ACCENT ATNAH HAFUKH and U+05AA HEBREW ACCENT YERAH BEN YOMO. Many editions from the last few centuries do not retain this distinction, using only yerah ben yomo, but some users in recent decades have begun to re-introduce this distinction. Similarly, a number of publishers of Biblical or other religious texts have introduced a typographic distinction for the vowel point qamats corresponding to two different readings. The original letterform used for one reading is referred to as qamats or qamats gadol; the new letterform for the other reading is qamats qatan. It is important to note that not all users of Biblical Hebrew use atnah hafukh and qamats qatan. If the distinction between accents atnah hafukh and yerah ben yomo is not made, then only U+05AA HEBREW ACCENT YERAH BEN YOMO is used. If the distinction between vowels qamats gadol and qamats qatan is not made, then only U+05B8 HEBREW POINT QAMATS is used. Implementations that support Hebrew accents and vowel points may not necessarily support the special-usage characters U+05A2 HEBREW ACCENT ATNAH HAFUKH and U+05C7 HEBREW POINT QAMATS QATAN.
The vowel point holam represents the vowel phoneme /o/. The consonant letter vav represents the consonant phoneme /w/, but in some words is used to represent a vowel, /o/. When the point holam is used on vav, the combination usually represents the vowel /o/, but in a very small number of cases represents the consonant-vowel combination /wo/. A typographic distinction is made between these two in many versions of Biblical text. In most cases, in which vav + holam together represents the vowel /o/, the point holam is centered above the vav and referred to as holam male. In the less frequent cases, in which the vav represents the consonant /w/, some versions show the point holam positioned above left. This is referred to as holam haser. The character U+05BA HEBREW POINT HOLAM HASER FOR VAV is intended for use as holam haser only in those cases where a distinction is needed. When the distinction is made, the character U+05B9 HEBREW POINT HOLAM is used to represent the point holam male on vav. U+05BA HEBREW POINT HOLAM HASER FOR VAV is intended for use only on vav; results of combining this character with other base characters are not defined. Not all users distinguish between the two forms of holam, and not all implementations can be assumed to support U+05BA HEBREW POINT HOLAM HASER FOR VAV.
In the Hebrew Bible, dots are written in various places above or below the base letters that are distinct from the vowel points and accents. These are referred to by scholars as puncta extraordinaria, and there are two kinds. The upper punctum is the more common of the two, and has been encoded since Unicode 2.0 as U+05C4 HEBREW MARK UPPER DOT. The lower punctum is used only in one verse of the Bible, Psalm 27:13, and has been added in Unicode 4.1 as U+05C5 HEBREW MARK LOWER DOT. The puncta generally differ in appearance from dots that occur above letters used to represent numbers; the number dots should be represented using U+0307 COMBINING DOT ABOVE and U+0308 COMBINING DIAERESIS.
The nun hafukha is a special symbol that appears to have been used for scribal annotations, though its exact functions are uncertain. It is used a total of nine times in the Hebrew Bible, although not all versions include it, and there are variations in the exact locations in which it is used. There is also variation in the glyph used: it often has the appearance of a rotated or reversed nun, and is very often called inverted nun; it may also appear similar to a half tet or have some other form.
Bengali Khanda Ta
In Bengali a dead consonant TA shows up as U+09CE BENGALI LETTER KHANDA TA in all contexts except where it is immediately followed by one of the consonants TA, THA, NA, BA, MA, YA, or RA. Khanda-Ta cannot bear a vowel matra or combine with a following consonant to form a conjunctaksara. It can form a conjunctaksara only with a preceding dead consonant RA, with the latter showing up as a REPH placed on the Khanda Ta.
Previous versions of the Unicode Standard recommended that Khanda-Ta be encoded as TA + VIRAMA + ZWJ. Instead, the Khanda-Ta should be used explicitly in new text, but users are cautioned that instances of the old encoding may exist.
Phonetic Extensions: U+1D6C - U+1DBF
Unicode 4.1 adds a significant number of characters usedfor phonetic transcription and phonetically-basedorthographies. The characters in the range U+1D6C - U+1D7Fcomplete the previously existing Phonetic Extensions block.A new Phonetic Extensions Supplement block has also beenadded, with the range U+1D80 - U+1DBF.
The phonetic extensions for Unicode 4.1 are derived from a widevariety of sources, including many technical orthographiesdeveloped by SIL linguists, as well as older historic sources.
Of particular note, all attested phonetic characters showingstruckthrough tildes, struckthrough bars, and retroflex orpalatal hooks attached to the basic letter have beenseparately encoded in the blocks for phonetic extensions.Although separate combining marks exist in the Unicode Standardfor overstruck diacritics and attached retroflex orpalatal hooks, earlier encoded IPA letters such asU+0268 LATIN SMALL LETTER I WITH STROKE or U+026D LATIN SMALLLETTER L WITH RETROFLEX HOOK have never beenbeen given decomposition mappings in the standard. Forconsistency, all newly encoded characters are handled analogously to the existing, more common characters of this type,and are not given decomposition mappings.
The Phonetic Extensions Supplement block also contains 37superscript modifier letters. These complement the muchmore commonly used superscript modifier letters found inthe Spacing Modifer Letters block.
U+1D77 LATIN SMALL LETTER TURNED G and U+1D78 MODIFIER LETTERCYRILLIC EN are used in Caucasian linguistics. U+1D79 LATIN SMALL LETTER INSULAR G is used in older Irish phonetic notation.It is to be distinguished from merely a Gaelic style glyphfor U+0067 LATIN SMALL LETTER G.
U+1D7A LATIN SMALL LETTER THWITH STRIKETHROUGH is a digraphic notation commonly foundin some English-language dictionaries, representing thevoiceless (inter)dental fricative, as inthin.While this character is clearly a digraph, the obligatorystrikethrough across two letters distinguishes it froma "th" digraph per se, and there is no mechanism involvingcombining marks which can easily be used to represent it.A common alternative glyphic form for U+1D7A uses ahorizontal bar to strike through the two letters, insteadof a diagonal stroke.
Modifier Tone Letters: U+A700 - U+A71F
The Modifier Tone Letters block contains modifierletters used in various schemes for marking tones. Thesesupplement the more commonly used tone marks and tone lettersfound in the Spacing Modifier Letters block (U+02B0 - U+02FF).
The characters in the range U+A700 - U+A707 are cornertone marks used in the transcription of Chinese. They wereinvented by Bridgman and Wells Williams in the 1830s. Theyhave little current use, but are seen in a number of oldChinese sources.
The tone letters in the range U+A708 - U+A716 complement thebasic set of IPA tone letters (U+02E5 - U+02E9), and are alsoused in the representation of Chinese tones, for the mostpart. The dotted tone letters are used to represent short("stopped") tones. The left-stem tone letters are mirrorimages of the IPA tone letters, and like those tone letters,can be ligated in sequences of two or three tone letters torepresent contour tones. Left-stem versus right-stem toneletters are sometimes used contrastively to distinguish betweentonemic and tonetic transcription, or to show the effects oftonal sandhi.
Combining Diacritical Marks Supplement: U+1DC0 - U+1DFF
This block is the supplement to the Combining DiacriticalMarks block in the range U+0300 - U+036F. It containslesser-used combining diacritical marks.
U+1DC0 COMBINING DOTTED GRAVE ACCENT and U+1DC1 COMBININGDOTTED ACUTE ACCENT are marks occasionally seen in someGreek texts. They are variant representations of theaccent combinations,dialytika varia anddialytika oxia,respectively. They are, however, encoded separately becausethey cannot be reliably formed by regular stacking rulesinvolving U+0308 COMBINING DIAERESIS and U+0300 COMBININGGRAVE ACCENT or U+0301 COMBINING ACUTE ACCENT.
U+1DC3 COMBINING SUSPENSION MARK is a combining mark specificallyused in Glagolitic. It is not to be confused with a combiningbreve.
Editorial Marks for Biblical Text Annotation
The Greek text of the New Testament exists in a large number ofmanuscripts with many textual variants. The most widely used criticaledition of the New Testament, the Nestle-Aland edition published bythe United Bible Societies (UBS), introduced a set of editorialcharacters which are regularly used in a number of journals and otherpublications. As a result, these editorial marks have become therecognized method of annotating the New Testament, and have beenencoded in Unicode 4.1 in the range U+2E00..U+2E0D.
CJK Additions
Characters have been added to complete roundtrip mapping support forHKSCS and GB 18030. Some of these characters can be found in a new CJKBasic Strokes block (U+31C0..U+31EF), in a new Vertical Formsblock (U+FE10..U+FE1F), and as a range extension to CJK UnifiedIdeographs (U+9FA6..U+9FBB). Other new characters are found in symbolblocks (U+23DA..U+23DB). Parsers and other code may need to adjust forthe change of the end of the CJK Unified Ideographs range from U+9FA5to U+9FBB.
Characters in the CJK Basic Strokes block are single-stroke components of CJK ideographs. The first characters assigned to this block are 16 HKSCS-2001 characters.
A new collection of 106 CJK compatibility ideographs hasbeen added to support roundtrip mapping to the DPRKstandard.
Ancient Greek Additions
Ancient Greek Numbers: U+10140-U+1018F
Many symbols have been added to Unicode 4.1 toenable the complete coverage of Ancient Greek acrophonicnumeric representation. This includes all known dialectalvariants. In addition, a set of Ancient Greek papyrological numbers has been added.
Ancient Greek Editorial Marks
Ancient Greek scribes generally wrote in continuous uppercase letterswithout separating letters into words. On occasion the scribe added punctuation to indicate the end of a sentence or a change of speaker, or to separate words. Editorial and punctuation characters appearabundantly in surviving papyri and have been rendered in moderntypography when possible, often exhibiting considerable glyphicvariation. A number of these editorial marks are encoded in the rangeU+2E0E..U+2E16.
Ancient Greek Musical Notation: U+1D200 - U+1D24F
Ancient Greek had complete sets of vocal and instrumentalnotation symbols. These were based on Greek letters comparable to the modern usage of the Latin lettersA through G to refer to notes of the Western musicalscale. However, rather than using a sharp and flatnotation to indicate semitones, or casing and otherdiacritics to indicate distinct octaves, the AncientGreek system extended the basic Greek alphabet by rotatingand flipping letterforms in various ways, and by addinga few more symbols not directly based on a letter.
Ancient Greek musical notation had a separate systemfor vocal notation and for instrumental notation;each has a traditional catalog numbering system usedby modern scholars of Ancient Greek. In the Unicode Standard,the two systems are unified against each other andagainst the basic Greek alphabet, based on shape. Thus,if a note is to be represented for the vocal notationsystem by a Greek letterform, not rotated or flipped,then the corresponding letter from the Greek alphabetin the Greek and Coptic block should be used instead,using an appropriate font to match the archaic letterformsused in the notational system.
If a symbol is used in both the vocal notation systemand the instrumental notation system, its Unicodecharacter name is based on the vocal notation systemcatalog number. Thus U+1D20D GREEK VOCAL NOTATION SYMBOL-14has a glyph based on an inverted capital lambda. In thevocal notation system, it represents the first sharp of B,and in the instrumental notation system, it representsthe first sharp of d'. Since it is used in both systems,its name is based on its sequence in the vocal notationsystem, rather than its sequence in the instrumentalnotation system. The character names list in the UnicodeCharacter Database is fully annotated with the functionsof the symbols for each system.
The combining marks encoded in the range U+1D242 - U+1D244are placed over the vocal or instrumental notation symbolsand are used to indicate metrical qualities.
Georgian Nuskhuri: U+2D00 - U+2D2F
The Georgian script formNuskhuri was added in Unicode 4.1. The Georgianscript has two related forms. The ecclesiastical form, Khutsuri, has anuppercase, inscriptional form, called Asomtavruli, and a lowercase,cursive, manuscript form called Nuskhuri. The modern, ordinary form,Mkhedruli, is caseless. Prior to Unicode 4.1, secular (Mkhedruli) and ecclesiastical (Khutsuri) styles of Georgian were considered font styles. Both Mkhedruli text and Nuskhuri text were represented using the character range U+10D0..U+10F8. Beginning with Unicode 4.1, Nuskhuri is separately represented using the new Georgian Supplement block, U+2D00..U+2D2F, and the characters in the range U+10D0..U+10F8 should be restricted to use for Mkhedruli text. Case mappings are now provided between the two Khutsuri forms: Asomtavruli and Nuskhuri.
In addition, three Mkhedruli characters which are used in thetranscription of some East Caucasian languages were added.