Chapter 12
South and Central Asia-I
Official Scripts of India
The scripts of South Asia share so many common features that a side-by-side comparison of a few will often reveal structural similarities even in the modern letterforms. With minor historical exceptions, they are written from left to right. They are allabugidas in which most symbols stand for a consonant plus an inherent vowel (usually the sound /a/). Word-initial vowels in many of these scripts have distinct symbols, and word-internal vowels are usually written by juxtaposing a vowel sign in the vicinity of the affected consonant. Absence of the inherent vowel, when that occurs, is frequently marked with a special sign. In the Unicode Standard, this sign is denoted by the Sanskrit wordvirāma. In some languages, another designation is preferred. In Hindi, for example, the wordhal refers to the character itself, andhalant refers to the consonant that has its inherent vowel suppressed; in Tamil, the wordpuḷḷi is used. The virama sign nominally serves to suppress the inherent vowel of the consonant to which it is applied; it is a combining character, with its shape varying from script to script.
Most of the scripts of South Asia, from north of the Himalayas to Sri Lanka in the south, from Pakistan in the west to the easternmost islands of Indonesia, are derived from the ancient Brahmi script. The oldest lengthy inscriptions of India, the edicts of Ashoka from the third centuryBCE, were written in two scripts, Kharoshthi and Brahmi. These are both ultimately of Semitic origin, probably deriving from Aramaic, which was an important administrative language of the Middle East at that time. Kharoshthi, written from right to left, was supplanted by Brahmi and its derivatives. The descendants of Brahmi spread with myriad changes throughout the subcontinent and outlying islands. There are said to be some 200 different scripts deriving from it. By the eleventh century, the modern script known as Devanagari was in ascendancy in India proper as the major script of Sanskrit literature.
The North Indian branch of scripts was, like Brahmi itself, chiefly used to write Indo-European languages such as Pali and Sanskrit, and eventually the Hindi, Bangla, and Gujarati languages, though it was also the source for scripts for non-Indo-European languages such as Tibetan, Mongolian, and Lepcha.
The South Indian scripts are also derived from Brahmi and, therefore, share many structural characteristics. These scripts were first used to write Pali and Sanskrit but were later adapted for use in writing non-Indo-European languages—namely, the languages of the Dravidian family of southern India and Sri Lanka. Because of their use for Dravidian languages, the South Indian scripts developed many characteristics that distinguish them from the North Indian scripts. South Indian scripts were also exported to southeast Asia and were the source of scripts such as Tai Tham (Lanna) and Myanmar, as well as the insular scripts of the Philippines and Indonesia.
The shapes of letters in the South Indian scripts took on a quite distinct look from the shapes of letters in the North Indian scripts. Some scholars suggest that this occurred because writing materials such as palm leaves encouraged changes in the way letters were written.
The major official scripts of India proper, including Devanagari, are documented in this chapter. They are all encoded according to a common plan, so that comparable characters are in the same order and relative location. This structural arrangement, which facilitates transliteration to some degree, is based on the Indian national standard (ISCII) encoding for these scripts.
The first six columns in each script are isomorphic with the ISCII-1988 encoding, except that the last 11 positions (U+0955..U+095F in Devanagari, for example), which are unassigned or undefined in ISCII-1988, are used in the Unicode encoding. The seventh column in each of these scripts, along with the last 11 positions in the sixth column, represent additional character assignments in the Unicode Standard that are matched across some or all of the scripts. For example, positions U+xx66..U+xx6F and U+xxE6..U+xxEF code the Indic script digits for each script. The eighth column for each script is reserved for script-specific additions that do not correspond from one Indic script to the next.
While the arrangement of the encoding for the scripts of India is based on ISCII, this does not imply that the rendering behavior of South Indian scripts in particular is the same as that of Devanagari or other North Indian scripts. Implementations should ensure that adequate attention is given to the actual behavior of those scripts; they should not assume that they work just as Devanagari does. Each block description in this chapter describes the most important aspects of rendering for a particular script as well as unique behaviors it may have.
Many of the character names in this group of scripts represent the same sounds, and common naming conventions are used for the scripts of India.
#12.1 Devanagari
#12.1.1 Devanagari: U+0900–U+097F
The Devanagari script is used for writing classical Sanskrit and its modern historical derivative, Hindi. Extensions to the Sanskrit repertoire are used to write other related languages of India (such as Marathi) and of Nepal (Nepali). In addition, the Devanagari script is used to write the following languages: Awadhi, Bagheli, Bhatneri, Bhili, Bihari, Braj Bhasha, Chhattisgarhi, Garhwali, Gondi (Betul, Chhindwara, and Mandla dialects), Harauti, Ho, Jaipuri, Kachchhi, Kanauji, Konkani, Kului, Kumaoni, Kurku, Kurukh, Marwari, Mundari, Newari, Palpa, and Santali.
All other Indic scripts, as well as the Sinhala script of Sri Lanka, the Tibetan script, and the Southeast Asian scripts, are historically connected with the Devanagari script as descendants of the ancient Brahmi script. The entire family of scripts shares a large number of structural features.
The principles of the Indic scripts are covered in some detail in this introduction to the Devanagari script. The remaining introductions to the Indic scripts are abbreviated but highlight any differences from Devanagari where appropriate.
#Standards. The Devanagari block of the Unicode Standard is based on ISCII-1988 (Indian Script Code for Information Interchange). The ISCII standard of 1988 differs from and is an update of earlier ISCII standards issued in 1983 and 1986.
The Unicode Standard encodes Devanagari characters in the same relative positions as those coded in positions A0–F416 in the ISCII-1988 standard. The same character code layout is followed for eight other Indic scripts in the Unicode Standard: Bengali/Bangla, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, and Malayalam. This parallel code layout emphasizes the structural similarities of the Brahmi scripts and follows the stated intention of the Indian coding standards to enable one-to-one mappings between analogous coding positions in different scripts in the family. Sinhala, Tibetan, Thai, Lao, Khmer, Myanmar, and other scripts depart to a greater extent from the Devanagari structural pattern, so the Unicode Standard does not attempt to provide any direct mappings for these scripts to the Devanagari order.
In November 1991, at the timeThe Unicode Standard, Version 1.0, was published, the Bureau of Indian Standards published a new version of ISCII in Indian Standard (IS) 13194:1991. This new version partially modified the layout and repertoire of the ISCII-1988 standard. Because of these events, the Unicode Standard does not precisely follow the layout of the current version of ISCII. Nevertheless, the Unicode Standard remains a superset of the ISCII-1991 repertoire. Modern, non-Vedic texts encoded with ISCII-1991 may be automatically converted to Unicode code points and back to their original encoding without loss of information. The Vedic extension characters defined in IS 13194:1991Annex G—Extended Character Set for Vedic are now fully covered by the Unicode Standard, but the conversions between ISCII and Unicode code points in some cases are more complex than for modern texts.
#Encoding Principles. The writing systems that employ Devanagari and other Indic scripts constituteabugidas—a cross between syllabic writing systems and alphabetic writing systems. The effective unit of these writing systems is the orthographic syllable, consisting of a consonant and vowel(CV) core and, optionally, one or more preceding consonants, with a canonical structure of(((C)C)C)V. The orthographic syllable need not correspond exactly with a phonological syllable, especially when a consonant cluster is involved, but the writing system is built on phonological principles and tends to correspond quite closely to pronunciation.
The orthographic syllable is built up of alphabetic pieces, the actual letters of the Devanagari script. These pieces consist of three distinct character types: consonant letters, independent vowels, and dependent vowel signs. In a text sequence, these characters are stored in logical (phonetic) order. Consonant letters by themselves constitute aCV unit, where theV is aninherent vowel, whose exact phonetic value may vary by writing system. Independent vowels also constitute aCV unit, where theC is considered to be null.
A dependent vowel sign is used to represent aV inCV units whereC is not null andV is not the inherent vowel.CV units are not represented by sequences of a consonant followed by virama followed by independent vowel. In some cases, a phonological diphthong (such as Hindiजाओ /jāo/) is actually written as two orthographicCV units, where the second of these units is an independent vowel letter, whoseC is considered to be null.
#12.1.2 Principles of the Devanagari Script
#Rendering Devanagari Characters. Devanagari characters, like characters from many other scripts, can combine or change shape depending on their context. A character’s appearance is affected by its ordering with respect to other characters, the font used to render the character, and the application or system environment. These variables can cause the appearance of Devanagari characters to differ from their nominal glyphs (used in the code charts).
Additionally, a few Devanagari characters cause a change in the order of the displayed characters. This reordering is not commonly seen in non-Indic scripts and occurs independently of any bidirectional character reordering that might be required.
#Consonant Letters. Each consonant letter represents a single consonantal sound but also has the peculiarity of having aninherent vowel, generally the short vowel /a/ in Devanagari and the other Indic scripts. ThusU+0915DEVANAGARI LETTER KA represents not just /k/ but also /ka/. In the presence of a dependent vowel, however, the inherent vowel associated with a consonant letter is overridden by the dependent vowel.
Consonant letters may also be rendered ashalf-forms, which are presentation forms used within an orthographic syllable to depict initial consonants in a consonant cluster. These half-forms do not have an inherent vowel. Their rendered forms in Devanagari often resemble the full consonant but are missing the vertical stem, which marks a syllabic core. The stem glyph is graphically and historically related to the sign denoting the inherent /a/ vowel, as discussed later in this section.
Some Devanagari consonant letters have alternative presentation forms whose choice depends on neighboring consonants. This variability is especially notable forU+0930DEVANAGARI LETTER RA, which has numerous different forms, both as the initial element and as the final element of a consonant cluster. Only the nominal forms, rather than the contextual alternatives, are depicted in the code charts.
The traditional Sanskrit/Devanagari alphabetic encoding order for consonants follows articulatory phonetic principles, starting with velar consonants and moving forward to bilabial consonants, followed by liquids and then fricatives. ISCII and the Unicode Standard both observe this traditional order.
#Independent Vowel Letters. The independent vowels in Devanagari are letters that stand on their own. The writing system treats independent vowels as orthographicCV syllables in which the consonant is null. The independent vowel letters are used to write syllables that start with a vowel.
#Dependent Vowel Signs (Matras). The dependent vowels serve as the common manner of writing noninherent vowels and are generally referred to asvowel signs, or asmatras in Sanskrit. The dependent vowels do not stand alone; rather, they are visibly depicted in combination with a base letterform. A single consonant or a consonant cluster may have a dependent vowel applied to it to indicate the vowel quality of the syllable, when it is different from the inherent vowel. Explicit appearance of a dependent vowel in a syllable overrides the inherent vowel of a single consonant letter.
The greatest variation among different Indic scripts is found in the way that the dependent vowels are applied to base letterforms. Devanagari has a collection of nonspacing dependent vowel signs that may appear above or below a consonant letter, as well as spacing dependent vowel signs that may occur to the right or to the left of a consonant letter or consonant cluster. Other Indic scripts generally have one or more of these forms, but what is a nonspacing mark in one script may be a spacing mark in another. Also, some of the Indic scripts have single dependent vowels that are indicated by two or more glyph components—and those glyph components maysurround a consonant letter both to the left and to the right or may occur both above and below it.
In modern usage the Devanagari script has only one character denoting a left-side dependent vowel sign:U+093FDEVANAGARI VOWEL SIGN I. In the historic Prishthamatra orthography, Devanagari also made use of one additional left-side dependent vowel sign:U+094EDEVANAGARI VOWEL SIGN PRISHTHAMATRA E. Other Indic scripts either have no such vowel signs (Telugu and Kannada) or include as many as three of these signs (Bengali/Bangla, Tamil, and Malayalam).
#Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be analyzed visually as consisting of multiple parts.Table 12-1 shows vowel letters that can be analyzed, the single code point that should be used to represent them in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
ऄ | 0904 | <0905, 0946> |
आ | 0906 | <0905, 093E> |
ई | 0908 | <0930, 094D, 0907> |
ऊ | 090A | <0909, 0941> |
ऍ | 090D | <090F, 0945> |
ऎ | 090E | <090F, 0946> |
ऐ | 0910 | <090F, 0947> |
ऑ | 0911 | <0905, 0949> or <0906, 0945> |
ऒ | 0912 | <0905, 094A> or <0906, 0946> |
ओ | 0913 | <0905, 094B> or <0906, 0947> |
औ | 0914 | <0905, 094C> or <0906, 0948> |
ॲ | 0972 | <0905, 0945> |
ॳ | 0973 | <0905, 093A> |
ॴ | 0974 | <0905, 093B> or <0906, 093A> |
ॵ | 0975 | <0905, 094F> |
ॶ | 0976 | <0905, 0956> |
ॷ | 0977 | <0905, 0957> |
#Virama (Halant). Devanagari employs a sign known in Sanskrit as thevirama or vowel omission sign. In Hindi, it is calledhal orhalant, and that term is used in referring to the virama or to a consonant with its vowel suppressed by the virama. The terms are used interchangeably in this section.
The virama sign,U+094DDEVANAGARI SIGN VIRAMA, nominally serves to cancel (or kill) the inherent vowel of the consonant to which it is applied. When a consonant has lost its inherent vowel by the application of virama, it is known as adead consonant; in contrast, alive consonant is one that retains its inherent vowel or is written with an explicit dependent vowel sign. In the Unicode Standard, a dead consonant is defined as a sequence consisting of a consonant letter followed by a virama. The default rendering for a dead consonant is to position the virama as a combining mark bound to the consonant letterform.
For example, ifCn denotes the nominal form of consonantC, andCd denotes the dead consonant form, then a dead consonant is encoded as shown inFigure 12-1.
TAn | + | VIRAMAn | → | TAd |
त | + | ◌् | → | त् |
It could be assumed that a dead consonant may be combined with a vowel letter or sign to represent aCV orthographic syllable. Some non-Unicode implementations have used this approach; however, this is not done in implementations of the Unicode Standard. Instead, aCV orthographic syllable is represented with a (live) consonant followed by a dependent vowel. A dead consonant should not be followed either by an independent vowel letter or by a dependent vowel sign in an attempt to create an alternative representation of aCV orthographic syllable.
#Atomic Representation of Consonant Letters. Consonant letters are encoded atomically in Unicode, even if they can be analyzed visually as consisting of multiple parts. In particular, consonant half forms are dead-consonant forms that often resemble a full consonant form minus a vertical stem. This vertical stem is visually similar to the vowel sign denoting /ā/,U+093EDEVANAGARI VOWEL SIGN AA.Table 12-2 shows atomic consonant letters in Devanagari that could be graphically analyzed this way, the single code point that should be used to represent them in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
ख | 0916 | <0916, 094D, 093E>, <0916, 094D, 200D, 093E> |
ग | 0917 | <0917, 094D, 093E>, <0917, 094D, 200D, 093E> |
घ | 0918 | <0918, 094D, 093E>, <0918, 094D, 200D, 093E> |
च | 091A | <091A, 094D, 093E>, <091A, 094D, 200D, 093E> |
ज | 091C | <091C, 094D, 093E>, <091C, 094D, 200D, 093E> |
झ | 091D | <091D, 094D, 093E>, <091D, 094D, 200D, 093E> |
ञ | 091E | <091E, 094D, 093E>, <091E, 094D, 200D, 093E> |
ण | 0923 | <0923, 094D, 093E>, <0923, 094D, 200D, 093E> |
त | 0924 | <0924, 094D, 093E>, <0924, 094D, 200D, 093E> |
थ | 0925 | <0925, 094D, 093E>, <0925, 094D, 200D, 093E> |
ध | 0927 | <0927, 094D, 093E>, <0927, 094D, 200D, 093E> |
न | 0928 | <0928, 094D, 093E>, <0928, 094D, 200D, 093E> |
ऩ | 0929 | <0929, 094D, 093E>, <0929, 094D, 200D, 093E>, <0928, 093C, 094D, 093E>, <0928, 093C, 094D, 200D, 093E> |
प | 092A | <092A, 094D, 093E>, <092A, 094D, 200D, 093E> |
ब | 092C | <092C, 094D, 093E>, <092C, 094D, 200D, 093E> |
भ | 092D | <092D, 094D, 093E>, <092D, 094D, 200D, 093E> |
म | 092E | <092E, 094D, 093E>, <092E, 094D, 200D, 093E> |
य | 092F | <092F, 094D, 093E>, <092F, 094D, 200D, 093E> |
ल | 0932 | <0932, 094D, 093E>, <0932, 094D, 200D, 093E> |
व | 0935 | <0935, 094D, 093E>, <0935, 094D, 200D, 093E> |
श | 0936 | <0936, 094D, 093E>, <0936, 094D, 200D, 093E> |
ष | 0937 | <0937, 094D, 093E>, <0937, 094D, 200D, 093E> |
स | 0938 | <0938, 094D, 093E>, <0938, 094D, 200D, 093E> |
ख़ | 0959 | <0959, 094D, 093E>, <0959, 094D, 200D, 093E>, <0916, 093C, 094D, 093E>, <0916, 093C, 094D, 200D, 093E> |
ग़ | 095A | <095A, 094D, 093E>, <095A, 094D, 200D, 093E>, <0917, 093C, 094D, 093E>, <0917, 093C, 094D, 200D, 093E> |
ज़ | 095B | <095B, 094D, 093E>, <095B, 094D, 200D, 093E>, <091C, 093C, 094D, 093E>, <091C, 093C, 094D, 200D, 093E> |
य़ | 095F | <095F, 094D, 093E>, <095F, 094D, 200D, 093E>, <092F, 093C, 094D, 093E>, <092F, 093C, 094D, 200D, 093E> |
ॹ | 0979 | <0979, 094D, 093E>, <0979, 094D, 200D, 093E> |
ॺ | 097A | <097A, 094D, 093E>, <097A, 094D, 200D, 093E> |
ॻ | 097B | <097B, 094D, 093E>, <097B, 094D, 200D, 093E> |
ॼ | 097C | <097C, 094D, 093E>, <097C, 094D, 200D, 093E> |
ॾ | 097E | <097E, 094D, 093E>, <097E, 094D, 200D, 093E> |
ॿ | 097F | <097F, 094D, 093E>, <097F, 094D, 200D, 093E> |
The practice of using atomic consonants to represent letters is recommended. Using a half-form plus stems should be avoided.
#Consonant Conjuncts. The Indic scripts are noted for a large number of consonant conjunct forms that serve as orthographic abbreviations (ligatures) of two or more adjacent letterforms. This abbreviation takes place only in the context of aconsonant cluster. An orthographic consonant cluster is defined as a sequence of characters that represents one or more dead consonants (denotedCd) followed by a normal, live consonant letter (denotedCl).
Under normal circumstances, a consonant cluster is depicted with a conjunct glyph if such a glyph is available in the current font. In the absence of a conjunct glyph, the one or more dead consonants that form part of the cluster are depicted using half-form glyphs. In the absence of half-form glyphs, the dead consonants are depicted using the nominal consonant forms combined with visible virama signs (seeFigure 12-2).
(1) | GAd | + | DHAl | → | GAh + DHAn | (3) | KAd | + | SSAl | → | K.SSAn | |
ग् | + | ध | → | ग्ध | क् | + | ष | → | क्ष | |||
(2) | KAd | + | KAl | → | K.KAn | (4) | RAd | + | KAl | → | KAl + RAsup | |
क् | + | क | → | क्क | र् | + | क | → | र्क |
A number of types of conjunct formations appear in these examples: (1) a half-form ofGA in its combination with the full form ofDHA; (2) a vertical conjunctK.KA; and (3) a fully ligated conjunctK.SSA, in which the components are no longer distinct. In example (4) inFigure 12-2, the dead consonantRAd is depicted with the nonspacing combining markRAsup (repha).
A consonant conjunct form can take a virama and so become a dead consonant conjunct form. A dead consonant conjunct form can be followed by another consonant letter, and so form a multi-consonant conjunct. For example,Figure 12-3 illustrates a three-consonant conjunct form,P.S.YA.
PAd | + | SAd | + | YAl | → | P.S.YAn |
प् | + | स् | + | य | → | प्स्य |
A well-designed Indic script font may contain hundreds of conjunct glyphs, but they are not encoded as Unicode characters because they are the result of ligation of distinct letters. Indic script rendering software must be able to map appropriate combinations of characters in context to the appropriate conjunct glyphs in fonts.
A dead consonant conjunct may have an appearance like a half form, because the vertical stem of the last consonant is removed. As a result, a live consonant conjunct could be analyzed visually as consisting of the dead, consonant-conjunct half form plus the vowel sign /ā/. As in the case of consonant letters, the live form should not be represented using a half form followed byU+093EDEVANAGARI VOWEL SIGN AA.Table 12-3 shows some examples of live consonant conjuncts that exhibit this visual pattern, but that should not be represented with fully analyzed sequences.Table 12-3 also shows the sequence of code points that should be used to represent these conjuncts in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
क्च | <0915, 094D, 091A> | <0915, 094D, 091A, 094D, 093E>, <0915, 094D, 091A, 094D, 200D, 093E> |
क्ष | <0915, 094D, 0937> | <0915, 094D, 0937, 094D, 093E>, <0915, 094D, 0937, 094D, 200D, 093E> |
त्त | <0924, 094D, 0924> | <0924, 094D, 0924, 094D, 093E>, <0924, 094D, 0924, 094D, 200D, 093E> |
न्त | <0928, 094D, 0924> | <0928, 094D, 0924, 094D, 093E>, <0928, 094D, 0924, 094D, 200D, 093E> |
Note that these are illustrative examples only. There are many consonant conjuncts that could be visually analyzed in the same way, and the same principle applies to all such cases: these should not be represented as dead conjunct plus vowel sign sequences. The practice of using atomic consonants to represent letters is recommended. Using a half-form plus stems should be avoided.
#Explicit Virama (Halant). Normally a virama character serves to create dead consonants that are, in turn, combined with subsequent consonants to form conjuncts. This behavior usually results in a virama sign not being depicted visually. Occasionally, this default behavior is not desired when a dead consonant should be excluded from conjunct formation, in which case the virama sign is visibly rendered. To accomplish this goal, the Unicode Standard adopts the convention of placing the characterU+200CZERO WIDTH NON-JOINER immediately after the encoded dead consonant that is to be excluded from conjunct formation. In this case, the virama sign is always depicted as appropriate for the consonant to which it is attached.
For example, inFigure 12-4, the use ofZERO WIDTH NON-JOINER prevents the default formation of the conjunct formक्ष (K.SSAn).
KAd | + | ZWNJ | + | SSAl | → | KAd + SSAn |
क् | + | | + | ष | → | क्ष |
#Explicit Half-Consonants. When a dead consonant participates in forming a conjunct, the dead consonant form is often absorbed into the conjunct form, such that it is no longer distinctly visible. In other contexts, the dead consonant may remain visible as ahalf-consonant form. In general, a half-consonant form is distinguished from the nominal consonant form by the loss of its inherent vowel stem, a vertical stem appearing to the right side of the consonant form. In other cases, the vertical stem remains but some part of its right-side geometry is missing.
In certain cases, it is desirable to prevent a dead consonant from assuming full conjunct formation yet still not appear with an explicit virama. In these cases, the half-form of the consonant is used. To explicitly encode a half-consonant form, the Unicode Standard adopts the convention of placing the characterU+200DZERO WIDTH JOINER immediately after the encoded dead consonant. TheZERO WIDTH JOINER denotes a nonvisible letter that presents linking or cursive joining behavior on either side (that is, to the previous or following letter). Therefore, in the present context, theZERO WIDTH JOINER may be considered to present a context to which a preceding dead consonant may join so as to create the half-form of the consonant.
For example, ifCh denotes the half-form glyph of consonantC, then a half-consonant form is represented as shown inFigure 12-5.
KAd | + | ZWJ | + | SSAl | → | KAh + SSAn |
क् | + | | + | ष | → | क्ष |
In the absence of theZERO WIDTH JOINER, the sequence inFigure 12-5 would normally produce the full conjunct formक्ष (K.SSAn).
This encoding of half-consonant forms also applies in the absence of a base letterform. That is, this technique may be used to encode independent half-forms, as shown inFigure 12-6.
GAd | + | ZWJ | → | GAh |
ग् | + | | → | ग् |
Other Indic scripts have similar half-forms for the initial consonants of a conjunct. Some, such as Oriya, also have similar half-forms for the final consonants; those are represented as shown inFigure 12-7.
NGAn | + | ZWJ | + | VIRAMA | + | KAl | → | NGAl + KAh |
ଙ | + | | + | ◌୍ | + | କ | → | ଙ୍କ |
In the absence of theZERO WIDTH JOINER, the sequence inFigure 12-7 would normally produce the full conjunct formଙ୍କ (NG.KAn).
#Consonant Forms. In summary, each consonant may be encoded such that it denotes a live consonant, a dead consonant that may be absorbed into a conjunct, the half-form of a dead consonant, or a dead consonant with an overt halant that does not get absorbed into a conjunct (seeFigure 12-8).
क | + | ◌् | + | ष | → | क्ष | K.SSAn | ||
क | + | ◌् | + | | + | ष | → | क्ष | KAh + SSAn |
क | + | ◌् | + | | + | ष | → | क्ष | KAd + SSAn |
ଙ | + | ◌୍ | + | କ | → | ଙ୍କ | NG.KAn | ||
ଙ | + | | + | ◌୍ | + | କ | → | ଙ୍କ | NGAn + KAh |
ଙ | + | ◌୍ | + | | + | କ | → | ଙ୍କ | NGAd + KAn |
As the rendering of conjuncts and half-forms depends on the availability of glyphs in the font, the following fallback strategy should be employed:
- If the coded character sequence would normally render with a full conjunct, but such a conjunct is not available, the fallback rendering is to use half-forms. If those are not available, the fallback rendering should use an explicit (visible) virama.
- If the coded character sequence would normally render with a half-form (it contains a ZWJ), but half-forms are not available, the fallback rendering should use an explicit (visible) virama.
#12.1.3 Rendering Devanagari
#Rules for Rendering. This section provides more formal and detailed rules for minimal rendering of Devanagari as part of a plain text sequence. It describes the mapping between Unicode characters and the glyphs in a Devanagari font. It also describes the combining and ordering of those glyphs.
These rules provide minimal requirements for legibly rendering interchanged Devanagari text. As with any script, a more complex procedure can add rendering characteristics, depending on the font and application.
In a font that is capable of rendering Devanagari, the number of glyphs is greater than the number of Devanagari characters.
#Notation. In the next set of rules, the following notation applies:
Cn Nominal glyph form of consonantC as it appears in the code charts. Cl A live consonant, depicted identically toCn. Cd Glyph depicting the dead consonant form of consonantC. Ch Glyph depicting the half-consonant form of consonantC. Ln Nominal glyph form of a conjunct ligature consisting of two or more component consonants. A conjunct ligature composed of two consonantsX andY is also denotedX.Yn. RAsup A nonspacing combining mark glyph form ofU+0930DEVANAGARI LETTER RA positioned above or attached to the upper part of a base glyph form. This form is also known asrepha. RAsub A nonspacing combining mark glyph form ofU+0930DEVANAGARI LETTER RA positioned below or attached to the lower part of a base glyph form. Vvs Glyph depicting the dependent vowel sign form of a vowelV. VIRAMAn The nominal glyph form of the nonspacing combining mark depictingU+094DDEVANAGARI SIGN VIRAMA.
A virama character is not always depicted. When it is depicted, it adopts this nonspacing mark form.
#Dead Consonant Rule. The following rule logically precedes the application of any other rule to form a dead consonant. Once formed, a dead consonant may be subject to other rules described next.
#R1 When a consonantCn precedes aVIRAMAn, it is considered to be a dead consonantCd. A consonantCn that does not precedeVIRAMAn is considered to be a live consonantCl.
TAn | + | VIRAMAn | → | TAd |
त | + | ◌् | → | त् |
#Consonant RA Rules. The characterU+0930DEVANAGARI LETTER RA takes one of a number of visual forms depending on its context in a consonant cluster. By default, this letter is depicted with its nominal glyph form (as shown in the code charts). In some contexts, it is depicted using one of two nonspacing glyph forms that combine with a base letterform.
#R2 If the dead consonantRAd precedes a consonant, then it is replaced by the superscript nonspacing markRAsup , which is positioned so that it applies to the logically subsequent element in the memory representation.
RAd | + | KAl | → | KAl | + | RAsup | Displayed Output | |
र् | + | क | → | क | + | र्◌ | → | र्क |
RA¹d | + | RA²d | → | RA²d | + | RA¹sup | ||
र् | + | र् | → | र् | + | र्◌ | → | र्र् |
#R3 If the superscript markRAsup is to be applied to a dead consonant and that dead consonant is combined with another consonant to form a conjunct ligature, then the mark is positioned so that it applies to the conjunct ligature form as a whole.
RAd | + | JAd | + | NYAl | → | J.NYAn | + | RAsup | Displayed Output | |
र् | + | ज् | + | ञ | → | ज्ञ | + | र्◌ | → | र्ज्ञ |
#R4 If the superscript markRAsup is to be applied to a dead consonant that is subsequently replaced by its half-consonant form, then the mark is positioned so that it applies to the form that serves as the base of the consonant cluster.
RAd | + | GAd | + | GHAl | → | GAh | + | GHAl | + | RAsup | Displayed Output | |
र् | + | ग् | + | घ | → | ग् | + | घ | + | र्◌ | → | र्ग्घ |
#R5 In conformance with the ISCII standard, the half-consonant formRRAh is represented as eyelash-RA. This form ofRA is commonly used in writing Marathi and Newari.
RRAn | + | VIRAMAn | → | RRAh |
ऱ | + | ◌् | → | ऱ् |
#R5a For compatibility with The Unicode Standard, Version 2.0, if the dead consonantRAd precedesZERO WIDTH JOINER, then the half-consonant formRAh , depicted as eyelash-RA, is used instead ofRAsup .
RAd | + | ZWJ | → | RAh |
र् | + | | → | र् |
#R6 Except for the dead consonantRAd , when a dead consonantCd precedes the live consonantRAl, thenCd is replaced with its nominal formCn , andRA is replaced by the subscript nonspacing markRAsub, which is positioned so that it applies to Cn.
TTHAd | + | RAl | → | TTHAn | + | RAsub | Displayed Output | |
ठ् | + | र | → | ठ | + | ◌्र | → | ठ्र |
#R7 For certain consonants, the markRAsub may graphically combine with the consonant to form a conjunct ligature form. These combinations, such as the one shown here, are further addressed by the ligature rules described shortly.
PHAd | + | RAl | → | PHAn | + | RAsub | Displayed Output | |
फ् | + | र | → | फ | + | ◌्र | → | फ्र |
#R8 If a dead consonant (other thanRAd ) precedesRAd , then the substitution ofRA forRAsub is performed as described above; however, theVIRAMA that formedRAd remains so as to form a dead consonant conjunct form.
TAd | + | RAd | → | TAn | + | RAsub | + | VIRAMAn | → | T.RAd |
त् | + | र् | → | त | + | ◌्र | + | ◌् | → | त्र् |
A dead consonant conjunct form that contains an absorbedRAd may subsequently combine to form a multipart conjunct form.
T.RAd | + | YAl | → | T.R.YAn |
त्र् | + | य | → | त्र्य |
#Modifier Mark Rules. In addition to vowel signs, three other types of combining marks may be applied to a component of an orthographic syllable or to the syllable as a whole:nukta,bindus, andsvaras (such asU+0951DEVANAGARI STRESS SIGN UDATTA andU+0952DEVANAGARI STRESS SIGN ANUDATTA).
#R9 The nukta sign, which modifies a consonant form, is placed immediately after the consonant in the memory representation and is attached to that consonant in rendering. If the consonant represents a dead consonant, thenNUKTA should precedeVIRAMA in the memory representation.
KAn | + | NUKTAn | + | VIRAMAn | → | QAd |
क | + | ◌़ | + | ◌् | → | क़् |
#R10 Other modifying marks, in particular bindus and svaras, apply to the orthographic syllable as a whole and should follow (in the memory representation) all other characters that constitute the syllable. The bindus should follow any vowel signs, and the svaras should come last. A bindu and svara are placed side by side when they coexist on top of an orthographic syllable; the horizontal order may vary according to typographic concerns.
KAn | + | AAvs | + | CANDRABINDUn | ||
क | + | ◌ा | + | ◌ँ | → | काँ |
#Ligature Rules. Subsequent to the application of the rules just described, a set of rules governing ligature formation apply. The precise application of these rules depends on the availability of glyphs in the current font being used to display the text.
#R11 If a dead consonant immediately precedes another dead consonant or a live consonant, then the first dead consonant may join the subsequent element to form a two-part conjunct ligature form.
JAd | + | NYAl | → | J.NYAn | TTAd | + | TTHAl | → | TT.TTHAn | |
ज् | + | ञ | → | ज्ञ | ट् | + | ठ | → | ट्ठ |
#R12 A conjunct ligature form can itself behave as a dead consonant and enter into further, more complex ligatures.
SAd | + | TAd | + | RAn | → | SAd | + | T.RAn | → | S.T.RAn |
स् | + | त् | + | र | → | स् | + | त्र | → | स्त्र |
A conjunct ligature form can also produce a half-form.
K.SSAd | + | YAl | → | K.SSh + YAn |
क्ष् | + | य | → | क्ष्य |
#R13 If a nominal consonant or conjunct ligature form precedesRAsub as a result of the application of rule R6, then the consonant or ligature form may join withRAsub to form a multipart conjunct ligature (see rule R6 for more information).
KAn | + | RAsub | → | K.RAn | PHAn | + | RAsub | → | PH.RAn | |
क | + | ◌्र | → | क्र | फ | + | ◌्र | → | फ्र |
#R14 In some cases, other combining marks will combine with a base consonant, either attaching at a nonstandard location or changing shape. In minimal rendering, there are only two cases:RAl withUvs orUUvs.
RAl | + | Uvs | → | RUn | RAl | + | UUvs | → | RUUn | |
र | + | ◌ु | → | रु | र | + | ◌ू | → | रू |
#Memory Representation and Rendering Order. The storage of plain text in Devanagari and all other Indic scripts generally follows phonetic order; that is, aCV syllable with a dependent vowel is always encoded as a consonant letterC followed by a vowel signV in the memory representation. This order is employed by the ISCII standard and corresponds to both the phonetic order and the keying order of textual data (seeFigure 12-9).
Character Order | Glyph Order | |||||
KAn | + | Ivs | → | Ivs | + | KAn |
क | + | ◌ि | → | कि |
Because Devanagari and other Indic scripts have some dependent vowels that must be depicted to the left side of their consonant letter, the software that renders the Indic scripts must be able to reorder elements in mapping from the logical (character) store to the presentational (glyph) rendering. For example, ifCn denotes the nominal form of consonantC, andVvs denotes a left-side dependent vowel sign form of vowelV, then a reordering of glyphs with respect to encoded characters occurs as just shown.
#R15 When the dependent vowelIvs is used to override the inherent vowel of a syllable, it is always written to the extreme left of the orthographic syllable. If the orthographic syllable contains a consonant cluster, then this vowel is always depicted to the left of that cluster.
TAd | + | RAl | + | Ivs | → | T.RAn | + | Ivs | → | Ivs + T.RAn |
त् | + | र | + | ◌ि | → | त्र | + | ◌ि | → | त्रि |
#R16The presence of an explicit virama (either caused by a ZWNJ or by the absence of a conjunct in the font) blocks this reordering, and the dependent vowelIvsis rendered after the rightmost such explicit virama.
TAd | + | ZWNJ | + | RAl | + | Ivs | → | TAd + Ivs + RAl |
त् | + | | + | र | + | ◌ि | → | त्रि |
#Alternative Forms of Cluster-Initial RA. In addition toreph (rule R2) andeyelash (rule R5a), a cluster-initialRA may also take its nominal form while the following consonant takes a reduced form. This behavior is required by languages that make a morphological distinction between “reph onYA” and “RA with reducedYA”, such as Braj Bhasha. To trigger this behavior, a ZWJ is placed immediately before thevirama to request a reduced form of the following consonant, while preventing the formation ofreph, as shown in the third example below.
र | + | ◌् | + | य | → | र्य | ||
र | + | ◌् | + | | + | य | → | र्य |
र | + | | + | ◌् | + | य | → | र्य |
Similar, special rendering behavior of cluster-initial RA is noted in other scripts of India. See, for example, “Interaction of Repha and Ya-phalaa” inSection 12.2, Bengali (Bangla), “Reph” inSection 12.7, Telugu, and “Consonant Clusters Involving RA” inSection 12.8, Kannada.
#Sample Half-Forms.Table 12-4 shows examples of half-consonant forms that are commonly used with the Devanagari script. These forms are glyphs, not characters. They may be encoded explicitly usingZERO WIDTH JOINER as shown. In normal conjunct formation, they may be used spontaneously to depict a dead consonant in combination with subsequent consonant forms.
|
|
#Sample Ligatures.Table 12-5 shows examples of conjunct ligature forms that are commonly used with the Devanagari script. These forms are glyphs, not characters. Not every writing system that employs this script uses all of these forms; in particular, many of these forms are used only in writing Sanskrit texts. Furthermore, individual fonts may provide fewer or more ligature forms than are depicted here.
|
|
#Ligature Forms for Ra + Vocalic Liquids. The phonological sequence /r vocalic_r/, expressed with the character sequence <U+0930 ra, U+0943 vocalic_r>, can graphically appear as either of two forms, as shown in the first row ofTable 12-6. It may appear as the full independent vowel form of the vocalic_r, with a superscriptrepha form of the ra (V +RAsup):रृ. Alternatively, it may appear as the full letter form of the ra with the subscript, dependent form of the vocalic_r (RAn +Vvs):रृ. Similarly, the phonological sequences with the other vocalic sounds (rr,l,ll) have two written forms, as shown inTable 12-6.
र | + | ◌ृ | → | रृ | or | रृ |
र | + | ◌ॄ | → | रॄ | or | रॄ |
र | + | ◌ॢ | → | रॢ | or | रॢ |
र | + | ◌ॣ | → | रॣ | or | रॣ |
The graphical forms displayed above with the reph (RAsup) should not be represented by sequences ofRA + virama + independent vowel, as such sequences violate the general encoding principles of the script.CV orthographic syllables are not represented by consonant + virama + independent vowel.
The practice of writing these phonological sequences as areph on an independent vocalic liquid letter is also observed in other Indic scripts, such as Gujarati, Oriya, Telugu, Kannada, and Bhaiksuki.
#Sample Half-Ligature Forms. In addition to half-form glyphs of individual consonants, half-forms are used to depict conjunct ligature forms. A sample of such forms is shown inTable 12-7. These forms are glyphs, not characters. They may be encoded explicitly usingZERO WIDTH JOINER as shown. In normal conjunct formation, they may be used spontaneously to depict a conjunct ligature in combination with subsequent consonant forms.
क | + | ◌् | + | ष | + | ◌् | + | | → | क्ष् |
ज | + | ◌् | + | ञ | + | ◌् | + | | → | ज्ञ् |
त | + | ◌् | + | त | + | ◌् | + | | → | त्त् |
त | + | ◌् | + | र | + | ◌् | + | | → | त्र् |
श | + | ◌् | + | र | + | ◌् | + | | → | श्र् |
#Language-Specific Allographs. In Marathi, Nepali, and some South Indian orthographies, variant glyphs are preferred for certain letters and digits. These includeU+091DDEVANAGARI LETTER JHA,U+0932DEVANAGARI LETTER LA,U+0936DEVANAGARI LETTER SHA, and the digits five, eight, and nine, as shown inTable 12-8. Marathi also makes use of the “eyelash” form of the letter RA, as discussed in rule R5.
Code Point | Hindi | Marathi | Nepali |
---|---|---|---|
U+091DJHA | झ | झ | झ |
U+0932LA | ल | ल | ल |
U+0936SHA | श | श | श |
U+096BFIVE | ५ | ५ | ५ |
U+096EEIGHT | ८ | ८ | ८ |
U+096FNINE | ९ | ९ | ९ |
In addition, various languages written in Devanagari (or sometimes their various orthographic traditions) tend to have different preferences for formation of certain ligatures (see the text on “Sample Ligatures,” earlier in this section). For example, modern Nepali orthographies prefer a smaller number of ligatures than commonly used in Hindi or Marathi.
#Combining Marks. Devanagari and other Indic scripts have a number of combining marks that could be considered diacritic. One class of these marks, known as bindus, is represented byU+0901DEVANAGARI SIGN CANDRABINDU andU+0902DEVANAGARI SIGN ANUSVARA. These marks indicate nasalization or final nasal closure of a syllable.U+093CDEVANAGARI SIGN NUKTA is a true diacritic. It is used to extend the basic set of consonant letters by modifying them (with a subscript dot in Devanagari) to create new letters.
U+0951DEVANAGARI STRESS SIGN UDATTA andU+0952DEVANAGARI STRESS SIGN ANUDATTA are tone marks used in the representation of Vedic text in Devanagari. These two combining marks may also occur in the representation of Vedic texts written in other scripts, including transliterations in the Latin script. They are given the Indic_Syllabic_Category value of Cantillation_Mark.
U+0953DEVANAGARI GRAVE ACCENT andU+0954DEVANAGARI ACUTE ACCENT were originally encoded for Latin transliteration of Sanskrit text. However, such use is now discouraged, and Latin transliterations should simply use the generic combining marks,U+0300COMBINING GRAVE ACCENT andU+0301COMBINING ACUTE ACCENT. Because U+0953 and U+0954 are not intended to be used with the Devanagari script, they have no explicit property values for Indic_Positional_Category and Indic_Syllabic_Category.
#12.1.4 Devanagari Digits, Punctuation, and Symbols
#Digits. Each Indic script has a distinct set of digits appropriate to that script. These digits may or may not be used in ordinary text in that script. European digits have displaced the Indic script forms in modern usage in many of the scripts. Some Indic scripts—notably Tamil—lacked a distinct digit for zero in their traditional numerical systems, but adopted a zero based on general Indian practice.
#Punctuation.U+0964।DEVANAGARI DANDA is similar to a full stop.U+0965॥DEVANAGARI DOUBLE DANDA marks the end of a verse in traditional texts. The termdanda is from Sanskrit, and the punctuation mark is generally referred to as aviram instead in Hindi. Although thedanda anddouble danda are encoded in the Devanagari block, the intent is that they be used as common punctuation for all the major scripts of India covered by this chapter.Dandaanddouble dandapunctuation marks are not separately encoded for some Indic scripts, such as Gujarati, Gurmukhi, and Oriya. However, analogous punctuation marks for other Brahmi-derived scriptsare separately encoded, particularly for scripts used primarily outside of India.
Many modern languages written in the Devanagari script intersperse punctuation derived from the Latin script. ThusU+002CCOMMA andU+002EFULL STOP are freely used in writing Hindi, and thedanda is usually restricted to more traditional texts. However, thedanda may be preserved when such traditional texts are transliterated into the Latin script.
#Other Symbols. U+0970॰DEVANAGARI ABBREVIATION SIGN appears after letters or combinations of letters and marks the sequence as an abbreviation. It is intended specifically for Devanagari script-based abbreviations, such as the Devanagari rupee sign. Other symbols and signs most commonly occurring in Vedic texts are encoded in the Devanagari Extended and Vedic Extensions blocks and are discussed in the text that follows.
Thesvasti (or well-being) signs often associated with the Hindu, Buddhist, and Jain traditions are encoded in the Tibetan block. SeeSection 13.4, Tibetan for further information.
#12.1.5 Extensions in the Main Devanagari Block
#Sindhi Letters. The charactersU+097BDEVANAGARI LETTER GGA,U+097CDEVANAGARI LETTER JJA,U+097EDEVANAGARI LETTER DDDA, andU+097FDEVANAGARI LETTER BBA are used to write Sindhi implosive consonants. Previous versions of the Unicode Standard recommended representing those characters as a combination of the usual consonants withnukta andanudātta, but those combinations are no longer recommended.
#Konkani. Konkani makes use of additional sounds that can be represented with combinations such asU+091ADEVANAGARI LETTER CA plusU+093CDEVANAGARI SIGN NUKTA andU+091FDEVANAGARI LETTER TTA plusU+0949DEVANAGARI VOWEL SIGN CANDRA O.
#Kashmiri Letters. There are several letters for use with Kashmiri when written in Devanagari script. Long and short versions of the independent vowel letters are encoded in the range U+0973..U+0977. The corresponding dependent vowel signs areU+093ADEVANAGARI VOWEL SIGN OE,U+093BDEVANAGARI VOWEL SIGN OOE, andU+094FDEVANAGARI VOWEL SIGN AW. The forms of the independent vowels for Kashmiri are constructed by using the glyphs of the matrasU+093BDEVANAGARI VOWEL SIGN OOE,U+094FDEVANAGARI VOWEL SIGN AW,U+0956DEVANAGARI VOWEL SIGN UE, andU+0957DEVANAGARI VOWEL SIGN UUE as diacritics onU+0905DEVANAGARI LETTER A. However, for representation of independent vowels in Kashmiri, use the encoded, composite characters in the range U+0973..U+0977 and not the visually equivalent sequences ofU+0905DEVANAGARI LETTER A plus the matras. SeeTable 12-1. A few of the letters identified as being used for Kashmiri are also used to write the Bihari languages.
#Bodo, Dogri, and Maithili. The orthographies of the Bodo, Dogri, and Maithili languages of India make use of U+02BC “ ’ ”MODIFIER LETTER APOSTROPHE, either as a tone mark or as a length mark. In Bodo and Dogri, this character functions as a tone mark, calledgojau kamaa in Bodo andsur chinha in Dogri. In Dogri, the tone mark occurs after short vowels, including inherent vowels, and indicates a high-falling tone. After Dogri long vowels, a high-falling tone is written instead usingU+0939DEVANAGARI LETTER HA.
In Maithili, U+02BC “ ’ ”MODIFIER LETTER APOSTROPHE is used to indicate the prolongation of a shorta and to indicate the truncation of words. This sign is calledbikari kaamaa.
Examples illustrating the use of U+02BC “ ’ ”MODIFIER LETTER APOSTROPHE in Bodo, Dogri, and Maithili are shown inFigure 12-10. The Maithili examples show the same sentence, first in full form, and then using U+02BC to show truncation of words.
In both Dogri and Maithili, anavagraha sign,U+093DDEVANAGARI SIGN AVAGRAHA, is used to indicate extra-long vowels. An example of the contrastive use of thisavagraha sign is shown for Dogri inFigure 12-11.
#Letters for Bihari Languages. A number of the Devanagari vowel letters have been used to write the Bihari languages Bhojpuri, Magadhi, and Maithili, as listed inTable 12-9.
U+090EऎDEVANAGARI LETTER SHORT E |
U+0912ऒDEVANAGARI LETTER SHORT O |
U+0946◌ॆDEVANAGARI VOWEL SIGN SHORT E |
U+094A◌ॊDEVANAGARI VOWEL SIGN SHORT O |
U+0973ॳDEVANAGARI LETTER OE |
U+0974ॴDEVANAGARI LETTER OOE |
U+0975ॵDEVANAGARI LETTER AW |
U+093A◌ऺDEVANAGARI VOWEL SIGN OE |
U+093B◌ऻDEVANAGARI VOWEL SIGN OOE |
U+094F◌ॏDEVANAGARI VOWEL SIGN AW |
#Letter Short a. The characterU+0904DEVANAGARI LETTER SHORT A is used to denote a shorte in the Awadi language, an Indo-Aryan language spoken in the north Indian state of Uttar Pradesh and southern Nepal. A publisher in Lucknow, Uttar Pradesh also uses it in Hindi translations and Devanagari transliterations of the Kannada, Telugu, Tamil, Malayalam and Kashmiri languages.
#Prishthamatra Orthography. In the historic Prishthamatra orthography, the vowel signs fore,ai,o, andau are represented usingU+094EDEVANAGARI VOWEL SIGN PRISHTHAMATRA E (which goes on the left side of the consonant) alone or in combination with one ofU+0947DEVANAGARI VOWEL SIGN E,U+093EDEVANAGARI VOWEL SIGN AA orU+094BDEVANAGARI VOWEL SIGN O.Table 12-10 shows those combinations applied toka. In the underlying representation of text, U+094E should be first in the sequence of dependent vowel signs after the consonant, and may be followed by U+0947, U+093E or U+094B.
Prishthamatra Orthography | Modern Orthography | |
---|---|---|
ke | कॎ <0915, 094E> | के <0915, 0947> |
kai | कॎे <0915, 094E, 0947> | कै <0915, 0948> |
ko | कॎा <0915, 094E, 093E> | को <0915, 094B> |
kau | कॎो <0915, 094E, 094B> | कौ <0915, 094C> |
#12.1.6 Devanagari Extended: U+A8E0–U+A8FF
This block of characters is used chiefly for Vedic Sanskrit, although many of the characters are generic and can be used by other Indic scripts. The block includes a set of combining digits, letters, andavagraha which is used as a system of cantillation marks in the early Vedic Sanskrit texts. The Devanagari Extended block also includes nasalization marks (candrabindu), and a number of editorial marks.
The Devanagari Extended block, as well as the Vedic Extensions block and the Devanagari block, include characters that are used to indicate tone in Vedic Sanskrit. Indian linguists describe tone as a feature of vowels, shared by the consonants in the same syllable, or as a feature of syllables. In Vedic, vowels are marked for tone, as are certain non-vocalic characters that are syllabified in Vedic recitation (visarga andanusvāra); the tone marks directly follow the vowel or other character that they modify. Vowels are categorized according to tone as eitherudātta (high-toned or “acute”),anudātta (low-toned or “non-acute”),svarita (“modulated” or dropping from high to low tone) orekaśruti (monotone). Some of the symbols used for marking tone indicate different tones in different traditions.Visarga may be marked for all three tones. The tone marks also can indicate other modifications of vocal text, such as vibration, lengthening a vowel, or skipping a tone in a descending scale.
Cantillation marks are used to indicate length, tone, and other features in the recited text ofSāmaveda, and in the Kauthuma and Rāṇāyanīya traditions ofSāmagāna. These marks are encoded as a series of combining digits, alphabetic characters, andavagraha in the range U+A8E0..U+A8F1.
#Cantillation Marks for the Sāmaveda. One of the four major Vedic texts isSāmaveda. The text is both recited (Sāmaveda-Saṁhitā) and sung (Sāmagāna), and is marked differently for the purposes of each. Cantillation marks are used to indicate length, tone, and other features in the recited text ofSāmaveda, and in the Kauthuma and Rāṇāyanīya traditions ofSāmagāna. These marks are encoded as a series of combining digits, alphabetic characters, andavagraha in the range U+A8E0..U+A8F1. The marks are rendered directly over the base letter. They are represented in text immediately after the syllable they modify.
In certain cases, two marks may occur over a letter:U+A8E3COMBINING DEVANAGARI DIGIT THREE may be followed byU+A8ECCOMBINING DEVANAGARI LETTER KA, for example. Although no use ofU+A8E8COMBINING DEVANAGARI DIGIT EIGHT has been found in theSāmagāna, it is included to provide a complete set of 0–9 digits. The combining marks encoded for theSāmaveda do not include characters that may appear as subscripts and superscripts in the Jaiminīya tradition ofSāmagāna, which used interlinear annotation. Interlinear annotation may be rendered using Ruby and may be represented by means of markup or other higher-level protocols.
#Nasalization Marks. The set of spacing marks in the range U+A8F2..U+A8F7 include the termcandrabindu in their names and indicate nasalization. These marks are all aligned with the headline. Note thatU+A8F2DEVANAGARI SIGN SPACING CANDRABINDU is lower than theU+0901DEVANAGARI SIGN CANDRABINDU.
#Editorial Marks. A set of editorial marks is encoded in the range U+A8F8..U+A8FB for use with Devanagari.U+A8F9DEVANAGARI GAP FILLER signifies an intentional gap that would ordinarily be filled with text. In contrast,U+A8FBDEVANAGARI HEADSTROKE indicates illegible gaps in the original text. The glyph forDEVANAGARI HEADSTROKE should be designed so that it does not connect to the headstroke of the letters beside it, which will make it possible to indicate the number of illegible syllables in a given space.U+A8F8DEVANAGARI SIGN PUSHPIKA acts as a filler in text, and is commonly flanked by double dandas.U+A8FADEVANAGARI CARET, a zero-width spacing character, marks the insertion point of omitted text, and is placed at the insertion point between two orthographic syllables. It can also be used to indicate word division.
#12.1.7 Devanagari Extended-A: U+11B00–U+11B5F
#Bhale Mīṇḍu. Characters in the range of U+11B00..U+11B4F represent auspicious signs used in benedictions of Jaina manuscripts and inscriptions in western and central India. They are functionally similar to, but distinct fromsiddham signs such asU+A8FCDEVANAGARI SIGN SIDDHAM.
These auspicious signs are typically represented as sequences of up to three characters: a head-mark (U+11B00, U+11B01), followed by an initial orbhale (U+11B02..U+11B06), and a terminal ormīṇḍu (U+11B09, U+0966). The sequence is usually followed by adouble danda.
#12.1.8 Vedic Extensions: U+1CD0–U+1CFF
The Vedic Extensions block includes characters that are used in Vedic texts; they may be used with Devanagari, as well as many other Indic scripts. This block includes a set of characters designating tone, grouped by the various Vedic traditions in which they occur. Characters indicating tone marks directly follow the character they modify. Most of these marks indicate the tone of vowels, but three of them specifically indicate the tone ofvisarga.
A number of marks for nasalization are also included in the block.U+1CD3VEDIC SIGN NIHSHVASA is a breaking mark which separates sections of Samavedic singing between which a pause is disallowed. The block also contains several Vedic signs forardhavisarga,jihvamuliya,upadhmaniya andatikrama.
#Tone Marks. The Vedic tone marks are all combining marks. The tone marks are grouped together in the code charts based upon the tradition in which they appear: they are used in the four core texts of the Vedas (Sāmaveda,Yajurveda,Rigveda, andAtharvaveda) and in the prose text on Vedic ritual (Śatapathabrāhmaṇa). The characterU+1CD8VEDIC TONE CANDRA BELOW is also used to identify the short vowelse ando. In this usage, the prescribed order is the Indic syllable (aksara), followed byU+1CD8VEDIC TONE CANDRA BELOW and the tone mark (svara). When a tone mark is placed below, it appears below theVEDIC TONE CANDRA BELOW.
In addition to the marks encoded in this block, Vedic texts may use other nonspacing marks from the General Diacritics block and other blocks. For example,U+20F0COMBINING ASTERISK ABOVE would be used to represent a mark of that shape above a Vedic letter.
#Diacritics for the Visarga. A set of combining marks that serve as diacritics for thevisarga is encoded in the range U+1CE2..U+1CE8. These marks indicate that thevisarga has a particular tone. For example, the combinationU+0903DEVANAGARI SIGN VISARGA plusU+1CE2VEDIC SIGN VISARGA SVARITA represents asvarita visarga. The upward-shaped diacritic is used for theudātta (high-toned), the downward-shaped diacritic foranudātta (low-toned), and the midline glyph indicates thesvarita (modulated tone).
In Vedic manuscripts the tonal mark (that is, the horizontal bar, upward curve and downward curve) appears in colored ink, while the two dots of thevisarga appear in black ink. The characters for accents can be represented using separate characters, to make it easier for color information to be maintained by means of markup or other higher-level protocols.
#Nasalization Marks. A set of spacing marks and one combining mark,U+1CEDVEDIC SIGN TIRYAK, are encoded in the range U+1CE9..U+1CF1. They describe phonetic distinctions in the articulation of nasals. Thegomukha characters from U+1CE9..U+1CEC may be combined withU+0902DEVANAGARI SIGN ANUSVARA orU+0901DEVANAGARI SIGN CANDRABINDU.U+1CF1VEDIC SIGN ANUSVARA UBHAYATO MUKHA may indicate avisarga with a tonal mark as well as a nasal. The three characters,U+1CEEVEDIC SIGN HEXIFORM LONG ANUSVARA,U+1CEFVEDIC SIGN LONG ANUSVARA, andU+1CF0VEDIC SIGN RTHANG LONG ANUSVARA, are all synonymous and indicate a longanusvāra after a short vowel.U+1CEDVEDIC SIGN TIRYAK is the only combining character in this set of nasalization marks. While it appears similar to theU+094DDEVANAGARI SIGN VIRAMA, it is used to render glyph variants of nasal marks that occur in manuscripts and printed texts.
#Ardhavisarga.U+1CF2VEDIC SIGN ARDHAVISARGA is a character that marks either thejihvāmūlīya, a velar fricative occurring only before the unvoiced velar stopska andkha, or theupadhmānīya, a bilabial fricative occurring only before the unvoiced labial stopspa andpha.Ardhavisarga is a spacing character. It is represented in text in visual order before the consonant it modifies.
#12.2 Bengali (Bangla)
#12.2.1 Bengali: U+0980–U+09FF
The termBengali is used in the Unicode Standard for the script and character names. However, users of the script in the Indian state of West Bengal and the People’s Republic of Bangladesh preferBangla, so the term Bangla is used in this section and elsewhere in this chapter. The Bangla script is used for writing languages such as Bangla, Assamese, Bishnupriya Manipuri, Daphla, Garo, Hallam, Khasi, Mizo, Munda, Naga, Rian, and Santali. Although the Assamese language has been written historically using regional scripts, known generally as “Kamrupi,” its modern writing system is similar to that presently used for Bangla, with the addition of extra characters. The Bangla block supports the modern Assamese orthography. In the Indian state of Assam, the script is calledAsamiya orAssamese.
The Bangla script is a North Indian script historically related to Devanagari.
#Virama (Hasant). The Bangla script uses the Unicode virama model to form conjunct consonants. In Bangla, the virama is known ashasant.
#Vowel Letters. Vowel letters of Indic scripts are encoded atomically in Unicode, even if they can be analyzed visually as consisting of multiple parts.Table 12-11 shows the Bangla vowel letters that can be analyzed, the single code point that should be used to represent them in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
আ | 0986 | <0985, 09BE> |
ৠ | 09E0 | <098B, 09C3> |
ৡ | 09E1 | <098C, 09E2> |
There is an exception to this general pattern for the representation of Bangla independent vowel letters, for the Bangla script orthography of Kokborok, a major language of Tripura state in Northeast India. Kokborok has diphthongs which can occur as initial letters. To reflect existing practice, these diphthongs are represented with two character sequences, rather than as atomic characters, as shown inTable 12-12. Rendering systems which support display of the Kokborok orthography need to be aware of these exceptional sequences. The sequence forvowel letter aw usesU+09D7◌ৗBENGALI AU LENGTH MARK, also noted in the following discussion of two-part vowel signs.
For | Use | Description |
---|---|---|
অৗ | <0985, 09D7> | vowel letter aw |
উা | <0989, 09BE> | vowel letter ua |
#Two-Part Vowel Signs. The Bangla script, along with a number of other Indic scripts, makes use of two-part dependent vowel signs. In these dependent vowels (matras) one-half of the vowel is displayed on each side of a consonant letter or cluster—for example,U+09CB◌োBENGALI VOWEL SIGN O andU+09CC◌ৌBENGALI VOWEL SIGN AU. To provide compatibility with existing implementations of the scripts that use two-part vowel signs, the Unicode Standard explicitly encodes the right half of these vowel signs. For example,U+09D7◌ৗBENGALI AU LENGTH MARK represents the right-half glyph component ofU+09CC◌ৌBENGALI VOWEL SIGN AU. In Bangla orthography, theau length mark is always used in conjunction with the left part and does not have a meaning on its own.
#Special Characters. U+09F2..U+09F9 are a series of Bangla additions for writing currency and fractions.
#Historic Characters. The charactersvocalic rr,vocalic l andvocalic ll, both in their independent and dependent forms (U+098C, U+09C4, U+09E0..U+09E3), are only used to write Sanskrit words in the Bangla script.
#Characters for Assamese. Assamese employs two letters not used for the Bangla language. The Assamese letterra is represented in Unicode byU+09F0ৰBENGALI LETTER RA WITH MIDDLE DIAGONAL, and the Assamese letterwa is represented byU+09F1ৱBENGALI LETTER RA WITH LOWER DIAGONAL.
Assamese uses a conjunct character calledkssa. Althoughkssa is often considered a separate letter of the alphabet, it is not separately encoded. The conjunct is represented by the sequence <U+0995কBENGALI LETTER KA,U+09CD◌্BENGALI SIGN VIRAMA,U+09B7ষBENGALI LETTER SSA>. This same sequence is also used to represent the Bangla letterkhinya (orkhiya).
Assamese uses two additional consonant-vowel ligatures formed withU+09F0ৰBENGALI LETTER RA WITH MIDDLE DIAGONAL, which are not used for the Bangla language. These consonant-vowel ligatures are shown in the “ligated” column inTable 12-13.
#Rendering Behavior. Like other Brahmic scripts in the Unicode Standard, Bangla uses thehasant to form conjunct characters. For example, <U+09B8সBENGALI LETTER SA,U+09CD◌্BENGALI SIGN VIRAMA,U+0995কBENGALI LETTER KA> yields the conjunctস্ক SKA. For general principles regarding the rendering of the Bangla script, see the rules for rendering inSection 12.1, Devanagari.
#Consonant-Vowel Ligatures. Some Bangla consonant plus vowel combinations have two distinct visual presentations. The first visual presentation is a traditional ligated form, in which the vowel combines with the consonant in a novel way. In the second presentation, the vowel is joined to the consonant but retains its nominal form, and the combination is not considered a ligature. These consonant-vowel combinations are illustrated inTable 12-14.
The ligature forms of these consonant-vowel combinations are traditional. They are used in handwriting and some printing. The “non-ligated” forms are more common; they are used in newspapers and are associated with modern typefaces. However, the traditional ligatures are preferred in some contexts.
No semantic distinctions are made in Bangla text on the basis of the two different presentations of these consonant-vowel combinations. However, some users consider it important that implementations support both forms and that the distinction be representable in plain text. This may be accomplished by usingU+200DZERO WIDTH JOINER andU+200CZERO WIDTH NON-JOINER to influence ligature glyph selection. (See “Cursive Connection and Ligatures” inSection 23.2, Layout Controls.) Joiners are rarely needed in this situation. The rendered appearance will typically be the result of a font choice.
A given font implementation can choose whether to treat the ligature forms of the consonant-vowel combinations as the defaults for rendering. If the non-ligated form is the default, then ZWJ can be inserted to request a ligature, as shown inFigure 12-12.
If the ligated form is the default for a given font implementation, then ZWNJ can be inserted to block a ligature, as shown inFigure 12-13.
#Khiya. The letterক্ষ, known askhiya orkhinya, is often considered as a distinct letter of the Bangla alphabet. However, it is not encoded separately. It is represented by the sequence <U+0995কBENGALI LETTER KA,U+09CD◌্BENGALI SIGN VIRAMA,U+09B7ষBENGALI LETTER SSA>.
#Khanda Ta. In Bangla, a dead consonantta makes use of a special form,U+09CEৎBENGALI LETTER KHANDA TA. This form is used in all contexts except where it is immediately followed by one of the consonants:ta,tha,na,ba,ma,ya, orra.
Khanda ta cannot bear a vowel matra or combine with a following consonant to form a conjunctaksara. It can form a conjunctaksara only with a preceding dead consonantra, with the latter being displayed with arepha glyph placed on thekhanda ta.
Versions of the Unicode Standard prior to Version 4.1 recommended thatkhanda ta be represented as the sequence <U+09A4তBENGALI LETTER TA,U+09CD◌্BENGALI SIGN VIRAMA,U+200DZERO WIDTH JOINER> in all circumstances.U+09CEৎBENGALI LETTER KHANDA TA should instead be used explicitly in newly generated text, but users are cautioned that instances of the older representation may exist.
The Bangla syllablettaillustrates the usage ofkhanda ta when followed byta. The syllablettais normally represented with the sequence <U+09A4ta, U+09CDhasant, U+09A4ta>. That sequence will normally be displayed using a single glyphttaligature, as shown in the first example inFigure 12-14.
It is also possible for the sequence <ta,hasant,ta> to be displayed with a fulltaglyph combined with ahasantglyph, followed by another fulltaglyphত্ত. The choice of form actually displayed depends on the display engine, based on the availability of glyphs in the font.
The Unicode Standard also provides an explicit way to show thehasant glyph. To do so, aZERO WIDTH NON-JOINER is inserted after thehasant. That sequence is always displayed with the explicithasant, as shown in the second example inFigure 12-14.
When the syllabletta is written with akhanda ta, however, the characterU+09CEৎBENGALI LETTER KHANDA TA is used and nohasant is required, askhanda ta is already a dead consonant. The rendering ofkhanda ta is illustrated in the third example inFigure 12-14.
#Ya-phalaa.Ya-phalaa is a presentation form ofU+09AFযBENGALI LETTER YA. Represented by the sequence <U+09CD◌্BENGALI SIGN VIRAMA,U+09AFযBENGALI LETTER YA>,ya-phalaa has a special form্য. When combined withU+09BE◌াBENGALI VOWEL SIGN AA, it is used for transcribing [æ] as in the “a” in the English word “bat.” Theya-phalaa appears inর্যাশ [ræʃ] “rash,” which provides a minimal pair withরাশ [raʃ] “a whole lot.”
Ya-phalaa can be applied to initial vowels as well:
অ্যা = <0985, 09CD, 09AF, 09BE> (a- hasant ya -aa)
এ্যা = <098F, 09CD, 09AF, 09BE> (e- hasant ya -aa)
If a candrabindu or other combining mark needs to be added in the sequence, it comes at the end of the sequence. For example:
অ্যাঁ = <0985, 09CD, 09AF, 09BE, 0981> (a- hasant ya -aa candrabindu)
Further examples:
অ + ্ +য +◌া →অ্যা
এ + ্ +য +◌া →এ্যা
ত + ্ +য +◌া →ত্যা
#Interaction of Repha and Ya-phalaa. The formation of therepha form is defined inSection 12.1, Devanagari, “Rules for Rendering,” R2. Basically, therepha is formed when ara that has the inherent vowel killed by thehasant begins a syllable. This scenario is shown in the following example:
Theya-phalaa is a post-base form ofya and is formed when theya is the final consonant of a syllable cluster. In this case, the previous consonant retains its base shape and thehasant is combined with the followingya. This scenario is shown in the following example:
An ambiguous situation is encountered when the combination ofra +hasant +ya is encountered:
To resolve the ambiguity with this combination, the Unicode Standard adopts the convention of placing the characterU+200DZERO WIDTH JOINER immediately after thera to obtain theya-phalaa. Therepha form is rendered when no ZWJ is present, as shown in the following example:
When the first character of the cluster is not ara, theya-phalaais the normal rendering of aya, and a ZWJ is not necessary but can be present. Such a convention would make it possible, for example, for input methods to consistently associateya-phalaa with the sequence <ZWJ,hasant,ya>.
#Jihvamuliya and Upadhmaniya. In Bangla, the voiceless velar and bilabial fricatives are represented byU+1CF5ᳵVEDIC SIGN JIHVAMULIYA andU+1CF6ᳶVEDIC SIGN UPADHMANIYA, respectively. When the signs appear with a following homorganic voiceless stop consonant, they can be rendered in a font as a stacked ligature without a virama:
The sequences can also be represented linearly by inserting aU+200CZERO WIDTH NON-JOINER after thejihvamuliya orupadhmaniya, but before the following consonant:
Dependent vowel signs can also be added to the stack or linear sequence. Consonant clusters containingU+1CF5VEDIC SIGN JIHVAMULIYA andU+1CF6VEDIC SIGN UPADHMANIYA can occur with more than two consonants, such asẖkra andḫpra.
#Punctuation. Bangla uses punctuation marks shared across many Indic scripts, including thedanda anddouble danda marks. In Bangla these are called thedahri anddouble dahri. For a description of these common punctuation marks, seeSection 12.1, Devanagari.
#Truncation. The orthography of the Bangla language makes use ofU+02BC “ʼ ”MODIFIER LETTER APOSTROPHE to indicate the truncation of words. This sign is calledurdha-comma. Examples illustrating the use ofU+02BCMODIFIER LETTER APOSTROPHE are shown inTable 12-15.
Example | Meaning |
---|---|
W | after, on doing (something) |
X Y | } above |
#12.3 Gurmukhi
#12.3.1 Gurmukhi: U+0A00–U+0A7F
The Gurmukhi script is a North Indian script used to write the Punjabi (or Panjabi) language of the Punjab state of India. Gurmukhi, which literally means “proceeding from the mouth of the Guru,” is attributed to Angad, the second Sikh Guru (1504–1552CE). It is derived from an older script called Landa and is closely related to Devanagari structurally. The script is closely associated with Sikhs and Sikhism, but it is used on an everyday basis in East Punjab. (West Punjab, now in Pakistan, uses the Arabic script.)
#Encoding Principles. The Gurmukhi block is based on ISCII-1988, which makes it parallel to Devanagari. Gurmukhi, however, has a number of peculiarities described here.
The additional consonants (calledpairin bindi; literally, “with a dot in the foot,” in Punjabi) are primarily used to differentiate Urdu or Persian loan words. They includeU+0A36ਸ਼GURMUKHI LETTER SHA andU+0A33ਲ਼GURMUKHI LETTER LLA, but do not includeU+0A5CੜGURMUKHI LETTER RRA, which is genuinely Punjabi. For unification with the other scripts, ISCII-1991 considersrra to be equivalent todda+nukta, but this decomposition is not considered in Unicode. At the same time, ISCII-1991 does not consider U+0A36 to be equivalent to <0A38, 0A3C>, or U+0A33 to be equivalent to <0A32, 0A3C>.
Two different marks can be associated withU+0902◌ंDEVANAGARI SIGN ANUSVARA:U+0A02◌ਂGURMUKHI SIGN BINDI andU+0A70◌ੰGURMUKHI TIPPI. Present practice is to usebindi only with the dependent and independent forms of the vowelsaa,ii,ee,ai,oo, andau, and with the independent vowelsu anduu;tippi is used in the other contexts. Older texts may depart from this requirement. ISCII-1991 uses only one encoding point for both marks.
U+0A71◌ੱGURMUKHI ADDAK is a special sign to indicate that the following consonant is geminate. ISCII-1991 does not have a specific code point for addak and encodes it as a cluster. For example, the wordਪੱਗpagg, “turban,” can be represented with the sequence <0A2A, 0A71, 0A17> (or <pa, addak, ga>) in Unicode, while in ISCII-1991 it would be <pa, ga, virama, ga>.
U+0A75◌ੵGURMUKHI SIGN YAKASH probably originated as a subjoined form ofU+0A2FਯGURMUKHI LETTER YA. However, because its usage is relatively rare and not entirely predictable, it is encoded as a separate character. Some modern fonts renderyakash with the glyph◌ੵ , which varies from the traditional shape found in the code charts. This character should occur after the consonant to which it attaches and before any vowel sign.
U+0A51◌ੑGURMUKHI SIGN UDAAT occurs in older texts and indicates a high tone. This character should occur after the consonant to which it attaches and before any vowel sign.
#Unusual Usage of Vowel Signs. In older texts, such as theSri Guru Granth Sahib (the Sikh holy book), one can find typographic clusters with a vowel sign attached to a vowel letter, or with two vowel signs attached to a consonant. The most common cases are◌ੁu attached toਓ, as inਓੁਮਾਹਾ and both the vowel signs◌ੋ and◌ੁ attached to a consonant, as inਗੋੁਬਿੰਦgoubinda; this is used to indicate the metrical shortening of /o/ or the lengthening of /u/ depending on the context. Other combinations are attested as well, such asਗ੍ਹਿਾਨghiana, represented by the sequence <U+0A17, U+0A4D, U+0A39, U+0A3F, U+0A3E, U+0A28>.
Because of the combining classes of the charactersU+0A4B◌ੋGURMUKHI VOWEL SIGN OO andU+0A41◌ੁGURMUKHI VOWEL SIGN U, the sequences <consonant, U+0A4B, U+0A41> and <consonant, U+0A41, U+0A4B> are not canonically equivalent. To avoid ambiguity in representation, the first sequence, with U+0A4B before U+0A41, should be used in such cases. More generally, when a consonant or independent vowel is modified by multiple vowel signs, the sequence of the vowel signs in the underlying representation of the text should be: left, top, bottom, right.
#Unusual Positioning of bindi. Typically, whenU+0A40◌ੀGURMUKHI VOWEL SIGN II andU+0A02◌ਂGURMUKHI SIGN BINDI coexist in an orthographic syllable, thebindi is encoded after and rendered on the right side of thevowel sign ii. In cases where a special left side placement of thebindi must be distinguished in encoding, thebindi can be encoded immediately preceding thevowel sign ii instead.
In particular, this encoding order also applies whenbindi must appear on top ofiri precedingvowel sign ii: <0A72iri, 0A02bindi, 0A40vowel sign ii>. This sequential encoding does not conflict with the “Do Not Use” instruction aboutU+0A08ਈGURMUKHI LETTER II inTable 12-16 because of thebindi inserted in between.
#Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be analyzed visually as consisting of multiple parts.Table 12-16 shows the letters that can be analyzed, the single code point that should be used to represent them in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
ਆ | 0A06 | <0A05, 0A3E> |
ਇ | 0A07 | <0A72, 0A3F> |
ਈ | 0A08 | <0A72, 0A40> |
ਉ | 0A09 | <0A73, 0A41> |
ਊ | 0A0A | <0A73, 0A42> |
ਏ | 0A0F | <0A72, 0A47> |
ਐ | 0A10 | <0A05, 0A48> |
ਓ | 0A13 | <0A73, 0A4B> |
ਔ | 0A14 | <0A05, 0A4C> |
#Tones. The Punjabi language is tonal, but the Gurmukhi script does not contain any specific signs to indicate tones. Instead, the voiced aspirates (gha,jha,ddha,dha) and the letterha combine consonantal and tonal functions.
#Ordering.U+0A73ੳGURMUKHI URA andU+0A72ੲGURMUKHI IRI are the first and third “letters” of the Gurmukhi syllabary, respectively. They are used as bases or bearers for some of the independent vowels, whileU+0A05ਅGURMUKHI LETTER A is both the second “letter” and the base for the remaining independent vowels. As a result, the collation order for Gurmukhi is based on a seven-by-five grid:
- The first row is U+0A73ura, U+0A05a, U+0A72iri, U+0A38sa, U+0A39ha.
- This row is followed by five main rows of consonants, grouped according to the point of articulation, as is traditional in all South and Southeast Asian scripts.
- The semiconsonants follow in the seventh row: U+0A2Fya, U+0A30ra, U+0A32la, U+0A35va, U+0A5Crra.
- The letters withnukta, added later, are presented in a subsequent eighth row if needed.
#Rendering Behavior. For general principles regarding the rendering of the Gurmukhi script, see the rules for rendering inSection 12.1, Devanagari. In many aspects, Gurmukhi is simpler than Devanagari. In modern Punjabi, there are no half-consonants, no half-forms, norepha (upper form ofU+0930रDEVANAGARI LETTER RA), and no real ligatures. Rules R2–R5, R11, and R14 do not apply. Conversely, the behavior for subscript RA (rules R6–R8 and R13) applies toU+0A39ਹGURMUKHI LETTER HA andU+0A35ਵGURMUKHI LETTER VA, which also have subjoined forms, calledpairin in Punjabi. The subjoined form for RA is like a knot, while the subjoined HA and VA are written the same as the base form, without the top bar, but are reduced in size. As described in rule R13, they attach at the bottom of the base consonant, and will “push” down any attached vowel sign for U or UU. WhenU+0A2FਯGURMUKHI LETTER YA follows a dead consonant, it assumes a different form calledaddha in Punjabi, without the leftmost part, and the dead consonant returns to the nominal form, as shown inTable 12-17.
ਮ | + | ◌् | + | ਹ | → | ਮ੍ਹ | (mha) | pairin ha |
ਪ | + | ◌् | + | ਰ | → | ਪ੍ਰ | (pra) | pairin ra |
ਦ | + | ◌् | + | ਵ | → | ਦ੍ਵ | (dva) | pairin va |
ਦ | + | ◌् | + | ਯ | → | ਦ੍ਯ | (dya) | addha ya |
Other letters behaved similarly in old inscriptions, as shown inTable 12-18.
ਸ | + | ◌् | + | ਗ | → | ਸ੍ਗ | (sga) | pairin ga |
ਸ | + | ◌् | + | ਚ | → | ਸ੍ਚ | (sca) | pairin ca |
ਸ | + | ◌् | + | ਟ | → | ਸ੍ਟ | (stta) | pairin tta |
ਸ | + | ◌् | + | ਠ | → | ਸ੍ਠ | (sttha) | pairin ttha |
ਸ | + | ◌् | + | ਤ | → | ਸ੍ਤ | (sta) | pairin ta |
ਸ | + | ◌् | + | ਦ | → | ਸ੍ਦ | (sda) | pairin da |
ਸ | + | ◌् | + | ਨ | → | ਸ੍ਨ | (sna) | pairin na |
ਸ | + | ◌् | + | ਥ | → | ਸ੍ਥ | (stha) | pairin tha |
ਸ | + | ◌् | + | ਯ | → | ਸ੍ਯ | (sya) | pairin ya |
ਸ | + | ◌् | + | ਥ | → | ਸ੍ਥ | (stha) | addha tha |
ਸ | + | ◌् | + | ਮ | → | ਸ੍ਮ | (sma) | addha ma |
Older texts also exhibit another feature that is not found in modern Gurmukhi—namely, the use of a half- or reduced form for the first consonant of a cluster, whereas the modern practice is to represent the second consonant in a half- or reduced form. Joiners can be used to request this older rendering, as shown inTable 12-19. The reduced form of an initialU+0A30ਰGURMUKHI LETTER RA is similar to the Devanagari superscript RA (repha), but this usage is rare, even in older texts.
ਸ | + | ◌् | + | ਵ | → | ਸ੍ਵ | (sva) | ||
ਰ | + | ◌् | + | ਵ | → | ਰ੍ਵ | (rva) | ||
ਸ | + | ◌् | + | | + | ਵ | → | ਸ੍ਵ | (sva) |
ਰ | + | ◌् | + | | + | ਵ | → | ਰ੍ਵ | (rva) |
ਸ | + | ◌् | + | | + | ਵ | → | ਸ੍ਵ | (sva) |
ਰ | + | ◌् | + | | + | ਵ | → | ਰ੍ਵ | (rva) |
A rendering engine for Gurmukhi should make accommodations for the correct positioning of the combining marks (seeSection 5.13, Rendering Nonspacing Marks, and particularlyFigure 5-11). This is important, for example, in the correct centering of the marks above and belowU+0A28ਨGURMUKHI LETTER NA andU+0A20ਠGURMUKHI LETTER TTHA, which are laterally symmetrical. It is also important to avoid collisions between the various upper marks, vowel signs,bindi, and/oraddak.
#Other Symbols. The religious symbolkhanda sometimes used in Gurmukhi texts is encoded atU+262C☬ADI SHAKTI in the Miscellaneous Symbols block.U+0A74ੴGURMUKHI EK ONKAR, which is also a religious symbol, can have different presentation forms, which do not change its meaning. The representative glyph shown the code charts is a simple form that looks like the digit one, followed by a sign based onura, along with a long upper tail; other forms may be highly stylized.
#Punctuation. Danda and double danda marks as well as some other unified punctuation used with Gurmukhi are found in the Devanagari block. SeeSection 12.1, Devanagari, for more information. Punjabi also uses Latin punctuation.
#12.4 Gujarati
#12.4.1 Gujarati: U+0A80–U+0AFF
The Gujarati script is a North Indian script closely related to Devanagari. It is most obviously distinguished from Devanagari by not having a horizontal bar for its letterforms, a characteristic of the older Kaithi script to which Gujarati is related. The Gujarati script is used to write the Gujarati language of the Gujarat state in India.
#Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be analyzed visually as consisting of multiple parts.Table 12-20 shows the letters that can be analyzed, the single code point that should be used to represent them in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
આ | 0A86 | <0A85, 0ABE> |
ઍ | 0A8D | <0A85, 0AC5> |
એ | 0A8F | <0A85, 0AC7> |
ઐ | 0A90 | <0A85, 0AC8> |
ઑ | 0A91 | <0A85, 0AC9> |
ઓ | 0A93 | <0A85, 0ACB> or <0A85, 0ABE, 0AC5> |
ઔ | 0A94 | <0A85, 0ACC> or <0A85, 0ABE, 0AC8> |
ૉ | 0AC9 | <0AC5, 0ABE> |
#Rendering Behavior. For rendering of the Gujarati script, see the rules for rendering inSection 12.1, Devanagari. Like other Brahmic scripts in the Unicode Standard, Gujarati uses the virama to form conjunct characters. The virama is informally calledkhoḍo, which means “lame” in Gujarati. Many conjunct characters, as in Devanagari, lose the vertical stroke; there are also vertical conjuncts.U+0AB0GUJARATI LETTER RA takes special forms when it combines with other consonants, as shown inTable 12-21.
ક | + | ્ | + | ષ | → | ક્ષ | (kṣa) |
જ | + | ્ | + | ઞ | → | જ્ઞ | (jña) |
ત | + | ્ | + | ય | → | ત્ય | (tya) |
ટ | + | ્ | + | ટ | → | ટ્ટ | (ṭṭa) |
ર | + | ્ | + | ક | → | ર્ક | (rka) |
ક | + | ્ | + | ર | → | ક્ર | (kra) |
#Marks for Transliteration of Arabic. The combining marks encoded in the range U+0AFA..U+0AFF are used for the transliteration of the Arabic script into Gujarati. This system of transliteration was devised in the late 19th century, and is used by Ismaili Khoja communities. These marks occur both in manuscripts and in printed materials.
The three forms ofnuktaencoded in the range U+0AFD..U+0AFF are diacritics, placed above regular Gujarati letters to create new letters corresponding to Arabic letters for non-Gujarati sounds.U+0AFFGUJARATI SIGN TWO-CIRCLE NUKTA ABOVE is used only withU+0A9DGUJARATI LETTER JHA, to transliterate the Arabiczah.U+0AFEGUJARATI SIGN CIRCLE NUKTA ABOVE is used withU+0A9DGUJARATI LETTER JHA to transliterate the Arabicthal and withU+0AB8GUJARATI LETTER SA to transliterate the Arabictheh.U+0AFDGUJARATI SIGN THREE-DOT NUKTA ABOVE occurs with a number of different Gujarati letters, to transliterate a variety of Arabic letters.
U+0AFAGUJARATI SIGN SUKUN,U+0AFBGUJARATI SIGN SHADDA, andU+0AFCGUJARATI SIGN MADDAH are used to transliterate the Arabicsukun,shadda, andmaddah above, respectively. These marks may be applied to a Gujarati letter which also uses one of the three above-basenukta diacritic marks. In such cases, thenukta occurs first in the combining sequence, followed by thesukun,shadda, ormaddah mark. However, instead of being rendered above thenukta mark on the letter, thesukun,shadda, ormaddah mark is rendered to the left of thenukta mark.
#Punctuation. Words in Gujarati are separated by spaces. Danda and double danda marks as well as some other unified punctuation used with Gujarati are found in the Devanagari block; seeSection 12.1, Devanagari.
#12.5 Oriya (Odia)
#12.5.1 Oriya: U+0B00–U+0B7F
The Oriya script is used to write the Odia language of the Odisha (Orissa) state in India, as well as minority languages such as Khondi and Santali.
Languages and scripts can be referred to in many different ways, and these terms may evolve over time. The Oriya script is an example of this: The preferred Latin transcription used in India for this script has shifted to the spelling Odia (as shown, for example, by changes to the Indian constitution). The Unicode Standard retains the traditional English spelling Oriya in discussion, to minimize the potential for confusion when referring to immutable, standardized character names in the standard, which were assigned long ago.
#Special Characters.U+0B57ORIYA AU LENGTH MARK is provided as an encoding for the right side of the surroundrant vowelU+0B4CORIYA VOWEL SIGN AU.
#Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be analyzed visually as consisting of multiple parts.Table 12-22 shows the letters that can be analyzed, the single code point that should be used to represent them in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
ଆ | 0B06 | <0B05, 0B3E> |
ଐ | 0B10 | <0B0F, 0B57> |
ଔ | 0B14 | <0B13, 0B57> |
#Rendering Behavior. For rendering of the Oriya script, see the rules for rendering inSection 12.1, Devanagari. Like other Brahmic scripts in the Unicode Standard, Oriya uses the virama to suppress the inherent vowel. Oriya has a visible virama, often being a lengthening of a part of the base consonant:
କ + ୍ →କ୍ (k)
The virama is also used to form conjunct consonants, as shown inTable 12-23.
କ | + | ◌୍ | + | ଷ | → | କ୍ଷ | (kṣa) |
କ | + | ◌୍ | + | ତ | → | କ୍ତ | (kta) |
ତ | + | ◌୍ | + | କ | → | ତ୍କ | (tka) |
ତ | + | ◌୍ | + | ୟ | → | ତ୍ୟ | (tya) |
#Consonant Forms. In the initial position in a cluster, RA is reduced and placed above the following consonant, while it is also reduced in the second position:
ର + ୍ +ପ →ର୍ପ (rpa)
ପ + ୍ +ର →ପ୍ର (pra)
Nasal and stop clusters may be written with conjuncts, or the anusvara may be used:
ଅ +ଙ + ୍ +କ →ଅଙ୍କ (aṅka)
ଅ + ଂ +କ →ଅଂକ (aṁka)
#Vowels. As with other scripts, some dependent vowels are rendered in front of their consonant, some appear after it, and some are placed above or below it. Some are rendered with parts both in front of and after their consonant. A few of the dependent vowels fuse with their consonants.U+0B01ORIYA SIGN CANDRABINDU is used for nasal vowels. SeeTable 12-24.
କ | + | ା | → | କା | (kā) |
କ | + | ି | → | କି | (ki) |
କ | + | ୀ | → | କୀ | (kī) |
କ | + | ୁ | → | କୁ | (ku) |
କ | + | ୂ | → | କୂ | (kū) |
କ | + | ୃ | → | କୃ | (kṛ) |
କ | + | େ | → | କେ | (ke) |
କ | + | ୈ | → | କୈ | (kai) |
କ | + | ୋ | → | କୋ | (ko) |
କ | + | ୌ | → | କୌ | (kau) |
କ | + | ଁ | → | କକଁ | (kaṁ) |
An orthography for the Kuvi language makes use of a macron-shaped length mark. It is displayed directly above written forms of the following three vowels to indicate their corresponding long vowels:
[o] vowel lettera, or inherent vowel implied by consonant letters and conjuncts
[a] vowel letter or signaa
[e] vowel letter or signe
This length mark is represented in text byU+0B55ORIYA SIGN OVERLINE. It occurs in the text representation directly after the letter or sign it modifies, and after anynukta which is present.
#Oriya VA and WA. These two letters are extensions to the basic Oriya alphabet. Because Sanskritवनvana becomes Oriyaବନbana in orthography and pronunciation, an extended letter U+0B35ଵORIYA LETTER VA was devised by dotting U+0B2CବORIYA LETTER BA for use in academic and technical text. For example, basic Oriya script cannot distinguish Sanskritबवbava fromबबbaba orववvava, but this distinction can be made with the modified version ofba. In some older sources, the glyphଵ is sometimes found forva; in others,ଵ andଵ have been shown, which in a more modern type style would beଵ. The letterva is not in common use today.
In a consonant conjunct, subjoined U+0B2CବORIYA LETTER BA is usually—but not always—pronounced [wa]:
U+0B15କka + U+0B4D୍virama + U+0B2Cବba →କବ [kwa]
U+0B2Eମma + U+0B4D୍virama + U+0B2Cବba →ମବ [mba]
The extended Oriya letter U+0B71ୱORIYA LETTER WA is sometimes used in Perso-Arabic or English loan words for [w]. It appears to have originally been devised as a ligature ofଓo andବba, but because ligatures of independent vowels and consonants are not normally used in Oriya, this letter has been encoded as a single character that does not have a decomposition. It is used initially in words or orthographic syllables to represent the foreign consonant; as a native semivowel,virama + ba is used because that is historically accurate. Glyph variants ofwa areୱ,ୱ, andଓବ.
#Punctuation and Symbols. Danda and double danda marks as well as some other unified punctuation used with Oriya are found in the Devanagari block; seeSection 12.1, Devanagari. The markU+0B70ORIYA ISSHAR is placed before names of persons who are deceased.
The sacred syllableom is formed byU+0B13ORIYA LETTER O andU+0B01ORIYA SIGN CANDRABINDU. Ligation of the two glyphs can be encouraged or discouraged by the use ofU+200DZERO WIDTH JOINER orU+200CZERO WIDTH NON-JOINER between the two characters, as seen inTable 12-25. In the absence of a joiner, both the non-ligated and the ligated forms are acceptable renderings.
ଓ | + | | + | ଁ | → | ଓଁ orଓଁ |
ଓ | + | | + | ଁ | → | ଓଁ |
#Fraction Characters. As for many other scripts of India, Oriya has characters used to denote factional values. These were more commonly used before the advent of decimal weights, measures, and currencies. Oriya uses six signs: three for quarter values (1/4, 1/2, 3/4) and three for sixteenth values (1/16, 1/8, and 3/16). These are used additively, with quarter values appearing before sixteenths. ThusU+0B73ORIYA FRACTION ONE HALF followed byU+0B75ORIYA FRACTION ONE SIXTEENTH represents the value 5/16.
#12.6 Tamil
#12.6.1 Tamil: U+0B80–U+0BFF
The Tamil script is descended from the South Indian branch of Brahmi. It is used to write the Tamil language of the Tamil Nadu state in India as well as minority languages such as Irula, the Dravidian language Badaga, and the Indo-European language Saurashtra. Tamil is also used in Sri Lanka, Singapore, and parts of Malaysia.
The Tamil script has fewer consonants than the other Indic scripts. When representing the “missing” consonants in transcriptions of languages such as Sanskrit or Saurashtra, superscript European digits are often used, soப² =pha,ப³ =ba, andப⁴ =bha. The charactersU+00B2,U+00B3, andU+2074 can be used to preserve this distinction in plain text. The Grantha script is often also used by Tamil speakers to write Sanskrit because Grantha contains these missing consonants.
The Tamil script also avoids the use of conjunct consonant forms, although a few conventional conjuncts are used.
#Virama (Puḷḷi). Because the Tamil encoding in the Unicode Standard is based on ISCII-1988 (Indian Script Code for Information Interchange), it makes use of theabugida model. An abugida treats the basic consonants as containing an inherent vowel, which can be canceled by the use of a visible mark, called avirama in Sanskrit. In most Brahmi-derived scripts, the placement of a virama between two consonants implies the deletion of the inherent vowel of the first consonant and causes a conjoined or subjoined consonant cluster. In those scripts,U+200CZERO WIDTH NON-JOINER is used to display a visible virama, as shown previously in the Devanagari example inFigure 12-4.
The situation is quite different for Tamil because the script uses very few consonant conjuncts. An orthographic cluster consisting of multiple consonants (represented by <C1,U+0BCD◌்TAMIL SIGN VIRAMA, C2, …>) is normally displayed with explicit viramas, which are calledpuḷḷi in Tamil. Thepuḷḷi is typically rendered as a dot centered above the character. It occasionally appears as small circle instead of a dot, but this glyph variant should be handled by the font, and not be represented by the similar-appearingU+0B82◌ஂTAMIL SIGN ANUSVARA.
The conjunctskssa andshrii are traditionally displayed by conjunct ligatures, as illustrated forkssa inFigure 12-15, but nowadays tend to be displayed using an explicitpuḷḷi as well.
க +◌் +ஷ →க்ஷkṣa |
To explicitly display apuḷḷi for such sequences,U+200CZERO WIDTH NON-JOINER can be inserted after thepuḷḷi in the sequence of characters.
#Rendering of the Tamil Script. The Tamil script is complex and requires special rules for rendering. The following discussion describes the most important features of Tamil rendering behavior. As with any script, a more complex procedure can add rendering characteristics, depending on the font and application.
In a font that is capable of rendering Tamil, the number of glyphs is greater than the number of Tamil characters.
#12.6.2 Tamil Vowels
#Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be analyzed visually as consisting of multiple parts.Table 12-26 shows the letters that can be analyzed, the single code point that should be used to represent them in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
ஆ | 0B86 | <0B85, 0BC2> |
#Independent Versus Dependent Vowels. In the Tamil script, the dependent vowel signs are not equivalent to a sequence ofvirama +independent vowel. For example:
ன +◌ி ≠ன +◌் +இ |
#Left-Side Vowels. The Tamil vowelsU+0BC6◌ெ,U+0BC7◌ே, andU+0BC8◌ை are reordered in front of the consonant to which they are applied. When occurring in a syllable, these vowels are rendered to the left side of their consonant, as shown inFigure 12-16.
க | + | ◌ெ | → | கெ |
க | + | ◌ே | → | கே |
க | + | ◌ை | → | கை |
#Two-Part Vowels. Tamil also has several vowels that consist of elements which flank the consonant to which they are applied. A sequence of two Unicode code points can be used to express equivalent spellings for these vowels, as shown inFigure 12-17.
0BCA◌ொ | ≡ | 0BC6◌ெ +0BBE◌ா |
0BCB◌ோ | ≡ | 0BC7◌ே +0BBE◌ா |
0BCC◌ௌ | ≡ | 0BC6◌ெ +0BD7◌ௗ |
In these examples, the representation on the left, which is a single code point, is the preferred form and the form in common use for Tamil.
In the process of rendering, these two-part vowels are transformed into the two separate glyphs equivalent to those on the right, which are then subject to vowel reordering, as shown inFigure 12-18.
க | + | ◌ொ | → | கொ | ||
க | + | ◌ெ | + | ◌ா | → | கொ |
க | + | ◌ோ | → | கோ | ||
க | + | ◌ே | + | ◌ா | → | கோ |
க | + | ◌ௌ | → | கௌ | ||
க | + | ◌ெ | + | ◌ௗ | → | கௌ |
Even in the case where a two-part vowel occurs with a conjunct consonant or consonant cluster, the left part of the vowel is reordered around the conjunct or cluster, as shown inFigure 12-19.
க +◌் +ஷ +◌ே +◌ா →க்ஷோkṣō |
For either left-side vowels or two-part vowels, the ordering of the elements is unambiguous: the consonant or consonant cluster occurs first in the memory representation, followed by the vowel.
#Confusable Vowels.U+0B94TAMIL LETTER AU andU+0BCCTAMIL VOWEL SIGN AU are visually indistinguishable from two semantically unrelated sequences, as shown inFigure 12-20. In the decompositions of these two vowel characters, the rightmost part is represented as the characterU+0BD7TAMIL AU LENGTH MARK, which looks exactly like the separate character,U+0BB3TAMIL LETTER LLA.
0B94ஔ | ≡ | 0B92ஒ +0BD7◌ௗ | ≠ | 0B92ஒ +0BB3ள |
0BCC◌ௌ | ≡ | 0BC6◌ெ +0BD7◌ௗ | ≠ | 0BC6◌ெ +0BB3ள |
#12.6.3 Tamil Ligatures
A number of ligatures are conventionally used in Tamil. Most ligatures involve the shape taken by a consonant plus vowel sequence. A wide variety of modern Tamil words are written without a conjunct form, with a fully visiblepuḷḷi.
#Ligatures with Vowel i. The vowel signsi◌ி andii◌ீ form ligatures with the consonantttaட as shown in examples 1 and 2 ofFigure 12-21. These vowels often change shape or position slightly so as to join cursively with other consonants, as shown in examples 3 and 4 ofFigure 12-21.
1 | ட | + | ◌ி | → | டி | ṭi |
2 | ட | + | ◌ீ | → | டீ | ṭī |
3 | ல | + | ◌ி | → | லி | li |
4 | ல | + | ◌ீ | → | லீ | lī |
#Ligatures with Vowel u. The vowel signsu◌ு anduu◌ூ normally ligate with their consonant, as shown inTable 12-27. In the first column, the basic consonant is shown; the second column illustrates the ligation of that consonant with theu vowel sign; and the third column illustrates the ligation with theuu vowel sign.
x | x +◌ு | x +◌ூ |
---|---|---|
க | கு | கூ |
ங | ஙு | ஙூ |
ச | சு | சூ |
ஞ | ஞு | ஞூ |
ட | டு | டூ |
ண | ணு | ணூ |
த | து | தூ |
ந | நு | நூ |
ன | னு | னூ |
ப | பு | பூ |
ம | மு | மூ |
ய | யு | யூ |
ர | ரு | ரூ |
ற | று | றூ |
ல | லு | லூ |
ள | ளு | ளூ |
ழ | ழு | ழூ |
வ | வு | வூ |
With certain consonants,ஜ,ஷ,ஸ,ஹ, and the conjunctக்ஷ, the vowel signsu◌ு anduu◌ூ take a distinct spacing form, as shown inFigure 12-22.
ஜ | + | ◌ு | → | ஜு | ju |
ஜ | + | ◌ூ | → | ஜூ | jū |
#Ligatures with ra. Based on typographical preferences, the consonantraர may change shape toர, when it ligates. Such change, if it occurs, will happen only when theர form ofU+0BB0ரTAMIL LETTER RA would not be confused with the nominal formா ofU+0BBETAMIL VOWEL SIGN AA (namely, whenர is combined with◌்,◌ி, or◌ீ). This change in shape is illustrated inFigure 12-23.
ர | + | ◌் | → | ர் | r |
ர | + | ◌ி | → | ரி | ri |
ர | + | ◌ீ | → | ரீ | rī |
However, various governmental bodies mandate that the basic shape of the consonantraர should be used for these ligatures as well, especially in school textbooks. Media and literary publications in Malaysia and Singapore mostly use the unchanged form ofraர. Sri Lanka, on the other hand, specifies the use of the changed forms shown inFigure 12-23.
#Tamil Ligature shri. Prior to Unicode 4.1, the best mapping to represent the ligatureshri was to the sequence <U+0BB8, U+0BCD, U+0BB0, U+0BC0>. Unicode 4.1 in 2005 added the characterU+0BB6TAMIL LETTER SHA and as a consequence, the best mapping became <U+0BB6, U+0BCD, U+0BB0, U+0BC0>. Due to slow updates to implementations, both representations are widespread in existing text. Therefore, treating both representations as equivalent sequences is recommended.Figure 12-24 shows the two sequences.
ஸ | + | ◌் | + | ர | + | ◌ீ | → | ஸ்ரீ |
ஶ | + | ◌் | + | ர | + | ◌ீ | → | ஶ்ரீ |
#Ligatures with aa in Traditional Tamil Orthography. In traditional Tamil orthography, the vowel signaa◌ா optionally ligates withண,ன, orற, as illustrated inFigure 12-25.
ண | + | ◌ா | → | ணா | ṇā |
ன | + | ◌ா | → | னா | ṉā |
ற | + | ◌ா | → | றா | ṟā |
These ligations also affect the right-hand part of two-part vowels, as shown inFigure 12-26.
ண | + | ◌ொ | → | ணொ | ṇo |
ண | + | ◌ோ | → | ணோ | ṇō |
ன | + | ◌ொ | → | னொ | ṉo |
ன | + | ◌ோ | → | னோ | ṉō |
ற | + | ◌ொ | → | றொ | ṟo |
ற | + | ◌ோ | → | றோ | ṟō |
#Ligatures with ai in Traditional Tamil Orthography. In traditional Tamil orthography, the left-side vowel signai◌ை is also subject to a change in form. It is rendered as◌ை when it occurs on the left side ofண,ன,ல, orள, as illustrated inFigure 12-27.
ண | + | ◌ை | → | ணை | ṇai |
ன | + | ◌ை | → | னை | ṉai |
ல | + | ◌ை | → | லை | lai |
ள | + | ◌ை | → | ளை | ḷai |
By contrast, in modern Tamil orthography, this vowel does not change its shape, as shown inFigure 12-28.
ண +◌ை →ணைṇai |
#Tamil aytham. The characterU+0B83TAMIL SIGN VISARGA is normally calledaytham in Tamil. It is historically related to thevisarga in other Indic scripts, but has become an ordinary spacing letter in Tamil. Theaytham occurs in native Tamil words, but is frequently used as a modifying prefix before consonants used to represent foreign sounds. In particular, it is used in the spelling of words borrowed into Tamil from English or other languages.
#Punctuation. Danda and double danda marks as well as some other unified punctuation used with Tamil are found in the Devanagari block; seeSection 12.1, Devanagari.
#Numbers. Modern Tamil decimal digits are encoded at U+0BE6..U+0BEF. Note that some digits are confusable with letters, as shown inTable 12-28. In some Tamil fonts, the digits for two and eight look exactly like the lettersu anda, respectively. In other fonts, as shown here, the shapes for the digits two and eight are adjusted to minimize confusability.
U+0BE7௧TAMIL DIGIT ONE | U+0B95கTAMIL LETTER KA |
U+0BE8௨TAMIL DIGIT TWO | U+0B89உTAMIL LETTER U |
U+0BED௭TAMIL DIGIT SEVEN | U+0B8EஎTAMIL LETTER E |
U+0BEE௮TAMIL DIGIT EIGHT | U+0B85அTAMIL LETTER A |
Tamil also has distinct numerals for ten, one hundred, and one thousand at U+0BF0..U+0BF2 used for historical numbers.
#Use of Nukta. In addition to Tamil, several other languages of southern India are written using the Tamil script. For example, Irula is written with the Tamil script. Some of these languages contain sounds distinct from those normally written for the Tamil language. In such cases, the writing systems of these languages apply diacritic nukta marks to Tamil letters to represent their distinct sounds. For example, Irula uses a double dot nukta below represented withU+1133CGRANTHA SIGN NUKTA, and Badaga uses a single dot nukta represented byU+1133BCOMBINING BINDU BELOW for some sounds.
#12.6.4 Tamil Supplement: U+11FC0–U+11FFF
The Tamil Supplement block contains a set of fractions in the range U+11FC0..U+11FD4 used for generic measurement and calculations and for money. The block also includes symbols indicating various forms of measurement, old units of currency, agricultural and clerical signs, and other miscellaneous abbreviations. Most characters in this block are no longer in use, but a few appear in traditional contexts, such as on marriage invitations printed in a traditional format.
#12.6.5 Tamil Named Character Sequences
Tamil is less complex than some of the other Indic scripts, and both conceptually and in processing can be treated as an atomic set of elements: consonants, stand-alone vowels, and syllables.Table 12-29 shows these atomic elements, with the corresponding Unicode characters or sequences. In cases where the atomic elements for Tamil correspond to sequences of Unicode characters, those sequences have been added to the approved list of Unicode named character sequences. See NamedSequences.txt in the Unicode Character Database for details.
In implementations such as natural language processing, where it may be useful to treat such Tamil text elements as single code points for ease of processing, Tamil named character sequences could be mapped to code points in a contiguous segment of the Private Use Area.
InTable 12-29, the first row shows the transliterated representation of the Tamil vowels in abbreviated form, while the first column shows the transliterated representation of the Tamil consonants. Those row and column labels, together with identifying strings such as “TAMIL SYLLABLE” or “TAMIL CONSONANT” are concatenated to form formal names for these sequences. For example, the sequence shown in the table in the K row and the AA column, with the sequence <0B95, 0BBE>, gets the associated nameTAMIL SYLLABLE KAA. The sequence shown in the table in the K row in the first column, with the sequence <0B95, 0BCD>, gets the associated nameTAMIL CONSONANT K.
Details on the complete names for each element can be found in NamedSequences.txt.
#12.7 Telugu
#12.7.1 Telugu: U+0C00–U+0C7F
The Telugu script is a South Indian script used to write the Telugu language of the Andhra Pradesh state in India as well as minority languages such as Gondi (Adilabad and Koi dialects) and Lambadi. The script is also used in Maharashtra, Odisha (Orissa), Madhya Pradesh, and West Bengal. The Telugu script became distinct by the thirteenth centuryCE and shares ancestors with the Kannada script.
#Vowels. Telugu vowel letters and vowel signs are encoded atomically in Unicode, even if they can be analyzed visually as consisting of multiple parts.Table 12-30 shows the letters and signs that can be analyzed, the single code point that should be used to represent them in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
ఓ | 0C13 | <0C12, 0C55> |
ఔ | 0C14 | <0C12, 0C4C> |
ీ | 0C40 | <0C3F, 0C55> |
ే | 0C47 | <0C46, 0C55> |
ో | 0C4B | <0C4A, 0C55> |
#Rendering Behavior. Telugu script rendering is similar to that of some other Brahmic scripts in the Unicode Standard—in particular, the Kannada script. (SeeSection 12.8, Kannada.) Many Telugu letters have a v-shaped headstroke, which is a structural mark corresponding to the horizontal bar in Devanagari and the arch in Oriya. When a virama (calledvirāmamu in Telugu) or certain vowel signs are added to a letter with this headstroke, it is replaced:
U+0C15కka + U+0C4D ్virama →క్ (k)
U+0C15కka + U+0C3F ిvowel sign i →కి (ki)
Telugu consonant clusters are most commonly represented by a subscripted, and often transformed, consonant glyph for the second element of the cluster:
U+0C17గga + U+0C4D ్virama + U+0C17గga →గగ (gga)
U+0C15కka + U+0C4D ్virama + U+0C15కka →కక (kka)
U+0C15కka + U+0C4D ్virama + U+0C2Fయya →కయ (kya)
U+0C15కka + U+0C4D ్virama + U+0C37షssa →కష (kṣa)
U+200CZERO WIDTH NON-JOINER is used to preventU+0C4DTELUGU SIGN VIRAMA from subscripting a following letter:
U+0C15కka + U+0C4D ్virama + U+200C ZWNJ + U+0C15కka →క్క (k.ka)
#Nakāra-Pollu. A distinct formౝ of a vowelless U+0C28నTELUGU LETTER NA appears in older Telugu texts, and is known asnakāra-pollu. This form is represented by a separate character, U+0C5DౝTELUGU LETTER NAKAARA POLLU. The related form regularly used in modern texts takes an ordinary virama-joined shapeన్, as other consonants do, and thus is represented by the sequence <U+0C28నna, U+0C4D ్virama>.
Prior to Unicode 14.0, these two distinct forms were treated as glyphic variants of that regular sequence <U+0C28నna, U+0C4D ్virama>, handled at the font level.
#Reph. In modern Telugu,U+0C30TELUGU LETTER RA behaves in the same manner as most other initial consonants in a consonant cluster. That is, thera appears in its nominal form, and the second consonant takes the C2-conjoining or subscripted form:
U+0C30రra + U+0C4D ్virama + U+0C2Eమma →రమ (rma)
However, in older texts,U+0C30TELUGU LETTER RA takes the reduced (orreph) formర when it appears first in a consonant cluster, and the following consonant maintains its nominal form:
U+0C30రra + U+0C4D ్virama + U+0C2Eమ ma →మర (rma)
U+200DZERO WIDTH JOINER is placed immediately after thevirama to render thereph explicitly in modern texts:
U+0C30రra + U+0C4D ్virama + U+200D ZWJ + U+0C2Eమma →మర
To prevent display of areph,U+200DZERO WIDTH JOINER is placed after thera, but preceding thevirama:
U+0C30రra + U+200D ZWJ + U+0C4D ్virama + U+0C2Eమma →రమ
#Special Characters.U+0C55TELUGU LENGTH MARK is provided as an encoding for the distinguishing element appearing in certain letters and signs, however, this character is not used in ordinary representation of Telugu texts. See “Vowel Letters” earlier in this section for more information.U+0C56TELUGU AI LENGTH MARK is provided as an encoding for the second element of the surroundrant vowelU+0C48TELUGU VOWEL SIGN AI. The length marks are both nonspacing characters. For a detailed discussion of the use of two-part vowels, see “Two-Part Vowels” inSection 12.6, Tamil.
For scholarly orthographies in which a horizontal line below is used to denote an alternative vowel or consonant for a syllable,U+0952DEVANAGARI STRESS SIGN ANUDATTA is recommended to represent the line analogously to asvara in an orthographic syllable. For the encoding order ofsvaras, see R10 of “Rendering Devanagari” inSection 12.1, Devanagari.
#Nukta.U+0C3CTELUGU SIGN NUKTA is a mark placed under letters to indicate additional sounds from Tamil and Perso-Arabic languages. It may display as a large dot or as a ring, and is typically placed low enough to avoid confusion and collision with the differentiating “teardrop” that occurs under many Telugu letters. The representative glyph in the code chart is shown with the ring form to minimize accidental confusability in implementations.
#Fractions. Prior to the adoption of the metric system, Telugu fractions were used as part of the system of measurement. Telugu fractions are quaternary (base-4), and use eight marks, which are conceptually divided into two sets. The first set represents odd-numbered negative powers of four in fractions. The second set represents even-numbered negative powers of four in fractions. Different zeros are used with each set. The zero from the first set is known ashaḷḷi,U+0C78TELUGU FRACTION DIGIT ZERO FOR ODD POWERS OF FOUR. The zero for the second set isU+0C66TELUGU DIGIT ZERO.
#Punctuation. Danda and double danda are used primarily in the domain of religious texts to indicate the equivalent of a comma and full stop, respectively. The danda and double danda marks as well as some other unified punctuation used with Telugu are found in the Devanagari block; seeSection 12.1, Devanagari.
#12.8 Kannada
#12.8.1 Kannada: U+0C80–U+0CFF
The Kannada script is a South Indian script. It is used to write the Kannada (or Kanarese) language of the Karnataka state in India and to write minority languages such as Tulu. The Kannada language is also used in many parts of Tamil Nadu, Kerala, Andhra Pradesh, and Maharashtra. This script is very closely related to the Telugu script both in the shapes of the letters and in the behavior of conjunct consonants. The Kannada script also shares many features common to other Indic scripts. SeeSection 12.1, Devanagari, for further information.
The Unicode Standard follows the ISCII layout for encoding, which also reflects the traditional Kannada alphabetic order.
#12.8.2 Principles of the Kannada Script
Like Devanagari and related scripts, the Kannada script employs a halant, which is also known as a virama or vowel omission sign, U+0CCD ್KANNADA SIGN VIRAMA. The halant nominally serves to suppress the inherent vowel of the consonant to which it is applied. The halant functions as a combining character. When a consonant loses its inherent vowel by the application of halant, it is known as a dead consonant. The dead consonants are the presentation forms used to depict the consonants without an inherent vowel. Their rendered forms in Kannada resemble the full consonant with the horn replaced by the halant sign. In contrast, a live consonant is a consonant that retains its inherent vowel or is written with an explicit dependent vowel sign. The dead consonant is defined as a sequence consisting of a consonant letter followed by a halant. The default rendering for a dead consonant is to position the halant as a combining mark bound to the consonant letterform.
#Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be analyzed visually as consisting of multiple parts.Table 12-31 shows the letters that can be analyzed, the single code point that should be used to represent them in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
ಊ | 0C8A | <0C89, 0CBE> |
ಔ | 0C94 | <0C92, 0CCC> |
ೠ | 0CE0 | <0C8B, 0CBE> |
#Consonant Conjuncts. Kannada is also noted for a large number of consonant conjunct forms that serve as ligatures of two or more adjacent forms. This use of ligatures takes place in the context of a consonant cluster. A written consonant cluster is defined as a sequence of characters that represent one or more dead consonants followed by a normal live consonant. A separate and unique glyph corresponds to each part of a Kannada consonant conjunct. Most of these glyphs resemble their original consonant forms—many without the implicit vowel sign, wherever applicable.
In Kannada, conjunct formation tends to be graphically regular, using the following pattern:
- The first consonant of the cluster is rendered with the implicit vowel or a different dependent vowel appearing as the terminal element of the cluster.
- The remaining consonants (consonants between the first consonant and the terminal vowel element) appear in conjunct consonant glyph forms in phonetic order. They are generally depicted directly below or to the lower right of the first consonant.
A Kannada script font contains the conjunct glyph components, but they are not encoded as separate Unicode characters because they are simply ligatures. Kannada script rendering software must be able to map appropriate combinations of characters in context to the appropriate conjunct glyphs in fonts.
In a font that is capable of rendering Kannada, the number of glyphs is greater than the number of encoded Kannada characters.
#Special Characters. U+0CD5 ೕKANNADA LENGTH MARK is provided as an encoding for the right side of the two-part vowel U+0CC7 ೇKANNADA VOWEL SIGN EE should it be necessary for processing. Likewise, U+0CD6 ೖKANNADA AI LENGTH MARK is provided as an encoding for the right side of the two-part vowel U+0CC8 ೈKANNADA VOWEL SIGN AI. The Kannada two-part vowels actually consist of a nonspacing element above the consonant letter and one or more spacing elements to the right of the consonant letter. These two length marks have no independent existence in the Kannada writing system and do not play any part as independent codes in the traditional collation order.
#Kannada Letter LLLA. U+0CDEೞKANNADA LETTER FA is actually an archaic Kannada letter that is transliterated in Dravidian scholarship asẓ,ḻ, orṛ. This form should have been named “LLLA”, rather than “FA”, so the name in this standard is simply a mistake. A formal name aliasKANNADA LETTER LLLA has been added to the Unicode Character Database for this character, to clarify its identity. Collations should treat U+0CDE as followingU+0CB3KANNADA LETTER LLA.
Theletter llla has not been actively used in writing the Kannada language since the end of the tenth century. However, the letter does have modern use in writing the closely related Badaga language. Badaga is noteworthy for having distinct retroflexion in its vowel system, and a subjoined form of U+0CDE is often seen in Badaga written documents, to indicate retroflexed pronunciation of the vowel in a syllable. This subjoined form of U+0CDE may occur below consonants, but it also may be subjoined to an independent vowel, to indicate retroflexion of that vowel. In either case, the subjoined form of U+0CDE should be represented by a sequence includingU+0CCDKANNADA SIGN VIRAMA. Implementations of the Kannada script need to be aware that these sequences involving independent vowels followed by virama and U+0CDE are valid and required in orthographies for Badaga. Examples of the use of subjoined U+0CDE to indicate retroflexion, both for independent vowel letters and for dependent vowels, are shown inFigure 12-29.
#12.8.3 Rendering Kannada
Plain text in Kannada is generally stored in phonetic order; that is, aCV syllable with a dependent vowel is always encoded as a consonant letterC followed by a vowel signV in the memory representation. This order is employed by the ISCII standard and corresponds to the phonetic and keying order of textual data.
#Explicit Virama (Halant). Normally, a halant character creates dead consonants, which in turn combine with subsequent consonants to form conjuncts. This behavior usually results in a halant sign not being depicted visually. Occasionally, this default behavior is not desired when a dead consonant should be excluded from conjunct formation, in which case the halant sign is visibly rendered. To accomplish this,U+200CZERO WIDTH NON-JOINER is introduced immediately after the encoded dead consonant that is to be excluded from conjunct formation. SeeSection 12.1, Devanagari, for examples.
#Vowelless NA. A special form,ೝ, of a vowellessna appears in older Kannada texts, distinct from the usual form of the vowellessna in modern texts:ನ್. The historic form is represented by a separate character,U+0CDDKANNADA LETTER NAKAARA POLLU. This character is named after the analogous Telugu form,nakāra-pollu, because there is no conventional term for this form in Kannada. Prior to Unicode 14.0, these two forms were treated as glyphic variants of <U+0CA8KANNADA LETTER NA,U+0CCDKANNADA SIGN VIRAMA>, handled at the font level.
#Consonant Clusters Involving RA. Whenever a consonant cluster is formed with the U+0CB0ರKANNADA LETTER RA as the first component of the consonant cluster, the letterra is depicted with two different presentation forms: one as the initial element and the other as the final display element of the consonant cluster.
U+0CB0ರra + U+0CCD ್halant + U+0C95ಕka →ಕಕrka
U+0CB0ರra + + U+0CCD ್halant + U+0C95ಕka →ರಕrka
U+0C95ಕka + U+0CCD ್halant + U+0CB0ರra →ಕಕ್ರkra
#Jihvamuliya and Upadhmaniya. Voiceless velar and bilabial fricatives in Kannada are represented byU+0CF1KANNADA SIGN JIHVAMULIYA andU+0CF2KANNADA SIGN UPADHMANIYA, respectively. When the signs appear with a following homorganic voiceless stop consonant, the combination should be rendered in the font as a stacked ligature, without a virama:
Dependent vowels signs can also be added to the stack:
#Modifier Mark Rules. In addition to the vowel signs, one or more types of combining marks may be applied to a component of a written syllable or the syllable as a whole. If the consonant represents a dead consonant, then the nukta should precede the halant in the memory representation. The nukta is represented by a double-dot mark, U+0CBC ಼KANNADA SIGN NUKTA. Two such modified consonants are used in the Kannada language: one representing the syllableza and one representing the syllablefa.
#Avagraha Sign. A spacing mark, U+0CBDಽKANNADA SIGN AVAGRAHA, is used when rendering Sanskrit texts.
#Punctuation. Danda and double danda marks as well as some other unified punctuation used with this script are found in the Devanagari block; seeSection 12.1, Devanagari.
#12.9 Malayalam
#12.9.1 Malayalam: U+0D00–U+0D7F
The Malayalam script is a South Indian script used to write the Malayalam language of the Kerala state. Malayalam is a Dravidian language like Kannada, Tamil, and Telugu. Throughout its history, it has absorbed words from Tamil, Sanskrit, Arabic, and English.
The shapes of Malayalam letters closely resemble those of Tamil. Malayalam, however, has a very full and complex set of conjunct consonant forms.
#Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be analyzed visually as consisting of multiple parts.Table 12-32 shows the letters that can be analyzed, the single code point that should be used to represent them in text, and the sequence of code points resulting from analysis that should not be used.
For | Use | Do Not Use |
---|---|---|
ഈ | 0D08 | <0D07, 0D57> |
ഊ | 0D0A | <0D09, 0D57> |
ഐ | 0D10 | <0D0E, 0D46> |
ഓ | 0D13 | <0D12, 0D3E> |
ഔ | 0D14 | <0D12, 0D57> |
#Two-Part Vowels. The Malayalam script uses several two-part vowel characters. In modern times, the dominant practice is to write the dependent form of theauvowel using only “ൗ”, which is placed on the right side of the consonant it modifies; such texts are represented in Unicode usingU+0D57MALAYALAM AU LENGTH MARK. In the past, this dependent form was written using both “െ” on the left side and “ൗ” on the right side;U+0D4CMALAYALAM VOWEL SIGN AU can be used for documents following this earlier tradition. This historical simplification started much earlier than the orthographic reforms mentioned in the text that follows.
For a detailed discussion of the use of two-part vowels, see “Two-Part Vowels” inSection 12.6, Tamil.
#Historic and Scholarly Characters.U+0D5FMALAYALAM LETTER ARCHAIC II represents an earlier form for the vowel letterii. Characters for the letters and signs ofvocalic rr,vocalic l, andvocalic ll, as well ascandrabindu andavagraha, are only used in Sanskrit texts. U+0D54..U+0D56 are rarely usedchillu forms, found only in historical materials.
U+0D3BMALAYALAM SIGN VERTICAL BAR VIRAMA andU+0D3CMALAYALAM SIGN CIRCULAR VIRAMA are two specific forms of viramas found in historical materials. They were used to indicate a pure consonant in different orthographies.U+0D00MALAYALAM SIGN COMBINING ANUSVARA ABOVE is used in certain Prakrit texts, where the ordinaryanusvara indicates gemination of the following consonant.
U+0D3AMALAYALAM LETTER TTTA andU+0D29MALAYALAM LETTER NNNA are used in scholarly orthographies for transcribing the Malayalam language in a phonetically accurate way. They represent the alveolar plosive and nasal, respectively. The letternnna is parallel toU+0BA9TAMIL LETTER NNNA.
#Suriyani Malayalam. The Suriyani dialect of Malayalam is written using the Syriac script. It is also called Garshuni (Karshoni) or Syriac Malayalam. This usage requires eleven additional letters encoded in the Syriac Supplement block (U+0860..U+086F) to represent the sounds of Malayalam. The dialect was widely used by the St. Thomas Christians living in Kerala, India, in the 19th century.
#12.9.2 Malayalam Orthographic Reform
In the 1970s and 1980s, Malayalam underwent orthographic reform due to printing difficulties. The treatment of the combining vowel signsu anduu was simplified at this time. These vowel signs had previously been represented using special cluster graphemes where the vowel signs were fused beneath their consonants, but in the reformed orthography they are represented by spacing characters following their consonants.Table 12-33 lists a variety of consonants plus theu oruu vowel sign, yielding a syllable. Each syllable is shown as it would be displayed in the older orthography, contrasted with its display in the reformed orthography.
Syllable | Older Orthography | Reformed Orthography | |
---|---|---|---|
ku | ക + ു | കു | കു |
gu | ഗ + ു | ഗു | ഗു |
chu | ഛ + ു | ഛു | ഛു |
ju | ജ + ു | ജു | ജു |
ṇu | ണ + ു | ണു | ണു |
tu | ത + ു | തു | തു |
nu | ന + ു | നു | നു |
bhu | ഭ + ു | ഭു | ഭു |
ru | ര + ു | രു | രു |
śu | ശ + ു | ശു | ശു |
hu | ഹ + ു | ഹു | ഹു |
kū | ക + ൂ | കൂ | കൂ |
gū | ഗ + ൂ | ഗൂ | ഗൂ |
chū | ഛ + ൂ | ഛൂ | ഛൂ |
jū | ജ + ൂ | ജൂ | ജൂ |
ṇū | ണ + ൂ | ണൂ | ണൂ |
tū | ത + ൂ | തൂ | തൂ |
nū | ന + ൂ | നൂ | നൂ |
bhū | ഭ + ൂ | ഭൂ | ഭൂ |
rū | ര + ൂ | രൂ | രൂ |
śū | ശ + ൂ | ശൂ | ശൂ |
hū | ഹ + ൂ | ഹൂ | ഹൂ |
#12.9.3 Rendering Malayalam
#Candrakkala. As is the case for many other Brahmi-derived scripts in the Unicode Standard, Malayalam uses a virama character to form consonant conjuncts. The virama sign itself is known ascandrakkala in Malayalam.Table 12-34 provides a variety of examples of consonant conjuncts. There are both horizontal and vertical conjuncts, some of which ligate, and some of which are merely juxtaposed.
ക | + | ് | + | ഷ | → | ക്ഷ | (kṣa) |
ക | + | ് | + | ക | → | ക്ക | (kka) |
ജ | + | ് | + | ഞ | → | ജ്ഞ | (jña) |
ട | + | ് | + | ട | → | ട്ട | (ṭṭa) |
പ | + | ് | + | പ | → | പ്പ | (ppa) |
ച | + | ് | + | ഛ | → | ച്ഛ | (ccha) |
ബ | + | ് | + | ബ | → | ബ്ബ | (bba) |
ന | + | ് | + | യ | → | ന്യ | (nya) |
പ | + | ് | + | ര | → | പ്ര | (pra) |
ശ | + | ് | + | വ | → | ശ്വ | (śva) |
When thecandrakkala sign is visibly shown in Malayalam, it indicates either the suppression of the preceding vowel or its replacement with a neutral vowel sound. This sound is often called “half-u” orsamvruthokaram. In various orthographies this sound is typically spelled with either a vowel sign -u followed bycandrakkala or acandrakkala alone. In vernacular orthographies,candrakkala can also be seen on an independent vowel letter or preceding ananusvara. In all cases, thecandrakkala sign is represented by the characterU+0D4DMALAYALAM SIGN VIRAMA, which follows any vowel sign that may be present and precedes anyanusvarathat may be present. Implementations need to pay careful attention to correctly shape a Malayalam orthographic syllable when U+0D4D occurs in such locations. Examples are shown inTable 12-35.
s | /pālə/ milk | 0D2A, 0D3E, 0D32, 0D41, 0D4D |
t | /ənnā/ on which day? (vernacular) | 0D0E, 0D4D, 0D28, 0D4D, 0D28, 0D3E |
u | /aiśīləm/ than ice (vernacular) | 0D10, 0D36, 0D40, 0D32, 0D4D, 0D02 |
#Explicit Candrakkala. The sequence <C1, virama, ZWNJ, C2>, where C1 and C2 are consonants, may be used to request display with an explicit visiblecandrakkala, instead of the default conjunct form. SeeTable 12-36 for an example. This convention is consistent with the use of this sequence in other Indic scripts.
#Requesting Traditional Ligatures. The sequence <C1, ZWJ, virama, C2> may be used to request traditional ligatures, even if the current font defaults to the conjuncts appropriate for the reformed orthography. When such sequences occur, a closed or cursively connected ligature should be displayed, if available. SeeTable 12-36 for examples. This convention is consistent with the use of this sequence in some other Indic scripts, such as Kannada, Oriya, and Telugu.
#Requesting Open Forms of Conjuncts. The sequence <C1, ZWNJ, virama, C2> may be used to request open ligatures or those used in the reformed orthography, even if the current font defaults to the conjuncts appropriate for the traditional orthography. When such sequences occur, an open or disconnected conjunct form should be displayed, if available. SeeTable 12-36 for examples. Note that such sequences are defined for Malayalam only, and are left undefined for other Indic scripts.
ക | + | ് | + | ਰ | → | ക്ਰ orക്ਰ | (kra) | ||
വ | + | ് | + | ക | → | വ്ക orവ്ക | (ska) | ||
ത | + | ് | + | വ | → | ത്വ orത്വ | (tsa) | ||
ഴ | + | ് | + | വ | → | ഴ്വ orഴ്വ orഴ്വ | (ḻva) | ||
യ | + | ് | + | യ | → | യ്യ | (yya) | ||
ക | + | ് | + | | + | ਰ | → | P | (kra) |
ക | + | | + | ് | + | ਰ | → | ക്ਰ | (kra) |
വ | + | | + | ് | + | ക | → | വ്ക | (ska) |
ത | + | | + | ് | + | വ | → | ത്വ | (tsa) |
ഴ | + | | + | ് | + | വ | → | ഴ്വ | (ḻva) |
ക | + | | + | ് | + | ਰ | → | ക്ਰ | (kra) |
ഴ | + | | + | ് | + | വ | → | ഴ്വ | (ḻva) |
യ | + | | + | ് | + | യ | → | R | (yya) |
#Anusvara. Theanusvara can be seen multiple times after vowels, whether independent letters or dependent vowel signs, as inഈംംംം <0D08, 0D02, 0D02, 0D02, 0D02>. Vowel signs can also be seen after digits, as in 355ാം <0033, 0035, 0035, 0D3E, 0D02>. More generally, rendering engines should be prepared to handle Malayalam letters (including vowel letters), digits (both European and Malayalam),U+002DHYPHEN-MINUS,U+00A0NO-BREAK SPACE andU+25CCDOTTED CIRCLE as base characters for the Malayalam vowel signs,U+0D4DMALAYALAM SIGN VIRAMA,U+0D02MALAYALAM SIGN ANUSVARA, andU+0D03MALAYALAM SIGN VISARGA. They should also be prepared to handle multiple combining marks on those bases.
#Dot Reph.U+0D4EMALAYALAM LETTER DOT REPH is used to represent the dead consonant form ofU+0D30MALAYALAM LETTER RA, when it is displayed as a dot or small vertical stroke above the consonant that follows it in logical order. It has the character properties of a letter rather than those of a combining mark, but special behavior is required in implementations. Conceptually,dot reph is analogous to the sequence <ra,virama> which, in many Indic scripts, is rendered as areph mark over the following consonant. This same behavior is expected fordot reph: it should be rendered as a mark over the following consonant. In standard Malayalam, the sequence <ra,virama> would normally occur only within the sequence <ra,virama, ya>, which should be rendered as the nominal form ofra with a conjoining form ofya.
The sequence <ra,virama, ZWJ> is not used to represent thedot reph, because that sequence has considerable preexisting usage to represent thechillu form ofra, prior to the encoding of thechillu form as a distinct character,U+0D7CMALAYALAM LETTER CHILLU RR.
The Malayalamdot reph was in common print usage until 1970, but has fallen into disuse. Words that formerly useddot reph on a consonant are now spelled instead with achillu-rr form preceding the consonant. (See the following discussion ofchillu characters.) Thedot reph form is predominantly used by those who completed elementary education in Malayalam prior to 1970.
#Chillu Forms. The nine characters, U+0D54..U+0D56 and U+0D7A..U+0D7F, encode dead consonants (those without an inherent vowel) known aschillu orcillakṣaram. In Malayalam language text,chillu forms never start a word. Chillu-nn,-n,-rr,-l, and-ll are quite common;chillu-k is relatively rare in contemporary usage;chillu-m,-y, and-lll are found only in historical texts.
For backward-compatibility issues regarding the representation ofchillu forms, see the discussion of legacychillu sequences later in this section.
Althoughchillus are typically written alone, they may graphically behave like ordinary consonant letters. SeeTable 12-37 for examples of conjuncts involvingchillus. Thechillu-involving conjuncts are encoded graphically: the graphic component bearing the ligatedchillu tail is analyzed as achillu character, and then stacking or ligating between characters is requested byU+0D4DMALAYALAM SIGN VIRAMA. Dependent signs such as vowel signs andcandrakkala can be applied to both stand-alonechillus andchillu-involving conjuncts, just as they are applied to ordinary consonant letters and conjuncts.
Among the examples shown inTable 12-37, only the second conjunct,ൻ്റ /ṉṯa/, is used in modern Malayalam text. See “Special Cases Involvingrra” later in this section for how to deal with the contrast between this conjunct and a phonetically related side-by-side form,ൻറ.
ൺ്ന | 0D7Achillu nn, 0D4Dvirama, 0D28na | /ṇna/ |
ൻ്റ | 0D7Bchillu n, 0D4Dvirama, 0D31rra | /ṉṯa/ |
ന്ൻ | 0D28na, 0D4Dvirama, 0D7Bchillu n | /ṉṉ/ |
ൽ്പ | 0D7Dchillu l, 0D4Dvirama, 0D2Apa | /lpa/ |
ൾ്വ | 0D7Echillu ll, 0D4Dvirama, 0D35va | /ḷva/ |
U+0D3BMALAYALAM SIGN VERTICAL BAR VIRAMA is not used to formchillus. It only represents a vowel-killing vertical stroke that is identifiable as a separate stroke, either striking through or placed above the modified letter.
#Special Cases Involving rra. There are a number of textual representation and reading issues involving the letterrra. These issues are discussed here and tables of explicit examples are presented.
The letterറrra is normally read /ṟa/. Repetition of that sound is naturally written by repeating the letter:ററ. Each occurrence can bear a vowel sign.
The same repetition of the letterrra asററ is also used for /ṯṯa/, which can be unambiguously represented byറ്റ. The sequence of twoറ letters fundamentally behaves as a digraph in this instance. The digraph can bear a vowel sign in which case the digraph as a whole acts graphically as an atom: a left vowel part goes to the left of the digraph and a right vowel part goes to the right of the digraph. Historically, the side-by-side form was used until around 1960 when the stacked form began appearing and supplanted the side-by-side form.
As a consequence the graphical sequenceററ in text is ambiguous in reading. The reader must generally use the context to understand ifററ is read /ṟaṟa/ or /ṯṯa/. It is only when a vowel part appears between the twoറ that the reading cannot be /ṯṯa/. Note that similar situations are common in many other orthographies. For example,th in English can be a digraph (cathode) or two separate letters (cathouse);gn in French can be a digraph (oignon) or two separate letters (gnome).
The sequence <0D31, 0D31> is rendered asററ, regardless of the reading of that text. The sequence <0D31, 0D4D, 0D31> is rendered asറ്റ. In both cases, vowels signs are applied to each rendered base, as shown inTable 12-38.
പാററ | 0D2A 0D3E 0D31 0D31 | /pāṯṯa/ | cockroach |
പാറ്റ | 0D2A 0D3E 0D31 0D4D 0D31 | ||
മാെററാലി | 0D2E 0D3E 0D31 0D46 0D31 0D3E 0D32 0D3F | /māṯṯoli/ | echo |
മാെറ്റാലി | 0D2E 0D3E 0D31 0D4D 0D31 0D4A 0D32 0D3F | ||
ബാറററി | 0D2C 0D3E 0D31 0D31 0D31 0D3F | /bāṯṯaṟi/ | battery |
ബാറ്ററി | 0D2C 0D3E 0D31 0D4D 0D31 0D31 0D3F | ||
സൂറററ് | 0D38 0D42 0D31 0D31 0D31 0D4D | /sūṟaṯṯ/ | Surat, a town in Gujarat |
സൂററ്റ് | 0D38 0D42 0D31 0D31 0D4D 0D31 0D4D | ||
െടംപററി | 0D1F 0D46 0D02 0D2A 0D31 0D31 0D3F | /ṭempaṟaṟi/ | temporary |
െലക്ചറേറാട് | 0D32 0D46 0D15 0D4D 0D1A 0D31 0D31 0D4B 0D1F 0D4D | /lekcaṟaṟōṭ/ | to the lecturer |
A very similar situation exists for the combination ofൻchillu-n andറrra. When used side by side,ൻറ can be read either /ṉṟa/ or /ṉṯa/, while stackedൻ്റ is always read /ṉṯa/.
The sequence <0D7B, 0D31> is rendered asൻറ, regardless of the reading of that text. The sequence <0D7B, 0D4D, 0D31> is rendered asൻ്റ. In both cases, vowels signs are applied to each rendered base, as shown inTable 12-39.
ആേൻറാ | 0D06 0D7B 0D47 0D31 0D3E | /āṉṯō/ | a proper name |
ആേൻ്റാ | 0D06 0D7B 0D4D 0D31 0D4B | ||
എൻേറാൾ | 0D0E 0D7B 0D31 0D4B 0D7E | /eṉṟōl/ | enroll |
#Legacy Representations of Conjunct /ṉṯa/. Prior to Unicode 5.1 when <0D7Bchillu-n, 0D4Dvirama, 0D31rra> became the recommendation for the conjunctൻ്റ /ṉṯa/, two other representations were already in use: <0D28na, 0D4Dvirama, 0D31rra> and <0D28na, 0D4Dvirama, 200D ZWJ, 0D31rra>. All three representations are widespread because implementations have been slow to adopt the recommended representation.
Implementations should treat <na,virama,rra> in existing text as equivalent to the recommended representation for the conjunctൻ്റ, <chillu-n,virama,rra>. Newly generated text should only use the recommended representation.
The other legacy representation <na,virama, ZWJ,rra> conflicts with the legacy representation of the side-by-side formൻറ (see “Legacy Chillu Sequences” later in this section). Therefore, implementations should treat <na,virama, ZWJ,rra> as a representation of the stacked formൻ്റ only if they know this sequence is not used to represent the side-by-side formൻറ.
#Legacy Chillu Sequences. Prior to Unicode Version 5.1, the representation of text withchillu forms was problematic, and not clearly described in the text of the standard. Because older data will use different representation forchillu forms, implementations must be prepared to handle both kinds of data. Forchillu forms considered in isolation, the following table shows the relationship between their representation in Version 5.0 and earlier, and the recommended representation starting with Version 5.1. Note that only the fivechillu forms listed inTable 12-40 were specified in the standard before Version 5.1, and thus were represented in legacy text by <virama, ZWJ> sequences. Otherchillu forms in Malayalam are only represented as atomically encodedchillu characters.
Visual | Legacy Representation (5.0) | Preferred Representation |
---|---|---|
ൺ | nna,virama, ZWJ 0D23, 0D4D, 200D | 0D7AMALAYALAM LETTER CHILLU NN |
ൻ | na,virama, ZWJ 0D28, 0D4D, 200D | 0D7BMALAYALAM LETTER CHILLU N |
ർ | ra,virama, ZWJ 0D30, 0D4D, 200D | 0D7CMALAYALAM LETTER CHILLU RR |
ൽ | la,virama, ZWJ 0D32, 0D4D, 200D | 0D7DMALAYALAM LETTER CHILLU L |
ൾ | lla,virama, ZWJ 0D33, 0D4D, 200D | 0D7EMALAYALAM LETTER CHILLU LL |
#12.9.4 Malayalam Numbers and Punctuation
#Archaic Numbers. The archaic numbering system for Malayalam included numbers for 10, 100, and 1000, as well as signs for fractions. Many Malayalam-specific fraction signs are encoded in the Malayalam block. Malayalam also made use of the fraction signs for one quarter, one half, and three quarters encoded in the Common Indic Number Forms block.
#Date Mark. Thedate mark isused only for the day of the month in dates; it is roughly the equivalent of “th” in “June 5th.” While it has been used in modern times it is not seen as much in contemporary use.
#Punctuation.Danda anddouble danda marks as well as some other unified punctuation used with Malayalam are found in the Devanagari block; seeSection 12.1, Devanagari.