![]() | |
Version | 8 |
Authors | Asmus Freytag, Rick McGowan, and Ken Whistler |
Date | August 13, 2024 |
This Version | https://www.unicode.org/notes/tn27/tn27-8.html |
Previous Version | https://www.unicode.org/notes/tn27/tn27-7.html |
Latest Version | https://www.unicode.org/notes/tn27/ |
This documentprovides information on many known anomalies in the formal character names in the Unicode Standard.
This document is aUnicode Technical Note. Sole responsibility for its contents rests with the author(s). Publication does not imply any endorsement by the Unicode Consortium.
For information on Unicode Technical Notes, including criteria for acceptance, seehttps://www.unicode.org/notes/.
In this document we list all Unicode character names with known clerical errors in the spelling of their names at the time of its writing. In addition, we have compiled information on many misnamed characters, misleading character names, and characters with other known problems with their names.
Because Unicode Standard is acharacter encodingstandard and not theUniversal Encyclopedia of WritingSystems and Character Identity, thestability anduniquenessof published character names is far more important than the correctness of the name. The published character names arenormative for the purposes of theUnicode standard and the large number of other IT standards thatreference it. These standards require stable identifiers and character names must therefore beimmutable— any change of character names is almostas disruptive of the standards as changing code points forcharacters would be. Accordingly, the Unicode Consortium has adopted theName Stability Policy, preventing changes in character names. As a result, errors in character names cannot be corrected. Instead, important character name anomalies anomalies are documented with annotations in theUnicode Character Code Charts.
The requirement for a unique and stable character name that can be used as a formal identifier doesnot mean that the Unicode Standard dictates toanyone what the name of any given letter in their writing systemshould properly be, whether in English or in any other language. The Unicode Code Charts provide informative aliases for a large number of characters, the names of which are not anomalous or defective. This is because different user communities often use different names for the same character, even in English.
One of the reasons why the Unicode standard publishes manyinformative aliases in the Unicode names list is because there often aremuch better, more communicative names for particular characters,even in English than the normative names in the data file. For example, U+002F SOLIDUS is more widely known among its American users asslash. Informal aliases are useful in describing a character, but cannot be used as identifiers, because they are not guaranteed to be unique or stable. Users are free to use such aliases and other names, as long as they are not mis-represented as correctionsto the standard, but instead used as alternative, more usefulnames for charactersin the standard.
For character names that were encoded with misspelled words as part of their name, or that exhibit other serious errors, The Unicode Standard has adopted normative character name aliases. These formal name aliases can be used as a alternative, normative identifier for the character without the need to preserve the original spelling or other error in the character name. While this means that some characters can have more than one identifier, each identifier continues to uniquely refer to a single character. Formal name aliases are documented in the NameAliases.txt file in theUnicode Character Database. Formal name aliases also documented in the Unicode Code Charts. We have not documented them all here; instead, we merely indicate for which characters formal aliases exist at the time of this writing.
In some cases, annotations have been added to the names list in the Unicode Standard to document various lesser problems, but to date there has been no full listing of all known problems.
The authors therefore intend this Technical Note to serve as a convenient summary of the information about character name anomalies in the Unicode Standard at the time of its writing. It will be updated from time to time as additional anomalies become known. While the information in this technical note is based on information published in the Unicode Standard, the selection and manner of presentation in this document reflect choices made by its authors; it does not in any way supersede the information in the Unicode Standard.
This section lists character names with known anomalies, including those for which a formal name alias has been defined. It provides further information about some names that have been the objects of discussion or inquiry. As issues are reported, additional entries may be added at any time and without notice. While many of the explanations below are based on annotations in the Unicode code charts, they have been edited or re-stated by the authors.
U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
- Even though this is encoded as single character, it is not usually considered a singleletter.
U+01A2 LATIN CAPITAL LETTER
U+01A3 LATIN SMALL LETTER
- These should have been called letter GHA. They are neither pronounced 'oi' nor based on the letters 'o' and 'i'.Formal name aliases correcting these to LATIN CAPITAL LETTER GHA andLATIN SMALL LETTER GHA have been defined.
U+01BE LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE
- This is actually based on a ligation of "ts", not an inverted glottal stop.
U+0238 LATIN SMALL LETTER DB DIGRAPH
U+0239 LATIN SMALL LETTER QP DIGRAPH
- These are actually ligatures, rather than digraphs
U+025B LATIN SMALL LETTER OPEN E
- This is actually aLatin epsilon and should have been so called.
U+025E LATIN SMALL LETTER CLOSED REVERSED OPEN E
- Actually aclosed reversed epsilon (reversed form of U+025B).
U+0285 LATIN SMALL LETTER SQUAT REVERSED ESH
- This is actually areversed fishhook r with retroflex hook.
U+02C7 CARON
U+030C COMBINING CARON
- The "caron" should have been calledhacek andcombining hacek. The term "caron" is suspected by some to be an invention of some early standards body, but it has also been claimed by others to have been in use at Linotype before the days of digital typography. Its true origin may be lost in the mists of time.
U+034F COMBINING GRAPHEME JOINER
- The name does not describe the function of this character. Despite its name, it does not join graphemes. For more infomation, see Section 7.9Combining Marks, of the Unicode Standard.
U+039B GREEK CAPITAL LETTER LAMDA
U+03BB GREEK SMALL LETTER LAMDA
- The use of the spellinglamda derives from ISO 10646. This does not mean that it is more correct thanlambda, merely that the spelling without the 'b' is the one used in the formal character names.
U+04A5 CYRILLIC SMALL LIGATURE EN GHE
U+04B5 CYRILLIC SMALL LIGATURE TE TSE
U+04D5 CYRILLIC SMALL LIGATURE A IE
- Despite their names, these are not decomposable ligatures.
U+0598 HEBREW ACCENT ZARQA
- Perhaps should have been calledHebrew accent tsinnorit. May also be used for zarqa when shown on accented non-final letter. SeeAppendix A.
U+05AE HEBREW ACCENT ZINOR
- Should have been calledHebrew accent zarqa (=tsinor). SeeAppendix A.
U+0670 ARABIC LETTER SUPERSCRIPT ALEF
- Not an Arabicletter, but a vowel sign.
U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE
- These would have been better named as ligatures.
U+0709 SYRIAC SUBLINEAR COLON SKEWED
- The direction of the skewing was misinterpreted when thecharacter was named. It should have been "... SKEWED LEFT". A formal name alias correcting this error has been defined.
U+0964 DEVANAGARI DANDA
U+0965 DEVANAGARI DOUBLE DANDA
- Despite the fact that these characters have "DEVANAGARI" in their names, these punctuation marks are intended for common use for the scripts of India.
U+0A01 GURMUKHI SIGN ADAK BINDI
- The spelling of the word Adak with a single 'd' is inconsistent with U+0A71 GURMUKHI ADDAK and should really have had two d's.
U+0B83 TAMIL SIGN VISARGA
- This character is actually theaaytham, and is not used as a visarga in Tamil.
U+0CDE KANNADA LETTER
- There is no Kannada letter 'fa', this character represents the syllable 'llla'. A formal name alias correcting this error has been defined.
U+0E9D LAO LETTER
- The name for this character should have beenfo sung,but that name is already used for U+0E9F. A formal name alias LAO LETTER FO FON correcting this error has been defined.
U+0E9F LAO LETTER
- The name for this character should have beenfo tam,but that name is already used for U+0E9D. A formal name alias LAO LETTER FO FAY correcting this error has been defined.
U+0EA3 LAO LETTER
- The name for this character should have beenlo loot,but that name is already used for U+0EA5. A formal name alias LAO LETTER RO correcting this error has been defined.
U+0EA5 LAO LETTER
- The name for this character should have beenlo ling,but that name is already used for U+0EA3. A formal name alias LAO LETTER LO correcting this error has been defined.
U+0F0A TIBETAN MARK BKA- SHOG YIG MGO
- This character is used to indicate that a document is addressed to a superior (the "petition honorific"), but the Tibetan name actually indicates a superior addressing an inferior ("starting flourish for giving a command").
U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG
- The tsheg mark is not restricted to intersyllabic usage, and would have been better named Tibetan mark tsheg.
U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR
- This character is not a delimiter, but is a non-breaking version of the tsheg mark (U+0F0B) that is used exclusively between the letter NGA (U+0F44) and the shad mark (U+0F0D).
U+0FD0 TIBETAN MARK BKA- SHOG GI MGO RGYAN
- The syllable "BSKA-" does not occur naturally in Tibetan, and is a mistake for "BKA-" (cf. U+0F0A). A formal name alias correcting this error has been defined.
U+11EC HANGUL JONGSEONG -KIYEOK
U+11ED HANGUL JONGSEONG -SSANGKIYEOK
U+11EE HANGUL JONGSEONG SSANG
U+11EF HANGUL JONGSEONG -KHIEUKH
- In each case the "IEUNG" part of the names wasa misinterpretation of a component that should be named "YESIEUNG".Formal name aliases correcting this error have been defined.
U+156F CANADIAN SYLLABICS TTH
- There is no 'tth' syllable. A better name would have beenCanadian Syllabics asterisk.
U+178E KHMER LETTER NNO
- As this character belongs to the first register, its correct transliteration isnna, not NNO.
U+179E KHMER LETTER SSO
- As this character belongs to the first register, its correct transliteration isssa, not SSO.
U+1BBD SUNDANESE LETTER
- This was a mistaken identification of this letter. It shouldhave been identified as anarchaic i. A formal name alias correcting this error has been defined.
U+200B ZERO WIDTH SPACE
- This isn't a "space". It is an invisible character that can be used to provide line break opportunities.
U+2113 SCRIPT SMALL L
- Despite its character name, this symbol is derived from a special italicized version of the small letter "L".
U+2118 SCRIPT CAPITAL P
- Should have been calledcalligraphic small p orWeierstrass elliptic function symbol, which is what it is used for. It is not a capital "P" at all. A formal name alias correcting this toWEIERSTRASS ELLIPTIC FUNCTION has been defined.
U+234A APL FUNCTIONAL SYMBOL TACK UNDERBAR
U+234E APL FUNCTIONAL SYMBOL TACK JOT
U+2351 APL FUNCTIONAL SYMBOL TACK OVERBAR
U+2355 APL FUNCTIONAL SYMBOL TACK JOT
U+2361 APL FUNCTIONAL SYMBOL TACK DIAERESIS
- The tack symbols among the APL functional symbol set wereoriginally named according to the Bosworth convention about the sense ofup and down when referring to tacks. (That stemmed from an early registrationof APL characters pre-dating the Unicode Standard, which was used duringthe initial encoding of APL functional symbols.) Other tack symbols in the Unicode Standardwere named according to the London convention. That resulted in the inconsistencyin the naming of tack symbols. APL specifications have subsequently adopted theLondon convention, so the names of these five symbols no longer match APLusage for up and down.
U+2448 OCR DASH
U+2449 OCR CUSTOMER ACCOUNT NUMBER
- These two symbols were misinterpreted when named. They aremagnetic ink character recognition (MICR) symbols based on ISO 1004:1995.Formal name aliases correcting them to MICR ON US SYMBOL and MICR DASH SYMBOL,respectively, have been defined.
U+2629 CROSS OF JERUSALEM
- The cross shown is actually a simple cross potent. The actual cross of Jerusalemis a cross potent with a small crosslet added at each corner.
U+262B FARSI SYMBOL
- This symbol is so named because as symbol of Iranit cannot be encoded in ISO standards.
U+2B7A LEFTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE STROKE
U+2B7C RIGHTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE STROKE
- These two symbols including incorrect descriptions of theorientation of the strokes, as the result of a copy/paste error.Formal name aliases correcting them to LEFTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE VERTICAL STROKE and RIGHTWARDS TRIANGLE-HEADED ARROW WITH DOUBLE VERTICAL STROKE,respectively, have been defined.
U+3021 HANGZHOU NUMERAL ONE
U+3022 HANGZHOU NUMERAL TWO
U+3023 HANGZHOU NUMERAL THREE
U+3024 HANGZHOU NUMERAL FOUR
U+3025 HANGZHOU NUMERAL FIVE
U+3026 HANGZHOU NUMERAL SIX
U+3027 HANGZHOU NUMERAL SEVEN
U+3028 HANGZHOU NUMERAL EIGHT
U+3029 HANGZHOU NUMERAL NINE
U+3038 HANGZHOU NUMERAL TEN
U+3039 HANGZHOU NUMERAL TWENTY
U+303A HANGZHOU NUMERAL THIRTY
- TheSuzhou numerals (Chinese su1zhou1ma3zi) are special numeric forms used by traders to display the prices of goods. The use of "HANGZHOU" in the names is a misnomer.
U+3036 CIRCLED POSTAL MARK
- This character, despite its appearance, is not used as a postal mark. It should have been namedsymbol for type B electronics to be consistent with U+2B97 SYMBOL FOR TYPE A ELECTRONICS.
U+327C CIRCLED KOREAN CHARACTER CHAMKO
U+327D CIRCLED KOREAN CHARACTER JUEUI
- An instance of inconsistent transliterations, resulting from irreconciled North/South Korean positions.
U+A015 YI SYLLABLE
- This is not a syllable pronounced "wu", but is actually a syllable iteration mark.A formal name alias correcting this to YI SYLLABLE ITERATION MARK has been defined.
U+AA6E MYANMAR LETTER KHAMTI
- This was a typo in the names list for LLA. A formal name alias correcting this error has been defined.
U+FA0E CJK COMPATIBILITY IDEOGRAPH-FA0E
U+FA0F CJK COMPATIBILITY IDEOGRAPH-FA0F
U+FA11 CJK COMPATIBILITY IDEOGRAPH-FA11
U+FA13 CJK COMPATIBILITY IDEOGRAPH-FA13
U+FA14 CJK COMPATIBILITY IDEOGRAPH-FA14
U+FA1F CJK COMPATIBILITY IDEOGRAPH-FA1F
U+FA21 CJK COMPATIBILITY IDEOGRAPH-FA21
U+FA23 CJK COMPATIBILITY IDEOGRAPH-FA23
U+FA24 CJK COMPATIBILITY IDEOGRAPH-FA24
U+FA27 CJK COMPATIBILITY IDEOGRAPH-FA27
U+FA28 CJK COMPATIBILITY IDEOGRAPH-FA28
U+FA29 CJK COMPATIBILITY IDEOGRAPH-FA29
- These 12 characters are unified CJK ideographs, not compatibility ideographs, despite their names.
U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAET
- A spelling error: "brakcet" should be "bracket". A formal name alias correcting this error has been defined.
U+FEFF ZERO WIDTH NO-BREAK SPACE
- Byte Order Mark (Naming it ZWNBSP was a mistake from the start.)
U+122D4 CUNEIFORM SIGN TENU
U+122D5 CUNEIFORM SIGN OVER BUR OVER BUR
- The NU11 component in these signs was mistakenly identified as SHIR.Formal aliases have been defined correcting the names toCUNEIFORM SIGN NU11 TENU and CUNEIFORM SIGN NU11 OVER NU11 BUR OVER BUR, respectively.
U+12327 CUNEIFORM SIGN UN GUNU
- Due a glyph mix-up with U+12326 CUNEIFORM SIGN UN, this character was erroneously described as the gunû of the sign UN; instead UN is the gunû of this sign. A formal alias has been defined correcting the name of this sign to CUNEIFORM SIGN KALAM, according to its value kalam.
U+1680B BAMUM LETTER PHASE-A MAEMGBIEE
- This was a typo in the names list. A formal name alias correcting this error has been defined.
16E56 MEDEFAIDRIN CAPITAL LETTER H
16E57 MEDEFAIDRIN CAPITAL LETTER N
16E76 MEDEFAIDRIN SMALL LETTER H
16E77 MEDEFAIDRIN SMALL LETTER N
- These were the result of mistransliterations in the originalproposal that were not caught on review. The letter "HP" should be "H"and the letter "NY" should be "NG". Formal name aliases correcting this error have been defined.
U+1B001 HIRAGANA LETTER ARCHAIC YE
- This Hiragana character is a member of the larger collectionof hentaigana. The preferred name is HENTAIGANA LETTER E-1. A formal name alias notingthis preference has been defined (for Unicode 10.0).
U+1D0C5 BYZANTINE MUSICAL SYMBOL FORA SKLIRON CHROMA VASIS
- A spelling error: "fhtora" should be "fthora". A formal name alias correcting this error has been defined.
U+1D300 MONOGRAM FOR
U+1D301 DIGRAM FOR HEAVENLY
U+1D302 DIGRAM FOR
U+1D303 DIGRAM FOR HEAVEN
U+1D304 DIGRAM FOR
U+1D305 DIGRAM FOR
- The character names for the monogram and five digrams were incorrectly matched up with withinterpretation of the 3 dot line (人 rén = human) and the broken line (地 dì = earth). As aresult, all six names are misnomers. They should have been named as follows:
U+1D300 MONOGRAM FOR HUMAN
U+1D301 DIGRAM FOR HEAVENLY HUMAN
U+1D302 DIGRAM FOR EARTHLY HUMAN
U+1D303 DIGRAM FOR HUMANLY HEAVEN
U+1D304 DIGRAM FOR HUMANLY EARTH
U+1D305 DIGRAM FOR HUMANLY HUMAN
U+1D6B2 MATHEMATICAL BOLD CAPITAL LAMDA
U+1D6CC MATHEMATICAL BOLD SMALL LAMDA
U+1D6EC MATHEMATICAL ITALIC CAPITAL LAMDA
U+1D706 MATHEMATICAL ITALIC SMALL LAMDA
U+1D726 MATHEMATICAL BOLD ITALIC CAPITAL LAMDA
U+1D740 MATHEMATICAL BOLD ITALIC SMALL LAMDA
U+1D760 MATHEMATICAL SANS-SERIF BOLD CAPITAL LAMDA
U+1D79A MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL LAMDA
U+1D77A MATHEMATICAL SANS-SERIF BOLD SMALL LAMDA
U+1D79A MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL LAMDA
- For consistency with the naming of U+039B GREEK CAPITAL LETTER LAMDA (see there).
U+1E899 MENDE KIKAKUI SYLLABLE M172
U+1E89A MENDE KIKAKUI SYLLABLE M174
- The transliterated spellings for the intended syllables were inadvertantly swappedin the source originally defining these characters. The corrected syllabletransliterations have been supplied in formal name aliases for these twocharacters.
There are two separate cantillation systems in the Hebrew Bible. One is used for Psalms, Proverbs and (most of) Job, (the "poetic" books, hence the "poetic system"), and the other is used everywhere else. The two systems have structural similarities and share some graphemes, but not all. In modern printing the accents have roughly the same shape; old manuscripts actually had them written slightly differently. In the prose system there is an accent called ZARQA, which is postposed (on or to the left of the last letter), and in the poetic system there is one called TSINOR (and also zarqa and vice-versa; each of these has many names) which has the same shape and placement and even an analogous function in the structure of the cantillations. There is another accent, only in the poetic system, called the TSINNORIT (a diminutive of tsinor), which occurs directly above its letter, and is (almost?) never on the last letter of its word. (More modern printing tends to put the zarqa right on top of its letter too, but that's just a printing preference). If you look closely at some old manuscripts, you can tell that tsinnorit has a slightly different shape than zarqa/tsinor.
As encoded in Unicode, there are ZARQA (U+0598) and ZINOR (U+05AE) [sic]. By the usual meanings of those names, those should properly be synonyms, the same accent, but they're not. While the word"zinor" would be mnemonic of "tsinnorit," it's the wrong way around in the character names: ZINOR has the combining class of above-postposed, and ZARQA is encoded to go directly above the letter. So, to encode a zarqa or a tsinor, you need to use ZINOR, and to encode a tsinnorit, you need to use ZARQA.
Thanks to John Hudson, James Kass, KAWABATA Taichi, Ken Lunde, Robin Leroy, Marc Lodewijck, Artur Q.A., Mark Shoulson, and Andrew West for their contributions.
The following summarizes modifications from the previous version of this document.
© 2006–2024 Asmus Freytag, Rick McGowan, Ken Whistler. This publication is protected by copyright, and permission must be obtained from the author and Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the Terms of Use.
Use of this publication is governed by the UnicodeTerms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries.