![]() | This article has multiple issues. Please helpimprove it or discuss these issues on thetalk page.(Learn how and when to remove these messages) (Learn how and when to remove this message)
|
Inorthography andtypography, ahomoglyph is one of two or moregraphemes,characters, orglyphs with shapes that appear identical or very similar but may have differing meaning. The designation is also applied to sequences of characters sharing these properties.
In 2008, theUnicode Consortium published its Technical Report #36[1] on a range of issues deriving from the visual similarity of characters both in single scripts, and similarities between characters in different scripts.
Examples of homoglyphic symbols are (a) thediaeresis and umlaut (both a pair of dots, but with different meaning, althoughencoded with the samecode points); and (b) thehyphen andminus sign (both a short horizontal stroke, but with different meaning, although often encoded withthe same code point). Amongdigits andletters, digit1 and lowercasel are always encoded separately but in manytypefaces are given very similar glyphs, and digit0 and capitalO are always encoded separately but in many typefaces are given very similar glyphs. Virtually every example of a homoglyphic pair of characters can potentially be differentiated graphically with clearly distinguishable glyphs and separate code points, but this is not always done.Typefaces that do not emphatically distinguish the one/el and zero/oh homoglyphs are considered unsuitable for writingformulas,URLs,source code, IDs and other text where characters cannot always be differentiated withoutcontext. Fonts which distinguish glyphs by means of aslashed zero, for example, are preferred for those uses.
The termhomograph is sometimes misusedsynonymously with homoglyph, but in the usual linguistic sense, homographs arewords that are spelled the same but have different meanings, a property of words, not characters.
Allographs aretypeface design variants that look different but mean the same thing – for example⟨g⟩ and⟨g⟩, or adollar sign with one or two strokes. The termsynoglyph has a similar but a little more abstract meaning – for example the symbol⟨£⟩ and the letter⟨L⟩ (inLsd) both mean thepound sterling,[2] but only in that context. Allographs and synoglyphs are also known informally asdisplay variants.
In the days of early mechanical typewriters these were typed with the same key (using the "backspace and over-type" technique), which was also used for a double inverted comma. However the umlaut originated specifically as a pair of short vertical lines (not two dots) (seeSutterlin). Incidentally the two dots above the letter E in Albanian are described as a diaresis but do not fulfil the function of a diaresis.[3]
Two common and important sets of homoglyphs in use today are the digit zero and the capital letter O (i.e. 0 and O); and the digit one, the lowercase letter L and the uppercase i (i.e. 1, l and I). In the early days of mechanical typewriters there was very little or no visual difference between these glyphs, and typists treated them interchangeably as keyboarding shortcuts. In fact, most keyboards did not even have a key for the digit "1", requiring users to type the letter "l" instead, and some also omitted 0. As these same typists transitioned in the 1970s and 1980s to being computer keyboard operators, their old keyboarding habits continued with them, and was an occasional source of confusion.
Most current type designs carefully distinguish between these homoglyphs, usually by drawing the digit zero narrower and drawing the digit one with prominentserifs. Early computer print-outs went even further and marked the zero with a slash or dot, which led to a new conflict involving theScandinavian letter "Ø" and the Greek letter Φ (phi). The redesigning of character types to differentiate these characters has meant less confusion. The degree to which two different characters appear the same to a given observer is called the "visual similarity".[4]
Some type designs conform to theDIN 1450 legibility standard by carefully designing such characters to be easy to distinguish:slashed zero to distinguish it from capital O; lowercase l with a tail and uppercase I with serifs to distinguish it from the digit 1; distinguishing the numeral 5 from the capital S; etc.[5]
An example of confusion due to near-homoglyphs arose from the use of a⟨y⟩ to represent a⟨þ⟩ (thorn). Early English typesetters imported Dutch typesets that did not contain the latter character, so used the letter⟨y⟩ instead because (inBlackletter typeface) they look sufficiently similar.[6] It has led in modern times to such phenomena asYe olde shoppe, implying incorrectly that the wordthe was formerly writtenye/jiː/ rather thanþe. The spelling of the nameMenzies (pronouncedMengis and originally spelledMenȝies) arose for the same reason: the letter⟨z⟩ was substituted for⟨ȝ⟩ (yogh).
Some other combinations of letters look similar, for instancern looks similar tom,cl looks similar tod, andvv looks similar tow.
In certain narrow-spaced fonts (such asTahoma), placing the letterc next to a letter such as j, l or i will create a homoglyph, such ascj cl ci (g d a).
When some characters are placed next to each other, seen together at a glance they give the visual impression of another, unrelated character. A more precise way of saying this is that sometypographic ligatures can look similar to standalone glyphs. For example, thefi ligature (fi) can look similar toA in some typefaces or fonts. This potential for confusion is sometimes an argument made against the use of ligatures.[citation needed]
Homoglyphs of all kinds can be detected through a process called 'dual canonicalization'.[4] The first step in this process is to identify homoglyph sets, namely characters appearing the same to a given observer. From here, a single token is specified to represent the homoglyph set. This token is called a canon. The next step is to convert each character in the text to the corresponding canon in a process calledcanonicalization. If the canons of two runs of text are the same but the original text is different, then a homoglyph exists in the text.
Homoglyph attacks can be mitigated through a combination of user awareness and proactive measures. It is crucial to educate users about the risks associated with homoglyph attacks, urging them to meticulously inspect URLs before clicking.[7] Employing advanced security solutions, particularly those capable of scanning for homoglyph variations in domain names, can automate the detection and prevention of potential threats. Additionally, implementing stringent domain name monitoring and registration policies can help identify and neutralize homoglyph-related risks promptly. By fostering a culture of cyber vigilance and leveraging cutting-edge technologies, organizations can fortify their defenses against homoglyph attacks, ensuring a more secure online environment.
Unicode hascode points for many strongly homoglyphic characters, known as "confusables".[1] These present security risks in a variety of situations (addressed in UTR#36)[8] and were called to particular attention in regard tointernationalized domain names. In theory at least, one might deliberately spoof a domain name by replacing one character with its homoglyph, thus creating a second domain name, not readily distinguishable from the first, that can be exploited inphishing (see main articleIDN homograph attack). In manytypefaces, theGreek letter 'Α', theCyrillic letter 'А' and theLatin letter 'A' are visually identical, as are the Latin letter 'a' and the Cyrillic letter 'а' (the same can be applied to the Latin letters "aBceHKopTxy" and the Cyrillic letters "аВсеНКорТху"). A domain name can be spoofed simply by substituting one of these forms for another in a separately registered name. There are also many examples of near-homoglyphs within the same script such as 'í' (with anacute accent) and 'i' (with atittle), É (E-acute) and Ė (E dot above) and È (E-grave), Í (capital I with an acute accent) and ĺ (lowercase L with acute accent). When discussing this specific security issue, any two sequences of similar characters may be assessed in terms of its potential to be taken as a 'homoglyph pair', or if the sequences clearly appear to be words, as 'pseudo-homographs' (noting again that these terms may themselves cause confusion in other contexts). In theChinese language, manysimplified Chinese characters are homoglyphs of the correspondingtraditional Chinese characters.
Efforts byTLD registries andWeb browser designers aim to minimize the risks of homoglyphic confusion. Commonly, this is achieved by prohibiting names which mix character sets from multiple languages (toys-Я-us.org, using the Cyrillic letterЯ, would be invalid, butwíkipedia.org andwikipedia.org still exist as different websites); Canada's.ca registry goes one step further by requiring names which differ only indiacritics to have the same owner and same registrar.[9] The handling of Chinese characters varies: in.org and.info registration of one variant renders the other unavailable to anyone, while in.biz the traditional and simplified versions of the same name are delivered as a two-domain bundle which both point to the samedomain name server.
Relevant documentation will be found both on the developers' Web sites, and on an IDN Forum[10] provided byICANN.
The Cyrillic letter⟨С⟩ (U+0421 СCYRILLIC CAPITAL LETTER ES) not only looks like Latin⟨C⟩ (U+0043 CLATIN CAPITAL LETTER C), but also occupies the same button in JCUKEN-QWERTY hybrid layout keyboards. This design nuance can be seen on the C/С button represented inKeyboard Monument inYekaterinburg.
The types used by Caxton and his contemporaries originated in Holland and Belgium, and did not provide for the continuing use of elements of the Old English alphabet such as thorn <þ>, eth <ð>, and yogh <ʒ>. The substitution of visually similar typographic forms has led to some anomalies which persist to this day in the reprinting of archaic texts and the spelling of regional words. The widely misunderstood 'ye' occurs through a habit of printer's usage that originates in Caxton's time, when printers would substitute the <y> (often accompanied by a superscript <e>) in place of the thorn <þ> or the eth <ð>, both of which were used to denote both the voiced and non-voiced sounds, /ð/ and /θ/ (Anderson, D. (1969) The Art of Written Forms. New York: Holt, Rinehart and Winston, p 169)