Movatterモバイル変換

AWeb address is used to point to aresource on the Web such as a Web page. Recent developments enable you to add non-ASCII characters to Web addresses. This article provides a high level introduction to how this works. It is aimed at content authors and general users who want to understand the basics without too many gory technical details. For simplicity, we will use examples based on HTML and HTTP. We will also address how this works for both the domain name and the remaining path information in a web.

Basic concepts

We will refer to Web addresses that allow the use of characters from a wide range of scripts asInternationalizedResource Identifiers orIRIs. For IRIs to work, there are four main requirements:

the syntax of the format where IRIs are used (eg. HTML, XML, SVG, etc) must support the use of non-ASCII characters in Webaddresses
the application where IRIs are used (eg. browsers, parsers, etc.) must support the input and use of non-ASCII characters in Webaddresses
it must be possible to carry the information in an IRI through the necessary protocol (eg. HTTP, FTP, IMAP, etc.)
it must be possible to successfully match the string of characters in your Web address against the name of the target resource on thefile system or registry where it is stored.

Various document formats and specifications already support IRIs. Examples include HTML 4.0, XML 1.0 system identifiers, the XLinkhref attribute, XMLSchema'sanyURI datatype, etc. We will also see later that major browsers supportthe use of IRIs already.

Unfortunately, not so many protocols allow IRIs to pass through unchanged. Typically they require that the address be specified using theASCII characters defined for URIs. There are, however, well specified ways around this, and we will describe them briefly in this article.

The fourth requirement demands that a string of characters be matched against a target whether or not those characters are represented bythe same encoding, ie. bytes. This is dealt with by using UTF-8 as a broker.

We will use the following fictitious Web address in most of the examples on this page:

This is a simple IRI that is composed of three parts.

Thehttp:// contains information about thescheme to be used. Note that non-ASCIIcharacters are not currently used here.
The next part,JP納豆.例.jp, is thedomain name.
The remainder of the address is apath (part of which is a filename consisting of two kanji and twohiragana characters) that indicates the actual location of the resource you are pointing to from the server root.

What it all means. The domain name (JP納豆.例.jp) starts with 'JP' so that in the worked examples wecan show what happens to ASCII text within a domain name. The rest of the domain name is read 'natto (a Japanese delicacy made fromfermented soya beans)dot rei (meaning example)dot jp (Japanese country code)'. The path reads 'dir1 slash hikiwari (atype of natto)dot html'.

When it comes to dealing with requirements two to four above, there isone solution for the domain name and a different solutionfor the path. We will explore each of these in turn.

Handling the path

Whereas the domain registration authorities can all agree to accept domain names in a particular form and encoding (ASCII-basedpunycode), multi-scriptpath names identify resources located on many kinds of platforms, whose file systems do and will continue touse many different encodings. This makes the path much more difficult to handle than the domain name.

Having dealt with the domain name using punycode, we now need to deal with the path part of an IRI. The IETF Proposed Standard RFC 3987 (Internationalized Resource Identifiers (IRIs)) defines how to deal with this.

The string matching challenge

There is already a mechanism in the URI specification for representing non-ASCII characters in URIs. What you do is represent theunderlyingbytes using what is referred to aspercent-escaping (in the specification, the less common termpercent-encoding isused). Thus, in the page you are currently reading, which is encoded in UTF-8, we could represent the filename引き割り.html from our previous example asshown just after this paragraph. What you are seeing are two-digit hexadecimal numbers, preceded by %. These represent the bytes used to encode inUTF-8 the Japanese characters in the string. Each Japanese character is represented by 3 bytes, which are transformed into three percent-escapes.

%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html

Apart from the fact that this is not terribly user friendly, there is a bigger issue here. Another person may want to follow the samelink from a page that uses a Shift-JIS character encoding, rather than UTF-8. In this case, if we were to use percent-escaping to transform the (same)characters in the address so that they to conform to the URI requirements, we would base the escapes on the bytes that represent引き割り.html inShift-JIS. There are only two bytes per Japanese character in Shift-JIS, and they are different bytes from those used in UTF-8. So this would yieldthe totally different sequence of byte escapes shown below.

%88%F8%82%AB%8A%84%82%E8.html

So here we see that, although the URI escape mechanism allows the Japanese address to be specified, the actual result will varyaccording to the page of origin. How then is it possible to know how to map that onto a sequence of characters that will match the name of theresource as exposed by the system where it resides?

The chief difficulty here is that there is no encoding-related meta-data associated with the URI strings to indicate what charactersthey represent. Even if that information were available, the total number of mappings that a server would need to support to convert any incomingstring to the appropriate encoding would be extremely high.

Not only that, but the file system on which the resource itself actually resides may expose the file name using a totally differentencoding, such as EUC-JP. If so, the underlying byte sequence that represents the file name as thesystem knows it would be different again.So how are we going to know that these byte sequences all refer to the same resource?

Note that the filename may be stored andexposed in different encodings. UnderWindows NT or Windows XP the IIS or Apache 2 server exposes the file name as UTF-8, even though the operating system stores it as UTF-16.

High level overview

The IRI specification uses Unicode as a broker. It specifies that, before conversion to escapes, the IRI should be converted to UTF-8.As for IDNs, if a conversion is required by the protocol, it is the user agent that is responsible for performing that change when a request is madefor a resource.

The server must also then recognize the Unicode characters in the incoming web address and map them to the encoding used for theactual resources.

(Remember that we have already dealt with the domain name part of the IRI using IDN. The rules in the IRI specification aretypically only applied to the path part of the multilingual Web address.)

It is also possible to apply percent-escaping to the domain name before conversion, but clients often simplyconvert directly to punycode.

Resolving a path

Let us look at what the client does to send the path part of a web address to an HTTP server. Here is the path part of the earlierexample Web address:

/dir1/引き割り.html

When the user clicks on a hyperlink or enters the IRI in the address bar of a user agent, the address may be in any characterencoding, but that encoding is usually known.

If the string is input by the user or stored in a non-Unicode encoding, it is converted to Unicode, normalized using UnicodeNormalization Form C, and encoded using the UTF-8 encoding.

The user agent then converts the non-ASCII bytes to percent-escapes. Our example now looks like this:

/dir1/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html

The string is now in URI form, and will be acceptable to protocols such as HTTP. Note how the ASCII characters 'dir1' and '.html' arejust passed through without change, since these characters are encoded in the same way in both ASCII and UTF-8.

The user agent sends the request for the page.

When this request hits the server, one of two things need to happen:

if the server exposes the file names in UTF-8, the server simply accesses the resource
if the server uses another encoding, the server needs to convert from UTF-8.

Martin Dürst has written an Apache module calledmod_fileiri to convert requests from UTF-8 to the encoding of the server.

This covers the basics. There are some additional parts of the specification that deal with finer points, such as how to handlebidirectional text in IRIs, and so on.

Does it work?

Domain Name lookup

Numerous domain nameauthorities already offer registration of internationalized domain names. These include providers for top level country domains as .cn, .jp, .kr,etc., and global top level domains such as .info, .org and .museum.

Client-side support for IDN is appears in the recent versions of major browsers, including Internet Explorer 7, Firefox, Mozilla,Netscape, Opera, and Safari. It only works in Internet Explorer 6 if you download a plug-in (Microsoft support pages provide somesuggestions). This means that you can use IDNs in href values or the address bar, and thebrowser will convert the IDN to punycode and look up the host.

You can run a basic check to see whether IDNs work on your system using thissimpletest.

It has been an issue, until now, that IDN is not natively supported by Internet Explorer, with its huge market share. Althoughplug-ins are available, not all people will know how to, will want to, or will be able to install them. However, IE7 or its successors, which do support IDN, will,over time, replace most IE6 installs.

Note that, as a simple fallback solution until IDN is widely supported, content authors who want to point to a resource using an IDNcould write the link text in native characters, and put a punycode representation in the href attribute. This guarantees that the user would be ableto link to the resource, whatever platform they used.

If, for some reason, you wanted to, it is possible to turn off IDN support in IE7, Firefox and Mozilla.

Domain names and phishing

One of the problems associated with IDN support in browsers is that it can facilitate phishing through what are called 'homographattacks'. Consequently, most browsers that support IDN also put in place some safeguards to protect users from such fraud.

Special thanks to Michael Monaghan and Greg Aaron for their contributions to this section.

The way browsers typically alert the user to a possible homograph attack is to display the URI in the address bar and the statusbar using punycode, rather than in the original Unicode characters. Users should therefore always check the address bar after the page has loaded, orthe status bar before clicking on a link. However, note that:

'Homograph attack' refers tomixing charactersthat look alike visually in a URI in order to deceive someone about which site they are linking to. For example, in some fonts the capital 'I'looks sufficiently like an 'l' that the URI 'www.paypaI.com' seems to be taking you to a Paypal site, whereas it is most probably routing you to aplace where someone will try to harvest your personal information.

Different browsers use different strategies to determine whether the URI should be shown in Unicode or punycode.
If an address appears as punycode, it doesn't necessarily mean that this is a bogus site – simply 'user beware'. It's up to theuser to try and figure out whether the site should be avoided or not.
Detecting potential homograph attacks is usually only one part of the overall mechanism a browser uses to detect whether a siteis phishing or not.

Internet Explorer 7 shows the address as punycode if one of the following conditions is true:

The domain name contains a character from a script that is not used for the languages included in theuser's language preferences. Languages that use the Latin script are splitinto English (ASCII only) and others (for which any non-ASCII Latin character is valid). For example, bäcker.com will not work if your languagepreferences include only English, but will work if you have German in your preferences (or even, say, French, since the accented characters are not language-specific).
Any labels in the domain name (ie. a run of characters between dots) contains characters from a mix of scripts that do notappear together within a single language. For instance, the domain nameελληνικάрyccĸий.org will be displayed as punycode, because Greek characterscannot mix with Cyrillic within a single label. On the other hand,ελληνικά.рyccĸий.org would be fine. Note also that a combination of Japanese kanjiand kana is also acceptable, eg.全国温泉ガイド.jp.
IE7 allows an IDN to be displayed as Unicode if it mixes ASCII characters with a single other script froma given list. Note that cyrillic isnot one of those scripts, sopаypаl.com (where the 'a' characters are from the Cyrillic block rather than Latin) would be displayed as punycode.
The domain name contains characters which are not a part of any script, eg. I♥NY.museum

Binding the behavior to the list of languages in the browser preferences also means that a language that is not in the standard listsupplied by IE will always produce punycode. For example, Amharic in Ethiopic text will be displayed as punycode even if you add am to the browserpreferences. (Fortunately, there don't seem to be any registries providing Amharic IDNs at the moment.)

Some fraudulent domain names may still slip through this net. In this case, IE7's normal phishing protection would step in to comparethe domain with reported sites. IE7 can also, however, 'apply additional heuristics to determine if the domain name is visually ambiguous'. This ishelpful when letters within the same script are visually similar.

In addition to displaying suspect IDNs in the address bar in punycode, IE7 also uses its Information Bar to signal possible danger tothe user. It also uses a clickable icon at the end of the address bar to notify you when an URL contains a non-ASCII character. It also displays theaddress bar in all windows.

Firefox 2.x uses a different approach. It only displays domain names in Unicode for certainwhitelisted top level domains. Firefox selects Top Level Domains (TLDs) that have established policies on the domain names theyallow to beregistered and then relies on the registration process to create safe IDNs. You can find alist of supported TLDs on the Mozilla site. If an IDN is from a TLDthat is not on the list, the web address will appear in punycode form in the status and address bars. In some cases the TLD policy statements shouldinclude rules about managing visually similar characters within the set of characters allowed.

In addition, IDNs that contain particular characters (e.g. fraction-slash), even within trusted TLDs, are treated suspiciously, andcause the label to be displayed as punycode.

Opera 9.x uses a similar approach to Firefox, though it differs slightly in implementation.Officially, it only displays domain names in Unicode for whitelisted TLDs listed in opera6.ini, which is updated automatically.

For TLDs that are not on the list, Opera says that it allows domain names to use Latin 1 characters, ie. Latin characters with accentsthat support Western European languages. All other domain names are displayed as punycode.

In reality, tests show that Opera currently displays many characters as Unicode, regardless of whether a TLD is on the whitelist ornot. One exception we found is Devanagari script, which is displayed as punycode if the TLD is not on the list.

Opera does, however, also display certain mixtures of scripts as punycode. The testing revealed this is true for combinations of Greekor Cyrillic characters with Latin characters.

Also, Opera's list of illegal characters is slightly longer than the official IDNA list. Some IDNs, while displayed as punycode inother browsers, are entirely illegal in Opera.

Safari 9.x provides a user-editable list of scripts that are allowed to be displayed natively indomain names. If a character appears in a domain name and does not belong to a script in this list, the URI is displayed as punycode.

At the time of writing, the initial whitelist contains Arabic, Armenian, Bopomofo, Canadian_Aboriginal, Devanagari, Deseret,Gujarati, Gurmukhi, Hangul, Han, Hebrew, Hiragana, Katakana_Or_Hiragana, Katakana, Latin, Tamil, Thai, and Yi. Scripts like Cyrillic, Cherokee andGreek are specifically excluded because they contain characters that are easily confused with Latin characters.

If the whitelist is emptied, any non-ASCII character causes the address to be displayed as punycode.

Mozilla 1.7x displays all IDNs as punycode.

Examples. There is atest page you can use tosee how your browser displays IDNs in the status bar. See also the page that gathersresults for a number of browsers.

Other phishing concerns and registry-level solutions. Some potential aspects of phishing control needto be addressed by the registration authorities, and built into their policies for IDN registration.

Some registration authorities have to carefully consider how to manage equivalent ways of writing the same word. For example, the word'hindi' can be written in Devanagari as eitherहिंदी (using an anusvara) orहिन्दी (using a special glyph for NA).

There is a similar issue with the use of simplified vs. traditional characters in the Chinese Han script.

Another issue arises where two characters or combinations of characters within a single script look very similar, for instance theTamil letter KAக and the Tamil digit one௧ are indistinguishable. In other cases, diacritic marks attached to characters may be difficult todistinguish in small font sizes.

As mentioned earlier, these issues exist even in the Latin (ASCII) character set. For example, the letter O mayoccasionally be confused with the digit zero (0), and the lower case letter L (l) may be confused with the digit one (1), especially depending uponthe font and display size used.

On the other hand, a single registry may also have to deal with similar and potentially confusable characters across differentscripts. For example, Tamil and Malayalam are two different Indic scripts that may both be handled by the same registry, and the Tamil letter KA க(U+0B95) is very similar to the Malayalam letter KAക (U+0D15). Another example is the implications of registering the labelера (which uses Cyrilliccharacters only) vs.epa (which uses Latin characters only) for a TLD such as .museum that has to deal with multiple scripts. It could causesignificant confusion if more than one applicant was able to register them separately.

In some cases these scenarios can be documented as rules that can be picked up and applied by user agents for phishing detection, butthey are often best dealt with at the point of registration.

One registry-level approach is to decide which characters (i.e. Unicode points) in a given language will be allowed duringregistration. These lists are called language tables, and are developed by registries in cooperation with qualified language authorities. Forexample, the Indian language authority could allow use of the Tamil letter KAக (U+0B95) but not the Tamil digit one௧ (U+0BE7) in .in domain names,thereby avoiding a conflict.

Another registry-level approach is to create variant tables and variant registration capabilities. These variant tables show whichcharacters are considered visually confusable across chosen languages or scripts. If a domain name contains such a character, then the version of thedomain name containing the alternate character will be automatically reserved for the registrant. For example, if the requested domain name (the“primary domain”) contains the Tamil letter KAக (U+0BE7), the registry system can generate a variant of the domain name, substituting the Malayalamletter KAക (U+0D15) in the Tamil letter KA’s place. All identified variants may be automatically prohibited (from being registered or created) aspart of a package associated with the primary registered name.

The Unicode Consortium is also developing a technical reportUnicode SecurityConsiderations that describes issues relating to IDN spoofing and makes recommendations for addressing them.

Paths

The conversion process for parts of the IRI relating to the path is already supported natively in the latest versions of IE7, Firefox, Opera, Safari and Google Chrome.

It works in Internet Explorer 6 if the option inTools>Internet Options>Advanced>Always send URLs as UTF-8 is turned on.This means that links in HTML, or addresses typed into the browser's address bar will be correctly converted in those user agents. It doesn't work out of the box for Firefox 2 (although you may obtain results if the IRI and theresource name are in the same encoding), but technically-aware users can turn on an option to support this (set network.standard-url.encode-utf8 to true in about:config).

Whether or not the resource is found on the server, however, is a different question. If the file system is in UTF-8, there should be noproblem. If not, and no mechanism is available to convert addresses from UTF-8 to the appropriate encoding, the request will fail.

Files are normally exposed as UTF-8 by servers such as IIS and Apache 2 on Windows and Mac OS X. Unix and Linux users can store filenames in UTF-8, or use themod_fileiri module mentioned earlier. Version 1 of the Apache serverdoesn't yet expose filenames as UTF-8.

You can run a basic check whether it works for your client and resource using thissimpletest.

Note that, while the basics may work, there are other somewhat more complicated aspects of IRI support, such ashandling of bidirectional text in Arabic or Hebrew, which may need some additional time for full implementation.

Further specification work

There are some improvements needed to the specifications for IDN and IRIs, and these are currently being discussed. For example, there is a need to extend the range of Unicode characters that can be used in domain names to cover later versions of Unicode, and to allow combining characters at the end of labels in right to left scripts.

Movatterモバイル変換

An Introduction to Multilingual Web Addresses

Why multilingual Web addresses?

Basic concepts

Handling the domain name

Domain registration

High level overview

Resolving a domain name

Handling the path

The string matching challenge

High level overview

Resolving a path

A sample HTTP header

Does it work?

Domain Name lookup

Domain names and phishing

Paths

Further specification work

Further reading