Movatterモバイル変換


[0]ホーム

URL:


An Introduction to Multilingual Web Addresses

AWeb address is used to point to aresource on the Web such as a Web page. Recent developments enable you to add non-ASCII characters to Web addresses. This article provides a high level introduction to how this works. It is aimed at content authors and general users who want to understand the basics without too many gory technical details. For simplicity, we will use examples based on HTML and HTTP. We will also address how this works for both the domain name and the remaining path information in a web.

Why multilingual Web addresses?

Web addresses are typically expressed usingUniform Resource Identifiers orURIs. The URI syntax defined inRFC 3986 STD 66 (Uniform Resource Identifier(URI): Generic Syntax) essentially restricts Web addresses to a small number of characters: basically, just upper and lower case letters of theEnglish alphabet, European numerals and a small number of symbols.

The original reason for this was to aid in transcription and usability, both in computer systems and in non-computer communications, toavoid clashes with characters used conventionally as delimiters around URIs, and to facilitate entry using those input facilities available to mostInternet users.

User's expectations and use of the Internet have moved on since then, and there is now a growing need to enable use of characters fromany language in Web addresses. A Web address in your own language and alphabet is easier to create, memorize, transcribe, interpret, guess, andrelate to. It is also important for brand recognition. This, in turn, is better for business, better for finding things, and better forcommunicating. In short, better for the Web.

Imagine, for example, that all web addresses had to be written in Japanese katakana, as shown in the example below. How easy would itbe for you, if you weren't Japanese, to recognize the content or owner of the site, or type the address in your browser, or write the URIdown on notepaper, etc.?

There have been several developments recently that begin to make this possible.

Basic concepts

We will refer to Web addresses that allow the use of characters from a wide range of scripts asInternationalizedResource Identifiers orIRIs. For IRIs to work, there are four main requirements:

  1. the syntax of the format where IRIs are used (eg. HTML, XML, SVG, etc) must support the use of non-ASCII characters in Webaddresses
  2. the application where IRIs are used (eg. browsers, parsers, etc.) must support the input and use of non-ASCII characters in Webaddresses
  3. it must be possible to carry the information in an IRI through the necessary protocol (eg. HTTP, FTP, IMAP, etc.)
  4. it must be possible to successfully match the string of characters in your Web address against the name of the target resource on thefile system or registry where it is stored.

Various document formats and specifications already support IRIs. Examples include HTML 4.0, XML 1.0 system identifiers, the XLinkhref attribute, XMLSchema'sanyURI datatype, etc. We will also see later that major browsers supportthe use of IRIs already.

Unfortunately, not so many protocols allow IRIs to pass through unchanged. Typically they require that the address be specified using theASCII characters defined for URIs. There are, however, well specified ways around this, and we will describe them briefly in this article.

The fourth requirement demands that a string of characters be matched against a target whether or not those characters are represented bythe same encoding, ie. bytes. This is dealt with by using UTF-8 as a broker.

We will use the following fictitious Web address in most of the examples on this page:

http://JP納豆.例.jp/引き割り/おいしい.html

This is a simple IRI that is composed of three parts.

What it all means. The domain name (JP納豆.例.jp) starts with 'JP' so that in the worked examples wecan show what happens to ASCII text within a domain name. The rest of the domain name is read 'natto (a Japanese delicacy made fromfermented soya beans)dot rei (meaning example)dot jp (Japanese country code)'. The path reads 'dir1 slash hikiwari (atype of natto)dot html'.

When it comes to dealing with requirements two to four above, there isone solution for the domain name and a different solutionfor the path. We will explore each of these in turn.

Handling the domain name

Domain names are allocated and managed by domain name registration organizations spread around the world.

A standard approach to dealing with multilingual domain names was agreed by the IETF in March 2003. It is defined in RFCs3490,3491,3492 and3454, and is based onUnicode 3.2. One refers to this using the termInternationalized Domain Name orIDN.

Domain registration

The domain name registrar fixes the list of characters that people can request to be used in their country or top level domains.However, when a person requests a domain name using these characters they are actually allocated the equivalent of the domain name using arepresentation called punycode.

Punycode is a way of representing Unicode codepoints using only ASCII characters.

High level overview

We give a slightly more detailed worked example in the next section but, in summary, the desired Web address is stored in a documentlink or typed into the client's address bar using the relevant native characters, but when a user clicks on the link or otherwise initiates arequest, the user agent (ie. the browser or other client requesting the resource) needs to convert any native script characters in the Web address topunycode representations.

(Of course, if the user agent is unable to do this, it is always possible to express the location in the punycode directly, althoughit is not very user friendly.)

Resolving a domain name

Let's examine the steps in resolving an International Domain Name from the user to the identification of the resource. (Remember thatthis looks only at how the domain name is handled. The path information is treated differently and will be described later.)

The user clicks on a hyperlink or enters the IRI in the address bar of a user agent. At this point the IRI contains non-ASCIIcharacters that could be in any character encoding. Here is the domain name that appears in the example above.

JP納豆.例.jp

If the string that represents the domain name is not in Unicode, the user agent converts the string to Unicode. It then performs somenormalization functions on the string to eliminate ambiguities that may exist in Unicode encoded text.

Normalization involves such things as converting uppercase characters to lowercase, reducing alternative representations (eg.converting half-width kana to full), eliminating prohibited characters (eg. spaces), etc.

Next, the user agent converts each of thelabels (ie. pieces of text between dots) in the Unicode string to a punycode representation.A special marker ('xn--') is added to the beginning of each label containing non-ASCII characters to show that the label was notoriginally ASCII. The end result is not very user friendly, but accurately represents the original string of characters while using only thecharacters that were previously allowed for domain names. Our example now looks like this:

xn--jp-cd2fp15c.xn--fsq.jp

Note how the uppercase ASCII charactersJP at the beginning of the domain name are lowercased, but still recognizable. Any existingASCII characters in a label appear first, followed by a single hyphen and then an ASCII-based representation of any non-ASCII characters.

Next, the punycode is resolved by the domain name server into a numeric IP address (just like any other domain name is resolved).

Finally the user agent sends the request for the page. Since punycode contains no characters outside those normally allowed forprotocols such as HTTP, there is no issue with the transmission of the address. This should simply match against a registered domain name.

Note that most top-level country codes, for example, the.jp at the end ofJP納豆.例.jp, still has to be in Latin characters at the moment. Since 2010, however, IANA has been progressively introducing internationalized country code top level domains, such as مصر. for Egypt, and .рф for Russia.

In practice, it makes sense to register two names for your domain. One in your native script, and one using just regular ASCIIcharacters. The latter will be more memorable and easier to type for people who do not read and write your language. For example, you couldadditionally register a transcription of the Japanese in Latin script, such as the following:

http://JPnatto.rei.jp/

Handling the path

Whereas the domain registration authorities can all agree to accept domain names in a particular form and encoding (ASCII-basedpunycode), multi-scriptpath names identify resources located on many kinds of platforms, whose file systems do and will continue touse many different encodings. This makes the path much more difficult to handle than the domain name.

Having dealt with the domain name using punycode, we now need to deal with the path part of an IRI. The IETF Proposed StandardRFC 3987 (Internationalized Resource Identifiers (IRIs)) defines how to deal with this.

The string matching challenge

There is already a mechanism in the URI specification for representing non-ASCII characters in URIs. What you do is represent theunderlyingbytes using what is referred to aspercent-escaping (in the specification, the less common termpercent-encoding isused). Thus, in the page you are currently reading, which is encoded in UTF-8, we could represent the filename引き割り.html from our previous example asshown just after this paragraph. What you are seeing are two-digit hexadecimal numbers, preceded by %. These represent the bytes used to encode inUTF-8 the Japanese characters in the string. Each Japanese character is represented by 3 bytes, which are transformed into three percent-escapes.

%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html

Apart from the fact that this is not terribly user friendly, there is a bigger issue here. Another person may want to follow the samelink from a page that uses a Shift-JIS character encoding, rather than UTF-8. In this case, if we were to use percent-escaping to transform the (same)characters in the address so that they to conform to the URI requirements, we would base the escapes on the bytes that represent引き割り.html inShift-JIS. There are only two bytes per Japanese character in Shift-JIS, and they are different bytes from those used in UTF-8. So this would yieldthe totally different sequence of byte escapes shown below.

%88%F8%82%AB%8A%84%82%E8.html

So here we see that, although the URI escape mechanism allows the Japanese address to be specified, the actual result will varyaccording to the page of origin. How then is it possible to know how to map that onto a sequence of characters that will match the name of theresource as exposed by the system where it resides?

The chief difficulty here is that there is no encoding-related meta-data associated with the URI strings to indicate what charactersthey represent. Even if that information were available, the total number of mappings that a server would need to support to convert any incomingstring to the appropriate encoding would be extremely high.

Not only that, but the file system on which the resource itself actually resides may expose the file name using a totally differentencoding, such as EUC-JP. If so, the underlying byte sequence that represents the file name as thesystem knows it would be different again.So how are we going to know that these byte sequences all refer to the same resource?

Note that the filename may be stored andexposed in different encodings. UnderWindows NT or Windows XP the IIS or Apache 2 server exposes the file name as UTF-8, even though the operating system stores it as UTF-16.

High level overview

The IRI specification uses Unicode as a broker. It specifies that, before conversion to escapes, the IRI should be converted to UTF-8.As for IDNs, if a conversion is required by the protocol, it is the user agent that is responsible for performing that change when a request is madefor a resource.

The server must also then recognize the Unicode characters in the incoming web address and map them to the encoding used for theactual resources.

(Remember that we have already dealt with the domain name part of the IRI using IDN. The rules in the IRI specification aretypically only applied to the path part of the multilingual Web address.)

It is also possible to apply percent-escaping to the domain name before conversion, but clients often simplyconvert directly to punycode.

Resolving a path

Let us look at what the client does to send the path part of a web address to an HTTP server. Here is the path part of the earlierexample Web address:

/dir1/引き割り.html

When the user clicks on a hyperlink or enters the IRI in the address bar of a user agent, the address may be in any characterencoding, but that encoding is usually known.

If the string is input by the user or stored in a non-Unicode encoding, it is converted to Unicode, normalized using UnicodeNormalization Form C, and encoded using the UTF-8 encoding.

The user agent then converts the non-ASCII bytes to percent-escapes. Our example now looks like this:

/dir1/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html

The string is now in URI form, and will be acceptable to protocols such as HTTP. Note how the ASCII characters 'dir1' and '.html' arejust passed through without change, since these characters are encoded in the same way in both ASCII and UTF-8.

The user agent sends the request for the page.

When this request hits the server, one of two things need to happen:

  • if the server exposes the file names in UTF-8, the server simply accesses the resource
  • if the server uses another encoding, the server needs to convert from UTF-8.

Martin Dürst has written an Apache module calledmod_fileiri to convert requests from UTF-8 to the encoding of the server.

This covers the basics. There are some additional parts of the specification that deal with finer points, such as how to handlebidirectional text in IRIs, and so on.

A sample HTTP header

Here is the first part of the HTTP header for the page request generated by our example. It shows the host name as an IDN, and the pathusing percent-escaping where appropriate:

GET /dir1/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html HTTP/1.1Host: xn--jp-cd2fp15c.xn--fsq.jpUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5a) Gecko/20030728 Mozilla Firebird/0.6.1…

Does it work?

Domain Name lookup

Numerous domain nameauthorities already offer registration of internationalized domain names. These include providers for top level country domains as .cn, .jp, .kr,etc., and global top level domains such as .info, .org and .museum.

Client-side support for IDN is appears in the recent versions of major browsers, including Internet Explorer 7, Firefox, Mozilla,Netscape, Opera, and Safari. It only works in Internet Explorer 6 if you download a plug-in (Microsoft support pages provide somesuggestions). This means that you can use IDNs in href values or the address bar, and thebrowser will convert the IDN to punycode and look up the host.

You can run a basic check to see whether IDNs work on your system using thissimpletest.

It has been an issue, until now, that IDN is not natively supported by Internet Explorer, with its huge market share. Althoughplug-ins are available, not all people will know how to, will want to, or will be able to install them. However, IE7 or its successors, which do support IDN, will,over time, replace most IE6 installs.

Note that, as a simple fallback solution until IDN is widely supported, content authors who want to point to a resource using an IDNcould write the link text in native characters, and put a punycode representation in the href attribute. This guarantees that the user would be ableto link to the resource, whatever platform they used.

If, for some reason, you wanted to, it is possible to turn off IDN support in IE7, Firefox and Mozilla.

Domain names and phishing

One of the problems associated with IDN support in browsers is that it can facilitate phishing through what are called 'homographattacks'. Consequently, most browsers that support IDN also put in place some safeguards to protect users from such fraud.

Special thanks to Michael Monaghan and Greg Aaron for their contributions to this section.

The way browsers typically alert the user to a possible homograph attack is to display the URI in the address bar and the statusbar using punycode, rather than in the original Unicode characters. Users should therefore always check the address bar after the page has loaded, orthe status bar before clicking on a link. However, note that:

'Homograph attack' refers tomixing charactersthat look alike visually in a URI in order to deceive someone about which site they are linking to. For example, in some fonts the capital 'I'looks sufficiently like an 'l' that the URI 'www.paypaI.com' seems to be taking you to a Paypal site, whereas it is most probably routing you to aplace where someone will try to harvest your personal information.

Internet Explorer 7 shows the address as punycode if one of the following conditions is true:

Binding the behavior to the list of languages in the browser preferences also means that a language that is not in the standard listsupplied by IE will always produce punycode. For example, Amharic in Ethiopic text will be displayed as punycode even if you add am to the browserpreferences. (Fortunately, there don't seem to be any registries providing Amharic IDNs at the moment.)

Some fraudulent domain names may still slip through this net. In this case, IE7's normal phishing protection would step in to comparethe domain with reported sites. IE7 can also, however, 'apply additional heuristics to determine if the domain name is visually ambiguous'. This ishelpful when letters within the same script are visually similar.

In addition to displaying suspect IDNs in the address bar in punycode, IE7 also uses its Information Bar to signal possible danger tothe user. It also uses a clickable icon at the end of the address bar to notify you when an URL contains a non-ASCII character. It also displays theaddress bar in all windows.

Firefox 2.x uses a different approach. It only displays domain names in Unicode for certainwhitelisted top level domains. Firefox selects Top Level Domains (TLDs) that have established policies on the domain names theyallow to beregistered and then relies on the registration process to create safe IDNs. You can find alist of supported TLDs on the Mozilla site. If an IDN is from a TLDthat is not on the list, the web address will appear in punycode form in the status and address bars. In some cases the TLD policy statements shouldinclude rules about managing visually similar characters within the set of characters allowed.

In addition, IDNs that contain particular characters (e.g. fraction-slash), even within trusted TLDs, are treated suspiciously, andcause the label to be displayed as punycode.

Opera 9.x uses a similar approach to Firefox, though it differs slightly in implementation.Officially, it only displays domain names in Unicode for whitelisted TLDs listed in opera6.ini, which is updated automatically.

For TLDs that are not on the list, Opera says that it allows domain names to use Latin 1 characters, ie. Latin characters with accentsthat support Western European languages. All other domain names are displayed as punycode.

In reality, tests show that Opera currently displays many characters as Unicode, regardless of whether a TLD is on the whitelist ornot. One exception we found is Devanagari script, which is displayed as punycode if the TLD is not on the list.

Opera does, however, also display certain mixtures of scripts as punycode. The testing revealed this is true for combinations of Greekor Cyrillic characters with Latin characters.

Also, Opera's list of illegal characters is slightly longer than the official IDNA list. Some IDNs, while displayed as punycode inother browsers, are entirely illegal in Opera.

Safari 9.x provides a user-editable list of scripts that are allowed to be displayed natively indomain names. If a character appears in a domain name and does not belong to a script in this list, the URI is displayed as punycode.

At the time of writing, the initial whitelist contains Arabic, Armenian, Bopomofo, Canadian_Aboriginal, Devanagari, Deseret,Gujarati, Gurmukhi, Hangul, Han, Hebrew, Hiragana, Katakana_Or_Hiragana, Katakana, Latin, Tamil, Thai, and Yi. Scripts like Cyrillic, Cherokee andGreek are specifically excluded because they contain characters that are easily confused with Latin characters.

If the whitelist is emptied, any non-ASCII character causes the address to be displayed as punycode.

Mozilla 1.7x displays all IDNs as punycode.

Examples. There is atest page you can use tosee how your browser displays IDNs in the status bar. See also the page that gathersresults for a number of browsers.

Other phishing concerns and registry-level solutions. Some potential aspects of phishing control needto be addressed by the registration authorities, and built into their policies for IDN registration.

Some registration authorities have to carefully consider how to manage equivalent ways of writing the same word. For example, the word'hindi' can be written in Devanagari as eitherहिंदी (using an anusvara) orहिन्दी (using a special glyph for NA).

There is a similar issue with the use of simplified vs. traditional characters in the Chinese Han script.

Another issue arises where two characters or combinations of characters within a single script look very similar, for instance theTamil letter KA and the Tamil digit one are indistinguishable. In other cases, diacritic marks attached to characters may be difficult todistinguish in small font sizes.

As mentioned earlier, these issues exist even in the Latin (ASCII) character set. For example, the letter O mayoccasionally be confused with the digit zero (0), and the lower case letter L (l) may be confused with the digit one (1), especially depending uponthe font and display size used.

On the other hand, a single registry may also have to deal with similar and potentially confusable characters across differentscripts. For example, Tamil and Malayalam are two different Indic scripts that may both be handled by the same registry, and the Tamil letter KA க(U+0B95) is very similar to the Malayalam letter KA (U+0D15). Another example is the implications of registering the labelера (which uses Cyrilliccharacters only) vs.epa (which uses Latin characters only) for a TLD such as .museum that has to deal with multiple scripts. It could causesignificant confusion if more than one applicant was able to register them separately.

In some cases these scenarios can be documented as rules that can be picked up and applied by user agents for phishing detection, butthey are often best dealt with at the point of registration.

One registry-level approach is to decide which characters (i.e. Unicode points) in a given language will be allowed duringregistration. These lists are called language tables, and are developed by registries in cooperation with qualified language authorities. Forexample, the Indian language authority could allow use of the Tamil letter KA (U+0B95) but not the Tamil digit one (U+0BE7) in .in domain names,thereby avoiding a conflict.

Another registry-level approach is to create variant tables and variant registration capabilities. These variant tables show whichcharacters are considered visually confusable across chosen languages or scripts. If a domain name contains such a character, then the version of thedomain name containing the alternate character will be automatically reserved for the registrant. For example, if the requested domain name (the“primary domain”) contains the Tamil letter KA (U+0BE7), the registry system can generate a variant of the domain name, substituting the Malayalamletter KA (U+0D15) in the Tamil letter KA’s place. All identified variants may be automatically prohibited (from being registered or created) aspart of a package associated with the primary registered name.

The Unicode Consortium is also developing a technical reportUnicode SecurityConsiderations that describes issues relating to IDN spoofing and makes recommendations for addressing them.

Paths

The conversion process for parts of the IRI relating to the path is already supported natively in the latest versions of IE7, Firefox, Opera, Safari and Google Chrome.

It works in Internet Explorer 6 if the option inTools>Internet Options>Advanced>Always send URLs as UTF-8 is turned on.This means that links in HTML, or addresses typed into the browser's address bar will be correctly converted in those user agents. It doesn't work out of the box for Firefox 2 (although you may obtain results if the IRI and theresource name are in the same encoding), but technically-aware users can turn on an option to support this (set network.standard-url.encode-utf8 to true in about:config).

Whether or not the resource is found on the server, however, is a different question. If the file system is in UTF-8, there should be noproblem. If not, and no mechanism is available to convert addresses from UTF-8 to the appropriate encoding, the request will fail.

Files are normally exposed as UTF-8 by servers such as IIS and Apache 2 on Windows and Mac OS X. Unix and Linux users can store filenames in UTF-8, or use themod_fileiri module mentioned earlier. Version 1 of the Apache serverdoesn't yet expose filenames as UTF-8.

You can run a basic check whether it works for your client and resource using thissimpletest.

Note that, while the basics may work, there are other somewhat more complicated aspects of IRI support, such ashandling of bidirectional text in Arabic or Hebrew, which may need some additional time for full implementation.

Further specification work

There are some improvements needed to the specifications for IDN and IRIs, and these are currently being discussed. For example, there is a need to extend the range of Unicode characters that can be used in domain names to cover later versions of Unicode, and to allow combining characters at the end of labels in right to left scripts.

Further reading


[8]ページ先頭

©2009-2025 Movatter.jp