Movatterモバイル変換


[0]ホーム

URL:


[Unicode]
 

Unicode® Technical Standard #46

Unicode IDNA Compatibility Processing

Version15.1.0
EditorsMark Davis (markdavis@google.com),
Michel Suignard (michel@suignard.com)
Date2023-09-05
This Versionhttps://www.unicode.org/reports/tr46/tr46-31.html
Previous Versionhttps://www.unicode.org/reports/tr46/tr46-29.html
Latest Versionhttps://www.unicode.org/reports/tr46/
Latest Proposed Updatehttps://www.unicode.org/reports/tr46/proposed.html
Revision31

Summary

Client software, such as browsers and emailers, faced adifficult transition from the version of international domain namesapproved in 2003 (IDNA2003), to the revision approved in 2010(IDNA2008).The specification in this document has been providing a mechanismthat minimizes the impact of this transition for client software,allowing client software to access domains that are valid undereither system.

The specification provides two main features: One is acomprehensive mapping to support current user expectations forcasing and other variants of domain names. Such a mapping is allowedby IDNA2008. The second is a compatibility mechanism that supportsthe existing domain names that were allowed under IDNA2003. Thissecond feature was intended to improve client behavior during thetransition period.

Status

This document has been reviewed by Unicode members and otherinterested parties, and has been approved for publication by theUnicode Consortium. This is a stable document and may be used asreference material or cited as a normative reference by otherspecifications.

A Unicode Technical Standard (UTS) is an independentspecification. Conformance to the Unicode Standard does not implyconformance to any UTS.

Please submit corrigenda and other comments with the onlinereporting form [Feedback].Related information that is useful in understanding this document isfound in theReferences. For the latestversion of the Unicode Standard, see [Unicode]. For alist of current Unicode Technical Reports, see [Reports]. For moreinformation about versions of the Unicode Standard, see [Versions].

Contents




1Introduction

One of the great strengths of domain names is universality. The URLhttp://Apple.com goes to Apple'swebsite from anywhere in the world, using any browser. The emailaddress markdavis@google.com can beused to send email to an editor of this specification from anywherein the world, using any emailer.

Initially, domain names were restricted to ASCII characters. This wasa significant burden on people using other characters. Suppose, forexample, that the domain name system had been invented by Greeks, andone could only use Greek characters in URLs. Rather thanapple.com, one would have to writesomething likeαππλε.κομ. An Englishspeaker would not only have to be acquainted with Greek characters,but would also have to pick those Greek letters that would correspondto the desired English letters. One would have to guess at thespelling of particular words, because there are not exact matchesbetween scripts.

Most of the world’s population faced this situation until recently,because their languages use non-ASCII characters. A system wasintroduced in 2003 for internationalized domain names (IDN). Thissystem is calledInternationalizing Domain Names forApplications, or IDNA2003 for short. This mechanism supports IDNs bymeans of a client software transformation into a format known asPunycode. A revision of IDNA was approved in 2010 (IDNA2008). Thisrevision has a number of incompatibilities with IDNA2003.

The incompatibilities forced implementers of client software,such as browsers and emailers, to face difficult choices during thetransition period as registries shifted from IDNA2003 to IDNA2008. Thisdocument specifies a mechanism that has minimized the impact of thistransition for client software, allowing client software to accessdomains that are valid under either system.

The specification provides two main features. The first is acomprehensive mapping to support current user expectations for casingand other variants of domain names. Such a mapping is allowed byIDNA2008. The second feature is a compatibility mechanism thatsupports the existing domain names that were allowed under IDNA2003.This second feature is intended to improve client behavior during thetransition period. This specification contains both normative andinformative material. Only the conformance clauses and the text thatthey directly or indirectly reference are considered normative.

1.1IDNA2003

The series of RFCs collectively known as IDNA2003 [IDNA2003] allows domain names to containnon-ASCII Unicode characters, which includes not only the charactersneeded for Latin-script languages other than English (such as Å, Ħ,or Þ), but also different scripts, such as Greek, Cyrillic, Tamil, orKorean. An internationalized domain name such asBücher.de can then be used in an"internationalized" URL, called an IRI, such ashttp://Bücher.de#titel.

The IDNA mechanism for allowing non-ASCII Unicode characters indomain names involves applying the following steps to each label inthe domain name that contains Unicode characters:

  1. Transforming (mapping) a Unicode string to remove case andother variant differences.
  2. Checking the resulting string for validity, according tocertain rules.
  3. Transforming the Unicode characters into a DNS-compatibleASCII string using a specialized encoding calledPunycode [RFC3492].

For example, typing the IRIhttp://Bücher.deinto the address bar of any modern browser goes to a correspondingsite, even though the "ü" is not an ASCII character. Thisworks because the IDN in that IRI resolves to the Punycode stringwhich is actually stored by the DNS for that site. Similarly, when abrowser interprets a web page containing a link such as <ahref="http://Bücher.de">, the appropriate site isreached. (In this document, phrases such as "a browserinterprets" refer to domain names parsed out of IRIs entered inan address baras well as to those contained in linksinternal to HTML text.)

In the case of IDNBücher.de, thePunycode value actually used for the domain names on the wire isxn--bcher-kva.de. The Punycode version isalso typically transformed back into Unicode form for display. Theresulting display string will be a string which has already beenmapped according to the IDNA2003 rules. This example results in adisplay string for the IRI that has been casefolded to lowercase:

http://Bücher.dehttp://xn--bcher-kva.dehttp://bücher.de

A major limitation of IDNA2003 is its restriction to the repertoireof characters in Unicode 3.2, which means that some modern languagesare excluded or not fully supported. Furthermore, within theconstraints of IDNA2003, there is no simple way to extend therepertoire. IDNA2003 also does not make it clear to users ofregistries exactly which string they are registering for a domainname (betweenBücher.de andbücher.de, for example).

1.2IDNA2008

In early 2010, a new version of IDNA was approved. Like IDNA2003,this version consists of a collection of RFCs and is called IDNA2008[IDNA2008]. IDNA2008 is intended to solve themajor problems in IDNA2003. It extends the valid repertoire ofcharacters in domain names, and establishes an automatic process forupdating to future versions of the Unicode Standard. Furthermore, itdefines the concept of a valid domain name clearly, so thatregistrants understand exactly what domain name string is beingregistered.

Processing in IDNA2008 is identical to IDNA2003 for many commondomain names. Both IDNA2003 and IDNA2008 transform a Unicode domainname in an IRI (like http://öbb.at)to the Punycode version (likehttp://xn--bb-eka.at).However, IDNA2008 does not maintain strict backward compatibilitywith IDNA2003. The main differences are:

For more details, seeSection 7,IDNAComparison.

1.3Transition Considerations

The differences between IDNA2008 and IDNA2003 may causeinteroperability and security problems. They affect extremely commoncharacters, such as all uppercase characters, all halfwidth orfullwidth characters (commonly used in Japan, China, and Korea), andcertain other characters like the German eszett (U+00DF ßLATIN SMALL LETTER SHARP S) and Greekfinal sigma (U+03C2 ςGREEK SMALL LETTER FINAL SIGMA).

1.3.1Mapping

IDNA2003 requires a mapping phase, which mapsÖBB.attoöbb.at, for example. Mappingtypically involves mapping uppercase characters to their lowercasepairs, but it also involves other types of mappings betweenequivalent characters, such as mapping halfwidthkatakanacharacters to normalkatakana characters in Japanese. Themapping phase in IDNA2003 was included to match the insensitivity ofASCII domain names. Users are accustomed to having bothCNN.com andcnn.comwork identically. They expect domain names with accents to have thesame casing behavior, so thatÖBB.atis the same asöbb.at. There arevariations similar to case differences in other scripts. The IDNA2003mapping is based on data specified in the Unicode Standard, Version3.2; this mapping was later formalized as the Unicode property [NFKC_Casefold].

Note that case-folding generates a stable form of a string thaterases functional case-differences. It isnot the same aslowercasing. In particular, the lowercase Cherokee characters addedin Unicode Version 8.0 are case-folded to their uppercasecounterparts.

IDNA2008 does not require a mapping phase, but doespermit one(called "Local Mapping" or "Custom Mapping"). Formore information on the permitted mappings, see theProtocoldocument of [IDNA2008],Section 4.2,Permitted Character and Label Validation andSection 5.2,Conversion to Unicode.

The UTS #46 specification defines a mapping consistent with thenormative requirements of the IDNA2008 protocol, and which is ascompatible as possible with IDNA2003. For client software, thisprovides behavior that is the most consistent with user expectationsabout the handling of domain names with existing data—namely, thatdomain names will map consistently both on clients supportingIDNA2003 and on clients supporting IDNA2008 with the UTS #46 mapping.

1.3.2Deviations

There are a few situations where the use of IDNA2008 withoutcompatibility mapping will result in the resolution of IDNs todifferent IP addresses from in IDNA2003, unless the registry orregistrant takes special action. This affects a very small number ofcharacters, but because these characters are very common inparticular languages, a significant number of domain names in thoselanguages are affected. This set of characters is referred to as"Deviations" and is shown inTable 1,Deviation Characters,illustrated in the context of IRIs.

Table 1.Deviation Characters

CharExampleIDNA2003 ResultIDNA2008 Result
ß
00DF
href="http://faß.de"http://fass.de
http://fass.de
http://faß.de
http://xn--fa-hia.de
ς
03C2
href="http://βόλος.com"http://βόλοσ.com
http://xn--nxasmq6b.com
http://βόλος.com
http://xn--nxasmm1c.com
ZWJ
200D
href="http://ශ්‍රී.com"http://ශ්රී.com
http://xn--10cl1a0b.com
http://ශ්‍රී.com
http://xn--10cl1a0b660p.com
ZWNJ
200C
href="http://نامه‌ای.com"http://نامهای.com
http://xn--mgba3gch31f.com
http://نامه‌ای.com
http://xn--mgba3gch31f060k.com

For more information on the rationale for the occurrence of theseDeviations in IDNA2008, see the [IDN FAQ].

The differences in interpretation of Deviation characters result inpotential for security exploits. Consider a scenario involvinghttp://www.sparkasse-gießen.de, a GermanIRI containing an IDN for "Gießen Savings and Loan".

  1. Alice's browser supports IDNA2003. Under those rules,http://www.sparkasse-gießen.de is mapped tohttp://www.sparkasse-giessen.de,which leads to a site with the IP address01.23.45.67.
  2. She visits her friend Bob, and checks her bank statement onhis browser. His browser supports IDNA2008. Under those rules,http://www.sparkasse-gießen.de is alsovalid, but converts to a different Punycode domain name inhttp://www.xn--sparkasse-gieen-2ib.de. Thiscan lead to a different site with the IP address101.123.145.167,a spoof site.

Alice ends up at the phishing site, supplies her bankpassword, and her money is stolen. While the .DE registar (DENIC)might have a policy about bundling all of the variants of ß together(so that they all have the same owner) it is not required ofregistries. It is unlikely that all registries will have and enforcesuch a bundling policy in all such cases.

There are two Deviations of particular concern. IDNA2008 allowsthe joiner characters (ZWJ and ZWNJ) in labels. By contrast, theseare removed by the mapping in IDNA2003. When used in the intendedcontexts in particular scripts, the joiner characters produce anoticeable change in displayed text. However, when used between anyother characters in those scripts, or in any other scripts, they areinvisible. For example, when used between the Latin characters"a" and "b" there is no visible different: thesequence "a<ZWJ>b" looks just like "ab".

Because of the visual confusability introduced by the joinercharacters, IDNA2008 provides a special category for them calledCONTEXTJ, and only permits CONTEXTJ characters in limited contexts:certain sequences of Arabic or Indic characters. However,applications that perform IDNA2008 lookup are not required to checkfor these contexts, so overall security is dependent on registrieshaving correct implementations. Moreover, the IDNA2008 contextrestrictions do not catch most cases where distinct domain names havevisually confusable appearances because of ZWJ and ZWNJ.

2UnicodeIDNA Compatibility Processing

To satisfy user expectations for mapping, and provide maximalcompatibility with IDNA2003, this document specifies a mapping foruse with IDNA2008. In addition, to transition more smoothly toIDNA2008, this document provides a Unicode algorithm for astandardized processing that allows conformant implementations tominimize the security and interoperability problems caused by thedifferences between IDNA2003 and IDNA2008. This Unicode IDNACompatibility Processing is structured according to IDNA2003principles, but extends those principles to Unicode 5.2 and later. Italso incorporates the repertoire extensions provided by IDNA2008.

Where the transitional processing is not needed, UTS #46 can be usedpurely as a preprocessing (local mapping) for IDNA2008 by claimingconformance specifically toConformance ClauseC3.

By using this Compatibility Processing, a domain name such asÖBB.at will be mapped to the valid domainnameöbb.at, thus matching userexpectation for case behavior in domain names. For transitional use,the Compatibility Processing also allows domain names containingsymbols and punctuation that were valid in IDNA2003, such as√.com (which has an associated web page).Such domain names containing symbols will gradually disappear asregistries shift to IDNA2008.

Implementations may also restrict or flag (in a UI) domain names thatinclude symbols and punctuation. For more information, seeUnicodeTechnical Report # 36, Unicode Security Considerations [UTR36].

Using the Unicode IDNA Compatibility Processing to transform anIDN into a form suitable for DNS lookup is similar to the tactic of"try IDNA2008 then try IDNA2003". However, this approachavoids a potentially problematic dual lookup. It allows browsers andother clients, such as search engines, to have a single processingstep, without the burden of maintaining two different implementationsand multiple tables. It accounts for a number of edge cases thatwould cause problems, and provides a stable definition withpredictable results.

The Unicode IDNA Compatibility Processing also providesalternate mappings for the Deviation characters. This facilitates thetransition from IDNA2003 to IDNA2008. It is up to the registries todecide how to handle the transition, for example, by either bundlingor blocking the Deviation characters that they support.In practice, for the deviation characters, the transition is complete.All major implementations have switched to nontransitional processing of the four deviation characters.

The term "registries" includes far more than top-levelregistries, such as for.de or.com.For example,.blogspot.com has more domain namesregistered than most top-level registries. There may be differentpolicies in place for a registry and any of its subregistries. Thusmillions of registries need to be considered in a transitionstrategy, not just hundreds.

In lookup software, transitions may be fine-grained: forexample, it may be possible to transition to IDNA2008 rules regardingDeviations for.subdomain.com at a given point butnot for.com, or vice versa.If.tldbundles or blocks the Deviation characters, then clients couldtransition Deviations for.tld,but not for (say).subdomain.tld.Moreover, client software with a UI, such as the address bar in abrowser, could provide more options for the transition. A fulldiscussion of such transition strategies is outside of the scope ofthis document.

During the interim, authors of documents, such as HTMLdocuments, can unambiguously refer to the IDNA2008 interpretation ofcharacters by explicitly using the Punycode form of the domain namelabel.

There are two slightly different compatibility mechanisms for domainnames during a transition and afterward. UTS #46 therefore specifiestwo specific types of processing: Transitional Processing(Conformance ClauseC1)and Nontransitional Processing(Conformance ClauseC2).The only difference between them is the handlingof the four Deviation characters.

Summarized briefly, UTS #46 builds upon IDNA2008 in threeareas:

For a demonstration of differences between IDNA2003, IDNA2008, andthe Unicode IDNA Compatibility Processing, see the [DemoIDN]. For more detail on the differences,seeSection 7,IDNA Comparison.UTS #46 does not change any of the terms defined in IDNA2008, such asA-Label or U-Label.

Neither the Unicode IDNA Compatibility Processing nor IDNA2008address security problems associated with confusables (the so-called"paypal.com" problem).IDNA2008 disallows certain symbols and punctuation characters thatcan be used for spoofing, such as spoofs of the slash character("/"). However, these are an extremely small fraction ofthe confusable characters used for spoofing. Moreover, confusablecharacters themselves account for a small proportion of phishingproblems: most are cases like "secure-wellsfargo.com". Formore information, see [Bortzmeyer] and the[IDN FAQ]. It is strongly recommended thatUnicodeTechnical Report #36, Unicode Security Considerations [UTR36] andUnicode Technical Standard#39, Unicode Security Mechanisms [UTS39] beconsulted for information on dealing with confusables, both forclient software and registries. In particular, [UTS39]provides information that can be used to drastically reduce thenumber of confusables when dealing with international domain names,much beyond what IDNA2008 does. See also the [DemoConf].

2.1Display ofInternationalized Domain Names

IDNA2003 applications customarily display the processed string to theuser. This improves security by reducing the opportunity for visualconfusability. Thus, for example, the URLhttp://googIe.com(with a capital I in place of the L) is revealed ashttp://googie.com.

2.2Registries

This specification is primarily targeted at applications doing lookupof IDNs. There is, however, one strong recommendation for registries:do not allow the registration of labels that are invalidaccording to Nontransitional Processing, and do use bundling or blocking forlabels containing confusable characters.

These tactics can be described as follows:

Note: Some implementations outside Unicodeuse different terminology for these strategies.In particular, in the ICANN Root Zone Label Generation Rules [RZLGR5],the termallocatable variant of X is used for labels that can be bundled with X,and the termblocked variant is used for a mutually exclusive label.

The label that is actually registered and inserted into a registryhas always been processed. For example,xn--bcher-kvacorresponds tobücher. However, it maybe useful for a registry to also ask for "unprocessed" labels, suchasBücher, as part of the registrationprocess, so that they are aware of the registrant's intent. However,such unprocessed labels must be handled carefully:

2.3Notation

Sets of code points are defined using properties and the syntax ofUnicodeTechnical Standard #18, Unicode Regular Expressions [UTS18]. For example, the set of combining marks isrepresented by the syntax\p{gc=M}. Additionally, the "+" indicates the addition of elementsto a set, for clarity.

In this document, alabel is a substring of a domain name.That substring is bounded on both sides by either the start or theend of the string, or any of the following characters, calledlabel-separators:

  1. U+002E ( . ) FULL STOP
  2. U+FF0E ( . ) FULLWIDTH FULL STOP
  3. U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
  4. U+FF61 ( 。 ) HALFWIDTH IDEOGRAPHIC FULL STOP

Many people use the terms "domain names" and "hostnames" interchangeably. This document follows [RFC3490] in use of the term "domainname".

ABidi domain name is a domain name containing at least one character with Bidi_Class R, AL, or AN. See [IDNA2008] RFC 5893, Section 1.4.

3Conformance

The requirements for conformance on implementations of theUnicodeIDNA Compatibility Processing algorithm are stated in the followingclauses. An implementation can claim conformance to any or all ofthese clauses independently.

C1 (deprecated).Given aversion of Unicode and aUnicodeString, a conformant implementation ofTransitionalProcessing shall replicate the results given by applying theTransitional Processing algorithm specified by Section 4,Processing.

C2.Given aversion of Unicode and aUnicodeString, a conformant implementation ofNontransitionalProcessing shall replicate the results given by applying theNontransitional Processing algorithm specified by Section 4,Processing.

C3.Given aversion of Unicode and aUnicodeString, a conformant implementation ofPreprocessingfor IDNA2008 shall replicate the results specified by Section 4.4,Preprocessing for IDNA2008.

These specifications arelogical ones, designed to bestraightforward to describe. An actual implementation is free to usedifferent methods as long the result is the same as that specified bythe logical algorithm.

Any conformant implementation may also havetighter validitycriteria than those imposed bySection 4.1,Validity Criteria. For example, anapplication could disallow or warn of domain name labels with certaincharacteristics, such as:

For more information, seeUnicode Technical Report #36,Unicode Security Considerations [UTR36] andUnicodeTechnical Standard #39, Unicode Security Mechanisms [UTS39].

3.1STD3 Rules

IDNA2003 provides for a flag,UseSTD3ASCIIRules,that allows for implementations to choose whether or not to abide bythe rules in [STD3]. These rules exclude ASCIIcharacters outside the set consisting of A-Z, a-z, 0-9, and U+002D (- ) HYPHEN-MINUS. For example, some browsers also allow characterssuch as U+005F ( _ ) LOW LINE(underbar) in domain names,and thus useUseSTD3ASCIIRules=false, plus their ownvalidity checks for the other ASCII characters.

WhileUseSTD3ASCIIRules=true is stronglyrecommended,Section 5,IDNA Mapping Table provides data toallow implementations to supportUseSTD3ASCIIRules=falsefor compatibility with IDNA2003 implementations where necessary. Themapping table does this: providing the Status values and Mappingvalues for bothUseSTD3ASCIIRules=trueandUseSTD3ASCIIRules=false. Implementations that useUseSTD3ASCIIRules=falsewill need to apply their own validation to the mapped values asindicated inSection 4.1,ValidityCriteria.

4Processing

The input to Unicode IDNA Compatibility Processing is a prospectivedomain_namestring expressed in Unicode, and a choice of Transitional orNontransitional Processing. The domain name consists of a sequence oflabels with dot separators, such as "Bücher.de". For more information about the composition of aURL, see Section 3.5 of [STD13].

Main Processing Steps

The following steps, performed in order, successively alter the inputdomain_name string and then output it as a converted Unicodestring, plus a flag to indicate whether there was an error. Even ifan error occurs, the conversion of the string is performed as much asis possible.

Input

Processing
  1. Map. For each codepoint in thedomain_name string, look up the Status value inSection 5,IDNA Mapping Table, and take thefollowing actions:
    • disallowed: Leave the code pointunchanged in the string.Note: The Convert/Validate step below checks for disallowed characters,after mapping and normalization.
    • ignored: Remove the code point from thestring. This is equivalent to mapping the code point to an emptystring.
    • mapped:IfTransitional_Processing (deprecated) andthe code point is U+1E9E capital sharp s (ẞ),then replace the code point in the string by “ss”. Otherwise:
      Replace the code point in thestring by the value for the mapping inSection 5,IDNAMapping Table.
    • deviation:
      • IfTransitional_Processing (deprecated), replace the codepoint in the string by the value for the mapping inSection 5,IDNA Mapping Table.
      • Otherwise, leave the codepoint unchanged in the string.
    • valid: Leave the code point unchanged inthe string.
  2. Normalize.Normalize thedomain_name string to Unicode NormalizationForm C.
  3. Break. Break thestring into labels at U+002E ( . ) FULL STOP.
  4. Convert/Validate. Foreach label in thedomain_name string:
    • If the label starts with “xn--”:
      1. If the label contains any non-ASCII code point (i.e., a code point greater than U+007F), record that there was an error, and continue with the next label.
      2. Attempt to convert the rest of the label to Unicodeaccording toPunycode [RFC3492]. If that conversion failsand if notIgnoreInvalidPunycode,record that there was an error, andcontinue with the next label. Otherwise replace the originallabel in the string by the results of the conversion.
      3. Verify that the label meets the validity criteria inSection4.1,Validity Criteriafor Nontransitional Processing. If any of the validity criteriaare not satisfied, record that there was an error.
    • If the label does not startwith “xn--”:
      • Verify that the label meets the validity criteria inSection4.1,Validity Criteriafor the input Processing choice (Transitional orNontransitional). If any of the validity criteria are notsatisfied, record that there was an error.

Any inputdomain_name string that does not record an error hasbeen successfully processed according to this specification.Conversely, if an inputdomain_name string causes an error,then the processing of the inputdomain_name string fails.Determining what to do with error input is up to the caller, and notin the scope of this document. The processing isidempotent—reapplying the processing to the output will make nofurther changes. For examples, seeTable 2,Examples of TransitionalProcessing.

Implementations may make further modifications to the resultingUnicode string when showing it to the user. For example, it isrecommended that disallowed characters be replaced by a U+FFFD tomake them visible to the user. Similarly, labels that fail processingduring step 4 may be marked by the insertion of a U+FFFD orother visual device.

With either Transitional orNontransitional Processing, sources already in Punycode are validatedwithout mapping. In particular, Punycode containing Deviationcharacters, such as href="xn--fu-hia.de"(for fuß.de) is not remapped. This provides a mechanism allowingexplicit use of Deviation characters even during a transition period.

4.1ValidityCriteria

Each of the following criteria must be satisfied for a non-empty label:

  1. The label must be in Unicode Normalization Form NFC.
  2. IfCheckHyphens, the label must not contain a U+002D HYPHEN-MINUS characterin both the third and fourth positions.
  3. IfCheckHyphens, the label must neither begin nor end with a U+002DHYPHEN-MINUS character.
  4. If notCheckHyphens, the label must not begin with “xn--”.
  5. The label must not contain a U+002E ( . ) FULL STOP.
  6. The label must not begin with a combining mark, that is:General_Category=Mark.
  7. Each code point in the label must only have certain Statusvalues according toSection 5,IDNAMapping Table:
    1. For Transitional Processing (deprecated), each value must bevalid.
    2. For Nontransitional Processing, each value must be eithervalid ordeviation.
  8. IfCheckJoiners, the label must satisify theContextJ rules fromAppendix A, inThe Unicode Code Points and Internationalized Domain Names for Applications (IDNA) [IDNA2008].
  9. IfCheckBidi, and if the domain name is aBidi domain name, then the label must satisfy all six of the numbered conditions in [IDNA2008] RFC 5893, Section 2.

The first 6 criteria are from [IDNA2008],except for the fourth criterion. Criterion #2 in particular is meant to allow for future label extensions beyond just xn--, such as for future versions of IDNA. Some implementations appear to consider such extentions unlikely, and allow labels such as "r3---sn-apo3qvuoxuxbt-j5pe".

Any particular applicationmay have tighter validitycriteria, as discussed inSection 3,Conformance.

4.1.1UseSTD3ASCIIRules

IfUseSTD3ASCIIRules=false, then the validity testsfor ASCII characters are not provided by the table Status values, butare implementation-dependent. For example, if an implementationallows the characters [\u002Da-zA-Z0-9]and also the underbar ( _ ), then it needs to use the table valuesforUseSTD3ASCIIRules=false, and test for any otherASCII characters as part of its validity criteria. These ASCIIcharacters may have resulted from a mapping: for example, aU+005F ( _ ) LOW LINE(underbar) may have originally been aU+FF3F ( _ ) FULLWIDTH LOW LINE.

There are currently no non-ASCII characters with theStatus valuedisallowed_STD3_valid.

4.1.2Right-to-LeftScripts

In addition, the label should meet the requirements for right-to-leftcharacters specified in the Right-to-Left Scripts document of [IDNA2008], and for the CONTEXTJ requirements inthe Protocol document of [IDNA2008]. It isstrongly recommended thatUnicode Technical Report #36,Unicode Security Considerations [UTR36] andUnicodeTechnical Standard #39, Unicode Security Mechanisms[UTS39] be consulted for information on dealingwith confusables, and for characters that should be excluded fromidentifiers. Note that the recommended exclusions are a superset ofthose in [IDNA2008].

4.2ToASCII

The operation corresponding to ToASCII of [RFC3490]is defined by the following steps:

Input

Processing

  1. To the inputdomain_name, apply theProcessingSteps inSection 4,Processing,using the input boolean flagsTransitional_Processing,CheckHyphens,CheckBidi,CheckJoiners, andUseSTD3ASCIIRules. This may record an error.
  2. Break the result into labels at U+002E FULL STOP.
  3. Convert each label with non-ASCII characters into Punycode [RFC3492], andprefix by “xn--”. This may record an error.
  4. If theVerifyDnsLength flag is true, then verify DNSlength restrictions. This may record an error. For more information,see [STD13] and[STD3].
    1. The length of the domain name, excluding the root labeland its dot, is from 1 to 253.
    2. The length of each label is from 1 to 63.
      • Note: Technically, a complete domain name ends withan empty label for the DNS root(see [STD13] [RFC1034] section 3).This empty label, and the trailing dot, is almost always omitted.
      • WhenVerifyDnsLength is false, the empty root label is passed through.
      • WhenVerifyDnsLength is true, the empty root label is disallowed.This corresponds to the syntax in [RFC1034]section 3.5 Preferred name syntaxwhich also defines the label length restrictions.
  5. If an error was recorded in steps 1-4, then the operationhas failed and a failure value is returned. No DNS lookup should bedone.
  6. Otherwise join the labels using U+002E FULL STOP as aseparator, and return the result.

Implementations are advised to apply additional tests to theselabels, such as those described inUnicode Technical Report#36, Unicode Security Considerations [UTR36]andUnicode Technical Standard #39, Unicode SecurityMechanisms [UTS39], and take appropriateactions. For example, a label with mixed scripts or confusables maybe called out in the UI. Note that the use of Punycode to signalproblems may be counter-productive, as described in [UTR36].

4.3ToUnicode

The operation corresponding to ToUnicode of [RFC3490]is defined by the following steps:

Input

Processing

  1. To the inputdomain_name, apply theProcessingSteps inSection 4,Processing,using the input boolean flagsTransitional_Processing,CheckHyphens,CheckBidi,CheckJoiners, andUseSTD3ASCIIRules. This may record an error.
  2. Like [RFC3490], this will alwaysproduce a converted Unicode string. Unlike ToASCII of [RFC3490], this always signals whether or notthere was an error.

Implementations are advised to apply additional tests to theselabels, such as those described inUnicode Technical Report#36, Unicode Security Considerations [UTR36]andUnicode Technical Standard #39, Unicode SecurityMechanisms[UTS39], and takeappropriate actions. For example, a label with mixed scripts orconfusables may be called out in the UI. Note that the use ofPunycode to signal problems may be counter-productive, as describedin [UTR36].

4.4Preprocessingfor IDNA2008

The table specified inSection 5,IDNAMapping Table may also be used for a pure preprocessing step forIDNA2008, mapping a Unicode string for input directly to thealgorithm specified in IDNA2008.

Preprocessing for IDNA2008 is specified as follows:

Apply theSection 4.3,ToUnicodeprocessing to the Unicode string.

Note that this preprocessing allows some characters that areinvalid according to IDNA2008. However, the IDNA2008 processing willcatch those characters. For example, a Unicode string containing acharacter listed as DISALLOWED in IDNA2008, such as U+2665 (♥) BLACKHEART SUIT, will pass the preprocessing step without an error, butsubsequent application of the IDNA2008 processing will fail with anerror, indicating that the string is not a valid IDN according toIDNA2008.

4.5ImplementationNotes

A number of optimizations can be applied to the Unicode IDNACompatibility Processing. These optimizations can improveperformance, reduce table size, make use of existing NFKC transformmechanisms, and so on. For example:

Note that the inputdomain_name string for the Unicode IDNACompatibility Processing must have had all escaped Unicode codepoints converted to Unicode code points. For example,U+5341( 十 ) CJK UNIFIED IDEOGRAPH-5341 could have been escaped as any ofthe following:

Examples are shown inTable 2,Examples of Processing:

Table 2.Examples of Processing

InputMapNormalizeConvertValidateComment
Bloß.debloss.de=n/aokTransitional (deprecated): maps uppercase and sharp s
bloß.de=n/aokNontransitional: maps uppercase
BLOẞ.debloß.de=n/aokMaps uppercase
xn--blo-7ka.de==bloß.deokPunycode is not mapped, so ß never changes (whethertransitional or not).
u¨.com=ü.comn/aokNormalize changesu+ umlaut toü
xn--tda.com==ü.comokPunycodexn--tda changes toü
xn--u-ccb.com==u¨.comerrorPunycode is not mapped, butis validated. Becauseu + umlaut is not NFC, it fails.
a⒈comerrorerrorerrorerrorThe character "⒈" isdisallowed,because it would produce a dot when mapped.
xn--a-ecp.ruxn--a-ecp.ru=a⒈.ruerrorPunycodexn--a-ecp = a⒈, which failsvalidation.
xn--0.ptxn--0.pt=errorerrorPunycodexn--0 is invalid.
日本語。JP日本語.jp=n/aokFullwidth characters are remapped, including 。
☕.us==n/aokPost-Unicode 3.2 characters are allowed.

5IDNAMapping Table

For each code point in Unicode, the IDNA Mapping Table providesone of the following Status values:

If this Status value ismapped,disallowed_STD3_mapped ordeviation, the table alsosupplies a mapping value for that code point.

A table is provided for each version of Unicode starting with Unicode5.1, in versioned directories under [IDNA-Table].Each table for a version of the Unicode Standard will always bebackward compatible with previous versions of the table: onlycharacters with the Status valuedisallowed maychange in Status or Mapping value,with the following exception:

Unicode 15.1 also changed the Status ofthree conditionally-disallowed characters, which is not an exception:

Unlike the IDNA2008 table, thistable is designed to be applied to the entire domain name, not justto individual labels. That design provides for the IDNA2003 handlingof label separators. In particular, the table is constructed toforbid problematic characters such as U+2488 ( ⒈ ) DIGIT ONE FULLSTOP, whose decompositions contain a "dot".

The Unicode IDNA Compatibility Processing is based on the Unicodecharacter mapping property [NFKC_Casefold].Section 6,MappingTable Derivation describes the derivation of these tables. Likederived properties in the Unicode Character Database, the descriptionof the derivation is informative. Only the data in IDNA Mapping Tableis normative for the application of this specification.

The files use a semicolon-delimited format similar to those in theUnicode Character Database [UAX44]. The fieldvalues are listed inTable 2b,Data File Fields:

Table 2b.Data File Fields

NumFieldDescription
0Code point(s)Hex value or range of values.
1Statusvalid, ignored,mapped,deviation,disallowed,disallowed_STD3_valid,ordisallowed_STD3_mapped
2MappingHex value(s). Only present if the Status isignored,mapped,deviation, ordisallowed_STD3_mapped.
3IDNA2008 StatusThere are two values:NV8 andXV8.NV8is only present if the Status isvalid but thecharacter is excluded by IDNA2008 from all domain names for allversions of Unicode.XV8 is present when the character isexcluded by IDNA2008 for thecurrentversion of Unicode. These are not normative values.

Example:

0000..002C    ; disallowed                    #  NULL..COMMA
002D ; valid # HYPHEN-MINUS
...
0041 ; mapped ; 0061 # LATIN CAPITAL LETTER A...00A1..00A7 ; valid ; ; NV8 # INVERTED EXCLAMATION MARK..SECTION SIGN
00AD ; ignored # SOFT HYPHEN...00DF ; deviation ; 0073 0073 # LATIN SMALL LETTER SHARP S
...
19DA ; valid ; ; XV8 # 5.2 NEW TAI LUE THAM DIGIT ONE
...

6MappingTable Derivation

The following describes the derivation of the mapping table. Thisdescription has nothing to do with the actual mapping of labels inSection4,Processing.Instead, this section describes the derivation of the table inSection 5,IDNAMapping Table. That table is then normatively used for mapping inSection4,Processing.

The derivation is described as a series of steps.Step 1 defines a base mapping;Steps2,3, and4 define three sets of characters.Step 5 will modify the basemapping or the sets of characters as needed to maintain backwardcompatiblity. The mapping and sets are all used inStep 6 to produce the mapping andStatus values for the table.Step 7 removes characters whose mappings contain characters that are not valid. Each numberedstep may have substeps: for example,Step1 consists of Steps 1.1 through 1.2.

The computation is done twice, once withUseSTD3ASCIIRules=true,and once withUseSTD3ASCIIRules=false. Code pointsthat aredisallowed withUseSTD3ASCIIRules=true,butvalid ormapped withUseSTD3ASCIIRules=false,are given the special Status valuesdisallowed_STD3_validanddisallowed_STD3_mapped.

If a Unicode property changes in a future version in a way that wouldaffect backward compatibility,a corresponding clause will be addedtoStep 5 to maintaincompatibility. For more information on compatibility, seeSection5,IDNAMapping Table.

Step 1: Define a base mapping

This step specifies abase mapping, which is a mapping fromeach Unicode code point to sequences of zero or more code points. Thevalue resulting from mapping a particular code point C is called thebase mapping value of C. The base mapping value for C may beidentical to C.

  1. Map the following exceptional characters:
    1. Map label separator characters to U+002E ( . ) FULL STOP:
      • U+FF0E ( . ) FULLWIDTH FULL STOP
      • U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
      • U+FF61 ( 。 ) HALFWIDTH IDEOGRAPHIC FULL STOP
    2. Map all Bidi_Control characters to themselves
    3. Map U+1E9E (ẞ) LATIN CAPITAL LETTER SHARP S toU+00DF (ß) LATIN SMALL LETTER SHARP S
  2. Map eachother character to its NFKC_Casefold value[NFKC_Casefold].

Unicode 6.3 adds Bidi_Control characters that were not presentin Unicode 3.2. To preserve the intent of IDNA2003 in disallowingBidi_Control characters rather than just ignoring them, Step 1.1.bwas added. This step causes Step 6.3 to disallow all Bidi_Controlcharacters.

Step 1.1.b only affects 5 new characters added in Unicode 6.3.It would also impact any new Bidi_Control characters in futureversions of the standard.

Step 1.1.c (added in Unicode 15.1)maps the capital sharp s (ẞ) to the small sharp s (ß) rather than to ssbecause all major implementations have adopted nontransitional processing,which does not map ß to ss as in NFKC_Casefold.

Step 2: Specify the base valid set

The base valid set is defined by the sequential list of additions andsubtractions inTable 3,BaseValid Set. This definition is based on the principles of IDNA2003.When applied to the repertoire of Unicode 3.2 characters, thisproduces a set which is closely aligned with IDNA2003.

Table 3.BaseValid Set

Formal Set NotationDescription
\P{Changes_When_NFKC_Casefolded}Start with characters that are equal to their [NFKC_Casefold] value. This criterionexcludes uppercase letters, for example, as well as characters thatare unstable under NFKC normalization, and default ignorable codepoints.

Note that according to Perl/Java syntax, \P means the inverse of\p, so these are the characters thatdo not change whenindividually mapped according to [NFKC_Casefold].

+ \u00DFAdd LATIN SMALL LETTER SHARP S (ß).
- \p{c} - \p{z}Remove Unassigned, Controls, Private Use, Format,Surrogate, and Whitespace.
-\p{Block=Ideographic_Description_Characters}Remove ideographic description characters.
- \u31EFRemove IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION.
This is an ideographic description character that was added in Unicode 15.1outside the now-filled Ideographic_Description_Characters block.
- \p{ascii} + [\u002Da-zA-Z0-9]IfUseSTD3ASCIIRules = True: Remove disallowedASCII; '-' is valid.
+ \p{ascii} - [\u002E]IfUseSTD3ASCIIRules = False: Add all ASCII exceptfor "."

Step 3: Specify the base exclusionset

Form the base exclusion set in the following way:

  1. Start with the empty set.
  2. Add each code point C such that:
    1. According to IDNA2003, C is neither prohibited norunassigned nor a label separator (that is, it is either valid ormapped),and
    2. According to IDNA2003, C has a different mapping than C'sbase mapping value specified in Step 1.
  3. Add each code point C such that:
    1. According to IDNA2003, C is prohibited,and
    2. either C is in the base valid set, or every code point inC's base mapping value is in the base valid set.

For example, for Unicode 5.2 and 6.0, the base exclusion setconsists of the list that follows. The subheads (like "CaseChanges") are informational, and do not represent the principlefor excluding the characters listed under them.

Characters that have a different mapping inIDNA2003 (Step 3.2 above)

Characters that are disallowed in IDNA2003(Step 3.3 above)

Step 4: Specify the deviation set

This is the set of characters that deviate between IDNA2003 andIDNA2008.

Step 5: Specify changes for backward compatibility

This set is currently empty. Adjustments to the above sets orbase mapping will be made in this section if the steps would cause analready existing character to change Status or mapping under a futureversion of Unicode, so that backward compatibility is maintained.

Step 6: Produce the initial Statusand Mapping values

For each code point:

  1. If the code point is in thedeviation set
    • the Status isdeviation and the mappingvalue is the base mapping value for that code point.
  2. Otherwise, if the code point is in the base exclusion set oris unassigned
    • the Status isdisallowed and there is nomapping value in the table.
  3. Otherwise, if the code point is not a label separatorandsome code point in its base mapping value is not in the base validset
    • the Status isdisallowed and there is nomapping value in the table.
  4. Otherwise, if the base mapping value is an empty string
    • the Status isignored and there is nomapping value in the table.
  5. Otherwise, if the base mapping value is the same as the codepoint
    • the Status isvalid and there is nomapping value in the table.
  6. Otherwise,
    • the Status ismapped and the mappingvalue is the base mapping value for that code point.

Step 7: Produce the final Statusand Mapping values

After processing all code points in previous steps:

  1. Iterate through the set of characters with a Status ofmapped.Any whose mapping values are not wholly in the union of thevalid set and thedeviation set,makedisallowed.
  2. Recursively apply these actions until there are no moreStatus changes.

For example, for Unicode 15.1, the set of characters set todisallowed inStep 7 consists ofthe following:

Note: Characters such as U+2488 ( ⒈ ) DIGIT ONE FULL STOP aredisallowed by Step 6.3.

Note: In Unicode versions 15.0 and earlier,withUseSTD3ASCIIRules = Truethree additional characters were disallowed in this step:U+2260 (≠), U+226E (≮), and U+226F (≯).This was based on their canonical decompositions (NFD) containing charactersthat are not valid under that setting;that test was unnecessary for IDNA processing.

7IDNA Comparison

Table 4,IDNAComparisons for Unicode 11.0 illustrates the differences between the threespecifications in terms of valid character repertoire for Unicode 11.0. It omits theASCII-repertoire code points, all code points unassigned in Unicode 11.0, as well as control characters, private-usecharacters, and surrogate code points. It also includes labelsseparators that are valid or mapped. The table separates the Unicode3.2 characters from those encoded later, because they have a specialstatus in IDNA2003. It also separates buckets where UTS #46 andIDNA2008 behave the same from those where they behave differently.

Each row in the table defines a bucket of code points thatshare a pattern of behavior across the three specifications. Thecolumns provide the following information:

Table 4.IDNA Comparisons for Unicode 11.0

 CountIDNA2003UTS46IDNA2008Comment and Examples
Unicode 3.2 (IDNA2003 =UTS46 = IDNA2008)
a86,676ValidValidValidValid in all three
U+00E0 ( à ) LATIN SMALLLETTER A WITH GRAVE
b431DisallowedDisallowedDisallowedDisallowed in all three
U+FF01 (! ) FULLWIDTHEXCLAMATION MARK
Unicode 3.2 (IDNA2003 ≠UTS46 = IDNA2008)
c48ValidDisallowedDisallowedMappings changed after Unicode 3.2
U+2132 ( Ⅎ )TURNED CAPITAL F
d8MappedDisallowedDisallowedMappings changed after Unicode 3.2
U+2F868 ( 㛼 )CJK Compatibility Ideographs
Unicode 3.2 (IDNA2003 =UTS46 ≠ IDNA2008)
e4,640Mapped / IgnoredMapped / IgnoredDisallowedCase and compatibility variants, default ignorables
U+00C0 ( À ) LATIN CAPITAL LETTER A WITH GRAVE
f3,254ValidValidDisallowedPunctuation, symbols, ...
U+2665 ( ♥ ) BLACKHEART SUIT
g4Mapped / IgnoredDisplay: Valid
Lookup: Mapped / Ignored
ValidDeviations
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+00DF ( ß ) LATIN SMALL LETTERSHARP S
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA
Unicode 4.0 to Unicode 11.0 (UTS46= IDNA2008)
h36,045LookupValidValidValid*U+0221 ( ȡ ) LATIN SMALL LETTER D WITH CURL
i141LookupValidDisallowedDisallowedU+0602 ( ؂ ) ARABIC FOOTNOTE MARKER
Unicode 4.0 to Unicode 11.0 (UTS46≠ IDNA2008)
j4,757LookupValidValidDisallowedU+2615 ( ☕ ) HOT BEVERAGE
k1,275LookupValidMapped / IgnoredDisallowedU+023A ( Ⱥ ) LATIN CAPITAL LETTER A WITH STROKE

The table only includes counts up to Unicode 11.0. A detailed online listing of differences is found at [DemoIDNChars] and [DemoIDN].The implications for confusability can be seen at [DemoConf].

7.1Implications forImplementers

Table 4,IDNAComparisons for Unicode 11.0 can also be used to categorize implications forimplementers.

If any characters areMapped/Ignored in any specification—Rows (d), (e), (k)—then in theother specifications they are either Mapped/Ignored in precisely thesame way, or they are Disallowed. This prevents domain names frombeing mapped differently on different browsers: either the charactersmap to the same result, or they do not work. Row (k) is unproblematicin this regard, assuming that registries follow one of thespecifications, because characters like U+023A ( Ⱥ ) will not bevalid in registered labels.

Note: The transition is complete in practicefor the four problematic Deviations in Row (g).All major implementations treat them as Valid in UTS46, just like in IDNA2008.

This presumes that IDNA2008 implementations do not use custom,incompatible mappings: that is, that they do not take advantage ofthe fact that arbitrary mappings are allowed in IDNA2008, and choosea mapping that is incompatible with IDNA2003 or UTS #46. Thispertains to any of Rows (e), (f), (j), (k). If custom mappings wereused by any signficant client base, it would result in seriousproblems for security and interoperability. For more information, seethe [IDN_FAQ].

With the exception of the above issues, implementation isstraightforward:

8ConformanceTesting

A conformance testing file (IdnaTestV2.txt) is provided for eachversion of Unicode starting with Unicode 6.0, in versioneddirectories under [IDNA-Table]. It onlyprovides test cases forUseSTD3ASCIIRules=true.

8.1Format

The test file is UTF-8, with certain characters escaped using the\uXXXX or \x{XXXX} convention for readability. The details are in the header of the test file.

8.2Testing Conformance

To test for conformance to UTS #46, an implementation will perform the toUnicode, toAsciiN, and toAsciiToperations on the source string, then verify the resulting strings and relevant Status values. The details are in the header of the test file.

Implementations may be more strict than the default settings for UTS46. In particular, an implementation conformant to IDNA2008 would disallow the input for lines marked with NV8. Implementations need only record that there is an error: they need not reproduce the precise Status codes (after removing any ignored Status values).

8.3Migration

The test format and file name changed in Version 11.0 so that it could express a variety of different combinations of input options that people needed. The new format allows the testing implementation to test for precisely the results of its combination of supported flags, by filtering out Status codes that correspond to an unsupported input flag. The value XV8 was also removed, since it was not very useful in practice.

The following illustrate the differences between the old and new format. The set of examples is not exhaustive, but shows how there is more information available for the same examples.

Old-format sample lines:

T;  Faß.de;     faß.de;     fass.deN;  Faß.de;     faß.de;     xn--fa-hia.deB;  Bücher.de;  bücher.de;  xn--bcher-kva.deB;  à\u05D0;    [B5 B6];    [B5 B6]B;  a。。b;      [A4_2];     [A4_2]

New-format sample lines:

Faß.de;     faß.de;     [];       xn--fa-hia.de;     ;  fass.de;Bücher.de;  bücher.de;  [];       xn--bcher-kva.de;  ;  ;à\u05D0;    àא;         [B5 B6];  xn--0ca24w;        ;  ;a。。b;      a..b;       [A4_2];   a..b;              ;  ;

9IDNADerived Property

To facilitate comparison between versions of the Unicode Character Databaseand to highlight the implications for the addition of new characters and changes of character properties,the Unicode Technical Committee has prepared a collection of IDNA Derived Propertydata files. These data files are permanently posted at [IDNA-Derived].

For each version of the Unicode Standard starting with Unicode 6.1.0,the value of the enumerated IDNA2008_Category property is calculated and listed explicitlyin a separate data file. This property matches the "IDNA Derived Property" as defined in RFC 5892(see [IDNA2008]). The explicit listing is provided as a convenience for implementers. It is the result of performing the exact calculations defined in RFC 5892 concurrent with the release of each version of the Unicode Character Database.

RFC 5892 gives a list of code points for which the derivation is overriddenby exceptional values. All known exceptions are applied when a data file iscreated, but exceptions added in future updates of the IDNA protocol are not applied retroactively.

The format of these IDNA Derived Property data files is modeled closely on that specified in Appendix B.1 of RFC 5892, except that the comment section of each line is not truncated at column 72. For example, excerpted from RFC 5892:

007B..00B6  ; DISALLOWED  # LEFT CURLY BRACKET..PILCROW SIGN00B7        ; CONTEXTO    # MIDDLE DOT00B8..00DE  ; DISALLOWED  # CEDILLA..LATIN CAPITAL LETTER THORN00DF..00F6  ; PVALID      # LATIN SMALL LETTER SHARP S..LATIN SMALL LETT

Compare the same ranges excerpted from the data files:

007B..00B6  ; DISALLOWED  # LEFT CURLY BRACKET..PILCROW SIGN00B7        ; CONTEXTO    # MIDDLE DOT00B8..00DE  ; DISALLOWED  # CEDILLA..LATIN CAPITAL LETTER THORN00DF..00F6  ; PVALID      # LATIN SMALL LETTER SHARP S..LATIN SMALL LETTER O WITH DIAERESIS

This close match in format is designed to simplify scripted comparison between these IDNA Derived Property data files posted at unicode.org and other existing calculated listings based on RFC 5892 that have been posted at IANA or elsewhere.

Acknowledgments

Mark Davis and Michel Suignard authored the bulk of the text of thisdocument, under direction from the Unicode Technical Committee. Fortheir contributions of ideas or text to this specification, theeditors thank Julie Allen, Matitiahu Allouche, Peter Constable, CraigCummings, Martin Dürst, Peter Edberg, Asmus Freytag, Deborah Goldsmith, LaurentiuIancu, Gervase Markham, Simon Montagu, Lisa Moore, Eric Muller, Simon Sapin, Murray Sargent, Markus Scherer,Jungshik Shin, Shawn Steele, Erik van der Poel, Chris Weber, and KenWhistler. The specification builds upon [IDNA2008],developed in the IETF Idna-update working group, especiallycontributions from Matitiahu Allouche, Harald Alvestrand, Vint Cerf,Martin J. Dürst, Lisa Dusseault, Patrik Fältström, Paul Hoffman, CaryKarp, John Klensin, and Peter Resnick, and also upon [IDNA2003], authored by Marc Blanchet, AdamCostello, Patrik Fältström, and Paul Hoffman.

References

[Bortzmeyer]http://www.bortzmeyer.org/idn-et-phishing.html

The most interesting studies cited there(originally from Mike Beltzner ofMozilla) are:
[DemoConf]https://util.unicode.org/UnicodeJsps/confusables.jsp
[DemoIDN]https://util.unicode.org/UnicodeJsps/idna.jsp
[DemoIDNChars]https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=\p{age%3D3.2}-\p{cn}-\p{cs}-\p{co}&abb=on&g=uts46+idna+idna2008
[IDNA2003]The IDNA2003 specification is defined by acluster of IETF RFCs:
[IDNA2008]The IDNA2008 specification is defined by acluster of IETF RFCs: There is also an informative document:
[IDNA-Derived]https://www.unicode.org/Public/idna2008derived
[IDNA-Table]https://www.unicode.org/Public/idna
[IDN-FAQ]https://www.unicode.org/faq/idn.html
[NFKC_Casefold]The Unicode property specified in [UAX44], and defined by the data inDerivedNormalizationProps.txt(search for "NFKC_Casefold").
[RFC1034]P. Mockapetris"Domain names - concepts and facilities", RFC 1034, November 1987.
https://www.rfc-editor.org/info/rfc1034
[RFC3454]P. Hoffman, M. Blanchet."Preparation of Internationalized Strings("stringprep")", RFC 3454, December 2002.
https://www.rfc-editor.org/info/rfc3454
[RFC3490]Faltstrom, P., Hoffman, P.and A. Costello, "Internationalizing Domain Names inApplications (IDNA)", RFC 3490, March 2003.
https://www.rfc-editor.org/info/rfc3490
[RFC3491]Hoffman, P. and M. Blanchet,"Nameprep: A Stringprep Profile for Internationalized DomainNames (IDN)", RFC 3491, March 2003.
https://www.rfc-editor.org/info/rfc3491
[RFC3492]Costello, A., "Punycode:A Bootstring encoding of Unicode for Internationalized Domain Namesin Applications (IDNA)", RFC 3492, March 2003.
https://www.rfc-editor.org/info/rfc3492
[RZLGR5]Integration Panel,"Root Zone Label Generation Rules — LGR-5", 22 May 2022.
https://www.icann.org/sites/default/files/lgr/rz-lgr-5-overview-26may22-en.pdf
[SafeBrowsing]http://code.google.com/apis/safebrowsing/
[Stability]Unicode Consortium StabilityPolicies
https://www.unicode.org/policies/stability_policy.html 
[STD3]Braden, R.,"Requirements for Internet Hosts -- CommunicationLayers", STD 3, RFC 1122, and "Requirements for InternetHosts -- Application and Support", STD 3, RFC 1123, October1989.
https://www.rfc-editor.org/info/std3
[STD13]Mockapetris, P.,"Domain names - concepts and facilities", STD 13, RFC1034 and "Domain names - implementation andspecification", STD 13, RFC 1035, November 1987.
https://www.rfc-editor.org/info/std13
[UAX44]UAX #44:UnicodeCharacter Database
https://www.unicode.org/reports/tr44/
[Unicode]The Unicode Standard
For the latest version, see:
https://www.unicode.org/versions/latest/
[UTR36]UTR #36:UnicodeSecurity Considerations
https://www.unicode.org/reports/tr36/
[UTS18]UTS #18:UnicodeRegular Expressions
https://www.unicode.org/reports/tr18/
[UTS39]UTS #39:UnicodeSecurity Mechanisms
https://www.unicode.org/reports/tr39/

Modifications

The following summarizes modifications from the previous published version of this document.

Revision 31

Modifications for previous versions are listed in those respective versions.


© 2023 Unicode, Inc. All Rights Reserved. TheUnicode Consortium makes no expressed or implied warranty of anykind, and assumes no liability for errors or omissions. No liabilityis assumed for incidental and consequential damages in connectionwith or arising out of the use of the information or programscontained or accompanying this technical report. The UnicodeTerms of Use apply.

Unicode and the Unicode logo are trademarksof Unicode, Inc., and are registered in some jurisdictions.


[8]ページ先頭

©2009-2025 Movatter.jp