Unicode® Technical Standard #46

Unicode IDNA Compatibility Processing

Version	15.1.0
Editors	Mark Davis (markdavis@google.com), Michel Suignard (michel@suignard.com)
Date	2023-09-05
This Version	https://www.unicode.org/reports/tr46/tr46-31.html
Previous Version	https://www.unicode.org/reports/tr46/tr46-29.html
Latest Version	https://www.unicode.org/reports/tr46/
Latest Proposed Update	https://www.unicode.org/reports/tr46/proposed.html
Revision	31

Summary

Client software, such as browsers and emailers, faced adifficult transition from the version of international domain namesapproved in 2003 (IDNA2003), to the revision approved in 2010(IDNA2008).The specification in this document has been providing a mechanismthat minimizes the impact of this transition for client software,allowing client software to access domains that are valid undereither system.

The specification provides two main features: One is acomprehensive mapping to support current user expectations forcasing and other variants of domain names. Such a mapping is allowedby IDNA2008. The second is a compatibility mechanism that supportsthe existing domain names that were allowed under IDNA2003. Thissecond feature was intended to improve client behavior during thetransition period.

Status

This document has been reviewed by Unicode members and otherinterested parties, and has been approved for publication by theUnicode Consortium. This is a stable document and may be used asreference material or cited as a normative reference by otherspecifications.

A Unicode Technical Standard (UTS) is an independentspecification. Conformance to the Unicode Standard does not implyconformance to any UTS.

Please submit corrigenda and other comments with the onlinereporting form [Feedback].Related information that is useful in understanding this document isfound in theReferences. For the latestversion of the Unicode Standard, see [Unicode]. For alist of current Unicode Technical Reports, see [Reports]. For moreinformation about versions of the Unicode Standard, see [Versions].

1Introduction

One of the great strengths of domain names is universality. The URLhttp://Apple.com goes to Apple'swebsite from anywhere in the world, using any browser. The emailaddress markdavis@google.com can beused to send email to an editor of this specification from anywherein the world, using any emailer.

Initially, domain names were restricted to ASCII characters. This wasa significant burden on people using other characters. Suppose, forexample, that the domain name system had been invented by Greeks, andone could only use Greek characters in URLs. Rather thanapple.com, one would have to writesomething likeαππλε.κομ. An Englishspeaker would not only have to be acquainted with Greek characters,but would also have to pick those Greek letters that would correspondto the desired English letters. One would have to guess at thespelling of particular words, because there are not exact matchesbetween scripts.

Most of the world’s population faced this situation until recently,because their languages use non-ASCII characters. A system wasintroduced in 2003 for internationalized domain names (IDN). Thissystem is calledInternationalizing Domain Names forApplications, or IDNA2003 for short. This mechanism supports IDNs bymeans of a client software transformation into a format known asPunycode. A revision of IDNA was approved in 2010 (IDNA2008). Thisrevision has a number of incompatibilities with IDNA2003.

The incompatibilities forced implementers of client software,such as browsers and emailers, to face difficult choices during thetransition period as registries shifted from IDNA2003 to IDNA2008. Thisdocument specifies a mechanism that has minimized the impact of thistransition for client software, allowing client software to accessdomains that are valid under either system.

The specification provides two main features. The first is acomprehensive mapping to support current user expectations for casingand other variants of domain names. Such a mapping is allowed byIDNA2008. The second feature is a compatibility mechanism thatsupports the existing domain names that were allowed under IDNA2003.This second feature is intended to improve client behavior during thetransition period. This specification contains both normative andinformative material. Only the conformance clauses and the text thatthey directly or indirectly reference are considered normative.

1.1IDNA2003

The series of RFCs collectively known as IDNA2003 [IDNA2003] allows domain names to containnon-ASCII Unicode characters, which includes not only the charactersneeded for Latin-script languages other than English (such as Å, Ħ,or Þ), but also different scripts, such as Greek, Cyrillic, Tamil, orKorean. An internationalized domain name such asBücher.de can then be used in an"internationalized" URL, called an IRI, such ashttp://Bücher.de#titel.

The IDNA mechanism for allowing non-ASCII Unicode characters indomain names involves applying the following steps to each label inthe domain name that contains Unicode characters:

Transforming (mapping) a Unicode string to remove case andother variant differences.
Checking the resulting string for validity, according tocertain rules.
Transforming the Unicode characters into a DNS-compatibleASCII string using a specialized encoding calledPunycode [RFC3492].

For example, typing the IRIhttp://Bücher.deinto the address bar of any modern browser goes to a correspondingsite, even though the "ü" is not an ASCII character. Thisworks because the IDN in that IRI resolves to the Punycode stringwhich is actually stored by the DNS for that site. Similarly, when abrowser interprets a web page containing a link such as <ahref="http://Bücher.de">, the appropriate site isreached. (In this document, phrases such as "a browserinterprets" refer to domain names parsed out of IRIs entered inan address baras well as to those contained in linksinternal to HTML text.)

In the case of IDNBücher.de, thePunycode value actually used for the domain names on the wire isxn--bcher-kva.de. The Punycode version isalso typically transformed back into Unicode form for display. Theresulting display string will be a string which has already beenmapped according to the IDNA2003 rules. This example results in adisplay string for the IRI that has been casefolded to lowercase:

http://Bücher.de →http://xn--bcher-kva.de →http://bücher.de

A major limitation of IDNA2003 is its restriction to the repertoireof characters in Unicode 3.2, which means that some modern languagesare excluded or not fully supported. Furthermore, within theconstraints of IDNA2003, there is no simple way to extend therepertoire. IDNA2003 also does not make it clear to users ofregistries exactly which string they are registering for a domainname (betweenBücher.de andbücher.de, for example).

1.2IDNA2008

In early 2010, a new version of IDNA was approved. Like IDNA2003,this version consists of a collection of RFCs and is called IDNA2008[IDNA2008]. IDNA2008 is intended to solve themajor problems in IDNA2003. It extends the valid repertoire ofcharacters in domain names, and establishes an automatic process forupdating to future versions of the Unicode Standard. Furthermore, itdefines the concept of a valid domain name clearly, so thatregistrants understand exactly what domain name string is beingregistered.

Processing in IDNA2008 is identical to IDNA2003 for many commondomain names. Both IDNA2003 and IDNA2008 transform a Unicode domainname in an IRI (like http://öbb.at)to the Punycode version (likehttp://xn--bb-eka.at).However, IDNA2008 does not maintain strict backward compatibilitywith IDNA2003. The main differences are:

Additions. Some IDNs are invalid in IDNA2003, butvalid in IDNA2008.
Subtractions.Some IDNs are valid in IDNA2003, butinvalid in IDNA2008.
Deviations.Some IDNs are valid in both, but resolveto different destinations.

For more details, seeSection 7,IDNAComparison.

1.3Transition Considerations

The differences between IDNA2008 and IDNA2003 may causeinteroperability and security problems. They affect extremely commoncharacters, such as all uppercase characters, all halfwidth orfullwidth characters (commonly used in Japan, China, and Korea), andcertain other characters like the German eszett (U+00DF ßLATIN SMALL LETTER SHARP S) and Greekfinal sigma (U+03C2 ςGREEK SMALL LETTER FINAL SIGMA).

1.3.1Mapping

IDNA2003 requires a mapping phase, which mapsÖBB.attoöbb.at, for example. Mappingtypically involves mapping uppercase characters to their lowercasepairs, but it also involves other types of mappings betweenequivalent characters, such as mapping halfwidthkatakanacharacters to normalkatakana characters in Japanese. Themapping phase in IDNA2003 was included to match the insensitivity ofASCII domain names. Users are accustomed to having bothCNN.com andcnn.comwork identically. They expect domain names with accents to have thesame casing behavior, so thatÖBB.atis the same asöbb.at. There arevariations similar to case differences in other scripts. The IDNA2003mapping is based on data specified in the Unicode Standard, Version3.2; this mapping was later formalized as the Unicode property [NFKC_Casefold].

Note that case-folding generates a stable form of a string thaterases functional case-differences. It isnot the same aslowercasing. In particular, the lowercase Cherokee characters addedin Unicode Version 8.0 are case-folded to their uppercasecounterparts.

IDNA2008 does not require a mapping phase, but doespermit one(called "Local Mapping" or "Custom Mapping"). Formore information on the permitted mappings, see theProtocoldocument of [IDNA2008],Section 4.2,Permitted Character and Label Validation andSection 5.2,Conversion to Unicode.

The UTS #46 specification defines a mapping consistent with thenormative requirements of the IDNA2008 protocol, and which is ascompatible as possible with IDNA2003. For client software, thisprovides behavior that is the most consistent with user expectationsabout the handling of domain names with existing data—namely, thatdomain names will map consistently both on clients supportingIDNA2003 and on clients supporting IDNA2008 with the UTS #46 mapping.

1.3.2Deviations

There are a few situations where the use of IDNA2008 withoutcompatibility mapping will result in the resolution of IDNs todifferent IP addresses from in IDNA2003, unless the registry orregistrant takes special action. This affects a very small number ofcharacters, but because these characters are very common inparticular languages, a significant number of domain names in thoselanguages are affected. This set of characters is referred to as"Deviations" and is shown inTable 1,Deviation Characters,illustrated in the context of IRIs.

Table 1.Deviation Characters

Char	Example	IDNA2003 Result	IDNA2008 Result
ß `00DF`	href="http://faß.de"	http://fass.de → http://fass.de	http://faß.de → http://xn--fa-hia.de
ς `03C2`	href="http://βόλος.com"	http://βόλοσ.com → http://xn--nxasmq6b.com	http://βόλος.com → http://xn--nxasmm1c.com
ZWJ `200D`	href="http://ශ්‍රී.com"	http://ශ්රී.com→ http://xn--10cl1a0b.com	http://ශ්‍රී.com→ http://xn--10cl1a0b660p.com
ZWNJ `200C`	href="http://نامه‌ای.com"	http://نامهای.com→ http://xn--mgba3gch31f.com	http://نامه‌ای.com→ http://xn--mgba3gch31f060k.com

For more information on the rationale for the occurrence of theseDeviations in IDNA2008, see the [IDN FAQ].

The differences in interpretation of Deviation characters result inpotential for security exploits. Consider a scenario involvinghttp://www.sparkasse-gießen.de, a GermanIRI containing an IDN for "Gießen Savings and Loan".

Alice's browser supports IDNA2003. Under those rules,http://www.sparkasse-gießen.de is mapped tohttp://www.sparkasse-giessen.de,which leads to a site with the IP address01.23.45.67.
She visits her friend Bob, and checks her bank statement onhis browser. His browser supports IDNA2008. Under those rules,http://www.sparkasse-gießen.de is alsovalid, but converts to a different Punycode domain name inhttp://www.xn--sparkasse-gieen-2ib.de. Thiscan lead to a different site with the IP address101.123.145.167,a spoof site.

Alice ends up at the phishing site, supplies her bankpassword, and her money is stolen. While the .DE registar (DENIC)might have a policy about bundling all of the variants of ß together(so that they all have the same owner) it is not required ofregistries. It is unlikely that all registries will have and enforcesuch a bundling policy in all such cases.

There are two Deviations of particular concern. IDNA2008 allowsthe joiner characters (ZWJ and ZWNJ) in labels. By contrast, theseare removed by the mapping in IDNA2003. When used in the intendedcontexts in particular scripts, the joiner characters produce anoticeable change in displayed text. However, when used between anyother characters in those scripts, or in any other scripts, they areinvisible. For example, when used between the Latin characters"a" and "b" there is no visible different: thesequence "a<ZWJ>b" looks just like "ab".

Because of the visual confusability introduced by the joinercharacters, IDNA2008 provides a special category for them calledCONTEXTJ, and only permits CONTEXTJ characters in limited contexts:certain sequences of Arabic or Indic characters. However,applications that perform IDNA2008 lookup are not required to checkfor these contexts, so overall security is dependent on registrieshaving correct implementations. Moreover, the IDNA2008 contextrestrictions do not catch most cases where distinct domain names havevisually confusable appearances because of ZWJ and ZWNJ.

2UnicodeIDNA Compatibility Processing

To satisfy user expectations for mapping, and provide maximalcompatibility with IDNA2003, this document specifies a mapping foruse with IDNA2008. In addition, to transition more smoothly toIDNA2008, this document provides a Unicode algorithm for astandardized processing that allows conformant implementations tominimize the security and interoperability problems caused by thedifferences between IDNA2003 and IDNA2008. This Unicode IDNACompatibility Processing is structured according to IDNA2003principles, but extends those principles to Unicode 5.2 and later. Italso incorporates the repertoire extensions provided by IDNA2008.

Where the transitional processing is not needed, UTS #46 can be usedpurely as a preprocessing (local mapping) for IDNA2008 by claimingconformance specifically toConformance ClauseC3.

By using this Compatibility Processing, a domain name such asÖBB.at will be mapped to the valid domainnameöbb.at, thus matching userexpectation for case behavior in domain names. For transitional use,the Compatibility Processing also allows domain names containingsymbols and punctuation that were valid in IDNA2003, such as√.com (which has an associated web page).Such domain names containing symbols will gradually disappear asregistries shift to IDNA2008.

Implementations may also restrict or flag (in a UI) domain names thatinclude symbols and punctuation. For more information, seeUnicodeTechnical Report # 36, Unicode Security Considerations [UTR36].

Using the Unicode IDNA Compatibility Processing to transform anIDN into a form suitable for DNS lookup is similar to the tactic of"try IDNA2008 then try IDNA2003". However, this approachavoids a potentially problematic dual lookup. It allows browsers andother clients, such as search engines, to have a single processingstep, without the burden of maintaining two different implementationsand multiple tables. It accounts for a number of edge cases thatwould cause problems, and provides a stable definition withpredictable results.

The Unicode IDNA Compatibility Processing also providesalternate mappings for the Deviation characters. This facilitates thetransition from IDNA2003 to IDNA2008. It is up to the registries todecide how to handle the transition, for example, by either bundlingor blocking the Deviation characters that they support.In practice, for the deviation characters, the transition is complete.All major implementations have switched to nontransitional processing of the four deviation characters.

The term "registries" includes far more than top-levelregistries, such as for.de or.com.For example,.blogspot.com has more domain namesregistered than most top-level registries. There may be differentpolicies in place for a registry and any of its subregistries. Thusmillions of registries need to be considered in a transitionstrategy, not just hundreds.

In lookup software, transitions may be fine-grained: forexample, it may be possible to transition to IDNA2008 rules regardingDeviations for.subdomain.com at a given point butnot for.com, or vice versa.If.tldbundles or blocks the Deviation characters, then clients couldtransition Deviations for.tld,but not for (say).subdomain.tld.Moreover, client software with a UI, such as the address bar in abrowser, could provide more options for the transition. A fulldiscussion of such transition strategies is outside of the scope ofthis document.

During the interim, authors of documents, such as HTMLdocuments, can unambiguously refer to the IDNA2008 interpretation ofcharacters by explicitly using the Punycode form of the domain namelabel.

There are two slightly different compatibility mechanisms for domainnames during a transition and afterward. UTS #46 therefore specifiestwo specific types of processing: Transitional Processing(Conformance ClauseC1)and Nontransitional Processing(Conformance ClauseC2).The only difference between them is the handlingof the four Deviation characters.

Summarized briefly, UTS #46 builds upon IDNA2008 in threeareas:

Mapping.The UTS #46 mapping is used tomaintain maximal compatibility and meet user expectations. It isconformant to IDNA2008, which allows for mapping input.
Symbols and Punctuation.UTS #46 supportsprocessing of symbols and punctuation during the transitionperiod. The transition will be smooth: as registries move toIDNA2008 the DNS lookups of IDNs with symbols will simply berefused. At that point, in practice, there is full compatibilitywith IDNA2008.
Deviations (deprecated).UTS #46 provides two ways ofhandling these to support a transition. Transitional Processing (deprecated)had been recommended to be used immediately before a DNS lookup in thecircumstances where the registry does not guarantee a strategy ofbundling or blocking. Nontransitional Processing, which is fullycompatible with IDNA2008, should be used in all cases.

For a demonstration of differences between IDNA2003, IDNA2008, andthe Unicode IDNA Compatibility Processing, see the [DemoIDN]. For more detail on the differences,seeSection 7,IDNA Comparison.UTS #46 does not change any of the terms defined in IDNA2008, such asA-Label or U-Label.

Neither the Unicode IDNA Compatibility Processing nor IDNA2008address security problems associated with confusables (the so-called"paypal.com" problem).IDNA2008 disallows certain symbols and punctuation characters thatcan be used for spoofing, such as spoofs of the slash character("/"). However, these are an extremely small fraction ofthe confusable characters used for spoofing. Moreover, confusablecharacters themselves account for a small proportion of phishingproblems: most are cases like "secure-wellsfargo.com". Formore information, see [Bortzmeyer] and the[IDN FAQ]. It is strongly recommended thatUnicodeTechnical Report #36, Unicode Security Considerations [UTR36] andUnicode Technical Standard#39, Unicode Security Mechanisms [UTS39] beconsulted for information on dealing with confusables, both forclient software and registries. In particular, [UTS39]provides information that can be used to drastically reduce thenumber of confusables when dealing with international domain names,much beyond what IDNA2008 does. See also the [DemoConf].

2.1Display ofInternationalized Domain Names

IDNA2003 applications customarily display the processed string to theuser. This improves security by reducing the opportunity for visualconfusability. Thus, for example, the URLhttp://googIe.com(with a capital I in place of the L) is revealed ashttp://googie.com.

2.2Registries

This specification is primarily targeted at applications doing lookupof IDNs. There is, however, one strong recommendation for registries:do not allow the registration of labels that are invalidaccording to Nontransitional Processing, and do use bundling or blocking forlabels containing confusable characters.

These tactics can be described as follows:

Bundling:If two or more labels are different, but confusable,and more than one is registered,the registrant for each must be the same.
Blocking:If two or more labels are different, but confusable,allow the registration of only one, and block the others.Registries that do not allow any Deviationcharacters at all count asblocking.

Note: Some implementations outside Unicodeuse different terminology for these strategies.In particular, in the ICANN Root Zone Label Generation Rules [RZLGR5],the termallocatable variant of X is used for labels that can be bundled with X,and the termblocked variant is used for a mutually exclusive label.

The label that is actually registered and inserted into a registryhas always been processed. For example,xn--bcher-kvacorresponds tobücher. However, it maybe useful for a registry to also ask for "unprocessed" labels, suchasBücher, as part of the registrationprocess, so that they are aware of the registrant's intent. However,such unprocessed labels must be handled carefully:

Storing the unprocessed label as the sequence of charactersthat the registrant really wanted to apply for.
Processing the unprocessed label, and displaying theprocessed label to the registrant for confirmation.
Proceeding with the regular registration process usingonly the processed label.

2.3Notation

Sets of code points are defined using properties and the syntax ofUnicodeTechnical Standard #18, Unicode Regular Expressions [UTS18]. For example, the set of combining marks isrepresented by the syntax\p{gc=M}. Additionally, the "+" indicates the addition of elementsto a set, for clarity.

In this document, alabel is a substring of a domain name.That substring is bounded on both sides by either the start or theend of the string, or any of the following characters, calledlabel-separators:

U+002E ( . ) FULL STOP
U+FF0E ( ． ) FULLWIDTH FULL STOP
U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
U+FF61 ( ｡ ) HALFWIDTH IDEOGRAPHIC FULL STOP

Many people use the terms "domain names" and "hostnames" interchangeably. This document follows [RFC3490] in use of the term "domainname".

ABidi domain name is a domain name containing at least one character with Bidi_Class R, AL, or AN. See [IDNA2008] RFC 5893, Section 1.4.

3Conformance

The requirements for conformance on implementations of theUnicodeIDNA Compatibility Processing algorithm are stated in the followingclauses. An implementation can claim conformance to any or all ofthese clauses independently.

C1 (deprecated).Given aversion of Unicode and aUnicodeString, a conformant implementation ofTransitionalProcessing shall replicate the results given by applying theTransitional Processing algorithm specified by Section 4,Processing.

C2.Given aversion of Unicode and aUnicodeString, a conformant implementation ofNontransitionalProcessing shall replicate the results given by applying theNontransitional Processing algorithm specified by Section 4,Processing.

C3.Given aversion of Unicode and aUnicodeString, a conformant implementation ofPreprocessingfor IDNA2008 shall replicate the results specified by Section 4.4,Preprocessing for IDNA2008.

These specifications arelogical ones, designed to bestraightforward to describe. An actual implementation is free to usedifferent methods as long the result is the same as that specified bythe logical algorithm.

Any conformant implementation may also havetighter validitycriteria than those imposed bySection 4.1,Validity Criteria. For example, anapplication could disallow or warn of domain name labels with certaincharacteristics, such as:

labels with certain combinations of scripts (Safari)
labels with characters outside of the user's specifiedlanguages (IE)
labels with certain confusable characters (Firefox)
labels that are detected by the Google Safe Browsing API [SafeBrowsing]
labels that do not meet the validity requirements ofIDNA2008
labels produced by toUnicode that would not meet the labelvalidity requirements if toASCII were performed.
labels containing characters which are not contained in theGeneralSecurity Profile for Identifiers fromUnicode TechnicalStandard #39, Unicode Security Mechanisms [UTS39]
labels that do not satisfyRestriction Level 4,ModeratelyRestrictive fromUnicode Technical Standard #39, UnicodeSecurity Mechanisms [UTS39]

For more information, seeUnicode Technical Report #36,Unicode Security Considerations [UTR36] andUnicodeTechnical Standard #39, Unicode Security Mechanisms [UTS39].

3.1STD3 Rules

IDNA2003 provides for a flag,UseSTD3ASCIIRules,that allows for implementations to choose whether or not to abide bythe rules in [STD3]. These rules exclude ASCIIcharacters outside the set consisting of A-Z, a-z, 0-9, and U+002D (- ) HYPHEN-MINUS. For example, some browsers also allow characterssuch as U+005F ( _ ) LOW LINE(underbar) in domain names,and thus useUseSTD3ASCIIRules=false, plus their ownvalidity checks for the other ASCII characters.

WhileUseSTD3ASCIIRules=true is stronglyrecommended,Section 5,IDNA Mapping Table provides data toallow implementations to supportUseSTD3ASCIIRules=falsefor compatibility with IDNA2003 implementations where necessary. Themapping table does this: providing the Status values and Mappingvalues for bothUseSTD3ASCIIRules=trueandUseSTD3ASCIIRules=false. Implementations that useUseSTD3ASCIIRules=falsewill need to apply their own validation to the mapped values asindicated inSection 4.1,ValidityCriteria.

4Processing

The input to Unicode IDNA Compatibility Processing is a prospectivedomain_namestring expressed in Unicode, and a choice of Transitional orNontransitional Processing. The domain name consists of a sequence oflabels with dot separators, such as "Bücher.de". For more information about the composition of aURL, see Section 3.5 of [STD13].

Main Processing Steps

The following steps, performed in order, successively alter the inputdomain_name string and then output it as a converted Unicodestring, plus a flag to indicate whether there was an error. Even ifan error occurs, the conversion of the string is performed as much asis possible.

Input

A prospectivedomain_name expressed as a sequenceof Unicode code points
A boolean flag:UseSTD3ASCIIRules
A boolean flag:CheckHyphens
A boolean flag:CheckBidi
A boolean flag:CheckJoiners
A boolean flag:Transitional_Processing (deprecated)
A boolean flag:IgnoreInvalidPunycode

Processing

Map. For each codepoint in thedomain_name string, look up the Status value inSection 5,IDNA Mapping Table, and take thefollowing actions:
- disallowed: Leave the code pointunchanged in the string.Note: The Convert/Validate step below checks for disallowed characters,after mapping and normalization.
- ignored: Remove the code point from thestring. This is equivalent to mapping the code point to an emptystring.
- mapped:IfTransitional_Processing (deprecated) andthe code point is U+1E9E capital sharp s (ẞ),then replace the code point in the string by “ss”. Otherwise:
  Replace the code point in thestring by the value for the mapping inSection 5,IDNAMapping Table.
- deviation:
  - IfTransitional_Processing (deprecated), replace the codepoint in the string by the value for the mapping inSection 5,IDNA Mapping Table.
  - Otherwise, leave the codepoint unchanged in the string.
- valid: Leave the code point unchanged inthe string.
Normalize.Normalize thedomain_name string to Unicode NormalizationForm C.
Break. Break thestring into labels at U+002E ( . ) FULL STOP.
Convert/Validate. Foreach label in thedomain_name string:
- If the label starts with “xn--”:
  1. If the label contains any non-ASCII code point (i.e., a code point greater than U+007F), record that there was an error, and continue with the next label.
  2. Attempt to convert the rest of the label to Unicodeaccording toPunycode [RFC3492]. If that conversion failsand if notIgnoreInvalidPunycode,record that there was an error, andcontinue with the next label. Otherwise replace the originallabel in the string by the results of the conversion.
  3. Verify that the label meets the validity criteria inSection4.1,Validity Criteriafor Nontransitional Processing. If any of the validity criteriaare not satisfied, record that there was an error.
- If the label does not startwith “xn--”:
  - Verify that the label meets the validity criteria inSection4.1,Validity Criteriafor the input Processing choice (Transitional orNontransitional). If any of the validity criteria are notsatisfied, record that there was an error.

Any inputdomain_name string that does not record an error hasbeen successfully processed according to this specification.Conversely, if an inputdomain_name string causes an error,then the processing of the inputdomain_name string fails.Determining what to do with error input is up to the caller, and notin the scope of this document. The processing isidempotent—reapplying the processing to the output will make nofurther changes. For examples, seeTable 2,Examples of TransitionalProcessing.

Implementations may make further modifications to the resultingUnicode string when showing it to the user. For example, it isrecommended that disallowed characters be replaced by a U+FFFD tomake them visible to the user. Similarly, labels that fail processingduring step 4 may be marked by the insertion of a U+FFFD orother visual device.

With either Transitional orNontransitional Processing, sources already in Punycode are validatedwithout mapping. In particular, Punycode containing Deviationcharacters, such as href="xn--fu-hia.de"(for fuß.de) is not remapped. This provides a mechanism allowingexplicit use of Deviation characters even during a transition period.

4.1ValidityCriteria

Each of the following criteria must be satisfied for a non-empty label:

The label must be in Unicode Normalization Form NFC.
IfCheckHyphens, the label must not contain a U+002D HYPHEN-MINUS characterin both the third and fourth positions.
IfCheckHyphens, the label must neither begin nor end with a U+002DHYPHEN-MINUS character.
If notCheckHyphens, the label must not begin with “xn--”.
The label must not contain a U+002E ( . ) FULL STOP.
The label must not begin with a combining mark, that is:General_Category=Mark.
Each code point in the label must only have certain Statusvalues according toSection 5,IDNAMapping Table:
1. For Transitional Processing (deprecated), each value must bevalid.
2. For Nontransitional Processing, each value must be eithervalid ordeviation.
IfCheckJoiners, the label must satisify theContextJ rules fromAppendix A, inThe Unicode Code Points and Internationalized Domain Names for Applications (IDNA) [IDNA2008].
IfCheckBidi, and if the domain name is aBidi domain name, then the label must satisfy all six of the numbered conditions in [IDNA2008] RFC 5893, Section 2.

The first 6 criteria are from [IDNA2008],except for the fourth criterion. Criterion #2 in particular is meant to allow for future label extensions beyond just xn--, such as for future versions of IDNA. Some implementations appear to consider such extentions unlikely, and allow labels such as "r3---sn-apo3qvuoxuxbt-j5pe".

Any particular applicationmay have tighter validitycriteria, as discussed inSection 3,Conformance.

4.1.1UseSTD3ASCIIRules

IfUseSTD3ASCIIRules=false, then the validity testsfor ASCII characters are not provided by the table Status values, butare implementation-dependent. For example, if an implementationallows the characters [\u002Da-zA-Z0-9]and also the underbar ( _ ), then it needs to use the table valuesforUseSTD3ASCIIRules=false, and test for any otherASCII characters as part of its validity criteria. These ASCIIcharacters may have resulted from a mapping: for example, aU+005F ( _ ) LOW LINE(underbar) may have originally been aU+FF3F ( ＿ ) FULLWIDTH LOW LINE.

There are currently no non-ASCII characters with theStatus valuedisallowed_STD3_valid.

4.1.2Right-to-LeftScripts

In addition, the label should meet the requirements for right-to-leftcharacters specified in the Right-to-Left Scripts document of [IDNA2008], and for the CONTEXTJ requirements inthe Protocol document of [IDNA2008]. It isstrongly recommended thatUnicode Technical Report #36,Unicode Security Considerations [UTR36] andUnicodeTechnical Standard #39, Unicode Security Mechanisms[UTS39] be consulted for information on dealingwith confusables, and for characters that should be excluded fromidentifiers. Note that the recommended exclusions are a superset ofthose in [IDNA2008].

4.2ToASCII

The operation corresponding to ToASCII of [RFC3490]is defined by the following steps:

Input

A prospectivedomain_name expressed as a sequenceof Unicode code points
A boolean flag:CheckHyphens
A boolean flag:CheckBidi
A boolean flag:CheckJoiners
A boolean flag:UseSTD3ASCIIRules
A boolean flag:Transitional_Processing (deprecated)
A boolean flag:VerifyDnsLength
A boolean flag:IgnoreInvalidPunycode

Processing

To the inputdomain_name, apply theProcessingSteps inSection 4,Processing,using the input boolean flagsTransitional_Processing,CheckHyphens,CheckBidi,CheckJoiners, andUseSTD3ASCIIRules. This may record an error.
Break the result into labels at U+002E FULL STOP.
Convert each label with non-ASCII characters into Punycode [RFC3492], andprefix by “xn--”. This may record an error.
If theVerifyDnsLength flag is true, then verify DNSlength restrictions. This may record an error. For more information,see [STD13] and[STD3].
1. The length of the domain name, excluding the root labeland its dot, is from 1 to 253.
2. The length of each label is from 1 to 63.
  - Note: Technically, a complete domain name ends withan empty label for the DNS root(see [STD13] [RFC1034] section 3).This empty label, and the trailing dot, is almost always omitted.
  - WhenVerifyDnsLength is false, the empty root label is passed through.
  - WhenVerifyDnsLength is true, the empty root label is disallowed.This corresponds to the syntax in [RFC1034]section 3.5 Preferred name syntaxwhich also defines the label length restrictions.
If an error was recorded in steps 1-4, then the operationhas failed and a failure value is returned. No DNS lookup should bedone.
Otherwise join the labels using U+002E FULL STOP as aseparator, and return the result.

Implementations are advised to apply additional tests to theselabels, such as those described inUnicode Technical Report#36, Unicode Security Considerations [UTR36]andUnicode Technical Standard #39, Unicode SecurityMechanisms [UTS39], and take appropriateactions. For example, a label with mixed scripts or confusables maybe called out in the UI. Note that the use of Punycode to signalproblems may be counter-productive, as described in [UTR36].

4.3ToUnicode

The operation corresponding to ToUnicode of [RFC3490]is defined by the following steps:

Input

A prospectivedomain_name expressed as a sequenceof Unicode code points
A boolean flag:CheckHyphens
A boolean flag:CheckBidi
A boolean flag:CheckJoiners
A boolean flag:UseSTD3ASCIIRules
A boolean flag:Transitional_Processing (deprecated)
A boolean flag:IgnoreInvalidPunycode

Processing

To the inputdomain_name, apply theProcessingSteps inSection 4,Processing,using the input boolean flagsTransitional_Processing,CheckHyphens,CheckBidi,CheckJoiners, andUseSTD3ASCIIRules. This may record an error.
Like [RFC3490], this will alwaysproduce a converted Unicode string. Unlike ToASCII of [RFC3490], this always signals whether or notthere was an error.

Implementations are advised to apply additional tests to theselabels, such as those described inUnicode Technical Report#36, Unicode Security Considerations [UTR36]andUnicode Technical Standard #39, Unicode SecurityMechanisms[UTS39], and takeappropriate actions. For example, a label with mixed scripts orconfusables may be called out in the UI. Note that the use ofPunycode to signal problems may be counter-productive, as describedin [UTR36].

4.4Preprocessingfor IDNA2008

The table specified inSection 5,IDNAMapping Table may also be used for a pure preprocessing step forIDNA2008, mapping a Unicode string for input directly to thealgorithm specified in IDNA2008.

Preprocessing for IDNA2008 is specified as follows:

Apply theSection 4.3,ToUnicodeprocessing to the Unicode string.

Note that this preprocessing allows some characters that areinvalid according to IDNA2008. However, the IDNA2008 processing willcatch those characters. For example, a Unicode string containing acharacter listed as DISALLOWED in IDNA2008, such as U+2665 (♥) BLACKHEART SUIT, will pass the preprocessing step without an error, butsubsequent application of the IDNA2008 processing will fail with anerror, indicating that the string is not a valid IDN according toIDNA2008.

4.5ImplementationNotes

A number of optimizations can be applied to the Unicode IDNACompatibility Processing. These optimizations can improveperformance, reduce table size, make use of existing NFKC transformmechanisms, and so on. For example:

There is an NFC check inSection 4.1,Validity Criteria. However, it onlyneeds to be applied to labels that were converted from Punycode intoUnicode inStep 3.
A simple way to do much of the validity checking inSection4.1,Validity Criteriais to reapply Steps 1 and 2, and verify that the result does notchange.
Because the four label separators are all mapped to U+002E (. ) FULL STOP byStep 1, theparsing of labels in Steps 3 and 4 only need to detect U+002E ( . )FULL STOP, and not the other label separators defined in IDNA [RFC3490].

Note that the inputdomain_name string for the Unicode IDNACompatibility Processing must have had all escaped Unicode codepoints converted to Unicode code points. For example,U+5341( 十 ) CJK UNIFIED IDEOGRAPH-5341 could have been escaped as any ofthe following:

十 an HTML numeric character reference(NCR)
\u5341 a Javascript escapes
%E5%8D%81 a URI/IRI %-escape

Examples are shown inTable 2,Examples of Processing:

Table 2.Examples of Processing

Input	Map	Normalize	Convert	Validate	Comment
Bloß.de	bloss.de	=	n/a	ok	Transitional (deprecated): maps uppercase and sharp s
Bloß.de	bloß.de	=	n/a	ok	Nontransitional: maps uppercase
BLOẞ.de	bloß.de	=	n/a	ok	Maps uppercase
xn--blo-7ka.de	=	=	bloß.de	ok	Punycode is not mapped, so ß never changes (whethertransitional or not).
u¨.com	=	ü.com	n/a	ok	Normalize changesu+ umlaut toü
xn--tda.com	=	=	ü.com	ok	Punycodexn--tda changes toü
xn--u-ccb.com	=	=	u¨.com	*error*	Punycode is not mapped, butis validated. Becauseu + umlaut is not NFC, it fails.
a⒈com	*error*	*error*	*error*	*error*	The character "⒈" isdisallowed,because it would produce a dot when mapped.
xn--a-ecp.ru	xn--a-ecp.ru	=	a⒈.ru	*error*	Punycodexn--a-ecp = a⒈, which failsvalidation.
xn--0.pt	xn--0.pt	=	*error*	*error*	Punycodexn--0 is invalid.
日本語。ＪＰ	日本語.jp	=	n/a	ok	Fullwidth characters are remapped, including 。
☕.us	=	=	n/a	ok	Post-Unicode 3.2 characters are allowed.

5IDNAMapping Table

For each code point in Unicode, the IDNA Mapping Table providesone of the following Status values:

valid: the code point is valid, and notmodified.
ignored: the code point is removed: this isequivalent to mapping the code point to an empty string.
mapped: the code point is replaced in thestring by the value for the mapping.
deviation: the code point is either mappedor valid, depending on whether the processing is transitional ornot.
disallowed: the code point is not allowed.
- disallowed_STD3_valid: the status isdisallowedifUseSTD3ASCIIRules=true (the normal case);implementations that allowUseSTD3ASCIIRules=false would treat the code point asvalid.
- disallowed_STD3_mapped: the status isdisallowedifUseSTD3ASCIIRules=true (the normal case);implementations that allowUseSTD3ASCIIRules=false would treat the code point asmapped.

If this Status value ismapped,disallowed_STD3_mapped ordeviation, the table alsosupplies a mapping value for that code point.

A table is provided for each version of Unicode starting with Unicode5.1, in versioned directories under [IDNA-Table].Each table for a version of the Unicode Standard will always bebackward compatible with previous versions of the table: onlycharacters with the Status valuedisallowed maychange in Status or Mapping value,with the following exception:

As part of the deprecation of transitional processing,the following exceptional change has been made in Unicode 15.1:
- Before Unicode 15.1, U+1E9E capital sharp s (ẞ) wasunconditionallymapped to “ss”,consistent with transitional processing whichmaps U+00DF small sharp s (ß) also to “ss”.
- Since Unicode 15.1,when using nontransitional processing,capital sharp s ismapped to small sharp s,which is treated asvalidunder nontransitional processing.This is the new Mapping value in the table.
  When usingtransitional processing (deprecated),U+1E9E capital sharp s (ẞ) continues to bemapped to “ss”,just like thedeviation mapping forU+00DF small sharp s (ß).This is handled during processing.

Unicode 15.1 also changed the Status ofthree conditionally-disallowed characters, which is not an exception:

Before Unicode 15.1, U+2260 (≠), U+226E (≮), and U+226F (≯) aredisallowed_STD3_valid.
Since Unicode 15.1, U+2260 (≠), U+226E (≮), and U+226F (≯) arevalid.
- IfUseSTD3ASCIIRules=true,this is equivalent to a permissible changefromdisallowed tovalid.
- IfUseSTD3ASCIIRules=false,this is effectively no change at all.

Unlike the IDNA2008 table, thistable is designed to be applied to the entire domain name, not justto individual labels. That design provides for the IDNA2003 handlingof label separators. In particular, the table is constructed toforbid problematic characters such as U+2488 ( ⒈ ) DIGIT ONE FULLSTOP, whose decompositions contain a "dot".

The Unicode IDNA Compatibility Processing is based on the Unicodecharacter mapping property [NFKC_Casefold].Section 6,MappingTable Derivation describes the derivation of these tables. Likederived properties in the Unicode Character Database, the descriptionof the derivation is informative. Only the data in IDNA Mapping Tableis normative for the application of this specification.

The files use a semicolon-delimited format similar to those in theUnicode Character Database [UAX44]. The fieldvalues are listed inTable 2b,Data File Fields:

Table 2b.Data File Fields

Num	Field	Description
0	Code point(s)	Hex value or range of values.
1	Status	valid, ignored,mapped,deviation,disallowed,disallowed_STD3_valid,ordisallowed_STD3_mapped
2	Mapping	Hex value(s). Only present if the Status isignored,mapped,deviation, ordisallowed_STD3_mapped.
3	IDNA2008 Status	There are two values:NV8 andXV8.NV8is only present if the Status isvalid but thecharacter is excluded by IDNA2008 from all domain names for allversions of Unicode.XV8 is present when the character isexcluded by IDNA2008 for thecurrentversion of Unicode. These are not normative values.

Example:

0000..002C    ; disallowed                    #  NULL..COMMA
002D          ; valid                         #  HYPHEN-MINUS
...
0041          ; mapped       ; 0061           #  LATIN CAPITAL LETTER A...00A1..00A7    ; valid        ;      ; NV8     #  INVERTED EXCLAMATION MARK..SECTION SIGN
00AD          ; ignored                       #  SOFT HYPHEN...00DF          ; deviation    ; 0073 0073      #  LATIN SMALL LETTER SHARP S
...
19DA          ; valid        ;      ; XV8     # 5.2  NEW TAI LUE THAM DIGIT ONE
...

6MappingTable Derivation

The following describes the derivation of the mapping table. Thisdescription has nothing to do with the actual mapping of labels inSection4,Processing.Instead, this section describes the derivation of the table inSection 5,IDNAMapping Table. That table is then normatively used for mapping inSection4,Processing.

The derivation is described as a series of steps.Step 1 defines a base mapping;Steps2,3, and4 define three sets of characters.Step 5 will modify the basemapping or the sets of characters as needed to maintain backwardcompatiblity. The mapping and sets are all used inStep 6 to produce the mapping andStatus values for the table.Step 7 removes characters whose mappings contain characters that are not valid. Each numberedstep may have substeps: for example,Step1 consists of Steps 1.1 through 1.2.

The computation is done twice, once withUseSTD3ASCIIRules=true,and once withUseSTD3ASCIIRules=false. Code pointsthat aredisallowed withUseSTD3ASCIIRules=true,butvalid ormapped withUseSTD3ASCIIRules=false,are given the special Status valuesdisallowed_STD3_validanddisallowed_STD3_mapped.

If a Unicode property changes in a future version in a way that wouldaffect backward compatibility,a corresponding clause will be addedtoStep 5 to maintaincompatibility. For more information on compatibility, seeSection5,IDNAMapping Table.

Step 1: Define a base mapping

This step specifies abase mapping, which is a mapping fromeach Unicode code point to sequences of zero or more code points. Thevalue resulting from mapping a particular code point C is called thebase mapping value of C. The base mapping value for C may beidentical to C.

Map the following exceptional characters:
1. Map label separator characters to U+002E ( . ) FULL STOP:
  - U+FF0E ( ． ) FULLWIDTH FULL STOP
  - U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
  - U+FF61 ( ｡ ) HALFWIDTH IDEOGRAPHIC FULL STOP
2. Map all Bidi_Control characters to themselves
3. Map U+1E9E (ẞ) LATIN CAPITAL LETTER SHARP S toU+00DF (ß) LATIN SMALL LETTER SHARP S
Map eachother character to its NFKC_Casefold value[NFKC_Casefold].

Unicode 6.3 adds Bidi_Control characters that were not presentin Unicode 3.2. To preserve the intent of IDNA2003 in disallowingBidi_Control characters rather than just ignoring them, Step 1.1.bwas added. This step causes Step 6.3 to disallow all Bidi_Controlcharacters.

Step 1.1.b only affects 5 new characters added in Unicode 6.3.It would also impact any new Bidi_Control characters in futureversions of the standard.

Step 1.1.c (added in Unicode 15.1)maps the capital sharp s (ẞ) to the small sharp s (ß) rather than to ssbecause all major implementations have adopted nontransitional processing,which does not map ß to ss as in NFKC_Casefold.

Step 2: Specify the base valid set

The base valid set is defined by the sequential list of additions andsubtractions inTable 3,BaseValid Set. This definition is based on the principles of IDNA2003.When applied to the repertoire of Unicode 3.2 characters, thisproduces a set which is closely aligned with IDNA2003.

Table 3.BaseValid Set

Formal Set Notation	Description
`\P{Changes_When_NFKC_Casefolded}`	Start with characters that are equal to their [NFKC_Casefold] value. This criterionexcludes uppercase letters, for example, as well as characters thatare unstable under NFKC normalization, and default ignorable codepoints. Note that according to Perl/Java syntax, \P means the inverse of\p, so these are the characters thatdo not change whenindividually mapped according to [NFKC_Casefold].
`+ \u00DF`	Add LATIN SMALL LETTER SHARP S (ß).
`- \p{c} - \p{z}`	Remove Unassigned, Controls, Private Use, Format,Surrogate, and Whitespace.
`-\p{Block=Ideographic_Description_Characters}`	Remove ideographic description characters.
`- \u31EF`	Remove IDEOGRAPHIC DESCRIPTION CHARACTER SUBTRACTION. This is an ideographic description character that was added in Unicode 15.1outside the now-filled Ideographic_Description_Characters block.
`- \p{ascii} + [\u002Da-zA-Z0-9]`	IfUseSTD3ASCIIRules = True: Remove disallowedASCII; '-' is valid.
`+ \p{ascii} - [\u002E]`	IfUseSTD3ASCIIRules = False: Add all ASCII exceptfor "."

Step 3: Specify the base exclusionset

Form the base exclusion set in the following way:

Start with the empty set.
Add each code point C such that:
1. According to IDNA2003, C is neither prohibited norunassigned nor a label separator (that is, it is either valid ormapped),and
2. According to IDNA2003, C has a different mapping than C'sbase mapping value specified in Step 1.
Add each code point C such that:
1. According to IDNA2003, C is prohibited,and
2. either C is in the base valid set, or every code point inC's base mapping value is in the base valid set.

For example, for Unicode 5.2 and 6.0, the base exclusion setconsists of the list that follows. The subheads (like "CaseChanges") are informational, and do not represent the principlefor excluding the characters listed under them.

Characters that have a different mapping inIDNA2003 (Step 3.2 above)

Case Changes
- U+04C0 ( Ӏ ) CYRILLIC LETTER PALOCHKA
- U+10A0 ( Ⴀ ) GEORGIAN CAPITAL LETTER AN…U+10C5 ( Ⴥ )GEORGIAN CAPITAL LETTER HOE
- U+2132 ( Ⅎ ) TURNED CAPITAL F
- U+2183 ( Ↄ ) ROMAN NUMERAL REVERSED ONE HUNDRED
Normalization Changes (CJK Compatibility Characters)
- U+2F868, U+2F874, U+2F91F, U+2F95F, U+2F9BF
Default Ignorable Changes
- U+3164 HANGUL FILLER
- U+FFA0 HALFWIDTH HANGUL FILLER
- U+115F HANGUL CHOSEONG FILLER
- U+1160 HANGUL JUNGSEONG FILLER
- U+17B4 KHMER VOWEL INHERENT AQ
- U+17B5 KHMER VOWEL INHERENT AA
- U+1806 ( ᠆ ) MONGOLIAN TODO SOFT HYPHEN

Characters that are disallowed in IDNA2003(Step 3.3 above)

Bidi_Control characters
- U+200E LEFT-TO-RIGHT MARK..U+200F RIGHT-TO-LEFT MARK
- U+202A LEFT-TO-RIGHT EMBEDDING..U+202E RIGHT-TO-LEFTOVERRIDE
Invisible operators
- U+2061 FUNCTION APPLICATION..U+2063 INVISIBLE SEPARATOR
Replacement characters
- U+FFFC OBJECT REPLACEMENT CHARACTER
- U+FFFD ( � ) REPLACEMENT CHARACTER
Musical symbols
- U+1D173 MUSICAL SYMBOL BEGIN BEAM..U+1D17A MUSICAL SYMBOLEND PHRASE
Format characters (deprecated)
- U+206A INHIBIT SYMMETRIC SWAPPING..U+206F NOMINAL DIGITSHAPES
Tags (deprecated)
- U+E0001 LANGUAGE TAG
- U+E007F CANCEL TAG
Other tags
- U+E0020 TAG SPACE..U+E007E TAG TILDE

Step 4: Specify the deviation set

This is the set of characters that deviate between IDNA2003 andIDNA2008.

U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+00DF ( ß ) LATIN SMALL LETTER SHARP S
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA

Step 5: Specify changes for backward compatibility

This set is currently empty. Adjustments to the above sets orbase mapping will be made in this section if the steps would cause analready existing character to change Status or mapping under a futureversion of Unicode, so that backward compatibility is maintained.

Step 6: Produce the initial Statusand Mapping values

For each code point:

If the code point is in thedeviation set
- the Status isdeviation and the mappingvalue is the base mapping value for that code point.
Otherwise, if the code point is in the base exclusion set oris unassigned
- the Status isdisallowed and there is nomapping value in the table.
Otherwise, if the code point is not a label separatorandsome code point in its base mapping value is not in the base validset
- the Status isdisallowed and there is nomapping value in the table.
Otherwise, if the base mapping value is an empty string
- the Status isignored and there is nomapping value in the table.
Otherwise, if the base mapping value is the same as the codepoint
- the Status isvalid and there is nomapping value in the table.
Otherwise,
- the Status ismapped and the mappingvalue is the base mapping value for that code point.

Step 7: Produce the final Statusand Mapping values

After processing all code points in previous steps:

Iterate through the set of characters with a Status ofmapped.Any whose mapping values are not wholly in the union of thevalid set and thedeviation set,makedisallowed.
Recursively apply these actions until there are no moreStatus changes.

For example, for Unicode 15.1, the set of characters set todisallowed inStep 7 consists ofthe following:

U+FE12 ( ︒ ) PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULLSTOP

Note: Characters such as U+2488 ( ⒈ ) DIGIT ONE FULL STOP aredisallowed by Step 6.3.
Note: In Unicode versions 15.0 and earlier,withUseSTD3ASCIIRules = Truethree additional characters were disallowed in this step:U+2260 (≠), U+226E (≮), and U+226F (≯).This was based on their canonical decompositions (NFD) containing charactersthat are not valid under that setting;that test was unnecessary for IDNA processing.

7IDNA Comparison

Table 4,IDNAComparisons for Unicode 11.0 illustrates the differences between the threespecifications in terms of valid character repertoire for Unicode 11.0. It omits theASCII-repertoire code points, all code points unassigned in Unicode 11.0, as well as control characters, private-usecharacters, and surrogate code points. It also includes labelsseparators that are valid or mapped. The table separates the Unicode3.2 characters from those encoded later, because they have a specialstatus in IDNA2003. It also separates buckets where UTS #46 andIDNA2008 behave the same from those where they behave differently.

Each row in the table defines a bucket of code points thatshare a pattern of behavior across the three specifications. Thecolumns provide the following information:

The column titledCount shows the number ofcharacters in each bucket.
The columns titledIDNA2003,UTS46,andIDNA2008 show the status of the characters ineach bucket for the respective specifications.
- Deviations are modified in Transitional Processing (deprecated), butnot modified in Nontransitional Processing; seeSection4,Processing.
- IDNA2003 allows unassigned code points in lookup but notregistration. These are in the section of the table under"Unicode 4.0 to Unicode 11.0", and marked asLookupValid.
- IDNA2008 uses several subcategories that are groupedtogether here for comparison. Characters marked asValidare those that are CONTEXTJ, CONTEXTO, and PVALID in IDNA2008*.Other characters are marked asDisallowed.
  *This list ofValidcharacters for Unicode 4.0 and beyond is calculated as theunion of characters with values CONTEXTJ, CONTEXTO, and PVALIDunder any version of Unicode from Version 5.2or later.The union of valid characters over versions of Unicode is usedfor comparison because IDNA2008 does not guarantee backwardcompatibility over different versions of Unicode.
The column titledComment and Examplesdescribes the correlation between the specifications and providesillustrative characters.

Table 4.IDNA Comparisons for Unicode 11.0

	Count	IDNA2003	UTS46	IDNA2008	Comment and Examples
Unicode 3.2 (IDNA2003 =UTS46 = IDNA2008)
a	86,676	Valid	Valid	Valid	Valid in all three U+00E0 ( à ) LATIN SMALLLETTER A WITH GRAVE
b	431	Disallowed	Disallowed	Disallowed	Disallowed in all three U+FF01 (！ ) FULLWIDTHEXCLAMATION MARK
Unicode 3.2 (IDNA2003 ≠UTS46 = IDNA2008)
c	48	Valid	Disallowed	Disallowed	Mappings changed after Unicode 3.2 U+2132 ( Ⅎ )TURNED CAPITAL F
d	8	Mapped	Disallowed	Disallowed	Mappings changed after Unicode 3.2 U+2F868 ( 㛼 )CJK Compatibility Ideographs
Unicode 3.2 (IDNA2003 =UTS46 ≠ IDNA2008)
e	4,640	Mapped / Ignored	Mapped / Ignored	Disallowed	Case and compatibility variants, default ignorables U+00C0 ( À ) LATIN CAPITAL LETTER A WITH GRAVE
f	3,254	Valid	Valid	Disallowed	Punctuation, symbols, ... U+2665 ( ♥ ) BLACKHEART SUIT
g	4	Mapped / Ignored	Display: Valid Lookup: Mapped / Ignored	Valid	Deviations U+200C ZERO WIDTH NON-JOINER U+200D ZERO WIDTH JOINER U+00DF ( ß ) LATIN SMALL LETTERSHARP S U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA
Unicode 4.0 to Unicode 11.0 (UTS46= IDNA2008)
h	36,045	LookupValid	Valid	Valid*	U+0221 ( ȡ ) LATIN SMALL LETTER D WITH CURL
i	141	LookupValid	Disallowed	Disallowed	U+0602 ( ؂ ) ARABIC FOOTNOTE MARKER
Unicode 4.0 to Unicode 11.0 (UTS46≠ IDNA2008)
j	4,757	LookupValid	Valid	Disallowed	U+2615 ( ☕ ) HOT BEVERAGE
k	1,275	LookupValid	Mapped / Ignored	Disallowed	U+023A ( Ⱥ ) LATIN CAPITAL LETTER A WITH STROKE

The table only includes counts up to Unicode 11.0. A detailed online listing of differences is found at [DemoIDNChars] and [DemoIDN].The implications for confusability can be seen at [DemoConf].

7.1Implications forImplementers

Table 4,IDNAComparisons for Unicode 11.0 can also be used to categorize implications forimplementers.

If any characters areMapped/Ignored in any specification—Rows (d), (e), (k)—then in theother specifications they are either Mapped/Ignored in precisely thesame way, or they are Disallowed. This prevents domain names frombeing mapped differently on different browsers: either the charactersmap to the same result, or they do not work. Row (k) is unproblematicin this regard, assuming that registries follow one of thespecifications, because characters like U+023A ( Ⱥ ) will not bevalid in registered labels.

Note: The transition is complete in practicefor the four problematic Deviations in Row (g).All major implementations treat them as Valid in UTS46, just like in IDNA2008.

This presumes that IDNA2008 implementations do not use custom,incompatible mappings: that is, that they do not take advantage ofthe fact that arbitrary mappings are allowed in IDNA2008, and choosea mapping that is incompatible with IDNA2003 or UTS #46. Thispertains to any of Rows (e), (f), (j), (k). If custom mappings wereused by any signficant client base, it would result in seriousproblems for security and interoperability. For more information, seethe [IDN_FAQ].

With the exception of the above issues, implementation isstraightforward:

Rows (a) and (b) are unproblematic. All three specificationsbehave identically.
Rows (c) and (d) are unproblematic. They contain charactersthat are allowed under IDNA2003, but are disallowed in UTS #46because their mappings would be different after Unicode 3.2, basedon the Unicode Standard mappings. This treatment also matchesIDNA2008. Those mappings were stabilized some time ago, so mappingswill not change in the future; see [Stability].Fortunately, in-depth analysis of Web content indicates thesecharacters are quite rare: their presence in domain names in webpages cannot be distinguished from noise (unlike the Deviationcharacters in Row (g)).
Rows (e) and (k) are unproblematic. Ideally, implementationswill map these characters in IDNA2008, producing precisely the sameresults as in UTS #46, and the same results for Unicode 3.2characters as IDNA2003.
Rows (f) and (j) are symbols and punctuation that aredisallowed in IDNA2008, but allowed transitionally in UTS #46. Row(j) contains post-Unicode 3.2 characters that are handled in UTS #46according to IDNA2003 principles. These symbols and punctuation willtransition smoothly as registries discontinue support for them.
Rows (h) and (i) are unproblematic. The characters have thesame status in IDNA2008 and UTS #46.

8ConformanceTesting

A conformance testing file (IdnaTestV2.txt) is provided for eachversion of Unicode starting with Unicode 6.0, in versioneddirectories under [IDNA-Table]. It onlyprovides test cases forUseSTD3ASCIIRules=true.

8.1Format

The test file is UTF-8, with certain characters escaped using the\uXXXX or \x{XXXX} convention for readability. The details are in the header of the test file.

8.2Testing Conformance

To test for conformance to UTS #46, an implementation will perform the toUnicode, toAsciiN, and toAsciiToperations on the source string, then verify the resulting strings and relevant Status values. The details are in the header of the test file.

Implementations may be more strict than the default settings for UTS46. In particular, an implementation conformant to IDNA2008 would disallow the input for lines marked with NV8. Implementations need only record that there is an error: they need not reproduce the precise Status codes (after removing any ignored Status values).

8.3Migration

The test format and file name changed in Version 11.0 so that it could express a variety of different combinations of input options that people needed. The new format allows the testing implementation to test for precisely the results of its combination of supported flags, by filtering out Status codes that correspond to an unsupported input flag. The value XV8 was also removed, since it was not very useful in practice.

The following illustrate the differences between the old and new format. The set of examples is not exhaustive, but shows how there is more information available for the same examples.

Old-format sample lines:

T;  Faß.de;     faß.de;     fass.deN;  Faß.de;     faß.de;     xn--fa-hia.deB;  Bücher.de;  bücher.de;  xn--bcher-kva.deB;  à\u05D0;    [B5 B6];    [B5 B6]B;  a。。b;      [A4_2];     [A4_2]

New-format sample lines:

Faß.de;     faß.de;     [];       xn--fa-hia.de;     ;  fass.de;Bücher.de;  bücher.de;  [];       xn--bcher-kva.de;  ;  ;à\u05D0;    àא;         [B5 B6];  xn--0ca24w;        ;  ;a。。b;      a..b;       [A4_2];   a..b;              ;  ;

9IDNADerived Property

To facilitate comparison between versions of the Unicode Character Databaseand to highlight the implications for the addition of new characters and changes of character properties,the Unicode Technical Committee has prepared a collection of IDNA Derived Propertydata files. These data files are permanently posted at [IDNA-Derived].

For each version of the Unicode Standard starting with Unicode 6.1.0,the value of the enumerated IDNA2008_Category property is calculated and listed explicitlyin a separate data file. This property matches the "IDNA Derived Property" as defined in RFC 5892(see [IDNA2008]). The explicit listing is provided as a convenience for implementers. It is the result of performing the exact calculations defined in RFC 5892 concurrent with the release of each version of the Unicode Character Database.

RFC 5892 gives a list of code points for which the derivation is overriddenby exceptional values. All known exceptions are applied when a data file iscreated, but exceptions added in future updates of the IDNA protocol are not applied retroactively.

The format of these IDNA Derived Property data files is modeled closely on that specified in Appendix B.1 of RFC 5892, except that the comment section of each line is not truncated at column 72. For example, excerpted from RFC 5892:

007B..00B6  ; DISALLOWED  # LEFT CURLY BRACKET..PILCROW SIGN00B7        ; CONTEXTO    # MIDDLE DOT00B8..00DE  ; DISALLOWED  # CEDILLA..LATIN CAPITAL LETTER THORN00DF..00F6  ; PVALID      # LATIN SMALL LETTER SHARP S..LATIN SMALL LETT

Compare the same ranges excerpted from the data files:

007B..00B6  ; DISALLOWED  # LEFT CURLY BRACKET..PILCROW SIGN00B7        ; CONTEXTO    # MIDDLE DOT00B8..00DE  ; DISALLOWED  # CEDILLA..LATIN CAPITAL LETTER THORN00DF..00F6  ; PVALID      # LATIN SMALL LETTER SHARP S..LATIN SMALL LETTER O WITH DIAERESIS

This close match in format is designed to simplify scripted comparison between these IDNA Derived Property data files posted at unicode.org and other existing calculated listings based on RFC 5892 that have been posted at IANA or elsewhere.

Acknowledgments

Mark Davis and Michel Suignard authored the bulk of the text of thisdocument, under direction from the Unicode Technical Committee. Fortheir contributions of ideas or text to this specification, theeditors thank Julie Allen, Matitiahu Allouche, Peter Constable, CraigCummings, Martin Dürst, Peter Edberg, Asmus Freytag, Deborah Goldsmith, LaurentiuIancu, Gervase Markham, Simon Montagu, Lisa Moore, Eric Muller, Simon Sapin, Murray Sargent, Markus Scherer,Jungshik Shin, Shawn Steele, Erik van der Poel, Chris Weber, and KenWhistler. The specification builds upon [IDNA2008],developed in the IETF Idna-update working group, especiallycontributions from Matitiahu Allouche, Harald Alvestrand, Vint Cerf,Martin J. Dürst, Lisa Dusseault, Patrik Fältström, Paul Hoffman, CaryKarp, John Klensin, and Peter Resnick, and also upon [IDNA2003], authored by Marc Blanchet, AdamCostello, Patrik Fältström, and Paul Hoffman.

References

[Bortzmeyer]	http://www.bortzmeyer.org/idn-et-phishing.html The most interesting studies cited there(originally from Mike Beltzner ofMozilla) are: Decision Strategies and Susceptibility toPhishing by Downs, Holbrook & Cranor WhyPhishing Works by Dhamija, Tygar & Hearst Do Security Toolbars Actually Prevent PhishingAttacks by Wu, Miller & Garfinkel Phishing Tips and Techniques by Gutmann.
[DemoConf]	https://util.unicode.org/UnicodeJsps/confusables.jsp
[DemoIDN]	https://util.unicode.org/UnicodeJsps/idna.jsp
[DemoIDNChars]	https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=\p{age%3D3.2}-\p{cn}-\p{cs}-\p{co}&abb=on&g=uts46+idna+idna2008
[IDNA2003]	The IDNA2003 specification is defined by acluster of IETF RFCs: IDNA [RFC3490] Nameprep [RFC3491] Punycode [RFC3492] Stringprep [RFC3454].
[IDNA2008]	The IDNA2008 specification is defined by acluster of IETF RFCs: Internationalized Domain Names for Applications (IDNA):Definitions and Document Framework https://www.rfc-editor.org/info/rfc5890 Internationalized Domain Names in Applications (IDNA)Protocol https://www.rfc-editor.org/info/rfc5891 The Unicode Code Points and Internationalized DomainNames for Applications (IDNA) https://www.rfc-editor.org/info/rfc5892 Right-to-Left Scripts for Internationalized Domain Namesfor Applications (IDNA) https://www.rfc-editor.org/info/rfc5893 There is also an informative document: Internationalized Domain Names for Applications (IDNA):Background, Explanation, and Rationale https://www.rfc-editor.org/info/rfc5894
[IDNA-Derived]	https://www.unicode.org/Public/idna2008derived
[IDNA-Table]	https://www.unicode.org/Public/idna
[IDN-FAQ]	https://www.unicode.org/faq/idn.html
[NFKC_Casefold]	The Unicode property specified in [UAX44], and defined by the data inDerivedNormalizationProps.txt(search for "NFKC_Casefold").
[RFC1034]	P. Mockapetris"Domain names - concepts and facilities", RFC 1034, November 1987. https://www.rfc-editor.org/info/rfc1034
[RFC3454]	P. Hoffman, M. Blanchet."Preparation of Internationalized Strings("stringprep")", RFC 3454, December 2002. https://www.rfc-editor.org/info/rfc3454
[RFC3490]	Faltstrom, P., Hoffman, P.and A. Costello, "Internationalizing Domain Names inApplications (IDNA)", RFC 3490, March 2003. https://www.rfc-editor.org/info/rfc3490
[RFC3491]	Hoffman, P. and M. Blanchet,"Nameprep: A Stringprep Profile for Internationalized DomainNames (IDN)", RFC 3491, March 2003. https://www.rfc-editor.org/info/rfc3491
[RFC3492]	Costello, A., "Punycode:A Bootstring encoding of Unicode for Internationalized Domain Namesin Applications (IDNA)", RFC 3492, March 2003. https://www.rfc-editor.org/info/rfc3492
[RZLGR5]	Integration Panel,"Root Zone Label Generation Rules — LGR-5", 22 May 2022. https://www.icann.org/sites/default/files/lgr/rz-lgr-5-overview-26may22-en.pdf
[SafeBrowsing]	http://code.google.com/apis/safebrowsing/
[Stability]	Unicode Consortium StabilityPolicies https://www.unicode.org/policies/stability_policy.html
[STD3]	Braden, R.,"Requirements for Internet Hosts -- CommunicationLayers", STD 3, RFC 1122, and "Requirements for InternetHosts -- Application and Support", STD 3, RFC 1123, October1989. https://www.rfc-editor.org/info/std3
[STD13]	Mockapetris, P.,"Domain names - concepts and facilities", STD 13, RFC1034 and "Domain names - implementation andspecification", STD 13, RFC 1035, November 1987. https://www.rfc-editor.org/info/std13
[UAX44]	UAX #44:UnicodeCharacter Database https://www.unicode.org/reports/tr44/
[Unicode]	The Unicode Standard For the latest version, see: https://www.unicode.org/versions/latest/
[UTR36]	UTR #36:UnicodeSecurity Considerations https://www.unicode.org/reports/tr36/
[UTS18]	UTS #18:UnicodeRegular Expressions https://www.unicode.org/reports/tr18/
[UTS39]	UTS #39:UnicodeSecurity Mechanisms https://www.unicode.org/reports/tr39/

Modifications

The following summarizes modifications from the previous published version of this document.

Revision 31

Reissued for Unicode 15.1.
The transitional processing ofDeviation characters is now deprecated. All major implementations use nontransitional processing. This is reflected in several changes throughout this document.
ChangedSection 6Mapping Table Derivation to map U+1E9E capital sharp s to U+00DF small sharp s (instead of to ss) by adding a base mapping for U+1E9E, adding U+00DF to the base valid set, and allowing deviation characters in mappings.
ChangedSection 6Step 7: Produce the final Status and Mapping values to no longer check for NFD validity. In the mapping table, this changes U+2260 (≠), U+226E (≮), and U+226F (≯) from disallowed_STD3_valid to valid.
ChangedSection 4 Processing step1. Map to no longer record an error for disallowed characters. Checking for disallowed characters both before and after normalization yielded inconsistent results for unnormalized vs. normalized text, which became visible with the change in the preceding item (e.g., ≠ vs. equals sign plus combining overlay).
ChangedSection 4 Processing step1. Map to conditionally map U+1E9E capital sharp s to “ss” ifTransitional_Processing (now deprecated).
ChangedSection 4.1Validity Criteria to apply only to non-empty labels.
Added an additional condition in 4.1Validity Criteria to disallow labels such asxn--xn---epa., which do not round-trip.
Added a flagIgnoreInvalidPunycode, which makes labels such asxn--a.com valid. This allows for an all-ASCII fast-path following common industry practice.
Bug fix in IdnaTestV2.txt: When a label is neither an LTR label nor an RTL label, then it fails B1 but no other Bidi tests.

Modifications for previous versions are listed in those respective versions.

© 2023 Unicode, Inc. All Rights Reserved. TheUnicode Consortium makes no expressed or implied warranty of anykind, and assumes no liability for errors or omissions. No liabilityis assumed for incidental and consequential damages in connectionwith or arising out of the use of the information or programscontained or accompanying this technical report. The UnicodeTerms of Use apply.

Unicode and the Unicode logo are trademarksof Unicode, Inc., and are registered in some jurisdictions.

Movatterモバイル変換

Unicode® Technical Standard #46

Unicode IDNA Compatibility Processing

Summary

Status

1Introduction

1.1IDNA2003

1.2IDNA2008

1.3Transition Considerations

1.3.1Mapping

1.3.2Deviations

2UnicodeIDNA Compatibility Processing

2.1Display ofInternationalized Domain Names

2.2Registries

2.3Notation

3Conformance

3.1STD3 Rules

4Processing

4.1ValidityCriteria

4.1.1UseSTD3ASCIIRules

4.1.2Right-to-LeftScripts

4.2ToASCII

4.3ToUnicode

4.4Preprocessingfor IDNA2008

4.5ImplementationNotes

5IDNAMapping Table

6MappingTable Derivation

7IDNA Comparison

7.1Implications forImplementers

8ConformanceTesting

8.1Format

8.2Testing Conformance

8.3Migration

9IDNADerived Property