Unicode® Technical Standard #39

Unicode Security Mechanisms

Version	11.0.0
Editors	MarkDavis (markdavis@google.com), Michel Suignard (michel@suignard.com)
Date	2018-05-22
This Version	http://www.unicode.org/reports/tr39/tr39-17.html
Previous Version	http://www.unicode.org/reports/tr39/tr39-15.html
Latest Version	http://www.unicode.org/reports/tr39/
Latest Proposed Update	http://www.unicode.org/reports/tr39/proposed.html
Revision	17

Summary

Because Unicode contains such a large number of characters andincorporates the varied writing systems of the world, incorrectusage can expose programs or systems to possible security attacks.This document specifies mechanisms that can be used to detectpossible security problems.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independentspecification. Conformance to the Unicode Standard does not implyconformance to any UTS.

Please submit corrigenda and other comments with the onlinereporting form [Feedback].Related information that is useful in understanding this document isfound in theReferences. For the latestversion of the Unicode Standard, see [Unicode]. For alist of current Unicode Technical Reports, see [Reports]. For moreinformation about versions of the Unicode Standard, see [Versions].

1Introduction

Unicode Technical Report #36, "Unicode SecurityConsiderations" [UTR36]provides guidelines for detecting and avoiding security problemsconnected with the use of Unicode. This document specifies mechanismsthat are used in that document, and can be used elsewhere. Readersshould be familiar with [UTR36] beforecontinuing. See also the Unicode FAQ onSecurityIssues [FAQSec].

2Conformance

An implementation claiming conformance to this specificationmust do so in conformance to the following clauses:

C1An implementation claiming to implementthe General Profile for Identifiers shall do so in accordance withthe specifications in Section 3.1,General Security Profile for Identifiers.

Alternatively, it shall declare that it uses a modification,and provide a precise list of characters that are added to orremoved from the profile.

C1.1An implementation claiming to implementthe IDN Security Profiles for Identifiers shall do so in accordance withthe specifications in Section 3.2,IDN Security Profiles for Identifiers.

Alternatively, it shall declare that it uses a modification,and provide a precise list of characters that are added to orremoved from the profile.

C1.2An implementation claiming to implementthe Email Security Profiles for Identifiers shall do so in accordance withthe specifications in Section 3.3,Email Security Profiles for Identifiers.

Alternatively, it shall declare that it uses a modification,and provide a precise list of characters that are added to orremoved from the profile.

C2An implementation claiming to implementany of the following confusable-detection functions must do so inaccordance with the specifications in Section 4,Confusable Detection.

X and Y are single-script confusables
X and Y are mixed-script confusables
X and Y are whole-script confusables
X has whole-script confusables in set of scripts S

Alternatively, it shall declare that it uses a modification,and provide a precise list of character mappings that are added toor removed from the provided ones.

C3An implementation claiming to detectmixed scripts must do so in accordance with the specifications inSection 5.1,Mixed-Script Detection.

Alternatively, it shall declare that it uses a modification,and provide a precise specification of the differences inbehavior.

C4An implementation claiming to detectRestriction Levels must do so in accordance with the specifications inSection 5.2,Restriction-Level Detection.

Alternatively, it shall declare that it uses a modification,and provide a precise specification of the differences inbehavior.

C5An implementation claiming to detectmixed numbers must do so in accordance with the specifications inSection 5.3,Mixed-Number Detection.

Alternatively, it shall declare that it uses a modification,and provide a precise specification of the differences inbehavior.

3IdentifierCharacters

Identifiers are special-purpose strings used foridentification—strings that are deliberately limited to particularrepertoires for that purpose. Exclusion of characters fromidentifiers does not affect the general use of those characters, suchas within documents. Unicode Standard Annex #31,"Identifier and Pattern Syntax" [UAX31]provides a recommended method of determining which strings shouldqualify as identifiers. The UAX #31 specification extends the commonpractice of defining identifiers in terms of letters and numbers tothe Unicode repertoire.

That specification also permits other protocols to use that method asa base, and to define a profile that adds or removescharacters. For example, identifiers for specific programminglanguages typically add some characters like "$", andremove others like "-" (because of the use asminus),while IDNA removes "_" (among others)—see UnicodeTechnical Standard #46, "Unicode IDNA CompatibilityProcessing" [UTS46], as well as [IDNA2003], and [IDNA2008].

This document provides for additional identifier profiles forenvironments where security is an issue. These are profiles of theextended identifiers based on properties and specifications of theUnicode Standard [Unicode], including:

The XID_Start and XID_Continue properties defined in theUnicode Character Database (see [DCore])
The toCasefold(X) operation defined inChapter3, Conformance of [Unicode]
The NFKC and NFKD normalizations defined inChapter3, Conformance of [Unicode]

The data files used in defining these profiles follow the UCD FileFormat, which has a semicolon-delimited list of data fieldsassociated with given characters, with each field referenced bynumber. For more details, see [UCDFormat].

3.1General Security Profile forIdentifiers

The files under [idmod] provides data for a profile ofidentifiers in environments where security is at issue. The filecontains a set of characters recommended to be restricted from use.It also contains a small set of characters that are recommended asadditions to the list of characters defined by the XID_Start andXID_Continue properties, because they may be used in identifiers in abroader context than programming identifiers.

The Restricted characters are characters not in common use, andthey can be blocked to further reduce the possibilities for visualconfusion. They include the following:

characters not in modern use
characters only used in specialized fields, such asliturgical characters, phonetic letters, and mathematicalletter-like symbols
characters in limited use by very small communities

The principle has been to be more conservative initially, allowingfor the set to be modified in the future as requirements forcharacters are refined. For information on handling modificationsover time, seeSection 2.9.1, BackwardCompatibility inUnicode Technical Report #36,"Unicode Security Considerations" [UTR36].

An implementation following the General Security Profile does notpermitRestricted characters, unless it documents theadditional characters that it does allow. Common candidates for suchadditions include characters for scripts listed inTable 7,Limited Use Scripts of [UAX31]. However,characters from these scripts have not been a priority forexamination for confusables or to determine specialized, non-modern,or uncommon-use characters.

Canonical equivalence is applied when testing candidate identifiersfor inclusion ofAllowed characters. For example, supposethe candidate string is the sequence

<u,combining-diaeresis>

The target string would be Allowed ineither of thefollowing 2 situations:

u is Allowed and ¨ is Allowed, or
ü is Allowed

For details of the format for the [idmod] files, seeSection 7,Data Files.

Table 1.Identifier Status and Type

Status	Type	Description
Restricted	Not_Character	Unassigned characters, private use characters,surrogates, most control characters
	Deprecated	Characters with the Unicode propertyDeprecated=Yes
	Default_Ignorable	Characters with the Unicode propertyDefault_Ignorable_Code_Point=Yes
	Not_NFKC	Characters that cannot occur in stringsnormalized to NFKC.
	Not_XID	Other characters that do not qualify asdefault Unicode identifiers; that is, they do not have the UnicodepropertyXID_Continue=True.
	Exclusion	Characters fromTable 4,Candidate Characters for Exclusion from Identifiersfrom [UAX31]
	Obsolete	Characters that are no longer in modern use.
	Technical	Specialized usage: technical, liturgical, etc.
	Uncommon_Use	Characters whose status is uncertain, or thatare not commonly used in modern text.
	Limited_Use	Characters from scripts that are in limiteduse:Table 7,Limited Use Scripts in [UAX31].
Allowed	Inclusion	Exceptional allowed characters, includingTable 3,Candidate Characters for Inclusion in Identifiers in [UAX31], and some characters for IDNA2008, except for those characters that are Restricted above.
Allowed	Recommended	Table 5,Recommended Scripts in [UAX31], except for those characters that are Restricted above.

For stability considerations, seeMigratingPersistent Data.

The distinctions among theType values is notstrict; if there are multiple Types for restricting a character onlyone is given. The important characteristic is theStatus:whether or not the character is Restricted.As moreinformation is gathered about characters, this data may change insuccessive versions. That can cause either theStatusorType to change for a particular character. Thus users ofthis data should be prepared for changes in successive versions, suchas by having a grandfathering policy in place for previouslysupported characters or registrations. BothStatusandType values are to be comparedcase-insensitively and ignoring hyphens and underbars.

Restricted characters should be treated with caution in registration,and disallowed unless there is good reason to allow them in theenvironment in question. However, the set ofStatus=Allowedcharacters are not typically used as is by implementations. Instead,they are applied as filters to the set of characters C that aresupported by the identifier syntax, generating a new set C′.Typically there are also particular characters or classes ofcharacters from C that are retained asExceptioncharacters.

C′ = (C ∩ {Status=Allowed}) ∪Exceptions

The implementation may simply restrict use of new identifiers to C′,or may apply some other strategy. For example, there might be anappeal process for registrations of ids that contain charactersoutside of C′ (but still inside of C), or in user interfaces forlookup of identifiers, warnings of some kind may be appropriate. Formore information, see [UTR36].

TheException characters would beimplementation-specific. For example, a particular implementationmight extend the default Unicode identifier syntax by addingExceptioncharacters with the Unicode propertyXID_Continue=False,such as “$”, “-”, and “.”. Those characters are specific to thatidentifier syntax, and would be retained even though they are not intheStatus=Allowed set. Someimplementations may also wish to add some [CLDR]exemplar characters for particular supported languages that haveunusual characters.

TheType=Inclusion characters alreadycontain some characters that are not letters or numbers, but that areused within words in some languages. For example, it is recommendedthat U+00B7 (·) MIDDLE DOT be allowed in identifiers, because it isrequired for Catalan.

The implementation may also apply other restrictions discussedin this document, such as checking for confusable characters or doingmixed-script detection.

3.1.1Joining Controls

The determination of whether ZWJ and ZWNJ are allowed or restricted depends on the context, as described inSection 2.3, Layout and Format Control Characters of [UAX31]. At a minimum, implementations should test for the conditions A1, A2, and B listed in that section of [UAX31].

More advanced implementations may use script-specific information for more detailed testing. In particular, they can:

1.Disallow joining controls in sequences that meet the conditions of A1, A2, and B, where in common fonts the resulting appearance of the sequence is normally not distinct from appearance in the same sequences with the joining controls removed.

2.Allow joining controls in sequences that don't meet the conditions of A1, A2, and B (such as the following), where in common fonts the resulting appearance of the sequence is normally distinct from the appearance in the same sequences with the joining controls removed.

/$L ZWNJ $V $L/
/$L ZWJ $V $L/

The notation is from [UAX31].

3.2IDNSecurity Profiles for Identifiers

Version 1 of this document defined operations and data that apply to[IDNA2003], which has been superseded by [IDNA2008] and Unicode TechnicalStandard #46, "Unicode IDNA CompatibilityProcessing" [UTS46]. The identifiermodification data can be applied to whichever specification of IDNAis being used. For more information, see the [IDNFAQ].

However, implementations can claim conformance to other features ofthis document as applied to domain names, such asRestriction Levels.

3.3Email Security Profiles for Identifiers

TheSMTP Extension for Internationalized Email provides for specifications of internationalized email addresses [EAI]. However, it does not provide for testing those addresses for security issues. This section provides an email security profiles that may be used for that. It can be applied for different purposes, such as:

When an email address is registered, flag anything thatdoes not meet the profile:
- Either forbid the registration, or
- Allow for an appeals process.
When an email address is detected in linkification of plaintext:
- Do not linkify if the identifier does not meetthe profile.
When an email address is displayed in incoming email:
- Flag it as suspicious with a wavy underline, if itdoes not meet the profile.
- Filter characters from the quoted-string-part to preventdisplay problems.

This profile does not exclude characters fromEAI. Instead, it provides a profile that can be used for registration, linkification,and notification. The goal is to flag "structurally unsound" and “unexpectedlygarbagy” addresses.

An email address is formed from three main parts. (There are more elements of an email address, but these are the ones for which Unicode security is important.) For example:

"Joey" <joe31834@gmail.com>
Thedomain-part is "gmail.com"
Thelocal-part is "joe31834"
Thequoted-string-part is "Joey"

To meet the requirements of theEmail Security Profiles for Identifiers section of this specification, an identifier must satisfy the following conditions for the specified <restriction level>.

Domain-Part

The domain-part of an email address must satisfySection 3.2,IDNSecurity Profiles for Identifiers, and satisfy the conformanceclauses of [UTS46].

Local-Part

The local-part of an email address must satisfy all the following conditions:

It must be in NFKC format
It must have level = <restriction level> or less, fromRestriction_Level_Detection
It must not have mixed number systems according toMixed_Number_Detection
It must satisfydot-atom-text fromRFC 5322 §3.2.3, whereatext is extended as follows:

Where C ≤ U+007F, C is defined as in§3.2.3. (That is, C ∈ [!#-'*+\-/-9=?A-Z\^-~]. This list copies what is already in§3.2.3, and followsHTML5 for ASCII.)
Where C > U+007F, both of the following conditions are true:
C has IdentifierStatus=Allowed fromGeneral_Security_Profile
If C is the first character, it must be XID_Start fromDefault_Identifier_Syntax in [UAX31]

Note that inRFC 5322 §3.2.3:

dot-atom-text = 1*atext *("." 1*atext)

That is, dots can also occur in the local-part, but not leading, trailing, or two in a row. In more conventional regex syntax, this would be:

dot-atom-text = atext+ ("." atext+)*

Note that bidirectional controls and other format characters are specifically disallowed in the local-part, according to the above.

Quoted-String-Part

The quoted-string-part of an email address must satisfy the following conditions:

It must be in NFC.
It must not contain any stateful bidirectional format characters.
- That is, no [:bidicontrol:] except for the LRM, RLM, and ALM, since the bidirectional controls could influence the ordering of characters outside the quotes.
It must not contain more than four nonspacing marks in a row, and no sequence of two of the same nonspacing marks.
It may contain mixed scripts, symbols (including emoji), and so on.

Other Issues

The restrictions above are insufficient toprevent bidirectional-reordering that could intermix the quoted-string-partwith the local-part or the domain-part in display. To prevent that,implementations could use bidirectional isolates (or equivalent) around theeach of these parts in display.

Implementations may also want to use other checks, such as for confusability, or services such as Safe Browsing.

A serious practical issue is that clients do not know what theidentity rules are for any particular email server: that is, when twoemail addresses are considered equivalent. For example, aremark@macchiato.comandMark@macchiato.com treated the same by the server?Unfortunately, there is no way to query a server to seewhat identity rules it follows. One of the techniques used to deal withthis problem is having whitelists of email providers indicating which of them are case-insensitive, dot-insensitive, or both.

4ConfusableDetection

The data in [confusables] provide amechanism for determining when two strings are visually confusable.The data in these files may be refined and extended over time. Forinformation on handling modifications over time, seeSection2.9.1, Backward Compatibility in Unicode Technical Report #36,"Unicode Security Considerations" [UTR36]and theMigration section of this document.

Collection of data for detecting gatekeeper-confusable strings is notcurrently a goal for the confusable detection mechanism in thisdocument. For more information, seeSection 2, VisualSecurity Issues in [UTR36].

The data provides a mapping from source characters to their prototypes. A prototype should be thought of as a sequence of one or more classes of symbols, where each class has an exemplar character. For example, the character U+0153 (œ), LATIN SMALL LIGATURE OE, has a prototype consisting of two symbol classes: the one with exemplar character U+006F (o), and the one with exemplar character U+0065 (e). If an input character does not have a prototype explicitly defined in the data file, the prototype is assumed to consist of the class of symbols with the input character as the exemplar character.

For an input string X, defineskeleton(X) to be the following transformation on the string:

Convert X to NFD format, as described in [UAX15].
Concatenate the prototypes for each character in X according to the specified data, producing a string of exemplar characters.
Reapply NFD.

The strings X and Y are defined to beconfusable if and only if skeleton(X) = skeleton(Y). This is abbreviated as X ≅ Y.

This mechanism imposes transitivity on the data, so if X ≅ Y and Y ≅ Z, then X ≅ Z. It is possible to provide a more sophisticated confusable detection, by providing a metric between given characters, indicating their "closeness." However, that is computationally much more expensive, and requires more sophisticated data, so at this point in time the simpler mechanism has been chosen. That means that in some cases the test may be overly inclusive.

Note: The stringsskeleton(X) andskeleton(Y)arenot intended for display, storage or transmission.They should be thought of as an intermediate processing form,similar to a hashcode. The exemplar characters arenot guaranteed to be identifier characters.

Definitions

Confusables are divided into three classes: single-script confusables, mixed-script confusables, and whole-script confusables, defined below. All confusables are either a single-script confusable or a mixed-script confusable, but not both. All whole-script confusables are also mixed-script confusables.

The definitions of these three classes of confusables depend on the definitions ofresolved script set andsingle-script, which are provided inSection 5,Mixed-ScriptDetection.

X and Y aresingle-script confusables ifand only if they are confusable, and their resolved script sets have at least one element in common.

Examples:“ǉeto” and “ljeto” in Latin (the Croatian word for “summer”), where the first word uses only four codepoints, the first of which is U+01C9 (ǉ) LATIN SMALL LETTER LJ.

X and Y aremixed-script confusables ifand only if they are confusable but their resolved script sets have no elements in common.

Examples: "paypal" and "pаypаl", where thesecond word has the characterU+0430 ( а )CYRILLIC SMALL LETTER A.

X and Y arewhole-script confusables ifand only if they aremixed-script confusables, and each of them is asingle-script string.

Example: "scope" in Latin and "ѕсоре" in Cyrillic.

As noted in Section 5, the resolved script set ignores characters with Script_Extensions {Common} and {Inherited} and augments characters with CJK scripts with their respective writing systems. Characters with the Script_Extension property values COMMON orINHERITED are ignored when testing for differences in script.

Data File Format

Each line in the data file has the following format: Field 1 is the source, Field 2 is the target, and Field 3 is obsolete, always containing the letters “MA” for backwards compatibility. For example:

0441 ; 0063 ; MA # ( с → c ) CYRILLIC SMALL LETTER ES → LATIN SMALLLETTER C #
2CA5 ; 0063 ; MA # ( ⲥ → c ) COPTIC SMALL LETTER SIMA → LATINSMALL LETTER C # →ϲ→

Everything after the # is a comment and is purely informative. Aasterisk after the comment indicates that the character is not an XIDcharacter [UAX31]. The comments provide thecharacter names.

Implementations that use the confusable data do not have torecursively apply the mappings, because the transforms areidempotent. That is,

skeleton(skeleton(X)) = skeleton(X)

If the data was derived via transitivity, there isan extra comment at the end. For instance, in the above example thederivation was:

ⲥ (U+2CA5 COPTIC SMALL LETTER SIMA)
→ ϲ (U+03F2 GREEK LUNATE SIGMA SYMBOL)
→ c (U+0063 LATIN SMALL LETTER C)

To reduce security risks, it is advised that identifiers usecasefolded forms, thus eliminating uppercase variants where possible.

The data may change between versions. Even where the data is thesame, the order of lines in the files may change between versions.For more information, seeMigration.

Note:due to production problems, versionsbefore 7.0 did not maintain idempotency in all cases. For moreinformation, seeMigration.

4.1Whole-Script Confusables

For some applications, it is useful to determine if a given input string has any whole-script confusable. For example, the identifier "ѕсоре" using Cyrillic characters would pass the single-script test described inSection 5.2, Restriction-Level Detection, even though it is likely to be a spoof attempt.

It is possible to determine whether a single-script string X has a whole-script confusable:

Consider Q, the set of all strings that are confusable with X.
Remove all strings from Q whose resolved script set intersects with the resolved script set of X.
If Q is nonempty and contains any single-script string, return TRUE.
Otherwise, return FALSE.

The logical description above can be used for a reference implementation for testing, but is not particularly efficient. A production implementation can be optimized as long as it produces the same results.

Note that the confusables data include a large number of mappings between Latin and Cyrillic text. For this reason, the above algorithm is likely to flag a large number of legitimate strings written in Latin or Cyrillic as potential whole-script confusables. To effectively use whole-script confusables, it is often useful to determine both whether a string has a whole-script confusable, andwhich scripts those whole-script confusables have.

This information can be used, for example, to distinguish between reasonable versus suspect whole-script confusables. Consider the Latin-script domain-name label “circle”. It would be appropropriate to have that in the domain name “circle.com”. It would also be appropriate to have the Cyrillic confusable “сігсӀе” in the Cyrillic domain name “сігсӀе.рф”. However, a browser may want to alert the user to possible spoofs if the Cyrillic “сігсӀе” is used with .com or the Latin “circle” is used with .рф.

The process of determining suspect usage of whole-script confusables is more complicated than simply looking at the scripts of the labels in a domain name. For example, it can be perfectly legitimate to have scripts in a SLD (second level domain) not be the same as scripts in a TLD (top-level domain), such as:

Cyrillic labels in a domain name with a TLD of .ru or .рф
Chinese labels in a domain name with a TLD of .com.au or .com
Cyrillic labelsthat aren’t confusable with Latin with a TLD of .com.au or .com

The following high-level algorithm can be used to determine all scripts that contain a whole-script confusable with a string X:

Consider Q, the set of all strings confusable with X.
Remove all strings from Q whose resolved script set is ∅ orALL (that is, keep only single-script strings plus those with characters only in Common).
Take the union of the resolved script sets of all strings remaining in Q.

As usual, this algorithm is intended only as a definition; implementations should use an optimized routine that produces the same result.

4.2Mixed-Script Confusables

To determine the existence of a mixed-script confusable, a similar process could be used:

Consider Q, the set of all strings that are confusable with X.
Remove all strings from Q whose resolved script set intersects with the resolved script set of X.
If Q is nonempty, return TRUE.
Otherwise, return FALSE.

Note that due to the number of mappings provided by the confusables data, the above algorithm is likely to flag a large number of legitimate strings as potential mixed-script confusables.

5DetectionMechanisms

5.1Mixed-ScriptDetection

The Unicode Standard supplies information that can be used fordetermining the script of characters and detecting mixed-script text.The determination of script is according to theUAX #24, Unicode Script Property[UAX24], using data from the Unicode Character Database [UCD].

Define a character'saugmented script set to be a character's Script_Extensions with the following two modifications.

Entries for the writing systems containing multiple scripts — Hanb (Han with Bopomofo), Jpan (Japanese), and Kore (Korean) — are added according to the following rules.
1. If Script_Extensions contains Hani (Han), add Hanb, Jpan, and Kore.
2. If Script_Extensions contains Hira (Hiragana), add Jpan.
3. If Script_Extensions contains Kata (Katakana), add Jpan.
4. If Script_Extensions contains Hang (Hangul), add Kore.
5. If Script_Extensions contains Bopo (Bopomofo), add Hanb.
Sets containing Zyyy (Common) or Zinh (Inherited) are treated asALL, the set of all script values.

The Script_Extensions data is from the Unicode Character Database [UCD]. For more information on the Script_Extensions property and Jpan, Kore, and Hanb, seeUAX #24, Unicode Script Property [UAX24].

Define theresolved script set for a string to be the intersection of the augmented script sets over all characters in the string.

A string is defined to bemixed-script if its resolved script set is empty and defined to besingle-script if its resolved script set is nonempty.

Note that the term “single-script string” may be confusing. It means that there isat least one script in the resolved script set, not that there isonly one. For example, the string “〆切” is single-script, because it hasfour scripts {Hani, Hanb, Jpan, Kore} in its resolved script set.

As well as providing an API to detect whether a stringhas mixed-scripts, is also useful to offer an API that returns those scripts. Look at the examples below.

Table 1a.Mixed Script Examples

String	Code Point	Script_Extensions	Augmented Script Sets	Resolved Script Set	Single-Script?
Circle	U+0043 U+0069 U+0072 U+0063 U+006C U+0065	{Latn} {Latn} {Latn} {Latn} {Latn} {Latn}	{Latn} {Latn} {Latn} {Latn} {Latn} {Latn}	{Latn}	Yes
СігсӀе	U+0421 U+0456 U+0433 U+0441 U+04C0 U+0435	{Cyrl} {Cyrl} {Cyrl} {Cyrl} {Cyrl} {Cyrl}	{Cyrl} {Cyrl} {Cyrl} {Cyrl} {Cyrl} {Cyrl}	{Cyrl}	Yes
Сirсlе	U+0421 U+0069 U+0072 U+0441 U+006C U+0435	{Cyrl} {Latn} {Latn} {Cyrl} {Latn} {Cyrl}	{Cyrl} {Latn} {Latn} {Cyrl} {Latn} {Cyrl}	∅	No
Circ1e	U+0043 U+0069 U+0072 U+0063 U+0031 U+0065	{Latn} {Latn} {Latn} {Latn} {Zyyy} {Latn}	{Latn} {Latn} {Latn} {Latn} ALL {Latn}	{Latn}	Yes
C𝗂𝗋𝖼𝗅𝖾	U+0043 U+1D5C2 U+1D5CB U+1D5BC U+1D5C5 U+1D5BE	{Latn} {Zyyy} {Zyyy} {Zyyy} {Zyyy} {Zyyy}	{Latn} ALL ALL ALL ALL ALL	{Latn}	Yes
𝖢𝗂𝗋𝖼𝗅𝖾	U+1D5A2 U+1D5C2 U+1D5CB U+1D5BC U+1D5C5 U+1D5BE	{Zyyy} {Zyyy} {Zyyy} {Zyyy} {Zyyy} {Zyyy}	ALL ALL ALL ALL ALL ALL	ALL	Yes
〆切	U+3006 U+5207	{Hani, Hira, Kata} {Hani}	{Hani, Hira, Kata, Hanb, Jpan, Kore} {Hani, Hanb, Jpan, Kore}	{Hani, Hanb, Jpan, Kore}	Yes
ねガ	U+306D U+30AC	{Hira} {Kata}	{Hira, Jpan} {Kata, Jpan}	{Jpan}	Yes

A set of scripts is defined tocover a string if the intersection of that set with the augmented script sets of all characters in the string is nonempty; in other words, if every character in the string shares at least one script with the cover set. For example, {Latn, Cyrl} covers "Сirсlе", the third example inTable 1a.

A cover set is defined to beminimal if there is no smaller cover set. For example, {Hira, Hani} covers "〆切", the seventh example inTable 1a, but it is not minimal, since {Hira} also covers the string, and {Hira} is smaller than {Hira, Hani}. Note that minimal cover sets are not unique: a string may have different minimal cover sets.

Typically an API that returns the scripts in a string will return one of the minimal cover sets.

For computational efficiency, a set of script sets (SOSS) can be computed, where the augmented script sets for each character in the string map to one entry in the SOSS. For example, { {Latn}, {Cyrl} } would be the SOSS for "Сirсlе". A set of scripts that covers the SOSS also covers the input string. Likewise, the intersection of all entries of the SOSS will be the input string's resolved script set.

5.2Restriction-Level Detection

Restriction Levels 1-5 are defined here for use in implementations.These place restrictions on the use of identifiers according to theappropriate Identifier Profile as specified inSection 3,IdentifierCharacters. The lists of Recommended scripts aretaken fromTable5, Recommended Scripts of [UAX31]. Formore information on the use of Restriction Levels, seeSection2.9, Restriction Levels and Alerts in [UTR36].

For each of the Restriction Levels 1-6, the identifier must be well-formed according to whatever general syntactic constraints are in force, such as the Default Identifier Syntax in [UAX31].

In addition, an application may provide an Identifier Profile such as theGeneral Security Profile for Identifiers, which restricts the allowed characters further. For each of the Restriction Levels 1-5, characters in the string must also be in the Identifier Profile. Where there is no such Identifier Profile, Levels 5 and 6 are identical.

ASCII-Only
- All characters in the string are in the ASCII range.
Single Script
- The string qualifies as ASCII-Only, or
- The string issingle-script, according to the definition in Section 5.1.
Highly Restrictive
- The string qualifies as Single Script, or
- The string iscovered by any of the following sets of scripts, according to the definition in Section 5.1:
  - Latin + Han + Hiragana + Katakana; or equivalently: Latn + Jpan
  - Latin + Han + Bopomofo; or equivalently: Latn + Hanb
  - Latin + Han + Hangul; or equivalently: Latn + Kore
Moderately Restrictive
- The string qualifies as Highly Restrictive, or
- The string iscovered by Latin and any one other Recommended script, except Cyrillic, Greek
Minimally Restrictive
- There are no restrictions on the set of scripts thatcover the string.
- The only restrictions are the identifier well-formedness criteria and Identifier Profile, allowing arbitrary mixtures of scripts such as Ωmega, Teχ,HλLF-LIFE, Toys-Я-Us.
Unrestricted
- There are no restrictions on the script coverage of the string.
- The only restrictions are the criteria on identifier well-formedness. Characters may be outside of theIdentifier Profile.
- This level is primarily for use in detection APIs, providing return value indicating that the string does not match any of the levels 1-5.

Note that in all levels except ASCII-Only, any character having Script_Extensions {Common} or {Inherited} are allowed in the identifier, as long as those characters meet the Identifier Profile requirements.

These levels can be detected by reusing some of the mechanismsof Section 5.1. For a given input string, the Restriction Level isdetermined by the following logical process:

If the string contains any characters outside of theIdentifer Profile, returnUnrestricted.
If no character in the string is above 0x7F, returnASCII-Only.
Compute the string's SOSS according to Section 5.1.
If the SOSS is empty or the intersection of all entries in the SOSS is nonempty, returnSingle Script.
Remove all the entries from the SOSS that contain Latin.
If any of the following sets cover SOSS, returnHighlyRestrictive.
- {Kore}
- {Hanb}
- {Japn}
If the intersection of all entries in the SOSS contains any singleRecommendedscript exceptCyrillicor Greek, returnModeratelyRestrictive.
Otherwise, returnMinimally Restrictive.

The actual implementation of this algorithm can be optimized;as usual, the specification only depends on the results.

5.3Mixed-NumberDetection

There are three different types of numbers in Unicode. Only numberswith General_Category = Decimal_Numbers (Nd) should be allowed inidentifiers. However, characters from different decimal numbersystems can be easily confused. For example,U+0660 ( ٠ )ARABIC-INDIC DIGIT ZERO can be confused withU+06F0 ( ۰ )EXTENDED ARABIC-INDIC DIGIT ZERO, andU+09EA ( ৪ )BENGALI DIGIT FOUR can be confused withU+0038 ( 8 )DIGIT EIGHT.

For a given input string which does not contain non-decimalnumbers, the logical process of detecting mixed numbers is thefollowing:

For each character in the string:

Find the decimal number value for that character, if any.
Map the value to the unique zero character for that numbersystem.

If there is more than one such zero character, then the stringcontains multiple decimal number systems.

The actual implementation of this algorithm can be optimized; asusual, the specification only depends on the results. The followingJava sample using [ICU] shows how this can be done:

    public UnicodeSet getNumberRepresentatives(String identifier) {
        int cp;
        UnicodeSet numerics = new UnicodeSet();
        for (int i = 0; i < identifier.length(); i += Character.charCount(i)) {
            cp = Character.codePointAt(identifier, i);
            // Store a representative character for each kind of decimal digit
            switch (UCharacter.getType(cp)) {
            case UCharacterCategory.DECIMAL_DIGIT_NUMBER:
                // Just store the zero character as a representative for comparison.
                // Unicode guarantees it is cp - value.
                numerics.add(cp - UCharacter.getNumericValue(cp));
                break;
            case UCharacterCategory.OTHER_NUMBER:
            case UCharacterCategory.LETTER_NUMBER:
                throw new IllegalArgumentException("Should not be in identifiers.");
            }
        }
        return numerics;
    }...    UnicodeSet numerics = getMixedNumbers(String identifier);    if (numerics.size() > 1) reject(identifer, numerics);

5.4OptionalDetection

There are additional enhancements that may be useful in spoofdetection. This includes such mechanisms as markingstrings as "mixed script" where they containsimplified-only and traditional-only Chinese characters, using theUnihan data in the Unicode Character Database [UCD],or detecting sequences of the same nonspacing mark.

Other enhancements useful in spoof detection include thefollowing:

Check to see that all the characters are in the sets ofexemplar characters for at least one language in the Unicode CommonLocale Data Repository [CLDR].
Check for unlikely sequences of combining marks:
1. Forbid sequences of the same nonspacing mark.
2. Forbid sequences of more than 4 nonspacing marks (gc=Mn or gc=Me).
3. Forbid sequences of base character + nonspacing mark that look the same as or confusingly similar to the base character alone (because the nonspacing mark overlays a portion of the base character). An example is U+0069 LOWERCASE LETTER I + U+0307 COMBINING DOT ABOVE.
Mark Chinese strings as “mixed script” if they contain bothcharacters that are only used with simplified Chinese writing systems and characters that are only used with traditional Chinese writing systems. A great many characters can be used in both simplified and traditional writing systems, and the categorization of characters may also change over time.
1. The criterion can only be applied if the language of thestring is known to be Chinese. So, for example, the string“写真だけの結婚式 ” is Japanese, and should not be marked as mixed script because of a mixture of S and T characters.
Add support for detecting two distinctsequences that have identical representations. The current data files only handle cases where a single code point is confusable with another code point or sequence. It does not handle cases likeshri, as below.

The characters U+0BB6 TAMIL LETTER SHA and U+0BB8 TAMIL LETTER SA are normally quite distinct. However, they can both be used in the representation of the the Tamil wordshri. On some very common platforms, the following sequences result in exactly the same visual appearance:

U+0BB6	U+0BCD	U+0BB0	U+0BC0
SHA	VIRAMA	RA	II
ஶ	்	ர	◌ீ	= ஶ்ரீ

U+0BB8	U+0BCD	U+0BB0	U+0BC0
SA	VIRAMA	RA	II
ஸ	்	ர	◌ீ	= ஸ்ரீ

6DevelopmentProcess

As discussed in Unicode TechnicalReport #36, "Unicode Security Considerations" [UTR36], confusability among characters cannot bean exact science. There are many factors that make confusability amatter of degree:

Shapes of characters vary greatly among fonts used torepresent them. The Unicode Standard uses representative glyphs inthe code charts, but font designers are free to create their ownglyphs. Because fonts can easily be created using an arbitrary glyphto represent any Unicode code point, character confusability witharbitrary fonts can never be avoided. For example, one could designa font where the ‘a’ looks like a ‘b’ , ‘c’ like a ‘d’, and so on.
Writing systems using contextual shaping (such as Arabic,and many South Asian systems) introduce even more variation in textrendering. Characters do not really have an abstract shape inisolation and are only rendered as part of cluster of charactersmaking words, expressions, and sentences. It is a fairly commonoccurrence to find the same visual text representation correspondingto very different logical words that can only be recognized bycontext, if at all.
Font style variants such as italics may introduce aconfusability which does not exist in another style. For example, inthe Cyrillic script, theU+0442 ( т )CYRILLIC SMALL LETTER TE looks like a small caps Latin ‘T’ in normalstyle, while it looks like a small Latin ‘m’ in italic style.

In-script confusability is extremely user-dependent. For example, inthe Latin script, characters with accents or appendices may looksimilar to the unadorned characters for some users, especially ifthey are not familiar with their meaning in a particular language.However, most users will have at least a minimum understanding of therange of characters in their own script, and there are separatemechanisms available to deal with other scripts, as discussed in [UTR36].

As described elsewhere, there are cases where the confusable data maybe different than expected. Sometimes this is because two charactersor two strings may only be confusable in some fonts. In other cases,it is because of transitivity. For example, the dotless and dotted Iare considered equivalent (ı ↔ i), because they look the same whenaccents such as anacute are applied to each. However, forpractical implementation usage, transitivity is sufficientlyimportant that some oddities are accepted.

The data may be enhanced in future versions of thisspecification. For information on handling changes in data overtime, seeSection 2.9.1, Backward Compatibility of [UTR36].

6.1Confusables Data Collection

The confusability data was created by collecting a number ofprospective confusables, examining those confusables according to aset of common fonts, and processing the result for transitiveclosure.

The primary goal is to include characters that would beStatus=Allowedas inTable 1,Identifier Status and Type. Other characters, such as NFKCvariants, are not a primary focus for data collection. However, suchvariants may certainly be included in the data, and may be submittedusing the online forms at [Feedback].

The prospective confusables were gathered from a number ofsources. Erik van der Poel contributed a list derived from running aprogram over a large number of fonts to catch characters that sharedidentical glyphs within a font, and Mark Davis did the same morerecently for fonts on Windows and the Macintosh. Volunteers fromGoogle, IBM, Microsoft and other companies gathered other lists ofcharacters. These included native speakers for languages withdifferent writing systems. The Unicode compatibility mappings werealso used as a source. The process of gathering visual confusables isongoing: the Unicode Consortium welcomes submission of additionalmappings. The complex scripts of South and Southeast Asia needspecial attention. The focus is on characters that can be in theRecommended profile for identifiers, because they are of mostconcern.

The fonts used to assess the confusables included those used bythe major operating systems in user interfaces. In addition, therepresentative glyphs used in the Unicode Standard were alsoconsidered. Fonts used for the user interface in operating systemsare an important source, because they are the ones that will usuallybe seen by users in circumstances where confusability is important,such such as when using IRIS (Internationalized Resource Identifiers)and their sub-elements (such as domain names). These fonts have anumber of other relevant characteristics:

They rarely changed in updates to operating systems andapplications; changes brought by system upgrades tend to be gradualto avoid usability disruption.
Because user interface elements need to be legible at lowscreen resolution (implying a low number of pixels per EM), fontsused in these contexts tend to be designed in sans-serif style,which has the tendency to increase the possibility of confusables.There are, however, some languages such as Chinese where a serifstyle is in common use.
Strict bounding box requirements create even moreconstraints for scripts which use relatively large ascenders anddescenders. This also limits space allocated for accent or tonemarks, and can also create more opportunities for confusability.

Pairs of prospective confusables were removed if they were alwaysvisually distinct at common sizes, both within and across fonts. Thedata was then closed under transitivity, so that if X≅Y and Y≅Z, thenX≅Z. In addition, the data was closed under substring operations, sothat if X≅Y then AXB≅AYB. It was then processed to produce thein-script and cross-script data, so that a single data table can beused to map an input string to a resultingskeleton.

A skeleton is intendedonly for internal use for testingconfusability of strings; the resulting text is not suitable fordisplay to users, because it will appear to be a hodgepodge ofdifferent scripts. In particular, the result of mapping an identifierwill not necessary be an identifier. Thus the confusability mappingscan be used to test whether two identifiers are confusable (if theirskeletons are the same), but should definitely not be used as a"normalization" of identifiers.

6.2IdentifierModification Data Collection

Theidmod data is gathered in the following way. Thebasic assignments are derived based on UCD character properties,information in [UAX31], and a curated list ofexceptions based on information from various sources, including thecore specification of the Unicode Standard, annotations in the codecharts, information regarding CLDR exemplar characters, and externalfeedback.

The first condition that matches in the order of the items from topto bottom inTable 1.Identifier Status and Type is used, with a few exceptions:

When a character is inTable 3,Candidate Characters for Inclusion in Identifiers in [UAX31], then it is given the Type Inclusion,regardless of other properties.
When the Script_Extensions property value for a charactercontains multiple Script property values, the Script used for thederivation is the first in the following list:
1. Table 5,Recommended Scripts
2. Table 7,Limited Use Scripts
3. Table 4,Candidate Characters for Exclusion from Identifiers
  - Table4 also has some conditions that are not dependent on script;those conditions are applied regardless of Script_Extensionsproperty value.

The script information inTable4,Table5, andTable7 is in machine-readable form in CLDR, as scriptMetadata.txt.

7Data Files

The following files provide data used to implement therecommendations in this document. The data may be refined in futureversions of this specification. For more information,seeSection 2.9.1, Backward Compatibility of [UTR36].

The Unicode Consortium welcomes feedbackon additional confusables or identifier restrictions. There areonline forms at [Feedback] where you cansuggest additional characters or corrections.

The files are inhttp://www.unicode.org/Public/security/.The directories there contain data files associated with a givenversion. The directory forthis version is:

http://www.unicode.org/Public/security/11.0.0

The data files for the latest approved version are also in thedirectory:

http://www.unicode.org/Public/security/latest

The format for IdentifierStatus.txt follows the normal conventions for UCD data files, and is described in the header of that file. All characters not listed in the file default to IdentifierType=Restricted. Thus the file only lists characters with IdentifierStatus=Allowed. For example:

002D..002E ; Allowed # 1.1 HYPHEN-MINUS..FULL STOP

The format for IdentifierType.txt follows the normal conventions for UCD data files, and is described in the header of that file. The value is a set whose elements are delimited by spaces. This format is identical to that used for ScriptExtensions.txt. This differs from prior versions which only listed the strongest reason for exclusion. This new convention allows the values to be used for more nuanced filtering. For example, if an implementation wants to allow an Exclusion script, it could still exclude Obsolete and Deprecated characters in that script. All characters not listed in the file default to IdentifierType=Recommended. For example:

2460..24EA ; Technical Not_XID Not_NFKC # 1.1 CIRCLED DIGIT ONE..CIRCLED DIGIT ZERO

Table 2.Data File List

Reference	File Name(s)	Contents
[idmod]	IdentifierStatus.txt IdentifierType.txt	Identifier Type and Status:Provides the list of additions and restrictionsrecommended for building a profile of identifiers for environmentswhere security is at issue.
[confusables]	confusables.txt	Visually ConfusableCharacters: Provides a mapping for visual confusables for use indetecting possible security problems. The usage of thefile is described inSection 4,Confusable Detection.
[confusablesSummary]	confusablesSummary.txt	A summary view of theconfusables: Groups each set of confusables together, listing themfirst on a line starting with #, then individually with names andcode points. SeeSection 4,ConfusableDetection
[intentional]	intentional.txt	IntentionalConfusable Mappings: A selection of characters whose glyphs in anyparticular typeface would probably be designed to be identical inshape when using a harmonized typeface design.

Migration

Beginning with version 6.3.0, the version numbering of thisdocument has been changed to indicate the version of the UCD that thedata is based on. For versions up to and including 6.3.0, thefollowing table shows the correspondence between the versions of thisdocument and UCD versions that they were based on.

Table 3.Version Correspondence

Version	Release Date	Data File Directory	UCD Version	UCD Date
Version 1	2006-08-15	/Public/security/revision-02/	5.1.0	2008-04
draft only	2006-08-11	/Public/security/revision-03/	n/a	n/a
Version 2	2010-08-05	/Public/security/revision-04/	6.0.0	2010-10
Version 3	2012-07-23	/Public/security/revision-05/	6.1.0	2012-01
6.3.0	2013-11-11	/Public/security/6.3.0/	6.3.0	2013-09

If an update version of this standard is required betweenthe associated UCD versions, the version numbering will include anupdate number in the 3rd field. For example, if a version of thisdocument and its associated data is needed between UCD 6.3.0 and UCD7.0.0, then a version 6.3.1 could be used.

MigratingPersistent Data

Implementations must migrate their persistent data stores (suchas database indexes) whenever those implementations update to use thedata files from a new version of this specification.

Stability is never guaranteed between versions, although it ismaintained where feasible. In particular, an updated version ofconfusable mapping data may use a mapping for a particular characterthat is different from the mapping used for that character in anearlier version. Thus there may be cases where X → Y in Version N,and X → Z in Version N+1, where Z may or may not have mapped to Y inVersion N. Even in cases where the logical data has not changedbetween versions, the order of lines in the data files may have beenchanged.

The Identifier Status does not have stability guarantees (such as “Once a character is Allowed, it will not become Restricted in future versions”), because the data is changing over time as we find out more about character usage. Certain of the Type values, such as Not_XID, are backward compatible but most may change as new data becomes available. The identifier data may also not appear to be completely consistent when just viewed from the perspective of script and general category. For example, it may well be that one character out of a set of nonspacing marks in a script is Restricted, while others are not. But that can be just a reflection of the fact that that character is obsolete and the others are not.

For identifier lookup, the data is aimed more at flagging possibly questionable characters, thus serving as one factor (among perhaps many, like using the "Safe Browsing" service) in determining whether the user should be notified in some way. For registration, flagged characters can result in a "soft no", that is, require the user to appeal a denial with more information.

For dealing with characters whose status changes to Restricted, implementations can use a grandfathering mechanism to maintain backwards compatibility.

Implementations should therefore have a strategy for migratingtheir persistent data stores (such as database indexes) that use anyof the confusable mapping data or other data files.

Version 10.0 Migration

As of Unicode 10.0, Type=Aspirational is now empty; for more information, see [UAX31].

Version9.0 Migration

There is an important data format change between versions 8.0 and 9.0. In particular, the xidmodifications.txt file from Version 8.0 has been split into two files for Version 9.0: IdentifierStatus.txt and IdentifierType.txt.

Version 9.0	Version 8.0
Field 1 of IdentifierStatus.txt	Field 1 of xidmodifications.txt
Field 1 of IdentifierType.txt	Field 2 of xidmodifications.txt

Multiple values are listed in field 1 of IdentifierType.txt. To convert to the old format of xidmodifications.txt, use thelast value of that field. For example, the following values would correspond:

File	Field	Content
IdentifierType.txt	1	`180A ; Limited_Use ExclusionNot_XID`
xidmodifications.txt	2	`180A ; Restricted ;Not_XID`

Version8.0 Migration

In Version 8.0, the following changes were made to theIdentifier Status and Type:

Changed to the standard UCD formatting. For example,limited-use→Limited_Use.
- Usually this was simply changing the case and hyphen, butnot-chars changed toNot_Character.
Aligned the Identifier Type better with UAX 31 and Unicodeproperties
- historic
  - → Exclusion, where fromTable 4,Candidate Characters for Exclusion from Identifiers,
  - → Obsolete, otherwise
- limited-use
  - → Limited_Use, where fromTable 7,Limited Use Scripts,
  - → Aspirational, where fromTable 6,Aspirational Use Scripts (later incorporated into Limited_Use in Version 10.0)
  - → Uncommon-Use, otherwise
- obsolete
  - → Deprecated, where matching the Unicode property

Version7.0 Migration

Due to production problems, versions of the confusable mappingtables before 7.0 did not maintain idempotency in all cases, soupdating to version 8.0 is strongly advised.

Anyone using the skeleton mappings needs to rebuild anypersistent uses of skeletons, such as in database indexes.

The SL, SA, and ML mappings in 7.0 were significantly changedto address the idempotency problem. However, the tables SL, SA, andML were still problematic, and discouraged from use in 7.0. They werethus removed from version 8.0.

All of the data necessary for an implementation to recreate theremoved tables is available in the remaining data (MA) plus theUnicode Character Database properties (script, casing, etc.). Such arecreation would examine each of the equivalence classes from the MAdata, and filter out instances that did not fit the constraints (ofscript or casing). For the target character, it would choose the mostneutral character, typically a symbol. However, the reasons fordeprecating them still stand, so it is not recommended thatimplementations recreate them.

Note also that as the Script_Extensions data is made more complete,it may cause characters in the whole-script confusables data file tono longer match. For more information, seeSection 4,Confusable Detection.

Acknowledgments

Mark Davis and Michel Suignard authored the bulk of thetext, under direction from the Unicode Technical Committee. StevenLoomis and other people on the ICU team were very helpful indeveloping the original proposal for this technical report. Shane Carr analyzed the algorithms and supplied the source text for the rewrite of Sections 4 and 5 in version 10.

Thanksalso to the following people for their feedback or contributions tothis document or earlier versions of it, or to the source data forconfusables or idmod: Julie Allen, Andrew Arnold, Vernon Cole, David Corbett (specal thanks for the many contributions),Douglas Davidson, Rob Dawson, Alex Dejarnatt, Chris Fynn, Martin Dürst, Asmus Freytag, DeborahGoldsmith, Paul Hoffman, Denis Jacquerye, Cibu Johny, Patrick L.Jones, Peter Karlsson, Mike Kaplinskiy, Gervase Markham, Eric Muller,David Patterson, Erik van der Poel, Roozbeh Pournader, Michael van Riper, Marcos Sanz,Alexander Savenkov, Dominikus Scherkl, Manuel Strehl, Chris Weber, Ken Whistler,and Waïl Yahyaoui. Thanks to Peter Peng for his assistance with fontconfusables.

References

[CLDR]	Unicode Locales Project(Unicode Common Locale Data Repository) http://www.unicode.org/cldr/
[DCore]	Derived Core Properties http://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt
[DemoConf]	http://unicode.org/cldr/utility/confusables.jsp
[DemoIDN]	http://unicode.org/cldr/utility/idna.jsp
[DemoIDNChars]	http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{age%3D3.2}-\p{cn}-\p{cs}-\p{co}&abb=on&uts46+idna+idna2008
[FAQSec]	Unicode FAQ on SecurityIssues http://www.unicode.org/faq/security.html
[ICANN]	ICANN Documents: Internationalized Domain Names http://www.icann.org/en/topics/idn/ The IDN Variant Issues Project http://www.icann.org/en/topics/new-gtlds/idn-vip-integrated-issues-23dec11-en.pdf Maximal Starting Repertoire Version 2 (MSR-2) https://www.icann.org/news/announcement-2-2015-04-27-en
[ICU]	International Components forUnicode http://site.icu-project.org/
[IDNA2003]	The IDNA2003 specification isdefined by a cluster of IETF RFCs: IDNA [RFC3490] Nameprep [RFC3491] Punycode [RFC3492] Stringprep [RFC3454].
[IDNA2008]	The IDNA2008 specification is defined by acluster of IETF RFCs: Internationalized Domain Names for Applications (IDNA):Definitions and Document Framework http://tools.ietf.org/html/rfc5890 Internationalized Domain Names in Applications (IDNA)Protocol http://tools.ietf.org/html/rfc5891 The Unicode Code Points and Internationalized DomainNames for Applications (IDNA) http://tools.ietf.org/html/rfc5892 Right-to-Left Scripts for Internationalized Domain Namesfor Applications (IDNA) http://tools.ietf.org/html/rfc5893 There are also informative documents: Internationalized Domain Names for Applications (IDNA):Background, Explanation, and Rationale http://tools.ietf.org/html/rfc5894 The Unicode Code Points and Internationalized DomainNames for Applications (IDNA) - Unicode 6.0 http://tools.ietf.org/html/rfc6452
[IDN-FAQ]	http://www.unicode.org/faq/idn.html
[EAI]	https://tools.ietf.org/html/rfc6531
[Feedback]	To suggest additionsor changes to confusables or identifier restriction data, pleasesee: http://unicode.org/reports/tr39/suggestions.html For issues in the text, please see: Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process fortechnical reports, and for a list of technical reports.
[RFC3454]	P. Hoffman, M. Blanchet."Preparation of Internationalized Strings("stringprep")", RFC 3454, December 2002. http://ietf.org/rfc/rfc3454.txt
[RFC3490]	Faltstrom, P., Hoffman, P.and A. Costello, "Internationalizing Domain Names inApplications (IDNA)", RFC 3490, March 2003. http://ietf.org/rfc/rfc3490.txt
[RFC3491]	Hoffman, P. and M. Blanchet,"Nameprep: A Stringprep Profile for Internationalized DomainNames (IDN)", RFC 3491, March 2003. http://ietf.org/rfc/rfc3491.txt
[RFC3492]	Costello, A., "Punycode:A Bootstring encoding of Unicode for Internationalized Domain Namesin Applications (IDNA)", RFC 3492, March 2003. http://ietf.org/rfc/rfc3492.txt
[Security-FAQ]	http://www.unicode.org/faq/security.html
[UCD]	Unicode Character Database. http://www.unicode.org/ucd/ For an overview of the Unicode Character Database and a listof its associated files.
[UCDFormat]	UCD File Format http://www.unicode.org/reports/tr44/#Format_Conventions
[UAX15]	UAX #15:UnicodeNormalization Forms http://www.unicode.org/reports/tr15/
[UAX24]	UAX #24: Unicode ScriptProperty http://www.unicode.org/reports/tr24/
[UAX29 ]	UAX #29:Unicode TextSegmentation http://www.unicode.org/reports/tr29/
[UAX31]	UAX #31:UnicodeIdentifier and Pattern Syntax http://www.unicode.org/reports/tr31/
[Unicode]	The Unicode Standard For the latest version, see: http://www.unicode.org/versions/latest/
[UTR36 ]	UTR #36:UnicodeSecurity Considerations http://www.unicode.org/reports/tr36/
[UTS18 ]	UTS #18:UnicodeRegular Expressions http://www.unicode.org/reports/tr18/
[UTS39]	UTS #39: Unicode SecurityMechanisms http://www.unicode.org/reports/tr39/
[UTS46]	Unicode IDNA CompatibilityProcessing http://www.unicode.org/reports/tr46/
[Versions]	Versions of the UnicodeStandard http://www.unicode.org/standard/versions/ Forinformation on version numbering, and citing and referencing theUnicode Standard, the Unicode Character Database, and UnicodeTechnical Reports.

Modifications

The following summarizes modifications from the previouspublished version of this document.

Revision 17

Reissued for Unicode 11.0.
Section 3.1.1,Joining Controls
- Added new section describing the handling of Joining Controls.
Section 5.4,Optional Detection
- Added additional suggestions for nonspacing marks.

Revision 16 being a proposed update, only changes between revisions 15 and 17 are noted here.

Modifications for previous versions are listed in those respective versions.

© 2018 Unicode, Inc. All Rights Reserved. The UnicodeConsortium makes no expressed or implied warranty of any kind, andassumes no liability for errors or omissions. No liability is assumedfor incidental and consequential damages in connection with orarising out of the use of the information or programs contained oraccompanying this technical report. The Unicode Terms of Useapply.

Unicode and the Unicode logo are trademarksof Unicode, Inc., and are registered in some jurisdictions.

Movatterモバイル変換

Unicode® Technical Standard #39

Unicode Security Mechanisms

Summary

Status

Contents

1Introduction

2Conformance

3IdentifierCharacters

3.1General Security Profile forIdentifiers

3.1.1Joining Controls

3.2IDNSecurity Profiles for Identifiers

3.3Email Security Profiles for Identifiers

Domain-Part

Local-Part

Quoted-String-Part

Other Issues

4ConfusableDetection

Data File Format

4.1Whole-Script Confusables

4.2Mixed-Script Confusables

5DetectionMechanisms

5.1Mixed-ScriptDetection

5.2Restriction-Level Detection

5.3Mixed-NumberDetection

5.4OptionalDetection

6DevelopmentProcess

6.1Confusables Data Collection

6.2IdentifierModification Data Collection

7Data Files