Movatterモバイル変換


[0]ホーム

URL:


[Unicode]  Technical Reports
 

Unicode Technical Standard #35

Unicode Locale Data Markup Language (LDML)

Version48.1
EditorsMark Davis ([email protected]) andother CLDR committee members
Date2025-12-15
This Versionhttps://www.unicode.org/reports/tr35/tr35-77/tr35.html
Previous Versionhttps://www.unicode.org/reports/tr35/tr35-76/tr35.html
Latest Versionhttps://www.unicode.org/reports/tr35/
Corrigendahttps://cldr.unicode.org/index/corrigenda
Latest Proposed Updatehttps://www.unicode.org/reports/tr35/proposed.html
Namespacehttps://www.unicode.org/cldr/
DTDshttps://www.unicode.org/cldr/dtd/48/
Change HistoryModifications

Summary

This document describes an XML format (vocabulary) for the exchange of structured locale data. This format is used in theUnicode Common Locale Data Repository.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium.This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the CLDR bug reporting form [Bugs].Related information that is useful in understanding this document is found in theReferences.For the latest version of the Unicode Standard see [Unicode].For more information seeAbout Unicode Technical Reports and theSpecifications FAQ.Unicode Technical Reports are governed by the UnicodeTerms of Use.

Parts

The LDML specification is divided into the following parts:

Contents of Part 1, Core

Introduction

Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. However, there remain differences in the locale data used by different systems.

The best practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.

But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but yielding different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [Comparisons].)

Note: There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.

This document specifies an XML format for the communication of locale data: the Unicode Locale Data Markup Language (LDML). This provides a common format for systems to interchange locale data so that they can get the same results in the services provided by internationalization libraries. It also provides a standard format that can allow users to customize the behavior of a system. With it, for example, collation (sorting) rules can be exchanged, allowing two implementations to exchange a specification of tailored collation rules. Using the same specification, the two implementations will achieve the same results in comparing strings. Unicode LDML can also be used to let a user encapsulate specialized sorting behavior for a specific domain, or create a customized locale for a minority language. Unicode LDML is also used in the Unicode Common Locale Data Repository (CLDR). CLDR uses an open process for reconciling differences between the locale data used on different systems and validating the data, to produce with a useful, common, consistent base of locale data.

For more information, see the Common Locale Data Repository project page [LocaleProject].

As LDML is an interchange format, it was designed for ease of maintenance and simplicity of transformation into other formats, above efficiency of run-time lookup and use. Implementations should consider converting LDML data into a more compact format prior to use.

Conformance

There are many ways to use the Unicode LDML specification and the CLDR data.The Unicode Consortium does not restrict the ways in which the format or data are used.However, an implementation may also claim conformance to the LDML specification and/or to CLDR data, as follows:

UAX35-C1. An implementation that claims conformance to this specification shall:

  1. Identify the sections of the specification that it conforms to.
    • For example, an implementation might claim conformance to all LDML features except fortransforms andsegments.
    • The names of sections may change for clarity, so the associated links should be included in any reference — links into LDML will remain stable.
  2. Interpret the relevant elements and attributes of LDML data in accordance with the descriptions in those sections.
    • For example, an implementation that claims conformance to the date format patterns must interpret the characters in such patterns according toDate Field Symbol Table.
  3. Declare which types of CLDR data it uses.
    • For example, an implementation might declare that it only uses language names, and those with adraft status ofcontributed orapproved.
  4. Declare when it overrides CLDR data, or usesalt data
    • For example, for//ldml/numbers/symbols/group an implementation could usealt="official" data.

An implementation may also make ageneral claim of conformance to the LDML specification and/or CLDR data.Such a claim is understood to claim conformance to all portions of this specification that are relevant to the operations performed by the implementation,except for those specifically declared as exceptions.For example, if an implementation making ageneral claim of conformance performs date formatting, and does not declare date formatting as an exception,it is understood to be claiming conformance to date formatting as described in the section listed below.

UAX35-C2. An implementation that claims conformance to Unicode locale or language identifiers shall:

1. Specify whether Unicode locale extensions are allowed2. Specify the canonical form used for identifiers in terms of casing and field separator characters.

External specifications may also reference particular components of Unicode locale or language identifiers, such as:

>Field X can contain any Unicode region subtag values as given in Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), excluding grouping codes.

NOTE:UAX35-C2. is replaced by the following generalization.

The following lists the high-level sections with structures and/or processing algorithms.Conformance to a particular section may reference and require conformance to another section.

Unicode Locale Identifiers

SectionsTopics
Unicode Locale Identifieridentifier syntax, interpretation, and validity
Annex C. LocaleId Canonicalizationcanonicalize
CLDR to BCP 47,BCP 47 to CLDRconvert
Language Identifier Field Definitionsinterpretation and validity of -u key-value pairs
Locale Display Name Algorithmlocale display names

Unicode Locale Inheritance and Matching

SectionsTopics
Locale Inheritance and Matchinglocale inheritance
Likely Subtagslikely subtags
Language Matchinglocale matching

Units of Measurement

SectionsTopics
Unit Identifiersunit identifier syntax, interpretation, and validity
Unit Identifier Normalizationidentifier normalization
Unit Conversionunit conversion
Unit Preferencesevaluation of user preferences
Unit Identifier Uniquenessconverting units into BCP47 format
Compound Unitsunit display names

Number Formatting

SectionsTopics
Number Format Patternsnumber format patterns, syntax and interpretation
Compact Number Formatscompact number formats
Rule-Based Number Formattingspell-out number formatting

Date Formatting

SectionsTopics
Elements availableFormats, appendItemsdate formatting, patterns
Date Format Patternsdate format patterns and symbols
Using Time Zone Namestimezone forms, fallback and parsing

Collation

SectionsTopics
Root CollationRoot collation syntax and structure
Collation TailoringsRule syntax and interpretation for language-specific ordering

Grammar

SectionsTopics
Grammatical Featuresnoun classes (except for plurals)
Language Plural Rulesplural and ordinal category rules, ranges

Miscellaneous

SectionsTopics
Unicode SetsUnicode set syntax and interpretation
String Rangestring-range syntax and interpretation
Transformstransform identifier and rule syntax and interpretation
Segmentationssegmentation customizations
Synthesizing Sequence Namesconstructing derived emoji names
Formatting Processperson name formatting
Part 7: Keyboardskeyboard structure and interpretation
Conformance (Message Format)message formatting

Customization

Conformant implementations cannot modify CLDR structures, such as the syntax or interpretation of locale identifiers.There are usually mechanisms for implementations to customize these to a certain extent, using what are known a private use codes.For example, an implementation could use the private-use language codeqfz to mean a language that was not covered by BCP 47,or use aprivate use extension in a Unicode locale identifer, or use a private-use unit such asxxx-smoot-per-second.

An implementation may also use a deprecated code instead of the corresponding preferred code.For example, the most frequent case of this is with an implementation whose earlier versions predated BCP 47, and usediw for Hebrew,rather than the BCP 47 (and CLDR) codehe.When this is done, the CLDR data needs to be modified in appropriate places, not just in some file names.For example, the languageAlias data requires modification, from:

<languageAlias type="iw" replacement="he" reason="deprecated"/> <!-- Hebrew -->

to

<languageAlias type="he" replacement="iw" reason="deprecated"/> <!-- Hebrew -->

Minimized locale identifiers are also not required. For example, an implementation could consistently expand locale identifiers to include regions, such asenen_DE ordede-AT.

Implementations may customize CLDR data, as long as they declare that they are doing so. This may include:

Omitting data

An implementation may dispense with locale data for locales that an implementation does not support, or for locales it does support,dispense with data that is at CoverageLevel=Comprehensive, or dispense with particular sorts of data, such a annotations for emoji.

Adding data

An implementation could add data for a locale that CLDR does not yet support, or add higher-coverage data for a locale than what CLDR has.

Overriding data

CLDR has a mechanism for overriding data using thealt mechanism.At build time, an implementation could override the default value by using an alt value.For example, take the following data:

<territory type="HK">Sonderverwaltungsregion Hongkong</territory><territory type="HK" alt="short">Hongkong</territory>

An implementation could, at build time, substitute the short value for the regular value, getting "Hongkong".It could instead support both values at runtime, using display option settings to pick between the regular value and the short value.

Implementations can override the data in other ways as well, such as changing the spelling of a particular value.

Testing

The files intestData can be used to test conformance.Brief instructions for use are supplied in_readme.txt files in the different directories and/or in the headers of the files in question.For example, the following is from a sample header:

# Format:# <source locale identifier>;<expected canonicalized locale identifier>## The data lines are divided into 4 sets:#   explicit:    a short list of explicit test cases.#   fromAliases: test cases generated from the alias data.#   decanonicalized: test cases generated by reversing the normalization process.#   withIrrelevants: test cases generated from the others by adding irrelevant fields where possible,#                           to ensure that the canonicalization implementation is not sensitive to irrelevant fields. These include:#     Language: aaa#     Script:   Adlm#     Region:   AC#     Variant:  fonipa

If an implementation overrides CLDR data, then various lines in the relevant test files may need to be modified correspondingly, or skipped.

EBNF

The EBNF syntax used in LDML is a variant of the Extended Backus-Naur Form (EBNF) notation used inW3C XML Notation. The main differences are:

  1. Bounded repetition following Perl regex syntax is allowed, such asdigit{3} for 3 digits,digit{3,5} for 3 to 5 digits, anddigit{3,} for 3 or more digits.
  2. Whitespace inside bracketed enumerations and ranges is ignored.
    • eg.,[A-Z a-z] is the same as[A-Za-z]
  3. A backslash may be used to escape a following "x"-prefixed hexadecimal code point or the immediately following character.
    • eg.,\x20 is the same as#x20 and[\&\-] is the same as[#x26#x2D]
  4. Constraints (well-formedness or validity) may use separate notes, and/or the W3C notations:
    • [ wfc: ... ]
    • [ vc: ... ]

In the text, this is sometimes referred to as "EBNF (Perl-based)".

What is a Locale?

Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use data in LDML, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.

The first issue is basic:what is a locale? In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for time zones, languages, countries (regions), and scripts. The data can also include support for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and other services.

Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's time zone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, and so on), music preference, religion, party affiliation, favorite charity, and so on.

Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.

In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, and so on). The format in this document does not attempt to represent all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line betweenlocales andlanguages, as commonly used in the industry, are rather fuzzy. Note also that the vast majority of the locale data in CLDR is in fact language data; all non-linguistic data is separated out into a separate tree. For more information, seeLanguage and Locale IDs.

We will speak of data as being "in locale X". That does not imply that a localeis a collection of data; it is simply shorthand for "the set of data associated with the locale id X". Each individual piece of data is called aresource orfield, and a tag indicating the key of the resource is called aresource tag.

Unicode Language and Locale Identifiers

Unicode LDML uses stable identifiers based on [BCP47] for distinguishing among languages, locales, regions, currencies, time zones, transforms, and so on. There are many systems for identifiers for these entities. The Unicode LDML identifiers may not match the identifiers used on a particular target system. If so, some process of identifier translation may be required when using LDML data.

The BCP 47 extensions (-u- and -t-) are described inUnicode BCP 47 U Extension andUnicode BCP 47 T Extension.

Unicode Language Identifier

AUnicode language identifier has the following structure (provided in EBNF (Perl-based)). The following table defines syntactically well-formed identifiers: they are not necessarily valid identifiers. For additional validity criteria, see the links on the right.

EBNFValidity / Comments
unicode_language_id
= "root"| (unicode_language_subtag    (sep unicode_script_subtag)?  | unicode_script_subtag)  (sep unicode_region_subtag)?  (sep unicode_variant_subtag)* ;
"root" is treated as a specialunicode_language_subtag
unicode_language_subtag
= alpha{2,3} | alpha{5,8};
validity
latest-data
unicode_script_subtag
= alpha{4} ;
validity
latest-data
unicode_region_subtag
= (alpha{2} | digit{3}) ;
validity
latest-data
unicode_variant_subtag
= (alphanum{5,8}
| digit alphanum{3}) ;
validity
latest-data
sep
= [-_] ;
digit
= [0-9] ;
alpha
= [A-Z a-z] ;
alphanum
= [0-9 A-Z a-z] ;

The following is an additional well-formedness constraint:

  1. [ wfc: The sequence of variant subtags must not have any duplicates (eg, de-1996-fonipa-1996 is not syntactically well-formed). ]

The semantics of the various subtags is explained inLanguage Identifier Field Definitions ; there are also direct links fromunicode_language_subtag , etc. While theoretically theunicode_language_subtag may have more than 3 letters through the IANA registration process, in practice that has not occurred. Theunicode_language_subtag "und" may be omitted when there is aunicode_script_subtag ; for that reasonunicode_language_subtag values with 4 letters are not permitted. However, suchunicode_language_id values are not intended for general interchange, because they are not valid BCP 47 tags. Instead, they are intended for certain protocols such as the identification of transliterators or font ScriptLangTag values. For more information on language subtags with 4 letters, seeBCP 47 Language Tag to Unicode BCP 47 Locale Identifier.

For example, "en-US" (American English), "en_GB" (British English), "es-419" (Latin American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are all valid Unicode language identifiers.

Unicode Locale Identifier

AUnicode locale identifier is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure. The semantics of the U and T extensions are explained inUnicode BCP 47 U Extension andUnicode BCP 47 T Extension. Other extensions and private use extensions are supported for pass-through. The following table defines syntacticallywell-formed identifiers: they are not necessarilyvalid identifiers. For additional validity criteria, see the links on the right.

EBNFValidity / Comments
unicode_locale_id= unicode_language_id
  extensions*
  pu_extensions? ;
extensions= unicode_locale_extensions
| transformed_extensions
| other_extensions ;
unicode_locale_extensions= sep [uU]
  ((sep keyword)+
  |(sep uattribute)+ (sep ufield)*) ;
transformed_extensions= sep [tT]
  ((sep tlang (sep tfield)*)
  | (sep tfield)+) ;
pu_extensions= sep [xX]
(sep alphanum{1,8})+ ;
other_extensions= sep [alphanum-[tTuUxX]]
(sep alphanum{2,8})+ ;
ufield
(Also known askeyword)
= ukey (sep uvalue)? ;
ukey
(Also known askey)
= alphanum alpha ;validity
latest-data
(Note that this is narrower than in [RFC6067], so that it is disjoint withtkey.)
uvalue
(Also known astype)
= alphanum{3,8}
(sep alphanum{3,8})* ;
validity
latest-data
uattribute
(Also known asattribute)
= alphanum{3,8} ;
unicode_subdivision_id=unicode_region_subtag unicode_subdivision_suffix ;validity
latest-data
unicode_subdivision_suffix= alphanum{1,4} ;
unicode_measure_unit= alphanum{3,8}
(sep alphanum{3,8})* ;
validity
latest-data
tlang= unicode_language_subtag
(sep unicode_script_subtag)?
(sep unicode_region_subtag)?
(sep unicode_variant_subtag)* ;
same as in unicode_language_id
tfield= tkey tvalue;validity
latest-data
tkey= alpha digit ;
tvalue= alphanum{3,8}
(sep alphanum{3,8})+ ;

The following are additional well-formedness constraints:

  1. [ wfc: There cannot be more than one extension with the same singleton. For example, en-u-ca-buddhist-u-cf-standard is ill-formed.]
  2. [ wfc: There cannot be more than one ukey or tkey. For example, en-u-ca-buddhist-ca-islamic is ill-formed. ]
  3. [ wfc: The sequence of variant subtags in a tlang must not have any duplicates. ]
  4. [ wfc: The private use extension (-x-) must come after all other extensions. ]

For historical reasons, this is called a Unicode locale identifier. However, it also functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, seeLanguage and Locale IDs.

As of the release of this specification, there were no other_extensions defined. The other_extensions are present in the syntax to allow implementations to preserve that information.

As for terminology, the termcode may also be used instead of "subtag", and "territory" instead of "region". The primary language subtag is also called thebase language code. For example, the base language code for "en-US" (American English) is "en" (English). Thetype may also be referred to as avalue orkey-value.

All identifier field values are case-insensitive. Although case distinctions do not carry any special meaning, an implementation of LDML should use the casing recommendations in [BCP47], especially when a Unicode locale identifier is used for locale data exchange in software protocols.

The identifiers can vary in case and in the separator characters. The "-" and "_" separators are treated as equivalent, although "-" is preferred.

AUnicodeBCP 47 locale identifier (unicode_bcp47_locale_id) is aunicode_locale_id that meets the following additional constraints:

A well-formedUnicode BCP 47 locale identifier is always a well-formedBCP 47 language tag.The reverse, however, is not guaranteed;aBCP 47 language tag that contains an extlang subtag, an irregular subtag, or an initial 'x' subtag would not be a well-formedUnicode BCP 47 locale identifier— for details seeBCP 47 Conformance.However, anyBCP 47 language tag can easily converted to aUnicode BCP 47 locale identifier as specified inBCP 47 Language Tag Conversion.

AUnicodeCLDR locale identifier (unicode_cldr_locale_id) is aunicode_locale_id that meets the following additional constraints:

Note: The current version of CLDR data usesUnicodeCLDR locale identifiers for backward compatibility. This might be changed in future CLDR releases.

Canonical Unicode Locale Identifiers

Aunicode_locale_id hascanonical syntax when:

For example, the canonical form of "en-u-foo-bar-nu-thai-ca-buddhist-kk-true" is "en-u-bar-foo-ca-buddhist-kk-nu-thai". The attributes"foo" and"bar" in this example are provided only for illustration; no attribute subtags are defined by the current CLDR specification.

NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering inSection 4.1 of BCP 47. Here are the considerations that lead to that decision:

Aunicode_locale_id is incanonical form when it has canonical syntax and contains no aliased subtags. Aunicode_locale_id can be transformed into canonical form according toAnnex C. LocaleId Canonicalization.

Aunicode_locale_id ismaximal when theunicode_language_id and tlang (if any) have been transformed by the Add Likely Subtags operation inLikely Subtags, excluding "und".

Example: the maximal form of ja-Kana-t-it is ja-Kana-JP-t-it-latn-it

Note that thelatn and finalit don't use any uppercase characters, since they are not inside unicode_language_id.

Twounicode_locale_ids areequivalent when their maximal canonical forms are identical.

Example: "IW-HEBR-u-ms-imperial" ~ "he-u-ms-uksystem"

The equivalence relationship may change over time, such as when subtags are deprecated or likely subtag mappings change. For example, if two countries were to merge, then various subtags would become deprecated. These kinds of changes are generally very infrequent.

BCP 47 Conformance

Unicode language and locale identifiers inherit the design and the repertoire of subtags from [BCP47] Language Tags. There are some extensions and restrictions made for the use of the Unicode locale identifier in CLDR:

There are thus two subtypes of Unicode locale identifiers, as defined above.

These can both be easily converted to and fromBCP 47 language tags as described below.

BCP 47 Language Tag Conversion

The different identifiers can be converted to one another as described in this section.

A valid [BCP47] language tag can be converted to a valid Unicode BCP 47 locale identifier according toAnnex C. LocaleId Canonicalization.

The result is a Unicode BCP 47 locale identifier, in canonical form. It is both a BCP 47 language tag and a Unicode locale identifier. Because the process maps from all BCP 47 language tags into a subset of BCP 47 language tags, the format changes are not reversible, much as a lowercase transformation of the string “McGowan” is not reversible.

Table:BCP 47 Language Tag to Unicode BCP 47 Locale Identifier Examples
BCP 47 language tagUnicode BCP 47 locale identifierComments
en-USen-USno changes
iw-FXhe-FRBCP 47 canonicalization
cmn-TWzh-TWlanguage alias
zh-cmn-TWzh-TWBCP 47 canonicalization, then language alias
sr-CSsr-RSterritory alias
shsr-Latnmultiple replacement subtags
sh-Cyrlsr-Cyrlno replacement with multiple replacement subtags
hy-SUhy-AMmultiple territory values
<territoryAlias type="SU" replacement="RU AM AZ BY EE GE KZ KG LV LT MD TJ TM UA UZ" …/>
i-enochianund-x-i-enochianprefix any legacy language tags (marked as “Type: grandfathered” in BCP 47) with "und-x-"
x-abcund-x-abcprefix with "und-", so that there is always a base language subtag
Unicode Locale Identifier: CLDR to BCP 47

A Unicode CLDR locale identifier can be converted to a valid [BCP47] language tag (which is also a Unicode BCP 47 locale identifier) by performing the following transformation.

  1. Replace the "_" separators with "-"
  2. Replace the special language identifier "root" with the BCP 47 primary language tag "und"
  3. Add an initial "und" primary language subtag if the first subtag is a script.

Examples:

Unicode CLDR locale identifierBCP 47 language tagComments
en_USen-USchange separator
de_DE_u_co_phonebkde-DE-u-co-phonebkchange separator
rootundchange to "und"
root_u_cu_usdund-u-cu-usdchange to "und"
Latn_DEund-Latn-DEadd "und"
Unicode Locale Identifier: BCP 47 to CLDR

A Unicode BCP 47 locale identifier can be transformed into a Unicode CLDR locale identifier by performing the following transformation.

  1. the separator is changed to "_"
  2. the primary language subtag "und" is replaced with "root" if no script, region, or variant subtags are present.

Examples:

BCP 47 language tagUnicode CLDR locale identifierComments
en-USen_USchanges separator
undrootchanges to "root", because no script, region, or variant tag is present
und-USund_USno change to "und", because a region subtag is present
und-u-cu-USDroot_u_cu_usdchanges to "root", because no script, region, or variant tag is present
Truncation

BCP 47 requires that implementations allow for language tags of at least 35 characters, inSection 4.1.1.To allow for use of extensions, CLDR extends that minimum to 255 for Unicode locale identifiers.Theoretically, a language tag could be far longer, due to the possibility of a large number of variants and extensions.In practice, the typical size of a locale or language identifier will be much smaller, so implementations can optimize for smaller sizes, as long as there is an escape mechanism allowing for up to 255.

Language Identifier Field Definitions

Unicode language and locale identifier field values are provided in the following table. Note that some private-use BCP 47 field values are given specific meanings in CLDR. While field values are based on [BCP47] subtag values, their validity status in CLDR is specified by means of machine-readable files in thecommon/validity/ subdirectory, such as language.xml. For the format of those files and more information, seeValidity Data.

unicode_language_subtag (also known as aUnicode base language code)

Subtags in the language.xml file (seeValidity Data ). These are based on [BCP47] subtag values marked asType: language

ISO 639-3 introduces the notion of "macrolanguages", where certain ISO 639-1 or ISO 639-2 codes are given broad semantics, and additional codes are given for the narrower semantics. For backwards compatibility, Unicode language identifiers retain use of the narrower semantics for these codes. For example:

ForUseNot
Standard Chinese (Mandarin)zhcmn
Standard Arabicararb
Standard Malaymszsm
Standard Swahiliswswh
Standard Uzbekuzuzn
Standard Konkanikokgom
Northern Kurdishkukmr

If a language subtag matches thetype attribute of alanguageAlias element, then the replacement value is used instead. For example, because "swh" occurs in<languageAlias type="swh" replacement="sw" /> , "sw" must be used instead of "swh". Thus Unicode language identifiers use "ar-EG" for Standard Arabic (Egypt), not "arb-EG"; they use "zh-TW" for Mandarin Chinese (Taiwan), not "cmn-TW".

The private use codes listed asexcluded inPrivate Use Codes will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.

The CLDR provides data for normalizing language/locale codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US"; see theAliases Chart.

The following are special language subtags:

NameComment
misUncoded languagesThe content is in a language that doesn't yet have an ISO 639 code.
mulMultiple languagesThe content contains more than one language or text that is simultaneously in multiple languages (such as brand names).
zxxNo linguistic contentThe content is not in any particular languages (such as images, symbols, etc.)

unicode_script_subtag (also known as aUnicode script code)

Subtags in the script.xml file (seeValidity Data). These are based on [BCP47] subtag values marked asType: script

In most cases the script is not necessary, since the language is only customarily written in a single script. Examples of cases where it is used are:

SubtagDescription
az_ArabAzerbaijani in Arabic script
az_CyrlAzerbaijani in Cyrillic script
az_LatnAzerbaijani in Latin script
zh_HansChinese, in simplified script (=zh, zh-Hans, zh-CN, zh-Hans-CN)
zh_HantChinese, in traditional script

Unicode identifiers give specific semantics to certain Unicode Script values. For more information, see also [UAX24]:

QaagZawgyiQaag is a special script code for identifying the non-standard use of Myanmar characters for display with the Zawgyi font. The purpose of the code is to enable migration to standard, interoperable use of Unicode by providing an identifier for Zawgyi for tagging text, applications, input methods, font tables, transformations, and other mechanisms used for migration.
QaaiInheriteddeprecated: thecanonicalized form is Zinh
ZinhInherited 
ZsyeEmoji StylePrefer emoji style for characters that have both text and emoji styles available.
ZsymText StylePrefer text style for characters that have both text and emoji styles available.
ZxxxUnwrittenIndicates spoken or otherwise unwritten content. For example:
Sample(s)Description
uzeither written or spoken content
uz-Latnor uz-Arabwritten-only content (particular script)
uz-Zyyywritten-only content (unspecified script)
uz-Zxxxspoken-only content
uz-Latn, uz-Zxxxboth specific written and spoken content (using alanguage list)
ZyyyCommon 
ZzzzUnknown 

The private use subtags listed asexcluded inPrivate Use Codes will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.

unicode_region_subtag (also known as aUnicode region code, or aUnicode territory code)

Subtags in the region.xml file (seeValidity Data). These are based on [BCP47] subtag values marked asType: region

Unicode identifiers give specific semantics to the following subtags.(The alpha2 codes are used as Unicode region subtags. The alpha3 and numeric codes are derived according toNumeric Codes and listed here for additional documentation.)

alpha2alpha3numNameCommentISO 3166-1 status
QOQOO961Outlying Oceaniacountries in Oceania [009] that do not have asubcontinent.private use
QUQUU967European Uniondeprecated: thecanonicalized form is EUprivate use
UK--United Kingdomdeprecated: thecanonicalized form is GBexceptionally reserved
XAXAA973Pseudo-Accentsspecial code indicating derived testing locale with English + added accents and lengthenedprivate use
XBXBB974Pseudo-Bidispecial code indicating derived testing locale with forced RTL Englishprivate use
XKXKK983Kosovoindustry practiceprivate use
ZZZZZ999Unknown or Invalid Territoryused in APIs or as replacement for invalid codeprivate use

The private use subtags listed asexcluded inPrivate Use Codes will normally never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications. However, LDML may follow widespread industry practice in the use of some of these codes, such as for XK.

The CLDR provides data for normalizing territory/region codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US".

Special Codes:

unicode_variant_subtag (also known as aUnicode language variant code)

Subtags in the variant.xml file (seeValidity Data). These are based on [BCP47] subtag values marked asType: variant. The sequence of variant tags must not have any duplicates: thus de-1996-fonipa-1996 is invalid, while de-1996-fonipa and de-fonipa-1996 are both valid.

CLDR provides data for normalizing variant codes. About handling of the "POSIX" variant seeLegacy Variants.

Examples:

enfr_BEzh-Hant-HK

Deprecated codes—such as QU above—are valid, but strongly discouraged.

A locale that only has a language subtag (and optionally a script subtag) is called alanguage locale; one with both language and territory subtag is called aterritory locale (orcountry locale).

Special Codes

Unknown or Invalid Identifiers

The following identifiers are used to indicate an unknown or invalid code in Unicode language and locale identifiers. For Unicode identifiers, the region code uses a private use ISO 3166 code, and Time Zone code uses an additional code; the others are defined by the relevant standards. When these codes are used in APIs connected with Unicode identifiers, the meaning is that either there was no identifier available, or that at some point an input identifier value was determined to be invalid or ill-formed.

Code TypeValueDescription in Referenced Standards
LanguageundUndetermined language, also used for “root”
ScriptZzzzCode for uncoded script, Unknown [UAX24]
RegionZZUnknown or Invalid Territory
CurrencyXXXThe codes assigned for transactions where no currency is involved
Time ZoneunkUnknown or Invalid Time Zone
Subdivision<region>zzzzUnknown or Invalid Subdivision

When only the script or region are known, then a locale ID will use "und" as the language subtag portion. Thus the locale tag "und_Grek" represents the Greek script; "und_US" represents the US territory.

Numeric Codes

For region codes, ISO and the UN establish a mapping to three-letter codes and numeric codes. However, this does not extend to the private use codes, which are the codes 900-999 (total: 100), and AAA, QMA-QZZ, XAA-XZZ, and ZZZ (total: 1092). Unicode identifiers supply a standard mapping to these: for the numeric codes, it uses the top of the numeric private use range; for the 3-letter codes it doubles the final letter. These are the resulting mappings for all of the private use region codes:

RegionUN/ISO NumericISO 3-Letter
AA958AAA
QM..QZ959..972QMM..QZZ
XA..XZ973..998XAA..XZZ
ZZ999ZZZ

For script codes, ISO 15924 supplies a mapping (however, the numeric codes are not in common use):

ScriptNumeric
Qaaa..Qabx900..949

Private Use Codes

Private use codes fall into three groups.

Table:Private Use Codes in CLDR
categorystatuscodes
base languagedefinednone
reservedqaa..qfy
excludedqfz..qtz
scriptdefinedQaai (obsolete), Qaag
reservedQaaa..Qaaf Qaah Qaaj..Qaap
excludedQaaq..Qabx
regiondefinedQO, QU, UK, XA, XB, XK, ZZ
reservedAA QM..QN QP..QT QV..QZ
excludedXC..XJ, XL..XZ
timezonedefinedIANA: Etc/Unknown
bcp47: as listed in bcp47/timezone.xml
reservedbcp47: all non-5 letter codes not starting with x
excludedbcp47: all non-5 letter codes starting with x

See alsoUnknown or Invalid Identifiers.

Special Script Codes

Certain valid script code require special handling.These are the codes inScript Codes with the words "variant" or "alias" within parentheses,excluding Zsye.The Compound codes include characters in multiple scripts;the Visual variants are distinct in appearance, but otherwise encompass a single script;and the Subsets exclude certain characters from a script.The Equivalents for Subsets are not as well defined, so the "Equivalents" are marked as approximate.

VariantScriptEquivalent
CompoundJpan≡ Hani ∪ Hira ∪ Kana
Hrkt≡ Hira ∪ Kana
Kore≡ Hani ∪ Hang
Hanb≡ Hani ∪ Bopo
Hntl≡ Hant ∪ Latn
VisualAran≡ Arab (Nastaliq variant)
Cyrs≡ Cyrl (Old Church Slavonic variant)
Latf≡ Latn (Fraktur variant)
Latg≡ Latn (Gaelic variant)
Syrn≡ Syrc (Eastern variant)
Syre≡ Syrc (Estrangelo variant)
Syrj≡ Syrc (Western variant)
SubsetJamo≃ Hang − LVT - LV
Hans≃ Hani − Traditional-only
Hant≃ Hani − Simplified-only

The special codes most frequently used are in the locale identifierszh-Hans,zh-Hant,ja-Jpan, andko-Kore:the first two areSubsets, and the last two areCompounds.These are used, for example, inLikely Subtags in LDML.

TheEquivalent values in theSubset variants are only approximate,and the variants are also visual variants.ThusHans is a request for:

Visual variant script codes (that are notSubset variants) can be used in a locale identifier to request a particular rendering.For example, ar_Aran could be used to request that ar_Arab data be used, but with a Nastaliq-style font.However, the few variant script codes represent only a very small fraction of the different script variants in use.Moreover, this feature is not widely supported, and may give unexpected results when not supported.For example, an implmentation might not recognizeAran inuz-Aran at all, and return results foruz-Latn.

Some of the special codes are used in other specifications,such as inMixed_Script_Detection.

Unicode BCP 47 U Extension

[BCP47] Language Tags provides a mechanism for extending language tags for use in various applications by extension subtags. Each extension subtag is identified by a single alphanumeric character subtag assigned by IANA.

The Unicode Consortium has registered and is the maintaining authority for two BCP 47 language tag extensions: the extension 'u' for Unicode locale extension [RFC6067] and extension 't' for transformed content [RFC6497]. The Unicode BCP 47 extension data defines the complete list of valid subtags.

These subtags are all in lowercase (that is the canonical casing for these subtags), however, subtags are case-insensitive and casing does not carry any specific meaning. All subtags within the Unicode extensions are alphanumeric characters in length of two to eight that meet the ruleextension in the [BCP47].

The -u- Extension. The syntax of 'u' extension subtags is defined by the ruleunicode_locale_extensions inUnicode locale identifier, except the separator of subtagssep must be always hyphen '-' when the extension is used as a part of BCP 47 language tag.

A 'u' extension may contain multipleattributes orufields as defined inUnicode locale identifier. The canonical syntax is defined as inCanonical Unicode Locale Identifiers.

See alsoUnicode Extensions for BCP 47 on the CLDR site.

Key And Type Definitions

The following chart contains a set of U extension key values that are currently available, with a description or sampling of the U extension type values. Each category is associated with an XML file in the bcp47 directory.

For the complete list of valid keys and types defined for Unicode locale extensions, seeU Extension Data Files. For information on the process for adding newkey/type, see [LocaleProject].

Most type values are represented by a single subtag in the current version of CLDR. There are exceptions, such as types used for key "ca" (calendar) and "kr" (collation reordering). If the type is not included, then the type value "true" is assumed. Note that the default for key with a possible "true" value is often "false", but may not always be. Note also that "true"/"True" is not a valid script code, sincethe ISO 15924 Registration Authority has exceptionally reserved it, which means that it will not be assigned for any purpose.

Note that canonicalization does not change invalid locales to valid locales. For example, und-u-ka canonicalizes to und-u-ka-true, but:

  1. "und-u-ka-true" — is invalid, since "true" is not a valid value for ka
  2. "und-u-ka" — is invalid, since the value "true" is assumed whenever there is no value, and "true" is not a valid value for ka

The BCP 47 form for keys and types is the canonical form, and recommended. Other aliases are included for backwards compatibility.

Table:Key/Type Definitions
key
(old key name)
key descriptionexample type
(old type name)
type description
AUnicode Calendar Identifier defines a type of calendar.Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of key name="ca" in bcp47/calendar.xml.
This selects calendar-specific data within a locale used for formatting and parsing, such as date/time symbols and patterns; it also selects supplemental calendarData used for calendrical calculations.The value can affect the computation of the first day of the week: seeFirst Day Overrides.
ca
(calendar)
Calendar algorithm

(For information on the calendar algorithms associated with the data used with these, see [Calendars].)
buddhistThai Buddhist calendar (same as Gregorian except for the year)
chineseTraditional Chinese calendar
gregoryGregorian calendar
islamicIslamic calendar
islamic-civilIslamic calendar, tabular (intercalary years [2,5,7,10,13,16,18,21,24,26,29] - civil epoch)
islamic-umalquraIslamic calendar, Umm al-Qura
Note:Some calendar types are represented by two subtags. In such cases, the first subtag specifies a generic calendar type and the second subtag specifies a calendar algorithm variant. The CLDR uses generic calendar types (single subtag types) for tagging data when calendar algorithm variations within a generic calendar type are irrelevant. For example, type "islamic" is used for specifying Islamic calendar formatting data for all Islamic calendar types, including "islamic-civil" and "islamic-umalqura".
AUnicode Currency Format Identifier defines a style for currency formatting.Well-formed values matchuvalue. The valid values are thosename attribute values in thetype elements of key name="cf" in bcp47/currency.xml.
This selects the specific type of currency formatting pattern within a locale.
cfCurrency Format stylestandardNegative numbers use the minusSign symbol (the default).
accountNegative numbers use parentheses or equivalent.
AUnicode Collation Identifier defines a type of collation (sort order). Well-formed values matchuvalue. The valid values are thosename attribute values in thetype elements of bcp47/collation.xml.
For information on each collation setting parameter, fromka tovt, seeSetting Options
co
(collation)
Collation typestandardThe default ordering for each language. For root it is based on the [DUCET] (Default Unicode Collation Element Table): seeRoot Collation. Each other locale is based on that, except for appropriate modifications to certain characters for that language.
searchA special collation type dedicated for string search — it is not used to determine the relative order of two strings, but only to determine whether they should be considered equivalent for the specified strength, using the string search matching rules appropriate for the language. Compared to the normal collator for the language, this may add or remove primary equivalences, may make additional characters ignorable or change secondary equivalences, and may modify contractions to allow matching within them, depending on the desired behavior. For example, in Czech, the distinction between ‘a’ and ‘á’ is secondary for normal collation, but primary for search; a search for ‘a’ should never match ‘á’ and vice versa. A search collator is normally used with strength set to PRIMARY or SECONDARY (should be SECONDARY if using “asymmetric” search as described in the [UCA] section Asymmetric Search). The search collator in root supplies matching rules that are appropriate for most languages (and which are different than the root collation behavior); language-specific search collators may be provided to override the matching rules for a given language as necessary.

Other ufields provide additional choices for certain locales;they only have effect in certain locales.

phoneticRequests a phonetic variant if available, where text is sorted based on pronunciation. It may interleave different scripts, if multiple scripts are in common use.
pinyinPinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese)
searchjlSpecial collation type for a modified string search in which a pattern consisting of a sequence of Hangul initial consonants (jamo lead consonants) will match a sequence of Hangul syllable characters whose initial consonants match the pattern. The jamo lead consonants can be represented using conjoining or compatibility jamo. This search collator is best used at SECONDARY strength with an "asymmetric" search as described in the [UCA] section Asymmetric Search and obtained, for example, using ICU4C's usearch facility with attribute USEARCH_ELEMENT_COMPARISON set to value USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD; this ensures that a full Hangul syllable in the search pattern will only match the same syllable in the searched text (instead of matching any syllable with the same initial consonant), while a Hangul initial consonant in the search pattern will match any Hangul syllable in the searched text with the same initial consonant.
AUnicode Currency Identifier defines a type of currency.Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of key name="cu" in bcp47/currency.xml.
cu
(currency)
Currency typeISO 4217 code,

plus others in common use

Well-formed codes are of the form[A-Za-z]{3}, with the canonical format being[A-Z]{3}. The valid codes are ones that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use.Supplemental Currency Data provides the list of countries (regions) and time periods associated with each currency code. It also supplies the default number of decimals.

The XXX code is given a broader interpretation than in ISO 4217, asUnknown or Invalid Currency.

AUnicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break (for words and lines).Well-formed values matchuvalue.The valid values are of one or more items of type SCRIPT_CODE as specified in thename attribute value in thetype element of key name="dx" in bcp47/segmentation.xml.
This affects break iteration regardless of locale.
dxDictionary break script exclusionsunicode_script_subtag values
  • One or more items of type SCRIPT_CODE (as usual, separated by hyphens), which are validunicode_script_subtag values.
  • Each of the values for the DX key must be a short script property value in the UCD, or one of the compound script values like jpan. The compound script values are expanded when interpreted, eg, -dx-jpan = -dx-hani-hira-kata
  • The values may be in any order, eg, -dx-thai-hani = dx-hani-thai. However, the canonical order for the bcp47 subtag is alphabetical, eg, dx-hani-thai
  • Dictionary-based break iterators will ignore each character whose Script_Extension value set intersects with the DX value set.
  • The code Zyyy (Common) can be specified to exclude all scripts, if and only if it is the only SCRIPT_CODE value specified. If it is not the only script code, Zyyy has the normal meaning: excluding Script_Extension=Common.
AUnicode Emoji Presentation Style Identifier specifies a request for the preferred emoji presentation style. This can be used as part of the value for an HTML lang attribute, for example<html lang="sr-Latn-u-em-emoji">.Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of key name="em" in bcp47/variant.xml.
emEmoji presentation styleemojiUse an emoji presentation for emoji characters if possible.
textUse a text presentation for emoji characters if possible.
defaultUse the default presentation for emoji characters as specified in UTR #51Presentation Style.
AUnicode First Day Identifier defines the preferred first day of the week for calendar display. Specifying "fw" in a locale identifier overrides the default value specified by supplemental week data for the region (see Part 4 Dates,Week Data).Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of key name="fw" in bcp47/calendar.xml.The value can affect the computation of the first day of the week: seeFirst Day Overrides.
fwFirst day of weeksunSunday
monMonday
satSaturday
AUnicode Hour Cycle Identifier defines the preferred time cycle. Specifying "hc" in a locale identifier overrides the default value specified by supplemental time data for the region (see Part 4 Dates,Time Data).Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of key name="hc" in bcp47/calendar.xml.
hcHour cycleh12Hour system using 1–12; corresponds to 'h' in patterns
h23Hour system using 0–23; corresponds to 'H' in patterns
h11Hour system using 0–11; corresponds to 'K' in patterns
h24Hour system using 1–24; corresponds to 'k' in pattern
c12Technical Preview: Locale-preferred 12-hour cycle; resolves to eitherh11 orh12. First, select the supplemental time datahours element according to theRegion Override; if that keyword is not present, use theUnicode Region Subtag; if neither is present, or if the region does not have an entry in supplemental time data, use the region 001. Then, iterate through theallowed list of hour cycle symbols in preference order. If there is an entry with symbolK before an entry with symbolh, useh11; otherwise, useh12.
c24Technical Preview: Locale-preferred 24-hour cycle; resolves to eitherh23 orh24. First, select the supplemental time datahours element as above. Then, iterate through theallowed list of hour cycle symbols in preference order. If there is an entry with symbolk before an entry with symbolH, useh24; otherwise, useh23.
AUnicode Line Break Style Identifier defines a preferred line break style corresponding to the CSS level 3line-break option. Specifying "lb" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "strict").Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of key name="lb" in bcp47/segmentation.xml.
lbLine break stylestrictCSS level 3 line-break=strict, e.g. treat CJ as NS
normalCSS level 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh
looseCSS lev 3 line-break=loose
AUnicode Line Break Word Identifier defines preferred line break word handling behavior corresponding to the CSS level 3word-break option. Specifying "lw" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "keepall").Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of key name="lw" in bcp47/segmentation.xml.
lwLine break word handlingnormalCSS level 3 word-break=normal, normal script/language behavior for midword breaks
breakallCSS level 3 word-break=break-all, allow midword breaks unless forbidden by lb setting
keepallCSS level 3 word-break=keep-all, prohibit midword breaks except for dictionary breaks
phrasePrioritize keeping natural phrases (of multiple words) together when breaking, used in short text like title and headline
AUnicode Measurement System Identifier defines a preferred measurement system. Specifying "ms" in a locale identifier overrides the default value specified by supplemental measurement system data for the region (see Part 2 General,Measurement System Data).Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of key name="ms" in bcp47/measure.xml. The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences.For information about preferred units and unit conversion, seeUnit Conversion andUnit Preferences.
msMeasurement systemmetricMetric System
ussystemUS System of measurement: feet, pints, etc.; pints are 16oz
uksystemUK System of measurement: feet, pints, etc.; pints are 20oz
AMeasurement Unit Preference Override defines an override for measurement unit preference.Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of key name="mu" in bcp47/measure.xml.For information about preferred units and unit conversion, seeUnit Conversion andUnit Preferences.
muMeasurement unit overridecelsiusCelsius as temperature unit
kelvinKelvin as temperature unit
fahrenheFahrenheit as temperature unit
AUnicode Number System Identifier defines a type of number system.Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of bcp47/number.xml.
nu
(numbers)
Numbering systemUnicode script subtag

Four-letter types indicating the primary numbering system for the corresponding script represented in Unicode. Unless otherwise specified, it is a decimal numbering system using digits [:GeneralCategory=Nd:]. For example, "latn" refers to the ASCII / Western digits 0-9, while "taml" is an algorithmic (non-decimal) numbering system. (The code "tamldec" is indicates the "modern Tamil decimal digits".)

For more information, seeNumbering Systems.

arabextExtended Arabic-Indic digits ("arab" means the base Arabic-Indic digits)
armnlowArmenian lowercase numerals
romanRoman numerals
romanlowRoman lowercase numerals
tamldecModern Tamil decimal digits
ARegion Override specifies an alternate region to use for obtaining certain region-specific default values (those specified by the<rgScope> element), instead of using the region specified by theunicode_region_subtag in the Unicode Language Identifier (or inferred from theunicode_language_subtag).
rgRegion Overrideuszzzz

The valid values are aunicode_subdivision_id of type “unknown” or “regular”; this consists of aunicode_region_subtag for a regular region (not a macroregion), suffixed either by “zzzz” (case is not significant) to designate the region as a whole, or by a unicode_subdivision_suffix to provide more specificity. For example, “en-GB-u-rg-uszzzz” represents a locale for British English but with region-specific defaults set to US for items such as default currency, default calendar and week data, default time cycle, and default measurement system and unit preferences.The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences.The value can affect the computation of the first day of the week: seeFirst Day Overrides.For information about preferred units and unit conversion, seeUnit Conversion andUnit Preferences.
AUnicode Subdivision Identifier defines a regional subdivision used for locales.Well-formed values matchuvalue.The valid values are based on thesubdivisionContainment element as described inSection3.6.5 Subdivision Codes.
sdRegional SubdivisiongbsctAunicode_subdivision_id, which is aunicode_region_subtag concatenated with a unicode_subdivision_suffix.
For example,gbsct is “gb”+“sct” (where sct represents the subdivision code for Scotland). Thus “en-GB-u-sd-gbsct” represents the language variant “English as used in Scotland”. And both “en-u-sd-usca” and “en-US-u-sd-usca” represent “English as used in California”. See3.6.5 Subdivision Codes.The value can affect the computation of the first day of the week: seeFirst Day Overrides.
AUnicode Sentence Break Suppressions Identifier defines a set of data to be used for suppressing certain sentence breaks that would otherwise be found by UAX #14 rules.Well-formed values matchuvalue.The valid values are thosename attribute values in thetype elements of key name="ss" in bcp47/segmentation.xml.
ssSentence break suppressionsnoneDon’t use sentence break suppressions data (the default).
standardUse sentence break suppressions data of type "standard"
AUnicode Timezone Identifier defines a timezone.Well-formed values matchuvalue.The valid values are those name attribute values in thetype elements of bcp47/timezone.xml.
tz
(timezone)
Time zoneUnicode short time zone IDs

Short identifiers defined in terms of a TZ time zone database [Olson] identifier in the common/bcp47/timezone.xml file, plus a few extra values.

For more information, seeTime Zone Identifiers.

CLDR provides data for normalizing timezone codes.

AUnicode Variant Identifier defines a special variant used for locales.Well-formed values matchuvalue.The valid values are those name attribute values in thetype elements of bcp47/variant.xml.
vaCommon variant typeposixPOSIX style locale variant. About handling of the "POSIX" variant seeLegacy Variants.

For more information on the allowed keys and types, see the specific elements below, andU Extension Data Files.

Additional keys or types might be added in future versions. Implementations of LDML should be robust to handle any syntactically valid key or type values.

Numbering System Data

LDML supports multiple numbering systems. The identifiers for those numbering systems are defined in the filebcp47/number.xml. For example, for the latest version of the data seebcp47/number.xml.

Details about those numbering systems are defined insupplemental/numberingSystems.xml. For example, for the latest version of the data seesupplemental/numberingSystems.xml.

LDML makes certain stability guarantees on this data:

  1. Like other BCP 47 identifiers, once a numeric identifier is added tobcp47/number.xml ornumberingSystems.xml, it will never be removed from either of those files.
  2. If an identifier has type="numeric" in numberingSystems.xml, then
    1. It is a decimal, positional numbering system with an attributedigits=X, whereX is a string with the 10 digits in order used by the numbering system.
    2. The values of the type and digits will never change.

Time Zone Identifiers

LDML inherits time zone IDs from the tz database [Olson]. Because these IDs from the tz database do not satisfy the BCP 47 language subtag syntax requirements, CLDR defines short identifiers for the use in the Unicode locale extension. The short identifiers are defined in the filecommon/bcp47/timezone.xml.

The short identifiers use UN/LOCODE [LOCODE] (excluding a space character) codes where possible. For example, the short identifier for "America/Los_Angeles" is "uslax" (the LOCODE for Los Angeles, US is "US LAX"). Identifiers of length not equal to 5 are used where there is no corresponding UN/LOCODE, such as "usnavajo" for "America/Shiprock", or "utcw01" for "Etc/GMT+1", so that they do not overlap with future UN/LOCODE.

The first two letters of a length 5 short identifier double as the time zone's associated region, unless the time zone has an explicitregion attribute that overrides this. Short identifiers of length not equal to 5 are not associated with a region, unless the time zone has an explicitregion attribute.Short identifiers are stable, meaning that they will not change no matter what changes happen in the base standard. For example, the short identifier for "America/Curacao" is still "ancur", with aregion="CW" override, instead of the current LOCODE "cwcur".

There is a special code "unk" for an Unknown or Invalid time zone. This can be expressed in the tz database style ID "Etc/Unknown", although it is not defined in the tz database.

Stability of Time Zone Identifiers

Although the short time zone identifiers are guaranteed to be stable, the preferred IDs in the tz database (as those found inzone.tab file) might be changed time to time. For example, "Asia/Culcutta" was replaced with "Asia/Kolkata" and moved tobackward file in the tz database. CLDR contains locale data using a time zone ID from the tz database as the key, stability of the IDs is critical.

To maintain the stability of "long" IDs (for those inherited from the tz database), a special rule applied to thealias attribute in the<type> element for "tz" - the first "long" ID is the CLDR canonical "long" time zone ID. In addition to this,iana attribute specifies the preferred ID in the tz database if it's different from the CLDR canonical "long" ID.

For example:

<type name="inccu" description="Kolkata, India" alias="Asia/Calcutta Asia/Kolkata" iana="Asia/Kolkata"/>

Above<type> element defines the short time zone ID "inccu" (for the use in the Unicode locale extension), correspondingCLDR canonical "long" ID "Asia/Culcutta", and an alias "Asia/Kolkata". In the tz database, the preferred ID for this time zone is "Asia/Kolkata".

Links in the tz database

Not all TZDB links are in CLDR aliases.CLDR purposefully does not exactly match the Link structure in the TZDB.

  1. The links are maintained in the TZDB, and it would duplicate information that could fall out of sync (especially because the TZDB can be updated many times in a single month).
  2. The TZDB went though a change a few years ago where it dropped the mappings to countries (regions), whereas CLDR still maintains that distinction.
  3. Because there are several different timezones that all link together, that would make for a single long alias being an alias for several different short aliases.

CLDR doesn't alias across country boundaries because countries are useful for timezone selection.Even if, for example, Serbia and Croatia share the same rules, CLDR maintains the difference so that the user can either pick "Serbia time" or "Croatia time".The Croat is not forced to pick "Serbia time" (Europe/Belgrade) nor the Serb forced to pick “Croatia time” (Europe/Zagreb).

U Extension Data Files

The 'u' extension data is stored in multiple XML files located under common/bcp47 directory in CLDR. Each file contains the locale extension key/type values and their backward compatibility mappings appropriate for a particular domain.common/bcp47/collation.xml contains key/type values for collation, including optional collation parameters and valid type values for each key.

The 't' extension data is stored incommon/bcp47/transform.xml.

<!ELEMENT keyword ( key* )><!ELEMENT key ( type* )><!ATTLIST key extension NMTOKEN #IMPLIED><!ATTLIST key name NMTOKEN #REQUIRED><!ATTLIST key description CDATA #IMPLIED><!ATTLIST key deprecated ( true | false ) "false"><!ATTLIST key preferred NMTOKEN #IMPLIED><!ATTLIST key alias NMTOKEN #IMPLIED><!ATTLIST key valueType (single | multiple | incremental | any) #IMPLIED ><!ATTLIST key since CDATA #IMPLIED><!ELEMENT type EMPTY><!ATTLIST type name NMTOKEN #REQUIRED><!ATTLIST type description CDATA #IMPLIED><!ATTLIST type deprecated ( true | false ) "false"><!ATTLIST type preferred NMTOKEN #IMPLIED><!ATTLIST type alias CDATA #IMPLIED><!ATTLIST type since CDATA #IMPLIED><!ATTLIST type iana CDATA #IMPLIED ><!ELEMENT attribute EMPTY><!ATTLIST attribute name NMTOKEN #REQUIRED><!ATTLIST attribute description CDATA #IMPLIED><!ATTLIST attribute deprecated ( true | false ) "false"><!ATTLIST attribute preferred NMTOKEN #IMPLIED><!ATTLIST attribute since CDATA #IMPLIED>

The extension attribute in<key> element specifies the BCP 47 language tag extension type. The default value of the extension attribute is "u" (Unicode locale extension). The<type> element is only applicable to the enclosing<key>.

In the Unicode locale extension 'u' and 't' data files, the common attributes for the<key>,<type> and<attribute> elements are as follows:

name

The key or type name used by Unicode locale extension with'u' extension syntax or the 't' extensions syntax. Whenalias below is absent, this name can be also used with the old style"@key=type" syntax.

Most type names areliteral type names, which match exactly the same value. All of these have at least one lowercase letter, such as "buddhist". There are a small number ofindirect type names, such as "RG_KEY_VALUE". These have no lowercase letters. The interpretation of each one is listed below.

CODEPOINTS

The type name"CODEPOINTS" is reserved for a variable representing Unicode code point(s). The syntax is:

EBNF
codepoints= codepoint (sep codepoint)?
codepoint= [0-9 A-F a-f]{4,6}

In addition, no codepoint may exceed 10FFFF. For example, "00A0", "300b", "10D40C" and "00C1-00E1" are valid, but "A0", "U060C" and "110000" are not.

In the current version of CLDR, the type "CODEPOINTS" is only used for the deprecated locale extension key "vt" (variableTop). The subtags forming the type for "vt" represent an arbitrary string of characters. There is no formal limit in the number of characters, although practically anything above 1 will be rare, and anything longer than 4 might be useless. Repetition is allowed, for example, 0061-0061 ("aa") is a Valid type value for "vt", since the sequence may be a collating element. Order is vital: 0061-0062 ("ab") is different than 0062-0061 ("ba"). Note that for variableTop any character sequence must be a contraction which yields exactly one primary weight.

For example,

en-u-vt-00A4 : this indicates English, with any characters sorting at or below " ¤" (at a primary level) considered Variable.

By default in UCA, variable characters are ignored in sorting at a primary, secondary, and tertiary level. But in CLDR, they are not ignorable by default. For more information, seeCollation:Setting Options .

REORDER_CODE

The type name"REORDER_CODE" is reserved for reordering block names (e.g. "latn", "digit" and "others") defined in theRoot Collation. The type "REORDER_CODE" is used for locale extension key "kr" (colReorder). The value of type for "kr" is represented by one or more reordering block names such as "latn-digit". For more information, seeCollation:Collation Reordering .

RG_KEY_VALUE

The type name"RG_KEY_VALUE" is reserved for region codes in the format required by the "rg" key; this is a subdivision code with idStatus='unknown' or 'regular' from the idValidity data in common/validity/subdivision.xml.

SCRIPT_CODE

The type name"SCRIPT_CODE" is reserved forunicode_script_subtag values (e.g. "thai", "laoo"). The type "SCRIPT_CODE" is used for locale extension key "dx". The value of type for "dx" is represented by one or more SCRIPT_CODEs, such as "thai-laoo".

SUBDIVISION_CODE

The type name"SUBDIVISION_CODE" is reserved for subdivision codes in the format required by the "sd" key; this is a subdivision code from the idValidity data in common/validity/subdivision.xml, excluding those with idStatus='unknown'. Codes with idStatus='deprecated' should not be generated, and those with idStatus='private_use' are only to be used with prior agreement.

PRIVATE_USE

The type name"PRIVATE_USE" is reserved for private use types. A valid type value is composed of one or more subtags separated by hyphens and each subtag consists of three to eight ASCII alphanumeric characters. In the current version of CLDR,"PRIVATE_USE" is only used for transform extension "x0".

valueType

ThevalueType attribute indicates how many subtags are valid for a given key:

ValueDescription
singleEither exactly one type value, or no type value (but only if the value of "true" would be valid). This is the default if no valueType attribute is present.
incrementalMultiple type values are allowed, but only if a prefix is also present, and the sequence is explicitly listed. Each successive type value indicates a refinement of its prefix. For example:
<key name="ca" description="Calendar algorithm key" valueType="incremental">
<type name="islamic" description="Islamic calendar"/>
<type name="islamic-umalqura" description="Islamic calendar, Umm al-Qura"/>
Thusca-islamic-umalqura is valid. However,ca-gregory-japanese is not valid, because "gregory-japanese" is not listed as a type.
multipleMultiple type values are allowed, but each may only occur once. For example:
<key name="kr" description="Collation reorder codes" valueType="multiple">
<type name="REORDER_CODE" …/>
anyAny number of type values are allowed, with none of the above restrictions. For example:
<key extension="t" name="x0" description="Private use transform type key." valueType="any">
<type name="PRIVATE_USE" …/>

description

The description of thekey,type orattribute element. There is also some informative text about certain keys and types in theKey And Type Definitions.

deprecated

The deprecation status of thekey,type orattribute element. The value"true" indicates the element is deprecated and no longer used in the version of CLDR. The default value is"false".

preferred

The preferred value of the deprecatedkey,type orattribute element. When akey,type orattribute element is deprecated, this attribute is used for specifying a new canonical form if available.

alias (Not applicable to<attribute>)

The BCP 47 form is the canonical form, and recommended. Other aliases are included only for backwards compatibility.

Example:

<type name="phonebk" alias="phonebook" description="Phonebook style ordering (such as in German)"/>

The preferred term, and the only one to be used in BCP 47, is the name: in this example, "phonebk".

The alias is a key or type name used by Unicode locale extensions with the old"@key=type" syntax. The attribute value for type may contain multiple names delimited by ASCII space characters. Of those aliases, the first name is the preferred value.

since

The version of CLDR in which this key or type was introduced. Absence of this attribute value implies the key or type was available in CLDR 1.7.2.

Note: There are no values defined for the locale extension attribute in the current CLDR release.

For example,

<key name="co" alias="collation" description="Collation type key">  <type name="pinyin" description="Pinyin ordering for Latin and for CJK characters (used in Chinese)"/></key><key name="ka" alias="colAlternate" description="Collation parameter key for alternate handling">  <type name="noignore" alias="non-ignorable" description="Variable collation elements are not reset to ignorable"/>  <type name="shifted" description="Variable collation elements are reset to zero at levels one through three"/></key><key name="tz" alias="timezone">  ...  <type name="aumel" alias="Australia/Melbourne Australia/Victoria" description="Melbourne, Australia"/>  <type name="aumqi" alias="Antarctica/Macquarie" description="Macquarie Island Station, Macquarie Island" since="1.8.1"/>  ...</key>

The data above indicates:

It is strongly recommended that all API methods accept all possible aliases for keywords and types, but generate the canonical form. For example, "ar-u-ca-islamicc" would be equivalent to "ar-u-ca-islamic-civil" on input, but the latter should be output. The one exception is where an alias would only be well-formed with the old syntax, such as "gregorian" (for "gregory").

In the Unicode locale extension 'u' data files,<type> element has an optional attribute below:

iana

This attribute is used bytz types for specifying preferred zone ID in the IANA time zone database.

Subdivision Codes

The subdivision codes designate a subdivision of a country or region. They are called various names, such as astate in the United States, or aprovince in Canada. The codes in CLDR are based on ISO 3166-2 subdivision codes. The ISO codes have a region code followed by a hyphen, then a suffix consisting of 1..3 ASCII letters or digits.

The CLDR codes are designed to work in aunicode_locale_id (BCP 47), and are thus all lowercase, with no hyphen. For example, the following are valid, and mean “English as used in California, USA”.

CLDR has additional subdivision codes. These may start with a 3-digit region code or use a suffix of 4 ASCII letters or digits, so they will not collide with the ISO codes. Subdivision codes for unknown values are the region code plus "zzzz", such as "uszzzz" for an unknown subdivision of the US. Other codes may be added for stability.

ISO 3166-2 subdivision codes may change over time, because each country may change the codes it supplies to ISO.Typically this does not cause problems for use in locale identifiers.Newly added ISO 3166-2 codes are added to CLDR in each release.If an ISO 3166-2 code is removed, it remains valid in CLDR, though marked as deprecated.If an ISO 3166-2 code is replaced by a new code, an alias is added to CLDR that maps the old code to the new code.

In some unusual cases, countries have been known toreuse codes, giving a code a very different meaning from what it had.Needless to say, this is ill-advised:consider what would happen if the code US-TX (currently Texas) were swapped with the code US-TN (currently Tennessee).If an ISO 3166-2 code is reused,CLDR can solve the problem for the purpose of locale identification by defining new equivalent codes using 4-character suffixes.These codes will never collide with the ISO 3166-2 codes, because ISO 3166-2 limits the suffix length of its codes to 3 characters.

In late 2025, the CLDR Technical Committee became aware of radical reuse of ISO 3166-2 subdivision codes:In November 2020 almost all subdivisions of Iran were renumbered.For example, IR-25, which used to represent the Yazd province, now represents the Qom province.This means that the locale identifier fa-u-sd-ir25, which used to mean Yazdi Persian, now means Qomi Persian.In order to provide stable identifiers, CLDR is planning to add 4-character suffixes for provinces of Iran in version 49.

Note that:

Validity

Aunicode_subdivision_id is only valid when it is present in the subdivision.xml file as described inValidity Data. The data is in a compressed form, and thus needs to be expanded before such a test is made.

Examples:

If aunicode_locale_id contains both aunicode_region_subtag and aunicode_subdivision_id, it is only valid if theunicode_subdivision_id starts with theunicode_region_subtag (case-insensitively).

It is recommended that aunicode_locale_id contain aunicode_region_subtag if it contains aunicode_subdivision_id and the region would not be added by adding likely subtags. That produces better behavior if theunicode_subdivision_id is ignored by an implementation or if the language tag is truncated.

Examples:

In version 28.0, the subdivisions in the validity files used the ISO format, uppercase with a hyphen separating two components, instead of the BCP 47 format.

Unicode BCP 47 T Extension

The Unicode Consortium has registered and is the maintaining authority for two BCP 47 language tag extensions: the extension 'u' for Unicode locale extension [RFC6067] and extension 't' for transformed content [RFC6497]. The Unicode BCP 47 extension data defines the complete list of valid subtags. While the title of the RFC is “Transformed Content”, the abstract makes it clear that the scope is broader than the term "transformed" might indicate to a casual reader: “including content that has been transliterated, transcribed, or translated, orin some other way influenced by the source. It also provides for additional information used for identification.

The -t- Extension. The syntax of 't' extension subtags is defined by the ruletransformed_extensions in_ Unicode locale identifier_, except the separator of subtagssep must be always hyphen '-' when the extension is used as a part of BCP 47 language tag. For information about the registration process, meaning, and usage of the 't' extension, see [RFC6497].

These subtags are all in lowercase (that is the canonical casing for these subtags), however, subtags are case-insensitive and casing does not carry any specific meaning. All subtags within the Unicode extensions are alphanumeric characters in length of two to eight that meet the ruleextension in the [BCP47].

The following keys are defined for the -t- extension.Well-formed values matchtvalue.

KeysDescriptionValid Values in latest release
m0Transform extension mechanism: to reference an authority or rules for a type of transformation​transform.xml
s0, d0Transform source/destination: for non-languages/scripts, such as fullwidth-halfwidth conversion.​transform-destination.xml
i0Input Method Engine transform: Used to indicate an input method transformation, such as one used by a client-side input method. The first subfield in a sequence would typically be a 'platform' or vendor designation.​transform_ime.xml
k0Keyboard transform: Used to indicate a keyboard transformation, such as one used by a client-side virtual keyboard. The first subfield in a sequence would typically be a 'platform' designation, representing the platform that the keyboard is intended for. The keyboard might or might not correspond to a keyboard mapping shipped by the vendor for the platform. One or more subsequent fields may occur, but are only added where needed to distinguish from others.​transform_keyboard.xml
t0Machine Translation: Used to indicate content that has been machine translated, or a request for a particular type of machine translation of content. The first subfield in a sequence would typically be a 'platform' or vendor designation.​transform_mt.xml
h0Hybrid Locale Identifiers: h0 with the value 'hybrid' indicates that the -t- value is a language that is mixed into the main language tag to form a hybrid. For more information, and examples, seeHybrid Locale Identifiers.​transform_hybrid.xml
x0Private use transform​transform_private_use.xml

T Extension Data Files

The overall structure of the data files is the similar to the U Extension, with the following exceptions.

In the transformed content 't' data file, thename attribute in a<key> element defines a valid field separator subtag. Thename attribute in an enclosed<type> element defines a valid field subtag for the field separator subtag. For example:

<key extension="t" name="m0" description="Transform extension mechanism">    <type name="ungegn" description="United Nations Group of Experts on Geographical Names" since="21"/></key>

The data above indicates:

The attributes are:

name

The name of the mechanism, limited to 3-8 characters (or sequences of them). Any indirect type names are listed in 3.6.4U Extension Data Files.

description

A description of the name, with all and only that information necessary to distinguish one name from others with which it might be confused. Descriptions are not intended to provide general background information.

since

Indicates the first version of CLDR where the name appears. (Required for new items.)

alias

Alternative name, not limited in number of characters. Aliases are intended for compatibility, not to provide all possible alternate names or designations.(Optional)

For information about the registration process, meaning, and usage of the 't' extension, see [RFC6497].

Compatibility with Older Identifiers

LDML version before 1.7.2 used slightly different syntax for variant subtags and locale extensions. Implementations of LDML may provide backward compatible identifier support as described in following sections.

Old Locale Extension Syntax

LDML 1.7 or older specification used different syntax for representing Unicode locale extensions. The previous definition of Unicode locale extensions had the following structure:

EBNF
old_unicode_locale_extensions= "@" old_key "=" old_type
(";" old_key "=" old_type)*

The new specification mandates keys to be two alphanumeric characters and types to be three to eight alphanumeric characters. As the result, new codes were assigned to all existing keys and some types. For example, a new key "co" replaced the previous key "collation", a new type "phonebk" replaced the previous type "phonebook". However, the existing collation type "big5han" already satisfied the new requirement, so no new type code was assigned to the type. All new keys and types introduced after LDML 1.7 satisfy the new requirement, so they do not have aliases dedicated for the old syntax, except time zone types. The conversion between old types and new types can be done regardless of key, with one known exception (old type "traditional" is mapped to new type "trad" for collation and "traditio" for numbering system), and this relationship will be maintained in the future versions unless otherwise noted.

The new specification introduced a new fieldattribute in addition to key/type pairs in the Unicode locale extension. When it is necessary to map a new Unicode locale identifier withattribute field to a well-formed old locale identifier, a special key nameattribute with the value of entireattribute subtags in the new identifier is used. For example, a new identifierja-u-xxx-yyy-ca-japanese is mapped to an old identifierja@attribute=xxx-yyy;calendar=japanese .

The chart below shows some example mappings between the new syntax and the old syntax.

Table:Locale Extension Mappings
Old (LDML 1.7 or older)New
de_DE@collation=phonebookde_DE_u_co_phonebk
zh_Hant_TW@collation=big5hanzh_Hant_TW_u_co_big5han
th_TH@calendar=gregorian;numbers=thaith_TH_u_ca_gregory_nu_thai
en_US_POSIX@timezone=America/Los_Angelesen_US_u_tz_uslax_va_posix

Where the old API is supplied the bcp47 language code, or vice versa, the recommendation is to:

  1. Have all methods that take the old syntax also take the new syntax, interpreted correctly. For example, "zh-TW-u-co-pinyin" and "zh_TW@collation=pinyin" would both be interpreted as meaning the same.
  2. Have all methods (both for old and new syntax) accept all possible aliases for keywords and types. For example, "ar-u-ca-islamicc" would be equivalent to "ar-u-ca-islamic-civil".
    • The one exception is where an alias would only be well-formed with the old syntax, such as "gregorian" (for "gregory").
  3. Where an API cannot successfully accept the alternate syntax, throw an exception (or otherwise indicate an error) so that people can detect that they are using the wrong method (or wrong input).
  4. Provide a method that tests a purported locale ID string to determine its status:
    1. well-formed - syntactically correct
    2. valid - well-formed and only uses registered language subtags, extensions, keywords, types...
    3. canonical - valid and no deprecated codes or structure.

Legacy Variants

Old LDML specification allowed codes other than registered [BCP47] variant subtags used in Unicode language and locale identifiers for representing variations of locale data. Unicode locale identifiers including such variant codes can be converted to the new [BCP47] compatible identifiers by following the descriptions below:

Table:Legacy Variant Mappings
Variant CodeDescription
AALANDÅland, variant of "sv" Swedish used in Finland. Usesv_AX to indicate this.
BOKMALBokmål, variant of "no" Norwegian. Use primary language subtag "nb" to indicate this.
NYNORSKNynorsk, variant of "no" Norwegian. Use primary language subtag "nn" to indicate this.
POSIXPOSIX variation of locale data. Use Unicode locale extension-u-va-posix to indicate this.
POLYTONIPolytonic, variant of "el" Greek. Use [BCP47] variant subtagpolyton to indicate this.
SAAHOThe Saaho variant of Afar. Use primary language subtag "ssy" to indicate this.

When converting to old syntax, the Unicode locale extension "-u-va-posix" should be converted to the "POSIX" variant,not to old extension syntax like "@va=posix". This is an exception: The other mappings above should not be reversed.

Examples:

👉 Note that the mapping betweenen_US_POSIX anden-US-u-va-posix is a conversion process, not a canonicalization process.

Relation to OpenI18n

The locale id format generally follows the description in theOpenI18N Locale Naming Guideline [NamingGuideline], with some enhancements. The main differences from those guidelines are that the locale id:

  1. does not include a charset (since the data in LDML format always provides a representation of all Unicode characters. The repository is stored in UTF-8, although that can be transcoded to other encodings as well.)
  2. adds the ability to have a variant, as in Java
  3. adds the ability to discriminate the written language by script (or script variant).
  4. is a superset of [BCP47] codes.

Transmitting Locale Information

In a world of on-demand software components, with arbitrary connections between those components, it is important to get a sense of where localization should be done, and how to transmit enough information so that it can be done at that appropriate place. End-users need to get messages localized to their languages, messages that not only contain a translation of text, but also contain variables such as date, time, number formats, and currencies formatted according to the users' conventions. The strategy for doing the so-calledJIT localization is made up of two parts:

  1. Store and transmitneutral-format data wherever possible.
    • Neutral-format data is data that is kept in a standard format, no matter what the local user's environment is. Neutral-format is also (loosely) calledbinary data, even though it actually could be represented in many different ways, including a textual representation such as in XML.
    • Such data should use accepted standards where possible, such as for currency codes.
    • Textual data should also be in a uniform character set (Unicode/10646) to avoid possible data corruption problems when converting between encodings.
  2. Localize that data as "close" to the end-user as possible.

There are a number of advantages to this strategy. The longer the data is kept in a neutral format, the more flexible the entire system is. On a practical level, if transmitted data is neutral-format, then it is much easier to manipulate the data, debug the processing of the data, and maintain the software connections between components.

Once data has been localized into a given language, it can be quite difficult to programmatically convert that data into another format, if required. This is especially true if the data contains a mixture of translated text and formatted variables. Once information has been localized into, say, Romanian, it is much more difficult to localize that data into, say, French. Parsing is more difficult than formatting, and may run up against different ambiguities in interpreting text that has been localized, even if the original translated message text is available (which it may not be).

Moreover, the closer we are to end-user, the more we know about that user's preferred formats. If we format dates, for example, at the user's machine, then it can easily take into account any customizations that the user has specified. If the formatting is done elsewhere, either we have to transmit whatever user customizations are in play, or we only transmit the user's locale code, which may only approximate the desired format. Thus the closer the localization is to the end user, the less we need to ship all of the user's preferences around to all the places that localization could possibly need to be done.

Even though localization should be done as close to the end-user as possible, there will be cases where different components need to be aware of whatever settings are appropriate for doing the localization. Thus information such as a locale code or time zone needs to be communicated between different components.

Message Formatting and Exceptions

Windows (FormatMessage,String.Format), Java (MessageFormat) and ICU (MessageFormat,umsg) all provide methods of formatting variables (dates, times, etc) and inserting them at arbitrary positions in a string. This avoids the manual string concatenation that causes severe problems for localization. The question is, where to do this? It is especially important since the original code site that originates a particular message may be far down in the bowels of a component, and passed up to the top of the component with an exception. So we will take that case as representative of this class of issues.

There are circumstances where the message can be communicated with a language-neutral code, such as a numeric error code or mnemonic string key, that is understood outside of the component. If there are arguments that need to accompany that message, such as a number of files or a datetime, those need to accompany the numeric code so that when the localization is finally at some point, the full information can be presented to the end-user. This is the best case for localization.

More often, the exact messages that could originate from within the component are not known outside of the component itself; or at least they may not be known by the component that is finally displaying text to the user. In such a case, the information as to the user's locale needs to be communicated in some way to the component that is doing the localization. That locale information does not necessarily need to be communicated deep within the component; ideally, any exceptions should bundle up some language-neutral message ID, plus the arguments needed to format the message (for example, datetime), but not do the localization at the throw site. This approach has the advantages noted above for JIT localization.

In addition, exceptions are often caught at a higher level; they do not end up being displayed to any end-user at all. By avoiding the localization at the throw site, it the cost of doing formatting, when that formatting is not really necessary. In fact, in many running programs most of the exceptions that are thrown at a low level never end up being presented to an end-user, so this can have considerable performance benefits.

Unicode Language and Locale IDs

People have very slippery notions of what distinguishes a language code versus a locale code. The problem is that both are somewhat nebulous concepts.

In practice, many people use [BCP47] codes to mean locale codes instead of strictly language codes. It is easy to see why this came about; because [BCP47] includes an explicit region (territory) code, for most people it was sufficient for use as a locale code as well. For example, when typical web software receives a [BCP47] code, it will use it as a locale code. Other typical software will do the same: in practice, language codes and locale codes are treated interchangeably. Some people recommend distinguishing on the basis of "-" versus "_" (for example,zh-TW for language code,zh_TW for locale code), but in practice that does not work because of the free variation out in the world in the use of these separators. Notice that Windows, for example, uses "-" as a separator in its locale codes. So pragmatically one is forced to treat "-" and "_" as equivalent when interpreting either one on input.

Another reason for the conflation of these codes is thatvery little data in most systems is distinguished by region alone; currency codes and measurement systems being some of the few. Sometimes date or number formats are mentioned as regional, but that really does not make much sense. If people see the sentence "You will have to adjust the value to १,२३४.५६७ from ૭૧,૨૩૪.૫૬" (using Indic digits), they would say that sentence is simply not English. Number format is far more closely associated with language than it is with region. The same is true for date formats: people would never expect to see intermixed a date in the format "2003年4月1日" (using Kanji) in text purporting to be purely English. There are regional differences in date and number format — differences which can be important — but those are different in kind than other language differences between regions.

As far as we are concerned —as a completely practical matter — two languages are different if they require substantially different localized resources. Distinctions according to spoken form are important in some contexts, but the written form is by far and away the most important issue for data interchange. Unfortunately, this is not the principle used in [ISO639], which has the fairly unproductive notion (for data interchange) that only spoken language matters (it is also not completely consistent about this, however).

[BCP47]can express a difference if the use of written languages happens to correspond to region boundaries expressed as [ISO3166] region codes, and has recently added codes that allow it to express some important cases that are not distinguished by [ISO3166] codes. These written languages include simplified and traditional Chinese (both used in Hong Kong S.A.R.); Serbian in Latin script; Azerbaijani in Arab script, and so on.

Notice also thatcurrency codes are different thancurrency localizations. The currency localizations should largely be in the language-based resource bundles, not in the territory-based resource bundles. Thus, the resource bundleen contains the localized mappings in English for a range of different currency codes: USD → US$, RUR → Rub, AUD → $A and so on. Of course, some currency symbols are used for more than one currency, and in such cases specializations appear in the territory-based bundles. Continuing the example,en_US would have USD → $, whileen_AU would have AUD → $. (In protocols, the currency codes should always accompany any currency amounts; otherwise the data is ambiguous, and software is forced to use the user's territory to guess at the currency. For some informal discussion of this, seeJIT Localization.)

Written Language

Criteria for what makes a written language should be purely pragmatic;what would copy-editors say? If one gave them text like the following, they would respond that is far from acceptable English for publication, and ask for it to be redone:

  1. "Theatre Center News: The date of the last version of this document was 2003年3月20日. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt."

So one would change it to either B or C below, depending on which orthographic variant of English was the target for the publication:

  1. "Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."
  2. "Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."

Clearly there are many acceptable variations on this text. For example, copy editors might still quibble with the use of first versus last name sorting in the list, but clearly the first list wasnot acceptable English alphabetical order. And in quoting a name, like "Theatre Centre News", one may leave it in the source orthography even if it differs from the publication target orthography. And so on. However, just as clearly, there are limits on what is acceptable English, and "2003年3月20日", for example, isnot.

Note that the language of locale data may differ from the language of localized software or web sites, when those latter are not localized into the user's preferred language. In such cases, the kind of incongruous juxtapositions described above may well appear, but this situation is usually preferable to forcing unfamiliar date or number formats on the user as well.

Hybrid Locale Identifiers

Hybrid locales have intermixed content from 2 (or more) languages, often with one language's grammatical structure applied to words in another. These are commonly referred to with portmanteau words such asFranglais,​Spanglish orDenglish. Hybrid locales do notnot reference text simply containing two languages: a book of parallel text containing English and French, such as the following, is not Franglais:

On the 24th of May, 1863, my uncle, Professor Liedenbrock, rushed into his little house, No. 19 Königstrasse, one of the oldest streets in the oldest portion of the city of Hamburg…Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock, revint précipitamment vers sa petite maison située au numéro 19 de Königstrasse, l’une des plus anciennes rues du vieux quartier de Hambourg…

While text in a document can be tagged as partly in one language and partly in another, that is not the same having a hybrid locale. There is a difference between having a Spanglish document, and a Spanish document that has some passages quoted in English. Fine-grained tagging doesn't handle grammatical combinations like Tanglish “Enna matteru?” (What’s the matter?), which is neither standard Tamil nor standard English. More importantly, it doesn’t work for the very common use case for aunicode_locale_id:locale selection.

To communicate requests for localized content and internationalization services, locales are used. When people pick a language from a menu, internally they are picking a locale (en-GB, es-419, etc.). To allow an application to support Spanglish or Hinglish locale selection,unicode_locale_ids can represent hybrid locales using the T Extension key-value 'h0-hybrid'. (For more information on the T extension, seeUnicode BCP 47 T Extension.)

However, if users typically expect their language in a non-default script to contain a significant amount of text due to lexical borrowing, then the -t- and hybrid subtags may be omitted. An example of this is when Hindi is written in Latin script since Romanized Hindi typically contains a significant amount of English text, ‘hi-Latn’ can be used instead of ‘hi-Latn-t-en-h0-hybrid’.This tends to work better in implementations that don't yet handle the -t- extension.

Examples:

Locale IDBase scriptHybrid nameDescription
hi-t-en-h0-hybridDevaHinglishHindi-English hybrid where the script is Devanagari*
hi-Latn-t-en-h0-hybridLatinHinglishHindi-English hybrid where the script is Latin*
hi-LatnLatinHinglishHindi written in Latin script; in practice usually a hybrid with English
ta-t-en-h0-hybridTamilTanglishTamil-English hybrid where the script is Tamil*
...
en-t-hi-h0-hybridLatinHinglishEnglish-Hindi hybridwhere the script is Latin*
en-t-zh-h0-hybridLatinChinglishEnglish-Chinese hybrid where the script is Latin*
...

* When used as a request for international services (such as date formatting), the request is for everything to be in the base script if possible. When used to tag arbitrary content on a coarse level, the expectation is that it be the predominant script — that is, there may be certain passages or phrases that are in the other script but are not tagged on a fine-grained level.

Note: Theunicode_language_id should be the language used as the ‘scaffold’: for the fallback locale for internationalization services, typically used for more of the core vocabulary/structure in the content. Thus where Hindi is the scaffold, Hinglish should be represented as hi-t-en-h0-hybrid (when written in Devanagari script) or hi-Latn-t-en-h0-hybrid (when written in Latin characters). Where English is the scaffold, Hinglish should be represented as en-t-hi-h0-hybrid (or possibly en-Deva-t-hi-h0-hybrid).

The value of -t- is a fullunicode_language_id, and can contain a subtag for the region where it is important to include it, as in the following. The value can also include the script, although that is not normally included: the only instance where it should be is where the content of the source text varies by script. So because zh-Hant has different vocabulary and expressions, it could make sense to have en-t-zh-hant to make that distinction.

Note: The default script for the language is computed without reference to the hybrid subtags. Thus the default script for 'ru' is “Cyrl”, no matter what the source is in the -t- tag.

Locale IDBase scriptHybrid nameDescription
ru-t-en-h0-hybridCyrillicRunglishRussian with an admixture ofAmerican English
ru-t-en-gb-h0-hybridCyrillicRunglishRussian with an admixture ofBritish English
ru-Latn-t-en-gb-h0-hybridLatinRunglishRussian with an admixture of British English
en-t-zh-h0-hybridLatinChinglishAmerican English with an admixture ofChinese (Simplified Mandarin Chinese)
en-t-zh-hant-h0-hybridLatinChinglishAmerican English with an admixture ofChinese (Traditional Mandarin Chinese)

Should there ever be strong need for hybrids of more than two languages or for other purposes such as hybrid languages as the source of translated content, additional structure could be added.

Validity Data

<!ELEMENT idValidity (id*) ><!ELEMENT id ( #PCDATA ) ><!ATTLIST id type NMTOKEN #REQUIRED ><!ATTLIST id idStatus NMTOKEN #REQUIRED >

The directorycommon/validity contains machine-readable data for validating the language, region, script, and variant subtags, as well as currency, subdivisions and measure units. Each file contains a number of subtags with the followingidStatus values:

The list of subtags for each idStatus use a compact format as a space-delimited list of StringRanges, as defined inSection String Range](#String_Range). The separator for each StringRange is a "~".

Each measure unit is a sequence of subtags, such as “angle-arc-minute”. The first subtag provides a general “category” of the unit.

In version 28.0, the subdivisions in the validity files used the ISO format, uppercase with a hyphen separating two components, instead of the BCP 47 format.

Locale Inheritance and Matching

The XML format relies on an inheritance model, whereby the resources are collected intobundles, and the bundles organized into a tree. Data for the many Spanish locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in the Spanish language locale, and territory locales only need to supply differences. The parent of all of the language locales is a generic locale known asroot. Wherever possible, the resources in the root are language & territory neutral. For example, the collation (sorting) order in the root is based on the [DUCET] (seeRoot Collation). Since English language collation has the same ordering as the root locale, the 'en' locale data does not need to supply any collation data, nor do the 'en_US', 'en_GB' or the any of the various other locales that use English.

Given a particular locale id "en_US_someVariant", the default search chain for a particular resource is the following.

en_US_someVarianten_USenroot

The inheritance is often not simple truncation, as will be seen later in this section.

The default search chain is slighly different for multiple variants.In that case, the inheritance chain covers all combinations of variants, with longest number of variants first, and otherwise in alphabetical order.For example, where the requested locale ID is en_fonipa_scouse, the inheritance chain is as follows:

en_GB_fonipa_scouseen_GB_scouse_fonipa // extra step, only needed if not canonicalen_GB_fonipaen_GB_scouse // extra stepen_GBen

If the data for the implementation performing the inheritance doesn't require canonical locale identifiers, then extra locale IDs need to be inserted in the chain.That is indicated in the example above, marked with "only needed if not canonical".These would would include all combinations of variants that are not in canonical order, inserted in alphabetical order.Note that the order of multiple variants in canonical locale identifiers is alphabetical, as per5. Canonicalizing Syntax inAnnex C. LocaleId Canonicalization.

If a type and key are supplied in the locale id, then logically the chain from that id to the root is searched for a resource tag with a given type, all the way up to root. If no resource is found with that tag and type, then the chain is searched again without the type.

Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit the bulk of their data from the language locale: "en" will contain the bulk of the data: "en_IE" will only contain a few items like currency. All data that is inherited from a parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles, and much simpler (and less error-prone) maintenance. At the script or region level, the "primary" child locale will be empty, since its parent will contain all of the appropriate resources for it. For more information seeCLDR Information:Default Content.

Certain data items depend only on the region specified in a locale id (by aunicode_region_subtag or an “rg”Region Override key), and are obtained from supplemental data rather than through locale resources. For example:

(For more information on the specific items handled this way, seeTerritory-Based Preferences.) These items will be correct for the specified region regardless of whether a locale bundle actually exists with the same combination of language and region as in the locale id. For example, suppose data is requested for the locale id "fr_US" and there is no bundle for that combination. Data obtained via locale inheritance, such as currency patterns and currency symbols, will be obtained from the parent locale "fr". However, currency amounts would be formatted by default using US dollars, just displayed in the manner governed by the locale "fr". When a locale id does not specify a region, the region-specific items such as those above are obtained from the likely region for the locale (obtained viaLikely Subtags).

For the relationship between Inheritance, DefaultContent, LikelySubtags, and LocaleMatching, seeInheritance vs Related Information.

Lookup

If a language has more than one script in customary modern use, then the CLDR file structure in common/main follows the following model:

langlang_scriptlang_script_regionlang_region (aliases to lang_script_region based on likely subtags)

Bundle vs Item Lookup

There are actually two different kinds of inheritance fallback:resource bundle lookup andresource item lookup. For the former, a process is looking to find the first, best resource bundle it can; for the later, it is fallback within bundles on individual items, like the translated name for the region "CN" in Breton.

These are closely related, but distinct, processes. They are illustrated in the tableLookup Differences, where "key" stands for zero or more key/type pairs. Logically speaking, when looking up an item for a given locale, you first do a resource bundle lookup to find the best bundle for the locale, then you do an inherited item lookup starting with that resource bundle.

The tableLookup Differences uses the naïve resource bundle lookup for illustration. More sophisticated systems will get far better results for resource bundle lookup if they use the algorithm described inLanguage Matching. That algorithm takes into account both the user’s desired locale(s) and the application’s supported locales, in order to get the best match.

If the naïve resource bundle lookup is used, the desired locale needs to be canonicalized using 4.3Likely Subtags and the supplemental alias information, so that locales that CLDR considers identical are treated as such. Thus eng-Latn-GB should be mapped to en-GB, and cmn-TW mapped to zh-Hant-TW.

The initial bundle accessed during resource bundle lookup should not contain a script subtag unless, according to likely subtags, the script is required to disambiguate the locale. For example,zh-Hant-TW should start lookup atzh-TW (sincezh-TW impliesHant), andde-Latn-LI should start atde-LI (sincede impliesLatn andde-LI does not have its own entry in likely subtags).

For the purposes of CLDR, everything with the<ldml> dtd is treated logically as if it is one resource bundle, even if the implementation separates data into separate physical resource bundles. For example, suppose that there is a main XML file for Nama (naq), but there are no<unit> elements for it because the units are all inherited from root. If the<unit> elements are separated into a separate data tree for modularity in the implementation, the Nama<unit> resource bundle would be empty. However, for purposes of resource-bundle lookup the resource bundle lookup still stops at naq.xml.

Table:Lookup Differences
Lookup TypeExampleComments
Resource bundle lookup se-FI →
se →
default‑locale* →
root

* The default-locale may have its own inheritance change; for example, it may be "en-GB → en" In that case, the chain is expanded by inserting the chain, resulting in:

se-FI →
se →
fi →
en-GB →
en →
root

Inherited item lookup se-FI+key →
se+key →
root_alias*+key
→ root+key

* If there is a root_alias to another key or locale, then insert that entire chain. For example, suppose that months for another calendar system have a root alias to Gregorian months. In that case, the root alias would change the key, and retry from se-FI downward. This can happen multiple times.

se-FI+key →
se+key →
root_alias*+key →
se-FI+key2 →
se+key2 →
root_alias*+key2 →
root+key2

Both the resource bundle inheritance and the inherited item inheritance use the parentLocale data, where available, instead of simple truncation.

The fallback is a bit different for these two cases; internal aliases and keys are not involved in the bundle lookup, and the default locale is not involved in the item lookup. If the default-locale were used in the resource-item lookup, then strange results will occur. For example, suppose that the default locale is Swedish, and there is a Nama locale but no specific inherited item for collation. If the default-locale were used in resource-item lookup, it would produce odd and unexpected results for Nama sorting.

The default locale is not even always used in resource bundle inheritance. For the following services, the fallback is always directly to the root locale rather than through default locale.

Thus if there is no Akan locale, for example, asking for a collation for Akan should produce the root collation,not the Swedish collation.

The inherited item lookup must remain stable, because the resources are built with a certain fallback in mind; changing the core fallback order can render the bundle structure incoherent.

Resource bundle lookup, on the other hand, is more flexible; changes in the view of the "best" match between the input request and the output bundle are more tolerant, when represent overall improvements for users. For more information, seeA.1 Element fallback.

Where the LDML inheritance relationship does not match a target system, such as POSIX, the data logically should be fully resolved in converting to a format for use by that system, by addingall inherited data to each locale data set.

For a more complete description of how inheritance applies to data, and the use of keywords, seeInheritance .

The locale data does not contain general character properties that are derived from theUnicode Character Database [UAX44]. That data being common across locales, it is not duplicated in the bundles. Constructing a POSIX locale from the CLDR data requires use of UCD data. In addition, POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding.

Warning: If a locale has a different script than its parent (for example, sr_Latn), then special attention must be paid to make sure that all inheritance is covered. For example, auxiliary exemplar characters may need to be empty ("[]") to block inheritance.

Empty Override: There is one special value reserved in LDML to indicate that a child locale is to have no value for a path, even if the parent locale has a value for that path. That value is "∅∅∅". For example, if there is no phrase for "two days ago" in a language, that can be indicated with:

<field type="day">  <relative type="-2">∅∅∅</relative>

Lateral Inheritance

Lateral Inheritance is where resources are inherited from within the same locale,before inheriting from the parent. This is used for the following element@attribute instances:

Element @AttributeSourceContext
currency@patterncurrencyFormatnumberSystem =defaultNumberingSystem, unless otherwise specified*
currencyFormatLength type=none, unless otherwise specified
currencyFormattype="standard", unless otherwise specified
currency@decimalsymbols@decimalnumberSystem =defaultNumberingSystem, unless otherwise specified
currency@groupsymbols@groupnumberSystem =defaultNumberingSystem, unless otherwise specified

* The "unless otherwise specified" clause is for when an API or other context indicates a different choice, such as currencyFormat type="accounting".

For example, with /currency [@type="CVE"], the decimal symbol for almost all locales is the value from symbols/decimal, but for pt_CV it is explicitly<decimal>$</decimal>.

The following attributes use lateral inheritance forall elements with the DTD root = ldml, except where otherwise noted. The process is applied recursively.

AttributeFallbackException Elements
altno alt attributenone
case"nominative" → ∅caseMinimalPairs
genderdefault_gender(locale) → ∅genderMinimalPairs
countplural_rules(locale, x) → "other" → ∅minDays,pluralMinimalPairs
ordinalplural_rules(locale, x) → "other" → ∅ordinalMinimalPairs

The gender fallback is to neuter if the locale has a neuter gender, otherwise masculine. This may be extended in the future if necessary. See alsoPart 2, Grammatical Features.

For example, if there is no value for a path, and that path has a [@count="x"] attribute and value, then:

  1. If "x" is numeric, the path falls back to the path with [@count=«the plural rules category for x for that locale»], within that the same locale.
    1. For example, [@count="0"] for English falls back to [@count="other"], while for French falls back to [@count="one"].
  2. If "x" is anything but "other", it falls back to a path [@count="other"], within that the same locale.
  3. If "x" is "other", it falls back to the path that is completely missing the count item, within that the same locale.
  4. If there is no value for that path the same locale, the same process is used for theoriginal path in the parent locale.

A path may have multiple attributes with lateral inheritance. In such a case, all of the combinations are tried, and in the order supplied above. For example (this is an extreme case):

/compoundUnitPattern1[@count="few"][@gender="feminine"][@case="accusative">] →/compoundUnitPattern1[@count="few"][@gender="feminine"][@case="nominative">] →/compoundUnitPattern1[@count="few"][@gender="feminine"] →/compoundUnitPattern1[@count="few"][@gender="neuter"][@case="accusative">] →/compoundUnitPattern1[@count="few"][@gender="neuter"][@case="nominative">] →/compoundUnitPattern1[@count="few"][@gender="neuter"] →/compoundUnitPattern1[@count="few"][@case="accusative">] →/compoundUnitPattern1[@count="few"][@case="nominative">] →/compoundUnitPattern1[@count="few"] →/compoundUnitPattern1[@count="other"][@gender="feminine"][@case="accusative">] →/compoundUnitPattern1[@count="other"][@gender="feminine"][@case="nominative">] →/compoundUnitPattern1[@count="other"][@gender="feminine"] →/compoundUnitPattern1[@count="other"][@gender="neuter"][@case="accusative">] →/compoundUnitPattern1[@count="other"][@gender="neuter"][@case="nominative">] →/compoundUnitPattern1[@count="other"][@gender="neuter"] →/compoundUnitPattern1[@count="other"][@case="accusative">] →/compoundUnitPattern1[@count="other"][@case="nominative">] →/compoundUnitPattern1[@count="other"] →/compoundUnitPattern1[@gender="feminine"][@case="accusative">] →/compoundUnitPattern1[@gender="feminine"][@case="nominative">] →/compoundUnitPattern1[@gender="feminine"] →/compoundUnitPattern1[@gender="neuter"][@case="accusative">] →/compoundUnitPattern1[@gender="neuter"][@case="nominative">] →/compoundUnitPattern1[@gender="neuter"] →/compoundUnitPattern1[@case="accusative">] →/compoundUnitPattern1[@case="nominative">] →/compoundUnitPattern1

Examples:

Table:Count Fallback: normal
LocalePath
fr-CA//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="x"]
fr-CA//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="other"]
fr//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="x"]
fr//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="other"]
root//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="x"]
root//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="other"]

Note that there may also be an alias in root that changes the path and starts again from the requested locale, such as:

<unitLength type="narrow">   <alias source="locale" path="../unitLength[@type='short']"/></unitLength>
Table:Count Fallback: currency
LocalePath
fr-CA//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="x"]
fr-CA//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="other"]
fr-CA//ldml/numbers/currencies/currency[@type="CAD"]/displayName
fr//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="x"]
fr//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="other"]
fr//ldml/numbers/currencies/currency[@type="CAD"]/displayName
root//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="x"]
root//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="other"]
root//ldml/numbers/currencies/currency[@type="CAD"]/displayName

Inheritance Marker

There is a specialInheritance Marker used in the main repository, which has the value ↑↑↑. For example:

    <language type="ab">↑↑↑</language>

It is used created during data submission to record that the inherited value has been verified for the current locale and path.For example, the above was used in de_CH to indicate that the following was not only correct for de, but also for de_CH.

    <language type="ab">Abchasisch</language>

It is not needed or used in the released data, because conformant implementations produce the inherited value whether the element is present with a value of ↑↑↑, or is completely absent.

Parent Locales

<!ELEMENT parentLocales ( parentLocale* ) ><!ATTLIST parentLocales component NMTOKENS #IMPLIED ><!ELEMENT parentLocale EMPTY ><!ATTLIST parentLocale parent NMTOKEN #REQUIRED ><!ATTLIST parentLocale localeRules NMTOKENS #IMPLIED ><!ATTLIST parentLocale locales NMTOKENS #REQUIRED >

When the component does not occur, that is referred to as the ‘main’ component.Otherwise the component value typically corresponds to elements and their children, such as ‘collations’ or ‘plurals’.There may be more than one component value (space separated):in that case the information applies to all the components listed.

The basic inheritance model for locales of the formlang_script_region_variant1_…variantN is to truncate from the end.That is,remove the _u and _t extensions, then remove the last _ and following tag, then restore the extensions.

For example

sr_Cyrl_ME→sr_Cyrl→sr

In some cases, the normal truncation inheritance does not function well.For example, if the truncation algorithm changes script,then a mixture of child and parent textual data is a mishmash of different scripts.

Thus there are two cases where the truncation inheritance needs to be overridden:

  1. When the parent locale would have a different script, and text would be mixed.
  2. In certain exceptional circumstances where the 'truncation' parent needs to be adjusted.

TheparentLocale element is used to override the normal inheritance when accessing CLDR data.

For case 1, there is a special attribute and value,localeRules="nonlikelyScript",which specifiesall locales of the formlang_script,wherever thescript isnot the likely script forlang.For migration, the previous short list of locales (a subset of the nonlikelyScript locales) is retained,but those locales are slated for removal in the future.For example,ru_Latn is not included in the short list but is included (programmatically) in the rule.

<parentLocale parent="root" localeRules="nonlikelyScript" locales="az_Arab az_Cyrl bal_Latn … yue_Hans zh_Hant"/>/>

ThelocaleRules is used for the main component, for example.It is not used to components where text is not mixed,such as the collations component or the plurals component.

For case 2, the children and parent share the same primary language, but the region is changed.For example:

<parentLocale parent="es_419" locales="es_AR es_BO … es_UY es_VE"/>

There are certain components that require addenda to the common parent fallback rules.For a locale likezh_Hant in the example above,theparentLocale element would dictate the parent asroot when referring to main locale data,but for collation data, the parent locale should still bezh,even though theparentLocale element is present for that locale.To address this, components can have their own fallback rules that inherit from the common rulesand add additional parents that supplement or override the common rules:

<parentLocales component="segmentations">  <parentLocale parent="zh" locales="zh_Hant"/></parentLocales>

Note: When components were first introduced, the component-specific parent locales were be merged with the main parent locales.This was determined to be an error, and the component-specific parent locales are now not merged,but instead are treated as stand-alone.

Since parentLocale information is not localizable on a per locale basis,the parentLocale information is contained in CLDR’ssupplemental data.

When aparentLocale element is used to override normal inheritance, the following guidelines apply in most cases:

  1. If X is the parentLocale of Y, then either X is the root locale, or X has the same base language code as Y.For example, the parent ofen cannot befr, and the parent ofen_YY cannot befr orfr_XX.
  2. If X is the parentLocale of Y, Y must not be a base language locale. For example, the parent ofen cannot been_XX.

There may be specific exceptions to these for certain closely-related languages or language-script combinations, for example:

There are certain invariants that must always be true:

  1. The parent must either be the root locale or have the same script as the child. This rule applies to component=main.
  2. There must never be cycles, such as: X parent of Y ... parent of X.
  3. Following the inheritance path, using parentLocale where available and otherwise truncating the locale, must always lead eventually to the root locale.

Region-Priority Inheritance

Certain data may be more appropriate to store with the region as the primary key instead of language. This is often needed for regional user preferences, such as week info, calendar system, and measurement system. All resources matched by an entry in<rgScope> should use this type of inheritance.

The default search chain for region-priority inheritance removes the language subtag before the region subtag, as follows:

en_US_someVarianten_USUS001

Equivalently as BCP-47:

en-US-varianten-USund-USund

Before running region-priority inheritance, the locale should be normalized as follows:

  1. If the locale contains the-u-rg Unicode BCP-47 locale extension, the region subtag should be set to the-u-rg region. For example,en-US-u-rg-gbzzzz should normalize toen-GB when running region-priority inheritance.
  2. If, after performing step 1, the locale is missing the region subtag (language orlanguage_script), the region subtag should be filled in from likely subtags data. For example,en should becomeen-US before running region-priority inheritance.

Note that region-priority inheritance does not currently make use of parent locales or territory containment, but it may in the future.

Inheritance and Validity

The following describes in more detail how to determine the exact inheritance of elements, and the validity of a given element in LDML.

Definitions

Ordered elements are those whose sequence in the XML file is important; that is, changing the order of those elements can make a difference in the interpretation of the data. These are marked with the@ORDRED annotation in the dtd file. For example, consider the following inldmlSupplemental.dtd:

<!ELEMENT languageMatch EMPTY >    <!--@ORDERED-->

In the filelanguageInfo.xml, we find the following.

<languageMatch desired="ja_Hira"supported="ja_Jpan"distance="5"oneway="true"/>…<!-- default script mismatch distance --><languageMatch desired="*_*"supported="*_*"distance="50"/><!-- *; * ⇒ *; * -->

The ordering among thelanguageMatch items is important, because the*_* must only be matchedafter all the explicit scripts have been.

The ordered elements alsoblock inheritance in files governed byldml.dtd. That is, because the elements are ordered, there is no way to tell where an inherited element from a parent locale would be in that sequence.

Attributes that serve to distinguish multiple elements at the same level are calleddistinguishing attributes. For example, thetype attribute distinguishes different elements in lists of translations, such as:

<language type="aa">Afar</language><language type="ab">Abkhazian</language>

Distinguishing attributes affect inheritance; two elements with different distinguishing attributes are treated as different for purposes of inheritance. For more information, seeValid Attribute Values. Other attributes are called value attributes. Value attributes do not affect inheritance, and elements with value attributes may not have child elements (seeXML Format).

Non-distinguishing attributes are identified byDTD Annotations such as@VALUE.

For any element in an XML file,an element chain is a resolved [XPath] leading from the root to an element, with attributes on each element in alphabetical order. So in, say,https://github.com/unicode-org/cldr/blob/main/common/main/el.xml we may have:

<ldml>    <identity>        <version number="1.1" />        <language type="el" />    </identity>    <localeDisplayNames>        <languages>            <language type="ar">Αραβικά</language>...

Which gives the following element chains (among others):

An element chain A is anextension of an element chain B if B is equivalent to an initial portion of A. For example, #2 below is an extension of #1. (Equivalent, depending on the tree, may not be "identical to". See below for an example.)

  1. //ldml/localeDisplayNames
  2. //ldml/localeDisplayNames/languages/language[@type="ar"]

An LDML file can be thought of as an ordered list ofelement pairs: <element chain, data>, where the element chains are all the chains for the end-nodes. (This works because of restrictions on the structure of LDML, including that it does not allow mixed content.) The ordering is the ordering that the element chains are found in the file, and thus determined by the DTD.

For example, some of those pairs would be the following. Notice that the first has the null string as element contents.

Note: There are two exceptions to this:

  1. Ordered elements are treated as a single end node.
  2. In terms of computing inheritance, the element pair consists of the element chain plus all distinguishing attributes; the value consists of the value (if any) plus any nondistinguishing attributes.

Thus instead of the element pair being (a) below, it is (b):

  1. <//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart[@day='sun'][@time='00:00'],"">
  2. <//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart,[@day='sun'][@time='00:00']>

Two LDML element chains areequivalent when they would be identical if all attributes and their values were removed — except for distinguishing attributes. Thus the following are equivalent:

For any locale ID, alocale chain is an ordered list starting with the root and leading down to the ID. For example:

<root, de, de_DE, de_DE_xxx>

Resolved Data File

To produce fully resolved locale data file from CLDR for a locale ID L, you start with L, and successively add unique items from the parent locales until you get up to root. More formally, this can be expressed as the following procedure.

  1. Let Result be initially L.
  2. For each Li in the locale chain for L, starting at L and going up to root:
    1. Let Temp be a copy of the pairs in the LDML file for Li
    2. Replace each alias in Temp by the resolved list of pairs it points to.
      1. The resolved list of pairs is obtained by recursively applying this procedure.
      2. That alias now blocks any inheritance from the parent. (SeeCommon Elements for an example.)
    3. For each element pair P in Temp:
      1. If P is not an ordered element element, and Result does not have an element pair Q with an equivalent element chain, add P to Result.

Notes:

Valid Data

The attributedraft="x" in LDML means that the data has not been approved by the subcommittee. (For more information, seeProcess). However, some data that is not explicitly marked asdraft may be implicitlydraft, either because it inherits it from a parent, or from an enclosing element.

Example 2. Suppose that new locale data is added for af (Afrikaans). To indicate that all of the data isunconfirmed, the attribute can be added to the top level.

<ldml version="1.1" draft="unconfirmed">    <identity>        <version number="1.1" />        <language type="af" />    </identity>    <characters>...</characters>    <localeDisplayNames>...</localeDisplayNames></ldml>

Any data can be added to that file, and the status will all bedraft="unconfirmed". Once an item is vetted—whether it is inherited or explicitly in the file—then its status can be changed toapproved. This can be done either by leavingdraft="unconfirmed" on the enclosing element and marking the child withdraft="approved", such as:

<ldml version="1.1" draft="unconfirmed">    <identity>        <version number="1.1" />        <language type="af" />    </identity>    <characters draft="approved">...</characters>    <localeDisplayNames>...</localeDisplayNames>    <dates />    <numbers />    <collations /></ldml>

However, normally the draft attributes should be canonicalized, which means they are pushed down to leaf nodes as described inCanonical Form. If an LDML file does have draft attributes that are not on leaf nodes, the file should be interpreted as if it were the canonicalized version of that file.

More formally, here is how to determine whether data for an element chain E is implicitly or explicitly draft, given a locale L. Sections 1, 2, and 4 are simply formalizations of what is in LDML already. Item 3 adds the new element.

Checking for Draft Status

  1. Parent Locale Inheritance
    1. Walk through the locale chain until you find a locale ID L' with a data file D. (L' may equal L).
    2. Produce the fully resolved data file D' for D.
    3. In D', find the first element pair whose element chain E' is either equivalent to or an extension of E.
    4. If there is no such E', returntrue
    5. If E' is not equivalent to E, truncate E' to the length of E.
  2. Enclosing Element Inheritance
    1. Walk through the elements in E', from back to front.
      1. If you ever encounter draft=x, returnx
    2. If L' = L, returnfalse
  3. Missing File Inheritance
    1. Otherwise, walk again through the elements in E', from back to front.
      1. If you encounter avalidSubLocales attribute (deprecated):
        1. If L is in the attribute value, returnfalse
        2. Otherwise returntrue
  4. Otherwise
    1. Returntrue

ThevalidSubLocales in the most specific (farthest from root file) locale file "wins" through the full resolution step (data from more specific files replacing data from less specific ones).

Keyword and Default Resolution

When accessing data based on keywords, the following process is used. Consider the following example:

Here are the searches for various combinations.

User InputLookup in LocaleForComment
de_CH
no keyword
de_CHdefault collation typefinds "B"
de_CHcollation type=Bnot found
decollation type=Bfound
de
no keyword
dedefault collation typenot found
rootdefault collation typefinds "standard"
decollation type=standardnot found
rootcollation type=standardfound
de_u_co_Adecollation type=Afound
de_u_co_standarddecollation type=standardnot found
rootcollation type=standardfound
de_u_co_foobardecollation type=foobarnot found
rootcollation type=foobarnot found, starts looking for default
dedefault collation typenot found
rootdefault collation typefinds "standard"
decollation type=standardnot found
rootcollation type=standardfound

Examples of "search" collator lookup; 'de' has a language-specific version, but 'en' does not:

User InputLookup in LocaleForComment
de_CH_u_co_searchde_CHcollation type=searchnot found
decollation type=searchfound
en_US_u_co_searchen_UScollation type=searchnot found
encollation type=searchnot found
rootcollation type=searchfound

Examples of lookup for Chinese collation types. Note:

User InputLookup in LocaleForComment
zh_Hant
no keyword
zh_Hantdefault collation typefinds "stroke"
zh_Hantcollation type=strokenot found
zhcollation type=strokefound
zh_Hant_HK_u_co_pinyinzh_Hant_HKcollation type=pinyinnot found
zh_Hantcollation type=pinyinnot found
zhcollation type=pinyinfound
zh
no keyword
zhdefault collation typefinds "pinyin"
zhcollation type=pinyinfound

Note: It is an invariant that the default in root for a given element mustalways be a value that exists in root. So you can not have the following in root:

<someElements>    <default type='a'/>    <someElement type='b'>...</someElement>    <someElement type='c'>...</someElement>    <!-- no 'a' --></someElements>

For identifiers, such as language codes, script codes, region codes, variant codes, types, keywords, currency symbols or currency display names, the default value is the identifier itself whenever no value is found in the root. Thus if there is no display name for the region code 'QA' in root, then the display name is simply 'QA'.

Inheritance vs Related Information

There are related types of data and processing that are easy to confuse:

InheritancePart of the internal mechanism used by CLDR to organize and manage locale data. This is used to share common resources, and ease maintenance, and provide the best fallback behavior in the absence of data.Should not be used for locale matching or likely subtags.
Example:parent(en_AU) ⇒ en_001
parent(en_001) ⇒ en
parent(en) ⇒ root
Data:supplementalData.xml <parentLocale>
Spec:Section4.2 Inheritance and Validity
DefaultContentPart of the internal mechanism used by CLDR to manage locale data. A particular sublocale is designated the defaultContent for a parent, so that the parent exhibits consistent behavior.Should not be used for locale matching or likely subtags.
Example:addLikelySubtags(sr-ME) ⇒ sr-Latn-ME, minimize(de-Latn-DE) ⇒ de
Data:supplementalMetadata.xml <defaultContent>
Spec:Part 6: Section 9.3 Default Content
LikelySubtagsProvides most likely full subtag (script and region) in the absence of other information. A core component of LocaleMatching.
Example:addLikelySubtags(zh) ⇒ zh-Hans-CN
addLikelySubtags(zh-TW) ⇒ zh-Hant-TW
addLikelySubtags(zh-Hant) ⇒ zh-Hant-TW
minimize(zh-Hans-CN, favorRegion|favorScript) ⇒ zh
minimize(zh-Hant-TW, favorRegion) ⇒ zh-TW
minimize(zh-Hant-TW, favorScript) ⇒ zh-Hant
Data:likelySubtags.xml <likelySubtags>
Spec:Section4.3 Likely Subtags
LocaleMatchingProvides the best match for the user’s language(s) among an application’s supported languages.
Example:bestLocale(userLangs=<en, fr>, appLangs=<fr-CA, ru>) ⇒ fr-CA
Data:languageInfo.xml <languageMatching>
Spec:Section4.4 Language Matching

Likely Subtags

<!ELEMENT likelySubtag EMPTY ><!ATTLIST likelySubtag from NMTOKEN #REQUIRED><!ATTLIST likelySubtag to NMTOKEN #REQUIRED>

There are a number of situations where it is useful to be able to find the most likely language, script, or region. For example, given the language "zh" and the region "TW", what is the most likely script? Given the script "Thai" what is the most likely language or region? Given the region TW, what is the most likely language and script?

Conversely, given a locale, it is useful to find out which fields (language, script, or region) may be superfluous, in the sense that they contain the likely tags. For example, "en_Latn" can be simplified down to "en" since "Latn" is the likely script for "en"; "ja_Jpan_JP" can be simplified down to "ja".

ThelikelySubtag supplemental data provides default information for computing these values. This data is based on the default content data, the population data, and the suppress-script data in [BCP47]. It is heuristically derived, and may change over time.

For the relationship between Inheritance, DefaultContent, LikelySubtags, and LocaleMatching, seeInheritance vs Related Information.

To look up data in the table, see if a locale matches one of thefrom attribute values. If so, fetch the correspondingto attribute value. For example, the Chinese data looks like the following:

<likelySubtag from="zh" to="zh_Hans_CN" /><likelySubtag from="zh_HK" to="zh_Hant_HK" /><likelySubtag from="zh_Hani" to="zh_Hani_CN" /><likelySubtag from="zh_Hant" to="zh_Hant_TW" /><likelySubtag from="zh_MO" to="zh_Hant_MO" /><likelySubtag from="zh_TW" to="zh_Hant_TW" />

So looking up "zh_TW" returns "zh_Hant_TW", while looking up "zh" returns "zh_Hans_CN".

In more detail, the data is designed to be used in the following operations.Like other CLDR operations, these operations can also be used with language tags having [BCP47] syntax, with the appropriate changes to the data.

An implementation may choose to exclude language tags with the language subtag "und" from the following operation. In such a case, only the canonicalization is done. An implementation can declare that it is doing the exclusion, or can take a parameter that controls whether or not to do it.

Add Likely Subtags:Given a source locale X, to return a locale Y where the empty subtags have been filled in by the most likely subtags. This is written as X ⇒ Y ("X maximizes to Y").

A subtag is calledempty if it is a missing script or region subtag, or it is a base language subtag with the value "und". In the description below, a subscript on a subtagx indicates which tag it is from:xs is in the source,xm is in a match, andxr is in the final result.

This operation is performed in the following way.

  1. Canonicalize.
    1. Canonicalize the locale ID, according toLocaleID Canonicalization.
      • Some implementations still use three obsolete language subtags: iw, in, and yi.

The likely subtags data currently supports those implementations by providing elements that handle them, with the deprecated code on both sides:<likelySubtag from="iw" to="iw_Hebr_IL"/>.Such implementations may refrain from replacing those deprecated tags while canonicalizing. 2. Remove the script code 'Zzzz' and the region code 'ZZ' if they occur. 3. Get the components of the cleaned-up source tag(languages, scripts, andregions), plus any variants and extensions. 4. If the language is not 'und' and the other two components are not empty, return the language tag composed oflanguages_scripts_regions + variants + extensions.2.Lookup. Look up each of the following in order, and stop on the first match:

  1. languages_scripts_regions
  2. languages_scripts
  3. languages_regions
  4. languages
  5. Return
    1. If there is no match, signal an error and stop.
    2. Otherwise there is a match =languagem_scriptm_regionm
    3. Let xr = xs if xs is neither empty nor 'und', and xm otherwise.
    4. Return the language tag composed oflanguager_scriptr_regionr + variants + extensions.

Signalling an error can be done in various ways, depending on the most consistent approach for APIs in the module. For example:

  1. raise an exception
  2. return an error value (such as null)
  3. return the input (with missing fields)
  4. return the input, but "Zzzz", and/or "ZZ" substituted for empty fields.
  5. "und"

One by-product of this algorithm is that an element such as<likelySubtag from="fr_IR "to="en_Arab"/> would be misleading: the 'fr' can never be replaced by 'en'.The only subtags that can be replaced are deprecated ones, empty, und, Zzzz, and ZZ.

The lookup can be optimized. For example, if any of the tags in Step 2 are the same as previous ones in that list, they do not need to be tested.

Example1:

To find the most likely language for a country, or language for a script, use "und" as the language subtag. For example, looking up "und_TW" returns zh_Hant_TW.

A general goal of the algorithm is that non-empty field present in the 'from' field is also present in the 'to' field, so a non-empty input field will not change in "Add Likely Subtags" operation.That is, when X ⇒ Y, and X' results from replacing an empty subtag in X by the corresponding subtag in Y, then X' ⇒ Y.For example, if und_AF ⇒ fa_Arab_AF, then:

There are a few exceptions to this goal:

RemoveLikely Subtags: Given a locale, remove any fields that Add Likely Subtags would add.

The reverse operation removes fields that could be added by the first operation.

  1. First get max = AddLikelySubtags(inputLocale).
  2. If an error is signaled in AddLikelySubtags, signal that same error and stop.
  3. Remove the variants and extensions from max.
  4. Get the components of the max (languagemax,scriptmax,regionmax).
  5. Then fortrial in {languagemax,languagemax_regionmax,languagemax_scriptmax}
    • If AddLikelySubtags(trial) = max, then returntrial + variants + extensions.
  6. If there is no match, return max + variants + extensions.

Example:

RemoveLikely Subtags, favoring script: Given a locale, remove any fields that Add Likely Subtags would add, but favor script over region.

A variant of this favors the script over the region, thus using {language, language_script, language_region} in the step #4 above.This variant much less commonly used, only when the script relationship is more significant to users.Here is the difference:

Example:

Language Matching

<!ELEMENT languageMatching ( languageMatches* ) ><!ELEMENT languageMatches ( paradigmLocales*, matchVariable*, languageMatch* ) ><!ATTLIST languageMatches type NMTOKEN #REQUIRED ><!ELEMENT languageMatch EMPTY ><!ATTLIST languageMatch desired CDATA #REQUIRED ><!ATTLIST languageMatch supported CDATA #REQUIRED ><!ATTLIST languageMatch percent NMTOKEN #REQUIRED ><!ATTLIST languageMatch distance NMTOKEN #IMPLIED ><!ATTLIST languageMatch oneway ( true | false ) #IMPLIED ><!ELEMENT languageMatches ( paradigmLocales*, matchVariable*, languageMatch* ) ><!ATTLIST languageMatches type NMTOKEN #REQUIRED ><!ELEMENT paradigmLocales EMPTY ><!ATTLIST paradigmLocales locales NMTOKENS #REQUIRED >

Implementers are often faced with the issue of how to match the user's requested languages with their product's supported languages. For example, suppose that a product supports {ja-JP, de, zh-TW}. If the user understands written American English, German, French, Swiss German, and Italian, thende would be the best match; if s/he understands only Chinese (zh), then zh-TW would be the best match.

The standard truncation-fallback algorithm does not work well when faced with the complexities of natural language. The language matching data is designed to fill that gap. Stated in those terms, language matching can have the effect of a more complex fallback, such as:

sr-Cyrl-RSsr-Cyrlsr-Latn-RSsr-Latnsrhr-Latnhr

Language matching is used to find the best supported locale ID given a requested list of languages. The requested list could come from different sources, such as the user's list of preferred languages in the OS Settings, or from a browser Accept-Language list. For example, if my native tongue is English, I can understand Swiss German and German, my French is rusty but usable, and Italian basic, ideally an implementation would allow me to select {gsw, de, fr} as my preferred list of languages, skipping Italian because my comprehension is not good enough for arbitrary content.

Language Matching can also be used to get fallback data elements. In many cases, there may not be full data for a particular locale. For example, for a Breton speaker, the best fallback if data is unavailable might be French. That is, suppose we have found a Breton bundle, but it does not contain translation for the key "CN" (for the country China). It is best to return "chine", rather than falling back to the value default language such as Russian and getting "Китай". The language matching data can be used to get the closest fallback locales (of those supported) to a given language.

For the relationship between Inheritance, DefaultContent, LikelySubtags, and LocaleMatching, seeInheritance vs Related Information.

When such fallback is used for inherited item lookup, the normal order of inheritance is used for inherited item lookup, except that before using any data fromroot, the data for the fallback locales would be used if available. Language matching does not interact with the fallback of resourceswithin the locale-parent chain. For example, suppose that we are looking for the value for a particular pathP innb-NO. In the absence of aliases, normally the following lookup is used.

nb-NOnbroot

That is, we first look innb-NO. If there is no value forP there, then we look innb. If there is no value forP there, we return the value forP in root (or a code value, if there is nothing there). Remember that if there is analias element along this path, then the lookup may restart with a different path innb-NO (or another locale).

However, suppose thatnb-NO has the fallback values[nn da sv en], derived from language matching. In that case, an implementationmay progressively look up each of the listed locales, with the appropriate substitutions, returning the first value that is not found inroot. This follows roughly the following pseudocode:

value = lookup(P, nb-NO); if (locationFound != root) return value;value = lookup(P, nn-NO); if (locationFound != root) return value;value = lookup(P, da-NO); if (locationFound != root) return value;value = lookup(P, sv-NO); if (locationFound != root) return value;value = lookup(P, en-NO); return value;

The locales in the fallback list are not used recursively. For example, for the lookup of a path in nb-NO, iffr were a fallback value forda, it would not matter for the above process. Only the original language matters.

The language matching data is intended to be used according to the following algorithm. This is a logical description, and can be optimized for production in many ways. In this algorithm, the languageMatching data is interpreted as an ordered list.

Distances between given pair of subtags can be larger or smaller than the typical distances. For example, the distance between en and en-GB can be greater than those between en-GB and en-IE. In some cases, language and/or script differences can be as small as the typical region difference. (Example: sr-Latn vs. sr-Cyrl).

The distances resulting from the table are not linear, but are rather chosen to produce expected results. So a distance of 10 is not necessarily twice as "bad" as a distance of 5. Implementations may want to have a mode where script distances should swamp language distances. The tables are built such that this can be accomplished by multiplying the language distance by 0.25.

The language matching algorithm takes a list of a user’s desired languages, and a list of the application’s supported languages.

To find the matching distance MD between any two languages, perform the following steps.

  1. Maximize each language usingLikely Subtags.
    • und is a special case: see below.
  2. Set the match-distance MD to 0
  3. For each subtag in {language, script, region}
    1. If respective subtags in each language tag are identical, remove the subtag from each (logically) and continue.
    2. Traverse the languageMatching data until a match is found.
      • * matches any field.
      • If the oneway flag is false, then the match is symmetric; otherwise only match one direction.
      • For region matching, use the mechanisms inEnhanced Language Matching.
    3. Add thedistance attribute value to MD.
      • This used to be apercent attribute value, which was 100 - thedistance attribute value.
    4. Remove the subtag from each (logically)
  4. Return MD

It is typically useful to set the discount factor between successive elements of the desired languages list to be slightly greater than the default region difference. That avoids the following problem:

Supported languages: "de, fr, ja"

User's desired languages: "de-AT, fr"

This user would expect to get "de", not "fr". In practice, when a user selects a list of preferred languages, they don't include all the regional variants ahead of their second base language. Yet while the user's desired languages really doesn't tell us the priority ranking among their languages, normally the fall-off between the user's languages is substantially greater than regional variants. But unless F is greater than the distance between de-AT and de-DE, then the user’s second-choice language would be returned.

The base language subtag "und" is a special case. Suppose we have the following situation:

Part of this is because 'und' has a special function in BCP 47; it stands in for 'no supplied base language'. To prevent this from happening, if the desired base language is und, the language matcher should not apply likely subtags to it.

Examples:

For example, suppose that nn-DE and nb-FR are being compared. They are first maximized to nn-Latn-DE and nb-Latn-FR, respectively. The list is searched. The first match is with "*-*-*", for a match of 96%. The languages are truncated to nn-Latn and nb-Latn, then to nn and nb. The first match is also for a value of 96%, so the result is 92%.

Note that language matching is orthogonal to the how closely two languages are related linguistically. For example, Breton is more closely related to Welsh than to French, but French is the better match (because it is more likely that a Breton reader will understand French than Welsh). This also illustrates that the matches are often asymmetric: it is not likely that a French reader will understand Breton.

The "*" acts as a wild card, as shown in the following example:

<languageMatch desired="es-*-ES" supported="es-*-ES" percent="100" /><!-- Latin American Spanishes are closer to each other. Approximate by having es-ES be further from everything else. --><languageMatch desired="es-*-ES" supported="es-*-*" percent="93" /><languageMatch desired="*" supported="*" percent="1" /><!-- [Default value - must be at end!] Normally there is no comprehension of different languages. --><languageMatch desired="*-*" supported="*-*" percent="20" /><!-- [Default value - must be at end!] Normally there is little comprehension of different scripts. --><languageMatch desired="*-*-*" supported="*-*-*" percent="96" /><!-- [Default value - must be at end!] Normally there are small differences across regions. -->

When the language+region is not matched, and there is otherwise no reason to pick among the supported regions for that language, then some measure of geographic "closeness" can be used. The results may be more understandable by users. Looking for en-SK, for example, should fall back to something within Europe (eg en-GB) in preference to something far away and unrelated (eg en-SG). Such a closeness metric does not need to be exact; a small amount of data can be used to give an approximate distance between any two regions. However, any such data must be used carefully; although Hong Kong is closer to India than to the UK, it is unlikely that en-IN would be a better match to en-HK than en-GB would.

Enhanced Language Matching

The enhanced format for language matching adds structure to enable better matching of languages. It is distinguished by having a suffix "_new" on the type, as in the example below. The extended structure allows matching to take into account broad similarities that would give better results. For example, for English the regions that are or inherit from US (AS|GU|MH|MP|PR|UM|VI|US) form a “cluster”. Each region in that cluster should be closer to each other than to any other region. And a region outside the cluster should be closer to another region outside that cluster than to one inside. We get this issue with the “world languages” like English, Spanish, Portuguese, Arabic, etc.

Example:

<languageMatches type="written_new">    <paradigmLocales locales="en en-GB es es-419 pt-BR pt-PT" />    <matchVariable value="AS+GU+MH+MP+PR+UM+US+VI" />    <matchVariable value="HK+MO" />    <matchVariable value="019" />    <matchVariable value="MA+DZ+TN+LY+MR+EH" />    <languageMatch desired="no" supported="nb" distance="1" /><!-- no ⇒ nb -->    …    <languageMatch desired="ar_*_$maghreb" supported="ar_*_$maghreb" distance="4" />    <!-- ar; *; $maghreb ⇒ ar; *; $maghreb -->    <languageMatch desired="ar_*_$!maghreb" supported="ar_*_$!maghreb" distance="4" />    <!-- ar; *; $!maghreb ⇒ ar; *; $!maghreb -->    …

ThematchVariable allows for a rule to match to multiple regions, as illustrated by$maghreb. The syntax is simple: it allows for + forunion and - forset difference, but no precedence. So A+B-A+D is interpreted as (((A+B)-A)+D), not as (A+B)-(A+D). The variableid has a value of the form [$][a-zA-Z0-9]+. If $X is defined, then $!X automatically means all those regions that are not in $X.

When the set is interpreted, then macrolanguages are (logically) transformed into a list of their contents, so “053+GB” → “AU+GB+NF+NZ”. This is done recursively, so 009 → “053+054+057+061+QO” → “AU+NF+NZ+FJ+NC+PG+SB +VU...”. Note that we use 019 for all of the Americas in the variables above, because en-US should be in the same cluster as es-419 and its contents.

In the rules, the percent value (100..0) is replaced by adistance value, which is the inverse (0..100).

These new variables and rules divide up the world into clusters, where items in the same clusters (for specific languages) get the normal regional difference, and items in different clusters get different weights.

Each cluster can have one or more associatedparadigmLocales. These are locales that are preferred within a cluster. So when matching desired=[en-SA] against [en-GU en en-IN en-GB], the value en-GB is returned. Both of {en-GU en} are in a different cluster. While {en-IN en-GB} are in the same cluster, and the same distance from en-SA, the preference is given to en-GB because it is in the paradigm locales. It would be possible to express this in rules, but using this mechanism handles these very common cases without bulking up the tables.

TheparadigmLocales also allow matching to macroregions. For example, desired=[es-419] should match to {es-MX} more closely than to {es}, and vice versa: {es-MX} should match more closely to {es-419} than to {es}. But es-MX should match more closely to es-419 than to any of the other es-419 sublocales. In general, in the absence of other distance data, there is a ‘paradigm’ in each cluster that the others should match more closely to: en(-US), en-GB, es(-ES), es-419, ru(-RU)...

XML Format

There are two kinds of data that can be expressed in LDML: language-dependent data and supplementary data. In either case, data can be split across multiple files, which can be in multiple directory trees.

For example, the language-dependent data for Japanese in CLDR is present in the following files:

Data for cased languages such as French are in files like:

The status of the data is the same, whether or not data is split. That is, for the purpose of validation and lookup, all of the data for the above ja.xml files is treated as if it was in a single file. These files have the<ldml> root element and use ldml.dtd. The file name must match the identity element. For example, the<ldml> file pa_Arab_PK.xml must contain the following elements:

<ldml>    <identity>    …        <language type="pa" />        <script type="Arab" />        <territory type="PK" />    </identity>    …

Supplemental data can have different root elements, currently:ldmlBCP47,supplementalData,keyboard, andplatform. Keyboard and platform files are considered distinct. The ldmlBCP47 files and supplementalData files that have the same root are all logically part of the same file; they are simply split into separate files for convenience. Implementations may split the files in different ways, also for their convenience. The files in /properties are also supplemental data files, but are structured like UCD properties.

For example, supplemental data relating to Japan or the Japanese writing are in:

Like the<ldml> files, the keyboard file names must match internal data: in particular, thelocale attribute on the keyboard element must have a value that corresponds to the file name, such as<keyboard locale="af-t-k0-android"> for the file af-t-k0-android.xml.

The following sections describe the structure of the XML format for language-dependent data. The more precise syntax is in the ldml.dtd file;however, the DTD does not describe all the constraints on the structure.

To start with, the root element is<ldml>, with the following DTD entry:

<!ELEMENT ldml (identity,(alias|(fallback*,localeDisplayNames?,layout?,contextTransforms?,characters?,delimiters?,measurement?,dates?,numbers?,units?,listPatterns?,collations?,posix?,segmentations?,rbnf?,annotations?,metadata?,references?,special*)))>

The XML structure is stable over releases. Elements and attributes may be deprecated: they are retained in the DTD but their usage is strongly discouraged. In most cases, an alternate structure is provided for expressing the information. There is only one exception: newer DTDs cannot be used with version 1.1 files, without some modification.

In general, all translatable text in this format is in element contents, while attributes are reserved for types and non-translated information (such as numbers or dates). The reason that attributes are not used for translatable text is that spaces are not preserved, and we cannot predict where spaces may be significant in translated material.

There are two kinds of elements in LDML:rule elements andstructure elements.

For structure elements, there are restrictions to allow for effective inheritance and processing:

  1. There is no"mixed" content: if an element has textual content, then it cannot contain any elements.
  2. The [XPath] leading to the content is unique; no two different pieces of textual content have the same [XPath].
  3. An element that hasvalue attributes MUST NOT also have have child elements.

To illustrate these restrictions, consider the below chunk of XML:

<!-- Not correct LDML --><unit type="duration-day"      displayName="days"> <!-- #3: @VALUE attribute AND children -->  {0} per day <!-- #1: Mixed content -->  <unitPattern>{0} day</unitPattern>  <!-- #2 same XPath /unit[@type="duration-day"]/unitPattern -->  <unitPattern>{0} days</unitPattern> <!-- #2 same XPath /unit[@type="duration-day"]/unitPattern --></unit>

LDML is actually structured as below (fromen.xml):

<unit type="duration-day">  <!-- OK: "type" is distinguishing -->  <displayName>days</displayName>  <unitPattern count="one">{0} day</unitPattern> <!-- "count" is distinguishing -->  <unitPattern count="other">{0} days</unitPattern>  <perUnitPattern>{0} per day</perUnitPattern> <!-- mixed content in an element --></unit>

Rule elements do not have these restrictions, but also do not inherit, except as an entire block. Items which are ordered have the DTD Annotation@ORDERED. SeeDTD Annotations andInheritance and Validity. For more technical details, seeUpdating-DTDs.

Note that the data in examples given below is purely illustrative, and does not match any particular language. For a more detailed example of this format, see [Example]. There is also a DTD for this format, butremember that the DTD alone is not sufficient to understand the semantics, the constraints, nor the interrelationships between the different elements and attributes. You may wish to have copies of each of these to hand as you proceed through the rest of this document.

In particular, all elements allow for draft versions to coexist in the file at the same time. Thus most elements are marked in the DTD as allowing multiple instances. However, unless an element is annotated as@ORDERED, or has a distinguishing attribute, it can only occur once as a subelement of a given element. Thus, for example, the following is illegal even though allowed by the DTD:

<languages>    <language type="aa">...</language>    <language type="aa">..</language>

There must be only one instance of these per parent, unless there are other distinguishing attributes (such as analt element).

In general, LDML data should be in NFC format. Normalization forms are defined by [UAX15]. However, certain elements may need to contain characters that are not in NFC, including exemplars, transforms, segmentations, and p/s/t/i/pc/sc/tc/ic rules in collation. These elements must not be normalized (either to NFC or NFD), or their meaning may be changed. Thus LDML documents must not be normalized as a whole. To prevent problems with normalization, no element value can start with a combining slash (U+0338 COMBINING LONG SOLIDUS OVERLAY).

Lists, such as singleCountries are space-delimited. That means that they are separated by one or more XML whitespace characters:

Common Elements

At any level in any element, two special elements are allowed.

Element special

This element is designed to allow for arbitrary additional annotation and data that is product-specific. It has one required attributexmlns, which specifies the XMLnamespace of the special data. For example, the following used the version 1.0 POSIX special element.

<!DOCTYPE ldml SYSTEM "https://www.unicode.org/cldr/dtd/1.0/ldml.dtd" [    <!ENTITY % posix SYSTEM "https://www.unicode.org/cldr/dtd/1.0/ldmlPOSIX.dtd">%posix;]><ldml>...    <special xmlns:posix="https://www.opengroup.org/regproducts/xu.htm">        <!-- old abbreviations for pre-GUI days -->        <posix:messages>            <posix:yesstr>Yes</posix:yesstr>            <posix:nostr>No</posix:nostr>            <posix:yesexpr>^[Yy].*</posix:yesexpr>            <posix:noexpr>^[Nn].*</posix:noexpr>        </posix:messages>    </special></ldml>
Sample Special Elements

The elements in this section arenot part of the Locale Data Markup Language 1.0 specification. Instead, they are special elements used for application-specific data to be stored in the Common Locale Repository. They may change or be removed in future versions of this document, and are present here more as examples of how to extend the format. (Some of these items may move into a future version of the Locale Data Markup Language specification.)

The above examples are old versions: consult the documentation for the specific application to see which should be used.

These DTDs use namespaces and the special element. To include one or more, use the following pattern to import the special DTDs that are used in the file:

<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE ldml SYSTEM "https://www.unicode.org/cldr/dtd/1.1/ldml.dtd" [    <!ENTITY % icu SYSTEM "https://www.unicode.org/cldr/dtd/1.1/ldmlICU.dtd">    <!ENTITY % openOffice SYSTEM "https://www.unicode.org/cldr/dtd/1.1/ldmlOpenOffice.dtd">%icu;%openOffice; ]>

Thus to include just the ICU DTD, one uses:

<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE ldml SYSTEM "https://www.unicode.org/cldr/dtd/1.1/ldml.dtd" [    <!ENTITY % icu SYSTEM "https://www.unicode.org/cldr/dtd/1.1/ldmlICU.dtd">%icu; ]>

Note: A previous version of this document contained a special element forISO TR 14652 compatibility data. That element has been withdrawn, pending further investigation, since 14652 is a Type 1 TR: "when the required support cannot be obtained for the publication of an International Standard, despite repeated effort". See the ballot comments on14652 Comments for details on the 14652 defects. For example, most of these patterns make little provision for substantial changes in format when elements are empty, so are not particularly useful in practice. Compare, for example, the mail-merge capabilities of production software such as Microsoft Word or OpenOffice.

Note: While the CLDR specification guarantees backwards compatibility, the definition of specials is up to other organizations. Any assurance of backwards compatibility is up to those organizations.

A number of the elements above can have extra information foropenoffice.org, such as the following example:

<special xmlns:openOffice="https://www.openoffice.org">    <openOffice:search>        <openOffice:searchOptions>            <openOffice:transliterationModules>IGNORE_CASE</openOffice:transliterationModules>        </openOffice:searchOptions>    </openOffice:search></special>

Element alias

<!ELEMENT alias (special*) ><!ATTLIST alias source NMTOKEN #REQUIRED ><!ATTLIST alias path CDATA #IMPLIED>

The contents of any element in root can be replaced by an alias, which points to the path where the data can be found.

Aliases will only ever appear in root with the form//ldml/.../alias[@source="locale"][@path="..."].

Consider the following example in root:

<calendar type="gregorian">    <months>        <default choice="format" />        <monthContext type="format">            <default choice="wide" />            <monthWidth type="abbreviated">                <alias source="locale" path="../monthWidth[@type='wide']"/>            </monthWidth>

If the locale "de_DE" is being accessed for a month name for format/abbreviated, then a resource bundle at "de_DE" will be searched for a resource element at that path. If not found there, then the resource bundle at "de" will be searched, and so on. When the alias is found in root, then the search is restarted, but searching for format/wide element instead of format/abbreviated.

If thepath attribute is present, then its value is an [XPath] that points to a different node in the tree.That XPath is not relative to the location of thealias element itself, but rather to the location of the element thatcontains the alias element, as seen in the example above. For example:

<alias source="locale" path="../monthWidth[@type='wide']"/>

The default value if the path is not present is the same position in the tree. All of the attributes in the [XPath] must bedistinguishing elements. For more details, seeInheritance and Validity.

There is a special value for the source attribute, the constantsource="locale". This special value is equivalent to the locale being resolved. For example, consider the following example, where locale data for 'de' is being resolved:

Table:Inheritance withsource="locale"
RootdeResolved
<x>  <a>1</a>  <b>2</b>  <c>3</c></x>
<x> <a>11</a> <b>12</b> <d>14</d></x>
<x> <a>11</a> <b>12</b> <c>3</c> <d>14</d></x>
<y> <alias source="locale" path="../x"></y>
<y> <b>22</b> <e>25</e></y>
<y> <a>11</a> <b>22</b> <c>3</c> <d>14</d> <e>25</e></y>

The first row shows the inheritance within the<x> element, whereby<c> is inherited from root. The second shows the inheritance within the<y> element, whereby<a>,<c>, and<d> are inherited also from root, but from an alias there. The alias in root is logically replaced not by the elements in root itself, but by elements in the 'target' locale.

For more details on data resolution, seeInheritance and Validity.

Aliases must be resolved recursively. An alias may point to another path that results in another alias being found, and so on. For example, looking up Thai buddhist abbreviated months for the localexx-YY may result in the following chain of aliases being followed:

../../calendar[@type="buddhist"]/months/monthContext[@type="format"]/monthWidth[@type="abbreviated"]

xx-YY → xx → root // finds alias that changes path to:

../../calendar[@type="gregorian"]/months/monthContext[@type="format"]/monthWidth[@type="abbreviated"]

xx-YY → xx → root // finds alias that changes path to:

../../calendar[@type="gregorian"]/months/monthContext[@type="format"]/monthWidth[@type="wide"]

xx-YY → xx // finds value here

It is an error to have a circular chain of aliases. That is, a collection of LDML XML documents must not have situations where a sequence of alias lookups (including inheritance and lateral inheritance) can be followed indefinitely without terminating.

Element displayName

Many elements can have a display name. This is a translated name that can be presented to users when discussing the particular service. For example, a number format, used to format numbers using the conventions of that locale, can have translated name for presentation in GUIs.

<numberFormat>    <displayName>Prozentformat</displayName>    ...<numberFormat>

Where present, the display names must be unique; that is, two distinct codes would not get the same display name. (There is one exception to this: in time zones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different time zone IDs.) Any translations should follow customary practice for the locale in question. For more information, see [Data Formats].

Escaping Characters

Unfortunately, XML does not have the capability to contain all Unicode code points.Due to this, in certain instances extra syntax is required to represent those code points that cannot be otherwise represented in element content.The escaping syntax is only defined on a few types of elements, such as in collation or exemplar sets, and uses the appropriate syntax for that type.

The element<cp>, which was formerly used for this purpose, has been deprecated.

Common Attributes

Attribute type

The attributetype is also used to indicate an alternate resource that can be selected with a matchingtype=option in the locale id modifiers, or be referenced by a default element. For example:

<ldml>    ...    <currencies>        <currency>...</currency>        <currency type="preEuro">...</currency>    </currencies></ldml>

Attribute draft

If this attribute is present, it indicates the status of all the data in this element and any subelements (unless they have a contrarydraft value), as per the following:

For more information on precisely how these values are computed for any given release, seeData Submission and Vetting Process on the CLDR website.

Thedraft attribute should only occur on "leaf" elements, and is deprecated elsewhere. For a more formal description of how elements are inherited, and what their draft status is, seeInheritance and Validity.

Attribute alt

This attribute labels an alternative value for an element. The value is adescriptor that indicates what kind of alternative it is, and takes one of the following

proposed should only be present if the draft status is notapproved. It indicates that the data is proposed replacement data that has been added provisionally until the differences between it and the other data can be vetted. For example, suppose that the translation for September for some language is "Settembru", and a bug report is filed that that should be "Settembro". The new data can be entered in, but marked asalt="proposed" until it is vetted.

...<month type="9">Settembru</month><month type="9" draft="unconfirmed" alt="proposed">Settembro</month><month type="10">...

Now assume another bug report comes in, saying that the correct form is actually "Settembre". Another alternative can be added:

...<month type="9" draft="unconfirmed" alt="proposed2">Settembre</month>...

The values forvariantname at this time include "variant", "list", "email", "www", "short", and "secondary".

For a more complete description of how draft applies to data, seeInheritance and Validity.

Attribute references

The value of this attribute is a token representing a reference for the information in the element, including standards that it may conform to.<references>. (In older versions of CLDR, the value of the attribute was freeform text. That format is deprecated.)

Example:

<territory type="UM" references="R222">USAs yttre öar</territory>

Thereference element may be inherited. Thus, for example, R222 may be used in sv_SE.xml even though it is not defined there, if it is defined in sv.xml.

<... allow="verbatim" ...> (deprecated)

This attribute was originally intended for use in marking display names whose capitalization differed from what was indicated by the now-deprecated<inText> element (perhaps, for example, because the names included a proper noun). It was never supported in the dtd and is not needed for use with the new<contextTransforms> element.

Common Structures

Date and Date Ranges

When attribute specify date ranges, it is usually done with attributesfrom andto. Thefrom attribute specifies the starting point, and theto attribute specifies the end point. The deprecatedtime attribute was formerly used to specify time with the deprecatedweekEndStart andweekEndEnd elements, which were themselves inherentlyfrom orto.

The data format is a restricted ISO 8601 format, restricted to the fieldsyear,month,day,hour,minute, andsecond in that order, with "-" used as a separator between date fields, a space used as the separator between the date and the time fields, and: used as a separator between the time fields. If theminute orminute andsecond are absent, they are interpreted as zero. If thehour is also missing, then it is interpreted based on whether the attribute isfrom orto.

That is, Friday at 24:00:00 is the same time as Saturday at 00:00:00. Thus when thehour is missing, thefrom andto are interpreted inclusively: the range includes all of the day mentioned.

For example, the following are equivalent:

<usesMetazone from="1991-10-27" to="2006-04-02" .../><usesMetazone from="1991-10-27 00:00:00" to="2006-04-02 24:00:00" .../><usesMetazone from="1991-10-26 24:00:00" to="2006-04-03 00:00:00" .../>

If thefrom element is missing, it is assumed to be as far backwards in time as there is data for; if theto element is missing, then it is from this point onwards, with no known end point.

The dates and times are specified in local time, unless otherwise noted. (In particular, the metazone values are in UTC (also known as GMT).

Text Directionality

The content of certain elements, such as date or number formats, may consist of several sub-elements with an inherent order (for example, the year, month, and day for dates). In some cases, the order of these sub-elements may be changed depending on the bidirectional context in which the element is embedded.

For example, short date formats in languages such as Arabic may contain neutral or weak characters at the beginning or end of the element content. In such a case, the overall order of the sub-elements may change depending on the surrounding text.

Element content whose display may be affected in this way should include an explicit direction mark, such as U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK, at the beginning or end of the element content, or both.

Unicode Sets

Some attribute values or element contents useUnicodeSet notation.A UnicodeSet represents a finite set of Unicode code points and strings, and is defined by lists of code points and strings, Unicode property sets, and set operators, with square brackets for groupings.In this context, a code point means a string consisting of exactly one code point.

A UnicodeSet implements the semantics inUTS #18: Unicode Regular Expressions [UTS18] Levels 1 & 2 that are relevant to determining sets of characters.Note however that it may deviate from the syntax provided in [UTS18].In particular, SectionRL2.6Wildcards in Property Values is not supported.However, that feature can be supported in clients such as ICU by implementing a “hook” as is done in theonline UnicodeSet utilities.

A UnicodeSet may be cited in specifications outside of the domain of LDML.In such a case, that specification may specify a subset or superset of the syntax provided here.

UnicodeSet syntax
SymbolExpressionExamples
unicodeSet
= prop
| '[' '^'? s '-'? s seq* [\$ \-]? s ']'
| var
\p{x=y},
[abc],
$myset
seq
= unicodeSet (s [\&\-] s unicodeSet)* s
| range s
[abc]-[cde], a
range
= element ('-' element)?
a, a-c, {abc}, a-{z}
note: in ranges, elements must resolve to exactly one code point.
element
= char | string | var
%, b, {hello}, {}, \x{61 62}
prop
= '\' [pP] '{' propName ([≠=] s pValuePerl+)? '}'
| '[:' '^'? propName ([≠=] s pValuePosix+)? ':]'
\p{x=y}, [:x=y:]
propName
= s [A-Za-z0-9] [A-Za-z0-9_\x20]* s
General_Category,
General Category
pValuePerl
= [^\}]
| '\' quoted
Lm,
\n,
\}
pValuePosix
= [^:]
| '\' quoted
Lm,
\n,
\:
string
= '{' (s charInString)* s '}'
{hello}
char
= [^ \^ \& \- \[ \] \\ \{ \$ [:Pat_WS:]]
| '\' quoted
a, b, c, \n, \{, \$
charInString
= [^ \\ \} [:Pat_WS:]]
| '\' quoted
a, b, c, \n, {, $
quoted
= 'u' (hex{4} | bracketedHex)
| 'x' (hex{2} | bracketedHex)
| 'U00' ('0' hex{5} | '10' hex{4})
| 'N{' charName '}'
| [[\u0000-\U00010FFFF]-[uxUN]]
n, U0000FFFE, {, $, ]
note: lengths are exact
charName
= s [A-Za-z0-9] [-A-Za-z0-9_\x20]* s
TIBETAN LETTER -A
bracketedHex
= '{' s hexCodePoint (sRequired hexCodePoint)* s '}'
{61 2019 62}, {61}
hexCodePoint
= hex{1,5} | '10' hex{4}
hex
= [0-9A-Fa-f]
var
= '$' [:XID_Start:] [:XID_Continue:]*
$a, $elt5 (optional support)
s
= [:Pattern_White_Space:]*
optional whitespace
sRequired
= [:Pattern_White_Space:]+
required whitespace

The following are additional well-formedness and validity constraints:

  1. [ wfc: Ranges (X-Y) are only well-formed in the case that elementsX andY resolve to single code points. That is,[a-b] and[{a}-{b}] are well-formed because single-codepoint-strings are equivalent to that code point, while[a-{bz}] and[{ax}-{bz}] are ill-formed. ]
  2. [ vc: Property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [UAX44]. ]

Note also that:

  1. Escapes that use multiple code points are equivalent to their flattened representation, i.e.,\x{61 62} is equivalent to\x{61}\x{62}. These can also occur in strings, so[{\x{ 061 62 0063}}] is equivalent to[{abc}].
  2. If[…] starts with [:, then it begins a prop, and must also terminate with :]. Thus[:di:] is a valid property expression,[di:] is a 3 code-point set, and[:di] raises an error.
  3. Whitespace is significant when initiating/terminating a POSIX property expression, so[ :] is syntactically valid and equivalent to[\:].

The syntax characters are listed in the table below:

CharHexNameUsage
$U+0024DOLLAR SIGNEquivalent to \uFFFF when followed by ']', initiator for variable identifiers otherwise
&U+0026AMPERSANDIntersecting UnicodeSets
-U+002DHYPHEN-MINUSRanges of characters; also set difference.
:U+003ACOLONPOSIX-style property syntax
[U+005BLEFT SQUARE BRACKETGrouping; POSIX property syntax
]U+005DRIGHT SQUARE BRACKETGrouping; POSIX property syntax
\U+005CREVERSE SOLIDUSEscaping
^U+005ECIRCUMFLEX ACCENTPosix negation syntax
{U+007BLEFT CURLY BRACKETStrings in set; Perl property syntax
}U+007DRIGHT CURLY BRACKETStrings in set; Perl property syntax
U+0020 U+0009..U+000D U+0085
U+200E U+200F
U+2028 U+2029
ASCII whitespace,
LRM, RLM,
LINE/PARAGRAPH SEPARATOR
Ignored except when escaped

Note that some syntax characters only have a special meaning in a certain context. In particular:

Syntax Special Case Examples

In the following, a table of examples including common sources of confusion concerning the UnicodeSet syntax:

ExpressionContained ElementsSyntax Errors
[^a]All Unicode code points except 'a'[ ^a],[a^]
[\^a]'a' and '^'
[:L:]All code points with Unicode property 'General_Category' equal to 'Letter'[:L],[:]
[ :]':'
[L:]'L' and ':'
[-]'-'.
[ - ]'-'
[a-],[-a]'a' and '-'
[a -b]All code points between 'a' and 'b' (inclusive)
[[a-b] -[b]],[[a]-[b]-[c]]'a'[a-b-c]
[^ - ]All Unicode code points except '-'[ ^ - ]
[$],[ $ ]U+FFFF
[$a]The value of the variable '$a'[$ a],[$und]
[$a$]U+FFFF and the value of the variable '$a'
[a$]'a' and U+FFFF
[}]'}'[{]
[{}]the empty string, ''
[{}}]'}' and the empty string, ''
[{{}]'{'
[{$var}]the string '$var'
[{[a-z}],[{ [ a - z}]the string '[a-z'
[\x{10FFFF 1}]U+10FFFF and U+1[\x{10FFFF1}]
[\x{61}-d]'a', 'b', 'c', and 'd'[\x{61 63}-d],[\x{61 63}-\x{62 64}]

Note: the above assumes that variables are supported, $a is defined as a full UnicodeSet, a string, or a char, and $und is not defined at all.

Lists of Code Points

Lists are a sequence of strings that may include ranges, which are indicated by a '-' between two code points, as in "a-z". The sequencestart-end specifies the range of all code points from the start to end, inclusive, in Unicode order. For example,[a c d-f m] is equivalent to[a c d e f m]. Whitespace can be freely used for clarity, as[a c d-f m] means the same as[acd-fm].

A string with multiple code points is represented in a list by being surrounded by curly braces, such as in[a-z {ch}]. It can be used with the range notation, with the restriction that each string contains exactly one code point. Thus[{ab}-{c}],[{ax}-{bz}], and[{ab}-c] are invalid. A string consisting of a single code point is equivalent to that code point, that is,[{a}-c] is valid and equivalent to[a b c].

Backslash Escapes

Certain backslashed code point sequences can be used to quote code points:

SequenceCode point
\x{h...h}
\u{h...h}
list of 1-6 hex digits ([0-9A-Fa-f]), separated by spaces
\xhh2 hex digits
\uhhhhExactly 4 hex digits
\UhhhhhhhhExactly 8 hex digits
\aU+0007 (BEL / ALERT)
\bU+0008 (BACKSPACE)
\tU+0009 (TAB / CHARACTER TABULATION)
\nU+000A (LINE FEED)
\vU+000B (LINE TABULATION)
\fU+000C (FORM FEED)
\rU+000D (CARRIAGE RETURN)
\\U+005C (BACKSLASH / REVERSE SOLIDUS)
\N{name}The Unicode code point named "name".
\p{…},\P{…}Unicode property (see below)

Anything else following a backslash is mapped to itself, except the property syntax described below, or in an environment where it is defined to have some special meaning.

Any code point formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \x, \u and \U escapes create literal code points. (In contrast, Java treats Unicode escapes as just a way to represent arbitrary code points in an ASCII source file, and any resulting code points arenot tagged as literals.)

Unicode property sets are defined as described inUTS #18: Unicode Regular Expressions [UTS18], Level 1 and RL2.5, including the syntax where given. For an example of a concrete implementation of this, see [ICUUnicodeSet].

Unicode Properties

Briefly, Unicode property sets are specified by any Unicode property and a value of that property, such as[:General_Category=Letter:] for Unicode letters or\p{uppercase} for the set of upper case letters in Unicode. The property names are defined by the PropertyAliases.txt file and the property values by the PropertyValueAliases.txt file. For more information, see [UAX44]. The syntax for specifying the property sets is an extension of either POSIX or Perl syntax, by the addition of"=<value>". For example, you can match letters by using the POSIX-style syntax:

[:General_Category=Letter:]

or by using the Perl-style syntax

\p{General_Category=Letter}.

Property names and values are case-insensitive, and whitespace, "-", and "_" are ignored. The property name can be omitted for theGeneral_Category andScript properties, but is required for other properties. If the property value is omitted, it is assumed to represent a boolean property with the value "true". Thus[:Letter:] is equivalent to[:General_Category=Letter:], and[:Wh-ite-s pa_ce:] is equivalent to[:Whitespace=true:].

The table below shows the two kinds of syntax: POSIX and Perl style. Also, the table shows the "Negative" version, which is a property that excludes all code points of a given kind. For example,[:^Letter:] matches all code points that are not[:Letter:].

PositiveNegative
POSIX-style Syntax[:type=value:][:^type=value:]
Perl-style Syntax\p{type=value}\P{type=value}
Boolean Operations

The low-level lists or properties then can be freely combined with the normal set operations (union, inverse, difference, and intersection):

The binary operators '&', '-', and the implicit union have equal precedence and bind left-to-right. Thus[[:letter:]-[a-z]-[\u0100-\u01FF]] is equal to[[[:letter:]-[a-z]]-[\u0100-\u01FF]]. Another example is the set[[ace][bdf] - [abc][def]], which is not the empty set, but instead equal to[[[[ace] [bdf]] - [abc]] [def]], which equals[[[abcdef] - [abc]] [def]], which equals[[def] [def]], which equals[def].

One caution: the '&' and '-' operators operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the pattern[[:Lu:]-A] is illegal, since it is interpreted as the set[:Lu:] followed by the incomplete range-A. To specify the set of upper case letters except for 'A', enclose the 'A' in brackets:[[:Lu:]-[A]].

Variables in UnicodeSets

Support for variable identifiers (var) is optional.They are used in certain contexts such as inTransforms.When they are used, they are defined as follows:

UnicodeSets may contain variables ($my_char,$the_set, ...) in place of full UnicodeSets and strings/characters. If variable support is enabled, variables must be defined (out-of-scope for UnicodeSets). In particular, referring to undefined variables is an error.

Not all variable maps are valid for a given expression in UnicodeSet syntax.For instance, consider[$a-$b]; this may be a range of characters if both$a and$b are characters,or a difference of sets if they are both sets; but given the map{ a => '0', b => [:L:] }, it is invalid.

Note: In particular, the variable map is needed not just to compute the actual set of characters and strings represented by the UnicodeSet,but also to parse the UnicodeSet syntax: if$a and$b were unknown, the parsing of[$a-$b] would be ambiguous.

Variables are replaced by value, that is,[a $minus z] with a variable map{ minus => '-' } is equivalent to[-az], not[a-z] (i.e., cardinality of 3 instead of 26).The fullvar nonterminal is replaced, i.e., the variable name together with the prefixed $.

The variable syntax implements UAX31-R1-2 with XID_Start and XID_Continue. For more information, see [UAX31].Variables are equivalent normalized identifiers with Normalization Form C, implementing UAX31-R4. Furthermore, variables are case-sensitive.

Notes:

  1. The 'type' of a variable value is not specified syntactically.Thus [$a-$b] can resolve whether $a and $b are chars/strings (eg, $a=δ, $b=θ) or full UnicodeSets (eg, $a=\p{script=greek}, $b=\p{general_category=letter}).The only restriction is that the result be syntactic; thus ($a=w, $b=xy) would raise an error.
  2. Variable substitution is currently disallowed inside of property expressions.Thus \p{gc=$blah} raises an error.
  3. '$' when followed by ']' is interpreted as \uFFFF, and is used to match before the start of a string or after the end.Thus [ab$] matches the string "xaby" in the locations (marked with '()'): "()xaby", "x(a)by", "xa(b)y", "xaby()".
  4. If an unescaped '$' is neither followed by a character of type [:XID_Start:] nor a ']', it is a syntax error.

Backwards compatibility: In prior versions of this document, the character $ was a valid element of thechar nonterminal with the special meaning of\uFFFF.In current versions, the $ character may only appear by itself at the end of a UnicodeSet, e.g.,[a-z$], where it keeps that interpretation.Allowing $ to appear in any other location is only allowed as the prefix for variables.The previous behavior of allowing $ in thechar nonterminal is considered obsolete and must be avoided by new implementations.

UnicodeSet Examples

The following table summarizes the syntax that can be used.

ExampleDescription
[a]The set containing 'a' alone
[a-z]The set containing 'a' through 'z' and all letters in between, in Unicode order.
Thus it is the same as [\u0061-\u007A].
[^a-z]The set containing all code points but 'a' through 'z'.
Thus it is the same as [\u0000-\u0060 \u007B-\x{10FFFF}].
[[pat1][pat2]]The union of sets specified by pat1 and pat2
[[pat1]&[pat2]]The intersection of sets specified by pat1 and pat2
[[pat1]-[pat2]]The asymmetric difference of sets specified by pat1 and pat2
[a {ab} {ac}]The code point 'a' and the multi-code point strings "ab" and "ac"
[x\u{61 2019 62}y]Equivalent to [x\u0061\u2019\u0062y] (= [xa’by])
[:Lu:]The set of code points with a given property value, as defined by PropertyValueAliases.txt. In this case, these are the Unicode upper case letters. The long form for this is[:General_Category=Uppercase_Letter:].
[:L:]The set of code points belonging to all Unicode categories starting with 'L', that is,[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]. The long form for this is[:General_Category=Letter:].

String Range

A String Range is a compact format for specifying a list of strings.

Syntax:

Xsep Y

The separator and the format of strings X, Y may vary depending on the domain. For example,

Validity:

A string range Xsep Y is valid iff len(X) ≥ len(Y) > 0, where len(X) is the length of X in code points.

There may be additional, domain-specific requirements for validity of the expansion of the string range.

Interpretation:

  1. Break X into P and S, where len(S) = len(Y)
    • Note that P will be an empty string if the lengths of X and Y are equal.
  2. Form the combinations of all P+(s₀..y₀)+(s₁..y₁)+...(sₙ..yₙ)
    • s₀ is the first code point in S, etc.

Examples:

ab-adab ac ad
ab-dab ac ad
ab-cdab ac ad bb bc bd cb cc cd
👦🏻-👦🏿👦🏻 👦🏼 👦🏽 👦🏾 👦🏿
👦🏻-🏿👦🏻 👦🏼 👦🏽 👦🏾 👦🏿

Identity Elements

<!ELEMENT identity (alias | (version, generation?, language, script?, territory?, variant?, special*) ) >

Theidentity element contains information identifying the target locale for this data, and general information about the version of this data.

<version number="$Revision: 1.227 $">

Theversion element provides, in an attribute, the version of this file. The contents of the element can contain textual notes about the changes between this version and the last. For example:

<version number="1.1">Various notes and changes in version 1.1</version>

This is not to be confused with theversion attribute on theldml element, which tracks the dtd version.

<generation date="$Date: 2007/07/17 23:41:16 $" />

Thegeneration element is now deprecated. It was used to contain the last modified date for the data. This could be in two formats: ISO 8601 format, or CVS format (illustrated by the example above).

<language type="en" />

The language code is the primary part of the specification of the locale id, with values as described above.

<script type="Latn" />

The script code may be used in the identification of written languages, with values described above.

<territory type="US" />

The territory code is a common part of the specification of the locale id, with values as described above.

<variant type="NYNORSK" />

The variant code is the tertiary part of the specification of the locale id, with values as described above.

When combined according to the rules described inUnicode Language and Locale Identifiers, thelanguage element, along with any of the optionalscript,territory, andvariant elements, must identify a known, stable locale identifier. Otherwise, it is an error.

Valid Attribute Values

TheDTD Annotations in are used to determine whether elements, attributes, or attribute values are valid (or deprecated).

Canonical Form

The following are restrictions on the format of LDML files to allow for easier parsing and comparison of files.

Peer elements have consistent order. That is, if the DTD or this specification requires the following order in an elementfoo:

<foo>    <pattern>    <somethingElse></foo>

It can never require the reverse order in a different elementbar.

<bar>    <somethingElse>    <pattern></bar>

Note that there was one case that had to be corrected in order to make this true. For that reason, pattern occurs twice under currency:

<!ELEMENT currency (alias | (pattern*, displayName?, symbol?, pattern*, decimal?, group?, special*)) >

XML files can have a wide variation in textual form, while representing precisely the same data. By putting the LDML files in the repository into a canonical form, this allows us to use the simple diff tools used widely (and in CVS) to detect differences when vetting changes, without those tools being confused. This is not a requirement on other uses of LDML; just simply a way to manage repository data more easily.

Content

  1. All start elements are on their own line, indented bydepth tabs.

  2. All end elements (except for leaf nodes) are on their own line, indented bydepth tabs.

  3. Any leaf node with empty content is in the form<foo/>.

  4. There are no blank lines except within comments or content.

  5. Spaces are used within a start element. There are no extra spaces within elements.

    • <version number="1.2"/>, not<version number = "1.2" />
    • </identity>, not</identity >
  6. All attribute values use double quote ("), not single (').

  7. There are no CDATA sections, and no escapes except those absolutely required.

    • no&apos; since it is not necessary
    • no'&#x61;', it would be just'a'
  8. All attributes with defaulted values are suppressed.

  9. The draft andalt="proposed.*" attributes are only on leaf elements.

  10. The tzid are canonicalized in the following way:

    • All tzids as of CLDR 1.1 (2004.06.08) in zone.tab are canonical.
    • After that point, the first time a tzid is introduced, that is the canonical form.

    That is, new IDs are added, but existing ones keep the original form. TheTZ timezone database keeps a set of equivalences in the "backward" file. These are used to map other tzids to the canonical form. For example, whenAmerica/Argentina/Catamarca was introduced as the new name for the previousAmerica/Catamarca , a link was added in the backward file.

    Link America/Argentina/Catamarca America/Catamarca

Example:

<ldml draft="unconfirmed" >    <identity>        <version number="1.2" />        <language type="en" />        <territory type="AS" />    </identity>    <numbers>        <currencyFormats>            <currencyFormatLength>                <currencyFormat>                    <pattern>¤#,##0.00;(¤#,##0.00)</pattern>                </currencyFormat>            </currencyFormatLength>        </currencyFormats>    </numbers></ldml>

Ordering

An element is ordered first by the element name, and then if the element names are identical, by the sorted set of attribute-value pairs. For the latter, compare the first pair in each (in sorted order by attribute pair). If not identical, go to the second pair, and so on.

Elements and attributes are ordered according to their order in the respective DTDs. Attribute value comparison is a bit more complicated, and may depend on the attribute and type. This is currently done with specific ordering tables.

Any future additions to the DTD must be structured so as to allow compatibility with this ordering. See alsoValid Attribute Values.

Comments

  1. Comments are of the form<!-- stuff -->.
  2. They are logically attached to a node. There are 4 kinds:
    1. Inline always appear after a leaf node, on the same line at the end. These are a single line.
    2. Preblock comments always precede the attachment node, and are indented on the same level.
    3. Postblock comments always follow the attachment node, and are indented on the same level.
    4. Final comment, after</ldml>
  3. Multiline comments (except the final comment) have each line after the first indented to one deeper level.

Examples:

<eraAbbr>    <era type="0">BC</era> <!-- might add alternate BDE in the future -->...<timeZoneNames>    <!-- Note: zones that do not use daylight time need further work -->    <zone type="America/Los_Angeles">    ...    <!-- Note: the following is known to be sparse,            and needs to be improved in the future -->    <zone type="Asia/Jerusalem">

DTD Annotations

The information in a standard DTD is insufficient for use in CLDR. To make up for that, DTD annotations are added. These are of the form

<[email protected]>

and are included below the !ELEMENT or !ATTLIST line that they apply to. The current annotations are:

TypeDescription
<!--@VALUE-->The attribute is not distinguishing, and is treated like an element value
<!--@METADATA-->The attribute is a “comment” on the data, like the draft status. It is not typically used in implementations.
<!--@ALLOWS_UESC-->The attribute value can be escaped using the\u notation. Does not require this notation to be used.
<!--@ORDERED-->The element is ordered, and does not inherit.
<!--@DEPRECATED-->The element or attribute is deprecated, and should not be used.
<!--@DEPRECATED: attribute-value1, attribute-value2-->The attribute values are deprecated, and should not be used. Spaces between tokens are not significant.
<!--@TECHPREVIEW-->The element is a technical preview of a feature and may be changed or removed at any time.
<!--@MATCH:{attribute value constraint}-->Requires the attribute value to match the constraint.
<!--@CDATA-->The element content is wrapped as CDATA element.

Because they are intended for internal use in CLDR tooling, the {attribute value constraints} are described inDTD Attribute Value Constraints.

Property Data

Some data in CLDR does not use an XML format, but rather a semicolon-delimited format derived from that of the Unicode Character Database. That is because the data is more likely to be parsed by implementations that already parse UCD data. Those files are present in the common/properties directory.

Each file has a header that explains the format and usage of the data.

Script Metadata

scriptMetadata.txt

This file provides general information about scripts that may be useful to implementations processing text. The information is the best currently available, and may change between versions of CLDR. The format is similar to Unicode Character Database property file, and is documented in the header of the data file.

Extended Pictographic

ExtendedPictographic.txt

This file was used to define the ExtendedPictographic data used for “future-proofing” emoji behavior, especially in segmentation. As of Emoji version 11.0, the set of Extended_Pictographic is incorporated into the emoji data files found atunicode.org/Public/emoji/.

Labels.txt

labels.txt

This file provides general information about associations of labels to characters that may be useful to implementations of character-picking applications. The information is the best currently available, and may change between versions of CLDR. The format is similar to Unicode Character Database property file, and is documented in the header of the data file.

Initially, the contents are focused on emoji, but may be expanded in the future to other types of characters. Note that a character may have multiple labels.

Segmentation Tests

CLDR provides a tailoring to theGrapheme Cluster Break (gcb) algorithm to avoid splitting Indic aksaras. The corresponding test files for that are located in common/properties/segments/, along with a readme.txt that provides more details. There are also specific test files for the supported Indic scripts in the unittest directory.

Issues in Formatting and Parsing

Lenient Parsing

Motivation

User input is frequently messy. Attempting to parse it by matching it exactly against a pattern is likely to be unsuccessful, even when the meaning of the input is clear to a human being. For example, for a date pattern of "MM/dd/yy", the input "June 1, 2006" will fail.

The goal of lenient parsing is to accept user input whenever it is possible to decipher what the user intended. Doing so requires using patterns as data to guide the parsing process, rather than an exact template that must be matched. This informative section suggests some heuristics that may be useful for lenient parsing of dates, times, and numbers.

Loose Matching

Loose matching ignores attributes of the strings being compared that are not important to matching. It involves the following steps:

Loose matching involves (logically) applying the above transform to both the input text and to each of the field elements used in matching, before applying the specific heuristics below. For example, if the input number text is " - NA f. 1,000.00", then it is mapped to "-naf1,000.00" before processing. The currency signs are also transformed, so "NA f." is converted to "naf" for purposes of matching. As with other Unicode algorithms, this is a logical statement of the process; actual implementations can optimize, such as by applying the transform incrementally during matching.

Handling Invalid Patterns

Processes sometimes encounter invalid number or date patterns, such as a number pattern with “¤¤¤¤¤” (valid pattern character but invalid length in current CLDR), a date pattern with “nn” (invalid pattern character in current CLDR), or a date pattern with “MMMMMM” (invalid length in current CLDR). The recommended behavior for handling such an invalid pattern field is:

Data Size Reduction

Software implementations may have constrained memory requirements.The following outlines some techniques for filtering out CLDR data for a particular implementation.The exact filtering would depend on the particular requirements of the implementation in question, of course.

Locale data can besliced to exclude data not needed by a particular implementation.This can bevertical slicing: excluding a locale and all the locales inheriting from them, orhorizontal slicing: excluding particular types of data from all locales.For example:

Of course, both of these techniques can be applied.

Vertical Slicing

The choice of locales to include depends very much upon particular implementations.Some information that might be useful for determining the choice is found in theSupplemental Territory Information,which provides information on the use of languages in different countries/regions.(For a human-readable chart, seeTerritory-Language Information.)

It is important to note that if a particular locale is in a vertical slice, then all of its parents should be as well, because of inheritance.This is not a factor if the data is fully resolved, as in the JSON format data.

Slicing can also remove related supplemental data.For example, the likely subtags data includes a large number of languages that may not be of interest for all implementations.Where an the implementation only includes (say) the CLDR locales at Basic coverage inUnicode CLDR - Coverage Levels(and locales inheriting from them), the likely subtag data that doesn’t match can be filtered out.

Horizontal Slicing

The main reason to perform horizontal slicing is when a particular feature is not used, so the implementation wants to remove the data required for powering that feature.For example, if an application isn't performing date formatting, it can remove all date formatting data (transitively).It must take care to retain data used by other features: in the previous example, the number formatting data where currencies are being formatted.

Locales may also have data on a field-by-field basis that is reasonable to filter out.For example, locales that meet the Modern level of coverage typically also include some data at a Comprehensive level.That data is not typically needed for most implementations, and can typically be filtered out.For example, in CLDR version 43, 58% of the script names (//ldml/localeDisplayNames/scripts/script[@type="*"]) are at the Comprehensive level;in fact, ~20% of all of values for the Modern level locales are at the Comprehensive level.

The easiest way to do that is to use the CLDR Java tooling (thecldr-code package) to filter the data before generating the implementation's data format.That way allows the implementation to have direct access to the CoverageLevel code that can determine the coverage level, for a given locale and path.Once the data is transformed, such as to the JSON format, the CoverageLevel code is no longer accessible.For example, here is a code snippet:

private static final SupplementalDataInfo SUPPLEMENTAL_DATA_INFO = CLDRConfig.getInstance().getSupplementalDataInfo();...    Level pathLevel = SUPPLEMENTAL_DATA_INFO.getCoverageLevel(path, locale);    if (minimumPathCoverage.compareTo(pathLevel) >= 0) {include(path);    }

Similarly, the subdivision translations represent a large body of data that may not be needed for many implementations.


Annex A Deprecated Structure

TheDTD Annotations in are used to determine whether DTD items such as elements, attributes, or attribute values are deprecated.

Though such deprecated items are still valid LDML, they are strongly discouraged, and are no longer used in CLDR.

The CLDRDTD Deltas chart shows which DTD items have been deprecated in which version of CLDR.

The remainder of this section describes selected cases of deprecated structure, and what (if any) should be used instead.

A.1 Element fallback

Implementations should use instead the information inLanguage Matching for doing language fallback.

A.2 BCP 47 Keyword Mapping

Instead use the mechanisms descibed inU Extension Data Files.

A.3 Choice Patterns

Instead usecount attributes.

A.4 Element default

Instead use replacement structure, for example:

A.5 Deprecated Common Attributes

A.5.1 Attribute standard

Instead, use areference element with the attributestandard="true".

A.5.2 Attribute draft in non-leaf elements

Thedraft attribute is deprecated except in leaf elements (elements that do not have any subelements)

A.6 Element base

Instead use the collation<import> element.

A.7 Element rules

Instead use the basic collation syntax with the<cr> element.

A.8 Deprecated subelements of<dates>

A.9 Deprecated subelements of<calendars>

A.10 Deprecated subelements of<timeZoneNames>

A.11 Deprecated subelements of<zone> and<metazone>

A.12 Renamed attribute values for<contextTransformUsage> element

The<contextTransformUsage> element was introduced in CLDR 21. The values for itstype attribute are documented in<contextTransformUsage> type attribute values. In CLDR 25, some of these values were renamed from their previous values for improved clarity:

A.13 Deprecated subelements of<segmentations>

A.14 Element cp

Thecp element was used in certain elements to escape characters that cannot be represented in XML, even with NCRs. This mechanism was replaced by specialized syntax:

Code PointXML Example
U+0000<cp hex="0">

A.15 Attribute validSubLocales

Instead of usingvalidSubLocales, it is recommended to simply add empty files to specify which sublocales are valid. This convention is used throughout the CLDR.

A.16 Elements postalCodeData, postCodeRegex

Instead please see other services that are kept up to date, such ashttps://github.com/google/libaddressinput

A.17 Element telephoneCodeData

The element<telephoneCodeData> and its subelements have been deprecated and the data removed.

A.18 Deprecated attribute of supplemental languageData/language

For the supplemental<languageData> subelement<language>, theterritory attribute has been deprecated and associated data removed. A better source for such information is the more detailed data inSupplemental Territory Information.


Annex B Links to Other Parts

The LDML specification is split into severalparts by topic, with one HTML document per part. The following tables provide redirects for links to specific topics. Please update your links and bookmarks.

Part 1 Links: Core (this document): No redirects needed.

Table:Part 2 Links:General (display names & transforms, etc.)
Old sectionSection in new part
5.4Display Name Elements1Display Name Elements
5.5Layout Elements2Layout Elements
5.6Character Elements3Character Elements
5.6.1Exemplar Syntax3.1Exemplar Syntax
5.6.2 Restrictions3.1Exemplar Syntax
5.6.3 Mapping3.2Mapping
5.6.4Index Labels3.3Index Labels
5.6.5 Ellipsis3.4Ellipsis
5.6.6 More Information3.5More Information
5.7Delimiter Elements4Delimiter Elements
C.6Measurement System Data5Measurement System Data
5.8Measurement Elements (deprecated)5.1Measurement Elements (deprecated)
5.11Unit Elements6Unit Elements
5.12POSIX Elements7POSIX Elements
5.13Reference Element8Reference Element
5.15Segmentations9Segmentations
5.15.1Segmentation Inheritance9.1Segmentation Inheritance
5.16Transforms10Transforms
NTransform Rules10.3Transform Rules Syntax
5.18List Patterns11List Patterns
C.20Gender of Lists11.1Gender of Lists
5.19ContextTransform Elements12ContextTransform Elements
Table:Part 3 Links:Numbers (number & currency formatting)
Old sectionSection in new part
C.13Numbering Systems1Numbering Systems
5.10Number Elements2Number Elements
5.10.1Number Symbols2.3Number Symbols
GNumber Format Patterns3Number Format Patterns
5.10.2Currencies4Currencies
C.1Supplemental Currency Data4.1Supplemental Currency Data
C.11Language Plural Rules5Language Plural Rules
5.17Rule-Based Number Formatting6Rule-Based Number Formatting
Table:Part 4 Links:Dates (date, time, time zone formatting)
Old sectionSection in new part
5.9 Date Elements1Overview: Dates Element, Supplemental Date and Calendar Information
5.9.1 Calendar Elements2Calendar Elements
Elements months, days, quarters, eras2.1Elements months, days, quarters, eras
Elements monthPatterns, cyclicNameSets2.2Elements monthPatterns, cyclicNameSets
Element dayPeriods2.3Element dayPeriods
Element dateFormats2.4Element dateFormats
Element timeFormats2.5Element timeFormats
Element dateTimeFormats2.6Element dateTimeFormats
5.9.2 Calendar Fields3Calendar Fields
5.9.3Time Zone Names5Time Zone Names
C.5 Supplemental Calendar Data4Supplemental Calendar Data
C.7 Supplemental Time Zone Data6Supplemental Time Zone Data
C.15 Calendar Preference Data4.2Calendar Preference Data
C.17 DayPeriod Rules4.5Day Period Rules
Appendix F: Date Format Patterns8Date Format Patterns
Date Field Symbol TableDate Field Symbol Table
F.1 Localized Pattern Characters (deprecated)8.1Localized Pattern Characters (deprecated)
Appendix J: Time Zone Display Names7Using Time Zone Names
fallbackFormat:fallbackFormat:
O.4 Parsing Dates and Times9Parsing Dates and Times
Table:Part 5 Links:Collation (sorting, searching, grouping)
Old sectionSection in new part
5.14Collation Elements3Collation Tailorings
5.14.1Version3.1Version
5.14.2Collation Element3.2Collation Element
5.14.3Setting Options3.3Setting Options
TableCollation SettingsTableCollation Settings
5.14.4Collation Rule Syntax3.4Collation Rule Syntax
5.14.5Orderings3.5Orderings
5.14.6Contractions3.6Contractions
5.14.7Expansions3.7Expansions
5.14.8Context Before3.8Context Before
5.14.9Placing Characters Before Others3.9Placing Characters Before Others
5.14.10Logical Reset Positions3.10Logical Reset Positions
5.14.11Special-Purpose Commands3.11Special-Purpose Commands
5.14.12Collation Reordering3.12Collation Reordering
5.14.13Case Parameters3.13Case Parameters
Definition:UncasedExceptionsremoved: see 3.13Case Parameters
Definition:LowerExceptionsremoved: see 3.13Case Parameters
Definition:UpperExceptionsremoved: see 3.13Case Parameters
5.14.14Visibility3.14Visibility
Table:Part 6 Links:Supplemental (supplemental data)
Old sectionSection in new part
CSupplemental DataIntroductionSupplemental Data
C.2Supplemental Territory Containment1.1Supplemental Territory Containment
C.4Supplemental Territory Information1.2Supplemental Territory Information
C.3Supplemental Language Data2Supplemental Language Data
C.9Supplemental Code Mapping4Supplemental Code Mapping
C.12Telephone Code Data5Telephone Code Data
C.14Postal Code Validation6Postal Code Validation
C.8Supplemental Character Fallback Data7Supplemental Character Fallback Data
MCoverage Levels8Coverage Levels
5.20Metadata Elements10Locale Metadata Element
PSupplemental Metadata9Supplemental Metadata
P.1Supplemental Alias Information9.1Supplemental Alias Information
P.2Supplemental Deprecated Information9.2Supplemental Deprecated Information
P.3Default Content9.3Default Content
Table:Part 7 Links:Keyboards (keyboard mappings)

Part 7 has been extensively rewritten. The prior link anchors within this file are no longer valid.


Annex C. LocaleId Canonicalization

ThelanguageAlias,scriptAlias,territoryAlias, andvariantAlias elements are used as rules to transform an inputsource localeId. The first step is to transform thelanguageId portion of the localeId.

Note: in the following discussion, the separator '-' is used. That is also used in examples of XML alias data, even though for compatibility reasons that alias data actually uses '_' as a separator. The processing can also be applied to syntax while maintaining the separator '_',mutatis mutandis. CLDR also uses “territory” and “region” interchangeably.

Also note that the discussion of canonicalization assumes BCP 47input data. If input data is a CLDR or ICU locale ID suchasen_US_POSIX, a conversion step must be done prior tocanonicalization.See §3.8.2Legacy Variants.

LocaleId Definitions

1. Multimap interpretation

Interpret each languageId as a multimap from afieldId (language, script, region, variants) to asorted set of field values.

Examples:

SourceLanguageScriptRegionVariants
en-GB{en}{}{GB}{}
und-GB{}{}{GB}{}
ja-Latn-YU-hepburn-heploc{ja}{Latn}{YU}{hepburn, heploc}

2. Alias elements

For thelanguageAlias elements, thetype andreplacements are languageIds.

For the script-, territory- (aka region), and variant- Alias elements, the type and replacements are interpreted as a languageId,after prefixing with “und-”. Thus

<territoryAlias type="AN" replacement="CW SX BQ" reason="deprecated" />

is interpreted as:

<territoryAlias type="und-AN" replacement="und-CW und-SX und-BQ" reason="deprecated" />

Note that for the case of territoryAlias, there may be multiple replacement values separated by spaces in the text (such as replacement="und-CW und-SX und-BQ"); other rules only ever have a single replacement value.

Matches

A rule matches a source if and only for all fields, eachsource field ⊇type field.

Examples:

source="ja-heploc-hepburn" andtype="und-hepburn"

{ja} ⊇ {}success, und = {}
{hepburn, heploc} ⊇ {hepburn}success

so the rule matches the source. (Note that order of variants is immaterial to matching)

source="ja-hepburn" andtype="und-hepburn-heploc"

{ja} ⊇ {}success, und = {}
{hepburn} ⊉ {hepburn, heploc}failure

so the rule does not match the source.

4. Replacement

A matching rule can be used to transform the source fields as follows

Example:

source="ja-Latn-fonipa-hepburn-heploc"

rule =<languageAlias type="und-hepburn-heploc" replacement="und-alalc97">

result="ja-Latn-alalc97-fonipa"

(note that CLDR canonical order of variants is alphabetical)

Territory Exception

If the field = territory, and the replacement.field has more than one value, then look up the most likely territory for the base language code (and script, if there is one). If that likely territory is in the list of replacements, use it. Otherwise, use the first territory in the list.

5. Canonicalizing Syntax

To canonicalize the syntax ofsource:

Preprocessing

The data from supplementalMetadata is (logically) preprocessed as follows.

  1. Load the rules from supplementalMetadata.xml, replacing '_' by '-', and adding “und-” as described inDefinition 2. Alias Elements.
  2. Capture all languageAlias rules where thetype is an invalid languageId into a set ofBCP47 LegacyRules. Example:
    1. <languageAlias type="i-mingo" replacement="see-x-i-mingo" reason="legacy" />
  3. Discard all rules where thetype is an invalid languageId. Examples are
    1. <languageAlias type="i-mingo" replacement="see-x-i-mingo" reason="legacy" />
    2. <territoryAlias type="und-AAA" replacement="und-AA" reason="overlong" />
  4. Change thetype andreplacement values in the remaining rules into multimap rules, as perDefinition 1. Multimap Interpretation.
    1. Note that the “und” value disappears.
  5. Order the set of rules using the following comparison logic:
    1. For each rule, count the number of items in each field value set (L, S, R, V) and sum the four counts.If two rules have differing sums, order the rule with the greater sum before the rule with the smaller sum.
      • For example:
      • {V={hepburn,heploc}} is tied with
      • {L={en}, R={GB}} (because both have 2 total field value items) and both precede
      • {R={CA}} (which has 1).
    2. For rule pairs that are not differentiated by the previous step, consider the value set for each field in the order L, then S, then R, then V.If one rule has a non-empty value set for that field and the other rule does not,then order the rule with the non-empty value set for that field before the other rule and disregard all later fields.Otherwise, consider the next field.
      • For example:
      • {L={zh}, S={Hant}, R={CN}} is tied with
      • {L={en}, S={Latn}, R={GB}} (because both have non-empty sets for L, S, and R but not for V),and both precede
      • {L={zh}, S={Hans}, V={pinyin}} (because it lacks values for R),which precedes
      • {L={en}, R={GB}, V={scouse}} (because it lacks values for S),which precedes
      • {V={fonipa,hepburn,heploc}} (because it lacks values for L),which is tied with
      • {V={hepburn,heploc,simple}} (because both have non-empty sets for V but not for L, S, or R).
    3. For rule pairs that are not differentiated by the previous step,consider the value set for each field in the order L, then S, then R, then V as a sequence of subtags.If those lists for the same field of two rules differ,then consider the first position of difference in the two lists and order the rules by code-point orderof the field value at that position and disregard all later fields.Otherwise, consider the next field.
      • For example:
      • {L={ja}, V={hepburn, heploc}} precedes
      • {L={zh}, V={1996, pinyin}}(because it has a different field value set for L and "ja" precedes "zh" at the first position of difference),which precedes
      • {L={zh}, V={hepburn, heploc}}(because it has the same field value set for L and a different field value set for V in which "1996" precedes "hepburn" at the first position of difference),which precedes
      • {L={zh}, V={hepburn, simple}}(because it has the same field value set for L and a different field value set for V in which "heploc" precedes "simple" at the first position of difference).
  6. The result is the set ofAlias Rules

So using the examples above, we get the following order:

languageId5.1 total field value set item count5.2 non-empty field value set5.3 field value set items
{L={en}, S={Latn}, R={GB}}3n/an/a
{L={zh}, S={Hant}, R={CN}}3match (L, S, R)in L, “en” before “zh”
{L={zh}, S={Hans}, V={pinyin}}3(L, S, R, …) before (L, S, V)
{L={en}, R={GB}, V={scouse}}3(L, S, …) before (L, R, …)
{L={ja}, V={hepburn,heploc}}3(L, R, …) before (L, V)
{L={zh}, V={1996,pinyin}}3match (L, V)in L, “ja” before “zh”
{L={zh}, V={hepburn,heploc}}3match (L, V)in V, “1996” before “hepburn”
{L={zh}, V={hepburn,simple}}3match (L, V)in V, “heploc” before “simple”
{V={fonipa,hepburn,heploc}}3(L, …) before (V)
{V={hepburn,heploc,simple}}3match (V)in V, “fonipa” before “hepburn”
{L={en}, R={GB}}2
{V={hepburn,heploc}}2(L, …) before (V)
{R={CA}}1

Processing LanguageIds

To canonicalize a givensource:

  1. Canonicalize the syntax ofsource as perDefinition 5. Canonicalizing Syntax.
  2. Where thesource could be an arbitrary BCP 47 language tag, first process as follows:
    1. If the source is identical to one of the types in the BCP47 LegacyRules, replace the entire source by the replacement value.
    2. Else if there is an extlang subtag, then apply Step 3 of BCP 47Section 4.5 to remove the extlang subtag (possibly adjusting the language subtag).
      1. Don’t apply any of the other canonicalization steps in that section, however.
    3. Else if the first subtag is "x", prefix by "und-".
    4. Note: there are currently no valid 4-letter primary language subtags. While it is extremely unlikely that BCP 47 would ever register them, if so thenlanguageAlias mappings will be supplied for them, mapping to defined CLDR language subtags (from theidStatus="reserved" set).
  3. Find the first matching rule inAlias Rules (fromPreprocessing)
    1. If there are none, returnsource
  4. Transformsource according to that rule
  5. loop (goto #3)

Processing LocaleIds

The canonicalization of localeIds is done by first canonicalizing the languageId portion, then handling extensions in the following way:

  1. Replace anytlang languageId value by its canonicalization.
  2. Use the bcp47 data to replace keys, types, tfields, and tvalues by their canonical forms. SeeU Extension Data Files andT Extension Data Files. The matches are in thealias attribute value, while the canonical replacement is in thename attribute value. For example:
    1. Because of the following bcp47 data:<key name="ms"…>…<type name="uksystem" … alias="imperial" … />…</key>
    2. We get the following transformation:en-u-ms-imperial ⇒ en-u-ms-uksystem
  3. Replace any unicode_subdivision_id that is a subdivision alias by its replacement value in the same way, using subdivisionAlias data. This applies, for example, to the values for the 'sd' and 'rg' keys. However, where the replacement value is a two-letter region code, also append zzzz so that the result is syntactically correct. For example:
    1. Because of the following bcp47 data:<subdivisionAlias type="fi01" replacement="AX"…
    2. We get the following transformation:en-u-rg-fi01 ⇒ en-u-rg-axzzzz

Optimizations

The above algorithm is a logical statement of the process, but would obviously not be directly suited to production code. Production-level code can use many optimizations for efficiency while achieving the same result. For example, the Alias Rules can be further preprocessed to avoid indefinite looping, instead doing a rule lookup once per subtag. As another example, the small number ofTerritory Exceptions can be preprocessed to avoid the likely subtags processing.


References

Ancillary InformationTo properly localize, parse, and format data requires ancillary information, which is not expressed in Locale Data Markup Language. Some of the formats for values used in Locale Data Markup Language are constructed according to external specifications. The sources for this data and/or formats include the following:
[Bugs]CLDR Bug Reporting form
https://cldr.unicode.org/index/bug-reports
[Charts]The online code charts can be found athttps://www.unicode.org/charts/ An index to character names with links to the corresponding chart is found athttps://www.unicode.org/charts/charindex.html
[DUCET]The Default Unicode Collation Element Table (DUCET)
For the base-level collation, of which all the collation tables in this document are tailorings.
https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
[FAQ]Unicode Frequently Asked Questions
https://www.unicode.org/faq/
For answers to common questions on technical issues.
[FCD]As defined in UTN #5 Canonical Equivalences in Applications
https://www.unicode.org/notes/tn5/
[Glossary]Unicode Glossary
https://www.unicode.org/glossary/
For explanations of terminology used in this and other documents.
[JavaChoice]Java ChoiceFormat
https://docs.oracle.com/javase/7/docs/api/java/text/ChoiceFormat.html
[Olson]The TZID Database (aka Olson timezone database)
Time zone and daylight savings information.
https://www.iana.org/time-zones
For archived data, see
ftp://ftp.iana.org/tz/releases/
[Reports]Unicode Technical Reports
https://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[Unicode]The Unicode Consortium,The Unicode Standard, Version 13.0.0
(Mountain View, CA: The Unicode Consortium, 2020. ISBN 978-1-936213-26-9)
https://www.unicode.org/versions/Unicode13.0.0/
[Versions]Versions of the Unicode Standard
https://www.unicode.org/versions/
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.
[XPath]https://www.w3.org/TR/xpath/
Other StandardsVarious standards define codes that are used as keys or values in Locale Data Markup Language. These include:
[BCP47]https://www.rfc-editor.org/rfc/bcp/bcp47.txt
The Registry
https://www.iana.org/assignments/language-subtag-registry
[ISO639]ISO Language Codes
https://www.loc.gov/standards/iso639-2/
Actual List
https://www.loc.gov/standards/iso639-2/langcodes.html
[ISO1000]ISO 1000: SI units and recommendations for the use of their multiples and of certain other units, International Organization for Standardization, 1992.
https://www.iso.org/iso/catalogue_detail?csnumber=5448
[ISO3166]ISO Region Codes
https://www.iso.org/iso-3166-country-codes.html
Actual List
https://www.iso.org/obp/ui/#search
[ISO4217]ISO Currency Codes
https://www.iso.org/iso-4217-currency-codes.html
(Note that as of this point, there are significant problems with this list. The supplemental data file contains the best compendium of currency information available.)
[ISO8601]ISO Date and Time Format
https://www.iso.org/iso-8601-date-and-time-format.html
[ISO15924]ISO Script Codes
https://www.unicode.org/iso15924/index.html
Actual List
https://www.unicode.org/iso15924/codelists.html
[LOCODE]United Nations Code for Trade and Transport Locations, commonly known as "UN/LOCODE"
https://unece.org/trade/uncefact/unlocode
Download at:https://unece.org/trade/cefact/UNLOCODE-Download
[RFC6067]BCP 47 Extension U
https://www.ietf.org/rfc/rfc6067.txt
[RFC6497]BCP 47 Extension T - Transformed Content
https://www.ietf.org/rfc/rfc6497.txt
[UNM49]UN M.49: UN Statistics Division
Country or area & region codes
https://unstats.un.org/unsd/methods/m49/m49.htm
Composition of macro geographical (continental) regions, geographical sub-regions, and selected economic and other groupings
https://unstats.un.org/unsd/methods/m49/m49regin.htm
[XML Schema]W3C XML Schema
https://www.w3.org/XML/Schema
GeneralThe following are general references from the text:
[ByType]CLDR Comparison Charts
https://cldr.unicode.org/index/charts
[Calendars]Calendrical Calculations: The Millennium Edition by Edward M. Reingold, Nachum Dershowitz; Cambridge University Press; Book and CD-ROM edition (July 1, 2001); ISBN: 0521777526. Note that the algorithms given in this book are copyrighted.
[Comparisons]Comparisons between locale data from different sources
https://www.unicode.org/cldr/charts/latest/by_type/index.html
[CurrencyInfo]UNECE Currency Data
https://www.iso.org/iso-4217-currency-codes.html
[DataFormats]CLDR Translation Guidelines
https://cldr.unicode.org/translation
[Example]A sample in Locale Data Markup Language
https://www.unicode.org/cldr/dtd/1.1/ldml-example.xml
[ICUCollation]ICU rule syntax
https://unicode-org.github.io/icu/userguide/collation/customization/
[ICUTransforms]Transforms
https://unicode-org.github.io/icu/userguide/transforms/
Transforms Demo
https://icu4c-demos.unicode.org/icu-bin/translit
[ICUUnicodeSet]ICU UnicodeSet
https://unicode-org.github.io/icu/userguide/strings/unicodeset.html
API
https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/UnicodeSet.html
[ITUE164]International Telecommunication Union: List Of ITU Recommendation E.164 Assigned Country Codes
available athttps://www.itu.int/opb/publications.aspx?parent=T-SP&view=T-SP2
[LocaleExplorer]ICU Locale Explorer
https://icu4c-demos.unicode.org/icu-bin/locexp
[LocaleProject]Common Locale Data Repository Project
https://cldr.unicode.org
[NamingGuideline]OpenI18N Locale Naming Guideline
formerly athttps://www.openi18n.org/docs/text/LocNameGuide-V10.txt
[RBNF]Rule-Based Number Format
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1RuleBasedNumberFormat.html
[RBBI]Rule-Based Break Iterator
https://unicode-org.github.io/icu/userguide/boundaryanalysis/
[UCAChart]Collation Chart
https://www.unicode.org/charts/collation/
[UTCInfo]NIST Time and Frequency Division Home Page
https://www.nist.gov/pml/time-and-frequency-division
U.S. Naval Observatory: What is Universal Time?
https://www.cnmoc.usff.navy.mil/Our-Commands/United-States-Naval-Observatory/Precise-Time-Department/The-USNO-Master-Clock/Definitions-of-Systems-of-Time/
[WindowsCulture]Windows Culture Info (with mappings from [BCP47]-style codes to LCIDs)
https://learn.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo?view=net-6.0

Acknowledgments

This section is now in a separate part,Acknowledgments

Modifications

This section is now in a separate part,Modifications


© 2001–2025 Unicode, Inc.This publication is protected by copyright, and permission must be obtained from Unicode, Inc.prior to any reproduction, modification, or other use not permitted by theTerms of Use.Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution,provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original.You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.

Use of all Unicode Products, including this publication, is governed by the UnicodeTerms of Use.The authors, contributors, and publishers have taken care in the preparation of this publication,but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom.This publication is provided “AS-IS” without charge as a convenience to users.

Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.


[8]ページ先頭

©2009-2026 Movatter.jp