Copyright © 2020W3C® (MIT,ERCIM,Keio,Beihang). W3Cliability,trademark andpermissive document license rules apply.
This document provides definitions and best practices related to the identification of the natural language of content in document formats, specifications, and implementations on the Web. It describes how language tags are used to indicate a user's locale preferences which, in turn, are used to process, format, and display information to the user.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of currentW3C publications and the latest revision of this technical report can be found in theW3C technical reports index at https://www.w3.org/TR/.
This is an updated Public Working Draft of "Language Tags and Locale Identifiers for the World Wide Web". The Working Group expects this to become a Working Group Note.
If you wish to make comments regarding this document, pleaseraise a github issue. You may also send email to the listwww-international@w3.org (subscribe,archives) as mentioned below. Please include[ltli]
at the start of your email's subject. To make it easier to track comments, please raise separate issues or send separate emails for each comment. All comments are welcome.
This document was published by theInternationalization Working Group as a Working Draft.
GitHub Issues are preferred for discussion of this specification.
Publication as a Working Draft does not imply endorsement by theW3C Membership.
This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the1 August 2017W3C Patent Policy. The group does not expect this document to become aW3C Recommendation.W3C maintains apublic list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes containsEssential Claim(s) must disclose the information in accordance withsection 6 of theW3C Patent Policy.
This document is governed by the15 September 2020W3C Process Document.
Language tags and locales are some of the fundamental building blocks ofinternationalization (i18n
) of the Web. In this document you will find definitions for much of the basic terminology related to this aspect ofinternationalization.
This document also provides terminology and best practices needed by specification authors for the identification ofnatural language values in document formats or protocols and which are recommended by the Internationalization (I18N) Working Group. These (and many other) best practices, along with links to supporting materials, can also be found in theInternationalization Best Practices for Spec Developers [INTERNATIONAL-SPECS]. In addition to the best practices found here, additional best practices relating to language metadata on the Web can be found in [STRING-META].
In this document [RFC2119] keywords have their usual meaning. Best practices and definitions are set off from the remainder of the text with special formatting.
Best practices appear with a different background color and decoration like this.
Definitions appear with a different background color and decoration like this.
Gaps or recommendations for future work appear with a different background color and decoration like this.
Tags for identifying thenatural language of content or theinternational preferences of users are one of the fundamental building blocks of the Web. Thelanguage tags found in Web and Internet formats and protocols are defined by [BCP47]. Consistent use of language tags provides applications the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select an appropriate font for displaying text or a Web page designer might style text differently in one language than in another.
Many of the core standards for the Web include support forlanguage tags; these include thexml:lang
attribute in [XML10], thelang
andhreflang
atttributes in [HTML], thelanguage
property in [XSL10], and the:lang
pseudo-class in CSS [CSS3-SELECTORS], and many others, including SVG, TTML, SSML, etc.
Natural Language (or, in this document, justlanguage). The spoken, written, or signed communications used by human beings.
There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [BCP47]. "BCP" nomenclature refers to the current set of IETF RFCs that form the "best current practice".
Language tag. A string used as an identifier for a language. In this document, the termlanguage tag always refers explicitly to a [BCP47] language tag. These language tags consist of one or more subtags.
Specifications for the Web that require language identificationMUST refer to [BCP47].
SpecificationsSHOULD NOT refer to specific component RFCs of [BCP47].
[BCP47] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, calledTags for Identifying Languages [RFC5646], defines the grammar, form, and terminology of language tags. The second part, calledMatching of Language Tags [RFC4647], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.
Formulations such as "RFC 5646 or its successor"MAY be used, but only in cases where the specific document version is necessary.
While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [RFC4646], referring to the BCP will not incur additional compliance risk to most implementations.
SpecificationsMUST NOT reference obsolete versions of [BCP47], such as [RFC1766] or [RFC3066].
Specifications that need to preserve compatibility with obsolete versions of [BCP47]MUST reference the productionobs-language-tag
in [BCP47].
Beginning with [RFC4646], [BCP47] defined a more complex, machine-readable syntax for language tags. This syntax is stable and is not expected to change in the foreseeable future. Some specifications might desire or require compatibility with the older language tag grammar found in previous versions of BCP47 (specifically [RFC1766] and [RFC3066]). This grammar was more permissive and is described in [BCP47] as the ABNF productionobs-language-tag
. [RFC4646], which introduced the current grammar for language tags, was replaced by [RFC5646] as part of the current [BCP47].
Applications that provide language information as part of URIs (e.g. in the realm of RDF)SHOULD use [BCP47].
Currently, URIs expressing language information often use values from parts of ISO 639. This leads to situations in which there are ambiguities about what the proper value should be, e.g. for Germande
from ISO 639-1 orger
from ISO 639-2. By using BCP 47 and its language sub tag registry, such ambiguities can be avoided, e.g. for German, the registry contains onlyde
.
Subtag. A sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overalllanguage tag. In [BCP47], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).
Selecting content or behavior based on the language tag requires a few additional concepts defined by [BCP47] (in [RFC4647]). In this document, we adopt the following terminology taken directly from [BCP47]:
IANA Language Subtag Registry. A machine-readable text file available via IANA which contains a comprehensive list of all of the subtags valid in language tags. (Link:Registry)
SpecificationsSHOULD NOT reference [BCP47]'s underlying standards that contribute to theIANA Language Subtag Registry, such as ISO639, ISO15924, ISO3066, or UN M.49.
Some standards might directly consume one of [BCP47]'s contributory standards, in which case a reference is wholly appropriate. However, in most cases, the purpose of the reference is to specify a valid list of codes and their meanings. [BCP47]'ssubtag registry is stabilized and resolves ambiguity in a number of useful ways and so should be the preferred source for this type of reference.
[BCP47] defines two different levels of conformance. Seeclasses of conformance in [BCP47] for specifics. For language tags, the levels of conformance correspond to type of checking that an implementation applies to language tag values.
Well-formed language tag. A language tag that follows the grammar defined in [BCP47]. That is, it is structurally correct, consisting of ASCII letters and digitsubtags of the prescribed length, separated by hyphens.
Valid language tag. A language tag that iswell-formed and which also conforms to the additionalconformance requirements in [BCP47], notably that each of the subtags appears in the IANA Language Subtag Registry.
SpecificationsSHOULD require that language tags bewell-formed.
SpecificationsMAY require that language tags bevalid.
SpecificationsSHOULD require that content authors usevalid language tags.
Note that this is stricter than what is recommended for implementations.
Content validatorsSHOULD check if content usesvalid language tags where feasible.
Checking if a tag isvalid requires access to or a copy of theregistry plus additional runtime logic. While content authors are advised to choose, generate, and exchange only valid values, language tag matching and other common language tag operations are designed so that validity checking is not needed. Features or functions that need to understand the specific semantic content of subtags are the main reason that a specification would normatively requirevalid tags as part of the protocol or document format.
Language tag extension orextension. A system of additional [BCP47] subtags introduced by a single letter or digit subtag registered with IANA and permitting additional types of language identification.
SpecificationsMAY reference registered extensions to [BCP47] as necessary.
In particular, [RFC6067] defines theBCP 47 Extension U, also known as "Unicode Locales". This extension to [BCP47] provides additional subtag sequences for selecting specific locale variations.
SpecificationsSHOULD NOT restrict the length of language tags or permit or encourage the removal of extensions.
Language range. A string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".
Language priority list. A collection of one or morelanguage ranges identifying the user's language preferences for use in matching. As the name suggests, such lists are normally ordered or weighted according to the user's preferences. The HTTP [RFC2616]Accept-Language
[RFC3282] header is an example of one kind of language priority list.
Basic language range. Alanguage range consisting of a sequence of subtags separated by hyphens. That is, it is identical in appearance to a language tag.
Extended language range. Alanguage range consisting of a sequence of hyphen-separated subtags. In an extended language range, a subtag can either be a valid subtag or the wildcard subtag
, which matches any value.*
Basic versus extended language range and language priority list
The stringde-de
is a basic language range. It matches, for example, the language tagde-DE-1996
, but not the language tagde-Deva
.
The stringde-*-DE
is an extended language range. It matches all of the following tags:
de-DE
de-DE-x-goethe
de-Latn-DE-1996
"en; fr; zh-Hant"
is a language priority list. It would be read as "English before French before Chinese as written in the Traditional script". Note that the syntax shown is only an example, since it depends on the protocol, application, or implementation that uses the list.
Somelanguage priority lists, such as theAccept-Language
[RFC3282] header mentioned earlier, provide "weights" for values appearing in the list. Such weighting cannot be depended on for anything other than ordering the list.
Specifications that define language tag matching orlanguage negotiationMUST specify whether language ranges used are abasic language range or anextended language range.
Specifications that define language tag matchingMUST specify whether the results of a matching operation contains a single result (lookup as defined in [RFC4647]), or a possibly-empty (zero or more) set of results (filtering as defined in [RFC4647]).
Specifications that define language tag matchingMUST specify the matching algorithms available and the selection mechanism.
For example, JavaScript internationalization [ECMA-402] and [CLDR] provide a "best fit" algorithm which can be tailored by implementers.
This section defines basic terminology related to internationalization and localization.
Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions.
Language tags can also be used to identifyinternational preferences associated with a given piece of content or user because these preferences are linked to the natural language, regional association, or culture of the end user. Such preferences are applied to processes such as presenting numbers, dates, or times; sorting lists linguistically; providing defaults for items such as the presentation of a calendar, or common units of measurement; selecting between 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually. Collectively, an identifier for these preferences is usually called alocale. The extensions to [BCP47] that defineUnicode locales [CLDR] provide the basis forinternationalization APIs on the Web, notably the JavaScript language [ECMASCRIPT] usesUnicode locales as the basis for the APIs found in [ECMA-402].
International Preferences. A user's particular set of language and formatting preferences and associated cultural conventions. Software can use these preferences to correctly process or present information exchanged with that user.
Many kinds ofinternational preference may be offered on the Web in order for a content or a service to be considered usable and acceptable by users around the world. Some of these preferences might include:
Internationalization. The design and development of a product that is enabled for target audiences that vary in culture, region, or language. Internationalization is sometimes abbreviatedi18n
because there are eighteen letters between the "I" and the "N" in the English word.
Localization. The tailoring of a system to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated asl10n
because there are ten letters between the "L" and the "N" in the English word. When a particular set of content and preferences corresponding to a specific set of international preferences is operationally available, then the system is said to belocalized.
Locale. An identifier (such as alanguage tag) for a set ofinternational preferences. Usually this identifier indicates the preferred language of the user and possibly includes other information, such as a geographic region (such as a country). A locale is passed in APIs or set in the operating environment to obtain culturally-affected behavior within a system or process.
Locale-aware (orEnabled). A system that can respond to changes in thelocale with culturally and language-specific behavior or content. Generally, systems that are internationalized can support a wide range oflocales in order to meet theinternational preferences of many kinds of users.
Language tags can provide information about the language, script, region, and various specially-registered variants using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. Thus a German language user might want to choose between the sort ordering used in a dictionary versus that used in a phone book.
Historically, locales were associated with and specific to the programming language or operating environment of the user. These application-specific identifiers often could be inferred from or converted into language tags. Some examples of locale models include Java'sjava.util.Locale
, POSIX (with identifiers such asde_CH@utf8
), Oracle databases (AMERICAN_AMERICA.AL32UTF8
), or Microsoft's LCIDs (which used numeric codes such as0x0409
). The relationship between several of these models, the underlying standards such as ISO639 or ISO3166, and early language tags (such as [RFC1766]) was entirely intentional. Implementations often mapped (and continue to map) language tags from an existing protocol, such as HTTP's Accept-Language header, to proprietary or platform-specific locale models.
Since the adoption of the current [BCP47] identifier syntax, a number of locale models have adopted BCP47 directly or provided adaptation or mappings between proprietary models andlanguage tags. Notably, the development and adoption of the open-source repository of locale data known as [CLDR] has led to wider general adoption oflanguage tags aslocale identifiers.
Common Locale Data Repository (or[CLDR]). The Common Locale Data Repository is a Unicode Consortium project that defines, collects, and curates sets of data needed to enablelocales in systems or operating environments. CLDR data and its locale model are widely adopted, particularly in browsers.
Unicode Locale Identifier orUnicode Locale. Alanguage tag that follows the additional rules and restrictions on subtag choice defined in UTR#35 [LDML]. Any valid Unicode locale identifier is also avalid [BCP47]language tag, but a fewvalid language tags are not also valid Unicode locale identifiers.
Canonical Unicode locale identifier. Awell-formed language tag resulting from the application of theUnicode locale identifier canonicalization rules found in [LDML] (seeSection 3). This process converts anyvalid [BCP47]language tag into a validUnicode locale identifier. For example, deprecated subtags or irregular grandfathered tags are replaced with their preferred value from theIANA language subtag registry.
[CLDR] defines and maintains twolanguage tag extensions ([RFC6067] and [RFC6497]) that are related toUnicode locale identifiers. These extensions allow alanguage tag to express someinternational preference variations that go beyond linguistic or regional variation or to select formatting behavior or content when there are multiple options or user preferences within a given locale.Unicode locale identifiers are not required to include these extensions: they are only used when the locale being identified requires additional tailoring provided by one of these extensions. [CLDR] also applies specific interpretation of certain subtags when used as a locale identifier. SeeSection 3.2 of [LDML] for details.
TheUnicode localelanguage tag extension [RFC6067] uses the-u-
subtag, and provides subtags for selecting different locale-based formats and behaviors. SeeSection 3.6 of [LDML] for details.
Thetransformed contentlanguage tag extension [RFC6497], which uses the-t-
subtag, provides subtags for text transformations, such as transliteration between scripts. SeeSection 3.7 of [LDML] for details.
Unicode Locales increasingly form the basis forinternationalization on the Web, particularly as part of theIntl
locale framework [ECMA-402] in JavaScript [ECMASCRIPT].
Content authorsSHOULD choose language tags that arecanonical Unicode locale identifiers.
The additional content restrictions and normalization steps found inSection 3 of [LDML] provide for better interoperability and consistency than that afforded by [BCP47] directly.
ImplementationsSHOULD only emit language tags that arecanonical Unicode locale identifiers andSHOULD normalize language tags that they consume using the rules for producing canonical tags.
As above, the additional content restrictions and normalization steps found inSection 3 of [LDML] provide for better interoperability and consistency than that afforded by [BCP47] directly. This best practice should not be interpreted as meaning that implementations need to support, generate, process, or understand either of [CLDR]'s extensions.
Content authorsSHOULD NOT includelanguage tag extensions in alanguage tag unless the specific application requires the additional tailoring.
It is important to remember that everyUnicode locale identifier isalso awell-formed [BCP47] language tag.Unicode locale identifiers do not require the use of either of [CLDR]'slanguage tag extensions.
Some international and cultural preferences are individual and are left to content authors, service providers, operating environments, or user agents to define and manage on behalf of the user.
Here are a few selected examples ofUnicode Locale identifiers and the variations associated with them.
In the first example, the value123456789.5678
is formatted using the locale rules represented by the various language tags. Notice how theu
extension and itsnu
keyword are used to select between Latin and Devanagari digit shapes in the Hindi-as-used-in-India (hi-IN
) locale and between Latin and Arabic script digit shaps in the Arabic (ar
) locale.
Variation Type | Value | Locale | Formatted Value |
---|---|---|---|
Numbering System | 123456789.5678 | en-US | 123,456,789.5678 |
de | 123.456.789,5678 | ||
hi-IN-u-nu-latn | 12,34,56,789.5678 | ||
hi-IN-u-nu-deva | १२,३४,५६,७८९.५६७८ | ||
ar-u-nu-latn | 123,456,789.5678 | ||
ar-u-nu-arab | ١٢٣٬٤٥٦٬٧٨٩٫٥٦٧٨ |
In the second example, the date value corresponding to 11 July 2020 on the Gregorian calendar is formatted using various different locales. Here, for example, the language tag for Thai (th
) is extended to select between the Greogrian (-u-ca-gregory
) and Thai Buddhist (-u-ca-buddhist
) calendar systems. Other examples show the Japanese Imperial calendar and one type of Islamic calendar. Notice in the last example that the calendar is not restricted to a specific locale: here we show the Islamic calendar system in an English locale.
Variation Type | Value | Locale | Formatted Value |
---|---|---|---|
Calendar | 2020-07-11T12:00:00Z | th-u-ca-gregory | 11 ก.ค. 2020 |
th-u-ca-buddhist | 11 ก.ค. 2563 | ||
ja-u-ca-japanese | 令和2年7月11日 | ||
ar-u-ca-islamic | ٢٠ ذو القعدة ١٤٤١ هـ | ||
en-u-ca-islamic | Dhuʻl-Q. 20, 1441 AH |
Non-linguistic Field. Any element of a data structure not intended for the storage or interchange of natural language textual data. This includes non-string data types, such as booleans, numbers, dates, and so forth. It also includes strings, such as program or protocol internal identifiers. This document uses the termfield as a short hand for this concept.
Specifications for document formats or protocols usually define the exchange, processing, or display of various data values or data structures. The Web primarily relies on text files for the serialization and exchange of data: even raw bytes are usually transmitted using a string serialization such as base64. Thusnon-linguistic fields on the Web are also normally made up of strings. The important distinction here is thatnon-linguistic fields are generally interpreted by or meant for consumption by the underlying application, rather than by a user.
Locale-neutral. Anon-linguistic field is said to belocale-neutral when it is stored or exchanged in a format that is not specifically appropriate for any given language, locale, or culture and which can be interpreted unambiguously for presentation in alocale aware way.
Many specifications use a serialization scheme, such as those provided by [XMLSCHEMA11-2] or [JSON-LD], to provide alocale neutral encoding ofnon-linguistic fields in document formats or protocols.
Alocale-neutral representation might itself be linked to a specific cultural preference, but such linkages should be minimized. For example, many of the ISO8601 date/time value serializations are linked to the Gregorian calendar, but the format, field order, separators, and visual appearance are not specifically suitable to any locale (they are intended to be machine readable) and, as shown in theexample above, the value can be converted for display into any calendar or locale.
Suppose your application needs to collect and store some value in afield. The system can use alocale-neutral format for storing and exchanging the value. For instance, schema languages such as [XMLSCHEMA11-2] or data formats such as [JSON] provide ready made types for this purpose. When the user is entering or editing the value, however, the user expects to interact with a more human friendly format. For example, if your application needed to input a user's birth date and the value they were trying to enter were2020-01-31
:
The input field might look like this in HTML:
<inputtype="date"id="birthDate"value="2020-01-31"lang=… >
Thelang
attribute here should control the display and formatting of the value, including the expected input pattern.Note that this guidance is at odds with what browsers do at the time this document was published.
Value | Language Tag | Display | Input Format Pattern |
---|---|---|---|
2020-01-31 | en-GB | 31/01/2020 | dd/MM/yyyy |
en-US | 01/31/2020 | MM/dd/yyyy | |
fr-FR | 31-01-2020 | dd-MM-yyyy | |
zh-Hans-CN | 2020-01-31 | yyyy-MM-dd |
Language negotiation. The process of matching a user'sinternational preferences to available locales, localized resources, content, or processing.
Locale fallback. The process of searching for translated content, locale data, or other resources by "falling back" from more-specific resources to more-general ones following a deterministic pattern.
A user's preferences are usually expressed as alocale or prioritized list of locales. When negotiating the language, the system follows some sort of algorithm to get the best matching content or functionality from the available resources. In many cases the language negotiation algorithm useslocale fallback.
Specifications that presentfields in a document formatSHOULD require that data is formatted according to the language of the surrounding content.
Whennon-linguistic fields are presented to the user as part of a document or application, the document or application forms the "context" where the data is being viewed. Content authors or application developers need a way to make thefields seem like a natural part of the experience and need a way to control the presentation. This is indicated by thelanguage tag of the context in which the content appears: usuallyenabled implementations interpret the tag as alocale in order to accomplish this. Using the runtime locale or localization of the user-agent as the locale for presentingnon-linguistic fields should only be a last resort.
Specifications that present forms or receive input ofnon-linguistic fields in a document format or applicationSHOULD require that the values be presented to the userlocalized in the format of the language of the content or markup immediately surrounding the value.
Specifications that present, exchange, or allow the input ofnon-linguistic fieldsMUST use alocale-neutral format for storage and interchange.
ImplementationsSHOULD presentnon-linguistic fields in a document format or application using a format consistent with the language of the surrounding content and are encouraged to provide controls which arelocalized to the samelocale for input or editing.
Users expect form fields and other data inputs to use a presentation fornon-linguistic fields that is consistent with the document or application where the values appear. User's usually expect their input to match the document's context rather than the user-agent or operating environments and input validation, prompting, or controls are also thus consistent with the content. This gives content authors the ability to create a wholly localized customer experience and is generally in keeping with customer expectations.
The Internationalization WG has additional best practices and other references, such as articles on language tag choice. These include:
Changes to this document following theWorking Draft of 2015-04-23 are available via thegithub commit log. This document was significantly restructured since that revision. Notably:
The following changes were made since the revision of 2006-06-20.
The following log records changes that have been made to this document since thepublication in April 2006.
The informative introductory section has been rewritten thoroughly, including the description of the scope of the document, of application scenarios and of the separation of locale versus natural language.
Terms which rely on [BCP47] are notdefined anymore, but onlyreference these documents. In addition, examples for these terms were created.
The requirements for language and locale values have been taken out of the conformance section and are now placed in the body of the document.
A revision log has been created.
The Internationalization Working Group would like to acknowledge the following contributors to this specification: