Movatterモバイル変換

[0]ホーム

Jump to content

IETF language tag

Edit links

From Wikipedia, the free encyclopedia

Code to identify human languages

This article'suse ofexternal links may not follow Wikipedia's policies or guidelines. Pleaseimprove this article by removingexcessive orinappropriate external links, and converting useful links where appropriate intofootnote references.(August 2020) (Learn how and when to remove this message)

AnIETF BCP 47 language tag is a standardized code that is used to identifyhuman langages on the Internet.^[1] The tag structure has been standardized by the Internet Engineering Task Force (IETF)^[1] inBest Current Practice (BCP) 47;^[1] the subtags are maintained by theIANA Language Subtag Registry.^[2]^[3]^[4]

To distinguish language variants for countries, regions, or writing systems (scripts), IETF language tags combine subtags from other standards such as ISO 639, ISO 15924, ISO 3166-1 andUN M.49.For example, the tagen stands for English;es-419 for Latin American Spanish;rm-sursilv for Romansh Sursilvan;sr-Cyrl forSerbian written inCyrillic script;nan-Hant-TW for Min Nan Chinese usingtraditional Han characters, as spoken in Taiwan;yue-Hant-HK forCantonese usingtraditional Han characters, as spoken inHong Kong; andgsw-u-sd-chzh forZürich German.

It is used by computing standards such as HTTP,^[5]^: §8.5.1 HTML,^[6] XML^[7] and PNG.^[8]

History

[edit]

IETF language tags were first defined in RFC 1766^[9], edited byHarald Tveit Alvestrand, published in March 1995. The tags used ISO 639 two-letter language codes and ISO 3166 two-letter country codes, and allowed registration of whole tags that included variant or script subtags of three to eight letters.

In January 2001, this was updated byRFC 3066^[10], which added the use ofISO 639-2 three-letter codes, permitted subtags with digits, and adopted the concept of language ranges from HTTP/1.1 to help with matching of language tags.

The next revision of the specification came in September 2006 with the publication ofRFC 4646^[11] (the main part of the specification), edited by Addison Philips andMark Davis, andRFC 4647^[12] (which deals with matching behaviour). RFC 4646 introduced a more structured format for language tags, added the use of ISO 15924 four-letter script codes and UN M.49 three-digit geographical region codes, and replaced the old registry of tags with a new registry of subtags. The small number of previously defined tags that did not conform to the new structure weregrandfathered in order to maintain compatibility with RFC 3066.

The current version of the specification,RFC 5646^[13], was published in September 2009. The main purpose of this revision was to incorporate three-letter codes fromISO 639-3 and639-5 into the Language Subtag Registry, in order to increase the interoperability between ISO 639 and BCP 47.^[14]

Syntax of language tags

[edit]

Each language tag is composed of one or more "subtags" separated by hyphens (-). Each subtag is composed of basic Latin letters or digits only.

With the exceptions of private-use language tags beginning with anx- prefix and grandfathered language tags (including those starting with ani- prefix and those previously registered in the old Language Tag Registry), subtags occur in the following order:

A singleprimary language subtag based on a two-letter language code fromISO 639-1 (2002) or a three-letter code fromISO 639-2 (1998),ISO 639-3 (2007) or ISO 639-5 (2008), or registered through the BCP 47 process and composed of five to eight letters;
Up to three optionalextended language subtags composed of three letters each, separated by hyphens; (There is currently no extended language subtag registered in the Language Subtag Registry without an equivalent and preferred primary language subtag. This component of language tags is preserved for backwards compatibility and to allow for future parts of ISO 639.)
An optionalscript subtag, based on a four-letter script code fromISO 15924 (usually written inTitle Case);
An optionalregion subtag based on a two-letter country code fromISO 3166-1 alpha-2 (usually written in upper case), or a three-digit code fromUN M.49 for geographical regions;
Optionalvariant subtags, separated by hyphens, each composed of five to eight letters, or of four characters starting with a digit; (Variant subtags are registered with IANA and not associated with any external standard.)
Optionalextension subtags, separated by hyphens, each composed of a single character, with the exception of the letterx, and a hyphen followed by one or more subtags of two to eight characters each, separated by hyphens;
An optionalprivate-use subtag, composed of the letterx and a hyphen followed by subtags of one to eight characters each, separated by hyphens.

Subtags are notcase-sensitive, but the specification recommends using the same case as in the Language Subtag Registry, where region subtags areUPPERCASE, script subtags areTitle Case, and all other subtags arelowercase. This capitalization follows the recommendations of the underlying ISO standards.

Optional script and region subtags are preferred to be omitted when they add no distinguishing information to a language tag. For example,es is preferred overes-Latn, as Spanish is fully expected to be written in the Latin script;ja is preferred overja-JP, as Japaneseas used in Japan does not differ markedly from Japanese as used elsewhere.

Not all linguistic regions can be represented with a valid region subtag: the subnational regional dialects of a primary language are registered as variant subtags. For example, thevalencia variant subtag for theValencian variant of theCatalan is registered in the Language Subtag Registry with the prefixca. As this dialect is spoken almost exclusively in Spain, the region subtagES can normally be omitted.

Furthermore, there are script tags that do not refer to traditional scripts such as Latin, or even scripts at all, and these usually begin with aZ. For example,Zsye refers toemojis,Zmth tomathematical notation,Zxxx to unwritten documents andZyyy to undetermined scripts.

IETF language tags have been used aslocale identifiers in many applications. It may be necessary for these applications to establish their own strategy for defining, encoding and matching locales if the strategy described in RFC 4647 is not adequate.

The use, interpretation and matching of IETF language tags is currently defined in RFC 5646 and RFC 4647. The Language Subtag Registry lists all currently valid public subtags. Private-use subtags are not included in the Registry as they are implementation-dependent and subject to private agreements between third parties using them. These private agreements are out of scope of BCP 47.

List of common primary language subtags

[edit]

The following is a list of some of the more commonly used primary language subtags. The list represents only a small subset (less than 2 percent) of primary language subtags; for full information, the Language Subtag Registry should be consulted directly.

Common languages and their IETF subtags^[15]
English name	Native name	Subtag
Afrikaans	Afrikaans	af
Amharic	አማርኛ	am
Arabic	العربية	ar
Mapudungun	Mapudungun	arn
Moroccan Arabic	الدارجة المغربية	ary
Assamese	অসমীয়া	as
Azerbaijani	Azərbaycan	az
Bashkir	Башҡорт	ba
Belarusian	беларуская	be
Bulgarian	български	bg
Bengali	বাংলা	bn
Tibetan	བོད་ཡིག	bo
Breton	brezhoneg	br
Bosnian	bosanski/босански	bs
Catalan	català	ca
Central Kurdish	کوردیی ناوەندی	ckb
Corsican	Corsu	co
Czech	čeština	cs
Welsh	Cymraeg	cy
Danish	dansk	da
German	Deutsch	de
Lower Sorbian	dolnoserbšćina	dsb
Divehi	ދިވެހިބަސް	dv
Greek	Ελληνικά	el
English	English	en
Spanish	español	es
Estonian	eesti	et
Basque	euskara	eu
Persian	فارسى	fa
Finnish	suomi	fi
Filipino	Filipino	fil
Faroese	føroyskt	fo
French	français	fr
Frisian	Frysk	fy
Irish	Gaeilge	ga
Scottish Gaelic	Gàidhlig	gd
Gilbertese	Taetae ni Kiribati	gil
Galician	galego	gl
Swiss German	Schweizerdeutsch	gsw
Gujarati	ગુજરાતી	gu
Hausa	Hausa	ha
Hebrew	עברית	he
Hindi	हिंदी	hi
Croatian	hrvatski	hr
Upper Sorbian	hornjoserbšćina	hsb
Hungarian	magyar	hu
Armenian	Հայերեն	hy
Indonesian	Bahasa Indonesia	id
Igbo	Igbo	ig
Yi	ꆈꌠꁱꂷ	ii
Icelandic	íslenska	is
Italian	italiano	it
Inuktitut	Inuktitut/ ᐃᓄᒃᑎᑐᑦ (ᑲᓇᑕ)	iu
Japanese	日本語	ja
Georgian	ქართული	ka
Kazakh	Қазақша	kk
Greenlandic	kalaallisut	kl
Khmer	ខ្មែរ	km
Kannada	ಕನ್ನಡ	kn
Korean	한국어	ko
Konkani	कोंकणी	kok
Kurdish	Kurdî کوردی	ku
Kyrgyz	Кыргыз	ky
Luxembourgish	Lëtzebuergesch	lb
Lao	ລາວ	lo
Lithuanian	lietuvių	lt
Latvian	latviešu	lv
Maori	Reo Māori	mi
Macedonian	македонски јазик	mk
Malayalam	മലയാളം	ml
Mongolian	Монгол хэл/ ᠮᠤᠨᠭᠭᠤᠯ ᠬᠡᠯᠡ	mn
Mohawk	Kanien'kéha	moh
Marathi	मराठी	mr
Malay	Bahasa Malaysia	ms
Maltese	Malti	mt
Burmese	မြန်မာဘာသာ	my
Norwegian (Bokmål)	norsk (bokmål)	nb
Nepali	नेपाली (नेपाल)	ne
Dutch	Nederlands	nl
Norwegian (Nynorsk)	norsk (nynorsk)	nn
Norwegian	norsk	no
Occitan	occitan	oc
Odia	ଓଡ଼ିଆ	or
Papiamento	Papiamentu	pap
Punjabi	ਪੰਜਾਬੀ پنجابی	pa
Polish	polski	pl
Dari	درى	prs
Pashto	پښتو	ps
Portuguese	português	pt
K'iche	K'iche	quc
Quechua	runasimi	qu
Romansh	Rumantsch	rm
Romanian	română	ro
Russian	русский	ru
Kinyarwanda	Kinyarwanda	rw
Sanskrit	संस्कृत	sa
Yakut	саха	sah
Sami (Northern)	davvisámegiella	se
Sinhala	සිංහල	si
Slovak	slovenčina	sk
Slovenian	slovenščina	sl
Sami (Southern)	åarjelsaemiengiele	sma
Sami (Lule)	julevusámegiella	smj
Sami (Inari)	sämikielâ	smn
Sami (Skolt)	sääʹmǩiõll	sms
Albanian	shqip	sq
Serbian	srpski/српски	sr
Sesotho	Sesotho	st
Swedish	svenska	sv
Kiswahili	Kiswahili	sw
Syriac	ܣܘܪܝܝܐ	syc
Tamil	தமிழ்	ta
Telugu	తెలుగు	te
Tajik	Тоҷикӣ	tg
Thai	ไทย	th
Turkmen	türkmençe	tk
Tswana	Setswana	tn
Turkish	Türkçe	tr
Tatar	Татарча	tt
Tamazight	Tamazight	tzm
Uyghur	ئۇيغۇرچە	ug
Ukrainian	українська	uk
Urdu	اُردو	ur
Uzbek	Uzbek/Ўзбек	uz
Vietnamese	Tiếng Việt	vi
Wolof	Wolof	wo
Xhosa	isiXhosa	xh
Yoruba	Yoruba	yo
Chinese	中文	zh
Zulu	isiZulu	zu

Relation to other standards

[edit]

Although some types of subtags are derived fromISO orUN core standards, they do not follow these standards absolutely, as this could lead to the meaning of language tags changing over time. In particular, a subtag derived from a code assigned byISO 639,ISO 15924,ISO 3166, orUN M49 remains a valid (though deprecated) subtag even if the code is withdrawn from the corresponding core standard. If the standard later assigns a new meaning to the withdrawn code, the corresponding subtag will still retain its old meaning.

This stability was introduced in RFC 4646.

ISO 639-3 and ISO 639-1

[edit]

RFC 4646^[11] defined the concept of an "extended language subtag" (sometimes referred to asextlang), although no such subtags were registered at that time.^[16]^{[failed verification]}^[17]^{[failed verification]}

RFC 5645^[18] andRFC 5646^[13] added primary language subtags corresponding toISO 639-3 codes for all languages that did not already exist in the Registry. In addition, codes for languages encompassed by certain macrolanguages were registered as extended language subtags. Sign languages were also registered as extlangs, with the prefixsgn. These languages may be represented either with the subtag for the encompassed language alone (cmn for Mandarin) or with a language-extlang combination (zh-cmn). The first option is preferred for most purposes. The second option is called "extlang form" and is new in RFC 5646.

Whole tags that were registered prior to RFC 4646 and are now classified as "grandfathered" or "redundant" (depending on whether they fit the new syntax) are deprecated in favor of the corresponding ISO 639-3–based language subtag, if one exists. To list a few examples,nan is preferred overzh-min-nan forMin Nan Chinese;hak is preferred overi-hak andzh-hakka forHakka Chinese; andase is preferred oversgn-US forAmerican Sign Language.

Windows Vista and later versions of Microsoft Windows have RFC 4646 support.^[19]

ISO 639-5 and ISO 639-1/2

[edit]

ISO 639-5 defines language collections with alpha-3 codes in a different way than they were initially encoded in ISO 639-2 (including one code already present in ISO 639-1, Bihari coded inclusively asbh in ISO 639-1 andbih in ISO 639-2). Specifically, the language collections are now all defined in ISO 639-5 as inclusive, rather than some of them being defined exclusively. This means that language collections have a broader scope than before, in some cases where they could encompass languages that were already encoded separately within ISO 639-2.

For example, the ISO 639-2 codeafa was previously associated with the name "Afro-Asiatic (Other)", excluding languages such as Arabic that already had their own code. In ISO 639-5, this collection is named "Afro-Asiatic languages" and includes all such languages. ISO 639-2 changed the exclusive names in 2009 to match the inclusive ISO 639-5 names.^[20]

To avoid breaking implementations that may still depend on the older (exclusive) definition of these collections, ISO 639-5 defines a grouping type attribute for all collections that were already encoded in ISO 639-2 (such grouping type is not defined for the new collections added only in ISO 639-5).

BCP 47 defines a "Scope" property to identify subtags for language collections. However, it does not define any given collection as inclusive or exclusive, and does not use the ISO 639-5 grouping type attribute, although the description fields in the Language Subtag Registry for these subtags match the ISO 639-5 (inclusive) names. As a consequence, BCP 47 language tags that include a primary language subtag for a collection may be ambiguous as to whether the collection is intended to be inclusive or exclusive.

ISO 639-5 does not define precisely which languages are members of these collections; only the hierarchical classification of collections is defined, using the inclusive definition of these collections. Because of this, RFC 5646 does not recommend the use of subtags for language collections for most applications, although they are still preferred over subtags whose meaning is even less specific, such as "Multiple languages" and "Undetermined".

In contrast, the classification of individual languages within their macrolanguage is standardized, in both ISO 639-3 and the Language Subtag Registry.

ISO 15924, ISO/IEC 10646 and Unicode

[edit]

Script subtags were first added to the Language Subtag Registry whenRFC 4646^[11] was published, from the list of codes defined inISO 15924. They are encoded in the language tag after primary and extended language subtags, but before other types of subtag, including region and variant subtags.

Some primary language subtags are defined with a property named "Suppress-Script" which indicates the cases where a single script can usually be assumed by default for the language, even if it can be written with another script. When this is the case, it is preferable to omit the script subtag, to improve the likelihood of successful matching. A different script subtag can still be appended to make the distinction when necessary. For example,yi is preferred overyi-Hebr in most contexts, because the Hebrew script subtag is assumed for theYiddish language.

As another example,zh-Hans-SG may be considered equivalent tozh-Hans, because the region code is probably not significant; the written form of Chinese used in Singapore uses the same simplified Chinese characters as in other countries where Chinese is written. However, the script subtag is maintained because it is significant.

ISO 15924 includes some codes for script variants (for example,Hans andHant for simplified and traditional forms of Chinese characters) that are unified withinUnicode andISO/IEC 10646. These script variants are most often encoded for bibliographic purposes, but are not always significant from a linguistic point of view (for example,Latf andLatg script codes for the Fraktur and Gaelic variants of the Latin script, which are mostly encoded with regular Latin letters in Unicode and ISO/IEC 10646). They may occasionally be useful in language tags to expose orthographic or semantic differences, with different analysis of letters, diacritics, and digraphs/trigraphs as default grapheme clusters, or differences in letter casing rules.

ISO 3166-1 and UN M.49

[edit]

Further information:Country code top-level domain § Historical ccTLDs

Two-letter region subtags are based on codes assigned, or "exceptionally reserved", inISO 3166-1. If the ISO 3166 Maintenance Agency were to reassign a code that had previously been assigned to a different country, the existing BCP 47 subtag corresponding to that code would retain its meaning, and a new region subtag based onUN M.49 would be registered for the new country. UN M.49 is also the source for numeric region subtags for geographical regions, such as005 for South America.The UN M.49 codes for economic regions are not allowed.

Region subtags are used to specify the variety of a language "as used in" a particular region. They are appropriate when the variety is regional in nature, and can be captured adequately by identifying the countries involved, as when distinguishingBritish English (en-GB) fromAmerican English (en-US). When the difference is one of script or script variety, as forsimplified versustraditional Chinese characters, it should be expressed with a script subtag instead of a region subtag; in this example,zh-Hans andzh-Hant should be used instead ofzh-CN/zh-SG/zh-MY andzh-TW/zh-HK/zh-MO.

When a distinct language subtag exists for a language that could be considered a regional variety, it is often preferable to use the more specific subtag instead of a language-region combination. For example,ar-DZ (Arabic as used inAlgeria) may be better expressed asarq forAlgerian Spoken Arabic.

Adherence to core standards

[edit]

Disagreements about language identification may extend to BCP 47 and to the core standards that inform it. For example, some speakers of Punjabi believe that the ISO 639-3 distinction between [pan] "Panjabi" and [pnb] "Western Panjabi" is spurious (i.e. they feel the two arethe same language); that sub-varieties of theArabic script should be encoded separately in ISO 15924 (as, for example, theFraktur andGaelic styles of the Latin script are); and that BCP 47 should reflect these views or overrule the core standards with regard to them.

BCP 47 delegates this type of judgment to the core standards, and does not attempt to overrule or supersede them. Variant subtags and (theoretically) primary language subtags may be registered individually, but not in a way that contradicts the core standards.^[21]

Extensions

[edit]

Extension subtags (not to be confused withextended language subtags) allow additional information to be attached to a language tag that does not necessarily serve to identify a language. One use for extensions is to encode locale information, such as calendar and currency.

Extension subtags are composed of multiple hyphen-separated character strings, starting with a single character (other thanx), called asingleton. Each extension is described in its ownIETF RFC, which identifies a Registration Authority to manage the data for that extension.IANA is responsible for allocating singletons.

Two extensions have been assigned as of January 2014.

Extension T (Transformed Content)

[edit]

Extension T allows a language tag to include information on how the tagged data was transliterated, transcribed, or otherwise transformed. For example, the tagen-t-jp could be used for content in English that was translated from the original Japanese. Additional substrings could indicate that the translation was done mechanically, or in accordance with a published standard.

Extension T is described in the informationalRFC 6497^[22], published in February 2012. The Registration Authority is theUnicode Consortium.

Extension U (Unicode Locale)

[edit]

Extension U allows a wide variety of locale attributes found in theCommon Locale Data Repository (CLDR) to be embedded in language tags. These attributes include country subdivisions, calendar and time zone data, collation order, currency, number system, and keyboard identification.

Some examples include:

gsw-u-sd-chzh representsSwiss German as used in theCanton of Zurich.
ar-u-nu-latn represents Arabic-language content usingBasic Latin digits (0 through 9) instead ofArabic-script digits (٠ through ٩).
he-IL-u-ca-hebrew-tz-jeruslm represents Hebrew as spoken in Israel, using the traditionalHebrew calendar, and in the "Asia/Jerusalem" time zone as identified in thetz database.

Extension U is described in the informationalRFC 6067^[23], published in December 2010. The Registration Authority is theUnicode Consortium.

References

[edit]

^^a ^b ^cPhillips, Addison; Davis, Mark (September 2009)."Information on BCP 47 » RFC Editor".
^"Language Subtag Registry".iana.org.Internet Assigned Numbers Authority. Retrieved2018-12-05.
^"Language Tag Extensions Registry".iana.org.Internet Assigned Numbers Authority. Retrieved2018-12-06.
^"IANA — Protocol Registries".iana.org. Retrieved28 July 2015.
^R. Fielding; M. Nottingham; J. Reschke, eds. (June 2022).HTTP Semantics.Internet Engineering Task Force.doi:10.17487/RFC9110.ISSN 2070-1721. STD 97. RFC 9110.Internet Standard 97. ObsoletesRFC 2818,7230,7231,7232,7233,7235,7538,7615 and7694. UpdatesRFC 3864.
^"Language information and text direction".w3.org. Retrieved28 July 2015.
^"Extensible Markup Language (XML) 1.0 (Fifth Edition)".w3.org. Retrieved28 July 2015.
^"Portable Network Graphics (PNG) Specification (Second Edition)".w3.org. Retrieved28 July 2015.
^H. Alvestrand (March 1995).Tags for the Identification of Languages. Network Working Group.doi:10.17487/RFC1766.RFC 1766.Obsolete. Obsoleted byRFC 3066 and3282.
^H. Alvestrand (January 2001).Tags for the Identification of Languages. Network Working Group.doi:10.17487/RFC3066. BCP 47. RFC 3066.Obsolete, was BCP 47. Obsoleted byRFC 4646 and4647.
^^a ^b ^cA. Phillips;M. Davis, eds. (September 2006).Tags for Identifying Languages. Network Working Group.doi:10.17487/RFC4646. BCP 47. RFC 4646.Obsolete, was BCP 47. Obsoleted byRFC 5646. ObsoletesRFC 3066.
^A. Phillips;M. Davis, eds. (September 2006).Matching of Language Tags. Network Working Group.doi:10.17487/RFC4647. BCP 47. RFC 4647.Best Current Practice 47. ObsoletesRFC 3066.
^^a ^bPhillips, A.;Davis, M., eds. (September 2009).Tags for Identifying Languages.IETF Network Working Group.doi:10.17487/RFC5646. BCP 47. RFC 5646.Best Current Practice 47. ObsoletesRFC 4646.
^Language Tag Registry Update charter Archived 2007-02-10 at theWayback Machine
^"Letter Codes of Cultures – List".Archived from the original on 2022-08-07. Retrieved2022-01-08.
^Addison Phillips,Mark Davis (2008)."Tags for Identifying Languages (old draft for the revision of RFC 4646, now obsolete and may disappear soon)". IETF WG LTRU. Retrieved2008-06-23.
^Doug Ewell (2008)."Update to the Language Subtag Registry (old draft for the revision of RFC 4645, now obsolete and may disappear soon)"(1MB). IETF WG LTRU. Retrieved2008-06-23.
^D. Ewell, ed. (September 2009).Update to the Language Subtag Registry.IETF Network Working Group.doi:10.17487/RFC5645.RFC 5645.Informational.
^"GetGeoInfoA function (winnls.h) – Win32 apps".
^"ISO 639-2 Language Code List – Codes for the representation of names of languages (Library of Congress)".loc.gov. Retrieved28 July 2015.
^Ewell, Doug (2022-08-12)."Re: [Ietf-languages] Punjabi language code fix recommendations". Retrieved2022-08-12.
^M. Davis; A. Phillips; Y. Umaoka; C. Falk (February 2012).BCP 47 Extension T - Transformed Content.Internet Engineering Task Force.doi:10.17487/RFC6497.ISSN 2070-1721.RFC 6497.Informational.
^M. Davis; A. Phillips; Y. Umaoka (December 2010).BCP 47 Extension U.Internet Engineering Task Force (IETF).doi:10.17487/RFC6067.ISSN 2070-1721.RFC 6067.Informational.

External links

[edit]

BCP 47 Language Tags – current specification
- Contains two RFCs published separately at different dates, but concatenated in a single document:
  1. RFC 4647 – "Matching of Language Tags,"
  2. RFC 5646 – "Tags for Identifying Languages,"
- It also references the related informational RFC 5645, which complements the previous informational RFC 4645, as well other individual registration forms published separately by others for each language added or modified in the Registry between these BCP 47 revisions.
Language Subtag Registry – maintained by IANA
Language Subtag Registry Search – find subtags and view entries in the Registry
"Language tags in HTML and XML" – from the W3C
"Language Tags" – from the IETF Language Tag Registry Update working group