Movatterモバイル変換

[0]ホーム

Jump to content

Character encodings in HTML

Edit links

From Wikipedia, the free encyclopedia

Use of encoding systems for international characters in HTML

For a list of character entity references, seeList of XML and HTML character entity references.

For fixing links within Wikipedia, seeHelp:Percent-encoding § Fixing Links with Unsupported Characters.

HTML

HTML and variants
Dynamic HTML HTML5 XHTML Basic Mobile Profile
HTML elements and attributes
HTML element article audio blink canvas div and span marquee meta video HTML attribute alt attribute HTML frame
Editing
HTML editor Text editor
Character encodings and language
Character encodings Character entity references (named characters) Unicode Language code
Document and browser models
Document Object Model Browser Object Model Style sheets CSS Font family Web colors
Client-side scripting and APIs
JavaScript WebCL HTMX
Graphics and Web3D technology
Web3D WebGL WebGPU WebXR W3C Validator WHATWG Quirks mode Web storage Rendering engine
Comparisons
Document markup languages Comparison of browser engines
v t e

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where internationalcharacters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bitASCII, two goals are worth considering: the information'sintegrity, and universalbrowser display.

Specifying the document's character encoding

[edit]

There are two general ways to specify which character encoding is used in the document.

First, theweb server can include the character encoding or "charset" in theHypertext Transfer Protocol (HTTP)Content-Type header, which would typically look like this:^[1]

Content-Type: text/html; charset=utf-8

This method gives the HTTP server a convenient way to alter document's encoding according tocontent negotiation; certain HTTP server software can do it, for example Apache with themodulemod_charset_lite.^[2]

Second, a declaration can be included within the document itself.

For HTML it is possible to include this information inside thehead element near the top of the document:^[3]

<metahttp-equiv="Content-Type"content="text/html; charset=utf-8">

HTML5 also allows the following syntax to mean exactly the same:^[3]

<metacharset="utf-8">

XHTML documents have a third option: to express the character encoding viaXML declaration, as follows:^[4]

<?xml version="1.0" encoding="utf-8"?>

With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is anASCII extension then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such asUTF-16BE andUTF-16LE, a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics.

Encoding detection algorithm

[edit]

As of HTML5 the recommended charset isUTF-8.^[3] An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:

Explicit user instruction
An explicit meta tag within the first 1024 bytes of the document
Abyte order mark (BOM) within the first three bytes of the document
The HTTP Content-Type or other transport layer information
Analysis of the document bytes looking for specific sequences or ranges of byte values,^[5] and other tentative detection mechanisms.

Characters outside of the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems forEnglish-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean (CJK) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to overrideincorrect charset label manually as well.

It is increasingly common for multilingual websites and websites in non-Western languages to useUTF-8, which allows use of the same encoding for all languages.UTF-16 orUTF-32, which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume abyte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.

Permitted encodings

[edit]

TheWHATWG Encoding Standard, referenced by recent HTML standards (the current WHATWG HTML Living Standard, as well as the formerly competingW3C HTML 5.0 and 5.1) specifies a list of encodings which browsers must support. The HTML standards forbid support of other encodings.^[6]^[7]^[8] The Encoding Standard further stipulates that new formats, new protocols (even when existing formats are used) and authors of new documents are required to useUTF-8 exclusively.^[9]

Besides UTF-8, the following encodings are explicitly listed in the HTML standard itself, with reference to the Encoding Standard:^[8]

^Also specified forTIS-620,ISO-8859-11 and related labels.^[9]
^Also specified forASCII,ISO-8859-1 and related labels.^[9]
^Also specified forISO-8859-9 and related labels.^[9]
^Specified with 0xA3A0 as a duplicate encoding of theideographic space (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).^[10]^[11] Also, specified with 0x80 accepted as an alternative encoding of theeuro sign (U+20AC; seeWindows-936).^[12] Otherwise, follows the mappings from the 2005 standard.^[11]
^Hong Kong Supplementary Character Set variant,^[13] although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.^[14]
^The specification includesIBM andNEC extensions,^[15] and is more preciselyWindows-31J.^[13]
^The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions.Half-width kana is converted to fullwidth by the encoder,^[16] but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.^[17]Shift Out andShift In (0x0E and 0x0F) are excluded entirely to prevent attacks.^[17]^[18]
^ActuallyUnified Hangul Code (Windows-949), which is a superset which covers the entireHangul Syllables block.^[13]^[19]
^Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded inUTF-8.^[20]
^For compatibility with deployed content, also specified for the plainUTF-16 label,^[21] although abyte order mark (BOM), if present, takes priority over any label.^[22] Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded inUTF-8.^[20]
^Maps 0x00 through 0x7F to U+0000 through U+007F, and 0x80 through 0xFF to U+F780 through U+F7FF (aPrivate Use Area range), such that the low 8 bits of the code point always match the original byte.^[23]

The following additional encodings are listed in the Encoding Standard, and support for them is therefore also required:^[9]

^Uses the same encoder and decoder as ISO-8859-8, but is not subject to the visual-order behaviour which is used for documents labelled as ISO-8859-8.^[24]
^Titled KOI8-U and specified for bothKOI8-U andKOI8-RU labels;^[9] followsKOI8-RU in positions 0xAE and 0xBE (i.e. includesЎ/ў)^[25]^[26] but KOI8-U in positions 0x93–9F.^[25]
^Also specified forGB2312 and related labels. Handled the same asGB 18030 for decoding purposes.^[27] For encoding purposes, labelling as GBK (orGB 2312) excludes four-byte codes, and favours the one-byte 0x80 representation for U+20AC.^[10]
^The specification uses the same index as used for Shift JIS (insofar as is within reach of the EUC code set 1), i.e. includes NEC extensions.JIS X 0212 is included for decoding only.^[28]

The following encodings are listed as explicit examples of forbidden encodings:^[8]

The standard also defines a "replacement" decoder, which maps all content labelled as certain encodings to thereplacement character (�), refusing to process it at all. This is intended to prevent attacks (e.g.cross site scripting) which may exploit a difference between the client and server in what encodings are supported in order to mask malicious content.^[29] Although the same security concern applies toISO-2022-JP andUTF-16, which also allow sequences of ASCII bytes to be interpreted differently, this approach was not seen as feasible for them since they are comparatively more frequently used in deployed content.^[30] The following encodings receive this treatment:^[31]

Character references

[edit]

Main articles:List of XML and HTML character entity references andNumeric character reference

In addition to native character encodings, characters can also be encoded ascharacter references, which can benumeric character references (decimal orhexadecimal) orcharacter entity references. Character entity references are also sometimes referred to asnamed entities, orHTML entities for HTML. HTML's usage of character references derives fromSGML.

HTML character references

[edit]

Anumeric character reference in HTML refers to a character by itsUniversal Character Set/Unicodecode point, and uses the format

&#nnnn;

&#xhhhh;

wherennnn is the code point indecimal form, andhhhh is the code point inhexadecimal form. Thex must be lowercase in XML documents. Thennnn orhhhh may be any number of digits and may include leading zeros. Thehhhh may mix uppercase and lowercase, though uppercase is the usual style.

Not allweb browsers oremail clients used by receivers of HTML documents, ortext editors used by authors of HTML documents, will be able to render all HTML characters. Most modern software is able to display most or all of the characters for the user's language, and will draw a box or other clear indicator for characters they cannot render.

For codes from 0 to 127, the original 7-bitASCII standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created usingcharacter entity names. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.

Character entity references can also have the format&name; wherename is a case-sensitive alphanumeric string. For example, "λ" can also be encoded asλ in an HTML document. The character entity references<,>," and& are predefined in HTML and SGML, because<,>," and& are already used to delimit markup. This notably did not include XML's' (') entity prior toHTML5. For a list of all named HTML character entity references along with the versions in which they were introduced, seeList of XML and HTML character entity references.

Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a nativeUnicode encoding likeUTF-8 is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such ascross-site scripting. If HTML attributes are left unquoted, certain characters, most importantlywhitespace, such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.

XML character references

[edit]

Unlike traditional HTML with its large range of character entity references, inXML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:^[32]

`&`	&	ampersand	U+0026
`<`	<	less-than sign	U+003C
`>`	>	greater-than sign	U+003E
`"`	"	quotation mark	U+0022
`'`	'	apostrophe	U+0027

All other character entity references have to be defined before they can be used. For example, use ofé (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that thex in hexadecimal numeric references be in lowercase: for example&#xA1b rather than&#XA1b.XHTML, which is an XML application, supports the HTML entity set, along with XML's predefined entities.