Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Charset detection

From Wikipedia, the free encyclopedia
Process of determining content's charset

Character encoding detection,charset detection, orcode page detection is the process ofheuristically guessing thecharacter encoding of a series of bytes that represent text. The technique is recognised to be unreliable[1] and is only used when specificmetadata, such as an HTTPContent-Type:header is either not available, or is assumed to be untrustworthy.

This algorithm usually involves statistical analysis of byte patterns;[2] such statistical analysis can also be used to performlanguage detection.[2] This process is not foolproof because it depends on statistical data.[1]

In general, incorrect charset detection leads tomojibake, due to character bytes being interpreted as belonging to one set—the incorrectly detected one—when they actually belong to a completely different one.[3][4]

One of the few cases where charset detection works reliably is detectingUTF-8.[5] This is due to the large percentage of invalid byte sequences in UTF-8,[note 1] so that text in any other encoding that uses bytes with the high bit set isextremely unlikely to pass a UTF-8 validity test.[5] However, badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding. For example, websites in UTF-8 containing the name of the German cityMünchen may display "München", due to the code deciding that the encoding wasISO-8859-1 orWindows-1252 before (or without) even testing to see if it was UTF-8.

UTF-16 is fairly reliable to detect due to the high number of newlines (U+000A) and spaces (U+0020) that should be found when dividing the data into 16-bit words, and large numbers of NUL bytes all at even or odd locations. Common charactersmust be checked for, relying on a test to see that the text is valid UTF-16 fails: theWindows operating system would misdetect the phrase "Bush hid the facts" (without a newline) in ASCII as ChineseUTF-16LE, since all the byte pairs matched assigned Unicode characters in UTF-16LE.

Charset detection is particularly unreliable in Europe, in an environment of mixed ISO-8859 encodings. These are closely related eight-bit encodings that share an overlap in their lower half withASCII and all arrangements of bytes are valid. There is no technical way to tell these encodings apart and recognizing them relies on identifying language features, such as letter frequencies or spellings.

Due to the unreliability of heuristic detection, it is better to properly label datasets with the correct encoding (seeSpecifying the document's character encoding). Even though UTF-8 and UTF-16 are easy to detect, some systems require UTF encodings to explicitly label the document with a prefixedbyte order mark (BOM).

See also

[edit]

Notes

[edit]
  1. ^In a random byte string, a byte with the high bit set has only a 1/15 chance of starting a valid UTF-8 code point. Odds are even lower in actual text, which is not random but tends to contain isolated bytes with the high bit set which are always invalid in UTF-8.

References

[edit]
  1. ^ab"PHP: mb_detect_encoding - Manual".www.php.net. Retrieved2024-11-12.
  2. ^abKim, Seung-Ho; Park, Jongsoo (2007).Automatic Detection of Character Encoding and Language(PDF) (Thesis).Stanford University.
  3. ^King, Ritchie (2012). "Will unicode soon be the universal code? [The Data]".IEEE Spectrum.49 (7): 60.doi:10.1109/MSPEC.2012.6221090.
  4. ^Chen, Raymond (2019-07-01)."A program to detect mojibake that results from a UTF-8-encoded file being misinterpreted as code page 1252".The Old New Thing. Retrieved2025-07-07.
  5. ^ab"A composite approach to language/encoding detection".www-archive.mozilla.org. Retrieved2024-11-12.

External links

[edit]
Early telecommunications
ISO/IEC 8859
Bibliographic use
National standards
ISO/IEC 2022
Mac OSCode pages
("scripts")
DOS code pages
IBM AIX code pages
Windows code pages
EBCDIC code pages
DEC terminals (VTx)
Platform specific
Unicode /ISO/IEC 10646
TeX typesetting system
Miscellaneous code pages
Control character
Related topics
Retrieved from "https://en.wikipedia.org/w/index.php?title=Charset_detection&oldid=1307737433"
Category:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp