Movatterモバイル変換

Popularity of text encodings

From Wikipedia, the free encyclopedia

This article'slead sectionmay be too short to adequatelysummarize the key points. Please consider expanding the lead toprovide an accessible overview of all important aspects of the article. The reason given is:The statistics/claims on the actual use of encodings for the context in the article should be present(June 2024)

A number oftext encodings have historically been used for storing text on theWorld Wide Web, though by nowUTF-8 is dominant, with all languages at 95% use or higher by some estimates. The same encodings are used in local files (or databases), in fact many more, at least historically. Measuring the prevalence of each are not possible, because of privacy reasons (e.g. for local files, not web accessible), but rather accurate estimates are available for public web sites, and statistics may (or may not accurately) reflect use in local files. Attempts at measuring encoding popularity may utilize counts of numbers of (web) documents, or counts weighed by actual use or visibility of those documents.

The decision to use any one encoding may depend on the language used for the documents, or the locale that is the source of the document, or the purpose of the document. Text may be ambiguous as to what encoding it is in, for instance pureASCII text is valid ASCII orISO-8859-1 orCP1252 or UTF-8.Tags may indicate a document encoding, but when this is incorrect this may be silently corrected by display software (for instance theHTML specification says that the tag for ISO-8859-1 should be treated as CP1252), so counts of tags may not be accurate.

Popularity on the World Wide Web

[edit]

Use of the main encodings on the web from 2001 to 2012 as recorded by Google,^[1] with UTF-8 overtaking all others in 2008 and over 60% of the web in 2012 (since then approaching 100%). TheASCII-only figure includes all web pages that only contain ASCII characters, regardless of the declared header.

Declared character set for the 10 million most popular websites from 2010 to 2021

UTF-8 has been the most common encoding for theWorld Wide Web since 2008.^[2] As of November 2025^[update], UTF-8 is used by 98.8% of surveyed web sites (and 99.4% of top 1,000 pages), the next-most popular encoding,ISO-8859-1, is used by 1.0% (and only 12 of the top 1,000 pages).^[3] Although many pages only use ASCII characters to display content, very few websites now declare their encoding to only be ASCII instead of UTF-8.^[4]

All countries (and over 97% all of the tracked languages) have at least 96% use of the UTF-8 encoding on the web. See below for the major alternative encodings:

The second-most popular encoding varies depending on locale, and is typically more efficient for the associated language. One such encoding is the ChineseGB 18030 standard, which is a fullUnicode Transformation Format, still 96.4% ofwebsites in China and territories use UTF-8^[5]^[6]^[7] with it (effectively^[8]) the next popular encoding.Big5 is another popular non-UTF encoding meant fortraditional Chinese characters (thoughGB 18030 works for those too, is a fullUTF), and is next-most popular in Taiwan after UTF-8 at 97.1%, and it's also second-most used in Hong Kong, while there as elsewhere, UTF-8 is even more dominant at 98.4%.^[9] The single-byteWindows-1251 is twice as efficient for theCyrillic script and still 96.7% of Russian websites use UTF-8^[10] (however e.g. Greek and Hebrew encodings are also twice as efficient, and UTF-8 has over 99% use for those languages).^[11]^[12] Korean, Chinese and Japanese language websites also have relatively high non-UTF-8 use compared to most other countries, with Japanese UTF-8 use at 98.9% the rest use the legacyEUC-JP and/orShift JIS (actually decoded as its supersetWindows-31J) encodings that both are used about as much.^[13]^[14] South Korea has 96.0% UTF-8 use, with the rest of websites mainly usingEUC-KR which is more efficient for Korean text.

Popularity for local text files

[edit]

Local storage on computers has considerably more use of "legacy" single-byte encodings than on the web. Attempts to update to UTF-8 have been blocked by editors that do not display or write UTF-8 unless the first character in a file is abyte order mark, making it impossible for other software to use UTF-8 without being rewritten to ignore the byte order mark on input and add it on output. UTF-16 files are also fairly common on Windows, but not in other systems.^[15]^[16]

Popularity internally in software

[edit]

In the memory of a computer program, usage ofUTF-16 is very common, particularly in Windows but also cross-platform languages and libraries such asJavaScript,Python, andQt. Compatibility with the Windows API is a major reason for this. Non-Windows libraries written in the early days of Unicode also tend to use UTF-16, such asInternational Components for Unicode.^[17]

At one time it was believed by many (and is still believed today by some) that having fixed-size code units offers computational advantages, which led many systems, in particular Windows, to use the fixed-size UCS-2 with two bytes per character. This is false: strings are almost never randomly accessed, and sequential access is the same speed in both variable- and fixed-size encodings. In addition, even UCS-2 was not "fixed size" ifcombining characters are considered, and when Unicode exceeded 65536 code points it had to be replaced with the non-fixed-sized UTF-16 anyway.

Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, overwhelms any benefitsUTF-16 could offer. So newer software systems are starting to use UTF-8. The default string primitive used in newer programing languages, such asGo,^[18]Julia,Rust andSwift 5,^[19] assume UTF-8 encoding.PyPy also uses UTF-8 for its strings,^[20] and Python is looking into storing all strings in UTF-8.^[21]Microsoft now recommends the use of UTF-8 for applications using theWindows API, while continuing to maintain a legacy "Unicode" (meaning UTF-16) interface.^[22]

References

[edit]

^Davis, Mark (2012-02-03)."Unicode over 60 percent of the web".Official Google Blog.Archived from the original on 2018-08-09. Retrieved2020-07-24.
^Davis, Mark (2008-05-05)."Moving to Unicode 5.1".Official Google Blog. Retrieved2023-03-13.
^"Usage Survey of Character Encodings broken down by Ranking".W3Techs. November 2025. Retrieved2025-11-05.
^"Usage statistics and market share of ASCII for websites".W3Techs. November 2024. Retrieved2024-11-20.
^"Distribution of Character Encodings among websites that use China and territories".w3techs.com. Retrieved2025-09-12.
^"Distribution of Character Encodings among websites that use .cn".w3techs.com. Retrieved2021-11-01.
^"Distribution of Character Encodings among websites that use Chinese".w3techs.com. Retrieved2021-11-01.
^TheChinese standardGB 2312 and with its extensionGBK (which are both interpreted by web browsers asGB 18030, having support for the same letters as UTF-8)
^"Distribution of Character Encodings among websites that use Taiwan".w3techs.com. Retrieved2025-11-05.
^"Distribution of Character Encodings among websites that use .ru".w3techs.com. Retrieved2025-10-09.
^"Distribution of Character Encodings among websites that use Greek".w3techs.com. Retrieved2024-01-01.
^"Distribution of Character Encodings among websites that use Hebrew".w3techs.com. Retrieved2024-02-02.
^"Historical trends in the usage of character encodings". Retrieved2024-07-03.
^"UTF-8 Usage Statistics". BuiltWith. Retrieved2011-03-28.
^"Charset".Android Developers. Retrieved2021-01-02.Android note: The Android platform default is always UTF-8.
^Galloway, Matt (9 October 2012)."Character encoding for iOS developers. Or UTF-8 what now?".www.galloway.me.uk. Retrieved2021-01-02.in reality, you usually just assume UTF-8 since that is by far the most common encoding.
^"ICU Documentation: UTF-8".
^"The Go Programming Language Specification". Retrieved2021-02-10.
^Tsai, Michael J."Michael Tsai - Blog - UTF-8 String in Swift 5". Retrieved2021-03-15.
^Mattip (2019-03-24)."PyPy Status Blog: PyPy v7.1 released; now uses utf-8 internally for unicode strings".PyPy Status Blog. Retrieved2020-11-21.
^"PEP 623 -- Remove wstr from Unicode".Python.org. Retrieved2020-11-21.Until we drop legacy Unicode object, it is very hard to try other Unicode implementation like UTF-8 based implementation in PyPy.
^"Use the Windows UTF-8 code page".UWP applications. docs.microsoft.com. Retrieved2020-06-06.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Popularity_of_text_encodings&oldid=1320488461"

Categories:

Hidden categories:

[8]ページ先頭