Movatterモバイル変換

Unicode

From Wikipedia, the free encyclopedia

Character encoding standard

Unicode
Logo of theUnicode Consortium
Alias(es)	Universal Coded Character Set (UCS) ISO/IEC 10646
Language(s)	168 scripts(list)
Standard	Unicode Standard
Encoding formats	UTF-8 UTF-16 GB18030 UTF-32 BOCU SCSU UTF-EBCDIC (uncommon) UTF-7 UTF-1 (obsolete)
Preceded by	ISO/IEC 8859, among others
Official website Technical website

This article contains uncommon Unicode characters. Without properrendering support, you may seequestion marks, boxes, or other symbols.

Unicode (also known asThe Unicode Standard andTUS^[1]^[2]) is acharacter encoding standard maintained by theUnicode Consortium designed to support the use of text in all of the world'swriting systems that can be digitized. Version 16.0^[A] defines 154,998characters and 168scripts^[3] used in various ordinary, literary, academic, and technical contexts.

Unicode has largely supplanted the previous environment of a myriad of incompatiblecharacter sets used within different locales and on different computer architectures. The entire repertoire of these sets, plus many additional characters, were merged into the single Unicode set. Unicode is used to encode the vast majority of text on the Internet, including mostweb pages, and relevant Unicode support has become a common consideration in contemporary software development. Unicode is ultimately capable of encoding more than 1.1 million characters.

The Unicodecharacter repertoire is synchronized withISO/IEC 10646, each being code-for-code identical with one another. However,The Unicode Standard is more than just a repertoire within which characters are assigned. To aid developers and designers, the standard also provides charts and reference data, as well as annexes explaining concepts germane to various scripts, providing guidance for their implementation. Topics covered by these annexes includecharacter normalization,character composition and decomposition,collation, anddirectionality.^[4]

Unicode encodes 3,790emoji, with the continued development thereof conducted by the Consortium as a part of the standard.^[5] The widespread adoption of Unicode was in large part responsible for the initial popularization of emoji outside of Japan.^{[citation needed]}

Unicode text is processed and stored as binary datausing one of several encodings, which define how to translate the standard's abstracted codes for characters into sequences of bytes.The Unicode Standard itself defines three encodings:UTF-8,UTF-16,^[a] andUTF-32, though several others exist. UTF-8 is the most widely used by a large margin, in part due to its backwards-compatibility withASCII.

Version	Date	Publication (book, text)	UCS edition	Total		Details
Version	Date	Publication (book, text)	UCS edition	Scripts	Characters^[b]	Details
1.0.0^[22]	October 1991	ISBN 0-201-56788-1 (vol. 1)	—	24	7129	Initial scripts covered:Arabic,Armenian,Bengali,Bopomofo,Cyrillic,Devanagari,Georgian,Greek and Coptic,Gujarati,Gurmukhi,Hangul,Hebrew,Hiragana,Kannada,Katakana,Lao,Latin,Malayalam,Odia,Tamil,Telugu,Thai, andTibetan
1.0.1^[23]	June 1992	ISBN 0-201-60845-6 (vol. 2)	—	25	28327⁺²¹²⁰⁴ ₋₆	The initial 20,902CJK Unified Ideographs
1.1^[24]	June 1993	—	ISO/IEC 10646-1:1993 ^[c]	24	34168⁺⁵⁹⁶³ ₋₉	33 reclassified as control characters. 4,306Hangul syllables,Tibetan removed
2.0^[25]	July 1996	ISBN 0-201-48345-9		25	38885⁺¹¹³⁷³ ₋₆₆₅₆	Original set of Hangul syllables removed, new set of 11,172 Hangul syllables added at new location, Tibetan added back in a new location and with a different character repertoire, Surrogate character mechanism defined, Plane 15 and Plane 16private use area allocated
2.1^[26]	May 1998	—		25	38887⁺²	U+20AC€EURO SIGN,U+FFFCOBJECT REPLACEMENT CHARACTER^[26]
3.0^[27]	September 1999	ISBN 0-201-61633-5	ISO/IEC 10646-1:2000	38	49194⁺¹⁰³⁰⁷	Cherokee,Geʽez,Khmer,Mongolian,Burmese,Ogham,runes,Sinhala,Syriac,Thaana,Canadian Aboriginal syllabics, andYi Syllables,Braille patterns
3.1^[28]	March 2001	—	ISO/IEC 10646-1:2000^[d] ISO/IEC 10646-2:2001	41	94140⁺⁴⁴⁹⁴⁶	Deseret,Gothic andOld Italic, sets of symbols for Western andByzantine music, 42,711 additional CJK Unified Ideographs
3.2^[29]	March 2002	—	ISO/IEC 10646-1:2000^[d] ISO/IEC 10646-2:2001	45	95156⁺¹⁰¹⁶	Philippine scripts (Buhid,Hanunoo,Tagalog, andTagbanwa)
4.0^[30]	April 2003	ISBN 0-321-18578-1	ISO/IEC 10646:2003 ^[e]	52	96382⁺¹²²⁶	Cypriot syllabary,Limbu,Linear B,Osmanya,Shavian,Tai Le, andUgaritic,Hexagram symbols
4.1^[31]	March 2005	—		59	97655⁺¹²⁷³	Buginese,Glagolitic,Kharosthi,New Tai Lue,Old Persian,Sylheti Nagri, andTifinagh,Coptic disunified from Greek, ancientGreek numbers andmusical symbols, first named character sequences were introduced.^[32]
5.0^[33]	July 2006	ISBN 0-321-48091-0		64	99024⁺¹³⁶⁹	Balinese,cuneiform,N'Ko,ʼPhags-pa,Phoenician^[34]
5.1^[35]	April 2008	—		75	100648⁺¹⁶²⁴	Carian,Cham,Kayah Li,Lepcha,Lycian,Lydian,Ol Chiki,Rejang,Saurashtra,Sundanese, andVai, sets of symbols for thePhaistos Disc,Mahjong tiles,Domino tiles, additions to Burmese,Scribal abbreviations,U+1E9EẞLATIN CAPITAL LETTER SHARP S
5.2^[36]	October 2009	ISBN 978-1-936213-00-9		90	107296⁺⁶⁶⁴⁸	Avestan,Bamum,Gardiner's sign list ofEgyptian hieroglyphs,Imperial Aramaic,Inscriptional Pahlavi,Inscriptional Parthian,Javanese,Kaithi,Lisu,Meetei Mayek,Old South Arabian,Old Turkic,Samaritan,Tai Tham andTai Viet, additional CJK Unified Ideographs, Jamo for Old Hangul,Vedic Sanskrit
6.0^[37]	October 2010	ISBN 978-1-936213-01-6	ISO/IEC 10646:2010 ^[f]	93	109384⁺²⁰⁸⁸	Batak,Brahmi,Mandaic,playing card symbols, transport and map symbols,alchemical symbols,emoticons and emoji,^[38] additional CJK Unified Ideographs
6.1^[39]	January 2012	ISBN 978-1-936213-02-3	ISO/IEC 10646:2012 ^[g]	100	110116⁺⁷³²	Chakma,Meroitic cursive,Meroitic hieroglyphs,Miao,Sharada,Sora Sompeng, andTakri
6.2^[40]	September 2012	ISBN 978-1-936213-07-8			110117⁺¹	U+20BA₺TURKISH LIRA SIGN
6.3^[41]	September 2013	ISBN 978-1-936213-08-5			110122⁺⁵	5 bidirectional formatting characters
7.0^[42]	June 2014	ISBN 978-1-936213-09-2		123	112956⁺²⁸³⁴	Bassa Vah,Caucasian Albanian,Duployan,Elbasan,Grantha,Khojki,Khudawadi,Linear A,Mahajani,Manichaean,Mende Kikakui,Modi,Mro,Nabataean,Old North Arabian,Old Permic,Pahawh Hmong,Palmyrene,Pau Cin Hau,Psalter Pahlavi,Siddham,Tirhuta,Warang Citi, anddingbats
8.0^[43]	June 2015	ISBN 978-1-936213-10-8	ISO/IEC 10646:2014 ^[h]	129	120672⁺⁷⁷¹⁶	Ahom,Anatolian hieroglyphs,Hatran,Multani,Old Hungarian,SignWriting, additional CJK Unified Ideographs, lowercase letters for Cherokee, 5 emojiskin tone modifiers
9.0^[46]	June 2016	ISBN 978-1-936213-13-9	ISO/IEC 10646:2014 ^[h]	135	128172⁺⁷⁵⁰⁰	Adlam,Bhaiksuki,Marchen,Newa,Osage,Tangut, 72 emoji^[47]
10.0^[48]	June 2017	ISBN 978-1-936213-16-0	ISO/IEC 10646:2017 ^[i]	139	136690⁺⁸⁵¹⁸	Zanabazar Square,Soyombo,Masaram Gondi,Nüshu,hentaigana, 7,494 CJK Unified Ideographs, 56 emoji,U+20BF₿BITCOIN SIGN
11.0^[49]	June 2018	ISBN 978-1-936213-19-1		146	137374⁺⁶⁸⁴	Dogra,Georgian Mtavruli capital letters,Gunjala Gondi,Hanifi Rohingya,Indic Siyaq Numbers,Makasar,Medefaidrin,Old Sogdian and Sogdian,Maya numerals, 5 CJK Unified Ideographs, symbols forxiangqi andstar ratings, 145 emoji
12.0^[50]	March 2019	ISBN 978-1-936213-22-1		150	137928⁺⁵⁵⁴	Elymaic,Nandinagari,Nyiakeng Puachue Hmong,Wancho,Miao script, hiragana and katakana small letters, Tamil historic fractions and symbols, Lao letters forPali, Latin letters for Egyptological and Ugaritic transliteration, hieroglyph format controls, 61 emoji
12.1^[51]	May 2019	ISBN 978-1-936213-25-2		150	137929⁺¹	U+32FF㋿SQUARE ERA NAME REIWA
13.0^[52]	March 2020	ISBN 978-1-936213-26-9	ISO/IEC 10646:2020 ^[53]	154	143859⁺⁵⁹³⁰	Chorasmian,Dhives Akuru,Khitan small script,Yezidi, 4,969 CJK ideographs, Arabic script additions used to writeHausa,Wolof, and other African languages, additions used to writeHindko andPunjabi in Pakistan, Bopomofo additions used for Cantonese, Creative Commons license symbols, graphic characters for compatibility with teletext and home computer systems, 55 emoji
14.0^[54]	September 2021	ISBN 978-1-936213-29-0		159	144697⁺⁸³⁸	Toto,Cypro-Minoan,Vithkuqi,Old Uyghur,Tangsa, extended IPA, Arabic script additions for use in languages across Africa and in Iran, Pakistan, Malaysia, Indonesia, Java, and Bosnia, additions for honorifics and Quranic use, additions to support languages in North America, the Philippines, India, and Mongolia,U+20C0⃀SOM SIGN,Znamenny musical notation, 37 emoji
15.0^[55]	September 2022	ISBN 978-1-936213-32-0		161	149186⁺⁴⁴⁸⁹	Kawi andMundari, 20 emoji, 4,192 CJK ideographs, control characters for Egyptian hieroglyphs
15.1^[56]	September 2023	ISBN 978-1-936213-33-7		161	149813⁺⁶²⁷	Additional CJK ideographs
16.0^[57]	September 2024	ISBN 978-1-936213-34-4		168	154998⁺⁵¹⁸⁵	Garay,Gurung Khema,Kirat Rai,Ol Onal,Sunuwar,Todhri,Tulu-Tigalari

General Category(UnicodeCharacter Property)^[a] v t e
Value	Category Major, minor	Basic type^[b]	Character assigned^[b]	Count^[c] (as of 16.0)	Remarks

L, Letter;LC, Cased Letter(Lu, Ll, and Lt only)^[d]
Lu	Letter, uppercase	Graphic	Character	1,858
Ll	Letter, lowercase	Graphic	Character	2,258
Lt	Letter, titlecase	Graphic	Character	31	Ligatures ordigraphs containing an uppercase followed by a lowercase part (e.g.,ǅ,ǈ,ǋ, andǲ)
Lm	Letter, modifier	Graphic	Character	404	Amodifier letter
Lo	Letter, other	Graphic	Character	136,477	Anideograph or a letter in aunicase alphabet
M, Mark
Mn	Mark, nonspacing	Graphic	Character	2,020
Mc	Mark, spacing combining	Graphic	Character	468
Me	Mark, enclosing	Graphic	Character	13
N, Number
Nd	Number, decimal digit	Graphic	Character	760	All these, and only these, haveNumeric Type = De^[e]
Nl	Number, letter	Graphic	Character	236	Numerals composed of letters or letterlike symbols (e.g.,Roman numerals)
No	Number, other	Graphic	Character	915	E.g.,vulgar fractions,superscript andsubscript digits, vigesimal digits
P, Punctuation
Pc	Punctuation, connector	Graphic	Character	10	Includes spacingunderscore characters such as "_", and other spacingtie characters. Unlike other punctuation characters, these may be classified as "word" characters byregular expression libraries.^[f]
Pd	Punctuation, dash	Graphic	Character	27	Includes severalhyphen characters
Ps	Punctuation, open	Graphic	Character	79	Openingbracket characters
Pe	Punctuation, close	Graphic	Character	77	Closing bracket characters
Pi	Punctuation, initial quote	Graphic	Character	12	Openingquotation mark. Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage
Pf	Punctuation, final quote	Graphic	Character	10	Closing quotation mark. May behave like Ps or Pe depending on usage
Po	Punctuation, other	Graphic	Character	640
S, Symbol
Sm	Symbol, math	Graphic	Character	950	Mathematical symbols (e.g.,+,−,=,×,÷,√,∊,≠). Does not include parentheses and brackets, which are in categories Ps and Pe. Also does not include!,*,-, or/, which despite frequent use as mathematical operators, are primarily considered to be "punctuation".
Sc	Symbol, currency	Graphic	Character	63	Currency symbols
Sk	Symbol, modifier	Graphic	Character	125
So	Symbol, other	Graphic	Character	7,376
Z, Separator
Zs	Separator, space	Graphic	Character	17	Includes the space, but notTAB,CR, orLF, which are Cc
Zl	Separator, line	Format	Character	1	OnlyU+2028LINE SEPARATOR (LSEP)
Zp	Separator, paragraph	Format	Character	1	OnlyU+2029PARAGRAPH SEPARATOR (PSEP)
C, Other
Cc	Other, control	Control	Character	65 (will never change)^[e]	No name,^[g] <control>
Cf	Other, format	Format	Character	170	Includes thesoft hyphen, joining control characters (ZWNJ andZWJ), control characters to supportbidirectional text, andlanguage tag characters
Cs	Other, surrogate	Surrogate	Not (only used inUTF-16)	2,048 (will never change)^[e]	No name,^[g] <surrogate>
Co	Other, private use	Private-use	Character (but no interpretation specified)	137,468 total (will never change)^[e] (6,400 inBMP, 131,068inPlanes 15–16)	No name,^[g] <private-use>
Cn	Other, not assigned	Noncharacter	Not	66 (will not change unless the range of Unicode code points is expanded)^[e]	No name,^[g] <noncharacter>
Cn	Other, not assigned	Reserved	Not	819,467	No name,^[g] <reserved>
^"Table 4-4: General Category".The Unicode Standard. Unicode Consortium. September 2024. ^^a ^b"Table 2-3: Types of code points".The Unicode Standard. Unicode Consortium. September 2024. ^"DerivedGeneralCategory.txt". The Unicode Consortium. 2024-04-30. ^"5.7.1 General Category Values".UTR #44: Unicode Character Database. Unicode Consortium. 2024-08-27. ^^a ^b ^c ^d ^eUnicode Character Encoding Stability Policies: Property Value Stability Stability policy: Some gc groups will never change. gc=Nd corresponds with Numeric Type=De (decimal). ^"Annex C: Compatibility Properties (§ word)".Unicode Regular Expressions. Version 23.Unicode Consortium. 2022-02-08. Unicode Technical Standard #18. ^^a ^b ^c ^d ^e"Table 4-9: Construction of Code Point Labels".The Unicode Standard. Unicode Consortium. September 2024. ACode Point Label may be used to identify a nameless code point. E.g. <control-hhhh>, <control-0088>. The Name remains blank, which can prevent inadvertently replacing, in documentation, a Control Name with a true Control code. Unicode also uses <not a character> for <noncharacter>.

Row	Cells	Range(s)
00	20–7E	Basic Latin (00–7F)
00	A0–FF	Latin-1 Supplement (80–FF)
01	00–13, 14–15,16–2B, 2C–2D,2E–4D, 4E–4F,50–7E, 7F	Latin Extended-A (00–7F)
01	8F,92, B7, DE-EF,FA–FF	Latin Extended-B (80–FF...)
02	18–1B, 1E–1F	Latin Extended-B (... 00–4F)
	59, 7C, 92	IPA Extensions (50–AF)
	BB–BD,*C6,C7,* C9, D6,D8–DB, DC,DD,** DF, EE	Spacing Modifier Letters (B0–FF)
03	74–75, 7A, 7E,84–8A, 8C, 8E–A1, A3–CE, D7, DA–E1	Greek (70–FF)
04	00–5F, 90–91, 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9	Cyrillic (00–FF)
1E	02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B,80–85, 9B,F2–F3	Latin Extended Additional (00–FF)
1F	00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FE	Greek Extended (00–FF)
20	*13–14,15,* 17,18–19, 1A–1B,1C–1D, 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E, 44,** 4A	General Punctuation (00–6F)
	7F, 82	Superscripts and Subscripts (70–9F)
	A3–A4, A7,AC, AF	Currency Symbols (A0–CF)
21	*05, 13, 16,22, 26,* 2E**	Letterlike Symbols (00–4F)
	5B–5E	Number Forms (50–8F)
	90–93, 94–95, A8	Arrows (90–FF)
22	00,02, 03,06, 08–09,0F, 11–12, 15, 19–1A, 1E–1F, 27–28,29, 2A,2B, 48, 59,60–61, 64–65, 82–83, 95, 97	Mathematical Operators (00–FF)
23	02, 0A, 20–21, 29–2A	Miscellaneous Technical (00–FF)
25	00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6C	Box Drawing (00–7F)
	80, 84, 88, 8C, 90–93	Block Elements (80–9F)
	A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6	Geometric Shapes (A0–FF)
26	*3A–3C, 40, 42, 60, 63, 65–66,6A,* 6B**	Miscellaneous Symbols (00–FF)
F0	(01–02)	Private Use Area (00–FF ...)
FB	01–02	Alphabetic Presentation Forms (00–4F)
FF	FD	Specials

v t e Character encodings
Early telecommunications	Telegraph code Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Baudot and Murray Fieldata ASCII ISO/IEC 646 BCDIC Teletex andVideotex/Teletext T.51/ISO/IEC 6937 ITU T.61 ITU T.101 World System Teletext background sets Transcode
ISO/IEC 8859	Approved parts -1 (Western Europe) -2 (Central Europe) -3 (Maltese/Esperanto) -4 (North Europe) -5 (Cyrillic) -6 (Arabic) -7 (Greek) -8 (Hebrew) -9 (Turkish) -10 (Nordic) -11 (Thai) -13 (Baltic) -14 (Celtic) -15 (New Western Europe) -16 (Romanian) Abandoned parts -12 (Devanagari) Proposed but not approved KOI-8 Cyrillic Sámi Adaptations Welsh Estonian Ukrainian Cyrillic
Bibliographic use	MARC-8 ANSEL CCCII/EACC ISO 5426 5426-2 5427 5428 6438 6862
National standards	ArmSCII Big5 BraSCII BSCII CNS 11643 DIN 66003 ELOT 927 GOST 10859 GB 2312 GB 12345 GB 12052 GB 18030 HKSCS ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 KS X 1002 LST 1564 LST 1590-4 PASCII Shift JIS SI 960 TIS-620 TSCII VISCII VSCII YUSCII
ISO/IEC 2022	ISO/IEC 8859 ISO/IEC 10367 Extended Unix Code / EUC
Mac OSCode pages ("scripts")	Armenian Arabic Barents Cyrillic Celtic Central European Croatian Cyrillic Devanagari Farsi (Persian) Font X (Kermit) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Keyboard Latin (Kermit) Maltese/Esperanto Ogham Roman Romanian Sámi Turkish Turkic Cyrillic Ukrainian VT100
DOS code pages	437 737 850 858 861 862 863 864 865 866 867 868 869 899 904 932 936 942 949 950 951 1040 1043 1046 1098 1115 1116 1117 1118 1127 ABICOMP CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický Mazovia MIK
IBM AIX code pages	895 896 912 915 921 922 1006 1008 1009 1010 1012 1013 1014 1015 1016 1017 1018 1019 1046 1133
Windows code pages	CER-GS 932 936 (GBK) 950 Extended Latin-8 1250 1251 1252 1253 1254 1255 1256 1257 1258 1270 Cyrillic + French Cyrillic + German Polytonic Greek
EBCDIC code pages	Japanese language in EBCDIC DKOI
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish SI 960 Hebrew Special Graphics Technical (TCS)
Platform specific	1052 1053 1054 1055 1058 Acorn RISC OS Amstrad CPC Apple II ATASCII Atari ST BICS Casio calculators CDC Compucolor 8001 Compucolor II CP/M+ DEC RADIX 50 DEC MCS/NRCS DG International Galaksija GEM GSM 03.38 HP Roman HP FOCAL HP RPL SQUOZE LICS LMBCS MSX NEC APC NeXT PETSCII PostScript Standard PostScript Latin 1 SAM Coupé Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode /ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB 18030 DIN 91379 BOCU-1 CESU-8 SCSU TACE16 Comparison of Unicode encodings
TeX typesetting system	Cork LY1 OML OMS OT1
Miscellaneous code pages	ABICOMP ASMO 449 Digital encoding of APL symbols ISO-IR-68 ARIB STD-B24 Fieldata HZ IEC-P27-1 INIS 7-bit 8-bit ISO-IR-169 ISO 2033 KOI KOI8-R KOI8-RU KOI8-U Mojikyō SEASCII Stanford/ITS Symbol TRON Unified Hangul Code
Control character	Morse prosigns C0 and C1 control codes ISO/IEC 6429 JIS X 0211 Unicode control, format and separator characters Whitespace characters
Related topics	CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length encoding
Character sets

Authority control databases
National	Germany United States Czech Republic Israel
Other	Yale LUX

Movatterモバイル変換

Origin and development

History

Unicode Consortium

Scripts covered

Script Encoding Initiative

Versions

Projected versions

Architecture and terminology

Codespace and code points

Code planes and blocks

General Category property

Abstract characters

Ready-made versus composite characters

Ligatures

Standardized subsets

Mapping and encodings

Adoption

Operating systems

Input methods

Email

Web

Fonts

Newlines

Issues

Character unification

Han unification

Italic or cursive characters in Cyrillic

Localised case pairs

Diacritics on lowercaseI

Security

Mapping to legacy character sets

Indic scripts

Combining characters

Anomalies

See also

Notes

References

Further reading

External links