Movatterモバイル変換

[0]ホーム

Jump to content

Unicode input

Edit links

From Wikipedia, the free encyclopedia

Input characters using their Unicode code points

TheKCharSelect character mapping tool shown displaying a subset of theUnicode Mathematical Operators

Unicode input is a method to encode specificcharacters that are not directly available on a physicalkeyboard. Characters can be entered either by selecting them from a display, by typing a certain sequence or a 'chord' of keys on a physical keyboard, or by drawing the symbol by hand ontouch-sensitive screen. In contrast toASCII's 96 elementcharacter set (which it contains), Unicode encodes hundreds of thousands ofgraphemes (characters) from almost all of the world's written languages as well as many other signs and symbols.^[1]^{[better source needed]}

A comprehensive Unicode input system must provide for a large repertoire of characters, ideally all valid Unicode code points. This is different from akeyboard layout which defines keys and their combinations only for a limited number of characters appropriate for a certainlocale. Nevertheless, the latter technique (particularly if an "extended" setting is used) is often adequate to satisfy the needs of most users, most of the time.

Unicode numbers

[edit]

Unicode characters are distinguished bycode points, which areconventionally represented by "U+" followed by four, five or sixhexadecimal digits, for example U+00AE or U+1D310, which are "®" and "𝌐" respectively. Characters in theBasic Multilingual Plane (BMP), containing modernscripts – including many Chinese and Japanese characters – and many symbols, have a 4-digit code. Historic scripts, but also many modern symbols and pictographs (such asemoticons,emojis,playing cards and manyCJK characters) have 5-digit codes.

Grapheme availability

[edit]

An application can display a character only if it can access acomputer font which contains agrapheme for that character.^[2] Fonts usually have incomplete Unicode coverage; most only contain the graphemes needed to support a fewwriting systems. However, most modern browsers and other text-processing applications are able to display multilingual content because they performfont substitution, automatically switching to a fallback font when necessary to display characters which are not supported in the current font. Which fonts are used for fallback and the thoroughness of Unicode coverage varies by software and operating system; some software will search for a suitable glyph in all of the installed fonts, others only search within certain fonts.

If an application does not have access to a glyph for a required codepoint in the specified font,^[a] the character should be shown as the font's.notdef glyph⟨􏿮⟩.^[3] This often appears as an empty box, ☐ (nicknamed "tofu" based on the shape), a box with an X in it, ☒, a diamond with a question mark, �, or a box with a question mark in it, ⍰.

Techniques

[edit]

Main article:input method

Extended keyboard mapping

[edit]

Further information:QWERTY § Multilingual variants, andAZERTY § Dead keys

Most operating systems support an 'extendedkeyboard mapping' system setting, which is a facility to increase the repertoire of characters – typically aprecomposed character, atypographic symbol or punctuation mark, or other specialistgraphemes – beyond those enabled by the keyboard and keyboard mapping provided by default with the computer. The most well-known techniques include

Alternate graphic (AltGr orright-Alt key) that gives a third and fourth meaning to every key;
Compose key (sometimes calledmulti key), a key on acomputer keyboard that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character;^[4]
Adead key, a normal key that is repurposed to affect the behaviour of the next key pressed. It is typically used to attach a specificdiacritic to a baseletter;^[5]

or indeed combinations of these. The precise methods depend on which type of type of physical keyboard and (especially) the mapping used with it.

Selection from a screen

[edit]

Many systems provide a way to select Unicode characters visually.ISO/IEC 14755 refers to this as ascreen-selection entry method.^[6]

Microsoft Windows has provided a Unicode version of theCharacter Map program, appearing in the consumer edition since XP. This is limited to characters in theBasic Multilingual Plane (BMP). Characters are searchable by Unicode character name, and the table can be limited to a particular code block.^[7] Starting with Windows 10 Microsoft Windows also contains so called "emoji keyboard". It can be started by holding down theWindows key and hitting the period or semicolon key. The emoji keyboard allows entering of emojis as well as symbols.^[8]

More advanced third-party tools of the same type are also available (a notablefreeware example isBabelMap, which supports all Unicode characters). On mostLinux desktop environments, equivalent tools – such asgucharmap (GNOME) orkcharselect (KDE) – are available.^[9]

Generally these tools let the user "copy" the selected characters into the clipboard, and then paste them into the document, rather than pretending to directly type them.

It is often practical to just find the desired character on the web or in another document, and copy and paste it from there.

Decimal input (Alt codes)

[edit]

This sectionneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources in this section. Unsourced material may be challenged and removed.
Find sources: "Unicode input" – news ·newspapers ·books ·scholar ·JSTOR(October 2024) (Learn how and when to remove this message)

Some programs running inMicrosoft Windows, including recent versions ofWord andNotepad, can produce characters from their Unicode code points expressed in decimal and entered on thenumeric keypad with theAlt key held down.^[10] For example, theEuro sign€ has 20AC as its hexadecimal code point,^[11] which is 8364 in decimal, soAlt+8364 will produce the symbol.

Decimal code points in the range 160 –255 must be entered with a leading zero (so that theWindows code page is chosen) and furthermore the Windows code pageCP1252 must be used.^[b] For example,Alt+0247 yields a÷, corresponding to its code point, but the character produced byAlt+247 depends on theOEM code page, such asCode page 437, and may yield a ≈. AlsoAlt+0128 throughAlt+0159 yield the characters assigned in rows 8 and 9 in theCP1252 layout, rather than theC1 control codes that are assigned to those numbers in Unicode.

In programs which were not designed to handle Alt codes over 255, the character retrieved usually corresponds to theremainder when the number is divided by 256.

The text editorVim allows characters to be specified by two-character mnemonics referred to asdigraphs.^[12] The installed set can be augmented by custom mnemonics defined for arbitrary code points, specified in decimal. For example, as decimal 9881 is equal to hexadecimal 2699,dig Gr 9881 associates "Gr" withU+2699 ⚙GEAR.

Seebelow for use of decimal code points in HTML.

Hexadecimal input

[edit]

Clause 5.1 ofISO/IEC 14755 describes aBasic method whereby abeginning sequence is followed by thehex number representation of thecode point and theending sequence. Most modern systems have some method to emulate this, sometimes limited to four digits (thus only theBasic Multilingual Plane).

In Microsoft Windows

[edit]

Hexadecimal Unicode input can be enabled by adding a string type (REG_SZ) value calledEnableHexNumpad to theregistry keyHKEY_CURRENT_USER\Control Panel\Input Method and assigning the value data1 to it. Users will need to log off and back in after editing the registry for this input method to start working. (In versions earlier thanWindows Vista, users needed to reboot for it to start working.) Unicode characters can then be entered by holding downAlt, and typing+ on the numeric keypad, followed by the hexadecimal code, and then releasingAlt.^[2] This may not work for 5-digit hexadecimal codes likeU+1F937. Some versions of Windows may require the digits 0-9 to be typed on the numeric keypad or require NumLock to be on.^{[citation needed]}

In some applications (Word,Notepad andLibreOffice programs)Alt+X will replace the hexadecimal number to the left of the cursor with the matching Unicode character. Unless it is six hexadecimal digits long, the code must not be preceded by any digit or letters a–f as they may be treated as part of the code to be converted. For example, enteringaf1 followed byAlt+X (orAlt+C if using a French version) will produceU+0AF1 ૱GUJARATI RUPEE SIGN, but enteringa0000f1 followed byAlt+X will produce 'añ' (U+0061 aLATIN SMALL LETTER A followed byU+00F1 ñLATIN SMALL LETTER N WITH TILDE).^[why?]^{[citation needed]}

This facility enables Unicode characters to be entered in other applications: one can create a desired character in Notepad, for example, and thencut and paste it wherever desired.

In MacOS

[edit]

Hex input of Unicode must be enabled. In Mac OS 8.5 and later, one can choose theUnicode Hex Input keyboard layout; inOS X (10.10) Yosemite, this can be added in Keyboard → Input Sources.

Holding down⌥ Option, one types the four-digithexadecimal Unicode code point and the equivalent character appears; one can then release the⌥ Option key.^[13] Characters outside of the BMP (the Basic Multilingual Plane) exceed the four-digit limit of the Unicode hex input mechanism but can be entered by usingsurrogate pairs: holding down the⌥ Option key while entering the first surrogate, the+, the second surrogate, then releasing the Option key.

In X11 and Wayland (Linux and other Unix variants including ChromeOS)

[edit]

In many applications one or both of the following methods work to directly input Unicode characters:

HoldingCtrl+⇧ Shift and typingu followed by the hex digits, then releasingCtrl+⇧ Shift.
EnteringCtrl+⇧ Shift+u, releasing, then typing the hex digits and pressing↵ Enter (orSpace or even, on some systems, pressing and releasing⇧ Shift orCtrl).^[14]

This is supported by GTK and Qt applications, and possibly others. In ChromeOS, this is an operating system function.^[14]

In platform-independent applications

[edit]

InEmacs,Ctrl+x8Return invokes theinsert-char command, which accepts input either via hex code point or unicode char name.
InLibreOffice 5.1 onwards, theAlt+X method described above for Windows works.
InOpera versions that use thePresto layout engine—i.e. up to and including version 12.xx—, entering the hexadecimal number of the desired symbol or character and then pressingCtrl+⇧ Shift+x (alternative shortcutMeta+⇧ Shift++x onmacOS).
In theVim editor, in insert mode, the user first typesCtrl+Vu (for codepoints up to 4 hex digits long; usingCtrl+V⇧ Shift+U for longer), then types in the hexadecimal number of the symbol or character desired, and it will be converted into the symbol. (On Microsoft Windows,Ctrl+Q may be required instead ofCtrl+V.^[15])
InAutoCAD\U+2C72 or three shortcuts%%c,%%d,%%p.

HTML

[edit]

Main article:Numeric character reference

InHTML andXML, character codes to be rendered as characters are prefixed byampersand andnumber sign (&#), and are followed by a semicolon (;). The code point can be either indecimal or inhexadecimal; in the latter case it is preceded by an "x". Leading zeros may be omitted. A number of characters may be represented by anamed entity.

This works in many pieces of software that accept HTML markup, such asThunderbird and Wikipedia editing.

Notes

[edit]

^or infallback font(s), if any.
^CP1252 is the default in North and South America including the Caribbean islands, Western Europe, Central and Southern Africa, Australia, New Zealand, and the (former) European colonies and possessions inOceania

References

[edit]

^Lafontaine, Sylvain (February 17, 2012)."Unicode vs ASCII difference and benefits". MSDN. Archived fromthe original on 21 January 2022. Retrieved28 February 2014.
^^a ^bAndrew Marcuse,"How to enter Unicode characters in Microsoft Windows". Access date: September 13, 2012
^"Recommendations for OpenType Fonts". Microsoft.Glyph 0 must be assigned to a .notdef glyph. The .notdef glyph is very important for providing the user feedback that a glyph is not found in the font. This glyph should not be left without an outline as the user will only see what looks like a space if a glyph is missing and not be aware of the active font's limitation.
^"Linux Keyboard Text Symbols: Compose-Key Shortcuts".FSymbols. 2013-07-24. Retrieved2015-07-07.
^"Dead Key | Definition of Dead Key by Merriam-Webster". Merriam-webster.com. Retrieved2017-05-01.
^"ISO/IEC 14755:1997 Information technology -- Input methods to enter characters from the repertoire of ISO/IEC 10646 with a keyboard or other input device". ISO. Retrieved2017-10-14.
^"How to Use Special Characters in Windows Documents".support.microsoft.com. Jul 31, 2019. Retrieved2020-10-17.
^"Windows 10 Tip: Get started with the emoji keyboard shortcut".blogs.windows.com. Feb 5, 2018. Retrieved2024-06-04.
^Peck, Akkana (2009-11-25)."Mastering Characters Sets in Linux (Weird Characters, part 2)".LinuxPlanet. Archived fromthe original on 2010-11-26. Retrieved2018-12-05.
^"Insert ASCII or Unicode character codes in Word - Microsoft Support".support.microsoft.com. Retrieved2025-02-08.
^"Currency symbols".Unicode Consortium. Retrieved2025-02-08.
^"ii.com: Vim, Unicode, and Digraphs".www.ii.com. Retrieved2025-02-08.
^Typing special and accented characters Archived 2008-03-09 at theWayback Machine
^^a ^bJack Busch (April 20, 2018)."Type Special Characters with a Chromebook (Accents, Symbols, Em Dashes)".groovypost.com. RetrievedFebruary 28, 2020.
^Vim documentation: gui_w32

Unicode

Code points

Characters

Special purpose	BOM Combining grapheme joiner Left-to-right mark /Right-to-left mark Soft hyphen Variant form Word joiner Zero-width joiner Zero-width non-joiner Zero-width space
Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth Alias names and abbreviations Whitespace characters

Processing

Algorithms	Bidirectional text Collation ISO/IEC 14651 Equivalence Variation sequences International Ideographs Core
Comparison of encodings	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards

Movatterモバイル変換

Unicode numbers

Grapheme availability

Techniques

Extended keyboard mapping

Selection from a screen

Decimal input (Alt codes)

Hexadecimal input

In Microsoft Windows

In MacOS

In X11 and Wayland (Linux and other Unix variants including ChromeOS)

In platform-independent applications

HTML

See also

Notes

References