UTF-16
UTF-16 is acharacter encoding standard forUnicode. It encodes each Unicodecode point using either one or twocode units. Each code unit is a 16-bit value.
Code points whose values are less than 216 are encoded as a single code unit that is numerically equal to the code point's value. These code points comprise theBasic Multilingual Plane (BMP), and include the most common characters, including Latin, Greek, Cyrillic, and many East Asian characters.
For example, the Latin character "A" is assigned the code pointU+0041 in Unicode, and this is represented in UTF-16 as the single code unit41.
Code points whose values are greater than 216 are encoded using a pair of code units, which is called asurrogate pair. The values used for surrogate pairs are not used for Unicode code points, so as to avoid ambiguity.
For example, the emoji character "🦊" (Fox Face) is assigned the code pointU+1F98A in Unicode, and this is represented in UTF-16 as the surrogate paird83e dd8a.
In this article
UTF-16 in JavaScript
Strings in JavaScript are represented using UTF-16, and manyString APIs operate on code units, not code points. For example,String.length returns2 for a string containing a single Unicode character which is not in the BMP:
const string = "🦊"; // U+1F98Aconsole.log(string.length); // 2TheString.charCodeAt() method returns the code unit at the given index, and theString.codePointAt() method returns the code point at the given index:
const string = "🦊"; // U+1F98Aconsole.log(string.charCodeAt(0).toString(16)); // d83econsole.log(string.charCodeAt(1).toString(16)); // dd8aconsole.log(string.codePointAt(0).toString(16)); // 1f98aSeeUTF-16 characters, Unicode code points, and grapheme clusters to learn more about working with UTF-16 strings in JavaScript.
UTF-16 and UTF-8
UTF-8 is an alternative encoding for Unicode, which uses one to four bytes for each Unicode code point. UTF-8 is a much more common encoding for documents on the Web than UTF-16.
UTF-16 and UCS-2
UCS-2 is an obsolete encoding for Unicode. It is the same as UTF-16, except it does not support surrogate pairs, so is not able to encode code points outside the BMP.