This chapter explains the improved support for Unicode that ECMAScript 6 brings. For a general introduction to Unicode, read Chap. “Unicode and JavaScript” in “Speaking JavaScript”.
There are three areas in which ECMAScript 6 has improved support for Unicode:
\u{···}String.prototype.codePointAt().String.fromCodePoint()./u (plus boolean propertyunicode) improves handling of surrogate pairs.Additionally, ES6 is based on Unicode version 5.1.0, whereas ES5 is based on Unicode version 3.0.
There are three parameterized escape sequences for representing characters in #"#_where-can-escape-sequences-be-used" aria-hidden="true">#
The escape sequences can be used in the following locations:
\uHHHH | \u{···} | \xHH | |
|---|---|---|---|
| Identifiers | ✔ | ✔ | |
| String literals | ✔ | ✔ | ✔ |
| Template literals | ✔ | ✔ | ✔ |
| Regular expression literals | ✔ | Only with flag/u | ✔ |
Identifiers:
\uHHHH becomes a single code point.\u{···} becomes a single code point.> const hello = 123;> hell\u{6F}123String literals:
\xHH contributes a UTF-16 code unit.\uHHHH contributes a UTF-16 code unit.\u{···} contributes the UTF-16 encoding of its code point (one or two UTF-16 code units).Template literals:
> `hell\u{6F}` // cooked'hello'> String.raw`hell\u{6F}` // raw'hell\\u{6F}'Regular expressions:
/u is set, because\u{3} is interpreted as three times the characteru, otherwise: > /^\u{3}$/.test('uuu') trueVarious information:
The spec distinguishes between BMP patterns (flag/u not set) and Unicode patterns (flag/u set). Sect. “Pattern Semantics” explains that they are handled differently and how.
As a reminder, here is how grammar rules are be parameterized in the spec:
R has the subscript[U] then that means there are two versions of it:R andR_U.[?U].[+U] it only exists if the subscript[U] is present.[~U] it only exists if the subscript[U] is not present.You can see this parameterization in action in Sect. “Patterns”, where the subscript[U] creates separate grammars for BMP patterns and Unicode patterns:
\u is not followed by four hexadecimal digits, it is interpreted asu). In Unicode patterns that only works for the following characters (which frees up\u for Unicode code point escapes):^ $ \ . * + ? ( ) [ ] { } |"\u{" HexDigits "}" is only allowed in Unicode patterns. In those patterns, lead and trail surrogates are also grouped to help with UTF-16 decoding.Sect. “CharacterEscape” explains how various escape sequences are translated tocharacters (roughly: either code units or code points).
“JavaScript has a Unicode problem” (by Mathias Bynens) explains new Unicode features in ES6.