The source text of an ECMAScriptScript orModule is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.
The use of multiple lexical goals ensures that there are no lexical ambiguities that would affect automatic semicolon insertion. For example, there are no syntactic grammar contexts where both a leading division or division-assignment, and a leadingRegularExpressionLiteral are permitted. This is not affected by semicolon insertion (see12.10); in examples such as the following:
a = b/hi/g.exec(c).map(d);
where the first non-whitespace, non-comment code point after aLineTerminator is U+002F (SOLIDUS) and the syntactic context allows division or division-assignment, no semicolon is inserted at theLineTerminator. That is, the above example is interpreted in the same way as:
The Unicode format-control characters (i.e., the characters in category “Cf” in the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to control the formatting of a range of text in the absence of higher-level protocols for this (such as mark-up languages).
It is useful to allow format-control characters in source text to facilitate editing and display. All format control characters may be used within comments, and within string literals, template literals, and regular expression literals.
U+FEFF (ZERO WIDTH NO-BREAK SPACE) is a format-control character used primarily at the start of a text to mark it as Unicode and to allow detection of the text's encoding and byte order. <ZWNBSP> characters intended for this purpose can sometimes also appear after the start of a text, for example as a result of concatenating files. InECMAScript source text <ZWNBSP> code points are treated as white space characters (see12.2) outside of comments, string literals, template literals, and regular expression literals.
12.2 White Space
White space code points are used to improve source text readability and to separate tokens (indivisible lexical units) from each other, but are otherwise insignificant. White space code points may occur between any two tokens and at the start or end of input. White space code points may occur within aStringLiteral, aRegularExpressionLiteral, aTemplate, or aTemplateSubstitutionTail where they are considered significant code points forming part of a literal value. They may also occur within aComment, but cannot appear within any other kind of token.
The ECMAScript white space code points are listed inTable 33.
Table 33: White Space Code Points
Code Points
Name
Abbreviation
U+0009
CHARACTER TABULATION
<TAB>
U+000B
LINE TABULATION
<VT>
U+000C
FORM FEED (FF)
<FF>
U+FEFF
ZERO WIDTH NO-BREAK SPACE
<ZWNBSP>
any code point in general category “Space_Separator”
<USP>
Note 1
U+0020 (SPACE) and U+00A0 (NO-BREAK SPACE) code points are part of <USP>.
Note 2
Other than for the code points listed inTable 33, ECMAScriptWhiteSpace intentionally excludes all code points that have the Unicode “White_Space” property but which are not classified in general category “Space_Separator” (“Zs”).
Like white space code points, line terminator code points are used to improve source text readability and to separate tokens (indivisible lexical units) from each other. However, unlike white space code points, line terminators have some influence over the behaviour of the syntactic grammar. In general, line terminators may occur between any two tokens, but there are a few places where they are forbidden by the syntactic grammar. Line terminators also affect the process of automatic semicolon insertion (12.10). A line terminator cannot occur within any token except aStringLiteral,Template, orTemplateSubstitutionTail. <LF> and <CR> line terminators cannot occur within aStringLiteral token except as part of aLineContinuation.
Line terminators are included in the set of white space code points that are matched by the\s class in regular expressions.
The ECMAScript line terminator code points are listed inTable 34.
Table 34: Line Terminator Code Points
Code Point
Unicode Name
Abbreviation
U+000A
LINE FEED (LF)
<LF>
U+000D
CARRIAGE RETURN (CR)
<CR>
U+2028
LINE SEPARATOR
<LS>
U+2029
PARAGRAPH SEPARATOR
<PS>
Only the Unicode code points inTable 34 are treated as line terminators. Other new line or line breaking Unicode code points are not treated as line terminators but are treated as white space if they meet the requirements listed inTable 33. The sequence <CR><LF> is commonly used as a line terminator. It should be considered a singleSourceCharacter for the purpose of reporting line numbers.
Comments can be either single or multi-line. Multi-line comments cannot nest.
Because a single-line comment can contain any Unicode code point except aLineTerminator code point, and because of the general rule that a token is always as long as possible, a single-line comment always consists of all code points from the// marker to the end of the line. However, theLineTerminator at the end of the line is not considered to be part of the single-line comment; it is recognized separately by the lexical grammar and becomes part of the stream of input elements for the syntactic grammar. This point is very important, because it implies that the presence or absence of single-line comments does not affect the process of automatic semicolon insertion (see12.10).
Comments behave like white space and are discarded except that, if aMultiLineComment contains a line terminator code point, then the entire comment is considered to be aLineTerminator for purposes of parsing by the syntactic grammar.
IdentifierName andReservedWord are tokens that are interpreted according to the Default Identifier Syntax given in Unicode Standard Annex #31, Identifier and Pattern Syntax, with some small modifications.ReservedWord is an enumerated subset ofIdentifierName. The syntactic grammar definesIdentifier as anIdentifierName that is not aReservedWord. The Unicode identifier grammar is based on character properties specified by the Unicode Standard. The Unicode code points in the specified categories in the latest version of the Unicode Standard must be treated as in those categories by all conforming ECMAScript implementations. ECMAScript implementations may recognize identifier code points defined in later editions of the Unicode Standard.
Note 1
This standard specifies specific code point additions: U+0024 (DOLLAR SIGN) and U+005F (LOW LINE) are permitted anywhere in anIdentifierName.
The sets of code points with Unicode properties “ID_Start” and “ID_Continue” include, respectively, the code points with Unicode properties “Other_ID_Start” and “Other_ID_Continue”.
TwoIdentifierNames that are canonically equivalent according to the Unicode Standard arenot equal unless, after replacement of eachUnicodeEscapeSequence, they are represented by the exact same sequence of code points.
Thesyntax-directed operation IdentifierCodePoints takes no arguments and returns aList of code points. It is defined piecewise over the following productions:
Return the code point whose numeric value is the MV ofCodePoint.
12.7.2 Keywords and Reserved Words
Akeyword is a token that matchesIdentifierName, but also has a syntactic use; that is, it appears literally, in afixed width font, in some syntactic production. The keywords of ECMAScript includeif,while,async,await, and many others.
Areserved word is anIdentifierName that cannot be used as an identifier. Many keywords are reserved words, but some are not, and some are reserved only in certain contexts.if andwhile are reserved words.await is reserved only inside async functions and modules.async is not reserved; it can be used as a variable name or statement label without restriction.
This specification uses a combination of grammatical productions andearly error rules to specify which names are valid identifiers and which are reserved words. All tokens in theReservedWord list below, except forawait andyield, are unconditionally reserved. Exceptions forawait andyield are specified in13.1, using parameterized syntactic productions. Lastly, severalearly error rules restrict the set of valid identifiers. See13.1.1,14.3.1.1,14.7.5.1, and15.7.1. In summary, there are five categories of identifier names:
Those that are always allowed as identifiers, and are not keywords, such asMath,window,toString, and_;
Those that are never allowed as identifiers, namely theReservedWords listed below exceptawait andyield;
Those that are contextually allowed as identifiers, namelyawait andyield;
Those that are contextually disallowed as identifiers, instrict mode code:let,static,implements,interface,package,private,protected, andpublic;
Those that are always allowed as identifiers, but also appear as keywords within certain syntactic productions, at places whereIdentifier is not allowed:as,async,from,get,meta,of,set, andtarget.
The termconditional keyword, orcontextual keyword, is sometimes used to refer to the keywords that fall in the last three categories, and thus can be used as identifiers in some contexts and as keywords in others.
Per5.1.5, keywords in the grammar match literal sequences of specificSourceCharacter elements. A code point in a keyword cannot be expressed by a\UnicodeEscapeSequence.
enum is not currently used as a keyword in this specification. It is afuture reserved word, set aside for use as a keyword in future language extensions.
Similarly,implements,interface,package,private,protected, andpublic are future reserved words instrict mode code.
A string literal is 0 or more Unicode code points enclosed in single or double quotes. Unicode code points may also be represented by an escape sequence. All code points may appear literally in a string literal except for the closing quote code points, U+005C (REVERSE SOLIDUS), U+000D (CARRIAGE RETURN), and U+000A (LINE FEED). Any code points may appear in the form of an escape sequence. String literals evaluate to ECMAScript String values. When generating these String values Unicode code points are UTF-16 encoded as defined in11.1.1. Code points belonging to the Basic Multilingual Plane are encoded as a single code unit element of the string. All other code points are encoded as two code unit elements of the string.
<LF> and <CR> cannot appear in a string literal, except as part of aLineContinuation to produce the empty code points sequence. The proper way to include either in the String value of a string literal is to use an escape sequence such as\n or\u000A.
It is possible for string literals to precede aUse Strict Directive that places the enclosing code instrict mode, and implementations must take care to enforce the above rules for such literals. For example, the following source text contains a Syntax Error:
A string literal stands for a value of theString type. SV produces String values for string literals through recursive application on the various parts of the string literal. As part of this process, some Unicode code points within the string literal are interpreted as having amathematical value, as described below or in12.9.3.
A regular expression literal is an input element that is converted to a RegExp object (see22.2) each time the literal is evaluated. Two regular expression literals in a program evaluate to regular expression objects that never compare as=== to each other even if the two literals' contents are identical. A RegExp object may also be created at runtime bynew RegExp or calling the RegExpconstructor as a function (see22.2.4).
The productions below describe the syntax for a regular expression literal and are used by the input element scanner to find the end of the regular expression literal. The source text comprising theRegularExpressionBody and theRegularExpressionFlags are subsequently parsed again using the more stringent ECMAScript Regular Expression grammar (22.2.1).
An implementation may extend the ECMAScript Regular Expression grammar defined in22.2.1, but it must not extend theRegularExpressionBody andRegularExpressionFlags productions defined below or the productions used by these productions.
Regular expression literals may not be empty; instead of representing an empty regular expression literal, the code unit sequence// starts a single-line comment. To specify an empty regular expression, use:/(?:)/.
12.9.5.1 Static Semantics: BodyText
Thesyntax-directed operation BodyText takes no arguments and returns source text. It is defined piecewise over the following productions:
Thesyntax-directed operation TV takes no arguments and returns a String orundefined. A template literal component is interpreted by TV as a value of theString type. TV is used to construct the indexed components of a template object (colloquially, the template values). In TV, escape sequences are replaced by the UTF-16 code unit(s) of the Unicode code point represented by the escape sequence.
Thesyntax-directed operation TRV takes no arguments and returns a String. A template literal component is interpreted by TRV as a value of theString type. TRV is used to construct the raw components of a template object (colloquially, the template raw values). TRV is similar toTV with the difference being that in TRV, escape sequences are interpreted as they appear in the literal.
Most ECMAScript statements and declarations must be terminated with a semicolon. Such semicolons may always appear explicitly in the source text. For convenience, however, such semicolons may be omitted from the source text in certain situations. These situations are described by saying that semicolons are automatically inserted into the source code token stream in those situations.
12.10.1 Rules of Automatic Semicolon Insertion
In the following rules, “token” means the actual recognized lexical token determined using the current lexicalgoal symbol as described in clause12.
There are three basic rules of semicolon insertion:
When, as the source text is parsed from left to right, a token (called theoffending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true:
The offending token is separated from the previous token by at least oneLineTerminator.
The offending token is}.
The previous token is) and the inserted semicolon would then be parsed as the terminating semicolon of a do-while statement (14.7.2).
When, as the source text is parsed from left to right, the end of the input stream of tokens is encountered and the parser is unable to parse the input token stream as a single instance of the goal nonterminal, then a semicolon is automatically inserted at the end of the input stream.
When, as the source text is parsed from left to right, a token is encountered that is allowed by some production of the grammar, but the production is arestricted production and the token would be the first token for a terminal or nonterminal immediately following the annotation “[noLineTerminator here]” within the restricted production (and therefore such a token is called a restricted token), and the restricted token is separated from the previous token by at least oneLineTerminator, then a semicolon is automatically inserted before the restricted token.
However, there is an additional overriding condition on the preceding rules: a semicolon is never inserted automatically if the semicolon would then be parsed as an empty statement or if that semicolon would become one of the two semicolons in the header of afor statement (see14.7.4).
Note
The following are the only restricted productions in the grammar:
The practical effect of these restricted productions is as follows:
When a++ or-- token is encountered where the parser would treat it as a postfix operator, and at least oneLineTerminator occurred between the preceding token and the++ or-- token, then a semicolon is automatically inserted before the++ or-- token.
When acontinue,break,return,throw, oryield token is encountered and aLineTerminator is encountered before the next token, a semicolon is automatically inserted after thecontinue,break,return,throw, oryield token.
When arrow function parameter(s) are followed by aLineTerminator before a=> token, a semicolon is automatically inserted and the punctuator causes a syntax error.
When anasync token is followed by aLineTerminator before afunction orIdentifierName or( token, a semicolon is automatically inserted and theasync token is not treated as part of the same expression or class element as the following tokens.
When anasync token is followed by aLineTerminator before a* token, a semicolon is automatically inserted and the punctuator causes a syntax error.
The resulting practical advice to ECMAScript programmers is:
A postfix++ or-- operator should be on the same line as its operand.
AnExpression in areturn orthrow statement or anAssignmentExpression in ayield expression should start on the same line as thereturn,throw, oryield token.
ALabelIdentifier in abreak orcontinue statement should be on the same line as thebreak orcontinue token.
The end of an arrow function's parameter(s) and its=> should be on the same line.
Theasync token preceding an asynchronous function or method should be on the same line as the immediately following token.
12.10.2 Examples of Automatic Semicolon Insertion
This section is non-normative.
The source
{12 }3
is not a valid sentence in the ECMAScript grammar, even with the automatic semicolon insertion rules. In contrast, the source
{12 }3
is also not a valid ECMAScript sentence, but is transformed by automatic semicolon insertion into the following:
{1;2 ;}3;
which is a valid ECMAScript sentence.
The source
for (a; b)
is not a valid ECMAScript sentence and is not altered by automatic semicolon insertion because the semicolon is needed for the header of afor statement. Automatic semicolon insertion never inserts one of the two semicolons in the header of afor statement.
The source
returna + b
is transformed by automatic semicolon insertion into the following:
return;a + b;
Note 1
The expressiona + b is not treated as a value to be returned by thereturn statement, because aLineTerminator separates it from the tokenreturn.
The source
a = b++c
is transformed by automatic semicolon insertion into the following:
a = b;++c;
Note 2
The token++ is not treated as a postfix operator applying to the variableb, because aLineTerminator occurs betweenb and++.
The source
if (a > b)else c = d
is not a valid ECMAScript sentence and is not altered by automatic semicolon insertion before theelse token, even though no production of the grammar applies at that point, because an automatically inserted semicolon would then be parsed as an empty statement.
The source
a = b + c(d + e).print()
isnot transformed by automatic semicolon insertion, because the parenthesized expression that begins the second line can be interpreted as an argument list for a function call:
a = b +c(d + e).print()
In the circumstance that an assignment statement must begin with a left parenthesis, it is a good idea for the programmer to provide an explicit semicolon at the end of the preceding statement rather than to rely on automatic semicolon insertion.
12.10.3 Interesting Cases of Automatic Semicolon Insertion
This section is non-normative.
ECMAScript programs can be written in a style with very few semicolons by relying on automatic semicolon insertion. As described above, semicolons are not inserted at every newline, and automatic semicolon insertion can depend on multiple tokens across line terminators.
As new syntactic features are added to ECMAScript, additional grammar productions could be added that cause lines relying on automatic semicolon insertion preceding them to change grammar productions when parsed.
For the purposes of this section, a case of automatic semicolon insertion is considered interesting if it is a place where a semicolon may or may not be inserted, depending on the source text which precedes it. The rest of this section describes a number of interesting cases of automatic semicolon insertion in this version of ECMAScript.
12.10.3.1 Interesting Cases of Automatic Semicolon Insertion in Statement Lists
In aStatementList, manyStatementListItems end in semicolons, which may be omitted using automatic semicolon insertion. As a consequence of the rules above, at the end of a line ending an expression, a semicolon is required if the following line begins with any of the following:
An opening parenthesis ((). Without a semicolon, the two lines together are treated as aCallExpression.
An opening square bracket ([). Without a semicolon, the two lines together are treated as property access, rather than anArrayLiteral orArrayAssignmentPattern.
A template literal (`). Without a semicolon, the two lines together are interpreted as a tagged Template (13.3.11), with the previous expression as theMemberExpression.
Unary+ or-. Without a semicolon, the two lines together are interpreted as a usage of the corresponding binary operator.
A RegExp literal. Without a semicolon, the two lines together may be parsed instead as the/MultiplicativeOperator, for example if the RegExp has flags.
12.10.3.2 Cases of Automatic Semicolon Insertion and “[noLineTerminator here]”
This section is non-normative.
ECMAScript contains grammar productions which include “[noLineTerminator here]”. These productions are sometimes a means to have optional operands in the grammar. Introducing aLineTerminator in these locations would change the grammar production of a source text by using the grammar production without the optional operand.
The rest of this section describes a number of productions using “[noLineTerminator here]” in this version of ECMAScript.
12.10.3.2.1 List of Grammar Productions with Optional Operands and “[noLineTerminator here]”