Movatterモバイル変換


[0]ホーム

URL:


23. New regular expression features
Table of contents
Please support this book:buy it (PDF, EPUB, MOBI) ordonate
(Ad, please don’t block.)

23.New regular expression features

This chapter explains new regular expression features in ECMAScript 6. It helps if you are familiar with ES5 regular expression features and Unicode. Consult the following two chapters of “Speaking JavaScript” if necessary:



23.1Overview

The following regular expression features are new in ECMAScript 6:

23.2New flag/y (sticky)

The new flag/y changes two things while matching a regular expressionre against a string:

The main use case for this matching behavior is tokenizing, where you want each match to immediately follow its predecessor. An example of tokenizing via a sticky regular expression andexec() is given later.

Let’s look at how various regular expression operations react to the/y flag. The following tables give an overview. I’ll provide more details afterwards.

Methods of regular expressions (re is the regular expression that a method is invoked on):

 FlagsStart matchingAnchored toResult if matchNo matchre.lastIndex
exec()0Match objectnullunchanged
 /gre.lastIndexMatch objectnullindex after match
 /yre.lastIndexre.lastIndexMatch objectnullindex after match
 /gyre.lastIndexre.lastIndexMatch objectnullindex after match
test()(Any)(like exec())(like exec())truefalse(like exec())

Methods of strings (str is the string that a method is invoked on,r is the regular expression parameter):

 FlagsStart matchingAnchored toResult if matchNo matchr.lastIndex
search()–, /g0Index of match-1unchanged
 /y, /gy00Index of match-1unchanged
match()0Match objectnullunchanged
 /yr.lastIndexr.lastIndexMatch objectnullindex after
      match
 /gAfter prev.Array with matchesnull0
  match (loop)    
 /gyAfter prev.After prev.Array with matchesnull0
  match (loop)match   
split()–, /gAfter prev.Array with strings[str]unchanged
  match (loop) between matches  
 /y, /gyAfter prev.After prev.Arr. w/ empty strings[str]unchanged
  match (loop)matchbetween matches  
replace()0First match replacedNo repl.unchanged
 /y00First match replacedNo repl.unchanged
 /gAfter prev.All matches replacedNo repl.unchanged
  match (loop)    
 /gyAfter prev.After prev.All matches replacedNo repl.unchanged
  match (loop)match   

23.2.1RegExp.prototype.exec(str)

If/g is not set, matching always starts at the beginning, but skips ahead until a match is found.REGEX.lastIndex is not changed.

constREGEX=/a/;REGEX.lastIndex=7;// ignoredconstmatch=REGEX.exec('xaxa');console.log(match.index);// 1console.log(REGEX.lastIndex);// 7 (unchanged)

If/g is set, matching starts atREGEX.lastIndex and skips ahead until a match is found.REGEX.lastIndex is set to the position after the match. That means that you receive all matches if you loop untilexec() returnsnull.

constREGEX=/a/g;REGEX.lastIndex=2;constmatch=REGEX.exec('xaxa');console.log(match.index);// 3console.log(REGEX.lastIndex);// 4 (updated)// No match at index 4 or laterconsole.log(REGEX.exec('xaxa'));// null

If only/y is set, matching starts atREGEX.lastIndex and is anchored to that position (no skipping ahead until a match is found).REGEX.lastIndex is updated similarly to when/g is set.

constREGEX=/a/y;// No match at index 2REGEX.lastIndex=2;console.log(REGEX.exec('xaxa'));// null// Match at index 3REGEX.lastIndex=3;constmatch=REGEX.exec('xaxa');console.log(match.index);// 3console.log(REGEX.lastIndex);// 4

Setting both/y and/g is the same as only setting/y.

23.2.2RegExp.prototype.test(str)

test() works the same asexec(), but it returnstrue orfalse (instead of a match object ornull) when matching succeeds or fails:

constREGEX=/a/y;REGEX.lastIndex=2;console.log(REGEX.test('xaxa'));// falseREGEX.lastIndex=3;console.log(REGEX.test('xaxa'));// trueconsole.log(REGEX.lastIndex);// 4

23.2.3String.prototype.search(regex)

search() ignores the flag/g andlastIndex (which is not changed, either). Starting at the beginning of the string, it looks for the first match and returns its index (or-1 if there was no match):

constREGEX=/a/;REGEX.lastIndex=2;// ignoredconsole.log('xaxa'.search(REGEX));// 1

If you set the flag/y,lastIndex is still ignored, but the regular expression is now anchored to index 0.

constREGEX=/a/y;REGEX.lastIndex=1;// ignoredconsole.log('xaxa'.search(REGEX));// -1 (no match)

23.2.4String.prototype.match(regex)

match() has two modes:

If the flag/g is not set,match() captures groups likeexec():

{constREGEX=/a/;REGEX.lastIndex=7;// ignoredconsole.log('xaxa'.match(REGEX).index);// 1console.log(REGEX.lastIndex);// 7 (unchanged)}{constREGEX=/a/y;REGEX.lastIndex=2;console.log('xaxa'.match(REGEX));// nullREGEX.lastIndex=3;console.log('xaxa'.match(REGEX).index);// 3console.log(REGEX.lastIndex);// 4}

If only the flag/g is set thenmatch() returns all matching substrings in an Array (ornull). Matching always starts at position 0.

constREGEX=/a|b/g;REGEX.lastIndex=7;console.log('xaxb'.match(REGEX));// ['a', 'b']console.log(REGEX.lastIndex);// 0

If you additionally set the flag/y, then matching is still performed repeatedly, while anchoring the regular expression to the index after the previous match (or 0).

constREGEX=/a|b/gy;REGEX.lastIndex=0;// ignoredconsole.log('xab'.match(REGEX));// nullREGEX.lastIndex=1;// ignoredconsole.log('xab'.match(REGEX));// nullconsole.log('ab'.match(REGEX));// ['a', 'b']console.log('axb'.match(REGEX));// ['a']

23.2.5String.prototype.split(separator, limit)

The complete details ofsplit()are explained in Speaking JavaScript.

For ES6, it is interesting to see how things change if you use the flag/y.

With/y, the string must start with a separator:

> 'x##'.split(/#/y) // no match[ 'x##' ]> '##x'.split(/#/y) // 2 matches[ '', '', 'x' ]

Subsequent separators are only recognized if they immediately follow the first separator:

> '#x#'.split(/#/y) // 1 match[ '', 'x#' ]> '##'.split(/#/y) // 2 matches[ '', '', '' ]

That means that the string before the first separator and the strings between separators are always empty.

As usual, you can use groups to put parts of the separators into the result array:

> '##'.split(/(#)/y)[ '', '#', '', '#', '' ]

23.2.6String.prototype.replace(search, replacement)

Without the flag/g,replace() only replaces the first match:

constREGEX=/a/;// One matchconsole.log('xaxa'.replace(REGEX,'-'));// 'x-xa'

If only/y is set, you also get at most one match, but that match is always anchored to the beginning of the string.lastIndex is ignored and unchanged.

constREGEX=/a/y;// Anchored to beginning of string, no matchREGEX.lastIndex=1;// ignoredconsole.log('xaxa'.replace(REGEX,'-'));// 'xaxa'console.log(REGEX.lastIndex);// 1 (unchanged)// One matchconsole.log('axa'.replace(REGEX,'-'));// '-xa'

With/g set,replace() replaces all matches:

constREGEX=/a/g;// Multiple matchesconsole.log('xaxa'.replace(REGEX,'-'));// 'x-x-'

With/gy set,replace() replaces all matches, but each match is anchored to the end of the previous match:

constREGEX=/a/gy;// Multiple matchesconsole.log('aaxa'.replace(REGEX,'-'));// '--xa'

The parameterreplacement can also be a function,consult “Speaking JavaScript” for details.

23.2.7Example: using sticky matching for tokenizing

The main use case for sticky matching istokenizing, turning a text into a sequence of tokens. One important trait about tokenizing is that tokens are fragments of the text and that there must be no gaps between them. Therefore, sticky matching is perfect here.

functiontokenize(TOKEN_REGEX,str){constresult=[];letmatch;while(match=TOKEN_REGEX.exec(str)){result.push(match[1]);}returnresult;}constTOKEN_GY=/\s*(\+|[0-9]+)\s*/gy;constTOKEN_G=/\s*(\+|[0-9]+)\s*/g;

In a legal sequence of tokens, sticky matching and non-sticky matching produce the same output:

> tokenize(TOKEN_GY, '3 + 4')[ '3', '+', '4' ]> tokenize(TOKEN_G, '3 + 4')[ '3', '+', '4' ]

If, however, there is non-token text in the string then sticky matching stops tokenizing, while non-sticky matching skips the non-token text:

> tokenize(TOKEN_GY, '3x + 4')[ '3' ]> tokenize(TOKEN_G, '3x + 4')[ '3', '+', '4' ]

The behavior of sticky matching during tokenizing helps with error handling.

23.2.8Example: manually implementing sticky matching

If you wanted to manually implement sticky matching, you’d do it as follows: The functionexecSticky() works likeRegExp.prototype.exec() in sticky mode.

functionexecSticky(regex,str){// Anchor the regex to the beginning of the stringletmatchSource=regex.source;if(!matchSource.startsWith('^')){matchSource='^'+matchSource;}// Ensure that instance property `lastIndex` is updatedletmatchFlags=regex.flags;// ES6 feature!if(!regex.global){matchFlags=matchFlags+'g';}constmatchRegex=newRegExp(matchSource,matchFlags);// Ensure we start matching `str` at `regex.lastIndex`constmatchOffset=regex.lastIndex;constmatchStr=str.slice(matchOffset);letmatch=matchRegex.exec(matchStr);// Translate indices from `matchStr` to `str`regex.lastIndex=matchRegex.lastIndex+matchOffset;match.index=match.index+matchOffset;returnmatch;}

23.3New flag/u (unicode)

The flag/u switches on a special Unicode mode for a regular expression. That mode has two features:

  1. You can use Unicode code point escape sequences such as\u{1F42A} for specifying characters via code points. Normal Unicode escapes such as\u03B1 only have a range of four hexadecimal digits (which equals the basic multilingual plane).
  2. “characters” in the regular expression pattern and the string are code points (not UTF-16 code units). Code units are converted into code points.

A section in the chapter on Unicode has more information on escape sequences. I’ll explain the consequences of feature 2 next. Instead of Unicode code point escapes (e.g.,\u{1F680}), I’m using two UTF-16 code units (e.g.,\uD83D\uDE80). That makes it clear that surrogate pairs are grouped in Unicode mode and works in both Unicode mode and non-Unicode mode.

> '\u{1F680}' === '\uD83D\uDE80' // code point vs. surrogate pairstrue

23.3.1Consequence: lone surrogates in the regular expression only match lone surrogates

In non-Unicode mode, a lone surrogate in a regular expression is even found inside (surrogate pairs encoding) code points:

> /\uD83D/.test('\uD83D\uDC2A')true

In Unicode mode, surrogate pairs become atomic units and lone surrogates are not found “inside” them:

> /\uD83D/u.test('\uD83D\uDC2A')false

Actual lone surrogate are still found:

> /\uD83D/u.test('\uD83D \uD83D\uDC2A')true> /\uD83D/u.test('\uD83D\uDC2A \uD83D')true

23.3.2Consequence: you can put code points in character classes

In Unicode mode, you can put code points into character classes and they won’t be interpreted as two characters, anymore.

> /^[\uD83D\uDC2A]$/u.test('\uD83D\uDC2A')true> /^[\uD83D\uDC2A]$/.test('\uD83D\uDC2A')false> /^[\uD83D\uDC2A]$/u.test('\uD83D')false> /^[\uD83D\uDC2A]$/.test('\uD83D')true

23.3.3Consequence: the dot operator (.) matches code points, not code units

In Unicode mode, the dot operator matches code points (one or two code units). In non-Unicode mode, it matches single code units. For example:

> '\uD83D\uDE80'.match(/./gu).length1> '\uD83D\uDE80'.match(/./g).length2

23.3.4Consequence: quantifiers apply to code points, not code units

In Unicode mode, quantifiers apply to code points (one or two code units). In non-Unicode mode, they apply to single code units. For example:

> /\uD83D\uDE80{2}/u.test('\uD83D\uDE80\uD83D\uDE80')true> /\uD83D\uDE80{2}/.test('\uD83D\uDE80\uD83D\uDE80')false> /\uD83D\uDE80{2}/.test('\uD83D\uDE80\uDE80')true

23.4New data propertyflags

In ECMAScript 6, regular expressions have the following data properties:

As an aside,lastIndex is the only instance property now, all other data properties are implemented via internal instance properties and getters such asget RegExp.prototype.global.

The propertysource (which already existed in ES5) contains the regular expression pattern as a string:

> /abc/ig.source'abc'

The propertyflags is new, it contains the flags as a string, with one character per flag:

> /abc/ig.flags'gi'

You can’t change the flags of an existing regular expression (ignoreCase etc. have always been immutable), butflags allows you to make a copy where the flags are changed:

functioncopyWithIgnoreCase(re){returnnewRegExp(re.source,re.flags.includes('i')?re.flags:re.flags+'i');}

The next section explains another way to make modified copies of regular expressions.

23.5RegExp() can be used as a copy constructor

In ES6 there are two variants of the constructorRegExp() (the second one is new):

The following interaction demonstrates the latter variant:

> new RegExp(/abc/ig).flags'gi'> new RegExp(/abc/ig, 'i').flags // change flags'i'

Therefore, theRegExp constructor gives us another way to change flags:

functioncopyWithIgnoreCase(re){returnnewRegExp(re,re.flags.includes('i')?re.flags:re.flags+'i');}

23.5.1Example: an iterable version ofexec()

The following functionexecAll() is an iterable version ofexec() that fixes several issues with usingexec() to retrieve all matches of a regular expression:

function*execAll(regex,str){// Make sure flag /g is set and regex.index isn’t changedconstlocalCopy=copyAndEnsureFlag(regex,'g');letmatch;while(match=localCopy.exec(str)){yieldmatch;}}functioncopyAndEnsureFlag(re,flag){returnnewRegExp(re,re.flags.includes(flag)?re.flags:re.flags+flag);}

UsingexecAll():

conststr='"fee" "fi" "fo" "fum"';constregex=/"([^"]*)"/;// Access capture of group #1 via destructuringfor(const[,group1]ofexecAll(regex,str)){console.log(group1);}// Output:// fee// fi// fo// fum

23.6String methods that delegate to regular expression methods

The following string methods now delegate some of their work to regular expression methods:

For more information, consult Sect. “String methods that delegate regular expression work to their parameters” in the chapter on strings.

Further reading

If you want to know in more detail how the regular expression flag/u works, I recommend the article “Unicode-aware regular expressions in ECMAScript 6” by Mathias Bynens.

Next:24. Asynchronous programming (background)

[8]ページ先頭

©2009-2025 Movatter.jp