Localization library | |||||||||||||||||||||||||
Regular expressions library(C++11) | |||||||||||||||||||||||||
Formatting library(C++20) | |||||||||||||||||||||||||
Null-terminated sequence utilities | |||||||||||||||||||||||||
Byte strings | |||||||||||||||||||||||||
Multibyte strings | |||||||||||||||||||||||||
Wide strings | |||||||||||||||||||||||||
Primitive numeric conversions | |||||||||||||||||||||||||
| |||||||||||||||||||||||||
Text encoding identifications | |||||||||||||||||||||||||
|
Classes | ||||
(C++11) | ||||
(C++11) | ||||
(C++11) | ||||
Algorithms | ||||
(C++11) | ||||
(C++11) | ||||
(C++11) | ||||
Iterators | ||||
(C++11) | ||||
(C++11) | ||||
Exceptions | ||||
(C++11) | ||||
Traits | ||||
(C++11) | ||||
Constants | ||||
(C++11) | ||||
(C++11) | ||||
(C++11) | ||||
Regex Grammar | ||||
Modified ECMAScript-262 (C++11) |
This page describes the regular expression grammar that is used whenstd::basic_regex is constructed withsyntax_option_type
set toECMAScript
(the default). Seesyntax_option_type
for the other supported regular expression grammars.
TheECMAScript
3 regular expression grammar in C++ isECMA-262 grammar with modifications marked with(C++ only) below.
Contents |
Themodified regular expression grammar is mostly ECMAScript RegExp grammar with a POSIX-type expansion on locales underClassAtom. Some clarifications on equality checks and number parsing is made. For many of the examples here, you can try this equivalent in your browser console:
function match(s, re){return s.match(new RegExp(re));}
The "normative references" in the standard specifies ECMAScript 3. We link to the ECMAScript 5.1 spec here because it is a version with only minor changes from ECMAScript 3, and it also has an HTML version. See theMDN Guide on JavaScript RegExp for an overview on the dialect features.
A regular expression pattern is a sequence of one or moreAlternatives, separated by the disjunction operator|
(in other words, the disjunction operator has the lowest precedence).
Pattern ::
Disjunction ::
|
DisjunctionThe pattern first tries to skip theDisjunction and match the leftAlternative followed by the rest of the regular expression (after the Disjunction).
If it fails, it tries to skip the leftAlternative and match the rightDisjunction (followed by the rest of the regular expression).
If the leftAlternative, the rightDisjunction, and the remainder of the regular expression all have choice points, all choices in the remainder of the expression are tried before moving on to the next choice in the leftAlternative. If choices in the leftAlternative are exhausted, the rightDisjunction is tried instead of the leftAlternative.
Any capturing parentheses inside a skippedAlternative produce empty submatches.
#include <cstddef>#include <iostream>#include <regex>#include <string> void show_matches(conststd::string& in,conststd::string& re){std::smatch m;std::regex_search(in, m,std::regex(re));if(!m.empty()){std::cout<<"input=["<< in<<"], regex=["<< re<<"]\n ""prefix=["<< m.prefix()<<"]\n smatch: ";for(std::size_t n=0; n< m.size();++n)std::cout<<"m["<< n<<"]=["<< m[n]<<"] ";std::cout<<"\n suffix=["<< m.suffix()<<"]\n";}elsestd::cout<<"input=["<< in<<"], regex=["<< re<<"]: NO MATCH\n";} int main(){ show_matches("abcdef","abc|def"); show_matches("abc","ab|abc");// left Alternative matched first // Match of the input against the left Alternative (a) followed// by the remained of the regex (c|bc) succeeds, which results// in m[1]="a" and m[4]="bc".// The skipped Alternatives (ab) and (c) leave their submatches// m[3] and m[5] empty. show_matches("abc","((a)|(ab))((c)|(bc))");}
Output:
input=[abcdef], regex=[abc|def] prefix=[] smatch: m[0]=[abc] suffix=[def]input=[abc], regex=[ab|abc] prefix=[] smatch: m[0]=[ab] suffix=[c]input=[abc], regex=[((a)|(ab))((c)|(bc))] prefix=[] smatch: m[0]=[abc] m[1]=[a] m[2]=[a] m[3]=[] m[4]=[bc] m[5]=[] m[6]=[bc] suffix=[]
EachAlternative is either empty or is a sequence ofTerms (with no separators between theTerms)
Alternative ::
EmptyAlternative always matches and does not consume any input.
ConsecutiveTerms try to simultaneously match consecutive portions of the input.
If the leftAlternative, the rightTerm, and the remainder of the regular expression all have choice points, all choices in the remained of the expression are tried before moving on to the next choice in the rightTerm, and all choices in the rightTerm are tried before moving on to the next choice in the leftAlternative.
#include <cstddef>#include <iostream>#include <regex>#include <string> void show_matches(conststd::string& in,conststd::string& re){std::smatch m;std::regex_search(in, m,std::regex(re));if(!m.empty()){std::cout<<"input=["<< in<<"], regex=["<< re<<"]\n ""prefix=["<< m.prefix()<<"]\n smatch: ";for(std::size_t n=0; n< m.size();++n)std::cout<<"m["<< n<<"]=["<< m[n]<<"] ";std::cout<<"\n suffix=["<< m.suffix()<<"]\n";}elsestd::cout<<"input=["<< in<<"], regex=["<< re<<"]: NO MATCH\n";} int main(){ show_matches("abcdef","");// empty regex is a single empty Alternative show_matches("abc","abc|");// left Alternative matched first show_matches("abc","|abc");// left Alternative matched first, leaving abc unmatched}
Output:
input=[abcdef], regex=[] prefix=[] smatch: m[0]=[] suffix=[abcdef]input=[abc], regex=[abc|] prefix=[] smatch: m[0]=[abc] suffix=[]input=[abc], regex=[|abc] prefix=[] smatch: m[0]=[] suffix=[abc]
Term ::
EachQuantifier is either agreedy quantifier (which consists of just oneQuantifierPrefix) or anon-greedy quantifier (which consists of oneQuantifierPrefix followed by the question mark?
).
Quantifier ::
?
EachQuantifierPrefix determines two numbers: the minimum number of repetitions and the maximum number of repetitions, as follows:
QuantifierPrefix | Minimum | Maximum |
---|---|---|
* | zero | infinity |
+ | one | infinity |
? | zero | one |
{ DecimalDigits} | value of DecimalDigits | value of DecimalDigits |
{ DecimalDigits, } | value of DecimalDigits | infinity |
{ DecimalDigits, DecimalDigits} | value of DecimalDigits before the comma | value of DecimalDigits after the comma |
The values of the individualDecimalDigits are obtained by callingstd::regex_traits::value(C++ only) on each of the digits.
AnAtom followed by aQuantifier is repeated the number of times specified by theQuantifier. AQuantifier can benon-greedy, in which case theAtom pattern is repeated as few times as possible while still matching the remainder of the regular expression, or it can begreedy, in which case theAtom pattern is repeated as many times as possible while still matching the remainder of the regular expression.
TheAtom pattern is what is repeated, not the input that it matches, so different repetitions of theAtom can match different input substrings.
If theAtom and the remainder of the regular expression all have choice points, theAtom is first matched as many (or as few, ifnon-greedy) times as possible. All choices in the remainder of the regular expression are tried before moving on to the next choice in the last repetition ofAtom. All choices in the last (nth) repetition ofAtom are tried before moving on to the next choice in the next-to-last (n–1)st repetition ofAtom; at which point it may turn out that more or fewer repetitions ofAtom are now possible; these are exhausted (again, starting with either as few or as many as possible) before moving on to the next choice in the (n-1)st repetition ofAtom and so on.
TheAtom's captures are cleared each time it is repeated (see the"(z)((a+)?(b+)?(c))*" example below)
#include <cstddef>#include <iostream>#include <regex>#include <string> void show_matches(conststd::string& in,conststd::string& re){std::smatch m;std::regex_search(in, m,std::regex(re));if(!m.empty()){std::cout<<"input=["<< in<<"], regex=["<< re<<"]\n ""prefix=["<< m.prefix()<<"]\n smatch: ";for(std::size_t n=0; n< m.size();++n)std::cout<<"m["<< n<<"]=["<< m[n]<<"] ";std::cout<<"\n suffix=["<< m.suffix()<<"]\n";}elsestd::cout<<"input=["<< in<<"], regex=["<< re<<"]: NO MATCH\n";} int main(){// greedy match, repeats [a-z] 4 times show_matches("abcdefghi","a[a-z]{2,4}");// non-greedy match, repeats [a-z] 2 times show_matches("abcdefghi","a[a-z]{2,4}?"); // Choice point ordering for quantifiers results in a match// with two repetitions, first matching the substring "aa",// second matching the substring "ba", leaving "ac" not matched// ("ba" appears in the capture clause m[1]) show_matches("aabaac","(aa|aabaac|ba|b|c)*"); // Choice point ordering for quantifiers makes this regex// calculate the greatest common divisor between 10 and 15// (the answer is 5, and it populates m[1] with "aaaaa") show_matches("aaaaaaaaaa,aaaaaaaaaaaaaaa","^(a+)\\1*,\\1+$"); // the substring "bbb" does not appear in the capture clause m[4]// because it is cleared when the second repetition of the atom// (a+)?(b+)?(c) is matching the substring "ac"// NOTE: gcc gets this wrong - it does not correctly clear the// matches[4] capture group as required by ECMA-262 21.2.2.5.1,// and thus incorrectly captures "bbb" for that group. show_matches("zaacbbbcac","(z)((a+)?(b+)?(c))*");}
Output:
input=[abcdefghi], regex=[a[a-z]{2,4}] prefix=[] smatch: m[0]=[abcde] suffix=[fghi]input=[abcdefghi], regex=[a[a-z]{2,4}?] prefix=[] smatch: m[0]=[abc] suffix=[defghi]input=[aabaac], regex=[(aa|aabaac|ba|b|c)*] prefix=[] smatch: m[0]=[aaba] m[1]=[ba] suffix=[ac]input=[aaaaaaaaaa,aaaaaaaaaaaaaaa], regex=[^(a+)\1*,\1+$] prefix=[] smatch: m[0]=[aaaaaaaaaa,aaaaaaaaaaaaaaa] m[1]=[aaaaa] suffix=[]input=[zaacbbbcac], regex=[(z)((a+)?(b+)?(c))*] prefix=[] smatch: m[0]=[zaacbbbcac] m[1]=[z] m[2]=[ac] m[3]=[a] m[4]=[] m[5]=[c] suffix=[]
Assertions match conditions, rather than substrings of the input string. They never consume any characters from the input. EachAssertion is one of the following
Assertion ::
^
$
\
b
\
B
(
?
=
Disjunction)
(
?
!
Disjunction)
The assertion^
(beginning of line) matches
The assertion$
(end of line) matches
In the two assertions above and in the Atom.
below,LineTerminator is one of the following four characters:U+000A
(\n
or line feed),U+000D
(\r
or carriage return),U+2028
(line separator), orU+2029
(paragraph separator)
The assertion\b
(word boundary) matches
The assertion\B
(negative word boundary) matches everything EXCEPT the following
The assertion(
?
=
Disjunction)
(zero-width positive lookahead) matches ifDisjunction would match the input at the current position
The assertion(
?
!
Disjunction)
(zero-width negative lookahead) matches ifDisjunction would NOT match the input at the current position.
For both Lookahead assertions, when matching theDisjunction, the position is not advanced before matching the remainder of the regular expression. Also, ifDisjunction can match at the current position in several ways, only the first one is tried.
ECMAScript forbids backtracking into the lookahead Disjunctions, which affects the behavior of backreferences into a positive lookahead from the remainder of the regular expression (see example below). Backreferences into the negative lookahead from the rest of the regular expression are always undefined (since the lookahead Disjunction must fail to proceed).
Note: Lookahead assertions may be used to create logical AND between multiple regular expressions (see example below).
#include <cstddef>#include <iostream>#include <regex>#include <string> void show_matches(conststd::string& in,conststd::string& re){std::smatch m;std::regex_search(in, m,std::regex(re));if(!m.empty()){std::cout<<"input=["<< in<<"], regex=["<< re<<"]\n ""prefix=["<< m.prefix()<<"]\n smatch: ";for(std::size_t n=0; n< m.size();++n)std::cout<<"m["<< n<<"]=["<< m[n]<<"] ";std::cout<<"\n suffix=["<< m.suffix()<<"]\n";}elsestd::cout<<"input=["<< in<<"], regex=["<< re<<"]: NO MATCH\n";} int main(){// matches the a at the end of input show_matches("aaa","a$"); // matches the o at the end of the first word show_matches("moo goo gai pan","o\\b"); // the lookahead matches the empty string immediately after the first b// this populates m[1] with "aaa" although m[0] is empty show_matches("baaabac","(?=(a+))"); // because backtracking into lookaheads is prohibited,// this matches aba rather than aaaba show_matches("baaabac","(?=(a+))a*b\\1"); // logical AND via lookahead: this password matches IF it contains// at least one lowercase letter// AND at least one uppercase letter// AND at least one punctuation character// AND be at least 6 characters long show_matches("abcdef","(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,}"); show_matches("aB,def","(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,}");}
Output:
input=[aaa], regex=[a$] prefix=[aa] smatch: m[0]=[a] suffix=[]input=[moo goo gai pan], regex=[o\b] prefix=[mo] smatch: m[0]=[o] suffix=[ goo gai pan]input=[baaabac], regex=[(?=(a+))] prefix=[b] smatch: m[0]=[] m[1]=[aaa] suffix=[aaabac]input=[baaabac], regex=[(?=(a+))a*b\1] prefix=[baa] smatch: m[0]=[aba] m[1]=[a] suffix=[c]input=[abcdef], regex=[(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,}]: NO MATCHinput=[aB,def], regex=[(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,}] prefix=[] smatch: m[0]=[aB,def] suffix=[]
AnAtom can be one of the following:
Atom ::
.
\
AtomEscape(
Disjunction)
(
?
:
Disjunction)
whereAtomEscape ::
Different kinds of atoms evaluate differently.
TheAtom(
Disjunction)
is a marked subexpression: it executes theDisjunction and stores the copy of the input substring that was consumed byDisjunction in the submatch array at the index that corresponds to the number of times the left open parenthesis(
of marked subexpressions has been encountered in the entire regular expression at this point.
Besides being returned in thestd::match_results, the captured submatches are accessible as backreferences (\1
,\2
, ...) and can be referenced in regular expressions. Note thatstd::regex_replace uses$
instead of\
for backreferences ($1
,$2
, ...) in the same manner asString.prototype.replace
(ECMA-262, part 15.5.4.11).
TheAtom(
?
:
Disjunction)
(non-marking subexpression) simply evaluates theDisjunction and does not store its results in the submatch. This is a purely lexical grouping.
This section is incomplete Reason: no example |
DecimalEscape ::
If\
is followed by a decimal numberN
whose first digit is not0
, then the escape sequence is considered to be abackreference. The valueN
is obtained by callingstd::regex_traits::value(C++ only) on each of the digits and combining their results using base-10 arithmetic. It is an error ifN
is greater than the total number of left capturing parentheses in the entire regular expression.
When a backreference\N
appears as anAtom, it matches the same substring as what is currently stored in the N'th element of the submatch array.
The decimal escape\0
is NOT a backreference: it is a character escape that represents theNUL character. It cannot be followed by a decimal digit.
As above, note thatstd::regex_replace uses$
instead of\
for backreferences ($1
,$2
, ...).
This section is incomplete Reason: no example |
TheAtom.
matches and consumes any one character from the input string except forLineTerminator (U+000D
,U+000A
,U+2029
, orU+2028
)
TheAtomPatternCharacter, wherePatternCharacter is anySourceCharacter EXCEPT the characters^ $ \ . * + ? ( ) [ ] { } |
, matches and consumes one character from the input if it is equal to thisPatternCharacter.
The equality for this and all other single character matches is defined as follows:
EachAtom that consists of the escape character\
followed byCharacterEscape as well as the special DecimalEscape\0
, matches and consumes one character from the input if it is equal to the character represented by theCharacterEscape. The following character escape sequences are recognized:
CharacterEscape ::
c
ControlLetterHere,ControlEscape is one of the following five characters:f n r t v
ControlEscape | Code Unit | Name |
---|---|---|
f | U+000C | form feed |
n | U+000A | new line |
r | U+000D | carriage return |
t | U+0009 | horizontal tab |
v | U+000B | vertical tab |
ControlLetter is any lowercase or uppercase ASCII letters and this character escape matches the character whose code unit equals the remainder of dividing the value of the code unit ofControlLetter by32. For example,\cD
and\cd
both match code unitU+0004
(EOT) because 'D' isU+0044
and0x44%32==4, and 'd' isU+0064
and0x64%32==4.
HexEscapeSequence is the letterx
followed by exactly twoHexDigits (whereHexDigit is one of0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
). This character escape matches the character whose code unit equals the numeric value of the two-digit hexadecimal number.
UnicodeEscapeSequence is the letteru
followed by exactly fourHexDigits. This character escape matches the character whose code unit equals the numeric value of this four-digit hexadecimal number. If the value does not fit in thisstd::basic_regex'sCharT
,std::regex_error is thrown(C++ only).
IdentityEscape can be any non-alphanumeric character: for example, another backslash. It matches the character as-is.
#include <cstddef>#include <iostream>#include <regex>#include <string> void show_matches(conststd::wstring& in,conststd::wstring& re){std::wsmatch m;std::regex_search(in, m,std::wregex(re));if(!m.empty()){std::wcout<< L"input=["<< in<< L"], regex=["<< re<< L"]\n " L"prefix=["<< m.prefix()<< L"]\n wsmatch: ";for(std::size_t n=0; n< m.size();++n)std::wcout<< L"m["<< n<< L"]=["<< m[n]<< L"] ";std::wcout<< L"\n suffix=["<< m.suffix()<< L"]\n";}elsestd::wcout<< L"input=["<< in<<"], regex=["<< re<< L"]: NO MATCH\n";} int main(){// Most escapes are similar to C++, save for metacharacters. You will have to// double-escape or use raw strings on the slashes though. show_matches(L"C++\\", LR"(C\+\+\\)"); // Escape sequences and NUL.std::wstring s(L"ab\xff\0cd",5); show_matches(s, L"(\\0|\\u00ff)"); // No matching for non-BMP Unicode is defined, because ECMAScript uses UTF-16// atoms. Whether this emoji banana matches can be platform dependent:// These need to be wide-strings! show_matches(L"\U0001f34c", L"[\\u0000-\\ufffe]+");}
Possible output:
input=[C++\], regex=[C\+\+\\] prefix=[] wsmatch: m[0]=[C++\] suffix=[]input=[ab?c], regex=[(\0{{!}}\u00ff)] prefix=[ab] wsmatch: m[0]=[?] m[1]=[?] suffix=[c]input=[?], regex=[[\u0000-\ufffe]+]: NO MATCH
An Atom can represent a character class, that is, it will match and consume one character if it belongs to one of the predefined groups of characters.
A character class can be introduced through a character class escape:
Atom ::
\
CharacterClassEscapeor directly
Atom ::
The character class escapes are shorthands for some of the common characters classes, as follows:
CharacterClassEscape | ClassName expression(C++ only) | Meaning |
---|---|---|
d | [[:digit:]] | digits |
D | [^[:digit:]] | non-digits |
s | [[:space:]] | whitespace characters |
S | [^[:space:]] | non-whitespace characters |
w | [_[:alnum:]] | alphanumeric characters and the character_ |
W | [^_[:alnum:]] | characters other than alphanumeric or_ |
ACharacterClass is a bracket-enclosed sequence ofClassRanges, optionally beginning with the negation operator^
. If it begins with^
, thisAtom matches any character that is NOT in the set of characters represented by the union of allClassRanges. Otherwise, thisAtom matches any character that IS in the set of the characters represented by the union of allClassRanges.
CharacterClass ::
[
[
lookahead ∉ {^
}]ClassRanges]
[
^
ClassRanges]
ClassRanges ::
NonemptyClassRanges ::
If non-empty class range has the formClassAtom -ClassAtom
, it matches any character from a range defined as follows:(C++ only)
The firstClassAtom must match a single collating elementc1
and the secondClassAtom must match a single collating elementc2
. To test if the input characterc
is matched by this range, the following steps are taken:
c
is matched ifc1 <= c && c <= c2
c
,c1
, andc2
) are passedstd::regex_traits::translate_nocasec
,c1
, andc2
) are passedstd::regex_traits::translatec
is matched iftransformed c1 <= transformed c && transformed c <= transformed c2
The character-
is treated literally if it is
NonemptyClassRangesNoDash ::
ClassAtom ::
-
ClassAtomNoDash ::
\ or ] or -
\
ClassEscapeEachClassAtomNoDash represents a single character -- eitherSourceCharacter as-is or escaped as follows:
ClassEscape ::
b
The specialClassEscape\b
produces a character set that matches the code unit U+0008 (backspace). Outside ofCharacterClass, it is the word-boundaryAssertion.
The use of\B
and the use of any backreference (DecimalEscape other than zero) inside aCharacterClass is an error.
The characters-
and]
may need to be escaped in some situations in order to be treated as atoms. Other characters that have special meaning outside ofCharacterClass, such as*
or?
, do not need to be escaped.
This section is incomplete Reason: no example |
These character classes are an extension to the ECMAScript grammar, and are equivalent to character classes found in the POSIX regular expressions.
ClassAtomExClass(C++ only) ::
[:
ClassName:]
Represents all characters that are members of the named character classClassName. The name is valid only ifstd::regex_traits::lookup_classname returns non-zero for this name. As described instd::regex_traits::lookup_classname, the following names are guaranteed to be recognized:alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit, d, s, w
. Additional names may be provided by system-supplied locales (such asjdigit
orjkanji
in Japanese) or implemented as a user-defined extension.
ClassAtomCollatingElement(C++ only) ::
[.
ClassName.]
Represents the named collating element, which may represent a single character or a sequence of characters that collates as a single unit under the imbued locale, such as[.tilde.]
or[.ch.]
in Czech. The name is valid only ifstd::regex_traits::lookup_collatename is not an empty string.
When usingstd::regex_constants::collate, collating elements can always be used as ends points of a range (e.g.[[.dz.]-g]
in Hungarian).
ClassAtomEquivalence(C++ only) ::
[=
ClassName=]
Represents all characters that are members of the same equivalence class as the named collating element, that is, all characters whose primary collation key is the same as that for collating elementClassName. The name is valid only ifstd::regex_traits::lookup_collatename for that name is not an empty string and if the value returned bystd::regex_traits::transform_primary for the result of the call tostd::regex_traits::lookup_collatename is not an empty string.
A primary sort key is one that ignores case, accentuation, or locale-specific tailorings; so for example[[=a=]]
matches any of the characters:a, À, Á, Â, Ã, Ä, Å, A, à, á, â, ã, ä and å.
ClassName(C++ only) ::
ClassNameCharacter(C++ only) ::
. = :
This section is incomplete Reason: no example |