NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commit2a0af7f

committed

Allow complemented character class escapes within regex brackets.

The complement-class escapes \D, \S, \W are now allowed withinbracket expressions. There is no semantic difficulty with doingthat, but the rather hokey macro-expansion-based implementationpreviously used here couldn't cope.Also, invent "word" as an allowed character class name, thus "\w"is now equivalent to "[[:word:]]" outside brackets, or "[:word:]"within brackets. POSIX allows such implementation-specificextensions, and the same name is used in e.g. bash.One surprising compatibility issue this raises is that constructssuch as "[\w-_]" are now disallowed, as our documentation has alwayssaid they should be: character classes can't be endpoints of a range.Previously, because \w was just a macro for "[:alnum:]_", such aconstruct was read as "[[:alnum:]_-_]", so it was accepted so long asthe character after "-" was numerically greater than or equal to "_".Some implementation cleanup along the way:* Remove the lexnest() hack, and in consequence clean up wordchrs()to not interact with the lexer.* Fix colorcomplement() to not be O(N^2) in the number of colorsinvolved.* Get rid of useless-as-far-as-I-can-see calls of element()on single-character character element names in brackpart().element() always maps these to the character itself, and thingswould be quite broken if it didn't --- should "[a]" match somethingdifferent than "a" does? Besides, the shortcut path in brackpart()wasn't doing this anyway, making it even more inconsistent.Discussion:https://postgr.es/m/2845172.1613674385@sss.pgh.pa.usDiscussion:https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us

1 parent6b40d9b commit2a0af7fCopy full SHA for 2a0af7f

File tree

10 files changed

+672

-271

lines changed

doc/src/sgml
- func.sgml
src
- backend/regex
- include/regex
  - regguts.h
- test/modules/test_regex
  - expected
    - test_regex.out
  - sql
    - test_regex.sql

10 files changed

+672

-271

lines changed

`‎doc/src/sgml/func.sgml`

Lines changed: 12 additions & 13 deletions

Original file line number	Diff line number	Diff line change
`@@ -6097,6 +6097,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;`
`6097`	`6097`	`non-ASCII characters to belong to any of these classes.)`
`6098`	`6098`	`In addition to these standard character`
`6099`	`6099`	`classes, <productname>PostgreSQL</productname> defines`
	`6100`	`+ the <literal>word</literal> character class, which is the same as`
	`6101`	`+ <literal>alnum</literal> plus the underscore (<literal>_</literal>)`
	`6102`	`+ character, and`
`6100`	`6103`	`the <literal>ascii</literal> character class, which contains exactly`
`6101`	`6104`	`the 7-bit ASCII set.`
`6102`	`6105`	`</para>`
`@@ -6108,9 +6111,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;`
`6108`	`6111`	`matching empty strings at the beginning`
`6109`	`6112`	`and end of a word respectively. A word is defined as a sequence`
`6110`	`6113`	`of word characters that is neither preceded nor followed by word`
`6111`		`- characters. A word character isan <literal>alnum</literal>character(as`
`6112`		`-defined by the <acronym>POSIX</acronym> character class described above)`
`6113`		`- oranunderscore. This is an extension, compatible with but not`
	`6114`	`+ characters. A word character isanycharacterbelonging to the`
	`6115`	`+<literal>word</literal> character class, that is, any letter, digit,`
	`6116`	`+ or underscore. This is an extension, compatible with but not`
`6114`	`6117`	`specified by <acronym>POSIX</acronym> 1003.2, and should be used with`
`6115`	`6118`	`caution in software intended to be portable to other systems.`
`6116`	`6119`	`The constraint escapes described below are usually preferable; they`
`@@ -6330,8 +6333,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;`
`6330`	`6333`
`6331`	`6334`	`<row>`
`6332`	`6335`	`<entry> <literal>\w</literal> </entry>`
`6333`		`- <entry> <literal>[[:alnum:]_]</literal>`
`6334`		`- (note underscore is included) </entry>`
	`6336`	`+ <entry> <literal>[[:word:]]</literal> </entry>`
`6335`	`6337`	`</row>`
`6336`	`6338`
`6337`	`6339`	`<row>`
`@@ -6346,21 +6348,18 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;`
`6346`	`6348`
`6347`	`6349`	`<row>`
`6348`	`6350`	`<entry> <literal>\W</literal> </entry>`
`6349`		`- <entry> <literal>[^[:alnum:]_]</literal>`
`6350`		`- (note underscore is included) </entry>`
	`6351`	`+ <entry> <literal>[^[:word:]]</literal> </entry>`
`6351`	`6352`	`</row>`
`6352`	`6353`	`</tbody>`
`6353`	`6354`	`</tgroup>`
`6354`	`6355`	`</table>`
`6355`	`6356`
`6356`	`6357`	`<para>`
`6357`		`-Within bracket expressions, <literal>\d</literal>, <literal>\s</literal>,`
`6358`		`-and <literal>\w</literal> lose their outer brackets,`
`6359`		`-and <literal>\D</literal>, <literal>\S</literal>, and <literal>\W</literal> are illegal.`
`6360`		`-(So, for example, <literal>[a-c\d]</literal> is equivalent to`
	`6358`	`+The class-shorthand escapes also work within bracket expressions,`
	`6359`	`+although the definitions shown above are not quite syntactically`
	`6360`	`+valid in that context.`
	`6361`	`+For example, <literal>[a-c\d]</literal> is equivalent to`
`6361`	`6362`	`<literal>[a-c[:digit:]]</literal>.`
`6362`		`- Also, <literal>[a-c\D]</literal>, which is equivalent to`
`6363`		`- <literal>[a-c^[:digit:]]</literal>, is illegal.)`
`6364`	`6363`	`</para>`
`6365`	`6364`
`6366`	`6365`	`<table id="posix-constraint-escapes-table">`

`‎src/backend/regex/re_syntax.n`

Lines changed: 4 additions & 9 deletions

Original file line number	Diff line number	Diff line change
`@@ -519,15 +519,10 @@ character classes:`
`519`	`519`	`(note underscore)`
`520`	`520`	`.RE`
`521`	`521`	`.PP`
`522`		-Within bracket expressions, `\fB\ed\fR', `\fB\es\fR',
`523`		-and `\fB\ew\fR'\&
`524`		`-lose their outer brackets,`
`525`		-and `\fB\eD\fR', `\fB\eS\fR',
`526`		-and `\fB\eW\fR'\&
`527`		`-are illegal.`
`528`		`-.VS8.2`
`529`		`-(So, for example,\fB[a-c\ed]\fR is equivalent to\fB[a-c[:digit:]]\fR.`
`530`		`-Also,\fB[a-c\eD]\fR, which is equivalent to\fB[a-c^[:digit:]]\fR, is illegal.)`
	`522`	`+The class-shorthand escapes also work within bracket expressions,`
	`523`	`+although the definitions shown above are not quite syntactically`
	`524`	`+valid in that context.`
	`525`	`+For example,\fB[a-c\ed]\fR is equivalent to\fB[a-c[:digit:]]\fR.`
`531`	`526`	`.VE8.2`
`532`	`527`	`.PP`
`533`	`528`	`A constraint escape (AREs only) is a constraint,`

`‎src/backend/regex/regc_color.c`

Lines changed: 30 additions & 4 deletions

Original file line number	Diff line number	Diff line change
`@@ -936,7 +936,16 @@ okcolors(struct nfa *nfa,`
`936`	`936`	`}`
`937`	`937`	`elseif (cd->nschrs==0&&cd->nuchrs==0)`
`938`	`938`	`{`
`939`		`-/* parent empty, its arcs change color to subcolor */`
	`939`	`+/*`
	`940`	`+ * Parent is now empty, so just change all its arcs to the`
	`941`	`+ * subcolor, then free the parent.`
	`942`	`+ *`
	`943`	`+ * It is not obvious that simply relabeling the arcs like this is`
	`944`	`+ * OK; it appears to risk creating duplicate arcs. We are`
	`945`	`+ * basically relying on the assumption that processing of a`
	`946`	`+ * bracket expression can't create arcs of both a color and its`
	`947`	`+ * subcolor between the bracket's endpoints.`
	`948`	`+ */`
`940`	`949`	`cd->sub=NOSUB;`
`941`	`950`	`scd=&cm->cd[sco];`
`942`	`951`	`assert(scd->nschrs>0\|\|scd->nuchrs>0);`
`@@ -1062,17 +1071,34 @@ colorcomplement(struct nfa *nfa,`
`1062`	`1071`	`structcolordesc*cd;`
`1063`	`1072`	`structcolordesc*end=CDEND(cm);`
`1064`	`1073`	`colorco;`
	`1074`	`+structarc*a;`
`1065`	`1075`
`1066`	`1076`	`assert(of!=from);`
`1067`	`1077`
`1068`	`1078`	`/* A RAINBOW arc matches all colors, making the complement empty */`
`1069`	`1079`	`if (findarc(of,PLAIN,RAINBOW)!=NULL)`
`1070`	`1080`	`return;`
`1071`	`1081`
	`1082`	`+/* Otherwise, transiently mark the colors that appear in of's out-arcs */`
	`1083`	`+for (a=of->outs;a!=NULL;a=a->outchain)`
	`1084`	`+{`
	`1085`	`+if (a->type==PLAIN)`
	`1086`	`+{`
	`1087`	`+assert(a->co >=0);`
	`1088`	`+cd=&cm->cd[a->co];`
	`1089`	`+assert(!UNUSEDCOLOR(cd));`
	`1090`	`+cd->flags \|=COLMARK;`
	`1091`	`+}`
	`1092`	`+}`
	`1093`	`+`
	`1094`	`+/* Scan colors, clear transient marks, add arcs for unmarked colors */`
`1072`	`1095`	`for (cd=cm->cd,co=0;cd<end&& !CISERR();cd++,co++)`
`1073`		`-if (!UNUSEDCOLOR(cd)&& !(cd->flags&PSEUDO))`
`1074`		`-if (findarc(of,PLAIN,co)==NULL)`
`1075`		`-newarc(nfa,type,co,from,to);`
	`1096`	`+{`
	`1097`	`+if (cd->flags&COLMARK)`
	`1098`	`+cd->flags &= ~COLMARK;`
	`1099`	`+elseif (!UNUSEDCOLOR(cd)&& !(cd->flags&PSEUDO))`
	`1100`	`+newarc(nfa,type,co,from,to);`
	`1101`	`+}`
`1076`	`1102`	`}`
`1077`	`1103`
`1078`	`1104`

`‎src/backend/regex/regc_lex.c`

Lines changed: 16 additions & 150 deletions

Original file line number	Diff line number	Diff line change
`@@ -193,83 +193,6 @@ prefixes(struct vars *v)`
`193`	`193`	`}`
`194`	`194`	`}`
`195`	`195`
`196`		`-/*`
`197`		`- * lexnest - "call a subroutine", interpolating string at the lexical level`
`198`		`- *`
`199`		`- * Note, this is not a very general facility. There are a number of`
`200`		`- * implicit assumptions about what sorts of strings can be subroutines.`
`201`		`- */`
`202`		`-staticvoid`
`203`		`-lexnest(structvars*v,`
`204`		`-constchrbeginp,/ start of interpolation */`
`205`		`-constchrendp)/ one past end of interpolation */`
`206`		`-{`
`207`		`-assert(v->savenow==NULL);/* only one level of nesting */`
`208`		`-v->savenow=v->now;`
`209`		`-v->savestop=v->stop;`
`210`		`-v->now=beginp;`
`211`		`-v->stop=endp;`
`212`		`-}`
`213`		`-`
`214`		`-/*`
`215`		`- * string constants to interpolate as expansions of things like \d`
`216`		`- */`
`217`		`-staticconstchrbackd[]= {/* \d */`
`218`		`-CHR('['),CHR('['),CHR(':'),`
`219`		`-CHR('d'),CHR('i'),CHR('g'),CHR('i'),CHR('t'),`
`220`		`-CHR(':'),CHR(']'),CHR(']')`
`221`		`-};`
`222`		`-staticconstchrbackD[]= {/* \D */`
`223`		`-CHR('['),CHR('^'),CHR('['),CHR(':'),`
`224`		`-CHR('d'),CHR('i'),CHR('g'),CHR('i'),CHR('t'),`
`225`		`-CHR(':'),CHR(']'),CHR(']')`
`226`		`-};`
`227`		`-staticconstchrbrbackd[]= {/* \d within brackets */`
`228`		`-CHR('['),CHR(':'),`
`229`		`-CHR('d'),CHR('i'),CHR('g'),CHR('i'),CHR('t'),`
`230`		`-CHR(':'),CHR(']')`
`231`		`-};`
`232`		`-staticconstchrbacks[]= {/* \s */`
`233`		`-CHR('['),CHR('['),CHR(':'),`
`234`		`-CHR('s'),CHR('p'),CHR('a'),CHR('c'),CHR('e'),`
`235`		`-CHR(':'),CHR(']'),CHR(']')`
`236`		`-};`
`237`		`-staticconstchrbackS[]= {/* \S */`
`238`		`-CHR('['),CHR('^'),CHR('['),CHR(':'),`
`239`		`-CHR('s'),CHR('p'),CHR('a'),CHR('c'),CHR('e'),`
`240`		`-CHR(':'),CHR(']'),CHR(']')`
`241`		`-};`
`242`		`-staticconstchrbrbacks[]= {/* \s within brackets */`
`243`		`-CHR('['),CHR(':'),`
`244`		`-CHR('s'),CHR('p'),CHR('a'),CHR('c'),CHR('e'),`
`245`		`-CHR(':'),CHR(']')`
`246`		`-};`
`247`		`-staticconstchrbackw[]= {/* \w */`
`248`		`-CHR('['),CHR('['),CHR(':'),`
`249`		`-CHR('a'),CHR('l'),CHR('n'),CHR('u'),CHR('m'),`
`250`		`-CHR(':'),CHR(']'),CHR('_'),CHR(']')`
`251`		`-};`
`252`		`-staticconstchrbackW[]= {/* \W */`
`253`		`-CHR('['),CHR('^'),CHR('['),CHR(':'),`
`254`		`-CHR('a'),CHR('l'),CHR('n'),CHR('u'),CHR('m'),`
`255`		`-CHR(':'),CHR(']'),CHR('_'),CHR(']')`
`256`		`-};`
`257`		`-staticconstchrbrbackw[]= {/* \w within brackets */`
`258`		`-CHR('['),CHR(':'),`
`259`		`-CHR('a'),CHR('l'),CHR('n'),CHR('u'),CHR('m'),`
`260`		`-CHR(':'),CHR(']'),CHR('_')`
`261`		`-};`
`262`		`-`
`263`		`-/*`
`264`		`- * lexword - interpolate a bracket expression for word characters`
`265`		`- * Possibly ought to inquire whether there is a "word" character class.`
`266`		`- */`
`267`		`-staticvoid`
`268`		`-lexword(structvars*v)`
`269`		`-{`
`270`		`-lexnest(v,backw,ENDOF(backw));`
`271`		`-}`
`272`		`-`
`273`	`196`	`/*`
`274`	`197`	`* next - get next token`
`275`	`198`	`*/`
`@@ -292,14 +215,6 @@ next(struct vars *v)`
`292`	`215`	`RETV(SBEGIN,0);/* same as \A */`
`293`	`216`	`}`
`294`	`217`
`295`		`-/* if we're nested and we've hit end, return to outer level */`
`296`		`-if (v->savenow!=NULL&&ATEOS())`
`297`		`-{`
`298`		`-v->now=v->savenow;`
`299`		`-v->stop=v->savestop;`
`300`		`-v->savenow=v->savestop=NULL;`
`301`		`-}`
`302`		`-`
`303`	`218`	`/* skip white space etc. if appropriate (not in literal or []) */`
`304`	`219`	`if (v->cflags&REG_EXPANDED)`
`305`	`220`	`switch (v->lexcon)`
`@@ -420,32 +335,15 @@ next(struct vars *v)`
`420`	`335`	`NOTE(REG_UNONPOSIX);`
`421`	`336`	`if (ATEOS())`
`422`	`337`	`FAILW(REG_EESCAPE);`
`423`		`-(DISCARD)lexescape(v);`
	`338`	`+if (!lexescape(v))`
	`339`	`+return0;`
`424`	`340`	`switch (v->nexttype)`
`425`	`341`	`{/* not all escapes okay here */`
`426`	`342`	`casePLAIN:`
	`343`	`+caseCCLASSS:`
	`344`	`+caseCCLASSC:`
`427`	`345`	`return1;`
`428`	`346`	`break;`
`429`		`-caseCCLASS:`
`430`		`-switch (v->nextvalue)`
`431`		`-{`
`432`		`-case'd':`
`433`		`-lexnest(v,brbackd,ENDOF(brbackd));`
`434`		`-break;`
`435`		`-case's':`
`436`		`-lexnest(v,brbacks,ENDOF(brbacks));`
`437`		`-break;`
`438`		`-case'w':`
`439`		`-lexnest(v,brbackw,ENDOF(brbackw));`
`440`		`-break;`
`441`		`-default:`
`442`		`-FAILW(REG_EESCAPE);`
`443`		`-break;`
`444`		`-}`
`445`		`-/* lexnest done, back up and try again */`
`446`		`-v->nexttype=v->lasttype;`
`447`		`-returnnext(v);`
`448`		`-break;`
`449`	`347`	`}`
`450`	`348`	`/* not one of the acceptable escapes */`
`451`	`349`	`FAILW(REG_EESCAPE);`
`@@ -691,49 +589,17 @@ next(struct vars *v)`
`691`	`589`	`}`
`692`	`590`	`RETV(PLAIN,*v->now++);`
`693`	`591`	`}`
`694`		`-(DISCARD)lexescape(v);`
`695`		`-if (ISERR())`
`696`		`-FAILW(REG_EESCAPE);`
`697`		`-if (v->nexttype==CCLASS)`
`698`		`-{/* fudge at lexical level */`
`699`		`-switch (v->nextvalue)`
`700`		`-{`
`701`		`-case'd':`
`702`		`-lexnest(v,backd,ENDOF(backd));`
`703`		`-break;`
`704`		`-case'D':`
`705`		`-lexnest(v,backD,ENDOF(backD));`
`706`		`-break;`
`707`		`-case's':`
`708`		`-lexnest(v,backs,ENDOF(backs));`
`709`		`-break;`
`710`		`-case'S':`
`711`		`-lexnest(v,backS,ENDOF(backS));`
`712`		`-break;`
`713`		`-case'w':`
`714`		`-lexnest(v,backw,ENDOF(backw));`
`715`		`-break;`
`716`		`-case'W':`
`717`		`-lexnest(v,backW,ENDOF(backW));`
`718`		`-break;`
`719`		`-default:`
`720`		`-assert(NOTREACHED);`
`721`		`-FAILW(REG_ASSERT);`
`722`		`-break;`
`723`		`-}`
`724`		`-/* lexnest done, back up and try again */`
`725`		`-v->nexttype=v->lasttype;`
`726`		`-returnnext(v);`
`727`		`-}`
`728`		`-/* otherwise, lexescape has already done the work */`
`729`		`-return !ISERR();`
	`592`	`+returnlexescape(v);`
`730`	`593`	`}`
`731`	`594`
`732`	`595`	`/*`
`733`	`596`	`* lexescape - parse an ARE backslash escape (backslash already eaten)`
`734`		`- * Note slightly nonstandard use of the CCLASS type code.`
	`597`	`+ *`
	`598`	`+ * This is used for ARE backslashes both normally and inside bracket`
	`599`	`+ * expressions. In the latter case, not all escape types are allowed,`
	`600`	`+ * but the caller must reject unwanted ones after we return.`
`735`	`601`	`*/`
`736`		`-staticint/* not actually used, but convenient for RETV */`
	`602`	`+staticint`
`737`	`603`	`lexescape(structvars*v)`
`738`	`604`	`{`
`739`	`605`	`chrc;`
`@@ -775,11 +641,11 @@ lexescape(struct vars *v)`
`775`	`641`	`break;`
`776`	`642`	`caseCHR('d'):`
`777`	`643`	`NOTE(REG_ULOCALE);`
`778`		`-RETV(CCLASS,'d');`
	`644`	`+RETV(CCLASSS,CC_DIGIT);`
`779`	`645`	`break;`
`780`	`646`	`caseCHR('D'):`
`781`	`647`	`NOTE(REG_ULOCALE);`
`782`		`-RETV(CCLASS,'D');`
	`648`	`+RETV(CCLASSC,CC_DIGIT);`
`783`	`649`	`break;`
`784`	`650`	`caseCHR('e'):`
`785`	`651`	`NOTE(REG_UUNPORT);`
`@@ -802,11 +668,11 @@ lexescape(struct vars *v)`
`802`	`668`	`break;`
`803`	`669`	`caseCHR('s'):`
`804`	`670`	`NOTE(REG_ULOCALE);`
`805`		`-RETV(CCLASS,'s');`
	`671`	`+RETV(CCLASSS,CC_SPACE);`
`806`	`672`	`break;`
`807`	`673`	`caseCHR('S'):`
`808`	`674`	`NOTE(REG_ULOCALE);`
`809`		`-RETV(CCLASS,'S');`
	`675`	`+RETV(CCLASSC,CC_SPACE);`
`810`	`676`	`break;`
`811`	`677`	`caseCHR('t'):`
`812`	`678`	`RETV(PLAIN,CHR('\t'));`
`@@ -828,11 +694,11 @@ lexescape(struct vars *v)`
`828`	`694`	`break;`
`829`	`695`	`caseCHR('w'):`
`830`	`696`	`NOTE(REG_ULOCALE);`
`831`		`-RETV(CCLASS,'w');`
	`697`	`+RETV(CCLASSS,CC_WORD);`
`832`	`698`	`break;`
`833`	`699`	`caseCHR('W'):`
`834`	`700`	`NOTE(REG_ULOCALE);`
`835`		`-RETV(CCLASS,'W');`
	`701`	`+RETV(CCLASSC,CC_WORD);`
`836`	`702`	`break;`
`837`	`703`	`caseCHR('x'):`
`838`	`704`	`NOTE(REG_UUNPORT);`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit2a0af7f

File tree

10 files changed

10 files changed

`‎doc/src/sgml/func.sgml`

`‎src/backend/regex/re_syntax.n`

`‎src/backend/regex/regc_color.c`

`‎src/backend/regex/regc_lex.c`

0 commit comments