Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit2a0af7f

Browse files
committed
Allow complemented character class escapes within regex brackets.
The complement-class escapes \D, \S, \W are now allowed withinbracket expressions. There is no semantic difficulty with doingthat, but the rather hokey macro-expansion-based implementationpreviously used here couldn't cope.Also, invent "word" as an allowed character class name, thus "\w"is now equivalent to "[[:word:]]" outside brackets, or "[:word:]"within brackets. POSIX allows such implementation-specificextensions, and the same name is used in e.g. bash.One surprising compatibility issue this raises is that constructssuch as "[\w-_]" are now disallowed, as our documentation has alwayssaid they should be: character classes can't be endpoints of a range.Previously, because \w was just a macro for "[:alnum:]_", such aconstruct was read as "[[:alnum:]_-_]", so it was accepted so long asthe character after "-" was numerically greater than or equal to "_".Some implementation cleanup along the way:* Remove the lexnest() hack, and in consequence clean up wordchrs()to not interact with the lexer.* Fix colorcomplement() to not be O(N^2) in the number of colorsinvolved.* Get rid of useless-as-far-as-I-can-see calls of element()on single-character character element names in brackpart().element() always maps these to the character itself, and thingswould be quite broken if it didn't --- should "[a]" match somethingdifferent than "a" does? Besides, the shortcut path in brackpart()wasn't doing this anyway, making it even more inconsistent.Discussion:https://postgr.es/m/2845172.1613674385@sss.pgh.pa.usDiscussion:https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us
1 parent6b40d9b commit2a0af7f

File tree

10 files changed

+672
-271
lines changed

10 files changed

+672
-271
lines changed

‎doc/src/sgml/func.sgml

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6097,6 +6097,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
60976097
non-ASCII characters to belong to any of these classes.)
60986098
In addition to these standard character
60996099
classes, <productname>PostgreSQL</productname> defines
6100+
the <literal>word</literal> character class, which is the same as
6101+
<literal>alnum</literal> plus the underscore (<literal>_</literal>)
6102+
character, and
61006103
the <literal>ascii</literal> character class, which contains exactly
61016104
the 7-bit ASCII set.
61026105
</para>
@@ -6108,9 +6111,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
61086111
matching empty strings at the beginning
61096112
and end of a word respectively. A word is defined as a sequence
61106113
of word characters that is neither preceded nor followed by word
6111-
characters. A word character isan <literal>alnum</literal>character(as
6112-
defined by the <acronym>POSIX</acronym> character class described above)
6113-
oranunderscore. This is an extension, compatible with but not
6114+
characters. A word character isanycharacterbelonging to the
6115+
<literal>word</literal> character class, that is, any letter, digit,
6116+
or underscore. This is an extension, compatible with but not
61146117
specified by <acronym>POSIX</acronym> 1003.2, and should be used with
61156118
caution in software intended to be portable to other systems.
61166119
The constraint escapes described below are usually preferable; they
@@ -6330,8 +6333,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
63306333

63316334
<row>
63326335
<entry> <literal>\w</literal> </entry>
6333-
<entry> <literal>[[:alnum:]_]</literal>
6334-
(note underscore is included) </entry>
6336+
<entry> <literal>[[:word:]]</literal> </entry>
63356337
</row>
63366338

63376339
<row>
@@ -6346,21 +6348,18 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
63466348

63476349
<row>
63486350
<entry> <literal>\W</literal> </entry>
6349-
<entry> <literal>[^[:alnum:]_]</literal>
6350-
(note underscore is included) </entry>
6351+
<entry> <literal>[^[:word:]]</literal> </entry>
63516352
</row>
63526353
</tbody>
63536354
</tgroup>
63546355
</table>
63556356

63566357
<para>
6357-
Within bracket expressions, <literal>\d</literal>, <literal>\s</literal>,
6358-
and <literal>\w</literal> lose their outer brackets,
6359-
and <literal>\D</literal>, <literal>\S</literal>, and <literal>\W</literal> are illegal.
6360-
(So, for example, <literal>[a-c\d]</literal> is equivalent to
6358+
The class-shorthand escapes also work within bracket expressions,
6359+
although the definitions shown above are not quite syntactically
6360+
valid in that context.
6361+
For example, <literal>[a-c\d]</literal> is equivalent to
63616362
<literal>[a-c[:digit:]]</literal>.
6362-
Also, <literal>[a-c\D]</literal>, which is equivalent to
6363-
<literal>[a-c^[:digit:]]</literal>, is illegal.)
63646363
</para>
63656364

63666365
<table id="posix-constraint-escapes-table">

‎src/backend/regex/re_syntax.n

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -519,15 +519,10 @@ character classes:
519519
(note underscore)
520520
.RE
521521
.PP
522-
Within bracket expressions, `\fB\ed\fR', `\fB\es\fR',
523-
and `\fB\ew\fR'\&
524-
lose their outer brackets,
525-
and `\fB\eD\fR', `\fB\eS\fR',
526-
and `\fB\eW\fR'\&
527-
are illegal.
528-
.VS8.2
529-
(So, for example,\fB[a-c\ed]\fR is equivalent to\fB[a-c[:digit:]]\fR.
530-
Also,\fB[a-c\eD]\fR, which is equivalent to\fB[a-c^[:digit:]]\fR, is illegal.)
522+
The class-shorthand escapes also work within bracket expressions,
523+
although the definitions shown above are not quite syntactically
524+
valid in that context.
525+
For example,\fB[a-c\ed]\fR is equivalent to\fB[a-c[:digit:]]\fR.
531526
.VE8.2
532527
.PP
533528
A constraint escape (AREs only) is a constraint,

‎src/backend/regex/regc_color.c

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -936,7 +936,16 @@ okcolors(struct nfa *nfa,
936936
}
937937
elseif (cd->nschrs==0&&cd->nuchrs==0)
938938
{
939-
/* parent empty, its arcs change color to subcolor */
939+
/*
940+
* Parent is now empty, so just change all its arcs to the
941+
* subcolor, then free the parent.
942+
*
943+
* It is not obvious that simply relabeling the arcs like this is
944+
* OK; it appears to risk creating duplicate arcs. We are
945+
* basically relying on the assumption that processing of a
946+
* bracket expression can't create arcs of both a color and its
947+
* subcolor between the bracket's endpoints.
948+
*/
940949
cd->sub=NOSUB;
941950
scd=&cm->cd[sco];
942951
assert(scd->nschrs>0||scd->nuchrs>0);
@@ -1062,17 +1071,34 @@ colorcomplement(struct nfa *nfa,
10621071
structcolordesc*cd;
10631072
structcolordesc*end=CDEND(cm);
10641073
colorco;
1074+
structarc*a;
10651075

10661076
assert(of!=from);
10671077

10681078
/* A RAINBOW arc matches all colors, making the complement empty */
10691079
if (findarc(of,PLAIN,RAINBOW)!=NULL)
10701080
return;
10711081

1082+
/* Otherwise, transiently mark the colors that appear in of's out-arcs */
1083+
for (a=of->outs;a!=NULL;a=a->outchain)
1084+
{
1085+
if (a->type==PLAIN)
1086+
{
1087+
assert(a->co >=0);
1088+
cd=&cm->cd[a->co];
1089+
assert(!UNUSEDCOLOR(cd));
1090+
cd->flags |=COLMARK;
1091+
}
1092+
}
1093+
1094+
/* Scan colors, clear transient marks, add arcs for unmarked colors */
10721095
for (cd=cm->cd,co=0;cd<end&& !CISERR();cd++,co++)
1073-
if (!UNUSEDCOLOR(cd)&& !(cd->flags&PSEUDO))
1074-
if (findarc(of,PLAIN,co)==NULL)
1075-
newarc(nfa,type,co,from,to);
1096+
{
1097+
if (cd->flags&COLMARK)
1098+
cd->flags &= ~COLMARK;
1099+
elseif (!UNUSEDCOLOR(cd)&& !(cd->flags&PSEUDO))
1100+
newarc(nfa,type,co,from,to);
1101+
}
10761102
}
10771103

10781104

‎src/backend/regex/regc_lex.c

Lines changed: 16 additions & 150 deletions
Original file line numberDiff line numberDiff line change
@@ -193,83 +193,6 @@ prefixes(struct vars *v)
193193
}
194194
}
195195

196-
/*
197-
* lexnest - "call a subroutine", interpolating string at the lexical level
198-
*
199-
* Note, this is not a very general facility. There are a number of
200-
* implicit assumptions about what sorts of strings can be subroutines.
201-
*/
202-
staticvoid
203-
lexnest(structvars*v,
204-
constchr*beginp,/* start of interpolation */
205-
constchr*endp)/* one past end of interpolation */
206-
{
207-
assert(v->savenow==NULL);/* only one level of nesting */
208-
v->savenow=v->now;
209-
v->savestop=v->stop;
210-
v->now=beginp;
211-
v->stop=endp;
212-
}
213-
214-
/*
215-
* string constants to interpolate as expansions of things like \d
216-
*/
217-
staticconstchrbackd[]= {/* \d */
218-
CHR('['),CHR('['),CHR(':'),
219-
CHR('d'),CHR('i'),CHR('g'),CHR('i'),CHR('t'),
220-
CHR(':'),CHR(']'),CHR(']')
221-
};
222-
staticconstchrbackD[]= {/* \D */
223-
CHR('['),CHR('^'),CHR('['),CHR(':'),
224-
CHR('d'),CHR('i'),CHR('g'),CHR('i'),CHR('t'),
225-
CHR(':'),CHR(']'),CHR(']')
226-
};
227-
staticconstchrbrbackd[]= {/* \d within brackets */
228-
CHR('['),CHR(':'),
229-
CHR('d'),CHR('i'),CHR('g'),CHR('i'),CHR('t'),
230-
CHR(':'),CHR(']')
231-
};
232-
staticconstchrbacks[]= {/* \s */
233-
CHR('['),CHR('['),CHR(':'),
234-
CHR('s'),CHR('p'),CHR('a'),CHR('c'),CHR('e'),
235-
CHR(':'),CHR(']'),CHR(']')
236-
};
237-
staticconstchrbackS[]= {/* \S */
238-
CHR('['),CHR('^'),CHR('['),CHR(':'),
239-
CHR('s'),CHR('p'),CHR('a'),CHR('c'),CHR('e'),
240-
CHR(':'),CHR(']'),CHR(']')
241-
};
242-
staticconstchrbrbacks[]= {/* \s within brackets */
243-
CHR('['),CHR(':'),
244-
CHR('s'),CHR('p'),CHR('a'),CHR('c'),CHR('e'),
245-
CHR(':'),CHR(']')
246-
};
247-
staticconstchrbackw[]= {/* \w */
248-
CHR('['),CHR('['),CHR(':'),
249-
CHR('a'),CHR('l'),CHR('n'),CHR('u'),CHR('m'),
250-
CHR(':'),CHR(']'),CHR('_'),CHR(']')
251-
};
252-
staticconstchrbackW[]= {/* \W */
253-
CHR('['),CHR('^'),CHR('['),CHR(':'),
254-
CHR('a'),CHR('l'),CHR('n'),CHR('u'),CHR('m'),
255-
CHR(':'),CHR(']'),CHR('_'),CHR(']')
256-
};
257-
staticconstchrbrbackw[]= {/* \w within brackets */
258-
CHR('['),CHR(':'),
259-
CHR('a'),CHR('l'),CHR('n'),CHR('u'),CHR('m'),
260-
CHR(':'),CHR(']'),CHR('_')
261-
};
262-
263-
/*
264-
* lexword - interpolate a bracket expression for word characters
265-
* Possibly ought to inquire whether there is a "word" character class.
266-
*/
267-
staticvoid
268-
lexword(structvars*v)
269-
{
270-
lexnest(v,backw,ENDOF(backw));
271-
}
272-
273196
/*
274197
* next - get next token
275198
*/
@@ -292,14 +215,6 @@ next(struct vars *v)
292215
RETV(SBEGIN,0);/* same as \A */
293216
}
294217

295-
/* if we're nested and we've hit end, return to outer level */
296-
if (v->savenow!=NULL&&ATEOS())
297-
{
298-
v->now=v->savenow;
299-
v->stop=v->savestop;
300-
v->savenow=v->savestop=NULL;
301-
}
302-
303218
/* skip white space etc. if appropriate (not in literal or []) */
304219
if (v->cflags&REG_EXPANDED)
305220
switch (v->lexcon)
@@ -420,32 +335,15 @@ next(struct vars *v)
420335
NOTE(REG_UNONPOSIX);
421336
if (ATEOS())
422337
FAILW(REG_EESCAPE);
423-
(DISCARD)lexescape(v);
338+
if (!lexescape(v))
339+
return0;
424340
switch (v->nexttype)
425341
{/* not all escapes okay here */
426342
casePLAIN:
343+
caseCCLASSS:
344+
caseCCLASSC:
427345
return1;
428346
break;
429-
caseCCLASS:
430-
switch (v->nextvalue)
431-
{
432-
case'd':
433-
lexnest(v,brbackd,ENDOF(brbackd));
434-
break;
435-
case's':
436-
lexnest(v,brbacks,ENDOF(brbacks));
437-
break;
438-
case'w':
439-
lexnest(v,brbackw,ENDOF(brbackw));
440-
break;
441-
default:
442-
FAILW(REG_EESCAPE);
443-
break;
444-
}
445-
/* lexnest done, back up and try again */
446-
v->nexttype=v->lasttype;
447-
returnnext(v);
448-
break;
449347
}
450348
/* not one of the acceptable escapes */
451349
FAILW(REG_EESCAPE);
@@ -691,49 +589,17 @@ next(struct vars *v)
691589
}
692590
RETV(PLAIN,*v->now++);
693591
}
694-
(DISCARD)lexescape(v);
695-
if (ISERR())
696-
FAILW(REG_EESCAPE);
697-
if (v->nexttype==CCLASS)
698-
{/* fudge at lexical level */
699-
switch (v->nextvalue)
700-
{
701-
case'd':
702-
lexnest(v,backd,ENDOF(backd));
703-
break;
704-
case'D':
705-
lexnest(v,backD,ENDOF(backD));
706-
break;
707-
case's':
708-
lexnest(v,backs,ENDOF(backs));
709-
break;
710-
case'S':
711-
lexnest(v,backS,ENDOF(backS));
712-
break;
713-
case'w':
714-
lexnest(v,backw,ENDOF(backw));
715-
break;
716-
case'W':
717-
lexnest(v,backW,ENDOF(backW));
718-
break;
719-
default:
720-
assert(NOTREACHED);
721-
FAILW(REG_ASSERT);
722-
break;
723-
}
724-
/* lexnest done, back up and try again */
725-
v->nexttype=v->lasttype;
726-
returnnext(v);
727-
}
728-
/* otherwise, lexescape has already done the work */
729-
return !ISERR();
592+
returnlexescape(v);
730593
}
731594

732595
/*
733596
* lexescape - parse an ARE backslash escape (backslash already eaten)
734-
* Note slightly nonstandard use of the CCLASS type code.
597+
*
598+
* This is used for ARE backslashes both normally and inside bracket
599+
* expressions. In the latter case, not all escape types are allowed,
600+
* but the caller must reject unwanted ones after we return.
735601
*/
736-
staticint/* not actually used, but convenient for RETV */
602+
staticint
737603
lexescape(structvars*v)
738604
{
739605
chrc;
@@ -775,11 +641,11 @@ lexescape(struct vars *v)
775641
break;
776642
caseCHR('d'):
777643
NOTE(REG_ULOCALE);
778-
RETV(CCLASS,'d');
644+
RETV(CCLASSS,CC_DIGIT);
779645
break;
780646
caseCHR('D'):
781647
NOTE(REG_ULOCALE);
782-
RETV(CCLASS,'D');
648+
RETV(CCLASSC,CC_DIGIT);
783649
break;
784650
caseCHR('e'):
785651
NOTE(REG_UUNPORT);
@@ -802,11 +668,11 @@ lexescape(struct vars *v)
802668
break;
803669
caseCHR('s'):
804670
NOTE(REG_ULOCALE);
805-
RETV(CCLASS,'s');
671+
RETV(CCLASSS,CC_SPACE);
806672
break;
807673
caseCHR('S'):
808674
NOTE(REG_ULOCALE);
809-
RETV(CCLASS,'S');
675+
RETV(CCLASSC,CC_SPACE);
810676
break;
811677
caseCHR('t'):
812678
RETV(PLAIN,CHR('\t'));
@@ -828,11 +694,11 @@ lexescape(struct vars *v)
828694
break;
829695
caseCHR('w'):
830696
NOTE(REG_ULOCALE);
831-
RETV(CCLASS,'w');
697+
RETV(CCLASSS,CC_WORD);
832698
break;
833699
caseCHR('W'):
834700
NOTE(REG_ULOCALE);
835-
RETV(CCLASS,'W');
701+
RETV(CCLASSC,CC_WORD);
836702
break;
837703
caseCHR('x'):
838704
NOTE(REG_UUNPORT);

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp