Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit5223f96

Browse files
committed
Fix regex back-references that are directly quantified with *.
The syntax "\n*", that is a backref with a * quantifier directly appliedto it, has never worked correctly in Spencer's library. This has been anopen bug in the Tcl bug tracker since 2005:https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894The core of the problem is in parseqatom(), which first changes "\n*" to"\n+|" and then applies repeat() to the NFA representing the backref atom.repeat() thinks that any arc leading into its "rp" argument is part of thesub-NFA to be repeated. Unfortunately, since parseqatom() already createdthe arc that was intended to represent the empty bypass around "\n+", thisarc gets moved too, so that it now leads into the state loop created byrepeat(). Thus, what was supposed to be an "empty" bypass gets turned intosomething that represents zero or more repetitions of the NFA representingthe backref atom. In the original example, in place of^([bc])\1*$we now have something that acts like^([bc])(\1+|[bc]*)$At runtime, the branch involving the actual backref fails, as it's supposedto, but then the other branch succeeds anyway.We could no doubt fix this by some rearrangement of the operations inparseqatom(), but that code is plenty ugly already, and what's more thewhole business of converting "x*" to "x+|" probably needs to go away to fixanother problem I'll mention in a moment. Instead, this patch suppressesthe *-conversion when the target is a simple backref atom, leaving the caseof m == 0 to be handled at runtime. This makes the patch in regcomp.c aone-liner, at the cost of having to tweak cbrdissect() a little. In theevent I went a bit further than that and rewrote cbrdissect() to check allthe string-length-related conditions before it starts comparing characters.It seems a bit stupid to possibly iterate through many copies of ann-character backreference, only to fail at the end because the targetstring's length isn't a multiple of n --- we could have found that outbefore starting. The existing coding could only be a win if integerdivision is hugely expensive compared to character comparison, but I don'tknow of any modern machine where that might be true.This does not fix all the problems with quantified back-references. Inparticular, the code is still broken for back-references that appear withina larger expression that is quantified (so that direct insertion of thequantification limits into the BACKREF node doesn't apply). I think fixingthat will take some major surgery on the NFA code, specifically introducingan explicit iteration node type instead of trying to transform iterationinto concatenation of modified regexps.Back-patch to all supported branches. In HEAD, also add a regression testcase for this. (It may seem a bit silly to create a regression test filefor just one test case; but I'm expecting that we will soon import a wholebunch of regex regression tests from Tcl, so might as well create theinfrastructure now.)
1 parente00f68e commit5223f96

File tree

6 files changed

+103
-30
lines changed

6 files changed

+103
-30
lines changed

‎src/backend/regex/regcomp.c

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1088,8 +1088,12 @@ parseqatom(struct vars * v,
10881088
NOERR();
10891089
}
10901090

1091-
/* it's quantifier time; first, turn x{0,...} into x{1,...}|empty */
1092-
if (m==0)
1091+
/*
1092+
* It's quantifier time. If the atom is just a BACKREF, we'll let it deal
1093+
* with quantifiers internally. Otherwise, the first step is to turn
1094+
* x{0,...} into x{1,...}|empty
1095+
*/
1096+
if (m==0&&atomtype!=BACKREF)
10931097
{
10941098
EMPTYARC(s2,atom->end);/* the bypass */
10951099
assert(PREF(qprefer)!=0);

‎src/backend/regex/regexec.c

Lines changed: 46 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -720,7 +720,7 @@ cdissect(struct vars * v,
720720
case'|':/* alternation */
721721
assert(t->left!=NULL);
722722
returncaltdissect(v,t,begin,end);
723-
case'b':/* backref -- shouldn't be calling us! */
723+
case'b':/* backreference */
724724
assert(t->left==NULL&&t->right==NULL);
725725
returncbrdissect(v,t,begin,end);
726726
case'.':/* concatenation */
@@ -962,12 +962,12 @@ cbrdissect(struct vars * v,
962962
chr*begin,/* beginning of relevant substring */
963963
chr*end)/* end of same */
964964
{
965-
inti;
966965
intn=t->subno;
967-
size_tlen;
968-
chr*paren;
966+
size_tnumreps;
967+
size_ttlen;
968+
size_tbrlen;
969+
chr*brstring;
969970
chr*p;
970-
chr*stop;
971971
intmin=t->min;
972972
intmax=t->max;
973973

@@ -978,46 +978,65 @@ cbrdissect(struct vars * v,
978978

979979
MDEBUG(("cbackref n%d %d{%d-%d}\n",t->retry,n,min,max));
980980

981+
/* get the backreferenced string */
981982
if (v->pmatch[n].rm_so==-1)
982983
returnREG_NOMATCH;
983-
paren=v->start+v->pmatch[n].rm_so;
984-
len=v->pmatch[n].rm_eo-v->pmatch[n].rm_so;
984+
brstring=v->start+v->pmatch[n].rm_so;
985+
brlen=v->pmatch[n].rm_eo-v->pmatch[n].rm_so;
985986

986987
/* no room to maneuver -- retries are pointless */
987988
if (v->mem[t->retry])
988989
returnREG_NOMATCH;
989990
v->mem[t->retry]=1;
990991

991-
/* special-case zero-length string */
992-
if (len==0)
992+
/* special cases for zero-length strings */
993+
if (brlen==0)
994+
{
995+
/*
996+
* matches only if target is zero length, but any number of
997+
* repetitions can be considered to be present
998+
*/
999+
if (begin==end&&min <=max)
1000+
{
1001+
MDEBUG(("cbackref matched trivially\n"));
1002+
returnREG_OKAY;
1003+
}
1004+
returnREG_NOMATCH;
1005+
}
1006+
if (begin==end)
9931007
{
994-
if (begin==end)
1008+
/* matches only if zero repetitions are okay */
1009+
if (min==0)
1010+
{
1011+
MDEBUG(("cbackref matched trivially\n"));
9951012
returnREG_OKAY;
1013+
}
9961014
returnREG_NOMATCH;
9971015
}
9981016

999-
/* and too-short string */
1000-
assert(end >=begin);
1001-
if ((size_t) (end-begin)<len)
1017+
/*
1018+
* check target length to see if it could possibly be an allowed number of
1019+
* repetitions of brstring
1020+
*/
1021+
assert(end>begin);
1022+
tlen=end-begin;
1023+
if (tlen %brlen!=0)
1024+
returnREG_NOMATCH;
1025+
numreps=tlen /brlen;
1026+
if (numreps<min|| (numreps>max&&max!=INFINITY))
10021027
returnREG_NOMATCH;
1003-
stop=end-len;
10041028

1005-
/*count occurrences */
1006-
i=0;
1007-
for (p=begin;p <=stop&& (i<max||max==INFINITY);p+=len)
1029+
/*okay, compare the actual string contents */
1030+
p=begin;
1031+
while (numreps-->0)
10081032
{
1009-
if ((*v->g->compare) (paren,p,len)!=0)
1010-
break;
1011-
i++;
1033+
if ((*v->g->compare) (brstring,p,brlen)!=0)
1034+
returnREG_NOMATCH;
1035+
p+=brlen;
10121036
}
1013-
MDEBUG(("cbackref found %d\n",i));
10141037

1015-
/* and sort it out */
1016-
if (p!=end)/* didn't consume all of it */
1017-
returnREG_NOMATCH;
1018-
if (min <=i&& (i <=max||max==INFINITY))
1019-
returnREG_OKAY;
1020-
returnREG_NOMATCH;/* out of range */
1038+
MDEBUG(("cbackref matched\n"));
1039+
returnREG_OKAY;
10211040
}
10221041

10231042
/*

‎src/test/regress/expected/regex.out

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
--
2+
-- Regular expression tests
3+
--
4+
-- Don't want to have to double backslashes in regexes
5+
set standard_conforming_strings = on;
6+
-- Test simple quantified backrefs
7+
select 'bbbbb' ~ '^([bc])\1*$' as t;
8+
t
9+
---
10+
t
11+
(1 row)
12+
13+
select 'ccc' ~ '^([bc])\1*$' as t;
14+
t
15+
---
16+
t
17+
(1 row)
18+
19+
select 'xxx' ~ '^([bc])\1*$' as f;
20+
f
21+
---
22+
f
23+
(1 row)
24+
25+
select 'bbc' ~ '^([bc])\1*$' as f;
26+
f
27+
---
28+
f
29+
(1 row)
30+
31+
select 'b' ~ '^([bc])\1*$' as t;
32+
t
33+
---
34+
t
35+
(1 row)
36+

‎src/test/regress/parallel_schedule

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ test: point lseg box path polygon circle date time timetz timestamp timestamptz
3030
# geometry depends on point, lseg, box, path, polygon and circle
3131
# horology depends on interval, timetz, timestamp, timestamptz, reltime and abstime
3232
# ----------
33-
test: geometry horology oidjoins type_sanity opr_sanity
33+
test: geometry horologyregexoidjoins type_sanity opr_sanity
3434

3535
# ----------
3636
# These four each depend on the previous one

‎src/test/regress/serial_schedule

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ test: tstypes
4242
test: comments
4343
test: geometry
4444
test: horology
45+
test: regex
4546
test: oidjoins
4647
test: type_sanity
4748
test: opr_sanity

‎src/test/regress/sql/regex.sql

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
--
2+
-- Regular expression tests
3+
--
4+
5+
-- Don't want to have to double backslashes in regexes
6+
set standard_conforming_strings=on;
7+
8+
-- Test simple quantified backrefs
9+
select'bbbbb' ~'^([bc])\1*$'as t;
10+
select'ccc' ~'^([bc])\1*$'as t;
11+
select'xxx' ~'^([bc])\1*$'as f;
12+
select'bbc' ~'^([bc])\1*$'as f;
13+
select'b' ~'^([bc])\1*$'as t;

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp