Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit4aea704

Browse files
committed
Fix semantics of regular expression back-references.
POSIX defines the behavior of back-references thus: The back-reference expression '\n' shall match the same (possibly empty) string of characters as was matched by a subexpression enclosed between "\(" and "\)" preceding the '\n'.As far as I can see, the back-reference is supposed to consider onlythe data characters matched by the referenced subexpression. However,because our engine copies the NFA constructed from the referencedsubexpression, it effectively enforces any constraints therein, too.As an example, '(^.)\1' ought to match 'xx', or any other stringstarting with two occurrences of the same character; but in our codeit does not, and indeed can't match anything, because the '^' anchorconstraint is included in the backref's copied NFA. If POSIX intendedthat, you'd think they'd mention it. Perl for one doesn't act thatway, so it's hard to conclude that this isn't a bug.Fix by modifying the backref's NFA immediately after it's copied fromthe reference, replacing all constraint arcs by EMPTY arcs so that theconstraints are treated as automatically satisfied. This still allowsus to enforce matching rules that depend only on the data characters;for example, in '(^\d+).*\1' the NFA matching step will still knowthat the backref can only match strings of digits.Perhaps surprisingly, this change does not affect the results of anyof a rather large corpus of real-world regexes. Nonetheless, I wouldnot consider back-patching it, since it's a clear compatibility break.Patch by me, reviewed by Joel JacobsonDiscussion:https://postgr.es/m/661609.1614560029@sss.pgh.pa.us
1 parentc5530d8 commit4aea704

File tree

5 files changed

+107
-0
lines changed

5 files changed

+107
-0
lines changed

‎doc/src/sgml/func.sgml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6166,6 +6166,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
61666166
The subexpression must entirely precede the back reference in the RE.
61676167
Subexpressions are numbered in the order of their leading parentheses.
61686168
Non-capturing parentheses do not define subexpressions.
6169+
The back reference considers only the string characters matched by the
6170+
referenced subexpression, not any constraints contained in it. For
6171+
example, <literal>(^\d)\1</literal> will match <literal>22</literal>.
61696172
</para>
61706173

61716174
<table id="posix-character-entry-escapes-table">

‎src/backend/regex/regc_nfa.c

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1382,6 +1382,77 @@ duptraverse(struct nfa *nfa,
13821382
}
13831383
}
13841384

1385+
/*
1386+
* removeconstraints - remove any constraints in an NFA
1387+
*
1388+
* Constraint arcs are replaced by empty arcs, essentially treating all
1389+
* constraints as automatically satisfied.
1390+
*/
1391+
staticvoid
1392+
removeconstraints(structnfa*nfa,
1393+
structstate*start,/* process subNFA starting here */
1394+
structstate*stop)/* and stopping here */
1395+
{
1396+
if (start==stop)
1397+
return;
1398+
1399+
stop->tmp=stop;
1400+
removetraverse(nfa,start);
1401+
/* done, except for clearing out the tmp pointers */
1402+
1403+
stop->tmp=NULL;
1404+
cleartraverse(nfa,start);
1405+
}
1406+
1407+
/*
1408+
* removetraverse - recursive heart of removeconstraints
1409+
*/
1410+
staticvoid
1411+
removetraverse(structnfa*nfa,
1412+
structstate*s)
1413+
{
1414+
structarc*a;
1415+
structarc*oa;
1416+
1417+
/* Since this is recursive, it could be driven to stack overflow */
1418+
if (STACK_TOO_DEEP(nfa->v->re))
1419+
{
1420+
NERR(REG_ETOOBIG);
1421+
return;
1422+
}
1423+
1424+
if (s->tmp!=NULL)
1425+
return;/* already done */
1426+
1427+
s->tmp=s;
1428+
for (a=s->outs;a!=NULL&& !NISERR();a=oa)
1429+
{
1430+
removetraverse(nfa,a->to);
1431+
if (NISERR())
1432+
break;
1433+
oa=a->outchain;
1434+
switch (a->type)
1435+
{
1436+
casePLAIN:
1437+
caseEMPTY:
1438+
/* nothing to do */
1439+
break;
1440+
caseAHEAD:
1441+
caseBEHIND:
1442+
case'^':
1443+
case'$':
1444+
caseLACON:
1445+
/* replace it */
1446+
newarc(nfa,EMPTY,0,s,a->to);
1447+
freearc(nfa,a);
1448+
break;
1449+
default:
1450+
NERR(REG_ASSERT);
1451+
break;
1452+
}
1453+
}
1454+
}
1455+
13851456
/*
13861457
* cleartraverse - recursive cleanup for algorithms that leave tmp ptrs set
13871458
*/

‎src/backend/regex/regcomp.c

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,8 @@ static void delsub(struct nfa *, struct state *, struct state *);
150150
staticvoiddeltraverse(structnfa*,structstate*,structstate*);
151151
staticvoiddupnfa(structnfa*,structstate*,structstate*,structstate*,structstate*);
152152
staticvoidduptraverse(structnfa*,structstate*,structstate*);
153+
staticvoidremoveconstraints(structnfa*,structstate*,structstate*);
154+
staticvoidremovetraverse(structnfa*,structstate*);
153155
staticvoidcleartraverse(structnfa*,structstate*);
154156
staticstructstate*single_color_transition(structstate*,structstate*);
155157
staticvoidspecialcolors(structnfa*);
@@ -1182,6 +1184,10 @@ parseqatom(struct vars *v,
11821184
dupnfa(v->nfa,v->subs[subno]->begin,v->subs[subno]->end,
11831185
atom->begin,atom->end);
11841186
NOERR();
1187+
1188+
/* The backref node's NFA should not enforce any constraints */
1189+
removeconstraints(v->nfa,atom->begin,atom->end);
1190+
NOERR();
11851191
}
11861192

11871193
/*

‎src/test/modules/test_regex/expected/test_regex.out

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2636,6 +2636,28 @@ select * from test_regex('^(.+)( \1)+$', 'abc abc abd', 'RP');
26362636
{2,REG_UBACKREF,REG_UNONPOSIX}
26372637
(1 row)
26382638

2639+
-- back reference only matches the string, not any constraints
2640+
select * from test_regex('(^\w+).*\1', 'abc abc abc', 'LRP');
2641+
test_regex
2642+
--------------------------------------------
2643+
{1,REG_UBACKREF,REG_UNONPOSIX,REG_ULOCALE}
2644+
{"abc abc abc",abc}
2645+
(2 rows)
2646+
2647+
select * from test_regex('(^\w+\M).*\1', 'abc abcd abd', 'LRP');
2648+
test_regex
2649+
--------------------------------------------
2650+
{1,REG_UBACKREF,REG_UNONPOSIX,REG_ULOCALE}
2651+
{"abc abc",abc}
2652+
(2 rows)
2653+
2654+
select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP');
2655+
test_regex
2656+
------------------------------------------------------------
2657+
{1,REG_UBACKREF,REG_ULOOKAROUND,REG_UNONPOSIX,REG_ULOCALE}
2658+
{"abc abc",abc}
2659+
(2 rows)
2660+
26392661
-- doing 15 "octal escapes vs back references"
26402662
-- # initial zero is always octal
26412663
-- expectMatch15.1 MP"a\\010b""a\bb""a\bb"

‎src/test/modules/test_regex/sql/test_regex.sql

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -770,6 +770,11 @@ select * from test_regex('^(.+)( \1)+$', 'abc abd abc', 'RP');
770770
-- expectNomatch14.29 RP{^(.+)( \1)+$}{abc abc abd}
771771
select*from test_regex('^(.+)(\1)+$','abc abc abd','RP');
772772

773+
-- back reference only matches the string, not any constraints
774+
select*from test_regex('(^\w+).*\1','abc abc abc','LRP');
775+
select*from test_regex('(^\w+\M).*\1','abc abcd abd','LRP');
776+
select*from test_regex('(\w+(?= )).*\1','abc abcd abd','HLRP');
777+
773778
-- doing 15 "octal escapes vs back references"
774779

775780
-- # initial zero is always octal

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp