Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitafdfcd3

Browse files
committed
Miscellaneous cleanup of regular-expression compiler.
Revert our previous addition of "all" flags to copyins() and copyouts();they're no longer needed, and were never anything but an unsightly hack.Improve a couple of infelicities in the REG_DEBUG code for dumpingthe NFA data structure, including adding code to count the totalnumber of states and arcs.Add a couple of missed error checks.Add some more documentation in the README file, and some regression testsillustrating cases that exceeded the state-count limit and/or tookunreasonable amounts of time before this set of patches.Back-patch to all supported branches.
1 parent538b3b8 commitafdfcd3

File tree

5 files changed

+153
-54
lines changed

5 files changed

+153
-54
lines changed

‎src/backend/regex/README

Lines changed: 84 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -76,11 +76,10 @@ relates to what you'll see in the code. Here's what really happens:
7676
of states approximately proportional to the length of the regexp.
7777

7878
* The NFA is then optimized into a "compact NFA" representation, which is
79-
basically the same data but without fields that are not going to be needed
80-
at runtime. We do a little bit of cleanup too, such as removing
81-
unreachable states that might be created as a result of the rather naive
82-
transformation done by initial parsing. The cNFA representation is what
83-
is passed from regcomp to regexec.
79+
basically the same idea but without fields that are not going to be needed
80+
at runtime. It is simplified too: the compact format only allows "plain"
81+
and "LACON" arc types. The cNFA representation is what is passed from
82+
regcomp to regexec.
8483

8584
* Unlike traditional NFA-based regex engines, we do not execute directly
8685
from the NFA representation, as that would require backtracking and so be
@@ -139,12 +138,13 @@ a possible division of the input string that allows its two child nodes to
139138
each match their part of the string (and although this specific case can
140139
only succeed when the division is at the middle, the code does not know
141140
that, nor would it be true in general). However, we can first run the DFA
142-
and quickly reject any input that doesn't contain two a's and some number
143-
of b's and c's. If the DFA doesn't match, there is no need to recurse to
144-
the two child nodes for each possible string division point. In many
145-
cases, this prefiltering makes the search run much faster than a pure NFA
146-
engine could do. It is this behavior that justifies using the phrase
147-
"hybrid DFA/NFA engine" to describe Spencer's library.
141+
and quickly reject any input that doesn't start with an "a" and contain
142+
one more "a" plus some number of b's and c's. If the DFA doesn't match,
143+
there is no need to recurse to the two child nodes for each possible
144+
string division point. In many cases, this prefiltering makes the search
145+
run much faster than a pure NFA engine could do. It is this behavior that
146+
justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
147+
library.
148148

149149

150150
Colors and colormapping
@@ -296,3 +296,76 @@ character classes are somehow processed "symbolically" without making a
296296
full expansion of their contents at parse time. This would mean that we'd
297297
have to be ready to call iswalpha() at runtime, but if that only happens
298298
for high-code-value characters, it shouldn't be a big performance hit.
299+
300+
301+
Detailed semantics of an NFA
302+
----------------------------
303+
304+
When trying to read dumped-out NFAs, it's helpful to know these facts:
305+
306+
State 0 (additionally marked with "@" in dumpnfa's output) is always the
307+
goal state, and state 1 (additionally marked with ">") is the start state.
308+
(The code refers to these as the post state and pre state respectively.)
309+
310+
The possible arc types are:
311+
312+
PLAIN arcs, which specify matching of any character of a given "color"
313+
(see above). These are dumped as "[color_number]->to_state".
314+
315+
EMPTY arcs, which specify a no-op transition to another state. These
316+
are dumped as "->to_state".
317+
318+
AHEAD constraints, which represent a "next character must be of this
319+
color" constraint. AHEAD differs from a PLAIN arc in that the input
320+
character is not consumed when crossing the arc. These are dumped as
321+
">color_number>->to_state".
322+
323+
BEHIND constraints, which represent a "previous character must be of
324+
this color" constraint, which likewise consumes no input. These are
325+
dumped as "<color_number<->to_state".
326+
327+
'^' arcs, which specify a beginning-of-input constraint. These are
328+
dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
329+
beginning-of-line constraints respectively.
330+
331+
'$' arcs, which specify an end-of-input constraint. These are dumped
332+
as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line
333+
constraints respectively.
334+
335+
LACON constraints, which represent "(?=re)" and "(?!re)" constraints,
336+
i.e. the input starting at this point must match (or not match) a
337+
given sub-RE, but the matching input is not consumed. These are
338+
dumped as ":subtree_number:->to_state".
339+
340+
If you see anything else (especially any question marks) in the display of
341+
an arc, it's dumpnfa() trying to tell you that there's something fishy
342+
about the arc; see the source code.
343+
344+
The regex executor can only handle PLAIN and LACON transitions. The regex
345+
optimize() function is responsible for transforming the parser's output
346+
to get rid of all the other arc types. In particular, ^ and $ arcs that
347+
are not dropped as impossible will always end up adjacent to the pre or
348+
post state respectively, and then will be converted into PLAIN arcs that
349+
mention the special "colors" for BOS, BOL, EOS, or EOL.
350+
351+
To decide whether a thus-transformed NFA matches a given substring of the
352+
input string, the executor essentially follows these rules:
353+
1. Start the NFA "looking at" the character *before* the given substring,
354+
or if the substring is at the start of the input, prepend an imaginary BOS
355+
character instead.
356+
2. Run the NFA until it has consumed the character *after* the given
357+
substring, or an imaginary following EOS character if the substring is at
358+
the end of the input.
359+
3. If the NFA is (or can be) in the goal state at this point, it matches.
360+
361+
So one can mentally execute an untransformed NFA by taking ^ and $ as
362+
ordinary constraints that match at start and end of input; but plain
363+
arcs out of the start state should be taken as matches for the character
364+
before the target substring, and similarly, plain arcs leading to the
365+
post state are matches for the character after the target substring.
366+
This definition is necessary to support regexes that begin or end with
367+
constraints such as \m and \M, which imply requirements on the adjacent
368+
character if any. NFAs for simple unanchored patterns will usually have
369+
pre-state outarcs for all possible character colors as well as BOS and
370+
BOL, and post-state inarcs for all possible character colors as well as
371+
EOS and EOL, so that the executor's behavior will work.

‎src/backend/regex/regc_nfa.c

Lines changed: 17 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -823,14 +823,11 @@ moveins(struct nfa * nfa,
823823

824824
/*
825825
* copyins - copy in arcs of a state to another state
826-
*
827-
* Either all arcs, or only non-empty ones as determined by all value.
828826
*/
829827
staticvoid
830828
copyins(structnfa*nfa,
831829
structstate*oldState,
832-
structstate*newState,
833-
intall)
830+
structstate*newState)
834831
{
835832
assert(oldState!=newState);
836833

@@ -840,8 +837,7 @@ copyins(struct nfa * nfa,
840837
structarc*a;
841838

842839
for (a=oldState->ins;a!=NULL;a=a->inchain)
843-
if (all||a->type!=EMPTY)
844-
cparc(nfa,a,a->from,newState);
840+
cparc(nfa,a,a->from,newState);
845841
}
846842
else
847843
{
@@ -873,12 +869,6 @@ copyins(struct nfa * nfa,
873869
{
874870
structarc*a=oa;
875871

876-
if (!all&&a->type==EMPTY)
877-
{
878-
oa=oa->inchain;
879-
continue;
880-
}
881-
882872
switch (sortins_cmp(&oa,&na))
883873
{
884874
case-1:
@@ -904,12 +894,6 @@ copyins(struct nfa * nfa,
904894
/* newState does not have anything matching oa */
905895
structarc*a=oa;
906896

907-
if (!all&&a->type==EMPTY)
908-
{
909-
oa=oa->inchain;
910-
continue;
911-
}
912-
913897
oa=oa->inchain;
914898
createarc(nfa,a->type,a->co,a->from,newState);
915899
}
@@ -1107,14 +1091,11 @@ moveouts(struct nfa * nfa,
11071091

11081092
/*
11091093
* copyouts - copy out arcs of a state to another state
1110-
*
1111-
* Either all arcs, or only non-empty ones as determined by all value.
11121094
*/
11131095
staticvoid
11141096
copyouts(structnfa*nfa,
11151097
structstate*oldState,
1116-
structstate*newState,
1117-
intall)
1098+
structstate*newState)
11181099
{
11191100
assert(oldState!=newState);
11201101

@@ -1124,8 +1105,7 @@ copyouts(struct nfa * nfa,
11241105
structarc*a;
11251106

11261107
for (a=oldState->outs;a!=NULL;a=a->outchain)
1127-
if (all||a->type!=EMPTY)
1128-
cparc(nfa,a,newState,a->to);
1108+
cparc(nfa,a,newState,a->to);
11291109
}
11301110
else
11311111
{
@@ -1157,12 +1137,6 @@ copyouts(struct nfa * nfa,
11571137
{
11581138
structarc*a=oa;
11591139

1160-
if (!all&&a->type==EMPTY)
1161-
{
1162-
oa=oa->outchain;
1163-
continue;
1164-
}
1165-
11661140
switch (sortouts_cmp(&oa,&na))
11671141
{
11681142
case-1:
@@ -1188,12 +1162,6 @@ copyouts(struct nfa * nfa,
11881162
/* newState does not have anything matching oa */
11891163
structarc*a=oa;
11901164

1191-
if (!all&&a->type==EMPTY)
1192-
{
1193-
oa=oa->outchain;
1194-
continue;
1195-
}
1196-
11971165
oa=oa->outchain;
11981166
createarc(nfa,a->type,a->co,newState,a->to);
11991167
}
@@ -1452,6 +1420,10 @@ optimize(struct nfa * nfa,
14521420
fprintf(f,"\nfinal cleanup:\n");
14531421
#endif
14541422
cleanup(nfa);/* final tidying */
1423+
#ifdefREG_DEBUG
1424+
if (verbose)
1425+
dumpnfa(nfa,f);
1426+
#endif
14551427
returnanalyze(nfa);/* and analysis */
14561428
}
14571429

@@ -1568,7 +1540,7 @@ pull(struct nfa * nfa,
15681540
s=newstate(nfa);
15691541
if (NISERR())
15701542
return0;
1571-
copyins(nfa,from,s,1);/* duplicate inarcs */
1543+
copyins(nfa,from,s);/* duplicate inarcs */
15721544
cparc(nfa,con,s,to);/* move constraint arc */
15731545
freearc(nfa,con);
15741546
if (NISERR())
@@ -1735,7 +1707,7 @@ push(struct nfa * nfa,
17351707
s=newstate(nfa);
17361708
if (NISERR())
17371709
return0;
1738-
copyouts(nfa,to,s,1);/* duplicate outarcs */
1710+
copyouts(nfa,to,s);/* duplicate outarcs */
17391711
cparc(nfa,con,from,s);/* move constraint arc */
17401712
freearc(nfa,con);
17411713
if (NISERR())
@@ -2952,6 +2924,8 @@ dumpnfa(struct nfa * nfa,
29522924
{
29532925
#ifdefREG_DEBUG
29542926
structstate*s;
2927+
intnstates=0;
2928+
intnarcs=0;
29552929

29562930
fprintf(f,"pre %d, post %d",nfa->pre->no,nfa->post->no);
29572931
if (nfa->bos[0]!=COLORLESS)
@@ -2964,7 +2938,12 @@ dumpnfa(struct nfa * nfa,
29642938
fprintf(f,", eol [%ld]", (long)nfa->eos[1]);
29652939
fprintf(f,"\n");
29662940
for (s=nfa->states;s!=NULL;s=s->next)
2941+
{
29672942
dumpstate(s,f);
2943+
nstates++;
2944+
narcs+=s->nouts;
2945+
}
2946+
fprintf(f,"total of %d states, %d arcs\n",nstates,narcs);
29682947
if (nfa->parent==NULL)
29692948
dumpcolors(nfa->cm,f);
29702949
fflush(f);

‎src/backend/regex/regcomp.c

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -136,10 +136,10 @@ static intsortins_cmp(const void *, const void *);
136136
staticvoidsortouts(structnfa*,structstate*);
137137
staticintsortouts_cmp(constvoid*,constvoid*);
138138
staticvoidmoveins(structnfa*,structstate*,structstate*);
139-
staticvoidcopyins(structnfa*,structstate*,structstate*,int);
139+
staticvoidcopyins(structnfa*,structstate*,structstate*);
140140
staticvoidmergeins(structnfa*,structstate*,structarc**,int);
141141
staticvoidmoveouts(structnfa*,structstate*,structstate*);
142-
staticvoidcopyouts(structnfa*,structstate*,structstate*,int);
142+
staticvoidcopyouts(structnfa*,structstate*,structstate*);
143143
staticvoidcloneouts(structnfa*,structstate*,structstate*,structstate*,int);
144144
staticvoiddelsub(structnfa*,structstate*,structstate*);
145145
staticvoiddeltraverse(structnfa*,structstate*,structstate*);
@@ -181,7 +181,6 @@ static void dumpnfa(struct nfa *, FILE *);
181181
#ifdefREG_DEBUG
182182
staticvoiddumpstate(structstate*,FILE*);
183183
staticvoiddumparcs(structstate*,FILE*);
184-
staticintdumprarcs(structarc*,structstate*,FILE*,int);
185184
staticvoiddumparc(structarc*,structstate*,FILE*);
186185
staticvoiddumpcnfa(structcnfa*,FILE*);
187186
staticvoiddumpcstate(int,structcnfa*,FILE*);
@@ -614,7 +613,9 @@ makesearch(struct vars * v,
614613
for (s=slist;s!=NULL;s=s2)
615614
{
616615
s2=newstate(nfa);
617-
copyouts(nfa,s,s2,1);
616+
NOERR();
617+
copyouts(nfa,s,s2);
618+
NOERR();
618619
for (a=s->ins;a!=NULL;a=b)
619620
{
620621
b=a->inchain;
@@ -2014,7 +2015,7 @@ dump(regex_t *re,
20142015
dumpcolors(&g->cmap,f);
20152016
if (!NULLCNFA(g->search))
20162017
{
2017-
printf("\nsearch:\n");
2018+
fprintf(f,"\nsearch:\n");
20182019
dumpcnfa(&g->search,f);
20192020
}
20202021
for (i=1;i<g->nlacons;i++)

‎src/test/regress/expected/regex.out

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,41 @@ select 'a' ~ '((((((a+|)+|)+|)+|)+|)+|)';
229229
t
230230
(1 row)
231231

232+
-- These cases used to give too-many-states failures
233+
select 'x' ~ 'abcd(\m)+xyz';
234+
?column?
235+
----------
236+
f
237+
(1 row)
238+
239+
select 'a' ~ '^abcd*(((((^(a c(e?d)a+|)+|)+|)+|)+|a)+|)';
240+
?column?
241+
----------
242+
f
243+
(1 row)
244+
245+
select 'x' ~ 'a^(^)bcd*xy(((((($a+|)+|)+|)+$|)+|)+|)^$';
246+
?column?
247+
----------
248+
f
249+
(1 row)
250+
251+
select 'x' ~ 'xyz(\Y\Y)+';
252+
?column?
253+
----------
254+
f
255+
(1 row)
256+
257+
select 'x' ~ 'x|(?:\M)+';
258+
?column?
259+
----------
260+
t
261+
(1 row)
262+
263+
-- This generates O(N) states but O(N^2) arcs, so it causes problems
264+
-- if arc count is not constrained
265+
select 'x' ~ repeat('x*y*z*', 1000);
266+
ERROR: invalid regular expression: regular expression is too complex
232267
-- Test backref in combination with non-greedy quantifier
233268
-- https://core.tcl.tk/tcl/tktview/6585b21ca8fa6f3678d442b97241fdd43dba2ec0
234269
select 'Programmer' ~ '(\w).*?\1' as t;

‎src/test/regress/sql/regex.sql

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,17 @@ select 'dd x' ~ '(^(?!aa)(?!bb)(?!cc))+';
5555
select'a' ~'((((((a)*)*)*)*)*)*';
5656
select'a' ~'((((((a+|)+|)+|)+|)+|)+|)';
5757

58+
-- These cases used to give too-many-states failures
59+
select'x' ~'abcd(\m)+xyz';
60+
select'a' ~'^abcd*(((((^(a c(e?d)a+|)+|)+|)+|)+|a)+|)';
61+
select'x' ~'a^(^)bcd*xy(((((($a+|)+|)+|)+$|)+|)+|)^$';
62+
select'x' ~'xyz(\Y\Y)+';
63+
select'x' ~'x|(?:\M)+';
64+
65+
-- This generates O(N) states but O(N^2) arcs, so it causes problems
66+
-- if arc count is not constrained
67+
select'x' ~ repeat('x*y*z*',1000);
68+
5869
-- Test backref in combination with non-greedy quantifier
5970
-- https://core.tcl.tk/tcl/tktview/6585b21ca8fa6f3678d442b97241fdd43dba2ec0
6071
select'Programmer' ~'(\w).*?\1'as t;

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp