Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitcebc1d3

Browse files
committed
Fix regex engine to suppress useless concatenation sub-REs.
The comment for parsebranch() claims that it avoids generatingunnecessary concatenation nodes in the "subre" tree, but it missedsome significant cases. Once we've decided that a given atom is"messy" and can't be bundled with the preceding atom(s) of thecurrent regex branch, parseqatom() always generated two new concatnodes, one to concat the messy atom to what follows it in the branch,and an upper node to concatenate the preceding part of the branchto that one. But one or both of these could be unnecessary, if themessy atom is the first, last, or only one in the branch. Improvethe code to suppress such useless concat nodes, along with theno-op child nodes representing empty chunks of a branch.Reducing the number of subre tree nodes offers significant savingsnot only at execution but during compilation, because each subre nodehas its own NFA that has to be separately optimized. (Maybe somedaywe'll figure out how to share the optimization work across multipletree nodes, but it doesn't look easy.) Eliminating upper tree nodesis especially useful because they tend to have larger NFAs.This is part of a patch series that in total reduces the regex engine'sruntime by about a factor of four on a large corpus of real-world regexes.Patch by me, reviewed by Joel JacobsonDiscussion:https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
1 parent824bf71 commitcebc1d3

File tree

1 file changed

+76
-7
lines changed

1 file changed

+76
-7
lines changed

‎src/backend/regex/regcomp.c

Lines changed: 76 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1123,14 +1123,24 @@ parseqatom(struct vars *v,
11231123
t->left=atom;
11241124
atomp=&t->left;
11251125

1126-
/* here we should recurse... but we must postpone that to the end */
1126+
/*
1127+
* Here we should recurse to fill t->right ... but we must postpone that
1128+
* to the end.
1129+
*/
11271130

1128-
/* split top into prefix and remaining */
1131+
/*
1132+
* Convert top node to a concatenation of the prefix (top->left, covering
1133+
* whatever we parsed previously) and remaining (t). Note that the prefix
1134+
* could be empty, in which case this concatenation node is unnecessary.
1135+
* To keep things simple, we operate in a general way for now, and get rid
1136+
* of unnecessary subres below.
1137+
*/
11291138
assert(top->op=='='&&top->left==NULL&&top->right==NULL);
11301139
top->left=subre(v,'=',top->flags,top->begin,lp);
11311140
NOERR();
11321141
top->op='.';
11331142
top->right=t;
1143+
/* top->flags will get updated later */
11341144

11351145
/* if it's a backref, now is the time to replicate the subNFA */
11361146
if (atomtype==BACKREF)
@@ -1220,16 +1230,75 @@ parseqatom(struct vars *v,
12201230
/* and finally, look after that postponed recursion */
12211231
t=top->right;
12221232
if (!(SEE('|')||SEE(stopper)||SEE(EOS)))
1233+
{
1234+
/* parse all the rest of the branch, and insert in t->right */
12231235
t->right=parsebranch(v,stopper,type,s2,rp,1);
1236+
NOERR();
1237+
assert(SEE('|')||SEE(stopper)||SEE(EOS));
1238+
1239+
/* here's the promised update of the flags */
1240+
t->flags |=COMBINE(t->flags,t->right->flags);
1241+
top->flags |=COMBINE(top->flags,t->flags);
1242+
1243+
/*
1244+
* At this point both top and t are concatenation (op == '.') subres,
1245+
* and we have top->left = prefix of branch, top->right = t, t->left =
1246+
* messy atom (with quantification superstructure if needed), t->right
1247+
* = rest of branch.
1248+
*
1249+
* If the messy atom was the first thing in the branch, then top->left
1250+
* is vacuous and we can get rid of one level of concatenation. Since
1251+
* the caller is holding a pointer to the top node, we can't remove
1252+
* that node; but we're allowed to change its properties.
1253+
*/
1254+
assert(top->left->op=='=');
1255+
if (top->left->begin==top->left->end)
1256+
{
1257+
assert(!MESSY(top->left->flags));
1258+
freesubre(v,top->left);
1259+
top->left=t->left;
1260+
top->right=t->right;
1261+
t->left=t->right=NULL;
1262+
freesubre(v,t);
1263+
}
1264+
}
12241265
else
12251266
{
1267+
/*
1268+
* There's nothing left in the branch, so we don't need the second
1269+
* concatenation node 't'. Just link s2 straight to rp.
1270+
*/
12261271
EMPTYARC(s2,rp);
1227-
t->right=subre(v,'=',0,s2,rp);
1272+
top->right=t->left;
1273+
top->flags |=COMBINE(top->flags,top->right->flags);
1274+
t->left=t->right=NULL;
1275+
freesubre(v,t);
1276+
1277+
/*
1278+
* Again, it could be that top->left is vacuous (if the messy atom was
1279+
* in fact the only thing in the branch). In that case we need no
1280+
* concatenation at all; just replace top with top->right.
1281+
*/
1282+
assert(top->left->op=='=');
1283+
if (top->left->begin==top->left->end)
1284+
{
1285+
assert(!MESSY(top->left->flags));
1286+
freesubre(v,top->left);
1287+
t=top->right;
1288+
top->op=t->op;
1289+
top->flags=t->flags;
1290+
top->id=t->id;
1291+
top->subno=t->subno;
1292+
top->min=t->min;
1293+
top->max=t->max;
1294+
top->left=t->left;
1295+
top->right=t->right;
1296+
top->begin=t->begin;
1297+
top->end=t->end;
1298+
t->left=t->right=NULL;
1299+
freesubre(v,t);
1300+
}
12281301
}
1229-
NOERR();
1230-
assert(SEE('|')||SEE(stopper)||SEE(EOS));
1231-
t->flags |=COMBINE(t->flags,t->right->flags);
1232-
top->flags |=COMBINE(top->flags,t->flags);
12331302
}
12341303

12351304
/*

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp