Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commita53e0ea

Browse files
committed
In the Snowball dictionary, don't try to stem excessively-long words.
If the input word exceeds 1000 bytes, don't pass it to the stemmer;just return it as-is after case folding. Such an input is surelynot a word in any human language, so whatever the stemmer mightdo to it would be pretty dubious in the first place. Adding thisrestriction protects us against a known recursion-to-stack-overflowproblem in the Turkish stemmer, and it seems like good insuranceagainst any other safety or performance issues that may exist inthe Snowball stemmers. (I note, for example, that they contain noCHECK_FOR_INTERRUPTS calls, so we really don't want them runningfor a long time.) The threshold of 1000 bytes is arbitrary.An alternative definition could have been to treat such words asstopwords, but that seems like a bigger break from the old behavior.Per report from Egor Chindyaskin and Alexander Lakhin.Thanks to Olly Betts for the recommendation to fix it this way.Discussion:https://postgr.es/m/1661334672.728714027@f473.i.mail.ru
1 parent68bfe36 commita53e0ea

File tree

1 file changed

+17
-1
lines changed

1 file changed

+17
-1
lines changed

‎src/backend/snowball/dict_snowball.c

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -257,8 +257,24 @@ dsnowball_lexize(PG_FUNCTION_ARGS)
257257
char*txt=lowerstr_with_len(in,len);
258258
TSLexeme*res=palloc0(sizeof(TSLexeme)*2);
259259

260-
if (*txt=='\0'||searchstoplist(&(d->stoplist),txt))
260+
/*
261+
* Do not pass strings exceeding 1000 bytes to the stemmer, as they're
262+
* surely not words in any human language. This restriction avoids
263+
* wasting cycles on stuff like base64-encoded data, and it protects us
264+
* against possible inefficiency or misbehavior in the stemmer. (For
265+
* example, the Turkish stemmer has an indefinite recursion, so it can
266+
* crash on long-enough strings.) However, Snowball dictionaries are
267+
* defined to recognize all strings, so we can't reject the string as an
268+
* unknown word.
269+
*/
270+
if (len>1000)
271+
{
272+
/* return the lexeme lowercased, but otherwise unmodified */
273+
res->lexeme=txt;
274+
}
275+
elseif (*txt=='\0'||searchstoplist(&(d->stoplist),txt))
261276
{
277+
/* empty or stopword, so report as stopword */
262278
pfree(txt);
263279
}
264280
else

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp