Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit9556aa0

Browse files
committed
Use single-byte Boyer-Moore-Horspool search even with multibyte encodings.
The old implementation first converted the input strings to arrays ofwchars, and performed the conversion on those. However, the conversion isexpensive, and for a large input string, consumes a lot of memory.Allocating the large arrays also meant that these functions could not beused on strings larger 1 GB / pg_encoding_max_length() (256 MB for UTF-8).Avoid the conversion, and instead use the single-byte algorithm even withmultibyte encodings. That can get fooled, if there is a matching bytesequence in the middle of a multi-byte character, so to eliminate falsepositives like that, we verify any matches by walking the string characterby character with pg_mblen(). Also, if the caller needs the position ofthe match, as a character-offset, we also need to walk the string to countthe characters.Performance testing shows that walking the whole string with pg_mblen() issomewhat slower than converting the whole string to wchars. It's stilloften a win, though, because we don't need to do it if there is no match,and even when there is, we only need to walk up to the point where thematch is, not the whole string. Even in the worst case, there would beroom for optimization: Much of the CPU time in the current loop withpg_mblen() is function call overhead, and could be improved by inliningpg_mblen() and/or the encoding-specific mblen() functions. But I didn'tattempt to do that as part of this patch.Most of the callers of text_position_setup/next functions were actuallynot interested in the position of the match, counted in characters. Tocater for them, refactor the text_position_next() interface into twoparts: searching for the next match (text_position_next()), and returningthe current match's position as a pointer (text_position_get_match_ptr())or as a character offset (text_position_get_match_pos()). Getting thepointer to the match is a more convenient API for many callers, and withUTF-8, it allows skipping the character-walking step altogether, becauseUTF-8 can't have false matches even when treated like raw byte strings.Reviewed-by: John NaylorDiscussion:https://www.postgresql.org/message-id/3173d989-bc1c-fc8a-3b69-f24246f73876%40iki.fi
1 parenta5be6e9 commit9556aa0

File tree

1 file changed

+257
-226
lines changed

1 file changed

+257
-226
lines changed

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp