forked frompostgres/postgres
- Notifications
You must be signed in to change notification settings - Fork6
Commit9556aa0
committed
Use single-byte Boyer-Moore-Horspool search even with multibyte encodings.
The old implementation first converted the input strings to arrays ofwchars, and performed the conversion on those. However, the conversion isexpensive, and for a large input string, consumes a lot of memory.Allocating the large arrays also meant that these functions could not beused on strings larger 1 GB / pg_encoding_max_length() (256 MB for UTF-8).Avoid the conversion, and instead use the single-byte algorithm even withmultibyte encodings. That can get fooled, if there is a matching bytesequence in the middle of a multi-byte character, so to eliminate falsepositives like that, we verify any matches by walking the string characterby character with pg_mblen(). Also, if the caller needs the position ofthe match, as a character-offset, we also need to walk the string to countthe characters.Performance testing shows that walking the whole string with pg_mblen() issomewhat slower than converting the whole string to wchars. It's stilloften a win, though, because we don't need to do it if there is no match,and even when there is, we only need to walk up to the point where thematch is, not the whole string. Even in the worst case, there would beroom for optimization: Much of the CPU time in the current loop withpg_mblen() is function call overhead, and could be improved by inliningpg_mblen() and/or the encoding-specific mblen() functions. But I didn'tattempt to do that as part of this patch.Most of the callers of text_position_setup/next functions were actuallynot interested in the position of the match, counted in characters. Tocater for them, refactor the text_position_next() interface into twoparts: searching for the next match (text_position_next()), and returningthe current match's position as a pointer (text_position_get_match_ptr())or as a character offset (text_position_get_match_pos()). Getting thepointer to the match is a more convenient API for many callers, and withUTF-8, it allows skipping the character-walking step altogether, becauseUTF-8 can't have false matches even when treated like raw byte strings.Reviewed-by: John NaylorDiscussion:https://www.postgresql.org/message-id/3173d989-bc1c-fc8a-3b69-f24246f73876%40iki.fi1 parenta5be6e9 commit9556aa0
1 file changed
+257
-226
lines changed0 commit comments
Comments
(0)