You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
Make locale-dependent regex character classes work for large char codes.
Previously, we failed to recognize Unicode characters above U+7FF asbeing members of locale-dependent character classes such as [[:alpha:]].(Actually, the same problem occurs for large pg_wchar values in anymultibyte encoding, but UTF8 is the only case people have actuallycomplained about.) It's impractical to get Spencer's original code tohandle character classes or ranges containing many thousands of characters,because it insists on considering each member character individually atregex compile time, whether or not the character will ever be of interestat run time. To fix, choose a cutoff point MAX_SIMPLE_CHR below whichwe process characters individually as before, and deal with entire rangesor classes as single entities above that. We can actually make thingscheaper than before for chars below the cutoff, because the color map cannow be a simple linear array for those chars, rather than the multileveltree structure Spencer designed. It's more expensive than before forchars above the cutoff, because we must do a binary search in a list ofhigh chars and char ranges used in the regex pattern, plus call iswalpha()and friends for each locale-dependent character class used in the pattern.However, multibyte encodings are normally designed to give smaller codesto popular characters, so that we can expect that the slow path will betaken relatively infrequently. In any case, the speed penalty appearsminor except when we have to apply iswalpha() etc. to high character codesat runtime --- and the previous coding gave wrong answers for those cases,so whether it was faster is moot.Tom Lane, reviewed by Heikki LinnakangasDiscussion: <15563.1471913698@sss.pgh.pa.us>