Quick Links

fixing tsearch locale support

Lists:	pgsql-hackers

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	fixing tsearch locale support
Date:	2024-12-02 10:57:05
Message-ID:	653f3b84-fc87-45a7-9a0c-bfb4fcab3e7d@eisentraut.org
Views:	Whole Thread \|Raw Message \|Download mbox \|Resend email
Lists:	pgsql-hackers

Infamously, the tsearch locale support in
src/backend/tsearch/ts_locale.c still depends on libc environment
variable locale settings and is not caught up with pg_locale_t,
collations, ICU, and all that newer stuff. This is used in the tsearch
facilities themselves, but also in other modules such as ltree, pg_trgm,
and unaccent.

Several of the functions are wrappers around <ctype.h> functions, like

int
t_isalpha(const char *ptr)
{
int clen = pg_mblen(ptr);
wchar_t character[WC_BUF_LEN];
pg_locale_t mylocale = 0; /* TODO */

if (clen == 1 || database_ctype_is_c)
return isalpha(TOUCHAR(ptr));

char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);

return iswalpha((wint_t) character[0]);
}

So this has multibyte and encoding awareness, but does not observe
locale provider or collation settings.

As an easy start toward fixing this, I think several of these functions
we don't even need.

t_isdigit() and t_isspace() are just used to parse various configuration
and data files, and surely we don't need support for encoding-dependent
multibyte support for parsing ASCII digits and ASCII spaces. At least,
I didn't find any indications in the documentation of these file formats
that they are supposed to support that kind of thing. So these can be
replaced by the normal isdigit() and isspace().

There is one call to t_isprint(), which is similarly used only to parse
some flags in a configuration file. From the surrounding code you can
deduce that it's only called on single-byte characters, so it can
similarly be replaced by plain issprint().

Note, pg_trgm has some compile-time options with macros such as
KEEPONLYALNUM and IGNORECASE. AFAICT, these are not documented, and the
non-default variant is not supported by any test cases. So as part of
this undertaking, I'm going to remove the non-default variants if they
are in the way of cleanup.

Attachment	Content-Type	Size
0001-Remove-t_isdigit.patch	text/plain	4.8 KB
0002-Remove-t_isspace.patch	text/plain	16.1 KB
0003-Remove-t_isprint.patch	text/plain	2.0 KB

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: fixing tsearch locale support
Date:	2024-12-09 10:11:57
Message-ID:	f30299bf-ad8e-4125-bf80-e0a8663991b6@eisentraut.org
Views:	Whole Thread \|Raw Message \|Download mbox \|Resend email
Lists:	pgsql-hackers

I have expanded this patch set. The first three patches are the same as
before. I have added a new patch that gets rid of lowerstr() from
ts_locale.c and replaces it with the standard str_tolower() that
everyone else is using.

lowerstr() and lowerstr_with_len() in ts_locale.c do the same thing as
str_tolower(), except that the former don't use the common locale
provider framework but instead use the global libc locale settings.

This patch replaces uses of lowerstr*() with str_tolower(...,
DEFAULT_COLLATION_OID). For instances that use a libc locale globally,
this will result in exactly the same behavior. For instances that use
other locale providers, you now get consistent behavior and are no
longer dependent on the libc locale settings.

Most uses of these functions are for processing dictionary and
configuration files. In those cases, using the default collation seems
appropriate. At least we don't have a more specific collation
available. But the code in contrib/pg_trgm should really depend on the
collation of the columns being processed. This is not done here, this
can be done in a separate patch.

(You can probably construct some edge cases where this change would
create some locale-related upgrade incompatibility, for example if
before you used a combination of ICU and a differently-behaving libc
locale. We can document this in the release notes, but I don't think
there is anything more we can do about this.)

After these patches, the only problematic things left in ts_locale.{c,h} are

extern int t_isalpha(const char *ptr);
extern int t_isalnum(const char *ptr);

My current assessment is that these are best addressed after patch [0],
because we need locale-provider-aware character classification functions.

[0]:
https://www.postgresql.org/message-id/flat/2830211e1b6e6a2e26d845780b03e125281ea17b(dot)camel(at)j-davis(dot)com

On 02.12.24 11:57, Peter Eisentraut wrote:
> Infamously, the tsearch locale support in src/backend/tsearch/
> ts_locale.c still depends on libc environment variable locale settings
> and is not caught up with pg_locale_t, collations, ICU, and all that
> newer stuff. This is used in the tsearch facilities themselves, but
> also in other modules such as ltree, pg_trgm, and unaccent.
>
> Several of the functions are wrappers around <ctype.h> functions, like
>
> int
> t_isalpha(const char *ptr)
> {
>     int         clen = pg_mblen(ptr);
>     wchar_t     character[WC_BUF_LEN];
>     pg_locale_t mylocale = 0;   /* TODO */
>
>     if (clen == 1 || database_ctype_is_c)
>         return isalpha(TOUCHAR(ptr));
>
>     char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);
>
>     return iswalpha((wint_t) character[0]);
> }
>
> So this has multibyte and encoding awareness, but does not observe
> locale provider or collation settings.
>
> As an easy start toward fixing this, I think several of these functions
> we don't even need.
>
> t_isdigit() and t_isspace() are just used to parse various configuration
> and data files, and surely we don't need support for encoding-dependent
> multibyte support for parsing ASCII digits and ASCII spaces. At least,
> I didn't find any indications in the documentation of these file formats
> that they are supposed to support that kind of thing. So these can be
> replaced by the normal isdigit() and isspace().
>
> There is one call to t_isprint(), which is similarly used only to parse
> some flags in a configuration file. From the surrounding code you can
> deduce that it's only called on single-byte characters, so it can
> similarly be replaced by plain issprint().
>
> Note, pg_trgm has some compile-time options with macros such as
> KEEPONLYALNUM and IGNORECASE. AFAICT, these are not documented, and the
> non-default variant is not supported by any test cases. So as part of
> this undertaking, I'm going to remove the non-default variants if they
> are in the way of cleanup.

Attachment	Content-Type	Size
v2-0001-Remove-t_isdigit.patch	text/plain	4.8 KB
v2-0002-Remove-t_isspace.patch	text/plain	16.1 KB
v2-0003-Remove-t_isprint.patch	text/plain	2.0 KB
v2-0004-Remove-ts_locale.c-s-lowerstr.patch	text/plain	14.9 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: fixing tsearch locale support
Date:	2024-12-12 18:14:21
Message-ID:	c7835d44c9388b2776ec6ca48d54daffc872377a.camel@j-davis.com
Views:	Whole Thread \|Raw Message \|Download mbox \|Resend email
Lists:	pgsql-hackers

On Mon, 2024-12-02 at 11:57 +0100, Peter Eisentraut wrote:
> t_isdigit() and t_isspace() are just used to parse various
> configuration
> and data files, and surely we don't need support for encoding-
> dependent
> multibyte support for parsing ASCII digits and ASCII spaces.
> ... So these can
> be
> replaced by the normal isdigit() and isspace().

That would still call libc, and still depend on LC_CTYPE. Should we use
pure ASCII variants?

There was also some discussion about forcing LC_COLLATE and LC_CTYPE to
C, now that the default collation doesn't depend on them any more (cf.
option 1):

https://www.postgresql.org/message-id/CA+hUKGL82jG2PdgfQtwWG+_51TQ--6M9XNa3rtt7ub+S3Pmfsw@mail.gmail.com

If we do that, then it would be fine to use isdigit/isspace.

Regards,
Jeff Davis

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: fixing tsearch locale support
Date:	2024-12-12 18:20:14
Message-ID:	190f5bbbbefe8accddf685e227365d9f7dd3e282.camel@j-davis.com
Views:	Whole Thread \|Raw Message \|Download mbox \|Resend email
Lists:	pgsql-hackers

On Mon, 2024-12-09 at 11:11 +0100, Peter Eisentraut wrote:
> I have expanded this patch set. The first three patches are the same
> as
> before. I have added a new patch that gets rid of lowerstr() from
> ts_locale.c and replaces it with the standard str_tolower() that
> everyone else is using.

+1 to the patch series.

Note: I posted a patch series to support case folding:

https://commitfest.postgresql.org/51/5436/

and we may want to use case folding for some of these purposes
internally. But this is not a blocker.

There is some kind of compatibility risk with these changes, so we will
need a release note. And we should try to get all the changes in one
release to avoid repeated small breakages.

Regards,
Jeff Davis

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: fixing tsearch locale support
Date:	2024-12-13 06:16:22
Message-ID:	563dac99-5848-4126-ba7c-752820369da8@eisentraut.org
Views:	Whole Thread \|Raw Message \|Download mbox \|Resend email
Lists:	pgsql-hackers

On 12.12.24 19:14, Jeff Davis wrote:
> On Mon, 2024-12-02 at 11:57 +0100, Peter Eisentraut wrote:
>> t_isdigit() and t_isspace() are just used to parse various
>> configuration
>> and data files, and surely we don't need support for encoding-
>> dependent
>> multibyte support for parsing ASCII digits and ASCII spaces.
>> ... So these can
>> be
>> replaced by the normal isdigit() and isspace().
>
> That would still call libc, and still depend on LC_CTYPE. Should we use
> pure ASCII variants?

isdigit() and isspace() in particular are widely used throughout the
backend code without such concerns. I think the assumption is that this
is not a problem in practice: For multibyte encodings, these functions
would only be able to process the ASCII subset, and the character
classification of that should be consistent across all locales. For
single-byte encodings, among the encodings that PostgreSQL supports, I
don't think any of them actually provide non-ASCII digits or space
characters.

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: fixing tsearch locale support
Date:	2024-12-13 17:07:54
Message-ID:	aa8d8a056997297fdb60a82db01c280c9e71f4ba.camel@j-davis.com
Views:	Whole Thread \|Raw Message \|Download mbox \|Resend email
Lists:	pgsql-hackers

On Fri, 2024-12-13 at 07:16 +0100, Peter Eisentraut wrote:
> isdigit() and isspace() in particular are widely used throughout the
> backend code without such concerns. I think the assumption is that
> this
> is not a problem in practice: For multibyte encodings, these
> functions
> would only be able to process the ASCII subset, and the character
> classification of that should be consistent across all locales. For
> single-byte encodings, among the encodings that PostgreSQL supports,
> I
> don't think any of them actually provide non-ASCII digits or space
> characters.

OK, that's fine with me for this patch series.

Eventually though, I think we should have built-in versions of these
ASCII functions. Even if there's no actual problem, it would more
clearly indicate that we only care about ASCII at that particular call
site, and eliminate questions about what libc might do on some platform
for some encoding/locale combination. It would also make it easier to
search for locale-sensitive functions in the codebase.

Regards,
Jeff Davis

From:	Andreas Karlsson <andreas(at)proxel(dot)se>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: fixing tsearch locale support
Date:	2024-12-17 15:25:22
Message-ID:	a0243556-4e10-4bd3-b3f4-f4236dc83596@proxel.se
Views:	Whole Thread \|Raw Message \|Download mbox \|Resend email
Lists:	pgsql-hackers

On 12/13/24 6:07 PM, Jeff Davis wrote:
> OK, that's fine with me for this patch series.
>
> Eventually though, I think we should have built-in versions of these
> ASCII functions. Even if there's no actual problem, it would more
> clearly indicate that we only care about ASCII at that particular call
> site, and eliminate questions about what libc might do on some platform
> for some encoding/locale combination. It would also make it easier to
> search for locale-sensitive functions in the codebase.

+1 I had exactly the same idea.

Andreas

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	Andreas Karlsson <andreas(at)proxel(dot)se>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: fixing tsearch locale support
Date:	2024-12-17 18:27:07
Message-ID:	4bbdfe2d-f6c6-491e-90fb-4c5604321b9a@eisentraut.org
Views:	Whole Thread \|Raw Message \|Download mbox \|Resend email
Lists:	pgsql-hackers

On 17.12.24 16:25, Andreas Karlsson wrote:
> On 12/13/24 6:07 PM, Jeff Davis wrote:
>> OK, that's fine with me for this patch series.
>>
>> Eventually though, I think we should have built-in versions of these
>> ASCII functions. Even if there's no actual problem, it would more
>> clearly indicate that we only care about ASCII at that particular call
>> site, and eliminate questions about what libc might do on some platform
>> for some encoding/locale combination. It would also make it easier to
>> search for locale-sensitive functions in the codebase.
>
> +1 I had exactly the same idea.

Yes, I think that could make sense.

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: fixing tsearch locale support
Date:	2024-12-17 18:29:16
Message-ID:	e5313839-32af-4856-a0e2-77b143e5f8c5@eisentraut.org
Views:	Whole Thread \|Raw Message \|Download mbox \|Resend email
Lists:	pgsql-hackers

On 12.12.24 19:20, Jeff Davis wrote:
> On Mon, 2024-12-09 at 11:11 +0100, Peter Eisentraut wrote:
>> I have expanded this patch set. The first three patches are the same
>> as
>> before. I have added a new patch that gets rid of lowerstr() from
>> ts_locale.c and replaces it with the standard str_tolower() that
>> everyone else is using.
>
> +1 to the patch series.
>
> Note: I posted a patch series to support case folding:
>
>https://commitfest.postgresql.org/51/5436/
>
> and we may want to use case folding for some of these purposes
> internally. But this is not a blocker.
>
> There is some kind of compatibility risk with these changes, so we will
> need a release note. And we should try to get all the changes in one
> release to avoid repeated small breakages.

I have committed this and made a note on the open items wiki page about
making a release note or similar.

I'll close this commitfest entry now and will come back with a new patch
series for t_isalpha/t_isalnum when the locale-provider-aware character
classification functions are available.

Movatterモバイル変換

Quick Links

fixing tsearch locale support