Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitd3d0983

Browse files
committed
Support PG_UNICODE_FAST locale in the builtin collation provider.
The PG_UNICODE_FAST locale uses code point sort order (fast,memcmp-based) combined with Unicode character semantics. The charactersemantics are based on Unicode full case mapping.Full case mapping can map a single codepoint to multiple codepoints,such as "ß" uppercasing to "SS". Additionally, it handlescontext-sensitive mappings like the "final sigma", and it usestitlecase mappings such as "Dž" when titlecasing (rather than plainuppercase mappings).Importantly, the uppercasing of "ß" as "SS" is specifically mentionedby the SQL standard. In Postgres, UCS_BASIC uses plain ASCII semanticsfor case mapping and pattern matching, so if we changed it to use thePG_UNICODE_FAST locale, it would offer better compliance with thestandard. For now, though, do not change the behavior of UCS_BASIC.Discussion:https://postgr.es/m/ddfd67928818f138f51635712529bc5e1d25e4e7.camel@j-davis.comDiscussion:https://postgr.es/m/27bb0e52-801d-4f73-a0a4-02cfdd4a9ada@eisentraut.orgReviewed-by: Peter Eisentraut, Daniel Verite
1 parent286a365 commitd3d0983

File tree

13 files changed

+283
-16
lines changed

13 files changed

+283
-16
lines changed

‎doc/src/sgml/charset.sgml

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -377,8 +377,9 @@ initdb --locale-provider=icu --icu-locale=en
377377
<listitem>
378378
<para>
379379
The <literal>builtin</literal> provider uses built-in operations. Only
380-
the <literal>C</literal> and <literal>C.UTF-8</literal> locales are
381-
supported for this provider.
380+
the <literal>C</literal>, <literal>C.UTF-8</literal>, and
381+
<literal>PG_UNICODE_FAST</literal> locales are supported for this
382+
provider.
382383
</para>
383384
<para>
384385
The <literal>C</literal> locale behavior is identical to the
@@ -392,6 +393,13 @@ initdb --locale-provider=icu --icu-locale=en
392393
regular expression character classes are based on the "POSIX
393394
Compatible" semantics, and the case mapping is the "simple" variant.
394395
</para>
396+
<para>
397+
The <literal>PG_UNICODE_FAST</literal> locale is available only when
398+
the database encoding is <literal>UTF-8</literal>, and the behavior is
399+
based on Unicode. The collation uses the code point values only. The
400+
regular expression character classes are based on the "Standard"
401+
semantics, and the case mapping is the "full" variant.
402+
</para>
395403
</listitem>
396404
</varlistentry>
397405

@@ -886,6 +894,23 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
886894
</listitem>
887895
</varlistentry>
888896

897+
<varlistentry>
898+
<term><literal>pg_unicode_fast</literal></term>
899+
<listitem>
900+
<para>
901+
This collation sorts by Unicode code point values rather than natural
902+
language order. For the functions <function>lower</function>,
903+
<function>initcap</function>, and <function>upper</function> it uses
904+
Unicode full case mapping. For pattern matching (including regular
905+
expressions), it uses the Standard variant of Unicode <ulink
906+
url="https://www.unicode.org/reports/tr18/#Compatibility_Properties">Compatibility
907+
Properties</ulink>. Behavior is efficient and stable within a
908+
<productname>Postgres</productname> major version. It is only
909+
available for encoding <literal>UTF8</literal>.
910+
</para>
911+
</listitem>
912+
</varlistentry>
913+
889914
<varlistentry>
890915
<term><literal>pg_c_utf8</literal></term>
891916
<listitem>

‎doc/src/sgml/ref/create_collation.sgml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,8 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
9999
<para>
100100
If <replaceable>provider</replaceable> is <literal>builtin</literal>,
101101
then <replaceable>locale</replaceable> must be specified and set to
102-
either <literal>C</literal> or <literal>C.UTF-8</literal>.
102+
either <literal>C</literal>, <literal>C.UTF-8</literal> or
103+
<literal>PG_UNICODE_FAST</literal>.
103104
</para>
104105
</listitem>
105106
</varlistentry>

‎doc/src/sgml/ref/create_database.sgml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,8 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
168168
If <xref linkend="create-database-locale-provider"/> is
169169
<literal>builtin</literal>, then <replaceable>locale</replaceable> or
170170
<replaceable>builtin_locale</replaceable> must be specified and set to
171-
either <literal>C</literal> or <literal>C.UTF-8</literal>.
171+
either <literal>C</literal>, <literal>C.UTF-8</literal>, or
172+
<literal>PG_UNICODE_FAST</literal>.
172173
</para>
173174
<tip>
174175
<para>
@@ -233,7 +234,8 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
233234
</para>
234235
<para>
235236
The locales available for the <literal>builtin</literal> provider are
236-
<literal>C</literal> and <literal>C.UTF-8</literal>.
237+
<literal>C</literal>, <literal>C.UTF-8</literal> and
238+
<literal>PG_UNICODE_FAST</literal>.
237239
</para>
238240
</listitem>
239241
</varlistentry>

‎doc/src/sgml/ref/initdb.sgml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -295,8 +295,8 @@ PostgreSQL documentation
295295
<para>
296296
If <option>--locale-provider</option> is <literal>builtin</literal>,
297297
<option>--locale</option> or <option>--builtin-locale</option> must be
298-
specified and set to <literal>C</literal> or
299-
<literal>C.UTF-8</literal>.
298+
specified and set to <literal>C</literal>, <literal>C.UTF-8</literal>
299+
or<literal>PG_UNICODE_FAST</literal>.
300300
</para>
301301
</listitem>
302302
</varlistentry>

‎src/backend/regex/regc_pg_locale.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,7 @@ pg_wc_isdigit(pg_wchar c)
307307
return (c <= (pg_wchar)127&&
308308
(pg_char_properties[c]&PG_ISDIGIT));
309309
casePG_REGEX_STRATEGY_BUILTIN:
310-
returnpg_u_isdigit(c,true);
310+
returnpg_u_isdigit(c,!pg_regex_locale->info.builtin.casemap_full);
311311
casePG_REGEX_STRATEGY_LIBC_WIDE:
312312
if (sizeof(wchar_t) >=4||c <= (pg_wchar)0xFFFF)
313313
returniswdigit_l((wint_t)c,pg_regex_locale->info.lt);
@@ -361,7 +361,7 @@ pg_wc_isalnum(pg_wchar c)
361361
return (c <= (pg_wchar)127&&
362362
(pg_char_properties[c]&PG_ISALNUM));
363363
casePG_REGEX_STRATEGY_BUILTIN:
364-
returnpg_u_isalnum(c,true);
364+
returnpg_u_isalnum(c,!pg_regex_locale->info.builtin.casemap_full);
365365
casePG_REGEX_STRATEGY_LIBC_WIDE:
366366
if (sizeof(wchar_t) >=4||c <= (pg_wchar)0xFFFF)
367367
returniswalnum_l((wint_t)c,pg_regex_locale->info.lt);
@@ -505,7 +505,7 @@ pg_wc_ispunct(pg_wchar c)
505505
return (c <= (pg_wchar)127&&
506506
(pg_char_properties[c]&PG_ISPUNCT));
507507
casePG_REGEX_STRATEGY_BUILTIN:
508-
returnpg_u_ispunct(c,true);
508+
returnpg_u_ispunct(c,!pg_regex_locale->info.builtin.casemap_full);
509509
casePG_REGEX_STRATEGY_LIBC_WIDE:
510510
if (sizeof(wchar_t) >=4||c <= (pg_wchar)0xFFFF)
511511
returniswpunct_l((wint_t)c,pg_regex_locale->info.lt);

‎src/backend/utils/adt/pg_locale.c

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1590,8 +1590,11 @@ builtin_locale_encoding(const char *locale)
15901590
{
15911591
if (strcmp(locale,"C")==0)
15921592
return-1;
1593-
if (strcmp(locale,"C.UTF-8")==0)
1593+
elseif (strcmp(locale,"C.UTF-8")==0)
15941594
returnPG_UTF8;
1595+
elseif (strcmp(locale,"PG_UNICODE_FAST")==0)
1596+
returnPG_UTF8;
1597+
15951598

15961599
ereport(ERROR,
15971600
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
@@ -1616,6 +1619,8 @@ builtin_validate_locale(int encoding, const char *locale)
16161619
canonical_name="C";
16171620
elseif (strcmp(locale,"C.UTF-8")==0||strcmp(locale,"C.UTF8")==0)
16181621
canonical_name="C.UTF-8";
1622+
elseif (strcmp(locale,"PG_UNICODE_FAST")==0)
1623+
canonical_name="PG_UNICODE_FAST";
16191624

16201625
if (!canonical_name)
16211626
ereport(ERROR,

‎src/backend/utils/adt/pg_locale_builtin.c

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,8 @@ size_t
7878
strlower_builtin(char*dest,size_tdestsize,constchar*src,ssize_tsrclen,
7979
pg_locale_tlocale)
8080
{
81-
returnunicode_strlower(dest,destsize,src,srclen, false);
81+
returnunicode_strlower(dest,destsize,src,srclen,
82+
locale->info.builtin.casemap_full);
8283
}
8384

8485
size_t
@@ -93,15 +94,17 @@ strtitle_builtin(char *dest, size_t destsize, const char *src, ssize_t srclen,
9394
.prev_alnum= false,
9495
};
9596

96-
returnunicode_strtitle(dest,destsize,src,srclen, false,
97+
returnunicode_strtitle(dest,destsize,src,srclen,
98+
locale->info.builtin.casemap_full,
9799
initcap_wbnext,&wbstate);
98100
}
99101

100102
size_t
101103
strupper_builtin(char*dest,size_tdestsize,constchar*src,ssize_tsrclen,
102104
pg_locale_tlocale)
103105
{
104-
returnunicode_strupper(dest,destsize,src,srclen, false);
106+
returnunicode_strupper(dest,destsize,src,srclen,
107+
locale->info.builtin.casemap_full);
105108
}
106109

107110
pg_locale_t
@@ -142,6 +145,7 @@ create_pg_locale_builtin(Oid collid, MemoryContext context)
142145
result=MemoryContextAllocZero(context,sizeof(structpg_locale_struct));
143146

144147
result->info.builtin.locale=MemoryContextStrdup(context,locstr);
148+
result->info.builtin.casemap_full= (strcmp(locstr,"PG_UNICODE_FAST")==0);
145149
result->provider=COLLPROVIDER_BUILTIN;
146150
result->deterministic= true;
147151
result->collate_is_c= true;
@@ -164,6 +168,8 @@ get_collation_actual_version_builtin(const char *collcollate)
164168
return"1";
165169
elseif (strcmp(collcollate,"C.UTF-8")==0)
166170
return"1";
171+
elseif (strcmp(collcollate,"PG_UNICODE_FAST")==0)
172+
return"1";
167173
else
168174
ereport(ERROR,
169175
(errcode(ERRCODE_WRONG_OBJECT_TYPE),

‎src/bin/initdb/initdb.c

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2489,6 +2489,8 @@ setlocales(void)
24892489
elseif (strcmp(datlocale,"C.UTF-8")==0||
24902490
strcmp(datlocale,"C.UTF8")==0)
24912491
canonname="C.UTF-8";
2492+
elseif (strcmp(datlocale,"PG_UNICODE_FAST")==0)
2493+
canonname="PG_UNICODE_FAST";
24922494
else
24932495
pg_fatal("invalid locale name \"%s\" for builtin provider",
24942496
datlocale);
@@ -2782,7 +2784,9 @@ setup_locale_encoding(void)
27822784

27832785
if (locale_provider==COLLPROVIDER_BUILTIN)
27842786
{
2785-
if (strcmp(datlocale,"C.UTF-8")==0&&encodingid!=PG_UTF8)
2787+
if ((strcmp(datlocale,"C.UTF-8")==0||
2788+
strcmp(datlocale,"PG_UNICODE_FAST")==0)&&
2789+
encodingid!=PG_UTF8)
27862790
pg_fatal("builtin provider locale \"%s\" requires encoding \"%s\"",
27872791
datlocale,"UTF-8");
27882792
}

‎src/include/catalog/catversion.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,6 @@
5757
*/
5858

5959
/*yyyymmddN */
60-
#defineCATALOG_VERSION_NO202501162
60+
#defineCATALOG_VERSION_NO202501171
6161

6262
#endif

‎src/include/catalog/pg_collation.dat

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,5 +33,8 @@
3333
descr => 'sorts by Unicode code point; Unicode and POSIX character semantics',
3434
collname => 'pg_c_utf8', collprovider => 'b', collencoding => '6',
3535
colllocale => 'C.UTF-8', collversion => '1' },
36+
{ oid => '9535', descr => 'sorts by Unicode code point; Unicode character semantics',
37+
collname => 'pg_unicode_fast', collprovider => 'b', collencoding => '6',
38+
colllocale => 'PG_UNICODE_FAST', collversion => '1' },
3639

3740
]

‎src/include/utils/pg_locale.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,7 @@ struct pg_locale_struct
108108
struct
109109
{
110110
constchar*locale;
111+
boolcasemap_full;
111112
}builtin;
112113
locale_tlt;
113114
#ifdefUSE_ICU

‎src/test/regress/expected/collate.utf8.out

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,3 +160,163 @@ SELECT 'δ' ~* '[Γ-Λ]' COLLATE PG_C_UTF8; -- same as above with cases reversed
160160
t
161161
(1 row)
162162

163+
--
164+
-- Test PG_UNICODE_FAST
165+
--
166+
CREATE COLLATION regress_pg_unicode_fast (
167+
provider = builtin, locale = 'unicode'); -- fails
168+
ERROR: invalid locale name "unicode" for builtin provider
169+
CREATE COLLATION regress_pg_unicode_fast (
170+
provider = builtin, locale = 'PG_UNICODE_FAST');
171+
CREATE TABLE test_pg_unicode_fast (
172+
t TEXT COLLATE PG_UNICODE_FAST
173+
);
174+
INSERT INTO test_pg_unicode_fast VALUES
175+
('abc DEF 123abc'),
176+
('ábc sßs ßss DÉF'),
177+
('DŽxxDŽ džxxDž Džxxdž'),
178+
('ȺȺȺ'),
179+
('ⱥⱥⱥ'),
180+
('ⱥȺ');
181+
SELECT
182+
t, lower(t), initcap(t), upper(t),
183+
length(convert_to(t, 'UTF8')) AS t_bytes,
184+
length(convert_to(lower(t), 'UTF8')) AS lower_t_bytes,
185+
length(convert_to(initcap(t), 'UTF8')) AS initcap_t_bytes,
186+
length(convert_to(upper(t), 'UTF8')) AS upper_t_bytes
187+
FROM test_pg_unicode_fast;
188+
t | lower | initcap | upper | t_bytes | lower_t_bytes | initcap_t_bytes | upper_t_bytes
189+
-----------------+-----------------+------------------+-------------------+---------+---------------+-----------------+---------------
190+
abc DEF 123abc | abc def 123abc | Abc Def 123abc | ABC DEF 123ABC | 14 | 14 | 14 | 14
191+
ábc sßs ßss DÉF | ábc sßs ßss déf | Ábc Sßs Ssss Déf | ÁBC SSSS SSSS DÉF | 19 | 19 | 19 | 19
192+
DŽxxDŽ džxxDž Džxxdž | džxxdž džxxdž džxxdž | Džxxdž Džxxdž Džxxdž | DŽXXDŽ DŽXXDŽ DŽXXDŽ | 20 | 20 | 20 | 20
193+
ȺȺȺ | ⱥⱥⱥ | Ⱥⱥⱥ | ȺȺȺ | 6 | 9 | 8 | 6
194+
ⱥⱥⱥ | ⱥⱥⱥ | Ⱥⱥⱥ | ȺȺȺ | 9 | 9 | 8 | 6
195+
ⱥȺ | ⱥⱥ | Ⱥⱥ | ȺȺ | 5 | 6 | 5 | 4
196+
(6 rows)
197+
198+
DROP TABLE test_pg_unicode_fast;
199+
-- test Final_Sigma
200+
SELECT lower('ΑΣ' COLLATE PG_UNICODE_FAST); -- 0391 03A3
201+
lower
202+
-------
203+
ας
204+
(1 row)
205+
206+
SELECT lower('ΑΣ0' COLLATE PG_UNICODE_FAST); -- 0391 03A3 0030
207+
lower
208+
-------
209+
ας0
210+
(1 row)
211+
212+
SELECT lower('ἈΣ̓' COLLATE PG_UNICODE_FAST); -- 0391 0343 03A3 0343
213+
lower
214+
-------
215+
ἀς̓
216+
(1 row)
217+
218+
SELECT lower('ᾼΣͅ' COLLATE PG_UNICODE_FAST); -- 0391 0345 03A3 0345
219+
lower
220+
-------
221+
ᾳςͅ
222+
(1 row)
223+
224+
-- test !Final_Sigma
225+
SELECT lower('Σ' COLLATE PG_UNICODE_FAST); -- 03A3
226+
lower
227+
-------
228+
σ
229+
(1 row)
230+
231+
SELECT lower('0Σ' COLLATE PG_UNICODE_FAST); -- 0030 03A3
232+
lower
233+
-------
234+
235+
(1 row)
236+
237+
SELECT lower('ΑΣΑ' COLLATE PG_UNICODE_FAST); -- 0391 03A3 0391
238+
lower
239+
-------
240+
ασα
241+
(1 row)
242+
243+
SELECT lower('ἈΣ̓Α' COLLATE PG_UNICODE_FAST); -- 0391 0343 03A3 0343 0391
244+
lower
245+
-------
246+
ἀσ̓α
247+
(1 row)
248+
249+
SELECT lower('ᾼΣͅΑ' COLLATE PG_UNICODE_FAST); -- 0391 0345 03A3 0345 0391
250+
lower
251+
-------
252+
ᾳσͅα
253+
(1 row)
254+
255+
-- properties
256+
SELECT 'xyz' ~ '[[:alnum:]]' COLLATE PG_UNICODE_FAST;
257+
?column?
258+
----------
259+
t
260+
(1 row)
261+
262+
SELECT 'xyz' !~ '[[:upper:]]' COLLATE PG_UNICODE_FAST;
263+
?column?
264+
----------
265+
t
266+
(1 row)
267+
268+
SELECT '@' !~ '[[:alnum:]]' COLLATE PG_UNICODE_FAST;
269+
?column?
270+
----------
271+
t
272+
(1 row)
273+
274+
SELECT '=' !~ '[[:punct:]]' COLLATE PG_UNICODE_FAST; -- symbols are not punctuation
275+
?column?
276+
----------
277+
t
278+
(1 row)
279+
280+
SELECT 'a8a' ~ '[[:digit:]]' COLLATE PG_UNICODE_FAST;
281+
?column?
282+
----------
283+
t
284+
(1 row)
285+
286+
SELECT '൧' ~ '\d' COLLATE PG_UNICODE_FAST;
287+
?column?
288+
----------
289+
t
290+
(1 row)
291+
292+
-- case mapping
293+
SELECT 'xYz' ~* 'XyZ' COLLATE PG_UNICODE_FAST;
294+
?column?
295+
----------
296+
t
297+
(1 row)
298+
299+
SELECT 'xAb' ~* '[W-Y]' COLLATE PG_UNICODE_FAST;
300+
?column?
301+
----------
302+
t
303+
(1 row)
304+
305+
SELECT 'xAb' !~* '[c-d]' COLLATE PG_UNICODE_FAST;
306+
?column?
307+
----------
308+
t
309+
(1 row)
310+
311+
SELECT 'Δ' ~* '[γ-λ]' COLLATE PG_UNICODE_FAST;
312+
?column?
313+
----------
314+
t
315+
(1 row)
316+
317+
SELECT 'δ' ~* '[Γ-Λ]' COLLATE PG_UNICODE_FAST; -- same as above with cases reversed
318+
?column?
319+
----------
320+
t
321+
(1 row)
322+

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp