Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit5e1963f

Browse files
committed
Collations with nondeterministic comparison
This adds a flag "deterministic" to collations. If that is false,such a collation disables various optimizations that assume thatstrings are equal only if they are byte-wise equal. That then allowsuse cases such as case-insensitive or accent-insensitive comparisonsor handling of strings with different Unicode normal forms.This functionality is only supported with the ICU provider. At leastglibc doesn't appear to have any locales that work in anondeterministic way, so it's not worth supporting this for the libcprovider.The term "deterministic comparison" in this context is from UnicodeTechnical Standard#10(https://unicode.org/reports/tr10/#Deterministic_Comparison).This patch makes changes in three areas:- CREATE COLLATION DDL changes and system catalog changes to support this new flag.- Many executor nodes and auxiliary code are extended to track collations. Previously, this code would just throw away collation information, because the eventually-called user-defined functions didn't use it since they only cared about equality, which didn't need collation information.- String data type functions that do equality comparisons and hashing are changed to take the (non-)deterministic flag into account. For comparison, this just means skipping various shortcuts and tie breakers that use byte-wise comparison. For hashing, we first need to convert the input string to a canonical "sort key" using the ICU analogue of strxfrm().Reviewed-by: Daniel Verite <daniel@manitou-mail.org>Reviewed-by: Peter Geoghegan <pg@bowt.ie>Discussion:https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
1 parent2ab6d28 commit5e1963f

File tree

69 files changed

+2087
-239
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+2087
-239
lines changed

‎contrib/bloom/bloom.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,7 @@ typedef struct BloomMetaPageData
137137
typedefstructBloomState
138138
{
139139
FmgrInfohashFn[INDEX_MAX_KEYS];
140+
Oidcollations[INDEX_MAX_KEYS];
140141
BloomOptionsopts;/* copy of options on index's metapage */
141142
int32nColumns;
142143

‎contrib/bloom/blutils.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,7 @@ initBloomState(BloomState *state, Relation index)
163163
fmgr_info_copy(&(state->hashFn[i]),
164164
index_getprocinfo(index,i+1,BLOOM_HASH_PROC),
165165
CurrentMemoryContext);
166+
state->collations[i]=index->rd_indcollation[i];
166167
}
167168

168169
/* Initialize amcache if needed with options from metapage */
@@ -267,7 +268,7 @@ signValue(BloomState *state, BloomSignatureWord *sign, Datum value, int attno)
267268
* different columns will be mapped into different bits because of step
268269
* above
269270
*/
270-
hashVal=DatumGetInt32(FunctionCall1(&state->hashFn[attno],value));
271+
hashVal=DatumGetInt32(FunctionCall1Coll(&state->hashFn[attno],state->collations[attno],value));
271272
mySrand(hashVal ^myRand());
272273

273274
for (j=0;j<state->opts.bitSize[attno];j++)

‎doc/src/sgml/catalogs.sgml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2077,6 +2077,13 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
20772077
default, <literal>c</literal> = libc, <literal>i</literal> = icu</entry>
20782078
</row>
20792079

2080+
<row>
2081+
<entry><structfield>collisdeterministic</structfield></entry>
2082+
<entry><type>bool</type></entry>
2083+
<entry></entry>
2084+
<entry>Is the collation deterministic?</entry>
2085+
</row>
2086+
20802087
<row>
20812088
<entry><structfield>collencoding</structfield></entry>
20822089
<entry><type>int4</type></entry>

‎doc/src/sgml/charset.sgml

Lines changed: 56 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -847,11 +847,13 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
847847

848848
<para>
849849
Note that while this system allows creating collations that <quote>ignore
850-
case</quote> or <quote>ignore accents</quote> or similar (using
851-
the <literal>ks</literal> key), PostgreSQL does not at the moment allow
852-
such collations to act in a truly case- or accent-insensitive manner. Any
853-
strings that compare equal according to the collation but are not
854-
byte-wise equal will be sorted according to their byte values.
850+
case</quote> or <quote>ignore accents</quote> or similar (using the
851+
<literal>ks</literal> key), in order for such collations to act in a
852+
truly case- or accent-insensitive manner, they also need to be declared as not
853+
<firstterm>deterministic</firstterm> in <command>CREATE COLLATION</command>;
854+
see <xref linkend="collation-nondeterministic"/>.
855+
Otherwise, any strings that compare equal according to the collation but
856+
are not byte-wise equal will be sorted according to their byte values.
855857
</para>
856858

857859
<note>
@@ -883,6 +885,55 @@ CREATE COLLATION french FROM "fr-x-icu";
883885
</para>
884886
</sect4>
885887
</sect3>
888+
889+
<sect3 id="collation-nondeterministic">
890+
<title>Nondeterminstic Collations</title>
891+
892+
<para>
893+
A collation is either <firstterm>deterministic</firstterm> or
894+
<firstterm>nondeterministic</firstterm>. A deterministic collation uses
895+
deterministic comparisons, which means that it considers strings to be
896+
equal only if they consist of the same byte sequence. Nondeterministic
897+
comparison may determine strings to be equal even if they consist of
898+
different bytes. Typical situations include case-insensitive comparison,
899+
accent-insensitive comparison, as well as comparion of strings in
900+
different Unicode normal forms. It is up to the collation provider to
901+
actually implement such insensitive comparisons; the deterministic flag
902+
only determines whether ties are to be broken using bytewise comparison.
903+
See also <ulink url="https://unicode.org/reports/tr10">Unicode Technical
904+
Standard 10</ulink> for more information on the terminology.
905+
</para>
906+
907+
<para>
908+
To create a nondeterministic collation, specify the property
909+
<literal>deterministic = false</literal> to <command>CREATE
910+
COLLATION</command>, for example:
911+
<programlisting>
912+
CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
913+
</programlisting>
914+
This example would use the standard Unicode collation in a
915+
nondeterministic way. In particular, this would allow strings in
916+
different normal forms to be compared correctly. More interesting
917+
examples make use of the ICU customization facilities explained above.
918+
For example:
919+
<programlisting>
920+
CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
921+
CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
922+
</programlisting>
923+
</para>
924+
925+
<para>
926+
All standard and predefined collations are deterministic, all
927+
user-defined collations are deterministic by default. While
928+
nondeterministic collations give a more <quote>correct</quote> behavior,
929+
especially when considering the full power of Unicode and its many
930+
special cases, they also have some drawbacks. Foremost, their use leads
931+
to a performance penalty. Also, certain operations are not possible with
932+
nondeterministic collations, such as pattern matching operations.
933+
Therefore, they should be used only in cases where they are specifically
934+
wanted.
935+
</para>
936+
</sect3>
886937
</sect2>
887938
</sect1>
888939

‎doc/src/sgml/citext.sgml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,16 @@
1414
exactly like <type>text</type>.
1515
</para>
1616

17+
<tip>
18+
<para>
19+
Consider using <firstterm>nondeterministic collations</firstterm> (see
20+
<xref linkend="collation-nondeterministic"/>) instead of this module. They
21+
can be used for case-insensitive comparisons, accent-insensitive
22+
comparisons, and other combinations, and they handle more Unicode special
23+
cases correctly.
24+
</para>
25+
</tip>
26+
1727
<sect2>
1828
<title>Rationale</title>
1929

@@ -246,6 +256,17 @@ SELECT * FROM users WHERE nick = 'Larry';
246256
will be invoked instead.
247257
</para>
248258
</listitem>
259+
260+
<listitem>
261+
<para>
262+
The approach of lower-casing strings for comparison does not handle some
263+
Unicode special cases correctly, for example when one upper-case letter
264+
has two lower-case letter equivalents. Unicode distinguishes between
265+
<firstterm>case mapping</firstterm> and <firstterm>case
266+
folding</firstterm> for this reason. Use nondeterministic collations
267+
instead of <type>citext</type> to handle that correctly.
268+
</para>
269+
</listitem>
249270
</itemizedlist>
250271
</sect2>
251272

‎doc/src/sgml/func.sgml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4065,6 +4065,12 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
40654065
</para>
40664066
</caution>
40674067

4068+
<para>
4069+
The pattern matching operators of all three kinds do not support
4070+
nondeterministic collations. If required, apply a different collation to
4071+
the expression to work around this limitation.
4072+
</para>
4073+
40684074
<sect2 id="functions-like">
40694075
<title><function>LIKE</function></title>
40704076

‎doc/src/sgml/ref/create_collation.sgml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> (
2323
[ LC_COLLATE = <replaceable>lc_collate</replaceable>, ]
2424
[ LC_CTYPE = <replaceable>lc_ctype</replaceable>, ]
2525
[ PROVIDER = <replaceable>provider</replaceable>, ]
26+
[ DETERMINISTIC = <replaceable>boolean</replaceable>, ]
2627
[ VERSION = <replaceable>version</replaceable> ]
2728
)
2829
CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replaceable>existing_collation</replaceable>
@@ -124,6 +125,27 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
124125
</listitem>
125126
</varlistentry>
126127

128+
<varlistentry>
129+
<term><literal>DETERMINISTIC</literal></term>
130+
131+
<listitem>
132+
<para>
133+
Specifies whether the collation should use deterministic comparisons.
134+
The default is true. A deterministic comparison considers strings that
135+
are not byte-wise equal to be unequal even if they are considered
136+
logically equal by the comparison. PostgreSQL breaks ties using a
137+
byte-wise comparison. Comparison that is not deterministic can make the
138+
collation be, say, case- or accent-insensitive. For that, you need to
139+
choose an appropriate <literal>LC_COLLATE</literal> setting
140+
<emphasis>and</emphasis> set the collation to not deterministic here.
141+
</para>
142+
143+
<para>
144+
Nondeterministic collations are only supported with the ICU provider.
145+
</para>
146+
</listitem>
147+
</varlistentry>
148+
127149
<varlistentry>
128150
<term><replaceable>version</replaceable></term>
129151

‎src/backend/access/hash/hashfunc.c

Lines changed: 89 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,10 @@
2727
#include"postgres.h"
2828

2929
#include"access/hash.h"
30+
#include"catalog/pg_collation.h"
3031
#include"utils/builtins.h"
3132
#include"utils/hashutils.h"
33+
#include"utils/pg_locale.h"
3234

3335
/*
3436
* Datatype-specific hash functions.
@@ -243,15 +245,51 @@ Datum
243245
hashtext(PG_FUNCTION_ARGS)
244246
{
245247
text*key=PG_GETARG_TEXT_PP(0);
248+
Oidcollid=PG_GET_COLLATION();
249+
pg_locale_tmylocale=0;
246250
Datumresult;
247251

248-
/*
249-
* Note: this is currently identical in behavior to hashvarlena, but keep
250-
* it as a separate function in case we someday want to do something
251-
* different in non-C locales. (See also hashbpchar, if so.)
252-
*/
253-
result=hash_any((unsignedchar*)VARDATA_ANY(key),
254-
VARSIZE_ANY_EXHDR(key));
252+
if (!collid)
253+
ereport(ERROR,
254+
(errcode(ERRCODE_INDETERMINATE_COLLATION),
255+
errmsg("could not determine which collation to use for string hashing"),
256+
errhint("Use the COLLATE clause to set the collation explicitly.")));
257+
258+
if (!lc_collate_is_c(collid)&&collid!=DEFAULT_COLLATION_OID)
259+
mylocale=pg_newlocale_from_collation(collid);
260+
261+
if (!mylocale||mylocale->deterministic)
262+
{
263+
result=hash_any((unsignedchar*)VARDATA_ANY(key),
264+
VARSIZE_ANY_EXHDR(key));
265+
}
266+
else
267+
{
268+
#ifdefUSE_ICU
269+
if (mylocale->provider==COLLPROVIDER_ICU)
270+
{
271+
int32_tulen=-1;
272+
UChar*uchar=NULL;
273+
Sizebsize;
274+
uint8_t*buf;
275+
276+
ulen=icu_to_uchar(&uchar,VARDATA_ANY(key),VARSIZE_ANY_EXHDR(key));
277+
278+
bsize=ucol_getSortKey(mylocale->info.icu.ucol,
279+
uchar,ulen,NULL,0);
280+
buf=palloc(bsize);
281+
ucol_getSortKey(mylocale->info.icu.ucol,
282+
uchar,ulen,buf,bsize);
283+
284+
result=hash_any(buf,bsize);
285+
286+
pfree(buf);
287+
}
288+
else
289+
#endif
290+
/* shouldn't happen */
291+
elog(ERROR,"unsupported collprovider: %c",mylocale->provider);
292+
}
255293

256294
/* Avoid leaking memory for toasted inputs */
257295
PG_FREE_IF_COPY(key,0);
@@ -263,12 +301,52 @@ Datum
263301
hashtextextended(PG_FUNCTION_ARGS)
264302
{
265303
text*key=PG_GETARG_TEXT_PP(0);
304+
Oidcollid=PG_GET_COLLATION();
305+
pg_locale_tmylocale=0;
266306
Datumresult;
267307

268-
/* Same approach as hashtext */
269-
result=hash_any_extended((unsignedchar*)VARDATA_ANY(key),
270-
VARSIZE_ANY_EXHDR(key),
271-
PG_GETARG_INT64(1));
308+
if (!collid)
309+
ereport(ERROR,
310+
(errcode(ERRCODE_INDETERMINATE_COLLATION),
311+
errmsg("could not determine which collation to use for string hashing"),
312+
errhint("Use the COLLATE clause to set the collation explicitly.")));
313+
314+
if (!lc_collate_is_c(collid)&&collid!=DEFAULT_COLLATION_OID)
315+
mylocale=pg_newlocale_from_collation(collid);
316+
317+
if (!mylocale||mylocale->deterministic)
318+
{
319+
result=hash_any_extended((unsignedchar*)VARDATA_ANY(key),
320+
VARSIZE_ANY_EXHDR(key),
321+
PG_GETARG_INT64(1));
322+
}
323+
else
324+
{
325+
#ifdefUSE_ICU
326+
if (mylocale->provider==COLLPROVIDER_ICU)
327+
{
328+
int32_tulen=-1;
329+
UChar*uchar=NULL;
330+
Sizebsize;
331+
uint8_t*buf;
332+
333+
ulen=icu_to_uchar(&uchar,VARDATA_ANY(key),VARSIZE_ANY_EXHDR(key));
334+
335+
bsize=ucol_getSortKey(mylocale->info.icu.ucol,
336+
uchar,ulen,NULL,0);
337+
buf=palloc(bsize);
338+
ucol_getSortKey(mylocale->info.icu.ucol,
339+
uchar,ulen,buf,bsize);
340+
341+
result=hash_any_extended(buf,bsize,PG_GETARG_INT64(1));
342+
343+
pfree(buf);
344+
}
345+
else
346+
#endif
347+
/* shouldn't happen */
348+
elog(ERROR,"unsupported collprovider: %c",mylocale->provider);
349+
}
272350

273351
PG_FREE_IF_COPY(key,0);
274352

‎src/backend/access/spgist/spgtextproc.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -630,7 +630,8 @@ spg_text_leaf_consistent(PG_FUNCTION_ARGS)
630630
* query (prefix) string, so we don't need to check it again.
631631
*/
632632
res= (level >=queryLen)||
633-
DatumGetBool(DirectFunctionCall2(text_starts_with,
633+
DatumGetBool(DirectFunctionCall2Coll(text_starts_with,
634+
PG_GET_COLLATION(),
634635
out->leafValue,
635636
PointerGetDatum(query)));
636637

‎src/backend/catalog/pg_collation.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ Oid
4646
CollationCreate(constchar*collname,Oidcollnamespace,
4747
Oidcollowner,
4848
charcollprovider,
49+
boolcollisdeterministic,
4950
int32collencoding,
5051
constchar*collcollate,constchar*collctype,
5152
constchar*collversion,
@@ -160,6 +161,7 @@ CollationCreate(const char *collname, Oid collnamespace,
160161
values[Anum_pg_collation_collnamespace-1]=ObjectIdGetDatum(collnamespace);
161162
values[Anum_pg_collation_collowner-1]=ObjectIdGetDatum(collowner);
162163
values[Anum_pg_collation_collprovider-1]=CharGetDatum(collprovider);
164+
values[Anum_pg_collation_collisdeterministic-1]=BoolGetDatum(collisdeterministic);
163165
values[Anum_pg_collation_collencoding-1]=Int32GetDatum(collencoding);
164166
namestrcpy(&name_collate,collcollate);
165167
values[Anum_pg_collation_collcollate-1]=NameGetDatum(&name_collate);

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp