Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit2991ac5

Browse files
committed
Add SQL functions for Unicode normalization
This adds SQL expressions NORMALIZE() and IS NORMALIZED to convert andcheck Unicode normal forms, per SQL standard.To support fast IS NORMALIZED tests, we pull in a new data fileDerivedNormalizationProps.txt from Unicode and build a lookup tablefrom that, using techniques similar to ones already used for otherUnicode data. make update-unicode will keep it up to date. We onlybuild and use these tables for the NFC and NFKC forms, because theyare too big for NFD and NFKD and the improvement is not significantenough there.Reviewed-by: Daniel Verite <daniel@manitou-mail.org>Reviewed-by: Andreas Karlsson <andreas@proxel.se>Discussion:https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com
1 parent070c3d3 commit2991ac5

File tree

20 files changed

+6764
-7
lines changed

20 files changed

+6764
-7
lines changed

‎doc/src/sgml/charset.sgml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -934,6 +934,16 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
934934
such as pattern matching operations. Therefore, they should be used
935935
only in cases where they are specifically wanted.
936936
</para>
937+
938+
<tip>
939+
<para>
940+
To deal with text in different Unicode normalization forms, it is also
941+
an option to use the functions/expressions
942+
<function>normalize</function> and <literal>is normalized</literal> to
943+
preprocess or check the strings, instead of using nondeterministic
944+
collations. There are different trade-offs for each approach.
945+
</para>
946+
</tip>
937947
</sect3>
938948
</sect2>
939949
</sect1>

‎doc/src/sgml/func.sgml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1560,6 +1560,30 @@
15601560
<entry><literal>Value: 42</literal></entry>
15611561
</row>
15621562

1563+
<row>
1564+
<entry>
1565+
<indexterm>
1566+
<primary>normalized</primary>
1567+
</indexterm>
1568+
<indexterm>
1569+
<primary>Unicode normalization</primary>
1570+
</indexterm>
1571+
<literal><parameter>string</parameter> is <optional>not</optional> <optional><parameter>form</parameter></optional> normalized</literal>
1572+
</entry>
1573+
<entry><type>boolean</type></entry>
1574+
<entry>
1575+
Checks whether the string is in the specified Unicode normalization
1576+
form. The optional parameter specifies the form:
1577+
<literal>NFC</literal> (default), <literal>NFD</literal>,
1578+
<literal>NFKC</literal>, <literal>NFKD</literal>. This expression can
1579+
only be used if the server encoding is <literal>UTF8</literal>. Note
1580+
that checking for normalization using this expression is often faster
1581+
than normalizing possibly already normalized strings.
1582+
</entry>
1583+
<entry><literal>U&amp;'\0061\0308bc' IS NFD NORMALIZED</literal></entry>
1584+
<entry><literal>true</literal></entry>
1585+
</row>
1586+
15631587
<row>
15641588
<entry>
15651589
<indexterm>
@@ -1610,6 +1634,30 @@
16101634
<entry><literal>tom</literal></entry>
16111635
</row>
16121636

1637+
<row>
1638+
<entry>
1639+
<indexterm>
1640+
<primary>normalize</primary>
1641+
</indexterm>
1642+
<indexterm>
1643+
<primary>Unicode normalization</primary>
1644+
</indexterm>
1645+
<literal><function>normalize(<parameter>string</parameter> <type>text</type>
1646+
<optional>, <parameter>form</parameter> </optional>)</function></literal>
1647+
</entry>
1648+
<entry><type>text</type></entry>
1649+
<entry>
1650+
Converts the string in the first argument to the specified Unicode
1651+
normalization form. The optional second argument specifies the form
1652+
as an identifier: <literal>NFC</literal> (default),
1653+
<literal>NFD</literal>, <literal>NFKC</literal>,
1654+
<literal>NFKD</literal>. This function can only be used if the server
1655+
encoding is <literal>UTF8</literal>.
1656+
</entry>
1657+
<entry><literal>normalize(U&amp;'\0061\0308bc', NFC)</literal></entry>
1658+
<entry><literal>U&amp;'\00E4bc'</literal></entry>
1659+
</row>
1660+
16131661
<row>
16141662
<entry>
16151663
<indexterm>

‎src/backend/catalog/sql_features.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -257,7 +257,7 @@ F386Set identity column generation clauseYES
257257
F391Long identifiersYES
258258
F392Unicode escapes in identifiersYES
259259
F393Unicode escapes in literalsYES
260-
F394Optional normal form specificationNO
260+
F394Optional normal form specificationYES
261261
F401Extended joined tableYES
262262
F401Extended joined table01NATURAL JOINYES
263263
F401Extended joined table02FULL OUTER JOINYES

‎src/backend/catalog/system_views.sql

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1400,6 +1400,21 @@ LANGUAGE INTERNAL
14001400
STRICT STABLE PARALLEL SAFE
14011401
AS'jsonb_path_query_first_tz';
14021402

1403+
-- default normalization form is NFC, per SQL standard
1404+
CREATEOR REPLACE FUNCTION
1405+
"normalize"(text,text DEFAULT'NFC')
1406+
RETURNStext
1407+
LANGUAGE internal
1408+
STRICT IMMUTABLE PARALLEL SAFE
1409+
AS'unicode_normalize_func';
1410+
1411+
CREATEOR REPLACE FUNCTION
1412+
is_normalized(text,text DEFAULT'NFC')
1413+
RETURNSboolean
1414+
LANGUAGE internal
1415+
STRICT IMMUTABLE PARALLEL SAFE
1416+
AS'unicode_is_normalized';
1417+
14031418
--
14041419
-- The default permissions for functions mean that anyone can execute them.
14051420
-- A number of functions shouldn't be executable by just anyone, but rather

‎src/backend/parser/gram.y

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -444,6 +444,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
444444
%type<list>substr_listtrim_list
445445
%type<list>opt_intervalinterval_second
446446
%type<node>overlay_placingsubstr_fromsubstr_for
447+
%type<str>unicode_normal_form
447448

448449
%type<boolean>opt_instead
449450
%type<boolean>opt_uniqueopt_concurrentlyopt_verboseopt_full
@@ -664,7 +665,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
664665

665666
MAPPING MATCH MATERIALIZED MAXVALUE METHOD MINUTE_P MINVALUE MODE MONTH_P MOVE
666667

667-
NAME_P NAMES NATIONAL NATURAL NCHAR NEW NEXT NO NONE
668+
NAME_P NAMES NATIONAL NATURAL NCHAR NEW NEXT NFC NFD NFKC NFKD NO NONE
669+
NORMALIZE NORMALIZED
668670
NOT NOTHING NOTIFY NOTNULL NOWAIT NULL_P NULLIF
669671
NULLS_P NUMERIC
670672

@@ -13491,6 +13493,22 @@ a_expr:c_expr{ $$ = $1; }
1349113493
list_make1($1), @2),
1349213494
@2);
1349313495
}
13496+
|a_exprISNORMALIZED%precIS
13497+
{
13498+
$$ = (Node *) makeFuncCall(SystemFuncName("is_normalized"), list_make1($1),@2);
13499+
}
13500+
|a_exprISunicode_normal_formNORMALIZED%precIS
13501+
{
13502+
$$ = (Node *) makeFuncCall(SystemFuncName("is_normalized"), list_make2($1, makeStringConst($3,@3)),@2);
13503+
}
13504+
|a_exprISNOTNORMALIZED%precIS
13505+
{
13506+
$$ = makeNotExpr((Node *) makeFuncCall(SystemFuncName("is_normalized"), list_make1($1),@2),@2);
13507+
}
13508+
|a_exprISNOTunicode_normal_formNORMALIZED%precIS
13509+
{
13510+
$$ = makeNotExpr((Node *) makeFuncCall(SystemFuncName("is_normalized"), list_make2($1, makeStringConst($4,@4)),@2),@2);
13511+
}
1349413512
|DEFAULT
1349513513
{
1349613514
/*
@@ -13934,6 +13952,14 @@ func_expr_common_subexpr:
1393413952
{
1393513953
$$ = (Node *) makeFuncCall(SystemFuncName("date_part"),$3,@1);
1393613954
}
13955+
|NORMALIZE'('a_expr')'
13956+
{
13957+
$$ = (Node *) makeFuncCall(SystemFuncName("normalize"), list_make1($3),@1);
13958+
}
13959+
|NORMALIZE'('a_expr','unicode_normal_form')'
13960+
{
13961+
$$ = (Node *) makeFuncCall(SystemFuncName("normalize"), list_make2($3, makeStringConst($5,@5)),@1);
13962+
}
1393713963
|OVERLAY'('overlay_list')'
1393813964
{
1393913965
/* overlay(A PLACING B FROM C FOR D) is converted to
@@ -14569,6 +14595,13 @@ extract_arg:
1456914595
|Sconst{$$ =$1; }
1457014596
;
1457114597

14598+
unicode_normal_form:
14599+
NFC{$$ ="nfc"; }
14600+
|NFD{$$ ="nfd"; }
14601+
|NFKC{$$ ="nfkc"; }
14602+
|NFKD{$$ ="nfkd"; }
14603+
;
14604+
1457214605
/* OVERLAY() arguments
1457314606
* SQL99 defines the OVERLAY() function:
1457414607
* o overlay(text placing text from int for int)
@@ -15315,7 +15348,12 @@ unreserved_keyword:
1531515348
| NAMES
1531615349
| NEW
1531715350
| NEXT
15351+
| NFC
15352+
| NFD
15353+
| NFKC
15354+
| NFKD
1531815355
| NO
15356+
| NORMALIZED
1531915357
| NOTHING
1532015358
| NOTIFY
1532115359
| NOWAIT
@@ -15494,6 +15532,7 @@ col_name_keyword:
1549415532
| NATIONAL
1549515533
| NCHAR
1549615534
| NONE
15535+
| NORMALIZE
1549715536
| NULLIF
1549815537
| NUMERIC
1549915538
| OUT_P

‎src/backend/utils/adt/varlena.c

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
#include"catalog/pg_type.h"
2323
#include"common/hashfn.h"
2424
#include"common/int.h"
25+
#include"common/unicode_norm.h"
2526
#include"lib/hyperloglog.h"
2627
#include"libpq/pqformat.h"
2728
#include"miscadmin.h"
@@ -5976,3 +5977,152 @@ rest_of_char_same(const char *s1, const char *s2, int len)
59765977
#include"levenshtein.c"
59775978
#defineLEVENSHTEIN_LESS_EQUAL
59785979
#include"levenshtein.c"
5980+
5981+
5982+
/*
5983+
* Unicode support
5984+
*/
5985+
5986+
staticUnicodeNormalizationForm
5987+
unicode_norm_form_from_string(constchar*formstr)
5988+
{
5989+
UnicodeNormalizationFormform=-1;
5990+
5991+
/*
5992+
* Might as well check this while we're here.
5993+
*/
5994+
if (GetDatabaseEncoding()!=PG_UTF8)
5995+
ereport(ERROR,
5996+
(errcode(ERRCODE_SYNTAX_ERROR),
5997+
errmsg("Unicode normalization can only be performed if server encoding is UTF8")));
5998+
5999+
if (pg_strcasecmp(formstr,"NFC")==0)
6000+
form=UNICODE_NFC;
6001+
elseif (pg_strcasecmp(formstr,"NFD")==0)
6002+
form=UNICODE_NFD;
6003+
elseif (pg_strcasecmp(formstr,"NFKC")==0)
6004+
form=UNICODE_NFKC;
6005+
elseif (pg_strcasecmp(formstr,"NFKD")==0)
6006+
form=UNICODE_NFKD;
6007+
else
6008+
ereport(ERROR,
6009+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
6010+
errmsg("invalid normalization form: %s",formstr)));
6011+
6012+
returnform;
6013+
}
6014+
6015+
Datum
6016+
unicode_normalize_func(PG_FUNCTION_ARGS)
6017+
{
6018+
text*input=PG_GETARG_TEXT_PP(0);
6019+
char*formstr=text_to_cstring(PG_GETARG_TEXT_PP(1));
6020+
UnicodeNormalizationFormform;
6021+
intsize;
6022+
pg_wchar*input_chars;
6023+
pg_wchar*output_chars;
6024+
unsignedchar*p;
6025+
text*result;
6026+
inti;
6027+
6028+
form=unicode_norm_form_from_string(formstr);
6029+
6030+
/* convert to pg_wchar */
6031+
size=pg_mbstrlen_with_len(VARDATA_ANY(input),VARSIZE_ANY_EXHDR(input));
6032+
input_chars=palloc((size+1)*sizeof(pg_wchar));
6033+
p= (unsignedchar*)VARDATA_ANY(input);
6034+
for (i=0;i<size;i++)
6035+
{
6036+
input_chars[i]=utf8_to_unicode(p);
6037+
p+=pg_utf_mblen(p);
6038+
}
6039+
input_chars[i]= (pg_wchar)'\0';
6040+
Assert((char*)p==VARDATA_ANY(input)+VARSIZE_ANY_EXHDR(input));
6041+
6042+
/* action */
6043+
output_chars=unicode_normalize(form,input_chars);
6044+
6045+
/* convert back to UTF-8 string */
6046+
size=0;
6047+
for (pg_wchar*wp=output_chars;*wp;wp++)
6048+
{
6049+
unsignedcharbuf[4];
6050+
6051+
unicode_to_utf8(*wp,buf);
6052+
size+=pg_utf_mblen(buf);
6053+
}
6054+
6055+
result=palloc(size+VARHDRSZ);
6056+
SET_VARSIZE(result,size+VARHDRSZ);
6057+
6058+
p= (unsignedchar*)VARDATA_ANY(result);
6059+
for (pg_wchar*wp=output_chars;*wp;wp++)
6060+
{
6061+
unicode_to_utf8(*wp,p);
6062+
p+=pg_utf_mblen(p);
6063+
}
6064+
Assert((char*)p== (char*)result+size+VARHDRSZ);
6065+
6066+
PG_RETURN_TEXT_P(result);
6067+
}
6068+
6069+
/*
6070+
* Check whether the string is in the specified Unicode normalization form.
6071+
*
6072+
* This is done by convering the string to the specified normal form and then
6073+
* comparing that to the original string. To speed that up, we also apply the
6074+
* "quick check" algorithm specified in UAX #15, which can give a yes or no
6075+
* answer for many strings by just scanning the string once.
6076+
*
6077+
* This function should generally be optimized for the case where the string
6078+
* is in fact normalized. In that case, we'll end up looking at the entire
6079+
* string, so it's probably not worth doing any incremental conversion etc.
6080+
*/
6081+
Datum
6082+
unicode_is_normalized(PG_FUNCTION_ARGS)
6083+
{
6084+
text*input=PG_GETARG_TEXT_PP(0);
6085+
char*formstr=text_to_cstring(PG_GETARG_TEXT_PP(1));
6086+
UnicodeNormalizationFormform;
6087+
intsize;
6088+
pg_wchar*input_chars;
6089+
pg_wchar*output_chars;
6090+
unsignedchar*p;
6091+
inti;
6092+
UnicodeNormalizationQCquickcheck;
6093+
intoutput_size;
6094+
boolresult;
6095+
6096+
form=unicode_norm_form_from_string(formstr);
6097+
6098+
/* convert to pg_wchar */
6099+
size=pg_mbstrlen_with_len(VARDATA_ANY(input),VARSIZE_ANY_EXHDR(input));
6100+
input_chars=palloc((size+1)*sizeof(pg_wchar));
6101+
p= (unsignedchar*)VARDATA_ANY(input);
6102+
for (i=0;i<size;i++)
6103+
{
6104+
input_chars[i]=utf8_to_unicode(p);
6105+
p+=pg_utf_mblen(p);
6106+
}
6107+
input_chars[i]= (pg_wchar)'\0';
6108+
Assert((char*)p==VARDATA_ANY(input)+VARSIZE_ANY_EXHDR(input));
6109+
6110+
/* quick check (see UAX #15) */
6111+
quickcheck=unicode_is_normalized_quickcheck(form,input_chars);
6112+
if (quickcheck==UNICODE_NORM_QC_YES)
6113+
PG_RETURN_BOOL(true);
6114+
elseif (quickcheck==UNICODE_NORM_QC_NO)
6115+
PG_RETURN_BOOL(false);
6116+
6117+
/* normalize and compare with original */
6118+
output_chars=unicode_normalize(form,input_chars);
6119+
6120+
output_size=0;
6121+
for (pg_wchar*wp=output_chars;*wp;wp++)
6122+
output_size++;
6123+
6124+
result= (size==output_size)&&
6125+
(memcmp(input_chars,output_chars,size*sizeof(pg_wchar))==0);
6126+
6127+
PG_RETURN_BOOL(result);
6128+
}

‎src/common/unicode/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,6 @@
33

44
# Downloaded files
55
/CompositionExclusions.txt
6+
/DerivedNormalizationProps.txt
67
/NormalizationTest.txt
78
/UnicodeData.txt

‎src/common/unicode/Makefile

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,14 @@ LIBS += $(PTHREAD_LIBS)
1818
# By default, do nothing.
1919
all:
2020

21-
update-unicode: unicode_norm_table.h unicode_combining_table.h
21+
update-unicode: unicode_norm_table.h unicode_combining_table.h unicode_normprops_table.h
2222
$(MAKE) normalization-check
23-
mvunicode_norm_table.h unicode_combining_table.h ../../../src/include/common/
23+
mv$^ ../../../src/include/common/
2424

2525
# These files are part of the Unicode Character Database. Download
2626
# them on demand. The dependency on Makefile.global is for
2727
# UNICODE_VERSION.
28-
UnicodeData.txtCompositionExclusions.txtNormalizationTest.txt:$(top_builddir)/src/Makefile.global
28+
UnicodeData.txtDerivedNormalizationProps.txtCompositionExclusions.txtNormalizationTest.txt:$(top_builddir)/src/Makefile.global
2929
$(DOWNLOAD) https://www.unicode.org/Public/$(UNICODE_VERSION)/ucd/$(@F)
3030

3131
# Generation of conversion tables used for string normalization with
@@ -36,6 +36,9 @@ unicode_norm_table.h: generate-unicode_norm_table.pl UnicodeData.txt Composition
3636
unicode_combining_table.h: generate-unicode_combining_table.pl UnicodeData.txt
3737
$(PERL)$^>$@
3838

39+
unicode_normprops_table.h: generate-unicode_normprops_table.pl DerivedNormalizationProps.txt
40+
$(PERL)$^>$@
41+
3942
# Test suite
4043
normalization-check: norm_test
4144
./norm_test

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp