NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commita02b37f

committed

Additional unicode primitive functions.

Introduce unicode_version(), icu_unicode_version(), andunicode_assigned().The latter requires introducing a new lookup table for the UnicodeGeneral Category, which is generated along with the other Unicodelookup tables.Discussion:https://postgr.es/m/CA+TgmoYzYR-yhU6k1XFCADeyj=Oyz2PkVsa3iKv+keM8wp-F_A@mail.gmail.comReviewed-by: Peter Eisentraut

1 parent7021d3b commita02b37fCopy full SHA for a02b37f

File tree

18 files changed

+4924

-22

lines changed

doc/src/sgml
- func.sgml
src
- backend/utils/adt
  - varlena.c
- common
- include
  - catalog
    - pg_proc.dat
  - common
- test
  - icu/t
    - 010_database.pl
  - regress
    - expected
      - unicode.out
    - sql
      - unicode.sql

18 files changed

+4924

-22

lines changed

`‎doc/src/sgml/func.sgml‎`

Lines changed: 90 additions & 19 deletions

Original file line number	Diff line number	Diff line change
`@@ -2859,6 +2859,22 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>`
`2859`	`2859`	`</para></entry>`
`2860`	`2860`	`</row>`
`2861`	`2861`
	`2862`	`+ <row>`
	`2863`	`+ <entry role="func_table_entry"><para role="func_signature">`
	`2864`	`+ <indexterm>`
	`2865`	`+ <primary>unicode_assigned</primary>`
	`2866`	`+ </indexterm>`
	`2867`	`+ <function>unicode_assigned</function> ( <type>text</type> )`
	`2868`	`+ <returnvalue>text</returnvalue>`
	`2869`	`+ </para>`
	`2870`	`+ <para>`
	`2871`	`+ Returns <literal>true</literal> if all characters in the string are`
	`2872`	`+ assigned Unicode codepoints; <literal>false</literal> otherwise. This`
	`2873`	`+ function can only be used when the server encoding is`
	`2874`	`+ <literal>UTF8</literal>.`
	`2875`	`+ </para></entry>`
	`2876`	`+ </row>`
	`2877`	`+`
`2862`	`2878`	`<row>`
`2863`	`2879`	`<entry role="func_table_entry"><para role="func_signature">`
`2864`	`2880`	`<indexterm>`
`@@ -23427,25 +23443,6 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);`
`23427`	`23443`	`This is equivalent to <function>current_user</function>.`
`23428`	`23444`	`</para></entry>`
`23429`	`23445`	`</row>`
`23430`		`-`
`23431`		`- <row>`
`23432`		`- <entry role="func_table_entry"><para role="func_signature">`
`23433`		`- <indexterm>`
`23434`		`- <primary>version</primary>`
`23435`		`- </indexterm>`
`23436`		`- <function>version</function> ()`
`23437`		`- <returnvalue>text</returnvalue>`
`23438`		`- </para>`
`23439`		`- <para>`
`23440`		`- Returns a string describing the <productname>PostgreSQL</productname>`
`23441`		`- server's version. You can also get this information from`
`23442`		`- <xref linkend="guc-server-version"/>, or for a machine-readable`
`23443`		`- version use <xref linkend="guc-server-version-num"/>. Software`
`23444`		`- developers should use <varname>server_version_num</varname> (available`
`23445`		`- since 8.2) or <xref linkend="libpq-PQserverVersion"/> instead of`
`23446`		`- parsing the text version.`
`23447`		`- </para></entry>`
`23448`		`- </row>`
`23449`	`23446`	`</tbody>`
`23450`	`23447`	`</tgroup>`
`23451`	`23448`	`</table>`
`@@ -26332,6 +26329,80 @@ SELECT collation for ('foo' COLLATE "de_DE");`
`26332`	`26329`
`26333`	`26330`	`</sect2>`
`26334`	`26331`
	`26332`	`+ <sect2 id="functions-info-version">`
	`26333`	`+ <title>Version Information Functions</title>`
	`26334`	`+`
	`26335`	`+ <para>`
	`26336`	`+ The functions shown in <xref linkend="functions-version"/>`
	`26337`	`+ print version information.`
	`26338`	`+ </para>`
	`26339`	`+`
	`26340`	`+ <table id="functions-version">`
	`26341`	`+ <title>Version Information Functions</title>`
	`26342`	`+ <tgroup cols="1">`
	`26343`	`+ <thead>`
	`26344`	`+ <row>`
	`26345`	`+ <entry role="func_table_entry"><para role="func_signature">`
	`26346`	`+ Function`
	`26347`	`+ </para>`
	`26348`	`+ <para>`
	`26349`	`+ Description`
	`26350`	`+ </para></entry>`
	`26351`	`+ </row>`
	`26352`	`+ </thead>`
	`26353`	`+`
	`26354`	`+ <tbody>`
	`26355`	`+ <row>`
	`26356`	`+ <entry role="func_table_entry"><para role="func_signature">`
	`26357`	`+ <indexterm>`
	`26358`	`+ <primary>version</primary>`
	`26359`	`+ </indexterm>`
	`26360`	`+ <function>version</function> ()`
	`26361`	`+ <returnvalue>text</returnvalue>`
	`26362`	`+ </para>`
	`26363`	`+ <para>`
	`26364`	`+ Returns a string describing the <productname>PostgreSQL</productname>`
	`26365`	`+ server's version. You can also get this information from`
	`26366`	`+ <xref linkend="guc-server-version"/>, or for a machine-readable`
	`26367`	`+ version use <xref linkend="guc-server-version-num"/>. Software`
	`26368`	`+ developers should use <varname>server_version_num</varname> (available`
	`26369`	`+ since 8.2) or <xref linkend="libpq-PQserverVersion"/> instead of`
	`26370`	`+ parsing the text version.`
	`26371`	`+ </para></entry>`
	`26372`	`+ </row>`
	`26373`	`+`
	`26374`	`+ <row>`
	`26375`	`+ <entry role="func_table_entry"><para role="func_signature">`
	`26376`	`+ <indexterm>`
	`26377`	`+ <primary>unicode_version</primary>`
	`26378`	`+ </indexterm>`
	`26379`	`+ <function>unicode_version</function> ()`
	`26380`	`+ <returnvalue>text</returnvalue>`
	`26381`	`+ </para>`
	`26382`	`+ <para>`
	`26383`	`+ Returns a string representing the version of Unicode used by`
	`26384`	`+ <productname>PostgreSQL</productname>.`
	`26385`	`+ </para></entry>`
	`26386`	`+ </row>`
	`26387`	`+ <row>`
	`26388`	`+ <entry role="func_table_entry"><para role="func_signature">`
	`26389`	`+ <indexterm>`
	`26390`	`+ <primary>icu_unicode_version</primary>`
	`26391`	`+ </indexterm>`
	`26392`	`+ <function>icu_unicode_version</function> ()`
	`26393`	`+ <returnvalue>text</returnvalue>`
	`26394`	`+ </para>`
	`26395`	`+ <para>`
	`26396`	`+ Returns a string representing the version of Unicode used by ICU, if`
	`26397`	`+ the server was built with ICU support; otherwise returns`
	`26398`	`+ <literal>NULL</literal> </para></entry>`
	`26399`	`+ </row>`
	`26400`	`+ </tbody>`
	`26401`	`+ </tgroup>`
	`26402`	`+ </table>`
	`26403`	`+`
	`26404`	`+ </sect2>`
	`26405`	`+`
`26335`	`26406`	`</sect1>`
`26336`	`26407`
`26337`	`26408`	`<sect1 id="functions-admin">`

`‎src/backend/utils/adt/varlena.c‎`

Lines changed: 61 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -23,7 +23,9 @@`
`23`	`23`	`#include"catalog/pg_type.h"`
`24`	`24`	`#include"common/hashfn.h"`
`25`	`25`	`#include"common/int.h"`
	`26`	`+#include"common/unicode_category.h"`
`26`	`27`	`#include"common/unicode_norm.h"`
	`28`	`+#include"common/unicode_version.h"`
`27`	`29`	`#include"funcapi.h"`
`28`	`30`	`#include"lib/hyperloglog.h"`
`29`	`31`	`#include"libpq/pqformat.h"`
`@@ -6237,6 +6239,65 @@ unicode_norm_form_from_string(const char *formstr)`
`6237`	`6239`	`returnform;`
`6238`	`6240`	`}`
`6239`	`6241`
	`6242`	`+/*`
	`6243`	`+ * Returns version of Unicode used by Postgres in "major.minor" format (the`
	`6244`	`+ * same format as the Unicode version reported by ICU). The third component`
	`6245`	`+ * ("update version") never involves additions to the character repertiore and`
	`6246`	`+ * is unimportant for most purposes.`
	`6247`	`+ *`
	`6248`	`+ * See: https://unicode.org/versions/`
	`6249`	`+ */`
	`6250`	`+Datum`
	`6251`	`+unicode_version(PG_FUNCTION_ARGS)`
	`6252`	`+{`
	`6253`	`+PG_RETURN_TEXT_P(cstring_to_text(PG_UNICODE_VERSION));`
	`6254`	`+}`
	`6255`	`+`
	`6256`	`+/*`
	`6257`	`+ * Returns version of Unicode used by ICU, if enabled; otherwise NULL.`
	`6258`	`+ */`
	`6259`	`+Datum`
	`6260`	`+icu_unicode_version(PG_FUNCTION_ARGS)`
	`6261`	`+{`
	`6262`	`+#ifdefUSE_ICU`
	`6263`	`+PG_RETURN_TEXT_P(cstring_to_text(U_UNICODE_VERSION));`
	`6264`	`+#else`
	`6265`	`+PG_RETURN_NULL();`
	`6266`	`+#endif`
	`6267`	`+}`
	`6268`	`+`
	`6269`	`+/*`
	`6270`	`+ * Check whether the string contains only assigned Unicode code`
	`6271`	`+ * points. Requires that the database encoding is UTF-8.`
	`6272`	`+ */`
	`6273`	`+Datum`
	`6274`	`+unicode_assigned(PG_FUNCTION_ARGS)`
	`6275`	`+{`
	`6276`	`+text*input=PG_GETARG_TEXT_PP(0);`
	`6277`	`+unsignedchar*p;`
	`6278`	`+intsize;`
	`6279`	`+`
	`6280`	`+if (GetDatabaseEncoding()!=PG_UTF8)`
	`6281`	`+ereport(ERROR,`
	`6282`	`+(errmsg("Unicode categorization can only be performed if server encoding is UTF8")));`
	`6283`	`+`
	`6284`	`+/* convert to pg_wchar */`
	`6285`	`+size=pg_mbstrlen_with_len(VARDATA_ANY(input),VARSIZE_ANY_EXHDR(input));`
	`6286`	`+p= (unsignedchar*)VARDATA_ANY(input);`
	`6287`	`+for (inti=0;i<size;i++)`
	`6288`	`+{`
	`6289`	`+pg_wcharuchar=utf8_to_unicode(p);`
	`6290`	`+intcategory=unicode_category(uchar);`
	`6291`	`+`
	`6292`	`+if (category==PG_U_UNASSIGNED)`
	`6293`	`+PG_RETURN_BOOL(false);`
	`6294`	`+`
	`6295`	`+p+=pg_utf_mblen(p);`
	`6296`	`+}`
	`6297`	`+`
	`6298`	`+PG_RETURN_BOOL(true);`
	`6299`	`+}`
	`6300`	`+`
`6240`	`6301`	`Datum`
`6241`	`6302`	`unicode_normalize_func(PG_FUNCTION_ARGS)`
`6242`	`6303`	`{`

`‎src/common/Makefile‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -78,6 +78,7 @@ OBJS_COMMON = \`
`78`	`78`	`scram-common.o\`
`79`	`79`	`string.o\`
`80`	`80`	`stringinfo.o\`
	`81`	`+unicode_category.o\`
`81`	`82`	`unicode_norm.o\`
`82`	`83`	`username.o\`
`83`	`84`	`wait_error.o\`

`‎src/common/meson.build‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -30,6 +30,7 @@ common_sources = files(`
`30`	`30`	`'scram-common.c',`
`31`	`31`	`'string.c',`
`32`	`32`	`'stringinfo.c',`
	`33`	`+'unicode_category.c',`
`33`	`34`	`'unicode_norm.c',`
`34`	`35`	`'username.c',`
`35`	`36`	`'wait_error.c',`

`‎src/common/unicode/Makefile‎`

Lines changed: 17 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -15,11 +15,15 @@ include $(top_builddir)/src/Makefile.global`
`15`	`15`	`overrideCPPFLAGS := -DFRONTEND -I.$(CPPFLAGS)`
`16`	`16`	`LIBS +=$(PTHREAD_LIBS)`
`17`	`17`
	`18`	`+LDFLAGS_INTERNAL +=$(ICU_LIBS)`
	`19`	`+CPPFLAGS +=$(ICU_CFLAGS)`
	`20`	`+`
`18`	`21`	`# By default, do nothing.`
`19`	`22`	`all:`
`20`	`23`
`21`		`-update-unicode: unicode_norm_table.h unicode_nonspacing_table.h unicode_east_asian_fw_table.h unicode_normprops_table.h unicode_norm_hashfunc.h`
	`24`	`+update-unicode:unicode_category_table.hunicode_norm_table.h unicode_nonspacing_table.h unicode_east_asian_fw_table.h unicode_normprops_table.h unicode_norm_hashfunc.h unicode_version.h`
`22`	`25`	`mv$^$(top_srcdir)/src/include/common/`
	`26`	`+$(MAKE) category-check`
`23`	`27`	`$(MAKE) normalization-check`
`24`	`28`
`25`	`29`	`# These files are part of the Unicode Character Database. Download`
`@@ -28,6 +32,12 @@ update-unicode: unicode_norm_table.h unicode_nonspacing_table.h unicode_east_asi`
`28`	`32`	`UnicodeData.txtEastAsianWidth.txtDerivedNormalizationProps.txtCompositionExclusions.txtNormalizationTest.txt:$(top_builddir)/src/Makefile.global`
`29`	`33`	`$(DOWNLOAD) https://www.unicode.org/Public/$(UNICODE_VERSION)/ucd/$(@F)`
`30`	`34`
	`35`	`+unicode_version.h: generate-unicode_version.pl`
	`36`	`+$(PERL)$< --version$(UNICODE_VERSION)`
	`37`	`+`
	`38`	`+unicode_category_table.h: generate-unicode_category_table.pl UnicodeData.txt`
	`39`	`+$(PERL)$<`
	`40`	`+`
`31`	`41`	`# Generation of conversion tables used for string normalization with`
`32`	`42`	`# UTF-8 strings.`
`33`	`43`	`unicode_norm_hashfunc.h: unicode_norm_table.h`
`@@ -45,9 +55,14 @@ unicode_normprops_table.h: generate-unicode_normprops_table.pl DerivedNormalizat`
`45`	`55`	`$(PERL)$^>$@`
`46`	`56`
`47`	`57`	`# Test suite`
	`58`	`+category-check: category_test`
	`59`	`+./category_test`
	`60`	`+`
`48`	`61`	`normalization-check: norm_test`
`49`	`62`	`./norm_test`
`50`	`63`
	`64`	`+category_test: category_test.o ../unicode_category.o \| submake-common`
	`65`	`+`
`51`	`66`	`norm_test: norm_test.o ../unicode_norm.o \| submake-common`
`52`	`67`
`53`	`68`	`norm_test.o: norm_test_table.h`
`@@ -64,7 +79,7 @@ norm_test_table.h: generate-norm_test_table.pl NormalizationTest.txt`
`64`	`79`
`65`	`80`
`66`	`81`	`clean:`
`67`		`-rm -f$(OBJS) norm_test norm_test.o`
	`82`	`+rm -f$(OBJS)category_test category_test.onorm_test norm_test.o`
`68`	`83`
`69`	`84`	`distclean: clean`
`70`	`85`	`rm -f UnicodeData.txt EastAsianWidth.txt CompositionExclusions.txt NormalizationTest.txt norm_test_table.h unicode_norm_table.h`

`‎src/common/unicode/category_test.c‎`

Lines changed: 108 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,108 @@`
	`1`	`+/*-------------------------------------------------------------------------`
	`2`	`+ * category_test.c`
	`3`	`+ *Program to test Unicode general category functions.`
	`4`	`+ *`
	`5`	`+ * Portions Copyright (c) 2017-2023, PostgreSQL Global Development Group`
	`6`	`+ *`
	`7`	`+ * IDENTIFICATION`
	`8`	`+ * src/common/unicode/category_test.c`
	`9`	`+ *`
	`10`	`+ *-------------------------------------------------------------------------`
	`11`	`+ */`
	`12`	`+#include"postgres_fe.h"`
	`13`	`+`
	`14`	`+#include<stdio.h>`
	`15`	`+#include<stdlib.h>`
	`16`	`+#include<string.h>`
	`17`	`+`
	`18`	`+#ifdefUSE_ICU`
	`19`	`+#include<unicode/uchar.h>`
	`20`	`+#endif`
	`21`	`+#include"common/unicode_category.h"`
	`22`	`+#include"common/unicode_version.h"`
	`23`	`+`
	`24`	`+/*`
	`25`	`+ * Parse version into integer for easy comparison.`
	`26`	`+ */`
	`27`	`+#ifdefUSE_ICU`
	`28`	`+staticint`
	`29`	`+parse_unicode_version(constchar*version)`
	`30`	`+{`
	`31`	`+intn,`
	`32`	`+major,`
	`33`	`+minor;`
	`34`	`+`
	`35`	`+n=sscanf(version,"%d.%d",&major,&minor);`
	`36`	`+`
	`37`	`+Assert(n==2);`
	`38`	`+Assert(minor<100);`
	`39`	`+`
	`40`	`+returnmajor*100+minor;`
	`41`	`+}`
	`42`	`+#endif`
	`43`	`+`
	`44`	`+/*`
	`45`	`+ * Exhaustively test that the Unicode category for each codepoint matches that`
	`46`	`+ * returned by ICU.`
	`47`	`+ */`
	`48`	`+int`
	`49`	`+main(intargc,char**argv)`
	`50`	`+{`
	`51`	`+#ifdefUSE_ICU`
	`52`	`+intpg_unicode_version=parse_unicode_version(PG_UNICODE_VERSION);`
	`53`	`+inticu_unicode_version=parse_unicode_version(U_UNICODE_VERSION);`
	`54`	`+intpg_skipped_codepoints=0;`
	`55`	`+inticu_skipped_codepoints=0;`
	`56`	`+`
	`57`	`+printf("Postgres Unicode Version:\t%s\n",PG_UNICODE_VERSION);`
	`58`	`+printf("ICU Unicode Version:\t\t%s\n",U_UNICODE_VERSION);`
	`59`	`+`
	`60`	`+for (UChar32code=0;code <=0x10ffff;code++)`
	`61`	`+{`
	`62`	`+uint8_tpg_category=unicode_category(code);`
	`63`	`+uint8_ticu_category=u_charType(code);`
	`64`	`+`
	`65`	`+if (pg_category!=icu_category)`
	`66`	`+{`
	`67`	`+/*`
	`68`	`+ * A version mismatch means that some assigned codepoints in the`
	`69`	`+ * newer version may be unassigned in the older version. That's`
	`70`	`+ * OK, though the test will not cover those codepoints marked`
	`71`	`+ * unassigned in the older version (that is, it will no longer be`
	`72`	`+ * an exhaustive test).`
	`73`	`+ */`
	`74`	`+if (pg_category==PG_U_UNASSIGNED&&`
	`75`	`+pg_unicode_version<icu_unicode_version)`
	`76`	`+pg_skipped_codepoints++;`
	`77`	`+elseif (icu_category==PG_U_UNASSIGNED&&`
	`78`	`+icu_unicode_version<pg_unicode_version)`
	`79`	`+icu_skipped_codepoints++;`
	`80`	`+else`
	`81`	`+{`
	`82`	`+printf("FAILURE for codepoint %06x\n",code);`
	`83`	`+printf("Postgres category:%02d %s %s\n",pg_category,`
	`84`	`+unicode_category_abbrev(pg_category),`
	`85`	`+unicode_category_string(pg_category));`
	`86`	`+printf("ICU category:%02d %s %s\n",icu_category,`
	`87`	`+unicode_category_abbrev(icu_category),`
	`88`	`+unicode_category_string(icu_category));`
	`89`	`+printf("\n");`
	`90`	`+exit(1);`
	`91`	`+}`
	`92`	`+}`
	`93`	`+}`
	`94`	`+`
	`95`	`+if (pg_skipped_codepoints>0)`
	`96`	`+printf("Skipped %d codepoints unassigned in Postgres due to Unicode version mismatch.\n",`
	`97`	`+pg_skipped_codepoints);`
	`98`	`+if (icu_skipped_codepoints>0)`
	`99`	`+printf("Skipped %d codepoints unassigned in ICU due to Unicode version mismatch.\n",`
	`100`	`+icu_skipped_codepoints);`
	`101`	`+`
	`102`	`+printf("category_test: All tests successful!\n");`
	`103`	`+exit(0);`
	`104`	`+#else`
	`105`	`+printf("ICU support required for test; skipping.\n");`
	`106`	`+exit(0);`
	`107`	`+#endif`
	`108`	`+}`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commita02b37f

File tree

18 files changed

18 files changed

`‎doc/src/sgml/func.sgml‎`

`‎src/backend/utils/adt/varlena.c‎`

`‎src/common/Makefile‎`

`‎src/common/meson.build‎`

`‎src/common/unicode/Makefile‎`

`‎src/common/unicode/category_test.c‎`

0 commit comments