NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commitad49994

committed

Add Unicode property tables.

Provide functions to test for Unicode properties, such as Alphabeticor Cased. These functions use tables derived from Unicode data files,similar to the tables for Unicode normalization or general category,and those tables can be updated with the 'update-unicode' buildtarget.Use Unicode properties to provide functions to test for regexcharacter classes, like 'punct' or 'alnum'.Infrastructure in preparation for a builtin collation provider, andmay also be useful for other callers.Discussion:https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.comReviewed-by: Daniel Verite, Peter Eisentraut, Jeremy Schneider

1 parent2ed8f9a commitad49994Copy full SHA for ad49994

File tree

8 files changed

+4604

-102

lines changed

src
- common
  - unicode_category.c
  - unicode
- include/common
  - unicode_category.h
  - unicode_category_table.h

8 files changed

+4604

-102

lines changed

`‎src/common/unicode/Makefile‎`

Lines changed: 3 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -29,13 +29,13 @@ update-unicode: unicode_category_table.h unicode_east_asian_fw_table.h unicode_n`
`29`	`29`	`# These files are part of the Unicode Character Database. Download`
`30`	`30`	`# them on demand. The dependency on Makefile.global is for`
`31`	`31`	`# UNICODE_VERSION.`
`32`		`-CompositionExclusions.txtDerivedNormalizationProps.txtEastAsianWidth.txtNormalizationTest.txtUnicodeData.txt:$(top_builddir)/src/Makefile.global`
	`32`	`+CompositionExclusions.txtDerivedCoreProperties.txtDerivedNormalizationProps.txtEastAsianWidth.txtNormalizationTest.txtPropList.txtUnicodeData.txt:$(top_builddir)/src/Makefile.global`
`33`	`33`	`$(DOWNLOAD) https://www.unicode.org/Public/$(UNICODE_VERSION)/ucd/$(@F)`
`34`	`34`
`35`	`35`	`unicode_version.h: generate-unicode_version.pl`
`36`	`36`	`$(PERL)$< --version$(UNICODE_VERSION)`
`37`	`37`
`38`		`-unicode_category_table.h: generate-unicode_category_table.pl UnicodeData.txt`
	`38`	`+unicode_category_table.h: generate-unicode_category_table.plDerivedCoreProperties.txt PropList.txtUnicodeData.txt`
`39`	`39`	`$(PERL)$<`
`40`	`40`
`41`	`41`	`# Generation of conversion tables used for string normalization with`
`@@ -82,4 +82,4 @@ clean:`
`82`	`82`	`rm -f$(OBJS) category_test category_test.o norm_test norm_test.o`
`83`	`83`
`84`	`84`	`distclean: clean`
`85`		`-rm -f CompositionExclusions.txt DerivedNormalizationProps.txt EastAsianWidth.txt NormalizationTest.txt UnicodeData.txt norm_test_table.h unicode_category_table.h unicode_norm_table.h`
	`85`	`+rm -f CompositionExclusions.txtDerivedCoreProperties.txtDerivedNormalizationProps.txt EastAsianWidth.txt NormalizationTest.txt PropList.txt UnicodeData.txt norm_test_table.h unicode_category_table.h unicode_norm_table.h`

`‎src/common/unicode/README‎`

Lines changed: 35 additions & 10 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,22 +1,35 @@`
`1`		`-This directory contains tools to generate the tables in`
`2`		`-src/include/common/unicode_norm.h, used for Unicode normalization. The`
`3`		`-generated .h file is included in the source tree, so these are normally not`
`4`		`-needed to build PostgreSQL, only if you need to re-generate the .h file`
`5`		`-from the Unicode data files for some reason, e.g. to update to a new version`
`6`		`-of Unicode.`
	`1`	`+This directory contains tools to download new Unicode data files and`
	`2`	`+generate static tables. These tables are used to normalize or`
	`3`	`+determine various properties of Unicode data.`
`7`	`4`
`8`		`-Generating unicode_norm_table.h`
`9`		`--------------------------------`
	`5`	`+The generated header files are copied to src/include/common/, and`
	`6`	`+included in the source tree, so these tools are not normally required`
	`7`	`+to build PostgreSQL.`
`10`	`8`
`11`		`-Run`
	`9`	`+Update Unicode Version`
	`10`	`+----------------------`
	`11`	`+`
	`12`	`+Edit src/Makefile.global.in and src/common/unicode/meson.build`
	`13`	`+to update the UNICODE_VERSION.`
	`14`	`+`
	`15`	`+Then, generate the new header files with:`
`12`	`16`
`13`	`17`	`make update-unicode`
`14`	`18`
`15`		`-from the top level of the source tree and commit the result.`
	`19`	`+or if using meson:`
	`20`	`+`
	`21`	`+ ninja update-unicode`
	`22`	`+`
	`23`	`+from the top level of the source tree. Examine the result to make sure`
	`24`	`+the changes look reasonable (that is, that the diff size and scope is`
	`25`	`+comparable to the Unicode changes since the last update), and then`
	`26`	`+commit it.`
`16`	`27`
`17`	`28`	`Tests`
`18`	`29`	`-----`
`19`	`30`
	`31`	`+Normalization tests:`
	`32`	`+`
`20`	`33`	`The Unicode consortium publishes a comprehensive test suite for the`
`21`	`34`	`normalization algorithm, in a file called NormalizationTest.txt. This`
`22`	`35`	`directory also contains a perl script and some C code, to run our`
`@@ -26,3 +39,15 @@ To download NormalizationTest.txt and run the tests:`
`26`	`39`	`make normalization-check`
`27`	`40`
`28`	`41`	`This is also run as part of the update-unicode target.`
	`42`	`+`
	`43`	`+Category & Property tests:`
	`44`	`+`
	`45`	`+The file category_test.c exhaustively compares the category and`
	`46`	`+properties of each code point as determined by the generated tables`
	`47`	`+with the category and properties as reported by ICU. For this test to`
	`48`	`+be effective, the version of the Unicode data files must be similar to`
	`49`	`+the version of Unicode on which ICU is based, so attempt to match the`
	`50`	`+versions as closely as possible. A mismatched Unicode will skip over`
	`51`	`+codepoints that are assigned in one version and not the other, and may`
	`52`	`+falsely report failures. This test is run as a part of the`
	`53`	`+update-unicode target.`

`‎src/common/unicode/category_test.c‎`

Lines changed: 179 additions & 43 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`/*-------------------------------------------------------------------------`
`2`	`2`	`* category_test.c`
`3`		`- *Program to test Unicode general categoryfunctions.`
	`3`	`+ *Program to test Unicode general categoryand character properties.`
`4`	`4`	`*`
`5`	`5`	`* Portions Copyright (c) 2017-2024, PostgreSQL Global Development Group`
`6`	`6`	`*`
`@@ -14,17 +14,23 @@`
`14`	`14`	`#include<stdio.h>`
`15`	`15`	`#include<stdlib.h>`
`16`	`16`	`#include<string.h>`
	`17`	`+#include<wctype.h>`
`17`	`18`
`18`	`19`	`#ifdefUSE_ICU`
`19`	`20`	`#include<unicode/uchar.h>`
`20`	`21`	`#endif`
	`22`	`+`
`21`	`23`	`#include"common/unicode_category.h"`
`22`	`24`	`#include"common/unicode_version.h"`
`23`	`25`
	`26`	`+staticintpg_unicode_version=0;`
	`27`	`+#ifdefUSE_ICU`
	`28`	`+staticinticu_unicode_version=0;`
	`29`	`+#endif`
	`30`	`+`
`24`	`31`	`/*`
`25`	`32`	`* Parse version into integer for easy comparison.`
`26`	`33`	`*/`
`27`		`-#ifdefUSE_ICU`
`28`	`34`	`staticint`
`29`	`35`	`parse_unicode_version(constchar*version)`
`30`	`36`	`{`
`@@ -39,57 +45,175 @@ parse_unicode_version(const char *version)`
`39`	`45`
`40`	`46`	`returnmajor*100+minor;`
`41`	`47`	`}`
`42`		`-#endif`
`43`	`48`
	`49`	`+#ifdefUSE_ICU`
`44`	`50`	`/*`
`45`		`- * Exhaustively test that the Unicode category for each codepoint matches that`
`46`		`- * returned by ICU.`
	`51`	`+ * Test Postgres Unicode tables by comparing with ICU. Test the General`
	`52`	`+ * Category, as well as the properties Alphabetic, Lowercase, Uppercase,`
	`53`	`+ * White_Space, and Hex_Digit.`
`47`	`54`	`*/`
`48`		`-int`
`49`		`-main(intargc,char**argv)`
	`55`	`+staticvoid`
	`56`	`+icu_test()`
`50`	`57`	`{`
`51`		`-#ifdefUSE_ICU`
`52`		`-intpg_unicode_version=parse_unicode_version(PG_UNICODE_VERSION);`
`53`		`-inticu_unicode_version=parse_unicode_version(U_UNICODE_VERSION);`
	`58`	`+intsuccessful=0;`
`54`	`59`	`intpg_skipped_codepoints=0;`
`55`	`60`	`inticu_skipped_codepoints=0;`
`56`	`61`
`57`		`-printf("category_test: Postgres Unicode version:\t%s\n",PG_UNICODE_VERSION);`
`58`		`-printf("category_test: ICU Unicode version:\t\t%s\n",U_UNICODE_VERSION);`
`59`		`-`
`60`		`-for (UChar32code=0;code <=0x10ffff;code++)`
	`62`	`+for (pg_wcharcode=0;code <=0x10ffff;code++)`
`61`	`63`	`{`
`62`	`64`	`uint8_tpg_category=unicode_category(code);`
`63`	`65`	`uint8_ticu_category=u_charType(code);`
`64`	`66`
	`67`	`+/* Property tests */`
	`68`	`+boolprop_alphabetic=pg_u_prop_alphabetic(code);`
	`69`	`+boolprop_lowercase=pg_u_prop_lowercase(code);`
	`70`	`+boolprop_uppercase=pg_u_prop_uppercase(code);`
	`71`	`+boolprop_cased=pg_u_prop_cased(code);`
	`72`	`+boolprop_case_ignorable=pg_u_prop_case_ignorable(code);`
	`73`	`+boolprop_white_space=pg_u_prop_white_space(code);`
	`74`	`+boolprop_hex_digit=pg_u_prop_hex_digit(code);`
	`75`	`+boolprop_join_control=pg_u_prop_join_control(code);`
	`76`	`+`
	`77`	`+boolicu_prop_alphabetic=u_hasBinaryProperty(`
	`78`	`+code,UCHAR_ALPHABETIC);`
	`79`	`+boolicu_prop_lowercase=u_hasBinaryProperty(`
	`80`	`+code,UCHAR_LOWERCASE);`
	`81`	`+boolicu_prop_uppercase=u_hasBinaryProperty(`
	`82`	`+code,UCHAR_UPPERCASE);`
	`83`	`+boolicu_prop_cased=u_hasBinaryProperty(`
	`84`	`+code,UCHAR_CASED);`
	`85`	`+boolicu_prop_case_ignorable=u_hasBinaryProperty(`
	`86`	`+code,UCHAR_CASE_IGNORABLE);`
	`87`	`+boolicu_prop_white_space=u_hasBinaryProperty(`
	`88`	`+code,UCHAR_WHITE_SPACE);`
	`89`	`+boolicu_prop_hex_digit=u_hasBinaryProperty(`
	`90`	`+code,UCHAR_HEX_DIGIT);`
	`91`	`+boolicu_prop_join_control=u_hasBinaryProperty(`
	`92`	`+code,UCHAR_JOIN_CONTROL);`
	`93`	`+`
	`94`	`+/*`
	`95`	`+ * Compare with ICU for character classes using:`
	`96`	`+ *`
	`97`	`+ * https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details`
	`98`	`+ *`
	`99`	`+ * which describes how to use ICU to test for membership in regex`
	`100`	`+ * character classes.`
	`101`	`+ *`
	`102`	`+ * NB: the document suggests testing for some properties such as`
	`103`	`+ * UCHAR_POSIX_ALNUM, but that doesn't mean that we're testing for the`
	`104`	`+ * "POSIX Compatible" character classes.`
	`105`	`+ */`
	`106`	`+boolisalpha=pg_u_isalpha(code);`
	`107`	`+boolislower=pg_u_islower(code);`
	`108`	`+boolisupper=pg_u_isupper(code);`
	`109`	`+boolispunct=pg_u_ispunct(code, false);`
	`110`	`+boolisdigit=pg_u_isdigit(code, false);`
	`111`	`+boolisxdigit=pg_u_isxdigit(code, false);`
	`112`	`+boolisalnum=pg_u_isalnum(code, false);`
	`113`	`+boolisspace=pg_u_isspace(code);`
	`114`	`+boolisblank=pg_u_isblank(code);`
	`115`	`+booliscntrl=pg_u_iscntrl(code);`
	`116`	`+boolisgraph=pg_u_isgraph(code);`
	`117`	`+boolisprint=pg_u_isprint(code);`
	`118`	`+`
	`119`	`+boolicu_isalpha=u_isUAlphabetic(code);`
	`120`	`+boolicu_islower=u_isULowercase(code);`
	`121`	`+boolicu_isupper=u_isUUppercase(code);`
	`122`	`+boolicu_ispunct=u_ispunct(code);`
	`123`	`+boolicu_isdigit=u_isdigit(code);`
	`124`	`+boolicu_isxdigit=u_hasBinaryProperty(code,`
	`125`	`+UCHAR_POSIX_XDIGIT);`
	`126`	`+boolicu_isalnum=u_hasBinaryProperty(code,`
	`127`	`+UCHAR_POSIX_ALNUM);`
	`128`	`+boolicu_isspace=u_isUWhiteSpace(code);`
	`129`	`+boolicu_isblank=u_isblank(code);`
	`130`	`+boolicu_iscntrl=icu_category==PG_U_CONTROL;`
	`131`	`+boolicu_isgraph=u_hasBinaryProperty(code,`
	`132`	`+UCHAR_POSIX_GRAPH);`
	`133`	`+boolicu_isprint=u_hasBinaryProperty(code,`
	`134`	`+UCHAR_POSIX_PRINT);`
	`135`	`+`
	`136`	`+/*`
	`137`	`+ * A version mismatch means that some assigned codepoints in the newer`
	`138`	`+ * version may be unassigned in the older version. That's OK, though`
	`139`	`+ * the test will not cover those codepoints marked unassigned in the`
	`140`	`+ * older version (that is, it will no longer be an exhaustive test).`
	`141`	`+ */`
	`142`	`+if (pg_category==PG_U_UNASSIGNED&&`
	`143`	`+icu_category!=PG_U_UNASSIGNED&&`
	`144`	`+pg_unicode_version<icu_unicode_version)`
	`145`	`+{`
	`146`	`+pg_skipped_codepoints++;`
	`147`	`+continue;`
	`148`	`+}`
	`149`	`+`
	`150`	`+if (icu_category==PG_U_UNASSIGNED&&`
	`151`	`+pg_category!=PG_U_UNASSIGNED&&`
	`152`	`+icu_unicode_version<pg_unicode_version)`
	`153`	`+{`
	`154`	`+icu_skipped_codepoints++;`
	`155`	`+continue;`
	`156`	`+}`
	`157`	`+`
`65`	`158`	`if (pg_category!=icu_category)`
`66`	`159`	`{`
`67`		`-/*`
`68`		`- * A version mismatch means that some assigned codepoints in the`
`69`		`- * newer version may be unassigned in the older version. That's`
`70`		`- * OK, though the test will not cover those codepoints marked`
`71`		`- * unassigned in the older version (that is, it will no longer be`
`72`		`- * an exhaustive test).`
`73`		`- */`
`74`		`-if (pg_category==PG_U_UNASSIGNED&&`
`75`		`-pg_unicode_version<icu_unicode_version)`
`76`		`-pg_skipped_codepoints++;`
`77`		`-elseif (icu_category==PG_U_UNASSIGNED&&`
`78`		`-icu_unicode_version<pg_unicode_version)`
`79`		`-icu_skipped_codepoints++;`
`80`		`-else`
`81`		`-{`
`82`		`-printf("category_test: FAILURE for codepoint 0x%06x\n",code);`
`83`		`-printf("category_test: Postgres category:%02d %s %s\n",pg_category,`
`84`		`-unicode_category_abbrev(pg_category),`
`85`		`-unicode_category_string(pg_category));`
`86`		`-printf("category_test: ICU category:%02d %s %s\n",icu_category,`
`87`		`-unicode_category_abbrev(icu_category),`
`88`		`-unicode_category_string(icu_category));`
`89`		`-printf("\n");`
`90`		`-exit(1);`
`91`		`-}`
	`160`	`+printf("category_test: FAILURE for codepoint 0x%06x\n",code);`
	`161`	`+printf("category_test: Postgres category:%02d %s %s\n",pg_category,`
	`162`	`+unicode_category_abbrev(pg_category),`
	`163`	`+unicode_category_string(pg_category));`
	`164`	`+printf("category_test: ICU category:%02d %s %s\n",icu_category,`
	`165`	`+unicode_category_abbrev(icu_category),`
	`166`	`+unicode_category_string(icu_category));`
	`167`	`+printf("\n");`
	`168`	`+exit(1);`
	`169`	`+}`
	`170`	`+`
	`171`	`+if (prop_alphabetic!=icu_prop_alphabetic\|\|`
	`172`	`+prop_lowercase!=icu_prop_lowercase\|\|`
	`173`	`+prop_uppercase!=icu_prop_uppercase\|\|`
	`174`	`+prop_cased!=icu_prop_cased\|\|`
	`175`	`+prop_case_ignorable!=icu_prop_case_ignorable\|\|`
	`176`	`+prop_white_space!=icu_prop_white_space\|\|`
	`177`	`+prop_hex_digit!=icu_prop_hex_digit\|\|`
	`178`	`+prop_join_control!=icu_prop_join_control)`
	`179`	`+{`
	`180`	`+printf("category_test: FAILURE for codepoint 0x%06x\n",code);`
	`181`	`+printf("category_test: Postgrespropertyalphabetic/lowercase/uppercase/cased/case_ignorable/white_space/hex_digit/join_control: %d/%d/%d/%d/%d/%d/%d/%d\n",`
	`182`	`+prop_alphabetic,prop_lowercase,prop_uppercase,`
	`183`	`+prop_cased,prop_case_ignorable,`
	`184`	`+prop_white_space,prop_hex_digit,prop_join_control);`
	`185`	`+printf("category_test: ICUpropertyalphabetic/lowercase/uppercase/cased/case_ignorable/white_space/hex_digit/join_control: %d/%d/%d/%d/%d/%d/%d/%d\n",`
	`186`	`+icu_prop_alphabetic,icu_prop_lowercase,icu_prop_uppercase,`
	`187`	`+icu_prop_cased,icu_prop_case_ignorable,`
	`188`	`+icu_prop_white_space,icu_prop_hex_digit,icu_prop_join_control);`
	`189`	`+printf("\n");`
	`190`	`+exit(1);`
`92`	`191`	`}`
	`192`	`+`
	`193`	`+if (isalpha!=icu_isalpha\|\|`
	`194`	`+islower!=icu_islower\|\|`
	`195`	`+isupper!=icu_isupper\|\|`
	`196`	`+ispunct!=icu_ispunct\|\|`
	`197`	`+isdigit!=icu_isdigit\|\|`
	`198`	`+isxdigit!=icu_isxdigit\|\|`
	`199`	`+isalnum!=icu_isalnum\|\|`
	`200`	`+isspace!=icu_isspace\|\|`
	`201`	`+isblank!=icu_isblank\|\|`
	`202`	`+iscntrl!=icu_iscntrl\|\|`
	`203`	`+isgraph!=icu_isgraph\|\|`
	`204`	`+isprint!=icu_isprint)`
	`205`	`+{`
	`206`	`+printf("category_test: FAILURE for codepoint 0x%06x\n",code);`
	`207`	`+printf("category_test: Postgresclassalpha/lower/upper/punct/digit/xdigit/alnum/space/blank/cntrl/graph/print: %d/%d/%d/%d/%d/%d/%d/%d/%d/%d/%d/%d\n",`
	`208`	`+isalpha,islower,isupper,ispunct,isdigit,isxdigit,isalnum,isspace,isblank,iscntrl,isgraph,isprint);`
	`209`	`+printf("category_test: ICU classalpha/lower/upper/punct/digit/xdigit/alnum/space/blank/cntrl/graph/print: %d/%d/%d/%d/%d/%d/%d/%d/%d/%d/%d/%d\n",`
	`210`	`+icu_isalpha,icu_islower,icu_isupper,icu_ispunct,icu_isdigit,icu_isxdigit,icu_isalnum,icu_isspace,icu_isblank,icu_iscntrl,icu_isgraph,icu_isprint);`
	`211`	`+printf("\n");`
	`212`	`+exit(1);`
	`213`	`+}`
	`214`	`+`
	`215`	`+if (pg_category!=PG_U_UNASSIGNED)`
	`216`	`+successful++;`
`93`	`217`	`}`
`94`	`218`
`95`	`219`	`if (pg_skipped_codepoints>0)`
`@@ -99,10 +223,22 @@ main(int argc, char **argv)`
`99`	`223`	`printf("category_test: skipped %d codepoints unassigned in ICU due to Unicode version mismatch\n",`
`100`	`224`	`icu_skipped_codepoints);`
`101`	`225`
`102`		`-printf("category_test: success\n");`
`103`		`-exit(0);`
	`226`	`+printf("category_test: ICU test: %d codepoints successful\n",successful);`
	`227`	`+}`
	`228`	`+#endif`
	`229`	`+`
	`230`	`+int`
	`231`	`+main(intargc,char**argv)`
	`232`	`+{`
	`233`	`+pg_unicode_version=parse_unicode_version(PG_UNICODE_VERSION);`
	`234`	`+printf("category_test: Postgres Unicode version:\t%s\n",PG_UNICODE_VERSION);`
	`235`	`+`
	`236`	`+#ifdefUSE_ICU`
	`237`	`+icu_unicode_version=parse_unicode_version(U_UNICODE_VERSION);`
	`238`	`+printf("category_test: ICU Unicode version:\t\t%s\n",U_UNICODE_VERSION);`
	`239`	`+`
	`240`	`+icu_test();`
`104`	`241`	`#else`
`105`		`-printf("category_test: ICU support required for test; skipping\n");`
`106`		`-exit(0);`
	`242`	`+printf("category_test: ICU not available; skipping\n");`
`107`	`243`	`#endif`
`108`	`244`	`}`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitad49994

File tree

8 files changed

8 files changed

`‎src/common/unicode/Makefile‎`

`‎src/common/unicode/README‎`

`‎src/common/unicode/category_test.c‎`

0 commit comments