postgrespro/postgresPublic

forked frompostgres/postgres

NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commit4b7d9e5

michaelpq

authored and

pull[bot]

committed

unaccent: Add support for quoted translated characters

As reported in bug #18057, the extension unaccent removes in its rulefile whitespace characters that are intentionally specified whenbuilding unaccent.rules from UnicodeData.txt, causing an incorrecttranslation for some characters like numeric symbols. This is caused bythe fact that all whitespaces before and after the origin and targetcharacters are all discarded (this limitation is documented).This commit makes possible the use of quotes around target characters,so as whitespaces can be considered part of target characters. Sometarget characters use a double quote, these require an extra doublequote.The documentation is updated to show how to use quoted areas,generate_unaccent_rules.py is updated to generate unaccent.rules and acouple of tests are added for numeric symbols. While working on thispatch, I have implemented a fake rule file to test the parsing logicimplemented, which is not included here as it would just consume extracycles in the tests, and it requires the manipulation of an installationtree to be able to work correctly.As this requires a change of format in unaccent.rules, this cannot bebackpatched, unfortunately. The idea to use double quotes as escapedcharacters comes from Tom Lane.Reported-by: Martin SchlossarekAuthor: Michael PaquierDiscussion:https://postgr.es/m/18057-62712cad01bd202c@postgresql.org

1 parent5086199 commit4b7d9e5Copy full SHA for 4b7d9e5

File tree

6 files changed

+166

-38

lines changed

contrib/unaccent
- expected
  - unaccent.out
- generate_unaccent_rules.py
- sql
  - unaccent.sql
- unaccent.c
- unaccent.rules
doc/src/sgml
- unaccent.sgml

6 files changed

+166

-38

lines changed

`‎contrib/unaccent/expected/unaccent.out`

Lines changed: 36 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -51,6 +51,18 @@ SELECT unaccent('℗'); -- sound recording copyright`
`51`	`51`	`(P)`
`52`	`52`	`(1 row)`
`53`	`53`
	`54`	`+SELECT unaccent('1½'); -- math expression with whitespace`
	`55`	`+ unaccent`
	`56`	`+----------`
	`57`	`+ 1 1/2`
	`58`	`+(1 row)`
	`59`	`+`
	`60`	`+SELECT unaccent('〝'); -- quote`
	`61`	`+ unaccent`
	`62`	`+----------`
	`63`	`+ "`
	`64`	`+(1 row)`
	`65`	`+`
`54`	`66`	`SELECT unaccent('unaccent', 'foobar');`
`55`	`67`	`unaccent`
`56`	`68`	`----------`
`@@ -93,6 +105,18 @@ SELECT unaccent('unaccent', '℗');`
`93`	`105`	`(P)`
`94`	`106`	`(1 row)`
`95`	`107`
	`108`	`+SELECT unaccent('unaccent', '1½');`
	`109`	`+ unaccent`
	`110`	`+----------`
	`111`	`+ 1 1/2`
	`112`	`+(1 row)`
	`113`	`+`
	`114`	`+SELECT unaccent('unaccent', '〝');`
	`115`	`+ unaccent`
	`116`	`+----------`
	`117`	`+ "`
	`118`	`+(1 row)`
	`119`	`+`
`96`	`120`	`SELECT ts_lexize('unaccent', 'foobar');`
`97`	`121`	`ts_lexize`
`98`	`122`	`-----------`
`@@ -135,6 +159,18 @@ SELECT ts_lexize('unaccent', '℗');`
`135`	`159`	`{(P)}`
`136`	`160`	`(1 row)`
`137`	`161`
	`162`	`+SELECT ts_lexize('unaccent', '1½');`
	`163`	`+ ts_lexize`
	`164`	`+-----------`
	`165`	`+ {"1 1/2"}`
	`166`	`+(1 row)`
	`167`	`+`
	`168`	`+SELECT ts_lexize('unaccent', '〝');`
	`169`	`+ ts_lexize`
	`170`	`+-----------`
	`171`	`+ {"\""}`
	`172`	`+(1 row)`
	`173`	`+`
`138`	`174`	`-- Controversial case. Black-Letter Capital H (U+210C) is translated by`
`139`	`175`	`-- Latin-ASCII.xml as 'x', but it should be 'H'.`
`140`	`176`	`SELECT unaccent('ℌ');`

`‎contrib/unaccent/generate_unaccent_rules.py`

Lines changed: 4 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -58,6 +58,10 @@`
`58`	`58`
`59`	`59`	`defprint_record(codepoint,letter):`
`60`	`60`	`ifletter:`
	`61`	`+# If the letter has whitespace or double quotes, escape double`
	`62`	`+# quotes and apply more quotes around it.`
	`63`	`+if (' 'inletter)or ('"'inletter):`
	`64`	`+letter='"'+letter.replace('"','""')+'"'`
`61`	`65`	`output=chr(codepoint)+"\t"+letter`
`62`	`66`	`else:`
`63`	`67`	`output=chr(codepoint)`

`‎contrib/unaccent/sql/unaccent.sql`

Lines changed: 6 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -20,6 +20,8 @@ SELECT unaccent('˃˖˗˜');`
`20`	`20`	`SELECT unaccent('À');-- Remove combining diacritical 0x0300`
`21`	`21`	`SELECT unaccent('℃℉');-- degree signs`
`22`	`22`	`SELECT unaccent('℗');-- sound recording copyright`
	`23`	`+SELECT unaccent('1½');-- math expression with whitespace`
	`24`	`+SELECT unaccent('〝');-- quote`
`23`	`25`
`24`	`26`	`SELECT unaccent('unaccent','foobar');`
`25`	`27`	`SELECT unaccent('unaccent','ёлка');`
`@@ -28,6 +30,8 @@ SELECT unaccent('unaccent', '˃˖˗˜');`
`28`	`30`	`SELECT unaccent('unaccent','À');`
`29`	`31`	`SELECT unaccent('unaccent','℃℉');`
`30`	`32`	`SELECT unaccent('unaccent','℗');`
	`33`	`+SELECT unaccent('unaccent','1½');`
	`34`	`+SELECT unaccent('unaccent','〝');`
`31`	`35`
`32`	`36`	`SELECT ts_lexize('unaccent','foobar');`
`33`	`37`	`SELECT ts_lexize('unaccent','ёлка');`
`@@ -36,6 +40,8 @@ SELECT ts_lexize('unaccent', '˃˖˗˜');`
`36`	`40`	`SELECT ts_lexize('unaccent','À');`
`37`	`41`	`SELECT ts_lexize('unaccent','℃℉');`
`38`	`42`	`SELECT ts_lexize('unaccent','℗');`
	`43`	`+SELECT ts_lexize('unaccent','1½');`
	`44`	`+SELECT ts_lexize('unaccent','〝');`
`39`	`45`
`40`	`46`	`-- Controversial case. Black-Letter Capital H (U+210C) is translated by`
`41`	`47`	`-- Latin-ASCII.xml as 'x', but it should be 'H'.`

`‎contrib/unaccent/unaccent.c`

Lines changed: 76 additions & 10 deletions

Original file line number	Diff line number	Diff line change
`@@ -127,24 +127,30 @@ initTrie(const char *filename)`
`127`	`127`	`* src and trg are sequences of one or more non-whitespace`
`128`	`128`	`* characters, separated by whitespace. Whitespace at start`
`129`	`129`	`* or end of line is ignored. If trg is omitted, an empty`
`130`		`- * string is used as the replacement.`
	`130`	`+ * string is used as the replacement. trg can be optionally`
	`131`	`+ * quoted, in which case whitespaces are included in it.`
`131`	`132`	`*`
`132`	`133`	`* We use a simple state machine, with states`
`133`	`134`	`*0initial (before src)`
`134`	`135`	`*1in src`
`135`	`136`	`*2in whitespace after src`
`136`		`- *3in trg`
`137`		`- *4in whitespace after trg`
`138`		`- *-1syntax error detected`
	`137`	`+ *3in trg (non-quoted)`
	`138`	`+ *4in trg (quoted)`
	`139`	`+ *5in whitespace after trg`
	`140`	`+ *-1syntax error detected (two strings)`
	`141`	`+ *-2syntax error detected (unfinished quoted string)`
`139`	`142`	`*----------`
`140`	`143`	`*/`
`141`	`144`	`intstate;`
`142`	`145`	`char*ptr;`
`143`	`146`	`char*src=NULL;`
`144`	`147`	`char*trg=NULL;`
	`148`	`+char*trgstore=NULL;`
`145`	`149`	`intptrlen;`
`146`	`150`	`intsrclen=0;`
`147`	`151`	`inttrglen=0;`
	`152`	`+inttrgstorelen=0;`
	`153`	`+booltrgquoted= false;`
`148`	`154`
`149`	`155`	`state=0;`
`150`	`156`	`for (ptr=line;*ptr;ptr+=ptrlen)`
`@@ -156,8 +162,10 @@ initTrie(const char *filename)`
`156`	`162`	`if (state==1)`
`157`	`163`	`state=2;`
`158`	`164`	`elseif (state==3)`
`159`		`-state=4;`
`160`		`-continue;`
	`165`	`+state=5;`
	`166`	`+/* whitespaces are OK in quoted area */`
	`167`	`+if (state!=4)`
	`168`	`+continue;`
`161`	`169`	`}`
`162`	`170`	`switch (state)`
`163`	`171`	`{`
`@@ -173,13 +181,40 @@ initTrie(const char *filename)`
`173`	`181`	`break;`
`174`	`182`	`case2:`
`175`	`183`	`/* start of trg */`
	`184`	`+if (*ptr=='"')`
	`185`	`+{`
	`186`	`+trgquoted= true;`
	`187`	`+state=4;`
	`188`	`+}`
	`189`	`+else`
	`190`	`+state=3;`
	`191`	`+`
`176`	`192`	`trg=ptr;`
`177`	`193`	`trglen=ptrlen;`
`178`		`-state=3;`
`179`	`194`	`break;`
`180`	`195`	`case3:`
`181`		`-/* continue trg */`
	`196`	`+/* continue non-quoted trg */`
	`197`	`+trglen+=ptrlen;`
	`198`	`+break;`
	`199`	`+case4:`
	`200`	`+/* continue quoted trg */`
`182`	`201`	`trglen+=ptrlen;`
	`202`	`+`
	`203`	`+/*`
	`204`	`+ * If this is a quote, consider it as the end of`
	`205`	`+ * trg except if the follow-up character is itself`
	`206`	`+ * a quote.`
	`207`	`+ */`
	`208`	`+if (*ptr=='"')`
	`209`	`+{`
	`210`	`+if (*(ptr+1)=='"')`
	`211`	`+{`
	`212`	`+ptr++;`
	`213`	`+trglen+=1;`
	`214`	`+}`
	`215`	`+else`
	`216`	`+state=5;`
	`217`	`+}`
`183`	`218`	`break;`
`184`	`219`	`default:`
`185`	`220`	`/* bogus line format */`
`@@ -195,15 +230,46 @@ initTrie(const char *filename)`
`195`	`230`	`trglen=0;`
`196`	`231`	`}`
`197`	`232`
	`233`	`+/* If still in a quoted area, fallback to an error */`
	`234`	`+if (state==4)`
	`235`	`+state=-2;`
	`236`	`+`
	`237`	`+/* If trg was quoted, remove its quotes and unescape it */`
	`238`	`+if (trgquoted&&state>0)`
	`239`	`+{`
	`240`	`+/* Ignore first and end quotes */`
	`241`	`+trgstore=palloc0(sizeof(char)trglen-2);`
	`242`	`+trgstorelen=0;`
	`243`	`+for (inti=1;i<trglen-1;i++)`
	`244`	`+{`
	`245`	`+trgstore[trgstorelen]=trg[i];`
	`246`	`+trgstorelen++;`
	`247`	`+/* skip second double quotes */`
	`248`	`+if (trg[i]=='"'&&trg[i+1]=='"')`
	`249`	`+i++;`
	`250`	`+}`
	`251`	`+}`
	`252`	`+else`
	`253`	`+{`
	`254`	`+trgstore=palloc0(sizeof(char)trglen);`
	`255`	`+trgstorelen=trglen;`
	`256`	`+memcpy(trgstore,trg,trgstorelen);`
	`257`	`+}`
	`258`	`+`
`198`	`259`	`if (state>0)`
`199`	`260`	`rootTrie=placeChar(rootTrie,`
`200`	`261`	`(unsignedchar*)src,srclen,`
`201`		`-trg,trglen);`
`202`		`-elseif (state<0)`
	`262`	`+trgstore,trgstorelen);`
	`263`	`+elseif (state==-1)`
`203`	`264`	`ereport(WARNING,`
`204`	`265`	`(errcode(ERRCODE_CONFIG_FILE_ERROR),`
`205`	`266`	`errmsg("invalid syntax: more than two strings in unaccent rule")));`
	`267`	`+elseif (state==-2)`
	`268`	`+ereport(WARNING,`
	`269`	`+(errcode(ERRCODE_CONFIG_FILE_ERROR),`
	`270`	`+errmsg("invalid syntax: unfinished quoted string in unaccent rule")));`
`206`	`271`
	`272`	`+pfree(trgstore);`
`207`	`273`	`pfree(line);`
`208`	`274`	`}`
`209`	`275`	`skip= false;`

`‎contrib/unaccent/unaccent.rules`

Lines changed: 28 additions & 28 deletions

Original file line number	Diff line number	Diff line change
`@@ -5,9 +5,9 @@`
`5`	`5`	`®(R)`
`6`	`6`	`±+/-`
`7`	`7`	`»>>`
`8`		`-¼ 1/4`
`9`		`-½ 1/2`
`10`		`-¾ 3/4`
	`8`	`+¼" 1/4"`
	`9`	`+½" 1/2"`
	`10`	`+¾" 3/4"`
`11`	`11`	`¿?`
`12`	`12`	`ÀA`
`13`	`13`	`ÁA`
`@@ -403,7 +403,7 @@`
`403`	`403`	`ʪls`
`404`	`404`	`ʫlz`
`405`	`405`	`ʹ'`
`406`		`-ʺ"`
	`406`	`+ʺ""""`
`407`	`407`	`ʻ'`
`408`	`408`	`ʼ'`
`409`	`409`	`ʽ'`
`@@ -1058,15 +1058,15 @@`
`1058`	`1058`	`’'`
`1059`	`1059`	`‚,`
`1060`	`1060`	`‛'`
`1061`		`-“"`
`1062`		`-”"`
	`1061`	`+“""""`
	`1062`	`+”""""`
`1063`	`1063`	`„,,`
`1064`		`-‟"`
	`1064`	`+‟""""`
`1065`	`1065`	`․.`
`1066`	`1066`	`‥..`
`1067`	`1067`	`…...`
`1068`	`1068`	`′'`
`1069`		`-″"`
	`1069`	`+″""""`
`1070`	`1070`	`‹<`
`1071`	`1071`	`›>`
`1072`	`1072`	`‼!!`
`@@ -1134,22 +1134,22 @@`
`1134`	`1134`	`ⅇe`
`1135`	`1135`	`ⅈi`
`1136`	`1136`	`ⅉj`
`1137`		`-⅐ 1/7`
`1138`		`-⅑ 1/9`
`1139`		`-⅒ 1/10`
`1140`		`-⅓ 1/3`
`1141`		`-⅔ 2/3`
`1142`		`-⅕ 1/5`
`1143`		`-⅖ 2/5`
`1144`		`-⅗ 3/5`
`1145`		`-⅘ 4/5`
`1146`		`-⅙ 1/6`
`1147`		`-⅚ 5/6`
`1148`		`-⅛ 1/8`
`1149`		`-⅜ 3/8`
`1150`		`-⅝ 5/8`
`1151`		`-⅞ 7/8`
`1152`		`-⅟ 1/`
	`1137`	`+⅐" 1/7"`
	`1138`	`+⅑" 1/9"`
	`1139`	`+⅒" 1/10"`
	`1140`	`+⅓" 1/3"`
	`1141`	`+⅔" 2/3"`
	`1142`	`+⅕" 1/5"`
	`1143`	`+⅖" 2/5"`
	`1144`	`+⅗" 3/5"`
	`1145`	`+⅘" 4/5"`
	`1146`	`+⅙" 1/6"`
	`1147`	`+⅚" 5/6"`
	`1148`	`+⅛" 1/8"`
	`1149`	`+⅜" 3/8"`
	`1150`	`+⅝" 5/8"`
	`1151`	`+⅞" 7/8"`
	`1152`	`+⅟" 1/"`
`1153`	`1153`	`ⅠI`
`1154`	`1154`	`ⅡII`
`1155`	`1155`	`ⅢIII`
`@@ -1182,7 +1182,7 @@`
`1182`	`1182`	`ⅽc`
`1183`	`1183`	`ⅾd`
`1184`	`1184`	`ⅿm`
`1185`		`-↉ 0/3`
	`1185`	`+↉" 0/3"`
`1186`	`1186`	`−-`
`1187`	`1187`	`∕/`
`1188`	`1188`	`∖\`
`@@ -1296,8 +1296,8 @@`
`1296`	`1296`	`〙]`
`1297`	`1297`	`〚[`
`1298`	`1298`	`〛]`
`1299`		`-〝"`
`1300`		`-〞"`
	`1299`	`+〝""""`
	`1300`	`+〞""""`
`1301`	`1301`	`㍱hPa`
`1302`	`1302`	`㍲da`
`1303`	`1303`	`㍳AU`
`@@ -1512,7 +1512,7 @@`
`1512`	`1512`	`﹪%`
`1513`	`1513`	`﹫@`
`1514`	`1514`	`！!`
`1515`		`-＂"`
	`1515`	`+＂""""`
`1516`	`1516`	`＃#`
`1517`	`1517`	`＄$`
`1518`	`1518`	`％%`

`‎doc/src/sgml/unaccent.sgml`

Lines changed: 16 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -84,6 +84,22 @@`
`84`	`84`	`</para>`
`85`	`85`	`</listitem>`
`86`	`86`
	`87`	`+ <listitem>`
	`88`	`+ <para>`
	`89`	`+ Some characters, like numeric symbols, may require whitespaces in their`
	`90`	`+ translation rule. It is possible to use double quotes around the translated`
	`91`	`+ characters in this case. A double quote needs to be escaped with a second`
	`92`	`+ double quote when including one in the translated character. For example:`
	`93`	`+<programlisting>`
	`94`	`+¼ " 1/4"`
	`95`	`+½ " 1/2"`
	`96`	`+¾ " 3/4"`
	`97`	`+“ """"`
	`98`	`+” """"`
	`99`	`+</programlisting>`
	`100`	`+ </para>`
	`101`	`+ </listitem>`
	`102`	`+`
`87`	`103`	`<listitem>`
`88`	`104`	`<para>`
`89`	`105`	`As with other <productname>PostgreSQL</productname> text search configuration files,`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit4b7d9e5

File tree

6 files changed

6 files changed

`‎contrib/unaccent/expected/unaccent.out`

`‎contrib/unaccent/generate_unaccent_rules.py`

`‎contrib/unaccent/sql/unaccent.sql`

`‎contrib/unaccent/unaccent.c`

`‎contrib/unaccent/unaccent.rules`

`‎doc/src/sgml/unaccent.sgml`

0 commit comments