Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit4b7d9e5

Browse files
michaelpqpull[bot]
authored andcommitted
unaccent: Add support for quoted translated characters
As reported in bug #18057, the extension unaccent removes in its rulefile whitespace characters that are intentionally specified whenbuilding unaccent.rules from UnicodeData.txt, causing an incorrecttranslation for some characters like numeric symbols. This is caused bythe fact that all whitespaces before and after the origin and targetcharacters are all discarded (this limitation is documented).This commit makes possible the use of quotes around target characters,so as whitespaces can be considered part of target characters. Sometarget characters use a double quote, these require an extra doublequote.The documentation is updated to show how to use quoted areas,generate_unaccent_rules.py is updated to generate unaccent.rules and acouple of tests are added for numeric symbols. While working on thispatch, I have implemented a fake rule file to test the parsing logicimplemented, which is not included here as it would just consume extracycles in the tests, and it requires the manipulation of an installationtree to be able to work correctly.As this requires a change of format in unaccent.rules, this cannot bebackpatched, unfortunately. The idea to use double quotes as escapedcharacters comes from Tom Lane.Reported-by: Martin SchlossarekAuthor: Michael PaquierDiscussion:https://postgr.es/m/18057-62712cad01bd202c@postgresql.org
1 parent5086199 commit4b7d9e5

File tree

6 files changed

+166
-38
lines changed

6 files changed

+166
-38
lines changed

‎contrib/unaccent/expected/unaccent.out

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,18 @@ SELECT unaccent('℗'); -- sound recording copyright
5151
(P)
5252
(1 row)
5353

54+
SELECT unaccent('1½'); -- math expression with whitespace
55+
unaccent
56+
----------
57+
1 1/2
58+
(1 row)
59+
60+
SELECT unaccent('〝'); -- quote
61+
unaccent
62+
----------
63+
"
64+
(1 row)
65+
5466
SELECT unaccent('unaccent', 'foobar');
5567
unaccent
5668
----------
@@ -93,6 +105,18 @@ SELECT unaccent('unaccent', '℗');
93105
(P)
94106
(1 row)
95107

108+
SELECT unaccent('unaccent', '1½');
109+
unaccent
110+
----------
111+
1 1/2
112+
(1 row)
113+
114+
SELECT unaccent('unaccent', '〝');
115+
unaccent
116+
----------
117+
"
118+
(1 row)
119+
96120
SELECT ts_lexize('unaccent', 'foobar');
97121
ts_lexize
98122
-----------
@@ -135,6 +159,18 @@ SELECT ts_lexize('unaccent', '℗');
135159
{(P)}
136160
(1 row)
137161

162+
SELECT ts_lexize('unaccent', '1½');
163+
ts_lexize
164+
-----------
165+
{"1 1/2"}
166+
(1 row)
167+
168+
SELECT ts_lexize('unaccent', '〝');
169+
ts_lexize
170+
-----------
171+
{"\""}
172+
(1 row)
173+
138174
-- Controversial case. Black-Letter Capital H (U+210C) is translated by
139175
-- Latin-ASCII.xml as 'x', but it should be 'H'.
140176
SELECT unaccent('ℌ');

‎contrib/unaccent/generate_unaccent_rules.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,10 @@
5858

5959
defprint_record(codepoint,letter):
6060
ifletter:
61+
# If the letter has whitespace or double quotes, escape double
62+
# quotes and apply more quotes around it.
63+
if (' 'inletter)or ('"'inletter):
64+
letter='"'+letter.replace('"','""')+'"'
6165
output=chr(codepoint)+"\t"+letter
6266
else:
6367
output=chr(codepoint)

‎contrib/unaccent/sql/unaccent.sql

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ SELECT unaccent('˃˖˗˜');
2020
SELECT unaccent('');-- Remove combining diacritical 0x0300
2121
SELECT unaccent('℃℉');-- degree signs
2222
SELECT unaccent('');-- sound recording copyright
23+
SELECT unaccent('');-- math expression with whitespace
24+
SELECT unaccent('');-- quote
2325

2426
SELECT unaccent('unaccent','foobar');
2527
SELECT unaccent('unaccent','ёлка');
@@ -28,6 +30,8 @@ SELECT unaccent('unaccent', '˃˖˗˜');
2830
SELECT unaccent('unaccent','');
2931
SELECT unaccent('unaccent','℃℉');
3032
SELECT unaccent('unaccent','');
33+
SELECT unaccent('unaccent','');
34+
SELECT unaccent('unaccent','');
3135

3236
SELECT ts_lexize('unaccent','foobar');
3337
SELECT ts_lexize('unaccent','ёлка');
@@ -36,6 +40,8 @@ SELECT ts_lexize('unaccent', '˃˖˗˜');
3640
SELECT ts_lexize('unaccent','');
3741
SELECT ts_lexize('unaccent','℃℉');
3842
SELECT ts_lexize('unaccent','');
43+
SELECT ts_lexize('unaccent','');
44+
SELECT ts_lexize('unaccent','');
3945

4046
-- Controversial case. Black-Letter Capital H (U+210C) is translated by
4147
-- Latin-ASCII.xml as 'x', but it should be 'H'.

‎contrib/unaccent/unaccent.c

Lines changed: 76 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -127,24 +127,30 @@ initTrie(const char *filename)
127127
* src and trg are sequences of one or more non-whitespace
128128
* characters, separated by whitespace. Whitespace at start
129129
* or end of line is ignored. If trg is omitted, an empty
130-
* string is used as the replacement.
130+
* string is used as the replacement. trg can be optionally
131+
* quoted, in which case whitespaces are included in it.
131132
*
132133
* We use a simple state machine, with states
133134
*0initial (before src)
134135
*1in src
135136
*2in whitespace after src
136-
*3in trg
137-
*4in whitespace after trg
138-
*-1syntax error detected
137+
*3in trg (non-quoted)
138+
*4in trg (quoted)
139+
*5in whitespace after trg
140+
*-1syntax error detected (two strings)
141+
*-2syntax error detected (unfinished quoted string)
139142
*----------
140143
*/
141144
intstate;
142145
char*ptr;
143146
char*src=NULL;
144147
char*trg=NULL;
148+
char*trgstore=NULL;
145149
intptrlen;
146150
intsrclen=0;
147151
inttrglen=0;
152+
inttrgstorelen=0;
153+
booltrgquoted= false;
148154

149155
state=0;
150156
for (ptr=line;*ptr;ptr+=ptrlen)
@@ -156,8 +162,10 @@ initTrie(const char *filename)
156162
if (state==1)
157163
state=2;
158164
elseif (state==3)
159-
state=4;
160-
continue;
165+
state=5;
166+
/* whitespaces are OK in quoted area */
167+
if (state!=4)
168+
continue;
161169
}
162170
switch (state)
163171
{
@@ -173,13 +181,40 @@ initTrie(const char *filename)
173181
break;
174182
case2:
175183
/* start of trg */
184+
if (*ptr=='"')
185+
{
186+
trgquoted= true;
187+
state=4;
188+
}
189+
else
190+
state=3;
191+
176192
trg=ptr;
177193
trglen=ptrlen;
178-
state=3;
179194
break;
180195
case3:
181-
/* continue trg */
196+
/* continue non-quoted trg */
197+
trglen+=ptrlen;
198+
break;
199+
case4:
200+
/* continue quoted trg */
182201
trglen+=ptrlen;
202+
203+
/*
204+
* If this is a quote, consider it as the end of
205+
* trg except if the follow-up character is itself
206+
* a quote.
207+
*/
208+
if (*ptr=='"')
209+
{
210+
if (*(ptr+1)=='"')
211+
{
212+
ptr++;
213+
trglen+=1;
214+
}
215+
else
216+
state=5;
217+
}
183218
break;
184219
default:
185220
/* bogus line format */
@@ -195,15 +230,46 @@ initTrie(const char *filename)
195230
trglen=0;
196231
}
197232

233+
/* If still in a quoted area, fallback to an error */
234+
if (state==4)
235+
state=-2;
236+
237+
/* If trg was quoted, remove its quotes and unescape it */
238+
if (trgquoted&&state>0)
239+
{
240+
/* Ignore first and end quotes */
241+
trgstore=palloc0(sizeof(char*)*trglen-2);
242+
trgstorelen=0;
243+
for (inti=1;i<trglen-1;i++)
244+
{
245+
trgstore[trgstorelen]=trg[i];
246+
trgstorelen++;
247+
/* skip second double quotes */
248+
if (trg[i]=='"'&&trg[i+1]=='"')
249+
i++;
250+
}
251+
}
252+
else
253+
{
254+
trgstore=palloc0(sizeof(char*)*trglen);
255+
trgstorelen=trglen;
256+
memcpy(trgstore,trg,trgstorelen);
257+
}
258+
198259
if (state>0)
199260
rootTrie=placeChar(rootTrie,
200261
(unsignedchar*)src,srclen,
201-
trg,trglen);
202-
elseif (state<0)
262+
trgstore,trgstorelen);
263+
elseif (state==-1)
203264
ereport(WARNING,
204265
(errcode(ERRCODE_CONFIG_FILE_ERROR),
205266
errmsg("invalid syntax: more than two strings in unaccent rule")));
267+
elseif (state==-2)
268+
ereport(WARNING,
269+
(errcode(ERRCODE_CONFIG_FILE_ERROR),
270+
errmsg("invalid syntax: unfinished quoted string in unaccent rule")));
206271

272+
pfree(trgstore);
207273
pfree(line);
208274
}
209275
skip= false;

‎contrib/unaccent/unaccent.rules

Lines changed: 28 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@
55
®(R)
66
±+/-
77
»>>
8-
¼ 1/4
9-
½ 1/2
10-
¾ 3/4
8+
¼" 1/4"
9+
½" 1/2"
10+
¾" 3/4"
1111
¿?
1212
ÀA
1313
ÁA
@@ -403,7 +403,7 @@
403403
ʪls
404404
ʫlz
405405
ʹ'
406-
ʺ"
406+
ʺ""""
407407
ʻ'
408408
ʼ'
409409
ʽ'
@@ -1058,15 +1058,15 @@
10581058
’'
10591059
‚,
10601060
‛'
1061-
“"
1062-
”"
1061+
“""""
1062+
”""""
10631063
„,,
1064-
‟"
1064+
‟""""
10651065
․.
10661066
‥..
10671067
…...
10681068
′'
1069-
″"
1069+
″""""
10701070
‹<
10711071
›>
10721072
‼!!
@@ -1134,22 +1134,22 @@
11341134
ⅇe
11351135
ⅈi
11361136
ⅉj
1137-
⅐ 1/7
1138-
⅑ 1/9
1139-
⅒ 1/10
1140-
⅓ 1/3
1141-
⅔ 2/3
1142-
⅕ 1/5
1143-
⅖ 2/5
1144-
⅗ 3/5
1145-
⅘ 4/5
1146-
⅙ 1/6
1147-
⅚ 5/6
1148-
⅛ 1/8
1149-
⅜ 3/8
1150-
⅝ 5/8
1151-
⅞ 7/8
1152-
⅟ 1/
1137+
" 1/7"
1138+
" 1/9"
1139+
" 1/10"
1140+
" 1/3"
1141+
" 2/3"
1142+
" 1/5"
1143+
" 2/5"
1144+
" 3/5"
1145+
" 4/5"
1146+
" 1/6"
1147+
" 5/6"
1148+
" 1/8"
1149+
" 3/8"
1150+
" 5/8"
1151+
" 7/8"
1152+
" 1/"
11531153
ⅠI
11541154
ⅡII
11551155
ⅢIII
@@ -1182,7 +1182,7 @@
11821182
ⅽc
11831183
ⅾd
11841184
ⅿm
1185-
↉ 0/3
1185+
" 0/3"
11861186
−-
11871187
∕/
11881188
∖\
@@ -1296,8 +1296,8 @@
12961296
〙]
12971297
〚[
12981298
〛]
1299-
〝"
1300-
〞"
1299+
〝""""
1300+
〞""""
13011301
㍱hPa
13021302
㍲da
13031303
㍳AU
@@ -1512,7 +1512,7 @@
15121512
﹪%
15131513
﹫@
15141514
!!
1515-
""
1515+
"""""
15161516
##
15171517
$$
15181518
%%

‎doc/src/sgml/unaccent.sgml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,22 @@
8484
</para>
8585
</listitem>
8686

87+
<listitem>
88+
<para>
89+
Some characters, like numeric symbols, may require whitespaces in their
90+
translation rule. It is possible to use double quotes around the translated
91+
characters in this case. A double quote needs to be escaped with a second
92+
double quote when including one in the translated character. For example:
93+
<programlisting>
94+
&frac14; " 1/4"
95+
&frac12; " 1/2"
96+
&frac34; " 3/4"
97+
&ldquo; """"
98+
&rdquo; """"
99+
</programlisting>
100+
</para>
101+
</listitem>
102+
87103
<listitem>
88104
<para>
89105
As with other <productname>PostgreSQL</productname> text search configuration files,

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp