Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit97c40ce

Browse files
committed
Allow empty replacement strings in contrib/unaccent.
This is useful in languages where diacritic signs are represented asseparate characters; it's also one step towards letting unaccent be usedfor arbitrary substring substitutions.In passing, improve the user documentation for unaccent, which was sadlyvague about some important details.Mohammad Alhashash, reviewed by Abhijit Menon-Sen
1 parent5586327 commit97c40ce

File tree

2 files changed

+54
-11
lines changed

2 files changed

+54
-11
lines changed

‎contrib/unaccent/unaccent.c

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -104,11 +104,21 @@ initTrie(char *filename)
104104

105105
while ((line=tsearch_readline(&trst))!=NULL)
106106
{
107-
/*
108-
* The format of each line must be "src trg" where src and trg
109-
* are sequences of one or more non-whitespace characters,
110-
* separated by whitespace. Whitespace at start or end of
111-
* line is ignored.
107+
/*----------
108+
* The format of each line must be "src" or "src trg", where
109+
* src and trg are sequences of one or more non-whitespace
110+
* characters, separated by whitespace. Whitespace at start
111+
* or end of line is ignored. If trg is omitted, an empty
112+
* string is used as the replacement.
113+
*
114+
* We use a simple state machine, with states
115+
*0initial (before src)
116+
*1in src
117+
*2in whitespace after src
118+
*3in trg
119+
*4in whitespace after trg
120+
*-1syntax error detected (line will be ignored)
121+
*----------
112122
*/
113123
intstate;
114124
char*ptr;
@@ -160,7 +170,14 @@ initTrie(char *filename)
160170
}
161171
}
162172

163-
if (state >=3)
173+
if (state==1||state==2)
174+
{
175+
/* trg was omitted, so use "" */
176+
trg="";
177+
trglen=0;
178+
}
179+
180+
if (state>0)
164181
rootTrie=placeChar(rootTrie,
165182
(unsignedchar*)src,srclen,
166183
trg,trglen);

‎doc/src/sgml/unaccent.sgml

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,9 @@
4545
<itemizedlist>
4646
<listitem>
4747
<para>
48-
Each line representsa pair, consisting of a character with accent
49-
followed by a character without accent. The first is translated into
50-
the second. For example,
48+
Each line representsone translation rule, consisting of a character with
49+
accentfollowed by a character without accent. The first is translated
50+
intothe second. For example,
5151
<programlisting>
5252
&Agrave; A
5353
&Aacute; A
@@ -57,6 +57,27 @@
5757
&Aring; A
5858
&AElig; A
5959
</programlisting>
60+
The two characters must be separated by whitespace, and any leading or
61+
trailing whitespace on a line is ignored.
62+
</para>
63+
</listitem>
64+
65+
<listitem>
66+
<para>
67+
Alternatively, if only one character is given on a line, instances of
68+
that character are deleted; this is useful in languages where accents
69+
are represented by separate characters.
70+
</para>
71+
</listitem>
72+
73+
<listitem>
74+
<para>
75+
As with other <productname>PostgreSQL</> text search configuration files,
76+
the rules file must be stored in UTF-8 encoding. The data is
77+
automatically translated into the current database's encoding when
78+
loaded. Any lines containing untranslatable characters are silently
79+
ignored, so that rules files can contain rules that are not applicable in
80+
the current encoding.
6081
</para>
6182
</listitem>
6283
</itemizedlist>
@@ -132,8 +153,8 @@ mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels')
132153

133154
<para>
134155
The <function>unaccent()</> function removes accents (diacritic signs) from
135-
a given string. Basically, it's a wrapper around the
136-
<filename>unaccent</> dictionary, but it can be used outside normal
156+
a given string. Basically, it's a wrapper around
157+
<filename>unaccent</>-type dictionaries, but it can be used outside normal
137158
text search contexts.
138159
</para>
139160

@@ -145,6 +166,11 @@ mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels')
145166
unaccent(<optional><replaceable class="PARAMETER">dictionary</replaceable>, </optional> <replaceable class="PARAMETER">string</replaceable>) returns <type>text</type>
146167
</synopsis>
147168

169+
<para>
170+
If the <replaceable class="PARAMETER">dictionary</replaceable> argument is
171+
omitted, <literal>unaccent</> is assumed.
172+
</para>
173+
148174
<para>
149175
For example:
150176
<programlisting>

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp