NotificationsYou must be signed in to change notification settings
Fork5
Star27

Commit97c40ce

committed

Allow empty replacement strings in contrib/unaccent.

This is useful in languages where diacritic signs are represented asseparate characters; it's also one step towards letting unaccent be usedfor arbitrary substring substitutions.In passing, improve the user documentation for unaccent, which was sadlyvague about some important details.Mohammad Alhashash, reviewed by Abhijit Menon-Sen

1 parent5586327 commit97c40ceCopy full SHA for 97c40ce

File tree

2 files changed

+54

-11

lines changed

contrib/unaccent
- unaccent.c
doc/src/sgml
- unaccent.sgml

2 files changed

+54

-11

lines changed

`‎contrib/unaccent/unaccent.c‎`

Lines changed: 23 additions & 6 deletions

Original file line number	Diff line number	Diff line change
`@@ -104,11 +104,21 @@ initTrie(char *filename)`
`104`	`104`
`105`	`105`	`while ((line=tsearch_readline(&trst))!=NULL)`
`106`	`106`	`{`
`107`		`-/*`
`108`		`- * The format of each line must be "src trg" where src and trg`
`109`		`- * are sequences of one or more non-whitespace characters,`
`110`		`- * separated by whitespace. Whitespace at start or end of`
`111`		`- * line is ignored.`
	`107`	`+/*----------`
	`108`	`+ * The format of each line must be "src" or "src trg", where`
	`109`	`+ * src and trg are sequences of one or more non-whitespace`
	`110`	`+ * characters, separated by whitespace. Whitespace at start`
	`111`	`+ * or end of line is ignored. If trg is omitted, an empty`
	`112`	`+ * string is used as the replacement.`
	`113`	`+ *`
	`114`	`+ * We use a simple state machine, with states`
	`115`	`+ *0initial (before src)`
	`116`	`+ *1in src`
	`117`	`+ *2in whitespace after src`
	`118`	`+ *3in trg`
	`119`	`+ *4in whitespace after trg`
	`120`	`+ *-1syntax error detected (line will be ignored)`
	`121`	`+ *----------`
`112`	`122`	`*/`
`113`	`123`	`intstate;`
`114`	`124`	`char*ptr;`
`@@ -160,7 +170,14 @@ initTrie(char *filename)`
`160`	`170`	`}`
`161`	`171`	`}`
`162`	`172`
`163`		`-if (state >=3)`
	`173`	`+if (state==1\|\|state==2)`
	`174`	`+{`
	`175`	`+/* trg was omitted, so use "" */`
	`176`	`+trg="";`
	`177`	`+trglen=0;`
	`178`	`+}`
	`179`	`+`
	`180`	`+if (state>0)`
`164`	`181`	`rootTrie=placeChar(rootTrie,`
`165`	`182`	`(unsignedchar*)src,srclen,`
`166`	`183`	`trg,trglen);`

`‎doc/src/sgml/unaccent.sgml‎`

Lines changed: 31 additions & 5 deletions

Original file line number	Diff line number	Diff line change
`@@ -45,9 +45,9 @@`
`45`	`45`	`<itemizedlist>`
`46`	`46`	`<listitem>`
`47`	`47`	`<para>`
`48`		`- Each line representsa pair, consisting of a character with accent`
`49`		`- followed by a character without accent. The first is translated into`
`50`		`- the second. For example,`
	`48`	`+ Each line representsone translation rule, consisting of a character with`
	`49`	`+accentfollowed by a character without accent. The first is translated`
	`50`	`+intothe second. For example,`
`51`	`51`	`<programlisting>`
`52`	`52`	`À A`
`53`	`53`	`Á A`
`@@ -57,6 +57,27 @@`
`57`	`57`	`Å A`
`58`	`58`	`Æ A`
`59`	`59`	`</programlisting>`
	`60`	`+ The two characters must be separated by whitespace, and any leading or`
	`61`	`+ trailing whitespace on a line is ignored.`
	`62`	`+ </para>`
	`63`	`+ </listitem>`
	`64`	`+`
	`65`	`+ <listitem>`
	`66`	`+ <para>`
	`67`	`+ Alternatively, if only one character is given on a line, instances of`
	`68`	`+ that character are deleted; this is useful in languages where accents`
	`69`	`+ are represented by separate characters.`
	`70`	`+ </para>`
	`71`	`+ </listitem>`
	`72`	`+`
	`73`	`+ <listitem>`
	`74`	`+ <para>`
	`75`	`+ As with other <productname>PostgreSQL</> text search configuration files,`
	`76`	`+ the rules file must be stored in UTF-8 encoding. The data is`
	`77`	`+ automatically translated into the current database's encoding when`
	`78`	`+ loaded. Any lines containing untranslatable characters are silently`
	`79`	`+ ignored, so that rules files can contain rules that are not applicable in`
	`80`	`+ the current encoding.`
`60`	`81`	`</para>`
`61`	`82`	`</listitem>`
`62`	`83`	`</itemizedlist>`
`@@ -132,8 +153,8 @@ mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels')`
`132`	`153`
`133`	`154`	`<para>`
`134`	`155`	`The <function>unaccent()</> function removes accents (diacritic signs) from`
`135`		`- a given string. Basically, it's a wrapper around the`
`136`		`- <filename>unaccent</> dictionary, but it can be used outside normal`
	`156`	`+ a given string. Basically, it's a wrapper around`
	`157`	`+ <filename>unaccent</>-type dictionaries, but it can be used outside normal`
`137`	`158`	`text search contexts.`
`138`	`159`	`</para>`
`139`	`160`
`@@ -145,6 +166,11 @@ mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels')`
`145`	`166`	`unaccent(<optional><replaceable class="PARAMETER">dictionary</replaceable>, </optional> <replaceable class="PARAMETER">string</replaceable>) returns <type>text</type>`
`146`	`167`	`</synopsis>`
`147`	`168`
	`169`	`+ <para>`
	`170`	`+ If the <replaceable class="PARAMETER">dictionary</replaceable> argument is`
	`171`	`+ omitted, <literal>unaccent</> is assumed.`
	`172`	`+ </para>`
	`173`	`+`
`148`	`174`	`<para>`
`149`	`175`	`For example:`
`150`	`176`	`<programlisting>`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit97c40ce

File tree

2 files changed

2 files changed

`‎contrib/unaccent/unaccent.c‎`

`‎doc/src/sgml/unaccent.sgml‎`

0 commit comments