NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commite1c1d54

committed

Update unaccent rules with release 34 of CLDR for Latin-ASCII.xml

This has required an update of the python script generating the rules,as its format has changed in release 29. This release has also addednew punctuation and symbols, and a new set of rules has been generatedto include them. The way to find newest versions of Latin-ASCII getsalso more clearly documented.Author: Hugh Ranalli, Michael PaquierDiscussion:https://postgr.es/m/15548-cef1b3f8de190d4f@postgresql.org

1 parentc64d0cd commite1c1d54Copy full SHA for e1c1d54

File tree

4 files changed

+54

-3

lines changed

contrib/unaccent
- expected
  - unaccent.out
- generate_unaccent_rules.py
- sql
  - unaccent.sql
- unaccent.rules

4 files changed

+54

-3

lines changed

`‎contrib/unaccent/expected/unaccent.out‎`

Lines changed: 18 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -25,6 +25,12 @@ SELECT unaccent('ЁЖИК');`
`25`	`25`	`ЕЖИК`
`26`	`26`	`(1 row)`
`27`	`27`
	`28`	`+SELECT unaccent('˃˖˗˜');`
	`29`	`+ unaccent`
	`30`	`+----------`
	`31`	`+ >+-~`
	`32`	`+(1 row)`
	`33`	`+`
`28`	`34`	`SELECT unaccent('unaccent', 'foobar');`
`29`	`35`	`unaccent`
`30`	`36`	`----------`
`@@ -43,6 +49,12 @@ SELECT unaccent('unaccent', 'ЁЖИК');`
`43`	`49`	`ЕЖИК`
`44`	`50`	`(1 row)`
`45`	`51`
	`52`	`+SELECT unaccent('unaccent', '˃˖˗˜');`
	`53`	`+ unaccent`
	`54`	`+----------`
	`55`	`+ >+-~`
	`56`	`+(1 row)`
	`57`	`+`
`46`	`58`	`SELECT ts_lexize('unaccent', 'foobar');`
`47`	`59`	`ts_lexize`
`48`	`60`	`-----------`
`@@ -61,3 +73,9 @@ SELECT ts_lexize('unaccent', 'ЁЖИК');`
`61`	`73`	`{ЕЖИК}`
`62`	`74`	`(1 row)`
`63`	`75`
	`76`	`+SELECT ts_lexize('unaccent', '˃˖˗˜');`
	`77`	`+ ts_lexize`
	`78`	`+-----------`
	`79`	`+ {>+-~}`
	`80`	`+(1 row)`
	`81`	`+`

`‎contrib/unaccent/generate_unaccent_rules.py‎`

Lines changed: 18 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -20,8 +20,13 @@`
`20`	`20`	`# option is enabled, the XML file of this transliterator [2] -- given as a`
`21`	`21`	`# command line argument -- will be parsed and used.`
`22`	`22`	`#`
	`23`	`+# Ideally you should use the latest release for each data set. For`
	`24`	`+# Latin-ASCII.xml, the latest data sets released can be browsed directly`
	`25`	`+# via [3]. Note that this script is compatible with at least release 29.`
	`26`	`+#`
`23`	`27`	`# [1] http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt`
`24`		`-# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml`
	`28`	`+# [2] http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml`
	`29`	`+# [3] https://unicode.org/cldr/trac/browser/tags`
`25`	`30`
`26`	`31`	`# BEGIN: Python 2/3 compatibility - remove when Python 2 compatibility dropped`
`27`	`32`	`# The approach is to be Python3 compatible with Python2 "backports".`
`@@ -140,8 +145,18 @@ def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):`
`140`	`145`	`transliterationTree=ET.parse(latinAsciiFilePath)`
`141`	`146`	`transliterationTreeRoot=transliterationTree.getroot()`
`142`	`147`
`143`		`-forruleintransliterationTreeRoot.findall("./transforms/transform/tRule"):`
`144`		`-matches=rulePattern.search(rule.text)`
	`148`	`+# Fetch all the transliteration rules. Since release 29 of Latin-ASCII.xml`
	`149`	`+# all the transliteration rules are located in a single tRule block with`
	`150`	`+# all rules separated into separate lines.`
	`151`	`+blockRules=transliterationTreeRoot.findall("./transforms/transform/tRule")`
	`152`	`+assert(len(blockRules)==1)`
	`153`	`+`
	`154`	`+# Split the block of rules into one element per line.`
	`155`	`+rules=blockRules[0].text.splitlines()`
	`156`	`+`
	`157`	`+# And finish the processing of each individual rule.`
	`158`	`+forruleinrules:`
	`159`	`+matches=rulePattern.search(rule)`
`145`	`160`
`146`	`161`	`# The regular expression capture four groups corresponding`
`147`	`162`	`# to the characters.`

`‎contrib/unaccent/sql/unaccent.sql‎`

Lines changed: 3 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -8,11 +8,14 @@ SET client_encoding TO 'UTF8';`
`8`	`8`	`SELECT unaccent('foobar');`
`9`	`9`	`SELECT unaccent('ёлка');`
`10`	`10`	`SELECT unaccent('ЁЖИК');`
	`11`	`+SELECT unaccent('˃˖˗˜');`
`11`	`12`
`12`	`13`	`SELECT unaccent('unaccent','foobar');`
`13`	`14`	`SELECT unaccent('unaccent','ёлка');`
`14`	`15`	`SELECT unaccent('unaccent','ЁЖИК');`
	`16`	`+SELECT unaccent('unaccent','˃˖˗˜');`
`15`	`17`
`16`	`18`	`SELECT ts_lexize('unaccent','foobar');`
`17`	`19`	`SELECT ts_lexize('unaccent','ёлка');`
`18`	`20`	`SELECT ts_lexize('unaccent','ЁЖИК');`
	`21`	`+SELECT ts_lexize('unaccent','˃˖˗˜');`

`‎contrib/unaccent/unaccent.rules‎`

Lines changed: 15 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -399,6 +399,21 @@`
`399`	`399`	`ʦts`
`400`	`400`	`ʪls`
`401`	`401`	`ʫlz`
	`402`	`+ʹ'`
	`403`	`+ʺ"`
	`404`	`+ʻ'`
	`405`	`+ʼ'`
	`406`	`+ʽ'`
	`407`	`+˂<`
	`408`	`+˃>`
	`409`	`+˄^`
	`410`	`+ˆ^`
	`411`	`+ˈ'`
	`412`	+ˋ`
	`413`	`+ː:`
	`414`	`+˖+`
	`415`	`+˗-`
	`416`	`+˜~`
`402`	`417`	`ΆΑ`
`403`	`418`	`ΈΕ`
`404`	`419`	`ΉΗ`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commite1c1d54

File tree

4 files changed

4 files changed

`‎contrib/unaccent/expected/unaccent.out‎`

`‎contrib/unaccent/generate_unaccent_rules.py‎`

`‎contrib/unaccent/sql/unaccent.sql‎`

`‎contrib/unaccent/unaccent.rules‎`

0 commit comments