Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commite1c1d54

Browse files
committed
Update unaccent rules with release 34 of CLDR for Latin-ASCII.xml
This has required an update of the python script generating the rules,as its format has changed in release 29. This release has also addednew punctuation and symbols, and a new set of rules has been generatedto include them. The way to find newest versions of Latin-ASCII getsalso more clearly documented.Author: Hugh Ranalli, Michael PaquierDiscussion:https://postgr.es/m/15548-cef1b3f8de190d4f@postgresql.org
1 parentc64d0cd commite1c1d54

File tree

4 files changed

+54
-3
lines changed

4 files changed

+54
-3
lines changed

‎contrib/unaccent/expected/unaccent.out

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,12 @@ SELECT unaccent('ЁЖИК');
2525
ЕЖИК
2626
(1 row)
2727

28+
SELECT unaccent('˃˖˗˜');
29+
unaccent
30+
----------
31+
>+-~
32+
(1 row)
33+
2834
SELECT unaccent('unaccent', 'foobar');
2935
unaccent
3036
----------
@@ -43,6 +49,12 @@ SELECT unaccent('unaccent', 'ЁЖИК');
4349
ЕЖИК
4450
(1 row)
4551

52+
SELECT unaccent('unaccent', '˃˖˗˜');
53+
unaccent
54+
----------
55+
>+-~
56+
(1 row)
57+
4658
SELECT ts_lexize('unaccent', 'foobar');
4759
ts_lexize
4860
-----------
@@ -61,3 +73,9 @@ SELECT ts_lexize('unaccent', 'ЁЖИК');
6173
{ЕЖИК}
6274
(1 row)
6375

76+
SELECT ts_lexize('unaccent', '˃˖˗˜');
77+
ts_lexize
78+
-----------
79+
{>+-~}
80+
(1 row)
81+

‎contrib/unaccent/generate_unaccent_rules.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,13 @@
2020
# option is enabled, the XML file of this transliterator [2] -- given as a
2121
# command line argument -- will be parsed and used.
2222
#
23+
# Ideally you should use the latest release for each data set. For
24+
# Latin-ASCII.xml, the latest data sets released can be browsed directly
25+
# via [3]. Note that this script is compatible with at least release 29.
26+
#
2327
# [1] http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt
24-
# [2] http://unicode.org/cldr/trac/export/12304/tags/release-28/common/transforms/Latin-ASCII.xml
28+
# [2] http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml
29+
# [3] https://unicode.org/cldr/trac/browser/tags
2530

2631
# BEGIN: Python 2/3 compatibility - remove when Python 2 compatibility dropped
2732
# The approach is to be Python3 compatible with Python2 "backports".
@@ -140,8 +145,18 @@ def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
140145
transliterationTree=ET.parse(latinAsciiFilePath)
141146
transliterationTreeRoot=transliterationTree.getroot()
142147

143-
forruleintransliterationTreeRoot.findall("./transforms/transform/tRule"):
144-
matches=rulePattern.search(rule.text)
148+
# Fetch all the transliteration rules. Since release 29 of Latin-ASCII.xml
149+
# all the transliteration rules are located in a single tRule block with
150+
# all rules separated into separate lines.
151+
blockRules=transliterationTreeRoot.findall("./transforms/transform/tRule")
152+
assert(len(blockRules)==1)
153+
154+
# Split the block of rules into one element per line.
155+
rules=blockRules[0].text.splitlines()
156+
157+
# And finish the processing of each individual rule.
158+
forruleinrules:
159+
matches=rulePattern.search(rule)
145160

146161
# The regular expression capture four groups corresponding
147162
# to the characters.

‎contrib/unaccent/sql/unaccent.sql

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,14 @@ SET client_encoding TO 'UTF8';
88
SELECT unaccent('foobar');
99
SELECT unaccent('ёлка');
1010
SELECT unaccent('ЁЖИК');
11+
SELECT unaccent('˃˖˗˜');
1112

1213
SELECT unaccent('unaccent','foobar');
1314
SELECT unaccent('unaccent','ёлка');
1415
SELECT unaccent('unaccent','ЁЖИК');
16+
SELECT unaccent('unaccent','˃˖˗˜');
1517

1618
SELECT ts_lexize('unaccent','foobar');
1719
SELECT ts_lexize('unaccent','ёлка');
1820
SELECT ts_lexize('unaccent','ЁЖИК');
21+
SELECT ts_lexize('unaccent','˃˖˗˜');

‎contrib/unaccent/unaccent.rules

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -399,6 +399,21 @@
399399
ʦts
400400
ʪls
401401
ʫlz
402+
ʹ'
403+
ʺ"
404+
ʻ'
405+
ʼ'
406+
ʽ'
407+
˂<
408+
˃>
409+
˄^
410+
ˆ^
411+
ˈ'
412+
ˋ`
413+
ː:
414+
˖+
415+
˗-
416+
˜~
402417
ΆΑ
403418
ΈΕ
404419
ΉΗ

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp