Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit1850184

Browse files
committed
Add simple codepoint redirections to unaccent.rules.
Previously we searched for code points where the Unicode data filelisted an equivalent combining character sequence that added accents.Some codepoints redirect to a single other codepoint, instead of doingany combining. We can follow those references recursively to get theanswer.Per bug report #18362, which reported missing Ancient Greek characters.Specifically, precomposed characters with oxia (from the polytonicaccent system used for old Greek) just point to precomposed characterswith tonos (from the monotonic accent system for modern Greek), and wehave to follow the extra hop to find out that they are composed withan acute accent.Besides those, the new rule also:* pulls in a lot of 'Mathematical Alphanumeric Symbols', which are copies of the Latin and Greek alphabets and numbers rendered in different typefaces, and* corrects a single mathematical letter that previously came from the CLDR transliteration file, but the new rule extracts from the main Unicode database file, where clearly the latter is right and the former is a wrong (reported to CLDR).Reported-by: Cees van Zeeland <cees.van.zeeland@freedom.nl>Reviewed-by: Robert Haas <robertmhaas@gmail.com>Reviewed-by: Peter Eisentraut <peter@eisentraut.org>Reviewed-by: Michael Paquier <michael@paquier.xyz>Discussion:https://postgr.es/m/18362-be6d0cfe122b6354%40postgresql.org
1 parent1eff827 commit1850184

File tree

3 files changed

+1025
-9
lines changed

3 files changed

+1025
-9
lines changed

‎contrib/unaccent/expected/unaccent.out

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,6 @@ SELECT ts_lexize('unaccent', '〝');
176176
SELECT unaccent('ℌ');
177177
unaccent
178178
----------
179-
x
179+
H
180180
(1 row)
181181

‎contrib/unaccent/generate_unaccent_rules.py

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -104,10 +104,11 @@ def is_letter_with_marks(codepoint, table):
104104
"""Returns true for letters combined with one or more marks."""
105105
# See https://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
106106

107-
# Letter may have no combining characters, in which case it has
108-
# no marks.
109-
iflen(codepoint.combining_ids)==1:
110-
returnFalse
107+
# Some codepoints redirect directly to another, instead of doing any
108+
# "combining"... but sometimes they redirect to a codepoint that doesn't
109+
# exist, so ignore those.
110+
iflen(codepoint.combining_ids)==1andcodepoint.combining_ids[0]intable:
111+
returnis_letter_with_marks(table[codepoint.combining_ids[0]],table)
111112

112113
# A letter without diacritical marks has none of them.
113114
ifany(is_mark(table[i])foriincodepoint.combining_ids[1:])isFalse:
@@ -148,8 +149,7 @@ def get_plain_letter(codepoint, table):
148149

149150
defis_ligature(codepoint,table):
150151
"""Return true for letters combined with letters."""
151-
returnall(is_letter(table[i],table)foriincodepoint.combining_ids)
152-
152+
returnall(iintableandis_letter(table[i],table)foriincodepoint.combining_ids)
153153

154154
defget_plain_letters(codepoint,table):
155155
"""Return a list of plain letters from a ligature."""
@@ -200,6 +200,11 @@ def parse_cldr_latin_ascii_transliterator(latinAsciiFilePath):
200200
# the parser of unaccent only accepts non-whitespace characters
201201
# for "src" and "trg" (see unaccent.c)
202202
ifnotsrc.isspace()andnottrg.isspace():
203+
ifsrc=="\u210c":
204+
# This mapping seems to be in error, and causes a collision
205+
# by disagreeing with the main Unicode database file:
206+
# https://unicode-org.atlassian.net/browse/CLDR-17656
207+
continue
203208
charactersSet.add((ord(src),trg))
204209

205210
returncharactersSet
@@ -251,7 +256,7 @@ def main(args):
251256
# walk through all the codepoints looking for interesting mappings
252257
forcodepointinall:
253258
ifcodepoint.general_category.startswith('L')and \
254-
len(codepoint.combining_ids)>1:
259+
len(codepoint.combining_ids)>0:
255260
ifis_letter_with_marks(codepoint,table):
256261
charactersSet.add((codepoint.id,
257262
chr(get_plain_letter(codepoint,table).id)))

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp