Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitec0a69e

Browse files
committed
Extend the default rules file for contrib/unaccent with Vietnamese letters.
Improve generate_unaccent_rules.py to handle composed characters whose baseis another composed character rather than a plain letter. The net effectof this is to add a bunch of multi-accented Vietnamese characters tounaccent.rules.Original complaint from Kha Nguyen, diagnosis of the script's shortcomingby Thomas Munro.Dang Minh Huong and Michael PaquierDiscussion:https://postgr.es/m/CALo3sF6EC8cy1F2JUz=GRf5h4LMUJTaG3qpdoiLrNbWEXL-tRg@mail.gmail.com
1 parent2b74303 commitec0a69e

File tree

2 files changed

+145
-8
lines changed

2 files changed

+145
-8
lines changed

‎contrib/unaccent/generate_unaccent_rules.py

Lines changed: 31 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -48,24 +48,47 @@ def is_mark(codepoint):
4848
returncodepoint.general_categoryin ("Mn","Me","Mc")
4949

5050
defis_letter_with_marks(codepoint,table):
51-
"""Returns true forplainletters combined with one or more marks."""
51+
"""Returns true for letters combined with one or more marks."""
5252
# See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
53-
returnlen(codepoint.combining_ids)>1and \
54-
is_plain_letter(table[codepoint.combining_ids[0]])and \
55-
all(is_mark(table[i])foriincodepoint.combining_ids[1:])
53+
54+
# Letter may have no combining characters, in which case it has
55+
# no marks.
56+
iflen(codepoint.combining_ids)==1:
57+
returnFalse
58+
59+
# A letter without diacritical marks has none of them.
60+
ifany(is_mark(table[i])foriincodepoint.combining_ids[1:])isFalse:
61+
returnFalse
62+
63+
# Check if the base letter of this letter has marks.
64+
codepoint_base=codepoint.combining_ids[0]
65+
if (is_plain_letter(table[codepoint_base])isFalseand \
66+
is_letter_with_marks(table[codepoint_base],table)isFalse):
67+
returnFalse
68+
69+
returnTrue
5670

5771
defis_letter(codepoint,table):
5872
"""Return true for letter with or without diacritical marks."""
5973
returnis_plain_letter(codepoint)oris_letter_with_marks(codepoint,table)
6074

6175
defget_plain_letter(codepoint,table):
62-
"""Return the base codepoint without marks."""
76+
"""Return the base codepoint without marks. If this codepoint has more
77+
than one combining character, do a recursive lookup on the table to
78+
find out its plain base letter."""
6379
ifis_letter_with_marks(codepoint,table):
64-
returntable[codepoint.combining_ids[0]]
80+
iflen(table[codepoint.combining_ids[0]].combining_ids)>1:
81+
returnget_plain_letter(table[codepoint.combining_ids[0]],table)
82+
elifis_plain_letter(table[codepoint.combining_ids[0]]):
83+
returntable[codepoint.combining_ids[0]]
84+
85+
# Should not come here
86+
assert(False)
6587
elifis_plain_letter(codepoint):
6688
returncodepoint
67-
else:
68-
raise"mu"
89+
90+
# Should not come here
91+
assert(False)
6992

7093
defis_ligature(codepoint,table):
7194
"""Return true for letters combined with letters."""

‎contrib/unaccent/unaccent.rules

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,6 +254,18 @@
254254
ǒo
255255
ǓU
256256
ǔu
257+
ǕU
258+
ǖu
259+
ǗU
260+
ǘu
261+
ǙU
262+
ǚu
263+
ǛU
264+
ǜu
265+
ǞA
266+
ǟa
267+
ǠA
268+
ǡa
257269
ǤG
258270
ǥg
259271
ǦG
@@ -262,6 +274,8 @@
262274
ǩk
263275
ǪO
264276
ǫo
277+
ǬO
278+
ǭo
265279
ǰj
266280
DZDZ
267281
DzDz
@@ -270,6 +284,8 @@
270284
ǵg
271285
ǸN
272286
ǹn
287+
ǺA
288+
ǻa
273289
ȀA
274290
ȁa
275291
ȂA
@@ -307,8 +323,14 @@
307323
ȧa
308324
ȨE
309325
ȩe
326+
ȪO
327+
ȫo
328+
ȬO
329+
ȭo
310330
ȮO
311331
ȯo
332+
ȰO
333+
ȱo
312334
ȲY
313335
ȳy
314336
ȴl
@@ -441,6 +463,8 @@
441463
ḅb
442464
ḆB
443465
ḇb
466+
ḈC
467+
ḉc
444468
ḊD
445469
ḋd
446470
ḌD
@@ -451,10 +475,16 @@
451475
ḑd
452476
ḒD
453477
ḓd
478+
ḔE
479+
ḕe
480+
ḖE
481+
ḗe
454482
ḘE
455483
ḙe
456484
ḚE
457485
ḛe
486+
ḜE
487+
ḝe
458488
ḞF
459489
ḟf
460490
ḠG
@@ -471,6 +501,8 @@
471501
ḫh
472502
ḬI
473503
ḭi
504+
ḮI
505+
ḯi
474506
ḰK
475507
ḱk
476508
ḲK
@@ -479,6 +511,8 @@
479511
ḵk
480512
ḶL
481513
ḷl
514+
ḸL
515+
ḹl
482516
ḺL
483517
ḻl
484518
ḼL
@@ -497,6 +531,14 @@
497531
ṉn
498532
ṊN
499533
ṋn
534+
ṌO
535+
ṍo
536+
ṎO
537+
ṏo
538+
ṐO
539+
ṑo
540+
ṒO
541+
ṓo
500542
ṔP
501543
ṕp
502544
ṖP
@@ -505,12 +547,20 @@
505547
ṙr
506548
ṚR
507549
ṛr
550+
ṜR
551+
ṝr
508552
ṞR
509553
ṟr
510554
ṠS
511555
ṡs
512556
ṢS
513557
ṣs
558+
ṤS
559+
ṥs
560+
ṦS
561+
ṧs
562+
ṨS
563+
ṩs
514564
ṪT
515565
ṫt
516566
ṬT
@@ -525,6 +575,10 @@
525575
ṵu
526576
ṶU
527577
ṷu
578+
ṸU
579+
ṹu
580+
ṺU
581+
ṻu
528582
ṼV
529583
ṽv
530584
ṾV
@@ -563,12 +617,42 @@
563617
ạa
564618
ẢA
565619
ảa
620+
ẤA
621+
ấa
622+
ẦA
623+
ầa
624+
ẨA
625+
ẩa
626+
ẪA
627+
ẫa
628+
ẬA
629+
ậa
630+
ẮA
631+
ắa
632+
ẰA
633+
ằa
634+
ẲA
635+
ẳa
636+
ẴA
637+
ẵa
638+
ẶA
639+
ặa
566640
ẸE
567641
ẹe
568642
ẺE
569643
ẻe
570644
ẼE
571645
ẽe
646+
ẾE
647+
ếe
648+
ỀE
649+
ềe
650+
ỂE
651+
ểe
652+
ỄE
653+
ễe
654+
ỆE
655+
ệe
572656
ỈI
573657
ỉi
574658
ỊI
@@ -577,10 +661,40 @@
577661
ọo
578662
ỎO
579663
ỏo
664+
ỐO
665+
ốo
666+
ỒO
667+
ồo
668+
ỔO
669+
ổo
670+
ỖO
671+
ỗo
672+
ỘO
673+
ộo
674+
ỚO
675+
ớo
676+
ỜO
677+
ờo
678+
ỞO
679+
ởo
680+
ỠO
681+
ỡo
682+
ỢO
683+
ợo
580684
ỤU
581685
ụu
582686
ỦU
583687
ủu
688+
ỨU
689+
ứu
690+
ỪU
691+
ừu
692+
ỬU
693+
ửu
694+
ỮU
695+
ữu
696+
ỰU
697+
ựu
584698
ỲY
585699
ỳy
586700
ỴY

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp