Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitcfa6cd2

Browse files
committed
Generate GB18030 mappings from the Unicode Consortium's UCM file
Previously we built the .map files for GB18030 (version 2000) from anXML file. The 2022 version for this encoding is only available as aUnicode Character Mapping (UCM) file, so as preparatory refactoringswitch to this format as the source for building version 2000.As we do with most input files for the conversion mappings, downloadthe file on demand. In order to generate the same mappings we havenow, we must download from a previous upstream commit, rather thanthe head since the latter contains a correction not present in ourcurrent .map files.The XML file is still used by EUC_CN, so we cannot delete it from ourrepository. GB18030 is a superset of EUC_CN, so it may be possible tobuild EUC_CN from the same UCM file, but that is left for future work.Author: Chao Li <lic@highgo.com>Discussion:https://postgr.es/m/966d9fc.169.198741fe60b.Coremail.jiaoshuntian%40highgo.com
1 parente56a601 commitcfa6cd2

File tree

3 files changed

+29
-11
lines changed

3 files changed

+29
-11
lines changed

‎src/backend/utils/mb/Unicode/Makefile‎

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ $(eval $(call map_rule,euc_cn,UCS_to_EUC_CN.pl,gb-18030-2000.xml))
5454
$(eval $(call map_rule,euc_kr,UCS_to_EUC_KR.pl,KSX1001.TXT))
5555
$(eval $(call map_rule,euc_tw,UCS_to_EUC_TW.pl,CNS11643.TXT))
5656
$(eval $(call map_rule,sjis,UCS_to_SJIS.pl,CP932.TXT))
57-
$(eval $(call map_rule,gb18030,UCS_to_GB18030.pl,gb-18030-2000.xml))
57+
$(eval $(call map_rule,gb18030,UCS_to_GB18030.pl,gb-18030-2000.ucm))
5858
$(eval $(call map_rule,big5,UCS_to_BIG5.pl,CP950.TXT BIG5.TXT CP950.TXT))
5959
$(eval $(call map_rule,euc_jis_2004,UCS_to_EUC_JIS_2004.pl,euc-jis-2004-std.txt))
6060
$(eval $(call map_rule,shift_jis_2004,UCS_to_SHIFT_JIS_2004.pl,sjis-0213-2004-std.txt))
@@ -78,6 +78,9 @@ euc-jis-2004-std.txt sjis-0213-2004-std.txt:
7878
gb-18030-2000.xmlwindows-949-2000.xml:
7979
$(DOWNLOAD) https://raw.githubusercontent.com/unicode-org/icu-data/master/charset/data/xml/$(@F)
8080

81+
gb-18030-2000.ucm:
82+
$(DOWNLOAD) https://raw.githubusercontent.com/unicode-org/icu-data/d9d3a6ed27bb98a7106763e940258f0be8cd995b/charset/data/ucm/$(@F)
83+
8184
GB2312.TXT:
8285
$(DOWNLOAD)'http://trac.greenstone.org/browser/trunk/gsdl/unicode/MAPPINGS/EASTASIA/GB/GB2312.TXT?rev=1842&format=txt'
8386

‎src/backend/utils/mb/Unicode/UCS_to_GB18030.pl‎

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,14 @@
55
# src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
66
#
77
# Generate UTF-8 <--> GB18030 code conversion tables from
8-
# "gb-18030-2000.xml", obtained from
9-
#http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/
8+
# "gb-18030-2000.ucm", obtained from
9+
#https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm
1010
#
1111
# The lines we care about in the source file look like
12-
# <a u="009A" b="81 30 83 36"/>
13-
# where the "u" field is the Unicode code point in hex,
14-
# and the "b" field is the hex byte sequence for GB18030
12+
# <UXXXX> \xYY[\xYY...] |n
13+
# where XXXX is the Unicode code point in hex,
14+
# and the \xYY... is the hex byte sequence for GB18030,
15+
# and n is a flag indicating the type of mapping.
1516

1617
use strict;
1718
use warningsFATAL=>'all';
@@ -22,17 +23,26 @@
2223

2324
# Read the input
2425

25-
my$in_file ="gb-18030-2000.xml";
26+
my$in_file ="gb-18030-2000.ucm";
2627

2728
open(my$in,'<',$in_file) ||die("cannot open$in_file");
2829

2930
my@mapping;
3031

3132
while (<$in>)
3233
{
33-
nextif (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/);
34-
my ($u,$c) = ($1,$2);
35-
$c =~s///g;
34+
# Mappings may have been removed by commenting out
35+
nextif/^#/;
36+
37+
nextif !/^<U([0-9A-Fa-f]+)>\s+
38+
((?:\\x[0-9A-Fa-f]{2})+)\s+
39+
\|(\d+)/x;
40+
my ($u,$c,$flag) = ($1,$2,$3);
41+
$c =~s/\\x//g;
42+
43+
# We only want round-trip mappings
44+
nextif ($flagne'0');
45+
3646
my$ucs =hex($u);
3747
my$code =hex($c);
3848
if ($code >= 0x80 &&$ucs >= 0x0080)

‎src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c‎

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,12 @@ utf8word_to_unicode(uint32 c)
124124
/*
125125
* Perform mapping of GB18030 ranges to UTF8
126126
*
127-
* The ranges we need to convert are specified in gb-18030-2000.xml.
127+
* General description, and the range we need to convert for U+10000 and up:
128+
* https://htmlpreview.github.io/?https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/gb18030.html
129+
*
130+
* Ranges up to U+FFFF:
131+
* https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/ranges.txt
132+
*
128133
* All are ranges of 4-byte GB18030 codes.
129134
*/
130135
staticuint32

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp