forked frompostgres/postgres
- Notifications
You must be signed in to change notification settings - Fork6
Commit8d3e090
committed
Extend GB18030 encoding conversion to cover full Unicode range.
Our previous code for GB18030 <-> UTF8 conversion only covered Unicode codepoints up to U+FFFF, but the actual spec defines conversions for all codepoints up to U+10FFFF. That would be rather impractical as a lookup table,but fortunately there is a simple algorithmic conversion between theadditional code points and the equivalent GB18030 byte patterns. Make useof the just-added callback facility in LocalToUtf/UtfToLocal to perform theadditional conversions.Having created the infrastructure to do that, we can use the same code tomap certain linearly-related subranges of the Unicode space below U+FFFF,allowing removal of the corresponding lookup table entries. This morethan halves the lookup table size, which is a substantial savings;utf8_and_gb18030.so drops from nearly a megabyte to about half that.In support of doing that, replace ISO10646-GB18030.TXT with the data filegb-18030-2000.xml (retrieved fromhttp://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/ )in which these subranges have been deleted from the simple lookup entries.Per bug #12845 from Arjen Nienhuis. The conversion code added here isbased on his proposed patch, though I whacked it around rather heavily.1 parent92edba2 commit8d3e090
File tree
7 files changed
+31111
-128805
lines changed- src/backend/utils/mb
- Unicode
- conversion_procs/utf8_and_gb18030
7 files changed
+31111
-128805
lines changed0 commit comments
Comments
(0)