Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
gh-113274: fix EUC-JP decoding of FULLWIDTH TILDE#113275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
Modules/cjkcodecs/mappings_jp.h Outdated
@@ -591,7 +591,7 @@ __jisx0208_decmap+6950,33,38},{0,0,0},{0,0,0},{0,0,0},{0,0,0},{0,0,0},{0,0,0}, | |||
}; | |||
static const ucs2_t __jisx0212_decmap[6179] = { | |||
728,711,184,729,733,175,731,730,126,900,901,U,U,U,U,U,U,U,U,161,166,191,U,U,U, | |||
728,711,184,729,733,175,731,730,65374,900,901,U,U,U,U,U,U,U,U,161,166,191,U,U,U, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
// AUTO-GENERATED FILE FROM genmap_japanese.py: DO NOT EDIT |
Please take a look at the comment on the first line.
I 've not taken a look at the issue deeply yet, but if you want to change the mapping file, you should modify the generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I totally missed that. I will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
- Python generates the data using this as its source of truth:https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT.
encoding_rs
: also generated, but fromhttps://encoding.spec.whatwg.org/index-jis0212.txticonv
: seems to be maintained indenpendently, and we can actually seen when they replaced TILDE with FULLWIDTH TILDE in their changelog:https://git.savannah.gnu.org/gitweb/?p=libiconv.git;a=blob;f=ChangeLog;h=0e2f5f00bc6aff84932cd92bd09a7c14f802c44d;hb=refs/heads/master#l4729
I couldn't figure out how Vim and Firefox proceed (although Vim does fallback toiconv
for non-internally-supported encodings). Since I do not expect to get Unicode to update the document of an obsolete standard that they only used to adapt, I see two options:
- add a hack in
genmap_japanese.py
around line 90 after the call toloadmap(jisx0212file)
to reassignjisx0212decmap[34][55] = ord('~')
(it does fix__jisx0212_decmap
as expected and changes nothing else), and comment that properly - switch to WhatWG as the source of truth
1 can be done pretty quickly and with minimal risk of breaking anything. 2 has the advantage that it might help discover other issues in mappings, but is obviously much more involved.
If 1 seems reasonable to you, I will update this PR accordingly. If 2, I will close it and document this in the ticket for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I have updated this MR with the hack ingenmap_japanese.py
and regeneratedmappings_jp.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I will take a look at which will be the better, If we decide to solve this issue, option 2 will be the better way.
But I need to check the current status related to JIS0212 and side effect.
Also need to consider stability of whatwg spec.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
I have made the requested changes; please review again. |
@qsantos I will take a look the PR by next week. Enjoy your Christmas. |
corona10 commentedDec 24, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Okay, I like this PR, it will guarantee the round trip
But as I said, I will take a look at this PR by next week. |
Thanks, I was not sure if I had properly triggered the bot. Enjoy your Christmas as well! |
corona10 left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Okay, here is my temporal conclusion for this change.
(If we decide to adopt the change!)
- Let's use uncode.org data as same as now, whatwg is great but not stable. We are not a browser, I am not sure that we can follow every change without side effect.
- Let's vendor the modified JIS0212.TXT intohttps://github.com/python/cpython/tree/main/Tools/unicode/python-mappings and store the diff file tohttps://github.com/python/cpython/tree/main/Tools/unicode/python-mappings/diff for tracking changes.
But I need to listen to the opinion of@ezio-melotti, who is a more experienced unicode expert than me.
@ezio-melotti What is your take on this? |
Should we consider this dead? |
Uh oh!
There was an error while loading.Please reload this page.
This PRcloses#113274. This is done by changing the UCS-2 codepoint 126 (~, TILDE) to 65374 (~, FULLWIDTH TILDE) in the
__jisx0212_decmap
reference table. Since no character is added or removed, no other changes are needed.Bug report
Bug description:
Python decodes the bytes 0x8FA2A7 as ~ (TILDE) in EUC-JP.
Thisreference document is ambiguous in that it shows a simple ~ (TILDE), but most other software (iconv, Vim, Firefox, Rust's encoding_rs) interpret this as ~ (FULLWIDTH TILDE). Note that EUC-JP already includes US-ASCII, and so:
CPython versions tested on:
3.11, CPython main branch
Operating systems tested on:
Linux