Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-113274: fix EUC-JP decoding of FULLWIDTH TILDE#113275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
qsantos wants to merge2 commits intopython:main
base:main
Choose a base branch
Loading
fromqsantos:fix-issue-113274

Conversation

qsantos
Copy link

@qsantosqsantos commentedDec 19, 2023
edited by bedevere-appbot
Loading

This PRcloses#113274. This is done by changing the UCS-2 codepoint 126 (~, TILDE) to 65374 (~, FULLWIDTH TILDE) in the__jisx0212_decmap reference table. Since no character is added or removed, no other changes are needed.

Bug report

Bug description:

Python decodes the bytes 0x8FA2A7 as ~ (TILDE) in EUC-JP.

assertb'\x8f\xa2\xb7'.decode('euc_jp')=='~'

Thisreference document is ambiguous in that it shows a simple ~ (TILDE), but most other software (iconv, Vim, Firefox, Rust's encoding_rs) interpret this as ~ (FULLWIDTH TILDE). Note that EUC-JP already includes US-ASCII, and so:

assert'~'.encode('euc-jp')==b'~'

CPython versions tested on:

3.11, CPython main branch

Operating systems tested on:

Linux

@@ -591,7 +591,7 @@ __jisx0208_decmap+6950,33,38},{0,0,0},{0,0,0},{0,0,0},{0,0,0},{0,0,0},{0,0,0},
};

static const ucs2_t __jisx0212_decmap[6179] = {
728,711,184,729,733,175,731,730,126,900,901,U,U,U,U,U,U,U,U,161,166,191,U,U,U,
728,711,184,729,733,175,731,730,65374,900,901,U,U,U,U,U,U,U,U,161,166,191,U,U,U,
Copy link
Member

@corona10corona10Dec 19, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

// AUTO-GENERATED FILE FROM genmap_japanese.py: DO NOT EDIT

Please take a look at the comment on the first line.
I 've not taken a look at the issue deeply yet, but if you want to change the mapping file, you should modify the generator.

serhiy-storchaka and ezio-melotti reacted with thumbs up emoji
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I totally missed that. I will do that.

Copy link
Author

@qsantosqsantosDec 19, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I couldn't figure out how Vim and Firefox proceed (although Vim does fallback toiconv for non-internally-supported encodings). Since I do not expect to get Unicode to update the document of an obsolete standard that they only used to adapt, I see two options:

  1. add a hack ingenmap_japanese.py around line 90 after the call toloadmap(jisx0212file) to reassignjisx0212decmap[34][55] = ord('~') (it does fix__jisx0212_decmap as expected and changes nothing else), and comment that properly
  2. switch to WhatWG as the source of truth

1 can be done pretty quickly and with minimal risk of breaking anything. 2 has the advantage that it might help discover other issues in mappings, but is obviously much more involved.

If 1 seems reasonable to you, I will update this PR accordingly. If 2, I will close it and document this in the ticket for later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I have updated this MR with the hack ingenmap_japanese.py and regeneratedmappings_jp.h.

Copy link
Member

@corona10corona10Dec 24, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I will take a look at which will be the better, If we decide to solve this issue, option 2 will be the better way.
But I need to check the current status related to JIS0212 and side effect.
Also need to consider stability of whatwg spec.

@bedevere-app
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phraseI have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@qsantos
Copy link
Author

I have made the requested changes; please review again.

@corona10
Copy link
Member

@qsantos I will take a look the PR by next week. Enjoy your Christmas.

@corona10corona10 self-assigned thisDec 24, 2023
@corona10
Copy link
Member

corona10 commentedDec 24, 2023
edited
Loading

Okay, I like this PR, it will guarantee the round trip

a = b'\x8f\xa2\xb7'assert a.decode('euc_jp').encode('euc_jp') == a

But as I said, I will take a look at this PR by next week.

@qsantos
Copy link
Author

@qsantos I will take a look the PR by next week. Enjoy your Christmas.

Thanks, I was not sure if I had properly triggered the bot. Enjoy your Christmas as well!

Copy link
Member

@corona10corona10 left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Okay, here is my temporal conclusion for this change.
(If we decide to adopt the change!)

  1. Let's use uncode.org data as same as now, whatwg is great but not stable. We are not a browser, I am not sure that we can follow every change without side effect.
  2. Let's vendor the modified JIS0212.TXT intohttps://github.com/python/cpython/tree/main/Tools/unicode/python-mappings and store the diff file tohttps://github.com/python/cpython/tree/main/Tools/unicode/python-mappings/diff for tracking changes.

But I need to listen to the opinion of@ezio-melotti, who is a more experienced unicode expert than me.

qsantos reacted with thumbs up emoji
@qsantos
Copy link
Author

@ezio-melotti What is your take on this?

@qsantos
Copy link
Author

Should we consider this dead?

@pythonpython deleted a commentApr 7, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@corona10corona10corona10 requested changes

Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

Python decodes EUC-JP 8FA2A7 as TILDE instead of FULLWIDTH TILDE
3 participants
@qsantos@corona10@ezio-melotti

[8]ページ先頭

©2009-2025 Movatter.jp