Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

bpo-43950: handle wide unicode characters in tracebacks#28150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
pablogsal merged 1 commit intopython:mainfromcolnotab:bpo-43950-emojis
Oct 26, 2023

Conversation

isidentical
Copy link
Member

@isidenticalisidentical commentedSep 4, 2021
edited
Loading

Do not merge yet, only for discussion.

This PR adds support for the existing traceback machinery to work with wide unicode characters when dumping to the terminal. It usesunicodedata.east_asian_width to classify individual unicode characters.
image

https://bugs.python.org/issue43950

@isidentical
Copy link
MemberAuthor

CC:@pablogsal@ammaraskar

@pablogsal
Copy link
Member

Someone should try this on Windows with "Courier New" as Terry mentioned in the issue. It will be good to check with a bunch of fonts to see what we are up against.

isidentical reacted with eyes emoji

@isidentical
Copy link
MemberAuthor

isidentical commentedSep 4, 2021
edited
Loading

some notesOne thing I noticed while doing the migration was that, C tokenizer doesn't match with the Python one regarding how to treat unicode symbols with multiple code points;
tokenizer:  python  3xREGULAR    TokenInfo(type=3 (STRING), string="'XXX'", start=(1, 0), end=(1, 5), line="'XXX'")  3xTINY H    TokenInfo(type=3 (STRING), string="'ʰʰʰ'", start=(1, 0), end=(1, 5), line="'ʰʰʰ'")  3xEAST ASIAN    TokenInfo(type=3 (STRING), string="'该该该'", start=(1, 0), end=(1, 5), line="'该该该'")  3xEMOJIs    TokenInfo(type=3 (STRING), string="'🐍🐍🐍'", start=(1, 0), end=(1, 5), line="'🐍🐍🐍'")tokenizer:  c  3xREGULAR    TokenInfo(type=3 (STRING), string="'XXX'", start=(1, 0), end=(1, 5), line="'XXX'\n")  3xTINY H    TokenInfo(type=3 (STRING), string="'ʰʰʰ'", start=(1, 0), end=(1, 8), line="'ʰʰʰ'\n")  3xEAST ASIAN    TokenInfo(type=3 (STRING), string="'该该该'", start=(1, 0), end=(1, 11), line="'该该该'\n")  3xEMOJIs    TokenInfo(type=3 (STRING), string="'🐍🐍🐍'", start=(1, 0), end=(1, 14), line="'🐍🐍🐍'\n")

Which then directly mirrored back to AST offets, which makes stuff a bit trickier at the traceback level. We could just patch this on the traceback level though I am not really sure that 'hack' is the proper way to go. We might need to end up a similiar solution to our_PyPegen_byte_offset_to_character_offset in the AST logic too, but I am not too certain about how.

Seems to work now with my last commit
image

I simply applied the same functionality at the end of the specialization function, and casted the newly retrieved AST offsets back to the regular ones. There are a lot of code that needs to be cleaned away in the final version, though for the prototype at least it works (for now).

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actionsgithub-actionsbot added the staleStale PR or inactive for long period of time. labelOct 12, 2021
@github-actionsgithub-actionsbot removed the staleStale PR or inactive for long period of time. labelAug 12, 2022
@isidentical
Copy link
MemberAuthor

Wow, this is an old PR but@pablogsal@ammaraskar are we interested in reviving it? I think I can rebase and get it ready for 3.12.

@pablogsal
Copy link
Member

pablogsal commentedNov 2, 2022
edited
Loading

I am. I was precisely thinking on reviving it last week so you managed to read my mind! 😁

@isidentical
Copy link
MemberAuthor

Hahahaha, perfection! I'll try to get it up to date over this week, and will ping you again for the review 💯

@pablogsal
Copy link
Member

Hahahaha, perfection

giphy

@ammaraskar
Copy link
Member

Sounds good, happy to re-review when you update :)

Sorry for the slow follow-ups around PEP657 stuff, I've been a little busy and inactive recently :(

@isidentical
Copy link
MemberAuthor

isidentical commentedNov 13, 2022
edited
Loading

I've pushed the initial revision where we handle everything using theunicodedata.east_asian_width (double width onW andF labels, single on rest) as per thediscussion here. Before merging this we also have to decide:

  • Do we want to give users a way to opt-in/opt-out from this (considering the nature, I'd say opt-out, since I don't think anybody is going to look why behaves like this and turn this feature on)?

  • Whether to do some sort of terminal support detection (non-deterministic tracebacks depending on nonPYTHON* env variable configuration doesn't sound really nice but no strong opinions if we want to make this opt-In by default and add some sort of scanning)

  • Is this applicable when showing the traceback in web? A lot of web frameworks usetraceback.* APIs to format tracebacks and then embed them inside an HTML page on error in debug mode. Should we offer a flag informat_* APIs to allow them switch without changing it in the global Python exectuon level.

  • Is this a feature or a fix? I am leaning towards a feature since this essentially changes how tracebacks are printed for certain cases and people might depend on the behavior (there are a couple of codebases where tracebacks are also part of the tests themselves in a stringized form)

I guess also looking at what other compilers are doing by default might also help us gain some insight (I recall@pablogsal mentioning rustc; maybe it might worth a shot to check out what they do to decide when to show carets).

@isidenticalisidentical marked this pull request as ready for reviewNovember 13, 2022 01:24
@isidentical
Copy link
MemberAuthor

CC:@cfbolz (would also love to hear your feedback on the unicode related parts)

@cfbolz
Copy link
Contributor

I'll take a look at the code!

"Amusingly" the width doesn't line up in my browser's font:

image

(Looks fantastic in my editor and my terminal though)

Personal opinions on some of your questions:

  • imo it's a bug. for people with wide characters the position info was just confusing so far, fixing that is a bug fix.
  • maybe we want a way to opt out (I have no clear opinion on that) but the new behaviour should be the default

@cfbolz
Copy link
Contributor

cfbolz commentedNov 14, 2022
edited
Loading

The code looks reasonable to me.

I've been thinking about it a bit more, and it would be certainly more annoying to implement, but I am wondering whether it wouldn't be an option to use unicode chars 0x3000 (IDEOGRAPHIC SPACE) and 0xFF3E (FULLWIDTH CIRCUMFLEX ACCENT) to do the spaces/underlines under wide chars. Because even if in a font the width of two ascii spaces is not the same as a fullwidth char, the font should at least be consistent with itself and have the fullwidth space be the same width. Example:

Here are the chars:

说明说明📗a      a fullwidth space^^^^^a fullwidth circumflex          a ascii whitespace^^^^^^^^^^a ascii circumflex

Screenshot in my Firefox:

image

(in my terminal all the 'a' line up, so it doesn't matter there).

@cfbolz
Copy link
Contributor

(Unfortunately it already breaks down in Chrome on my laptop, where the book emoji is even wider)

@pablogsal
Copy link
Member

@isidentical let's push this forward. Could you rebase the PR?

@pablogsalpablogsalenabled auto-merge (squash)October 26, 2023 06:41
@cfbolz
Copy link
Contributor

yay, i'm excited for this to land :-)

@pablogsalpablogsal merged commit78e6d72 intopython:mainOct 26, 2023
pablogsal added a commit to pablogsal/cpython that referenced this pull requestOct 26, 2023
…nGH-28150)(cherry picked from commit78e6d72)Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
@bedevere-app
Copy link

GH-111345 is a backport of this pull request to the3.11 branch.

pablogsal added a commit to pablogsal/cpython that referenced this pull requestOct 26, 2023
…nGH-28150)(cherry picked from commit78e6d72)Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
@bedevere-app
Copy link

GH-111346 is a backport of this pull request to the3.12 branch.

pablogsal pushed a commit to pablogsal/cpython that referenced this pull requestOct 26, 2023
…nGH-28150)(cherry picked from commit78e6d72)Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
pablogsal pushed a commit to pablogsal/cpython that referenced this pull requestOct 26, 2023
…nGH-28150)(cherry picked from commit78e6d72)Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
pablogsal pushed a commit to pablogsal/cpython that referenced this pull requestOct 26, 2023
…nGH-28150)(cherry picked from commit78e6d72)Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
@bedevere-app
Copy link

GH-111373 is a backport of this pull request to the3.11 branch.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@ammaraskarammaraskarammaraskar left review comments

@pablogsalpablogsalpablogsal approved these changes

@lysnikolaoulysnikolaouAwaiting requested review from lysnikolaou

@iritkatrieliritkatrielAwaiting requested review from iritkatriel

Assignees
No one assigned
Labels
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

7 participants
@isidentical@pablogsal@ammaraskar@cfbolz@the-knights-who-say-ni@ezio-melotti@bedevere-bot

[8]ページ先頭

©2009-2025 Movatter.jp