This PR adds support for the existing traceback machinery to work with wide unicode characters when dumping to the terminal. It usesunicodedata.east_asian_width to classify individual unicode characters.

https://bugs.python.org/issue43950

isidentical added the DO-NOT-MERGE label

Sep 4, 2021

the-knights-who-say-ni added the CLA signed label

Sep 4, 2021

bedevere-bot added the awaiting core review label

Sep 4, 2021

Copy link

MemberAuthor

isidentical commentedSep 4, 2021

CC:@pablogsal @ammaraskar

isidentical added the skip news label

Sep 4, 2021

Copy link

Member

pablogsal commentedSep 4, 2021

Someone should try this on Windows with "Courier New" as Terry mentioned in the issue. It will be good to check with a bunch of fonts to see what we are up against.

ammaraskar reviewed

Sep 4, 2021

View reviewed changes

Parser/pegen.c OutdatedShow resolvedHide resolved

Copy link

MemberAuthor

isidentical commentedSep 4, 2021•
edited
Loading

some notes

One thing I noticed while doing the migration was that, C tokenizer doesn't match with the Python one regarding how to treat unicode symbols with multiple code points;

tokenizer:  python  3xREGULAR    TokenInfo(type=3 (STRING), string="'XXX'", start=(1, 0), end=(1, 5), line="'XXX'")  3xTINY H    TokenInfo(type=3 (STRING), string="'ʰʰʰ'", start=(1, 0), end=(1, 5), line="'ʰʰʰ'")  3xEAST ASIAN    TokenInfo(type=3 (STRING), string="'该该该'", start=(1, 0), end=(1, 5), line="'该该该'")  3xEMOJIs    TokenInfo(type=3 (STRING), string="'🐍🐍🐍'", start=(1, 0), end=(1, 5), line="'🐍🐍🐍'")tokenizer:  c  3xREGULAR    TokenInfo(type=3 (STRING), string="'XXX'", start=(1, 0), end=(1, 5), line="'XXX'\n")  3xTINY H    TokenInfo(type=3 (STRING), string="'ʰʰʰ'", start=(1, 0), end=(1, 8), line="'ʰʰʰ'\n")  3xEAST ASIAN    TokenInfo(type=3 (STRING), string="'该该该'", start=(1, 0), end=(1, 11), line="'该该该'\n")  3xEMOJIs    TokenInfo(type=3 (STRING), string="'🐍🐍🐍'", start=(1, 0), end=(1, 14), line="'🐍🐍🐍'\n")

Which then directly mirrored back to AST offets, which makes stuff a bit trickier at the traceback level. We could just patch this on the traceback level though I am not really sure that 'hack' is the proper way to go. We might need to end up a similiar solution to our_PyPegen_byte_offset_to_character_offset in the AST logic too, but I am not too certain about how.

Seems to work now with my last commit

I simply applied the same functionality at the end of the specialization function, and casted the newly retrieved AST offsets back to the regular ones. There are a lot of code that needs to be cleaned away in the final version, though for the prototype at least it works (for now).

isidentical force-pushed thebpo-43950-emojis branch fromf4b35ac to6ed3a9dCompare

September 8, 2021 21:47

Copy link

github-actionsbot commentedOct 12, 2021

This PR is stale because it has been open for 30 days with no activity.

github-actionsbot added the staleStale PR or inactive for long period of time. label

Oct 12, 2021

pablogsal mentioned this pull request

Apr 13, 2022

Include column offsets for bytecode instructions#88116

Closed

ezio-melotti removed the CLA signed label

Jul 13, 2022

github-actionsbot removed the staleStale PR or inactive for long period of time. label

Aug 12, 2022

Copy link

MemberAuthor

isidentical commentedNov 2, 2022

Wow, this is an old PR but@pablogsal @ammaraskar are we interested in reviving it? I think I can rebase and get it ready for 3.12.

Copy link

Member

pablogsal commentedNov 2, 2022•
edited
Loading

I am. I was precisely thinking on reviving it last week so you managed to read my mind! 😁

Copy link

MemberAuthor

isidentical commentedNov 2, 2022

Hahahaha, perfection! I'll try to get it up to date over this week, and will ping you again for the review 💯

Copy link

Member

pablogsal commentedNov 2, 2022

Hahahaha, perfection

Copy link

Member

ammaraskar commentedNov 2, 2022

Sounds good, happy to re-review when you update :)

Sorry for the slow follow-ups around PEP657 stuff, I've been a little busy and inactive recently :(

isidentical force-pushed thebpo-43950-emojis branch from6ed3a9d to8102a9dCompare

November 13, 2022 01:14

Copy link

MemberAuthor

isidentical commentedNov 13, 2022•
edited
Loading

I've pushed the initial revision where we handle everything using theunicodedata.east_asian_width (double width onW andF labels, single on rest) as per thediscussion here. Before merging this we also have to decide:

Do we want to give users a way to opt-in/opt-out from this (considering the nature, I'd say opt-out, since I don't think anybody is going to look why behaves like this and turn this feature on)?
Whether to do some sort of terminal support detection (non-deterministic tracebacks depending on nonPYTHON* env variable configuration doesn't sound really nice but no strong opinions if we want to make this opt-In by default and add some sort of scanning)
Is this applicable when showing the traceback in web? A lot of web frameworks usetraceback.* APIs to format tracebacks and then embed them inside an HTML page on error in debug mode. Should we offer a flag informat_* APIs to allow them switch without changing it in the global Python exectuon level.
Is this a feature or a fix? I am leaning towards a feature since this essentially changes how tracebacks are printed for certain cases and people might depend on the behavior (there are a couple of codebases where tracebacks are also part of the tests themselves in a stringized form)

I guess also looking at what other compilers are doing by default might also help us gain some insight (I recall@pablogsal mentioning rustc; maybe it might worth a shot to check out what they do to decide when to show carets).

isidentical marked this pull request as ready for review

November 13, 2022 01:24

isidentical requested review fromlysnikolaou andiritkatriel ascode owners

November 13, 2022 01:24

Copy link

MemberAuthor

isidentical commentedNov 13, 2022

CC:@cfbolz (would also love to hear your feedback on the unicode related parts)

Copy link

Contributor

cfbolz commentedNov 14, 2022

I'll take a look at the code!

"Amusingly" the width doesn't line up in my browser's font:

(Looks fantastic in my editor and my terminal though)

Personal opinions on some of your questions:

imo it's a bug. for people with wide characters the position info was just confusing so far, fixing that is a bug fix.
maybe we want a way to opt out (I have no clear opinion on that) but the new behaviour should be the default

Copy link

Contributor

cfbolz commentedNov 14, 2022•
edited
Loading

The code looks reasonable to me.

I've been thinking about it a bit more, and it would be certainly more annoying to implement, but I am wondering whether it wouldn't be an option to use unicode chars 0x3000 (IDEOGRAPHIC SPACE) and 0xFF3E (FULLWIDTH CIRCUMFLEX ACCENT) to do the spaces/underlines under wide chars. Because even if in a font the width of two ascii spaces is not the same as a fullwidth char, the font should at least be consistent with itself and have the fullwidth space be the same width. Example:

Here are the chars:

说明说明📗a 　　　　　a fullwidth space＾＾＾＾＾a fullwidth circumflex          a ascii whitespace^^^^^^^^^^a ascii circumflex

Screenshot in my Firefox:

(in my terminal all the 'a' line up, so it doesn't matter there).

Copy link

Contributor

cfbolz commentedNov 14, 2022

(Unfortunately it already breaks down in Chrome on my laptop, where the book emoji is even wider)

Copy link

Member

pablogsal commentedMay 25, 2023

@isidentical let's push this forward. Could you rebase the PR?

pythongh-88116: Handle wide unicode characters in tracebacks

5ec6af6

pablogsal force-pushed thebpo-43950-emojis branch from8102a9d to5ec6af6Compare

October 26, 2023 06:35

pablogsal approved these changes

Oct 26, 2023

View reviewed changes

pablogsal removed the DO-NOT-MERGE label

Oct 26, 2023

pablogsalenabled auto-merge (squash)

October 26, 2023 06:41

Copy link

Contributor

cfbolz commentedOct 26, 2023

yay, i'm excited for this to land :-)

pablogsal merged commit78e6d72 intopython:main

Oct 26, 2023

bedevere-appbot removed the awaiting core review label

Oct 26, 2023

pablogsal added a commit to pablogsal/cpython that referenced this pull request

Oct 26, 2023

[3.11] bpo-43950: handle wide unicode characters in tracebacks (pytho…

d24d3f9

…nGH-28150)(cherry picked from commit78e6d72)Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

Copy link

bedevere-appbot commentedOct 26, 2023

GH-111345 is a backport of this pull request to the3.11 branch.

pablogsal added a commit to pablogsal/cpython that referenced this pull request

Oct 26, 2023

[3.11] bpo-43950: handle wide unicode characters in tracebacks (pytho…

d7cebb3

…nGH-28150)(cherry picked from commit78e6d72)Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

Copy link

bedevere-appbot commentedOct 26, 2023

GH-111346 is a backport of this pull request to the3.12 branch.

pablogsal pushed a commit to pablogsal/cpython that referenced this pull request

Oct 26, 2023

[3.12] bpo-43950: handle wide unicode characters in tracebacks (pytho…

d5ee672

…nGH-28150)(cherry picked from commit78e6d72)Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

pablogsal pushed a commit to pablogsal/cpython that referenced this pull request

Oct 26, 2023

[3.12] bpo-43950: handle wide unicode characters in tracebacks (pytho…

a5018f1

…nGH-28150)(cherry picked from commit78e6d72)Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>

pablogsal pushed a commit to pablogsal/cpython that referenced this pull request

Oct 26, 2023

[3.12] bpo-43950: handle wide unicode characters in tracebacks (pytho…

fcc4944

…nGH-28150)(cherry picked from commit78e6d72)Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal added a commit that referenced this pull request

Oct 26, 2023

[3.12]bpo-43950: handle wide unicode characters in tracebacks (GH-28150

c81ebf5