NotificationsYou must be signed in to change notification settings
Fork33.3k
Star69.8k

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

vstinner wants to merge5 commits intopython:mainfromvstinner:nonascii_coding

Closed

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235

vstinner wants to merge5 commits intopython:mainfromvstinner:nonascii_coding

Conversation

Copy link

Member

vstinner commentedSep 22, 2025•
edited by bedevere-appbot
Loading

Issue:Non-UTF8 encoding line #63161

pythongh-63161: Fix tokenize detect_encoding() for non-ASCII coding

fb7b944

vstinner requested review fromlysnikolaou andpablogsal ascode owners

September 22, 2025 13:06

vstinner added needs backport to 3.13

bugs and security fixes

needs backport to 3.14bugs and security fixes labels

Sep 22, 2025

bedevere-appbot added the awaiting core review label

Sep 22, 2025

bedevere-appbot mentioned this pull request

Sep 22, 2025

Non-UTF8 encoding line#63161

Closed

Add NEWS entry

c535b65

serhiy-storchaka reviewed

Sep 22, 2025

View reviewed changes

Copy link

Member

serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please add a test for the coded cookie on the second line (and non-ascii first line).

Also add a test with specified ASCII encoding, but non-ASCII content that can still be decoded as UTF-8. E.g.'#coding=ascii €'.encode('utf-8') and corresponding for two lines.

Add tests

5723fc5

Copy link

MemberAuthor

vstinner commentedSep 22, 2025

@serhiy-storchaka: I added more tests, please review the updated PR. Is it what you wanted?

serhiy-storchaka reviewed

Sep 22, 2025

View reviewed changes

Copy link

Member

serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for update. In two-line cases please use non-ASCII data in the first line, before the codec cookie. Test that the tokenizer uses correct encoding to decode comments in first lines.

It may be already tested elsewhere, but I would also add tests for non-ASCII data in the first and in the second comment lines, when no codec cookie is present (so UTF-8 should be used). For valid and invalid UTF-8.

I expect that the tokenizer correctly decodes files that match the explicit or implicit encoding, and reject files that do not match. And the interpreter should work the same.

vstinner added2 commits

September 23, 2025 16:23

Add more tests

911dc3a

Test comments with no coding marker

e36d860

Copy link

MemberAuthor

vstinner commentedSep 23, 2025

Ok, I added more tests. Please review the updated PR.

Copy link

MemberAuthor

vstinner commentedOct 7, 2025

@serhiy-storchaka: It seems like you're working on the same area these days and you have more advanced fix. I can abandon this PR, no?

Copy link

Member

serhiy-storchaka commentedOct 8, 2025

Agree. Sorry, but I already had tests for the core interpreter and the model of how it should work. I only needed to beat the code until it started to pass the tests.

vstinner closed this

Oct 8, 2025

vstinner deleted the nonascii_coding branch

October 8, 2025 09:59

Labels

awaiting core review needs backport to 3.13

bugs and security fixes

needs backport to 3.14

bugs and security fixes

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235

Uh oh!

Conversation

vstinner commentedSep 22, 2025•
edited by bedevere-appbot
Loading

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commentedSep 22, 2025

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commentedSep 23, 2025

Uh oh!

vstinner commentedOct 7, 2025

Uh oh!

serhiy-storchaka commentedOct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Movatterモバイル変換

Uh oh!

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235

Uh oh!

Conversation

vstinner commentedSep 22, 2025• edited by bedevere-appbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commentedSep 22, 2025

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commentedSep 23, 2025

Uh oh!

vstinner commentedOct 7, 2025

Uh oh!

serhiy-storchaka commentedOct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vstinner commentedSep 22, 2025•
edited by bedevere-appbot
Loading