Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
vstinner wants to merge5 commits intopython:mainfromvstinner:nonascii_coding

Conversation

@vstinner
Copy link
Member

@vstinnervstinner commentedSep 22, 2025
edited by bedevere-appbot
Loading

Copy link
Member

@serhiy-storchakaserhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please add a test for the coded cookie on the second line (and non-ascii first line).

Also add a test with specified ASCII encoding, but non-ASCII content that can still be decoded as UTF-8. E.g.'#coding=ascii €'.encode('utf-8') and corresponding for two lines.

@vstinner
Copy link
MemberAuthor

@serhiy-storchaka: I added more tests, please review the updated PR. Is it what you wanted?

Copy link
Member

@serhiy-storchakaserhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for update. In two-line cases please use non-ASCII data in the first line, before the codec cookie. Test that the tokenizer uses correct encoding to decode comments in first lines.

It may be already tested elsewhere, but I would also add tests for non-ASCII data in the first and in the second comment lines, when no codec cookie is present (so UTF-8 should be used). For valid and invalid UTF-8.

I expect that the tokenizer correctly decodes files that match the explicit or implicit encoding, and reject files that do not match. And the interpreter should work the same.

@vstinner
Copy link
MemberAuthor

Ok, I added more tests. Please review the updated PR.

@vstinner
Copy link
MemberAuthor

@serhiy-storchaka: It seems like you're working on the same area these days and you have more advanced fix. I can abandon this PR, no?

@serhiy-storchaka
Copy link
Member

Agree. Sorry, but I already had tests for the core interpreter and the model of how it should work. I only needed to beat the code until it started to pass the tests.

@vstinnervstinner deleted the nonascii_coding branchOctober 8, 2025 09:59
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@serhiy-storchakaserhiy-storchakaserhiy-storchaka left review comments

@pablogsalpablogsalAwaiting requested review from pablogsalpablogsal is a code owner

@lysnikolaoulysnikolaouAwaiting requested review from lysnikolaoulysnikolaou is a code owner

Assignees

No one assigned

Labels

awaiting core reviewneeds backport to 3.13bugs and security fixesneeds backport to 3.14bugs and security fixes

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@vstinner@serhiy-storchaka

[8]ページ先頭

©2009-2025 Movatter.jp