Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork33.3k
gh-63161: Fix tokenize detect_encoding() for non-ASCII coding#139235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Please add a test for the coded cookie on the second line (and non-ascii first line).
Also add a test with specified ASCII encoding, but non-ASCII content that can still be decoded as UTF-8. E.g.'#coding=ascii €'.encode('utf-8') and corresponding for two lines.
vstinner commentedSep 22, 2025
@serhiy-storchaka: I added more tests, please review the updated PR. Is it what you wanted? |
serhiy-storchaka left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thank you for update. In two-line cases please use non-ASCII data in the first line, before the codec cookie. Test that the tokenizer uses correct encoding to decode comments in first lines.
It may be already tested elsewhere, but I would also add tests for non-ASCII data in the first and in the second comment lines, when no codec cookie is present (so UTF-8 should be used). For valid and invalid UTF-8.
I expect that the tokenizer correctly decodes files that match the explicit or implicit encoding, and reject files that do not match. And the interpreter should work the same.
vstinner commentedSep 23, 2025
Ok, I added more tests. Please review the updated PR. |
vstinner commentedOct 7, 2025
@serhiy-storchaka: It seems like you're working on the same area these days and you have more advanced fix. I can abandon this PR, no? |
serhiy-storchaka commentedOct 8, 2025
Agree. Sorry, but I already had tests for the core interpreter and the model of how it should work. I only needed to beat the code until it started to pass the tests. |
Uh oh!
There was an error while loading.Please reload this page.