Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork33.7k
gh-102856: Python tokenizer implementation for PEP 701#104323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
sunmy2019 commentedMay 9, 2023
A thought: should this be aligned with the C tokenizer? If so, we can add tests to compare python tokenizer and the internal C tokenizer. |
mgmacias95 commentedMay 9, 2023
It should be aligned with the c tokenizer, but there are some tokens that differ. For example the c tokenizer returns lbrace and rbrace ({ and }) tokens while the python one just returns an OP token. Matching tests sound a good idea to make sure they are both aligned. I can add it :). |
lysnikolaou commentedMay 11, 2023
Not sure about matching tests. There are many and often very slight differences between the implementations of the C tokenizer and the Python tokenize module and that's something we've been okay with for a long time. I wonder whether writing those tests is unnecessary effort. |
lysnikolaou left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks@mgmacias95 for working on this! I just had a first look at it and it looks great in general.
However, I think that the regex for matching f-strings is going to fail for nested strings that use the same quote. Since this was something that has been explicitly allowed in the PEP and also somewhat "advertised", I feel that most people would expect thetokenize module to support that as well, especially since we're putting in the work to support the new tokens.
Am I missing something? Is that maybe handled a different way? How do others feel about not supporting nested strings with the same quotation?
pablogsal commentedMay 12, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Sorry for the lack of context, let me explain the plan:
I don't think this is possible as both tokenizers are incompatible. One of them emits tokens that the other does not and they also emit different tokens (the Python tokenizer emits What do you think? |
sunmy2019 commentedMay 13, 2023
It makes sense
I see the point here. I agree |
lysnikolaou commentedMay 15, 2023
The plan makes sense! Thanks for the thorough explanation@pablogsal! |
pablogsal commentedMay 19, 2023
Ok, we are switching directions. Turns out the handling I described was even more difficult because we need to also factor in code that knows how to handle the scaped So we had an idea: what if we could reuse the c tokenizer as it already knows how to handle this?. The problem as I said before is that the C tokenizer emits tokens in a different fashion and it doesn't even bother with some others (like
|
bedevere-bot commentedMay 19, 2023
🤖 New build scheduled with the buildbot fleet by@pablogsal for commitf1a5090 🤖 If you want to schedule another build, you need to add the🔨 test-with-buildbots label again. |
isidentical commentedMay 19, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Quick question@pablogsal: Do we still maintain the Stuff like unnecessary Lines 204 to 207 ind78c3bc
|
pablogsal commentedMay 19, 2023
Ithink we stil do (the test pass and we spent a huge amount of time making sure they do with all the files, which was non-trivial). I may still be missing something though. |
| self.assertEqual(tokens,expected_tokens, | ||
| "bytes not decoded with encoding") | ||
| deftest__tokenize_does_not_decode_with_encoding_none(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is being removed because it was testing the_tokenize implementation that doesn't exist anymore and is not public
bedevere-bot commentedMay 20, 2023
🤖 New build scheduled with the buildbot fleet by@pablogsal for commit7fb58b0 🤖 If you want to schedule another build, you need to add the🔨 test-with-buildbots label again. |
sunmy2019 commentedMay 20, 2023
+1 for this. Just like I have mentioned herepablogsal#67 (comment) |
pablogsal left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM! Great job! 🚀
AlexWaygood commentedMay 21, 2023
Looks like this change broke IDLE: |
jayaddison commentedMay 22, 2023
It's possible that the tokenizer changes here introduced some parsing-related test failures in Sphinx:sphinx-doc/sphinx#11436 - we've begun looking into the cause. (it may turn out to be a problem to resolve on the Sphinx side; I'm not yet sure whether it suggests a bug in cPython, but thought it'd be worthwhile to mention) |
pablogsal commentedMay 22, 2023
Would it be possible to give us a self-contained reproducer? With that we may be able to indicate if this is an expected change or a bug. |
pablogsal commentedMay 22, 2023
Also, could you please open an issue for this once you have your reproducer? |
jayaddison commentedMay 22, 2023
Yep, sure thing@pablogsal (on both counts: a repro case and a bugreport to go along with it). I'll send those along when available (or will similarly confirm if it turns out to be a non-issue). |
jayaddison commentedMay 22, 2023
Thanks to@mgmacias95's explanation, I believe that the updated tokenizer is working as-expected. There is some code in Sphinx to handle what I'd consider a quirk of the previous tokenizer, and that code isn't compatible with the updated (improved, I think) representation of dedent tokens. (apologies for the distraction, and thank you for the help) |
Uh oh!
There was an error while loading.Please reload this page.