NotificationsYou must be signed in to change notification settings
Fork33.7k
Star70.4k

gh-102856: Python tokenizer implementation for PEP 701#104323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

pablogsal merged 20 commits intopython:mainfrommgmacias95:python_tokenizer

May 21, 2023

Merged

gh-102856: Python tokenizer implementation for PEP 701#104323

pablogsal merged 20 commits intopython:mainfrommgmacias95:python_tokenizer

May 21, 2023

Conversation

Copy link

Contributor

mgmacias95 commentedMay 9, 2023•
edited by bedevere-bot
Loading

Issue:PEP 701 – Syntactic formalization of f-strings #102856

bedevere-bot mentioned this pull request

May 9, 2023

PEP 701 – Syntactic formalization of f-strings#102856

Closed

4 tasks

sunmy2019 requested review fromlysnikolaou,pablogsal andsunmy2019

May 9, 2023 14:10

Copy link

Member

sunmy2019 commentedMay 9, 2023

A thought: should this be aligned with the C tokenizer?

If so, we can add tests to compare python tokenizer and the internal C tokenizer.

sunmy2019 requested a review fromisidentical

May 9, 2023 14:13

Copy link

ContributorAuthor

mgmacias95 commentedMay 9, 2023

It should be aligned with the c tokenizer, but there are some tokens that differ. For example the c tokenizer returns lbrace and rbrace ({ and }) tokens while the python one just returns an OP token.

Matching tests sound a good idea to make sure they are both aligned. I can add it :).

Copy link

Member

lysnikolaou commentedMay 11, 2023

Not sure about matching tests. There are many and often very slight differences between the implementations of the C tokenizer and the Python tokenize module and that's something we've been okay with for a long time. I wonder whether writing those tests is unnecessary effort.

lysnikolaou reviewed

May 11, 2023

View reviewed changes

Copy link

Member

lysnikolaou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks@mgmacias95 for working on this! I just had a first look at it and it looks great in general.

However, I think that the regex for matching f-strings is going to fail for nested strings that use the same quote. Since this was something that has been explicitly allowed in the PEP and also somewhat "advertised", I feel that most people would expect thetokenize module to support that as well, especially since we're putting in the work to support the new tokens.

Am I missing something? Is that maybe handled a different way? How do others feel about not supporting nested strings with the same quotation?

Copy link

Member

pablogsal commentedMay 12, 2023•
edited
Loading

Thanks@mgmacias95 for working on this! I just had a first look at it and it looks great in general.
However, I think that the regex for matching f-strings is going to fail for nested strings that use the same quote. Since this was something that has been explicitly allowed in the PEP and also somewhat "advertised", I feel that most people would expect thetokenize module to support that as well, especially since we're putting in the work to support the new tokens.
Am I missing something? Is that maybe handled a different way? How do others feel about not supporting nested strings with the same quotation?

Sorry for the lack of context, let me explain the plan:

The current tokenize implementation is based on regular expressions. Unfortunately, this makes chunking the f-string into the parts that we need very difficult because we technically need to parse a character-at-a-time to properly stop when we need but the current design makes this very difficult and requires a full reimplementation that makes the whole ordeal much much more tricky if we don't want to break anything.
The plan in general is that first we identify the full f-string and then we pass this to a post-process function that chunks it in the appropriate tokens. The challenge with this is to properly identify the full f-string when there are repeated nested quotes. This is possible but requires a special branch in the tokenizer for normal mode when f-strings are detected. This will switch to a custom parsing using character-at-a-time that identifies the{ and} and knows how to match the quotes using a stack. This allows us to reuse as much as possible of the non-fstring code while using our new strategy.
As we are a bit short on time the strategy that I think is better is to merge a first version of this that doesn't take into account the nested quotes and make sure that it works in general and then fix the nested quotes separately (which then is just a matter of implementing the code that correctly matches the end and start quote so we can pass the real f-string to the chunking code).
As we are a bit short of time we want to merge as much as possible before beta freeze and then fix bugs and edge cases later, and possibly the most contentious code is this one (and not the 'identify-nested-fstrings').

Not sure about matching tests.

I don't think this is possible as both tokenizers are incompatible. One of them emits tokens that the other does not and they also emit different tokens (the Python tokenizer emitsOP for operators and the C tokenizer emits generic tokens). Also, there are tokens that are not emitted in the C tokenizer (such as encoding). We would loose more time trying to match them and find differences than just treating them separately.

What do you think?

Copy link

Member

sunmy2019 commentedMay 13, 2023

As we are a bit short of time we want to merge as much as possible before beta freeze and then fix bugs and edge cases later, and possibly the most contentious code is this one (and not the 'identify-nested-fstrings').

It makes sense

We would lose more time trying to match them and find differences than just treating them separately.

I see the point here. I agree

Copy link

Member

lysnikolaou commentedMay 15, 2023

The plan makes sense! Thanks for the thorough explanation@pablogsal!

pablogsal force-pushed thepython_tokenizer branch from765c2de to681f2a5Compare

May 18, 2023 16:05

mgmacias95and others added14 commits

May 18, 2023 17:21

First iteration

008f8e5

Handle escaping {

67a6ad6

nested expressions

f58104d

Recursive expression tokenization

26102cc

Remove intermediate token created for dev purposes

a5f4b40

More improvements

598bab4

fix handling of } tokens

a0ed816

other tokenizer

90b4ab1

Some progress

63ef1c1

Fix more bugs

6833b1a

Fix more problems

90da796

Use IA to clean code

b5ccd94

Remove lel

b1c3b2a

Remove whitespace

e941f12

pablogsal force-pushed thepython_tokenizer branch from5830232 toe941f12Compare

May 18, 2023 16:21

pablogsal marked this pull request as ready for review

May 18, 2023 16:21

bedevere-bot added the awaiting review label

May 18, 2023

Some cleanups

fd8b60a

Copy link

Member

pablogsal commentedMay 19, 2023

CC:@isidentical @lysnikolaou

Ok, we are switching directions. Turns out the handling I described was even more difficult because we need to also factor in code that knows how to handle the scaped{{ and}} and this would be even more difficult if we also need to parse these to identify the entire f-string code (with nested quotes) to post-process it later. We did some prototypes and it was quite verbose and unmaintainable and that's on top of the current tokenizer.

So we had an idea: what if we could reuse the c tokenizer as it already knows how to handle this?. The problem as I said before is that the C tokenizer emits tokens in a different fashion and it doesn't even bother with some others (likeCOMMENT orNL tokens). So what we have decided is to teach the C tokenizer to optionally emit these tokens and adapt it so we can mimic the output of the Python tokenizer as much as possible. There are some very minor things that the Python tokenizer does that are quite odd (like emittingDEDENT tokens with line numbers that do not exist at the end) that we have decided not to replicate but otherwise this is a HUGE win because:

We can remove the custom python tokenizer entirely
The Python tokenizer and the C tokenizer will never get out of sync
The tokenization now is MUCH faster because it runs at C level
We have much better coverage over weird subtle things that the C tokenizer does (we found a bunch of actual errors in the C implementation because of this).

pass the vacuum cleaner

f1a5090

pablogsal added the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

May 19, 2023

Copy link

bedevere-bot commentedMay 19, 2023

🤖 New build scheduled with the buildbot fleet by@pablogsal for commitf1a5090 🤖

If you want to schedule another build, you need to add the🔨 test-with-buildbots label again.

bedevere-bot removed the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

May 19, 2023

Copy link

Member

isidentical commentedMay 19, 2023•
edited
Loading

Quick question@pablogsal: Do we still maintain theuntokenize(tokenize($src)) == $src guarantee with this switch (with the alternative mode enabled in the C tokenizer)?

Stuff like unnecessaryDEDENTs might be relevant to the pretty code generation phase here although I am not super sure (just a hunch):

cpython/Lib/tokenize.py

Lines 204 to 207 ind78c3bc

	eliftok_type==DEDENT:
	indents.pop()
	self.prev_row,self.prev_col=end
	continue

Copy link

Member

pablogsal commentedMay 19, 2023

Quick question@pablogsal: Do we still maintain theuntokenize(tokenize($src)) == $src guarantee with this switch (with the alternative mode enabled in the C tokenizer)?

Ithink we stil do (the test pass and we spent a huge amount of time making sure they do with all the files, which was non-trivial). I may still be missing something though.

pablogsal reviewed

May 19, 2023

View reviewed changes

Lib/test/test_tokenize.py

		self.assertEqual(tokens,expected_tokens,
		"bytes not decoded with encoding")

		deftest__tokenize_does_not_decode_with_encoding_none(self):

Copy link

Member

pablogsalMay 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is being removed because it was testing the_tokenize implementation that doesn't exist anymore and is not public

Fix refleaks

7fb58b0

pablogsal added the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

May 20, 2023

Copy link

bedevere-bot commentedMay 20, 2023

🤖 New build scheduled with the buildbot fleet by@pablogsal for commit7fb58b0 🤖

If you want to schedule another build, you need to add the🔨 test-with-buildbots label again.

bedevere-bot removed the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

May 20, 2023

Copy link

Member

sunmy2019 commentedMay 20, 2023

what if we could reuse the c tokenizer as it already knows how to handle this?

+1 for this. Just like I have mentioned herepablogsal#67 (comment)

📜🤖 Added by blurb_it.

e1b5d35

pablogsal approved these changes

May 20, 2023

View reviewed changes

Copy link

Member

pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM! Great job! 🚀

bedevere-bot added awaiting merge and removed awaiting review labels

May 20, 2023

pablogsal merged commit6715f91 intopython:main

May 21, 2023

bedevere-bot removed the awaiting merge label

May 21, 2023

Copy link

Member

AlexWaygood commentedMay 21, 2023

Looks like this change broke IDLE:

IDLE is unable to open any.py files #104719

mgmacias95 deleted the python_tokenizer branch

May 21, 2023 15:20

Copy link

jayaddison commentedMay 22, 2023

It's possible that the tokenizer changes here introduced some parsing-related test failures in Sphinx:sphinx-doc/sphinx#11436 - we've begun looking into the cause.

(it may turn out to be a problem to resolve on the Sphinx side; I'm not yet sure whether it suggests a bug in cPython, but thought it'd be worthwhile to mention)

Copy link

Member

pablogsal commentedMay 22, 2023

Would it be possible to give us a self-contained reproducer? With that we may be able to indicate if this is an expected change or a bug.

Copy link

Member

pablogsal commentedMay 22, 2023

Also, could you please open an issue for this once you have your reproducer?

Copy link

jayaddison commentedMay 22, 2023

Yep, sure thing@pablogsal (on both counts: a repro case and a bugreport to go along with it). I'll send those along when available (or will similarly confirm if it turns out to be a non-issue).

Copy link

jayaddison commentedMay 22, 2023

Thanks to@mgmacias95's explanation, I believe that the updated tokenizer is working as-expected.

There is some code in Sphinx to handle what I'd consider a quirk of the previous tokenizer, and that code isn't compatible with the updated (improved, I think) representation of dedent tokens.

(apologies for the distraction, and thank you for the help)