NotificationsYou must be signed in to change notification settings
Fork33.3k
Star69.7k

gh-104400: pygettext: use an AST parser instead of a tokenizer#104402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

serhiy-storchaka merged 40 commits intopython:mainfromtomasr8:better-pygettext

Feb 11, 2025

Merged

gh-104400: pygettext: use an AST parser instead of a tokenizer#104402

serhiy-storchaka merged 40 commits intopython:mainfromtomasr8:better-pygettext

Feb 11, 2025

Conversation

Copy link

Member

tomasr8 commentedMay 11, 2023•
edited
Loading

This PR replaces the token-based message extraction with one that uses the AST parser instead.
See theissue or theforum discussion for more info.

This change fixes some issues just by virtue of using AST instead of working directly with tokens:

docstrings with leading blank lines are extracted correctly
dosctrings like"""Hello, {}!""".format('world') are no longer extracted
docstrings are cleaned withinspect.cleandoc() viaast.get_docstring()
This is now correctly extracted:

deftest(x=_('param')):pass

I added a CLI argument--charset (same as in pybabel and--from-code in xgettext) to force a file encoding, e.g.--charset=utf-8 will open the source files withutf-8 encoding. This is useful because currently we are relying on the system default which is error-prone. For example on Windows,open() in my locale defaults tocp1250 which mangles uputf-8 files and vice versa (with someUnicodeDecoreErrors in between).

Let'd do this in a separate PR in order to keep the diff as small as possible here (this is also not needed when running python with-Xutf8)

This PR has lots more tests to make sure we don't regress on anything. The tests now compare the script output to a.po file rather than just comparing themsgids (basically snapshot tests). This ensures that we also catch issues with formatting, line locations or anything else.

@warsaw if you feel like having a look (or anyone else ;))

Issue:pygettext: use an AST parser instead of a tokenizer #104400

tomasr8 added2 commits

May 11, 2023 00:52

Move test_i18n into a separate folder

ce99920

Switch to AST-based message extraction

4234c0b

bedevere-bot added the awaiting review label

May 11, 2023

bedevere-bot mentioned this pull request

May 11, 2023

pygettext: use an AST parser instead of a tokenizer#104400

Closed

tomasr8 added2 commits

May 11, 2023 23:32

Add news entry

d431951

Merge branch 'main' into better-pygettext

5721857

arhadthedev reviewed

Jun 12, 2023

View reviewed changes

Tools/i18n/pygettext.pyShow resolvedHide resolved

tomasr8 added5 commits

June 13, 2023 20:25

Fix comment

42277e8

Merge branch 'main' into better-pygettext

03f698f

Merge branch 'main' into better-pygettext

705b608

Merge branch 'main' into better-pygettext

a16274f

Merge branch 'main' into better-pygettext

f291862

Copy link

MemberAuthor

tomasr8 commentedAug 3, 2023

@ambv This is what I talked to you about at EuroPython. If you have time I'd be very happy if you could have a look :)

The TL;DR is pygettext has a couple of bugs which stem from it using a tokenizer-based extraction (and overall the code needs modernizing). I fix those bugs in this PR by switching to a parser. Otherwise I try to keep the functionality as close as possible. I also added lots more tests which compare the entire output and not just the messages as it was previously.

There are also lots of features missing in pygettext - handling ngettext, pgettext and others, format flags, etc..
Once this is done, I will submit patches for those missing features as well - I didn't want to put everything in one giant PR as it's already pretty big.

Thank you!

Copy link

Member

AA-Turner commentedAug 8, 2023

@tomasr8 would it be possible to break this PR up into several chunks / stages? You may have more luck with reviewers & progress -- I'm happy to help if wanted.

Copy link

MemberAuthor

tomasr8 commentedAug 8, 2023

@tomasr8 would it be possible to break this PR up into several chunks / stages? You may have more luck with reviewers & progress -- I'm happy to help if wanted.

I can definitely give it a try! I think it'll be difficult to separate the actual change from tokens to AST, since that's kind of an all-or-nothing change but I could start with improving the tests first in a separate PR. That should be an added value regardless of whether the rest gets merged or not. I'll see if I can get a separate PR for the tests in the coming days.

Any help/review is greatly appreciated of course! :)

tomasr8 mentioned this pull request

Aug 20, 2023

gh-104400: Add more tests to pygettext#108173

Merged

Copy link

MemberAuthor

tomasr8 commentedAug 20, 2023

@AA-Turner I opened a separatePR just adding extra tests, if you wanna have a look ;)

Copy link

Contributor

Wulian233 commentedOct 19, 2024

There was a recent issue in support of f-string#113604 , that mentioned this. AST makes this feature easier to add

https://github.com/python/cpython/pull/108173/files already contains the required tests, can you remove them in this PR, so that the diff will be smaller and easier to review

Copy link

MemberAuthor

tomasr8 commentedOct 19, 2024

There was a recent issue in support of f-string#113604 , that mentioned this. AST makes this feature easier to add
https://github.com/python/cpython/pull/108173/files already contains the required tests, can you remove them in this PR, so that the diff will be smaller and easier to review

Don't waste your time reviewing this PR just yet, we should get the tests merged before moving on with this :) Actually, it's been a while since I opened the tests PR, it might need updating first..

erlend-aasland marked this pull request as draft

October 21, 2024 09:55

bedevere-appbot removed the awaiting review label

Oct 21, 2024

serhiy-storchaka self-requested a review

October 28, 2024 08:56

tomasr8 mentioned this pull request

Nov 17, 2024

gh-126700: pygettext: Support more gettext functions#126912

Merged

tomasr8 added7 commits

November 29, 2024 21:06

Merge remote-tracking branch 'upstream/main' into better-pygettext

ca4cd02

Fix conflicts

22c44b4

Remove unrelated changes

5625422

Use match-case

a80d92e

Reorder methods

7fe3df5

Test f-strings

a6b1d54

Improve error messages

46eba7a

bedevere-appbot added awaiting change review and removed awaiting changes labels

Feb 5, 2025

Copy link

bedevere-appbot commentedFeb 5, 2025

Thanks for making the requested changes!

@AA-Turner: please review the changes made to this pull request.

bedevere-appbot requested a review fromAA-Turner

February 5, 2025 22:08

AA-Turner reviewed

Feb 6, 2025

View reviewed changes

Copy link

Member

AA-Turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not an amazing fan of reusing the same visitor for every file. There's not massive performance benefits, and it means we have tricks with resetting the filename, etc.

However, if you'd prefer to keep the current design, could we add awalkabout orinitiate_visit or etc method that takes the node tree and filename, to conceptually keep the two together during tree traversal?

Lib/test/test_tools/i18n_data/messages.pyShow resolvedHide resolved

Tools/i18n/pygettext.py Outdated

		ifttype==tokenize.NAMEandtstringin ('class','def'):
		self.__state=self.__ignorenext
		classGettextVisitor(ast.NodeVisitor):
		def__init__(self,options,filename=None):

Copy link

Member

AA-TurnerFeb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	def__init__(self,options,filename=None):
	def__init__(self,options):

Tools/i18n/pygettext.py OutdatedShow resolvedHide resolved

Tools/i18n/pygettext.py Outdated

		def__init__(self,options,filename=None):
		super().__init__()
		self.options=options
		self.filename=filename

Copy link

Member

AA-TurnerFeb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The filename argument is never used

Suggested change

	self.filename=filename
	self.filename:str=''

Tools/i18n/pygettext.pyShow resolvedHide resolved

serhiy-storchaka reviewed

Feb 6, 2025

View reviewed changes

Copy link

Member

serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Great! It is now much simpler. And it works more correctly!

There are still some issues with var-positional arguments. Since we cannot determine the argument position in such case, it should be rejected.

Tools/i18n/pygettext.py OutdatedShow resolvedHide resolved

Tools/i18n/pygettext.pyShow resolvedHide resolved

Lib/test/test_tools/i18n_data/messages.pyShow resolvedHide resolved

tomasr8 added7 commits

February 6, 2025 22:42

Add a comment

84e2d24

Remove walrus

006e4a8

Remove redundant function

d684780

Add visit_file to GettextVisitor

084405f

Simplify reading files

1ad8d76

Reject calls with var-positional arguments

29ec497

Use more specific visit functions

621cf01

Copy link

MemberAuthor

tomasr8 commentedFeb 6, 2025

I'm not an amazing fan of reusing the same visitor for every file. There's not massive performance benefits, and it means we have tricks with resetting the filename, etc.

I didn't really do this for performance but so that I don't need to pass themessages dictionary. If I create a new visitor for each file, I still need to pass the messages to it. I thought that passing the filename was simpler, so I did it this way, but if you prefer it the other way, just tell me and I'll change it 🙂 For now, I addedGettextVisitor.visit_file(module_tree, filename) which sets the filename before extracting the messages.