NotificationsYou must be signed in to change notification settings
Fork32.3k
Star67.8k

gh-133546: Make`re.Match` a well-rounded`Sequence` type#133549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

vberlier wants to merge13 commits intopython:mainfromvberlier:gh-133546

Closed

gh-133546: Make`re.Match` a well-rounded`Sequence` type#133549

vberlier wants to merge13 commits intopython:mainfromvberlier:gh-133546

Conversation

Copy link

Contributor

vberlier commentedMay 7, 2025•
edited by github-actionsbot
Loading

Issue:Makere.Match a well-roundedSequence type #133546

📚 Documentation preview 📚:https://cpython-previews--133549.org.readthedocs.build/

bedevere-appbot mentioned this pull request

May 7, 2025

Makere.Match a well-roundedSequence type#133546

Open

bedevere-appbot added the awaiting review label

May 7, 2025

Makere.Match a well-roundedSequence type

74480a7

vberlier force-pushed thegh-133546 branch fromac95ba2 to74480a7Compare

May 7, 2025 03:02

JelleZijlstra reviewed

May 7, 2025

View reviewed changes

Doc/library/re.rst

		@@ -1378,6 +1378,27 @@ when there is no match, you can test whether there was a match with a simple
		if match:
		process(match)

		Match objects are proper :class:`~collections.abc.Sequence` types. You can access

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is not true with this PR, Sequence has a number of other requirements (e.g. anindex andcount method).

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice catch, thanks! I addedindex andcount by adapting the implementation oftuple.index andtuple.count. I also updated the unit test to cover allSequence mixin methods.

vberlier force-pushed thegh-133546 branch from01288ce to497c42aCompare

May 7, 2025 05:23

Implement missingindex andcount methods

a3de846

vberlier force-pushed thegh-133546 branch from497c42a toa3de846Compare

May 8, 2025 11:08

ZeroIntensity reviewed

May 11, 2025

View reviewed changes

Modules/_sre/sre.c OutdatedShow resolvedHide resolved

Modules/_sre/sre.cShow resolvedHide resolved

Modules/_sre/sre.c OutdatedShow resolvedHide resolved

Modules/_sre/sre.cShow resolvedHide resolved

vberlier added6 commits

May 11, 2025 20:15

Don't check for PyErr_Occurred

603b1d1

Add missing braces

70b73e4

Fix link to ValueError

f51ef45

Rewritematch_subscript to support negative indexing and slicing

f218828

Update ACKS

5272141

Add news entry

d0aa6fa

picnixz reviewed

May 15, 2025

View reviewed changes

Copy link

Member

picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Test coverage is sometimes insufficient and sometimes redundant with existing ones.

Doc/library/re.rst OutdatedShow resolvedHide resolved

Doc/library/re.rstShow resolvedHide resolved

Lib/test/test_re.py OutdatedShow resolvedHide resolved

Lib/test/test_re.pyShow resolvedHide resolved

Lib/test/test_re.py OutdatedShow resolvedHide resolved

picnixz reviewed

May 15, 2025

View reviewed changes

Copy link

Member

picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Some additional feedback. For the test failures, let's add someassertWarns. I don't think we should make some additional checks in C for the type of the operands (the issue is when you compare the input (which is a bytes) with the matched item (which are strings), hence a BytesWarning). So, instead, let's just expect the warning.

By the way, could you avoid using@picnixz in your commits as I'm getting pinged? TiA.

Lib/test/test_re.py OutdatedShow resolvedHide resolved

Modules/_sre/sre.c OutdatedShow resolvedHide resolved

Address feedback from@picnixz

5f67be0

vberlier force-pushed thegh-133546 branch fromccdbbbb to5f67be0Compare

May 15, 2025 23:57

Copy link

ContributorAuthor

vberlier commentedMay 16, 2025

Oh, sorry for the ping. Is it okay if I just leave out the cases with bytes as input? I can't seem to trigger the warning locally, even with./python -Werror -m test.

vberlier added3 commits

May 16, 2025 02:21

Move slicing test cases into its own test function

17feaa6

Fix error checking

fe709f8

Use else if

4095b52

vberlier requested a review frompicnixz

May 20, 2025 17:29

picnixz reviewed

Jun 2, 2025

View reviewed changes

Doc/library/re.rst OutdatedShow resolvedHide resolved

Modules/_sre/sre.c OutdatedShow resolvedHide resolved

Modules/_sre/sre.cShow resolvedHide resolved

Modules/_sre/sre.c OutdatedShow resolvedHide resolved

Lib/test/test_re.py OutdatedShow resolvedHide resolved

Lib/test/test_re.pyShow resolvedHide resolved

Address next round of comments

51d918d

Copy link

Member

AA-Turner commentedJun 2, 2025

I'm not sure consensus exists for this on either the issue or the linked Discourse thread...

Copy link

ContributorAuthor

vberlier commentedJun 3, 2025

It's a bit quiet, but I guess there's not much to talk about. Serhiy Storchaka brought up a valuable point about the fact that back in the day there was some debate over the possible semantics oflen() and unpacking. But with the__getitem__ implementation that was introduced 10 years ago, there's actually not much wiggle room for creative interpretations of what it would mean forre.Match to be a properSequence type.

This PR is nothing more than painting by numbers. The API was just waiting for someone to spend a couple of hours to fill in the blanks. So it's not particularly exciting, and I doubt the topic will attract a lot of attention on Discourse, but personally I think this is the kind of polish and attention to detail that makes Python feel like home. And it looks like some other people and a core dev are on board with it too. That's enough for me to get started. Plus, working code can help ground discussions.

That said, I'm still new to CPython's contributing process, so let me know if I should've done something differently.

vberlier requested a review frompicnixz

June 11, 2025 12:39

picnixz reviewed

Jun 22, 2025

View reviewed changes

Copy link

Member

picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

After staring at this PR once again, I think we don't need the full-fledged sequence API but we could have some of it:

It's useful to havecase support.
It's useful to have deconstruction via_, a, b = match
It may be useful to havelen as otherwise we need to extract the groups and dolen(m.groups()).
I can't find use cases of.index. While it could give back the index of the group that matches what we pass, it could be more useful if we had the name of the group when possible. IOW,.index is a bit ambiguous as we can access groups by name (if they exist) or by index.

So, maybe for a first iteration, we can drop.index and.count? they don't seem that useful to me. We wouldn't have a real sequence, but at least we would have a more useful interface.

Finally, the docs for.group() (which is meant to be equivalent to__getitem__) say:

If a group number is negative or larger than the number of groups defined in the pattern, anIndexError exception is raised

IOW, we don't support negative groups but we support negative access and this is inconsistent with the sequence contract I think.

What I find useful though is the possibility to usematch m: and deconstruction.

Copy link

ContributorAuthor

vberlier commentedJun 23, 2025•
edited
Loading

I agree I don't think that.index() and.count() bring a lot of value on their own. The main motivation is primarily pattern-matching and sequence unpacking.

But in generic contexts, a lot of code out there will assert that the object is a properSequence before making use of those features, either through type annotations or dynamically withisinstance. The problem is that this also technically encodes a requirement for mixin methods that are not needed.

I often wonder if Python could have originally introducedcollections.abc with an alternative toSequence that doesn't advertise as many capabilities, basically justCollection + __getitem__. But it's a bit too late now, users rely onSequence everywhere, and being a good citizen of the data model means abiding by the rules and in some cases filling out pointless methods. In pure Python, the fact that inheriting fromSequence generates the mixin methods automatically tends to hide the fact that most code usingcollections.abc at API boundaries is prone to overly strict interface requirements. In C the problem becomes much more noticeable because the necessary boilerplate is no longer invisible, you have to implement it manually.

As you're suggesting, the alternative could be to only addlen() andPy_TPFLAGS_SEQUENCE, which should be enough for sequence unpacking and pattern-matching, without actually makingre.Match a realSequence type. But we have to consider that from a user's perspective, having to get familiar with a bespoke subset of theSequence protocol will incur more cognitive overhead than if the contract is fulfilled in its entirety, even though some parts may end up being less applicable. Here providing fullSequence compatibility is not actually much work so I think it's worth it.

Note that for.index() since it's there for compatibility for users supplying are.Match object to a piece of code that accepts aSequence, we shouldn't diverge from the expected semantics.

Regarding__getitem__ and.group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option..group() needs to keep raising theIndexError otherwise existing code may break. But__getitem__ needs to handle negative indices to fulfill the requirements forSequence. I'll update the docs to explain the inconsistency.

Copy link

Member

picnixz commentedJun 23, 2025

Regardinggetitem and .group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option. .group() needs to keep raising the IndexError otherwise existing code may break. Butgetitem needs to handle negative indices to fulfill the requirements for Sequence. I'll update the docs to explain the inconsistency.

In this case, I think it's better not change it and abandon the proposal. Isn't it possible to just implement the destructuring & match support without the sequence? I still don't think it's useful to have more methods than necessary.

Note that for .index() since it's there for compatibility for users supplying a re.Match object to a piece of code that accepts a Sequence, we shouldn't diverge from the expected semantics.

The question we should ask ourselves is rather: why would they use match as a regular sequence? it's not really a sequence IMO. thecurrent expected semantics are "not a sequence" or am I wrong?

I would hold this until some other core dev chimes in to agree with me that we shouldn't add useless methods, reject what I said, or propose something else. I don't know if it's possible to only support destructing and/or matches.

Copy link

ContributorAuthor

vberlier commentedJun 25, 2025

Isn't it possible to just implement the destructuring & match support without the sequence?

Yes, addinglen() andPy_TPFLAGS_SEQUENCE should be enough to support sequence unpacking and pattern-matching, without actually makingre.Match a realSequence type.

why would they use match as a regular sequence? it's not really a sequence IMO. the current expected semantics are "not a sequence" or am I wrong?

It's been teetering on the edge of being a sequence for a decade with the introduction of the__getitem__ implementation. And we both agree that making it behave even more like a sequence by adding unpacking and pattern-matching would be beneficial. The rest falls out of Python's natural inclination towards duck-typing: the morere.Match will look like a sequence, the more people will want to use it like a sequence. So the most straightforward way to legitimize this improvement is to finally make the API coherent with the standard Python data model and implement all the remaining bits of theSequence protocol.

When it comes to mixin methods, they're a pain point that's recurrent throughout all Python code. Even objects likesys.version_info are forced to have dubious-looking.index() and.count() methods, because people use named tuples asSequence all the time in spite ofSequence making overly generous interface promises.

And even then, while I don't see.index() and.count() being used often onre.Match objects directly, they still make perfect sense. The intrinsic piece of data held withinre.Match is a non-emptytuple[str | None, ...] listing the substrings extracted when matching the pattern (it just happens to not actually be stored as a regulartuple internally to make the allocation of the substrings lazy).

So at its core it's a perfectly legitimateSequence type. Granted, the historicalre.Match API doesn't do a great job at making this transparent. But as long as there's a way to fix it with backward-compatibility in mind, I think we should strive to close the gap with other languages, for example like in JavaScript where a match object is a fully-fledged builtinArray:

>letm="abc".match(/(a)(b)(c)/)>console.log(minstanceofArray)true>m.concat(['hello'])['abc','a','b','c','hello']

Regarding__getitem__ and.group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option
In this case, I think it's better not change it and abandon the proposal

This is a trade-off, we add new features and tackle some long overdue API polish, but this comes at the cost of a slight inconsistency between__getitem__ and the old.group() accessor. Here it's going to come down to personal preference and individual sensibilities. If the inconsistency is a deal-breaker for the majority, we can still do as you suggested and only make the necessary changes to support unpacking and pattern-matching. This wouldn't require updating__getitem__ to conform to theSequence protocol and therefore wouldn't introduce an inconsistency with.group(). But some people may point out the missed opportunity for cleaning things up.

Copy link

ContributorAuthor

vberlier commentedJun 25, 2025

Actually,m[-1] now returning the last matched substring instead of raisingIndexError is technically not backward-compatible. And this change would be required for fullSequence compatibility. So maybe there.Match API is beyond saving already. We'd need to survey the ecosystem. If there are existing usages that depend on__getitem__ raisingIndexError for negative values, we might have to fall back to ad-hoc support for unpacking and pattern-matching regardless.

Copy link

Member

AA-Turner commentedJun 25, 2025

all observable behaviors of your system / will be depended on by somebody

I think with the current opposition already noted and the compatability break here, better to close this PR and consider a new approach of just adding pattern matching support. Before you spend time working on an implementation, though, please check in the issue or on Discourse that there is support for this idea.

AA-Turner closed this

Jun 25, 2025

Labels

awaiting review

5 participants

Movatterモバイル変換

Uh oh!

gh-133546: Makere.Match a well-roundedSequence type#133549

gh-133546: Makere.Match a well-roundedSequence type#133549

Uh oh!

Conversation

vberlier commentedMay 7, 2025• edited by github-actionsbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

JelleZijlstraMay 7, 2025

Choose a reason for hiding this comment

Uh oh!

vberlierMay 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vberlier commentedMay 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AA-Turner commentedJun 2, 2025

Uh oh!

vberlier commentedJun 3, 2025

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

vberlier commentedJun 23, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

picnixz commentedJun 23, 2025

Uh oh!

vberlier commentedJun 25, 2025

Uh oh!

vberlier commentedJun 25, 2025

Uh oh!

AA-Turner commentedJun 25, 2025

Uh oh!

Uh oh!

gh-133546: Make`re.Match` a well-rounded`Sequence` type#133549

gh-133546: Make`re.Match` a well-rounded`Sequence` type#133549

vberlier commentedMay 7, 2025•
edited by github-actionsbot
Loading

vberlier commentedJun 23, 2025•
edited
Loading