Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-133546: Makere.Match a well-roundedSequence type#133549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
vberlier wants to merge13 commits intopython:mainfromvberlier:gh-133546

Conversation

vberlier
Copy link
Contributor

@vberliervberlier commentedMay 7, 2025
edited by github-actionsbot
Loading

@@ -1378,6 +1378,27 @@ when there is no match, you can test whether there was a match with a simple
if match:
process(match)

Match objects are proper :class:`~collections.abc.Sequence` types. You can access
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is not true with this PR, Sequence has a number of other requirements (e.g. anindex andcount method).

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice catch, thanks! I addedindex andcount by adapting the implementation oftuple.index andtuple.count. I also updated the unit test to cover allSequence mixin methods.

Copy link
Member

@picnixzpicnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Test coverage is sometimes insufficient and sometimes redundant with existing ones.

Copy link
Member

@picnixzpicnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Some additional feedback. For the test failures, let's add someassertWarns. I don't think we should make some additional checks in C for the type of the operands (the issue is when you compare the input (which is a bytes) with the matched item (which are strings), hence a BytesWarning). So, instead, let's just expect the warning.

By the way, could you avoid using@picnixz in your commits as I'm getting pinged? TiA.

@vberlier
Copy link
ContributorAuthor

Oh, sorry for the ping. Is it okay if I just leave out the cases with bytes as input? I can't seem to trigger the warning locally, even with./python -Werror -m test.

@vberliervberlier requested a review frompicnixzMay 20, 2025 17:29
@AA-Turner
Copy link
Member

I'm not sure consensus exists for this on either the issue or the linked Discourse thread...

A

@vberlier
Copy link
ContributorAuthor

It's a bit quiet, but I guess there's not much to talk about. Serhiy Storchaka brought up a valuable point about the fact that back in the day there was some debate over the possible semantics oflen() and unpacking. But with the__getitem__ implementation that was introduced 10 years ago, there's actually not much wiggle room for creative interpretations of what it would mean forre.Match to be a properSequence type.

This PR is nothing more than painting by numbers. The API was just waiting for someone to spend a couple of hours to fill in the blanks. So it's not particularly exciting, and I doubt the topic will attract a lot of attention on Discourse, but personally I think this is the kind of polish and attention to detail that makes Python feel like home. And it looks like some other people and a core dev are on board with it too. That's enough for me to get started. Plus, working code can help ground discussions.

That said, I'm still new to CPython's contributing process, so let me know if I should've done something differently.

@vberliervberlier requested a review frompicnixzJune 11, 2025 12:39
Copy link
Member

@picnixzpicnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

After staring at this PR once again, I think we don't need the full-fledged sequence API but we could have some of it:

  • It's useful to havecase support.
  • It's useful to have deconstruction via_, a, b = match
  • It may be useful to havelen as otherwise we need to extract the groups and dolen(m.groups()).
  • I can't find use cases of.index. While it could give back the index of the group that matches what we pass, it could be more useful if we had the name of the group when possible. IOW,.index is a bit ambiguous as we can access groups by name (if they exist) or by index.

So, maybe for a first iteration, we can drop.index and.count? they don't seem that useful to me. We wouldn't have a real sequence, but at least we would have a more useful interface.

Finally, the docs for.group() (which is meant to be equivalent to__getitem__) say:

If a group number is negative or larger than the number of groups defined in the pattern, anIndexError exception is raised

IOW, we don't support negative groups but we support negative access and this is inconsistent with the sequence contract I think.


What I find useful though is the possibility to usematch m: and deconstruction.

@vberlier
Copy link
ContributorAuthor

vberlier commentedJun 23, 2025
edited
Loading

I agree I don't think that.index() and.count() bring a lot of value on their own. The main motivation is primarily pattern-matching and sequence unpacking.

But in generic contexts, a lot of code out there will assert that the object is a properSequence before making use of those features, either through type annotations or dynamically withisinstance. The problem is that this also technically encodes a requirement for mixin methods that are not needed.

I often wonder if Python could have originally introducedcollections.abc with an alternative toSequence that doesn't advertise as many capabilities, basically justCollection + __getitem__. But it's a bit too late now, users rely onSequence everywhere, and being a good citizen of the data model means abiding by the rules and in some cases filling out pointless methods. In pure Python, the fact that inheriting fromSequence generates the mixin methods automatically tends to hide the fact that most code usingcollections.abc at API boundaries is prone to overly strict interface requirements. In C the problem becomes much more noticeable because the necessary boilerplate is no longer invisible, you have to implement it manually.

As you're suggesting, the alternative could be to only addlen() andPy_TPFLAGS_SEQUENCE, which should be enough for sequence unpacking and pattern-matching, without actually makingre.Match a realSequence type. But we have to consider that from a user's perspective, having to get familiar with a bespoke subset of theSequence protocol will incur more cognitive overhead than if the contract is fulfilled in its entirety, even though some parts may end up being less applicable. Here providing fullSequence compatibility is not actually much work so I think it's worth it.

Note that for.index() since it's there for compatibility for users supplying are.Match object to a piece of code that accepts aSequence, we shouldn't diverge from the expected semantics.

Regarding__getitem__ and.group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option..group() needs to keep raising theIndexError otherwise existing code may break. But__getitem__ needs to handle negative indices to fulfill the requirements forSequence. I'll update the docs to explain the inconsistency.

@picnixz
Copy link
Member

Regardinggetitem and .group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option. .group() needs to keep raising the IndexError otherwise existing code may break. Butgetitem needs to handle negative indices to fulfill the requirements for Sequence. I'll update the docs to explain the inconsistency.

In this case, I think it's better not change it and abandon the proposal. Isn't it possible to just implement the destructuring & match support without the sequence? I still don't think it's useful to have more methods than necessary.

Note that for .index() since it's there for compatibility for users supplying a re.Match object to a piece of code that accepts a Sequence, we shouldn't diverge from the expected semantics.

The question we should ask ourselves is rather: why would they use match as a regular sequence? it's not really a sequence IMO. thecurrent expected semantics are "not a sequence" or am I wrong?

I would hold this until some other core dev chimes in to agree with me that we shouldn't add useless methods, reject what I said, or propose something else. I don't know if it's possible to only support destructing and/or matches.

@vberlier
Copy link
ContributorAuthor

Isn't it possible to just implement the destructuring & match support without the sequence?

Yes, addinglen() andPy_TPFLAGS_SEQUENCE should be enough to support sequence unpacking and pattern-matching, without actually makingre.Match a realSequence type.

why would they use match as a regular sequence? it's not really a sequence IMO. the current expected semantics are "not a sequence" or am I wrong?

It's been teetering on the edge of being a sequence for a decade with the introduction of the__getitem__ implementation. And we both agree that making it behave even more like a sequence by adding unpacking and pattern-matching would be beneficial. The rest falls out of Python's natural inclination towards duck-typing: the morere.Match will look like a sequence, the more people will want to use it like a sequence. So the most straightforward way to legitimize this improvement is to finally make the API coherent with the standard Python data model and implement all the remaining bits of theSequence protocol.

When it comes to mixin methods, they're a pain point that's recurrent throughout all Python code. Even objects likesys.version_info are forced to have dubious-looking.index() and.count() methods, because people use named tuples asSequence all the time in spite ofSequence making overly generous interface promises.

And even then, while I don't see.index() and.count() being used often onre.Match objects directly, they still make perfect sense. The intrinsic piece of data held withinre.Match is a non-emptytuple[str | None, ...] listing the substrings extracted when matching the pattern (it just happens to not actually be stored as a regulartuple internally to make the allocation of the substrings lazy).

So at its core it's a perfectly legitimateSequence type. Granted, the historicalre.Match API doesn't do a great job at making this transparent. But as long as there's a way to fix it with backward-compatibility in mind, I think we should strive to close the gap with other languages, for example like in JavaScript where a match object is a fully-fledged builtinArray:

>letm="abc".match(/(a)(b)(c)/)>console.log(minstanceofArray)true>m.concat(['hello'])['abc','a','b','c','hello']

Regarding__getitem__ and.group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option

In this case, I think it's better not change it and abandon the proposal

This is a trade-off, we add new features and tackle some long overdue API polish, but this comes at the cost of a slight inconsistency between__getitem__ and the old.group() accessor. Here it's going to come down to personal preference and individual sensibilities. If the inconsistency is a deal-breaker for the majority, we can still do as you suggested and only make the necessary changes to support unpacking and pattern-matching. This wouldn't require updating__getitem__ to conform to theSequence protocol and therefore wouldn't introduce an inconsistency with.group(). But some people may point out the missed opportunity for cleaning things up.

@vberlier
Copy link
ContributorAuthor

Actually,m[-1] now returning the last matched substring instead of raisingIndexError is technically not backward-compatible. And this change would be required for fullSequence compatibility. So maybe there.Match API is beyond saving already. We'd need to survey the ecosystem. If there are existing usages that depend on__getitem__ raisingIndexError for negative values, we might have to fall back to ad-hoc support for unpacking and pattern-matching regardless.

@AA-Turner
Copy link
Member

all observable behaviors of your system / will be depended on by somebody

I think with the current opposition already noted and the compatability break here, better to close this PR and consider a new approach of just adding pattern matching support. Before you spend time working on an implementation, though, please check in the issue or on Discourse that there is support for this idea.

A

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@JelleZijlstraJelleZijlstraJelleZijlstra left review comments

@picnixzpicnixzpicnixz left review comments

@ZeroIntensityZeroIntensityZeroIntensity left review comments

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

5 participants
@vberlier@AA-Turner@picnixz@JelleZijlstra@ZeroIntensity

[8]ページ先頭

©2009-2025 Movatter.jp