NotificationsYou must be signed in to change notification settings
Fork32.3k
Star67.9k

Improvements in regular expression doc#114357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

adorilson wants to merge34 commits intopython:main

base:main

Choose a base branch

fromadorilson:re_improvements

Open

Improvements in regular expression doc#114357

adorilson wants to merge34 commits intopython:mainfromadorilson:re_improvements

Conversation

Copy link

Contributor

adorilson commentedJan 20, 2024•
edited by github-actionsbot
Loading

📚 Documentation preview 📚:https://cpython-previews--114357.org.readthedocs.build/

adorilsonand others added6 commits

January 20, 2024 20:23

Doc: Fix the array.fromfile method doc

817b3f3

The check about the f argument type was removed in this commit:python@2c94aa5Thanks for Pedro Arthur Duarte (pedroarthur.jedi at gmail.com) by the help withthis bug.

pythongh-106320: Remove private _PyInterpreterState functions (python…

6b53456

…#106335)Remove private _PyThreadState and _PyInterpreterState C APIfunctions: move them to the internal C API (pycore_pystate.h andpycore_interp.h). Don't export most of these functions anymore, butstill export functions used by tests.Remove _PyThreadState_Prealloc() and _PyThreadState_Init() from the CAPI, but keep it in the stable API.

[Doc] Divide RE Syntax in subsections

1b4d152

[DOC] Add crasis surrounding some RE-matched words

6ad009c

[DOC] Make clearer what will be matched with a RE

94f765f

Doc: minor change

292672b

bedevere-appbot added awaiting review docs

Documentation in the Doc dir

skip news labels

Jan 20, 2024

adorilson added6 commits

February 3, 2024 22:43

Merge branch 'python:main' into re_improvements

65b4278

Merge branch 'python:main' into re_improvements

fe7389a

Merge branch 'python:main' into re_improvements

8394cd3

Doc: Put PatternError's attributes inside a table instead of regular …

e2023e0

…paragraph

Doc: Fix PatternError's attributes

cdaa9ae

Doc: fix lint issue

bb98dad

terryjreedy requested changes

Feb 25, 2024

View reviewed changes

Copy link

Member

terryjreedy left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This PR does 3 things.

Add headers. I have thought to propose the same. Please add 1 more at 320, something like

.. _re_extension_notationExtension notation^^^^^^^^^^^^^^^^^^

CHANGE

Add double backticks, either new or extending single backticks. The existing text always put backticks on REs and sometimes on text matched. PR makes that (nearly, 2 expections noted) always on matches. Defensible since this seems the majority of existing cases. CHANGE
Add 'only' in several places. I am not sure these are needed, but I see existing similar uses.

@serhiy-storchaka I want to finish this RE doc change. Any additional comments from you?

Doc/library/re.rst Outdated

Comment on lines 124 to 125

		only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
		matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
		matches 'foo2' normally, but ``'foo1'`` in :const:`MULTILINE` mode; searching

Copy link

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

To be consistent with other additions, 'foo' above and 'foo2' here should be backticked. But see review summary.

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done.

bedevere-appbot removed the awaiting review label

Feb 25, 2024

Copy link

bedevere-appbot commentedFeb 25, 2024

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phraseI have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

bedevere-appbot added the awaiting changes label

Feb 25, 2024

Copy link

Member

serhiy-storchaka commentedFeb 25, 2024

I am not sure that there is a need in these changes.

New headers and anchors. It is perhaps harmless, but there are some problems in the text (re: documentation claim that special characters lose their special meaning inside […] seems wrong #106482) which requires more serious rewriting, so some parts of the text can be moved and headers and anchors can change.
I used double backquotes to highlight regular expressions. Single quotes are used for strings, they are not fragments of the Python code, they are just strings in quotes. If use the same style in both cases, it will be more difficult to distinguish REs from strings. Maybe you have better solution?
I have no opinion about "only", I left the decision on the native English users.

adorilson added2 commits

February 25, 2024 18:32

Merge branch 'main' into re_improvements

22ffed7

Merge branch 'python:main' into re_improvements

6a1e74e

adorilson marked this pull request as draft

September 25, 2024 09:45

bedevere-appbot removed the awaiting changes label

Sep 25, 2024

adorilson added4 commits

September 25, 2024 17:43

Doc: Add extension notation header

6b357af

Doc: Add some more backticks

8f7356d

Merge branch 'python:main' into re_improvements

6ed5109

Doc: Fix malformed hyperlink target

9c17aa8

Copy link

ContributorAuthor

adorilson commentedSep 26, 2024

Hi,@terryjreedy. Thank you for your review and comments.

The items 1 and 2 are done.

Concern 3: the idea is to make there.ASCII use more explicit.

Withoutre.ASCII,[^0-9] is matched, but something more can be matched too, i.e.:

>>>importre>>>re.findall(r'\d+','567abc123٠١٢٣٤٥٦٧٨٩')['567','123٠١٢٣٤٥٦٧٨٩']

However, withre.ASCII only (and just only)[^0-9] is matched, i.e.:

>>>importre>>>re.findall(r'\d+','567abc123٠١٢٣٤٥٦٧٨٩',re.ASCII)['567','123']

Copy link

ContributorAuthor

adorilson commentedSep 26, 2024

requires more serious rewriting

This can start adding more in-line examples, like in progress with strings (#119445).

Copy link

bedevere-appbot commentedSep 26, 2024

Thanks for making the requested changes!

@terryjreedy: please review the changes made to this pull request.

bedevere-appbot requested a review fromterryjreedy

September 26, 2024 08:58

adorilson added5 commits

September 26, 2024 09:58

Merge branch 'main' into re_improvements

acb2e38

Merge branch 'python:main' into re_improvements

4d3b8dd

Merge branch 'main' into re_improvements

643070c

Docs: add a 'also' for $ special character and RE examples reference …

17baf98

…labels

Docs: add some RE raw string notation references

4e12f7c

vadmium reviewed

Oct 16, 2024

View reviewed changes

Copy link

Member

vadmium left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I appreciate the subheadings to divide up the syntax section.

I agree with addingonly to clarify when ASCII mode matches less than previously described. But the cases with complemented sets don’t have this problem, and I think addingonly to them only hurts.

\D:
Matches any character which is not a decimal digit. This is the opposite of \d.
Matches only [^0–9] if the ASCII flag is used.

reads as “Matches only the universe except zero to nine”?

Doc/library/re.rst Outdated

		@@ -514,6 +529,9 @@ The special characters are:

		.. _re-special-sequences:

		Special sequences
		^^^^^^^^^^^^^^^^^

Copy link

Member

vadmiumOct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we call them all escape sequences? Differentiates better from the multi-character “special character” sequences above.

Copy link

ContributorAuthor

adorilsonOct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Are there other subtypes of special characters? What if about subsections?

Copy link

Member

vadmiumOct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I suggest change the next heading to something likeString literal escapes, and change this heading fromSpecial sequences toEscape sequences.

These are the types of the special characters I can think of for REs:

The single-character metacharacters:$, *, [, ], \, etc, as listed in the how-tohttps://cpython-previews--114357.org.readthedocs.build/en/114357/howto/regex.html#matching-characters
Multicharacter syntax built with the metacharacters, like *?, {m,n} and the bracketed extension notation (?. . .)
“Special sequences” a.k.a. escape sequences, which begin with a backslash. These could be subdivided into
- Non-alphanumeric, for escaping metacharacters and other syntax:\$, \*, \\, \', \", etc
- Group references \1–\99
- Alphanumeric sequences that specify locations to match, or categories of characters: \A, \b, \d, etc
- String literal escapes:\n, \\, \N{. . .}, \0–\777, etc. Excludes \b and\<newline>.
Characters only special in “verbose” expressions: whitespace and #
Additional backslash sequence forre.sub templates: \g<. . .>
Special characters inside square-bracketed classes/sets [. . .], especially -, ^, ], \b, and reserved [, &&, etc

Doc/library/re.rst OutdatedShow resolvedHide resolved

Doc/library/re.rst

		matches both ``'foo'`` and ``'foobar'``, while the regular expression ``foo$``
		matches
		only ``'foo'``. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
		matches ``'foo2'`` normally, but also ``'foo1'`` in :const:`MULTILINE` mode; searching

Copy link

Member

vadmiumOct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I thought the original was easier to read, with the full string being searched given in a different font from the substrings that are found

Copy link

ContributorAuthor

adorilsonOct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Firstly, it was inconsistent with the "(In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched 'in single quotes'.)"

However, you highlighted that 'strings to be matched' is different from 'the matches'. On the other hand, both are literal strings, and this is a common pattern around all docs.

I would like some more opinions here.

Doc/library/re.rst Outdated

		many repetitions as are possible. ``ab*`` will match 'a','ab', or 'a' followed
		by any number of 'b's.
		many repetitions as are possible. ``ab*`` will match``'a'``, ``'ab'``, or
		``'a'`` followedby any number of``'b'``s.

Copy link

Member

vadmiumOct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The -s signifying plural has become disconnected from theb

Copy link

ContributorAuthor

adorilsonOct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Waiting decision about:#114357 (comment)

Copy link

Member

vadmiumOct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not that I think it is great formatting, but looking at the markup under *+ you might join thes on with

Suggested change

	``'a'`` followed by any number of ``'b'`` s.
	``'a'`` followed by any number of ``'b'``\s.

Copy link

ContributorAuthor

adorilsonMar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done

Doc/library/re.rst Outdated


		.. index:: single: + (plus); in regular expressions

		``+``
		Causes the resulting RE to match 1 or more repetitions of the preceding RE.
		``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
		match just 'a'.
		``ab+`` will match ``'a'`` followed by any non-zero number of ``'b'`` s; it

Copy link

Member

vadmiumOct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

-s disconnected again

Copy link

ContributorAuthor

adorilsonMar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done

adorilson added2 commits

October 20, 2024 21:20

Merge branch 'python:main' into re_improvements

a09a187

Revert "[DOC] Make clearer what will be matched with a RE"

625a5cf

This reverts commit94f765f.

Copy link

ContributorAuthor

adorilson commentedOct 20, 2024

I agree with addingonly to clarify when ASCII mode matches less than previously described. But the cases with complemented sets don’t have this problem, and I think addingonly to them only hurts.

I've reverted this change.

\D:
Matches any character which is not a decimal digit. This is the opposite of \d.
Matches only [^0–9] if the ASCII flag is used.
reads as “Matches only the universe except zero to nine”?

HUmm... It is curious. Even without theonly, this can sound weird, mainly if you don't read the first paragraph.

Maybe this would be "Matches all the ASCII universe except zero to nine."

adorilson added2 commits

October 20, 2024 23:08

Doc: Put some subheadings at Special Character section

12ecb3a

Doc: Fix raw string notation reference

f576282

Copy link

ContributorAuthor

adorilson commentedOct 20, 2024

I have made the requested changes; please review again

Copy link

bedevere-appbot commentedOct 20, 2024

Thanks for making the requested changes!

@terryjreedy: please review the changes made to this pull request.

Copy link

ContributorAuthor

adorilson commentedOct 20, 2024

I have made the requested changes; please review again

Copy link

bedevere-appbot commentedOct 20, 2024

Thanks for making the requested changes!

@terryjreedy: please review the changes made to this pull request.

adorilson requested a review fromvadmium

October 20, 2024 22:46

Copy link

Member

vadmium commentedOct 25, 2024

\D:
Matches any character which is not a decimal digit. This is the opposite of \d.

Matches [^0–9] if the ASCII flag is used.

HUmm... It is curious. Even without theonly, this can sound weird, mainly if you don't read the first paragraph.
Maybe this would be "Matches all the ASCII universe except zero to nine."

I don’t understand what is weird. It matches all Unicode characters, not just ASCII, except for ASCII zero to nine. The only suggestion I can think of is saying “Equivalent to [^0–9]” rather than “Matches”. Maybe that is clearer to you? (Although the equivalency doesn’t quite work when \D is already inside a square-bracket character class/set.)

adorilson added2 commits

October 28, 2024 09:26

Merge branch 'python:main' into re_improvements

337e4b4

Doc: Include "Python's" to a link text in RE module

0e0e082

Copy link

ContributorAuthor

adorilson commentedOct 28, 2024

\D: Matches any character which is not a decimal digit. This is the opposite of \d.
Matches [^0–9] if the ASCII flag is used.
HUmm... It is curious. Even without theonly, this can sound weird, mainly if you don't read the first paragraph.
Maybe this would be "Matches all the ASCII universe except zero to nine."
I don’t understand what is weird. It matches all Unicode characters, not just ASCII, except for ASCII zero to nine. The only suggestion I can think of is saying “Equivalent to [^0–9]” rather than “Matches”. Maybe that is clearer to you? (Although the equivalency doesn’t quite work when \D is already inside a square-bracket character class/set.)

Oh, my goodness.

I had a misconception about how re.ASCII works. I thought it was like a filter: "filter all ASCII characters and after matching against the re". So, necessarily, with the ASCII flag, the matches only had ASCII characters. But it is not true, especially with the negative set of characters ([^ re ]).

In this case, we might need to improve there.ASCII definition to avoid this misconception.

Copy link

ContributorAuthor

adorilson commentedOct 28, 2024

The only suggestion I can think of is saying “Equivalent to [^0–9]” rather than “Matches”. Maybe that is clearer to you?

Yeah. With the correct understanding of how ASCII works, it sounds better.

Can I change this in all occurrences like that?

Doc: Add some backticks in re.IGNORECASE section

f094a90

Copy link

Contributor

willingc commentedNov 1, 2024

Thank you@adorilson for your patience and effort on this PR.

@terryjreedy I'm going through and triaging a bunch of docs PRs. If you have time, please review this one again. Thanks.

Merge branch 'main' into re_improvements

fd24e0f

willingc added the skip issue label

Nov 2, 2024

adorilson added3 commits

November 21, 2024 22:50

Merge branch 'main' into re_improvements

a8c44e1

Doc: rename some heading in RE

f970235

Doc: Connect some s in RE

8d52469

Labels

awaiting change review docs

Documentation in the Doc dir

skip issue skip news

6 participants

Movatterモバイル変換

Uh oh!

Improvements in regular expression doc#114357

Are you sure you want to change the base?

Improvements in regular expression doc#114357

Uh oh!

Conversation

adorilson commentedJan 20, 2024• edited by github-actionsbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

terryjreedy left a comment• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bedevere-appbot commentedFeb 25, 2024

Uh oh!

serhiy-storchaka commentedFeb 25, 2024

Uh oh!

adorilson commentedSep 26, 2024

Uh oh!

adorilson commentedSep 26, 2024

Uh oh!

bedevere-appbot commentedSep 26, 2024

Uh oh!

vadmium left a comment• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adorilson commentedOct 20, 2024

Uh oh!

adorilson commentedOct 20, 2024

Uh oh!

bedevere-appbot commentedOct 20, 2024

Uh oh!

adorilson commentedOct 20, 2024

Uh oh!

bedevere-appbot commentedOct 20, 2024

Uh oh!

vadmium commentedOct 25, 2024

Uh oh!

adorilson commentedOct 28, 2024

Uh oh!

adorilson commentedOct 28, 2024

Uh oh!

willingc commentedNov 1, 2024

Uh oh!

Uh oh!

adorilson commentedJan 20, 2024•
edited by github-actionsbot
Loading

terryjreedy left a comment•
edited
Loading

vadmium left a comment•
edited
Loading