NotificationsYou must be signed in to change notification settings
Fork32.3k
Star67.9k

gh-76960: Fix urljoin() and urldefrag() for URIs with empty components#123273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

serhiy-storchaka merged 2 commits intopython:mainfromserhiy-storchaka:urllib-urljoin2

Aug 31, 2024

Merged

gh-76960: Fix urljoin() and urldefrag() for URIs with empty components#123273

serhiy-storchaka merged 2 commits intopython:mainfromserhiy-storchaka:urllib-urljoin2

Aug 31, 2024

Conversation

Copy link

Member

serhiy-storchaka commentedAug 23, 2024•
edited by bedevere-appbot
Loading

urljoin() with relative reference "?" sets empty query and removes fragment.
Preserve empty components (authority, params, query, fragment) in urljoin().
Preserve empty components (authority, params, query) in urldefrag().

Also refactor the code and get rid of double _coerce_args() and _coerce_result() calls in urljoin(), urldefrag(), urlparse() and urlunparse().

Issue:urljoining an empty query string doesn't clear query string #76960

pythongh-76960: Fix urljoin() and urldefrag() for URIs with empty com…

57867a6

…ponents* urljoin() with relative reference "?" sets empty query and removes fragment.* Preserve empty components (authority, params, query, fragment) in urljoin().* Preserve empty components (authority, params, query) in urldefrag().Also refactor the code and get rid of double _coerce_args() and_coerce_result() calls in urljoin(), urldefrag(), urlparse() andurlunparse().

serhiy-storchaka requested a review fromorsenthil

August 23, 2024 20:24

bedevere-appbot mentioned this pull request

Aug 23, 2024

urljoining an empty query string doesn't clear query string#76960

Closed

bedevere-appbot added the awaiting core review label

Aug 23, 2024

serhiy-storchaka commented

Aug 23, 2024

View reviewed changes

Lib/test/test_urlparse.py

Comment on lines -529 to -530

		self.checkJoin(SIMPLE_BASE, 'http:g','http://a/b/c/g')
		self.checkJoin(SIMPLE_BASE, 'http:','http://a/b/c/d')

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Duplicated below.

Lib/test/test_urlparse.py

Comment on lines -551 to -552

		self.checkJoin(SIMPLE_BASE, 'http:','http://a/b/c/d')
		self.checkJoin(SIMPLE_BASE, 'http:?y','http://a/b/c/d?y')

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Duplicated in new tests below.

Lib/urllib/parse.py

		v = SplitResult(scheme or '', netloc or '', url, query or '', fragment or '')
		return _coerce_result(v)

		def _urlsplit(url, scheme=None, allow_fragments=True):

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This refactoring is also a preparation for#67041. These private functions distinguish undefined components (None) from empty components (empty string), but public functions remove this difference. With new options added for#67041 this will be optional.

Lib/urllib/parse.py

Comment on lines +592 to +593

		if fragment is None:
		fragment = bfragment

Copy link

MemberAuthor

serhiy-storchakaAug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This contradicts RFC 3986 according to which the target fragment is always set to the relative reference fragment. But this means that an empty string (which has no defined fragment) should remove fragment. I think this is an error in RFC 3986.

Copy link

Member

orsenthilAug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

But this means that an empty string (which has no defined fragment) should remove fragment. I think this is an error in RFC 3986.

Just to be explicit, this means we are keeping the existing behavior when a relative fragment is given an empty string (or None). Is that right?

Copy link

MemberAuthor

serhiy-storchakaAug 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Practically, this only affects 4 cases: an empty string,'//'. and URIs like'http:' andhttp:// where the scheme is the same as in the base URI.

urljoin('http:/a/b/c/d;p?q#f', '') currently returns the base URI'http:/a/b/c/d;p?q#f'. The same in Ruby and Go. According to the RFC 3986 pseudocode it should drop fragment. According to previous RFCs it should return the base URI. I think it is important to keep the current behavior in this case, so there should be some error in RFC 3986.

urljoin('http:/a/b/c/d;p?q#f', '//') currently returns the base URI without fragment'http:/a/b/c/d;p?q', which is consistent with RFC 3986 for non-strict parser. Ruby and Go do not distinguish between'//' and'', so they preserve fragment. The current PR preserves it too.

urljoin('http:/a/b/c/d;p?q#f', 'http:') currently returns the base URI without fragment'http:/a/b/c/d;p?q', which is consistent with RFC 3986 for non-strict parser. But Ruby and Go return'http: which is consistent with RFC 3986 for strict parser. With this PR it will return the base URI unchanged, this is a change from the current behavior and is not consistent with RFC 3986 unless we suppose that RFC 3986 was wrong in this place.

But maybe RFC 3986 is wrong in other place. Previous RFCs contained special statement for empty relative URI. If it is missed in RFC 3986 by mistake, then we can only add special case for empty relative URI and preserve the current behavior forurljoin('http:/a/b/c/d;p?q#f', 'http:') which is consistent with the pseudocode in RFC 3986.

We can only add a special case for empty relative URI (it is already here for performance reasons), with possible addition of'//'. Then the behavior for'http:' and'http://' will be consistent with 3986 and match the current behavior (but Ruby and Go go other way).

Lib/urllib/parse.py


		if scheme is None:
		scheme = bscheme
		if scheme != bscheme or scheme not in uses_relative:
		return _coerce_result(url)
		if scheme in uses_netloc:
		if netloc:

Copy link

MemberAuthor

serhiy-storchakaAug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

According to RFC 3986, this should benetlog is not None (undefined). But this gives results too different from Ruby and Go (they may be wrong) and from the current Python code. I leave this for a separate issue.

Copy link

Member

orsenthilAug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Being consistent with Ruby, Go, and Java (and usually with browser/curl) somes seems to be more beneficial, and going with that is expected by the programmer.

Copy link

MemberAuthor

serhiy-storchaka commentedAug 23, 2024

@orsenthil, please take a look at this PR.

orsenthil reviewed

Aug 30, 2024

View reviewed changes

Misc/NEWS.d/next/Library/2024-08-23-22-01-30.gh-issue-76960.vsANPu.rst Outdated

		Fix :func:`urllib.parse.urljoin` and :func:`urllib.parse.urldefrag` for URIs
		containing empty components. For example, :func:`!urljoin()` with relative
		reference "?" now sets empty query and removes fragment. Empty components are
		preserved in the result.

Copy link

Member

orsenthilAug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What is meant by " Empty components are preserved in the result." in this news entry?

Copy link

Member

orsenthilAug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Got it what is meant in the ticket summary.

Preserve empty components (authority, params, query, fragment) in urljoin().Preserve empty components (authority, params, query) in urldefrag().

Copy link

MemberAuthor

serhiy-storchakaAug 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess that I should be more explicit (even if a little repetitive) here.

orsenthil reviewed

Aug 30, 2024

View reviewed changes

Lib/test/test_urlparse.py

		self.checkJoin(RFC1808_BASE, '?', 'http://a/b/c/d;p?', relroundtrip=False)
		self.checkJoin(RFC1808_BASE, '?#z', 'http://a/b/c/d;p?#z', relroundtrip=False)
		self.checkJoin(RFC1808_BASE, '?y', 'http://a/b/c/d;p?y')
		self.checkJoin(RFC1808_BASE, ';', 'http://a/b/c/;')

Copy link

Member

orsenthilAug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Making a note, using parameter; removes the final portion of the path in the base url/d in this case.

Copy link

Member

orsenthilAug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

And this as per the rfc3986 requirement as given in the5.4. Reference Resolution Examples and these tests handle them properly for the empty? cases correctly.

Copy link

MemberAuthor

serhiy-storchakaAug 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

From RFC3986's point of view,; is not a delimiter, it is a normal character. So/b/c/d;p is the old path,d;p is its last part, and it will be replaced with relative path;.

Other example:

>>> urljoin('http:/a/b/c/d;?#f','#z')'http:///a/b/c/d#z'

Currently it removes; and? which correspond to emptyparams andquery. But for RFC3986 they should be preserved. It is not only a matter of representation of empty component, params is not a thing in RFC3986. Removing trailing; from path/b/c/d; is a bug.

orsenthil approved these changes

Aug 30, 2024

View reviewed changes

Copy link

Member

orsenthil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM.

Thank you@serhiy-storchaka for bringing this long pending and desirable change into urllib.parse module in a relativelysafe way.

bedevere-appbot added awaiting merge and removed awaiting core review labels

Aug 30, 2024

Update NEWS.

a65a5f0

Copy link

MemberAuthor

serhiy-storchaka commentedAug 31, 2024

Thank you for your review@orsenthil.

During writing responses to your comments I reconsider my approach to empty URI problem, tried different ways, but have not found good reasons to prefer one way or another for such corner cases like//,http: andhttp://. The current approach at least looks more consistent for all 4 cases, so I leave it. This is not the last PR for these functions, we will need to add more options for consistency with Ruby and Go, and maybe change some defaults in future.

serhiy-storchaka merged commitfc897fc intopython:main

Aug 31, 2024

34 checks passed

bedevere-appbot removed the awaiting merge label

Aug 31, 2024

serhiy-storchaka deleted the urllib-urljoin2 branch

August 31, 2024 09:42

WinChua mentioned this pull request

Sep 20, 2024

fix DocumentURI error for gopls@v0.16.1ycm-core/ycmd#1762

Closed

Copy link

WinChua commentedSep 20, 2024

This commin change the behavior of urljoin. Consider this codeurljoin('file:', '/home'), before this commit it produced'file:///home', while it produces'file:/home' now, which may break some code