Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-76960: Fix urljoin() and urldefrag() for URIs with empty components#123273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchakaserhiy-storchaka commentedAug 23, 2024
edited by bedevere-appbot
Loading

  • urljoin() with relative reference "?" sets empty query and removes fragment.
  • Preserve empty components (authority, params, query, fragment) in urljoin().
  • Preserve empty components (authority, params, query) in urldefrag().

Also refactor the code and get rid of double _coerce_args() and _coerce_result() calls in urljoin(), urldefrag(), urlparse() and urlunparse().

mattwang44 and orsenthil reacted with heart emoji
…ponents* urljoin() with relative reference "?" sets empty query and removes fragment.* Preserve empty components (authority, params, query, fragment) in urljoin().* Preserve empty components (authority, params, query) in urldefrag().Also refactor the code and get rid of double _coerce_args() and_coerce_result() calls in urljoin(), urldefrag(), urlparse() andurlunparse().
Comment on lines -529 to -530
self.checkJoin(SIMPLE_BASE, 'http:g','http://a/b/c/g')
self.checkJoin(SIMPLE_BASE, 'http:','http://a/b/c/d')
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Duplicated below.

Comment on lines -551 to -552
self.checkJoin(SIMPLE_BASE, 'http:','http://a/b/c/d')
self.checkJoin(SIMPLE_BASE, 'http:?y','http://a/b/c/d?y')
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Duplicated in new tests below.

v = SplitResult(scheme or '', netloc or '', url, query or '', fragment or '')
return _coerce_result(v)

def _urlsplit(url, scheme=None, allow_fragments=True):
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This refactoring is also a preparation for#67041. These private functions distinguish undefined components (None) from empty components (empty string), but public functions remove this difference. With new options added for#67041 this will be optional.

orsenthil reacted with thumbs up emoji
Comment on lines +592 to +593
if fragment is None:
fragment = bfragment
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This contradicts RFC 3986 according to which the target fragment is always set to the relative reference fragment. But this means that an empty string (which has no defined fragment) should remove fragment. I think this is an error in RFC 3986.

orsenthil reacted with thumbs up emoji
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

But this means that an empty string (which has no defined fragment) should remove fragment. I think this is an error in RFC 3986.

Just to be explicit, this means we are keeping the existing behavior when a relative fragment is given an empty string (or None). Is that right?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Practically, this only affects 4 cases: an empty string,'//'. and URIs like'http:' andhttp:// where the scheme is the same as in the base URI.

urljoin('http:/a/b/c/d;p?q#f', '') currently returns the base URI'http:/a/b/c/d;p?q#f'. The same in Ruby and Go. According to the RFC 3986 pseudocode it should drop fragment. According to previous RFCs it should return the base URI. I think it is important to keep the current behavior in this case, so there should be some error in RFC 3986.

urljoin('http:/a/b/c/d;p?q#f', '//') currently returns the base URI without fragment'http:/a/b/c/d;p?q', which is consistent with RFC 3986 for non-strict parser. Ruby and Go do not distinguish between'//' and'', so they preserve fragment. The current PR preserves it too.

urljoin('http:/a/b/c/d;p?q#f', 'http:') currently returns the base URI without fragment'http:/a/b/c/d;p?q', which is consistent with RFC 3986 for non-strict parser. But Ruby and Go return'http: which is consistent with RFC 3986 for strict parser. With this PR it will return the base URI unchanged, this is a change from the current behavior and is not consistent with RFC 3986 unless we suppose that RFC 3986 was wrong in this place.

But maybe RFC 3986 is wrong in other place. Previous RFCs contained special statement for empty relative URI. If it is missed in RFC 3986 by mistake, then we can only add special case for empty relative URI and preserve the current behavior forurljoin('http:/a/b/c/d;p?q#f', 'http:') which is consistent with the pseudocode in RFC 3986.

We can only add a special case for empty relative URI (it is already here for performance reasons), with possible addition of'//'. Then the behavior for'http:' and'http://' will be consistent with 3986 and match the current behavior (but Ruby and Go go other way).


if scheme is None:
scheme = bscheme
if scheme != bscheme or scheme not in uses_relative:
return _coerce_result(url)
if scheme in uses_netloc:
if netloc:
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

According to RFC 3986, this should benetlog is not None (undefined). But this gives results too different from Ruby and Go (they may be wrong) and from the current Python code. I leave this for a separate issue.

orsenthil reacted with thumbs up emoji
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Being consistent with Ruby, Go, and Java (and usually with browser/curl) somes seems to be more beneficial, and going with that is expected by the programmer.

@serhiy-storchaka
Copy link
MemberAuthor

@orsenthil, please take a look at this PR.

orsenthil reacted with thumbs up emoji

Fix :func:`urllib.parse.urljoin` and :func:`urllib.parse.urldefrag` for URIs
containing empty components. For example, :func:`!urljoin()` with relative
reference "?" now sets empty query and removes fragment. Empty components are
preserved in the result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What is meant by " Empty components are preserved in the result." in this news entry?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Got it what is meant in the ticket summary.

Preserve empty components (authority, params, query, fragment) in urljoin().Preserve empty components (authority, params, query) in urldefrag().

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess that I should be more explicit (even if a little repetitive) here.

self.checkJoin(RFC1808_BASE, '?', 'http://a/b/c/d;p?', relroundtrip=False)
self.checkJoin(RFC1808_BASE, '?#z', 'http://a/b/c/d;p?#z', relroundtrip=False)
self.checkJoin(RFC1808_BASE, '?y', 'http://a/b/c/d;p?y')
self.checkJoin(RFC1808_BASE, ';', 'http://a/b/c/;')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Making a note, using parameter; removes the final portion of the path in the base url/d in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

And this as per the rfc3986 requirement as given in the5.4. Reference Resolution Examples and these tests handle them properly for the empty? cases correctly.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

From RFC3986's point of view,; is not a delimiter, it is a normal character. So/b/c/d;p is the old path,d;p is its last part, and it will be replaced with relative path;.

Other example:

>>> urljoin('http:/a/b/c/d;?#f','#z')'http:///a/b/c/d#z'

Currently it removes; and? which correspond to emptyparams andquery. But for RFC3986 they should be preserved. It is not only a matter of representation of empty component, params is not a thing in RFC3986. Removing trailing; from path/b/c/d; is a bug.

Copy link
Member

@orsenthilorsenthil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM.

Thank you@serhiy-storchaka for bringing this long pending and desirable change into urllib.parse module in a relativelysafe way.

@serhiy-storchaka
Copy link
MemberAuthor

Thank you for your review@orsenthil.

During writing responses to your comments I reconsider my approach to empty URI problem, tried different ways, but have not found good reasons to prefer one way or another for such corner cases like//,http: andhttp://. The current approach at least looks more consistent for all 4 cases, so I leave it. This is not the last PR for these functions, we will need to add more options for consistency with Ruby and Go, and maybe change some defaults in future.

@serhiy-storchakaserhiy-storchaka merged commitfc897fc intopython:mainAug 31, 2024
34 checks passed
@serhiy-storchakaserhiy-storchaka deleted the urllib-urljoin2 branchAugust 31, 2024 09:42
@WinChua
Copy link

This commin change the behavior of urljoin. Consider this codeurljoin('file:', '/home'), before this commit it produced'file:///home', while it produces'file:/home' now, which may break some code

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@orsenthilorsenthilorsenthil approved these changes

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

3 participants
@serhiy-storchaka@WinChua@orsenthil

[8]ページ先頭

©2009-2025 Movatter.jp