Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitbb97e80

Browse files
committed
pythongh-102153: Start stripping C0 control and space chars in `urlsplit` (pythonGH-102508)`urllib.parse.urlsplit` has already been respecting the WHATWG spec a bitpythonGH-25595.This adds more sanitizing to respect the "Remove any leading C0 control or space from input" [rule](https://url.spec.whatwg.org/GH-url-parsing:~:text=Remove%20any%20leading%20and%20trailing%20C0%20control%20or%20space%20from%20input.) in response to [CVE-2023-24329](https://nvd.nist.gov/vuln/detail/CVE-2023-24329).Backported from Python 3.12
1 parente7ecd65 commitbb97e80

File tree

4 files changed

+145
-3
lines changed

4 files changed

+145
-3
lines changed

‎Doc/library/urllib.parse.rst

Lines changed: 69 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,28 @@ or on combining URL components into a URL string.
126126
``#``, ``@``, or ``:`` will raise a:exc:`ValueError`. If the URL is
127127
decomposed before parsing, no error will be raised.
128128

129+
As is the case with all named tuples, the subclass has a few additional methods
130+
and attributes that are particularly useful. One such method is:meth:`_replace`.
131+
The:meth:`_replace` method will return a new ParseResult object replacing specified
132+
fields with new values.
133+
134+
..doctest::
135+
:options: +NORMALIZE_WHITESPACE
136+
137+
>>>from urllib.parseimport urlparse
138+
>>>u= urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
139+
>>>u
140+
ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
141+
params='', query='', fragment='')
142+
>>>u._replace(scheme='http')
143+
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
144+
params='', query='', fragment='')
145+
146+
..warning::
147+
148+
:func:`urlparse` does not perform validation. See:ref:`URL parsing
149+
security <url-parsing-security>` for details.
150+
129151
..versionchanged::3.2
130152
Added IPv6 URL parsing capabilities.
131153

@@ -288,8 +310,14 @@ or on combining URL components into a URL string.
288310
``#``, ``@``, or ``:`` will raise a:exc:`ValueError`. If the URL is
289311
decomposed before parsing, no error will be raised.
290312

291-
Following the `WHATWG spec`_ that updates RFC 3986, ASCII newline
292-
``\n``, ``\r`` and tab ``\t`` characters are stripped from the URL.
313+
Following some of the `WHATWG spec`_ that updates RFC 3986, leading C0
314+
control and space characters are stripped from the URL. ``\n``,
315+
``\r`` and tab ``\t`` characters are removed from the URL at any position.
316+
317+
..warning::
318+
319+
:func:`urlsplit` does not perform validation. See:ref:`URL parsing
320+
security <url-parsing-security>` for details.
293321

294322
..versionchanged::3.6
295323
Out-of-range port numbers now raise:exc:`ValueError`, instead of
@@ -302,6 +330,9 @@ or on combining URL components into a URL string.
302330
..versionchanged::3.6.14
303331
ASCII newline and tab characters are stripped from the URL.
304332

333+
..versionchanged::3.11.4
334+
Leading WHATWG C0 control and space characters are stripped from the URL.
335+
305336
.. _WHATWG spec:https://url.spec.whatwg.org/#concept-basic-url-parser
306337

307338
..function::urlunsplit(parts)
@@ -371,6 +402,42 @@ or on combining URL components into a URL string.
371402
..versionchanged::3.2
372403
Result is a structured object rather than a simple 2-tuple.
373404

405+
..function::unwrap(url)
406+
407+
Extract the url from a wrapped URL (that is, a string formatted as
408+
``<URL:scheme://host/path>``, ``<scheme://host/path>``, ``URL:scheme://host/path``
409+
or ``scheme://host/path``). If *url* is not a wrapped URL, it is returned
410+
without changes.
411+
412+
.. _url-parsing-security:
413+
414+
URL parsing security
415+
--------------------
416+
417+
The:func:`urlsplit` and:func:`urlparse` APIs do not perform **validation** of
418+
inputs. They may not raise errors on inputs that other applications consider
419+
invalid. They may also succeed on some inputs that might not be considered
420+
URLs elsewhere. Their purpose is for practical functionality rather than
421+
purity.
422+
423+
Instead of raising an exception on unusual input, they may instead return some
424+
component parts as empty strings. Or components may contain more than perhaps
425+
they should.
426+
427+
We recommend that users of these APIs where the values may be used anywhere
428+
with security implications code defensively. Do some verification within your
429+
code before trusting a returned component part. Does that ``scheme`` make
430+
sense? Is that a sensible ``path``? Is there anything strange about that
431+
``hostname``? etc.
432+
433+
What constitutes a URL is not universally well defined. Different applications
434+
have different needs and desired constraints. For instance the living `WHATWG
435+
spec`_ describes what user facing web clients such as a web browser require.
436+
While:rfc:`3986` is more general. These functions incorporate some aspects of
437+
both, but cannot be claimed compliant with either. The APIs and existing user
438+
code with expectations on specific behaviors predate both standards leading us
439+
to be very cautious about making API behavior changes.
440+
374441
.. _parsing-ascii-encoded-bytes:
375442

376443
Parsing ASCII Encoded Bytes

‎Lib/test/test_urlparse.py

Lines changed: 60 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -660,14 +660,73 @@ def test_urlsplit_remove_unsafe_bytes(self):
660660
self.assertEqual(p.scheme,"https")
661661
self.assertEqual(p.geturl(),"https://www.python.org/#"diff-bb89d64e7413a637b421609f9a4f532c294037171c3f0150479930acf30425fa-662-662-0" data-selected="false" role="gridcell" tabindex="-1" valign="top">662
662

663+
deftest_urlsplit_strip_url(self):
664+
noise=bytes(range(0,0x20+1))
665+
base_url="http://User:Pass@www.python.org:080/doc/?query=yes#frag"
666+
667+
url=noise.decode("utf-8")+base_url
668+
p=urllib.parse.urlsplit(url)
669+
self.assertEqual(p.scheme,"http")
670+
self.assertEqual(p.netloc,"User:Pass@www.python.org:080")
671+
self.assertEqual(p.path,"/doc/")
672+
self.assertEqual(p.query,"query=yes")
673+
self.assertEqual(p.fragment,"frag")
674+
self.assertEqual(p.username,"User")
675+
self.assertEqual(p.password,"Pass")
676+
self.assertEqual(p.hostname,"www.python.org")
677+
self.assertEqual(p.port,80)
678+
self.assertEqual(p.geturl(),base_url)
679+
680+
url=noise+base_url.encode("utf-8")
681+
p=urllib.parse.urlsplit(url)
682+
self.assertEqual(p.scheme,b"http")
683+
self.assertEqual(p.netloc,b"User:Pass@www.python.org:080")
684+
self.assertEqual(p.path,b"/doc/")
685+
self.assertEqual(p.query,b"query=yes")
686+
self.assertEqual(p.fragment,b"frag")
687+
self.assertEqual(p.username,b"User")
688+
self.assertEqual(p.password,b"Pass")
689+
self.assertEqual(p.hostname,b"www.python.org")
690+
self.assertEqual(p.port,80)
691+
self.assertEqual(p.geturl(),base_url.encode("utf-8"))
692+
693+
# Test that trailing space is preserved as some applications rely on
694+
# this within query strings.
695+
query_spaces_url="https://www.python.org:88/doc/?query= "
696+
p=urllib.parse.urlsplit(noise.decode("utf-8")+query_spaces_url)
697+
self.assertEqual(p.scheme,"https")
698+
self.assertEqual(p.netloc,"www.python.org:88")
699+
self.assertEqual(p.path,"/doc/")
700+
self.assertEqual(p.query,"query= ")
701+
self.assertEqual(p.port,88)
702+
self.assertEqual(p.geturl(),query_spaces_url)
703+
704+
p=urllib.parse.urlsplit("www.pypi.org ")
705+
# That "hostname" gets considered a "path" due to the
706+
# trailing space and our existing logic... YUCK...
707+
# and re-assembles via geturl aka unurlsplit into the original.
708+
# django.core.validators.URLValidator (at least through v3.2) relies on
709+
# this, for better or worse, to catch it in a ValidationError via its
710+
# regular expressions.
711+
# Here we test the basic round trip concept of such a trailing space.
712+
self.assertEqual(urllib.parse.urlunsplit(p),"www.pypi.org ")
713+
714+
# with scheme as cache-key
715+
url="//www.python.org/"
716+
scheme=noise.decode("utf-8")+"https"+noise.decode("utf-8")
717+
for_inrange(2):
718+
p=urllib.parse.urlsplit(url,scheme=scheme)
719+
self.assertEqual(p.scheme,"https")
720+
self.assertEqual(p.geturl(),"https://www.python.org/")
721+
663722
deftest_attributes_bad_port(self):
664723
"""Check handling of invalid ports."""
665724
forbytesin (False,True):
666725
forparsein (urllib.parse.urlsplit,urllib.parse.urlparse):
667726
forportin ("foo","1.5","-1","0x10"):
668727
withself.subTest(bytes=bytes,parse=parse,port=port):
669728
netloc="www.example.net:"+port
670-
url="http://"+netloc
729+
url="http://"+netloc+"/"
671730
ifbytes:
672731
netloc=netloc.encode("ascii")
673732
url=url.encode("ascii")

‎Lib/urllib/parse.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,10 @@
2525
scenarios for parsing, and for backward compatibility purposes, some
2626
parsing quirks from older RFCs are retained. The testcases in
2727
test_urlparse.py provides a good indicator of parsing behavior.
28+
29+
The WHATWG URL Parser spec should also be considered. We are not compliant with
30+
it either due to existing user code API behavior expectations (Hyrum's Law).
31+
It serves as a useful guide when making changes.
2832
"""
2933

3034
importre
@@ -76,6 +80,10 @@
7680
'0123456789'
7781
'+-.')
7882

83+
# Leading and trailing C0 control and space to be stripped per WHATWG spec.
84+
# == "".join([chr(i) for i in range(0, 0x20 + 1)])
85+
_WHATWG_C0_CONTROL_OR_SPACE='\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f '
86+
7987
# Unsafe bytes to be removed per WHATWG spec
8088
_UNSAFE_URL_BYTES_TO_REMOVE= ['\t','\r','\n']
8189

@@ -426,6 +434,11 @@ def urlsplit(url, scheme='', allow_fragments=True):
426434
url,scheme,_coerce_result=_coerce_args(url,scheme)
427435
url=_remove_unsafe_bytes_from_url(url)
428436
scheme=_remove_unsafe_bytes_from_url(scheme)
437+
# Only lstrip url as some applications rely on preserving trailing space.
438+
# (https://url.spec.whatwg.org/#concept-basic-url-parser would strip both)
439+
url=url.lstrip(_WHATWG_C0_CONTROL_OR_SPACE)
440+
scheme=scheme.strip(_WHATWG_C0_CONTROL_OR_SPACE)
441+
429442
allow_fragments=bool(allow_fragments)
430443
key=url,scheme,allow_fragments,type(url),type(scheme)
431444
cached=_parse_cache.get(key,None)
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
:func:`urllib.parse.urlsplit` now strips leading C0 control and space
2+
characters following the specification for URLs defined by WHATWG in
3+
response to CVE-2023-24329. Patch by Illia Volochii.

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp