NotificationsYou must be signed in to change notification settings
Fork32k
Star67.3k

gh-69426: only unescape properly terminated character entities in attribute values#95215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

serhiy-storchaka merged 8 commits intopython:mainfromsissbruecker:gh-69426-htmlparser-attribute-entities

May 7, 2025

Merged

gh-69426: only unescape properly terminated character entities in attribute values#95215

serhiy-storchaka merged 8 commits intopython:mainfromsissbruecker:gh-69426-htmlparser-attribute-entities

May 7, 2025

Conversation

Copy link

Contributor

sissbruecker commentedJul 24, 2022•
edited by bedevere-bot
Loading

FixesHTMLParser to only unescape named character references in attribute values if they are properly terminated.

According to theHTML5 spec, named character references in attribute values should only be processed if they are not followed by an ASCII alphanumeric, or an equals sign. So the following references should be unescaped:

&cent
&cent foo
&cent-foo

While the following should not:

&center
&cent=

This change adds an attribute value specific character unescaping logic that should cover these cases.

Fixes:#69426

Issue:HTMLParser handle_starttag replaces entity references in attribute value even without semicolon #69426

pythongh-69426: only unescape properly terminated character entities …

71a89f9

…in attribute values

sissbruecker requested a review fromezio-melotti as acode owner

July 24, 2022 19:14

Copy link

ghost commentedJul 24, 2022•
edited by ghost
Loading

All commit authors signed the Contributor License Agreement.

bedevere-bot added the awaiting review label

Jul 24, 2022

sissbruecker commented

Jul 24, 2022

View reviewed changes

Lib/html/parser.py Outdated

		@@ -57,6 +58,26 @@
		# </ and the tag name, so maybe this should be fixed
		endtagfind = re.compile(r'</\s([a-zA-Z][-.a-zA-Z0-9:_])\s*>')

		# Character reference processing logic specific to attribute values
		# See: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
		attr_charref = re.compile(r'&(#[0-9]+\|#[xX][0-9a-fA-F]+\|[a-zA-Z][a-zA-Z0-9]*)[;=]?')

Copy link

ContributorAuthor

sissbrueckerJul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This partially duplicates an existing Regex, but I was not able to reuse the existing one for this purpose.

Copy link

Member

ezio-melottiJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since the issue only seems to affect named character references, is there a reason to include numeric charrefs too in this regex?

Copy link

ContributorAuthor

sissbrueckerJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, the new_unescape_attrvalue is effectively a wrapper forhtml.escape that only delegates tohtml.escape if the attribute specific conditions are met. Since we still want to escape numeric and hex char refs in attributes, we need to include them in the regex.

Copy link

Member

serhiy-storchakaMay 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be better to move this immediately after the definition ofentityref andcharref. If we change one regexp, we will not forget to change the other.

Copy link

ContributorAuthor

sissbrueckerMay 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done

Lib/test/test_htmlparser.py Outdated

Comment on lines 350 to 360

		expected = [('starttag', 'a', [('href', 'foo"zar')]),
		expected = [('starttag', 'a', [('href', 'foo "zar')]),
		('data', 'a"z'), ('endtag', 'a')]
		for charref in charrefs:
		self._run_check('<a href="foo{0}zar">a{0}z</a>'.format(charref),
		self._run_check('<a href="foo{0}zar">a{0}z</a>'.format(charref),
		expected, collector=collector())
		# check charrefs at the beginning/end of the text/attributes
		# check charrefs at the beginning/end of the text
		expected = [('data', '"'),
		('starttag', 'a', [('x', '"'), ('y', '"X'), ('z', 'X"')]),
		('starttag', 'a', []),
		('data', '"'), ('endtag', 'a'), ('data', '"')]
		for charref in charrefs:
		self._run_check('{0}<a x="{0}" y="{0}X" z="X{0}">'
		self._run_check('{0}<a>'

Copy link

ContributorAuthor

sissbrueckerJul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Changed the existing tests to remove flawed assumptions about how the unescaping in attribute values should work.

Copy link

Member

ezio-melottiJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It might be better to remove all attribute-related checks from this test, and move them in the next.

Copy link

ContributorAuthor

sissbrueckerJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done

Lib/test/test_htmlparser.py Outdated

Comment on lines 418 to 423

		# do unescape char refs at begging and end of text attributes
		charrefs = ['"', '"', '"', '&quot', '&#34', '&#x22']
		expected = [('starttag', 'a', [('x', '"'), ('y', '"-X'), ('z', 'X-"')]), ('endtag', 'a')]
		for charref in charrefs:
		self._run_check('<a x="{0}" y="{0}-X" z="X-{0}"></a>'.format(charref),
		expected, collector=collector())

Copy link

ContributorAuthor

sissbrueckerJul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Extracted this fromtest_convert_charrefs

fix typo

bebae0a

sissbruecker mentioned this pull request

Jan 6, 2023

Unwanted modification of special URLs on importsissbruecker/linkding#291

Open

Copy link

ContributorAuthor

sissbruecker commentedJan 6, 2023

@ezio-melotti I see you are marked as code owner. Would there be any interest in moving ahead with this?

ezio-melotti reviewed

Jan 14, 2023

View reviewed changes

Copy link

Member

ezio-melotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for the PR!
I left a few inline comments, but if you prefer I could also make the suggested changes myself and push them to your branch.

Lib/html/parser.py Outdated

		@@ -57,6 +58,26 @@
		# </ and the tag name, so maybe this should be fixed
		endtagfind = re.compile(r'</\s([a-zA-Z][-.a-zA-Z0-9:_])\s*>')

		# Character reference processing logic specific to attribute values
		# See: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
		attr_charref = re.compile(r'&(#[0-9]+\|#[xX][0-9a-fA-F]+\|[a-zA-Z][a-zA-Z0-9]*)[;=]?')

Copy link

Member

ezio-melottiJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since the issue only seems to affect named character references, is there a reason to include numeric charrefs too in this regex?

Lib/html/parser.py Outdated

		return ref

		def unescape_attrvalue(s):
		return attr_charref.sub(replace_attr_charref, s)

Copy link

Member

ezio-melottiJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Both functions should be private, and their name prefixed by an_.

Copy link

ContributorAuthor

sissbrueckerJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done

Lib/html/parser.py Outdated

		def replace_attr_charref(match):
		ref = match.group(0)
		# Numeric / hex char refs must always be unescaped
		if ref[1] == '#':

Copy link

Member

ezio-melottiJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	ifref[1]=='#':
	ifref.startswith('&#'):

I think this is clearer.

Copy link

ContributorAuthor

sissbrueckerJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done

Lib/html/parser.py Outdated

		return unescape(ref)
		# Named character / entity references must only be unescaped
		# if they are an exact match, and they are not followed by an equals sign
		terminates_with_equals = ref[-1:] == '='

Copy link

Member

ezio-melottiJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	terminates_with_equals=ref[-1:]=='='
	terminates_with_equals=ref.endswith('=')

Copy link

ContributorAuthor

sissbrueckerJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Done

Lib/test/test_htmlparser.py Outdated

Comment on lines 350 to 360

		expected = [('starttag', 'a', [('href', 'foo"zar')]),
		expected = [('starttag', 'a', [('href', 'foo "zar')]),
		('data', 'a"z'), ('endtag', 'a')]
		for charref in charrefs:
		self._run_check('<a href="foo{0}zar">a{0}z</a>'.format(charref),
		self._run_check('<a href="foo{0}zar">a{0}z</a>'.format(charref),
		expected, collector=collector())
		# check charrefs at the beginning/end of the text/attributes
		# check charrefs at the beginning/end of the text
		expected = [('data', '"'),
		('starttag', 'a', [('x', '"'), ('y', '"X'), ('z', 'X"')]),
		('starttag', 'a', []),
		('data', '"'), ('endtag', 'a'), ('data', '"')]
		for charref in charrefs:
		self._run_check('{0}<a x="{0}" y="{0}X" z="X{0}">'
		self._run_check('{0}<a>'

Copy link

Member

ezio-melottiJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It might be better to remove all attribute-related checks from this test, and move them in the next.

Lib/test/test_htmlparser.py Outdated

		expected = [('starttag', 'a',
		[('href', 'https://example.com?foo¢=123')]),
		('endtag', 'a')]
		self._run_check('<a href="https://example.com?foo¢=123"></a>', expected, collector=collector())

Copy link

Member

ezio-melottiJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If possible, it would be better to match the style of the previous test, creating different lists of charrefs (e.g. valid, invalid, named, numeric, etc.) and add them in different places in the attribute (beginning, end, before an alnum/space/semicolon/equal).

Also try to keep the lines shorter than 80 chars (you can remove the initial part of the URLs, since they are not necessary).

Copy link

ContributorAuthor

sissbrueckerJan 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks, looking at it combining multiple cases in a single attribute is indeed hard to read. I restructured the test to have two scenarios:

terminated entity, numeric and hex char refs
unterminated entity char refs

Both include cases for start, middle, end, as well as followed by alphanumeric, non-alphanumeric and equals sign. I hope it's a bit clearer now.

Also updated formatting to respect the 80 char limit.

sissbruecker added3 commits

January 14, 2023 19:16

Address review comments in parser.py

a7af750

Extract attribute tests from test_convert_charrefs

f915b19

Refactor attribute unescape tests

6c65830

bedevere-bot mentioned this pull request

Jan 14, 2023

HTMLParser handle_starttag replaces entity references in attribute value even without semicolon#69426

Closed

sissbruecker requested a review fromezio-melotti

January 14, 2023 19:31

Copy link

ContributorAuthor

sissbruecker commentedFeb 6, 2023

Thanks for taking the time to review@ezio-melotti . I have addressed all comments. Could you please take another look when you find some time?

Copy link

kurtqq commentedJun 15, 2023

ping@ezio-melotti on this one would be nice to get it fixed

Merge branch 'main' intopythongh-69426-htmlparser-attribute-entities

e8263ae

serhiy-storchaka self-requested a review

May 6, 2025 19:17

serhiy-storchaka reviewed

May 6, 2025

View reviewed changes

Lib/html/parser.py Outdated

		@@ -57,6 +58,26 @@
		# </ and the tag name, so maybe this should be fixed
		endtagfind = re.compile(r'</\s([a-zA-Z][-.a-zA-Z0-9:_])\s*>')

		# Character reference processing logic specific to attribute values
		# See: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
		attr_charref = re.compile(r'&(#[0-9]+\|#[xX][0-9a-fA-F]+\|[a-zA-Z][a-zA-Z0-9]*)[;=]?')

Copy link

Member

serhiy-storchakaMay 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be better to move this immediately after the definition ofentityref andcharref. If we change one regexp, we will not forget to change the other.

Lib/html/parser.py OutdatedShow resolvedHide resolved

sissbruecker added2 commits

May 6, 2025 22:33

address review comments

ec1341b

fix docs class reference

fb77f97

serhiy-storchaka approved these changes

May 7, 2025

View reviewed changes

Copy link

Member

serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM. 👍

Thank you for your contribution,@sissbruecker.

Lib/html/parser.py

		@@ -23,6 +24,7 @@

		entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')

Copy link

Member

serhiy-storchakaMay 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I wonder why there are. and- symbols in the name here? It may not be related to this issue.

bedevere-appbot added awaiting merge and removed awaiting review labels

May 7, 2025

serhiy-storchakaenabled auto-merge (squash)

May 7, 2025 06:23

serhiy-storchaka added needs backport to 3.13

bugs and security fixes

needs backport to 3.14bugs and security fixes labels

May 7, 2025

serhiy-storchaka merged commit77b14a6 intopython:main

May 7, 2025

47 checks passed

Copy link

miss-islington-appbot commentedMay 7, 2025

Thanks@sissbruecker for the PR, and@serhiy-storchaka for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14.
🐍🍒⛏🤖

bedevere-appbot removed the awaiting merge label

May 7, 2025

Copy link

miss-islington-appbot commentedMay 7, 2025

Sorry@sissbruecker and@serhiy-storchaka, I had trouble checking out the3.14 backport branch.
Please retry by removing and re-adding the "needs backport to 3.14" label.
Alternatively, you can backport usingcherry_picker on the command line.

cherry_picker 77b14a6d58e527f915966446eb0866652a46feb5 3.14

miss-islington-appbot assignedserhiy-storchaka

May 7, 2025

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request

May 7, 2025

pythongh-69426: HTMLParser: only unescape properly terminated charact…

318884d

…er entities in attribute values (pythonGH-95215)According to the HTML5 spec, named character references in attribute valuesshould only be processed if they are not followed by an ASCII alphanumeric,or an equals sign.(cherry picked from commit77b14a6)Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state

Copy link

bedevere-appbot commentedMay 7, 2025

GH-133586 is a backport of this pull request to the3.13 branch.

bedevere-appbot removed the needs backport to 3.13bugs and security fixes label

May 7, 2025

serhiy-storchaka added needs backport to 3.14bugs and security fixes and removed needs backport to 3.14bugs and security fixes labels

May 8, 2025

Copy link

miss-islington-appbot commentedMay 8, 2025

Thanks@sissbruecker for the PR, and@serhiy-storchaka for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request

May 8, 2025

pythongh-69426: HTMLParser: only unescape properly terminated charact…

9e56103

…er entities in attribute values (pythonGH-95215)According to the HTML5 spec, named character references in attribute valuesshould only be processed if they are not followed by an ASCII alphanumeric,or an equals sign.(cherry picked from commit77b14a6)Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state

Copy link

bedevere-appbot commentedMay 8, 2025

GH-133704 is a backport of this pull request to the3.14 branch.

bedevere-appbot removed the needs backport to 3.14bugs and security fixes label

May 8, 2025

serhiy-storchaka pushed a commit that referenced this pull request

May 9, 2025

[3.14]gh-69426: HTMLParser: only unescape properly terminated charac…

3937c78

…ter entities in attribute values (GH-95215) (GH-133704)According to the HTML5 spec, named character references in attribute valuesshould only be processed if they are not followed by an ASCII alphanumeric,or an equals sign.(cherry picked from commit77b14a6)https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-stateCo-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>

serhiy-storchaka pushed a commit that referenced this pull request

May 9, 2025

[3.13]gh-69426: HTMLParser: only unescape properly terminated charac…

3e55441

…ter entities in attribute values (GH-95215) (GH-133586)According to the HTML5 spec, named character references in attribute valuesshould only be processed if they are not followed by an ASCII alphanumeric,or an equals sign.(cherry picked from commit77b14a6)https: //html.spec.whatwg.org/multipage/parsing.html#named-character-reference-stateCo-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>

Labels

None yet

5 participants

Movatterモバイル変換

Uh oh!

gh-69426: only unescape properly terminated character entities in attribute values#95215

gh-69426: only unescape properly terminated character entities in attribute values#95215

Uh oh!

Conversation

sissbruecker commentedJul 24, 2022• edited by bedevere-botLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ghost commentedJul 24, 2022• edited by ghostLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sissbruecker commentedJan 6, 2023

Uh oh!

ezio-melotti left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sissbruecker commentedFeb 6, 2023

Uh oh!

kurtqq commentedJun 15, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

miss-islington-appbot commentedMay 7, 2025

Uh oh!

miss-islington-appbot commentedMay 7, 2025

Uh oh!

bedevere-appbot commentedMay 7, 2025

Uh oh!

miss-islington-appbot commentedMay 8, 2025

Uh oh!

bedevere-appbot commentedMay 8, 2025

Uh oh!

Uh oh!

sissbruecker commentedJul 24, 2022•
edited by bedevere-bot
Loading

ghost commentedJul 24, 2022•
edited by ghost
Loading