Commit89818a5

miss-islington

and

serhiy-storchaka

authored

[3.14]gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser (GH-137837) (GH-140841)

* the "plaintext" element* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"* optionally RAWTEXT (if scripting=True) element "noscript"(cherry picked from commita17c57e)Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

1 parentd0c78a4 commit89818a5Copy full SHA for 89818a5

File tree

4 files changed

+163

-114

lines changed

Doc/library
- html.parser.rst
Lib
- html
  - parser.py
- test
  - test_htmlparser.py
Misc/NEWS.d/next/Security
- 2025-08-15-23-08-44.gh-issue-137836.b55rhh.rst

4 files changed

+163

-114

lines changed

`‎Doc/library/html.parser.rst‎`

Lines changed: 20 additions & 13 deletions

Original file line number	Diff line number	Diff line change
`@@ -15,14 +15,18 @@`
`15`	`15`	This module defines a class:class:`HTMLParser` which serves as the basis for
`16`	`16`	`parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.`
`17`	`17`
`18`		`-..class::HTMLParser(*, convert_charrefs=True)`
	`18`	`+..class::HTMLParser(*, convert_charrefs=True, scripting=False)`
`19`	`19`
`20`	`20`	`Create a parser instance able to parse invalid markup.`
`21`	`21`
`22`		- If convert_charrefs is``True`` (the default), all character
`23`		- references (except the ones in ``script``/``style`` elements) are
	`22`	`+ If convert_charrefs istrue (the default), all character`
	`23`	+ references (except the ones inelements like``script`` and``style``) are
`24`	`24`	`automatically converted to the corresponding Unicode characters.`
`25`	`25`
	`26`	+ If scripting is false (the default), the content of the ``noscript``
	`27`	`+ element is parsed normally; if it's true, it's returned as is without`
	`28`	`+ being parsed.`
	`29`	`+`
`26`	`30`	An:class:`.HTMLParser` instance is fed HTML data and calls handler methods
`27`	`31`	`when start tags, end tags, text, comments, and other markup elements are`
`28`	`32`	encountered. The user should subclass:class:`.HTMLParser` and override its
`@@ -37,6 +41,9 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.`
`37`	`41`	`..versionchanged::3.5`
`38`	`42`	The default value for argument convert_charrefs is now ``True``.
`39`	`43`
	`44`	`+ ..versionchanged::3.14.1`
	`45`	`+ Added the scripting parameter.`
	`46`	`+`
`40`	`47`
`41`	`48`	`Example HTML Parser Application`
`42`	`49`	`-------------------------------`
@@ -161,24 +168,24 @@ implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
`161`	`168`	`..method::HTMLParser.handle_data(data)`
`162`	`169`
`163`	`170`	`This method is called to process arbitrary data (e.g. text nodes and the`
`164`		- content of``<script>...</script>`` and ``<style>...</style>``).
	`171`	+ content ofelements like ``script`` and ``style``).
`165`	`172`
`166`	`173`
`167`	`174`	`..method::HTMLParser.handle_entityref(name)`
`168`	`175`
`169`	`176`	`This method is called to process a named character reference of the form`
`170`	`177`	``&name;`` (e.g. ``>``), where name is a general entity reference
`171`		- (e.g. ``'gt'``). This method is never called if convert_charrefs is
`172`		-``True``.
	`178`	+ (e.g. ``'gt'``).
	`179`	`+This method is only called if convert_charrefs is false.`
`173`	`180`
`174`	`181`
`175`	`182`	`..method::HTMLParser.handle_charref(name)`
`176`	`183`
`177`	`184`	`This method is called to process decimal and hexadecimal numeric character`
`178`	`185`	references of the form:samp:`&#{NNN};` and:samp:`&#x{NNN};`. For example, the decimal
`179`	`186`	equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
`180`		- in this case the method will receive ``'62'`` or ``'x3E'``. This method
`181`		- isnever called if convert_charrefs is``True``.
	`187`	+ in this case the method will receive ``'62'`` or ``'x3E'``.
	`188`	`+This methodisonly called if convert_charrefs isfalse.`
`182`	`189`
`183`	`190`
`184`	`191`	`..method::HTMLParser.handle_comment(data)`
`@@ -292,8 +299,8 @@ Parsing an element with a few attributes and a title:`
`292`	`299`	`Data : Python`
`293`	`300`	`End tag : h1`
`294`	`301`
`295`		-The content of ``script`` and ``style``elementsis returned as is, without
`296`		`-further parsing:`
	`302`	+The content ofelements like``script`` and ``style`` is returned as is,
	`303`	`+withoutfurther parsing:`
`297`	`304`
`298`	`305`	`..doctest::`
`299`	`306`
`@@ -304,10 +311,10 @@ further parsing:`
`304`	`311`	`End tag : style`
`305`	`312`
`306`	`313`	`>>>parser.feed('<script type="text/javascript">'`
`307`		`- ...'alert("<strong>hello!</strong>");</script>')`
	`314`	`+ ...'alert("<strong>hello! ☺</strong>");</script>')`
`308`	`315`	`Start tag: script`
`309`	`316`	`attr: ('type', 'text/javascript')`
`310`		`-Data : alert("<strong>hello!</strong>");`
	`317`	`+Data : alert("<strong>hello! ☺</strong>");`
`311`	`318`	`End tag : script`
`312`	`319`
`313`	`320`	`Parsing comments:`
@@ -336,7 +343,7 @@ correct char (note: these 3 references are all equivalent to ``'>'``):
`336`	`343`
`337`	`344`	Feeding incomplete chunks to:meth:`~HTMLParser.feed` works, but
`338`	`345`	:meth:`~HTMLParser.handle_data` might be called more than once
`339`		-(unless convert_charrefs isset to ``True``):
	`346`	`+if convert_charrefs isfalse:`
`340`	`347`
`341`	`348`	`..doctest::`
`342`	`349`

`‎Lib/html/parser.py‎`

Lines changed: 18 additions & 6 deletions

Original file line number	Diff line number	Diff line change
`@@ -127,17 +127,25 @@ class HTMLParser(_markupbase.ParserBase):`
`127`	`127`	`argument.`
`128`	`128`	`"""`
`129`	`129`
`130`		`-CDATA_CONTENT_ELEMENTS= ("script","style")`
	`130`	`+# See the HTML5 specs section "13.4 Parsing HTML fragments".`
	`131`	`+# https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments`
	`132`	`+# CDATA_CONTENT_ELEMENTS are parsed in RAWTEXT mode`
	`133`	`+CDATA_CONTENT_ELEMENTS= ("script","style","xmp","iframe","noembed","noframes")`
`131`	`134`	`RCDATA_CONTENT_ELEMENTS= ("textarea","title")`
`132`	`135`
`133`		`-def__init__(self,*,convert_charrefs=True):`
	`136`	`+def__init__(self,*,convert_charrefs=True,scripting=False):`
`134`	`137`	`"""Initialize and reset this instance.`
`135`	`138`
`136`		`- If convert_charrefs isTrue (the default), all character references`
	`139`	`+ If convert_charrefs istrue (the default), all character references`
`137`	`140`	`are automatically converted to the corresponding Unicode characters.`
	`141`	`+`
	`142`	`+ If scripting is false (the default), the content of the`
	`143`	+ ``noscript`` element is parsed normally; if it's true,
	`144`	`+ it's returned as is without being parsed.`
`138`	`145`	`"""`
`139`	`146`	`super().__init__()`
`140`	`147`	`self.convert_charrefs=convert_charrefs`
	`148`	`+self.scripting=scripting`
`141`	`149`	`self.reset()`
`142`	`150`
`143`	`151`	`defreset(self):`
`@@ -172,7 +180,9 @@ def get_starttag_text(self):`
`172`	`180`	`defset_cdata_mode(self,elem,*,escapable=False):`
`173`	`181`	`self.cdata_elem=elem.lower()`
`174`	`182`	`self._escapable=escapable`
`175`		`-ifescapableandnotself.convert_charrefs:`
	`183`	`+ifself.cdata_elem=='plaintext':`
	`184`	`+self.interesting=re.compile(r'\z')`
	`185`	`+elifescapableandnotself.convert_charrefs:`
`176`	`186`	`self.interesting=re.compile(r'&\|</%s(?=[\t\n\r\f />])'%self.cdata_elem,`
`177`	`187`	`re.IGNORECASE\|re.ASCII)`
`178`	`188`	`else:`
`@@ -444,8 +454,10 @@ def parse_starttag(self, i):`
`444`	`454`	`self.handle_startendtag(tag,attrs)`
`445`	`455`	`else:`
`446`	`456`	`self.handle_starttag(tag,attrs)`
`447`		`-iftaginself.CDATA_CONTENT_ELEMENTS:`
`448`		`-self.set_cdata_mode(tag)`
	`457`	`+if (taginself.CDATA_CONTENT_ELEMENTSor`
	`458`	`+ (self.scriptingandtag=="noscript")or`
	`459`	`+tag=="plaintext"):`
	`460`	`+self.set_cdata_mode(tag,escapable=False)`
`449`	`461`	`eliftaginself.RCDATA_CONTENT_ELEMENTS:`
`450`	`462`	`self.set_cdata_mode(tag,escapable=True)`
`451`	`463`	`returnendpos`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Commit89818a5

File tree

4 files changed

4 files changed

`‎Doc/library/html.parser.rst‎`

`‎Lib/html/parser.py‎`

0 commit comments