Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit89818a5

Browse files
[3.14]gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser (GH-137837) (GH-140841)
* the "plaintext" element* the RAWTEXT elements "xmp", "iframe", "noembed" and "noframes"* optionally RAWTEXT (if scripting=True) element "noscript"(cherry picked from commita17c57e)Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
1 parentd0c78a4 commit89818a5

File tree

4 files changed

+163
-114
lines changed

4 files changed

+163
-114
lines changed

‎Doc/library/html.parser.rst‎

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,18 @@
1515
This module defines a class:class:`HTMLParser` which serves as the basis for
1616
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
1717

18-
..class::HTMLParser(*, convert_charrefs=True)
18+
..class::HTMLParser(*, convert_charrefs=True, scripting=False)
1919

2020
Create a parser instance able to parse invalid markup.
2121

22-
If *convert_charrefs* is``True`` (the default), all character
23-
references (except the ones in ``script``/``style`` elements) are
22+
If *convert_charrefs* istrue (the default), all character
23+
references (except the ones inelements like``script`` and``style``) are
2424
automatically converted to the corresponding Unicode characters.
2525

26+
If *scripting* is false (the default), the content of the ``noscript``
27+
element is parsed normally; if it's true, it's returned as is without
28+
being parsed.
29+
2630
An:class:`.HTMLParser` instance is fed HTML data and calls handler methods
2731
when start tags, end tags, text, comments, and other markup elements are
2832
encountered. The user should subclass:class:`.HTMLParser` and override its
@@ -37,6 +41,9 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
3741
..versionchanged::3.5
3842
The default value for argument *convert_charrefs* is now ``True``.
3943

44+
..versionchanged::3.14.1
45+
Added the *scripting* parameter.
46+
4047

4148
Example HTML Parser Application
4249
-------------------------------
@@ -161,24 +168,24 @@ implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
161168
..method::HTMLParser.handle_data(data)
162169

163170
This method is called to process arbitrary data (e.g. text nodes and the
164-
content of``<script>...</script>`` and ``<style>...</style>``).
171+
content ofelements like ``script`` and ``style``).
165172

166173

167174
..method::HTMLParser.handle_entityref(name)
168175

169176
This method is called to process a named character reference of the form
170177
``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
171-
(e.g. ``'gt'``). This method is never called if *convert_charrefs* is
172-
``True``.
178+
(e.g. ``'gt'``).
179+
This method is only called if *convert_charrefs* is false.
173180

174181

175182
..method::HTMLParser.handle_charref(name)
176183

177184
This method is called to process decimal and hexadecimal numeric character
178185
references of the form:samp:`&#{NNN};` and:samp:`&#x{NNN};`. For example, the decimal
179186
equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
180-
in this case the method will receive ``'62'`` or ``'x3E'``. This method
181-
isnever called if *convert_charrefs* is``True``.
187+
in this case the method will receive ``'62'`` or ``'x3E'``.
188+
This methodisonly called if *convert_charrefs* isfalse.
182189

183190

184191
..method::HTMLParser.handle_comment(data)
@@ -292,8 +299,8 @@ Parsing an element with a few attributes and a title:
292299
Data : Python
293300
End tag : h1
294301

295-
The content of ``script`` and ``style``elementsis returned as is, without
296-
further parsing:
302+
The content ofelements like``script`` and ``style`` is returned as is,
303+
withoutfurther parsing:
297304

298305
..doctest::
299306

@@ -304,10 +311,10 @@ further parsing:
304311
End tag : style
305312

306313
>>>parser.feed('<script type="text/javascript">'
307-
...'alert("<strong>hello!</strong>");</script>')
314+
...'alert("<strong>hello! &#9786;</strong>");</script>')
308315
Start tag: script
309316
attr: ('type', 'text/javascript')
310-
Data : alert("<strong>hello!</strong>");
317+
Data : alert("<strong>hello! &#9786;</strong>");
311318
End tag : script
312319

313320
Parsing comments:
@@ -336,7 +343,7 @@ correct char (note: these 3 references are all equivalent to ``'>'``):
336343

337344
Feeding incomplete chunks to:meth:`~HTMLParser.feed` works, but
338345
:meth:`~HTMLParser.handle_data` might be called more than once
339-
(unless *convert_charrefs* isset to ``True``):
346+
if *convert_charrefs* isfalse:
340347

341348
..doctest::
342349

‎Lib/html/parser.py‎

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -127,17 +127,25 @@ class HTMLParser(_markupbase.ParserBase):
127127
argument.
128128
"""
129129

130-
CDATA_CONTENT_ELEMENTS= ("script","style")
130+
# See the HTML5 specs section "13.4 Parsing HTML fragments".
131+
# https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments
132+
# CDATA_CONTENT_ELEMENTS are parsed in RAWTEXT mode
133+
CDATA_CONTENT_ELEMENTS= ("script","style","xmp","iframe","noembed","noframes")
131134
RCDATA_CONTENT_ELEMENTS= ("textarea","title")
132135

133-
def__init__(self,*,convert_charrefs=True):
136+
def__init__(self,*,convert_charrefs=True,scripting=False):
134137
"""Initialize and reset this instance.
135138
136-
If convert_charrefs isTrue (the default), all character references
139+
If convert_charrefs istrue (the default), all character references
137140
are automatically converted to the corresponding Unicode characters.
141+
142+
If *scripting* is false (the default), the content of the
143+
``noscript`` element is parsed normally; if it's true,
144+
it's returned as is without being parsed.
138145
"""
139146
super().__init__()
140147
self.convert_charrefs=convert_charrefs
148+
self.scripting=scripting
141149
self.reset()
142150

143151
defreset(self):
@@ -172,7 +180,9 @@ def get_starttag_text(self):
172180
defset_cdata_mode(self,elem,*,escapable=False):
173181
self.cdata_elem=elem.lower()
174182
self._escapable=escapable
175-
ifescapableandnotself.convert_charrefs:
183+
ifself.cdata_elem=='plaintext':
184+
self.interesting=re.compile(r'\z')
185+
elifescapableandnotself.convert_charrefs:
176186
self.interesting=re.compile(r'&|</%s(?=[\t\n\r\f />])'%self.cdata_elem,
177187
re.IGNORECASE|re.ASCII)
178188
else:
@@ -444,8 +454,10 @@ def parse_starttag(self, i):
444454
self.handle_startendtag(tag,attrs)
445455
else:
446456
self.handle_starttag(tag,attrs)
447-
iftaginself.CDATA_CONTENT_ELEMENTS:
448-
self.set_cdata_mode(tag)
457+
if (taginself.CDATA_CONTENT_ELEMENTSor
458+
(self.scriptingandtag=="noscript")or
459+
tag=="plaintext"):
460+
self.set_cdata_mode(tag,escapable=False)
449461
eliftaginself.RCDATA_CONTENT_ELEMENTS:
450462
self.set_cdata_mode(tag,escapable=True)
451463
returnendpos

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp