html.parser — Simple HTML and XHTML parser

Source code:Lib/html/parser.py


This module defines a classHTMLParser which serves as the basis forparsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.

classhtml.parser.HTMLParser(*,convert_charrefs=True)

Create a parser instance able to parse invalid markup.

Ifconvert_charrefs isTrue (the default), all characterreferences (except the ones inscript/style elements) areautomatically converted to the corresponding Unicode characters.

AnHTMLParser instance is fed HTML data and calls handler methodswhen start tags, end tags, text, comments, and other markup elements areencountered. The user should subclassHTMLParser and override itsmethods to implement the desired behavior.

This parser does not check that end tags match start tags or call the end-taghandler for elements which are closed implicitly by closing an outer element.

Changed in version 3.4:convert_charrefs keyword argument added.

Changed in version 3.5:The default value for argumentconvert_charrefs is nowTrue.

Example HTML Parser Application

As a basic example, below is a simple HTML parser that uses theHTMLParser class to print out start tags, end tags, and dataas they are encountered:

fromhtml.parserimportHTMLParserclassMyHTMLParser(HTMLParser):defhandle_starttag(self,tag,attrs):print("Encountered a start tag:",tag)defhandle_endtag(self,tag):print("Encountered an end tag :",tag)defhandle_data(self,data):print("Encountered some data  :",data)parser=MyHTMLParser()parser.feed('<html><head><title>Test</title></head>''<body><h1>Parse me!</h1></body></html>')

The output will then be:

Encountered a start tag: htmlEncountered a start tag: headEncountered a start tag: titleEncountered some data  : TestEncountered an end tag : titleEncountered an end tag : headEncountered a start tag: bodyEncountered a start tag: h1Encountered some data  : Parse me!Encountered an end tag : h1Encountered an end tag : bodyEncountered an end tag : html

HTMLParser Methods

HTMLParser instances have the following methods:

HTMLParser.feed(data)

Feed some text to the parser. It is processed insofar as it consists ofcomplete elements; incomplete data is buffered until more data is fed orclose() is called.data must bestr.

HTMLParser.close()

Force processing of all buffered data as if it were followed by an end-of-filemark. This method may be redefined by a derived class to define additionalprocessing at the end of the input, but the redefined version should always calltheHTMLParser base class methodclose().

HTMLParser.reset()

Reset the instance. Loses all unprocessed data. This is called implicitly atinstantiation time.

HTMLParser.getpos()

Return current line number and offset.

HTMLParser.get_starttag_text()

Return the text of the most recently opened start tag. This should not normallybe needed for structured processing, but may be useful in dealing with HTML “asdeployed” or for re-generating input with minimal changes (whitespace betweenattributes can be preserved, etc.).

The following methods are called when data or markup elements are encounteredand they are meant to be overridden in a subclass. The base classimplementations do nothing (except forhandle_startendtag()):

HTMLParser.handle_starttag(tag,attrs)

This method is called to handle the start tag of an element (e.g.<divid="main">).

Thetag argument is the name of the tag converted to lower case. Theattrsargument is a list of(name,value) pairs containing the attributes foundinside the tag’s<> brackets. Thename will be translated to lower case,and quotes in thevalue have been removed, and character and entity referenceshave been replaced.

For instance, for the tag<AHREF="https://www.cwi.nl/">, this methodwould be called ashandle_starttag('a',[('href','https://www.cwi.nl/')]).

All entity references fromhtml.entities are replaced in the attributevalues.

HTMLParser.handle_endtag(tag)

This method is called to handle the end tag of an element (e.g.</div>).

Thetag argument is the name of the tag converted to lower case.

HTMLParser.handle_startendtag(tag,attrs)

Similar tohandle_starttag(), but called when the parser encounters anXHTML-style empty tag (<img.../>). This method may be overridden bysubclasses which require this particular lexical information; the defaultimplementation simply callshandle_starttag() andhandle_endtag().

HTMLParser.handle_data(data)

This method is called to process arbitrary data (e.g. text nodes and thecontent of<script>...</script> and<style>...</style>).

HTMLParser.handle_entityref(name)

This method is called to process a named character reference of the form&name; (e.g.&gt;), wherename is a general entity reference(e.g.'gt'). This method is never called ifconvert_charrefs isTrue.

HTMLParser.handle_charref(name)

This method is called to process decimal and hexadecimal numeric characterreferences of the form&#NNN; and&#xNNN;. For example, the decimalequivalent for&gt; is&#62;, whereas the hexadecimal is&#x3E;;in this case the method will receive'62' or'x3E'. This methodis never called ifconvert_charrefs isTrue.

HTMLParser.handle_comment(data)

This method is called when a comment is encountered (e.g.<!--comment-->).

For example, the comment<!--comment--> will cause this method to becalled with the argument'comment'.

The content of Internet Explorer conditional comments (condcoms) will also besent to this method, so, for<!--[ifIE9]>IE9-specificcontent<![endif]-->,this method will receive'[ifIE9]>IE9-specificcontent<![endif]'.

HTMLParser.handle_decl(decl)

This method is called to handle an HTML doctype declaration (e.g.<!DOCTYPEhtml>).

Thedecl parameter will be the entire contents of the declaration insidethe<!...> markup (e.g.'DOCTYPEhtml').

HTMLParser.handle_pi(data)

Method called when a processing instruction is encountered. Thedataparameter will contain the entire processing instruction. For example, for theprocessing instruction<?proccolor='red'>, this method would be called ashandle_pi("proccolor='red'"). It is intended to be overridden by a derivedclass; the base class implementation does nothing.

Note

TheHTMLParser class uses the SGML syntactic rules for processinginstructions. An XHTML processing instruction using the trailing'?' willcause the'?' to be included indata.

HTMLParser.unknown_decl(data)

This method is called when an unrecognized declaration is read by the parser.

Thedata parameter will be the entire contents of the declaration insidethe<![...]> markup. It is sometimes useful to be overridden by aderived class. The base class implementation does nothing.

Examples

The following class implements a parser that will be used to illustrate moreexamples:

fromhtml.parserimportHTMLParserfromhtml.entitiesimportname2codepointclassMyHTMLParser(HTMLParser):defhandle_starttag(self,tag,attrs):print("Start tag:",tag)forattrinattrs:print("     attr:",attr)defhandle_endtag(self,tag):print("End tag  :",tag)defhandle_data(self,data):print("Data     :",data)defhandle_comment(self,data):print("Comment  :",data)defhandle_entityref(self,name):c=chr(name2codepoint[name])print("Named ent:",c)defhandle_charref(self,name):ifname.startswith('x'):c=chr(int(name[1:],16))else:c=chr(int(name))print("Num ent  :",c)defhandle_decl(self,data):print("Decl     :",data)parser=MyHTMLParser()

Parsing a doctype:

>>>parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '...'"http://www.w3.org/TR/html4/strict.dtd">')Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"

Parsing an element with a few attributes and a title:

>>>parser.feed('<img src="python-logo.png" alt="The Python logo">')Start tag: img     attr: ('src', 'python-logo.png')     attr: ('alt', 'The Python logo')>>>>>>parser.feed('<h1>Python</h1>')Start tag: h1Data     : PythonEnd tag  : h1

The content ofscript andstyle elements is returned as is, withoutfurther parsing:

>>>parser.feed('<style type="text/css">#python { color: green }</style>')Start tag: style     attr: ('type', 'text/css')Data     : #python { color: green }End tag  : style>>>parser.feed('<script type="text/javascript">'...'alert("<strong>hello!</strong>");</script>')Start tag: script     attr: ('type', 'text/javascript')Data     : alert("<strong>hello!</strong>");End tag  : script

Parsing comments:

>>>parser.feed('<!--a comment-->'...'<!--[if IE 9]>IE-specific content<![endif]-->')Comment  : a commentComment  : [if IE 9]>IE-specific content<![endif]

Parsing named and numeric character references and converting them to thecorrect char (note: these 3 references are all equivalent to'>'):

>>>parser=MyHTMLParser()>>>parser.feed('&gt;&#62;&#x3E;')Data     : >>>>>>parser=MyHTMLParser(convert_charrefs=False)>>>parser.feed('&gt;&#62;&#x3E;')Named ent: >Num ent  : >Num ent  : >

Feeding incomplete chunks tofeed() works, buthandle_data() might be called more than once(unlessconvert_charrefs is set toTrue):

>>>forchunkin['<sp','an>buff','ered',' text</s','pan>']:...parser.feed(chunk)...Start tag: spanData     : buffData     : eredData     :  textEnd tag  : span

Parsing invalid HTML (e.g. unquoted attributes) also works:

>>>parser.feed('<p><a class=link href=#main>tag soup</p ></a>')Start tag: pStart tag: a     attr: ('class', 'link')     attr: ('href', '#main')Data     : tag soupEnd tag  : pEnd tag  : a