| Python Library Reference |
This module defines a classHTMLParser which serves as thebasis for parsing text files formatted in HTML (HyperTextMark-up Language) and XHTML. Unlike the parser inhtmllib, this parser is not based on the SGML parser insgmllib.
An HTMLParser instance is fed HTML data and calls handler functionswhen tags begin and end. TheHTMLParser class is meant to beoverridden by the user to provide a desired behavior.
Unlike the parser inhtmllib, this parser does not checkthat end tags match start tags or call the end-tag handler forelements which are closed implicitly by closing an outer element.
HTMLParser instances have the following methods:
Thetag argument is the name of the tag converted tolower case. Theattrs argument is a list of(name,value) pairs containing the attributes found inside the tag's<> brackets. Thename will be translated to lower caseand double quotes and backslashes in thevalue have beeninterpreted. For instance, for the tag<AHREF="http://www.cwi.nl/">, this method would be called as"handle_starttag('a', [('href', 'http://www.cwi.nl/')])".
<a .../>). This methodmay be overridden by subclasses which require this particular lexicalinformation; the default implementation simple callshandle_starttag() andhandle_endtag().'text'. It isintended to be overridden by a derived class; the base classimplementation does nothing.<!...> markup.It is intended to be overriddenby a derived class; the base class implementation does nothing.<?proc color='red'>,this method would be called ashandle_pi("proc color='red'"). Itis intended to be overridden by a derived class; the base classimplementation does nothing.Note:TheHTMLParser class uses the SGML syntactic rules forprocessing instruction. An XHTML processing instruction using thetrailing "?" will cause the "?" to be included indata.
| Python Library Reference |