Movatterモバイル変換

[0]ホーム

Python Library Reference

Previous:13. Structured Markup ProcessingUp:13. Structured Markup ProcessingNext:13.1.1 Example HTML Parser

13.1`HTMLParser` -- Simple HTML and XHTML parser

This module defines a classHTMLParser which serves as thebasis for parsing text files formatted in HTML (HyperTextMark-up Language) and XHTML. Unlike the parser inhtmllib, this parser is not based on the SGML parser insgmllib.

classHTMLParser()

TheHTMLParser class is instantiated without arguments.

An HTMLParser instance is fed HTML data and calls handler functionswhen tags begin and end. TheHTMLParser class is meant to beoverridden by the user to provide a desired behavior.

Unlike the parser inhtmllib, this parser does not checkthat end tags match start tags or call the end-tag handler forelements which are closed implicitly by closing an outer element.

HTMLParser instances have the following methods:

reset(): Reset the instance. Loses all unprocessed data. This is calledimplicitly at instantiation time.

feed(data): Feed some text to the parser. It is processed insofar as it consistsof complete elements; incomplete data is buffered until more data isfed orclose() is called.

close(): Force processing of all buffered data as if it were followed by anend-of-file mark. This method may be redefined by a derived class todefine additional processing at the end of the input, but theredefined version should always call theHTMLParser base classmethodclose().

getpos(): Return current line number and offset.

get_starttag_text(): Return the text of the most recently opened start tag. This shouldnot normally be needed for structured processing, but may be useful indealing with HTML ``as deployed'' or for re-generating input withminimal changes (whitespace between attributes can be preserved,etc.).

handle_starttag(tag, attrs): This method is called to handle the start of a tag. It is intended tobe overridden by a derived class; the base class implementation doesnothing.
Thetag argument is the name of the tag converted tolower case. Theattrs argument is a list of(name,value) pairs containing the attributes found inside the tag's<> brackets. Thename will be translated to lower caseand double quotes and backslashes in thevalue have beeninterpreted. For instance, for the tag<AHREF="http://www.cwi.nl/">, this method would be called as"handle_starttag('a', [('href', 'http://www.cwi.nl/')])".

handle_startendtag(tag, attrs): Similar tohandle_starttag(), but called when the parserencounters an XHTML-style empty tag (<a .../>). This methodmay be overridden by subclasses which require this particular lexicalinformation; the default implementation simple callshandle_starttag() andhandle_endtag().

handle_endtag(tag): This method is called to handle the end tag of an element. It isintended to be overridden by a derived class; the base classimplementation does nothing. Thetag argument is the name ofthe tag converted to lower case.

handle_data(data): This method is called to process arbitrary data. It is intended to beoverridden by a derived class; the base class implementation doesnothing.

handle_charref(name): This method is called toprocess a character reference of the form "&#ref;". Itis intended to be overridden by a derived class; the base classimplementation does nothing.

handle_entityref(name): This method is called to process a general entity reference of theform "&name;" wherename is an general entityreference. It is intended to be overridden by a derived class; thebase class implementation does nothing.

handle_comment(data): This method is called when a comment is encountered. Thecomment argument is a string containing the text between the"<!-" and "->" delimiters, but not the delimitersthemselves. For example, the comment "<!-text->" will causethis method to be called with the argument'text'. It isintended to be overridden by a derived class; the base classimplementation does nothing.

handle_decl(decl): Method called when an SGML declaration is read by the parser. Thedecl parameter will be the entire contents of the declarationinside the<!...> markup.It is intended to be overriddenby a derived class; the base class implementation does nothing.

handle_pi(data): Method called when a processing instruction is encountered. Thedata parameter will contain the entire processing instruction.For example, for the processing instruction<?proc color='red'>,this method would be called ashandle_pi("proc color='red'"). Itis intended to be overridden by a derived class; the base classimplementation does nothing.
Note:TheHTMLParser class uses the SGML syntactic rules forprocessing instruction. An XHTML processing instruction using thetrailing "?" will cause the "?" to be included indata.

Subsections

13.1.1 Example HTML Parser Application

Python Library Reference

Previous:13. Structured Markup ProcessingUp:13. Structured Markup ProcessingNext:13.1.1 Example HTML Parser

Release 2.2.3, documentation updated on 30 May 2003.

SeeAbout this document... for information on suggesting changes.

[8]ページ先頭

Movatterモバイル変換

13.1HTMLParser -- Simple HTML and XHTML parser

13.1`HTMLParser` -- Simple HTML and XHTML parser