19.1.2.email.parser: Parsing email messages¶
Source code:Lib/email/parser.py
Message object structures can be created in one of two ways: they can be createdfrom whole cloth by instantiatingMessage objects andstringing them together viaattach() andset_payload() calls, or theycan be created by parsing a flat text representation of the email message.
Theemail package provides a standard parser that understands most emaildocument structures, including MIME documents. You can pass the parser a stringor a file object, and the parser will return to you the rootMessage instance of the object structure. For simple,non-MIME messages the payload of this root object will likely be a stringcontaining the text of the message. For MIME messages, the root object willreturnTrue from itsis_multipart() method, andthe subparts can be accessed via theget_payload()andwalk() methods.
There are actually two parser interfaces available for use, the classicParser API and the incrementalFeedParser API. The classicParser API is fine if you have the entire text of the message in memoryas a string, or if the entire message lives in a file on the file system.FeedParser is more appropriate for when you’re reading the message froma stream which might block waiting for more input (e.g. reading an email messagefrom a socket). TheFeedParser can consume and parse the messageincrementally, and only returns the root object when you close the parser[1].
Note that the parser can be extended in limited ways, and of course you canimplement your own parser completely from scratch. There is no magicalconnection between theemail package’s bundled parser and theMessage class, so your custom parser can create messageobject trees any way it finds necessary.
19.1.2.1. FeedParser API¶
TheFeedParser, imported from theemail.feedparser module,provides an API that is conducive to incremental parsing of email messages, suchas would be necessary when reading the text of an email message from a sourcethat can block (e.g. a socket). TheFeedParser can of course be usedto parse an email message fully contained in a string or a file, but the classicParser API may be more convenient for such use cases. The semanticsand results of the two parser APIs are identical.
TheFeedParser’s API is simple; you create an instance, feed it a bunchof text until there’s no more to feed it, then close the parser to retrieve theroot message object. TheFeedParser is extremely accurate when parsingstandards-compliant messages, and it does a very good job of parsingnon-compliant messages, providing information about how a message was deemedbroken. It will populate a message object’sdefects attribute with a list ofany problems it found in a message. See theemail.errors module for thelist of defects that it can find.
Here is the API for theFeedParser:
- class
email.parser.FeedParser(_factory=email.message.Message,*,policy=policy.compat32)¶ Create a
FeedParserinstance. Optional_factory is a no-argumentcallable that will be called whenever a new message object is needed. Itdefaults to theemail.message.Messageclass.Ifpolicy is specified (it must be an instance of a
policyclass) use the rules it specifies to update the representation of themessage. Ifpolicy is not set, use thecompat32policy, which maintains backward compatibility withthe Python 3.2 version of the email package. For more information see thepolicydocumentation.Changed in version 3.3:Added thepolicy keyword.
feed(data)¶Feed the
FeedParsersome more data.data should be a stringcontaining one or more lines. The lines can be partial and theFeedParserwill stitch such partial lines together properly. Thelines in the string can have any of the common three line endings,carriage return, newline, or carriage return and newline (they can even bemixed).
close()¶Closing a
FeedParsercompletes the parsing of all previously feddata, and returns the root message object. It is undefined what happensif you feed more data to a closedFeedParser.
- class
email.parser.BytesFeedParser(_factory=email.message.Message)¶ Works exactly like
FeedParserexcept that the input to thefeed()method must be bytes and not string.New in version 3.2.
19.1.2.2. Parser class API¶
TheParser class, imported from theemail.parser module,provides an API that can be used to parse a message when the complete contentsof the message are available in a string or file. Theemail.parsermodule also provides header-only parsers, calledHeaderParser andBytesHeaderParser, which can be used if you’re only interested in theheaders of the message.HeaderParser andBytesHeaderParsercan be much faster in these situations, since they do not attempt to parse themessage body, instead setting the payload to the raw body as a string. Theyhave the same API as theParser andBytesParser classes.
New in version 3.3:The BytesHeaderParser class.
- class
email.parser.Parser(_class=email.message.Message,*,policy=policy.compat32)¶ The constructor for the
Parserclass takes an optional argument_class. This must be a callable factory (such as a function or a class), andit is used whenever a sub-message object needs to be created. It defaults toMessage(seeemail.message). The factory willbe called without arguments.Ifpolicy is specified (it must be an instance of a
policyclass) use the rules it specifies to update the representation of themessage. Ifpolicy is not set, use thecompat32policy, which maintains backward compatibility withthe Python 3.2 version of the email package. For more information see thepolicydocumentation.Changed in version 3.3:Removed thestrict argument that was deprecated in 2.4. Added thepolicy keyword.
The other public
Parsermethods are:parse(fp,headersonly=False)¶Read all the data from the file-like objectfp, parse the resultingtext, and return the root message object.fp must support both the
readline()and theread()methods on file-like objects.The text contained infp must be formatted as a block ofRFC 2822style headers and header continuation lines, optionally preceded by anenvelope header. The header block is terminated either by the end of thedata or by a blank line. Following the header block is the body of themessage (which may contain MIME-encoded subparts).
Optionalheadersonly is a flag specifying whether to stop parsing afterreading the headers or not. The default is
False, meaning it parsesthe entire contents of the file.
- class
email.parser.BytesParser(_class=email.message.Message,*,policy=policy.compat32)¶ This class is exactly parallel to
Parser, but handles bytes input.The_class andstrict arguments are interpreted in the same way as fortheParserconstructor.Ifpolicy is specified (it must be an instance of a
policyclass) use the rules it specifies to update the representation of themessage. Ifpolicy is not set, use thecompat32policy, which maintains backward compatibility withthe Python 3.2 version of the email package. For more information see thepolicydocumentation.Changed in version 3.3:Removed thestrict argument. Added thepolicy keyword.
parse(fp,headersonly=False)¶Read all the data from the binary file-like objectfp, parse theresulting bytes, and return the message object.fp must supportboth the
readline()and theread()methods on file-like objects.The bytes contained infp must be formatted as a block ofRFC 2822style headers and header continuation lines, optionally preceded by anenvelope header. The header block is terminated either by the end of thedata or by a blank line. Following the header block is the body of themessage (which may contain MIME-encoded subparts, including subpartswith aContent-Transfer-Encoding of
8bit.Optionalheadersonly is a flag specifying whether to stop parsing afterreading the headers or not. The default is
False, meaning it parsesthe entire contents of the file.
parsebytes(text,headersonly=False)¶Similar to the
parse()method, except it takes abytes-likeobject instead of a file-like object. Calling this method is equivalentto wrappingtext in aBytesIOinstance first and callingparse().Optionalheadersonly is as with the
parse()method.
New in version 3.2.
Since creating a message object structure from a string or a file object is sucha common task, four functions are provided as a convenience. They are availablein the top-levelemail package namespace.
email.message_from_string(s,_class=email.message.Message,*,policy=policy.compat32)¶Return a message object structure from a string. This is exactly equivalent to
Parser().parsestr(s)._class andpolicy are interpreted aswith theParserclass constructor.Changed in version 3.3:Removed thestrict argument. Added thepolicy keyword.
email.message_from_bytes(s,_class=email.message.Message,*,policy=policy.compat32)¶Return a message object structure from abytes-like object. This is exactlyequivalent to
BytesParser().parsebytes(s). Optional_class andstrict are interpreted as with theParserclassconstructor.New in version 3.2.
Changed in version 3.3:Removed thestrict argument. Added thepolicy keyword.
email.message_from_file(fp,_class=email.message.Message,*,policy=policy.compat32)¶Return a message object structure tree from an openfile object.This is exactly equivalent to
Parser().parse(fp)._classandpolicy are interpreted as with theParserclassconstructor.Changed in version 3.3:Removed thestrict argument. Added thepolicy keyword.
email.message_from_binary_file(fp,_class=email.message.Message,*,policy=policy.compat32)¶Return a message object structure tree from an open binaryfileobject. This is exactly equivalent to
BytesParser().parse(fp)._class andpolicy are interpreted as with theParserclass constructor.New in version 3.2.
Changed in version 3.3:Removed thestrict argument. Added thepolicy keyword.
Here’s an example of how you might use this at an interactive Python prompt:
>>>importemail>>>msg=email.message_from_string(myString)
19.1.2.3. Additional notes¶
Here are some notes on the parsing semantics:
- Most non-multipart type messages are parsed as a single messageobject with a string payload. These objects will return
Falseforis_multipart(). Theirget_payload()method will return a string object. - Allmultipart type messages will be parsed as a container messageobject with a list of sub-message objects for their payload. The outercontainer message will return
Trueforis_multipart()and theirget_payload()method will return the list ofMessagesubparts. - Most messages with a content type ofmessage/* (e.g.message/delivery-status andmessage/rfc822) will also beparsed as container object containing a list payload of length 1. Their
is_multipart()method will returnTrue.The single element in the list payload will be a sub-message object. - Some non-standards compliant messages may not be internally consistent abouttheirmultipart-edness. Such messages may have aContent-Type header of typemultipart, but their
is_multipart()method may returnFalse.If such messages were parsed with theFeedParser,they will have an instance of theMultipartInvariantViolationDefectclass in theirdefects attribute list. Seeemail.errorsfor details.
Footnotes
| [1] | As of email package version 3.0, introduced in Python 2.4, the classicParser was re-implemented in terms of theFeedParser, so the semantics and results areidentical between the two parsers. |
