Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
Description
Bug report
Bug description:
Method:email.message_from_binary_file()
It seems we've found an issue with the email.parser module when parsing raw binary MIME message where the preamble contains UTF-8 encoded data
When using the.as_string()
method on the returned message, the unicode data will contain invalid UTF-8 characters ("surrogates not allowed")
The problem does not occur when using theemail.message_from_string()
method
I've made a somewhat minimal example (from the MIME RFC) exposing the issue.
Here"préamble"
is decoded as"pr\udcc3\udca9amble"
importioimportemailimportemail.policyCONTENTS="""From: Nathaniel Borenstein <nsb@bellcore.com>To: Ned Freed <ned@innosoft.com>Subject: Sample messageMIME-Version: 1.0Content-type: multipart/mixed; boundary="i-am-boundary"This is the préamble. It is to be ignored, though itis a handy place for mail composers to include anexplanatory note to non-MIME compliant readers.--i-am-boundaryContent-type: text/plain; charset=us-asciiThis is explicitly typed plain ASCII text.It DOES end with a linebreak.--i-am-boundaryContent-type: text/plain; charset=utf-8Content-Transfer-Encoding: 8bitThis should be correctly encapsulated: Un petit café ?--i-am-boundary--This is the epilogue. It is also to be ignored.""".lstrip()CONTENTS_BYTES=io.BytesIO(CONTENTS.encode())# Does not have an impact on the resultPOLICY=email.policy.default.clone(utf8=True)defshow_message(msg):# Parts are correctly decoded in all casesfori,partinenumerate(msg.iter_parts(),1):print(f'MIME PART{i}:')as_string=part.as_string()as_bytes=as_string.encode()print(as_string)as_string=msg.as_string()# When source was bytes, the unicode result of as_string is incorrectas_bytes=as_string.encode()print(as_string)msg_from_binary=email.message_from_binary_file(CONTENTS_BYTES,policy=POLICY)show_message(msg_from_binary)# UnicodeEncodeError: 'utf-8' codec can't encode characters in position 192-193: surrogates not allowed# Using the unicode representation is OK#msg_from_string = email.message_from_string(CONTENTS, EmailMessage)#show_message(msg_from_string)
I've tried adding a charset and content-transfer-encoding: 8bit in the headers, with the same result (I do not know if this is actually valid)
Looking at thecurrent code it seems thatBytesParser
always uses the ASCII encoding witherrors='surrogateescape'
CPython versions tested on:
3.11, CPython main branch
Operating systems tested on:
Linux