Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork33.7k
Description
Bug report
Bug description:
In the attached Python minimal example,email_raw_1 survives a round-trip from UTF-8 bytes string to an EmailMessage object and back to a string, whileemail_raw_2 does not:
Traceback (most recent call last):
File "//surrogate_issue.py", line 29, in
print(message_2)
…
File "/usr/local/lib/python3.12/email/_encoded_words.py", line 224, in encode
bstring = string.encode(charset)
^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-2: surrogates not allowed
Funny thing is that the only difference is an additional digit in the middle of it.
The email is malformed, however, it is taken from an actual mail athttps://wilson.bronger.org/5105.txt. Malformed or not, my other email machinery can deal with it, so I think Python should handle such real-world specimen on best-effort basis without exiting.
#!/bin/pythonimportemail,email.policyemail_raw_1="""Content-Type: multipart/mixed; boundary="==="--===Content-Type: message/plain 您0123456789012.3456789--===--""".encode()email_raw_2="""Content-Type: multipart/mixed; boundary="==="--===Content-Type: message/plain 您0123456789012.34567890--===--""".encode()message_1=email.message_from_bytes(email_raw_1,policy=email.policy.SMTPUTF8)message_2=email.message_from_bytes(email_raw_2,policy=email.policy.SMTPUTF8)print(message_1)print(message_2)
CPython versions tested on:
3.12
Operating systems tested on:
Linux