Movatterモバイル変換

Issue21315

➜

This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65514

classification

Title:	email._header_value_parser does not recognise in-line encoding changes
Type:	behavior	Stage:	resolved
Components:	email	Versions:	Python 3.8, Python 3.7

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, maxking, miss-islington, r.david.murray, valhallasw
Priority:	normal	Keywords:	patch

Created on2014-04-20 15:58 byvalhallasw, last changed2022-04-11 14:58 byadmin. This issue is nowclosed.

Files
File name	Uploaded	Description	Edit
000359.raw	valhallasw,2014-04-20 15:58	Example bugzilla e-mail
unstructured_ew_without_whitespace.diff	valhallasw,2014-04-20 15:58	Unit test & possible fix	review

Pull Requests
URL	Status	Linked	Edit
PR 13425	merged	maxking,2019-05-19 17:57
PR 13846	merged	miss-islington,2019-06-05 16:56
PR 15655	merged	epicfaace,2019-09-03 04:38

Messages (12)
msg216908 -(view)	Author: Merlijn van Deen (valhallasw)*	Date: 2014-04-20 15:58
Bugzilla sends e-mail in a format where =?UTF-8 is not preceded by whitespace. This makes email.headerregistry.UnstructuredHeader (and email._header_value_parser on the background) not recognise the structure.>>> import email.headerregistry, pprint>>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x){'decoded': '[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\t' 'russian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94')])}versus>>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew: =?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text: =?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x){'decoded': '[Bug 64155]\tNew: non-ascii bug tést;\trussian text: АБВГҐД', 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('non-ascii'), WhiteSpaceTerminal(' '), ValueTerminal('bug'), WhiteSpaceTerminal(' '), ValueTerminal('tést')]), ValueTerminal(';'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('АБВГҐД')])])}I have attached the raw e-mail as attachment.Judging by the code, this is supposed to work (while raising a Defect -- "missing whitespace before encoded word"), but the code splits by whitespace:tok, *remainder = _wsp_splitter(value, 1)which swallows the encoded section in one go. In a second attachment, I added a patch which 1) adds a test case for this and 2) implements a solution, but the solution is unfortunately not in the style of the rest of the module.In the meanwhile, I've chosen a monkey-patching approach to work around the issue:import email._header_value_parser, email.headerregistrydef get_unstructured(value): value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?") return email._header_value_parser.get_unstructured(value)email.headerregistry.UnstructuredHeader.value_parser = staticmethod(get_unstructured)
msg238956 -(view)	Author: Mark Lawrence (BreamoreBoy)*	Date: 2015-03-22 23:43
Could someone formally review the patch please, it's only three additional lines of code and a new test.
msg342739 -(view)	Author: Abhilash Raj (maxking)*	Date: 2019-05-17 17:06
According to RFC 2047 5(1)> However, an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.So, it seems like splitting on whitespace is the right thing to do (see MUST).While your solution works for your case where the charset and cte are utf-8 and q respectively (not a general case for random chatsets and cte), it seems like a hack to get around the fact the header is non-conformant to RFC.IMO manipulating the original header (value.replace in your patch) isn't something we should do, but @r.david.murray would be the right person to answer how we handle non-conformant messages.
msg342745 -(view)	Author: R. David Murray (r.david.murray)*	Date: 2019-05-17 18:06
A cleaner/safer solution here would be: tok, remainder = _wsp_splitter(value, 1) if _rfc2047_matcher(tok): tok, remainder = value.partition('=?') where _rfc2047_matcher would be a regex that matches a correctly formatted encoded word. There a regex for that in the header.py module, though for this application we don't need the groups it has.Abhilash, I'm not sure why you say the proposed solution only works for utf-8 and 'q'?
msg342750 -(view)	Author: Abhilash Raj (maxking)*	Date: 2019-05-17 18:21
The solution replaces RFC 20147 chrome for utf-8 and q to make sure there is a space before ew, it wouldn't replace in case of any other charset/cte pair. value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?")Isn't that correct?
msg342755 -(view)	Author: R. David Murray (r.david.murray)*	Date: 2019-05-17 18:53
I don't see that line of code in unstructured_ew_without_whitespace.diff.Oh, you are referring to his monkey patch. Yes, that is not a suitable solution for anyone but him, and I don't think he meant to imply otherwise :)
msg342756 -(view)	Author: Abhilash Raj (maxking)*	Date: 2019-05-17 18:55
Ah, I wrongly assumed the patch had the same thing.Sorry about that.
msg343070 -(view)	Author: Abhilash Raj (maxking)*	Date: 2019-05-21 16:20
Created a Pull Request for this.https://github.com/python/cpython/pull/13425
msg343271 -(view)	Author: Abhilash Raj (maxking)*	Date: 2019-05-23 02:07
I have made the requested changes on PR.
msg344750 -(view)	Author: Barry A. Warsaw (barry)*	Date: 2019-06-05 16:56
New changeset66c4f3f38b867d8329b28c032bb907fd1a2f22d2 by Barry Warsaw (Abhilash Raj) in branch 'master':bpo-21315: Fix parsing of encoded words with missing leading ws. (#13425)https://github.com/python/cpython/commit/66c4f3f38b867d8329b28c032bb907fd1a2f22d2
msg344841 -(view)	Author: Barry A. Warsaw (barry)*	Date: 2019-06-06 17:08
New changesetdc20fc4311dece19488299a7cd11317ffbe4d3c3 by Barry Warsaw (Miss Islington (bot)) in branch '3.7':bpo-21315: Fix parsing of encoded words with missing leading ws. (GH-13425) (#13846)https://github.com/python/cpython/commit/dc20fc4311dece19488299a7cd11317ffbe4d3c3
msg351091 -(view)	Author: miss-islington (miss-islington)	Date: 2019-09-03 17:08
New changeset59e8fba7189d0e86d428a1125744afb8b0f40b5d by Miss Islington (bot) (Ashwin Ramaswami) in branch '3.8':[3.8]bpo-21315: Fix parsing of encoded words with missing leading ws (GH-13425) (GH-15655)https://github.com/python/cpython/commit/59e8fba7189d0e86d428a1125744afb8b0f40b5d

History
Date	User	Action	Args
2022-04-11 14:58:02	admin	set	github: 65514
2019-09-03 17:08:44	miss-islington	set	nosy: +miss-islington messages: +msg351091
2019-09-03 04:38:59	epicfaace	set	pull_requests: +pull_request15322
2019-08-17 03:03:39	maxking	set	status: open -> closed stage: patch review -> resolved versions: + Python 3.7, Python 3.8, - Python 3.3, Python 3.4, Python 3.5
2019-06-06 17:08:46	barry	set	messages: +msg344841
2019-06-05 16:56:45	miss-islington	set	pull_requests: +pull_request13723
2019-06-05 16:56:38	barry	set	messages: +msg344750
2019-05-23 02:07:41	maxking	set	messages: +msg343271
2019-05-21 16:20:41	maxking	set	messages: +msg343070
2019-05-19 17:57:39	maxking	set	stage: patch review pull_requests: +pull_request13335
2019-05-17 18:55:50	maxking	set	messages: +msg342756
2019-05-17 18:53:30	r.david.murray	set	messages: +msg342755
2019-05-17 18:21:48	maxking	set	messages: +msg342750
2019-05-17 18:06:45	r.david.murray	set	messages: +msg342745
2019-05-17 17:06:32	maxking	set	nosy: +maxking messages: +msg342739
2019-03-15 22:17:08	BreamoreBoy	set	nosy: -BreamoreBoy
2015-03-22 23:43:39	BreamoreBoy	set	nosy: +BreamoreBoy messages: +msg238956
2014-04-20 15:58:51	valhallasw	set	files: +unstructured_ew_without_whitespace.diff keywords: +patch type: behavior
2014-04-20 15:58:18	valhallasw	create

Supported byThe Python Software Foundation,
Powered byRoundup