
This issue trackerhas been migrated toGitHub, and is currentlyread-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.
Created on2011-05-17 23:10 byvstinner, last changed2022-04-11 14:57 byadmin. This issue is nowclosed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| cjk_no_reset.patch | vstinner,2011-05-17 23:10 | review | ||
| Messages (10) | |||
|---|---|---|---|
| msg136194 -(view) | Author: STINNER Victor (vstinner)*![]() | Date: 2011-05-17 23:10 | |
Stateful CJK codecs reset the codec at each call to encode() producing a valid but overlong output:>>> import codecs>>> encoder = codecs.getincrementalencoder('hz')()>>> encoder.encode('\u804a') + encoder.encode('\u804a')b'~{AD~}~{AD~}'>>> '\u804a\u804a'.encode('hz')b'~{ADAD~}'Multibyte encodings: HZ and all encodings of the ISO 2022 family (e.g. iso-2022-jp).Attached patch fixes this issue. I don't like how I added the tests, these tests may be moved somewhere else, but HZ codec doesn't have tests today (I opened issue#12057 for that), and ISO 2022 codecs don't have specific tests (test_multibytecodec is "Unit test for multibytecodec itself"). We should maybe also add tests specific to ISO 2022 first?I hesitate to reset the codec on .encode(text, final=True), but UTF-8-SIG or UTF-16 don't reset the codec if final=True. io.TextIOWrapper only calls encoder.reset() on file.seek(0). On a seek to another position, it calls encoder.setstate(0).See also issues#12016 and#12057. | |||
| msg136534 -(view) | Author: Marc-Andre Lemburg (lemburg)*![]() | Date: 2011-05-22 15:02 | |
I think it's better to use a StringIO instance for the tests.Regarding resetting the incremental codec every time .encode() is called: Hye-Shik will have to comment. Perhaps there's an internal reason why they do this. | |||
| msg136727 -(view) | Author: STINNER Victor (vstinner)*![]() | Date: 2011-05-24 09:55 | |
> I think it's better to use a StringIO instance for the tests.For which test excatly? An encoder produces bytes, I don't the relation with StringIO. | |||
| msg136730 -(view) | Author: Marc-Andre Lemburg (lemburg)*![]() | Date: 2011-05-24 10:17 | |
STINNER Victor wrote:> > STINNER Victor <victor.stinner@haypocalc.com> added the comment:> >> I think it's better to use a StringIO instance for the tests.> > For which test excatly? An encoder produces bytes, I don't the relation with StringIO.Sorry, BytesIO in Python3-speak. In Python2 you'd use StringIO. | |||
| msg136777 -(view) | Author: Martin (gz)* | Date: 2011-05-24 18:13 | |
Does Victor Stinner have a psychic link with Armin Rigo? :)https://bitbucket.org/pypy/pypy/src/7f593e7877d4/pypy/module/_multibytecodec/app_multibytecodec.py"""# My theory is that they are not widely used on CPython either, because# I found two bugs just by looking at their .c source: they always call# encreset() after a piece of data, even though I think it's wrong ---# it should be called only once at the end; and mbiencoder_reset() calls# decreset() instead of encreset()."""The answer to Armin's theory is that they're bugs but not ones users are likely to notice? | |||
| msg136780 -(view) | Author: STINNER Victor (vstinner)*![]() | Date: 2011-05-24 19:04 | |
Le mardi 24 mai 2011 à 18:13 +0000, Martin a écrit :> Martin <gzlist@googlemail.com> added the comment:> > Does Victor Stinner have a psychic link with Armin Rigo? :)> >https://bitbucket.org/pypy/pypy/src/7f593e7877d4/pypy/module/_multibytecodec/app_multibytecodec.py> > """> # My theory is that they are not widely used on CPython either, because> # I found two bugs just by looking at their .c sourceSorry, I only found one bug, and while testing HZ, not while reading thesource code.> ... and mbiencoder_reset() calls decreset() instead of encreset()This is a new bug that you should be fixed. Armin did not reported thebug upstream (in this bug tracker)?> The answer to Armin's theory is that they're bugs but not ones users are likely to notice?Ok, I will apply my fix. | |||
| msg136788 -(view) | Author: Armin Rigo (arigo)*![]() | Date: 2011-05-24 20:08 | |
Hi :-) I did not report the two issues I found so far because I didn't finish the PyPy implementation of CJK yet, and I'm very new to anything related to codecs; additionally I didn't check Python 3.x, as I was just following the 2.7 sources. Can someone confirm that the two bugs I suspect are really bugs? And should I open another report to help tracking the 2nd bug? | |||
| msg136791 -(view) | Author: Roundup Robot (python-dev)![]() | Date: 2011-05-24 20:24 | |
New changesetbd17396895fb by Victor Stinner in branch '3.1':Issue#12100: Don't reset incremental encoders of CJK codecs at each call tohttp://hg.python.org/cpython/rev/bd17396895fbNew changeset7f2ab2f95a04 by Victor Stinner in branch '3.2':(Merge 3.1) Issue#12100: Don't reset incremental encoders of CJK codecs athttp://hg.python.org/cpython/rev/7f2ab2f95a04New changesetcb9867dab15e by Victor Stinner in branch 'default':(Merge 3.2) Issue#12100: Don't reset incremental encoders of CJK codecs athttp://hg.python.org/cpython/rev/cb9867dab15e | |||
| msg136792 -(view) | Author: Roundup Robot (python-dev)![]() | Date: 2011-05-24 20:29 | |
New changesete789b4cda872 by Victor Stinner in branch '2.7':Issue#12100: Don't reset incremental encoders of CJK codecs at each call tohttp://hg.python.org/cpython/rev/e789b4cda872 | |||
| msg136910 -(view) | Author: STINNER Victor (vstinner)*![]() | Date: 2011-05-25 22:01 | |
The initial problem (reset() at each call to .encode()) is fixed in Python 2.7, 3.1, 3.2 and 3.3.I opened a new issue,#12171, for the second problem noticed by Armin (decreset vs encreset). | |||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:57:17 | admin | set | github: 56309 |
| 2012-02-01 01:25:06 | kennyluck | set | nosy: +kennyluck |
| 2011-05-25 22:01:10 | vstinner | set | status: open -> closed resolution: fixed messages: +msg136910 |
| 2011-05-24 20:29:21 | python-dev | set | messages: +msg136792 |
| 2011-05-24 20:24:36 | python-dev | set | nosy: +python-dev messages: +msg136791 |
| 2011-05-24 20:08:14 | arigo | set | messages: +msg136788 |
| 2011-05-24 19:04:14 | vstinner | set | messages: +msg136780 |
| 2011-05-24 18:13:35 | gz | set | nosy: +arigo,gz messages: +msg136777 |
| 2011-05-24 10:17:15 | lemburg | set | messages: +msg136730 title: Incremental encoders of CJK codecs reset the codec at each call to encode() -> Incremental encoders of CJK codecs reset the codec ateach call to encode() |
| 2011-05-24 09:55:35 | vstinner | set | messages: +msg136727 |
| 2011-05-22 15:02:13 | lemburg | set | messages: +msg136534 |
| 2011-05-22 13:44:17 | vstinner | set | nosy: +loewis |
| 2011-05-21 00:33:45 | vstinner | link | issue12057 dependencies |
| 2011-05-17 23:10:05 | vstinner | create | |