- Notifications
You must be signed in to change notification settings - Fork302
Do not directly use isolated surrogates in unicode literals#150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
…orms besides Jython
hoppipolla-critic-bot commentedMay 3, 2014
Critic review:https://critic.hoppipolla.co.uk/r/1443 This is an external review system which you may optionally use for the code review of your pull request. In order to help critic track your changes, please donot make in-place history rewrites (e.g. via |
jimbaker commentedMay 3, 2014
I should mention that my branch of pip has the same change proposed here in its vendor lib inclusion of html5lib |
gsnedders commentedMay 5, 2014
Really I'd rather Jython just changed so that is matched the Python documentation and CPython and PyPy and had its unicode type be 16-/32-bit code-units. :) But… Trying to run the tests with that branch of Jython on your branch of html5lib leads to a bunch of errors. Notably, Otherwise there's a bunch of failures due to lack of an euc_jp codec. The docs say:
…which I take to mean as Python, as a language, comes with them. Hence that's seems to be another bug in Jython. |
jimbaker commentedMay 5, 2014
Guido van Rossum summarized the issue of UTF-16 support in Jython well, which we used as guidance when Jython 2.5 was under development:
Re euc_jp codec, this is a known bug with respect to a lack of CJK codecs (http://bugs.jython.org/issue1066). Hopefully there has been some more recent work on the corresponding patch. However, I did just re-run using $~/jythondev/jython-socket-reboot/dist/bin/nosetests --verbose...----------------------------------------------------------------------Ran 1038 testsin 16.498sOK Maybe I need to specify additional options? |
gsnedders commentedMay 5, 2014
Did you init/update the git submodule? See the readme. You should be looking at tens of thousands of tests. FWIW, while the Py2 language reference uses the ambiguous phrasing of "Unicode ordinal" in defining the unicode type (what on earth a "Unicode ordinal" is is a very good question — it's not defined anywhere!), the Py3 language reference states that the str type is a sequence of Unicode codepoints (which would therefore mean lone surrogates are allowed). But this is rehashing what I said in the Jython bug and probably not the place to debate Python semantics. |
jimbaker commentedMay 6, 2014
@gsnedders Got it, I was able to run 16729 tests, so we are now on the same page. Of these I'm seeing 7 errors. 6 are definitely euc_jp; 1 is failing in the middle of a loop, which I haven't investigated yet. Maybe this is euc_jp too? Since we would like to fixhttp://bugs.jython.org/issue1066 and support CJK codecs, I'm going to put this PR on hold on our side until that CJK support is complete, which I would be expect in the beta 4. This timing seems to be sound given the work by Yuji Yamano and the fact that this is now getting some attention. Because we will also will be integrating ensurepip during beta 4, there may be an interval where ensurepip will refer to a Jython-specific fork of pip (so beyond the current practice of referring people tohttps://github.com/jimbaker/pip to try things out), but this should be short in duration. |
gsnedders commentedMay 6, 2014
The other error is another case of lone-surrogates, I can tell you that much. |
jimbaker commentedMay 6, 2014
@gsnedders re the other case: will take a look. Thanks again for your help! I have updatedhttp://bugs.jython.org/issue1066 to reflect this CJK codec dependency. |
gsnedders commentedMay 17, 2014
When you next take a look at this, mind adding |
agronholm commentedMay 30, 2014
So where are we with this right now? |
jimbaker commentedJun 14, 2014
The CJK work is now complete in Jython trunk: Jython now supports being able to wrap java.nio.charset codec support with the standard Python codec API. This includes the tested euc_jp codec in this test suite. There's one remaining problem with isolated surrogates with the tests corresponding to unicodeCharsProblematic.test (a data file used to construct distinct unit tests, a rather clever approach); some sort of workaround will be required. I will update the branch with this test workaround accordingly, along with .gitignore. |
HTMLUnicodeInputStream objects from unicode strings that containisolated surrogates. Such tests are not meaningful on Jython whichdoes not allow for invalid unicode strings to be decoded in the firstplace.
jimbaker commentedJun 16, 2014
All tests pass with python 2.7, python 3.4, and jython 2.7 trunk using html5lib-tests test data. With this latest change, Jython will now only run 17353 of the current 17357 tests, effectively excluding the following test cases in unicodeCharsProblematic.test: However, it does run the test case in that data file using I did notice that the JSON parsing of namedentities.test is rather slow. Something to fixed separately, by using the Jackson package for the json module implementation. |
jimbaker commentedJun 16, 2014
I should also mention thathttps://bitbucket.org/jimbaker/jython-socket-reboot has been deleted, now that it has been merged intohttps://bitbucket.org/jython/jython (or alternatively,http://hg.python.org/jython) |
dstufft commentedJun 16, 2014
For what it's worth, if this gets merged and released within the next month or so It'll make it into pip 1.6. Not sure what the release schedule looks like for html5lib though :) |
jimbaker commentedJul 2, 2014
@gsnedders Any updates/comments on this pull request? We are really hoping that pip 1.6 will be able to support Jython 2.7, as well as other users of html5lib-python |
gsnedders commentedJul 10, 2014
Apologies for being a bit slow (I've been all over the place travelling pretty continuously, and I'm now on holiday till EuroPython); there's a couple of comments on Critic now, see the first comment above. @dstufft: the html5lib release schedule is typically someone going, "oh, hey, I need this bugfix that's on master, could you make a release?", and me responding with a release. :) (In theory, I plan on making releases every so often once anything worthwhile happens, but other people often find obscure bugfixes worthwhile, so it gets released when they ask.) |
jimbaker commentedJul 11, 2014
@gsnedders Looking forward to seeing you at EuroPython! I'll respond to the Critic review, thanks! |
jgraham commentedJul 11, 2014
I added some more comments on critic. |
tseaver commentedNov 29, 2014
Hmm, test failures are due to trailing whitespace. |
gsnedders commentedNov 29, 2014
#168 is essentially the same as this with the trailing whitespace fixed. |
tseaver commentedNov 29, 2014
#168 failsall its tests, not just the flake8 ones. |
gsnedders commentedNov 29, 2014
That odd givennonameentername@c4755d3 is the only extra commit on the branch… Literally the whole diff is just that one commit! So, um… Somehow submodule not being updated properly on Travis? Because that looks like it's run with a more recent html5lib-tests revision… |
gsnedders commentedNov 29, 2014
Oh, Travis CI seems to run the implicit merge-commit in html5lib-python@refs/pull/168/merge. Hence given it's based on master (which currently fails tests because#174 isn't fixed yet) and run more recently it fails. |
gsnedders commentedNov 29, 2014
(That's annoying because it means that when Travis CI runs a given PR it tests different things; if I were to make it retest#150 it would have all the same failures, but with flake8 finding nothing, le sigh.) |
tseaver commentedFeb 19, 2015
I was hoping that the recentjython 2.7b4 release somehow worked around this problem, but it is still borken. |
tseaver commentedApr 25, 2015
Hmm, I don't know of any resolution here. Are you in the process of merging#168? |
gsnedders commentedApr 25, 2015
Just closing this because#168 is it's replacement rather than keeping both open. |
gsnedders commentedApr 25, 2015
(We'll see if anything happens soon. Maybe.) |
Jython does not support isolated surrogates in unicode, including in unicode literals. This has been reported in#2 This bug is critical for Jython due to the fact that html5lib is a vendor lib for pip, and this is blocking pip from running on Jython.
For platforms besides Jython, this pull request allows for these surrogates to be constructed in literals, but through an additional step of indirection. For Jython itself, Jython's normal decode of literals will ensure that such invalid unicode strings cannot be constructed from any source.
To run this on Jython:
Note that in the dev build, you will find executables in dist/bin, such as dist/bin/jython or dist/bin/pip
The jython-socket-reboot branch is nearly complete for merging into Jython; it is a major component of Jython 2.7.0 beta 3. (I'm a core dev of Jython.)