Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork33.3k
gh-126004: fix positions handling incodecs.xmlcharrefreplace_errors#127675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
gh-126004: fix positions handling incodecs.xmlcharrefreplace_errors#127675
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| /* object is guaranteed to be "ready" */ | ||
| Py_UCS4ch=PyUnicode_READ_CHAR(obj,i); | ||
| if (ch<10) { | ||
| ressize+=2+1+1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
An alternative toend = start + PY_SSIZE_T_MAX / (2 + 7 + 1); would be to setdigits = 1; and after the serie of ifs, check thatressize += (2 + digits + 1); doesn't overflow. If it does overflow, return MemoryError. Something like:
if (resize > PY_SSIZE_T_MAX - (2 + digits + 1)) { return PyErr_MemoryError();}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think that's whatPyCodec_NameReplaceErrors does but I don't know how performances would be affected. HittingPY_SSIZE_T_MAX / (2 + 7 + 1) means that we're handling something that is quite large. So doing the check onresize at each loop iteration might slow down the handler a bit.
Now, it took me a while to convince myself that it won't be slowing down the handlersby much. Namely, I don't think it would dramatically slow it down because if our characters are > 10^3, we would do< 10,< 100,< 1000 and< 10000 checks (this last one is needed to know that it's > 10^3 but not > 10^4) already. So instead of 4 we have 5 checks which is not that annoying.
Why can we assume that we will have at least 2 checks and not less? Well... everything < 100 is actually ASCII, so and unless someone is using a special codec for which those characters are not supported or for artificially created exceptions that indicate their start/end positions incorrectly, we're likely to have at least 2 checks in the loop (namely < 10 and < 100) because the bad characters are likely do be something outside the ASCII range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This would look like:
Py_ssize_tressize=0;for (Py_ssize_ti=start;i<end;++i) {// The number of characters that each character 'ch' contributes// in the result is 2 + k + 1, where k = min{t >= 1 | 10^t > ch}// and will be formatted as "&#" + DIGITS + ";". Since the Unicode// range is below 10^7, each "block" requires at most 2 + 7 + 1// characters.intreplsize;/* object is guaranteed to be "ready" */Py_UCS4ch=PyUnicode_READ_CHAR(obj,i);if (ch<10) {replsize=2+1+1; }elseif (ch<100) {replsize=2+2+1; }elseif (ch<1000) {replsize=2+3+1; }elseif (ch<10000) {replsize=2+4+1; }elseif (ch<100000) {replsize=2+5+1; }elseif (ch<1000000) {replsize=2+6+1; }else {assert(ch<10000000);replsize=2+7+1; }if (ressize>PY_SSIZE_T_MAX-replsize) {Py_DECREF(obj);returnPyErr_NoMemory(); }ressize+=replsize;
But I'm not very fond of this. I think it's still nicer to have the check outside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can we decide in a follow-up PR whether those special checks (for instance, we have similar checks forPyCodec_NameReplaceErrors andPyCodec_BackslashReplaceErrors) need to be kept. For the 'namereplace' handler, we purely break actually:
for (i=start,ressize=0;i<end;++i) {/* object is guaranteed to be "ready" */c=PyUnicode_READ_CHAR(object,i);if (ucnhash_capi->getname(c,buffer,sizeof(buffer),1)) {replsize=1+1+1+(int)strlen(buffer)+1; }elseif (c >=0x10000) {replsize=1+1+8; }elseif (c >=0x100) {replsize=1+1+4; }elsereplsize=1+1+2;if (ressize>PY_SSIZE_T_MAX-replsize)break;ressize+=replsize; }
and just don't care anymore :') (the reason is that cannot determine in advance how much it would take unless we callgetname before...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM.
I'm fine with keeping the code as it is, or changing how PY_SSIZE_T_MAX limit is handled in a separated PR.
Strange, the Lint job reports an unrelated issue: diff --git a/Lib/test/test_asyncio/test_subprocess.py b/Lib/test/test_asyncio/test_subprocess.pyindex a053db8..341e3e9 100644--- a/Lib/test/test_asyncio/test_subprocess.py+++ b/Lib/test/test_asyncio/test_subprocess.py@@ -905,7 +905,7 @@ def setUp(self): # Force the use of the threaded child watcher unix_events.can_use_pidfd = mock.Mock(return_value=False) super().setUp()-+ def tearDown(self): unix_events.can_use_pidfd = self._original_can_use_pidfd return super().tearDown() |
It might be because I merged main and that main is failing? |
This comment was marked as off-topic.
This comment was marked as off-topic.
I'll merge this one and the other PR you approved tomorrow (I've reverted the bad commit). |
70dcc84 intopython:mainUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
startandendvalues incodecserror handlers #126004