Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-126004: fix positions handling incodecs.xmlcharrefreplace_errors#127675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@picnixz
Copy link
Member

@picnixzpicnixz commentedDec 6, 2024
edited by bedevere-appbot
Loading

@picnixzpicnixz requested a review fromencukouJanuary 3, 2025 13:42
/* object is guaranteed to be "ready" */
Py_UCS4ch=PyUnicode_READ_CHAR(obj,i);
if (ch<10) {
ressize+=2+1+1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

An alternative toend = start + PY_SSIZE_T_MAX / (2 + 7 + 1); would be to setdigits = 1; and after the serie of ifs, check thatressize += (2 + digits + 1); doesn't overflow. If it does overflow, return MemoryError. Something like:

if (resize > PY_SSIZE_T_MAX - (2 + digits + 1)) {    return PyErr_MemoryError();}

Copy link
MemberAuthor

@picnixzpicnixzJan 22, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think that's whatPyCodec_NameReplaceErrors does but I don't know how performances would be affected. HittingPY_SSIZE_T_MAX / (2 + 7 + 1) means that we're handling something that is quite large. So doing the check onresize at each loop iteration might slow down the handler a bit.

Now, it took me a while to convince myself that it won't be slowing down the handlersby much. Namely, I don't think it would dramatically slow it down because if our characters are > 10^3, we would do< 10,< 100,< 1000 and< 10000 checks (this last one is needed to know that it's > 10^3 but not > 10^4) already. So instead of 4 we have 5 checks which is not that annoying.

Why can we assume that we will have at least 2 checks and not less? Well... everything < 100 is actually ASCII, so and unless someone is using a special codec for which those characters are not supported or for artificially created exceptions that indicate their start/end positions incorrectly, we're likely to have at least 2 checks in the loop (namely < 10 and < 100) because the bad characters are likely do be something outside the ASCII range.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This would look like:

Py_ssize_tressize=0;for (Py_ssize_ti=start;i<end;++i) {// The number of characters that each character 'ch' contributes// in the result is 2 + k + 1, where k = min{t >= 1 | 10^t > ch}// and will be formatted as "&#" + DIGITS + ";". Since the Unicode// range is below 10^7, each "block" requires at most 2 + 7 + 1// characters.intreplsize;/* object is guaranteed to be "ready" */Py_UCS4ch=PyUnicode_READ_CHAR(obj,i);if (ch<10) {replsize=2+1+1;        }elseif (ch<100) {replsize=2+2+1;        }elseif (ch<1000) {replsize=2+3+1;        }elseif (ch<10000) {replsize=2+4+1;        }elseif (ch<100000) {replsize=2+5+1;        }elseif (ch<1000000) {replsize=2+6+1;        }else {assert(ch<10000000);replsize=2+7+1;        }if (ressize>PY_SSIZE_T_MAX-replsize) {Py_DECREF(obj);returnPyErr_NoMemory();        }ressize+=replsize;

But I'm not very fond of this. I think it's still nicer to have the check outside the loop.

Copy link
MemberAuthor

@picnixzpicnixzJan 22, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we decide in a follow-up PR whether those special checks (for instance, we have similar checks forPyCodec_NameReplaceErrors andPyCodec_BackslashReplaceErrors) need to be kept. For the 'namereplace' handler, we purely break actually:

for (i=start,ressize=0;i<end;++i) {/* object is guaranteed to be "ready" */c=PyUnicode_READ_CHAR(object,i);if (ucnhash_capi->getname(c,buffer,sizeof(buffer),1)) {replsize=1+1+1+(int)strlen(buffer)+1;            }elseif (c >=0x10000) {replsize=1+1+8;            }elseif (c >=0x100) {replsize=1+1+4;            }elsereplsize=1+1+2;if (ressize>PY_SSIZE_T_MAX-replsize)break;ressize+=replsize;        }

and just don't care anymore :') (the reason is that cannot determine in advance how much it would take unless we callgetname before...)

Copy link
Member

@vstinnervstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM.

I'm fine with keeping the code as it is, or changing how PY_SSIZE_T_MAX limit is handled in a separated PR.

@vstinner
Copy link
Member

Strange, the Lint job reports an unrelated issue:

diff --git a/Lib/test/test_asyncio/test_subprocess.py b/Lib/test/test_asyncio/test_subprocess.pyindex a053db8..341e3e9 100644--- a/Lib/test/test_asyncio/test_subprocess.py+++ b/Lib/test/test_asyncio/test_subprocess.py@@ -905,7 +905,7 @@ def setUp(self):             # Force the use of the threaded child watcher             unix_events.can_use_pidfd = mock.Mock(return_value=False)             super().setUp()-+         def tearDown(self):             unix_events.can_use_pidfd = self._original_can_use_pidfd             return super().tearDown()

@picnixz
Copy link
MemberAuthor

Strange, the Lint job reports an unrelated issue:

It might be because I merged main and that main is failing?

@picnixz
Copy link
MemberAuthor

@picnixz

This comment was marked as off-topic.

@picnixz
Copy link
MemberAuthor

I'll merge this one and the other PR you approved tomorrow (I've reverted the bad commit).

@picnixzpicnixz merged commit70dcc84 intopython:mainJan 23, 2025
41 checks passed
@picnixzpicnixz deleted the fix/codecs/xmlcharrefreplace-errors-126004 branchJanuary 23, 2025 10:42
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@vstinnervstinnervstinner approved these changes

@encukouencukouAwaiting requested review from encukou

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@picnixz@vstinner@encukou

[8]ページ先頭

©2009-2025 Movatter.jp