An alternative toend = start + PY_SSIZE_T_MAX / (2 + 7 + 1); would be to setdigits = 1; and after the serie of ifs, check thatressize += (2 + digits + 1); doesn't overflow. If it does overflow, return MemoryError. Something like:

if (resize > PY_SSIZE_T_MAX - (2 + digits + 1)) {    return PyErr_MemoryError();}

Copy link

MemberAuthor

picnixzJan 22, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think that's whatPyCodec_NameReplaceErrors does but I don't know how performances would be affected. HittingPY_SSIZE_T_MAX / (2 + 7 + 1) means that we're handling something that is quite large. So doing the check onresize at each loop iteration might slow down the handler a bit.

Now, it took me a while to convince myself that it won't be slowing down the handlersby much. Namely, I don't think it would dramatically slow it down because if our characters are > 10^3, we would do< 10,< 100,< 1000 and< 10000 checks (this last one is needed to know that it's > 10^3 but not > 10^4) already. So instead of 4 we have 5 checks which is not that annoying.

Why can we assume that we will have at least 2 checks and not less? Well... everything < 100 is actually ASCII, so and unless someone is using a special codec for which those characters are not supported or for artificially created exceptions that indicate their start/end positions incorrectly, we're likely to have at least 2 checks in the loop (namely < 10 and < 100) because the bad characters are likely do be something outside the ASCII range.

Copy link

MemberAuthor

picnixzJan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This would look like:

Py_ssize_tressize=0;for (Py_ssize_ti=start;i<end;++i) {// The number of characters that each character 'ch' contributes// in the result is 2 + k + 1, where k = min{t >= 1 | 10^t > ch}// and will be formatted as "&#" + DIGITS + ";". Since the Unicode// range is below 10^7, each "block" requires at most 2 + 7 + 1// characters.intreplsize;/* object is guaranteed to be "ready" */Py_UCS4ch=PyUnicode_READ_CHAR(obj,i);if (ch<10) {replsize=2+1+1;        }elseif (ch<100) {replsize=2+2+1;        }elseif (ch<1000) {replsize=2+3+1;        }elseif (ch<10000) {replsize=2+4+1;        }elseif (ch<100000) {replsize=2+5+1;        }elseif (ch<1000000) {replsize=2+6+1;        }else {assert(ch<10000000);replsize=2+7+1;        }if (ressize>PY_SSIZE_T_MAX-replsize) {Py_DECREF(obj);returnPyErr_NoMemory();        }ressize+=replsize;

But I'm not very fond of this. I think it's still nicer to have the check outside the loop.

Copy link

MemberAuthor

picnixzJan 22, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we decide in a follow-up PR whether those special checks (for instance, we have similar checks forPyCodec_NameReplaceErrors andPyCodec_BackslashReplaceErrors) need to be kept. For the 'namereplace' handler, we purely break actually:

for (i=start,ressize=0;i<end;++i) {/* object is guaranteed to be "ready" */c=PyUnicode_READ_CHAR(object,i);if (ucnhash_capi->getname(c,buffer,sizeof(buffer),1)) {replsize=1+1+1+(int)strlen(buffer)+1;            }elseif (c >=0x10000) {replsize=1+1+8;            }elseif (c >=0x100) {replsize=1+1+4;            }elsereplsize=1+1+2;if (ressize>PY_SSIZE_T_MAX-replsize)break;ressize+=replsize;        }

and just don't care anymore :') (the reason is that cannot determine in advance how much it would take unless we callgetname before...)

add assertion per Victor's suggestion

5046358

vstinner approved these changes

Jan 22, 2025

View reviewed changes

Copy link

Member

vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM.

I'm fine with keeping the code as it is, or changing how PY_SSIZE_T_MAX limit is handled in a separated PR.

bedevere-appbot added awaiting merge and removed awaiting review labels

Jan 22, 2025

Copy link

Member

vstinner commentedJan 22, 2025

Strange, the Lint job reports an unrelated issue:

diff --git a/Lib/test/test_asyncio/test_subprocess.py b/Lib/test/test_asyncio/test_subprocess.pyindex a053db8..341e3e9 100644--- a/Lib/test/test_asyncio/test_subprocess.py+++ b/Lib/test/test_asyncio/test_subprocess.py@@ -905,7 +905,7 @@ def setUp(self):             # Force the use of the threaded child watcher             unix_events.can_use_pidfd = mock.Mock(return_value=False)             super().setUp()-+         def tearDown(self):             unix_events.can_use_pidfd = self._original_can_use_pidfd             return super().tearDown()

Copy link

MemberAuthor

picnixz commentedJan 22, 2025

Strange, the Lint job reports an unrelated issue:

It might be because I merged main and that main is failing?

Copy link

MemberAuthor

picnixz commentedJan 22, 2025

Yes:https://github.com/python/cpython/pull/129160/files#r1925168965

This comment was marked as off-topic.

Merge branch 'main' into fix/codecs/xmlcharrefreplace-errors-126004

9df878e

Copy link

MemberAuthor

picnixz commentedJan 22, 2025

I'll merge this one and the other PR you approved tomorrow (I've reverted the bad commit).

picnixz merged commit70dcc84 intopython:main

Jan 23, 2025

41 checks passed

bedevere-appbot removed the awaiting merge label

Jan 23, 2025

picnixz deleted the fix/codecs/xmlcharrefreplace-errors-126004 branch

January 23, 2025 10:42

Labels

None yet

Movatterモバイル変換

Uh oh!

gh-126004: fix positions handling incodecs.xmlcharrefreplace_errors#127675

gh-126004: fix positions handling incodecs.xmlcharrefreplace_errors#127675

Uh oh!

Conversation

picnixz commentedDec 6, 2024• edited by bedevere-appbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vstinnerJan 22, 2025

Choose a reason for hiding this comment

Uh oh!

picnixzJan 22, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

picnixzJan 22, 2025

Choose a reason for hiding this comment

Uh oh!

picnixzJan 22, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commentedJan 22, 2025

Uh oh!

picnixz commentedJan 22, 2025

Uh oh!

picnixz commentedJan 22, 2025

Uh oh!

This comment was marked as off-topic.

picnixz commentedJan 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gh-126004: fix positions handling in`codecs.xmlcharrefreplace_errors`#127675

gh-126004: fix positions handling in`codecs.xmlcharrefreplace_errors`#127675

picnixz commentedDec 6, 2024•
edited by bedevere-appbot
Loading

picnixzJan 22, 2025•
edited
Loading

picnixzJan 22, 2025•
edited
Loading