Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-111545: Add Py_HashDouble() function#112449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
vstinner wants to merge6 commits intopython:mainfromvstinner:hash_double3

Conversation

vstinner
Copy link
Member

@vstinnervstinner commentedNov 27, 2023
edited by github-actionsbot
Loading

  • Add again the private _PyHASH_NAN constant.
  • Add tests: Modules/_testcapi/hash.c and Lib/test/test_capi/test_hash.py.

📚 Documentation preview 📚:https://cpython-previews--112449.org.readthedocs.build/

@vstinner
Copy link
MemberAuthor

@encukou@mdickinson@serhiy-storchaka: Would you mind to review this change?

It's a new API for PyHash_Double() function: other APIs were discussed in PR#112095.

@vstinnervstinner changed the titlegh-111545: Add PyHash_Double() functiongh-111545: Add Py_HashDouble() functionNov 30, 2023
@vstinner
Copy link
MemberAuthor

I renamed the function fromPyHash_Double() toPy_HashDouble() following#112095 (comment) discussion.

Copy link
Member

@mdickinsonmdickinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The API looks fine to me. I'm wondering whether we want to consider switching the meaning of the return value, so that0 is used for non-NaNs and1 for NaNs; I want to think of the return value as essentially anis_nan check, and from that perspective the current API has an extra negation to get my brain around.

But either way is probably fine (so long as it's documented).

@vstinner
Copy link
MemberAuthor

I rebased the PR on the main branch.

I'm wondering whether we want to consider switching the meaning of the return value, so that 0 is used for non-NaNs and 1 for NaNs

Ok, I modified the API to return 1 if the float is not-a-number (NaN), and return 0 otherwise.

@encukou
Copy link
Member

encukou commentedDec 4, 2023
edited
Loading

In similar API, 1 means that you're getting a meaningful*result, and 0 means the result isNULL:

IMO, that's a good convention to establish (here, with0 rather thanNULL).
(The current devguide uses“lesser” and “greater”, which always seemed rather opaque to me, but “you're getting more data” does sounds like the “greater” result.)

@mdickinson
Copy link
Member

IMO, that's a good convention to establish

Fair enough. That convention does feel somewhat in conflict with the common "zero-return = success", "non-zero-return = failure" API that we have elsewhere, but that's a wider issue.

@mdickinson
Copy link
Member

but “you're getting more data” does sounds like the “greater” result

Agreed.

@encukou
Copy link
Member

the common "zero-return = success", "non-zero-return = failure" API

If I'm reading the room correctly, we're reserving only-1 for errors -- but only the “hard” errors where an exception is set.
In the case of exceptions not being set, “success” and “failure” tend to mean different things to different people :)

@vstinner
Copy link
MemberAuthor

I reverted my change to move back to the previous API:

  • Return 0 if the argument is not-a-number (NaN)
  • Return 1 otherwise.

@mdickinson@encukou: Would you mind to review/approve formally the PR?

encukou
encukou previously approved these changesDec 4, 2023
Copy link
Member

@encukouencukou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks good, thanks!

Other WG members might disagree, but I don't see anything too controversial any more.

@vstinner
Copy link
MemberAuthor

@encukou:

Other WG members might disagree, but I don't see anything too controversial any more.

I createdcapi-workgroup/decisions#2 for the C API Working Group.

@vstinner
Copy link
MemberAuthor

I merged my Py_HashPointer() PR. I rebased this PR on top on it.

@encukouencukou dismissed theirstale reviewDecember 6, 2023 14:53

The PR was rebased.

@vstinner
Copy link
MemberAuthor

I will update the PR to address the reviews, once the API will be approved by the C API Working Group.

* Add again the private _PyHASH_NAN constant.* Add tests: Modules/_testcapi/hash.c and  Lib/test/test_capi/test_hash.py.
Copy link
Member

@encukouencukou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, thank you!

@vstinner
Copy link
MemberAuthor

I updated What's New in Python 3.13 to suggest a recipe replacing usage of the private_Py_HashDouble() function:

* Add :c:func:`Py_HashDouble` function to hash a C double number. Existing code  using the private ``_Py_HashDouble()`` function can be updated to::        Py_hash_t        hash_double(PyObject *obj, double value)        {            Py_hash_t hash;            if (Py_HashDouble(value, &hash) == 0) {                hash = Py_HashPointer(obj);            }            return hash;        }  (Contributed by Victor Stinner in :gh:`111545`.)

@serhiy-storchaka
Copy link
Member

I would prefer a simpler API that does not treat NaN as special at all and leave this to the user:

        Py_hash_t        hash_double(PyObject *obj, double value)        {            if (Py_IS_NAN(value)) {                return Py_HashPointer(obj);            }            else {                return Py_HashDouble(value);            }        }

Even if it is slightly slower. But it requires less mental effort to understand and use the API.

mdickinson reacted with thumbs up emoji

@encukou
Copy link
Member

Py_hash_t hash_double(PyObject *obj, double value)

What should an user pass in asobj?

@vstinner
Copy link
MemberAuthor

What should an user pass in as obj?

@serhiy-storchaka proposes the API:Py_hash_t Py_HashDouble(double value) which never fails. It returns 0 for NaN.

I would be fine with this API. It was more or less proposed earlier. But it was said that having to callPy_IS_NAN() might have an impact on performance. I supposed thatPy_IS_NAN() is really cheap.

In terms of API,Py_hash_t Py_HashDouble(double value) is simpler thanint Py_HashDouble(double value, Py_hash_t *result). There is no need to have to think ifresult isNULL or not, and there is no need to check for error.

Py_hash_t hash_double(PyObject *obj, double value) is only the example that I added to What's New in Python 3.13 doc to migrate from the private_Py_HashDouble() function.

@vstinner
Copy link
MemberAuthor

I wrote PR#113115 which implementsPy_hash_t Py_HashDouble(double value) API so you can look at code to compare with this PR implementing the APIint Py_HashDouble(double value, Py_hash_t *result).

@vstinner
Copy link
MemberAuthor

@serhiy-storchaka@mdickinson: So which API do you prefer?

  • Py_hash_t Py_HashDouble(double value)
  • int Py_HashDouble(double value, Py_hash_t *result)

@serhiy-storchaka
Copy link
Member

I prefer your initial interfacePy_hash_t Py_HashDouble(double value), but maybe without any special handling for NaN. It should be handled externally by user.

Unless there is really large performance advantage in using the second variant. I do not know how large should it be to justify more complex interface. 10% is not large.

@vstinner
Copy link
MemberAuthor

I prefer your initial interface Py_hash_t Py_HashDouble(double value)

Ok.

but maybe without any special handling for NaN. It should be handled externally by user.

If prefer to have a deterministic behavior and always return the same hash value (0) if value is NaN. There are legit use cases to treat NaN as hash value 0. Seemy comment.

@mdickinson
Copy link
Member

I prefer your initial interfacePy_hash_t Py_HashDouble(double value), but maybe without any special handling for NaN. It should be handled externally by user.

Ditto. I don't see a need for anything more complicated than this, and it feels wrong to me to allow minor performance differences to drive API design.

If prefer to have a deterministic behavior and always return the same hash value (0) if value is NaN. There are legit use cases to treat NaN as hash value 0.

This sounds good to me.

vstinner reacted with thumbs up emoji

@vstinner
Copy link
MemberAuthor

Ditto. I don't see a need for anything more complicated than this, and it feels wrong to me to allow minor performance differences to drive API design.

Ok. I proposed to change C API Working Group decision on this API:capi-workgroup/decisions#2 (comment)

@serhiy-storchaka
Copy link
Member

If prefer to have a deterministic behavior and always return the same hash value (0) if value is NaN.

It creates a vulnerability. CPython itself never calls this API for a NaN. If a third-party code calls it for a NaN value, it will allow to easily create non-equal objects with the same hash.

@encukou
Copy link
Member

The simpler API,Py_hash_t Py_HashDouble(double value), has a footgun: if you forget to do the NaN check, all your NaNs will hash the same.
IMO, this is a case of expert blindness: it you know floats and Python's float-hashing behaviour, the requiredPy_IS_NAN check is second nature. But IMO, we shouldn't be designing the API for experts.
So I'd present slightly different choices:

  • int Py_HashDouble(double value, Py_hash_t *result), as in this PR, which makes it more obvious that different kinds of results are possible.
  • Py_hash_t Py_HashNonNaNDouble(double value), with an uglier name, where the necessary check is obvious from that name.

@vstinner
Copy link
MemberAuthor

The simpler API, Py_hash_t Py_HashDouble(double value), has a footgun: if you forget to do the NaN check, all your NaNs will hash the same.

Can this concern be resolved with better documentation? Explain in which case treating all NaN "as equal" can be an issue, or suggest a solution such as@serhiy-storchaka's recipe#112449 (comment) ?

I'm not convinced that using the same hash value (0) for all NaN numbers is a big deal, since Python was doing that until 3.9. The hash value is just an optimization to avoid the slower comparison operator, it's more about performance than about correctness.

$ python3.9>>> nan1=float("inf")*0>>> nan2=float("inf")*0>>> nan2 is nan1False# Same hash value>>> hash(nan1), hash(nan2)(0, 0)# dict works "as expected">>> d={nan1: 1, nan2: 2}; d{nan: 1, nan: 2}>>> d[nan1]1>>> d[nan2]2# where the magic happens, NaN is not equal to NaN>>> nan2 == nan1False

@encukou
Copy link
Member

Can this concern be resolved with better documentation?

IMO, no. When doing a review, you don't check the docs of all functions you see. Just the surprising ones.

@serhiy-storchaka
Copy link
Member

It is a big deal. Try to create a dict with 10000 or 100000 of NaNs.

d= {float('nan'):iforiinrange(100000)}

@vstinner
Copy link
MemberAuthor

It is a big deal. Try to create a dict with 10000 or 100000 of NaNs.

Aha, I see: it's way faster in Python 3.10 where hash(nan) is not always 0.

# hash(nan) = 0$ python3.9 -m timeit '{float("nan"): i for i in range(10_000)}'1 loop, best of 5: 1.12 sec per loop# hash(nan) = id(obj)$ python3.10 -m timeit '{float("nan"): i for i in range(10_000)}'100 loops, best of 5: 2.7 msec per loop

But the proposed API is forPy_HashDouble() which only takes a C double. It's up to the caller to use something else to calculate a different hash value for NaN. The question is how we can guide users to such solution.

For me,Py_HashDouble() is like a primitive used to build a hash function.

@vstinner
Copy link
MemberAuthor

@encukou:

IMO, no. When doing a review, you don't check the docs of all functions you see. Just the surprising ones.

I'm talking aboutPy_HashDouble() documentation. I'm asking if we can write doc in a way that users avoid the hash collision issue if their numbers can be NaN.

@encukou
Copy link
Member

I'm talking about Py_HashDouble() documentation. I'm asking if we can write doc in a way that users avoid the hash collision issue if their numbers can be NaN.

I'm also talking about the documentation. Users won't read the docs. If the function has surprising behaviour, adding a hint that you should look at the docs goes a long way.

vstinner reacted with thumbs up emoji

@vstinner
Copy link
MemberAuthor

I created PR#112095 more than 1 month ago. I spent time to run benchmark, implement different APIs, try to collect feedback on each API, and discuss in length advantages and disadvantages of each API. Sadly, we failed to reach a consensus on the API. Nowanother API is being discussed. The API looks simple to me, I didn't expect to spend more than one month on a single function.

I need to take a break from that topic. I don't have the energy to dig into these discussions. I prefer to close the PR for now.

@vstinnervstinner deleted the hash_double3 branchDecember 20, 2023 11:03
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@encukouencukouencukou approved these changes

@iritkatrieliritkatrieliritkatriel left review comments

@mdickinsonmdickinsonmdickinson approved these changes

@tirantiranAwaiting requested review from tirantiran is a code owner

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

5 participants
@vstinner@encukou@mdickinson@serhiy-storchaka@iritkatriel

[8]ページ先頭

©2009-2025 Movatter.jp