Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

feat: add Python bindings#108

Draft
winstxnhdw wants to merge9 commits intoogxd:main
base:main
Choose a base branch
Loading
fromwinstxnhdw:main
Draft

Conversation

winstxnhdw
Copy link

@winstxnhdwwinstxnhdw commentedNov 14, 2024
edited
Loading

Summary

This PR adds Python bindings with type hints forgxhash32,gxhash64 andgxhash128. Onlycargo is required. Currently, it's able to hash a 5 GB file in 0.7s.

Closes#97.

Demo

pip install"gxhash @ git+https://git@github.com/winstxnhdw/gxhash.git#subdirectory=py-gxhash"
fromgxhashimport (gxhash32,gxhash32_async,gxhash64,gxhash64_async,gxhash128,gxhash128_async,)asyncdefmain():seed=1234file=TemporaryFile()file.write(bytes([42]*1000))file.seek(0)hash32=gxhash32(file,seed)file.seek(0)hash32_async=awaitgxhash32_async(file,seed)file.seek(0)hash64=gxhash64(file,seed)file.seek(0)hash64_async=awaitgxhash64_async(file,seed)file.seek(0)hash128=gxhash128(file,seed)file.seek(0)hash128_async=awaitgxhash128_async(file,seed)print(f"{hash32=}")print(f"{hash32_async=}")print(f"{hash64=}")print(f"{hash64_async=}")print(f"{hash128=}")print(f"{hash128_async=}")asserthash32==hash32_asyncasserthash64==hash64_asyncasserthash128==hash128_async

Todo

  • Upload wheels topypi
  • CI for compiling wheels
  • Drop the GIL
  • Async support

ogxd reacted with thumbs up emoji
@winstxnhdw
Copy link
Author

winstxnhdw commentedNov 14, 2024
edited
Loading

Should we consider creating a function with their GIL-less counterpart likegxhash32_nogil? Although dropping the GIL is effectively just flipping a boolean variable, our single-threaded users may think that it's an unnecessary performance overhead on them.

But more importantly, isgxhash even thread-safe?

@winstxnhdw
Copy link
Author

winstxnhdw commentedNov 14, 2024
edited
Loading

Perhaps we should also consider moving thepy-gxhash directory intoffi/python and the existing C FFI toffi/c.

@ogxd
Copy link
Owner

Hello@winstxnhdw,

I tried installing py-gxhash via pip using your command but I get this error:

ERROR: Could not find a version that satisfies the requirement version-subpkg (unavailable) (from versions: none)ERROR: No matching distribution found for version-subpkg (unavailable)

Am I missing something? I am using python 3.13

But more importantly, is gxhash even thread-safe?

Gxhash has no shared or global state. Any state is internal to the method scope (thus thread local). The input data is passed by reference, so it must not be mutated while gxhash is reading from it, or the produced hash will be undefined, but it won't crash nor hang. You can indeed drop the GIL if you wish.

winstxnhdw reacted with thumbs up emoji

@winstxnhdw
Copy link
Author

winstxnhdw commentedNov 14, 2024
edited
Loading

Am I missing something? I am using python 3.13

Looks like your shell is not escaping the URL correctly. I have updated the installation instructions. This one should work.

pip install"gxhash @ git+https://git@github.com/winstxnhdw/gxhash.git#subdirectory=py-gxhash"

@winstxnhdwwinstxnhdwforce-pushed themain branch 8 times, most recently fromf24a5d3 tof5e453aCompareNovember 15, 2024 01:17
@winstxnhdw
Copy link
Author

winstxnhdw commentedNov 15, 2024
edited
Loading

I noticed that the docs uses the specific phrasearbitrary stream of bytes, but this is not technically correct, isit? None of thegxhash* functions can accept a stream.

@winstxnhdw
Copy link
Author

winstxnhdw commentedNov 15, 2024
edited
Loading

The current async function is more expensive than running it synchronously even for 5 GB files. Using nogil + Python threads is more than 20x faster.

Perhaps instead of spawning a tokio thread, we should pass a Python thread to Rust instead.

Also, it's really interesting that repeated calls are many magnitudes faster than the first call. I guess this is the cache being populated and then hit. With so much going on in Python, I always assumed that the cache would be evicted quickly.

@ogxd
Copy link
Owner

ogxd commentedNov 15, 2024
edited
Loading

Looks like your shell is not escaping the URL correctly. I have updated the installation instructions. This one should work.

Indeed, italmost did! Thank you :).
One remaining issue is that using the hybrid flag does not build on ARM (I am on a Macbook M1) because this feature is reserved for x86 + nightly rust. I changed the behavior to ignore the flag in such cases instead of not building. I also used the no_std crate sinceHasher and other stuff is unnecessary here.Feel free to copy or cherry pick.

The current async function is more expensive than running it synchronously even for 5 GB files. Using nogil + Python threads is more than 20x faster.

Perhaps instead of spawning a tokio thread, we should pass a Python thread to Rust instead.

I don't think it makes a lot of sense to expose an async version of gxhash. Async / await is usually reserved for network-bound or sometimes even disk-bound operation so that a thread is not frozen while waiting for the response. Here gxhash is purely CPU-bound. Suppose in a given context the hashing time is substantial enough that you think one may want to execute it asynchronously. In that case, it also means there is a substantial amount of CPU work delegated to a threadpool thread (from tokio or python), which is not something you want (it may cause threadpool starvation). In such case what you want is delegate the work to a dedicated background thread, and you wouldn't use async / await (instead invoke thread and send / consume work via queue for instance).

I noticed that the docs uses the specific phrasearbitrary stream of bytes, but this is not technically correct, isit? None of the gxhash* functions can accept a stream.

It is correct, however this documentation is on theHashertrait implementation which is a special Rust construct to build a hash pieces by pieces. The mention ofarbitrary stream of bytes is actually a copy paste from the official documentation of that trait:https://doc.rust-lang.org/std/hash/trait.Hasher.html

This PR adds Python bindings with type hints for gxhash32, gxhash64 and gxhash128. Only cargo is required. Currently, it's able to hash a 5 GB file in 0.7s.

It's really awesome! I was surprised to see how little it would require locally in order to make this work, congrats! At this point I am wondering if the future of this belongs to a subfolder (like you did) or a dedicated repository, the reasons being:

  • We also have implements or bindings for other languages
    • There is afull C# implementation
    • I have pieces of a C implementation somewhere, and I received feedback from people willing to work on this.
    • We are about to have python bindings thanks to your work
  • Having a dedicated repository for the python bindings allows us to make a README dedicated to python users (with less Rust stuff) and we also keep the rust implementation repository free from Python stuff which most Rust users may not be looking for.
  • It also simplifies CI
  • It's easier regarding ownership (you can become one of the maintainers of the gxhash-py if you wish)

If you like this second option please let me know. That would imply moving transferring this repo to a "gxhash" organization and name itgxhash-rs, transfergxhash-csharp and then finallygxhash-py.

@winstxnhdw
Copy link
Author

winstxnhdw commentedNov 15, 2024
edited
Loading

One remaining issue is that using the hybrid flag does not build on ARM (I am on a Macbook M1) because this feature is reserved for x86 + nightly rust. I changed the behavior to ignore the flag in such cases instead of not building. I also used the no_std crate since Hasher and other stuff is unnecessary here.

Ah, right. I'll be sure to handle that.

I don't think it makes a lot of sense to expose an async version of gxhash. Async / await is usually reserved for network-bound or sometimes even disk-bound operation so that a thread is not frozen while waiting for the response. Here gxhash is purely CPU-bound. Suppose in a given context the hashing time is substantial enough that you think one may want to execute it asynchronously. In that case, it also means there is a substantial amount of CPU work delegated to a threadpool thread (from tokio or python), which is not something you want (it may cause threadpool starvation). In such case what you want is delegate the work to a dedicated background thread, and you wouldn't use async / await (instead invoke thread and send / consume work via queue for instance).

The reason for exposing an async version is because I'd imaginegxhash being used a lot in an asynchronous context. More specifically, in web servers. Running a synchronous function, regardless of whether it has dropped the GIL, within an async route handlerwill block the event loop from handling concurrent requests. If it takes 0.5s for a request to hash a file, that's 0.5s that the server will not be able to respond to any other requests. I have heard other developers mention that even 100 microseconds of blocking is unacceptable. From how I see it, there's no difference between how I would handle a synchronous IO-bound task and a CPU-bound task. As long as neither saturates CPU or memory usage, I would still run it within a separate thread, of which would still be susceptible to thread pool starvation.

I am not exactly a software design expert on this topic, so if you still think that it doesn't make sense, let's discuss this further. There's very little resources on this topic, especially in the context of Python and I am always looking to broaden my understanding.

It is correct, however this documentation is on the Hashertrait implementation which is a special Rust construct to build a hash pieces by pieces. The mention of arbitrary stream of bytes is actually a copy paste from the official documentation of that trait:https://doc.rust-lang.org/std/hash/trait.Hasher.html

You also used the same descriptionhere. Also, do you have plans to implement larger-than-memory hashing? After confirming thatgxhash-py is stable, I am thinking of adding two more variants. A streaming variant for larger-than-memory inputs and a parallel variant that'll userayon to hash chunks in parallel. Downside to this, is that the parallel and streaming variants will not emit the same hash as the other variants, given the same input. Not sure if you think this would be poor DX.

If you like this second option please let me know. That would imply moving transferring this repo to a "gxhash" organization and name it gxhash-rs, transfer gxhash-csharp and then finally gxhash-py.

I don't mind this, but personally, I think there's some beauty and credibility to have a monolith of a library with all the bindings in one place, similar topolars. It's a shame that you already madegxhash-csharp since that was what I was going to propose next usingcsbindgen which would generate a Rust binding for .NET and Unity without runtime marshalling. I am impartial to how you want to do this as long as it is clearly within thegxhash ecosystem, so either in a organisation or within this repository.

@ogxd
Copy link
Owner

ogxd commentedNov 15, 2024
edited
Loading

The reason for exposing an async version is because I'd imagine gxhash being used a lot in an asynchronous context. More specifically, in web servers. Running a synchronous function, regardless of whether it has dropped the GIL, within an async route handler will block the event loop from handling concurrent requests. If it takes 0.5s for a request to hash a file, that's 0.5s that the server will not be able to respond to any other requests. I have heard other developers mention that even 100 microseconds of blocking is unacceptable. From how I see it, there's no difference between how I would handle a synchronous IO-bound task and a CPU-bound task. As long as neither saturates CPU or memory usage, I would still run it within a separate thread, of which would still be susceptible to thread pool starvation.

I am not exactly a software design expert on this topic, so if you still think that it doesn't make sense, let's discuss this further. There's very little resources on this topic, especially in the context of Python and I am always looking to broaden my understanding.

Let's take your example:
We have a backend service handling a request, and at some point, and for some reason while processing that request it must hash some huge data held in memory (let's assume it's not on the disk for now). This operation is purely CPU-bound.
Now imagine these 4 options:

  • The hash is computed synchronously. This means it will effectively block the event loop thread for that duration. We agree it's not ideal.
  • The hash is computed asynchronously via async / await. In python, by default, this means the hash computation coroutine will be queued and scheduled at some point by the event loop, but still on the event loop thread. So it brings no improvement.
  • The hash is computed asynchronously via async / await, but this time you use aThreadPoolExecutor to delegate this work to a pool of threads (called workers in this jargon). Here you're no longer blocking the event loop! However you might have another issue: your operation will block a worker dedicated to the processing of async / await operations for the same duration. This is calledthread pool starvation (not to be confused with thread starvation!).
  • The hash is computed asynchronously, but not using async / await. Instead, the work is simply delegated to a background thread dedicated to processing such heavy operations. This is the way to go.

As a staff backend engineer I do have some experience on this kind of topic, but don't just take my words for it, there are some resources online on this subject. Here are some:

To eliminate ThreadPool starvation, ThreadPool threads need to remain unblocked so that they're available to handle incoming work items.

Running such operations inside an async def endpoint will block the event loop; and hence, any further client requests will get blocked until the blocking operation is completed.

Now if you read from the disk, or compute a hash block by block, then it's another story. I'd likely use async / await to read from the disk or wait for receiving request bytes, and thus hash on-the-go. In that situation I'd say every chunk hash operation isn't going to be blocking as much as our first example, and thus it might be acceptable to do it synchronously.

You also used the same descriptionhere

Oh indeed! We must change that 👍

Also, do you have plans to implement larger-than-memory hashing?

In rust, that's one of the purposes ofHasher trait. You build a Hasher, you hash pieces by pieces, and then you "finalize" your hash. I'm not sure how easily this structure could be mapped to some python class however using this bindgen method.

@winstxnhdw
Copy link
Author

winstxnhdw commentedNov 17, 2024
edited
Loading

The hash is computed asynchronously via async / await, but this time you use aThreadPoolExecutor to delegate this work to a pool of threads (called workers in this jargon). Here you're no longer blocking the event loop! However you might have another issue: your operation will block a worker dedicated to the processing of async / await operations for the same duration. This is called thread pool starvation (not to be confused with thread starvation!).

Thank you for taking the time to write something up. I am still confused whetheronly CPU-bound functions can cause thread pool starvation? In Python, not all IO-bound functions have an async variant. For example,open is used to load a file from disk synchronously.

withopen("model.bin","rb")asf:file_bytes=f.read()

Performing such an operation would block the event loop and one way to avoid blocking the event loop would be to runopen in another thread. Would this not suffer from thread pool starvation as well? If this was the only way to avoid blocking the event loop, would you recommend pushing this to a task queue?

In any case, it would be best to remove theasync variants because it requires a blocking and really expensive copy step to convert&[u8] toVec<u8>.

Also,polars has a CPU-bound API that they made async similarly to how I did it:
https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.collect_async.html
https://github.com/pola-rs/polars/blob/54218e7e35e3defd4b0801e820c56eea6b91e525/crates/polars-python/src/lazyframe/general.rs#L595

Oh indeed! We must change that 👍

I will wait for you to change the description and then align my docstrings with yours. Also, I am wondering if you plan to pushthis line tomain?

In rust, that's one of the purposes of Hasher trait. You build a Hasher, you hash pieces by pieces, and then you "finalize" your hash. I'm not sure how easily this structure could be mapped to some python class however using this bindgen method.

I think Python Generators would work well here.

@mqudsi
Copy link

Now if you read from the disk, or compute a hash block by block, then it's another story. I'd likely use async / await to read from the disk or wait for receiving request bytes, and thus hash on-the-go. In that situation I'd say every chunk hash operation isn't going to be blocking as much as our first example, and thus it might be acceptable to do it synchronously.

I would humbly disagree, or at least say that it would only be acceptable with some caveats; this approach would improve p9x latency by blocking the event loop for a shorter amount of time per call, but the total time the event loop remains "unavailable" to handle events would remain the same (well, actually it would increase because of the function call and ffi overhead combined with the decreased gxhash-side performance of hashing smaller chunks at a time) thereby reducing total throughput.

In my opinion though, this isn't something that the library itself should necessarily manage (though I am not a pythonista and I am talking from the general perspective such as that taken in async rust or async C#, both of which I am considerably more familiar with), only because of the number of factors and permutations that have to be taken into account, all of which the downstream caller integrating gxhash into their project would know more about.

The subtleties here lie greatly in the specific application. If you are hashing content that fits comfortably in memory without causing gc issues or if the final content is going to be stored in memory regardless (i.e. is not being streamed to disk or network) then it might make more sense (given how fast gxhash itself is) to buffer the entirety of the content then make one threadpool-backed call to gxhash that doesn't block the event loop, that runs at max gxhash performance, and minimizes the function call/ffi overhead. But if youare working on streaming content that won't be stored (contiguously!) in memory thereafter, then you're obviously going to have to call gxhash per some smaller-sized slice at a time. In that case, the question of whether this should be done on a threadpool thread or directly from the async routine is a much more complicated question that would involve knowledge of the size of the input (i.e. if you are hashing 16 bytes at a time, the threadpool overhead, context switch, cache misses, etc is going to exceed the benefits of not blocking the event pool) while if you're processing still decently-sized chunks at a time it might make sense to spin up a threadpool task to handle the cpu-intensive work.

But I don't know what the accepted level of "handholding" here is in the python modules community (or even if python devs are interested in optimizations at this level to begin with, though what I wrote above is assuming at leastsome do).

@winstxnhdw
Copy link
Author

winstxnhdw commentedNov 19, 2024
edited
Loading

But I don't know what the accepted level of "handholding" here is in the python modules community (or even if python devs are interested in optimizations at this level to begin with, though what I wrote above is assuming at least some do).

I always believed that good API design should include sane defaults while also giving the user the option to build their own specialised solution from scratch, and FWIW, I care deeply about such optimisations. I also think providing the user with anasync variant is better DX but I can also see that it may mislead users to believe that theasync variant is most optimal for their use case.

In that case, the question of whether this should be done on a threadpool thread or directly from the async routine is a much more complicated question that would involve knowledge of the size of the input (i.e. if you are hashing 16 bytes at a time, the threadpool overhead, context switch, cache misses, etc is going to exceed the benefits of not blocking the event pool)

I am not sure what could be much worse than blocking the event loop. When the event loop is blocked, no one can load your web page, your Prometheus service will no longer be able to scrape for metrics, and your monitoring service sends an alert to everyone that your service is down. Of course the other option is for users to push it to a task queue, but on a sufficiently large cluster with a known amount of users, is there really a need to force the user to pay the development complexity and overhead cost of using a task queue?

In that situation I'd say every chunk hash operation isn't going to be blocking as much as our first example, and thus it might be acceptable to do it synchronously.

Also, in Python, all synchronous functions are blocking. This means that hashing every chunk synchronously would still block the event loop until all chunks have been completely hashed. Of course, this is not the case if you are streaming the bytes in from a web/socket.

@mqudsi
Copy link

mqudsi commentedNov 19, 2024
edited
Loading

I am not sure what could be much worse than blocking the event loop.

The implication is that you are still blocking the event loop while it does the tasks I mentioned. e.g. one heavily optimized ffi call to gxhash to hash a 4-byte input might actually block the event loopless than the bookkeepingthat same event loop thread has to do (synchronously) to send and retrieve the result to an off-thread worker for such a low-latency operation. And if you expand the scope past microbenchmarks, you need to take into account whether or not the "other thread" will cohabit a core that's already servicing another event loop for your application, the trashing of multiple threads' L1 cache lines, etc.

But again, this is onlygenerally speaking for typical async event loops, without possessing arcane knowledge of the minutia of Python scheduler overhead, cpu core selection, etc as that is quite outside my wheelhouse.

@winstxnhdw
Copy link
Author

In the context of Python, once you cross the FFI boundary and drop the GIL, everything from then on is effectively non-blocking. Unless you mean the hashing operation saturates the CPU.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

Can we get an FFI for Python?
3 participants
@winstxnhdw@ogxd@mqudsi

[8]ページ先頭

©2009-2025 Movatter.jp