- Notifications
You must be signed in to change notification settings - Fork14.2k
ggml-hexagon: Add lightweight atomic synchronization support to htp_ops_context for inter-task coordination#18113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Open
ngdxzy wants to merge1 commit intoggml-org:masterChoose a base branch fromngdxzy:atomic_sync
base:master
Could not load branches
Branch not found:{{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline, and old review comments may become outdated.
+7 −0
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading.Please reload this page.
Background:
The current ggml-hexagon backend uses a worker pool to launch user-defined tasks such as quantization and matrix multiplication. These worker threads are pre-created and execute independently, and the framework currently provides no synchronization primitives that can be safely used inside user task callbacks.
As a result:
What this PR proposes
This PR explores adding a minimal atomic synchronization mechanism to the existing framework by introducing a shared atomic variable in
htp_ops_context. This mechanism enables basic coordination (such as “all quant jobs finished”) while preserving the current worker pool design and execution model.With this minor change, together with previous work (thread id is provided for the worker function), we can almost program the NPU just like aSIMT architecture.
Motivation
In the current design, multi-precision matrix multiplication requires the entire quantized src1 tensor to be stored in VTCM. This imposes a hard limit on the problem size that can be handled by the MM kernel.
Since src1 typically corresponds to the hidden states in an LLM, this effectively constrains the maximum context length that can be executed on the NPU.
If the proposed atomic synchronization mechanism is accepted, it would enable more flexible execution patterns and staged processing, allowing VTCM to be used more efficiently. This opens the door to follow-up work that reduces VTCM pressure and relaxes the current context-length limitations without major changes to the existing framework.
Request for Feedback
I would appreciate feedback on:
htp_ops_contextis acceptableIf this approach is considered acceptable, I will follow up with a separate commit to remove the concept-demonstration logic currently added in
matmul-ops.c, leaving only the minimal infrastructure changes required to support synchronization.