Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ggml-hexagon: Add lightweight atomic synchronization support to htp_ops_context for inter-task coordination#18113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ngdxzy wants to merge1 commit intoggml-org:master
base:master
Choose a base branch
Loading
fromngdxzy:atomic_sync

Conversation

@ngdxzy
Copy link

@ngdxzyngdxzy commentedDec 16, 2025
edited
Loading

Background:

The current ggml-hexagon backend uses a worker pool to launch user-defined tasks such as quantization and matrix multiplication. These worker threads are pre-created and execute independently, and the framework currently provides no synchronization primitives that can be safely used inside user task callbacks.

As a result:

  1. User callbacks cannot coordinate or exchange state
  2. Future optimizations that require staged execution, pipelining, or shared intermediate state are difficult to implement
  3. Data sharing is currently not possible

What this PR proposes

This PR explores adding a minimal atomic synchronization mechanism to the existing framework by introducing a shared atomic variable inhtp_ops_context. This mechanism enables basic coordination (such as “all quant jobs finished”) while preserving the current worker pool design and execution model.

With this minor change, together with previous work (thread id is provided for the worker function), we can almost program the NPU just like aSIMT architecture.

Motivation

In the current design, multi-precision matrix multiplication requires the entire quantized src1 tensor to be stored in VTCM. This imposes a hard limit on the problem size that can be handled by the MM kernel.

Since src1 typically corresponds to the hidden states in an LLM, this effectively constrains the maximum context length that can be executed on the NPU.

If the proposed atomic synchronization mechanism is accepted, it would enable more flexible execution patterns and staged processing, allowing VTCM to be used more efficiently. This opens the door to follow-up work that reduces VTCM pressure and relaxes the current context-length limitations without major changes to the existing framework.

Request for Feedback

I would appreciate feedback on:

  1. Whether exposing a shared atomic inhtp_ops_context is acceptable
  2. Whether this aligns with the intended direction of the worker pool design
  3. Suggestions for alternative lightweight synchronization mechanisms

If this approach is considered acceptable, I will follow up with a separate commit to remove the concept-demonstration logic currently added inmatmul-ops.c, leaving only the minimal infrastructure changes required to support synchronization.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@max-krasnyanskymax-krasnyanskyAwaiting requested review from max-krasnyanskymax-krasnyansky is a code owner

@lhezlhezAwaiting requested review from lhezlhez is a code owner

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

ggmlchanges relating to the ggml tensor library for machine learning

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

@ngdxzy

[8]ページ先頭

©2009-2025 Movatter.jp