Following the discussion regarding therefactor idea here, I have re-evaluated the approach. Upon reviewing the current implementations for activation, unary, and binary functions, I observed that the existing code heavily relies on L2 prefetching while under-utilizing the VTCM and DMA.

While L2 prefetching offers some benefits, it is limited by two main factors:

The L2 cache is significantly smaller than the VTCM.
The L2 cache is shared between L1 program instructions and the L1 data cache.

This PR optimizes the code (using GELU as the initial implementation) by shifting the workload to heavily utilize the VTCM and DMA, thereby freeing up the L2 cache for L1 instruction and data cache

Optimization Strategy

Instead of relying on L2 prefetching, this implementation employs aDMA ping-pong buffering approach:

Fetch: DMA moves data from DDR to VTCM (buffer A).
Compute: DSP processes data in VTCM (buffer B) while DMA fetches the next chunk into buffer A.
Write back :DMA moves processed data from VTCM back to DDR.

This allows for overlapping computation and memory transfer, resulting in significantly higher throughput.

Performance Benchmarks

The performance improvements are significant. Below is a comparison between the existing implementation and the new DMA/VTCM approach:

Dimensions	Old Implementation (L2 Prefetch)	New Implementation (DMA/VTCM)
4096 x 4096	~5000 µs	~3800 µs
4096 x 4304	~5300 µs	~3700 µs

NOTE: I used the GELU as an example, but this approach can easily extend to other operations.

Unaligned Load Resolution:
This approach inherently solves the unaligned load issues encountered previously. Since data is fetched from DDR via DMA, the DMA engine ensures that data is stored into aligned addresses within the VTCM, even if the source data in DDR is unaligned.

@max-krasnyansky

joeldushouyu added3 commits

December 17, 2025 16:07

feat: working gelu with src0 put on vtcm

8f1e0e0

feat: gelu ping-pong for both in and out

8fd0573

fix: fixu compile error

e875809

Copy link

ContributorAuthor

joeldushouyu commentedDec 17, 2025

Obviously, I need to do more cleanup for this pr. But if you think this approach is good, I can probably apply the approach to all activation, unary(mul, add, sub).. and similar functions? I think this should bring some improvement for all models, especially for long context

joeldushouyu changed the title~~Hexagon gelu optimization~~ggml-hexagon: gelu optimization

Dec 17, 2025

loci-dev mentioned this pull request

Dec 17, 2025

UPSTREAM PR #18151: ggml-hexagon: gelu optimizationauroralabs-loci/llama.cpp#609

Open

github-actionsbot added the ggmlchanges relating to the ggml tensor library for machine learning label

Dec 18, 2025

chraac reviewed

Dec 18, 2025

View reviewed changes

ggml/src/ggml-hexagon/htp/main.c Outdated

		ctx->n_threads=n_hvx;
		for (inti=0;i<ctx->n_threads;i++) {
		ctx->dma[i]=dma_queue_create(HTP_SPAD_SRC0_NROWS*2);
		ctx->dma[i]=dma_queue_create(ctx->vtcm_size);//NOTE: for now, whole vtcm size

Copy link

Contributor

chraacDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: looks thecapacity here is for the max dma queue depth, which in mum_mat case is theHTP_SPAD_SRC0_NROWS * 2 for each thread. Its not the vtcm size for each thread I think...

Copy link

ContributorAuthor

joeldushouyuDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes. But with this pr, each thread could theoretically now fetch up to the whole vtcm size. Thus, I set it to the size of VTCM instead

Copy link

Contributor

chraacDec 18, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thought theoretically its not size in bytes, but the count of row that gonna fetch to vtcm tho....

Copy link

ContributorAuthor

joeldushouyuDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actually, as I take another look, it is not even insize of byte orcount of row. It seems to me that the maximum size of the DMA-linked list that it can support. Maybe I should set to like total_thread*(4*2) 4 stands for number of input/output and 2 for pingpong

Copy link

Collaborator

max-krasnyanskyDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's a number of DMA descriptors, aka a number of DMA transactions that can be enqueued at a time.
The DMA engine is per thread i.e each HW/SW thread literally has its own DMA HW state.
So there is no need to include the number if threads in the sizing.
Also the size of the queue doesn't affect the performance it's just an array of preallocated descriptors.
So we can just set it to something reasonable like 64 (ie 64 outstanding transactions per thread), add an error message for full queue for the debug builds, and that should be good enough.

Copy link

ContributorAuthor

joeldushouyuDec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

seef3e9f6a

break: distinguish dma ddr->vtcm and vtcm->ddr operation

54f7bbd

Copy link

Collaborator

max-krasnyansky commentedDec 18, 2025

Good timing.
We just had an internal discussion on the L2-prefetch vs DMA/VTCM and the conclusion is that using DMA/VTCM should better in most scenarios. And yes, it'd naturally fix most of not all alignment issues.
I didn't get a chance to read through the changes yet but overall it's step in the right direction.

Copy link

ContributorAuthor

joeldushouyu commentedDec 18, 2025

Good timing. We just had an internal discussion on the L2-prefetch vs DMA/VTCM and the conclusion is that using DMA/VTCM should better in most scenarios. And yes, it'd naturally fix most of not all alignment issues. I didn't get a chance to read through the changes yet but overall it's step in the right direction.

Thanks for confirming!

Should I go ahead and apply the same optimization on all other functions, or wait until you have a thorough look at it and fix the code format/style before applying it to other functions?

fix: fix dma queue size

f3e9f6a

Labels

ggml

changes relating to the ggml tensor library for machine learning

Movatterモバイル変換

ggml-hexagon: gelu optimization#18151

Are you sure you want to change the base?

ggml-hexagon: gelu optimization#18151

Conversation

joeldushouyu commentedDec 17, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Optimization Strategy

Performance Benchmarks

Uh oh!

joeldushouyu commentedDec 17, 2025

Uh oh!

chraacDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

joeldushouyuDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

chraacDec 18, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joeldushouyuDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

max-krasnyanskyDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

joeldushouyuDec 18, 2025

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commentedDec 18, 2025

Uh oh!

joeldushouyu commentedDec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joeldushouyu commentedDec 17, 2025•
edited
Loading

chraacDec 18, 2025•
edited
Loading