In#2929, we have added the RISC-V intrinsics for the dot product functions in GGML, this PR improves these existing dot product functions in ggml.c and also adds the new risc-v vector intrinsics for k_quants and row quantize (Q8_0 and Q8_1) functions. Now LLaMa.cpp fully supports to run on RISC-V vector processor with GGUF.

In future, this will enable GGML and LLaMa.cpp to run efficiently on RISC-V hardware with vector support and also open a way to compare its performance with other vector processors like Intel AVX and Arm Neon.

Update: Got access to RISC-V vector board with 8 cores and 4GB RAM, the performance boost is 6-7 times against the scalar version on the same board.

Runining llama.cpp AI model on RVV1.0 vs RISC-V Scalar

The RISC-V Vector intrinsics support is added for the following K_quants functions with both QKK = 256 and QKK = 64 block size

   ggml_vec_dot_q2_K_q8_K   ggml_vec_dot_q3_K_q8_K   ggml_vec_dot_q4_K_q8_K   ggml_vec_dot_q5_K_q8_K   ggml_vec_dot_q6_K_q8_K

The RVV intrinsics is also added for the following Q8 quantize row functions

    quantize_row_q8_0    quantize_row_q8_1

The following dot product functions have also been optimized by using fractional LMUL (i.e. 1/2) instead of LMUL = 1. I am a little skeptical of this since it works correctly but I have noticed some decrease in inference accuracy, which I think could be a problem with my system or weights. Although I prefer to stick with it since it utilizes a much less number of vector registers after product

    ggml_vec_dot_q4_0_q8_0    ggml_vec_dot_q4_1_q8_1    ggml_vec_dot_q5_0_q8_0    ggml_vec_dot_q5_1_q8_1

And finally, the vector initialization in Q5 by the temporary array is also replaced by the vid_v intrinsics

[Compilation]
Ubuntu: 22.10
riscv-toolchain: 2023.07.05 riscv64 linux glibc

To compile it for RISC-V run,

$   make   llama-cli                   # For RISC-V CPU

$   make clean$   make   RISCV_CROSS_COMPILE=1       # For Cross Compilation only

[Directly on RISC-V CPU]

$   ./llama-cli -m ./path/to/model.gguf -p "Anything" -n 50

[QEMU]

$   qemu-riscv64 -L /path/to/sysroot/  -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-cli -m ./path/to/model.gguf -p "Anything" -n 50

Note: Running on qemu emulator could be very slow and may take 2-5 minutes per token

Any feedback is welcome, if you have any suggestions or improvements, especially for fractional LMUL change, please share.

Thanks!

Tameem-10xE added2 commits

October 2, 2023 13:46

Added RVV intrinsics support for Q8 quantize row and also improved th…

dd13a1b

…e existing dot product function for risc-v.The RVV intrinsics is added for the following quantize row functions   quantize_row_q8_0   quantize_row_q8_1The following dot product functions have also been optimized by using LMUL = 1/2 instead of LMUL = 1   ggml_vec_dot_q4_0_q8_0   ggml_vec_dot_q4_1_q8_1   ggml_vec_dot_q5_0_q8_0   ggml_vec_dot_q5_1_q8_1And vector initialization in Q5 by temporary array is also replaced by the vid intrinsicsSigned-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>

Added RVV intrinsics support for k_quants

f6883a7

This adds RISC-V Vector intrinsics support for the following K_quants functions for both QKK = 256 and QKK = 64   ggml_vec_dot_q2_K_q8_K   ggml_vec_dot_q3_K_q8_K   ggml_vec_dot_q4_K_q8_K   ggml_vec_dot_q5_K_q8_K   ggml_vec_dot_q6_K_q8_KSigned-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>

Tameem-10xE force-pushed thellama-rvv branch from388a59a tof6883a7Compare

October 3, 2023 12:17

ggerganov approved these changes

Oct 3, 2023

View reviewed changes

ggerganov merged commit79f34ab intoggml-org:master

Oct 3, 2023

joelkuiper added a commit to vortext/llama.cpp that referenced this pull request

Oct 5, 2023

Merge branch 'master' of github.com:ggerganov/llama.cpp into grammar-…

94daebe

…example* 'master' of github.com:ggerganov/llama.cpp: (24 commits)  convert : fix Baichuan2 models by using vocab size in config.json (ggml-org#3299)  readme : add project status link  ggml : fix build afterggml-org#3329  llm : add Refact model (ggml-org#3329)  sync : ggml (conv 1d + 2d updates, UB fixes) (ggml-org#3468)  finetune : readme fix typo (ggml-org#3465)  ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggml-org#3453)  main : consistent prefix/suffix coloring (ggml-org#3425)  llama : fix session saving/loading (ggml-org#3400)  llama : expose model's rope_freq_scale in the API (ggml-org#3418)  metal : alibi for arbitrary number of heads (ggml-org#3426)  cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggml-org#3273)  Work on the BPE tokenizer (ggml-org#3252)  convert : fix vocab size when not defined in hparams (ggml-org#3421)  cmake : increase minimum version for add_link_options (ggml-org#3444)  CLBlast: Add broadcast support for matrix multiplication (ggml-org#3402)  gguf : add BERT, MPT, and GPT-J arch info (ggml-org#3408)  gguf : general usability improvements (ggml-org#3409)  cmake : make CUDA flags more similar to the Makefile (ggml-org#3420)  finetune :fixggml-org#3404 (ggml-org#3437)  ...

yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request

Oct 7, 2023

ggml : add RISC-V Vector Support for K-Quants and improved the existi…

b6d62ea

…ng intrinsics (ggml-org#3453)* Added RVV intrinsics support for Q8 quantize row and also improved the existing dot product function for risc-v.The RVV intrinsics is added for the following quantize row functions   quantize_row_q8_0   quantize_row_q8_1The following dot product functions have also been optimized by using LMUL = 1/2 instead of LMUL = 1   ggml_vec_dot_q4_0_q8_0   ggml_vec_dot_q4_1_q8_1   ggml_vec_dot_q5_0_q8_0   ggml_vec_dot_q5_1_q8_1And vector initialization in Q5 by temporary array is also replaced by the vid intrinsicsSigned-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>* Added RVV intrinsics support for k_quantsThis adds RISC-V Vector intrinsics support for the following K_quants functions for both QKK = 256 and QKK = 64   ggml_vec_dot_q2_K_q8_K   ggml_vec_dot_q3_K_q8_K   ggml_vec_dot_q4_K_q8_K   ggml_vec_dot_q5_K_q8_K   ggml_vec_dot_q6_K_q8_KSigned-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>---------Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>

Tameem-10xE deleted the llama-rvv branch

October 10, 2023 09:03

Tameem-10xE mentioned this pull request

Feb 8, 2024

[ERROR] Futex facility returned an unexpected error coderiscv-software-src/riscv-isa-sim#1443

Closed

Tameem-10xE mentioned this pull request

Mar 4, 2024

[GGML] Added RISC-V Vector Intrinsics Support#2929

Merged

Copy link

grigohas commentedJul 10, 2024

hello, i am doing what you suggested and i have results. I have 2 questions, when i want to run it without vector proccesor in qemu, what comand do i have to run? also , how can i check that those 2 runs are different and the one with vector proccesor working like i wanted to ? sorry, i am new to this

Copy link

ContributorAuthor

Tameem-10xE commentedJul 10, 2024

Hi, for running on CPU (scalar) provide the path to risc-v toolchain and then use qemu

make llama-cli CC="riscv64-unknown-linux-gnu-gcc -march=rv64gc -mabi=lp64d" CXX="riscv64-unknown-linux-gnu-g++ -march=rv64gc -mabi=lp64d"

qemu-riscv64 -L /path/to/sysroot/  -cpu rv64 ./llama-cli -m ./path/to/model.gguf -p "Anything" -n 100

You can set the seed to get the same results i.e; llama-cli -s (some_seed number) ...

More details:RVV article
Also, this is old, and many things have change, like main -> llama-cli etc.

Thank you

Copy link

grigohas commentedJul 10, 2024

Hi, for running on CPU (scalar) provide the path to risc-v toolchain and then use qemu
make llama-cli CC="riscv64-unknown-linux-gnu-gcc -march=rv64gc -mabi=lp64d" CXX="riscv64-unknown-linux-gnu-g++ -march=rv64gc -mabi=lp64d"
qemu-riscv64 -L /path/to/sysroot/  -cpu rv64 ./llama-cli -m ./path/to/model.gguf -p "Anything" -n 100
You can set the seed to get the same results i.e; llama-cli -s (some_seed number) ...
More details:RVV article Also, this is old, and many things have change, like main -> llama-cli etc.
Thank you

yeah i read this article but when i do the make you provided , i get an error for "march=native" error and from what i searched on makefile, i have to do RISCV_CROSS_COMPILE=1 RISCV=1

Copy link

ContributorAuthor

Tameem-10xE commentedJul 10, 2024•
edited
Loading

Sorry yes, I just identified the makefile has been reorder and RISCV=1 is required in current version

Copy link

ContributorAuthor

Tameem-10xE commentedJul 10, 2024•
edited
Loading

After line 432 in makefile, update the flags for vector version with scalar, i.e

MK_CFLAGS += -march=rv64gc -mabi=lp64d
MK_CXXFLAGS += -march=rv64gc -mabi=lp64d

and then run qemu with,

make llama-cli RISCV=1 CC="riscv64-unknown-linux-gnu-gcc" CXX="riscv64-unknown-linux-gnu-g++"

Copy link

grigohas commentedJul 10, 2024

okay okay, one last question, i use the same seed and i have results for both with vector and without, but the only difference in log print is the print time . with vector is 2-2.5X more than without. Is it correct?

Copy link

ContributorAuthor

Tameem-10xE commentedJul 10, 2024

Yes, on qemu the vector emulation time is much slower (the actual reason is not known to me, could be due to qemu has to additionally emulate vector processor with the scalar one or parallel processing issues, and also the log use the real-time for comparison), but this should not be the case with actual RISC-V vector board

Copy link

grigohas commentedSep 11, 2024

hello again, i am running llama with vector extension on gem5 but since there isnt something on log to check if vector extension is enabled , how do i know ?

Copy link

ContributorAuthor

Tameem-10xE commentedSep 11, 2024

Hi, I’ve submitted a PR (#9442) which will print RISCV_VECT=1 on the terminal, if the vector processor is found. Also, I slightly changed Makefile so it no longer requires a flag for RISC-V vector boards—only RISCV_CROSS_COMPILATION=1 is needed for the emulator (i.e. QEMU).

The following is the output from the RISC-V BPI-F3 board with vector support,
...

...

...

...

Copy link

grigohas commentedNov 26, 2024

Hello, i have a question, why the load time is less when rvv is enabled ? load time is the loading time of the model right ? how the rvv affects it ?

Copy link

ContributorAuthor

Tameem-10xE commentedNov 27, 2024•
edited
Loading

Hi, Actually that was due to the quantization happening while loading the weights into memory (Not all weights are quantized, some are done while loading them into memory; specifically check the function quantize_row_q8_0_reference in GGML quants.c file or quantize_row_q4_K), also enabling auto-vectorization can affect the load time.

Copy link

grigohas commentedNov 28, 2024

okay but why the quantization affects the loadimg time ? can you explain me what changes with the rvv enabled ?

Copy link

ContributorAuthor

Tameem-10xE commentedNov 28, 2024

During model's initialization phase, there could be several reasons, such as auto-vectorization of memory operations by compiler, recalculation, decompression, and data alignment. However, during benchmarking, I noticed the most change due to the quantize_row function, since it was performing some computations before the weights were fully loaded into memory.

Copy link

grigohas commentedNov 28, 2024

okay so when the rvv is enabled , whats happening with the quantize_row function and the load time is less ? also if the model is already quantized is there any change ?

Copy link

grigohas commentedNov 29, 2024•
edited
Loading

when you enable the rvv , the quantize_row function that you mentioned that is perforforming computations doesnt execute ?
@Tameem-10xE

Copy link

ContributorAuthor

Tameem-10xE commentedNov 30, 2024•
edited
Loading

Sorry, I last worked on this project a year ago and did not go deeper into how it works except for what I could quantify. I might be mistaken or unclear about which functions are affecting the load time (also many things has been change). I think you should ask in the GitHub discussion or Discord if they have one. I also noticed a reduction in load time on x86 with vector (~5 times), and I think the most probable cause must be compiler auto-vectorization. If not, they may be able to provide the exact reason.

Copy link

grigohas commentedDec 2, 2024•
edited
Loading

Sorry, I last worked on this project a year ago and did not go deeper into how it works except for what I could quantify. I might be mistaken or unclear about which functions are affecting the load time (also many things has been change). I think you should ask in the GitHub discussion or Discord if they have one. I also noticed a reduction in load time on x86 with vector (~5 times), and I think the most probable cause must be compiler auto-vectorization. If not, they may be able to provide the exact reason.

Okay, thank you very much. are the weights quantized differently in scalar than with rvv on ? I mean, does it make sense to compare the results of scalar with the rvv on or do they follow a different loading procedure ?

Copy link

ContributorAuthor

Tameem-10xE commentedDec 2, 2024

Welcome!
No, weights are quantized the same for scalar and RVV (lllama-qunatize script is independent of this), I think it does not matter much since loading is a one-time process, but the more important metric for comparison would be inference time i.e how much tokens are being generated with RVV on against the scalar time.

Copy link

grigohas commentedJan 2, 2025

Hello again, i am simulating through gem5 a riscv environment and i am running llama-cli with a llama 4b model. When i change the vlen and elen of the rvv to greater than 256bit and 64 bit , the generated phrase i get , it doesnt make sense. Is the llama implementation working only on rvv with vlen=256 and elen=64 bit ?

Copy link

ContributorAuthor

Tameem-10xE commentedJan 2, 2025•
edited
Loading

Hi, What Output did you got?
Some functions were using dynamic vl i.e (size_t vl = __riscv_vsetvl_e8m1(qk/2);)
But to avoid duplicating instructions to load qh and smaller size arrays, I did limit the vl size to 256, but smaller VL should work without any issues.

Copy link

grigohas commentedJan 2, 2025

For rvv with vlen=512 and elen=64 i got this

Copy link

ContributorAuthor

Tameem-10xE commentedJan 2, 2025

..., this could be a bug, I missed something or the second half of the register could be interfering with the output causing undefined or junk text. Sorry, due to other tasks, I will not be able to look into this for now, but meanwhile you can file an issue or ask further it about in RISC-V intrinsic repo.

Copy link

grigohas commentedJan 5, 2025

Now that we build llama.cpp with cmake instead of make command , what file do i have to change to build it for riscv but without vector extension ? i changed the -march on makefile but it was built again with the vector extension

Copy link

ContributorAuthor

Tameem-10xE commentedJan 6, 2025

You can use this flag -DGGML_RVV=OFF i.e;

mkdir buildcd buildcmake -DGGML_RVV=OFF [... other flags ...]

Copy link

grigohas commentedJan 22, 2025

..., this could be a bug, I missed something or the second half of the register could be interfering with the output causing undefined or junk text. Sorry, due to other tasks, I will not be able to look into this for now, but meanwhile you can file an issue or ask further it about in RISC-V intrinsic repo.

Hello again. What do you think i can change to fix this bug and the exe can run with higher vlen with correct results?

Copy link

ContributorAuthor

Tameem-10xE commentedJan 22, 2025

Usually, this could be due to the vl variable, outer or inner for loop (for (int i = 0; i < nb; i++) not completed or missed edge elements) or vsetvli instruction. Also, I am unsure about GEMM5 simulation, which could be different from the QEMU or RV board I tested on (with 256-bit VLEN). Sorry, I have this in my mind, but I will try to fix this next week.

Copy link

grigohas commentedFeb 3, 2025

Usually, this could be due to the vl variable, outer or inner for loop (for (int i = 0; i < nb; i++) not completed or missed edge elements) or vsetvli instruction. Also, I am unsure about GEMM5 simulation, which could be different from the QEMU or RV board I tested on (with 256-bit VLEN). Sorry, I have this in my mind, but I will try to fix this next week.

Any news ?

Copy link

ContributorAuthor

Tameem-10xE commentedFeb 3, 2025

Really sorry for the late response. I was inactive last week.
Actually, I have only access to 256-bit RV Hardware, so my only option was to check it on an emulator (qemu), and it is working correctly for 512 VLEN with Q4_K. I also go through code but was unable to find any issue. It may be issue with GEMM5? But can't say for sure.

Copy link

grigohas commentedFeb 3, 2025•
edited
Loading

Really sorry for the late response. I was inactive last week. Actually, I have only access to 256-bit RV Hardware, so my only option was to check it on an emulator (qemu), and it is working correctly for 512 VLEN with Q4_K. I also go through code but was unable to find any issue. It may be issue with GEMM5? But can't say for sure.

Can you explain how do you run it on qemu ? I just run llama-cli for vlen=512 and i got this generated phrase with q4_k model
"" Anythingï¿½stagemblá»¥ shelterirasichenForKeyronessageHinttexttÃ¥k Weltkrieg ""

Copy link

ContributorAuthor

Tameem-10xE commentedFeb 3, 2025

Sorry, I plugged the wrong weights above (need to change my naming convention...).

Identified the issue in the function namedvoid ggml_vec_dot_q4_K_q8_K inggml-cpu-quant.c file. Unable to test this right now, but try to change this variable value ofsize_t vl=8 in this function to either16 or 32. (Only for 4-bit q4_k)

Copy link

grigohas commentedFeb 3, 2025•
edited
Loading

Sorry, I plugged the wrong weights above (need to change my naming convention...).
Identified the issue in the function namedvoid ggml_vec_dot_q4_K_q8_K inggml-cpu-quant.c file. Unable to test this right now, but try to change this variable value ofsize_t vl=8 in this function to either16 or 32. (Only for 4-bit q4_k)

I changed the vl to 16 and 32 and it still doesnt work correctly. The same happens for other models too. I tried a q2_k model for 512 and the generated phrase is not correct

xctan mentioned this pull request

Mar 23, 2025

ggml : riscv: add 128-bit RVV support#12530

Merged

Labels

None yet

Movatterモバイル変換

Added RISC-V Vector Support for K-Quants and improved the existing intrinsics#3453

Added RISC-V Vector Support for K-Quants and improved the existing intrinsics#3453

Uh oh!

Conversation

Tameem-10xE commentedOct 3, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

grigohas commentedJul 10, 2024

Uh oh!

Tameem-10xE commentedJul 10, 2024

Uh oh!

grigohas commentedJul 10, 2024

Uh oh!

Tameem-10xE commentedJul 10, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Tameem-10xE commentedJul 10, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

grigohas commentedJul 10, 2024

Uh oh!

Tameem-10xE commentedJul 10, 2024

Uh oh!

grigohas commentedSep 11, 2024

Uh oh!

Tameem-10xE commentedSep 11, 2024

Uh oh!

grigohas commentedNov 26, 2024

Uh oh!

Tameem-10xE commentedNov 27, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

grigohas commentedNov 28, 2024

Uh oh!

Tameem-10xE commentedNov 28, 2024

Uh oh!

grigohas commentedNov 28, 2024

Uh oh!

grigohas commentedNov 29, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Tameem-10xE commentedNov 30, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

grigohas commentedDec 2, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Tameem-10xE commentedDec 2, 2024

Uh oh!

grigohas commentedJan 2, 2025

Uh oh!

Tameem-10xE commentedJan 2, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

grigohas commentedJan 2, 2025

Uh oh!

Tameem-10xE commentedJan 2, 2025

Uh oh!

grigohas commentedJan 5, 2025

Uh oh!

Tameem-10xE commentedJan 6, 2025

Uh oh!

grigohas commentedJan 22, 2025

Uh oh!

Tameem-10xE commentedJan 22, 2025

Uh oh!

grigohas commentedFeb 3, 2025

Uh oh!

Tameem-10xE commentedFeb 3, 2025

Uh oh!

grigohas commentedFeb 3, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Tameem-10xE commentedFeb 3, 2025

Uh oh!

grigohas commentedFeb 3, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Reviewers

Tameem-10xE commentedOct 3, 2023•
edited
Loading

Tameem-10xE commentedJul 10, 2024•
edited
Loading

Tameem-10xE commentedJul 10, 2024•
edited
Loading

Tameem-10xE commentedNov 27, 2024•
edited
Loading

grigohas commentedNov 29, 2024•
edited
Loading

Tameem-10xE commentedNov 30, 2024•
edited
Loading

grigohas commentedDec 2, 2024•
edited
Loading

Tameem-10xE commentedJan 2, 2025•
edited
Loading

grigohas commentedFeb 3, 2025•
edited
Loading

grigohas commentedFeb 3, 2025•
edited
Loading