Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.

License

NotificationsYou must be signed in to change notification settings

SciSharp/LLamaSharp

Repository files navigation

logo

DiscordQQ GroupLLamaSharp BadgeLLamaSharp BadgeLLamaSharp BadgeLLamaSharp BadgeLLamaSharp BadgeLLamaSharp BadgeLLamaSharp Badge

LLamaSharp is a cross-platform library to run 🦙LLaMA/LLaVA model (and others) on your local device. Based onllama.cpp, inference with LLamaSharp is efficient on both CPU and GPU. With the higher-level APIs and RAG support, it's convenient to deploy LLMs (Large Language Models) in your application with LLamaSharp.

Please star the repo to show your support for this project!🤗


Table of Contents

📖Documentation

📌Console Demo

LLaMALLaVA

🔗Integrations & Examples

There are integrations for the following libraries, making it easier to develop your APP. Integrations for semantic-kernel and kernel-memory are developed in the LLamaSharp repository, while others are developed in their own repositories.

  • semantic-kernel: an SDK that integrates LLMs like OpenAI, Azure OpenAI, and Hugging Face.
  • kernel-memory: a multi-modal AI Service specialized in the efficient indexing of datasets through custom continuous data hybrid pipelines, with support for RAG (Retrieval Augmented Generation), synthetic memory, prompt engineering, and custom semantic memory processing.
  • BotSharp: an open source machine learning framework for AI Bot platform builder.
  • Langchain: a framework for developing applications powered by language models.
  • MaIN.NET: simplistic approach to orchestrating agents/chats from different (llm) providers

The following examples show how to build APPs with LLamaSharp.

LLamaSharp-Integrations

🚀Get started

Installation

To gain high performance, LLamaSharp interacts with native libraries compiled from c++, these are calledbackends. We provide backend packages for Windows, Linux and Mac with CPU, CUDA, Metal and Vulkan. Youdon't need to compile any c++, just install the backend packages.

If no published backend matches your device, please open an issue to let us know. If compiling c++ code is not difficult for you, you could also followthis guide to compile a backend and run LLamaSharp with it.

  1. InstallLLamaSharp package on NuGet:
PM> Install-Package LLamaSharp
  1. Install one or more of these backends, or use a self-compiled backend.

  2. (optional) ForMicrosoft semantic-kernel integration, install theLLamaSharp.semantic-kernel package.

  3. (optional) To enable RAG support, install theLLamaSharp.kernel-memory package (this package only supportsnet6.0 or higher yet), which is based onMicrosoft kernel-memory integration.

Model preparation

There are two popular formats of model file of LLMs, these are PyTorch format (.pth) and Huggingface format (.bin). LLamaSharp uses aGGUF format file, which can be converted from these two formats. To get aGGUF file, there are two options:

  1. Search model name + 'gguf' inHuggingface, you will find lots of model files that have already been converted to GGUF format. Please take note of the publishing time of them because some old ones may only work with older versions of LLamaSharp.

  2. Convert PyTorch or Huggingface format to GGUF format yourself. Please follow the instructions fromthis part of llama.cpp readme to convert them with python scripts.

Generally, we recommend downloading models with quantization rather than fp16, because it significantly reduces the required memory size while only slightly impacting the generation quality.

Example of LLaMA chat session

Here is a simple example to chat with a bot based on a LLM in LLamaSharp. Please replace the model path with yours.

usingLLama;usingLLama.Common;usingLLama.Sampling;stringmodelPath=@"<Your Model Path>";// change it to your own model path.varparameters=newModelParams(modelPath){ContextSize=1024,// The longest length of chat as memory.GpuLayerCount=5// How many layers to offload to GPU. Please adjust it according to your GPU memory.};usingvarmodel=LLamaWeights.LoadFromFile(parameters);usingvarcontext=model.CreateContext(parameters);varexecutor=newInteractiveExecutor(context);// Add chat histories as prompt to tell AI how to act.varchatHistory=newChatHistory();chatHistory.AddMessage(AuthorRole.System,"Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.");chatHistory.AddMessage(AuthorRole.User,"Hello, Bob.");chatHistory.AddMessage(AuthorRole.Assistant,"Hello. How may I help you today?");ChatSessionsession=new(executor,chatHistory);InferenceParamsinferenceParams=newInferenceParams(){MaxTokens=256,// No more than 256 tokens should appear in answer. Remove it if antiprompt is enough for control.AntiPrompts=newList<string>{"User:"},// Stop generation once antiprompts appear.SamplingPipeline=newDefaultSamplingPipeline(),};Console.ForegroundColor=ConsoleColor.Yellow;Console.Write("The chat session has started.\nUser: ");Console.ForegroundColor=ConsoleColor.Green;stringuserInput=Console.ReadLine()??"";while(userInput!="exit"){awaitforeach(// Generate the response streamingly.vartextinsession.ChatAsync(newChatHistory.Message(AuthorRole.User,userInput),inferenceParams)){Console.ForegroundColor=ConsoleColor.White;Console.Write(text);}Console.ForegroundColor=ConsoleColor.Green;userInput=Console.ReadLine()??"";}

For more examples, please refer toLLamaSharp.Examples.

💡FAQ

Why is my GPU not used when I have installed CUDA?

  1. If you are using backend packages, please make sure you have installed the CUDA backend package which matches the CUDA version installed on your system.
  2. Add the following line to the very beginning of your code. The log will show which native library file is loaded. If the CPU library is loaded, please try to compile the native library yourself and open an issue for that. If the CUDA library is loaded, please check ifGpuLayerCount > 0 when loading the model weight.
NativeLibraryConfig.All.WithLogCallback(delegate(LLamaLogLevellevel,stringmessage){Console.Write($"{level}:{message}");})

Why is the inference so slow?

Firstly, due to the large size of LLM models, it requires more time to generate output than other models, especially when you are using models larger than 30B parameters.

To see if that's a LLamaSharp performance issue, please follow the two tips below.

  1. If you are using CUDA, Metal or Vulkan, please setGpuLayerCount as large as possible.
  2. If it's still slower than you expect it to be, please try to run the same model with same setting inllama.cpp examples. If llama.cpp outperforms LLamaSharp significantly, it's likely a LLamaSharp BUG and please report that to us.

Why does the program crash before any output is generated?

Generally, there are two possible cases for this problem:

  1. The native library (backend) you are using is not compatible with the LLamaSharp version. If you compiled the native library yourself, please make sure you have checked-out llama.cpp to the corresponding commit of LLamaSharp, which can be found at the bottom of README.
  2. The model file you are using is not compatible with the backend. If you are using a GGUF file downloaded from huggingface, please check its publishing time.

Why is my model generating output infinitely?

Please set anti-prompt or max-length when executing the inference.

🙌Contributing

All contributions are welcome! There's a TODO list inLLamaSharp Dev Project and you can pick an interesting one to start. Please read thecontributing guide for more information.

You can also do one of the following to help us make LLamaSharp better:

  • Submit a feature request.
  • Star and share LLamaSharp to let others know about it.
  • Write a blog or demo about LLamaSharp.
  • Help to develop Web API and UI integration.
  • Just open an issue about the problem you've found!

Join the community

Join our chat onDiscord (please contact Rinne to join the dev channel if you want to be a contributor).

JoinQQ group

Star history

Star History Chart

Contributor wall of fame

LLamaSharp Contributors

Map of LLamaSharp and llama.cpp versions

If you want to compile llama.cpp yourself youmust use the exact commit ID listed for each version.

LLamaSharpVerified Model Resourcesllama.cpp commit id
v0.2.0This version is not recommended to use.-
v0.2.1WizardLM,Vicuna (filenames with "old")-
v0.2.2, v0.2.3WizardLM,Vicuna (filenames without "old")63d2046
v0.3.0, v0.4.0LLamaSharpSamples v0.3.0,WizardLM7e4ea5b
v0.4.1-previewOpen llama 3b,Open Buddyaacdbd4
v0.4.2-previewLlama2 7B (GGML)3323112
v0.5.1Llama2 7B (GGUF)6b73ef1
v0.6.0cb33f43
v0.7.0, v0.8.0Thespis-13B,LLaMA2-7B207b519
v0.8.1e937066
v0.9.0, v0.9.1Mixtral-8x7B9fb13f9
v0.10.0Phi2d71ac90
v0.11.1, v0.11.2LLaVA-v1.5,Phi23ab8b3a
v0.12.0LLama3a743d76
v0.13.01debe72
v0.14.0Gemma236864569
v0.15.0LLama3.1345c8c0c
v0.16.011b84eb4
v0.17.0c35e586e
v0.18.0c35e586e
v0.19.0958367bf
v0.20.00827b2c1
v0.21.0DeepSeek R15783575c
v0.22.0, v0.23.0Gemma3be7c3034
v0.24.0Qwen3ceda28ef
v0.25.011dd5a44eb180e1d69fac24d3852b5222d66fb7f

License

This project is licensed under the terms of the MIT license.


[8]ページ先頭

©2009-2025 Movatter.jp