Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

New llama-run#17554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ericcurtin wants to merge1 commit intoggml-org:master
base:master
Choose a base branch
Loading
fromericcurtin:llama-server-chat

Conversation

@ericcurtin
Copy link
Collaborator

@ericcurtinericcurtin commentedNov 27, 2025
edited
Loading

  • Added readline.cpp include
  • Created run_chat_mode():
    • Initializes readline with command history
    • Maintains conversation history
    • Applies chat templates to format messages
    • Submits completion tasks to the server queue
    • Displays assistant responses interactively

@ericcurtinericcurtinforce-pushed thellama-server-chat branch 3 times, most recently from94a98cf to500a90cCompareNovember 27, 2025 17:37
@ericcurtin
Copy link
CollaboratorAuthor

@angt@ggerganov@CISC@ngxson PTAL

@ngxson
Copy link
Collaborator

Considering these points:

  1. We already hadllama-chat,llama-run,llama-cli all provide chat experience. Adding one more may introduce more stress for maintainers (especially forllama-server, where the code is already quite complex)
  2. llama-server, as the name suggest, should launch a server - otherwise, we should drop theserver in its name
  3. Console-related code should not be insideserver.cpp. Instead, it should be in a dedicated module that is decoupled from server (maybe in the form of a dedicated binary)

So, I think it's better to repurpose one of the 3 binaries mentioned in my first point instead.

@ericcurtin
Copy link
CollaboratorAuthor

ericcurtin commentedNov 27, 2025
edited
Loading

  1. Plan to delete llama-run and linenoise.cpp after this, so the net would be, one less binary and less lines of code.

  2. might not be a terrible idea (dropping-server in it's name), even if llama and llama-server could be an identical binary, llama-server is the main binary that gets the most usage.

@ericcurtinericcurtinforce-pushed thellama-server-chat branch 2 times, most recently fromf839cfb to5b0c817CompareNovember 27, 2025 18:44
@ericcurtin
Copy link
CollaboratorAuthor

  1. addressed

@ngxson
Copy link
Collaborator

llama-server is the main binary that gets the most usage.

llama-server gets the most usage for users who want an inference server. For those who want to use CLI, they will usellama-cli - as the name suggest. The correct naming makes things very intuitive for end users without having to dive into the documentations.

I think what's better doing at this point is improve the usability ofllama-cli by re-using server code inside it. For the first iteration, we can build server code as a static target and use it inllama-server andllama-cli. I already had a quick demohere, but obviously we can now do much better as the server's code base has been broken down into smaller parts, make it easier to be#include inside other binaries.

@ericcurtin
Copy link
CollaboratorAuthor

llama-server is the main binary that gets the most usage.

llama-server gets the most usage for users who want an inference server. For those who want to use CLI, they will usellama-cli - as the name suggest. The correct naming makes things very intuitive for end users without having to dive into the documentations.

I think what's better doing at this point is improve the usability ofllama-cli by re-using server code inside it. For the first iteration, we can build server code as a static target and use it inllama-server andllama-cli. I already had a quick demohere, but obviously we can now do much better as the server's code base has been broken down into smaller parts, make it easier to be#include inside other binaries.

I’m on board with the idea, but can we push ahead with this PR and plan the refactor for a later stage?

To be honest, I think you’re the best person to handle the refactoring. You know exactly how you want the architecture to look, and it’s hard for others to guess those specifics. If someone else tries, we risk getting stuck in a loop of corrections, so it’s probably more efficient if you drive that part.

@ngxson
Copy link
Collaborator

ngxson commentedNov 27, 2025
edited
Loading

To be honest, I think you’re the best person to handle the refactoring. You know exactly how you want the architecture to look, and it’s hard for others to guess those specifics. If someone else tries, we risk getting stuck in a loop of corrections, so it’s probably more efficient if you drive that part.

For most parts, server code already refactored to be reused in another tool/example. I'm not sure what kind of refactoring you're talking about

@ericcurtin
Copy link
CollaboratorAuthor

To be honest, I think you’re the best person to handle the refactoring. You know exactly how you want the architecture to look, and it’s hard for others to guess those specifics. If someone else tries, we risk getting stuck in a loop of corrections, so it’s probably more efficient if you drive that part.

For most parts, server code already refactored to be reused in another tool/example. I'm not sure what kind of refactoring you're talking about

Moving the relevant code to llama-cli, llama-run, etc. or some other place.

@ngxson
Copy link
Collaborator

ngxson commentedNov 27, 2025
edited
Loading

I think yourserver-chat.cpp introduced in this PR is already a standalone binary. I think what can be useful is:

  1. Move the code fromserver-chat.cpp torun.cpp
  2. Modify cmakelists oftools/run so that it also compile server source code inside the binary

This way, you simply makellama-run becomingllama-server --chat

In the future, we can make server as a static (or maybe shared) lib target and reuse it in both server and run

@ericcurtinericcurtinforce-pushed thellama-server-chat branch 5 times, most recently from0e3f326 tof7a7775CompareNovember 27, 2025 22:11
@ericcurtinericcurtin changed the titleNew --chat option for llama-serverNew llama-runNov 27, 2025
@ericcurtinericcurtinforce-pushed thellama-server-chat branch 2 times, most recently fromf347cb1 to2619c11CompareNovember 27, 2025 22:29
@ericcurtin
Copy link
CollaboratorAuthor

People should give this a shot, the UX is quite neat:

$ build/bin/llama-run -dr gemma3> What is llama.cppllama.cpp is a C++ port of the LLaMA (Large Language Model Meta AI) inference engine. It's designed to run LLaMA models on commodity hardware, such as CPUs and laptops, without requiring a powerful GPU.Here's a breakdown of key aspects:*   **Purpose:** To make LLaMA models accessible to a wider audience by enabling them to run locally on less powerful machines.*   **Technology:** Written in C++, leveraging optimizations for CPU architectures.  It uses techniques like quantization (reducing the precision of the model's weights) to significantly reduce memory usage and improve performance.*   **Key Features:**    *   **CPU-Focused:** Optimized for performance on CPUs, not GPUs.    *   **Quantization Support:**  Supports various quantization methods (e.g., 4-bit, 8-bit) to reduce model size and memory requirements, enabling running on devices with limited RAM.    *   **Easy to Build and Use:**  Relatively straightforward to compile and run, with a simple command-line interface.    *   **Cross-Platform:**  Works on Windows, macOS, and Linux.    *   **Active Community:**  Has a vibrant community contributing to improvements and supporting users.*   **How it works:** llama.cpp utilizes techniques like memory mapping and efficient data structures to minimize memory footprint and maximize processing speed. It also supports various parallelization strategies to utilize multiple CPU cores.*   **Popularity:**  It's gained significant popularity due to its ability to run LLaMA models locally, allowing users to experiment with and utilize these models without needing expensive hardware.**Where to learn more:***   **GitHub Repository:** [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)*   **Wiki:** [https://github.com/ggerganov/llama.cpp/wiki](https://github.com/ggerganov/llama.cpp/wiki)Do you want me to elaborate on a specific aspect of llama.cpp, such as:*   Quantization methods?*   How to build it?*   How to use it with a specific model?*   Its performance compared to other inference engines?*   Its use for specific applications (e.g., chatbots)?llama.cpp is a C++ port of the LLaMA (Large Language Model Meta AI) inference engine. It's designed to run LLaMA models on commodity hardware, such as CPUs and laptops, without requiring a powerful GPU.Here's a breakdown of key aspects:*   **Purpose:** To make LLaMA models accessible to a wider audience by enabling them to run locally on less powerful machines.*   **Technology:** Written in C++, leveraging optimizations for CPU architectures.  It uses techniques like quantization (reducing the precision of the model's weights) to significantly reduce memory usage and improve performance.*   **Key Features:**    *   **CPU-Focused:** Optimized for performance on CPUs, not GPUs.    *   **Quantization Support:**  Supports various quantization methods (e.g., 4-bit, 8-bit) to reduce model size and memory requirements, enabling running on devices with limited RAM.    *   **Easy to Build and Use:**  Relatively straightforward to compile and run, with a simple command-line interface.    *   **Cross-Platform:**  Works on Windows, macOS, and Linux.    *   **Active Community:**  Has a vibrant community contributing to improvements and supporting users.*   **How it works:** llama.cpp utilizes techniques like memory mapping and efficient data structures to minimize memory footprint and maximize processing speed. It also supports various parallelization strategies to utilize multiple CPU cores.*   **Popularity:**  It's gained significant popularity due to its ability to run LLaMA models locally, allowing users to experiment with and utilize these models without needing expensive hardware.**Where to learn more:***   **GitHub Repository:** [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)*   **Wiki:** [https://github.com/ggerganov/llama.cpp/wiki](https://github.com/ggerganov/llama.cpp/wiki)Do you want me to elaborate on a specific aspect of llama.cpp, such as:*   Quantization methods?*   How to build it?*   How to use it with a specific model?*   Its performance compared to other inference engines?*   Its use for specific applications (e.g., chatbots)?> Send a message
ngxson reacted with thumbs up emoji

@ngxson
Copy link
Collaborator

Seems to be a better direction now. I'll see if it's worth splitting some changes specific toserver.cppto a dedicated PR, for visibility.

One question though: why do we need to splitrun intorun.cpp andrun-chat.cpp? I think the whole code can be just one singlerun.cpp

@ngxson
Copy link
Collaborator

Btw maybe a TODO, we can now add multimodal support to the CLI too

@ericcurtinericcurtinforce-pushed thellama-server-chat branch 3 times, most recently from2619c11 tof9ae221CompareNovember 28, 2025 12:26
@ericcurtin
Copy link
CollaboratorAuthor

Seems to be a better direction now. I'll see if it's worth splitting some changes specific toserver.cppto a dedicated PR, for visibility.

One question though: why do we need to splitrun intorun.cpp andrun-chat.cpp? I think the whole code can be just one singlerun.cpp

I've received mixed feedback in the past regarding granular file separation versus consolidation, so I am unsure of the preferred direction here. run-chat.cpp and run.cpp seems like a reasonable split between chat-focused activities and other code required to get the tool running.

- Added readline.cpp include- Created run_chat_mode():  - Initializes readline with command history  - Maintains conversation history  - Applies chat templates to format messages  - Submits completion tasks to the server queue  - Displays assistant responses interactivelySigned-off-by: Eric Curtin <eric.curtin@docker.com>
@ericcurtin
Copy link
CollaboratorAuthor

If somebody could restart the failing builds I'd appreciate it, I don't have any sort of maintainer access anymore, as limited as a random contributor

@ngxson
Copy link
Collaborator

ngxson commentedNov 28, 2025
edited
Loading

The failed builds are not related to server, we can ignore them anw. Beside, I usually open a mirror PR on my fork to skip the long waiting line of CIs on main repo, you can give that a try.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@ggerganovggerganovAwaiting requested review from ggerganovggerganov is a code owner

@ngxsonngxsonAwaiting requested review from ngxsonngxson is a code owner

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@ericcurtin@ngxson

[8]ページ先頭

©2009-2025 Movatter.jp