- Notifications
You must be signed in to change notification settings - Fork317
Open
Description
browsing web:
- when the model doesn't fit on the gpu ram, you loose 10x the gpu performance.. making APU with big RAM competing with big RTX3090
- hardware:
- the models must fit inside the RAM or you loose 10x the performance
- RAM is even more crucial for reasoning models who needs to remember
- big usefull model seems to require between 16 to 32 Go RAM, so:
- a new pc apu + npu with 64Go RAM is the caution
- if a gpu, the ratio CPU ram = 4x GPU ram seems still nice, so 64gb ram CPU + option a 16gb GPU
- unified RAM makes big APU with big bandwith the "cheap" competitive idea in 2025
- bandwith to RAM is important: 256 bit bus looks minimum, 384 lovely
- 1T disk may be too small
- evolution pressure:
- Microsoft "copilote" ==> npu and ram
- electricity ==> APU seems 10x more efficient then gpu
- mac-mini style pc the low cost choice, ep if hardware renewal strategy every 2 years
- token cost is trending -20x per year:
- so there is no reason sink too much money on hardware, renew more often, but check minimums
- apple M4 are the reference: we can only hope from competition from Qualcomm, AMD, Nvidia
asking to AI (gemini flash 2):
- software:
- llama.cpp:ggml-org/llama.cpp , greatest but first oriented Mac / Arm
- ollama: ollama.ai:ollama.ai
- LM studio:https://lmstudio.ai/
- torch / Tensorflow / Jax ? not really for cpu
- intel IPEX-LLM for optimizing on cpu (works below Torch?)
- ... docker models are looking like a trend
*** gemini flash 2: 2025-03-16
It's great you're looking into running LLMs locally! Here's a breakdown of how to obtain llama.cpp, Ollama, and LM Studio:
1. llama.cpp:
- GitHub:
- The primary source is the llama.cpp GitHub repository:
- From there, you have a few options:
- Building from source: This offers the most flexibility but requires technical knowledge. You'll need to compile the code according to your operating system.
- Downloading pre-built binaries: On the github release page, you can find prebuilt binaries. Be sure to download the correct version for your CPU architecture.
- Key points:
- llama.cpp is a C++ library, so it's primarily command-line driven.
- It's highly optimized for CPU inference, especially with quantized models.
2. Ollama:
- Official Website:
- The easiest way to get Ollama is from its official website:
- The website provides straightforward installation instructions for macOS, Linux, and Windows.
- Installation often involves running a simple installation script.
- Key points:
- Ollama simplifies the process of running LLMs locally.
- It handles model management, making it user-friendly.
3. LM Studio:
- Official Website:
- You can download LM Studio from its official website:
- The website provides installers for various operating systems.
- Key points:
- LM Studio offers a graphical user interface (GUI), making it very accessible.
- It allows you to easily download and run LLMs from Hugging Face.
- It uses llama.cpp in the background.
General Tips:
- When downloading or installing, always prioritize official sources to ensure safety.
- Pay attention to system requirements, especially RAM and storage space.
- If you're new to LLMs, Ollama or LM Studio might be easier starting points.
I hope this helps!
simple comparisons:
Metadata
Metadata
Assignees
Labels
No labels