Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork0
QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.
codewithdark-git/QuantLLM
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
QuantLLM is a Python library designed for efficient model quantization using the GGUF (GGML Universal Format) method. It provides a robust framework for converting and deploying large language models with minimal memory footprint and optimal performance. Key capabilities include:
- Memory-efficient GGUF quantization with multiple precision options (2-bit to 8-bit)
- Chunk-based processing for handling large models
- Comprehensive benchmarking tools
- Detailed progress tracking with memory statistics
- Easy model export and deployment
Feature | Description |
---|---|
✅ Multiple GGUF Types | Support for various GGUF quantization types (Q2_K to Q8_0) with different precision-size tradeoffs |
✅ Memory Optimization | Chunk-based processing and CPU offloading for efficient handling of large models |
✅ Progress Tracking | Detailed layer-wise progress with memory statistics and ETA |
✅ Benchmarking Tools | Comprehensive benchmarking suite for performance evaluation |
✅ Hardware Optimization | Automatic device selection and memory management |
✅ Easy Deployment | Simple conversion to GGUF format for deployment |
✅ Flexible Configuration | Customizable quantization parameters and processing options |
Basic installation:
pip install quantllm
With GGUF support (recommended):
pip install quantllm[gguf]
fromquantllmimportQuantLLMfromtransformersimportAutoTokenizer# Load tokenizer and prepare datamodel_name="facebook/opt-125m"tokenizer=AutoTokenizer.from_pretrained(model_name)calibration_text= ["Example text for calibration."]*10calibration_data=tokenizer(calibration_text,return_tensors="pt",padding=True)["input_ids"]# Quantize modelquantized_model,benchmark_results=QuantLLM.quantize_from_pretrained(model_name_or_path=model_name,bits=4,# Quantization bits (2-8)group_size=32,# Group size for quantizationquant_type="Q4_K_M",# GGUF quantization typecalibration_data=calibration_data,benchmark=True,# Run benchmarksbenchmark_input_shape=(1,32))# Save and convert to GGUFQuantLLM.save_quantized_model(model=quantized_model,output_path="quantized_model")QuantLLM.convert_to_gguf(model=quantized_model,output_path="model.gguf")
For detailed usage examples and API documentation, please refer to our:
- CPU: 4+ cores
- RAM: 16GB+
- Storage: 10GB+ free space
- Python: 3.10+
- CPU: 8+ cores
- RAM: 32GB+
- GPU: NVIDIA GPU with 8GB+ VRAM
- CUDA: 11.7+
- Storage: 20GB+ free space
Type | Bits | Description | Use Case |
---|---|---|---|
Q2_K | 2 | Extreme compression | Size-critical deployment |
Q3_K_S | 3 | Small size | Limited storage |
Q4_K_M | 4 | Balanced quality | General use |
Q5_K_M | 5 | Higher quality | Quality-sensitive tasks |
Q8_0 | 8 | Best quality | Accuracy-critical tasks |
QuantLLM | Python | PyTorch | Transformers | CUDA |
---|---|---|---|---|
1.2.0 | ≥3.10 | ≥2.0.0 | ≥4.30.0 | ≥11.7 |
- Support for more GGUF model architectures
- Enhanced benchmarking capabilities
- Multi-GPU processing support
- Advanced memory optimization techniques
- Integration with more deployment platforms
- Custom quantization kernels
We welcome contributions! Please see ourCONTRIBUTE.md for guidelines and setup instructions.
This project is licensed under the MIT License - see theLICENSE file for details.
- llama.cpp for GGUF format
- HuggingFace for Transformers library
- CTransformers for GGUF support
- GitHub Issues:Create an issue
- Documentation:Read the docs
- Discord:Join our community
- Email:support@quantllm.ai
About
QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.