codewithdark-git/QuantLLMPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star5

QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.

5 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github		.github
docs		docs
quantllm		quantllm
test		test
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTE.md		CONTRIBUTE.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

🧠 QuantLLM: Efficient GGUF Model Quantization and Deployment

📌 Overview

QuantLLM is a Python library designed for efficient model quantization using the GGUF (GGML Universal Format) method. It provides a robust framework for converting and deploying large language models with minimal memory footprint and optimal performance. Key capabilities include:

Memory-efficient GGUF quantization with multiple precision options (2-bit to 8-bit)
Chunk-based processing for handling large models
Comprehensive benchmarking tools
Detailed progress tracking with memory statistics
Easy model export and deployment

🎯 Key Features

Feature	Description
✅ Multiple GGUF Types	Support for various GGUF quantization types (Q2_K to Q8_0) with different precision-size tradeoffs
✅ Memory Optimization	Chunk-based processing and CPU offloading for efficient handling of large models
✅ Progress Tracking	Detailed layer-wise progress with memory statistics and ETA
✅ Benchmarking Tools	Comprehensive benchmarking suite for performance evaluation
✅ Hardware Optimization	Automatic device selection and memory management
✅ Easy Deployment	Simple conversion to GGUF format for deployment
✅ Flexible Configuration	Customizable quantization parameters and processing options

🚀 Getting Started

Installation

Basic installation:

pip install quantllm

With GGUF support (recommended):

pip install quantllm[gguf]

Quick Example

fromquantllmimportQuantLLMfromtransformersimportAutoTokenizer# Load tokenizer and prepare datamodel_name="facebook/opt-125m"tokenizer=AutoTokenizer.from_pretrained(model_name)calibration_text= ["Example text for calibration."]*10calibration_data=tokenizer(calibration_text,return_tensors="pt",padding=True)["input_ids"]# Quantize modelquantized_model,benchmark_results=QuantLLM.quantize_from_pretrained(model_name_or_path=model_name,bits=4,# Quantization bits (2-8)group_size=32,# Group size for quantizationquant_type="Q4_K_M",# GGUF quantization typecalibration_data=calibration_data,benchmark=True,# Run benchmarksbenchmark_input_shape=(1,32))# Save and convert to GGUFQuantLLM.save_quantized_model(model=quantized_model,output_path="quantized_model")QuantLLM.convert_to_gguf(model=quantized_model,output_path="model.gguf")

For detailed usage examples and API documentation, please refer to our:

💻 Hardware Requirements

Minimum Requirements

CPU: 4+ cores
RAM: 16GB+
Storage: 10GB+ free space
Python: 3.10+

Recommended for Large Models

CPU: 8+ cores
RAM: 32GB+
GPU: NVIDIA GPU with 8GB+ VRAM
CUDA: 11.7+
Storage: 20GB+ free space

GGUF Quantization Types

Type	Bits	Description	Use Case
Q2_K	2	Extreme compression	Size-critical deployment
Q3_K_S	3	Small size	Limited storage
Q4_K_M	4	Balanced quality	General use
Q5_K_M	5	Higher quality	Quality-sensitive tasks
Q8_0	8	Best quality	Accuracy-critical tasks

🔄 Version Compatibility

QuantLLM	Python	PyTorch	Transformers	CUDA
1.2.0	≥3.10	≥2.0.0	≥4.30.0	≥11.7

🗺 Roadmap

Support for more GGUF model architectures
Enhanced benchmarking capabilities
Multi-GPU processing support
Advanced memory optimization techniques
Integration with more deployment platforms
Custom quantization kernels

🤝 Contributing

We welcome contributions! Please see ourCONTRIBUTE.md for guidelines and setup instructions.

📝 License

This project is licensed under the MIT License - see theLICENSE file for details.

🙏 Acknowledgments

llama.cpp for GGUF format
HuggingFace for Transformers library
CTransformers for GGUF support

📫 Contact & Support

GitHub Issues:Create an issue
Documentation:Read the docs
Discord:Join our community
Email:support@quantllm.ai

About

QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.

pypi.org/project/quantllm/