Summary: Existing support forload_in_fp8=True performs an offline quantization when loading the initial model. This is no longer necessary as of vllm==0.12.0 (aftervllm-project/vllm#23014), where we can quantize the model on-the-fly when we load it:

llm = LLM(  ...  hf_overrides={    "quantization_config_dict_str": json.dumps(torchao_config),  },)

Note: Needsunslothai/unsloth-zoo#380

Test Plan:
https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423

Copy link

Contributor

gemini-code-assistbot commentedDec 11, 2025

Summary of Changes

Hello@andrewor14, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the FP8 quantization logic within the Unsloth framework to leverage new capabilities in vLLM (version 0.12.0 and above). Instead of performing an offline quantization step, models can now be quantized on-the-fly when loaded, simplifying the workflow and improving efficiency. The changes involve updating model loading functions to accept aload_in_fp8 parameter and integrating the new on-the-fly quantization mechanism.

Highlights

On-the-fly FP8 Quantization: Enables direct FP8 quantization during model loading for vLLM versions 0.12.0 and newer, eliminating the need for prior offline quantization.
Streamlined Model Loading: Simplifies the process of loading FP8 quantized models by integrating the quantization step directly into the LLM constructor.
Parameter Updates: Introducesload_in_fp8 parameter tofrom_pretrained methods inllama.py andvision.py and updatesloader_utils.py to handle the new quantization flow.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on ourdocumentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either/gemini <command> or@gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a.gemini/ folder in the base of the repository. Detailed instructions can be foundhere.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on@gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign uphere.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with theGemini Code Assist IDE Extension.

Footnotes

Review thePrivacy Notices,Generative AI Prohibited Use Policy,Terms of Service, and learn how to configure Gemini Code Assist in GitHubhere. Gemini can make mistakes, so double check it anduse code with caution.↩

andrewor14 added a commit to andrewor14/unsloth-zoo that referenced this pull request

Dec 11, 2025

FP8: Load model on-the-fly in vLLM

686d4fd

Supportsunslothai/unsloth#3717

andrewor14 mentioned this pull request

Dec 11, 2025

FP8: Load model on-the-fly in vLLMunslothai/unsloth-zoo#380

Open

gemini-code-assistbot reviewed

Dec 11, 2025

View reviewed changes

Copy link

Contributor

gemini-code-assistbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Code Review

This pull request refactors the FP8 model loading to leverage vLLM's on-the-fly quantization for versions 0.12.0 and newer. This is a solid improvement, as it eliminates the need for offline quantization and the creation of a temporary model. The implementation acrossllama.py,loader.py,loader_utils.py, andvision.py is well-executed, correctly checking the vLLM version to conditionally skip the offline process. My feedback consists of a minor stylistic suggestion to improve the readability of boolean checks, making them more idiomatic to Python.

unsloth/models/llama.py

		)

		fp8_mode=None
		ifload_in_fp8!=False:

Copy link

Contributor

gemini-code-assistbotDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For boolean checks, it's more idiomatic in Python to use the truthiness of the value directly rather than comparing withFalse. Theload_in_fp8 parameter can beTrue,False, or a string like'block'. BothTrue and non-empty strings are truthy, whileFalse is falsy. Usingif load_in_fp8: is more concise and readable, and achieves the same result.

Suggested change

	ifload_in_fp8!=False:
	ifload_in_fp8:

Copy link

Collaborator

Datta0Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah agree with gemini here :)

Copy link

ContributorAuthor

andrewor14Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

sure I can change it, I just had it this way because I saw that's how Daniel wrote it in a few existing places

unsloth/models/loader_utils.py

		lower_model_name=model_name.lower()

		assertload_in_fp8in (True,False,"block")
		ifload_in_fp8!=False:

Copy link

Contributor

gemini-code-assistbotDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Similar to my other comments, this check can be made more Pythonic. Instead ofif load_in_fp8 != False:, you can useif load_in_fp8:. This leverages Python's truthiness evaluation and is generally considered better style for readability and conciseness.

Suggested change

	ifload_in_fp8!=False:
	ifload_in_fp8:

unsloth/models/vision.py

		)

		fp8_mode=None
		ifload_in_fp8!=False:

Copy link

Contributor

gemini-code-assistbotDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This checkif load_in_fp8 != False: can be simplified toif load_in_fp8:. This is the more idiomatic and preferred way to check for truthiness in Python, improving code readability.

Suggested change

	ifload_in_fp8!=False:
	ifload_in_fp8:

FP8: Load model on-the-fly in vLLM

f7ee75e

**Summary:** Existing support for `load_in_fp8=True` performsan offline quantization when loading the initial model.This is no longer necessary as of vllm==0.12.0 (aftervllm-project/vllm#23014), where wecan quantize the model on-the-fly when we load it:```llm = LLM(  ...  hf_overrides={    "quantization_config_dict_str": json.dumps(torchao_config),  },)```**Note:** Needsunslothai/unsloth-zoo#380**Test Plan:**https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423

andrewor14 force-pushed thetrue-on-the-fly-fp8 branch from32cb1d7 tof7ee75eCompare

December 11, 2025 22:15

chatgpt-codex-connectorbot reviewed

Dec 11, 2025

View reviewed changes

Copy link

chatgpt-codex-connectorbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When yousign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

unsloth/models/loader.py

Comment on lines 241 to 246

		load_in_4bit,
		load_in_8bit,
		load_in_16bit,
		use_exact_model_name,
		)
		model_name=_offline_quantize_to_fp8(model_name,fp8_mode)
		else:

Copy link

chatgpt-codex-connectorbotDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Enforce FP8/4bit mutual exclusion for vLLM >=0.12

Whenload_in_fp8 is true,get_model_name now returns the original name as soon as vLLM ≥ 0.12.0 (loader_utils.py lines 110-118), so thenew_model_name is None branch here is never taken and_get_fp8_mode_and_check_settings no longer runs. With the defaultload_in_4bit=True, the code now proceeds to fast inference with bothload_in_fp8 anduse_bitsandbytes=load_in_4bit set, even though_get_fp8_mode_and_check_settings used to reject FP8 together with 4/8/16-bit loads. This yields conflicting quantization paths (fp8 on-the-fly plus bitsandbytes 4bit) and is likely to fail at runtime for users who simply enableload_in_fp8 without also disabling 4bit.

Useful? React with 👍 / 👎.

[pre-commit.ci] auto fixes from pre-commit.com hooks

4451af9

for more information, seehttps://pre-commit.ci

Copy link

Contributor

danielhanchen commentedDec 12, 2025

@andrewor14 Oh thanks - would this be backwards compatible?

Copy link

ContributorAuthor

andrewor14 commentedDec 12, 2025

Oh thanks - would this be backwards compatible?

Yeah, this only affects vllm >= 0.12.0. Behavior is the same as before for older versions. Just tested on 0.12.0 and 0.11.1

Labels

None yet

3 participants

Movatterモバイル変換

Uh oh!

FP8: Load model on-the-fly in vLLM#3717

Are you sure you want to change the base?

FP8: Load model on-the-fly in vLLM#3717

Conversation

andrewor14 commentedDec 11, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

gemini-code-assistbot commentedDec 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assistbot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assistbotDec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Datta0Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

andrewor14Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assistbotDec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assistbotDec 11, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connectorbot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connectorbotDec 11, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen commentedDec 12, 2025

Uh oh!

andrewor14 commentedDec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrewor14 commentedDec 11, 2025•
edited
Loading