- Notifications
You must be signed in to change notification settings - Fork202
Moved vllm fq export code to separate files#612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
kinjalpatel27 commentedNov 26, 2025
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughThis pull request refactors vLLM fakequant export functionality by splitting the monolithic Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates thehigh-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for usingCodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
modelopt/torch/export/plugins/vllm_fq_hf.py (1)
28-43:Consider documenting the in-place model mutation.The function modifies the model in place by removing quantizer attributes. Users may not expect this side effect. Consider adding a note in the docstring.
"""Exports the torch model weights and amax values separately. This function: 1. Extracts amax values for calibration 2. Deletes all quantizer parameters from state dict to store only weights in original dtype 3. Saves the model weights+ Note:+ This function modifies the model in place by removing quantizer attributes.+ The model should not be used for inference after calling this function.+ Args: model: The quantized model to export export_dir: Directory to save the amax values """tests/gpu/torch/export/test_vllm_fq_hf_export.py (1)
91-93:Minor: Consider more explicit assertion message for key comparison.While
amax_dict.keys() == amax_state_dict.keys()works correctly, the assertion message could be more specific about what keys are present vs. expected to aid debugging.Consider this diff for clearer failure messages:
- assert amax_dict.keys() == amax_state_dict.keys(), (- "amax keys mismatch between before and after export"- )+ assert set(amax_dict.keys()) == set(amax_state_dict.keys()), (+ f"amax keys mismatch: exported has {len(amax_dict)} keys, "+ f"expected {len(amax_state_dict)} keys. "+ f"Missing: {set(amax_state_dict.keys()) - set(amax_dict.keys())}, "+ f"Extra: {set(amax_dict.keys()) - set(amax_state_dict.keys())}"+ )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
examples/vllm_serve/README.md(1 hunks)modelopt/torch/export/__init__.py(1 hunks)modelopt/torch/export/plugins/__init__.py(1 hunks)modelopt/torch/export/plugins/vllm_fakequant.py(0 hunks)modelopt/torch/export/plugins/vllm_fq_hf.py(1 hunks)modelopt/torch/export/plugins/vllm_fq_megatron.py(1 hunks)modelopt/torch/export/unified_export_hf.py(1 hunks)modelopt/torch/export/unified_export_megatron.py(8 hunks)tests/gpu/torch/export/test_vllm_fq_hf_export.py(1 hunks)tests/gpu/torch/export/test_vllm_fq_megatron_export.py(2 hunks)
💤 Files with no reviewable changes (1)
- modelopt/torch/export/plugins/vllm_fakequant.py
🧰 Additional context used
🧬 Code graph analysis (5)
tests/gpu/torch/export/test_vllm_fq_hf_export.py (2)
tests/_test_utils/torch/transformers_models.py (1)
create_tiny_llama_dir(121-135)modelopt/torch/export/plugins/vllm_fq_hf.py (1)
export_hf_vllm_fq_checkpoint(28-62)
tests/gpu/torch/export/test_vllm_fq_megatron_export.py (1)
modelopt/torch/export/plugins/vllm_fq_megatron.py (1)
export_mcore_gpt_to_hf_vllm_fq(84-112)
modelopt/torch/export/plugins/vllm_fq_hf.py (4)
modelopt/torch/export/layer_utils.py (1)
is_quantlinear(346-348)modelopt/torch/quantization/utils.py (1)
get_quantizer_state_dict(492-502)modelopt/torch/export/plugins/vllm_fq_megatron.py (1)
save_pretrained(71-78)modelopt/torch/export/unified_export_megatron.py (2)
save_pretrained(263-464)state_dict(467-471)
modelopt/torch/export/unified_export_megatron.py (3)
modelopt/torch/export/plugins/vllm_fq_megatron.py (1)
_get_quantization_format(80-81)modelopt/torch/export/quant_utils.py (5)
get_weight_block_size(412-429)get_weight_scaling_factor(251-293)get_weight_scaling_factor_2(296-320)get_activation_scaling_factor(234-248)get_quantization_format(432-533)modelopt/torch/export/model_config.py (2)
weight(145-150)bias(153-163)
modelopt/torch/export/plugins/vllm_fq_megatron.py (1)
modelopt/torch/export/unified_export_megatron.py (4)
GPTModelExporter(115-1210)state_dict(467-471)save_pretrained(263-464)_get_quantization_format(563-564)
🪛 GitHub Actions: Code Quality
modelopt/torch/export/plugins/vllm_fq_megatron.py
[error] 71-71: E501 Line too long (129 > 120). Ruff: Line too long in def signature.
[error] 75-75: E501 Line too long (141 > 120). Ruff: Line too long in multi-line string or statement.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
🔇 Additional comments (16)
modelopt/torch/export/unified_export_megatron.py (4)
507-564:Well-structured internal APIs for extension.The new
_get_quantized_stateand_get_quantization_formatmethods provide clean extension points that allow subclasses (likeVllmFqGPTModelExporterinvllm_fq_megatron.py) to override quantization behavior. The implementation correctly extracts weights, scales, and quantization format in a reusable manner.
285-286:LGTM - quantization format determination refactored correctly.The quantization format is now obtained through the internal
_get_quantization_formatmethod, enabling subclasses to override this behavior for specialized export paths.
333-333:Condition simplified appropriately.The condition now checks
quantization is not Nonedirectly, which is cleaner than relying on a separate export flag. This aligns with the removal of theexport_vllm_fq_weights_qstateparameter.
594-594:Consistent usage of internal method.All mapping functions now use
self._get_quantized_stateinstead of the removed external function, maintaining consistency throughout the class.modelopt/torch/export/__init__.py (1)
22-22:Plugin exports now accessible at the package level.This addition exposes the new vLLM fakequant export functions (
export_hf_vllm_fq_checkpoint,export_mcore_gpt_to_hf_vllm_fq) at themodelopt.torch.exportnamespace level, providing a convenient import path for consumers.modelopt/torch/export/plugins/__init__.py (1)
24-27:Appropriate import guards for new plugin modules.The
vllm_fq_hfmodule is imported directly since it only depends on torch, whilevllm_fq_megatronis correctly guarded withimport_pluginto handle its megatron dependencies gracefully. This follows the established pattern used formegatron_importer.examples/vllm_serve/README.md (1)
60-63:Documentation updated to reflect new export APIs.The README correctly directs users to the new dedicated export functions for HuggingFace and MCore models, aligning with the refactored code structure.
modelopt/torch/export/plugins/vllm_fq_hf.py (2)
47-51:Amax extraction logic is correct.The dictionary comprehension properly extracts
_amaxvalues from the quantizer state dict, cloning and moving them to CPU for serialization.
60-62: Based on my verification, I have confirmed that thesave_modelopt_stateparameter is fully supported and properly implemented:
Parameter Definition: The parameter is defined in the ModelOpt patch for HuggingFace models in
modelopt/torch/opt/plugins/huggingface.py(line 99), where it's extracted viakwargs.pop("save_modelopt_state", True).Implementation: The patch correctly handles the parameter by:
- Extracting it from kwargs with a default value of
True- Passing the remaining kwargs to the original
save_pretrainedmethod- Conditionally saving ModelOpt state based on the parameter value
Usage Pattern: The test file (
test_vllm_fq_hf_export.py) confirms the expected usage:
- Models are quantized via
mtq.quantize()which applies the patches- Then
export_hf_vllm_fq_checkpoint()is called on the patched model- This is consistent with other export functions in the codebase
Correctness: The use of
save_modelopt_state=Falsein line 62 is appropriate because the quantizers have been removed from the model (lines 56-58), so there's no ModelOpt state to save.The parameter is fully supported and the code is correct.
tests/gpu/torch/export/test_vllm_fq_hf_export.py (1)
26-93:LGTM! Comprehensive test coverage.The test properly verifies:
- Model quantization with FP8
- Export to vLLM fakequant format
- File existence checks (quant_amax.pth should exist, hf_quant_config.json should not)
- Weight preservation across export/reload
- Amax value extraction and storage
tests/gpu/torch/export/test_vllm_fq_megatron_export.py (2)
25-25:LGTM! Import updated correctly.The import now uses the new dedicated vLLM fakequant export function for Megatron.
93-98:LGTM! Export call updated correctly.The test now calls
export_mcore_gpt_to_hf_vllm_fqwith the appropriate parameters, properly replacing the old conditional export path.modelopt/torch/export/plugins/vllm_fq_megatron.py (3)
84-112:LGTM! Export function properly integrates with the exporter class.The
export_mcore_gpt_to_hf_vllm_fqfunction correctly instantiatesVllmFqGPTModelExporterand delegates to itssave_pretrainedmethod. The parameter forwarding is complete and the docstring is clear.
80-81: <function_calls>
<invoke name="shell</invoke name="args">
#!/bin/bashCheck the gather function to confirm it handles quantization state
rg -n "def gather_mcore_vllm_fq_quantized_state_dict" -A 30 modelopt/torch/export/plugins/vllm_fq_megatron.py
</invoke_name>
</invoke_name>
</function_calls>#!/bin/bash# Search for how amax values are handled in vllm fakequantrg"amax" modelopt/torch/export/plugins/vllm_fq_megatron.py -n -B3 -A3</parameter></invoke_name></invoke_name></function_calls>---`29-67`: I'll help you verify the distributed gather correctness concerns in this code. Let me search the codebase for similar patterns and context.<function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Search for torch.distributed.gather_object usage patternsrg -n "gather_object" -B3 -A5</parameter></invoke></function_calls>Let me also search for amax-related patterns and distributed state dict handling:<function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Search for amax patterns and how they're handledrg -n"amax" -B3 -A3# Also search for similar state dict gathering patternsrg -n"state_dict" -C2| grep -A2 -B2"gather\|distributed\|rank"| head -100</parameter></invoke></function_calls>Let me searchfor the specific file and understand its context better:<function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# First, find the filefd"vllm_fq_megatron.py" -type f# Check the full file to see imports and contextif [-f"modelopt/torch/export/plugins/vllm_fq_megatron.py" ];then wc -l"modelopt/torch/export/plugins/vllm_fq_megatron.py"fi</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Read the file to understand contextcat -n"modelopt/torch/export/plugins/vllm_fq_megatron.py"</parameter></invoke></function_calls>Nowlet me searchfor related quantization andexport code to understand the broader context:<function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Look for related export code and state dict handlingfd -e py"export"| grep -E"(plugin|quant)"| head -20# Search for gather_mcore or similar distributed export functionsrg -n"def gather" --type py| head -20</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Check for usage of this function - who calls it?rg -n"gather_mcore_vllm_fq_quantized_state_dict" -B2 -A2# Look for torch.save patterns in the codebase to see how they typically handle distributed writesrg -n"torch.save" -B3 -A1| head -50</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Search for barrier usage patterns to understand synchronization strategyrg -n"distributed.barrier" -B5 -A3</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Search for how megatron or vllm handle distributed state dictsrg -n"megatron\|vllm" -i| grep -E"(gather|merge|state_dict)"| head -30</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Look at the immediate context around the function - is there caller code?fd"vllm_fq_megatron.py" -type f -exec cat -n {}\;</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Search for amax key naming patterns and quantization structurerg -n"_amax" -B2 -A2| head -80# Look for comments or documentation about distributed quantizationrg -n"quantized.*state\|amax.*dict" -i -B2 -A2| head -100</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Search for Megatron or vLLM quantization documentation/patternsfd -e py| xargs grep -l"fq\|fake.*quant" -i| head -10# Look at related export plugin filesfd"plugins" -type d -exec ls -la {}\;</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Look at the broader plugin directory structurels -la modelopt/torch/export/plugins/2>/dev/null|| ls -la modelopt/torch/| head -30</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Check what other plugins exist and how they handle state dictsfd -path"*export/plugins*" -type f -name"*.py"| head -20</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Look at similar distributed export patterns in the repositoryrg -n"gather_object" -A20| head -100# Also check for distributed.barrier() usage contextrg -n"barrier()" -B10 -A2| head -150</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Look at what the function returns - it filters out amax keyscat -n"modelopt/torch/export/plugins/vllm_fq_megatron.py"| tail -20# Check if there are any tests for this functionfd"test.*vllm_fq" -o -name"*test*" -path"*export*"| xargs grep -l"gather_mcore"2>/dev/null</parameter></invoke></function_calls><function_calls><invoke name="shell"><parameter name="code">#!/bin/bash# Look for the complete function and what comes after line 67tail -n +29"modelopt/torch/export/plugins/vllm_fq_megatron.py"| head -80</parameter></invoke></function_calls>Based on my verification of the code and codebase patterns, here's my analysis:**Verify distributed gather correctness; key collision handling is unclear but likely safe in practice, barrier placement is correct.**The code correctly uses `torch.distributed.barrier()` after gather operations to ensure synchronization. However, three concerns warrant clarification:1. **Key collision handling**: The `dict.update()` silently overwrites duplicate keys. While amax keys likely include model parameter names (reducing collision risk across ranks in typical distributed setups), this could cause silent data loss if the model sharding strategy results in overlapping parameter ownership. Consider adding validation or merging logic that detects and handles collisions explicitly.2. **Barrier timing is correct**: The barrier is placed after both the gather operation and the `torch.save()` call on rank 0, ensuring non-rank-0 processes wait for the file write to complete before proceeding. This addresses the third concern adequately.3. **File write completeness**: The barrier provides the necessary synchronization for downstream operations that depend on the saved file, assuming the file system ensures durability by the barrier point.Recommend adding either: (a) an assertion or log statement to detect if merged keys differ significantly from individual rank keys, or (b) explicit handling if overlapping amax entries should be averaged/combined rather than overwritten.</blockquote></details><details><summary>modelopt/torch/export/unified_export_hf.py (1)</summary><blockquote>`582-582`: Based on my thorough verification, I can now provide the rewritten review comment.---**The claim about `export_vllm_fq_weights_qstate` parameter removal is incorrect.**The parameter `export_vllm_fq_weights_qstate` never existed in the `export_hf_checkpoint` API. The actual architecture implements vLLM functionality through a separate dedicated function `export_hf_vllm_fq_checkpoint` (in `modelopt/torch/export/plugins/vllm_fq_hf.py`), not as a parameter flag. Users requiring vLLM fakequant export call `export_hf_vllm_fq_checkpoint()` directly rather than passing a parameter to `export_hf_checkpoint()`. This is not a breaking API change.> Likely an incorrect or invalid review comment.</blockquote></details></blockquote></details></details><!-- This is an auto-generated comment by CodeRabbit for review status -->
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
codecovbot commentedNov 26, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@## main #612 +/- ##==========================================- Coverage 74.80% 74.64% -0.17%========================================== Files 183 183 Lines 18626 18542 -84 ==========================================- Hits 13933 13840 -93- Misses 4693 4702 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
263b2b7 intomainUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
What does this PR do?
Type of change: : Bug fix
Overview:
moved vLLM fakequant checkpoint export code to separate files:
Usage
Refer to
README.md
Testing
tests/gpu/torch/export/test_vllm_fq_hf_export.pyandtests/gpu/torch/export/test_vllm_fq_megatron_export.pyBefore your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.