- Notifications
You must be signed in to change notification settings - Fork202
GPTQ Lite implementation#555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this messagehere. |
codecovbot commentedNov 13, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@## main #555 +/- ##==========================================- Coverage 74.36% 73.92% -0.45%========================================== Files 182 182 Lines 18216 18395 +179 ==========================================+ Hits 13547 13598 +51- Misses 4669 4797 +128 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
| gt=0.0, | ||
| le=1.0, | ||
| title="Percentage damping factor.", | ||
| description="The percentage of average Hessian diagonal used for damping.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
if you have a reference from the original paper about what these are, could you also share the link too?
| batch_size=input.shape[0] | ||
| # Incremental averaging: scale down old hessian | ||
| hessian*=n_samples/ (n_samples+batch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
what's the dtype of hessian? Do we need to up cast to fp32 for this division?
| hessian,n_samples=update_hessian(input[0],state["hessian"],state["n_samples"]) | ||
| hessian_state[module.name]= {"hessian":hessian,"n_samples":n_samples} | ||
| torch.cuda.empty_cache() | ||
| gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
do we have to do gc.collect() here? It's going to be very slow
| # Phase 1: Collect statistics for quantizers | ||
| enable_stats_collection(model) | ||
| max_calibrate(model,forward_loop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
do you need forward_loop here? Is this for weight amax calib only?
| state=hessian_state[module.name] | ||
| hessian=state["hessian"].to(module.weight.device) | ||
| blockwise_weight_update(module,hessian,block_size,percdamp) | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
maybe you can del the hessian after applying blockwise_weight_update?
| hessian_state_path:str|None=ModeloptField( | ||
| default=None, | ||
| title="Path to the Hessian state file.", | ||
| description="The path to the Hessian state file.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Maybe state: if the path exists, we load the hessian from the path instead of re-computing them.
| GPTQ lite does not perform sequential quantization of layers. This means that the updated | ||
| activations are not used to process the next layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can you estimate how much effort is needed if we need to add this constraint? I am thinking if we can have a quick test to see what's the accuracy impact.
| block_size:int|None=ModeloptField( | ||
| default=128, | ||
| title="Block size for GPTQ weight update.", | ||
| description="The block size for GPTQ weight update.", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This should be the multiple of block_size used in quantization. We should explain it in the description as well.
| gt=0.0, | ||
| le=1.0, | ||
| title="Percentage damping factor.", | ||
| description="The percentage of average Hessian diagonal used for damping.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Could you also add some instructions here, so users can know what's the impact of increasing/decreasing this parameter?
| tensor_mapping= {} | ||
| forname,moduleinmodel.named_modules(): | ||
| ifis_quantized_linear(module)andmodule.weight_quantizer.is_enabled: | ||
| in_features=module.weight.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can we usemodule.weight.shape[-1] instead incase of 3D weight?
| forname,moduleinmodel.named_modules(): | ||
| ifis_quantized_linear(module)andmodule.weight_quantizer.is_enabled: | ||
| module.input_quantizer.reset_amax() | ||
| module.output_quantizer.reset_amax() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Do you know how much accuracy is impacted if we don't recalibrate the input quantizer??
Uh oh!
There was an error while loading.Please reload this page.
What does this PR do?
Type of change: New feature
Overview: Adds support for GPTQ algorithm. This PR implements a modified version of the official GPTQ algorithm; the key differences are
Usage
Modify "algorithm" field in quant_cfg to "gptq_lite".
Note: Does not currently work with AWQ
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information