- Notifications
You must be signed in to change notification settings - Fork113
Add configs to run int4 inference#37
base:main
Are you sure you want to change the base?
Add configs to run int4 inference#37
Uh oh!
There was an error while loading.Please reload this page.
Conversation
stas00 left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Amazing work with adding int4-support, Reza!
| mp_size=world_size, | ||
| base_dir=repo_root, | ||
| dtype=getattr(torch,infer_dtype), | ||
| quantization_bits=8ifargs.dtype=='int8'else4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
what happens with--dtype float16?
probably best to set this inkwargs only if quantization dtype is provided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The quabtization-bit should not be used when running in half-precision. But, I agree we can do it in the kwargs and only for qunatized inference mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
these demos are already used by many users so let's make those nice and clean configuration-wise, so it's clear to the reader when what bits should be enabled.
| input_sentences*=math.ceil(args.batch_size/len(input_sentences)) | ||
| generate_kwargs=dict(max_new_tokens=num_tokens,do_sample=False) | ||
| generate_kwargs=dict(max_new_tokens=num_tokens,do_sample=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
this is already a very different type of change.
Ifint4 requiresdo_sample=True, then again, let's change it only if it's--dtype int4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sure, I will double check thedo_sample=False again to see if the text generated makes sense. If not, I set it to true for int4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I just checked withdo_sample=False and I see the text is produced in the same way as for FP16 and INT8:
in=DeepSpeed is a machine learning frameworkout=DeepSpeed is a machine learning framework for deep learning. It is a Python library, and it is also a framework. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, andSo, I am gonna turn it off for now.
stas00 commentedNov 18, 2022 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Also should probably assert if I can take care of that. |
RezaYazdaniAminabadi commentedNov 18, 2022
Sounds good to me. Thanks@stas00 |
| model_name=args.name | ||
| infer_dtype=args.dtype | ||
| infer_dtype=args.dtypeifargs.dtype!='int4'else'int8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
would it make for a more user-friendly API to
- keep the
dtypeintact - drop
quantization_bits - let
deepspeed.init_inferencederive the number of bits fromdtype?
not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information indtype and andquantization_bits twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
oh, wait,torch.init4 still doesn't exist, does it?
let's find the feature request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
still not implementedpytorch/pytorch#74627
so that's why you had to do the odd workarounds, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I guess we can drop it once its implemented@stas00 ?
For now, this might be the best way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
it's pointless to wait, since they won't haveint3 andint12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
would it make for a more user-friendly API to
- keep the
dtypeintact- drop
quantization_bits- let
deepspeed.init_inferencederive the number of bits fromdtype?not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information in
dtypeand andquantization_bitstwice
@stas00 and@RezaYazdaniAminabadi - just clarifying that we have introduced a new DeepSpeedInferenceConfig that can be passed to init_inference. We are keeping it backwards compatible but if we are okay to make changes to this file, I would advocate for writing a config dictionary for DeepSpeed and pass that to init_inference instead of the various kwargs. Please see here for an example:https://gist.github.com/awan-10/6e3d5c756be3a876522e860c6bbf702d#file-bloom-ds-inference-py-L173
Also, see the docs for the new config:https://deepspeed.readthedocs.io/en/latest/inference-init.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
That definitely works.
@awan-10, may I suggest you make the inferenceconfig acceptdict_or_path just likezero does? it might be for some users easier to write out a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@stas00 - thanks for the suggestion. Created an issue so we can track it:deepspeedai/DeepSpeed#2532. Mike and I will work on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thank you very much,@awan-10
stas00 commentedNov 19, 2022 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
OK, I think I understand the limitations of pytorch and it'll get only worse when you try I propose we break the currently proposed API and draw a better one. I propose to have only 2 user-configurable args related to how deepspeed-inference operates
Now the API is simple, unambiguous and future proof (as in For back-compat What do you think, Reza? |
mayank31398 commentedNov 19, 2022
Huh? |
RezaYazdaniAminabadi commentedNov 19, 2022
Hi@stas00, |
RezaYazdaniAminabadi commentedNov 19, 2022
In that case, we
In this case, we can simply pass the bits to the DeepSpeed-inference config: |
stas00 commentedNov 19, 2022
may I suggest that the just added why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want. |
RezaYazdaniAminabadi commentedNov 19, 2022
I agree, let me work on that and I fix it. |
awan-10 commentedNov 19, 2022 • edited by stas00
Loading Uh oh!
There was an error while loading.Please reload this page.
edited by stas00
Uh oh!
There was an error while loading.Please reload this page.
@RezaYazdaniAminabadi -- please see my comment above.#37 (comment) |
RezaYazdaniAminabadi commentedNov 19, 2022
thanks@awan-10. Please go ahead and push your changes. |
Update bloom-ds-inference.py
235af1a to134b703Compareabfc97f to9d48dbfCompare655179f to114b912Compare
Add some minor config changes to support int4 inference through DeepSpeed-Inference.
The Int4 support will be added to DeepSpeed through thisPR.
cc:@stas00