Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
This repository was archived by the owner on Oct 9, 2024. It is now read-only.

Add configs to run int4 inference#37

Open
RezaYazdaniAminabadi wants to merge5 commits intohuggingface:main
base:main
Choose a base branch
Loading
fromRezaYazdaniAminabadi:int4-bloom

Conversation

@RezaYazdaniAminabadi

Add some minor config changes to support int4 inference through DeepSpeed-Inference.

The Int4 support will be added to DeepSpeed through thisPR.

cc:@stas00

stas00, mayank31398, and sergey21000 reacted with heart emoji
Copy link
Contributor

@stas00stas00 left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Amazing work with adding int4-support, Reza!

RezaYazdaniAminabadi reacted with heart emoji
mp_size=world_size,
base_dir=repo_root,
dtype=getattr(torch,infer_dtype),
quantization_bits=8ifargs.dtype=='int8'else4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

what happens with--dtype float16?

probably best to set this inkwargs only if quantization dtype is provided

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The quabtization-bit should not be used when running in half-precision. But, I agree we can do it in the kwargs and only for qunatized inference mode.

Copy link
Contributor

@stas00stas00Nov 18, 2022
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

these demos are already used by many users so let's make those nice and clean configuration-wise, so it's clear to the reader when what bits should be enabled.

RezaYazdaniAminabadi reacted with thumbs up emoji
input_sentences*=math.ceil(args.batch_size/len(input_sentences))

generate_kwargs=dict(max_new_tokens=num_tokens,do_sample=False)
generate_kwargs=dict(max_new_tokens=num_tokens,do_sample=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this is already a very different type of change.

Ifint4 requiresdo_sample=True, then again, let's change it only if it's--dtype int4

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sure, I will double check thedo_sample=False again to see if the text generated makes sense. If not, I set it to true for int4

stas00 reacted with thumbs up emoji

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I just checked withdo_sample=False and I see the text is produced in the same way as for FP16 and INT8:

in=DeepSpeed is a machine learning frameworkout=DeepSpeed is a machine learning framework for deep learning. It is a Python library, and it is also a framework. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and

So, I am gonna turn it off for now.

@stas00
Copy link
Contributor

stas00 commentedNov 18, 2022
edited
Loading

Also should probably assert ifint4 attempted to be used w/odeepspeed>=xyz once the DS PR is merged... could tentatively set to the next deepspeed version? perhaps with XXX to enabled so the script can be used against ds@master.

I can take care of that.

RezaYazdaniAminabadi reacted with thumbs up emoji

@RezaYazdaniAminabadi
Copy link
Author

Also should probably assert ifint4 attempted to be used w/odeepspeed>=xyz once the DS PR is merged... could tentatively set to the next deepspeed version? perhaps with XXX to enabled so the script can be used against ds@master.

I can take care of that.

Sounds good to me. Thanks@stas00


model_name=args.name
infer_dtype=args.dtype
infer_dtype=args.dtypeifargs.dtype!='int4'else'int8'
Copy link
Contributor

@stas00stas00Nov 18, 2022
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

would it make for a more user-friendly API to

  1. keep thedtype intact
  2. drop quantization_bits
  3. letdeepspeed.init_inference derive the number of bits fromdtype?

not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information indtype and andquantization_bits twice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

oh, wait,torch.init4 still doesn't exist, does it?

let's find the feature request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

still not implementedpytorch/pytorch#74627

so that's why you had to do the odd workarounds, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess we can drop it once its implemented@stas00 ?
For now, this might be the best way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

it's pointless to wait, since they won't haveint3 andint12

RezaYazdaniAminabadi reacted with thumbs up emoji

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

would it make for a more user-friendly API to

  1. keep thedtype intact
  2. drop quantization_bits
  3. letdeepspeed.init_inference derive the number of bits fromdtype?

not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information indtype and andquantization_bits twice

@stas00 and@RezaYazdaniAminabadi - just clarifying that we have introduced a new DeepSpeedInferenceConfig that can be passed to init_inference. We are keeping it backwards compatible but if we are okay to make changes to this file, I would advocate for writing a config dictionary for DeepSpeed and pass that to init_inference instead of the various kwargs. Please see here for an example:https://gist.github.com/awan-10/6e3d5c756be3a876522e860c6bbf702d#file-bloom-ds-inference-py-L173

Also, see the docs for the new config:https://deepspeed.readthedocs.io/en/latest/inference-init.html

RezaYazdaniAminabadi reacted with heart emoji
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That definitely works.

@awan-10, may I suggest you make the inferenceconfig acceptdict_or_path just likezero does? it might be for some users easier to write out a separate file.

awan-10 reacted with thumbs up emoji

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@stas00 - thanks for the suggestion. Created an issue so we can track it:deepspeedai/DeepSpeed#2532. Mike and I will work on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you very much,@awan-10

@stas00
Copy link
Contributor

stas00 commentedNov 19, 2022
edited
Loading

OK, I think I understand the limitations of pytorch and it'll get only worse when you tryint3, etc. even ifint4 is supported.
https://github.com/huggingface/transformers-bloom-inference/pull/37/files#r1026981222

I propose we break the currently proposed API and draw a better one.

I propose to have only 2 user-configurable args related to how deepspeed-inference operates

  1. dtype is the dtype of the original model - so only fp32, fp16 or bf16 - neverintX (i.e. we dropint8)
  2. quantization_bits:[None, 12, 8, 4, 3]

Now the API is simple, unambiguous and future proof (as inint12 orint3, Mixture of Precisions support)

For back-compatdeepspeed.init_inference can simply setquantization_bits=8 ifdtype==torch.int8 is passed. So the API will be unbroken.

What do you think, Reza?

@mayank31398
Copy link
Collaborator

Huh?
Int4?
I will test this branch surely and let you know.
Thanks a lot for this :)

@RezaYazdaniAminabadi
Copy link
Author

is simple, unambiguous and future pro

Hi@stas00,
I agree with what you said, and we are going through the same route as you see from my last commit here.
Thanks for the good suggestion :)
Best,
Reza

stas00 reacted with heart emoji

@RezaYazdaniAminabadi
Copy link
Author

In that case, we

is simple, unambiguous and future pro

Hi@stas00, I agree with what you said, and we are going through the same route as you see from my last commit here. Thanks for the good suggestion :) Best, Reza

In this case, we can simply pass the bits to the DeepSpeed-inference config:kwargs['quant']['weight']['num_bits'] = quantization_bits

@stas00
Copy link
Contributor

may I suggest that the just addedkwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far askwargs go?

why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

@RezaYazdaniAminabadi
Copy link
Author

may I suggest that the just addedkwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far askwargs go?

why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

stas00 reacted with heart emoji

@awan-10
Copy link

awan-10 commentedNov 19, 2022
edited by stas00
Loading

may I suggest that the just addedkwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far askwargs go?
why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

@RezaYazdaniAminabadi -- please see my comment above.#37 (comment)

@RezaYazdaniAminabadi
Copy link
Author

may I suggest that the just addedkwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far askwargs go?
why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

@RezaYazdaniAminabadi -- please see my comment above.#37 (comment)

thanks@awan-10. Please go ahead and push your changes.

@mayank31398mayank31398force-pushed themain branch 10 times, most recently fromabfc97f to9d48dbfCompareJanuary 24, 2023 11:37
@mayank31398mayank31398force-pushed themain branch 2 times, most recently from655179f to114b912CompareMarch 3, 2023 16:33
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

3 more reviewers

@awan-10awan-10awan-10 left review comments

@stas00stas00stas00 requested changes

@mayank31398mayank31398mayank31398 left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@RezaYazdaniAminabadi@stas00@mayank31398@awan-10

[8]ページ先頭

©2009-2025 Movatter.jp