Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Various script cleanups/fixes + convert merges and special token handling#2842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@KerfuffleV2
Copy link
Contributor

@KerfuffleV2KerfuffleV2 commentedAug 27, 2023
edited by ggerganov
Loading

Changes

  • Fix incorrect number of args to permute methods inconvert.py
  • Misc fixes/cleanups for various scripts.
  • Fix/add type hints where possible.
  • Set thegguf Python package as having type info.
  • Handle merges and special tokens inconvert.py
  • --vocab-only inconvert.py can work without the actual model tensors being available or in the correct format (i.e. git LFS links). (But guessed params can no longer be used in vocab only mode.)
  • gguf.py tensor name mapping refactored. This also allows simpler tensor skipping.
  • falcon converter accepts--vocab-only

(Not a 100% complete list of changes.)

Status

Completed (unless someone speaks up).

ref:#2820

@KerfuffleV2KerfuffleV2 added the scriptScript related labelAug 27, 2023
@KerfuffleV2KerfuffleV2 changed the titleFeat scripts improvementsVarious script cleanups/fixes + convert merge and special token handlingAug 27, 2023
@KerfuffleV2KerfuffleV2 changed the titleVarious script cleanups/fixes + convert merge and special token handlingVarious script cleanups/fixes + convert merges and special token handlingAug 27, 2023
@KerfuffleV2
Copy link
ContributorAuthor

KerfuffleV2 commentedAug 27, 2023
edited
Loading

@ggerganov Anything you hate about these changes so far?

Also:https://github.com/ggerganov/llama.cpp/blob/9cc112929c5a798c93d6a34abe53b751a478aa52/gguf-py/gguf/gguf.py#L165-L169

I don't fully understand what this means, but if you explain a bit more I can probably do it. Do you just mean you wantget_tensor_name_map to returnDict[str, Tuple[str, MODEL_TENSOR]] instead ofDict[str, str]?

Think I figured it out and this approach is an improvement. Reduces duplicated code a fair bit.

@KerfuffleV2KerfuffleV2 marked this pull request as ready for reviewAugust 27, 2023 23:22
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These changes shut up the type warnings but I'm not 100% sure they're correct. The alternative would be to leave itList[str] and then convert the token text to a string. Iassume it's already UTF-8 bytes so there probably isn't a functional difference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The types are correct this way, because when we get totokens.append(text), 'text' is explicitly a bytearray.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The types are correct this way, because when we get to tokens.append(text), 'text' is explicitly a bytearray.

Right. I meant my change fixes the type warning but the part I wasn't sure about was whethertextactually is supposed to be abytearray there and what is supposed to get submitted togguf to write out the tokens. The type annotation foradd_token_list method is also justList and doesn't specify what the element is supposed to be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The decision was made to accept several types as input to the token list depending on type of he tokenizer output (spm vs bpe)
https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/gguf-py/gguf/gguf.py#L373-L375

KerfuffleV2 reacted with thumbs up emoji
Copy link
ContributorAuthor

@KerfuffleV2KerfuffleV2Aug 28, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The decision was made to accept several types as input to the token list

So we'd want

defadd_token_list(self,tokens:Union[List[str],List[bytes],List[bytearray]]):

Correct? Or would you want to allow the items to be non-homogenous likeList[Union[str, bytes, bytearray]]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What is the differences of the python typesstr,bytes andbytearray? If they all resolve to the same when written to disk then any of them could be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Make it as simple it could be as long there is no difference in how the tokens are written to disk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

str is unicode text which will be encoded as utf-8 before being written to disk. bytes and bytearray are binary data. Those two are subclasses of ByteString, but we can't use that because it also allows memoryview. The least repetitive way to write this would be to use a TypeVar:

StrOrBytes=TypeVar('StrOrBytes',str,bytes,bytearray)# ...defadd_token_list(self,tokens:Sequence[StrOrBytes]):# ...defadd_token_merges(self,merges:Sequence[StrOrBytes]):# ...

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Should merges actually support all of those?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If there's notokenizer.json we just generate a model withno vocabulary at without a warning or anything? Is that behavior deliberate?

It's the same for most of the conversion scripts. If it's not actually supposed to work that way (and I don't understand why it would) I'd change that to bail out immediately if it doesn't exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ok, lets change. In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed. But some basic error handling would be useful

Btw, for the falcon script, it would be useful to have a--vocab-only option since we can add extra tokenizer tests if it is available. Can be done in separate PR though

KerfuffleV2 reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed.

Is there any plan to roll the functionality into the normal convert script? I was kind of thinking the the more common stuff I can refactor the easier doing something like that would be.

Ideally convert would also get refactored into something more modular than a gigantic monolithic script. (Probably not something I'd include in this pull since it's already pretty large and complicated.)

Copy link
Member

@ggerganovggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice work, thank you!

KerfuffleV2 reacted with thumbs up emoji
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ok, lets change. In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed. But some basic error handling would be useful

Btw, for the falcon script, it would be useful to have a--vocab-only option since we can add extra tokenizer tests if it is available. Can be done in separate PR though

KerfuffleV2 reacted with thumbs up emoji
@klosax
Copy link
Contributor

I guess this could be used in the simpler conversion scripts (the import be moved to top?)
https://github.com/ggerganov/llama.cpp/blob/f55538c3ccba9b926846ef862fa830cea08c433e/convert.py#L338-L339
instead of
https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/convert-falcon-hf-to-gguf.py#L16-L36

ggerganov reacted with thumbs up emoji

@klosax
Copy link
Contributor

tokens are used by both spm and bpe tokenizers

toktypes are used only by spm tokenizer
scores are used only by spm tokenizer

merges are used only by bpe tokenizer

We should have one class for extracting the spm parts and another for the bpe parts?

The special token mapping (bos eos..) are used by both spm and bpe.
Theclass SpecialVocab should not include the bpemerges only the special token mapping?

@KerfuffleV2
Copy link
ContributorAuthor

The classSpecialVocab should not include the bpe merges only the special token mapping?

How about passing it a flag or something to tell it to leave that part out? Maybe it should be renamed to something more general thanSpecialVocab also (any ideas?)

I was kind of thinking of making it a general class for the non-main vocab stuff. I also wanted to make it so you could access the loaded JSON stuff liketokenizer_config.json because there are some places where that gets loaded twice.

@KerfuffleV2
Copy link
ContributorAuthor

I've implemented some of the suggested changes. I was pretty aggressive about resolving conversations just to try to keep track of the stuff I dealt with. If anyone doesn't like my approach to implementing the suggestions, please don't hesitate to re-open those resolved conversations.

lin72h reacted with eyes emoji

@klosax
Copy link
Contributor

I was kind of thinking of making it a general class for the non-main vocab stuff.

The merges are as important for the function of the bpe tokenizer as the scores are for the spm tokenizer. About 33% higher perplexity without them in both tokenizers.

IMO the script should determinate the tokenizer type and call the toeknizer specific class that will extract only what is needed for that tokenizer to function properly. The special token mapping could be separate.

@cebtenzzre
Copy link
Collaborator

With --check-untyped-defs, mypy doesn't like that lazy_rebuild_tensor_v2 and rebuild_from_type_v2 are not marked as static methods. We should revert the change made by#1327 and uselazy_rebuild_tensor_v2.__func__ andrebuild_from_type_v2.__func__ in CLASSES to make sure we aren't using unbound methods.

KerfuffleV2 reacted with thumbs up emoji

@cebtenzzre
Copy link
Collaborator

There are still some missing type parameters according to mypy --disallow-any-generics:

convert.py:450: error: Missing type parameters for generic type "ndarray"  [type-arg]    def bf16_to_fp32(bf16_arr: np.ndarray) -> NDArray:    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~convert.py:755: error: Missing type parameters for generic type "Callable"  [type-arg]    ...n], Out], iterable: Iterable[In], concurrency: int, max_workers: Optional[int] = None, factory: Callable = ThreadPool...convert-lora-to-ggml.py:63: error: Missing type parameters for generic type "dtype"  [type-arg]        self, name: str, shape: Sequence[int], data_type: np.dtype        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@klosax
Copy link
Contributor

Aquila model uses the LLaMA architecture and the BPE tokenizer. It should be a good test target for BPE in theconvert.pyscript:
https://huggingface.co/BAAI/Aquila-7B

@KerfuffleV2
Copy link
ContributorAuthor

KerfuffleV2 commentedAug 28, 2023
edited
Loading

@cebtenzzre

We should revert the change made by#1327 and uselazy_rebuild_tensor_v2.__func__ andrebuild_from_type_v2.__func__ inCLASSES to make sure we aren't using unbound methods.

edit2: Usinggettattr seems okay as a workaround.

        ('torch._tensor','_rebuild_from_type_v2'):rebuild_from_type_v2.__func__,        ('torch._utils','_rebuild_tensor_v2'):lazy_rebuild_tensor_v2.__func__,

This runs but mypy hates it:

convert.py:678: error: "Callable[[Any, Any, Any, Any], Any]" has no attribute "__func__"  [attr-defined]convert.py:679: error: "Callable[[Any, Any, Any, Any, Any, Any, Any], LazyTensor]" has no attribute "__func__"  [attr-defined]

(I didn't forget to uncomment the decorators.)

edit:python/mypy#11211 andpython/mypy#14123

Open since 2021, so I wouldn't hold my breath. What do you suggest as a workaround?


edit:

convert.py:755: error: Missing type parameters for generic type "Callable"  [type-arg]

I'm not really sure what this one should be. I guess the simplest way to deal with it is just make it a boolean for using threadpool executor or not instead of being able to pass in the executor.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commentedAug 28, 2023
edited
Loading

I'm not really sure what this one should be.

It would be nice ifType[Executor] worked here, but the base class doesn't supportmax_workers. Strictly speaking, it's aType[Union[ThreadPoolExecutor, ProcessPoolExecutor]], but that's a bit verbose.

I agree that using a boolean would be the simplest solution.

KerfuffleV2 reacted with thumbs up emoji

@KerfuffleV2
Copy link
ContributorAuthor

ce00528 to fix#2865 (comment)

klosax reacted with thumbs up emoji

@klosax
Copy link
Contributor

Some naming issues for future model support: All references to BPE should be changed togpt2 since both the SentencePiece tokenizer and the GPT2 tokenizer is in fact BPE aka Byte-Pair-Encoder tokenizers.The tokenizer the original LLaMA model uses is a special version of the original SentencePiece tokenizer, so it should be referenced asllama orllama-spm The Replit models uses another version of the SentencePiece tokenizer and when support for the replit model is added it should be referenced asreplit orreplit-spm.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commentedAug 29, 2023
edited
Loading

mypy with--disallow-any-generics and--check-untyped-defs still has one problem withgguf.py:

I'm not sure what's up with that, the first argument seems to matchSupportsRead[bytes] and the second argument seems to matchSupportsWrite[bytes]. You can tack # type: ignore[misc] onto that line and I'll look into filing a mypy issue.
edit: mypy seems to be happy if you replacefout: BinaryIO withfout: BufferedWriter (io.BufferedWriter).
This issue is currently tracked atpython/mypy#15031

It looks like mypy wants two more type annotations:

convert-llama-ggmlv3-to-gguf.py:90: error: Incompatible types in assignment (expression has type "tuple[Any, ...]", variablehas type "tuple[()]")  [assignment]            self.dims = struct.unpack(f'<{n_dims}I', data[offset:offset + (4 * n_dims)])                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~convert-llama-ggmlv3-to-gguf.py:122: error: Need type annotation for "tensors" (hint: "tensors: List[<type>] = ...") [var-annotated]            tensors = []            ^~~~~~~

@KerfuffleV2
Copy link
ContributorAuthor

@klosax

Some naming issues for future model support: All references to BPE should be changed to gpt2 since both the SentencePiece tokenizer and the GPT2 tokenizer is in fact BPE aka Byte-Pair-Encoder tokenizers.

I think I'll leave that to a different pull unless it's unanimous that this should happen and clear what to do. RenamingBpeVocab toGpt2Vocab (I assume) and changing the commandline argument togpt2 wouldn't be an issue. I don't know enough about this to make the decision myself though.

(I also kind of just want to get this merged since it touches so much stuff. Pretty much any change to other scripts is going to create conflicts that I have to mess around with.)


@cebtenzzre

edit: mypy seems to be happy if you replace fout: BinaryIO with fout: BufferedWriter (io.BufferedWriter).

Are you referring to theshutil.copyfileobj thing? If so, the issue I saw was referring to argument 1, not argument 2 (fout)

convert-llama-ggmlv3-to-gguf.py:90: error: Incompatible types in assignment

convert-llama-ggmlv3-to-gguf.py just doesn't have type annotations at all. I saw those warnings but I don't think it really makes sense to add just one or two annotations in a script that has none. (Also might be opening a can of worms by doing that.)

If I get a chance, I'll look at that in a different pull.

klosax reacted with thumbs up emoji

@cebtenzzre
Copy link
Collaborator

cebtenzzre commentedAug 29, 2023
edited
Loading

Are you referring to theshutil.copyfileobj thing? If so, the issue I saw was referring to argument 1, not argument 2 (fout)

It's talking abouttype argument one - the single TypeVar that it's trying to resolve between the two file arguments. ForSupportsRead[AnyStr] andSupportsWrite[AnyStr], it needs to resolveAnyStr tobytes.

KerfuffleV2 reacted with thumbs up emoji

@KerfuffleV2
Copy link
ContributorAuthor

It's talking abouttype argument one - the single TypeVar that it's trying to resolve between the two file arguments.

This is why they call you Massive Brain Cebtenzzre.

cebtenzzre reacted with laugh emoji

@ggerganovggerganov merged commitdc07dc4 intoggml-org:masterAug 30, 2023
@akawrykow
Copy link
Contributor

@KerfuffleV2 I'm wondering how we properly stage these changes vs rolling out a new gguf pip package release?

As it is right now, e.g theconvert-falcon-hf-to-gguf.py expectedly fails with:

Traceback (most recent call last):  File ".\convert-falcon-hf-to-gguf.py", line 169, in <module>    special_vocab = gguf.SpecialVocab(dir_model, load_merges = True)AttributeError: module 'gguf' has no attribute 'SpecialVocab'

(because the conversion script doesn't usegguf from this repo, rather it imports the package)

Can we publish that update ASAP? I saw@monatis added a workflow for this in#2896

@cebtenzzre
Copy link
Collaborator

because the conversion script doesn't usegguf from this repo

FWIW, there is nothing stopping you from cd'ing to gguf-py and runningpip install -e .. That would use the gguf from this repo, and-e will make sure it automatically updates when yougit pull.

akawrykow reacted with thumbs up emoji

@monatis
Copy link
Collaborator

@akawrykow I already made a manual release, and you can upgrade your gguf package withpip install --upgrade gguf. But you are advised to install it in editable mode as@cebtenzzre pointed out. With that said, any future change should be released with the workflow I described ingguf-py/README.md from now on.

akawrykow reacted with thumbs up emoji

@akawrykow
Copy link
Contributor

@KerfuffleV2@monatis after upgrading, I get:

Traceback (most recent call last):  File ".\convert-falcon-hf-to-gguf.py", line 4, in <module>    import gguf  File "C:\Python38\lib\site-packages\gguf\__init__.py", line 1, in <module>    from .gguf import *  File "C:\Python38\lib\site-packages\gguf\gguf.py", line 425, in <module>    class GGUFWriter:  File "C:\Python38\lib\site-packages\gguf\gguf.py", line 435, in GGUFWriter    temp_file: Optional[tempfile.SpooledTemporaryFile[bytes]] = NoneTypeError: 'type' object is not subscriptable

any ideas?

@akawrykow
Copy link
Contributor

akawrykow commentedAug 30, 2023
edited
Loading

Seems like my python version is outdated (currently on 3.8) and I guess this requires 3.9 and up?

edit: working now. Thanks

@cebtenzzre
Copy link
Collaborator

Python 3.8 isn't EOL yet. Should we addfrom __future__ import annotations to files with annotations so python doesn't try to interpret them at runtime?

@monatis
Copy link
Collaborator

Does this fix the error?

temp_file:Optional[tempfile.SpooledTemporaryFile]=None

@monatis
Copy link
Collaborator

We should support python 3.8 definitely. I'll check type hints tomorrow and make sure that it works with 3.8.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commentedAug 30, 2023
edited
Loading

Does this fix the error?

mypy won't like that because SpooledTemporaryFile is generic in the type stubs. As an alternative to the __future__ import we could do this:

temp_file:'Optional[tempfile.SpooledTemporaryFile[bytes]]'=None

edit: another problem is that typing.TypeAlias does not exist in Python 3.8. We should make the import conditional on TYPE_CHECKING.

@KerfuffleV2
Copy link
ContributorAuthor

KerfuffleV2 commentedAug 31, 2023
edited
Loading

I'm wondering how we properly stage these changes vs rolling out a new gguf pip package release?

The current behavior is pretty unintuitive in my behavior. I really feel like if I'm running these scripts from the repo directory it should use thegguf module that's from the repo without having to mess around with virtual environments and stuff. Probably a lot of people also don't know about thepip install -e trick Cebtenzzre mentioned (I didn't).

What if we did something like

importsys,osifos.environ.get('NO_LOCAL_GGUF')isNoneandPath('gguf-py').is_dir():sys.path.insert(1,str(Path('gguf-py')/'gguf'))importgguf

in the scripts that usegguf? That way it'll find the local one if running from the repo root and can also be disabled if necessary.

edit:#2927

@KerfuffleV2KerfuffleV2 mentioned this pull requestAug 31, 2023
Copy link
Collaborator

@cebtenzzrecebtenzzre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just some things I noticed while doing some stuff with GGUF today.

}

MODEL_TENSOR_NAMES= {
MODEL_TENSOR_NAMES:Dict[MODEL_ARCH,Dict[MODEL_TENSOR,str]]= {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Using a dict of dicts is confusing here because the values for a given key are always the same - this is guaranteed by the spec IIUC. Maybe a dict of lists, with a global dict for key->name lookup, would make more sense?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sorry, I meant to get back to you on these but it slipped my mind. Also, since I didn't design this stuff I'm not completely sure what to say.

For this one, I agree it's weird. It seems like the only thing you could do with this compared to a non-architecture specific list is check if a type of tensor is in the model. I'm not sure when that would ever be useful.

I only added the type annotation with this pull, and generally I was mostly focused on cleaning up some obvious stuff and adding annotations rather than looking at the design/logic of it.

Fixing it would be a breaking change though.

# Attention and feed-forward blocks
foriinrange(0,n_blocks):
classTensorNameMap:
mappings_cfg:Dict[MODEL_TENSOR,Tuple[str, ...]]= {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since we go through the trouble of listing out the tensors used per arch, why do we ignore the arch in this table?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

My impression is it's just to be able to throw in any tensor name and get the GGUF version out. That's why I argued for#3095 as well. This behavior I can actually see a use for, since general tools could use it without having to actually worry about the model type.

special_token_types:Tuple[str, ...]=tuple(('bos','eos','unk','sep','pad'))
special_token_ids:Dict[str,int]= {}

def__init__(self,path:Path,load_merges:bool=False,special_token_types:Optional[Tuple[str, ...]]=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Requiring a pathlib.Path in the API is not user-friendly. Generalizing it to os.PathLike[str] would be better.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Makes sense to me.

@cebtenzzre
Copy link
Collaborator

@KerfuffleV2 What do you think about the comments I've made above?

KerfuffleV2 reacted with thumbs up emoji

@KerfuffleV2KerfuffleV2 deleted the feat-scripts-improvements branchNovember 17, 2023 03:13
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@ggerganovggerganovggerganov approved these changes

@monatismonatisAwaiting requested review from monatis

+2 more reviewers

@cebtenzzrecebtenzzrecebtenzzre left review comments

@klosaxklosaxklosax approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

scriptScript related

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

7 participants

@KerfuffleV2@klosax@cebtenzzre@ggerganov@slaren@akawrykow@monatis

[8]ページ先頭

©2009-2025 Movatter.jp