Movatterモバイル変換

convert.py : handle special tokens#2820

Closed

KerfuffleV2 added the scriptScript related label

KerfuffleV2 changed the title~~Feat scripts improvements~~Various script cleanups/fixes + convert merge and special token handling

KerfuffleV2 changed the title~~Various script cleanups/fixes + convert merge and special token handling~~Various script cleanups/fixes + convert merges and special token handling

Also:https://github.com/ggerganov/llama.cpp/blob/9cc112929c5a798c93d6a34abe53b751a478aa52/gguf-py/gguf/gguf.py#L165-L169

Copy link

ContributorAuthor

KerfuffleV2 commentedAug 27, 2023•
edited
Loading

@ggerganov Anything you hate about these changes so far?

I don't fully understand what this means, but if you explain a bit more I can probably do it. Do you just mean you wantget_tensor_name_map to returnDict[str, Tuple[str, MODEL_TENSOR]] instead ofDict[str, str]?

Think I figured it out and this approach is an improvement. Reduces duplicated code a fair bit.

KerfuffleV2 marked this pull request as ready for review

August 27, 2023 23:22

KerfuffleV2 commented

convert-gptneox-hf-to-gguf.py Outdated

Copy link

ContributorAuthor

KerfuffleV2Aug 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These changes shut up the type warnings but I'm not 100% sure they're correct. The alternative would be to leave itList[str] and then convert the token text to a string. Iassume it's already UTF-8 bytes so there probably isn't a functional difference.

Copy link

Collaborator

cebtenzzreAug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The types are correct this way, because when we get totokens.append(text), 'text' is explicitly a bytearray.

Copy link

ContributorAuthor

KerfuffleV2Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The types are correct this way, because when we get to tokens.append(text), 'text' is explicitly a bytearray.

Right. I meant my change fixes the type warning but the part I wasn't sure about was whethertextactually is supposed to be abytearray there and what is supposed to get submitted togguf to write out the tokens. The type annotation foradd_token_list method is also justList and doesn't specify what the element is supposed to be.

Copy link

Contributor

klosaxAug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The decision was made to accept several types as input to the token list depending on type of he tokenizer output (spm vs bpe)
https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/gguf-py/gguf/gguf.py#L373-L375

Copy link

ContributorAuthor

KerfuffleV2Aug 28, 2023•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The decision was made to accept several types as input to the token list

So we'd want

defadd_token_list(self,tokens:Union[List[str],List[bytes],List[bytearray]]):

Correct? Or would you want to allow the items to be non-homogenous likeList[Union[str, bytes, bytearray]]?

Copy link

Contributor

klosaxAug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What is the differences of the python typesstr,bytes andbytearray? If they all resolve to the same when written to disk then any of them could be used.

Copy link

Contributor

klosaxAug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Make it as simple it could be as long there is no difference in how the tokens are written to disk.

Copy link

Collaborator

cebtenzzreAug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

str is unicode text which will be encoded as utf-8 before being written to disk. bytes and bytearray are binary data. Those two are subclasses of ByteString, but we can't use that because it also allows memoryview. The least repetitive way to write this would be to use a TypeVar:

StrOrBytes=TypeVar('StrOrBytes',str,bytes,bytearray)# ...defadd_token_list(self,tokens:Sequence[StrOrBytes]):# ...defadd_token_merges(self,merges:Sequence[StrOrBytes]):# ...

Copy link

ContributorAuthor

KerfuffleV2Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Should merges actually support all of those?

KerfuffleV2 commented

convert.py OutdatedShow resolvedHide resolved

KerfuffleV2 commented

convert.py OutdatedShow resolvedHide resolved

KerfuffleV2 commented

gguf-py/gguf/gguf.py OutdatedShow resolvedHide resolved

KerfuffleV2 commented

convert-falcon-hf-to-gguf.py Outdated

Copy link

ContributorAuthor

KerfuffleV2Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If there's notokenizer.json we just generate a model withno vocabulary at without a warning or anything? Is that behavior deliberate?

It's the same for most of the conversion scripts. If it's not actually supposed to work that way (and I don't understand why it would) I'd change that to bail out immediately if it doesn't exist.

Copy link

Member

ggerganovAug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ok, lets change. In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed. But some basic error handling would be useful

Btw, for the falcon script, it would be useful to have a--vocab-only option since we can add extra tokenizer tests if it is available. Can be done in separate PR though

Copy link

ContributorAuthor

KerfuffleV2Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed.

Is there any plan to roll the functionality into the normal convert script? I was kind of thinking the the more common stuff I can refactor the easier doing something like that would be.

Ideally convert would also get refactored into something more modular than a gigantic monolithic script. (Probably not something I'd include in this pull since it's already pretty large and complicated.)

cebtenzzre reviewed

gguf-py/gguf/gguf.py OutdatedShow resolvedHide resolved

cebtenzzre reviewed

gguf-py/gguf/gguf.py OutdatedShow resolvedHide resolved

cebtenzzre reviewed

convert.py OutdatedShow resolvedHide resolved

cebtenzzre reviewed

convert.py OutdatedShow resolvedHide resolved

ggerganov approved these changes

Copy link

Member

ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice work, thank you!

convert-llama-hf-to-gguf.py OutdatedShow resolvedHide resolved

convert.py OutdatedShow resolvedHide resolved

gguf-py/gguf/gguf.py OutdatedShow resolvedHide resolved

convert-falcon-hf-to-gguf.py Outdated

Copy link

Member

ggerganovAug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Btw, for the falcon script, it would be useful to have a--vocab-only option since we can add extra tokenizer tests if it is available. Can be done in separate PR though

ggerganov requested review fromklosax andmonatis

August 28, 2023 08:19

Copy link

Contributor

klosax commentedAug 28, 2023

I guess this could be used in the simpler conversion scripts (the import be moved to top?)
https://github.com/ggerganov/llama.cpp/blob/f55538c3ccba9b926846ef862fa830cea08c433e/convert.py#L338-L339
instead of
https://github.com/ggerganov/llama.cpp/blob/dd0dc366dab10e8df28d3924e7f313b5c695e908/convert-falcon-hf-to-gguf.py#L16-L36

Copy link

Contributor

klosax commentedAug 28, 2023

tokens are used by both spm and bpe tokenizers

toktypes are used only by spm tokenizer
scores are used only by spm tokenizer

merges are used only by bpe tokenizer

We should have one class for extracting the spm parts and another for the bpe parts?

The special token mapping (bos eos..) are used by both spm and bpe.
Theclass SpecialVocab should not include the bpemerges only the special token mapping?

Copy link

ContributorAuthor

KerfuffleV2 commentedAug 28, 2023

The classSpecialVocab should not include the bpe merges only the special token mapping?

How about passing it a flag or something to tell it to leave that part out? Maybe it should be renamed to something more general thanSpecialVocab also (any ideas?)

I was kind of thinking of making it a general class for the non-main vocab stuff. I also wanted to make it so you could access the loaded JSON stuff liketokenizer_config.json because there are some places where that gets loaded twice.

Copy link

ContributorAuthor

KerfuffleV2 commentedAug 28, 2023

I've implemented some of the suggested changes. I was pretty aggressive about resolving conversations just to try to keep track of the stuff I dealt with. If anyone doesn't like my approach to implementing the suggestions, please don't hesitate to re-open those resolved conversations.

Copy link

Contributor

klosax commentedAug 28, 2023

I was kind of thinking of making it a general class for the non-main vocab stuff.

The merges are as important for the function of the bpe tokenizer as the scores are for the spm tokenizer. About 33% higher perplexity without them in both tokenizers.

IMO the script should determinate the tokenizer type and call the toeknizer specific class that will extract only what is needed for that tokenizer to function properly. The special token mapping could be separate.

cebtenzzre reviewed

gguf-py/gguf/gguf.py OutdatedShow resolvedHide resolved

Copy link

Collaborator

cebtenzzre commentedAug 28, 2023

With --check-untyped-defs, mypy doesn't like that lazy_rebuild_tensor_v2 and rebuild_from_type_v2 are not marked as static methods. We should revert the change made by#1327 and uselazy_rebuild_tensor_v2.__func__ andrebuild_from_type_v2.__func__ in CLASSES to make sure we aren't using unbound methods.

Copy link

Collaborator

cebtenzzre commentedAug 28, 2023

There are still some missing type parameters according to mypy --disallow-any-generics:

convert.py:450: error: Missing type parameters for generic type "ndarray"  [type-arg]    def bf16_to_fp32(bf16_arr: np.ndarray) -> NDArray:    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~convert.py:755: error: Missing type parameters for generic type "Callable"  [type-arg]    ...n], Out], iterable: Iterable[In], concurrency: int, max_workers: Optional[int] = None, factory: Callable = ThreadPool...convert-lora-to-ggml.py:63: error: Missing type parameters for generic type "dtype"  [type-arg]        self, name: str, shape: Sequence[int], data_type: np.dtype        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy link

Contributor

klosax commentedAug 28, 2023

Aquila model uses the LLaMA architecture and the BPE tokenizer. It should be a good test target for BPE in theconvert.pyscript:
https://huggingface.co/BAAI/Aquila-7B

Copy link

ContributorAuthor

KerfuffleV2 commentedAug 28, 2023•
edited
Loading

@cebtenzzre

We should revert the change made by#1327 and uselazy_rebuild_tensor_v2.__func__ andrebuild_from_type_v2.__func__ inCLASSES to make sure we aren't using unbound methods.

edit2: Usinggettattr seems okay as a workaround.

        ('torch._tensor','_rebuild_from_type_v2'):rebuild_from_type_v2.__func__,        ('torch._utils','_rebuild_tensor_v2'):lazy_rebuild_tensor_v2.__func__,

This runs but mypy hates it:

convert.py:678: error: "Callable[[Any, Any, Any, Any], Any]" has no attribute "__func__"  [attr-defined]convert.py:679: error: "Callable[[Any, Any, Any, Any, Any, Any, Any], LazyTensor]" has no attribute "__func__"  [attr-defined]

(I didn't forget to uncomment the decorators.)

edit:python/mypy#11211 andpython/mypy#14123

Open since 2021, so I wouldn't hold my breath. What do you suggest as a workaround?

edit:

convert.py:755: error: Missing type parameters for generic type "Callable"  [type-arg]

I'm not really sure what this one should be. I guess the simplest way to deal with it is just make it a boolean for using threadpool executor or not instead of being able to pass in the executor.

Copy link

Collaborator

cebtenzzre commentedAug 28, 2023•
edited
Loading

I'm not really sure what this one should be.

It would be nice ifType[Executor] worked here, but the base class doesn't supportmax_workers. Strictly speaking, it's aType[Union[ThreadPoolExecutor, ProcessPoolExecutor]], but that's a bit verbose.

I agree that using a boolean would be the simplest solution.

KerfuffleV2 commented

Converting kfkas Llama-2-ko-7b-Chat to GGUF fails#2865

gguf-py/gguf/gguf.py OutdatedShow resolvedHide resolved

KerfuffleV2 mentioned this pull request

Aug 29, 2023

Closed

convert.py: Set gpt2 as tokenizer model when using BPE

ce00528

Copy link

ContributorAuthor

KerfuffleV2 commentedAug 29, 2023

ce00528 to fix#2865 (comment)

KerfuffleV2 mentioned this pull request

Aug 29, 2023

Please add support for kfkas llama-2-ko-7b-chat#2877

Closed

Copy link

Contributor

klosax commentedAug 29, 2023

Some naming issues for future model support: All references to BPE should be changed togpt2 since both the SentencePiece tokenizer and the GPT2 tokenizer is in fact BPE aka Byte-Pair-Encoder tokenizers.The tokenizer the original LLaMA model uses is a special version of the original SentencePiece tokenizer, so it should be referenced asllama orllama-spm The Replit models uses another version of the SentencePiece tokenizer and when support for the replit model is added it should be referenced asreplit orreplit-spm.

klosax approved these changes

Aug 29, 2023

Copy link

Collaborator

cebtenzzre commentedAug 29, 2023•
edited
Loading

mypy with--disallow-any-generics and--check-untyped-defs still has one problem withgguf.py:

I'm not sure what's up with that, the first argument seems to matchSupportsRead[bytes] and the second argument seems to matchSupportsWrite[bytes]. You can tack # type: ignore[misc] onto that line and I'll look into filing a mypy issue.
edit: mypy seems to be happy if you replacefout: BinaryIO withfout: BufferedWriter (io.BufferedWriter).
This issue is currently tracked atpython/mypy#15031

It looks like mypy wants two more type annotations:

convert-llama-ggmlv3-to-gguf.py:90: error: Incompatible types in assignment (expression has type "tuple[Any, ...]", variablehas type "tuple[()]")  [assignment]            self.dims = struct.unpack(f'<{n_dims}I', data[offset:offset + (4 * n_dims)])                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~convert-llama-ggmlv3-to-gguf.py:122: error: Need type annotation for "tensors" (hint: "tensors: List[<type>] = ...") [var-annotated]            tensors = []            ^~~~~~~

Copy link

ContributorAuthor

KerfuffleV2 commentedAug 29, 2023

@klosax

Some naming issues for future model support: All references to BPE should be changed to gpt2 since both the SentencePiece tokenizer and the GPT2 tokenizer is in fact BPE aka Byte-Pair-Encoder tokenizers.

I think I'll leave that to a different pull unless it's unanimous that this should happen and clear what to do. RenamingBpeVocab toGpt2Vocab (I assume) and changing the commandline argument togpt2 wouldn't be an issue. I don't know enough about this to make the decision myself though.

(I also kind of just want to get this merged since it touches so much stuff. Pretty much any change to other scripts is going to create conflicts that I have to mess around with.)

@cebtenzzre

edit: mypy seems to be happy if you replace fout: BinaryIO with fout: BufferedWriter (io.BufferedWriter).

Are you referring to theshutil.copyfileobj thing? If so, the issue I saw was referring to argument 1, not argument 2 (fout)

convert-llama-ggmlv3-to-gguf.py:90: error: Incompatible types in assignment

convert-llama-ggmlv3-to-gguf.py just doesn't have type annotations at all. I saw those warnings but I don't think it really makes sense to add just one or two annotations in a script that has none. (Also might be opening a can of worms by doing that.)

If I get a chance, I'll look at that in a different pull.

Copy link

Collaborator

cebtenzzre commentedAug 29, 2023•
edited
Loading

Are you referring to theshutil.copyfileobj thing? If so, the issue I saw was referring to argument 1, not argument 2 (fout)

It's talking abouttype argument one - the single TypeVar that it's trying to resolve between the two file arguments. ForSupportsRead[AnyStr] andSupportsWrite[AnyStr], it needs to resolveAnyStr tobytes.

Squish last type warning in gguf.py - yay!

d77b74b

Copy link

ContributorAuthor

KerfuffleV2 commentedAug 29, 2023

It's talking abouttype argument one - the single TypeVar that it's trying to resolve between the two file arguments.

This is why they call you Massive Brain Cebtenzzre.

ggerganov approved these changes

ggerganov merged commitdc07dc4 intoggml-org:master

Copy link

Contributor

akawrykow commentedAug 30, 2023

@KerfuffleV2 I'm wondering how we properly stage these changes vs rolling out a new gguf pip package release?

As it is right now, e.g theconvert-falcon-hf-to-gguf.py expectedly fails with:

Traceback (most recent call last):  File ".\convert-falcon-hf-to-gguf.py", line 169, in <module>    special_vocab = gguf.SpecialVocab(dir_model, load_merges = True)AttributeError: module 'gguf' has no attribute 'SpecialVocab'

(because the conversion script doesn't usegguf from this repo, rather it imports the package)

Can we publish that update ASAP? I saw@monatis added a workflow for this in#2896

akawrykow mentioned this pull request

[User] “'token_embd.weight' has wrong shape” when loading WizardLM-Uncensored-Falcon-40b#2894

Closed

Copy link

Collaborator

cebtenzzre commentedAug 30, 2023

because the conversion script doesn't usegguf from this repo

FWIW, there is nothing stopping you from cd'ing to gguf-py and runningpip install -e .. That would use the gguf from this repo, and-e will make sure it automatically updates when yougit pull.

Copy link

Collaborator

monatis commentedAug 30, 2023

@akawrykow I already made a manual release, and you can upgrade your gguf package withpip install --upgrade gguf. But you are advised to install it in editable mode as@cebtenzzre pointed out. With that said, any future change should be released with the workflow I described ingguf-py/README.md from now on.

Copy link

Contributor

akawrykow commentedAug 30, 2023

@KerfuffleV2 @monatis after upgrading, I get:

Traceback (most recent call last):  File ".\convert-falcon-hf-to-gguf.py", line 4, in <module>    import gguf  File "C:\Python38\lib\site-packages\gguf\__init__.py", line 1, in <module>    from .gguf import *  File "C:\Python38\lib\site-packages\gguf\gguf.py", line 425, in <module>    class GGUFWriter:  File "C:\Python38\lib\site-packages\gguf\gguf.py", line 435, in GGUFWriter    temp_file: Optional[tempfile.SpooledTemporaryFile[bytes]] = NoneTypeError: 'type' object is not subscriptable

any ideas?

Copy link

Contributor

akawrykow commentedAug 30, 2023•
edited
Loading

Seems like my python version is outdated (currently on 3.8) and I guess this requires 3.9 and up?

edit: working now. Thanks

Copy link

Collaborator

cebtenzzre commentedAug 30, 2023

Python 3.8 isn't EOL yet. Should we addfrom __future__ import annotations to files with annotations so python doesn't try to interpret them at runtime?

Copy link

Collaborator

monatis commentedAug 30, 2023

Does this fix the error?

temp_file:Optional[tempfile.SpooledTemporaryFile]=None

Copy link

Collaborator

monatis commentedAug 30, 2023

We should support python 3.8 definitely. I'll check type hints tomorrow and make sure that it works with 3.8.

Copy link

Collaborator

cebtenzzre commentedAug 30, 2023•
edited
Loading

Does this fix the error?

mypy won't like that because SpooledTemporaryFile is generic in the type stubs. As an alternative to the __future__ import we could do this:

temp_file:'Optional[tempfile.SpooledTemporaryFile[bytes]]'=None

edit: another problem is that typing.TypeAlias does not exist in Python 3.8. We should make the import conditional on TYPE_CHECKING.

cebtenzzre mentioned this pull request

convert : fix python 3.8 support, modernize type annotations#2916

Merged

Copy link

ContributorAuthor

KerfuffleV2 commentedAug 31, 2023•
edited
Loading

I'm wondering how we properly stage these changes vs rolling out a new gguf pip package release?

The current behavior is pretty unintuitive in my behavior. I really feel like if I'm running these scripts from the repo directory it should use thegguf module that's from the repo without having to mess around with virtual environments and stuff. Probably a lot of people also don't know about thepip install -e trick Cebtenzzre mentioned (I didn't).

What if we did something like

importsys,osifos.environ.get('NO_LOCAL_GGUF')isNoneandPath('gguf-py').is_dir():sys.path.insert(1,str(Path('gguf-py')/'gguf'))importgguf

in the scripts that usegguf? That way it'll find the local one if running from the repo root and can also be disabled if necessary.

edit:#2927

KerfuffleV2 mentioned this pull request

Aug 31, 2023

convert.py: BPE fixes?#2938

Merged

cebtenzzre reviewed

Sep 25, 2023