- Notifications
You must be signed in to change notification settings - Fork130
Open
Description
Hi, I found the tokenizer behavior different from Python transformers when I use Phi-3 model.
swift-transformers
func testTokenizer()asyncthrows{lettokenizer=tryawaitAutoTokenizer.from(pretrained:"mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed")letinputIds=tokenizer(" Hi")print(inputIds) // output: [1, 6324]}
Python transformers
fromtransformersimportAutoTokenizertokenizer=AutoTokenizer.from_pretrained("mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed")input_ids=tokenizer.encode(" Hi")print(input_ids)# output: [1, 29871, 6324]
Python transformers prepends29871
(▁
) before6324
. It seems to be done by the normalizer. I debugged this issue and found that the normalizer is ignored whenlegacy
isfalse
at
swift-transformers/Sources/Tokenizers/Tokenizer.swift
Lines 341 to 344 infc65432
if !isLegacy{ | |
configDictionary.removeValue(forKey:"normalizer") | |
configDictionary["pre_tokenizer"]=["type":"Metaspace","replacement": sentencePieceUnderline,"add_prefix_space":true,"prepend_scheme":"first"] | |
} |