- Notifications
You must be signed in to change notification settings - Fork130
Description
Hi. Came up with an issue trying to useUnigramTokenizer
with the XLMRobertaTokenizer vocabulary.
Reproduce with the following onFactoryTests.swift
, after adding the relevant entries toknownTokenizers
, e.gclass XLMRobertaTokenizer: UnigramTokenizer {}
, etc.
func testE5()asyncthrows{lettokenizer=tryawaitAutoTokenizer.from(pretrained:"intfloat/multilingual-e5-small", hubApi: hubApi)letinputIds=tokenizer("query: how much protein should a female eat")print(tokenizer.decode(tokens: inputIds))XCTAssertEqual(inputIds,[0,41,1294,12,3642,5045,21308,5608,10,117776,73203,2])}
results in error:
Swift/NativeDictionary.swift:770: Fatal error: Duplicate values for key: 'َّ'
PatchingUnigramTokenizer.swift:66
with the following will get the test passing:
vartmp=[String: Int]()vocab.map{ $0.token}.enumerated().forEach{(v,k)intmp[k]= v}tokensToIds= tmp
This patch does not address the root cause and will obviously cause some vocabulary entries to be lost. From visual inspection seems a bunch of entries of what look like Thai script suffer from this issue.
I don't know enough about Swift strings to determine if this is a bug inswift-transformers
or a problem with the vocabulary file.
Thanks.