huggingface/swift-transformersPublic

NotificationsYou must be signed in to change notification settings
Fork130
Star1k

Edge tokenization issues: Unicode parsing #116

New issue

Open

Edge tokenization issues: Unicode parsing#116

Labels

tokenizationrelated to tokenizers

Description

pcuenca

opened

on Aug 19, 2024

          Regarding the [missing tokes in the parsed vocabulary](https://github.com/huggingface/swift-transformers/pull/113#issuecomment-2267520368), this is my documentation after tracking one of the issues.

First, we are parsing the JSON file (tokenizer.json) usingJSONSerialization.jsonObject. This reads data as Foundation objects, parsing tokens from the vocab dictionary asNSString instances. This is a good thing.Strings cannot be used as keys in the vocab dictionary because equality only considers theUnicode canonical representation. Parsing the JSON and casting to[String : Int] would ignore multiple entries.

However, I found thatJSONSerialization fails to correctly parse some strings. Consider the following test case:

func testArrayParsingWithBOMPrefix(){        // The second one starts with a BOM prefixletitems=["a","\u{feff}a"]        // Neither Strings nor NSStrings are equalXCTAssertNotEqual(items[0],items[1])XCTAssertNotEqual(items[0]asNSString,items[1]asNSString)        // JSONDecoder worksletjsonData=try!JSONSerialization.data(withJSONObject: items, options:[])letdecoder=JSONDecoder()letdecoded=try! decoder.decode([String].self, from: jsonData)XCTAssertEqual(decoded, items)        // JSONSerialization seems to ignore the BOM.        // The decoded array contains two items, but they are the same NSString.letns_decoded=try!JSONSerialization.jsonObject(with: jsonData, options:[])as!NSArrayXCTAssertEqual(ns_decoded.count, items.count)                               // passesXCTAssertNotEqual(ns_decoded[0]as!NSString,ns_decoded[1]as!NSString)   // failsXCTAssertEqual(ns_decodedas![String], items)                              // fails        // Compare unicodeScalarsfunc scalars(_ string:String)->[UInt32]{            string.unicodeScalars.map{ $0.value}}for(decoded, expected)inzip(ns_decoded, items){letdecodedScalars=scalars(decodedas!String)letexpectedScalars=scalars(expected)XCTAssertEqual(decodedScalars, expectedScalars)         // first passes, second fails}}

There are two strings in the test array. The second onestarts with aBOM prefix. The prefix is ignored when parsing the twoNSStrings, as confirmed by looking at the unicode scalars in the debugger. Unfortunately, the Gemma vocab contains some duplicate entries with/without a BOM prefix, so reading them into a dictionary skips some entries.

Interestingly, all the tests pass if the BOM character is in the middle of the string. Replacing the test items with these works fine:

        // If the non-breaking space is inside the String, all tests pass//        let items = ["ab", "a\u{feff}b"]

I suspectthis is used for parsing, and the stream is incorrectly assumed to start with a BOM even though it's in the middle of the actual json data.

Also interestingly,JSONDecoder works and can decode the two distinct String instances in the array. We are not usingJSONDecoder in this project because:

The structure of the JSON files to be parsed is quite open and flexible, I don't think it would be straightforward to write a decodable structure that represents it. Instead, we use dynamic member lookup to navigate the contents.
We can't useString instances for vocab keys, as mentioned above.

I'm not sure how to deal with this.

Originally posted by@pcuenca in#113 (comment)

Metadata

Assignees

No one assigned

Labels

tokenizationrelated to tokenizers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Edge tokenization issues: Unicode parsing #116

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions