Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Experimental: tokenizers with and without templates#168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
pcuenca wants to merge5 commits intomain
base:main
Choose a base branch
Loading
fromhub-tokenizers-templates

Conversation

pcuenca
Copy link
Member

@pcuencapcuenca commentedJan 30, 2025
edited
Loading

This builds on top of#166 by@greenrazer.

The goal is to be able to opt-in to the chat template feature, which carries the Jinja dependency, which in turn depends on swift-collections and whatnot. It works in its current state, but it's ugly – I found way more cross-module quirks than I was expecting. I wanted to make it as easy as possible for consumers and allow to use either version (with or without templates) from their SPM manifests.

Opinions, corrections, and alternative ideas are encouraged and most welcome!


How to use:

  • If you want to use the core Tokenizers functionality, without the chat templates:

Add the following dependency to your target, as usual (exceptTokenizers is now a product, no need to use the fullTransformers lib). This applies to projects such asWhisperKit.

            dependencies: [                .product(name: "Tokenizers", package: "swift-transformers"),            ],
  • To opt-in to using chat-templates:

This applies to projects such asmlx-swift-examples.

            dependencies: [                .product(name: "Tokenizers", package: "swift-transformers"),+               .product(name: "TokenizersTemplates", package: "swift-transformers"),            ],

That's it, you don't need to importTokenizersTemplates or do anything other than just declaring it as a dependency.


Known issues:

  • I haven't looked at tests yet, they probably won't compile.
  • I know there are conflicts because of last night's tools PR. Let's agree on the general direction before addressing them.

FL33TW00D reacted with rocket emoji
try applyChatTemplate(messages: messages, chatTemplate: .literal(chatTemplate), addGenerationPrompt: true, truncation: false, maxLength: nil, tools: nil)
}
}

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

See comment inTokenizersTemplates

]

open class PreTrainedTokenizerWithTemplates : PreTrainedTokenizer {
// I don't know why these need to be here. They are implemented in the protocol, **and** in the superclass.
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, if these overrides don't exist, the linker can't find the implementations.

import Foundation
import Hub

@_exported import TokenizersCore
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is a good chunk of the magic. The newTokenizers implementation is just this wrapper file, which exposes the importedTokenizersCore types.

Comment on lines +6 to +9
#if canImport(TokenizersTemplates)
import TokenizersTemplates
public typealias PreTrainedTokenizer = PreTrainedTokenizerWithTemplates
#endif
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So ifTokenizersTemplates is available (because users have declared it as a dependency), we override the definition so the factory below uses the subclass.

}

// See https://github.com/xenova/transformers.js/blob/1a9964fb09b8f54fcbeac46dc6aae8d76795809d/src/tokenizers.js#L3203 for these exceptions
class LlamaPreTrainedTokenizer: PreTrainedTokenizer {
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This could be moved back toTokenizersCore, but then we'd need atypealias here as well (and a subclass).

@pcuenca
Copy link
MemberAuthor

cc@greenrazer@FL33TW00D@Vaibhavs10 for opinions and feedback.

@greenrazer
Copy link
Collaborator

I like this solution. however, I cannot get thecanImport approach inTokenizersWrapper to work for the tests due to how Swift Package Manager handles module dependencies and conditional imports within the same package. All the tests work great with almost no modification (besides changingTokenizers toTokenizersCore) except for theChatTemplatesTests.swift.

I've tried a few different options:

  • The best option: Just addingTokenizersTemplates as a dependency to the test, similar to how you would for an external package. This doesn’t work because it doesn’t trigger recompilation for the test.
  • AddingswiftSettings: [.define("USE_TEMPLATES")] to the test and then adding another build condition toTokenizersWrapper. Same issue as above—it doesn’t trigger recompilation for the test.
  • Creating two targets pointing to the sameTokenizersWrapper source file, with one includingUSE_TEMPLATES. This doesn’t work because you can't have two targets referencing the same source file.
  • Adding.target(name: "TokenizersTemplates", condition: .when(platforms: nil)) as a dependency to hopefully run only during testing. This doesn’t work in newer Swift versions.
  • AddingswiftSettings: [.define("ENABLE_TEMPLATES", .when(configuration: .debug))] to theTokenizers target. This almost works, butTokenizersTemplates must then be a dependency ofTokenizers, which defeats the whole purpose.

Since users must explicitly importTokenizersTemplates anyway, I think the best approach is to create two wrappers and two targets. That way, users only need to import eitherTokenizersTemplates orTokenizers, depending on their needs.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@pcuenca@greenrazer

[8]ページ先頭

©2009-2025 Movatter.jp