Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Universal cross-platform tokenizers binding to HF and sentencepiece

License

NotificationsYou must be signed in to change notification settings

mlc-ai/tokenizers-cpp

Repository files navigation

This project provides a cross-platform C++ tokenizer binding library that can be universally deployed.It wraps and binds theHuggingFace tokenizers libraryandsentencepiece and provides a minimum common interface in C++.

The main goal of the project is to enable tokenizer deployment for language model applicationsto native platforms with minimum dependencies and remove some of the barriers ofcross-language bindings. This project is developed in part with andused inMLC LLM. We have tested the following platforms:

  • iOS
  • Android
  • Windows
  • Linux
  • Web browser

Getting Started

The easiest way is to add this project as a submodule and theninclude it viaadd_sub_directory in your CMake project.You also need to turn onc++17 support.

  • First, you need to make sure you have rust installed.
  • If you are cross-compiling make sure you install the necessary target in rust.For example, runrustup target add aarch64-apple-ios to install iOS target.
  • You can then link the library

Seeexample folder for an example CMake project.

Example Code

// - dist/tokenizer.jsonvoidHuggingFaceTokenizerExample() {// Read blob from file.auto blob =LoadBytesFromFile("dist/tokenizer.json");// Note: all the current factory APIs takes in-memory blob as input.// This gives some flexibility on how these blobs can be read.auto tok =Tokenizer::FromBlobJSON(blob);  std::string prompt ="What is the capital of Canada?";// call Encode to turn prompt into token ids  std::vector<int> ids = tok->Encode(prompt);// call Decode to turn ids into string  std::string decoded_prompt = tok->Decode(ids);}voidSentencePieceTokenizerExample() {// Read blob from file.auto blob =LoadBytesFromFile("dist/tokenizer.model");// Note: all the current factory APIs takes in-memory blob as input.// This gives some flexibility on how these blobs can be read.auto tok =Tokenizer::FromBlobSentencePiece(blob);  std::string prompt ="What is the capital of Canada?";// call Encode to turn prompt into token ids  std::vector<int> ids = tok->Encode(prompt);// call Decode to turn ids into string  std::string decoded_prompt = tok->Decode(ids);}

Extra Details

Currently, the project generates three static libraries

  • libtokenizers_c.a: the c binding to tokenizers rust library
  • libsentencepice.a: sentencepiece static library
  • libtokenizers_cpp.a: the cpp binding implementation

If you are using an IDE, you can likely first use cmake to generatethese libraries and add them to your development environment.If you are using cmake,target_link_libraries(yourlib tokenizers_cpp)will automatically links in the other two libraries.You can also checkoutMLC LLMfor as an example of complete LLM chat application integrations.

Javascript Support

We use emscripten to expose tokenizer-cpp to wasm and javascript.Checkoutweb for more details.

Acknowledgements

This project is only possible thanks to the shoulders open-source ecosystems that we stand on.This project is based on sentencepiece and tokenizers library.


[8]ページ先頭

©2009-2025 Movatter.jp