- Notifications
You must be signed in to change notification settings - Fork74
Universal cross-platform tokenizers binding to HF and sentencepiece
License
mlc-ai/tokenizers-cpp
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This project provides a cross-platform C++ tokenizer binding library that can be universally deployed.It wraps and binds theHuggingFace tokenizers libraryandsentencepiece and provides a minimum common interface in C++.
The main goal of the project is to enable tokenizer deployment for language model applicationsto native platforms with minimum dependencies and remove some of the barriers ofcross-language bindings. This project is developed in part with andused inMLC LLM. We have tested the following platforms:
- iOS
- Android
- Windows
- Linux
- Web browser
The easiest way is to add this project as a submodule and theninclude it viaadd_sub_directory
in your CMake project.You also need to turn onc++17
support.
- First, you need to make sure you have rust installed.
- If you are cross-compiling make sure you install the necessary target in rust.For example, run
rustup target add aarch64-apple-ios
to install iOS target. - You can then link the library
Seeexample folder for an example CMake project.
// - dist/tokenizer.jsonvoidHuggingFaceTokenizerExample() {// Read blob from file.auto blob =LoadBytesFromFile("dist/tokenizer.json");// Note: all the current factory APIs takes in-memory blob as input.// This gives some flexibility on how these blobs can be read.auto tok =Tokenizer::FromBlobJSON(blob); std::string prompt ="What is the capital of Canada?";// call Encode to turn prompt into token ids std::vector<int> ids = tok->Encode(prompt);// call Decode to turn ids into string std::string decoded_prompt = tok->Decode(ids);}voidSentencePieceTokenizerExample() {// Read blob from file.auto blob =LoadBytesFromFile("dist/tokenizer.model");// Note: all the current factory APIs takes in-memory blob as input.// This gives some flexibility on how these blobs can be read.auto tok =Tokenizer::FromBlobSentencePiece(blob); std::string prompt ="What is the capital of Canada?";// call Encode to turn prompt into token ids std::vector<int> ids = tok->Encode(prompt);// call Decode to turn ids into string std::string decoded_prompt = tok->Decode(ids);}
Currently, the project generates three static libraries
libtokenizers_c.a
: the c binding to tokenizers rust librarylibsentencepice.a
: sentencepiece static librarylibtokenizers_cpp.a
: the cpp binding implementation
If you are using an IDE, you can likely first use cmake to generatethese libraries and add them to your development environment.If you are using cmake,target_link_libraries(yourlib tokenizers_cpp)
will automatically links in the other two libraries.You can also checkoutMLC LLMfor as an example of complete LLM chat application integrations.
We use emscripten to expose tokenizer-cpp to wasm and javascript.Checkoutweb for more details.
This project is only possible thanks to the shoulders open-source ecosystems that we stand on.This project is based on sentencepiece and tokenizers library.
About
Universal cross-platform tokenizers binding to HF and sentencepiece