- Notifications
You must be signed in to change notification settings - Fork537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Tokenizer supports multiple encodings, compatible with .Net Standard 2.0#218
base:dev
Are you sure you want to change the base?
Conversation
…e encoding: r50k_base, p50k_base, cl100k_base; supports encode and decode method.```C# Tokenizer tokenizer = new Tokenizer("cl100k_base"); Tokenizer tokenizer = new Tokenizer().FromModelName("gpt-3.5-turbo-0301"); Tokenizer tokenizer = new Tokenizer().FromModel(Models.Model.TextDavinciV3); string str = @"床前明月光,疑是地上霜,举头望明月,低头思故乡。"; int[] res = tokenizer.Encode(str); // res =[ 11795 232 25580 31958 9953 6708 231 3922 163 244 239 21043 30590 17905 52597 250 3922 3574 122 65455 4916 249 31958 9953 3922 8687 236 65455 91763 8067 227 18259 94 1811] string str2 = tokenizer.Decode(res); // str2 = "床前明月光,疑是地上霜,举头望明月,低头思故乡。"```
Hey@Frogley, I haven't forgotten about your PR. I am just trying to understand how the tokenizer works and comparing it against your PR, which is taking up a lot of time. I apologize for the delay. :/ |
Great. To be honest, my understanding of the core algorithm for the tokenizer is somewhat vague, I didn't fully grasp it. Basically, my PR is a translation oftiktoken/lib.rs from Rust into C#, with some simplifications. After the translation was complete, I did a few case tests and they were consistent. But I didn't do any extensive testing and comparison. Hope my work can be of help to you. |
Tokenizer supports multiple encodings: r50k_base, p50k_base, cl100k_base; supports encode and decode method.