Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Tokenizer supports multiple encodings, compatible with .Net Standard 2.0#218

Open
Frogley wants to merge4 commits intobetalgo:dev
base:dev
Choose a base branch
Loading
fromFrogley:feature/Tokenizer-multiple-encodings-DotnetStandard2.0

Conversation

Frogley
Copy link

Tokenizer supports multiple encodings: r50k_base, p50k_base, cl100k_base; supports encode and decode method.

Tokenizertokenizer=newTokenizer("cl100k_base");Tokenizertokenizer=newTokenizer().FromModelName("gpt-3.5-turbo-0301");Tokenizertokenizer=newTokenizer().FromModel(Models.Model.TextDavinciV3);stringstr=@"床前明月光,疑是地上霜,举头望明月,低头思故乡。";int[]res=tokenizer.Encode(str);// res =[ 11795 232 25580 31958 9953 6708 231 3922 163 244 239 21043 30590 17905 52597 250 3922 3574 122 65455 4916 249 31958 9953 3922 8687 236 65455 91763 8067 227 18259 94 1811]stringstr2=tokenizer.Decode(res);// str2 = "床前明月光,疑是地上霜,举头望明月,低头思故乡。"

kayhantolgaand others added4 commitsMarch 20, 2023 18:32
…e encoding: r50k_base, p50k_base, cl100k_base; supports encode and decode method.```C#  Tokenizer tokenizer = new Tokenizer("cl100k_base");  Tokenizer tokenizer = new Tokenizer().FromModelName("gpt-3.5-turbo-0301");  Tokenizer tokenizer = new Tokenizer().FromModel(Models.Model.TextDavinciV3);  string str = @"床前明月光,疑是地上霜,举头望明月,低头思故乡。";  int[] res = tokenizer.Encode(str);  // res =[ 11795 232 25580 31958 9953 6708 231 3922 163 244 239 21043 30590 17905 52597 250 3922 3574 122 65455 4916 249 31958 9953 3922 8687 236 65455 91763 8067 227 18259 94 1811]  string str2 = tokenizer.Decode(res);  // str2 = "床前明月光,疑是地上霜,举头望明月,低头思故乡。"```
@FrogleyFrogley changed the base branch frommaster todevApril 7, 2023 04:52
@kayhantolga
Copy link
Member

Hey@Frogley, I haven't forgotten about your PR. I am just trying to understand how the tokenizer works and comparing it against your PR, which is taking up a lot of time. I apologize for the delay. :/

@Frogley
Copy link
Author

Great. To be honest, my understanding of the core algorithm for the tokenizer is somewhat vague, I didn't fully grasp it. Basically, my PR is a translation oftiktoken/lib.rs from Rust into C#, with some simplifications. After the translation was complete, I did a few case tests and they were consistent. But I didn't do any extensive testing and comparison. Hope my work can be of help to you.

@kayhantolgakayhantolga added this to the8.0.4 milestoneApr 10, 2024
@kayhantolgakayhantolga removed this from the8.4.3 milestoneOct 10, 2024
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@Frogley@kayhantolga

[8]ページ先頭

©2009-2025 Movatter.jp