Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Comments

remove tf dependency from tokenizer.py#3196

Open
aireenmei wants to merge 1 commit intomainfrom
aireen/tokenizer_no_tf
Open

remove tf dependency from tokenizer.py#3196
aireenmei wants to merge 1 commit intomainfrom
aireen/tokenizer_no_tf

Conversation

@aireenmei
Copy link
Collaborator

@aireenmeiaireenmei commentedFeb 19, 2026
edited
Loading

Description

This is part of the bigger plan of removing tensorflow dependency in MaxText

  • combine SentencePieceTokenizerGrain and SentencePieceTokenizer under the name SentencePieceTokenizer
  • Uses the sentencepiece library to load sentencepiece tokenizer instead of the tensorflow_text library
  • replace tf.io.gfile with Google Storage client for gs:// path
  • move TokenizeOp to input_pipeline_utils.py, in the future will put functions using tf to a separate input_pipeline_utils_tf.py to make installing tf optional

Tests

CI test

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add thegemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained inour documentation.

@codecov
Copy link

codecovbot commentedFeb 19, 2026
edited
Loading

Codecov Report

❌ Patch coverage is35.13514% with24 lines in your changes missing coverage. Please review.

Files with missing linesPatch %Lines
src/maxtext/utils/gcs_utils.py15.38%11 Missing⚠️
src/maxtext/input_pipeline/input_pipeline_utils.py36.36%7 Missing⚠️
src/maxtext/input_pipeline/tokenizer.py45.45%5 Missing and 1 partial⚠️

📢 Thoughts on this report?Let us know!

@github-actions
Copy link

🤖 Hi@aireenmei, I've received your request, and I'm working on it now! You can track my progressin the logs for more details.

Copy link

@github-actionsgithub-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

## 📋 Review Summary

This PR successfully removes thetensorflow_text dependency by consolidatingSentencePieceTokenizerGrain andSentencePieceTokenizer using the nativesentencepiece library. The implementation correctly handles GCS loading and transitions the tokenization operation to a unifiedtf.py_function approach withininput_pipeline_utils.py.

🔍 General Feedback

  • Simplification: Merging the two classes and moving the file logic togcs_utils.py significantly cleans up the tokenization architecture.
  • Testing: The unit tests have been appropriately updated to accommodate the return type change (fromtf.Tensor tolist).
  • Completeness: The removal oftokenizer imports intfds_data_processing_c4_mlperf.py looks clean and accurate.

Copy link
Collaborator

@hengtaoguohengtaoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM!

elif tokenizer_type == "huggingface":
return HFTokenizer(tokenizer_path, add_bos, add_eos, hf_access_token)
elif tokenizer_type == "sentencepiece":
if dataset_type == "tfds":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice clean up

def test_tokenize(self):
text = "This is a test"
self.assertTrue(np.array_equal(self.source_tokenizer.encode(text).numpy(), self.test_tokenizer.encode(text).numpy()))
self.assertTrue(np.array_equal(self.source_tokenizer.encode(text), self.test_tokenizer.encode(text)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just curious—did the encode() method previously return a type other than a numpy array? Why it's not necessary any more?

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@github-actionsgithub-actions[bot]github-actions[bot] left review comments

@hengtaoguohengtaoguohengtaoguo approved these changes

@gobbleturkgobbleturkAwaiting requested review from gobbleturkgobbleturk is a code owner

@khatwanimohitkhatwanimohitAwaiting requested review from khatwanimohitkhatwanimohit is a code owner

@bvandermoonbvandermoonAwaiting requested review from bvandermoonbvandermoon is a code owner

@vipannallavipannallaAwaiting requested review from vipannallavipannalla is a code owner

@RissyRanRissyRanAwaiting requested review from RissyRanRissyRan is a code owner

@richjames0richjames0Awaiting requested review from richjames0richjames0 is a code owner

@gagikagagikaAwaiting requested review from gagikagagika is a code owner

@shralexshralexAwaiting requested review from shralexshralex is a code owner

@SurbhiJainUSCSurbhiJainUSCAwaiting requested review from SurbhiJainUSCSurbhiJainUSC is a code owner

@A9ishaA9ishaAwaiting requested review from A9ishaA9isha is a code owner

@NuojChengNuojChengAwaiting requested review from NuojChengNuojCheng is a code owner

@jiangjy1982jiangjy1982Awaiting requested review from jiangjy1982jiangjy1982 is a code owner

@suexu1025suexu1025Awaiting requested review from suexu1025suexu1025 is a code owner

@NicoGrandeNicoGrandeAwaiting requested review from NicoGrandeNicoGrande is a code owner

@jesselu-googlejesselu-googleAwaiting requested review from jesselu-googlejesselu-google is a code owner

At least 2 approving reviews are required to merge this pull request.

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@aireenmei@hengtaoguo

[8]ページ先頭

©2009-2026 Movatter.jp