AI-Hypercomputer/maxtextPublic

NotificationsYou must be signed in to change notification settings
Fork473
Star2.1k

Comments

remove tf dependency from tokenizer.py#3196

Open

aireenmei wants to merge 1 commit intomainfrom

aireen/tokenizer_no_tf

Open

remove tf dependency from tokenizer.py#3196
aireenmei wants to merge 1 commit intomainfrom
aireen/tokenizer_no_tf

Conversation

Copy link

Collaborator

aireenmei commentedFeb 19, 2026•
edited
Loading

Description

This is part of the bigger plan of removing tensorflow dependency in MaxText

combine SentencePieceTokenizerGrain and SentencePieceTokenizer under the name SentencePieceTokenizer
Uses the sentencepiece library to load sentencepiece tokenizer instead of the tensorflow_text library
replace tf.io.gfile with Google Storage client for gs:// path
move TokenizeOp to input_pipeline_utils.py, in the future will put functions using tf to a separate input_pipeline_utils_tf.py to make installing tf optional

Tests

CI test

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add thegemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained inour documentation.

aireenmei force-pushed theaireen/tokenizer_no_tf branch from8ae57b1 to3638112Compare

February 19, 2026 20:50

Copy link

codecovbot commentedFeb 19, 2026•
edited
Loading

Codecov Report

❌ Patch coverage is35.13514% with24 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/utils/gcs_utils.py	15.38%	11 Missing⚠️
src/maxtext/input_pipeline/input_pipeline_utils.py	36.36%	7 Missing⚠️
src/maxtext/input_pipeline/tokenizer.py	45.45%	5 Missing and 1 partial⚠️

📢 Thoughts on this report?Let us know!

aireenmei force-pushed theaireen/tokenizer_no_tf branch from3638112 to74232e2Compare

February 19, 2026 21:53

aireenmei marked this pull request as ready for review

February 19, 2026 21:59

aireenmei requested review fromA9isha,NicoGrande,NuojCheng,RissyRan,SurbhiJainUSC,bvandermoon,gagika,gobbleturk,hengtaoguo,jesselu-google,jiangjy1982,khatwanimohit,richjames0,shralex,suexu1025 andvipannalla ascode owners

February 19, 2026 21:59

aireenmei force-pushed theaireen/tokenizer_no_tf branch from74232e2 to4b354dbCompare

February 19, 2026 22:48

aireenmei added the gemini-review label

Feb 20, 2026

Copy link

github-actionsbot commentedFeb 20, 2026

🤖 Hi@aireenmei, I've received your request, and I'm working on it now! You can track my progressin the logs for more details.

github-actionsbot reviewed

Feb 20, 2026

View reviewed changes

Copy link

github-actionsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

## 📋 Review Summary

This PR successfully removes thetensorflow_text dependency by consolidatingSentencePieceTokenizerGrain andSentencePieceTokenizer using the nativesentencepiece library. The implementation correctly handles GCS loading and transitions the tokenization operation to a unifiedtf.py_function approach withininput_pipeline_utils.py.

🔍 General Feedback

Simplification: Merging the two classes and moving the file logic togcs_utils.py significantly cleans up the tokenization architecture.
Testing: The unit tests have been appropriately updated to accommodate the return type change (fromtf.Tensor tolist).
Completeness: The removal oftokenizer imports intfds_data_processing_c4_mlperf.py looks clean and accurate.

src/maxtext/input_pipeline/tokenizer.py OutdatedShow resolvedHide resolved

src/maxtext/input_pipeline/tokenizer.pyShow resolvedHide resolved

aireenmei force-pushed theaireen/tokenizer_no_tf branch from4b354db to7f1f7e3Compare

February 20, 2026 01:04

remove tf dependency from tokenizer.py

7963fc8

aireenmei force-pushed theaireen/tokenizer_no_tf branch from7f1f7e3 to7963fc8Compare

February 20, 2026 01:07

hengtaoguo approved these changes

Feb 20, 2026

View reviewed changes

Copy link

Collaborator

hengtaoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM!

src/maxtext/input_pipeline/tokenizer.py

		elif tokenizer_type == "huggingface":
		return HFTokenizer(tokenizer_path, add_bos, add_eos, hf_access_token)
		elif tokenizer_type == "sentencepiece":
		if dataset_type == "tfds":

Copy link

Collaborator

hengtaoguoFeb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice clean up

tests/unit/tokenizer_test.py

		def test_tokenize(self):
		text = "This is a test"
		self.assertTrue(np.array_equal(self.source_tokenizer.encode(text).numpy(), self.test_tokenizer.encode(text).numpy()))
		self.assertTrue(np.array_equal(self.source_tokenizer.encode(text), self.test_tokenizer.encode(text)))