- Notifications
You must be signed in to change notification settings - Fork473
Comments
Conversation
8ae57b1 to3638112Comparecodecovbot commentedFeb 19, 2026 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report?Let us know! |
3638112 to74232e2Compare74232e2 to4b354dbCompare🤖 Hi@aireenmei, I've received your request, and I'm working on it now! You can track my progressin the logs for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This PR successfully removes thetensorflow_text dependency by consolidatingSentencePieceTokenizerGrain andSentencePieceTokenizer using the nativesentencepiece library. The implementation correctly handles GCS loading and transitions the tokenization operation to a unifiedtf.py_function approach withininput_pipeline_utils.py.
🔍 General Feedback
- Simplification: Merging the two classes and moving the file logic to
gcs_utils.pysignificantly cleans up the tokenization architecture. - Testing: The unit tests have been appropriately updated to accommodate the return type change (from
tf.Tensortolist). - Completeness: The removal of
tokenizerimports intfds_data_processing_c4_mlperf.pylooks clean and accurate.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
4b354db to7f1f7e3Compare7f1f7e3 to7963fc8Compare
hengtaoguo left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM!
| elif tokenizer_type == "huggingface": | ||
| return HFTokenizer(tokenizer_path, add_bos, add_eos, hf_access_token) | ||
| elif tokenizer_type == "sentencepiece": | ||
| if dataset_type == "tfds": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice clean up
| def test_tokenize(self): | ||
| text = "This is a test" | ||
| self.assertTrue(np.array_equal(self.source_tokenizer.encode(text).numpy(), self.test_tokenizer.encode(text).numpy())) | ||
| self.assertTrue(np.array_equal(self.source_tokenizer.encode(text), self.test_tokenizer.encode(text))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Just curious—did the encode() method previously return a type other than a numpy array? Why it's not necessary any more?
Uh oh!
There was an error while loading.Please reload this page.
Description
This is part of the bigger plan of removing tensorflow dependency in MaxText
Tests
CI test
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.