Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Introducing Parameter Sharding and Torch backend for Tensor Parallelism#21724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
buildwithsuhana wants to merge8 commits intokeras-team:master
base:master
Choose a base branch
Loading
frombuildwithsuhana:Tensor_parallel_keras_3

Conversation

@buildwithsuhana
Copy link
Contributor

@buildwithsuhanabuildwithsuhana commentedOct 7, 2025
edited
Loading

This pull request introduces a foundational framework for Tensor Parallelism in Keras, Parameter_sharding.py, enabling the training of large-scale models by sharding their parameters across multiple devices. This is a significant step towards supporting advanced distributed training strategies directly within the Keras ecosystem.

The core of this contribution is a new, backend-agnostic parameter sharding framework and the necessary distributed communication primitives for the PyTorch backend.

Key Changes

PyTorch Distributed Backend
A new distributed_backend.py module has been added for the PyTorch backend.

It implements essential collective communication operations (all_reduce, all_gather, broadcast, scatter) using the torch.distributed package.

Provides helper functions for gradient computation (compute_gradients) and device management, aligning its interface with other Keras backends.

Parameter Sharding Framework
Introduces a powerful parameter sharding API under keras/src/distribution/tensor_parallel/.

ParameterShardingStrategy: A new class that manages the logic for splitting model weights based on user-defined rules specified in a ConfigKeras object.

ShardedWeight: A wrapper class for sharded keras.Variable objects, allowing them to be seamlessly integrated into the model.

make_parameter_sharded_model: A factory function that takes a standard Keras model and returns a sharded version, automatically handling the weight splitting and model wrapping. The wrapped ParameterShardedModel injects communication ops (e.g., all-reduce) into the forward pass to ensure correct computations.

Example usage:https://colab.research.google.com/drive/1UAINIcstDuO0aeA9lxCF5LaIj5ne5X5z?resourcekey=0-pPF4COO19KRoqS5cpWNILA&usp=sharing

This is the 3rd (out of 4) PR for AutoSharding Keras.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello@buildwithsuhana, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Keras's capabilities for large-scale model training by introducing a robust framework for Tensor Parallelism. It provides a new PyTorch-specific distributed backend with collective communication primitives and a flexible parameter sharding mechanism. This allows Keras models to efficiently distribute their parameters across multiple devices, paving the way for more advanced distributed training strategies within the Keras ecosystem.

Highlights

  • PyTorch Distributed Backend: A newdistributed_backend.py module is introduced for PyTorch, providing essential collective communication operations likeall_reduce,all_gather,broadcast, andscatter, along with gradient computation and device management utilities.
  • Parameter Sharding Framework: A foundational framework for parameter sharding is added underkeras/src/distribution/tensor_parallel/. This includesParameterShardingStrategy for managing weight splitting,ShardedWeight for wrapping sharded variables, andmake_parameter_sharded_model for creating sharded Keras models.
  • Backend Integration: The new PyTorch distributed backend is integrated into the Keras backend system, and placeholderdistributed_backend = None is added for other backends (TensorFlow, JAX, NumPy, OpenVINO) to maintain consistency.
  • Comprehensive Testing: New unit tests are added for both the PyTorch distributed backend functions and the parameter sharding framework, ensuring correctness and reliability of the new features.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on ourdocumentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either/gemini <command> or@gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

FeatureCommandDescription
Code Review/gemini reviewPerforms a code review for the current pull request in its current state.
Pull Request Summary/gemini summaryProvides a summary of the current pull request in its current state.
Comment@gemini-code-assistResponds in comments when explicitly tagged, both in pull request comments and review comments.
Help/gemini helpDisplays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a.gemini/ folder in the base of the repository. Detailed instructions can be foundhere.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on@gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign uphere.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with theGemini Code Assist IDE Extension.

Footnotes

  1. Review thePrivacy Notices,Generative AI Prohibited Use Policy,Terms of Service, and learn how to configure Gemini Code Assist in GitHubhere. Gemini can make mistakes, so double check it anduse code with caution.

Copy link
Contributor

@gemini-code-assistgemini-code-assistbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Code Review

This pull request introduces a foundational framework for Tensor Parallelism in Keras and adds a Torch backend for distributed communication. The changes are substantial and add significant new capabilities. My review focuses on the correctness, generality, and test coverage of this new framework. I've identified some critical issues, such as backend-specific implementations in what should be a backend-agnostic framework, and tests that don't cover the new Torch implementation. There are also opportunities to improve code quality by removing hardcoded logic and reducing code duplication. Addressing these points will help ensure the new Tensor Parallelism framework is robust and maintainable.

@codecov-commenter
Copy link

codecov-commenter commentedOct 15, 2025
edited
Loading

Codecov Report

❌ Patch coverage is53.51812% with218 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.40%. Comparing base (26d7166) to head (6b2efcc).
⚠️ Report is 26 commits behind head on master.

Files with missing linesPatch %Lines
keras/src/backend/torch/distribution_lib.py30.87%95 Missing and 8 partials⚠️
...distribution/tensor_parallel/parameter_sharding.py71.29%46 Missing and 14 partials⚠️
keras/src/backend/torch/distributed_backend.py35.41%26 Missing and 5 partials⚠️
keras/src/backend/jax/distributed_backend.py29.41%12 Missing⚠️
.../src/distribution/tensor_parallel/tensor_layout.py69.23%10 Missing and 2 partials⚠️
Additional details and impacted files
@@            Coverage Diff             @@##           master   #21724      +/-   ##==========================================- Coverage   82.58%   82.40%   -0.19%==========================================  Files         572      577       +5       Lines       58187    59004     +817       Branches     9116     9243     +127     ==========================================+ Hits        48055    48621     +566- Misses       7808     8029     +221- Partials     2324     2354      +30
FlagCoverage Δ
keras82.20% <53.51%> (-0.19%)⬇️
keras-jax63.08% <47.12%> (-0.31%)⬇️
keras-numpy57.26% <17.91%> (-0.45%)⬇️
keras-openvino35.07% <17.91%> (+0.80%)⬆️
keras-tensorflow63.60% <17.91%> (-0.52%)⬇️
keras-torch63.19% <23.45%> (-0.52%)⬇️

Flags with carried forward coverage won't be shown.Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report?Share it here.

🚀 New features to boost your workflow:
  • ❄️Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@buildwithsuhanabuildwithsuhana marked this pull request as ready for reviewOctober 15, 2025 20:29
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

1 more reviewer

@gemini-code-assistgemini-code-assist[bot]gemini-code-assist[bot] left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

@gbanedgbaned

Labels

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@buildwithsuhana@codecov-commenter@gbaned

[8]ページ先頭

©2009-2025 Movatter.jp