- Notifications
You must be signed in to change notification settings - Fork26.3k
[c10d][fr] Split cuda and non-cuda fr logic into two cpp file#154929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
[ghstack-poisoned]
pytorch-botbot commentedJun 2, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
🔗 Helpful Links🧪 See artifacts and rendered test results athud.pytorch.org/pr/154929
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit6674381 with merge base0d0058d ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k[ghstack-poisoned]
kwen2501 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM
…ile"During the integration fr with gloo I found that put all logic inside one cpp with both build Macro does not work in the current linkage set up in the bazil file. If we put the cpp in the libtorch_cpu, then cuda side build will fail, if we put both we get complaint about ld.lld: error: duplicate symbol: typeinfo for c10d::DebugInfoWriter. To fix this, we need to move the common logic into another header file and we use different cpp file for cpu and cuda so that fr can be used in both cases.cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k[ghstack-poisoned]
fduwjj commentedJun 3, 2025
@pytorchbot merge |
pytorchmergebot commentedJun 3, 2025
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in thewiki. Questions? Feedback? Please reach out to thePyTorch DevX Team |
fduwjj commentedJun 3, 2025
@fduwjj has imported this pull request. If you are a Meta employee, you can view this diffon Phabricator. |
This is a first quick prototyping for FR integration for gloo. Few features gaps:- Input/Output numels for each collective- Whether to use c10::Event or where to use it.- Where to dump the FR traces. (The dump api is provided in this PR)Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601)Pull Requestresolved:#152614Approved by:https://github.com/d4l3kghstack dependencies:#154929
…h#154929)During the integration fr with gloo I found that put all logic inside one cpp with both build Macro does not work in the current linkage set up in the bazil file. If we put the cpp in the libtorch_cpu, then cuda side build will fail, if we put both we get complaint about ld.lld: error: duplicate symbol: typeinfo for c10d::DebugInfoWriter. To fix this, we need to move the common logic into another header file and we use different cpp file for cpu and cuda so that fr can be used in both cases.Pull Requestresolved:pytorch#154929Approved by:https://github.com/kwen2501
This is a first quick prototyping for FR integration for gloo. Few features gaps:- Input/Output numels for each collective- Whether to use c10::Event or where to use it.- Where to dump the FR traces. (The dump api is provided in this PR)Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601)Pull Requestresolved:pytorch#152614Approved by:https://github.com/d4l3kghstack dependencies:pytorch#154929
This is a first quick prototyping for FR integration for gloo. Few features gaps:- Input/Output numels for each collective- Whether to use c10::Event or where to use it.- Where to dump the FR traces. (The dump api is provided in this PR)Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601)Pull Requestresolved:pytorch#152614Approved by:https://github.com/d4l3kghstack dependencies:pytorch#154929
Uh oh!
There was an error while loading.Please reload this page.
Stack fromghstack (oldest at bottom):
During the integration fr with gloo I found that put all logic inside one cpp with both build Macro does not work in the current linkage set up in the bazil file. If we put the cpp in the libtorch_cpu, then cuda side build will fail, if we put both we get complaint about ld.lld: error: duplicate symbol: typeinfo for c10d::DebugInfoWriter. To fix this, we need to move the common logic into another header file and we use different cpp file for cpu and cuda so that fr can be used in both cases.
cc@H-Huang@awgu@wanchaol@fegin@wz337@wconstab@d4l3k
Differential Revision:D75877197