Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Emulating multiple devices with a single GPU#8630

Discussion options

Hello,

I have a single GPU, but I would like to spawn multiple replicas on that single GPU and train a model with DDP. Of course, each replica would have to use a smaller batch size in order to fit in memory. (For my use case, I am not interested in having a single replica with a large batch size).

I tried to pass--gpus "0,0" to the Lightning Trainer, and it managed to spawn two processes on the same GPU:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2

But in the end it crashed withRuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage.

Please, is there any way to split a single GPU into multiple replicas with Lightning?
Thanks!

P.S.: Ray has a really nice support for fractional GPUs:https://docs.ray.io/en/master/using-ray-with-gpus.html#fractional-gpus. I've never used them with Lightning, but maybe it could be a workaround?

You must be logged in to vote
Answered by tholopAug 3, 2021

For reference: it seems to be possible when the backend isgloo instead ofnccl. See discussion here:#8630 (reply in thread).

Replies: 2 comments 11 replies

Comment options

Hmm interesting use case.

AFAIU it is not possible, at least withtorch.distributed. When using GPU, both thegloo andnccl backend useshttps://github.com/NVIDIA/nccl under the hood, which does not support the semantic you described:

Fromhttps://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html:

Using the same CUDA device multiple times as different ranks of the same NCCL communicator is not supported and may lead to hangs.

It probably can be done if you write a custom gradient sync'ing logic, which moves gradient to RAM before sync'ing and sync them with agloo process group.

You must be logged in to vote
8 replies
@awaelchli
Comment options

@ananthsub yes potentially in our parsing or alternatively also in the plugin which already gets the list of devices.

@yifuwang
Comment options

@justusschock sorry I'm not very familiar with how MPI works with GPUs intorch.distributed. However, if it relies on NCCL under the hood I'd guess it still wouldn't work.

@tholop
Comment options

@yifuwang I tried a bit more, and it actually worked withgloo. After addingPL_TORCH_DISTRIBUTED_BACKEND=gloo, I was able to run Lightning Training successfully with 2 replicas on a single GPU.

@dmarx
Comment options

@tholop did you experience the same kind of speedup you'd expect if you were training on multiple separate physical devices?

PS: I haven't experimented yet, but I suspect you might be able to apply the fractional GPU capability in ray to achieve something like this.

@tholop
Comment options

@dmarx I did experience a speedup, but not as good as having separate physical devices. I didn't benchmark thoroughly though.

I totally agree regarding Ray's fractional GPUs! I mentioned it in the original issue as a possible workaround, but it might require a bit more work than just passing a string to Lightning.

Comment options

For reference: it seems to be possible when the backend isgloo instead ofnccl. See discussion here:#8630 (reply in thread).

You must be logged in to vote
3 replies
@ksasso1028
Comment options

pytorch lightning complains about using the same device ID in current version, any workaround?@tholop certainly interested in this to get more steps vs batches

@haleyso
Comment options

hi@ksasso1028, i'm running into the same thing with pytorch lightning right now. did you happen to find a workaround?

update: I just commented out the check unique ids call in device_parser.py ... so far it's working ok

@ashok-arora
Comment options

Does it work with multi-node setup too or is it a single-node multi-gpu only?

Answer selected bytholop
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Labels
None yet
9 participants
@tholop@dmarx@ananthsub@yifuwang@awaelchli@ksasso1028@justusschock@ashok-arora@haleyso

[8]ページ先頭

©2009-2025 Movatter.jp