Lightning-AI/pytorch-lightningPublic

NotificationsYou must be signed in to change notification settings
Fork3.6k
Star30.2k

Emulating multiple devices with a single GPU#8630

Answeredbytholop

tholop asked this question inDDP / multi-GPU / multi-node

tholop

Jul 29, 2021

· 2 comments· 11 replies

AnsweredbytholopReturn to top

Discussion options

tholop
Jul 29, 2021

Hello,

I have a single GPU, but I would like to spawn multiple replicas on that single GPU and train a model with DDP. Of course, each replica would have to use a smaller batch size in order to fit in memory. (For my use case, I am not interested in having a single replica with a large batch size).

I tried to pass--gpus "0,0" to the Lightning Trainer, and it managed to spawn two processes on the same GPU:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2

But in the end it crashed withRuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage.

Please, is there any way to split a single GPU into multiple replicas with Lightning?
Thanks!

P.S.: Ray has a really nice support for fractional GPUs:https://docs.ray.io/en/master/using-ray-with-gpus.html#fractional-gpus. I've never used them with Lightning, but maybe it could be a workaround?

You must be logged in to vote

Answered by tholop

Aug 3, 2021

For reference: it seems to be possible when the backend isgloo instead ofnccl. See discussion here:#8630 (reply in thread).

View full answer

Replies: 2 comments 11 replies

Comment options

yifuwang
Jul 29, 2021

Hmm interesting use case.

AFAIU it is not possible, at least withtorch.distributed. When using GPU, both thegloo andnccl backend useshttps://github.com/NVIDIA/nccl under the hood, which does not support the semantic you described:

Fromhttps://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html:

Using the same CUDA device multiple times as different ranks of the same NCCL communicator is not supported and may lead to hangs.

It probably can be done if you write a custom gradient sync'ing logic, which moves gradient to RAM before sync'ing and sync them with agloo process group.

You must be logged in to vote

8 replies

Comment options

awaelchli Jul 30, 2021

@ananthsub yes potentially in our parsing or alternatively also in the plugin which already gets the list of devices.

Comment options

yifuwang Aug 2, 2021

@justusschock sorry I'm not very familiar with how MPI works with GPUs intorch.distributed. However, if it relies on NCCL under the hood I'd guess it still wouldn't work.

Comment options

tholop Aug 3, 2021
Author

@yifuwang I tried a bit more, and it actually worked withgloo. After addingPL_TORCH_DISTRIBUTED_BACKEND=gloo, I was able to run Lightning Training successfully with 2 replicas on a single GPU.

Comment options

dmarx Sep 21, 2021

@tholop did you experience the same kind of speedup you'd expect if you were training on multiple separate physical devices?

PS: I haven't experimented yet, but I suspect you might be able to apply the fractional GPU capability in ray to achieve something like this.

Comment options

tholop Sep 21, 2021
Author

@dmarx I did experience a speedup, but not as good as having separate physical devices. I didn't benchmark thoroughly though.

I totally agree regarding Ray's fractional GPUs! I mentioned it in the original issue as a possible workaround, but it might require a bit more work than just passing a string to Lightning.

Comment options

tholop
Aug 3, 2021
Author

For reference: it seems to be possible when the backend isgloo instead ofnccl. See discussion here:#8630 (reply in thread).

You must be logged in to vote

3 replies

Comment options

ksasso1028 May 15, 2024

pytorch lightning complains about using the same device ID in current version, any workaround?@tholop certainly interested in this to get more steps vs batches