Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork5k
Description
Is your feature request related to a problem? Please describe.
This (trivial to implement) feature gives 1..10% speedup, by reducing H2D-copy-overhead.
This could be implemented on user side, but blows up user code and performance gain should suffer (if on user side).
Describe the solution you'd like
Initializing/creating the DataLoder with new optinal parameters:multiplier + device
(possible other name = batch_bundle).
This should be 100% backward compatible, because multiplier<2 (default=1) is intendet to cause the original old behaviour.
Reason for speedup: CUDA H2D transfers show a startup delay (the first byte costs as much as many megabytes later).
Also the number of overall needed Multithreading timm-threads decreases (which is at a cost too).mdl = MultiEpochsDataLoader( data, batch_size: int=32, multiplier: int=1, device: torch.device=None )
My implementation on user side:data = datasets.MNIST(data_dir, train=True, download=True, transform=transform)
data_loader: DataLoader = MultiEpochsDataLoader(data, batch_size= batch_size*multiplier)
import torch as tt img_type: tt.dtype = tt.get_default_dtype() batch_bundle: int = multiplier if (batch_bundle <= 1): # default Off with tt.no_grad(): for data, target in data_loader: data, target = data.to(device, img_type, non_blocking=True), \ target.to(device, non_blocking=True) output: tt.Tensor = model(data) loss1 += cost_func(output, target) * len(target) # test_loss += F.nll_loss(output, target, reduction="sum") pred: tt.Tensor = output.max(1, keepdim=True)[1] corr1 += pred.eq(target.view_as(pred)).sum() else: data = target = data1 = target1 = None bs: int = None with tt.no_grad(): for data0, target0 in data_loader: # data, target = data1, target1 # optional (H2D in background, triggered earlier) data, target = data0.to(device, img_type, non_blocking=True), \ target0.to(device, non_blocking=True) if bs is None: # once bs0 = target0.numel() assert not (bs0 % batch_bundle), "odd/bad modulo" bs = bs0 // batch_bundle # /4 assert bs > 0, "empty batch" # continue for x,y in zip( data.split(bs), target.split(bs) ): # here we split down to intended batch_size output: tt.Tensor = model(x) loss1 += cost_func(output, y) * len(y) pred: tt.Tensor = output.max(1, keepdim=True)[1] corr1 += pred.eq(y.view_as(pred)).sum()
I would like to bring back down the user code to:
for data, target in data_loader: output: tt.Tensor = model(data) # data + target already on device and split
For Multi-GPU usage, it would be beneficial if device could be a tuple/list likedevice=(dev1, dev2)
.
Then data and target would be also tuple of same len.data=(data1, data2), target=(target1, target2)
.
Describe alternatives you've considered
There are 2 alternative ways, both not as fast as this feature request.
The explaination above is an alternative way, which is blowing up the user code in a significant way.
Doing nothing at all, still works, but is slower.
Additional context
Depending on system (OS + GPU driver) and batchsize and model complexitiy and multiplier the performance gain differs.
On Ubuntu 24.04 with RTX-4070-TI and tiny MNIST-10 benchmark, I can see 5% gain withmultiplier=8, batchsize=16
.
PS: Evenmultiplier=1, device='cuda'
should be benefitial, as the H2D transfer has been triggered before the user gets the batches.
PS: I use small batchsizes (16..32), as is good against overfitting in my case.