[FEATURE] Faster TIMM by combined background H2D copies #2548

New issue

Open

[FEATURE] Faster TIMM by combined background H2D copies#2548

Labels

enhancementNew feature or request

Description

alex77g2

opened

on Jul 14, 2025

Is your feature request related to a problem? Please describe.
This (trivial to implement) feature gives 1..10% speedup, by reducing H2D-copy-overhead.
This could be implemented on user side, but blows up user code and performance gain should suffer (if on user side).

Describe the solution you'd like
Initializing/creating the DataLoder with new optinal parameters:multiplier + device (possible other name = batch_bundle).
This should be 100% backward compatible, because multiplier<2 (default=1) is intendet to cause the original old behaviour.
Reason for speedup: CUDA H2D transfers show a startup delay (the first byte costs as much as many megabytes later).
Also the number of overall needed Multithreading timm-threads decreases (which is at a cost too).
mdl = MultiEpochsDataLoader( data, batch_size: int=32, multiplier: int=1, device: torch.device=None )

My implementation on user side:
data = datasets.MNIST(data_dir, train=True, download=True, transform=transform)
data_loader: DataLoader = MultiEpochsDataLoader(data, batch_size= batch_size*multiplier)

    import torch as tt    img_type: tt.dtype = tt.get_default_dtype()    batch_bundle: int = multiplier    if (batch_bundle <= 1):  # default Off        with tt.no_grad():            for data, target in data_loader:                data, target = data.to(device, img_type, non_blocking=True), \                    target.to(device, non_blocking=True)                output: tt.Tensor = model(data)                loss1 += cost_func(output, target) * len(target)                # test_loss += F.nll_loss(output, target, reduction="sum")                pred: tt.Tensor  = output.max(1, keepdim=True)[1]                corr1 += pred.eq(target.view_as(pred)).sum()    else:        data = target = data1 = target1 = None        bs: int = None        with tt.no_grad():            for data0, target0 in data_loader:                # data, target = data1, target1  # optional (H2D in background, triggered earlier)                data, target = data0.to(device, img_type, non_blocking=True), \                    target0.to(device, non_blocking=True)                if bs is None:  # once                    bs0 = target0.numel()                    assert not (bs0 % batch_bundle), "odd/bad modulo"                    bs = bs0 // batch_bundle  # /4                    assert bs > 0, "empty batch"                    # continue                for x,y in zip( data.split(bs), target.split(bs) ):  # here we split down to intended batch_size                    output: tt.Tensor = model(x)                    loss1 += cost_func(output, y) * len(y)                    pred: tt.Tensor  = output.max(1, keepdim=True)[1]                    corr1 += pred.eq(y.view_as(pred)).sum()

I would like to bring back down the user code to:

            for data, target in data_loader:                output: tt.Tensor = model(data)  # data + target already on device and split

For Multi-GPU usage, it would be beneficial if device could be a tuple/list likedevice=(dev1, dev2).
Then data and target would be also tuple of same len.data=(data1, data2), target=(target1, target2).

Describe alternatives you've considered
There are 2 alternative ways, both not as fast as this feature request.
The explaination above is an alternative way, which is blowing up the user code in a significant way.
Doing nothing at all, still works, but is slower.

Additional context
Depending on system (OS + GPU driver) and batchsize and model complexitiy and multiplier the performance gain differs.
On Ubuntu 24.04 with RTX-4070-TI and tiny MNIST-10 benchmark, I can see 5% gain withmultiplier=8, batchsize=16.
PS: Evenmultiplier=1, device='cuda'should be benefitial, as the H2D transfer has been triggered before the user gets the batches.
PS: I use small batchsizes (16..32), as is good against overfitting in my case.

Metadata

Assignees

No one assigned

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FEATURE] Faster TIMM by combined background H2D copies #2548

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions