Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[FEATURE] Faster TIMM by combined background H2D copies #2548

Open
Labels
enhancementNew feature or request
@alex77g2

Description

@alex77g2

Is your feature request related to a problem? Please describe.
This (trivial to implement) feature gives 1..10% speedup, by reducing H2D-copy-overhead.
This could be implemented on user side, but blows up user code and performance gain should suffer (if on user side).

Describe the solution you'd like
Initializing/creating the DataLoder with new optinal parameters:multiplier + device (possible other name = batch_bundle).
This should be 100% backward compatible, because multiplier<2 (default=1) is intendet to cause the original old behaviour.
Reason for speedup: CUDA H2D transfers show a startup delay (the first byte costs as much as many megabytes later).
Also the number of overall needed Multithreading timm-threads decreases (which is at a cost too).
mdl = MultiEpochsDataLoader( data, batch_size: int=32, multiplier: int=1, device: torch.device=None )

My implementation on user side:
data = datasets.MNIST(data_dir, train=True, download=True, transform=transform)
data_loader: DataLoader = MultiEpochsDataLoader(data, batch_size= batch_size*multiplier)

    import torch as tt    img_type: tt.dtype = tt.get_default_dtype()    batch_bundle: int = multiplier    if (batch_bundle <= 1):  # default Off        with tt.no_grad():            for data, target in data_loader:                data, target = data.to(device, img_type, non_blocking=True), \                    target.to(device, non_blocking=True)                output: tt.Tensor = model(data)                loss1 += cost_func(output, target) * len(target)                # test_loss += F.nll_loss(output, target, reduction="sum")                pred: tt.Tensor  = output.max(1, keepdim=True)[1]                corr1 += pred.eq(target.view_as(pred)).sum()    else:        data = target = data1 = target1 = None        bs: int = None        with tt.no_grad():            for data0, target0 in data_loader:                # data, target = data1, target1  # optional (H2D in background, triggered earlier)                data, target = data0.to(device, img_type, non_blocking=True), \                    target0.to(device, non_blocking=True)                if bs is None:  # once                    bs0 = target0.numel()                    assert not (bs0 % batch_bundle), "odd/bad modulo"                    bs = bs0 // batch_bundle  # /4                    assert bs > 0, "empty batch"                    # continue                for x,y in zip( data.split(bs), target.split(bs) ):  # here we split down to intended batch_size                    output: tt.Tensor = model(x)                    loss1 += cost_func(output, y) * len(y)                    pred: tt.Tensor  = output.max(1, keepdim=True)[1]                    corr1 += pred.eq(y.view_as(pred)).sum()

I would like to bring back down the user code to:

            for data, target in data_loader:                output: tt.Tensor = model(data)  # data + target already on device and split

For Multi-GPU usage, it would be beneficial if device could be a tuple/list likedevice=(dev1, dev2).
Then data and target would be also tuple of same len.data=(data1, data2), target=(target1, target2).

Describe alternatives you've considered
There are 2 alternative ways, both not as fast as this feature request.
The explaination above is an alternative way, which is blowing up the user code in a significant way.
Doing nothing at all, still works, but is slower.

Additional context
Depending on system (OS + GPU driver) and batchsize and model complexitiy and multiplier the performance gain differs.
On Ubuntu 24.04 with RTX-4070-TI and tiny MNIST-10 benchmark, I can see 5% gain withmultiplier=8, batchsize=16.
PS: Evenmultiplier=1, device='cuda'should be benefitial, as the H2D transfer has been triggered before the user gets the batches.
PS: I use small batchsizes (16..32), as is good against overfitting in my case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp