Main train.py does not properly instantiate ImageFolder dataset if --train-split --val-split and --num-classes are specified. It looks like this may only be happening if the main data-dir had additional folders than only data/train and data/val. My issue was a data/test folder.
run 1 (original)
CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --train-split data/train/ --val-split data/val/ --num-classes 200
Training with a single process on 1 device (cuda).
Model seresnet34 created, param count:21550016
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created SGD (sgd) optimizer: lr: 0.4, momentum: 0.9, dampening: 0, weight_decay: 2e-05, nesterov: True, maximize: False, foreach: None, differentiable: False, fused: None
Using native Torch AMP. Training in mixed precision.
Scheduled epochs: 150 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Traceback (most recent call last):
File "/pai/train.py", line 1235, in
main()
File "/pai/train.py", line 888, in main
train_metrics = train_one_epoch(
^^^^^^^^^^^^^^^^
File "/pai/train.py", line 1083, in train_one_epoch
loss = _forward()
^^^^^^^^^^
File "/pai/train.py", line 1051, in _forward
loss = loss_fn(output, target)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pai/timm/loss/cross_entropy.py", line 22, in forward
nll_loss = -logprobs.gather(dim=-1, index=target.unsqueeze(1))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile withTORCH_USE_CUDA_DSA
to enable device-side assertions.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [70,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [84,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [23,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [100,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [104,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [116,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [119,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [55,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [58,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [63,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
---- making edit
nano timm/data/readers/reader_factory.py
---- run 2
CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --train-split data/train/ --val-split data/val/ --num-classes 200
Training with a single process on 1 device (cuda).
Model seresnet34 created, param count:21550016
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created SGD (sgd) optimizer: lr: 0.4, momentum: 0.9, dampening: 0, weight_decay: 2e-05, nesterov: True, maximize: False, foreach: None, differentiable: False, fused: None
Using native Torch AMP. Training in mixed precision.
Scheduled epochs: 150 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Train: 0 [ 0/390 ( 0%)] Loss: 5.36 (5.36) Time: 1.552s, 164.95/s (1.552s, 164.95/s) LR: 1.000e-05 Data: 0.517 (0.517)
Main train.py does not properly instantiate ImageFolder dataset if --train-split --val-split and --num-classes are specified. It looks like this may only be happening if the main data-dir had additional folders than only data/train and data/val. My issue was a data/test folder.
run 1 (original)
CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --train-split data/train/ --val-split data/val/ --num-classes 200
Training with a single process on 1 device (cuda).
Model seresnet34 created, param count:21550016
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created SGD (sgd) optimizer: lr: 0.4, momentum: 0.9, dampening: 0, weight_decay: 2e-05, nesterov: True, maximize: False, foreach: None, differentiable: False, fused: None
Using native Torch AMP. Training in mixed precision.
Scheduled epochs: 150 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Traceback (most recent call last):
File "/pai/train.py", line 1235, in
main()
File "/pai/train.py", line 888, in main
train_metrics = train_one_epoch(
^^^^^^^^^^^^^^^^
File "/pai/train.py", line 1083, in train_one_epoch
loss = _forward()
^^^^^^^^^^
File "/pai/train.py", line 1051, in _forward
loss = loss_fn(output, target)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pai/timm/loss/cross_entropy.py", line 22, in forward
nll_loss = -logprobs.gather(dim=-1, index=target.unsqueeze(1))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [70,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [84,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [23,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [100,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [104,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [116,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [119,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [55,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [58,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed./opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [63,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.---- making edit
nano timm/data/readers/reader_factory.py
---- run 2
CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --train-split data/train/ --val-split data/val/ --num-classes 200
Training with a single process on 1 device (cuda).
Model seresnet34 created, param count:21550016
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created SGD (sgd) optimizer: lr: 0.4, momentum: 0.9, dampening: 0, weight_decay: 2e-05, nesterov: True, maximize: False, foreach: None, differentiable: False, fused: None
Using native Torch AMP. Training in mixed precision.
Scheduled epochs: 150 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Train: 0 [ 0/390 ( 0%)] Loss: 5.36 (5.36) Time: 1.552s, 164.95/s (1.552s, 164.95/s) LR: 1.000e-05 Data: 0.517 (0.517)