Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

updated reader_factory to correct extra folders#2448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
RorryB wants to merge1 commit intohuggingface:main
base:main
Choose a base branch
Loading
fromRorryB:main

Conversation

RorryB
Copy link

Main train.py does not properly instantiate ImageFolder dataset if --train-split --val-split and --num-classes are specified. It looks like this may only be happening if the main data-dir had additional folders than only data/train and data/val. My issue was a data/test folder.

run 1 (original)
CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --train-split data/train/ --val-split data/val/ --num-classes 200
Training with a single process on 1 device (cuda).
Model seresnet34 created, param count:21550016
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created SGD (sgd) optimizer: lr: 0.4, momentum: 0.9, dampening: 0, weight_decay: 2e-05, nesterov: True, maximize: False, foreach: None, differentiable: False, fused: None
Using native Torch AMP. Training in mixed precision.
Scheduled epochs: 150 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Traceback (most recent call last):
File "/pai/train.py", line 1235, in
main()
File "/pai/train.py", line 888, in main
train_metrics = train_one_epoch(
^^^^^^^^^^^^^^^^
File "/pai/train.py", line 1083, in train_one_epoch
loss = _forward()
^^^^^^^^^^
File "/pai/train.py", line 1051, in _forward
loss = loss_fn(output, target)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pai/timm/loss/cross_entropy.py", line 22, in forward
nll_loss = -logprobs.gather(dim=-1, index=target.unsqueeze(1))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile withTORCH_USE_CUDA_DSA to enable device-side assertions.

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [70,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [84,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [23,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [100,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [104,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [116,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [119,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [55,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [58,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [63,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

---- making edit
nano timm/data/readers/reader_factory.py
---- run 2

CUDA_VISIBLE_DEVICES=0 python train.py --data-dir data --model seresnet34 --sched cosine --epochs 150 --warmup-epochs 5 --lr 0.4 --reprob 0.5 --remode pixel --batch-size 256 --amp -j 4 --train-split data/train/ --val-split data/val/ --num-classes 200
Training with a single process on 1 device (cuda).
Model seresnet34 created, param count:21550016
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
crop_mode: center
Created SGD (sgd) optimizer: lr: 0.4, momentum: 0.9, dampening: 0, weight_decay: 2e-05, nesterov: True, maximize: False, foreach: None, differentiable: False, fused: None
Using native Torch AMP. Training in mixed precision.
Scheduled epochs: 150 (epochs + cooldown_epochs). Warmup within epochs when warmup_prefix=False. LR stepped per epoch.
Train: 0 [ 0/390 ( 0%)] Loss: 5.36 (5.36) Time: 1.552s, 164.95/s (1.552s, 164.95/s) LR: 1.000e-05 Data: 0.517 (0.517)

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

1 participant
@RorryB

[8]ページ先頭

©2009-2025 Movatter.jp