Known Issues#
We will release fixes for the following issues shortly:
25.09 Known Issues
Automodel
Knowledge distillation validation has aknown issue. Set –step_scheduler.val_every_steps=9223372036854775807 to bypass the issue.
Megatron Bridge
Pretraining DeepSeek in subchannel FP8 precision is not working. Pretraining DeepSeek with current scaling FP8 is a workaround, but MTP loss does not converge.
25.07 Known Issues
DeepSeek model pretraining has a memory spike at the end of training, after the validation loop and checkpoint saving. The memory spike is linked to the cross-entropy layer. This may lead to an NCCL error at the end of training.
When fine-tuning with CP > 1, you might need to set calculate_per_token_loss = True for some cases. It depends on the dataset you choose. Note that this will result in slightly different loss from before, but both will lead to model convergence.
TensorRT-LLM has to be installed in order to run the ONNX export tutorial for LLM embedding models inFinetuning Llama 3.2 Model into Embedding Model tutorial. Use the following instructions for installing TensorRT-LLM:NVIDIA-NeMo/Export-Deploy.
Exporting with ONNX requires transformers v4.51. By default, the container comes with v4.53. Consider downgrading transformers by running
uvpipinstalltransformers==4.51.0. For use cases outside the container, the command ispipinstalltransformers==4.51.0.Distributed checkpoint saving fails for Nemotron-h 47B and 56B on GB200. No issues observed on H100 or B200.
25.04.02 and 25.04.01 Known Issues
Tensor-Parallel Communication Overlap: Functional errors may occur with specific tensor-parallel communication overlap configurations. This includes:
AllGather+GEMM overlap.
Ring-exchange algorithm when
aggregate=True.
LayerNorm Bias Accuracy: Training models using LayerNorm with bias (e.g., StarCoder2) might exhibit accuracy issues.
A fix is available inTransformerEngine commit 1569.
This fix isnot yet included in the current NeMo release container.
Workaround: Manually mount or pip install the latest TransformerEngine version in your container.
Large Model Checkpoint NaN Errors (T5 11B, StarCoder2 7B): Loading trained checkpoints for fine-tuning T5 (11B) and StarCoder2 (7B) models may result in NaN values.
This is suspected to be a checkpoint saving/loading error.
A potential fix is inMcore PR 48cc46f.
This fix is currently under testing.
MXFP8 Memory Usage: MXFP8 is currently using more memory than expected. A fix is in progress.
FP8 in Automodel Workflow:
Using FP8 in the Automodel workflow requires manually setting
use_linear_ce_losstoFalse.Alternatively, upgrade NeMo to
commit64f0fa.FP8 support for Mixture of Experts (MoE) models is planned for a future release.
HF Export for Llama-3_3-Nemotron-Super-49B-v1: Hugging Face export is not currently supported for the Llama-3_3-Nemotron-Super-49B-v1 model.
25.04.00 Known Issues
Llama 4 accuracy may degrade slightly due to an issue with the order of sigmoid application in the expert routing logic. This has been fixed in the following Megatron Core commit:NVIDIA/Megatron-LM. However, the fix is not yet included in the current NeMo release container. To apply the fix, please manually mount the updated Megatron Core source when building or running your container.
Resuming from local checkpoints using theget_global_step_from_global_checkpoint_path utility function may face challenges with auto-inserted metrics in the path. This is fixed inNVIDIA/NeMo#13012. However, the fix is not yet included in the current NeMo release container.
Tensor-parallel communication overlap with the following configuration may have functional errors: AllGather+GEMM overlap, ring-exchange algorithm withaggregate=True.
In thescripts/vlm/automodel.py script, the gbs argument is a string instead of an integer. Additionally, this script needs to be run via torchrun for devices > 1.
There might be accuracy issues when training models that use LayerNorm with bias (e.g., StarCoder2). This issue has been addressed in the following TransformerEngine commit:NVIDIA/TransformerEngine#files. However, the fix is not yet included in the current NeMo release container. To apply the fix, please manually mount or pip install the latest version of TransformerEngine in your container.
T5 and StarCoder for large config model (11B for T5, 7B for StarCoder2) getting NaN values when loading trained checkpoint for finetuning. We suspect a checkpoint saving/loading error, which is supposed to be fixed with recent Mcore PR (NVIDIA/Megatron-LM). Currently we are testing this fix.
MXFP8 currently uses more memory than expected and we are still fixing it.
FP8 in the Automodel workflow requires setting use_linear_ce_loss toFalse manually, or upgrading NeMo to64f0fa commit. FP8 support for MoE models is scheduled for future release.
No HF export support for Llama-3_3-Nemotron-Super-49B-v1.
25.02 Known Issues
Automodel
Primarily a functional release, performance improvements are planned for future versions.
For large models (e.g., > 40B) trained with FSDP2, checkpoint saving can take longer than expected.
Support for long sequences is currently limited, especially for large models > 30B.
Models with external dependencies may fail to run, if dependencies are unavailable (e.g., missing package leading to failed import).
A small percentage of models available via AutoModelForCausalLM may only support inference, and have training capabilities explicitly disabled.
Support for FSDP2 with mixed weights models (e.g. FP8 + BF16) is scheduled for future releases.
Support for Context Parallelism with sequence packing + padding between sequences is currently broken (seeissue #12174). Use 24.12 or upgrade to TE 2.0+ for working support. Will be fixed in future versions.
MoE based models are seeing an instability with training. Please continue to use 24.12 for MoE training until 25.02 is patched with the fix for MoE.
In 24.12, NeMo switched frompytorch_lightning tolightning.pytorch. If you have custom code that importspytorch_lightning, you should replace the import withlightning.pytorch. Failing to do so will resultin an error that looks like this:
File"/usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/model_helpers.py",line42,inis_overriddenraiseValueError("Expected a parent")ValueError:Expectedaparent
Similarly, when using a 24.12 container or later, if running evaluations using theLM Evaluation Harness, be sure to upgrade the version ofLM evaluation harness to includethis commit. This can be done by followingthese install instructions.Failing to do so will result in an error that looks like this:
ValueError:Youselectedaninvalidstrategyname:`strategy=<nemo.collections.nlp.parts.nlp_overrides.NLPDDPStrategyobjectat0x1554480d2410>`.Itmustbeeitherastringoraninstanceof`pytorch_lightning.strategies.Strategy`.Examplechoices:auto,ddp,ddp_spawn,deepspeed,...Findacompletelistofoptionsinourdocumentationathttps://lightning.ai
Restoring the model context for NeMo 2.0 checkpoints produced using the NeMo 24.09 container fails when building theOptimizerConfig class from themegatron.core.optimizer.optimizer_config module, as theoverlap_grad_reduce andoverlap_param_gather parameters weremoved from the config API in Megatron Core. Theupdate_io_context.py script drops unknown parameters from the checkpoint context to make it compatible with the latest container.
Griffin’s (NeMo 1.0) full fine-tuning has checkpoint loading issues; the state dicts are not matching between the provided checkpoint and the initialized model. Please use the 24.07 container if this model is needed.
NeMo_Forced_Aligner_Tutorial.ipynb has an AttributeError, please use the 24.09 container if this notebook is needed.
Pretrain Gemma 2 27b recipe needs at least 2 nodes, currently the recipe has the default number of nodes set to 1.
The Megatron Core Distributed Optimizer currently lacks memory capacity optimization, resulting in higher model state memory usage at small data parallel sizes. We will include this optimization in the next patch.
The overlap of the data-parallel parameter AllGather with optimizer.step (
overlap_param_gather_with_optimizer=true) does not work with distributed checkpointing. Support for distributed checkpointing will be available in the next public release.Support for converting models from NeMo 2.0 to 1.0 is not yet available. This support will be needed to align models until NeMo Aligner natively supports 2.0.
Transformer Engine changed the way metadata is stored in checkpoints after v1.10, which can cause checkpoint incompatibilities when using a Transformer Engine version later than v1.10to load a checkpoint trained with an earlier version. Errors of this form look similar to the following:
File"/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/default_planner.py",line315,increate_default_local_load_planraiseRuntimeError(f"Missing key in checkpoint state_dict: {fqn}.")RuntimeError:Missingkeyincheckpointstate_dict:model.decoder.layers.self_attention.core_attention._extra_state/shard_0_24.
or
File"/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/common.py",line118,inload_sharded_objectraiseCheckpointingException(err_msg)fromemegatron.code.dist_checkpointing.core.CheckpointingException:Objectshard.../model.decoder.layers.self_attention.core_attention._extra_state/shard_0_4.ptnotfound
To work around this issue, usemodel.dist_ckpt_load_strictness=log_all when working with Transformer Engine v1.10 or higher.
You can find the Transformer Engine versions present in each NeMo container on theSoftware Component Versions page.
For data preparation of GPT models, use your own dataset or an online dataset legally approved by your organization.
A race condition in the NeMo experiment manager can occur when multiple processes or threads attempt to access and modify shared resources simultaneously, leading to unpredictable behavior or errors.
The Mistral and Mixtral tokenizers require aHugging Face login.
Exporting Gemma, Starcoder, and Falcon 7B models to TRT-LLM only works with a single GPU. Additionally, if you attempt to export with multiple GPUs, no descriptive error message is shown.
The following notebooks have functional issues and will be fixed in the next release:
ASR_with_NeMo.ipynb
ASR_with_Subword_Tokenization.ipynb
AudioTranslationSample.ipynb
Megatron_Synthetic_Tabular_Data_Generation.ipynb
SpellMapper_English_ASR_Customization.ipynb
FastPitch_ChineseTTS_Training.ipynb
NeVA Tutorial.ipynb
Export
Export Llama70B vLLM causes an out-of-memory issue. It requires more time for the root cause analysis.
Export vLLM does not support LoRA and P-tuning; however, LoRA support will be added in the next release.
In-framework (PyTorch level) deployment with 8 GPUs is encountering an error; more time is needed to understand the cause.
Query script under scripts/deploy/nlp/query.py returns the errorAn error occurred: ‘output_generation_logits’ in the 24.12 container. It’ll be fixed in the next container release.
Multimodal- LITA (Language-Independent Tokenization Algorithm) tutorial issue: The data preparation part intutorials/multimodal/LITA_Tutorial.ipynb requires you to manually download theyoumakeup dataset instead of using the provided script.- Add the argument,
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True, to theNeVA notebook pretraining procedure to ensure an end-to-end workflow.ASR- Timestamp misalignment occurs in FastConformer ASR models when using the ASR decoder for diarization. Related Issue:#8438.