Long Context Recipes#

Long context model training enhances the capability of large language models to handle long context inputs. Longer sequence lengths are beneficial for many NLP tasks, such as document-level summarization, long document classification, and long document question answering. NeMo Framework providesrecipes to train long context (16k and 64k sequence length) models like Llama-3, Mixtral, and Nemotron.Context Parallelism (CP) is the key technique used in NeMo Framework to train long context models. This technique allows training and inferencing models with longer sequence lengths. CP enables the NeMo Framework to support 16k and 64k sequence lengths in its current recipes, with the potential to scale to even longer sequence lengths, such as 1 million.

Provided Long Context Recipes in NeMo#

NeMo 2.0 provides tested recipes to train long context models. These recipes are available in theNeMo LLM recipes directory. At present, all available recipes are for pretraining, as there are no fine-tuning use cases yet.

The following tables for theLlama 3,Mixtral, andNemotron models show different sequence lengths with corresponding recipes in NeMo.

Llama 3#

Sequence Length

8B

70B

16k

Yes

Yes

64k

Yes

Yes

Mixtral#

Sequence Length

8x7B

16k

Yes

64k

Yes

Nemotron 4#

Sequence Length

15B

22B

16k

Yes

Yes

64k

Yes

Yes

Use Long Context Recipes#

Note

For long context recipes, we only support the pretrain task since there is no actual long context fine-tuning dataset yet.

We listed two methods to access long context recipes: using the NeMo-Run API and the NeMo-Run CLI. Both wrap the same API code, but the NeMo-Run CLI version is designed for ease of use.

Use with the NeMo-Run API#

Note

For detailed information about NeMo-Run, please refer toQuickstart with NeMo-Run. Below is a concise version that focuses on the usage of long context recipes in NeMo 2.0.

In NeMo 2.0, we included an example to demonstrate our existingpretraining recipes. For instance, you can use thellama3_8b_16k script as follows:

pythonpretraining.py--slurm--recipellama3_8b_16k
  • --slurm is required for a non-local Slurm job submission. Additionally, you must configure the Slurm account settings in this examplePython file.

  • The model type and model size can be substituted with other recipes available in NeMo 2.0 (e.g.mixtral with model size8x7b_16k).

Customize Parameters#

To use customized parameters, you can easily modify theexample. For instance, you can addpretrain.trainer.max_steps=2000 before the experiment execution part to override the max step of this recipe.

Use with theNeMo-Run CLI#

Note

TheNeMo-Run CLI currently works only in a local environment. To use it with a Slurm cluster, please refer to the NeMo-Run example.

You can use these recipes via theNeMo-Run CLI. For NeMo-Run CLI installation instructions, refer to theNeMo-Run CLI README.

For example:

nemollmpretrain--factoryllama3_8b_16k
  • llama3_8b_16k could be replaced by other recipes included in NeMo 2.0 (e.g.mixtral_8x7b_16k)

Note

When launching the recipes with multiple processes (i.e. on multiple GPUs), add the-y option to the command to avoid user confirmation prompts. For example,nemollmpretrain--factoryllama3_8b-y.

Customize Parameters#

You can override any parameter in the recipe:

nemollmpretrain--factoryllama3_8b_16ktrainer.max_steps=2000

To continue training from a checkpoint, add theresume.resume_from_path="to/some/path" option to the command. You can also specify other options like sequence length in this manner. For example:

nemollmpretrain--factoryllama3_8b_16kresume.resume_from_path="to/some/path"model.config.seq_length=2048data.seq_length=2048

For more details about running recipes, seeGithub pre-train README.

Add a New Recipe#

SeeGithub add-recipe README for instructions on how to add a new recipe.

Train a Long Context Model Pipeline#

Typically, training a long context model involves two steps:pretraining from scratch and extending the context length of the previous model multiple times throughcontinued pretraining. In NeMo 2.0, this process can be accomplished using the NeMo-Run CLI with the following steps.

Use with the NeMo-Run API#

Pretrain the Llama 3 Model from Scratch#

Here is an example of pretraining the Llama 3 model from scratch with a sequence length of 2048 using thisscript. The default sequence length for the llama3_8b recipe is 8192.First, you need to override theseq_length used inllama3_8b with:

pretrain.model.config.seq_length=2048pretrain.data.seq_length=2048

Then you can launch the job with:

pythonpretraining.py--slurm--recipellama3_8b

After running the above command, you will have a pretrained model checkpoint with a sequence length of 2048 in your directory.

Extend the Context Length of a Pretrained Checkpoint#

To extend the context length of the model, you cancontinue training from the checkpoint. For example, you can use a sequence length of 16k:

pythonpretraining.py--slurm--recipellama3_8b_16k

Next, you can continue training from the 16k checkpoint to 64k:

pythonpretraining.py--slurm--recipellama3_8b_64k

For saving and loading from different directories, please add:

pretrain.log.log_dir="your/second/log/ckpt/dir/here"pretrain.resume.resume_from_path="your/first/log/ckpt/dir/here"

For other context lengths, you need to add the corresponding recipes for different parallelism configurations or override the existing recipes. Refer to theGitHub add-recipe README for details.

Use with theNeMo-Run CLI#

Note

TheNeMo-Run CLI currently works only in a local environment. To work with a Slurm cluster, please see the NeMo-Run example.

Pretrain the Llama 3 Model from Scratch#

Here is an example of pretraining the Llama 3 model from scratch with a sequence length of 2048 with theNeMo-Run CLI. The default sequence length for the llama3_8b recipe is 8192.

nemollmpretrain--factoryllama3_8bmodel.config.seq_length=2048data.seq_length=2048log.log_dir="your/first/log/ckpt/dir/here"

After running the above command, we will have a pretrained model checkpoint with a sequence length of 2048 in your directory.

Extend the Context Length of a Pretrained Checkpoint#

To extend the context length of the model, you cancontinue training from the checkpoint. For example, you can use a sequence length of 16k:

nemollmpretrain--factoryllama3_8b_16klog.log_dir="your/second/log/ckpt/dir/here"resume.resume_from_path="your/first/log/ckpt/dir/here"

Next, you can continue training from the 16k checkpoint to 64k:

nemollmpretrain--factoryllama3_8b_64klog.log_dir="your/third/log/ckpt/dir/here"resume.resume_from_path="your/second/log/ckpt/dir/here"

For other context lengths, you need to add the corresponding recipes for different parallelism configurations or override the existing recipes. Refer to theGitHub add-recipe README for details.