Sequence Packing#
This section explains how to use the sequence packing training technique with Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT).
Sequence Packing for SFT/PEFT#
Overview#
When fine-tuning a large language model or vision language model, whether using SFT or PEFT methods, GPU under-utilization often occurs due toan inefficient input data structure. This inefficiency arises because many fine-tuning datasets have a skeweddistribution of sequence lengths, with many short sequences and a few long ones, following Zipf’s Law.Since transformer models require fixed-length inputs, shorter sequences must be padded with many padding tokens.This leads to two main inefficiencies:
Computation performed on the pad tokens is eventually masked out, resulting in wasted GPU computation.
Micro batch size is often limited by the batch which contains longer sequences, so that most other micro batches haveunder-utilized GPU memory.
Sequence packing is a training technique where multiple training sequences (examples) are concatenated into one longsequence (pack). This technique greatly reduces the number of padding tokens, allowing more meaningful tokens to beprocessed in each micro batch. As a result, it maximizes both GPU compute and GPU memory utilization.
While sequences for pretraining can be concatenated naively without carefully minding the sequence boundaries, this isoften not the case for supervised and instruction fine-tuning. This is because the data quality is much higher infine-tuning workloads, so each input sequence should be treated individually.
The conventional solution is to build a custom attention mask (specifically, a block triangular mask) tomask out attention values between sequences. However, this increases the complexity of attention from\(\sum_i {s_i}^2\) to\(\Big({\sum_i {s_i}}\Big)^2\), where\(s_i\) is the length of the\(i\) th subsequence. In practice, the conventional solution puts a limit on the packed sequence size.Instead, NeMo provides a highly optimized version of sequence packing which makes use of variable-length attentionkernels in FlashAttention and TransformerEngine. Instead of providing a custom attention mask, information aboutsequence boundaries is passed in with thecu_seq_lens variable (short for cumulative sequence length)[1].With this approach, attention values between sequences are never calculated, so the complexity of attention remainsat\(\sum_i {s_i}^2\). This allows the packed sequence size to increase to arbitrary lengths without affecting thememory complexity, so that GPU memory can be fully utilized.
All things considered, NeMo’s implementation of sequence packing provides[2]:
Up to 10X performance improvement in terms of FLOPs
Up to 6X performance improvement in terms of training time
No impact on model convergence
Run SFT/PEFT with Packed Sequences in LLM#
Prepare the Dataset#
In NeMo 2.0, the packed dataset is automatically prepared before training, eliminating the need for any additionalsteps.
Train with Predefined Fine-Tune Recipes#
The quickest way to start fine-tuning a model with packed sequences is to use the NeMo-Run recipes. Simply setpacked_sequence=True in the recipe function. The following is an example using the Llama 3 8B model.
fromnemo.collectionsimportllmrecipe=llm.llama3_8b.finetune_recipe(name="llama3_8b_finetuning",dir=f"/path/to/checkpoints",num_nodes=1,num_gpus_per_node=8,packed_sequence=True,)
Migrate Your Own Recipe#
If you already have a fine-tuning recipe without using packed sequences, modify your recipe as follows to starttraining with packed sequences.
Add
packed_sequence_specs=PackedSequenceSpecs(...)in your datamodule. For example:data=llm.DollyDataModule(seq_length=2048,micro_batch_size=1,global_batch_size=8,packed_sequence_specs=PackedSequenceSpecs(packed_sequence_size=2048,tokenizer_model_name="a_unique_tokenizer_name",),)
Seeherefor an explanation of the allowed fields in
PackedSequenceSpecs.Adjust the batch sizes.
Micro batch size must be set to 1. This constraint arises because samples in a micro batch are no longer stacked;they are now concatenated during the data preparation step. Consequently, micro batch size becomes irrelevantwhen using packed sequences, so we require users to set it to 1 to acknowledge this fact.To improve GPU memory utilization, you can increase the
packed_sequence_size, achieving the same effect asincreasing the micro batch size.Global batch size must be adjusted to maintain the training recipe. Since each pack now contains multiplesequences, the global batch size needs to be reduced by the average number of sequences per pack
nwheren=num_sequences_in_dataset/num_packs(equivalently,n=packed_sequence_size/average_seq_len).This ensures that each gradient iteration sees, on average, the same number of tokens.The value ofnis printed out during the data preparation step. You mayneed to run training once, obtain the value ofnfrom the logs, then run your training script again with theupdated global batch size. Alternatively, you can start with the smallest global batch size possible such thatdata_parallel_size=num_gpus(i.e. no gradient accumulation), and tune your global batch size gradually.
Now, you are all set to fine-tune your model with a much improved throughput!
Run SFT/PEFT with Packed Sequences in NeVA#
In NeMo 2.0, you no longer need to pre-process datasets for sequence packing before use. Instead, packing is performed on the fly. Depending on the dataset type you are using, the implementation will vary slightly. Check out ourexample fine-tuning script.
data=vlm.NevaMockDataModule(seq_length=2048,global_batch_size=128,micro_batch_size=4,tokenizer=None,image_processor=None,num_workers=4,packed_sequence=True,)
Setpacked_sequence toTrue to indicate whether you want to use packed sequences inNevaMockDataModule. If enabled, the entire micro batch (withmicro_batch_size randomly generated samples) will be packed into one sequence with theTHD layout[1]. The mock dataset is intended for quick tests and benchmarks.
data_config=vlm.ImageDataConfig(image_folder="/path/to/image_folder",conv_template="v1",)data=vlm.PreloadedDataModule(paths="/path/to/data_json_file",data_config=data_config,seq_length=2048,decoder_seq_length=None,global_batch_size=128,micro_batch_size=4,tokenizer=None,image_processor=None,num_workers=4,packed_sequence=True,num_image_embeddings_per_tile=576,)
Setpacked_sequence toTrue to enable packed sequences inPreloadedDataModule. Similar to the mock module, if packing is enabled, the entire micro batch will be packed into one sequence with theTHD layout[1]. Note that no additional selection algorithm is applied. Since samples in each micro batch are randomly selected from the entire dataset, the behavior is equivalent to randomly selectingmicro_batch_size samples. With sequence packing enabled, padding sequences to the same length within each micro batch is no longer necessary, potentially improving processing speed.
You can increase themicro_batch_size when enablingpacked_sequence. As long as the global batch size (global_batch_size) remains consistent, convergence behavior should be similar. Note that when pipeline parallelism (PP) is enabled, packed sequences will still be truncated toseq_length to facilitate PP communication.
config=vlm.MultiModalSampleConfig(image_token=vlm.ImageToken(token_str="<image>",token_id=-200),ignore_place_holder=-100,conversation_template_config=LLaVATemplateConfig(),)data=vlm.EnergonMultiModalDataModule(path="/path/to/energon_data",tokenizer=tokenizer,image_processor=image_processor,seq_length=2048,micro_batch_size=1,global_batch_size=32,num_workers=0,multimodal_sample_config=config,task_encoder=MultiModalTaskEncoder(tokenizer=tokenizer,image_processor=image_processor,multimodal_sample_config=config,packed_sequence=True,packed_sequence_size=8192,num_image_embeddings_per_tile=576,),packing_buffer_size=200ifpacked_sequenceelseNone,)
To use the Energon dataset, follow the instructions athere to process your dataset into the Energon format. Setpacked_sequence=True and specifypacking_buffer_size to enable sequence packing. The Energon dataset uses on-the-fly packing, where each worker reads apacking_buffer_size number of samples and packs them into sequences of sizepacked_sequence_size. Refer to theEnergon user guide for more details on packing.
When using this dataset, set the micro batch size (micro_batch_size) to 1 and adjust the global batch size (global_batch_size) by dividing it by the average number of sequences per pack (n).
For details, please see TransformerEngine documentationhere.
[2]Experiments were performed on Llama 7B with Dolly dataset. Actual performance improvement depends on datasetand model.