Supported PEFT Methods#
NeMo 2.0#
NeMo 2.0 supports the following PEFT tuning methods:
LoRA:LoRA: Low-Rank Adaptation of Large LanguageModels
LoRA makes fine-tuning efficient by representing weight updateswith two low rank decomposition matrices. The original modelweights remain frozen, while the low-rank decomposition matricesare updated to adapt to the new data, keeping the number oftrainable parameters low. In contrast with adapters, the originalmodel weights and adapted weights can be combined duringinference, avoiding any architectural change or additional latencyin the model at inference time.
In NeMo, you can customize the adapter bottleneck dimension andthe target modules to apply LoRA. LoRA can be applied to any linearlayer. In a transformer model, this includes 1) Q, K, V attentionprojections, 2) attention output projection layer, and 3) either orboth of the two transformer MLP layers. For QKV, NeMo’s attentionimplementation fuses QKV into a single projection, so our LoRAimplementation learns a single low-rank projection for QKVcombined.
DoRA:DoRA: Weight-Decomposed Low-Rank Adaptation
DoRA decomposes the pre-trained weight into magnitude and direction.It learns a separate magnitude parameter while employing LoRA for directionalupdates, efficiently minimizing the number of trainable parameters. DoRA enhancesboth the learning capacity and training stability of LoRA while avoiding anyadditional inference overhead. DoRA has been shown to consistently outperform LoRAon various downstream tasks.
In NeMo, DoRA leverages the same adapter structure as LoRA. NeMo addssupport for Tensor Parallelism and Pipeline Parallelism for DoRA, enabling DoRA to bescaled to larger model variants.
NeMo 1.0 (Legacy)#
QLoRA:QLoRA: Efficient Finetuning of QuantizedLLMs
Similar to LoRA, QLoRA keeps the original model weights frozen whileintroducing low-rank adapters for customization. However, QLoRA goes astep further by quantizing the frozen linear weights with a custom4-bit data type called Normal Float 4 (NF4). The adapters are identical to those of LoRAand kept in BF16.
Compared to LoRA, QLoRA is up to 60% more memory-efficient, allowing forfine-tuning large models with smaller/less GPUs and/or higher batch size.QLoRA is able to achieve the same accuracy, although a differentconvergence recipe may be required. However, the drawback is that QLoRAtraining is slower than LoRA by 50% to 200%.
For more details, please visit theNeMo QLoRA Guide.
P-Tuning:GPT Understands,Too
P-Tuning is an example of the prompt learning family of methods,in which trainable virtual tokens are inserted into the modelinput prompt to induce it to perform a task. Virtual tokens (alsocalled “continuous” or “soft” tokens) are embeddings that have noconcrete mapping to strings or characters within the model’svocabulary. They are simply 1D vectors that match thedimensionality of real tokens which make up the model’svocabulary.
In P-Tuning, an intermediate MLP model is used to generatevirtual token embeddings. We refer to this intermediate model asour
prompt_encoder. The prompt encoder parameters are randomlyinitialized at the start of p-tuning. All base model parametersare frozen, and only the prompt encoder weights are updated ateach training step.In Nemo, you can customize the number of virtual tokens, as wellas the embedding and MLP bottleneck dimensions.
Adapters (Canonical):Parameter-Efficient Transfer Learning forNLP
Adapters (Houlsby setup) is one of the first PEFT methods appliedto NLP. Adapter tuning is more efficient than full fine-tuningbecause the base model weights are frozen, while only a smallnumber of adapter module weights are updated. In this method, twolinear layers with a bottleneck and a non-linear activation areinserted into each transformer layer via a residual connection. Ineach case, the output linear layer is initialized to 0 to ensurethat an untrained adapter does not affect the normal forward passof the transformer layer.
In NeMo, you can customize the adapter bottleneck dimension,adapter dropout amount, as well as the type and position ofnormalization layer.
IA3:Few-Shot Parameter-Efficient Fine-Tuning is Better andCheaper than In-Context Learning
IA3 makes fine-tuning efficient by rescaling activations withlearned vectors. The rescaling layers are injected in theattention (for key and value) and feedforward modules in the basemodel. Similar to other PEFT methods, only the rescaling vectorsare updated during fine-tuning to adapt to the new data so thenumber of updated parameters is low. However, since rescalingvectors are much smaller than low rank matrices (LoRA) andbottleneck layers (Adapters), IA3 cuts down the number oftrainable parameters further by an order of magnitude. Thelearning rescaling vectors can also be merged with the baseweights, leading to no architectural change and no additionallatency at inference time.
There is no hyperparameter to tune for the IA3 adapter.