- Notifications
You must be signed in to change notification settings - Fork1
Reinforcement Learning Toolkit for RWKV.(v6,v7,ARWKV) Distillation,SFT,RLHF(DPO,ORPO), infinite context training, Aligning. Exploring the possibilities for deeper fine-tuning of RWKV.
License
OpenMOSE/RWKV-LM-RLHF
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
- RWKV v6 "Finch" (full feature support)
- RWKV v7 "Goose" (full feature support)
- RWKV v7 "Goose" + Mixture of LoRA Experts (SFT Support)
- ARWKV (MHLA + SwiGLU -> RWKV v7 TimeMix + SwiGLU) (SFT support)
RLHF Training with ORPO(Odds Ratio Preference Optimization) v6,v7:
ORPO is that it allows for simultaneous SFT and aligning. By adjusting the orpo_alpha value, you can control the ratio between SFT and aligning.
Uses odds ratios instead of probability ratios to measure policy changes
Can often achieve better performance with fewer training steps
Designed to mitigate common issues in RLHF like reward hacking and overoptimization
RLHF Training with DPO(Direct Preference Optimization) v6,v7:
Direct Preference Optimization (DPO) is an advanced training method for Large Language Models that aims to align model outputs with human preferences more effectively than traditional methods.
Directly optimizes for human preferences without the need for a separate reward model
Simplifies the training pipeline compared to other preference learning methods
More stable training process compared to RL-based methods
Infinite Context Compression Distillation Training v6,v7:
This approach involves simultaneously training a student model using compressed logits data pre-acquired from a larger parameter model (for example, 14B) as soft labels, and the dataset as hard labels.
Soft label learning can significantly reduce the risk of overfitting.
It allows for efficient transfer of specific tasks to smaller models.
Infinite Context Masked SFT with Smoothing Loss v6,v7,ARWKV:
By incorporating smoothing into the loss calculation, the transfer probability can be increased.
Dynamic masking allows for efficient loss reduction during multi-batch processing.
With an RTX4090, a 14B parameter model can be trained with 65k contexts.
- Bone (Block Affine Transformation):
- New training method proposed by @Jl-er
- Achieve faster convergence and better fit to data.
- No complex initialization is required, and fast convergence and better fit to data can be achieved.
- DoRA (Weight-Decomposed Low-Rank Adaptation) NEW!:
- DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters.
- Same training speed compared LoRA
- LoRA:
- You all already know this, right? :)
- Quantization:
- FP8: Fastest. Native matmul.
- FP6: by TorchAO Matmul slightly faster than bitsandbytes
- Int8: 8-bit quantization with Bitsandbytes. 16-bit Matmul
- NF4: 4-bit quantization with Bitsandbytes. 16-bit Matmul
- What is next?
Rank can be set variably for each layer. see layer_profile
- CPU RAM >= 32GB (128GB Recommend for stress free)
- Cuda or Rocm GPU.(NVIDIA RTX3090,4090, AMD MI100)
- CUDA 12.4+, Rocm 6.1.2+
- Python 3.12+
- Pytorch 2.5+
- Bitsandbytes (in MI100 with Rocm6.1.3, need build)
- some case need (conda install libstdcxx -c conda-forge --override-channels) for building cuda kernel
- Odds Ratio Preference Optimization document
- Direct Preference Optimization document
- Distillation document
- SFT document
- Layer Profile
- Iterative DPO, ORPO -> Suspend.
- Reinforcement++ -> working :)
- bnb 8bit optimizer. done
- ToachAO's FP6 Matmul testing
- Realtime Training(with Reverse State Distillation)
- RWKV-LM @BlinkDL
- RWKV-LM-RLHF-DPO @Triang-jyed-driung
- RWKV-PEFT @Jl-er
- Flash-Linear-Attention @fla-org
- SmerkyG
- Orpo @xfactlab
- All RWKV Community Members :)
same with RWKV-LM
Apache 2.0
@ 2025 OpenMOSE